How can we analyse data generated by Go Correct to discover which aspects of English are most challenging for learners?
Go Correct is a service that is used by English learners to practise English and find out their mistakes. Learners write a short text every day, in response to a question. The text is corrected by a human and errors are categorised according to grammar point. The student can click mistakes for more information and see statistics about where they make the most mistakes.
As a by-product of providing this service, Big Languages has a database of texts written by non-native speakers and information about the numbers and types of mistakes they make. This information has the potential to be used in interesting ways that go beyond providing personal feedback to the individual learner.
When taken as a whole, this data can provide insights into the most problematic aspects of English for all learners and the most problematic aspects for speakers of particular native languages. (Speakers of particular languages tend to make similar errors that result from the rules they use in their native language).
Help from Manchester University students
Big Languages wanted to better understand what the data currently shows and how it could be used. In October 2019, two maths and physics students from Manchester University came to do a short work placement in which they extracted data from the database and visualised it in a way that would allow some insights to be drawn from it. At the time of the placement, the database contained 53,000 words of English, written by 75 English learners with 13 different native languages.
Having the Manchester University students work on this project was very valuable as they brought skills and knowledge that did not otherwise exist within the company.
A closer look at the data
In the database, mistakes were categorised into categories and sub-categories. For example, the category ‘Pronouns’ would have sub-categories such as ‘Subject pronouns’, ‘Direct object pronouns’, ‘Indirect object pronouns’.
However, some categories didn’t have sub-categories and this meant the students had to do some difficult joining of tables in order to get a full list of individual mistakes. It was useful to discover this problem as it revealed the data should be structured differently in future.
Most common mistakes overall
We started by looking at the most common mistakes across all non-native speakers.
This was interesting because it reveals the aspects of English that are most challenging for any learner, regardless of their native language. (It’s worth noting this is not new information. There have of course been many studies of this over the years.)
This is what our data showed. The first graph shows the top 10 mistakes.
The second graph shows all mistakes and you can click it to view a larger version.
Most common mistakes by native language
We then wanted to go one step further and compare the error ‘profiles’ by language. This could be useful not only for creating language-specific English courses but also for the purposes of ‘authorship profiling’ – the task of analysing a text to determine information about the author of the text. The non-native errors made in a text can indicate the native language of the text’s author.
This type of work can be useful in forensic analysis, for the purpose of gaining information about the authors of written threats, ransom demands, false confessions and cyber crime correspondence.
For this step, it was decided to just look at top-level categories (not sub-categories). Given the relatively small amount of data, looking at sub-categories might have made the numbers too small to see patterns.
Some languages had a lot more data than others. There were only three languages that had a volume of mistakes that made them worthwhile comparing – Arabic, Russian and Spanish. Too few mistakes and it’s not reasonable to assume you are seeing any type of pattern.
There were texts written by 7 Arabic speakers, 12 Russian speakers and 16 Spanish speakers. Therefore, the Spanish data represents the most diverse set of data.
We first visualised using graphs, with a set of bars for each error. The graph below shows the number of mistakes as a percentage of all mistakes made in that language. View a larger version.
One of the students then had the idea that a heatmap would be an effective way to reveal ‘problem’ areas for specific languages.
The ‘stand out’ number on the heatmap above is Russian speakers’ problems with using articles. (An ‘article’ is the word ‘a’ or ‘the’. We use them a lot in English and they don’t exist in Russian).
To identify other points of interest, it helps to look horizontally across each row. That reveals:
- Arabic speakers use more unnecessary words, ie. they added a word that was not needed
- Spanish speakers have relatively little trouble with using singular and plural correctly
- Spanish speakers have more trouble with spelling and pronouns
- Arabic and Russian speakers miss out the verb ‘be’ more than Spanish speakers
A lot more data is needed before this type of analysis and visualisation can be reliable or useful but it is exciting to see a glimpse of what could be possible in the future.
With a relatively small number of speakers, there is the potential for one speaker’s problems to influence the data for their language. For example, this data shows spelling errors were common among Spanish speakers. This could be due to one or two particularly bad spellers or perhaps there is a reason that Spanish speakers in general are weak at spelling. It would be interesting to look at the distribution of this error across individuals.
There are also improvements that can be made to the structuring and categorisation of the data.
We’d like more data!
Do you know a non-native English speaker who wants to improve their English? Tell them to try Go Correct. The more users, the more texts we can gather to add to this analysis.