View on GitHub

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Grammatical Error Correction

Grammatical Error Correction (GEC) is the task of correcting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors.

GEC is typically formulated as a sentence correction task. A GEC system takes a potentially erroneous sentence as input and is expected to transform it to its corrected version. See the example given below:

Input (Erroneous) Output (Corrected)
She see Tom is catched by policeman in park at last night. She saw Tom caught by a policeman in the park last night.

CoNLL-2014 Shared Task

The CoNLL-2014 shared task test set is the most widely used dataset to benchmark GEC systems. The test set contains 1,312 English sentences with error annotations by 2 expert annotators. Models are evaluated with MaxMatch scorer (Dahlmeier and Ng, 2012) which computes a span-based Fβ-score (β set to 0.5 to weight precision twice as recall).

The shared task setting restricts that systems use only publicly available datasets for training to ensure a fair comparison between systems. The highest published scores on the the CoNLL-2014 test set are given below. A distinction is made between papers that report results in the restricted CoNLL-2014 shared task setting of training using publicly-available training datasets only (Restricted) and those that made use of large, non-public datasets (Unrestricted).

Restricted:

Model F0.5 Paper / Source Code
Transformer + Pre-train with Pseudo Data (Kiyono et al., EMNLP 2019) 65.0 An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction NA
Copy-Augmented Transformer + Pre-train (Zhao and Wang, NAACL 2019) 61.15 Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data Official
CNN Seq2Seq + Quality Estimation (Chollampatt and Ng, EMNLP 2018) 56.52 Neural Quality Estimation of Grammatical Error Correction Official
SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) 56.25 Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation NA
Transformer (Junczys-Dowmunt et al., 2018) 55.8 Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task Official
CNN Seq2Seq (Chollampatt and Ng, 2018) 54.79 A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction Official

Unrestricted:

Model F0.5 Paper / Source Code
CNN Seq2Seq + Fluency Boost (Ge et al., 2018) 61.34 Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study NA

Restricted: uses only publicly available datasets. Unrestricted: uses non-public datasets.

CoNLL-2014 10 Annotations

Bryant and Ng, 2015 released 8 additional annotations (in addition to the two official annotations) for the CoNLL-2014 shared task test set (link).

Restricted:

Model F0.5 Paper / Source Code
SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) 72.04 Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation NA
CNN Seq2Seq (Chollampatt and Ng, 2018) 70.14 (measured by Ge et al., 2018) A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction Official

Unrestricted:

Model F0.5 Paper / Source Code
CNN Seq2Seq + Fluency Boost (Ge et al., 2018) 76.88 Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study NA

Restricted: uses only publicly available datasets. Unrestricted: uses non-public datasets.

JFLEG

JFLEG test set released by Napoles et al., 2017 consists of 747 English sentences with 4 references for each sentence. Models are evaluated with GLEU metric (Napoles et al., 2016).

Restricted:

Model GLEU Paper / Source Code
SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) 61.50 Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation NA
Transformer (Junczys-Dowmunt et al., 2018) 59.9 Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task NA
CNN Seq2Seq (Chollampatt and Ng, 2018) 57.47 A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction Official

Unrestricted:

Model GLEU Paper / Source Code
CNN Seq2Seq + Fluency Boost and inference (Ge et al., 2018) 62.42 Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study NA

Restricted: uses only publicly available datasets. Unrestricted: uses non-public datasets.

BEA Shared Task - 2019

BEA shared task - 2019 dataset released for the BEA Shared Task on Grammatical Error Correction provides a newer and bigger dataset for evaluating GEC models in 3 tracks, based on the datasets used for training:

Training and dev sets are released publicly and a GEC model’s performance is evaluated by F-0.5 score. The model outputs on the test-set have to be uploaded to Codalab(publicly available) where category-wise error metrics are displayed. The test set consists of 4477 sentences(larger and diverse than the CoNLL-14 dataset) and the outputs are scored via ERRANT toolkit. The released data are collected from 2 sources:

The description of tracks from the BEA site is given below:

Restricted Track: In the restricted track, participants may only use the following learner datasets:

Unrestricted Track: In the unrestricted track, participants may use anything and everything to build their systems. This includes proprietary datasets and software.

Low Resource Track (formerly Unsupervised Track): In the low resource track, participants may only use the following learner dataset: W&I+LOCNESS development set.

Since current state-of-the-art systems rely on as much annotated learner data as possible to reach the best performance, the goal of the low resource track is to encourage research into systems that do not rely on large amounts of learner data. This track should be of particular interest to researchers working on GEC for languages where large learner corpora do not exist.

Results on WI-LOCNESS test set:

Restricted track:

Model F0.5 Paper / Source Code
BEA Combination 73.18 Learning to Combine Grammatical Error Corrections NA
Transformer + Pre-train with Pseudo Data (Kiyono et al., EMNLP 2019) 70.2 An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction NA
Transformer 69.47 Neural Grammatical Error Correction Systems with UnsupervisedPre-training on Synthetic Data Official: Code to be updated soon
Transformer 69.00 A Neural Grammatical Error Correction System Built OnBetter Pre-training and Sequential Transfer Learning Official
Ensemble of models 66.78 The LAIX Systems in the BEA-2019 GEC Shared Task NA

Low-resource track:

Model F0.5 Paper / Source Code
Transformer 64.24 Neural Grammatical Error Correction Systems with UnsupervisedPre-training on Synthetic Data Official: Code to be updated soon
Transformer 58.80 A Neural Grammatical Error Correction System Built OnBetter Pre-training and Sequential Transfer Learning Official
Ensemble of models 51.81 The LAIX Systems in the BEA-2019 GEC Shared Task NA

Reference: