View on GitHub

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Simplification

Simplification consists of modifying the content and structure of a text in order to make it easier to read and understand, while preserving its main idea and approximating its original meaning. A simplified version of a text could benefit low literacy readers, English learners, children, and people with aphasia, dyslexia or autism. Also, simplifying a text automatically could improve performance on other NLP tasks, such as parsing, summarisation, information extraction, semantic role labeling, and machine translation.

Sentence Simplification

Research on automatic simplification has been traditionally limited to executing transformations at the sentence-level. What should we expect from a sentence simplificatin model? Let’s take a look at how humans simplify (from here):

Original Sentence Simplified Sentence
Owls are the order Strigiformes, comprising 200 bird of prey species. An owl is a bird. There are about 200 kinds of owls.
Owls hunt mostly small mammals, insects, and other birds though some species specialize in hunting fish. Owls’ prey may be birds, large insects (such as crickets), small reptiles (such as lizards) or small mammals (such as mice, rats, and rabbits).

Notice the simplification transformations performed:

When the set of transformations is limited to replacing a word or phrase by a simpler synonym, we are dealing with Lexical Simplification (an overview of that area can be found here). In this section, we consider research that attempts to develop models that learn as many text transformations as possible.

Evaluation

The ideal method for determining the quality of a simplification is through human evaluation. Traditionally, a simplified output is judged in terms of grammaticality (or fluency), meaning preservation (or adequacy) and simplicity, using Likert scales (1-3 or 1-5) . Warning: Are these criteria (at the sentence level) the most appropriate for assessing a simplified sentence? It has been suggested (Siddharthan, 2014) that a task-oriented evaluation (e.g., through reading comprehension tests (Angrosh et al., 2014)) could be more informative of the usefulness of the generated simplification. However, this is not general practice.

For tuning and comparing models, the most commonly used automatic metrics are:

The previous two metrics will be used to rank the models in the following sections. Despite popular practice, we refrain from using Flesch Reading Ease or Flesch-Kincaid Grade Level. Because of the way these metrics are computed, short sentences could get good scores, even if they are ungrammatical or non-meaning preserving (Wubben et al., 2012), resulting in a missleading ranking.

Finally, as seen in the previous section, a simplification could involve text transformations beyond paraphrasing (which SARI intends to assess). For these cases, it could be more suitable to use SAMSA (Sulem et al., 2018a), a recently introduced metric for measuring structural simplicity (i.e., sentence splitting). However, it has not been used in papers besides the one where it was introduced (yet).

IMPORTANT NOTE: In the tables of the following sections, a score with a * means that it was not reported by the original authors but by future research that re-implemented and/or re-trained and re-tested the model. In these cases, the original reported score (if there is one) is shown in parentheses.

Main - Simple English Wikipedia

Simple English Wikipedia is an online encyclopedia aimed at English learners. Its articles are expected to contain fewer words and simpler grammar structures than those in their Main English Wikipedia counterpart. Much of the popularity of using Wikipedia for research in Simplification comes from publicly available sentence alignments between “equivalent” articles in Main and Simple English Wikipedia.

PWKP / WikiSmall

Zu et al. (2010) compiled a parallel corpus with more than 108K sentence pairs from 65,133 Wikipedia articles, allowing 1-to-1 and 1-to-N alignments. The latter type of alignments represents instances of sentence splitting. The original full corpus can be found here. The test set is composed of 100 instances, with one simplification reference per original sentence. Zhang and Lapata (2017) released a more standardised split of this dataset called WikiSmall, with 89,042 instances for training, 205 for development and the same original 100 instances for testing.

We present the models tested in this dataset ranked by BLEU score. SARI cannot be reliably computed in this dataset since it does not contain multiple simplification references per original sentence. In addition, there are instances of more advanced simplification transformations (e.g., splitting) which SARI does not assess by definition.

Model BLEU SARI Paper / Source Code
Hybrid (Narayan and Gardent, 2014) 53.94* (53.6) 30.46* Hybrid Simplification using Deep Semantics and Machine Translation Official
NSELSTM-B (Vu et al., 2018) 53.42 17.47 Sentence Simplification with Memory-Augmented Neural Networks  
PBMT-R (Wubben et al., 2012) 46.31* (43.0) 15.97* Sentence Simplification by Monolingual Machine Translation  
RevILP (Woodsend and Lapata, 2011) 42.0   Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming  
UNSUP (Narayan and Gardent, 2016) 38.47   Unsupervised Sentence Simplification Using Deep Semantics  
TSM (Zhu et al., 2010) 38.0   A Monolingual Tree-based Translation Model for Sentence Simplification  
DRESS-LS (Zhang and Lapata, 2017) 36.32 27.24 Sentence Simplification with Deep Reinforcement Learning Official
DRESS (Zhang and Lapata, 2017) 34.53 27.48 Sentence Simplification with Deep Reinforcement Learning Official
NSELSTM-S (Vu et al., 2018) 29.72 29.75 Sentence Simplification with Memory-Augmented Neural Networks  
Pointer + Multi-task Entailment and Paraphrase Generation (Guo et al., 2018) 27.23 29.58 Dynamic Multi-Level Multi-Task Learning for Sentence Simplification Official

Coster and Kauchack (2011)

Coster and Kauchack (2011) automatically aligned 137K sentence pairs from 10K Wikipedia articles, considering 1-to-1 and 1-to-N alignments, with one simplification reference per original sentence. The corpus was split into 124K instances for training, 12K for development, and 1.3K for testing. The dataset is available here. As before, models tested in this dataset are ranked by BLEU score and not SARI.

Model BLEU SARI Paper / Source Code
Moses-Del (Coster and Kauchak, 2011b) 60.46   Learning to Simplify Sentences Using Wikipedia  
Moses (Coster and Kauchak, 2011a) 59.87   Simple English Wikipedia: A New Text Simplification Task  
SimpleTT (Feblowitz and Kauchak, 2013) 56.4   Sentence Simplification as Tree Transduction  
PBMT-R (Wubben et al., 2012) 54.3*   Sentence Simplification by Monolingual Machine Translation  

Turk Corpus

Together with defining SARI, Xu et al. (2016) released a dataset properly collected to calculate the simplicity metric: 1-to-1 alignments focused on paraphrasing transformations (extracted from PWKP), and multiple (8) simplification references per original sentence (collected through Amazon Mechanical Turk). The dataset contains 2,350 sentences split into 2,000 instances for tuning and 350 for testing. For training, most models use WikiLarge, which was compiled by Zhang and Lapata (2017) using alignments from other Wikipedia-based datasets (Zhu et al., 2010; Woodsend and Lapata, 2011; Kauchak, 2013), and contains 296K instances of not only 1-to-1 alignments.

We present the models tested in this dataset ranked by SARI score.

Model BLEU SARI Paper / Source Code
DMASS + DCSS (Zhao et al., 2018)   40.45 Integrating Transformer and Paraphrase Rules for Sentence Simplification Official
SBSMT + PPDB + SARI (Xu et al, 2016) 73.08* (72.36) 39.96* (37.91) Optimizing Statistical Machine Translation for Text Simplification Official
PBMT-R (Wubben et al., 2012) 81.11* 38.56* Sentence Simplification by Monolingual Machine Translation  
Pointer + Multi-task Entailment and Paraphrase Generation (Guo et al., 2018) 81.49 37.45 Dynamic Multi-Level Multi-Task Learning for Sentence Simplification Official
NTS + SARI (Nisioi et al., 2017) 80.69 37.25 Exploring Neural Text Simplification Models Official
DRESS-LS (Zhang and Lapata, 2017) 80.12 37.27 Sentence Simplification with Deep Reinforcement Learning Official
DRESS (Zhang and Lapata, 2017) 77.18 37.08 Sentence Simplification with Deep Reinforcement Learning Official
NSELSTM-S (Vu et al., 2018) 80.43 36.88 Sentence Simplification with Memory-Augmented Neural Networks  
SEMoses (Sulem et al., 2018) 74.49 36.70 Simple and Effective Text Simplification Using Semantic and Neural Methods Official
NSELSTM-B (Vu et al., 2018) 92.02 33.43 Sentence Simplification with Memory-Augmented Neural Networks  
Hybrid (Narayan and Gardent, 2014) 48.97* 31.40* Hybrid Simplification using Deep Semantics and Machine Translation Official

Other Datasets

Hwang et al. (2015) released a dataset of 392K instances, while Kajiwara and Komachi (2016) collected the sscorpus of 493K instances, also from Main - Simple English Wikipedia article pairs. Both datasets contain only 1-to-1 alignments with one simplification reference per original sentence. Despite their bigger sizes and the more sophisticated sentence alignment algorithms used to collect them, these datasets are not commonly used in simplification research.

Newsela

Xu et al. (2015) introduced the Newsela corpus, which contains 1,130 news articles with four simplification versions each. The original article is considered version 0, and each simplification version goes from 1 to 4 (the highest being the simplest). These simplifications were produced manually by professional editors, considering children of different grade levels as target audience. Through manual evaluation on a subset of the data, Xu et al. (2015) showed that there is a better presence and distribution of simplification transformations in Newsela than in PWKP.

The dataset can be requested here. However, researchers are not allowed to publicly shared splits of the data. This is not ideal for proper reproducibility and comparison among models.

Splits by Zhang and Lapata (2017)

Xu et al. (2015) generated sentence alignments between all versions of each article in the Newsela corpus. Zhang and Lapata (2017) imply that they used those alignments but removed some sentence pairs that are “too similar”. In the end, they used a dataset composed of 94,208 instances for training, 1,129 instances for development, and 1,076 instances for testing. Their test set, in particular, contains only 1-to-1 alignments with one simplification reference per original sentence.

Using their splits, Zhang and Lapata (2017) trained and tested several models, which we include in our ranking. Other research that claims to have used the same dataset splits is also considered. Despite not being the ideal scenario, the models tested in this dataset are commonly ranked by SARI score.

Model BLEU SARI Paper / Source Code
Pointer + Multi-task Entailment and Paraphrase Generation (Guo et al., 2018) 11.14 33.22 Dynamic Multi-Level Multi-Task Learning for Sentence Simplification Official
Hybrid (Narayan and Gardent, 2014) 14.46* 30.00* Hybrid Simplification using Deep Semantics and Machine Translation Official
NSELSTM-S (Vu et al., 2018) 22.62 29.58 Sentence Simplification with Memory-Augmented Neural Networks  
NSELSTM-B (Vu et al., 2018) 26.31 27.42 Sentence Simplification with Memory-Augmented Neural Networks  
DRESS (Zhang and Lapata, 2017) 23.21 27.37 Sentence Simplification with Deep Reinforcement Learning Official
DMASS + DCSS (Zhao et al., 2018)   27.28 Integrating Transformer and Paraphrase Rules for Sentence Simplification Official
DRESS-LS (Zhang and Lapata, 2017) 24.30 26.63 Sentence Simplification with Deep Reinforcement Learning Official
PBMT-R (Wubben et al., 2012) 18.19* 15.77* Sentence Simplification by Monolingual Machine Translation  

As mentioned before, a big disadvantage of the Newsela corpus is that a unique train/dev/test split of the data is not (cannot be made?) publicly available. In addition, due to its characteristics, it is not clear what should be the best way to generate sentence alignments and split the data:

Go back to the README