Simplification
Simplification consists of modifying the content and structure of a text in order to make it easier to read and understand, while preserving its main idea and approximating its original meaning. A simplified version of a text could benefit low literacy readers, English learners, children, and people with aphasia, dyslexia or autism. Also, simplifying a text automatically could improve performance on other NLP tasks, such as parsing, summarisation, information extraction, semantic role labeling, and machine translation.
Sentence Simplification
Research on automatic simplification has been traditionally limited to executing transformations at the sentence-level. What should we expect from a sentence simplificatin model? Let’s take a look at how humans simplify (from here):
Original Sentence | Simplified Sentence |
---|---|
Owls are the order Strigiformes, comprising 200 bird of prey species. | An owl is a bird. There are about 200 kinds of owls. |
Owls hunt mostly small mammals, insects, and other birds though some species specialize in hunting fish. | Owls’ prey may be birds, large insects (such as crickets), small reptiles (such as lizards) or small mammals (such as mice, rats, and rabbits). |
Notice the simplification transformations performed:
-
Unusual concepts are explained: insects (such as crickets), small reptiles (such as lizards) or small mammals (such as mice, rats, and rabbits).
-
Uncommon words are replaced with a more familiar term or phrase: “comprising” → “There are about”.
-
Syntactic structures are changed by a simpler pattern. For example, the first sentence is split into two.
-
Some unimportant information is removed: the clause “though some species specialize in hunting fish” in the second sentence does not appear in its simplified version.
When the set of transformations is limited to replacing a word or phrase by a simpler synonym, we are dealing with Lexical Simplification (an overview of that area can be found here). In this section, we consider research that attempts to develop models that learn as many text transformations as possible.
Evaluation
The ideal method for determining the quality of a simplification is through human evaluation. Traditionally, a simplified output is judged in terms of grammaticality (or fluency), meaning preservation (or adequacy) and simplicity, using Likert scales (1-3 or 1-5) . Warning: Are these criteria (at the sentence level) the most appropriate for assessing a simplified sentence? It has been suggested (Siddharthan, 2014) that a task-oriented evaluation (e.g. through reading comprehension tests (Angrosh et al., 2014)) could be more informative of the usefulness of the generated simplification. However, this is not general practice.
For tuning and comparing models, the most commonly-used automatic metrics are:
- BLEU (Papineni et al., 2012), borrowed from Machine Translation. This metric is not one without problems for different text generation tasks. However, simplification studies (Stajner et al., 2014; Wubben et al., 2012; Xu et al., 2016) have shown that it correlates with human judgments of grammaticality and meaning preservation. BLEU is not well suited, though, for assessing simplicity from a lexical (Xu et al., 2016) nor a structural (Sulem et al., 2018b) point of view.
- SARI (Xu et al., 2016) is a lexical simplicity metric that measures “how good” are the words added, deleted and kept by a simplification model. The metric compares the model’s output to multiple simplification references and the original sentence. SARI has shown high correlation with human judgements of simplicity gain (Xu et al., 2016). Currently, this is the main metric used for evaluating sentence simplification models.
The previous two metrics will be used to rank the models in the following sections. Despite popular practice, we refrain from using Flesch Reading Ease or Flesch-Kincaid Grade Level. Because of the way these metrics are computed, short sentences could get good scores, even if they are ungrammatical or non-meaning preserving (Wubben et al., 2012), resulting in a missleading ranking.
Since a simplification could involve text transformations beyond paraphrasing (which SARI intends to assess). For these cases, it could be more suitable to use SAMSA (Sulem et al., 2018a), a metric designed to measure structural simplicity (i.e. sentence splitting). However, it has not been used in papers besides the one where it was introduced (yet).
EASSE: Alva-Manchego et al. (2019) released a tool that provides easy access to all of the above metrics (and several others) through the command line and as a python package. EASSE also contains commonly-used test sets for the task. Its aim is to help standarise automatic evaluation for sentence simplification.
IMPORTANT NOTE: In the tables of the following sections, a score with a * means that it was not reported by the original authors but by future research that re-implemented and/or re-trained and re-tested the model. In these cases, the original reported score (if there is one) is shown in parentheses.
Main - Simple English Wikipedia
Simple English Wikipedia is an online encyclopedia aimed at English learners. Its articles are expected to contain fewer words and simpler grammar structures than those in their Main English Wikipedia counterpart. Much of the popularity of using Wikipedia for research in Simplification comes from publicly available sentence alignments between “equivalent” articles in Main and Simple English Wikipedia.
PWKP / WikiSmall
Zhu et al. (2010) compiled a parallel corpus with more than 108K sentence pairs from 65,133 Wikipedia articles, allowing 1-to-1 and 1-to-N alignments. The latter type of alignments represents instances of sentence splitting. The original full corpus can be found here. The test set is composed of 100 instances, with one simplification reference per original sentence. Zhang and Lapata (2017) released a more standardised split of this dataset called WikiSmall, with 89,042 instances for training, 205 for development and the same original 100 instances for testing.
We present the models tested in this dataset ranked by BLEU score (or SARI if BLEU is not available). SARI cannot be reliably computed in this dataset since it does not contain multiple simplification references per original sentence. In addition, there are instances of more advanced simplification transformations (e.g. splitting) which SARI does not assess by definition.
Coster and Kauchack (2011)
Coster and Kauchack (2011) automatically aligned 137K sentence pairs from 10K Wikipedia articles, considering 1-to-1 and 1-to-N alignments, with one simplification reference per original sentence. The corpus was split into 124K instances for training, 12K for development, and 1.3K for testing. The dataset is available here. As before, models tested in this dataset are ranked by BLEU score and not SARI.
Model | BLEU | SARI | Paper / Source | Code |
---|---|---|---|---|
Moses-Del (Coster and Kauchak, 2011b) | 60.46 | Learning to Simplify Sentences Using Wikipedia | ||
Moses (Coster and Kauchak, 2011a) | 59.87 | Simple English Wikipedia: A New Text Simplification Task | ||
SimpleTT (Feblowitz and Kauchak, 2013) | 56.4 | Sentence Simplification as Tree Transduction | ||
PBMT-R (Wubben et al., 2012) | 54.3* | Sentence Simplification by Monolingual Machine Translation |
Turk Corpus
Together with defining SARI, Xu et al. (2016) released a dataset properly collected to calculate this simplicity metric: 1-to-1 alignments focused on paraphrasing transformations (extracted from PWKP), and multiple (8) simplification references per original sentence (collected through Amazon Mechanical Turk). The dataset contains 2,350 sentences split into 2,000 instances for tuning and 350 for testing. For training, most models use WikiLarge, which was compiled by Zhang and Lapata (2017) using alignments from other Wikipedia-based datasets (Zhu et al., 2010; Woodsend and Lapata, 2011; Kauchak, 2013), and contains 296K instances of not only 1-to-1 alignments.
We present the models tested in this dataset ranked by SARI score.
ASSET
Alva-Manchego et al. (2020) released a dataset aligned with TurkCorpus that contains the same set of original sentences, but with manual references where multiple simplification operations could have been applied, namely lexical paraphrasing, compression and/or sentence splitting. The authors showed that human judges found this type of simplifications simpler than those from TurkCorpus. Due to its multi-operation nature, ASSET contains 1-to-1 and 1-to-N alignments, with 10 simplification references per original sentence (collected through Amazon Mechanical Turk). Same as TurkCorpus, ASSET contains 2,350 sentences split into 2,000 instances for tuning and 350 for testing.
We present the models tested in this dataset ranked by SARI score.
Model | BLEU | SARI | Paper / Source | Code |
---|---|---|---|---|
MUSS (Martin et al., 2020) | 72.98 | 44.15 | Multilingual Unsupervised Sentence Simplification | |
TST (Omelianchuk et al., 2021) | 43.21 | Text Simplification by Tagging | ||
Trans-SS (Lu et al., 2021) | 71.83 | 42.69 | An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages | Official |
ACCESS (Martin et al., 2019) | 75.99* | 40.13* | Controllable Sentence Simplification | Official |
DMASS + DCSS (Zhao et al., 2018) | 71.44* | 38.67* | Integrating Transformer and Paraphrase Rules for Sentence Simplification | Official |
DRESS-LS (Zhang and Lapata, 2017) | 86.39* | 36.59* | Sentence Simplification with Deep Reinforcement Learning | Official |
UnsupNTS (Surya et al., 2019) | 76.14* | 35.19* | Unsupervised Neural Text Simplification | Official |
PBMT-R (Wubben et al., 2012) | 79.39* | 34.63* | Sentence Simplification by Monolingual Machine Translation |
Other Datasets
Hwang et al. (2015) released a dataset of 392K instances, while Kajiwara and Komachi (2016) collected the sscorpus of 493K instances, also from Main - Simple English Wikipedia article pairs. Both datasets contain only 1-to-1 alignments with one simplification reference per original sentence. Despite their bigger sizes and the more sophisticated sentence alignment algorithms used to collect them, these datasets are not commonly used in simplification research.
Newsela
Xu et al. (2015) introduced the Newsela corpus, which contains 1,130 news articles with four simplification versions each. The original article is considered version 0, and each simplification version goes from 1 to 4 (the highest being the simplest). These simplifications were produced manually by professional editors, considering children of different grade levels as target audience. Through manual evaluation on a subset of the data, Xu et al. (2015) showed that there is a better presence and distribution of simplification transformations in Newsela than in PWKP.
The dataset can be requested here. However, researchers are not allowed to publicly shared splits of the data. This is not ideal for proper reproducibility and comparison among models.
Splits by Zhang and Lapata (2017)
Xu et al. (2015) generated sentence alignments between all versions of each article in the Newsela corpus. Zhang and Lapata (2017) imply that they used those alignments but removed some sentence pairs that are “too similar”. In the end, they used a dataset composed of 94,208 instances for training, 1,129 instances for development, and 1,076 instances for testing. Their test set, in particular, contains only 1-to-1 alignments with one simplification reference per original sentence.
Using their splits, Zhang and Lapata (2017) trained and tested several models, which we include in our ranking. Other research that claims to have used the same dataset splits is also considered. Despite not being the ideal scenario, the models tested in this dataset are commonly ranked by SARI score.
As mentioned before, a big disadvantage of the Newsela corpus is that a unique train/dev/test split of the data is not (cannot be made?) publicly available. In addition, due to its characteristics, it is not clear what should be the best way to generate sentence alignments and split the data:
- Zhang and Lapata (2017) removed sentences from version pairs 0–1, 1–2, and 2–3 because they are “too similar to each other”. This could prevent the model from learning when a sentence should not be simplified. In addition, their test set only considers 1-to-1 sentence alignments, even though it is possible to generate 1-to-N and N-to-1 sentence pairs as shown by other researchers (Scarton et al., 2018; Stajner et al., 2018).
- Alva-Manchego et al. (2017), Scarton et al. (2018), and Stajner and Nisioi (2018) generate sentence alignments (using different algorithms) only between adjacent article versions (i.e. 0-1, 1-2, 2-3, and 3-4). Meanwhile, Scarton and Specia (2018) generate alignments between all versions (i.e., 0-{1,2,3,4}, 1-{2,3,4}, 2-{3,4}, and 3-4). The assumption behind using only adjacent versions is that, to write an article’s simplification, an editor takes the immediately previous simplified version as basis (i.e. 0→1, 1→2, etc.). However, since the simplification manual followed by the Newsela editors is not public, it is not possible to corroborate that hypothesis.