View on GitHub

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Lexical Normalization

Lexical normalization is the task of translating/transforming a non standard text to a standard register.

Example:

new pix comming tomoroe
new pictures coming tomorrow

Datasets usually consists of tweets, since these naturally contain a fair amount of these phenomena.

For lexical normalization, only replacements on the word-level are annotated. Some corpora include annotation for 1-N and N-1 replacements. However, word insertion/deletion and reordering is not part of the task.

LexNorm

The LexNorm corpus was originally introduced by Han and Baldwin (2011). Several mistakes in annotation were resolved by Yang and Eisenstein; on this page, we only report results on the new dataset. For this dataset, the 2,577 tweets from Li and Liu(2014) is often used as training data, because of its similar annotation style.

This dataset is commonly evaluated with accuracy on the non-standard words. This means that the system knows in advance which words are in need of normalization.

Model accuracy Paper / Source Code
MoNoise by Rob van der Goot and Gertjan van Noord (2017) 87.63 MoNoise: Modeling Noise Using a Modular Normalization System.
Joint POS + Norm in a Viterbi decoding* by Chen Li and Yang Liu (2015) 87.58* Joint POS Tagging and Text Normalization for Informal Text.
Syllable based by Ke Xu, Yunqing Xia and Chin-Hui Lee (2015) 86.08 Tweet Normalization with Syllables
unLOL by Yi Yang and Jacob Eisenstein (2013) 82.06 A Log-Linear Model for Unsupervised Text Normalization.

* used a slightly different version of the data

LexNorm2015

The LexNorm2015 dataset was introduced for the shared task on lexical normalization, hosted at WNUT2015 (Baldwin et al(2015)). In this dataset, 1-N and N-1 replacements are included in the annotation. The evaluation metrics used are precision, recall and F1 score. However, this is calculated a bit odd:

Precision: out of all necessary replacements, how many correctly found

Recall: out of all normalization by system, how many correct

This means that if the system replaces a word which is in need of normalization, but chooses the wrong normalization, it is penalized twice.

Model F1 Paper / Source Code
MoNoise by Rob van der Goot and Gertjan van Noord (2017) 86.39 MoNoise: Modeling Noise Using a Modular Normalization System.
Random Forest + novel similarity metric by Ning Jin (2017) 84.21 NCSU-SAS-Ning: Candidate Generation and Feature Engineering for Supervised Lexical Normalization

Go back to the README