View on GitHub

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Language modeling

Language modeling is the task of predicting the next word or character in a document.

* indicates models using dynamic evaluation; where, at test time, models may adapt to seen tokens in order to improve performance on following tokens. (Mikolov et al., (2010), Krause et al., (2017))

Word Level Models

Penn Treebank

A common evaluation dataset for language modeling ist the Penn Treebank, as pre-processed by Mikolov et al., (2011). The dataset consists of 929k training words, 73k validation words, and 82k test words. As part of the pre-processing, words were lower-cased, numbers were replaced with N, newlines were replaced with <eos>, and all other punctuation was removed. The vocabulary is the most frequent 10k words with the rest of the tokens replaced by an <unk> token. Models are evaluated based on perplexity, which is the average per-word log-probability (lower is better).

Model Validation perplexity Test perplexity Number of params Paper / Source Code
FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) 47.38 46.54 22M FRAGE: Frequency-Agnostic Word Representation Official
AWD-LSTM-DOC x5 (Takase et al., 2018) 48.63 47.17 185M Direct Output Connection for a High-Rank Language Model Official
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* 48.33 47.69 22M Breaking the Softmax Bottleneck: A High-Rank RNN Language Model Official
AWD-LSTM + dynamic eval (Krause et al., 2017)* 51.6 51.1 24M Dynamic Evaluation of Neural Sequence Models Official
AWD-LSTM-DOC + Partial Shuffle (Press, 2019) preprint 53.79 52.00 23M Partially Shuffling the Training Data to Improve Language Models Official
AWD-LSTM-DOC (Takase et al., 2018) 54.12 52.38 23M Direct Output Connection for a High-Rank Language Model Official
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* 53.9 52.8 24M Regularizing and Optimizing LSTM Language Models Official
Trellis Network (Bai et al., 2019) - 54.19 34M Trellis Networks for Sequence Modeling Official
AWD-LSTM-MoS + finetune (Yang et al., 2018) 56.54 54.44 22M Breaking the Softmax Bottleneck: A High-Rank RNN Language Model Official
Transformer-XL (Dai et al., 2018) under review 56.72 54.52 24M Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Official
AWD-LSTM-MoS (Yang et al., 2018) 58.08 55.97 22M Breaking the Softmax Bottleneck: A High-Rank RNN Language Model Official
AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018) 58.9 56.8 24M Fraternal dropout Official
AWD-LSTM (Merity et al., 2017) 60.0 57.3 24M Regularizing and Optimizing LSTM Language Models Official

WikiText-2

WikiText-2 has been proposed as a more realistic benchmark for language modeling than the pre-processed Penn Treebank. WikiText-2 consists of around 2 million words extracted from Wikipedia articles.

Model Validation perplexity Test perplexity Number of params Paper / Source Code
FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) 40.85 39.14 35M FRAGE: Frequency-Agnostic Word Representation Official
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* 42.41 40.68 35M Breaking the Softmax Bottleneck: A High-Rank RNN Language Model Official
AWD-LSTM + dynamic eval (Krause et al., 2017)* 46.4 44.3 33M Dynamic Evaluation of Neural Sequence Models Official
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* 53.8 52.0 33M Regularizing and Optimizing LSTM Language Models Official
AWD-LSTM-DOC x5 (Takase et al., 2018) 54.19 53.09 185M Direct Output Connection for a High-Rank Language Model Official
AWD-LSTM-DOC + Partial Shuffle (Press, 2019) preprint 60.16 57.85 37M Partially Shuffling the Training Data to Improve Language Models Official
AWD-LSTM-DOC (Takase et al., 2018) 60.29 58.03 37M Direct Output Connection for a High-Rank Language Model Official
AWD-LSTM-MoS (Yang et al., 2018) 63.88 61.45 35M Breaking the Softmax Bottleneck: A High-Rank RNN Language Model Official
AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018) 66.8 64.1 34M Fraternal dropout Official
AWD-LSTM (Merity et al., 2017) 68.6 65.8 33M Regularizing and Optimizing LSTM Language Models Official

WikiText-103

WikiText-103 The WikiText-103 corpus contains 267,735 unique words and each word occurs at least three times in the training set.

Model Validation perplexity Test perplexity Number of params Paper / Source Code
Transformer-XL + RMS dynamic eval (Krause et al., 2019)* arxiv preprint 15.8 16.4 257M Dynamic Evaluation of Transformer Language Models Official
Transformer-XL Large (Dai et al., 2018) under review 17.7 18.3 257M Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Official
Transformer with tied adaptive embeddings (Baevski and Auli, 2018) 19.8 20.5 247M Adaptive Input Representations for Neural Language Modeling Link
Transformer-XL Standard (Dai et al., 2018) under review 23.1 24.0 151M Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Official
LSTM + Hebbian + Cache + MbPA (Rae et al., 2018) 29.0 29.2   Fast Parametric Learning with Activation Memorization  
Trellis Network (Bai et al., 2019) - 30.35 180M Trellis Networks for Sequence Modeling Official
LSTM + Hebbian (Rae et al., 2018) 34.1 34.3   Fast Parametric Learning with Activation Memorization  
LSTM (Rae et al., 2018) 36.0 36.4   Fast Parametric Learning with Activation Memorization  
Gated CNN (Dauphin et al., 2016) - 37.2   Language modeling with gated convolutional networks  
Neural cache model (size = 2,000) (Grave et al., 2017) - 40.8   Improving Neural Language Models with a Continuous Cache Link
Temporal CNN (Bai et al., 2018) - 45.2   Convolutional sequence modeling revisited  
LSTM (Grave et al., 2017) - 48.7   Improving Neural Language Models with a Continuous Cache Link

1B Words / Google Billion Word benchmark

The One-Billion Word benchmark is a large dataset derived from a news-commentary site. The dataset consists of 829,250,940 tokens over a vocabulary of 793,471 words. Importantly, sentences in this model are shuffled and hence context is limited.

Model Test perplexity Number of params Paper / Source Code
Transformer-XL Large (Dai et al., 2018) under review 21.8 0.8B Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Official
Transformer-XL Base (Dai et al., 2018) under review 23.5 0.46B Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Official
Transformer with shared adaptive embeddings - Very large (Baevski and Auli, 2018) 23.7 0.8B Adaptive Input Representations for Neural Language Modeling Link
10 LSTM+CNN inputs + SNM10-SKIP (Jozefowicz et al., 2016) ensemble 23.7 43B? Exploring the Limits of Language Modeling Official
Transformer with shared adaptive embeddings (Baevski and Auli, 2018) 24.1 0.46B Adaptive Input Representations for Neural Language Modeling Link
Big LSTM+CNN inputs (Jozefowicz et al., 2016) 30.0 1.04B Exploring the Limits of Language Modeling  
Gated CNN-14Bottleneck (Dauphin et al., 2017) 31.9 ? Language Modeling with Gated Convolutional Networks  
BIGLSTM baseline (Kuchaiev and Ginsburg, 2018) 35.1 0.151B Factorization tricks for LSTM networks Official
BIG F-LSTM F512 (Kuchaiev and Ginsburg, 2018) 36.3 0.052B Factorization tricks for LSTM networks Official
BIG G-LSTM G-8 (Kuchaiev and Ginsburg, 2018) 39.4 0.035B Factorization tricks for LSTM networks Official

Character Level Models

Hutter Prize

The Hutter Prize Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset. Within these 100 million bytes are 205 unique tokens.

Model Bit per Character (BPC) Number of params Paper / Source Code
Transformer-XL + RMS dynamic eval (Krause et al., 2019)* arxiv preprint 0.94 277M Dynamic Evaluation of Transformer Language Models Official
24-layer Transformer-XL (Dai et al., 2018) under review 0.99 277M Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Official
18-layer Transformer-XL (Dai et al., 2018) under review 1.03 88M Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Official
12-layer Transformer-XL (Dai et al., 2018) under review 1.06 41M Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Official
64-layer Character Transformer Model (Al-Rfou et al., 2018) 1.06 235M Character-Level Language Modeling with Deeper Self-Attention  
mLSTM + dynamic eval (Krause et al., 2017)* 1.08 46M Dynamic Evaluation of Neural Sequence Models Official
12-layer Character Transformer Model (Al-Rfou et al., 2018) 1.11 44M Character-Level Language Modeling with Deeper Self-Attention  
3-layer AWD-LSTM (Merity et al., 2018) 1.232 47M An Analysis of Neural Language Modeling at Multiple Scales Official
Large mLSTM +emb +WN +VD (Krause et al., 2017) 1.24 46M Multiplicative LSTM for sequence modelling Official
Large FS-LSTM-4 (Mujika et al., 2017) 1.245 47M Fast-Slow Recurrent Neural Networks Official
Large RHN (Zilly et al., 2016) 1.27 46M Recurrent Highway Networks Official
FS-LSTM-4 (Mujika et al., 2017) 1.277 27M Fast-Slow Recurrent Neural Networks Official

Text8

The text8 dataset is also derived from Wikipedia text, but has all XML removed, and is lower cased to only have 26 characters of English text plus spaces.

Model Bit per Character (BPC) Number of params Paper / Source Code
Transformer-XL + RMS dynamic eval (Krause et al., 2019)* arxiv preprint 1.038 277M Dynamic Evaluation of Transformer Language Models Official
Transformer-XL Large (Dai et al., 2018) under review 1.08 277M Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Official
64-layer Character Transformer Model (Al-Rfou et al., 2018) 1.13 235M Character-Level Language Modeling with Deeper Self-Attention  
12-layer Character Transformer Model (Al-Rfou et al., 2018) 1.18 44M Character-Level Language Modeling with Deeper Self-Attention  
mLSTM + dynamic eval (Krause et al., 2017)* 1.19 45M Dynamic Evaluation of Neural Sequence Models Official
Large mLSTM +emb +WN +VD (Krause et al., 2016) 1.27 45M Multiplicative LSTM for sequence modelling Official
Large RHN (Zilly et al., 2016) 1.27 46M Recurrent Highway Networks Official
LayerNorm HM-LSTM (Chung et al., 2017) 1.29 35M Hierarchical Multiscale Recurrent Neural Networks  
BN LSTM (Cooijmans et al., 2016) 1.36 16M Recurrent Batch Normalization Official
Unregularised mLSTM (Krause et al., 2016) 1.40 45M Multiplicative LSTM for sequence modelling Official

Penn Treebank

The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset. This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.

Model Bit per Character (BPC) Number of params Paper / Source Code
Trellis Network (Bai et al., 2019) 1.159 13.4M Trellis Networks for Sequence Modeling Official
3-layer AWD-LSTM (Merity et al., 2018) 1.175 13.8M An Analysis of Neural Language Modeling at Multiple Scales Official
6-layer QRNN (Merity et al., 2018) 1.187 13.8M An Analysis of Neural Language Modeling at Multiple Scales Official
FS-LSTM-4 (Mujika et al., 2017) 1.190 27M Fast-Slow Recurrent Neural Networks Official
FS-LSTM-2 (Mujika et al., 2017) 1.193 27M Fast-Slow Recurrent Neural Networks Official
NASCell (Zoph & Le, 2016) 1.214 16.3M Neural Architecture Search with Reinforcement Learning  
2-layer Norm HyperLSTM (Ha et al., 2016) 1.219 14.4M HyperNetworks  

Go back to the README