Language modeling
Language modeling is the task of predicting the next word or character in a document.
* indicates models using dynamic evaluation; where, at test time, models may adapt to seen tokens in order to improve performance on following tokens. (Mikolov et al., (2010), Krause et al., (2017))
Word Level Models
Penn Treebank
A common evaluation dataset for language modeling is the Penn Treebank,
as pre-processed by Mikolov et al., (2011).
The dataset consists of 929k training words, 73k validation words, and
82k test words. As part of the pre-processing, words were lower-cased, numbers
were replaced with N, newlines were replaced with <eos>
,
and all other punctuation was removed. The vocabulary is
the most frequent 10k words with the rest of the tokens replaced by an <unk>
token.
Models are evaluated based on perplexity, which is the average
per-word log-probability (lower is better).
Model | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code |
---|---|---|---|---|---|
Mogrifier RLSTM + dynamic eval (Melis, 2022) | 42.9 | 42.9 | 24M | Circling Back to Recurrent Models of Language | Official |
Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 44.9 | 44.8 | 24M | Mogrifier LSTM | Official |
AdvSoft + AWD-LSTM-MoS + dynamic eval (Wang et al., 2019) | 46.63 | 46.01 | 22M | Improving Neural Language Modeling via Adversarial Training | Official |
FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) | 47.38 | 46.54 | 22M | FRAGE: Frequency-Agnostic Word Representation | Official |
AWD-LSTM-DOC x5 (Takase et al., 2018) | 48.63 | 47.17 | 185M | Direct Output Connection for a High-Rank Language Model | Official |
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 48.33 | 47.69 | 22M | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | Official |
Mogrifier RLSTM (Melis, 2022) | 48.9 | 47.9 | 24M | Circling Back to Recurrent Models of Language | Official |
Mogrifier LSTM (Melis et al., 2019) | 51.4 | 50.1 | 24M | Mogrifier LSTM | Official |
AWD-LSTM + dynamic eval (Krause et al., 2017)* | 51.6 | 51.1 | 24M | Dynamic Evaluation of Neural Sequence Models | Official |
AWD-LSTM-DOC + Partial Shuffle (Press, 2019) preprint | 53.79 | 52.00 | 23M | Partially Shuffling the Training Data to Improve Language Models | Official |
AWD-LSTM-DOC (Takase et al., 2018) | 54.12 | 52.38 | 23M | Direct Output Connection for a High-Rank Language Model | Official |
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.9 | 52.8 | 24M | Regularizing and Optimizing LSTM Language Models | Official |
Trellis Network (Bai et al., 2019) | - | 54.19 | 34M | Trellis Networks for Sequence Modeling | Official |
AWD-LSTM-MoS + ATOI (Kocher et al., 2019) | 56.44 | 54.33 | 22M | Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes | Official |
AWD-LSTM-MoS + finetune (Yang et al., 2018) | 56.54 | 54.44 | 22M | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | Official |
Transformer-XL (Dai et al., 2018) under review | 56.72 | 54.52 | 24M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
AWD-LSTM-MoS (Yang et al., 2018) | 58.08 | 55.97 | 22M | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | Official |
AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018) | 58.9 | 56.8 | 24M | Fraternal dropout | Official |
AWD-LSTM (Merity et al., 2017) | 60.0 | 57.3 | 24M | Regularizing and Optimizing LSTM Language Models | Official |
WikiText-2
WikiText-2 has been proposed as a more realistic benchmark for language modeling than the pre-processed Penn Treebank. WikiText-2 consists of around 2 million words extracted from Wikipedia articles.
Model | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code |
---|---|---|---|---|---|
Mogrifier RLSTM + dynamic eval (Melis, 2022) | 39.3 | 38.0 | 24M | Circling Back to Recurrent Models of Language | Official |
Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 40.2 | 38.6 | 35M | Mogrifier LSTM | Official |
AdvSoft + AWD-LSTM-MoS + dynamic eval (Wang et al., 2019) | 40.27 | 38.65 | 35M | Improving Neural Language Modeling via Adversarial Training | Official |
FRAGE + AWD-LSTM-MoS + dynamic eval (Gong et al., 2018) | 40.85 | 39.14 | 35M | FRAGE: Frequency-Agnostic Word Representation | Official |
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 42.41 | 40.68 | 35M | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | Official |
AWD-LSTM + dynamic eval (Krause et al., 2017)* | 46.4 | 44.3 | 33M | Dynamic Evaluation of Neural Sequence Models | Official |
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.8 | 52.0 | 33M | Regularizing and Optimizing LSTM Language Models | Official |
AWD-LSTM-DOC x5 (Takase et al., 2018) | 54.19 | 53.09 | 185M | Direct Output Connection for a High-Rank Language Model | Official |
Mogrifier RLSTM (Melis, 2022) | 56.7 | 55.0 | 24M | Circling Back to Recurrent Models of Language | Official |
Mogrifier LSTM (Melis et al., 2019) | 57.3 | 55.1 | 35M | Mogrifier LSTM | Official |
AWD-LSTM-DOC + Partial Shuffle (Press, 2019) preprint | 60.16 | 57.85 | 37M | Partially Shuffling the Training Data to Improve Language Models | Official |
AWD-LSTM-DOC (Takase et al., 2018) | 60.29 | 58.03 | 37M | Direct Output Connection for a High-Rank Language Model | Official |
AWD-LSTM-MoS (Yang et al., 2018) | 63.88 | 61.45 | 35M | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model | Official |
AWD-LSTM 3-layer with Fraternal dropout (Zołna et al., 2018) | 66.8 | 64.1 | 34M | Fraternal dropout | Official |
AWD-LSTM + ATOI (Kocher et al., 2019) | 67.47 | 64.73 | 33M | Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes | Official |
AWD-LSTM (Merity et al., 2017) | 68.6 | 65.8 | 33M | Regularizing and Optimizing LSTM Language Models | Official |
WikiText-103
WikiText-103 The WikiText-103 corpus contains 267,735 unique words and each word occurs at least three times in the training set.
Model | Validation perplexity | Test perplexity | Number of params | Paper / Source | Code |
---|---|---|---|---|---|
Routing Transformer (Roy et al., 2020)* arxiv preprint | - | 15.8 | - | Efficient Content-Based Sparse Attention with Routing Transformers | - |
Transformer-XL + RMS dynamic eval (Krause et al., 2019)* arxiv preprint | 15.8 | 16.4 | 257M | Dynamic Evaluation of Transformer Language Models | Official |
Compressive Transformer (Rae et al., 2019)* arxiv preprint | 16.0 | 17.1(16.1 with basic dynamic evaluation) | ~257M | Compressive Transformers for Long-Range Sequence Modelling | - |
SegaTransformer-XL (Bai et al., 2020) | - | 17.1 | 257M | Segatron: Segment-Aware Transformer for Language Modeling and Understanding | Official |
Transformer-XL Large (Dai et al., 2018) under review | 17.7 | 18.3 | 257M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
Transformer with tied adaptive embeddings (Baevski and Auli, 2018) | 19.8 | 20.5 | 247M | Adaptive Input Representations for Neural Language Modeling | Link |
TaLK Convolutions (Lioutas et al., 2020) | - | 23.3 | 240M | Time-aware Large Kernel Convolutions | Official |
Transformer-XL Standard (Dai et al., 2018) under review | 23.1 | 24.0 | 151M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
AdvSoft + 4 layer QRNN + dynamic eval (Wang et al., 2019) | 27.2 | 28.0 | Improving Neural Language Modeling via Adversarial Training | Official | |
LSTM + Hebbian + Cache + MbPA (Rae et al., 2018) | 29.0 | 29.2 | Fast Parametric Learning with Activation Memorization | ||
Trellis Network (Bai et al., 2019) | - | 30.35 | 180M | Trellis Networks for Sequence Modeling | Official |
AWD-LSTM-MoS + ATOI (Kocher et al., 2019) | 31.92 | 32.85 | Alleviating Sequence Information Loss with Data Overlapping and Prime Batch Sizes | Official | |
LSTM + Hebbian (Rae et al., 2018) | 34.1 | 34.3 | Fast Parametric Learning with Activation Memorization | ||
LSTM (Rae et al., 2018) | 36.0 | 36.4 | Fast Parametric Learning with Activation Memorization | ||
Gated CNN (Dauphin et al., 2016) | - | 37.2 | Language modeling with gated convolutional networks | ||
Neural cache model (size = 2,000) (Grave et al., 2017) | - | 40.8 | Improving Neural Language Models with a Continuous Cache | Link | |
Temporal CNN (Bai et al., 2018) | - | 45.2 | Convolutional sequence modeling revisited | ||
LSTM (Grave et al., 2017) | - | 48.7 | Improving Neural Language Models with a Continuous Cache | Link |
1B Words / Google Billion Word benchmark
The One-Billion Word benchmark is a large dataset derived from a news-commentary site. The dataset consists of 829,250,940 tokens over a vocabulary of 793,471 words. Importantly, sentences in this model are shuffled and hence context is limited.
Model | Test perplexity | Number of params | Paper / Source | Code |
---|---|---|---|---|
Transformer-XL Large (Dai et al., 2018) under review | 21.8 | 0.8B | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
Transformer-XL Base (Dai et al., 2018) under review | 23.5 | 0.46B | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
Transformer with shared adaptive embeddings - Very large (Baevski and Auli, 2018) | 23.7 | 0.8B | Adaptive Input Representations for Neural Language Modeling | Link |
10 LSTM+CNN inputs + SNM10-SKIP (Jozefowicz et al., 2016) ensemble | 23.7 | 43B? | Exploring the Limits of Language Modeling | Official |
Transformer with shared adaptive embeddings (Baevski and Auli, 2018) | 24.1 | 0.46B | Adaptive Input Representations for Neural Language Modeling | Link |
Big LSTM+CNN inputs (Jozefowicz et al., 2016) | 30.0 | 1.04B | Exploring the Limits of Language Modeling | |
Gated CNN-14Bottleneck (Dauphin et al., 2017) | 31.9 | ? | Language Modeling with Gated Convolutional Networks | |
BIGLSTM baseline (Kuchaiev and Ginsburg, 2018) | 35.1 | 0.151B | Factorization tricks for LSTM networks | Official |
BIG F-LSTM F512 (Kuchaiev and Ginsburg, 2018) | 36.3 | 0.052B | Factorization tricks for LSTM networks | Official |
BIG G-LSTM G-8 (Kuchaiev and Ginsburg, 2018) | 39.4 | 0.035B | Factorization tricks for LSTM networks | Official |
Character Level Models
Hutter Prize
The Hutter Prize Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset. Within these 100 million bytes are 205 unique tokens.
Model | Bit per Character (BPC) | Number of params | Paper / Source | Code |
---|---|---|---|---|
Mogrifier RLSTM + dynamic eval (Melis, 2022) | 0.935 | 96M | Circling Back to Recurrent Models of Language | Official |
Transformer-XL + RMS dynamic eval (Krause et al., 2019)* arxiv preprint | 0.94 | 277M | Dynamic Evaluation of Transformer Language Models | Official |
Compressive Transformer (Rae et al., 2019) arxiv preprint | 0.97 | - | Compressive Transformers for Long-Range Sequence Modelling | - |
Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 0.988 | 96M | Mogrifier LSTM | Official |
24-layer Transformer-XL (Dai et al., 2018) under review | 0.99 | 277M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
Longformer Large (Beltagy, Peters, and Cohan; 2020) | 0.99 | 102M | Longformer: The Long-Document Transformer | Official |
Longformer Small (Beltagy, Peters, and Cohan; 2020) | 1.00 | 41M | Longformer: The Long-Document Transformer | Official |
18-layer Transformer-XL (Dai et al., 2018) under review | 1.03 | 88M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
Mogrifier RLSTM (Melis, 2022) | 1.042 | 96M | Circling Back to Recurrent Models of Language | Official |
12-layer Transformer-XL (Dai et al., 2018) under review | 1.06 | 41M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
64-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.06 | 235M | Character-Level Language Modeling with Deeper Self-Attention | |
mLSTM + dynamic eval (Krause et al., 2017)* | 1.08 | 46M | Dynamic Evaluation of Neural Sequence Models | Official |
12-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.11 | 44M | Character-Level Language Modeling with Deeper Self-Attention | |
Mogrifier LSTM (Melis et al., 2019) | 1.122 | 96M | Mogrifier LSTM | Official |
3-layer AWD-LSTM (Merity et al., 2018) | 1.232 | 47M | An Analysis of Neural Language Modeling at Multiple Scales | Official |
Large mLSTM +emb +WN +VD (Krause et al., 2017) | 1.24 | 46M | Multiplicative LSTM for sequence modelling | Official |
Large FS-LSTM-4 (Mujika et al., 2017) | 1.245 | 47M | Fast-Slow Recurrent Neural Networks | Official |
Large RHN (Zilly et al., 2016) | 1.27 | 46M | Recurrent Highway Networks | Official |
FS-LSTM-4 (Mujika et al., 2017) | 1.277 | 27M | Fast-Slow Recurrent Neural Networks | Official |
Text8
The text8 dataset is also derived from Wikipedia text, but has all XML removed, and is lower cased to only have 26 characters of English text plus spaces.
Model | Bit per Character (BPC) | Number of params | Paper / Source | Code |
---|---|---|---|---|
Transformer-XL + RMS dynamic eval (Krause et al., 2019)* arxiv preprint | 1.038 | 277M | Dynamic Evaluation of Transformer Language Models | Official |
Mogrifier RLSTM + dynamic eval (Melis, 2022) | 1.044 | 96M | Circling Back to Recurrent Models of Language | Official |
Transformer-XL Large (Dai et al., 2018) under review | 1.08 | 277M | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Official |
Mogrifier RLSTM (Melis, 2022) | 1.096 | 96M | Circling Back to Recurrent Models of Language | Official |
Longformer Small (Beltagy, Peters, and Cohan; 2020) | 1.10 | 41M | Longformer: The Long-Document Transformer | Official |
64-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.13 | 235M | Character-Level Language Modeling with Deeper Self-Attention | |
12-layer Character Transformer Model (Al-Rfou et al., 2018) | 1.18 | 44M | Character-Level Language Modeling with Deeper Self-Attention | |
mLSTM + dynamic eval (Krause et al., 2017)* | 1.19 | 45M | Dynamic Evaluation of Neural Sequence Models | Official |
Large mLSTM +emb +WN +VD (Krause et al., 2016) | 1.27 | 45M | Multiplicative LSTM for sequence modelling | Official |
Large RHN (Zilly et al., 2016) | 1.27 | 46M | Recurrent Highway Networks | Official |
LayerNorm HM-LSTM (Chung et al., 2017) | 1.29 | 35M | Hierarchical Multiscale Recurrent Neural Networks | |
BN LSTM (Cooijmans et al., 2016) | 1.36 | 16M | Recurrent Batch Normalization | Official |
Unregularised mLSTM (Krause et al., 2016) | 1.40 | 45M | Multiplicative LSTM for sequence modelling | Official |
Penn Treebank
The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset. This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.
Model | Bit per Character (BPC) | Number of params | Paper / Source | Code |
---|---|---|---|---|
Mogrifier RLSTM + dynamic eval (Melis, 2022) | 1.061 | 24M | Circling Back to Recurrent Models of Language | Official |
Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 1.083 | 24M | Mogrifier LSTM | Official |
Mogrifier RLSTM (Melis, 2022) | 1.096 | 24M | Circling Back to Recurrent Models of Language | Official |
Mogrifier LSTM (Melis et al., 2019) | 1.120 | 24M | Mogrifier LSTM | Official |
Trellis Network (Bai et al., 2019) | 1.159 | 13.4M | Trellis Networks for Sequence Modeling | Official |
3-layer AWD-LSTM (Merity et al., 2018) | 1.175 | 13.8M | An Analysis of Neural Language Modeling at Multiple Scales | Official |
6-layer QRNN (Merity et al., 2018) | 1.187 | 13.8M | An Analysis of Neural Language Modeling at Multiple Scales | Official |
FS-LSTM-4 (Mujika et al., 2017) | 1.190 | 27M | Fast-Slow Recurrent Neural Networks | Official |
FS-LSTM-2 (Mujika et al., 2017) | 1.193 | 27M | Fast-Slow Recurrent Neural Networks | Official |
NASCell (Zoph & Le, 2016) | 1.214 | 16.3M | Neural Architecture Search with Reinforcement Learning | |
2-layer Norm HyperLSTM (Ha et al., 2016) | 1.219 | 14.4M | HyperNetworks |
Multilingual Wikipedia Corpus
The character-based MWC dataset is a collection of Wikipedia pages available in a number of languages. Markup and rare characters were removed, but otherwise no preprocessing was applied.
MWC English in the single text, large setting.
Model | Validation BPC | Test BPC | Number of params | Paper / Source | Code |
---|---|---|---|---|---|
Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 1.200 | 1.187 | 24M | Mogrifier LSTM | Official |
Mogrifier LSTM (Melis et al., 2019) | 1.312 | 1.298 | 24M | Mogrifier LSTM | Official |
HCLM with Cache (Kawakami et al. 2017) | 1.591 | 1.538 | 8M | Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling | |
LSTM (Kawakami et al. 2017) | 1.793 | 1.736 | 8M | Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling |
MWC Finnish in the single text, large setting.
Model | Validation BPC | Test BPC | Number of params | Paper / Source | Code |
---|---|---|---|---|---|
Mogrifier LSTM + dynamic eval (Melis et al., 2019) | 1.202 | 1.191 | 24M | Mogrifier LSTM | Official |
Mogrifier LSTM (Melis et al., 2019) | 1.327 | 1.313 | 24M | Mogrifier LSTM | Official |
HCLM with Cache (Kawakami et al. 2017) | 1.754 | 1.711 | 8M | Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling | |
LSTM (Kawakami et al. 2017) | 1.943 | 1.913 | 8M | Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling |