View on GitHub


Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Natural language inference

Natural language inference is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.


Premise Label Hypothesis
A man inspects the uniform of a figure in some East Asian country. contradiction The man is sleeping.
An older and younger man smiling. neutral Two men are smiling and laughing at the cats playing on the floor.
A soccer game with multiple males playing. entailment Some men are playing a sport.


The Stanford Natural Language Inference (SNLI) Corpus contains around 550k hypothesis/premise pairs. Models are evaluated based on accuracy.

State-of-the-art results can be seen on the SNLI website.


The Multi-Genre Natural Language Inference (MultiNLI) corpus contains around 433k hypothesis/premise pairs. It is similar to the SNLI corpus, but covers a range of genres of spoken and written text and supports cross-genre evaluation. The data can be downloaded from the MultiNLI website.

Public leaderboards for in-genre (matched) and cross-genre (mismatched) evaluation are available, but entries do not correspond to published models.

Model Matched Mismatched Paper / Source Code
RoBERTa (Liu et al., 2019) 90.8 90.2 RoBERTa: A Robustly Optimized BERT Pretraining Approach Official
XLNet-Large (ensemble) (Yang et al., 2019) 90.2 89.8 XLNet: Generalized Autoregressive Pretraining for Language Understanding Official
MT-DNN-ensemble (Liu et al., 2019) 87.9 87.4 Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding Official
Snorkel MeTaL(ensemble) (Ratner et al., 2018) 87.6 87.2 Training Complex Models with Multi-Task Weak Supervision Official
Finetuned Transformer LM (Radford et al., 2018) 82.1 81.4 Improving Language Understanding by Generative Pre-Training  
Multi-task BiLSTM + Attn (Wang et al., 2018) 72.2 72.1 GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding  
GenSen (Subramanian et al., 2018) 71.4 71.3 Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning  


The SciTail entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced but created from sentences that already exist “in the wild”. Hypotheses were created from science questions and the corresponding answer candidates, while relevant web sentences from a large corpus were used as premises. Models are evaluated based on accuracy.

Model Accuracy Paper / Source
Finetuned Transformer LM (Radford et al., 2018) 88.3 Improving Language Understanding by Generative Pre-Training
Hierarchical BiLSTM Max Pooling (Talman et al., 2018) 86.0 Natural Language Inference with Hierarchical BiLSTM Max Pooling
CAFE (Tay et al., 2018) 83.3 A Compare-Propagate Architecture with Alignment Factorization for Natural Language Inference

Go back to the README