View on GitHub

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Paraphrase Generation

Paraphrase generation is the task to generate an output sentence which is sementically identical to the input sentence but contains variations in lexicon or syntext. See the example given below:

Input (Erroneous) Output (Corrected)
The need for investors to earn a commercial return may put upward pressure on prices The need for profit is likely to push up prices

PRANMT-50M

PARANMT-50M dataset is a dataset for training paraphrastic sentence embeddings. It consists of more than 50 million English-English sentential paraphrase pairs.

Model BLEU Paper / Source Code
Trigram (baseline) 47.4 Wieting and Gimpel, 2018 Unvailable
Unsupervised BART w/ Dynamic Blocking 20.9 Niu et al., 2020 Unavailable

QQP-Pos

The QQP-POS dataset is a datast for paraphrase generation with 400K source-target pairs. Each pair is labelled as negative if two questions are not duplicates and positive otherwise.

Model BLEU Paper / Source Code
Unsupervised BART w/ Dynamic Blocking 26.76 Niu et al., 2020 Unavailable
ParafraGPT-UC 35.9 Bui et al., 2020 Code