View on GitHub

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Hindi

Chunking

Model	Dev accuracy	Test F1	Paper / Source	Code
Dalal et al. (2006)	87.40	82.40	Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach

Part-of-speech tagging

Model	Dev accuracy	Test F1	Paper / Source	Code
Jha et al. (2018)	99.30	99.06	Multi-Task Deep Morphological Analyzer: Context-Aware Joint Morphological Tagging and Lemma Prediction	mt-dma
Dalal et al. (2006)	89.35	82.22	Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach

Machine Translation

The IIT Bombay English-Hindi Parallel Corpus used by Kunchukuttan et al. (2018) can be accessed here. A live leaderboard involving more directions involving Hindi can be accessed at the evaluation website for the Workshop on Asian Translation.

Hindi -> English

WAT:HINDENhi-en

Model	BLEU	Paper / Source	Code
Philip et al. (2020)	24.85	Revisiting Low Resource Status of Indian Languages in MT	ilmulti
Siripragada et al. (2020)	22.91	A Multilingual Parallel Corpora Collection Effort for Indian Languages	ilmulti
Goyal et al. (2019)	19.06	LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019

English -> Hindi

WAT:HINDENen-hi

Model	BLEU	Paper / Source	Code
Philip et al. (2018)	21.57	CVIT-MT Systems for WAT-2018
Philip et al. (2020)	21.20	Revisiting Low Resource Status of Indian Languages in MT	ilmulti
Saini et al. (2018)	18.215	Neural Machine Translation for English to Hindi

G2P Conversion

Schwa Deletion

Due to diachronic processes the inherent vowel of Hindi (the schwa, automatically applied to consonants that have no other vowel diacritic or vowel-killer diacritic attached) is sometimes dropped in pronunciation despite being present in the orthography. This process is known as schwa deletion. There are no known linguistic rules that can consistently and accurately predict what happens to the inherent vowel in speech. Thus, this is an open problem in the field.

Each paper below has used different datasets. The dataset for Arora et al. (2020) is the largest of all, extracted from the Oxford Hindi-English Dictionary, and future work should ideally compare against that dataset.

Model	Schwa-level accuracy	Word-level accuracy	Paper / Source	Code
Arora et al. (2020)	98.00	97.78	Supervised Grapheme-to-Phoneme Conversion of Orthographic Schwas in Hindi and Punjabi	schwa-deletion
Tyson and Nagar (2009)		95.00	Prosodic rules for schwa-deletion in hindi text-to-speech synthesis
Narasimhan et al. (2004)		88.97	Schwa-Deletion in Hindi Text-to-Speech Synthesis
Choudhury et al. (2004)		99.89	A Diachronic Approach for Schwa Deletion in Indo Aryan Languages