View on GitHub

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Hindi

Chunking

Model Dev accuracy Test F1 Paper / Source Code
Dalal et al. (2006) 87.40 82.40 Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach  

Part-of-speech tagging

Model Dev accuracy Test F1 Paper / Source Code
Jha et al. (2018) 99.30 99.06 Multi-Task Deep Morphological Analyzer: Context-Aware Joint Morphological Tagging and Lemma Prediction mt-dma
Dalal et al. (2006) 89.35 82.22 Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach  

Machine Translation

The IIT Bombay English-Hindi Parallel Corpus used by Kunchukuttan et al. (2018) can be accessed here. A live leaderboard involving more directions involving Hindi can be accessed at the evaluation website for the Workshop on Asian Translation.

Hindi -> English

Model BLEU Paper / Source Code
Philip et al. (2020) 24.85 Revisiting Low Resource Status of Indian Languages in MT ilmulti
Siripragada et al. (2020) 22.91 A Multilingual Parallel Corpora Collection Effort for Indian Languages ilmulti
Goyal et al. (2019) 19.06 LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019  

English -> Hindi

Model BLEU Paper / Source Code
Philip et al. (2018) 21.57 CVIT-MT Systems for WAT-2018  
Philip et al. (2020) 21.20 Revisiting Low Resource Status of Indian Languages in MT ilmulti
Saini et al. (2018) 18.215 Neural Machine Translation for English to Hindi  

G2P Conversion

Schwa Deletion

Due to diachronic processes the inherent vowel of Hindi (the schwa, automatically applied to consonants that have no other vowel diacritic or vowel-killer diacritic attached) is sometimes dropped in pronunciation despite being present in the orthography. This process is known as schwa deletion. There are no known linguistic rules that can consistently and accurately predict what happens to the inherent vowel in speech. Thus, this is an open problem in the field.

Each paper below has used different datasets. The dataset for Arora et al. (2020) is the largest of all, extracted from the Oxford Hindi-English Dictionary, and future work should ideally compare against that dataset.

Model Schwa-level accuracy Word-level accuracy Paper / Source Code
Arora et al. (2020) 98.00 97.78 Supervised Grapheme-to-Phoneme Conversion of Orthographic Schwas in Hindi and Punjabi schwa-deletion
Tyson and Nagar (2009)   95.00 Prosodic rules for schwa-deletion in hindi text-to-speech synthesis  
Narasimhan et al. (2004)   88.97 Schwa-Deletion in Hindi Text-to-Speech Synthesis  
Choudhury et al. (2004)   99.89 A Diachronic Approach for Schwa Deletion in Indo Aryan Languages