View on GitHub

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Vietnamese NLP tasks

Dependency parsing

VnDT v1.1:

  Model LAS UAS Paper Code
Predicted POS PhoBERT-base (2020) 78.77 85.22 PhoBERT: Pre-trained language models for Vietnamese Official
Predicted POS PhoBERT-large (2020) 77.85 84.32 PhoBERT: Pre-trained language models for Vietnamese Official
Predicted POS Biaffine (2017) 74.99 81.19 Deep Biaffine Attention for Neural Dependency Parsing  
Predicted POS jointWPD (2018) 73.90 80.12 A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing  
Predicted POS jPTDP-v2 (2018) 73.12 79.63 An improved neural network model for joint POS tagging and dependency parsing  
Predicted POS VnCoreNLP (2018) 71.38 77.35 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official

VnDT v1.0:

  Model LAS UAS Paper Code
Predicted POS VnCoreNLP (2018) 70.23 76.93 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official
Gold POS VnCoreNLP (2018) 73.39 79.02 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official
Gold POS BIST BiLSTM graph-based parser (2016) 73.17 79.39 Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations Official
Gold POS BIST BiLSTM transition-based parser (2016) 72.53 79.33 Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations Official
Gold POS MSTparser (2006) 70.29 76.47 Online large-margin training of dependency parsers  
Gold POS MaltParser (2007) 69.10 74.91 MaltParser: A language-independent system for datadriven dependency parsing  

Machine translation

English-Vietnamese translation

English-to-Vietnamese

tst2015 is used for test

Model BLEU Paper Code
Stanford (2015) 26.4 Stanford Neural Machine Translation Systems for Spoken Language Domains  

tst2013 is used for test

Model BLEU Paper Code
Nguyen and Salazar (2019) 32.8 Transformers without Tears: Improving the Normalization of Self-Attention Official
Provilkov et al. (2019) 33.27 (uncased) BPE-Dropout: Simple and Effective Subword Regularization  
Xu et al. (2019) 31.4 Understanding and Improving Layer Normalization Official
CVT (2018) 29.6 (SST) Semi-Supervised Sequence Modeling with Cross-View Training  
ELMo (2018) 29.3 (SST) Deep contextualized word representations  
Transformer (2017) 28.9 Attention is all you need Link
Kudo (2018) 28.5 Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates  
Google (2017) 26.1 Neural machine translation (seq2seq) tutorial Official
Stanford (2015) 23.3 Stanford Neural Machine Translation Systems for Spoken Language Domains  

Vietnamese-to-English

tst2013 is used for test

Model BLEU Paper Code
Provilkov et al. (2019) 32.99 (uncased) BPE-Dropout: Simple and Effective Subword Regularization  
Kudo (2018) 26.31 Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates  

Named entity recognition

Model F1 Paper Code Note
PhoBERT-large (2020) 94.7 PhoBERT: Pre-trained language models for Vietnamese Official  
PhoBERT-base (2020) 93.6 PhoBERT: Pre-trained language models for Vietnamese Official  
VnCoreNLP (2018) [1] 91.30 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official Used ETNLP embeddings
BiLSTM-CRF + CNN-char (2016) [1] 91.09 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Official / Link Used ETNLP embeddings
VNER (2019) 89.58 Attentive Neural Network for Named Entity Recognition in Vietnamese    
VnCoreNLP (2018) 88.55 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF + CNN-char (2016) [2] 88.28 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Official / Link Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF + LSTM-char (2016) [2] 87.71 Neural Architectures for Named Entity Recognition Link Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF (2015) [2] 86.48 Bidirectional LSTM-CRF Models for Sequence Tagging Link Pre-trained embeddings learned from Baomoi corpus

Part-of-speech tagging

Model Accuracy Paper Code
PhoBERT-large (2020) 96.8 PhoBERT: Pre-trained language models for Vietnamese Official
PhoBERT-base (2020) 96.7 PhoBERT: Pre-trained language models for Vietnamese Official
jointWPD (2018) 95.97 A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing  
VnCoreNLP-VnMarMoT (2017) 95.88 From Word Segmentation to POS Tagging for Vietnamese Official
jPTDP-v2 (2018) 95.70 An improved neural network model for joint POS tagging and dependency parsing  
BiLSTM-CRF + CNN-char (2016) 95.40 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Official / Link
BiLSTM-CRF + LSTM-char (2016) 95.31 Neural Architectures for Named Entity Recognition Link
BiLSTM-CRF (2015) 95.06 Bidirectional LSTM-CRF Models for Sequence Tagging Link
RDRPOSTagger (2014) 95.11 RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger Official

Word segmentation

Model F1 Paper Code
VnCoreNLP-RDRsegmenter (2018) 97.90 A Fast and Accurate Vietnamese Word Segmenter Official
UETsegmenter (2016) 97.87 A hybrid approach to Vietnamese word segmentation Official
jointWPD (2018) 97.81 A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing  
vnTokenizer (2008) 97.33 A Hybrid Approach to Word Segmentation of Vietnamese Texts  
JVnSegmenter (2006) 97.06 Vietnamese Word Segmentation with CRFs and SVMs: An Investigation  
DongDu (2012) 96.90 Ứng dụng phương pháp Pointwise vào bài toán tách từ cho tiếng Việt