View on GitHub

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Vietnamese NLP tasks

Dependency parsing

VnDT v1.1:

  Model LAS UAS Paper Code
Predicted POS PhoNLP (2021) 79.11 85.47 PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing Official
Predicted POS PhoBERT-base (2020) 78.77 85.22 PhoBERT: Pre-trained language models for Vietnamese Official
Predicted POS PhoBERT-large (2020) 77.85 84.32 PhoBERT: Pre-trained language models for Vietnamese Official
Predicted POS Biaffine (2017) 74.99 81.19 Deep Biaffine Attention for Neural Dependency Parsing  
Predicted POS jointWPD (2018) 73.90 80.12 A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing  
Predicted POS jPTDP-v2 (2018) 73.12 79.63 An improved neural network model for joint POS tagging and dependency parsing  
Predicted POS VnCoreNLP (2018) 71.38 77.35 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official

VnDT v1.0:

  Model LAS UAS Paper Code
Predicted POS VnCoreNLP (2018) 70.23 76.93 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official
Gold POS VnCoreNLP (2018) 73.39 79.02 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official
Gold POS BIST BiLSTM graph-based parser (2016) 73.17 79.39 Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations Official
Gold POS BIST BiLSTM transition-based parser (2016) 72.53 79.33 Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations Official
Gold POS MSTparser (2006) 70.29 76.47 Online large-margin training of dependency parsers  
Gold POS MaltParser (2007) 69.10 74.91 MaltParser: A language-independent system for datadriven dependency parsing  

Intent detection and Slot filling

PhoATIS

Model Intent Acc. Slot F1 Sentence Acc. Paper Code Note
JointIDSF (2021) 97.62 94.98 86.25 Intent Detection and Slot Filling for Vietnamese Official Text are automatically word-segmented using RDRSegmenter
JointBERT (2019) with PhoBERT encoder 97.40 94.75 85.55 Intent Detection and Slot Filling for Vietnamese Official Text are automatically word-segmented using RDRSegmenter

Machine translation

PhoMT Dataset

Model EN-VI (BLEU) VI-EN (BLEU) Paper Code
mBART (2020) 43.46 39.78 Multilingual Denoising Pre-training for Neural Machine Translation Link
Transformer-big (2017) 42.94 37.83 Attention is all you need Link
Transformer-base (2017) 42.12 37.19 Attention is all you need Link

IWSLT2015 Dataset

English-to-Vietnamese

tst2015 is used for test

Model BLEU Paper Code
Stanford (2015) 26.4 Stanford Neural Machine Translation Systems for Spoken Language Domains  

tst2013 is used for test

Model BLEU Paper Code
Nguyen and Salazar (2019) 32.8 Transformers without Tears: Improving the Normalization of Self-Attention Official
Provilkov et al. (2019) 33.27 (uncased) BPE-Dropout: Simple and Effective Subword Regularization  
Xu et al. (2019) 31.4 Understanding and Improving Layer Normalization Official
CVT (2018) 29.6 (SST) Semi-Supervised Sequence Modeling with Cross-View Training  
ELMo (2018) 29.3 (SST) Deep contextualized word representations  
Transformer (2017) 28.9 Attention is all you need Link
Kudo (2018) 28.5 Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates  
Google (2017) 26.1 Neural machine translation (seq2seq) tutorial Official
Stanford (2015) 23.3 Stanford Neural Machine Translation Systems for Spoken Language Domains  

Vietnamese-to-English

tst2013 is used for test

Model BLEU Paper Code
Provilkov et al. (2019) 32.99 (uncased) BPE-Dropout: Simple and Effective Subword Regularization  
Kudo (2018) 26.31 Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates  

Named entity recognition

PhoNER_COVID19

Model F1 Paper Code Note
PhoBERT-large (2020) 94.5 PhoBERT: Pre-trained language models for Vietnamese Official  
PhoBERT-base (2020) 94.2 PhoBERT: Pre-trained language models for Vietnamese Official  
XLM-R-large (2019) 93.8 Unsupervised Cross-lingual Representation Learning at Scale Official  
XLM-R-base (2019) 92.5 Unsupervised Cross-lingual Representation Learning at Scale Official  
BiLSTM-CRF + CNN-char (2016) + Word Segmentation 91 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Link Text are automatically word-segmented using RDRSegmenter
BiLSTM-CRF + CNN-char (2016) 90.6 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Link No word segmentation

VLSP

Model F1 Paper Code Note
PhoBERT-large (2020) 94.7 PhoBERT: Pre-trained language models for Vietnamese Official  
PhoNLP (2021) 94.41 PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing Official  
vELECTRA (2020) 94.07 Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models Official  
PhoBERT-base (2020) 93.6 PhoBERT: Pre-trained language models for Vietnamese Official  
VnCoreNLP (2018) [1] 91.30 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official Used ETNLP embeddings
BiLSTM-CRF + CNN-char (2016) [1] 91.09 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Official / Link Used ETNLP embeddings
VNER (2019) 89.58 Attentive Neural Network for Named Entity Recognition in Vietnamese    
VnCoreNLP (2018) 88.55 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF + CNN-char (2016) [2] 88.28 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Official / Link Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF + LSTM-char (2016) [2] 87.71 Neural Architectures for Named Entity Recognition Link Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF (2015) [2] 86.48 Bidirectional LSTM-CRF Models for Sequence Tagging Link Pre-trained embeddings learned from Baomoi corpus

Part-of-speech tagging

Model Accuracy Paper Code
PhoBERT-large (2020) 96.8 PhoBERT: Pre-trained language models for Vietnamese Official
vELECTRA (2020) 96.77 Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models Official
PhoNLP (2021) 96.76 PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing Official
PhoBERT-base (2020) 96.7 PhoBERT: Pre-trained language models for Vietnamese Official
jointWPD (2018) 95.97 A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing  
VnCoreNLP-VnMarMoT (2017) 95.88 From Word Segmentation to POS Tagging for Vietnamese Official
jPTDP-v2 (2018) 95.70 An improved neural network model for joint POS tagging and dependency parsing  
BiLSTM-CRF + CNN-char (2016) 95.40 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Official / Link
BiLSTM-CRF + LSTM-char (2016) 95.31 Neural Architectures for Named Entity Recognition Link
BiLSTM-CRF (2015) 95.06 Bidirectional LSTM-CRF Models for Sequence Tagging Link
RDRPOSTagger (2014) 95.11 RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger Official

Semantic parsing

ViText2SQL

Model Exact Match Accuracy Paper Code Note
IRNet (2019) 53.2 A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese Link Using PhoBERT as encoder
EditSQL (2019) 52.6 A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese Link Using PhoBERT as encoder

Word segmentation

Model F1 Paper Code
UITws-v1 (2019) 98.06 Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix Capture Official
VnCoreNLP-RDRsegmenter (2018) 97.90 A Fast and Accurate Vietnamese Word Segmenter Official
UETsegmenter (2016) 97.87 A hybrid approach to Vietnamese word segmentation Official
jointWPD (2018) 97.81 A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing  
vnTokenizer (2008) 97.33 A Hybrid Approach to Word Segmentation of Vietnamese Texts  
JVnSegmenter (2006) 97.06 Vietnamese Word Segmentation with CRFs and SVMs: An Investigation  
DongDu (2012) 96.90 Ứng dụng phương pháp Pointwise vào bài toán tách từ cho tiếng Việt