Vietnamese NLP tasks
Dependency parsing
- Experiments employ the benchmark Vietnamese dependency treebank VnDT of 10K+ sentences, using 1,020 sentences for test, 200 sentences for development and the remaining sentences for training. LAS and UAS scores are computed on all tokens (i.e. including punctuation).
VnDT v1.1:
Model | LAS | UAS | Paper | Code | |
---|---|---|---|---|---|
Predicted POS | PhoNLP (2021) | 79.11 | 85.47 | PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing | Official |
Predicted POS | PhoBERT-base (2020) | 78.77 | 85.22 | PhoBERT: Pre-trained language models for Vietnamese | Official |
Predicted POS | PhoBERT-large (2020) | 77.85 | 84.32 | PhoBERT: Pre-trained language models for Vietnamese | Official |
Predicted POS | Biaffine (2017) | 74.99 | 81.19 | Deep Biaffine Attention for Neural Dependency Parsing | |
Predicted POS | jointWPD (2018) | 73.90 | 80.12 | A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing | |
Predicted POS | jPTDP-v2 (2018) | 73.12 | 79.63 | An improved neural network model for joint POS tagging and dependency parsing | |
Predicted POS | VnCoreNLP (2018) | 71.38 | 77.35 | VnCoreNLP: A Vietnamese Natural Language Processing Toolkit | Official |
- Results on the VnDT v1.1 for Biaffine, jPTDP-v2 and VnCoreNLP are reported in the jointWPD paper “A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing.”
VnDT v1.0:
Model | LAS | UAS | Paper | Code | |
---|---|---|---|---|---|
Predicted POS | VnCoreNLP (2018) | 70.23 | 76.93 | VnCoreNLP: A Vietnamese Natural Language Processing Toolkit | Official |
Gold POS | VnCoreNLP (2018) | 73.39 | 79.02 | VnCoreNLP: A Vietnamese Natural Language Processing Toolkit | Official |
Gold POS | BIST BiLSTM graph-based parser (2016) | 73.17 | 79.39 | Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations | Official |
Gold POS | BIST BiLSTM transition-based parser (2016) | 72.53 | 79.33 | Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations | Official |
Gold POS | MSTparser (2006) | 70.29 | 76.47 | Online large-margin training of dependency parsers | |
Gold POS | MaltParser (2007) | 69.10 | 74.91 | MaltParser: A language-independent system for datadriven dependency parsing |
- Results for the BIST graph/transition-based parsers, MSTparser and MaltParser are reported in “An empirical study for Vietnamese dependency parsing.”
Intent detection and Slot filling
PhoATIS
- The first dataset for intent detection and slot filling for Vietnamese, based on the common ATIS benchmark in the flight booking domain. Data is localized (e.g. replacing slot values with Vietnamese-specific entities) to fit the context of flight booking in Vietnam.
- Training set: 4478 sentences
- Development set: 500 sentences
- Test set: 893 sentences
Model | Intent Acc. | Slot F1 | Sentence Acc. | Paper | Code | Note |
---|---|---|---|---|---|---|
JointIDSF (2021) | 97.62 | 94.98 | 86.25 | Intent Detection and Slot Filling for Vietnamese | Official | Text are automatically word-segmented using RDRSegmenter |
JointBERT (2019) with PhoBERT encoder | 97.40 | 94.75 | 85.55 | Intent Detection and Slot Filling for Vietnamese | Official | Text are automatically word-segmented using RDRSegmenter |
Machine translation
PhoMT Dataset
- A large-scale and high-quality dataset for Vietnamese-English Machine Translation with 3.02M sentence pairs, available at https://github.com/VinAIResearch/PhoMT.
- Consists of 6 domains: TED Talks, WikiHow, MediaWiki, OpenSubtitles, News and Blog.
- Training set: 2.9M sentence pairs
- Validation set: 18719 sentence pairs
- Test set: 19151 sentence pairs
Model | EN-VI (BLEU) | VI-EN (BLEU) | Paper | Code |
---|---|---|---|---|
mBART (2020) | 43.46 | 39.78 | Multilingual Denoising Pre-training for Neural Machine Translation | Link |
Transformer-big (2017) | 42.94 | 37.83 | Attention is all you need | Link |
Transformer-base (2017) | 42.12 | 37.19 | Attention is all you need | Link |
IWSLT2015 Dataset
- Dataset is from The IWSLT 2015 Evaluation Campaign with 150K sentence pairs, also be obtained from https://github.com/tensorflow/nmt.
English-to-Vietnamese
tst2015
is used for test
Model | BLEU | Paper | Code |
---|---|---|---|
Stanford (2015) | 26.4 | Stanford Neural Machine Translation Systems for Spoken Language Domains |
tst2013
is used for test
Model | BLEU | Paper | Code |
---|---|---|---|
Nguyen and Salazar (2019) | 32.8 | Transformers without Tears: Improving the Normalization of Self-Attention | Official |
Provilkov et al. (2019) | 33.27 (uncased) | BPE-Dropout: Simple and Effective Subword Regularization | |
Xu et al. (2019) | 31.4 | Understanding and Improving Layer Normalization | Official |
CVT (2018) | 29.6 (SST) | Semi-Supervised Sequence Modeling with Cross-View Training | |
ELMo (2018) | 29.3 (SST) | Deep contextualized word representations | |
Transformer (2017) | 28.9 | Attention is all you need | Link |
Kudo (2018) | 28.5 | Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates | |
Google (2017) | 26.1 | Neural machine translation (seq2seq) tutorial | Official |
Stanford (2015) | 23.3 | Stanford Neural Machine Translation Systems for Spoken Language Domains |
- The ELMo score is reported in Semi-Supervised Sequence Modeling with Cross-View Training. The Transformer score is available at https://github.com/duyvuleo/Transformer-DyNet.
Vietnamese-to-English
tst2013
is used for test
Model | BLEU | Paper | Code |
---|---|---|---|
Provilkov et al. (2019) | 32.99 (uncased) | BPE-Dropout: Simple and Effective Subword Regularization | |
Kudo (2018) | 26.31 | Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates |
Named entity recognition
PhoNER_COVID19
- A named entity recognition dataset for Vietnamese with 10 newly-defined entity types in the context of the COVID-19 pandemic. Data is extracted from news articles and manually annotated. In total, there are 34 984 entities over 10 027 sentences.
- Training set: 5027 sentences
- Development set: 2000 sentences
- Test set: 3000 sentences
Model | F1 | Paper | Code | Note |
---|---|---|---|---|
PhoBERT-large (2020) | 94.5 | PhoBERT: Pre-trained language models for Vietnamese | Official | |
PhoBERT-base (2020) | 94.2 | PhoBERT: Pre-trained language models for Vietnamese | Official | |
XLM-R-large (2019) | 93.8 | Unsupervised Cross-lingual Representation Learning at Scale | Official | |
XLM-R-base (2019) | 92.5 | Unsupervised Cross-lingual Representation Learning at Scale | Official | |
BiLSTM-CRF + CNN-char (2016) + Word Segmentation | 91 | End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF | Link | Text are automatically word-segmented using RDRSegmenter |
BiLSTM-CRF + CNN-char (2016) | 90.6 | End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF | Link | No word segmentation |
VLSP
- 16,861 sentences for training and development from the VLSP 2016 NER shared task:
- 14,861 sentences are used for training.
- 2k sentences are used for development.
- Test data: 2,831 test sentences from the VLSP 2016 NER shared task.
- NOTE that in the VLSP 2016 NER data, each word representing a full personal name are separated into syllables that constitute the word. The VLSP 2016 NER data also consists of gold POS and chunking tags as reconfirmed by VLSP 2016 organizers. This scheme results in an unrealistic scenario for a pipeline evaluation:
- The standard annotation for Vietnamese word segmentation and POS tagging forms each full name as a word token, thus all word segmenters have been trained to output a full name as a word and all POS taggers have been trained to assign a POS label to the entire full-name.
- Gold POS and chunking tags are NOT available in a real-world application.
- For a realistic scenario, contiguous syllables constituting a full name are merged to form a word. POS/chunking tags–if used–have to be automatically predicted!
- [1] denotes that scores are reported in “ETNLP: a visual-aided systematic approach to select pre-trained embeddings for a downstream task”
- [2] denotes that BiLSTM-CRF-based scores are reported in “VnCoreNLP: A Vietnamese Natural Language Processing Toolkit”
Part-of-speech tagging
- 27,870 sentences for training and development from the VLSP 2013 POS tagging shared task:
- 27k sentences are used for training.
- 870 sentences are used for development.
- Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.
- Result for jPTDP-v2 is reported in “A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing.”
- Results for BiLSTM-CRF-based models and RDRPOSTagger are reported in “From Word Segmentation to POS Tagging for Vietnamese.”
Semantic parsing
ViText2SQL
- The first public large-scale Text-to-SQL semantic parsing dataset for Vietnamese, consisting of about 10K question and SQL query pairs.
- Training set: 6831 question and query pairs
- Development set: 954 question and query pairs
- Test set: 1906 question and query pairs
Model | Exact Match Accuracy | Paper | Code | Note |
---|---|---|---|---|
IRNet (2019) | 53.2 | A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese | Link | Using PhoBERT as encoder |
EditSQL (2019) | 52.6 | A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese | Link | Using PhoBERT as encoder |
Word segmentation
- Training & development data: 75k manually word-segmented training sentences from the VLSP 2013 word segmentation shared task.
- Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.
Model | F1 | Paper | Code |
---|---|---|---|
UITws-v1 (2019) | 98.06 | Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix Capture | Official |
VnCoreNLP-RDRsegmenter (2018) | 97.90 | A Fast and Accurate Vietnamese Word Segmenter | Official |
UETsegmenter (2016) | 97.87 | A hybrid approach to Vietnamese word segmentation | Official |
jointWPD (2018) | 97.81 | A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing | |
vnTokenizer (2008) | 97.33 | A Hybrid Approach to Word Segmentation of Vietnamese Texts | |
JVnSegmenter (2006) | 97.06 | Vietnamese Word Segmentation with CRFs and SVMs: An Investigation | |
DongDu (2012) | 96.90 | Ứng dụng phương pháp Pointwise vào bài toán tách từ cho tiếng Việt |
- Results for VnTokenizer, JVnSegmenter and DongDu are reported in “A hybrid approach to Vietnamese word segmentation.”