Chinese Word Segmentation
Task
Chinese word segmentation is the task of splitting Chinese text (a sequence of Chinese characters) into words.
Example:
'上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']
Systems
♠ marks the system that uses character unigram as input. ♣ marks the system that uses character bigram as input.
- Tian et al. (2020): ZEN + key-value memory networks ♠
- Huang et al. (2019): BERT + model compression + multi-criterial learing ♠
- Yang et al. (2018): Lattice LSTM-CRF + BPE subword embeddings ♠♣
- Ma et al. (2018): BiLSTM-CRF + hyper-params search♠♣
- Yang et al. (2017): Transition-based + Beam-search + Rich pretrain♠♣
- Zhou et al. (2017): Greedy Search + word context♠
- Chen et al. (2017): BiLSTM-CRF + adv. loss♠♣
- Cai et al. (2017): Greedy Search+Span representation♠
- Kurita et al. (2017): Transition-based + Joint model♠
- Liu et al. (2016): neural semi-CRF♠
- Cai and Zhao (2016): Greedy Search♠
- Chen et al. (2015a): Gated Recursive NN♠♣
- Chen et al. (2015b): BiLSTM-CRF♠♣
Evaluation
Metrics
F1-score
Dataset
Chinese Treebank 6
Model | F1 | Paper / Source | Code |
---|---|---|---|
Huang et al. (2019) | 97.6 | Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning | |
Tian et al. (2020) | 97.3 | Improving Chinese Word Segmentation with Wordhood Memory Networks | Github |
Ma et al. (2018) | 96.7 | State-of-the-art Chinese Word Segmentation with Bi-LSTMs | |
Yang et al. (2018) | 96.3 | Subword Encoding in Lattice LSTM for Chinese Word Segmentation | Github |
Yang et al. (2017) | 96.2 | Neural Word Segmentation with Rich Pretraining | Github |
Zhou et al. (2017) | 96.2 | Word-Context Character Embeddings for Chinese Word Segmentation | |
Chen et al. (2017) | 96.2 | Adversarial Multi-Criteria Learning for Chinese Word Segmentation | Github |
Liu et al. (2016) | 95.5 | Exploring Segment Representations for Neural Segmentation Models | Github |
Chen et al. (2015b) | 96.0 | Long Short-Term Memory Neural Networks for Chinese Word Segmentation | Github |
Chinese Treebank 7
Model | F1 | Paper / Source | Code |
---|---|---|---|
Ma et al. (2018) | 96.6 | State-of-the-art Chinese Word Segmentation with Bi-LSTMs | |
Kurita et al. (2017) | 96.2 | Neural Joint Model for Transition-based Chinese Syntactic Analysis |
AS
Model | F1 | Paper / Source | Code |
---|---|---|---|
Tian et al. (2020) | 96.6 | Improving Chinese Word Segmentation with Wordhood Memory Networks | Github |
Huang et al. (2019) | 96.6 | Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning | |
Ma et al. (2018) | 96.2 | State-of-the-art Chinese Word Segmentation with Bi-LSTMs | |
Yang et al. (2017) | 95.7 | Neural Word Segmentation with Rich Pretraining | Github |
Cai et al. (2017) | 95.3 | Fast and Accurate Neural Word Segmentation for Chinese | Github |
Chen et al. (2017) | 94.8 | Adversarial Multi-Criteria Learning for Chinese Word Segmentation | Github |
CityU
Model | F1 | Paper / Source | Code |
---|---|---|---|
Tian et al. (2020) | 97.9 | Improving Chinese Word Segmentation with Wordhood Memory Networks | Github |
Huang et al. (2019) | 97.6 | Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning | |
Ma et al. (2018) | 97.2 | State-of-the-art Chinese Word Segmentation with Bi-LSTMs | |
Yang et al. (2017) | 96.9 | Neural Word Segmentation with Rich Pretraining | Github |
Cai et al. (2017) | 95.6 | Fast and Accurate Neural Word Segmentation for Chinese | Github |
Chen et al. (2017) | 95.6 | Adversarial Multi-Criteria Learning for Chinese Word Segmentation | Github |
PKU
Model | F1 | Paper / Source | Code |
---|---|---|---|
Huang et al. (2019) | 96.6 | Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning | |
Tian et al. (2020) | 96.5 | Improving Chinese Word Segmentation with Wordhood Memory Networks | Github |
Yang et al. (2017) | 96.3 | Neural Word Segmentation with Rich Pretraining | Github |
Ma et al. (2018) | 96.1 | State-of-the-art Chinese Word Segmentation with Bi-LSTMs | |
Yang et al. (2018) | 95.9 | Subword Encoding in Lattice LSTM for Chinese Word Segmentation | Github |
Cai et al. (2017) | 95.8 | Fast and Accurate Neural Word Segmentation for Chinese | Github |
Chen et al. (2017) | 94.3 | Adversarial Multi-Criteria Learning for Chinese Word Segmentation | Github |
Liu et al. (2016) | 95.7 | Exploring Segment Representations for Neural Segmentation Models | Github |
Cai and Zhao (2016) | 95.7 | Neural Word Segmentation Learning for Chinese | Github |
MSR
Model | F1 | Paper / Source | Code |
---|---|---|---|
Tian et al. (2020) | 98.4 | Improving Chinese Word Segmentation with Wordhood Memory Networks | Github |
Ma et al. (2018) | 98.1 | State-of-the-art Chinese Word Segmentation with Bi-LSTMs | |
Huang et al. (2019) | 97.9 | Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning | |
Yang et al. (2018) | 97.8 | Subword Encoding in Lattice LSTM for Chinese Word Segmentation | Github |
Yang et al. (2017) | 97.5 | Neural Word Segmentation with Rich Pretraining | Github |
Cai et al. (2017) | 97.1 | Fast and Accurate Neural Word Segmentation for Chinese | Github |
Chen et al. (2017) | 96.0 | Adversarial Multi-Criteria Learning for Chinese Word Segmentation | Github |
Liu et al. (2016) | 97.6 | Exploring Segment Representations for Neural Segmentation Models | Github |
Cai and Zhao (2016) | 96.4 | Neural Word Segmentation Learning for Chinese | Github |