Word Segmentation

Word Segmentation

Language processing applications, such as machine translation, language analysis, language understanding and information retrieval have to know/understand the words in a text so that the text can be processed. A Chinese sentence contains no delimiters, such as a space, to separate words. Therefore, a typical word segmentation system tries to find the possible word compositions of a sentence by comparing it with a lexicon, which results in word segmentation ambiguities. Most Chinese word segmentation systems deal with the problem of resolving ambiguity, rather than identifying unknown words which make up 3% to 5% of all the words in an article. Therefore, unknown word identification is an important issue for a word segmentation algorithm. High frequency keywords are easier to extract and identify offline, while low frequency keywords must be extracted on-the-fly by using morphological rules, morphemes, and word collocations.

Our system is a Chinese word segmentation method with unknown word identification and part-of-speech tagging. The system contains a 100,000-entry lexicon with POS tags, word frequencies, POS tag bigram information, etc. The word segmentation process is based on the lexicons, morphological rules for quantifier words and reduplicated words. POS tagging is for both known and unknown words.

Research Results

Our word segmentation system was ranked first for traditional Chinese word segmentation evaluation at the First International Chinese Word Segmentation Bakeoff held by ACL SIGHAN. It is the first word segmentation system with out-of-vocabulary word identification and syntactic category prediction capabilities.

Online Demos

CKIP CoreNLP

CKIP CoreNLP

CKIP CoreNLP provides a set of human language technology tools — word segmentation, sentence parsing, name-entity recognition, and corerference detection.

Demo
CKIP Transformers

CKIP Transformers

An open-source word segmentation, POS tagging, and name-entity recognition system using transformers models.

Demo
CKIP Tagger

CKIP Tagger

A new open-source word segmentation, POS tagging, and name-entity recognition system.

Demo
Chinese Word Segmentation (Old)

Chinese Word Segmentation (Old)

輸入一篇文章(最簡單的方法是 copy 一篇新聞),系統就會做未知詞擷取以及包含未知詞的斷詞標記動作。

Demo

Resources

Publications

References

Researchers and Developers

馬偉雲、劉興寰、蔡瑜方、戴嘉宏、白明弘、范嘉仁、謝佑明、李朋軒、楊慕