Word Segmentation
Language processing applications, such as machine translation, language analysis, language understanding and information retrieval have to know/understand the words in a text so that the text can be processed. A Chinese sentence contains no delimiters, such as a space, to separate words. Therefore, a typical word segmentation system tries to find the possible word compositions of a sentence by comparing it with a lexicon, which results in word segmentation ambiguities. Most Chinese word segmentation systems deal with the problem of resolving ambiguity, rather than identifying unknown words which make up 3% to 5% of all the words in an article. Therefore, unknown word identification is an important issue for a word segmentation algorithm. High frequency keywords are easier to extract and identify offline, while low frequency keywords must be extracted on-the-fly by using morphological rules, morphemes, and word collocations.
Our system is a Chinese word segmentation method with unknown word identification and part-of-speech tagging. The system contains a 100,000-entry lexicon with POS tags, word frequencies, POS tag bigram information, etc. The word segmentation process is based on the lexicons, morphological rules for quantifier words and reduplicated words. POS tagging is for both known and unknown words.
Research Results
Our word segmentation system was ranked first for traditional Chinese word segmentation evaluation at the First International Chinese Word Segmentation Bakeoff held by ACL SIGHAN. It is the first word segmentation system with out-of-vocabulary word identification and syntactic category prediction capabilities.
Online Demos
CKIP CoreNLP
CKIP CoreNLP provides a set of human language technology tools — word segmentation, sentence parsing, name-entity recognition, and corerference detection.
DemoCKIP Transformers
An open-source word segmentation, POS tagging, and name-entity recognition system using transformers models.
DemoCKIP Tagger
A new open-source word segmentation, POS tagging, and name-entity recognition system.
DemoResources
Publications
- Yu-Fang Tsai, Chen Keh-Jiann. “Reliable and Cost-Effective PoS-Tagging”. IJCLCLP, Vol. 9, No. 1, pp. 83–96, Feb 2004.
- Yu-Fang Tsai, Chen Keh-Jiann. “Context-rule Model for PoS Tagging”. PACLIC, Oct 2003.
- Yu-Fang Tsai, Chen Keh-Jiann. “Reliable and Cost-Effective PoS-Tagging”. ROCLING, Sep 2003.
- Wei-Yun Ma, Keh-Jiann Chen. “Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff”. SIGHAN, Jul 2003.
- Wei-Yun Ma, Keh-Jiann Chen. “A Bottom-Up Merging Algorithm for Chinese Unknown Word Extraction”. SIGHAN, Jul 2003.
- Keh-Jiann Chen, Wei-Yun Ma. “Unknown Word Extraction for Chinese Documents”. COLING, Aug 2002.
- Keh-Jiann Chen, Ming-Hong Bai. “Unknown Word Detection for Chinese by a Corpus-based Learning Method”. IJCLCLP, Vol. 3, No. 1, pp. 27–44, Feb 1998.
References
Researchers and Developers
馬偉雲、劉興寰、蔡瑜方、戴嘉宏、白明弘、范嘉仁、謝佑明、李朋軒、楊慕