Search All site
Search CKIP site

Language processing applications, such as machine translation, language analysis, language understanding and information retrieval have to know/understand the words in a text so that the text can be processed. A Chinese sentence contains no delimiters, such as a space, to separate words. Therefore, a typical word segmentation system tries to find the possible word compositions of a sentence by comparing it with a lexicon, which results in word segmentation ambiguities. Most Chinese word segmentation systems deal with the problem of resolving ambiguity, rather than identifying unknown words which make up 3% to 5% of all the words in an article. Therefore, unknown word identification is an important issue for a word segmentation algorithm. High frequency keywords are easier to extract and identify offline, while low frequency keywords must be extracted on-the-fly by using morphological rules, morphemes, and word collocations.

Our system is a Chinese word segmentation method with unknown word identification and part-of-speech tagging. The system contains a 100,000-entry lexicon with pos tags, word frequencies, pos tag bigram information, etc. The word segmentation process is based on the lexicons, morphological rules for quantifier words and reduplicated words. Pos tagging is for both known and unknown words.


Our word segmentation system was ranked first for traditional Chinese word segmentation evaluation at the First International Chinese Word Segmentation Bakeoff held by ACL SIGHAN. It is the first word segmentation system with out-of-vocabulary word identification and syntactic category prediction capabilities.


 A simplified version of the word segmentation server is available to the public at


Tsai Yu-Fang and Keh-Jiann Chen, 2004, "Reliable and Cost-Effective Pos-Tagging", International Journal of Computational Linguistics & Chinese Language Processing, vol. 9 #1, pp83-96

Ma, Wei-Yun and Keh-Jiann Chen, 2003, "A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction", Proceedings of ACL, Second SIGHAN Workshop on Chinese Language Processing, pp31-38.

Ma, Wei-Yun and Keh-Jiann Chen, 2003, "Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff", Proceedings of ACL, Second SIGHAN Workshop on Chinese Language Processing, pp168-171.

Tsai Yu-Fang and Keh-Jiann Chen, 2003, "Reliable and Cost-Effective Pos-Tagging", Proceedings of ROCLING XV, pp161-174.

Tsai Yu-Fang and Keh-Jiann Chen, 2003, "Context-rule Model for POS Tagging", Proceedings of PACLIC 17, pp146-151.

Chen Keh-Jiann, Wei-Yun Ma, 2002, "Unknown Word Extraction for Chinese Documents", Proceedings of Coling 2002, pp.169-175.

Chen Keh-Jiann, Ming-Hong Bai, "Unknown Word Detection for Chinese by a Corpus-based Learning Method", International Journal of Computational linguistics and Chinese Language Processing, 1998, vol.3, #1, pages 27-44.


Wei-Yun Ma, Huan-Hsing Liu, Yu-Fang Tsai, Chia-Hung Tai, Ming-Hong Bai, Jia-Zen Fan,
Yu-Ming Hsieh

  Parser    Sinica Treebank    Sinica Corpus    EHowNet