Language processing applications, such as machine translation, language analysis, language understanding and information retrieval have to know/understand the words in a text so that the text can be processed. A Chinese sentence contains no delimiters, such as a space, to separate words. Therefore, a typical word segmentation system tries to find the possible word compositions of a sentence by comparing it with a lexicon, which results in word segmentation ambiguities. Most Chinese word segmentation systems deal with the problem of resolving ambiguity, rather than identifying unknown words which make up 3% to 5% of all the words in an article. Therefore, unknown word identification is an important issue for a word segmentation algorithm. High frequency keywords are easier to extract and identify offline, while low frequency keywords must be extracted on-the-fly by using morphological rules, morphemes, and word collocations.
Our system is a Chinese word segmentation method with unknown word identification and part-of-speech tagging. The system contains a 100,000-entry lexicon with POS tags, word frequencies, POS tag bigram information, etc. The word segmentation process is based on the lexicons, morphological rules for quantifier words and reduplicated words. POS tagging is for both known and unknown words.
Our word segmentation system was ranked first for traditional Chinese word segmentation evaluation at the First International Chinese Word Segmentation Bakeoff held by ACL SIGHAN. It is the first word segmentation system with out-of-vocabulary word identification and syntactic category prediction capabilities.
CKIP CoreNLP provides a set of human language technology tools — word segmentation, sentence parsing, name-entity recognition, and corerference detection.Demo
An open-source word segmentation, POS tagging, and name-entity recognition system using transformers models.Demo
A new open-source word segmentation, POS tagging, and name-entity recognition system.Demo
Chinese Word Segmentation (Old)
輸入一篇文章（最簡單的方法是 copy 一篇新聞），系統就會做未知詞擷取以及包含未知詞的斷詞標記動作。Demo
- Yu-Fang Tsai, Chen Keh-Jiann. “Reliable and Cost-Effective PoS-Tagging”. IJCLCLP, Vol. 9, No. 1, pp. 83–96, Feb 2004.
- Yu-Fang Tsai, Chen Keh-Jiann. “Context-rule Model for PoS Tagging”. PACLIC, Oct 2003.
- Yu-Fang Tsai, Chen Keh-Jiann. “Reliable and Cost-Effective PoS-Tagging”. ROCLING, Sep 2003.
- Wei-Yun Ma, Keh-Jiann Chen. “Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff”. SIGHAN, Jul 2003.
- Wei-Yun Ma, Keh-Jiann Chen. “A Bottom-Up Merging Algorithm for Chinese Unknown Word Extraction”. SIGHAN, Jul 2003.
- Keh-Jiann Chen, Wei-Yun Ma. “Unknown Word Extraction for Chinese Documents”. COLING, Aug 2002.
- Keh-Jiann Chen, Ming-Hong Bai. “Unknown Word Detection for Chinese by a Corpus-based Learning Method”. IJCLCLP, Vol. 3, No. 1, pp. 27–44, Feb 1998.