Search All site
Search CKIP site
  The online systems below are partial research results of our laboratory, and are provided for everybody to use.
Chinese Parser

When a user input a Chinese sentence in this demo version, the system will automatically tag, parse and assign roles to the sentence. Then information about its text, POS and tree structure will be displayed.

Sinica TreeBank

Sinica TreeBank 3.0 contains 6 files, 61,087 syntactic tree structures, and 361,834 words. The tree structures were extracted from the Sinica Corpus, and every structure is segmented and parsed. Each segmented word of a tree structure is tagged with its part-of-speech and argument.

Sinica TreeBank 3.0 is provided free on the website for syntactic and semantic research use. 1,000 syntactic tree structures are available.

A Chinese Word Segmentation System with Unknown Word Extraction and Pos Tagging

This demo system provides users with an interface to input some text, e.g., a news article. Input text is processed through unknown word detection/extraction, and the final segmentation result returned. The result not only includes the word segmentation and unknown word list, but also the detailed processing and steps of unknown word detection/extraction.

Academia Sinica Balanced Corpus of Modern Chinese

"Academia Sinica Balanced Corpus of Modern Chinese", simplified as Sinica Corpus, is the first Balanced Modern Chinese Corpus with part-of-speech tagging. The preliminary version of Sinica Corpus was developed on a small-scale and opened to the academic community in 1994 with the major purpose of obtaining feedback. Later in 1997 we present the corpus (Sinica Corpus 3.0) with 5 million words and a user-friendly search interface. The new version Sinica Corpus 4.0 targeted at 10 million words is ready for license in 2010 and the web search interface opens to public in 2013.

The web-interface address for Sinica Corpus:

Affix Database

This sub-corpus is composed of the following high-frequency initial and final morphemes retrieved from Sinica Corpus.

  • Initial Morpheme in Noun Compound : 1,135 (words with ambivalent meanings: 1,197)

  • Final Morpoheme in Noun Compound : 1,427 (words with ambivalent meanings: 1,610)

  • Initial Morpheme in Verb Compound : 735 (words with ambivalent meanings: 918 )

  • Final Morpoheme in Verb Compound : 282 (words with ambivalent meanings: 300)

There are 4,025 morphemes in total.

English meaning, POS, cilin, and examples are provided in each morpheme.

For Verb Compound, its English meaning, morphological rules, and examples are provided in each morpheme. The number of morphological rules varies in Verb Compound per se.

E-HowNet Ontology

Extended-HowNet (E-HowNet) is a lexical knowledge base evolved from HowNet and created by the CKIP (Chinese Knowledge and Information Processing) group. It consists of definitions for lexical senses and an ontology. The ontology is built by modifying HowNet taxonomy for sememes to denote taxonomic relations between concepts and attributes of concepts and aimed to construct a lexical knowledge database. It is a very important groundwork for E-HowNet project.