Sinica Corpus

“Academia Sinica Balanced Corpus of Modern Chinese”, simplified as Sinica Corpus, is the first Balanced Modern Chinese Corpus with part-of-speech tagging. The preliminary version of Sinica Corpus was developed on a small-scale and opened to the academic community in 1994 with the major purpose of obtaining feedback. Later in 1997 we present the corpus (Sinica Corpus 3.0) with 5 million words and a user-friendly search interface. The new version Sinica Corpus 4.0 targeted at 10 million words is ready for license in 2010 and the web search interface opens to public in 2013.

In addition to data-collection and data cleaning in the construction of a Chinese Balanced Corpus, we are also concerned with: 1) balancing and classifying collected data, 2) Chinese word segmentation, and 3) the design of pos-tag sets (Chen 1994).

1. Data extraction and classification for a Balanced Corpus

Topical distribution of the Sinica corpus:


2. Issues of Chinese word segmentation

“The word segmentation standard” for Chinese information processing issued by the Central Standards Bureau was adopted as the guideline for segmenting words in the Sinica corpus.

3. The Part-of Speech tagging system and its Interpretation

In accordance with the Tagset of 178 syntactic categories from the CKIP lexicon(CKIP 1993), a reduced tagset of 46 different tags (43 tags plus 3 features) is applied by Sinica Corpus.

4. Part-of-speech analysis

This technical report includes detail PoS analysis and the corresponding argument structures. Refer: Technical Report no.93-05.

Research Results

  1. The Sinica corpus, a Balanced Corpus of Modern Chinese with 10 million words:
    • 10 million words collected, primarily since 1996.
    • Texts in the corpus are being collected from different areas and classified according to five criteria: genre, style, mode, topic, and source.
    • Every text is segmented, and each segmented word is tagged with its pos.
    • The Sinica Corpus web-interface is designed for statistical comparison according to users' specification of topics, genres, etc.
  2. The web-interface address for Sinica Corpus:

Online Demos

