Sinica Corpus

Sinica Corpus

“Academia Sinica Balanced Corpus of Modern Chinese”, simplified as Sinica Corpus, is the first Balanced Modern Chinese Corpus with part-of-speech tagging. The preliminary version of Sinica Corpus was developed on a small-scale and opened to the academic community in 1994 with the major purpose of obtaining feedback. Later in 1997 we present the corpus (Sinica Corpus 3.0) with 5 million words and a user-friendly search interface. The new version Sinica Corpus 4.0 targeted at 10 million words is ready for license in 2010 and the web search interface opens to public in 2013.

In addition to data-collection and data cleaning in the construction of a Chinese Balanced Corpus, we are also concerned with: 1) balancing and classifying collected data, 2) Chinese word segmentation, and 3) the design of pos-tag sets (Chen 1994).

1. Data extraction and classification for a Balanced Corpus

Topical distribution of the Sinica corpus:

8%
13%
28%
38%
8%
5%

2. Issues of Chinese word segmentation

“The word segmentation standard” for Chinese information processing issued by the Central Standards Bureau was adopted as the guideline for segmenting words in the Sinica corpus.

3. The Part-of Speech tagging system and its Interpretation

In accordance with the Tagset of 178 syntactic categories from the CKIP lexicon(CKIP 1993), a reduced tagset of 46 different tags (43 tags plus 3 features) is applied by Sinica Corpus.

4. Part-of-speech analysis

This technical report includes detail PoS analysis and the corresponding argument structures. Refer: Technical Report no.93-05.

Research Results

  1. The Sinica corpus, a Balanced Corpus of Modern Chinese with 10 million words:
    • 10 million words collected, primarily since 1996.
    • Texts in the corpus are being collected from different areas and classified according to five criteria: genre, style, mode, topic, and source.
    • Every text is segmented, and each segmented word is tagged with its pos.
    • The Sinica Corpus web-interface is designed for statistical comparison according to users' specification of topics, genres, etc.
  2. The web-interface address for Sinica Corpus:

Online Demos

Sinica Corpus (500k words)

Sinica Corpus (500k words)

專門針對語言分析而設計的,每個文句都依詞斷開,並標示詞類。語料的蒐集也盡量做到平衡分配在不同的主題和語式上,是現代漢語無窮多的語句中一個代表性的樣本。

Demo
Sinica Corpus (1000k words)

Sinica Corpus (1000k words)

專門針對語言分析而設計的,每個文句都依詞斷開,並標示詞類。語料的蒐集也盡量做到平衡分配在不同的主題和語式上,是現代漢語無窮多的語句中一個代表性的樣本。

Demo

Resources

Publications

  • Chih-Ming Chiu, Ji-Chin Lo, Keh-Jiann Chen.Compositional Semantics of Mandarin Affix Verbs”. ROCLING, Sep 2004.
  • Wei-Yun Ma, Yu-Ming Hsieh, Chang-Hua Yang, Keh-Jiann Chen.Design of Management System for Chinese Corpus Construction”. ROCLING, Aug 2001.
  • 黃居仁, 陳克健, 陳鳳儀, 魏文真, 張麗麗.資訊用中文分詞規範設計理念及規範內容”. 語言文字應用學刊, Vol. 6, No. 1, pp. 92–100, 1997.
  • 詞庫小組.『搜』文解字:中文詞界研究與資訊用分詞標準”. No. 96-01, Jan 1996.
  • 張麗麗, 黃居仁.漢語數量詞後置”. NAACL, Jul 1995.
  • 黃居仁.科際整合與整合科技-談計算語言學與語料庫語言學之角色與發展”. 「語言學研究之現況與發展」研討會, Jul 1995.
  • 陳克健.素材語言學與文本處理”. 漢語語言學國際會議, Jul 1994.
  • 詞庫小組.中文詞類分析”. No. 93-05, May 1993.
  • Marie Meili Yeh, Chih-Chen Tang, Chu-Ren Huang, Keh-Jiann Chen.A Preliminary Study on Nominalization in Mandarin Chinese — Argument-Taking Deverbal Nouns”. ROCLING, Sep 1992.
  • 魏文真, 莫若萍.「是」的語法表達模式”. 民國八十年國科會報告, 1991.
  • 魏文真, 葉美利, 莫若萍.「有」的語法表達模式”. 民國八十年國科會報告, 1991.
  • Wen-Jen Wei, Keh-Jiann Chen.The Grammar Representation of Conjunctions — A Representation Based on ICG”. ROCLING, Aug 1991.
  • 陳克健.中文詞知識庫計劃與中文電子辭典”. 中日雙邊資訊研討會論文集, 1991.
  • Researchers and Developers

    林素朱、邱智銘