Sinica Treebank

Sinica Treebank

The goal of Sinica Treebank is to provide a syntactic, structure-tagged corpus for Chinese natural language processing. By extracting grammatical information from Treebank, we can improve the performance of the parser and learn more about the syntactic knowledge.

Sinica Treebank was built by CKIP in 1997 with texts taken from the Sinica Corpus. Based on ICG grammar (Information-based Case Grammar), the contexts are automatically parsed before being manually checked. The present version, Sinica Treebank v3.0, includes 61,087 trees (361,834 words). There are 1,000 tree structures open to the public for researchers to download. Meanwhile, a search interface on the website helps users who are interested in Chinese syntax and semantic relation.

The structural frame of Sinica Treebank is based on the Head-Driven Principle; that is, a sentence or phrase is composed of a core Head and its arguments, or adjuncts. The Head defines its phrasal category and relations with other constituents. For example, the Head of a sentence (S) or verb phrase (VP) is a verb (V). See “中文句結構樹資料庫 (Sinica Treebank) 的構建” (Chen et al. 1999) for details of supplementary principles, symbol illustrations, semantic roles, and phrasal structures.

Research Results

  • On-line Interface for searching TreeBank.
  • Licensing information for the Sinica TreeBank v2.1.
  • Online Demos

    Sinica Treebank

    Sinica Treebank

    中央研究院詞庫小組從中央研究院平衡語料庫(Sinica Corpus)中,抽取句子,經由電腦剖析成,結構樹並加以人工修正、檢驗後的所得的成果。




    Researchers and Developers