The goal of Sinica Treebank is to provide a syntactic, structure-tagged corpus for Chinese natural language processing. By extracting grammatical information from Treebank, we can improve the performance of the parser and learn more about the syntactic knowledge.

Sinica Treebank was built by CKIP in 1997 with texts taken from the Sinica Corpus. Based on ICG grammar (Information-based Case Grammar), the contexts are automatically parsed before being manually checked. The present version, Sinica Treebank v3.0, includes 61,087 trees (361,834 words). There are 1,000 tree structures open to the public for researchers to download. Meanwhile, a search interface on the website helps users who are interested in Chinese syntax and semantic relation.

The structural frame of Sinica Treebank is based on the Head-Driven Principle; that is, a sentence or phrase is composed of a core Head and its arguments, or adjuncts. The Head defines its phrasal category and relations with other constituents. For example, the Head of a sentence (S) or verb phrase (VP) is a verb (V). See “中文句結構樹資料庫 (Sinica Treebank) 的構建” (Chen et al. 1999) for details of supplementary principles, symbol illustrations, semantic roles, and phrasal structures.

Research Results

