Sinica Corpus
“Academia Sinica Balanced Corpus of Modern Chinese”, simplified as Sinica Corpus, is the first Balanced Modern Chinese Corpus with part-of-speech tagging. The preliminary version of Sinica Corpus was developed on a small-scale and opened to the academic community in 1994 with the major purpose of obtaining feedback. Later in 1997 we present the corpus (Sinica Corpus 3.0) with 5 million words and a user-friendly search interface. The new version Sinica Corpus 4.0 targeted at 10 million words is ready for license in 2010 and the web search interface opens to public in 2013.
In addition to data-collection and data cleaning in the construction of a Chinese Balanced Corpus, we are also concerned with: 1) balancing and classifying collected data, 2) Chinese word segmentation, and 3) the design of pos-tag sets (Chen 1994).
1. Data extraction and classification for a Balanced Corpus
Topical distribution of the Sinica corpus:
8% | 13% | 28% | 38% | 8% | 5% |
2. Issues of Chinese word segmentation
3. The Part-of Speech tagging system and its Interpretation
4. Part-of-speech analysis
Research Results
- The Sinica corpus, a Balanced Corpus of Modern Chinese with 10
million words:
- 10 million words collected, primarily since 1996.
- Texts in the corpus are being collected from different areas and classified according to five criteria: genre, style, mode, topic, and source.
- Every text is segmented, and each segmented word is tagged with its pos.
- The Sinica Corpus web-interface is designed for statistical comparison according to users' specification of topics, genres, etc.
- The web-interface address for Sinica Corpus:
Online Demos
Resources
Publications
Researchers and Developers
林素朱、邱智銘