Balanced corpus of contemporary written Japanese

Kikuo Maekawa,Yasuharu Den,Toshinobu Ogiso,Makiro Tanaka,Hideki Ogura,Masaya Yamaguchi,Hanae Koiso,Wakako Kashino,Takehiko Maruyama,Makoto Yamazaki

doi:10.1007/s10579-013-9261-0

Abstract

The balanced corpus of contemporary written Japanese (BCCWJ) is Japan's first 100 million words balanced corpus. It consists of three subcorpora (publication subcorpus, library subcorpus, and special-purpose subcorpus) and covers a wide range of text registers including books in general, magazines, newspapers, governmental white papers, best-selling books, an internet bulletin-board, a blog, school textbooks, minutes of the national diet, publicity newsletters of local governments, laws, and poetry verses. A random sampling technique is utilized whenever possible in order to maximize the representativeness of the corpus. The corpus is annotated in terms of dual POS analysis, document structure, and bibliographical information. The BCCWJ is currently accessible in three different ways including Chunagon a web-based interface to the dual POS analysis data. Lastly, results of some pilot evaluation of the corpus with respect to the textual diversity are reported. The analyses include POS distribution, word-class distribution, entropy of orthography, sentence length, and variation of the adjective predicate. High textual diversity is observed in all these analyses.

Highlights

One serious problem in the corpus-based analyses of present-day Japanese is the lack of a balanced corpus
The Kyoto University Text Corpus (Kurohashi and Nagao 1998) that played an important role in the development of an annotated corpus of Japanese natural language processing (NLP) consists of 40 thousands sentences taken from the articles of the Mainichi newspaper published in 1995
Whenever it is possible to divide a sample into separate parts that are written by different authors, each part is called an ‘article.’ Article information of the balanced corpus of contemporary written Japanese (BCCWJ) consists of fields like Article_ID, Directory_ID, First_appearance, First_published, and so forth

Summary

Introduction

One serious problem in the corpus-based analyses of present-day Japanese is the lack of a balanced corpus. There is the possibility of estimating the meta-information by means of various up-to-date statistical clustering and classification methods, this approach requires a certain amount of supervised learning training data, derived from reliable reference corpora including the types of data mentioned above and covering various text types. To solve these problems, the authors launched a corpus compilation project in the spring of 2006, for public release of Japan’s first 100 million words balanced corpus in the year of 2011. The corpus was named the Balanced Corpus of Contemporary Written Japanese (BCCWJ, hereafter)

Corpus design

Balance and representativeness

Publication subcorpus

Library subcorpus

Special-purpose subcorpus

Missing registers

Temporal coverage

Sample length

Treatment of copyright

Dual POS analysis and the UniDic

The performance

The BCCWJ-core

Document structure annotation

Notes Abstract

Meta-information

Shonagon

Chunagon

DVD-release

POS distribution

Distribution of word-classes

Entropy of orthographic variations

Sentence length

Adjective predicate

Automatic classification of registers

Findings

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computers and the humanities	Publication Date: Dec 29, 2013
Citations: 211	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Balanced corpus of contemporary written Japanese

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computers and the humanities

Lead the way for us

Similar Papers

Zinslengte en zinscomplexiteit
Henk Pander Maat
Tijdschrift voor taalbeheersing | VOL. 39
Henk Pander MaatHenk Pander Maat
11 Nov 2017
Tijdschrift voor taalbeheersing | VOL. 39

Implementation of information literacy programmes in public libraries
George Kingori ... Stephen Maina
Library Hi Tech News | VOL. 33
George Kingori, et. al.George Kingori ... Stephen Maina
04 Apr 2016
Library Hi Tech News | VOL. 33

Graph-based Interactive Bibliographic Information Retrieval Systems
Yongjun Zhu
-
Yongjun ZhuYongjun Zhu
16 Jul 2021
16 Jul 2021

Digital tendencies in public libraries in Balochistan, Pakistan: issues and challenges
Munazza Jabeen ... Farzana Zaman
Library Management | VOL. 45
Munazza Jabeen, et. al.Munazza Jabeen ... Farzana Zaman
01 Mar 2024
Library Management | VOL. 45

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Balanced corpus of contemporary written Japanese

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computers and the humanities