Abstract

AbstractRecently the number of people who learn Chinese as a Foreign Language (CFL) increased. New comers, international students, and denizened spouses all need to improve their Chinese reading fluency and listening comprehension for daily communication and work requirements. However, not everyone gets opportunity for formal education in a language school. Thus, informal learning is very important for CFL learners in Taiwan. For novice Chinese learners, they should first master a skill to grouping Chinese words into meaningful chunks, i.e. Chinese segmentation. For instance, “老師對教育的貢獻” (teachers’ contribution in education). After Chinese word segmentation, the sentence becomes “老師(teachers)/對(P)/教育(education)/的(DE)/貢獻(contribution)” from “老/師/對/教/育/的/貢/獻”. Consequently, this study used two Chinese segmentation methods to highlight meaningful and important word chunks in subtitles of Chinese videos and evaluate its usability for CFL learners. The first method adopted the top 800 and 1600 high-frequency words from an analysis report based on Academia Sinica Balanced Corpus of Modern Chinese to identify proper word segmentation in video subtitles and analyze its performance based on the forward maximum matching method. The statistical results show that most Chinese subtitles still remain unsegmented (62.3%) which means the Chinese subtitles in the videos are not appropriately segmented based on the corpus that contains the top 800 high frequency words. However, with the integration of the top 1600 high frequency words in the corpus, approximately 60% of the subtitles in each video are effectively segmented, and numerous unknown words still remain. Active phrases, idioms, and short phrases in Chinese subtitles may lead to the difficulty in word segmentation; moreover, the usability testing result of using high frequency words to conduct word segmentation is not significant.The second method used natural language processing technique to split Chinese subtitles into its separate morphemes. The study adopted CKIP Chinese parser, which is a word segmentation tool for Chinese, to split subtitles according their part-of-speech tagging (i.e. grammatical tagging). The statistical results show that 97.26% subtitles are split, but the usability testing shows that subjective satisfaction is not good enough. To further investigation, we asked subjects to identify the “improper” word segmentation. For instance, the subtitle “接受治療很久了” (treated for a long time) will be split into “接受/治療/很/久/了”, but most novices think that the proper segmentation should be “接受/治療/很久了”. The “improper” rate is about 22.30% on average. In other words, the segmentation results from Chinese parser based on natural language processing technique are not best scaffolding for Chinese novice while watching videos with Chinese subtitles. The preliminary results of usability testing show that the second method can provide effective scaffolding for novice, but the granularity of chunked words may be too fine to read fluently sometimes (i.e. less than thirty percentage in results). Consequently, adaptation mechanism is required for learners to achieve the balance point of provided scaffolding between aforementioned two methods. For example, the Chinese function words, such as 很 and 了, serve only grammatical functions (i.e. they have no meaning by themselves). Those function words should not be separated out from subtitles for learning purpose. Further work is necessary to find out the proper granularity for chunking words, design adaptation mechanism of segmentation, and prevent segmentation errors in new or unknown words.KeywordsChinese as a foreign languageChinese segmentationsubtitle manipulationnatural language processingcomputer-assisted language learning

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.