Abstract

The construction of Tibetan corpus is the field of Tibetan information processing of basic work. This paper uses the technology of web crawler and pretreatment and real-time acquisition of web sites to obtain a large number of Tibetan corpus in short time. The hot words reflected the hotspot of Tibetan people’s attention in a certain period of time. The paper draws lessons from the TFIDF for Tibetan text information extraction and the words of different locations are given different weights to extract the hot words. It is really effective to realize the construction of the raw Tibetan corpus and the extraction of the hot-words by self-made software.

Highlights

  • With China's reform and opening up, Tibetan regions has witnessed a rapid development

  • Hot words extraction algorithm draws feature extraction of TFIDF [13], and give word strings different weights according to different locations in the article,and give double weight to the word strings that appears in the title

  • Software constantly obtains relevant Tibetan corpus by crawl on the multiple mainstream Tibetan websites. It grasps the main news material stored as structured Tibetan corpus through the relevant web information acquisition technology

Read more

Summary

Introduction

With China's reform and opening up, Tibetan regions has witnessed a rapid development. How to extract the Tibetan information effectively and hot words is very hot topic of worthy study. At present, both Chinese and English information researches techniques have achieved good results, but the researches on Chinese minority languages are in the primary state. For the past few years, Tibetan and other minority language website have witnessed a rapid increase, which provides the study of minority language with sufficient materials. Rapid identification and directional tracking for hot words [2], we can quickly understand the people feelings, know the social dynamics and development trends, faster and more comprehensive grasp the trend of public opinion, thereby performing the correct guidance of public opinion and propaganda

Background
Information Gathering
Preprocessing
Word segmentation and Remove stop word
Hot words Extraction
Experiment
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call