Research on the Extraction Technology of Hot-words in Tibetan WebPages

Chang-Zhi Wang,Hui Wang,Gui-Xian Xu,T Gong,J Xu,T Yang

doi:10.1051/itmconf/20160701005

Abstract

The construction of Tibetan corpus is the field of Tibetan information processing of basic work. This paper uses the technology of web crawler and pretreatment and real-time acquisition of web sites to obtain a large number of Tibetan corpus in short time. The hot words reflected the hotspot of Tibetan people’s attention in a certain period of time. The paper draws lessons from the TFIDF for Tibetan text information extraction and the words of different locations are given different weights to extract the hot words. It is really effective to realize the construction of the raw Tibetan corpus and the extraction of the hot-words by self-made software.

Highlights

With China's reform and opening up, Tibetan regions has witnessed a rapid development
Hot words extraction algorithm draws feature extraction of TFIDF [13], and give word strings different weights according to different locations in the article,and give double weight to the word strings that appears in the title
Software constantly obtains relevant Tibetan corpus by crawl on the multiple mainstream Tibetan websites. It grasps the main news material stored as structured Tibetan corpus through the relevant web information acquisition technology

Summary

Introduction

With China's reform and opening up, Tibetan regions has witnessed a rapid development. How to extract the Tibetan information effectively and hot words is very hot topic of worthy study. At present, both Chinese and English information researches techniques have achieved good results, but the researches on Chinese minority languages are in the primary state. For the past few years, Tibetan and other minority language website have witnessed a rapid increase, which provides the study of minority language with sufficient materials. Rapid identification and directional tracking for hot words [2], we can quickly understand the people feelings, know the social dynamics and development trends, faster and more comprehensive grasp the trend of public opinion, thereby performing the correct guidance of public opinion and propaganda

Background

Information Gathering

Preprocessing

Word segmentation and Remove stop word

Hot words Extraction

Experiment

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Research on the Extraction Technology of Hot-words in Tibetan WebPages

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: ITM Web of Conferences

Lead the way for us

Journal: ITM Web of Conferences	Publication Date: Jan 1, 2016
License type: cc-by

Similar Papers

Research on Tibetan hot words, sensitive words tracking and public opinion classification
Guixian Xu ... Haishen Yao
Cluster Computing | VOL. 22
Guixian Xu, et. al.Guixian Xu ... Haishen Yao
08 Jul 2017
Cluster Computing | VOL. 22

Collection of Tibetan Network
Chang-Zhi Wang ... Guixian Xu
DEStech Transactions on Computer Science and Engineering | VOL. -
Chang-Zhi Wang, et. al.Chang-Zhi Wang ... Guixian Xu
17 Nov 2016
DEStech Transactions on Computer Science and Engineering | VOL. -

Bursty Hot-Words Detection for Campus BBS
Geng Changxin ...
TELKOMNIKA (Telecommunication, Computing, Electronics and Control) | VOL. 11
Geng Changxin, et. al.Geng Changxin ...
01 Jun 2013
TELKOMNIKA (Telecommunication, Computing, Electronics and Control) | VOL. 11

Public opinion classification and text alignment based on Chinese and Tibetan corpus
Guixian Xu ... Gaofeng Chen
Cluster Computing | VOL. 22
Guixian Xu, et. al.Guixian Xu ... Gaofeng Chen
20 Oct 2017
Cluster Computing | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Research on the Extraction Technology of Hot-words in Tibetan WebPages

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: ITM Web of Conferences