Internet is an important platform to spread public opinion for Tibetan people. The research on the Tibetan Web pages content analysis is meaningful for public opinion monitoring. Detecting sensitive words is beneficial to understand public opinion of the minority. In this paper, we present a novel sensitive information classification algorithm and topic tracking algorithm for Web pages contents. First, a text sensitive information classification method is proposed based on a vector space model and cosine theorem. The main idea is the different locations of sensitive words gives different importance degrees at term weight computing. Building sensitive word list is an artificial work. Compared with sensitive thesaurus, Web texts are classified. Sensitive word list is the foundation of classification. After the classification of each texts, a new topic tracking algorithm is introduced, which monitors sensitive words during a period of time. The first step is to compute weight of sensitive words in a fixed period of time and select the top 10 sensitive words. The second step is to select the top 3 sensitive words to track in 10 sensitive words. Experiments show that the classification of the text sensitive information is very effective and result of topic tracking is ideal.
Read full abstract