Automatically generation and evaluation of Stop words list for Chinese Patents

Deng Na,Chen Xu

doi:10.12928/telkomnika.v13i4.2389

Abstract

As an important preprocessing step of information retrieval and information processing, the accuracy of stop words’ elimination directly influences the ultimate result of retrieval and mining. In information retrieval, stop words’ elimination can compress the storage space of index, and in text mining, it can reduce the dimension of vector space enormously, save the storage space of vector space and speed up the calculation. However, Chinese patents are a kind of legal documents containing technical information, and the general Chinese stop words list is not applicable for them. This paper advances two methodologies for Chinese patents. One is based on word frequency and the other on statistics. Through experiments on real patents data, these two methodologies’ accuracy are compared under several corpuses with different scale, and also compared with general stop list. The experiment result indicates that both of these two methodologies can extract the stop words suitable for Chinese patents and the accuracy of Methodology based on statistics is a little higher than the one based on word frequency.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: TELKOMNIKA (Telecommunication Computing Electronics and Control)	Publication Date: Dec 1, 2015
Citations: 32	License type: cc-by-sa

R Discovery Prime

R Discovery Prime

Automatically generation and evaluation of Stop words list for Chinese Patents

Abstract

Talk to us

Similar Papers

More From: TELKOMNIKA (Telecommunication Computing Electronics and Control)

Lead the way for us

Similar Papers

Automatic Identification of Stop Words in Chinese Text Classification
Lili Hao ... Lizhu Hao
-
Lili Hao, et. al.Lili Hao ... Lizhu Hao
01 Jan 2008
01 Jan 2008

Generating Stopword List for Sanskrit Language
Jaideepsinh K Raulji ... Jatinderkumar R Saini
-
Jaideepsinh K Raulji, et. al.Jaideepsinh K Raulji ... Jatinderkumar R Saini
01 Jan 2017
01 Jan 2017

Automatic Construction of Generic Stop Words List for Hindi Text
Ruby Rani ... D.K Lobiyal
Procedia Computer Science | VOL. 132
Ruby Rani, et. al.Ruby Rani ... D.K Lobiyal
01 Jan 2018
Procedia Computer Science | VOL. 132

Lexical classes based stop words categorization for Gujarati language
Rajnish M Rakholia ... Jatinderkumar R Saini
-
Rajnish M Rakholia, et. al.Rajnish M Rakholia ... Jatinderkumar R Saini
01 Sep 2016
01 Sep 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatically generation and evaluation of Stop words list for Chinese Patents

Abstract

Talk to us

Similar Papers

More From: TELKOMNIKA (Telecommunication Computing Electronics and Control)