Abstract

In the era of Data Technology, the data is characterized by huge scale, modal diversity, and rapid growth. The worth of corpus related to Chinese is also increased by multiplication correspondingly. Based on one of the Chinese language processing systems called the Language Technology Platform (LTP), using the Data Mining and the Machine Learning to extract and apply Chinese sentence features is a new perspective and entry point in the field of Chinese information processing. In this paper, the dependency grammar is selected for sentence pattern analysis, and the text representation model consisting of sequences and vectors is established. A specialized “Chinese Sentence Pattern Retrieve Library” including 1,032,480 sentences and 92,451 kinds of sentence patterns is built to provide a sentence pattern database service for more special sentence patterns studies. On the basis of this database, relevant statistics and preliminary analysis are made on the sentence patterns of different genres articles. It is found that there are about 2,000 core sentence patterns in Chinese and commonly used sentence patterns are relatively concentrated, with the frequency of the 10 sentence patterns with a higher frequency accounting for 50%. The proportion of some sentence patterns used in certain articles is much higher or lower than that in other articles. These researches achievements provide the basis for the establishment of the feature vectors of the sentence pattern in the article and offers a basis for feature extraction and application of articles in the later period.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.