Abstract

Similar text fragments extraction from weakly formalized data is the task of natural language processing and intelligent data analysis and is used for solving the problem of automatic identification of connected knowledge fields. In order to search such common communities in Wikipedia, we propose to use as an additional stage a logical-algebraic model for similar collocations extraction. With Stanford Part-Of-Speech tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of collocation words. With WordNet synsets, we choose their synonyms. Our dataset includes Wikipedia articles from different portals and projects. The experimental results show the frequencies of synonymous text fragments in Wikipedia articles that form common information spaces. The number of highly frequented synonymous collocations can obtain an indication of key common up-to-date Wikipedia communities.

Highlights

  • The largest and most popular Web-based, free encyclopedia such as Wikipedia covers various fields of knowledge

  • We propose the information technology for identifying the semantic proximity of short text fragments in Wikipedia articles which will allow the formation of common information spaces, thereby providing relevant search and access to Wikipedia articles written on related topics

  • We have proved the hypotheses that a lot of synonymous collocations from texts, especially, related to similar topics can form common information spaces in Wikipedia communities

Read more

Summary

Introduction

The largest and most popular Web-based, free encyclopedia such as Wikipedia covers various fields of knowledge. Due to Wikipedia authors, the number of Wikiprojects that represent different directions of scientific research is exponentially growing. The task of identifying common information spaces in Wikipedia is becoming more important. In connection with the constant changes in the information community, the heterogeneity of information spaces is complemented by constant dynamism. For the adequate identification of common information spaces of Wikipedia communities, it is necessary to increase the level of text processing, including the solution of problems of semantic processing of sources. In contrast to particular words, short text fragments (i.e., collocations) include more specific semantic information of certain Wikiprojects. The extraction of text fragments similarity, carried out using Natural Language Processing approaches, makes it possible to identify common

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.