Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

Svitlana Petrasova,Włodzimierz Lewoniewski,Nina Khairova,Orken Mamyrbayev,Kuralay Mukhsina

doi:10.3390/data3040066

Abstract

Similar text fragments extraction from weakly formalized data is the task of natural language processing and intelligent data analysis and is used for solving the problem of automatic identification of connected knowledge fields. In order to search such common communities in Wikipedia, we propose to use as an additional stage a logical-algebraic model for similar collocations extraction. With Stanford Part-Of-Speech tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of collocation words. With WordNet synsets, we choose their synonyms. Our dataset includes Wikipedia articles from different portals and projects. The experimental results show the frequencies of synonymous text fragments in Wikipedia articles that form common information spaces. The number of highly frequented synonymous collocations can obtain an indication of key common up-to-date Wikipedia communities.

Highlights

The largest and most popular Web-based, free encyclopedia such as Wikipedia covers various fields of knowledge
We propose the information technology for identifying the semantic proximity of short text fragments in Wikipedia articles which will allow the formation of common information spaces, thereby providing relevant search and access to Wikipedia articles written on related topics
We have proved the hypotheses that a lot of synonymous collocations from texts, especially, related to similar topics can form common information spaces in Wikipedia communities

Summary

Introduction

The largest and most popular Web-based, free encyclopedia such as Wikipedia covers various fields of knowledge. Due to Wikipedia authors, the number of Wikiprojects that represent different directions of scientific research is exponentially growing. The task of identifying common information spaces in Wikipedia is becoming more important. In connection with the constant changes in the information community, the heterogeneity of information spaces is complemented by constant dynamism. For the adequate identification of common information spaces of Wikipedia communities, it is necessary to increase the level of text processing, including the solution of problems of semantic processing of sources. In contrast to particular words, short text fragments (i.e., collocations) include more specific semantic information of certain Wikiprojects. The extraction of text fragments similarity, carried out using Natural Language Processing approaches, makes it possible to identify common

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Data	Publication Date: Dec 13, 2018
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data

Lead the way for us

Similar Papers

Named entity recognition of rice genes and phenotypes based on BiGRU neural networks
Kangjie Wu ... Yiqiong Chen
Computational Biology and Chemistry | VOL. 108
Kangjie Wu, et. al.Kangjie Wu ... Yiqiong Chen
03 Nov 2023
Computational Biology and Chemistry | VOL. 108

The Language of Engineering
DANIEL BRAUN ... OLEKSANDRA KLYMENKO
-
DANIEL BRAUN, et. al.DANIEL BRAUN ... OLEKSANDRA KLYMENKO
02 Apr 2021
02 Apr 2021

Word Embedding for Bengali Language using Domain-related Corpus
Ashutosh Bandyopadhyay ... Jayashree Nair
-
Ashutosh Bandyopadhyay, et. al.Ashutosh Bandyopadhyay ... Jayashree Nair
26 Apr 2023
26 Apr 2023

Automatic Extraction of Comprehensive Drug Safety Information from Adverse Drug Event Narratives in the Korea Adverse Event Reporting System Using Natural Language Processing Techniques.
Siun Kim ... Yesol Hong
Drug Safety | VOL. 46
Siun Kim, et. al.Siun Kim ... Yesol Hong
17 Jun 2023
Drug Safety | VOL. 46

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data