Алгоритмы поиска вербальных маркеров идентичности в современном научном дискурсе

Oksana V Goncharova,Zaur A Zavrumov,Svetlana Khaleeva

doi:10.29025/2079-6021-2024-2-18-29

Oksana V Goncharova, Zaur A Zavrumov + Show 1 more

Open Access

https://doi.org/10.29025/2079-6021-2024-2-18-29

Copy DOI

Abstract

The article is devoted to the study of identity verbalization specifics via Data Mining. The research material consists of English texts from Internet scientific repositories and e-libraries devoted to various concepts of youth identity. A methodology based on the use of modern natural language processing and machine learning tools was developed and tested as part of the research. The analysis was carried out using the Natural Language Toolkit library for tokenization and POS-tagging procedures for calculating the frequency of tokens from the «identity» environment. Word Embeddings, pre-trained Word2Vec model and K-means algorithm were used for the subsequent analysis and clustering of words based on their semantic proximity. Gensim library and Scikit-learn library were used to work with the Word2Vec model. As a result, it was proved that in English scientific discourse young person’s identity is verbalized within 9 semantic categories: behavior, communities, communication, education, identity, language, practice, complexity, science, the most common of which are education (33%), language (21%) and communities (18%). N-grams analysis made it possible to identify semantic fields, establish their attributes, and evaluate texts’ similarity, which provided the most accurate vector space search for semantically close n-grams. Optimization made it possible to establish a similarity measure to rank phrases according to the query, as well as assign each n-gram a certain ranking weight. Improvements can be achieved by adding statistical word weighting, such as TF-IDF. The proposed system is capable of searching in a large text array of related phrases with a similar meaning.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Current Issues in Philology and Pedagogical Linguistics	Publication Date: Jun 25, 2024
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Алгоритмы поиска вербальных маркеров идентичности в современном научном дискурсе

Abstract

Talk to us

Similar Papers

More From: Current Issues in Philology and Pedagogical Linguistics

Lead the way for us

Similar Papers

Word Embeddings for Natural Language Processing

-

01 Jan 2015
01 Jan 2015

English
...
-
, et. al. ...
01 Jan 2009
01 Jan 2009

Using Continuous Integration to organize and monitor the annotation process of domain specific corpora
Marc Schreiber ... Bodo Kraft
-
Marc Schreiber, et. al.Marc Schreiber ... Bodo Kraft
01 Apr 2014
01 Apr 2014

Using supervised machine learning for large‐scale classification in management research: The case for identifying artificial intelligence patents
Milan Miric ... Kenneth G Huang
Strategic Management Journal | VOL. 44
Milan Miric, et. al.Milan Miric ... Kenneth G Huang
11 Jul 2022
Strategic Management Journal | VOL. 44

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Алгоритмы поиска вербальных маркеров идентичности в современном научном дискурсе

Abstract

Talk to us

Similar Papers

More From: Current Issues in Philology and Pedagogical Linguistics