Topic modeling is a method of determining the topic of a text document by analyzing the semantics and syntax of the latter. When analyzing text, the method determines the internal structure of a document or a set of documents and uses this information to classify or group similar words by topic. It also helps to identify the main trends of interests or information in a text document. For example, many people are interested in online shopping, politics, sports, economics, society, and etc. There are various online and offline data mining methods and algorithms used to determine the topic of a text. Most of them use a certain mechanism based on the semantic characteristics of the language and the subject of the text. In this study, the main idea is to develop a methodology that can be effectively used for topic modeling of a text in different languages. At first,the model preprocesses a text, which includes its tokenization, deletion of STOPWORDS and its lemmatization. Text preprocessing and filtering of inappropriate text elements reduces the size of the text and improves its classification performance. The algorithm also assumes the presence of ‘n’ topics in a text document and, based on this assumption, generates the processed document term matrix (PDTM) for a text document. The Processed Document Term Matrix (PDTM) is a two-dimensional matrix that assigns a specific numerical value to each word in the text based on the frequency of its occurrence in the document, and then correlates this word with each topic assumed earlier. The processed document terms (PDTM) are generated to store tokenized words. The proposed model and its results are described in detail in the methodology and discussion sections of this article.
Read full abstract