Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Dynamic Topic-Guided Deep Learning for Scalable and Interpretable Dark Web Text Analysis

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

The Dark Web is a hidden portion of the internet accessible only via specialized software like Tor, offering anonymity for both legal privacy needs and illegal activities such as drug sales and hacking forums. It serves as an anonymous haven for cyber threats including malware trading, hacking forums, and illicit marketplaces, complicating textual classification amid noisy, voluminous data. Existing methods integrate Latent Dirichlet Allocation (LDA) topic modeling weights with TextCNN, preprocessing Dark Web texts to derive class-specific keywords, slashing vector dimensions by approximately 300- fold for superior accuracy on DUTA-10k (25 classes) and CoDA (10 classes) over SVM, Naive Bayes, and prior benchmarks. Despite outperforming baselines, limitations persist: dependency on static datasets neglects dynamic content shifts; variable keyword tuning arises from class overlaps; real-time processing is absent; and separate components obscure neural interpretability. This paper proposes a unified deep learning architecture embedding topic modeling directly into TextCNN for real-time classification, dynamically pruning irrelevant terms while exposing neural influences via integrated keyword analysis. Key benefits include rapid threat detection for operational cybersecurity, enhanced explainability bridging probabilistic weights and deep features, reduced hyperparameter sensitivity for robust generalization, and scalable deployment across evolving Dark Web landscapes, advancing automated intelligence gathering. Key Words: Dark Web, Latent Dirichlet Allocation (LDA), real-time classification, generalization, TextCNN, operational cybersecurity.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 25
  • 10.1186/s40537-022-00605-3
An intelligent literature review: adopting inductive approach to define machine learning applications in the clinical domain
  • Apr 28, 2022
  • Journal of Big Data
  • Renu Sabharwal + 1 more

Big data analytics utilizes different techniques to transform large volumes of big datasets. The analytics techniques utilize various computational methods such as Machine Learning (ML) for converting raw data into valuable insights. The ML assists individuals in performing work activities intelligently, which empowers decision-makers. Since academics and industry practitioners have growing interests in ML, various existing review studies have explored different applications of ML for enhancing knowledge about specific problem domains. However, in most of the cases existing studies suffer from the limitations of employing a holistic, automated approach. While several researchers developed various techniques to automate the systematic literature review process, they also seemed to lack transparency and guidance for future researchers. This research aims to promote the utilization of intelligent literature reviews for researchers by introducing a step-by-step automated framework. We offer an intelligent literature review to obtain in-depth analytical insight of ML applications in the clinical domain to (a) develop the intelligent literature framework using traditional literature and Latent Dirichlet Allocation (LDA) topic modeling, (b) analyze research documents using traditional systematic literature review revealing ML applications, and (c) identify topics from documents using LDA topic modeling. We used a PRISMA framework for the review to harness samples sourced from four major databases (e.g., IEEE, PubMed, Scopus, and Google Scholar) published between 2016 and 2021 (September). The framework comprises two stages—(a) traditional systematic literature review consisting of three stages (planning, conducting, and reporting) and (b) LDA topic modeling that consists of three steps (pre-processing, topic modeling, and post-processing). The intelligent literature review framework transparently and reliably reviewed 305 sample documents.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/iccst50977.2020.00094
Sentiment Analysis of Consumer-Generated Online Reviews of Physical Bookstores Using Hybrid LSTM-CNN and LDA Topic Model
  • Oct 1, 2020
  • Yan Wang + 2 more

Physical bookstore is the leader of cultural trend, the carrier of national reading and the provider of public cultural services, which embodies the cultural soft power of a city. The widely use of Internet e-commerce platform and the change of people's reading habits have brought great impact on physical bookstores, resulting in poor overall profitability of physical bookstores. In order to realize the sustainable development of physical bookstores, we mine and analyze consumer-generated online reviews. In this paper, a method of sentiment analysis based on Hybrid LSTM-CNN (Hybrid Long Short-Term Memory-Convolutional Neural Network) and LDA (Latent Dirichlet Allocation) topic model is proposed. Firstly, the Hybrid LSTM-CNN model is used to classify reviews, and then LDA topic model is used to extract features of positive and negative reviews. The results show that Hybrid LSTM-CNN model has better performance than the classic LSTM and CNN in sentiment classification. The LDA model mines that consumers have the positive attitude towards the products, context and ambiance of physical bookstores, and the negative attitude towards price and service. This method studies consumer-generated online reviews in physical bookstores from two aspects: sentiment classification and topic mining, which can help physical bookstore operators to know consumer feedback in time.

  • Preprint Article
  • 10.2196/preprints.69983
Discovering Topics and Trends in Artificial Intelligence Chatbots in Medicine: Using Latent Dirichlet Allocation Topic Modeling (Preprint)
  • Dec 12, 2024
  • Ming Yue Ni + 5 more

BACKGROUND With the widespread adoption of the internet and smart devices, chatbots have emerged as significant auxiliary tools for public health activities. Despite the increasing application of chatbots in the medical field, comprehensive assessments of research topics and trends in this area remain relatively scarce. OBJECTIVE This study analyzed the application topics of chatbot technology in the medical field and explored the trends of these topics across different time periods, various journals, and different countries. METHODS In this study, a bibliometric approach was used to systematically search the PubMed, CINAHL, Web of Science and Embase databases for literature on medicine and chatbots between 2004 and 2024. By applying Latent Dirichlet Allocation (LDA) topic modeling, the study identified and analyzed the thematic applications of chatbots in the medical field, and explored the temporal evolution of these topics as well as their distribution characteristics across journals and countries. RESULTS We ultimately identified 3,029 articles for analysis. Utilizing the Latent Dirichlet Allocation (LDA) topic modeling technique, we identified nine core topics from the abstracts: ChatGPT medical quiz accuracy research, digital healthcare support assistants, mental health intervention research, epidemic health conversation application research, cancer patient diagnosis and treatment care, artificial intelligence (AI) healthcare education potential research, natural language processing models, human-computer interaction emotion research, and AI reading assistance systems. This study also found that these topics have shown diverse developmental trajectories over time, reflecting the evolution of research interests. In addition, researchers from different journals and countries have shown significant differences in the topics they focus on. CONCLUSIONS This study analyzed the topic distribution, temporal trends, journal, and country distribution characteristics of chatbots in the medical field. The results revealed popular and less researched topics, as well as emerging directions and trends, providing researchers with a tool for rapid identification. These findings not only provide guidance for researchers in selecting research directions but also offer references for journals and countries in determining research priorities, formulating strategic plans, and promoting international collaborative research.

  • Conference Article
  • Cite Count Icon 21
  • 10.2991/sekeie-14.2014.47
Text Similarity Computing Based on LDA Topic Model and Word Co-occurrence
  • Jan 1, 2014
  • Minglai Shao + 1 more

LDA (Latent Dirichlet Allocation) topic model has been widely applied to text clustering owing to its efficient dimension reduction. The prevalent method is to model text set through LDA topic model, to make inference by Gibbs sampling, and to calculate text similarity with JS (JensenShannon) distance. However, JS distance cannot distinguish semantic associations among text topics. For this defect, a new text similarity computing algorithm based on hidden topics model and word co-occurrence analysis is introduced. Tests are carried out to verify the clustering effect of this improved computing algorithm. Results show that this method can effectively improve text similarity computing result and text clustering accuracy. Keywords-topic model; LDA (Latent Dirichlet Allocation); JS (Jensen-Shannon) distance; word co-occurrence; similarity

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.2139/ssrn.3708327
Trends in COVID-19 Publications: Streamlining Research Using NLP and LDA
  • Jan 1, 2020
  • SSRN Electronic Journal
  • Akash Gupta + 3 more

Research publications related to the novel coronavirus disease COVID-19 are rapidly growing in number. However, current online literature hubs, even with artificial intelligence, are inadequate for identifying the relative strength of research topics. Hence, we aimed to develop a comprehensive Latent Dirichlet Allocation (LDA) topic model using natural language processing (NLP) techniques, provide visualisations for temporal trends, and apply our methodology to improve existing online literature hubs.Using the search term “COVID”, abstracts were extracted from PubMed®, from January to July 2020 (N=16346). An LDA topic model was trained on 81% of abstracts. Weekly temporal trends were visualised as a heatmap on all abstracts. Then, we tested our methodology on over 23,000 abstracts gathered from January 2020 to September 2020 from LitCovid, a literature hub from the National Center for Biotechnology Information. We use our topic model to subdivide LitCovid’s eight categories into corresponding LDA topics.The optimised LDA topic model, created using PubMed® data, produced 25 comprehensive topics with no significant overlap. There were temporal changes for topics: prominence of “Mental Health” and “Socioeconomic Impact” increased, “Genome Sequence” decreased, and “Epidemiology” remained relatively constant. We identified inadequate representation of “Airborne Transmission Protection”. Importantly, research on masks and PPE is skewed towards clinical applications with a lack of population-based epidemiological research. Our methodology, when applied to LitCovid, identified important topics within each LitCovid category. For example, “Case Report” was split into topics such as “Pulmonary” and “Oncology” as well as the under-represented topics “Haematology” and “Gastroenterology”. Our work allows for comprehensive topic identification and intuitive visualisation of temporal trends in COVID-19 research. Implementation of the methodology complements existing online literature hubs and identifies underrepresented topics such as population-based studies on masks that may be of significant public interest.Funding Statement: None to declare.Declaration of Interests: There are no conflicts of interest.

  • Research Article
  • Cite Count Icon 2
  • 10.54517/esp.v8i3.1958
Analyze IMDb movies by sentiment and topic analysis
  • Oct 25, 2023
  • Environment and Social Psychology
  • Ningjing Ouyang

Movie is an important cultural form, carrying multiple levels and meanings such as art, entertainment and social value. Movie review and rating data sets are huge, and deep learning and natural language processing methods are widely used today. Advances in big data and deep learning offer unprecedented opportunities to understand moviegoer behavior and preferences while providing a cost-effective way to gain insights relevant to the entertainment industry. This project conducts sentiment analysis, topic modeling, and visual statistical analysis based on the IMDb movie data set to identify key factors and deeper insights that influence successful decision-making in film production. This project first uses the word embedding method to vectorize the movie review text, and then uses Bidirectional Long Short-Term Memory (Bi-LSTM) to perform sentiment classification. In addition, statistical methods such as visualization were used to discover conclusions such as the highest average number of movies released in November, and identify trends, patterns and relationships between the variables of IMDb movies. Finally, the Latent Dirichlet Allocation (LDA) topic modeling model was constructed to find out that the important topic with increased demand is light entertainment movies, highlighting the commercial feasibility of comedy movies as a profitable business model. In summary, this project uses an emotion-topic fusion analysis method based on the Bi-LSTM emotion classification method and the LDA topic modeling method. The results show that the Bi-LSTM model can better identify positive and negative emotions in movie reviews, and the LDA topic model performs well in mining popular topics.

  • Research Article
  • Cite Count Icon 25
  • 10.1016/j.foodcont.2020.107435
A topic model approach to identify and track emerging risks from beeswax adulteration in the media
  • Jul 2, 2020
  • Food Control
  • Agnes Rortais + 8 more

A topic model approach to identify and track emerging risks from beeswax adulteration in the media

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.1186/s13634-021-00761-3
Intelligent radar software defect classification approach based on the latent Dirichlet allocation topic model
  • Jul 20, 2021
  • EURASIP Journal on Advances in Signal Processing
  • Xi Liu + 6 more

Existing software intelligent defect classification approaches do not consider radar characters and prior statistics information. Thus, when applying these appaoraches into radar software testing and validation, the precision rate and recall rate of defect classification are poor and have effect on the reuse effectiveness of software defects. To solve this problem, a new intelligent defect classification approach based on the latent Dirichlet allocation (LDA) topic model is proposed for radar software in this paper. The proposed approach includes the defect text segmentation algorithm based on the dictionary of radar domain, the modified LDA model combining radar software requirement, and the top acquisition and classification approach of radar software defect based on the modified LDA model. The proposed approach is applied on the typical radar software defects to validate the effectiveness and applicability. The application results illustrate that the prediction precison rate and recall rate of the poposed approach are improved up to 15 ~ 20% compared with the other defect classification approaches. Thus, the proposed approach can be applied in the segmentation and classification of radar software defects effectively to improve the identifying adequacy of the defects in radar software.

  • Research Article
  • Cite Count Icon 14
  • 10.1111/jjns.12520
Latent Dirichlet allocation topic modeling of free-text responses exploring the negative impact of the early COVID-19 pandemic on research in nursing.
  • Nov 30, 2022
  • Japan Journal of Nursing Science
  • Madoka Inoue + 4 more

To derive latent topics from free-text responses on the negative impact of the pandemic on research activities and determine similarities and differences in the resulting themes between academic-based and clinical-based researchers. We performed a secondary analysis of free-text responses from a cross-sectional online survey conducted by the Japan Academy of Nursing Science of its members in early 2020. The participants were categorized into two groups by workplace (academic-based and clinical-based researchers). Latent Dirichlet allocation (LDA) topic modeling was used to extract latent topics statistically and list important keywords/text associated with the topics. After organizing similar topics by principal component analysis (PCA), we finally derived topic-associated themes by reading the keywords/texts and determining the similarity and differences of the themes between the two groups. A total of 201 respondents (163 academic-based and 38 clinical-based researchers) provided free-text responses. LDA identified eight and three latent topics for the academic-based and clinical-based researchers, respectively. While PCA re-grouped the eight topics derived from the former group into four themes, no merging of the topics from the latter group was performed resulting in three themes. The only theme common to the two groups was "barriers to conducting research," with the remaining themes differing between the groups. Using LDA topic modeling with PCA, we identified similarities and differences in the themes described in free-text responses about the negative impact of the pandemic between academic-based and clinical-based researchers. Measures to mitigate the negative impact of pandemics on nursing research may need to be tailored separately.

  • Supplementary Content
  • Cite Count Icon 1
  • 10.20381/ruor-5492
Content Management and Hashtag Recommendation in a P2P Social Networking Application
  • Jan 1, 2015
  • uO Research (University of Ottawa)
  • Keerthi Nelaturu

In this thesis focus is on developing an online social network application with a Peer-to-Peer infrastructure motivated by BestPeer++ architecture and BATON overlay structure. BestPeer++ is a data processing platform which enables data sharing between enterprise systems. BATON is an open-sourced project which implements a peer-to-peer with a topology of a balanced tree. We designed and developed the components for users to manage their accounts, maintain friend relationships, and publish their contents with privacy control and newsfeed, notification requests in this social networking application. We also developed a Hashtag Recommendation system for this social networking application. A user may invoke a recommendation procedure while writing a content. After being invoked, the recommendation procedure returns a list of candidate hashtags, and the user may select one hashtag from the list and embed it into the content. The proposed approach uses Latent Dirichlet Allocation (LDA) topic model to derive the latent or hidden topics of different content. LDA topic model is a well developed data mining algorithm and generally effective in analyzing text documents with different lengths. The topic model is further used to identify the candidate hashtags that are associated with the texts in the published content through their association with the derived hidden topics. We considered different methods of recommendation approach for the procedure to select candidate hashtags from different content. Some methods consider the hashtags contained in the contents of the whole social network or of the user self. These are content-based recommendation techniques which matching user’s own profile with the profiles of items.. Some methods consider the hashtags contained in contents of the friends or of the similar users. These are collaborative filtering based recommendation

  • Research Article
  • Cite Count Icon 1
  • 10.1177/21582440251390678
Identify Future Trending Topics by Thematic Mapping of the Cinema Phenomenon Using Machine Learning and LDA
  • Oct 1, 2025
  • Sage Open
  • Türker Elitaş + 2 more

This study aims to evaluate 16,891 academic publications in the field of cinema between 1980 and 2024 using bibliometric analysis and topic modeling methods. Based on data obtained from the Web of Science (WOS) and Scopus databases, bibliometric findings were received, including the distribution of publications by year, the annual number and rate of citations per article, the most productive authors in the field, the production status of authors over time, the countries of authors and the number of articles they published, and the journals with the highest number of publications. Data obtained from the Web of Science (WOS) and Scopus databases were also used to identify prominent word groups and themes in the articles using text mining and Latent Dirichlet Allocation (LDA) topic modeling. As a result of the analysis, 12 main themes emerged based on word-text relationships and the weight of publications. The findings show that cinema studies have developed with increasing momentum over the years and that there has been a growing focus on certain topics. This study systematically examines the development of cinema studies literature through descriptive content analysis and LDA topic modeling. In this respect, it is important in that it systematically reveals the structural and thematic transformation of academic production in the field of cinema and provides a theoretical and methodological basis for future research. It also makes a current and multidimensional contribution to the discipline in terms of revealing the increasingly important digital trends, cultural representations, and interdisciplinary developments in cinema studies.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 16
  • 10.3390/make5020029
Evaluating the Coverage and Depth of Latent Dirichlet Allocation Topic Model in Comparison with Human Coding of Qualitative Data: The Case of Education Research
  • May 14, 2023
  • Machine Learning and Knowledge Extraction
  • Gaurav Nanda + 5 more

Fields in the social sciences, such as education research, have started to expand the use of computer-based research methods to supplement traditional research approaches. Natural language processing techniques, such as topic modeling, may support qualitative data analysis by providing early categories that researchers may interpret and refine. This study contributes to this body of research and answers the following research questions: (RQ1) What is the relative coverage of the latent Dirichlet allocation (LDA) topic model and human coding in terms of the breadth of the topics/themes extracted from the text collection? (RQ2) What is the relative depth or level of detail among identified topics using LDA topic models and human coding approaches? A dataset of student reflections was qualitatively analyzed using LDA topic modeling and human coding approaches, and the results were compared. The findings suggest that topic models can provide reliable coverage and depth of themes present in a textual collection comparable to human coding but require manual interpretation of topics. The breadth and depth of human coding output is heavily dependent on the expertise of coders and the size of the collection; these factors are better handled in the topic modeling approach.

  • Research Article
  • Cite Count Icon 25
  • 10.1093/jamiaopen/ooad112
Topic modeling on clinical social work notes for exploring social determinants of health factors
  • Jan 4, 2024
  • JAMIA Open
  • Shenghuan Sun + 4 more

ObjectiveExisting research on social determinants of health (SDoH) predominantly focuses on physician notes and structured data within electronic medical records. This study posits that social work notes are an untapped, potentially rich source for SDoH information. We hypothesize that clinical notes recorded by social workers, whose role is to ameliorate social and economic factors, might provide a complementary information source of data on SDoH compared to physician notes, which primarily concentrate on medical diagnoses and treatments. We aimed to use word frequency analysis and topic modeling to identify prevalent terms and robust topics of discussion within a large cohort of social work notes including both outpatient and in-patient consultations.Materials and methodsWe retrieved a diverse, deidentified corpus of 0.95 million clinical social work notes from 181 644 patients at the University of California, San Francisco. We conducted word frequency analysis related to ICD-10 chapters to identify prevalent terms within the notes. We then applied Latent Dirichlet Allocation (LDA) topic modeling analysis to characterize this corpus and identify potential topics of discussion, which was further stratified by note types and disease groups.ResultsWord frequency analysis primarily identified medical-related terms associated with specific ICD10 chapters, though it also detected some subtle SDoH terms. In contrast, the LDA topic modeling analysis extracted 11 topics explicitly related to social determinants of health risk factors, such as financial status, abuse history, social support, risk of death, and mental health. The topic modeling approach effectively demonstrated variations between different types of social work notes and across patients with different types of diseases or conditions.DiscussionOur findings highlight LDA topic modeling’s effectiveness in extracting SDoH-related themes and capturing variations in social work notes, demonstrating its potential for informing targeted interventions for at-risk populations.ConclusionSocial work notes offer a wealth of unique and valuable information on an individual’s SDoH. These notes present consistent and meaningful topics of discussion that can be effectively analyzed and utilized to improve patient care and inform targeted interventions for at-risk populations.

  • Research Article
  • Cite Count Icon 1
  • 10.16980/jitc.16.5.202010.779
Deriving the Determinants of Hotel Service Quality According to Hotel Class in Korea by Using LDA Topic Modeling
  • Oct 30, 2020
  • Korea International Trade Research Institute
  • Won-Sik Kim + 1 more

Purpose - This study aims to derive hotel service quality’s determinant factors according to hotel class by using OTA (online travel agency) reviews. Design/Methodology/Approach - This study used OTA (Online Travel Agency) reviews as big data and grouped online reviews according to the hotel’s star rating: low-class hotel (1-3 stars) and high-class hotel (4-5 stars). A total of 378,339 online reviews were extracted from 4,695 hotels on the OTA site (Hotels.com), and the factors of service quality based on hotel class were categorized by LDA (Latent Dirichlet Allocation) topic modeling technique. Findings - There were remarkable differences in factors of service quality between low-class and high-class hotels. Service quality factor (topics) of low-class hotels (1-3 stars) categorized as human service, cleanliness, accessibility to hotel, accessibility to surrounding resources, room noise, breakfast availability, and cost-effectiveness. The determinants of high-class hotels (4-5 stars) were grouped in family-friendliness, amenities, cleanliness, room service, human service, excellence in room view, and room convenience. Research Implications - Theoretically, this study suggested a differentiated approach to derive hotel service quality’s determinant factors by using LDA topic modeling. Concerning practical implication, as this study showed different service quality factors according to hotel class, it is expected for hotels to develop differential strategies to attract customers.

  • Research Article
  • Cite Count Icon 12
  • 10.1142/s0219649224500771
Library Similar Literature Screening System Research Based on LDA Topic Model
  • Jul 12, 2024
  • Journal of Information & Knowledge Management
  • Liang Gao + 2 more

Science and technology are highly inheritable undertakings, and any scientific and technological worker can make good progress without the experience and achievements of predecessors or others. In the face of an ever-expanding pool of literature, the ability to efficiently and accurately search for similar works is a major challenge in current research. This paper uses Latent Dirichlet Allocation (LDA) topic model to construct feature vectors for the title and abstract, and the bag-of-words model to construct feature vectors for publication type. The similarity between the feature vectors is measured by calculating the cosine values. The experiment demonstrated that the precision, recall and WSS95 scores of the algorithm proposed in the study were 90.55%, 98.74% and 52.45% under the literature title element, and 91.78%, 99.58% and 62.47% under the literature abstract element, respectively. Under the literature publication type element, the precision, recall and WSS95 scores of the proposed algorithm were 90.77%, 98.05% and 40.14%, respectively. Under the combination of literature title, abstract and publication type elements, the WSS95 score of the proposed algorithm was 79.03%. In summary, the study proposes a robust performance of the literature screening (LS) algorithm based on the LDA topic model, and a similar LS system designed on this basis can effectively improve the efficiency of LS.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant