Developing Predictive and Explainable Models for Cryptocurrency Delistings: A Case Study of Binance Exchange
This study develops an explainable machine learning model to predict cryptocurrency delistings on Binance by integrating quantitative indicators with qualitative data from news and Reddit, using LDA for trend extraction and comparing XGBoost, LightGBM, and CatBoost; XGBoost achieves the best performance, highlighting key predictors such as price drops and community reactions for early warning and investor protection.
Abstract This study develops an explainable machine learning model to predict cryptocurrency delistings using Binance data. It combines quantitative indicators (price, volume) with qualitative data from real‐time news and Reddit. Latent Dirichlet Allocation (LDA) is used to extract topic trends and community reactions, which are transformed into time‐series features. XGBoost, LightGBM, and CatBoost are compared, with SHAP applied for model interpretability. Results show that sharp price drops, repeated risk‐topic exposure, and Reddit responses strongly predict delisting. XGBoost achieves the best performance, offering practical insights for early warning systems and investor protection.
- Research Article
1
- 10.2458/jmmss.5397
- Oct 1, 2022
- Journal of Methods and Measurement in the Social Sciences
Qualitative methods can enhance our understanding of constructs that have not been well portrayed and enable nuanced depiction of experience from study participants who have not been broadly studied. However, qualitative data require time and effort to train raters to achieve validity and reliability. This study compares recent advances in Natural Language Processing (NLP) models with human coding. This web-based study (N=1,253; 3,046 free-text entries, averaging 64 characters per entry) included people with Duchenne Muscular Dystrophy (DMD), their siblings, and a representative comparison group. Human raters (n=6) were trained over multiple sessions in content analysis as per a comprehensive codebook. Three prompts addressed distinct aspects of participants’ aspirations. Unsupervised NLP was implemented using Latent Dirichlet Allocation (LDA), which extracts latent topics across all the free-text entries. Supervised NLP was done using a Bidirectional Encoder Representations from Transformers (BERT) model, which requires training the algorithm to recognize relevant human-coded themes across free-text entries. We compared the human-, LDA-, and BERT-coded themes. Study sample contained 286 people with DMD, 355 DMD siblings, and 997 comparison participants, age 8-69. Human coders generated 95 codes across the three prompts and had an average inter-rater reliability (Fleiss’s kappa) of 0.77, with minimal rater-effect (pseudo R2=4%). Compared to human coders, LDA does not yield easily interpretable themes. BERT correctly classified only 61-70% of the validation set. LDA and BERT required technical expertise to program and took approximately 1.15 minutes per open-text entry, compared to 1.18 minutes for human raters including training time. LDA and BERT provide potentially viable approaches to analyzing large-scale qualitative data, but both have limitations. When text entries are short, LDA yields latent topics that are hard to interpret. BERT accurately identified only about two thirds of new statements. Humans provided reliable and cost-effective coding in the web-based context. The upfront training enables BERT to process enormous quantities of text data in future work, which should examine NLP’s predictive accuracy given different quantities of training data.
- Conference Article
5
- 10.52842/conf.caadria.2022.1.383
- Jan 1, 2022
This paper aims to improve the usability of qualitative urban big data sources by utilizing Natural Language Processing (NLP) as a promising AI-based technique. In this research, we designed a digital participation experiment by deploying an open-source and customizable asynchronous participation tool, "Consul ProjectÂ, with 47 participants in the campus transformation process of the Singapore University of Technology and Design (SUTD). At the end of the data collection process with several debate topics and proposals, we analysed the qualitative data in entry scale, topic scale, and module scale. We investigated the impact of sentiment scores of each entry on the overall discussion and the sentiment scores of each introduction text on the ongoing discussions to trace the interaction and engagement. Furthermore, we used Latent Dirichlet Allocation (LDA) topic modelling to visualize the abstract topics that occurred in the participation experiment. The results revealed the links between different debates and proposals, which allow designers and decision makers to identify the most interacted arguments and engaging topics throughout participation processes. Eventually, this research presented the potentials of qualitative data while highlighting the necessity of adopting new methods and techniques, e.g., NLP, sentiment analysis, LDA topic modelling, to analyse and represent the collected qualitative data in asynchronous digital participation processes.
- Conference Article
3
- 10.1109/dsaa.2018.00069
- Oct 1, 2018
As we begin to leverage Big Data in health care settings and particularly in assessing patient-reported outcomes, there is a need for novel analytics to address unique challenges. One such challenge is in coding transcribed interview data, typically free-text entries of statements made by interviewees during face-to-face interviews. Conventional coding of such qualitative data into themes is labor-intensive and prone to inconsistencies. Latent Dirichlet Allocation (LDA) may offer statistical rigor in summarizing patients' concerns and coping strategies in a life-threatening illness. We aim to apply LDA to interview data collected as part of a prospective, longitudinal study of QOL in patients undergoing radical cystectomy and urinary diversion for bladder cancer. LDA showed that, prior to surgery, patients' priorities were primarily in cancer surgery and recovery. Six months after the surgery, however, their goals shifted to a desire to spend more time with family, resume work, and enjoy life to its fullest extent. Novel analytics such as LDA offer the possibility of summarizing personal goals in real time without the need for conventional fixed-length measures and qualitative data coding.
- Research Article
7
- 10.59298/nijep/2024/41916.1.1100
- Mar 11, 2024
- NEWPORT INTERNATIONAL JOURNAL OF ENGINEERING AND PHYSICAL SCIENCES
In an era dominated by an unprecedented deluge of textual information, the need for effective methods to make sense of large datasets is more pressing than ever. This article takes a pragmatic approach to unraveling the intricacies of topic modeling, with a specific focus on the widely used Latent Dirichlet Allocation (LDA) algorithm. The initial segment of the article lays the groundwork by exploring the practical relevance of topic modeling in real-world scenarios. It addresses the everyday challenges faced by researchers and professionals dealing with vast amounts of unstructured text, emphasizing the potential of topic modeling to distill meaningful insights from seemingly chaotic data. Moving beyond theoretical abstraction, the article then delves into the mechanics of Latent Dirichlet Allocation. Developed in 2003 by Blei, Ng, and Jordan, LDA provides a probabilistic framework to identify latent topics within documents. The article takes a step-by-step approach to demystify LDA, offering a practical understanding of its components and the Bayesian principles governing its operation. A significant portion of the article is dedicated to the practical implementation of LDA. It provides insights into preprocessing steps, parameter tuning, and model evaluation, offering readers a hands-on guide to applying LDA in their own projects. Real-world examples and case studies showcase how LDA can be a valuable tool for tasks such as document clustering, topic summarization, and sentiment analysis. However, the journey through LDA is not without challenges, and the article candidly addresses these hurdles. Topics such as determining the optimal number of topics, the sensitivity of results to parameter settings, and the interpretability of outcomes are discussed. This realistic appraisal adds depth to the article, helping readers navigate the nuances and potential pitfalls of employing LDA in practice. Beyond the technical intricacies, the article explores the broad spectrum of applications where LDA has proven its efficacy. From text mining and information retrieval to social network analysis and healthcare informatics, LDA has left an indelible mark on diverse domains. Through practical examples, the article illustrates how LDA can be adapted to different contexts, showcasing its versatility as a tool for uncovering latent patterns. Keywords: Topic Modeling, Latent Dirichlet Allocation, Text Mining, Natural Language Processing, Document Clustering, Bayesian Inference.
- Research Article
- 10.2139/ssrn.3452058
- Oct 1, 2019
- SSRN Electronic Journal
Defining Geographic Markets from Probabilistic Clusters: A Machine Learning Algorithm Applied to Supermarket Scanner Data
- Book Chapter
2
- 10.1007/978-981-10-2338-5_49
- Jan 1, 2016
LDA (Latent Dirichlet Allocation) model is a kind of unsupervised learning model which can extract the hidden topic from text in recent years. In this paper, we proposed a novel LDA model based on the traditional LDA model, which is integrated into the information of text category (Activity-topic LDA). In this paper, the Activity-topic LDA is proposed to improve the original latent Dirichlet allocation (LDA) model. On the basis of the LDA, the proposed method adds the tourism activity information, and obtains the probability distribution model of the tourism activities. Based on this model, we can identify and discover the theme of tourism activities.
- Research Article
129
- 10.1016/j.jbi.2019.103364
- Dec 28, 2019
- Journal of Biomedical Informatics
Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records.
- Conference Article
13
- 10.1109/bibe.2017.00-81
- Oct 1, 2017
Understanding the role of differential gene expression in the development of, and molecular response to, cancer is a complex problem that remains challenging, in part due to the sheer number of genes, gene products, and metabolites involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to explore patterns of gene expression in healthy and cancer tissues. An important advantage of LDA compared to alternative statistical and machine learning methods is its proven ability to handle sparse inputs over an extremely large numbers of features in an unsupervised manner. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. In this paper, we try to optimize the protocol and parameters for efficient implementation of LDA. Here, messenger RNA (mRNA) sequence data from breast cancer and healthy tissue is used to determine an effective approach for the application of LDA to classification of cancer versus healthy tissue. We describe our study in two phases: First, various parameters like the number of topics, bins and passes were optimized for LDA. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.
- Research Article
2
- 10.54364/aaiml.2022.1117
- Jan 1, 2022
- Advances in Artificial Intelligence and Machine Learning
Not only can online hate content spread easily between social media platforms, but its focus can also evolve over time. Machine learning and other artificial intelligence (AI) tools could play a key role in helping human moderators understand how such hate topics are evolving online. Latent Dirichlet Allocation (LDA) has been shown to be able to identify hate topics from a corpus of text associated with online communities that promote hate. However, applying LDA to each day’s data is impractical since the inferred topic list from the optimization can change abruptly from day to day, even though the underlying text and hence topics do not typically change this quickly. Hence, LDA is not well suited to capture the way in which hate topics evolve and morph. Here we solve this problem by showing that a dynamic version of LDA can help capture this evolution of topics surrounding online hate. Specifically, we show how standard and dynamical LDA models can be used in conjunction to analyze the topics over time emerging from extremist communities across multiple moderated and unmoderated social media platforms. Our dataset comprises material that we have gathered from hate-related communities on Facebook, Telegram, and Gab during the time period January-April 2021. We demonstrate the ability of dynamic LDA to shed light on how hate groups use different platforms in order to propagate their cause and interests across the online multiverse of social media platforms.
- Research Article
1
- 10.5909/jeb.2012.17.2.270
- Mar 30, 2012
- Journal of Broadcast Engineering
다채널 TV, IPTV 및 Smart TV 서비스의 등장으로 인해 수많은 방송 채널과 방대한 TV 프로그램 콘텐츠가 시청자 단말로 제공됨으로써 시청자들은 자신이 원하는 콘텐츠를 쉽게 찾고 소비하는 것이 어려운 TV 시청 환경을 맞게 되었다. 따라서 TV 사용자들에게 자신이 선호하는 콘텐츠를 자동 추천해 줌으로써 원하는 콘텐츠로의 접근성을 증대시키는 것은 미래의 지능형 TV 서비스에 있어서 주요한 이슈이다. 이에 본 논문에서는 사용자의 선호 취향과 대중의 선호취향을 모두 고려한 협업필터링 개념의 통계적 기계학습 기반 TV 프로그램 추천 모델을 제시한다. 이를 위해 시청한 TV 콘텐츠에 대한 선호 토픽을 사용자의 시청 선호도로 보고, 최근 널리 활용되고 있는 LDA(Latent Dirichlet Allocation)모델을 TV 프로그램 추천 모델에 적용하였다. LDA 기반 TV 프로그램 추천 성능을 개선하기 위해 본 논문에서는 TV시청 이용내역 데이터를 기반으로, TV 사용자들의 관심 토픽을 은닉 변수로 하고, TV 사용자들의 관심 토픽에 대한 다양성을 반영하기 위해 은닉 변수의 확률분포 특성을 비대칭 디리클레(Dirichlet) 분포로 모형화하여 실험에 적용하였다. 제안된 LDA 기반 TV 프로그램 자동 추천 방법의 성능을 검증하기 위해, 유사 시청 특성을 갖는 사용자 그룹에 대해 상위 5개의 TV 프로그램을 일주일 단위로 추천하였을 경우 평균 66.5%, 2개월 단위의 추천에 대해서는 평균 77.9%의 precision 추천 성능을 확인할 수 있었다. With the advent of multi-channel TV, IPTV and smart TV services, excessive amounts of TV program contents become available at users' sides, which makes it very difficult for TV viewers to easily find and consume their preferred TV programs. Therefore, the service of automatic TV recommendation is an important issue for TV users for future intelligent TV services, which allows to improve access to their preferred TV contents. In this paper, we present a recommendation model based on statistical machine learning using a collaborative filtering concept by taking in account both public and personal preferences on TV program contents. For this, users' preference on TV programs is modeled as a latent topic variable using LDA (Latent Dirichlet Allocation) which is recently applied in various application domains. To apply LDA for TV recommendation appropriately, TV viewers's interested topics is regarded as latent topics in LDA, and asymmetric Dirichlet distribution is applied on the LDA which can reveal the diversity of the TV viewers' interests on topics based on the analysis of the real TV usage history data. The experimental results show that the proposed LDA based TV recommendation method yields average 66.5% with top 5 ranked TV programs in weekly recommendation, average 77.9% precision in bimonthly recommendation with top 5 ranked TV programs for the TV usage history data of similar taste user groups.
- Conference Article
7
- 10.1145/3307681.3325407
- Jun 17, 2019
Latent Dirichlet Allocation(LDA) is a popular topic model. Given the fact that the input corpus of LDA algorithms consists of millions to billions of tokens, the LDA training process is very time-consuming, which prevents the adoption of LDA in many scenarios, e.g., online service. GPUs have benefited modern machine learning algorithms and big data analysis as they can provide high memory bandwidth and tremendous computation power. Therefore, many frameworks, e.g. TensorFlow, Caffe, CNTK, support GPUs for accelerating various data-intensive machine learning algorithms. However, we observe that the performance of existing LDA solutions on GPUs is not satisfying. In this paper, we present CuLDA, a GPU-based efficient and scalable approach to accelerate large-scale LDA problems. CuLDA is designed to efficiently solve LDA problems at high throughput. To this end, we first delicately design workload partitioning and synchronization mechanism to exploit multiple GPUs. Then, we offload the LDA sampling process to each individual GPU by optimizing from the sampling algorithm, parallelization, and data compression perspectives. Experiment evaluations show that compared with the state-of-the-art LDA solutions, CuLDA outperforms them by a large margin (up to 7.3X) on a single GPU. CuLDA is able to achieve an extra 7.5X speedup on 8 GPUs for large data sets.
- Research Article
43
- 10.1186/1471-2105-7-250
- May 8, 2006
- BMC Bioinformatics
BackgroundThe statistical modeling of biomedical corpora could yield integrated,coarse-to-fine views of biological phenomena that complement discoveriesmade from analysis of molecular sequence and profiling data. Here, thepotential of such modeling is demonstrated by examining the 5,225 free-textitems in the Caenorhabditis Genetic Center (CGC) Bibliography usingtechniques from statistical information retrieval. Items in the CGCbiomedical text corpus were modeled using the Latent Dirichlet Allocation(LDA) model. LDA is a hierarchical Bayesian model which represents adocument as a random mixture over latent topics; each topic is characterizedby a distribution over words.ResultsAn LDA model estimated from CGC items had better predictive performance thantwo standard models (unigram and mixture of unigrams) trained using the samedata. To illustrate the practical utility of LDA models of biomedicalcorpora, a trained CGC LDA model was used for a retrospective study ofnematode genes known to be associated with life span modification. Corpus-,document-, and word-level LDA parameters were combined with terms from theGene Ontology to enhance the explanatory value of the CGC LDA model, and tosuggest additional candidates for age-related genes. A novel, pairwisedocument similarity measure based on the posterior distribution on the topicsimplex was formulated and used to search the CGC database for "homologs" ofa "query" document discussing the life span-modifying clk-2 gene.Inspection of these document homologs enabled and facilitated the productionof hypotheses about the function and role of clk-2.ConclusionLike other graphical models for genetic, genomic and other types ofbiological data, LDA provides a method for extracting unanticipated insightsand generating predictions amenable to subsequent experimentalvalidation.
- Research Article
9
- 10.1371/journal.pdig.0000305
- Aug 2, 2023
- PLOS Digital Health
The emergence of new digital technologies has enabled a new way of doing research, including active collaboration with the public ('citizen science'). Innovation in machine learning (ML) and natural language processing (NLP) has made automatic analysis of large-scale text data accessible to study individual perspectives in a convenient and efficient fashion. Here we blend citizen science with innovation in NLP and ML to examine (1) which categories of life events persons with multiple sclerosis (MS) perceived as central for their MS; and (2) associated emotions. We subsequently relate our results to standardized individual-level measures. Participants (n = 1039) took part in the 'My Life with MS' study of the Swiss MS Registry which involved telling their story through self-selected life events using text descriptions and a semi-structured questionnaire. We performed topic modeling ('latent Dirichlet allocation') to identify high-level topics underlying the text descriptions. Using a pre-trained language model, we performed a fine-grained emotion analysis of the text descriptions. A topic modeling analysis of totally 4293 descriptions revealed eight underlying topics. Five topics are common in clinical research: 'diagnosis', 'medication/treatment', 'relapse/child', 'rehabilitation/wheelchair', and 'injection/symptoms'. However, three topics, 'work', 'birth/health', and 'partnership/MS' represent domains that are of great relevance for participants but are generally understudied in MS research. While emotions were predominantly negative (sadness, anxiety), emotions linked to the topics 'birth/health' and 'partnership/MS' was also positive (joy). Designed in close collaboration with persons with MS, the 'My Life with MS' project explores the experience of living with the chronic disease of MS using NLP and ML. Our study thus contributes to the body of research demonstrating the potential of integrating citizen science with ML-driven NLP methods to explore the experience of living with a chronic condition.
- Research Article
53
- 10.1007/s11336-021-09820-y
- Nov 10, 2021
- Psychometrika
The past few years were marked by increased online offensive strategies perpetrated by state and non-state actors to promote their political agenda, sow discord, and question the legitimacy of democratic institutions in the US and Western Europe. In 2016, the US congress identified a list of Russian state-sponsored Twitter accounts that were used to try to divide voters on a wide range of issues. Previous research used latent Dirichlet allocation (LDA) to estimate latent topics in data extracted from these accounts. However, LDA has characteristics that may limit the effectiveness of its use on data from social media: The number of latent topics must be specified by the user, interpretability of the topics can be difficult to achieve, and it does not model short-term temporal dynamics. In the current paper, we propose a new method to estimate latent topics in texts from social media termed Dynamic Exploratory Graph Analysis (DynEGA). In a Monte Carlo simulation, we compared the ability of DynEGA and LDA to estimate the number of simulated latent topics. The results show that DynEGA is substantially more accurate than several different LDA algorithms when estimating the number of simulated topics. In an applied example, we performed DynEGA on a large dataset with Twitter posts from state-sponsored right- and left-wing trolls during the 2016 US presidential election. DynEGA revealed topics that were pertinent to several consequential events in the election cycle, demonstrating the coordinated effort of trolls capitalizing on current events in the USA. This example demonstrates the potential power of our approach for revealing temporally relevant information from qualitative text data.
- Conference Article
6
- 10.1109/asew.2019.00037
- Nov 1, 2019
A lot of research in Software Engineering (SE) automatically extract topics of the text data and use the results directly or as a feature for a machine learning method. Research has shown that the majority of studies in SE use Latent Dirichlet Allocation (LDA) as the topic modeling approach. Similarly, there is a lot of work that apply LDA on GitHub data. However, there is no study that explores whether LDA is a good choice compared to other algorithms, nor is there any to investigate the effects of specific pre-processing steps on its performance. In this paper, we explore a large dataset of GitHub repositories and apply two main topic modeling algorithms, LDA (3 variants) and Non-Negative Matrix Factorization (NMF), in several experiments with different experimental settings. The results show that LDA results in a higher coherence score compared to NMF. However, care should be taken in the choice of LDA algorithm, setting its parameters, and the text pre-processing steps. The results of this paper benefit SE researchers who apply intelligent techniques using LDA.