Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Dendroid: A text mining approach to analyzing and classifying code structures in Android malware families

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Dendroid: A text mining approach to analyzing and classifying code structures in Android malware families

Similar Papers
  • Conference Article
  • Cite Count Icon 23
  • 10.1109/dsc.2017.77
Analysis of Android Malware Family Characteristic Based on Isomorphism of Sensitive API Call Graph
  • Jun 1, 2017
  • Hao Zhou + 3 more

The analysis of multiple Android malware families indicates malware instances within a common malware family always have similar call graph structures. Based on the isomorphism of sensitive API call graph, we propose a method which is used to construct malware family features via combining static analysis approach with graph similarity metric. The experiment is performed on a malware dataset which contains 1326 malware samples from 16 different malware families. The result shows that the method can differentiate distinct malware family features and divide suspect malware samples into corresponding families with a high accuracy of 96.77% overall and even defend a certain extent of obfuscation.

  • Research Article
  • Cite Count Icon 120
  • 10.1109/tdsc.2017.2739145
EC2: Ensemble Clustering and Classification for Predicting Android Malware Families
  • Oct 23, 2019
  • IEEE Transactions on Dependable and Secure Computing
  • Tanmoy Chakraborty + 2 more

As the most widely used mobile platform, Android is also the biggest target for mobile malware. Given the increasing number of Android malware variants, detecting malware families is crucial so that security analysts can identify situations where signatures of a known malware family can be adapted as opposed to manually inspecting behavior of all samples. We present EC2 (Ensemble Clustering and Classification), a novel algorithm for discovering Android malware families of varying sizes-ranging from very large to very small families (even if previously unseen). We present a performance comparison of several traditional classification and clustering algorithms for Android malware family identification on DREBIN, the largest public Android malware dataset with labeled families. We use the output of both supervised classifiers and unsupervised clustering to design EC2. Experimental results on both the DREBIN and the more recent Koodous malware datasets show that EC2 accurately detects both small and large families, outperforming several comparative baselines. Furthermore, we show how to automatically characterize and explain unique behaviors of specific malware families, such as FakeInstaller, MobileTx, Geinimi. In short, EC2 presents an early warning system for emerging new malware families, as well as a robust predictor of the family (when it is not new) to which a new malware sample belongs, and the design of novel strategies for data-driven understanding of malware behaviors.

  • Conference Article
  • Cite Count Icon 17
  • 10.1145/1183535.1183539
Towards applying text mining and natural language processing for biomedical ontology acquisition
  • Nov 10, 2006
  • Tasha R Inniss + 5 more

The use of text mining and natural language processing can extend into the realm of knowledge acquisition and management for biomedical applications. In this paper, we describe how we implemented natural language processing and text mining techniques on the transcribed verbal descriptions from retinal experts of biomedical disease features. The feature-attribute pairs generated were then incorporated within a user interface for a collaborative ontology development tool. This tool, IDOCS, is being used in the biomedical domain to help retinal specialists reach a consensus on a common ontology for describing age-related macular degeneration (AMD). We compare the use of traditional text mining and natural language processing techniques with that of a retinal specialist's analysis and discuss how we might integrate these techniques for future biomedical ontology and user interface development.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-030-93677-8_52
In Search of Insight from Unstructured Text Data: Towards an Identification of Text Mining Techniques
  • Jan 1, 2022
  • Sunet Eybers + 1 more

The availability of massive sets of unstructured data has opened up new opportunities for businesses to gain meaningful insights into what is currently happening in their organization and market place. Unfortunately, unstructured data is often challenging to work with, as it requires specialized toolsets, techniques, knowledge and skills to engage with the data. This study used a systematic review (SLR) process to explore the current and conversant state of affairs in the field of text mining (TM) techniques that exist for the processing of natural language in unstructured text datasets across a broad expanse of applications and domains. A comprehensive literature search yielded 1022 eligible articles from five prominent bibliographic databases, narrowed down to 89 articles for review. Information related to TM techniques, the TM process, applications, challenges and recommendations for the improvement of TM results were extracted and synthesized. Eighteen TM techniques were identified and used to complete a variety of tasks such as data pre-processing, information retrieval, information extraction, text classification, clustering, topic modeling, text summarization and sentiment analysis or opinion mining. No single TM technique may be suitable for all text representation and extraction requirements. Therefore, context and applicability are key factors for TM technique selection.KeywordsText miningText mining techniquesUnstructured dataKnowledge discovery

  • Research Article
  • Cite Count Icon 23
  • 10.1080/0267257x.2021.2003421
Customer engagement behaviours in a social media context revisited: using both the formative measurement model and text mining techniques
  • Dec 9, 2021
  • Journal of Marketing Management
  • Chaang-Iuan Ho + 2 more

Most research in the field of customer engagement employs multi-item measures that are comceptualised as three-dimensional and reflective models. This leaves room to suggest an alternative approach for defining and operationalising the construct. From a behavioural standpoint, we propose a higher-order formative measurement model (FMM) underlying customer engagement behaviours (CEBs) in the context of Facebook fan pages. Data from 259 restaurant customers show that the FMM works well, both theoretically and empirically, and that CEBs include eight dimensions and 16 indices. We also apply text mining (TM) techniques to analyse customers’ Facebook posts. The findings indicate that some dimensions identified by the FMM could not be extracted using TM, and the TM analysis provided clues regarding the FMM indices; the two approaches complement rather than compete with each other. These results serve as a basis for scale development in future research, and provide guidelines for managers to enhance long-term customer relationships.

  • Research Article
  • Cite Count Icon 21
  • 10.1109/tc.2022.3143439
Lightweight, Effective Detection and Characterization of Mobile Malware Families
  • Nov 1, 2022
  • IEEE Transactions on Computers
  • Karim O Elish + 2 more

Android malware is an ongoing threat to billions of smart devices’ security, ranging from mobile phones to car infotainment systems. Despite numerous approaches and previous studies to develop solutions for detecting and preventing Android malware, the rapid continuous development of new malware variants requires a careful reconsideration and the development of effective methods to identify malware families given a meager number of malware instances. In this paper, we present DroidMalVet, a novel Android malware family classification and detection approach that does not require to perform complex program analyses or utilize large feature sets. DroidMalVet is the first to use a promising, diverse, and small set of software metrics as features in a supervised learning platform to classify and detect various Android malware families. Our extensive empirical evaluations on two large public malware datasets show that DroidMalVet accurately detects both small and large malware families with F-Score accuracy of 94.4% and 96%, and AUC equal to 99.5% and 99.7% on the malware families in Drebin and AMD datasets, respectively. Moreover, our results demonstrate the superior performance of DroidMalVet in detecting small families (i.e., families with few samples). DroidMalVet complements existing approaches and presents an early warning tool for detecting known and emerging malware families.

  • Conference Article
  • 10.1109/hicss.2012.337
Introduction to Advanced Analytics Services for Managerial Decision Support Minitrack
  • Jan 1, 2012
  • Dursun Delen + 1 more

This minitrack consists of ten papers involving theory and practice of service-based analytics (i.e., data, text and Web mining) for support if managerial decision making. These ten papers illustrate a diverse set of approaches, demonstrating the variety of ways in which modern information technologies can be applied to today’s complex decision situations. The papers in this minitrack offer some insight into the efforts to more effectively and efficiently use information technology tools to extract knowledge and better understand our rapidly changing world. The ten papers are grouped into three topic areas (as three sessions): (1) Text mining application, (2) Data mining applications, and (3) Advanced analytics applications. In the first group (text mining applications) we have two papers that apply unique capabilities of text mining into understanding and analyzing financial markets. The paper by Siering proposes using text mining and its derivative, sentiment analysis, to support intraday investment decisions, while the paper by Hagenau et al. proposes prediction of stock prices based on automatically “reading” financial news by using context-specific features. The third paper in this group (by Napoletano et al.) proposes use of a novel method “mixed graph of terms” as opposed traditional “bags of words” representation of a text in document retrieval and making better sense out of unstructured/textual data sources. In the second group (data mining applications) we have two papers that apply knowledge discovery techniques to medical datasets. The paper by Zurada talks about a study where they applied seven different prediction algorithms to predict the risk of low back disorders. In this study Zurada claims to have proposed a more systematic and reliable approach to creating and validating classifiers to better distinguish between low and high risk manual lifting jobs that contribute to low back disorders. The second paper summarizes a data-and-text mining study where by Erraguntla et al developed algorithms to handle missing ICD 9 codes in medical datasets. Their approach involved developing a prediction model for the ICD 9 codes based on other associated attributes like medical diagnosis, medical remarks, and patient statements. They used text mining methods to extract key concepts from textual patient records, and used nearest neighborhood based classification algorithms to predict the missing ICD 9 codes. The third paper in this group (by Kim at al) proposes an ensemble model, which is based on multiple SVM classifiers, to address churner identification problems in the mobile telecommunication industry, a sector in which the role of customer retention has become increasingly important due to its very competitive business environment. According to their comparison results, the performance of the ensemble model was superior to all single and ensemble models. In the third group we have four very diverse studies. Paper by Seng and Ling proposes a pool-based cost sensitive active learning framework that requires fewer number of examples yet produces a smaller total cost compared to the previous methods. Paper by Mair et al reports on an interesting empirical study of software projects managers using a case-based reasoning tool. Their aim was to explore the interaction of cognitive processes and personality of software project managers undertaking tool-supported estimation tasks such as effort and cost prediction. The paper by Soper et al reports on a study where they mined institutional identities using n-grams (a text mining technique). They demonstrated the utility of their n-gram analysis tool in revealing identity of an academic journal, namely Communications of the ACM. The last paper in this group (by Nuhn et al) was about using clustering methods for the processing of the complex landslide simulation results to support decision making and learning. 2012 45th Hawaii International Conference on System Sciences

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 59
  • 10.3390/electronics9060942
Android Malware Family Classification and Analysis: Current Status and Future Directions
  • Jun 5, 2020
  • Electronics
  • Fahad Alswaina + 1 more

Android receives major attention from security practitioners and researchers due to the influx number of malicious applications. For the past twelve years, Android malicious applications have been grouped into families. In the research community, detecting new malware families is a challenge. As we investigate, most of the literature reviews focus on surveying malware detection. Characterizing the malware families can improve the detection process and understand the malware patterns. For this reason, we conduct a comprehensive survey on the state-of-the-art Android malware familial detection, identification, and categorization techniques. We categorize the literature based on three dimensions: type of analysis, features, and methodologies and techniques. Furthermore, we report the datasets that are commonly used. Finally, we highlight the limitations that we identify in the literature, challenges, and future research directions regarding the Android malware family.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 28
  • 10.3390/app11052323
Text Mining for Supply Chain Risk Management in the Apparel Industry
  • Mar 5, 2021
  • Applied Sciences
  • Sayed Mehdi Shah + 2 more

Text mining tools are now widely used for the efficient management of information and resources in business, academic and research organizations. This paper provides a comprehensive overview of research articles on the application of text mining techniques in the field of Supply Chain Risk Management and the apparel industry. Research articles published between 2000 and 2020, were obtained from various journals through two online databases, i.e., SCOPUS and IEEE Xplore. Through a systematic approach following PRISMA guidelines, 370 research papers were screened, filtered and finally classified into three main areas: Supply Chain Risk Management and outsourcing in the apparel industry, application of text mining in Supply Chain Risk Management and application of text mining in the apparel industry. In this study, we have identified a comprehensive list of various available data sources for text mining, methodologies and risks associated with outsourcing in the apparel industry. We classify the gaps in expanding the application of text mining in the apparel industry’s Supply Chain Risk Management. Extracting useful information from online newspapers through text mining could vividly enhance the ability to monitor supply chain risks and provide the ability to link data to provide decision makers with the right information at the right time.

  • Research Article
  • Cite Count Icon 39
  • 10.1145/1921656.1921657
An accuracy-enhanced light stemmer for arabic text
  • Feb 24, 2010
  • ACM Transactions on Speech and Language Processing
  • Samhaa R El-Beltagy + 1 more

Stemming is a key step in most text mining and information retrieval applications. Information extraction, semantic annotation, as well as ontology learning are but a few examples where using a stemmer is a must. While the use of light stemmers in Arabic texts has proven highly effective for the task of information retrieval, this class of stemmers falls short of providing the accuracy required by many text mining applications. This can be attributed to the fact that light stemmers employ a set of rules that they apply indiscriminately and that they do not address stemming of broken plurals at all, even though this class of plurals is very commonly used in Arabic texts. The goal of this work is to overcome these limitations. The evaluation of the work shows that it significantly improves stemming accuracy. It also shows that by improving stemming accuracy, tasks such as automatic annotation and keyphrase extraction can also be significantly improved.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/cis.workshops.2007.212
Research and Realization of Text Mining Algorithm on Web
  • Dec 15, 2007
  • Shiqun Yin + 2 more

It is recognized that text information on Web is growing at an astounding pace. Research and application of text mining on Web is an important branch in the data mining. Now people mainly use information retrieval (IR) or the search engine to look up Web information. But IR focuses on searching for information that is explicitly present but not latent knowledge in some document, the search engine can hardly according to different need of different customers and provide individual service, and it is very difficult to mine data further. However, text mining on Web aims to resolve this problem. This paper discusses an Algorithm of how to follow the appointed website or Web page according to the user's request by using the text mining technique, how to extract and express text characteristic, how to classify the data information with feedback judgement combined with the Web page text contents for later use. We present experiments on different data set that demonstrate more effectiveness of our algorithm than traditional algorithm. The process of Web text mining, information extraction method, mining algorithm and realization technique are discussed in details.

  • Research Article
  • Cite Count Icon 9
  • 10.1300/j104v37n01_08
Text Mining and Data Mining in Knowledge Organization and Discovery: The Making of Knowledge-Based Products
  • Jul 1, 2003
  • Cataloging & Classification Quarterly
  • L J Haravu + 1 more

SUMMARY Discusses the importance of knowledge organization in the context of the information overload caused by the vast quantities of data and information accessible on internal and external networks of an organization. Defines the characteristics of a knowledge-based product. Elaborates on the techniques and applications of text mining in developing knowledge products. Presents two approaches, as case studies, to the making of knowledge products: (1) steps and processes in the planning, designing and development of a composite multilingual multimedia CD product, with the potential international, inter-cultural end users in view, and (2) application of natural language processing software in text mining. Using a text mining software, it is possible to link concept terms from a processed text to a related thesaurus, glossary, schedules of a classification scheme, and facet structured subject representations. Concludes that the products of text mining and data mining could be made more useful if the features of a faceted scheme for subject classification are incorporated into text mining techniques and products.

  • Research Article
  • Cite Count Icon 39
  • 10.1016/j.accinf.2023.100624
The application of text mining in accounting
  • Jun 3, 2023
  • International Journal of Accounting Information Systems
  • Elseline Senave + 2 more

By facilitating the derivation of knowledge and qualitative measures from textual data, text mining techniques have come into vogue in various domains and industries. Namely in accounting, text mining outputs can elucidate, complement, and validate the customary quantitative data. This study creates an up-to-date view of text mining applications in accounting practice. Through a critical review of text mining literature, insight is given into the stages of a typical text mining process, contemporary text mining techniques that have been named valuable in an accounting context, and the information that can be obtained by applying these techniques.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 16
  • 10.3390/info14040201
Applications of Text Mining in the Transportation Infrastructure Sector: A Review
  • Mar 23, 2023
  • Information
  • Sudipta Chowdhury + 1 more

Transportation infrastructure is vital to the well-functioning of economic activities in a region. Due to the digitalization of data storage, ease of access to large databases, and advancement of social media, large volumes of text data that relate to different aspects of transportation infrastructure are generated. Text mining techniques can explore any large amount of textual data within a limited time and with limited resource allocation for generating easy-to-understand knowledge. This study aims to provide a comprehensive review of the various applications of text mining techniques in transportation infrastructure research. The scope of this research ranges across all forms of transportation infrastructure-related problems or issues that were investigated by different text mining techniques. These transportation infrastructure-related problems or issues may involve issues such as crashes or accidents investigation, driving behavior analysis, and construction activities. A Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA)-based structured methodology was used to identify relevant studies that implemented different text mining techniques across different transportation infrastructure-related problems or issues. A total of 59 studies from both the U.S. and other parts of the world (e.g., China, and Bangladesh) were ultimately selected for review after a rigorous quality check. The results show that apart from simple text mining techniques for data pre-processing, the majority of the studies used topic modeling techniques for a detailed evaluation of the text data. Other techniques such as classification algorithms were also later used to predict and/or project future scenarios/states based on the identified topics. The findings from this study will hopefully provide researchers and practitioners with a better understanding of the potential of text mining techniques under different circumstances to solve different types of transportation infrastructure-related problems. They will also provide a blueprint to better understand the ever-evolving area of transportation engineering and infrastructure-focused studies.

  • Preprint Article
  • 10.7490/f1000research.1118323.1
Literature mining for rare disease phenotype genotype associations
  • Sep 17, 2020
  • F1000Research
  • Erica Lyons + 6 more

Although individually uncommon, collectively, rare diseases affect 6 8% of the world’s population. Diagnosing rare diseases through phenotypes is a difficult task because symptoms overlap based solely on phenotype there may be multiple causes for a single disease and a single cause may be associated with multiple diseases. Genomic investigations are leading to precise molecular level characterization allowing for a systematic discovery of therapies that either target a specific disease or, more broadly, multiple related diseases. Currently, text mining techniques have lower accuracy and more coverage gaps than manual curation, but the potential for higher productivity. Recently, there has been some progress in developing text mining applications to tackle this enormous problem. However, there is still a need to assess these text mining applications and integrate the findings with data. Towards these ends, we manually created a rare disease variant data set which can be used to test and refine a text mining algorithm. We developed a manual curation workflow which incorporated searching for genetic variants on PubMed, listing symptoms and phenotypes associated with each variant, converting cDNA or protein to standardized genetic notation, obtaining annotations from public data sources, and presenting the details in an online interface for future release. Our project aims include summation of pathogenic variant frequency in populations to estimate birth prevalence for each rare disease. We created test data sets of known accuracy, coverage, and genotype phenotype associations that can be used to validate text mining approaches. Enhanced text mining will significantly decrease the amount of time necessary to gather data to molecularly characterize a rare disease and render it possible to mine rare disease phenotype genotype associations for the more than 7,000 rare disease genes in a timely fashion.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant