Pattern Mining with Natural Language Processing: An Exploratory Approach
Pattern mining derives from the need of discovering hidden knowledge in very large amounts of data, regardless of the form in which it is presented. When it comes to Natural Language Processing (NLP), it arose along the humans’ necessity of being understood by computers. In this paper we present an exploratory approach that aims at bringing together the best of both worlds. Our goal is to discover patterns in linguistically processed texts, through the usage of NLP state-of-the-art tools and traditional pattern mining algorithms.Articles from a Portuguese newspaper are the input of a series of tests described in this paper. First, they are processed by an NLP chain, which performs a deep linguistic analysis of text; afterwards, pattern mining algorithms Apriori and GenPrefixSpan are used. Results showed the applicability of sequential pattern mining techniques in textual structured data, and also provided several evidences about the structure of the language.KeywordsAssociation RuleNatural Language ProcessingMinimum SupportPattern MiningParse TreeThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Research Article
2
- 10.3233/ida-230672
- Mar 1, 2025
- Intelligent Data Analysis: An International Journal
Periodic high-utility sequential patterns (PHUSPs) mining is one of the research hotspots in data mining, which aims to discover patterns that not only have high utility but also regularly appear in sequence datasets. Traditional PHUSP mining mainly focuses on mining patterns from a single sequence, which often results in some interesting patterns being discarded due to strict constraints, and most of the discovered patterns are unstable and difficult to use for decision-making. In response to this issue, a novel algorithm called TKSPUS (top-k stable periodic high-utility sequential pattern mining) is proposed to discover stable top-k periodic high-utility sequential patterns that co-occur in multi-sequences. TKSPUS extends the traditional periodic high-utility sequential patterns mining, and designs two new metrics, namely utility stability coefficient (usc) and periodic stability coefficient (sr), to determine the periodic stability and utility stability of patterns in multi-sequences respectively. Additionally, the TKSPUS algorithm adopts the projection mechanism to mine stable periodic high-utility patterns over multi-sequence, while a new data structure called pusc and two corresponding pruning strategies are also introduced to boost the mining process. Experiments show that compared with the other four related algorithms, the TKSPUS algorithm has better performance in memory consumption and execution time, and the stability of the mining results is improved by 47% on average compared with the traditional periodic high-utility patterns mining algorithm.
- Book Chapter
3
- 10.4018/979-8-3693-9694-0.ch012
- Jan 17, 2025
In the recent decade, Natural Language Processing (NLP) has emerged as a important tool for extracting valuable insights from vast amounts of textual data. This work discusses the integration of NLP techniques with AI and robotics to enhance pattern mining capabilities. Using AI algorithms such as machine learning and deep learning, coupled with robotics for physical data gathering, enables the creation of sophisticated systems capable of understanding and interpreting human language in diverse contexts. In this study, we present a comprehensive framework that combines NLP intelligence with AI and robotics to extract meaningful patterns from textual data sources. We discuss the utilization of techniques such as sentiment analysis, named entity recognition, and topic modeling to analyze text data. We discuss about the integration of these NLP capabilities with AI algorithms for pattern identification and prediction. Moreover, the incorporation of robotics adds a tangible dimension to the pattern mining process, allowing for real-time data collection in various environments.
- Discussion
29
- 10.1161/circoutcomes.115.002125
- Aug 18, 2015
- Circulation: Cardiovascular Quality and Outcomes
The promise of big data has captured healthcare’s imagination. Although the term lacks a consensus definition, it generally refers to electronic health data sets characterized by the 3 Vs: volume, variety, and velocity.1,2 Volume refers to the sheer amount of healthcare data currently generated by clinical operations, administration, and patients themselves. By one estimate, ≈25 000 petabytes of healthcare data will be available by 2020—an amount that could fill 500 billion file cabinets.2 Variety refers to the wide range of healthcare data formats. For example, electronic health records (EHRs) contain both structured and unstructured (or free-text) data, diagnostic images come in a variety of multimedia formats, and patient data are generated from wearables, mobile devices, medical devices, and social media—each with its own format. Velocity refers to the rapidity with which new data are generated, and thus the speed at which it needs incorporation into data sets and analyses to provide real-time insights into health care. Article see p 477 The potential of such data is enormous. Insights from big data could fuel innovation and improvement in clinical operations, research and development, and public health.1 However, the potential of big data to realize these lofty aspirations is matched by the challenge of organizing, analyzing, and generating actionable insights from it. One of the biggest challenges in realizing the potential of big data is in abstracting it. With the passage of the HITECH (The Health Information Technology for Economic and Clinical Health) Act in 2009, the adoption of EHRs in clinical practice has accelerated, and now over half of office-based practices and hospitals are using some form of EHR.3,4 As a result, more point-of-care clinical data, previously inaccessible in its paper format, is potentially available. However, the variety aspect of EHR data—its mix …
- Research Article
14
- 10.1155/2023/8110588
- Jan 1, 2023
- Computational Intelligence and Neuroscience
Recommender systems are chiefly renowned for their applicability in e-commerce sites and social media. For system optimization, this work introduces a method of behaviour pattern mining to analyze the person's mental stability. With the utilization of the sequential pattern mining algorithm, efficient extraction of frequent patterns from the database is achieved. A candidate sub-sequence generation-and-test method is adopted in conventional sequential mining algorithms like the Generalized Sequential Pattern Algorithm (GSP). However, since this approach will yield a huge candidate set, it is not ideal when a large amount of data is involved from the social media analysis. Since the data is composed of numerous features, all of which may not have any relation with one another, the utilization of feature selection helps remove unrelated features from the data with minimal information loss. In this work, Frequent Pattern (FP) mining operations will employ the Systolic tree. The systolic tree-based reconfigurable architecture will offer various benefits such as high throughput as well as cost-effective performance. The database's frequently occurring item sets can be found by using the FP mining algorithms. Numerous research areas related to machine learning and data mining are fascinated by feature selection since it will enable the classifiers to be swift, more accurate, and cost-effective. Over the last ten years or so, there have been significant technological advancements in heuristic techniques. These techniques are beneficial because they improve the search procedure's efficiency, albeit at the potential sacrifice of completeness claims. A new recommender system for mental illness detection was based on features selected using River Formation Dynamics (RFD), Particle Swarm Optimization (PSO), and hybrid RFD-PSO algorithm is proposed in this paper. The experiments use the depressive patient datasets for evaluation, and the results demonstrate the improved performance of the proposed technique.
- Research Article
3
- 10.1080/01969722.2020.1871225
- Jan 11, 2021
- Cybernetics and Systems
Closed sequential pattern (CSP) mining is an optimization technique in sequential pattern mining because they produce more compact representations. Additionally, the runtime and memory usage required for mining CSPs is much lower than the sequential pattern mining. This task has fascinated numerous researchers. In this study, we propose a novel approach for closed clickstream pattern mining using C-List (CCPC) data structure. Closed clickstream pattern mining is a more specific task of CSP mining that has been lacking in research investment; nevertheless, it has promising applications in various fields. CCPC consists of two key steps: It initially builds the SPPC-tree and the C-List for each frequent 1-pattern and then determines all frequently closed clickstream 1-patterns; next, it constructs the C-List for each frequent k-pattern and mines the remaining frequently closed k-patterns. The proposed method is optimized by modifying the SPPC-tree structure and a new property is added into each node element in both the SPPC-tree and C-Lists to quickly prune nonclosed clickstream. Experimental results conducted on several datasets show that the proposed method is better than the previous techniques and improves the runtime and memory usage in most cases, especially when using low minimum support thresholds on the huge databases.
- Conference Article
49
- 10.1109/icdmw.2006.98
- Jan 1, 2006
Incremental mining of sequential patterns from data streams is one of the most challenging problems in mining data streams. However, previous work of mining sequential patterns from data streams is almost focused on mining of patterns from stream of item-sequences, not stream of itemset-sequences. In this paper, we propose an efficient single-pass algorithm, called IncSPAM, to maintain the set of sequential patterns from itemset-sequence streams with a transaction-sensitive sliding window. An effective bit-sequence representation of items is used in the proposed algorithm to reduce the time and memory needed to slide the windows. Experiments show that the proposed IncSPAM algorithm is efficient for mining sequential patterns over data streams
- Research Article
3
- 10.26593/jrsi.v13i1.6790.117-130
- Apr 26, 2024
- Jurnal Rekayasa Sistem Industri
Customer satisfaction is a key success factor for a business. To provide products that meet customer satisfaction, companies must be able to understand the customers’ needs and desires. Technological developments nowadays have helped companies to understand customer desires more easily so that companies can provide products that satisfy their customer. Natural Language Processing (NLP) is a technology that allows computers to process human language. NLP is also commonly referred as text-mining. NLP has been utilized in the New Product Development (NPD) process. We compiled studies related to NLP and NPD and conducted a literature review to map out how far NLP has been utilized in NPD processes. We found that in this era of Big Data, current NLP studies most often have the goal to process text data from online reviews on e-commerce and from social media. By using NLP, large amounts of data can produce valuable Voice of Customer (VOC) information for product development. We also found that NLP technology also has been utilized in other NPD processes that do not involve VOC, such as the design stage, document processing, and extraction of requirements in the NPD process.
- Research Article
16
- 10.1177/16094069231214144
- Oct 1, 2023
- International Journal of Qualitative Methods
Background Electronic health systems contain large amounts of unstructured data (UD) which are often unanalyzed due to the time and costs involved. Unanalyzed data creates missed opportunities to improve health outcomes. Natural language processing (NLP) is the foundation of generative artificial intelligence (GAI), which is the basis for large language models, such as ChatGPT. NLP and GAI are machine learning methods that analyze large amounts of data in a short time at minimal cost. The ability of NLP to conduct qualitative analyses is increasing, yet the results can lack context and nuance in their findings, requiring human intervention. Methods Our study compared outcomes, time, and costs of a previously published qualitative study. Our approach partnered an NLP model and a qualitative researcher (NLP+). UD from behavioral health patients were analyzed using NLP and a Latent Dirichlet allocation to identify the topics using probability of word coherence scores. The topics were then analyzed by a qualitative researcher, translated into themes, and compared with the original findings. Results The NLP + method results aligned with the original, qualitative derived themes. Our model also identified two additional themes which were not originally detected. The NLP + method required 6 hours of labor, 3 minutes for transcription, and a transcription cost of $1.17. The original, qualitative researcher only method required more than 36 hours ($2,250) of time and $1,100 for transcription. Conclusions While natural language processing analyzes voluminous amounts of data in seconds, context and nuance in human language are regularly missed. Combining a qualitative researcher with NLP + could be deployed in many settings, reducing time and costs, and improving context. Until large language models are more prevalent, a human interaction can help translate the patient experience by contextualizing data rich in social determinant indicators which may otherwise go unanalyzed.
- Conference Article
24
- 10.1109/fskd.2007.402
- Jan 1, 2007
Mining frequent patterns is an important data mining task and has been widely studied. However, the traditional frequent pattern mining does not involve the ordered problem, which is widely exists in the real world. A lot of papers have been proposed to solve the ordered problem, including sequential pattern mining, item sequences mining, temporal feature extraction, web log study and ordered patterns mining. Most of these papers used an APRIORI-based algorithm hence did not adopt the wonderful ideas and advanced technologies in traditional frequent patterns mining. This paper introduced a data structure called FOP-tree which is a modified version of FP-tree to solve the ordered patterns mining. The performance study shows that the FOP-tree is efficient and scalable for mining both long and short frequent ordered patterns, and is much faster than the traditional APRIORI-bases algorithms on several situations.
- Book Chapter
5
- 10.1007/978-3-642-27872-3_16
- Jan 1, 2012
The concept of Pattern Mining has obtained significant focus in Telecommunications Network Management Systems (NMS). A large volume of work has been dedicated to this field and valuable progress has been observed. Both sequential and structured pattern mining techniques were applied to NMS. In particular NMS logs (Performance and Alarm) pose several interesting issues for pattern mining, and it can help in various NMS activities such as alarm correlation, alarm associations, self-healing or pro-active fault management. In this paper, we present an overview of the different pattern mining techniques used in NMSs, compare them and present the most beneficial ones to NMS for Radio over Fiber (RoF) like convergent networks.KeywordsPattern MiningRadio-over-FiberNetwork Management Systems
- Conference Article
4
- 10.1109/cisim.2010.5643456
- Oct 1, 2010
Data mining is the task of discovering interesting patterns from large amounts of data. There are many data mining tasks, such as classification, clustering, association rule mining, and sequential pattern mining. Many frequent sequential traversal pattern mining algorithms have been developed which mine the set of frequent subsequences traversal pattern satisfying a minimum support constraint in a session database. However, previous frequent sequential traversal pattern mining algorithms give equal weightage to sequential traversal patterns while the pages in sequential traversal patterns have different importance and have different weightage. Another main problem in most of the frequent sequential traversal pattern mining algorithms is that they produce a large number of sequential traversal patterns when a minimum support is lowered and they do not provide alternative ways to adjust the number of sequential traversal patterns other than increasing the minimum support. In this paper, we propose a frequent sequential traversal pattern mining with weights constraint. Our main approach is to add the weight constraints into the sequential traversal pattern while maintaining the downward closure property. A weight range is defined to maintain the downward closure property and pages are given different weights and traversal sequences assign a minimum and maximum weight. In scanning a session database, a maximum and minimum weight in the session database is used to prune infrequent sequential traversal subsequence by doing downward closure property can be maintained. Our method produces a few but important sequential traversal patterns in session databases with a low minimum support, by adjusting a weight range of pages and sequence.
- Research Article
14
- 10.1155/2023/9761154
- Nov 3, 2023
- Structural Control and Health Monitoring
Condition rating of bridges is specified in many countries since it provides a basis for the decision-making of maintenance actions such as repair, strengthening, or limitation of passing vehicle weight. In practice, professional engineers check the textual description of damages to bridge members, such as girders, bearings, expansion joints, and piers that are acquired from periodic inspections, and then make a rating of the bridge condition. The task is time-consuming and labor-intensive due to the large amount of detailed data buried in the inspection reports. In this paper, a natural language processing- (NLP-) based machine learning (ML) approach is proposed for automated and fast bridge condition rating, which can efficiently extract the information of deficiencies in bridge members. The proposed approach involves three major steps, say, data repository establishment, NLP-based textual data processing, and ML-based bridge condition rating prediction. The data repository is established with the inspection reports of 263 concrete bridges, and in total there, are four condition levels for the bridges. Then, the NLP-based textual data processing approach is implemented to calculate the word frequency and the word clouds to visualize the characteristics of bridges in different condition levels. Finally, four typical ML techniques are adopted to generate the predictive model of the bridge condition rating. The results indicate that the NLP-based ML prediction model has an accuracy of 89% and is very efficient so that it can be used for large-scale applications such as condition rating for regional-level bridges.
- Research Article
39
- 10.5815/ijitcs.2014.03.09
- Feb 8, 2014
- International Journal of Information Technology and Computer Science
The process of data mining produces various patterns from a given data source. The most recognized data mining tasks are the process of discovering frequent itemsets, frequent sequential patterns, frequent sequential rules and frequent association rules. Numerous efficient algorithms have been proposed to do the above processes. Frequent pattern mining has been a focused topic in data mining research with a good number of references in literature and for that reason an important progress has been made, varying from performant algorithms for frequent itemset mining in transaction databases to complex algorithms, such as sequential pattern mining, structured pattern mining, correlation mining. Association Rule mining (ARM) is one of the utmost current data mining techniques designed to group objects together from large databases aiming to extract the interesting correlation and relation among huge amount of data. In this article, we provide a brief review and analysis of the current status of frequent pattern mining and discuss some promising research directions. Additionally, this paper includes a comparative study between the performance of the described approaches.
- Research Article
16
- 10.1007/s40747-020-00226-4
- Nov 11, 2020
- Complex & Intelligent Systems
Pattern mining has emerged as a compelling field of data mining over the years. Literature has bestowed ample endeavors in this field of research ranging from frequent pattern mining to rare pattern mining. A precise and impartial analysis of the existing pattern mining techniques has therefore become essential to widen the scope of data analysis using the notion of pattern mining. This paper is therefore an attempt to provide a comparative scrutiny of the fundamental algorithms in the field of pattern mining through performance analysis based on several decisive parameters. The paper provides a structural classification of the widely referenced techniques in four pattern mining categories: frequent, maximal frequent, closed frequent and rare. It provides an analytical comparison of these techniques based on computational time and memory consumption using benchmark real and synthetic data sets. The results illustrate that tree based approaches perform exceptionally well over level wise approaches in case of dense data sets for all the categories. However, for sparse data sets, level wise approaches performed better than the former ones. This study has been carried out with an aim to analyze the pros and cons of the well known pattern mining techniques under different categories. Through this empirical study, an endeavor has been made to enable the researchers identify some fruitful and promising research directions in one of the most remarkable area of research, pattern mining.
- Research Article
33
- 10.1007/s10489-021-02912-3
- Jan 10, 2022
- Applied Intelligence
Nonoverlapping sequential pattern mining, as a kind of repetitive sequential pattern mining with gap constraints, can find more valuable patterns. Traditional algorithms focused on finding all frequent patterns and found lots of redundant short patterns. However, it not only reduces the mining efficiency, but also increases the difficulty in obtaining the demand information. To reduce the frequent patterns and retain its expression ability, this paper focuses on the Nonoverlapping Maximal Sequential Pattern (NMSP) mining which refers to finding frequent patterns whose super-patterns are infrequent. In this paper, we propose an effective mining algorithm, Nettree for NMSP mining (NetNMSP), which has three key steps: calculating the support, generating the candidate patterns, and determining NMSPs. To efficiently calculate the support, NetNMSP employs the backtracking strategy to obtain a nonoverlapping occurrence from the leftmost leaf to its root with the leftmost parent node method in a Nettree. To reduce the candidate patterns, NetNMSP generates candidate patterns by the pattern join strategy. Furthermore, to determine NMSPs, NetNMSP adopts the screening method. Experiments on biological sequence datasets verify that not only does NetNMSP outperform the state-of-the-arts algorithms, but also NMSP mining has better compression performance than closed pattern mining. On sales datasets, we validate that our algorithm guarantees the best scalability on large scale datasets. Moreover, we mine NMSPs and frequent patterns in SARS-CoV-1, SARS-CoV-2 and MERS-CoV. The results show that the three viruses are similar in the short patterns but different in the long patterns. More importantly, NMSP mining is easier to find the differences between the virus sequences.