Clustering Similar Data Research Articles

The most common method used to document monetary transactions in Brazil is by issuing electronic invoices (NF-e). The audit of electronic invoices is essential, and this can be improved by using data mining solutions, such as clustering and anomaly detection. However, applying these solutions is not a simple task because NF-e data contains millions of records with noisy fields and nonstandard documents, especially short text descriptions. In addition to these challenges, it is costly to extract information from short texts to identify traces of mismanagement, embezzlement, commercial fraud or tax evasion. Analyzing such data can be more effective when divided into well-defined groups. However, efficient solutions for clustering data with characteristics similar to NF-es have not yet been proposed in the literature. We developed ELINAC, a service for clustering short-text data in NF-es that uses an automatic encoder to cluster data. ELINAC aids in auditing transactions documented in NF-e, clustering similar data by short-text descriptions and making anomaly detection in numeric fields easier. For this, ELINAC explores how to model the automatic encoder without increasing the calculation costs to suppress a large number of short text data. In the worst case, the results show that ELINAC efficiently groups data while performing three times faster than solutions previously adopted in the literature.

Read full abstract

Data mining is seen as a set of techniques and technologies allowing to extract, automatically or semi-automatically, a lot of useful information, models, and tendencies from a big set of data. Techniques like “clustering,” “classification,” “association,” and “regression”; statistics and Bayesian calculations; or intelligent artificial algorithms like neural networks will be used to extract patterns from data, and the main goal to achieve those patterns will be to explain and to predict their behavior. So, data are the source that becomes relevant information. Research data are gathered as numbers (quantitative data) as well as symbolic values (qualitative data). Useful knowledge is extracted (mined) from a huge amount of data. Such kind of knowledge will allow setting relationships among attributes or data sets, clustering similar data, classifying attribute relationships, and showing information that could be hidden or lost in a vast quantity of data when data mining is not used. Combination of quantitative and qualitative data is the essence of mixed methods: on one hand, a coherent integration of result data interpretation starting from separate analysis, and on the other hand, making data transformation from qualitative to quantitative and 1 vice versa. A study developed shows how data mining techniques can be a very interesting complement to mixed methods, because such techniques can work with qualitative and quantitative data together, obtaining numeric analysis from qualitative data based on Bayesian probability calculation or transforming quantitative into qualitative data using discretization techniques. As a study case, the Psychological Inventory of Sports Performance (IPED) has been mined and decision trees have been developed in order to check any relationships among the “Self-confidence” (AC), “Negative Coping Control” (CAN), “Attention Control” (CAT), “Visuoimaginative Control” (CVI), “Motivational Level” (NM), “Positive Coping Control” (CAP), and “Attitudinal Control” (CACT) factors against gender and age of athletes. These decision trees can also be used for future data predictions or assumptions.

Read full abstract

Clustering Similar Data Research Articles

Articles published on Clustering Similar Data

DIDS: Double Indices and Double Summarizations for Fast Similarity Search

ELINAC: Autoencoder Approach for Electronic Invoices Data Clustering

Alignment of Microarray Data.

Data Mining in the Mixed Methods: Application to the Study of the Psychological Profiles of Athletes.

Fuzzy granular gravitational clustering algorithm for multivariate data

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Clustering Similar Data Research Articles

Articles published on Clustering Similar Data

DIDS: Double Indices and Double Summarizations for Fast Similarity Search

ELINAC: Autoencoder Approach for Electronic Invoices Data Clustering

Alignment of Microarray Data.

Data Mining in the Mixed Methods: Application to the Study of the Psychological Profiles of Athletes.

Fuzzy granular gravitational clustering algorithm for multivariate data