Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis.

Diogo Azevedo,Alexandra M. Carvalho,Ana Maria Rodrigues,André Souto,Helena Canhão

doi:10.3390/s23031219

Diogo Azevedo, Alexandra M. Carvalho + Show 3 more

Open Access

https://doi.org/10.3390/s23031219

Copy DOI

Abstract

The normalized compression distance (NCD) is a similarity measure between a pair of finite objects based on compression. Clustering methods usually use distances (e.g., Euclidean distance, Manhattan distance) to measure the similarity between objects. The NCD is yet another distance with particular characteristics that can be used to build the starting distance matrix for methods such as hierarchical clustering or K-medoids. In this work, we propose Zgli, a novel Python module that enables the user to compute the NCD between files inside a given folder. Inspired by the CompLearn Linux command line tool, this module iterates on it by providing new text file compressors, a new compression-by-column option for tabular data, such as CSV files, and an encoder for small files made up of categorical data. Our results demonstrate that compression by column can yield better results than previous methods in the literature when clustering tabular data. Additionally, the categorical encoder shows that it can augment categorical data, allowing the use of the NCD for new data types. One of the advantages is that using this new feature does not require knowledge or context of the data. Furthermore, the fact that the new proposed module is written in Python, one of the most popular programming languages for machine learning, potentiates its use by developers to tackle problems with a new approach based on compression. This pipeline was tested in clinical data and proved a promising computational strategy by providing patient stratification via clusters aiding in precision medicine.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors	Publication Date: Jan 20, 2023
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis.

Abstract

Talk to us

Similar Papers

More From: Sensors

Lead the way for us

Similar Papers

Packing It All Up in Search for a Language Independent MT Quality Measure Tool – Part Two
Kimmo Kettunen
-
Kimmo KettunenKimmo Kettunen
01 Jan 2010
01 Jan 2010

Similarity Calculation with Length Delimiting Dictionary Distance
A Burkovski ... G Heidemann
-
A Burkovski, et. al.A Burkovski ... G Heidemann
01 Nov 2011
01 Nov 2011

Analysing and comparing problem landscapes for black-box optimization via length scale
Rachael Morgan
-
Rachael MorganRachael Morgan
31 Aug 2015
31 Aug 2015

Improved Compression-Based Pattern Recognition Exploiting New Useful Features
Taichi Uchino ... Hisashi Koga
-
Taichi Uchino, et. al.Taichi Uchino ... Hisashi Koga
01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis.

Abstract

Talk to us

Similar Papers

More From: Sensors