Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices

Md. Shamsuzzoha Bayzid,Ananya Bhattacharjee

doi:10.1186/s12864-020-06892-5

Md. Shamsuzzoha Bayzid, Ananya Bhattacharjee

Open Access

https://doi.org/10.1186/s12864-020-06892-5

Copy DOI

Abstract

BackgroundWith the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data.ResultsWe introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data.ConclusionsThis study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances.

Highlights

With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology
With moderate to high numbers of missing values (50 ∼110), LASSO achieved the best performance in recovering true bipartitions, Matrix factorization (MF) and AE were good in some cases
As DAMBE and LASSO can not handle distance matrices with more than 50% missing entries, only MF and AE were able to run on the distance matrices with 342 (∼50%) missing entries, albeit the Robinson Foulds (RF) rates were very high

Summary

Introduction

With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. Distancebased methods represent an attractive class of methods for large-scale analyses due to their computational efficiency These methods are generally not as accurate as the computationally demanding Bayesian or likelihood based methods, several studies [10, 11, 15,16,17,18,19] have provided support for the ability of the distance-based methods in estimating accurate phylogenetic trees. Notable progress has been made towards developing various distance-based methods [1, 10, 11, 16, 17, 19, 28,29,30,31,32,33,34,35] Some of these methods can be used to analyze large-scale single nucleotide polymorphism (SNP) data [36, 37]

Objectives

Methods

Results

Conclusion