Motif Discovery Tools Research Articles

BackgroundPhenotype information in electronic health records (EHRs) is mainly recorded in unstructured free text, which cannot be directly used for clinical research. EHR-based deep-phenotyping methods can structure phenotype information in EHRs with high fidelity, making it the focus of medical informatics. However, developing a deep-phenotyping method for non-English EHRs (ie, Chinese EHRs) is challenging. Although numerous EHR resources exist in China, fine-grained annotation data that are suitable for developing deep-phenotyping methods are limited. It is challenging to develop a deep-phenotyping method for Chinese EHRs in such a low-resource scenario.ObjectiveIn this study, we aimed to develop a deep-phenotyping method with good generalization ability for Chinese EHRs based on limited fine-grained annotation data.MethodsThe core of the methodology was to identify linguistic patterns of phenotype descriptions in Chinese EHRs with a sequence motif discovery tool and perform deep phenotyping of Chinese EHRs by recognizing linguistic patterns in free text. Specifically, 1000 Chinese EHRs were manually annotated based on a fine-grained information model, PhenoSSU (Semantic Structured Unit of Phenotypes). The annotation data set was randomly divided into a training set (n=700, 70%) and a testing set (n=300, 30%). The process for mining linguistic patterns was divided into three steps. First, free text in the training set was encoded as single-letter sequences (P: phenotype, A: attribute). Second, a biological sequence analysis tool—MEME (Multiple Expectation Maximums for Motif Elicitation)—was used to identify motifs in the single-letter sequences. Finally, the identified motifs were reduced to a series of regular expressions representing linguistic patterns of PhenoSSU instances in Chinese EHRs. Based on the discovered linguistic patterns, we developed a deep-phenotyping method for Chinese EHRs, including a deep learning–based method for named entity recognition and a pattern recognition–based method for attribute prediction.ResultsIn total, 51 sequence motifs with statistical significance were mined from 700 Chinese EHRs in the training set and were combined into six regular expressions. It was found that these six regular expressions could be learned from a mean of 134 (SD 9.7) annotated EHRs in the training set. The deep-phenotyping algorithm for Chinese EHRs could recognize PhenoSSU instances with an overall accuracy of 0.844 on the test set. For the subtask of entity recognition, the algorithm achieved an F1 score of 0.898 with the Bidirectional Encoder Representations from Transformers–bidirectional long short-term memory and conditional random field model; for the subtask of attribute prediction, the algorithm achieved a weighted accuracy of 0.940 with the linguistic pattern–based method.ConclusionsWe developed a simple but effective strategy to perform deep phenotyping of Chinese EHRs with limited fine-grained annotation data. Our work will promote the second use of Chinese EHRs and give inspiration to other non–English-speaking countries.

Read full abstract

BackgroundA strong focus of the post-genomic era is mining of the non-coding regulatory genome in order to unravel the function of regulatory elements that coordinate gene expression (Nat 489:57–74, 2012; Nat 507:462–70, 2014; Nat 507:455–61, 2014; Nat 518:317–30, 2015). Whole-genome approaches based on next-generation sequencing (NGS) have provided insight into the genomic location of regulatory elements throughout different cell types, organs and organisms. These technologies are now widespread and commonly used in laboratories from various fields of research. This highlights the need for fast and user-friendly software tools dedicated to extracting cis-regulatory information contained in these regulatory regions; for instance transcription factor binding site (TFBS) composition. Ideally, such tools should not require prior programming knowledge to ensure they are accessible for all users.ResultsWe present TrawlerWeb, a web-based version of the Trawler_standalone tool (Nat Methods 4:563–5, 2007; Nat Protoc 5:323–34, 2010), to allow for the identification of enriched motifs in DNA sequences obtained from next-generation sequencing experiments in order to predict their TFBS composition. TrawlerWeb is designed for online queries with standard options common to web-based motif discovery tools. In addition, TrawlerWeb provides three unique new features: 1) TrawlerWeb allows the input of BED files directly generated from NGS experiments, 2) it automatically generates an input-matched biologically relevant background, and 3) it displays resulting conservation scores for each instance of the motif found in the input sequences, which assists the researcher in prioritising the motifs to validate experimentally. Finally, to date, this web-based version of Trawler_standalone remains the fastest online de novo motif discovery tool compared to other popular web-based software, while generating predictions with high accuracy.ConclusionsTrawlerWeb provides users with a fast, simple and easy-to-use web interface for de novo motif discovery. This will assist in rapidly analysing NGS datasets that are now being routinely generated. TrawlerWeb is freely available and accessible at: http://trawler.erc.monash.edu.au.

Read full abstract

Motif Discovery Tools Research Articles

Related Topics

Articles published on Motif Discovery Tools

Discovering DNA shape motifs with multiple DNA shape features: generalization, methods, and validation.

Trie-PMS8: A trie-tree based robust solution for planted motif search problem

Comparative analysis of computational based motif refinement methods

Thermodynamic and structural characterization of an EBV infected B-cell lymphoma transcriptome.

Transcriptomic Response of the Diazotrophic Bacteria Gluconacetobacter diazotrophicus Strain PAL5 to Iron Limitation and Characterization of the fur Regulatory Network.

Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation.

Prediction and Experimental Validation of a New Salinity-Responsive Cis-Regulatory Element (CRE) in a Tilapia Cell Line.

Bayesian Markov models improve the prediction of binding motifs beyond first order.

Locating transcription factor binding sites by fully convolutional neural network.

CEMD: A Cluster-based Ensemble Motif Discovery Tool

A survey of RNA secondary structural propensity encoded within human herpesvirus genomes: global comparisons and local motifs.

Matrix profile goes MAD: variable-length motif and discord discovery in data series

Regmex: a statistical tool for exploring motifs in ranked sequence lists from genomics experiments

MCAT: Motif Combining and Association Tool.

Sequential Integration of Fuzzy Clustering and Expectation Maximization for Transcription Factor Binding Site Identification.

TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets

An efficient method for significant motifs discovery from multiple DNA sequences.

RSAT matrix-clustering: dynamic exploration and redundancy reduction of transcription factor binding motif collections.

KpLogo: positional k-mer analysis reveals hidden specificity in biological sequences.

SLiMSearch: a framework for proteome-wide discovery and annotation of functional modules in intrinsically disordered regions.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Motif Discovery Tools Research Articles

Related Topics

Articles published on Motif Discovery Tools

Discovering DNA shape motifs with multiple DNA shape features: generalization, methods, and validation.

Trie-PMS8: A trie-tree based robust solution for planted motif search problem

Comparative analysis of computational based motif refinement methods

Thermodynamic and structural characterization of an EBV infected B-cell lymphoma transcriptome.

Transcriptomic Response of the Diazotrophic Bacteria Gluconacetobacter diazotrophicus Strain PAL5 to Iron Limitation and Characterization of the fur Regulatory Network.

Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation.

Prediction and Experimental Validation of a New Salinity-Responsive Cis-Regulatory Element (CRE) in a Tilapia Cell Line.

Bayesian Markov models improve the prediction of binding motifs beyond first order.

Locating transcription factor binding sites by fully convolutional neural network.

CEMD: A Cluster-based Ensemble Motif Discovery Tool

A survey of RNA secondary structural propensity encoded within human herpesvirus genomes: global comparisons and local motifs.

Matrix profile goes MAD: variable-length motif and discord discovery in data series

Regmex: a statistical tool for exploring motifs in ranked sequence lists from genomics experiments

MCAT: Motif Combining and Association Tool.

Sequential Integration of Fuzzy Clustering and Expectation Maximization for Transcription Factor Binding Site Identification.

TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets

An efficient method for significant motifs discovery from multiple DNA sequences.

RSAT matrix-clustering: dynamic exploration and redundancy reduction of transcription factor binding motif collections.

KpLogo: positional k-mer analysis reveals hidden specificity in biological sequences.

SLiMSearch: a framework for proteome-wide discovery and annotation of functional modules in intrinsically disordered regions.