Digitizing DNA Sequences Using Multiset-Based Nucleotide Frequencies for Machine Learning-Based Mutation Detection

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Investigating algebraic structures in a non-conventional framework supplements mathematics for hard-nosed practical applications to the fields of theoretical biology and computer science. One such algebraic structure is multigroup whose underlying set is a multiset. The genome is the entire set of DNA instructions found within a cell which contains all the information needed for an individual to develop and function. DNA and RNA are the hereditary materials that play a vital role in the metabolism process of living things, especially protein synthesis. In genomic database DNA sequences are stored in the form of string or text data types. The only data that works with machine learning algorithms is numerical. Thus, it is necessary to transform DNA sequence strings to numerical values. This article is organized in the following manner. Firstly, we prove that standard genetic code is a multigroup and genome architecture of the whole population can be interpreted as the sum of multisets. Next, it is described how a numerical representation of DNA bases relates to its algebraic representation. We further employed Gated Recurrent Unit, Long Short-Term Memory, and Bidirectional Long Short-Term Memory to identify changes between the DNA sequences. Experimental results show that GRU with multiset-based numerical values for DNA bases offers 98% accuracy on testing data. This novel technique will aid in the detection of mutations in complex diseases.

Similar Papers
  • Research Article
  • 10.1088/1757-899x/567/1/012019
Vector space of codons sequence over galois field GF(73)
  • Jul 1, 2019
  • IOP Conference Series: Materials Science and Engineering
  • I Aisah + 3 more

DNA and RNA is genetic material that play an important role in living things metabolism process which called protein synthesis. DNA have four nucleic acid, they are adenine (A), guanine (G), cytosine (C), and thymine (T). Protein synthesis process closely related with standard genetic code. Standard genetic code is a set of rules that defines the order of nucleotide bases in DNA or RNA to determine the order of certain amino acids in protein synthesis. The standard genetic code is a combination of three nitrogen bases or triplet’s bases. This standard genetic code can mathematically represented by algebraic structure. In this paper we will give that representation using the extended set of 7 elements from the set of four DNA’s nucleic acid that are N = {D, A, C, 0, G, T, P}. Then we construct new triplet set from N, called extended triplet set. In the end we analyze the vector space structure of it and find the significant field that correspond with that structure.

  • Research Article
  • Cite Count Icon 41
  • 10.2353/jmoldx.2009.090022
Sensitive Detection of KRAS Mutations in Archived Formalin-Fixed Paraffin-Embedded Tissue Using Mutant-Enriched PCR and Reverse-Hybridization
  • Nov 1, 2009
  • The Journal of Molecular Diagnostics
  • Christoph Ausch + 8 more

Sensitive Detection of KRAS Mutations in Archived Formalin-Fixed Paraffin-Embedded Tissue Using Mutant-Enriched PCR and Reverse-Hybridization

  • Research Article
  • Cite Count Icon 1122
  • 10.1137/0210055
The NP-Completeness of Edge-Coloring
  • Nov 1, 1981
  • SIAM Journal on Computing
  • Ian Holyer

We show that it is NP-complete to determine the chromatic index of an arbitrary graph. The problem remains NP-complete even for cubic graphs.

  • Single Book
  • Cite Count Icon 4
  • 10.1385/0896032051
Protocols in Human Molecular Genetics
  • Nov 1, 1991
  • Christopher G Mathew

The Polymerase Chain Reaction: Getting Started. Direct DNA Sequencing of Complementary DNA Amplified by the Polymerase Chain Reaction. Direct-Sequencing of PCR-Amplified DNA. Rapid DNA Sequence Analysis Using Fluorescent Labels. Detection of Mutations in DNA and RNA by Chemical Cleavage. Rapid Methods for Detection of Polymorphic Markers in Genomic DNA. The Analysis of Point Mutations Using Synthetic Oligonucleotide Probes. Detection of Mutations by the Amplification Refractory Mutation System (ARMS). Automated Gene Detection Using the Oligonucleotide Ligation Assay. Detection of Point Mutations by Denaturing-Gradient Gel Electrophoresis. The Detection and Mapping of Point Mutations by RNase A Cleavage. Discontinuous Polyacrylamide Gel Electrophoresis of DNA Fragments. Extraction and Enzymatic Amplification of DNA from Paraffin-Embedded Specimens. The Use of the Polymerase Chain Reaction in the Mapping of Human Genes Using Somatic Cell Hybrids. The Southern Blot: An Update. The Detection of Specific DNA Sequences by Enhanced Chemiluminescence. Pulsed-Field Gel Electrophoresis. Cloning from Gels Following Pulse-Field Gel Electrophoresis. Yeast Artificial-Chromosome (YAC) Cloning Systems. Gene Targeting for Somatic Cell Manipulation. In Situ Hybridization of Chromosomes. DNA Fingerprinting Analysis: Methodology and Its Applications. DNA Fingerprinting and Forensic Medicine. The Detection of Point Mutations in Hemoglobin Defects Using Allele-Specific Oligonucleotide Probes. Detection of Gene Deletions Using Multiplex Polymerase Chain Reactions. Application of Pulsed-Field Gel Electrophoresis to Genetic Diagnosis. Molecular Diagnostics of Cancer. The Detection of Latent Virus Infection by Polymerase Chain Reaction. Mapping Inherited Diseases by Linkage Analysis. Diagnosis of Genetic Disorders with Linked DNA Markers. Software for Genetic Linkage Analysis. Creating Animal Models of Genetic Diseases. Molecular Biology and Medicine: Ethical Implications. Appendix. Index.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 43
  • 10.1074/jbc.m113.467977
Genetic Code-guided Protein Synthesis and Folding in Escherichia coli
  • Oct 1, 2013
  • Journal of Biological Chemistry
  • Shaoliang Hu + 3 more

Universal genetic codes are degenerated with 61 codons specifying 20 amino acids, thus creating synonymous codons for a single amino acid. Synonymous codons have been shown to affect protein properties in a given organism. To address this issue and explore how Escherichia coli selects its "codon-preferred" DNA template(s) for synthesis of proteins with required properties, we have designed synonymous codon libraries based on an antibody (scFv) sequence and carried out bacterial expression and screening for variants with altered properties. As a result, 342 codon variants have been identified, differing significantly in protein solubility and functionality while retaining the identical original amino acid sequence. The soluble expression level varied from completely insoluble aggregates to a soluble yield of ~2.5 mg/liter, whereas the antigen-binding activity changed from no binding at all to a binding affinity of > 10(-8) m. Not only does our work demonstrate the involvement of genetic codes in regulating protein synthesis and folding but it also provides a novel screening strategy for producing improved proteins without the need to substitute amino acids.

  • Research Article
  • Cite Count Icon 40
  • 10.2353/jmoldx.2008.080024
Rapid Screening Assay for KRAS Mutations by the Modified Smart Amplification Process
  • Nov 1, 2008
  • The Journal of Molecular Diagnostics
  • Kenji Tatsumi + 22 more

Rapid Screening Assay for KRAS Mutations by the Modified Smart Amplification Process

  • Front Matter
  • 10.1088/1742-6596/1255/1/011002
Preface
  • Aug 1, 2019
  • Journal of Physics: Conference Series

Assalamu’alaikum Wr. Wb. and Greetings.I am pleased and honored to welcome you at the First International Conference on Computer Science and Applied Mathematic (ICCSAM 2018), which was held this year by AMIK and STIKOM Tunas Bangsa Pematangsiantar in collaboration with the Indonesian Computer and Information Professional Association (IPKIN), Indonesian Mathematical Society (INDOMS) and Bank Muamalat. As Chair of the Foundation, I would like to convey a greeting Welcome to Parapat City Lake Toba to presenters from various countries and especially those I respect for the presence of scientists in the field of Computer and Mathematical Sciences and industry practitioners in the 2018 ICCSAM event.I see that this activity is designed to enhance the exchange of knowledge and new discoveries in computer science and mathematics and related fields in industry. I hope that scientists in the field of computer science and mathematics and those who work in the industry can share knowledge and work together as a team that has mutual relations with each other. I am very pleased to say that the theme of this conference, “Advancing Computability Innovation” is very much in line with the objectives of AMIK and STIKOM conferences and missions, namely “to become a Science and Technology based Study Program that meets industrial needs and functions as a research center in Information Data Science. The conference was held in response to output and aoutcame which had a significant contribution in the field of computer science and mathematics as an inseparable unit of science towards the development of local and global industries. It is our happiness and honor to welcome distinguished professors present to convey their expertise at this conference.I hope this meeting will enable the development of productive dialogue between participants from various countries. They also provide invaluable opportunities for networking among participants, institutions and industries. I also hope that diversity in these fields can reveal more opportunities for researchers and practitioners from all over the world to start a lot of research related to the industry in the future.I would like to congratulate AMIK and STIKOM Tunas Bangsa for starting this conference with a synergistic contribution from dedicated partners namely the USU Faculty of Mathematics and Natural Sciences, USU’s Faculty of Computer and Information Technology, the Indonesian Computer and Informatics Professional Association (IPKIN), Indonesian Mathematical Society (INDOMS), Bank Muamalat, Malaysia Pahang University (UMP), University of Essex, United Kingdom, Ankara, Turkey and Institute of Applied Mathematics Middle East Technical University to all delegates for their full support, cooperation and contribution to the 2018 ICCSAM. I also want to thank the Organizing Committee for their perseverance and extraordinary efforts. Various sponsors are also rewarded for their contributions. I really hope that all participants will get a pleasant stay here at NIAGARA HOTEL and bring back unforgettable experiences and valuable knowledge from this conference.Thanks.H. Ahmad Ridwansyah Putra

  • Research Article
  • Cite Count Icon 303
  • 10.1137/0222038
Nondeterminism within $P^ * $
  • Jun 1, 1993
  • SIAM Journal on Computing
  • Jonathan F Buss + 1 more

Classes of machines using very limited amounts of nondeterminism are studied. The P=? NP question is related to questions about classes lying within P. Complete sets for these classes are given.

  • Research Article
  • 10.1158/1538-7445.am2012-3975
Abstract 3975: Detecting patient mutomes by integrating DNA and RNA sequencing
  • Apr 15, 2012
  • Cancer Research
  • Matthew D Wilkerson + 7 more

Personalized cancer medicine, the matching of therapies to a given patient's somatic alterations, depends on highly accurate and complete identification of patients’ somatic alterations, or their mutome. Advances in sequencing technologies (exome sequencing, RNAseq, and whole genome sequencing) have provided a means to examine large portions of the genetic content of patients’ cancers. Computational tools have arisen that make somatic mutation predictions utilizing particular sequencing assays; however, each sequencing assay has limitations and existing mutation detection tools exhibit less than ideal agreement when analyzing the same data. The task of identifying all somatic mutations in one patient's cancer remains a challenge to personalized cancer medicine. Typically, somatic mutation detection is performed utilizing DNA sequencing. Because RNA sequencing is often a component of genome characterization projects along with DNA sequencing, we sought to evaluate the possible added value of RNA sequencing in somatic mutation detection. We have developed an original computational method, UNCeqR, that makes patient-specific somatic mutation predictions utilizing RNA sequencing combined with DNA sequencing. DNA mutations and RNA mutations are statistically modeled separately and results are combined in a meta-analytic fashion, resulting in up to three predictions for a locus: DNA-only, RNA-only, and DNA+RNA. In addition to de novo genomewide mutation predictions, UNCeqR can query specific a priori mutations. UNCeqR was applied to The Cancer Genome Atlas (TCGA) lung squamous cell carcinoma sequencing data, consisting of Ilumina RNAseq and Illumina exome sequencing. Of annotated exons, 20% had very low to zero coverage in RNA and 5% had very low to zero coverage in DNA, indicating that both sequencing assays add new genomic territory for mutation detection. Limiting to regions with both DNA and RNA coverage, 56% of mutations detected from DNA were also predicted by RNA, providing an independent validation of these mutations. To evaluate if mutation detection using DNA+RNA is superior to detection using DNA-only, cancer specimen DNA and RNA reads were randomly split into subsamples. UNCeqR was executed on each of the subsamples and mutation agreement was compared among pairs of subsamples within regions of DNA and RNA coverage. Compared with the DNA-only method, DNA+RNA mutation detection exhibited a 42% relative increase in percent agreement across subsamples and a 230% relative increase in the number of mutations detected. Therefore, RNA sequencing adds positive value to somatic mutation detection via UNCeqR. Citation Format: {Authors}. {Abstract title} [abstract]. In: Proceedings of the 103rd Annual Meeting of the American Association for Cancer Research; 2012 Mar 31-Apr 4; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2012;72(8 Suppl):Abstract nr 3975. doi:1538-7445.AM2012-3975

  • Front Matter
  • Cite Count Icon 11
  • 10.1111/bjh.17621
British Society for Haematology guidelines for the diagnosis and evaluation of prognosis of Adult Myelodysplastic Syndromes.
  • Jun 16, 2021
  • British Journal of Haematology
  • Sally B Killick + 18 more

British Society for Haematology guidelines for the diagnosis and evaluation of prognosis of Adult Myelodysplastic Syndromes.

  • Book Chapter
  • Cite Count Icon 2
  • 10.5772/9153
On The Combination of Feature and Instance Selection
  • Feb 1, 2010
  • Jerffeson Teixeira de Souza + 2 more

In the last decades, huge amounts of data became omnipresent in diverse areas of knowledge, such as business, astronomy, biology, and so on. Machine Learning and Knowledge Discovery in Databases (KDD) are fields in Computer Science that focus on the task of transforming these data into useful knowledge. In (Fayyad et al., 1996), KDD is defined as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. Feature and Instance Selection belong to the practice of data preparation (or pre-processing), which is a preliminary process that transforms raw data into a format that is convenient to the data mining (or machine learning) algorithm. Usually, data is stored in a table-like format: the columns of these tables are the attributes or features they describe the data and the rows, or lines, are the records or instances they are the examples of the concept stored in the data. Feature and Instance selection processes allow applications, such as classification or clusterization, to focus only on the important (or relevant) attributes and records to the specific concept that is in study. As important machine learning problems, Feature and Instance Selection have been studied systematically over the last decades, when several algorithms for solving them individually have been proposed. Such selection problems play a fundamental role in the pre-processing step of any learning task. By removing noise, irrelevant and redundant features and instances, and reducing the overall dimensionality of a dataset, feature and instance selection have been demonstrated to improve the performance of most machine learning algorithms, speed up the output of models and allow algorithms to deal with datasets whose sizes are gigantic. Even though the specialized literature have exhibited remarkable results in solving both the feature and instance selection problems individually, little work has been done to manage these solutions to work together in order to solve these related problems simultaneously or even understand the relationship between features and instances. This chapter initially discusses the feature and instance selection problems and their relevance to machine learning, giving an accurate definition of both problems. Next, it surveys different approaches for dealing with feature selection and instance selection separately and some works that tried to integrate the solutions for these two problems, 9

  • Research Article
  • Cite Count Icon 185
  • 10.1137/0208008
Total Ordering Problem
  • Feb 1, 1979
  • SIAM Journal on Computing
  • J Opatrny

The problem of finding a total ordering of a finite set satisfying a given set of in-between restrictions is considered. It is shown that the problem is $NP$-complete.

  • Research Article
  • Cite Count Icon 9
  • 10.2353/jmoldx.2009.090061
Ultrasensitive Detection of KRAS2 Mutations in Bile and Serum from Patients with Biliary Tract Carcinoma Using LigAmp Technology
  • Nov 1, 2009
  • The Journal of Molecular Diagnostics
  • Chanjuan Shi + 7 more

Ultrasensitive Detection of KRAS2 Mutations in Bile and Serum from Patients with Biliary Tract Carcinoma Using LigAmp Technology

  • Research Article
  • Cite Count Icon 16
  • 10.1016/s1525-1578(10)60531-4
Microsphere Bead Arrays and Sequence Validation of 5/7/9T Genotypes for Multiplex Screening of Cystic Fibrosis Polymorphisms
  • Nov 1, 2004
  • The Journal of Molecular Diagnostics
  • Andrew G Hadd + 6 more

Microsphere Bead Arrays and Sequence Validation of 5/7/9T Genotypes for Multiplex Screening of Cystic Fibrosis Polymorphisms

  • Research Article
  • Cite Count Icon 36
  • 10.1161/circulationaha.111.027300
Functional genomics applied to cardiovascular medicine.
  • Jul 5, 2011
  • Circulation
  • Thomas P Cappola + 1 more

Since completion of the draft sequence of the human genome in 2000, the landscape of biomedical research has undergone a rapid transformation. Growing knowledge of genome structure and variation has spawned the development of technologies that allow researchers to study thousands of genes, transcripts, and proteins simultaneously. This has expanded biomedical science beyond reductionist approaches that test the function of individual genes to less biased approaches that study the behavior of many or all genes in homeostasis and disease. Such studies have been grouped under the broad label of functional genomics, which can be defined as the branch of biology that seeks to uncover the properties and function of the entirety of the genes and gene products of an organism.1 Functional genomics is fueling an explosion of new insights in biology and medicine, and many of these insights were completely unanticipated. Because these advances have begun to influence clinical practice,2,3 physicians will be expected to understand the potential uses and limitations of functional genomics in clinical settings. The purpose of this review is to provide a conceptual overview of functional genomics applied to the practice of cardiovascular medicine. We begin with a review of commonly used terms and approaches and then describe examples of their use for screening, diagnosis, and treatment selection in clinical cardiology. We also highlight emerging trends and speculate about where the field is headed in the near term. Although some predictions will be overly optimistic and some major advances unanticipated, we hope this review will help prepare cardiologists for their role in the application of genome science to the diagnosis and treatment of disease. This review is part of a series that introduces several related areas of cardiovascular genetics and genomics.4 We refer to these other contributions to guide further reading …

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon