Sort by
An ontology for very large numbers of longitudinal health records to facilitate data mining and machine learning

Despite the extensive experience of the authors working in industry with a variety of electronic health records that worked well in their intended context, none currently available in reasonably large numbers seem to have ontologies and formats that will scale well to very large numbers of detailed cradle-to-grave longitudinal health records facilitating knowledge extraction. By that we mean data mining, Deep Learning neural nets and all related analytic and predictive methods for biomedical research and clinical decision support potentially applied to the health records of an entire nation. They are mostly far too complicated to support frequent high-dimensional analysis, which is required because such records will update (or should update) dynamically on a regular basis, will in future include new tests etc. acquired daily by translational medical research, and not least allow public health, research, and diagnostic, vaccine, and drug development teams to respond quickly to emergent epidemics like COVID-19. A Presidential Advisory team call in 2010 for interoperability and ease of data mining for medical records is discussed and the situation seems still not fully resolved. The solution appears to lie between efficient comma separated value files and the ability to embellish these with a moderate degree of more elaborate ontology. One recommendation is made here with discussion and analysis that should guide alternative and future approaches. It combines demographic, comorbidity, genomic, diagnostic, interventional, and outcomes information along with time/date stamping method appropriate to analysis, with facilities for special research studies. By using a “metadata operator”, a suitable balance between a comma separated values file and an ontological structure is possible.

Open Access
Relevant
Using actor maps and AcciMaps for road safety investigations: Development of taxonomies and meta-analyses

The number of road collisions have plateaued over the past decade, both in the UK and worldwide. One problem is that most road safety investigations focus on the immediate, sharp-end factors rather than delving into the deeper, blunt-end, systems factors. It is argued that the contributions of both of these types of factors need to be understood before further reductions in road collisions can be made. The study reported in this paper developed taxonomies for Actor Maps and AcciMaps from 37 road collision investigation reports undertaken in the UK. The meta-analysis of the Actor Maps showed that relatively few categories of actors are associated with the majority of collisions (i.e., 35/256). Similarly, the meta-analysis of the AcciMaps showed that all of the 1656 actions, events and decisions (or lack of) could be placed into 19 categories. Across the eight AcciMap levels there were 11 categories that appeared most frequently. Both of these taxonomies together with the meta-analysis enabled a summary of the analysis and derivation of interventions at a national level. The study also points toward a common contributory (and protective) network for road collisions. The meta-analysis showed that the ‘sharp-end’ accounted for approximately 40% of the factors (those that are normally investigated in collisions) whilst the blunt-end accounted for approximately 60% of the factors. These blunt-end factors are often not addressed in traditional police-led collision investigations, as these investigations typically focus on establishing individual criminal culpability. Any future road collision investigations which aim to identify no blame safety learning should seek to understand blunt end factors as they create the pre-conditions for incidents to occur.

Relevant
De novo protein folding on computers. Benefits and challenges

There has been recent success in prediction of the three-dimensional folded native structures of proteins, most famously by the AlphaFold Algorithm running on Google's/Alphabet's DeepMind computer. However, this largely involves machine learning of protein structures and is not a de novo protein structure prediction method for predicting three-dimensional structures from amino acid residue sequences. A de novo approach would be based almost entirely on general principles of energy and entropy that govern protein folding energetics, and importantly do so without the use of the amino acid sequences and structural features of other proteins. Most consider that problem as still unsolved even though it has occupied leading scientists for decades. Many consider that it remains one of the major outstanding issues in modern science. There is crucial continuing help from experimental findings on protein unfolding and refolding in the laboratory, but only to a limited extent because many researchers consider that the speed by which real proteins folds themselves, often from milliseconds to minutes, is itself still not fully understood. This is unfortunate, because a practical solution to the problem would probably have a major effect on personalized medicine, the pharmaceutical industry, biotechnology, and nanotechnology, including for example “smaller” tasks such as better modeling of flexible “unfolded” regions of the SARS-COV-2 spike glycoprotein when interacting with its cell receptor, antibodies, and therapeutic agents. Some important ideas from earlier studies are given before moving on to lessons from periodic and aperiodic crystals, and a possible role for quantum phenomena. The conclusion is that better computation of entropy should be the priority, though that is presented guardedly.

Relevant
Searching for the principles of a less artificial A.I.

What would it take to build a computer physician that can take its place amongst human peers? Currently, Neural Nets, especially as so-called “Deep Learning” nets, dominate what is popularly called “Artificial Intelligence”, but to many critics they seem to be little more than powerful data-analytic tools inspired by some of the more basic functions and regions of the human brain such as those involved in early processes in biological vision, classification, and categorization. The deeper nature of human intelligence as the term is normally meant, including relating to consciousness, has been the domain of philosophers, psychologists, and some neuroscientists. Now, attention is turning to neuronal mechanisms in humans and simpler organisms as a basis of a truer AI with far greater potential. Arguably, the approach required should be rooted in information theory and algorithmic science. But as discussed in this paper, caution is required: “just any old information” might not do. The information might need to be of a particular dynamical and actioning nature, and that might significantly impact the kind of computation and computer hardware required. Overall, however, the authors do not favor emergent properties such as those based on complexity and quantum effects. Despite the possible difficulties, such studies could, in return, have substantial benefits for biology and medicine beyond the computational tools that they produce to serve those disciplines.

Open Access
Relevant
Mining real-world high dimensional structured data in medicine and its use in decision support. Some different perspectives on unknowns, interdependency, and distinguishability

There are many difficulties in extracting and using knowledge for medical analytic and predictive purposes from Real-World Data, even when the data is already well structured in the manner of a large spreadsheet. Preparative curation and standardization or "normalization" of such data involves a variety of chores but underlying them is an interrelated set of fundamental problems that can in part be dealt with automatically during the datamining and inference processes. These fundamental problems are reviewed here and illustrated and investigated with examples. They concern the treatment of unknowns, the need to avoid independency assumptions, and the appearance of entries that may not be fully distinguished from each other. Unknowns include errors detected as implausible (e.g., out of range) values that are subsequently converted to unknowns. These problems are further impacted by high dimensionality and problems of sparse data that inevitably arise from high-dimensional datamining even if the data is extensive. All these considerations are different aspects of incomplete information, though they also relate to problems that arise if care is not taken to avoid or ameliorate consequences of including the same information twice or more, or if misleading or inconsistent information is combined. This paper addresses these aspects from a slightly different perspective using the Q-UEL language and inference methods based on it by borrowing some ideas from the mathematics of quantum mechanics and information theory. It takes the view that detection and correction of probabilistic elements of knowledge subsequently used in inference need only involve testing and correction so that they satisfy certain extended notions of coherence between probabilities. This is by no means the only possible view, and it is explored here and later compared with a related notion of consistency.

Relevant
Computers and preventative diagnosis. A survey with bioinformatics examples of mitochondrial small open reading frame peptides as portents of a new generation of powerful biomarkers

The present brief survey is to alert developers in datamining, machine learning, inference methods, and other approaches in relation to diagnostic, predictive, and risk assessment medicine about a relatively new class of bioactive messaging peptides in which there is escalating interest. They provide patterns of communication and cross-chatter about states of health and disease within and, importantly, between cells (they also appear extracellularly in biological fluids). This chatter needs to be analyzed somewhat in the manner of the decryption of the Enigma code in the Second World War. It could lead not only to improved diagnosis but to predictive diagnosis, prediction of organ failure, and preventative medicine. This involves peptide products of short reading frames that have been previously somewhat neglected as unlikely gene products, with probably many examples in nuclear DNA, but certainly several known in the mitochondrial DNA. There is a great deal of knowledge now becoming available about the latter and itis believed thatthat the mRNA can be translated both by standard cytosolic and mitochondrial genetic codes, resulting in different peptides, adding a further level of complexity to the applications of bioinformatics and computational biology but a higher level of detail and sophistication to preventative diagnosis. The code to crack could be sophisticated and combinatorically complex to analyze by computers. Mitochondria may have combined with proto-eucaryotic cells some 2 billion years ago, only about a 7th of the age of the universe. Cells appeared some 2 billion years before that, also with possible signaling based on similar ideas. This makes life small in space but huge in time, refinement of which centrally involves these signaling processes.

Relevant
Testing machine learning techniques for general application by using protein secondary structure prediction. A brief survey with studies of pitfalls and benefits using a simple progressive learning approach

Many researchers have recently used the prediction of protein secondary structure (local conformational states of amino acid residues) to test advances in predictive and machine learning technology such as Neural Net Deep Learning. Protein secondary structure prediction continues to be a helpful tool in research in biomedicine and the life sciences, but it is also extremely enticing for testing predictive methods such as neural nets that are intended for different or more general purposes. A complication is highlighted here for researchers testing their methods for other applications. Modern protein databases inevitably contain important clues to the answer, so-called "strong buried clues", though often obscurely; they are hard to avoid. This is because most proteins or parts of proteins in a modern protein data base are related to others by biological evolution. For researchers developing machine learning and predictive methods, this can overstate and so confuse understanding of the true quality of a predictive method. However, for researchers using the algorithms as tools, understanding strong buried clues is of great value, because they need to make maximum use of all information available. A simple method related to the GOR methods but with some features of neural nets in the sense of progressive learning of large numbers of weights, is used to explore this. It can acquire tens of millions and hence gigabytes of weights, but they are learned stably by exhaustive sampling. The significance of the findings is discussed in the light of promising recent results from AlphaFold using Google's DeepMind.

Relevant