Abstract
BackgroundRecent years have seen an explosion in the availability of data in the chemistry domain. With this information explosion, however, retrieving relevant results from the available information, and organising those results, become even harder problems. Computational processing is essential to filter and organise the available resources so as to better facilitate the work of scientists. Ontologies encode expert domain knowledge in a hierarchically organised machine-processable format. One such ontology for the chemical domain is ChEBI. ChEBI provides a classification of chemicals based on their structural features and a role or activity-based classification. An example of a structure-based class is 'pentacyclic compound' (compounds containing five-ring structures), while an example of a role-based class is 'analgesic', since many different chemicals can act as analgesics without sharing structural features. Structure-based classification in chemistry exploits elegant regularities and symmetries in the underlying chemical domain. As yet, there has been neither a systematic analysis of the types of structural classification in use in chemistry nor a comparison to the capabilities of available technologies.ResultsWe analyze the different categories of structural classes in chemistry, presenting a list of patterns for features found in class definitions. We compare these patterns of class definition to tools which allow for automation of hierarchy construction within cheminformatics and within logic-based ontology technology, going into detail in the latter case with respect to the expressive capabilities of the Web Ontology Language and recent extensions for modelling structured objects. Finally we discuss the relationships and interactions between cheminformatics approaches and logic-based approaches.ConclusionSystems that perform intelligent reasoning tasks on chemistry data require a diverse set of underlying computational utilities including algorithmic, statistical and logic-based tools. For the task of automatic structure-based classification of chemical entities, essential to managing the vast swathes of chemical data being brought online, systems which are capable of hybrid reasoning combining several different approaches are crucial. We provide a thorough review of the available tools and methodologies, and identify areas of open research.
Highlights
Recent years have seen an explosion in the availability of data in the chemistry domain
Logic-based knowledge representation can be contrasted with algorithmic ‘knowledge representation’, in which software algorithms procedurally define outputs based on stated inputs, and with statistical ‘knowledge representation’, in which complex statistical models are trained to produce outputs based on a given set of inputs by learning weights for a complex set of internal parameters
Analysis of structural features used in class definitions By examination of the definitions of higher-level structural classes included in Chemical Entities of Biological Interest ontology (ChEBI), we have identified the following categories of elementary features used in structural chemical class definitions: 1. Interesting parts (IP), such as the carboxy group or the cholestane scaffold 2
Summary
Recent years have seen an explosion in the availability of data in the chemistry domain. In biomedicine and the natural sciences more generally, hierarchical organisation and large-scale data management are being facilitated by formal ontologies: machine-understandable encodings of human domain knowledge. Such ontologies are used in several different ways [2,3,4]. They ensure standardisation of terminology and identification across all entities in a domain so that multiple sources of data can be aggregated through comparable reference terms They provide hierarchical organisation so that such aggregation can be performed at different levels for novel datadriven scientific discovery. An advantage of logicbased knowledge representation is that it allows the knowledge to be explicitly expressed as knowledge, i.e. as statements that are comprehensible, true and selfcontained, and available for modification by persons without a computational background such as domain experts; this is in contrast to statistical methods that operate as black boxes and to procedural methods that require a programmer in order to manipulate or extend them
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.