Abstract
BackgroundDiscovering gene interactions and their characterizations from biological text collections is a crucial issue in bioinformatics. Indeed, text collections are large and it is very difficult for biologists to fully take benefit from this amount of knowledge. Natural Language Processing (NLP) methods have been applied to extract background knowledge from biomedical texts. Some of existing NLP approaches are based on handcrafted rules and thus are time consuming and often devoted to a specific corpus. Machine learning based NLP methods, give good results but generate outcomes that are not really understandable by a user.ResultsWe take advantage of an hybridization of data mining and natural language processing to propose an original symbolic method to automatically produce patterns conveying gene interactions and their characterizations. Therefore, our method not only allows gene interactions but also semantics information on the extracted interactions (e.g., modalities, biological contexts, interaction types) to be detected. Only limited resource is required: the text collection that is used as a training corpus. Our approach gives results comparable to the results given by state-of-the-art methods and is even better for the gene interaction detection in AIMed.ConclusionsExperiments show how our approach enables to discover interactions and their characterizations. To the best of our knowledge, there is few methods that automatically extract the interactions and also associated semantics information. The extracted gene interactions from PubMed are available through a simple web interface at https://bingotexte.greyc.fr/. The software is available at https://bingo2.greyc.fr/?q=node/22.
Highlights
Literature on biology and medicine represents a huge amount of knowledge: more than 24 million publications are currently listed in the PubMed repository [1]
Biologists know from the context if the sentence is about protein or gene
Application: detection and characterization of gene interactions We have evaluated the quality of the sequential patterns found in the previous section as information extraction rules
Summary
Literature on biology and medicine represents a huge amount of knowledge: more than 24 million publications are currently listed in the PubMed repository [1] These text collections are large and it is difficult for biologists to fully take benefit from this incredible amount of knowledge. NLP, and Information Extraction (IE) in particular, aim to provide accurate processing to extract specific knowledge such as named entities (e.g., gene, protein) and relationships between the recognized entities (e.g., gene-gene interactions, biological functions). Databases such as BioGRID [2] or STRING [3] store a large collection of interactions derived from different sources and indicate which gene. Machine learning based NLP methods, give good results but generate outcomes that are not really understandable by a user
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.