Abstract
BackgroundThe explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata quality. However, the semantic heterogeneity and annotation inconsistencies in biological databases greatly increase the complexity of aggregating and cleaning metadata. Manual curation of datasets, traditionally favoured by life scientists, is impractical for studies involving thousands of records. In this study, we investigate quality issues that affect major public databases, and quantify the effectiveness of an automated metadata extraction approach that combines structural and semantic rules. We applied this approach to more than 90,000 influenza A records, to annotate sequences with protein name, virus subtype, isolate, host, geographic origin, and year of isolation.ResultsOver 40,000 annotated Influenza A protein sequences were collected by combining information from more than 90,000 documents from NCBI public databases. Metadata values were automatically extracted, aggregated and reconciled from several document fields by applying user-defined structural rules. For each property, values were recovered from ≥88.8% of records, with accuracy exceeding 96% in most cases. Because of semantic heterogeneity, each property required up to six different structural rules to be combined. Significant quality differences between databases were found: GenBank documents yield values more reliably than documents extracted from GenPept. Using a simple set of semantic rules and a reasoner, we reconstructed relationships between sequences from the same isolate, thus identifying 7640 isolates. Validation of isolate metadata against a simple ontology highlighted more than 400 inconsistencies, leading to over 3,000 property value corrections.ConclusionTo overcome the quality issues inherent in public databases, automated knowledge aggregation with embedded intelligence is needed for large-scale analyses. Our results show that user-controlled intuitive approaches, based on combination of simple rules, can reliably automate various curation tasks, reducing the need for manual corrections to approximately 5% of the records. Emerging semantic technologies possess desirable features to support today's knowledge aggregation tasks, with a potential to bring immediate benefits to this field.
Highlights
The recent availability of high-throughput experimental technologies has produced an explosive growth of biological data
Leveraging on Resource Description Framework (RDF)'s structural flexibility, we developed a reasoning task based on semantic rules, to reconstruct the relationships between sequences and isolates
Yield was defined as the fraction of documents from a given database that produces a value from structural rules, while accuracy was computed as the percentage of extracted values that matches the manually curated property value
Summary
The recent availability of high-throughput experimental technologies has produced an explosive growth of biological data. Because of limited technology resources and programming knowledge, most biologists manage data in an ad-hoc fashion, often involving manual retrieval from Web-based databases, and storage in spreadsheets These information management methods are only practical for small-scale studies; for very large datasets, the time and effort necessary for manual curation would make them prohibitively expensive. The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata quality. We investigate quality issues that affect major public databases, and quantify the effectiveness of an automated metadata extraction approach that combines structural and semantic rules We applied this approach to more than 90,000 influenza A records, to annotate sequences with protein name, virus subtype, isolate, host, geographic origin, and year of isolation
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.