Understanding the biotransformation of xenobiotics in the human body is critical for a comprehensive assessment of drug effects since pharmacologically active drug metabolites may exhibit a range of biological effects that often differ from those of the original pharmaceutical agent. Studies of the biotransformation mechanisms of xenobiotics have resulted in numerous publications. Extracting information about the parent compounds (substrates) and their metabolites from the texts allows retrieval of information on their biological activities, molecular mechanisms of action, and toxicity. Manual curation of the names of xenobiotics, their metabolites, and biotransformation reactions in the text is a challenging task due to the large number of publications related to studies of pharmaceutical agents metabolism. Our aim is to create an annotated corpus of texts that can be used for automated extraction of the names of xenobiotics, including pharmaceutical agents that undergo biotransformation and their metabolites. Prior to manual annotation of the corpus, semiautomatic annotation was carried out based on the earlier developed rule-based method for parent compounds and their metabolites extraction. To create XenoMet, we automatically extracted relevant texts from PubMed using a query based on MeSH terms. The names of biotransformation reactions were recognized by using an in-house-developed dictionary. Then, we manually verified the extracted data by correcting errors in the named entity annotation and identified the associations between substrates and metabolites. We tested the applicability of XenoMet for the reconstruction of a metabolic tree and for the automated extraction of the chemical names of substrates, metabolites, and reactions of biotransformation. Classification of the named entities of metabolites, substrates, and biotransformation reactions by a conditional random fields approach using XenoMet as the training set provides an F1-score of 0.79.
Read full abstract