Information Retrieval Analysis Research Articles

Clinical drug–drug interactions (DDIs) have been a major cause for not only medical error but also adverse drug events (ADEs). The published literature on DDI clinical toxicity continues to grow significantly, and high-performance DDI information retrieval (IR) text mining methods are in high demand. The effectiveness of IR and its machine learning (ML) algorithm depends on the availability of a large amount of training and validation data that have been manually reviewed and annotated. In this study, we investigated how active learning (AL) might improve ML performance in clinical safety DDI IR analysis. We recognized that a direct application of AL would not address several primary challenges in DDI IR from the literature. For instance, the vast majority of abstracts in PubMed will be negative, existing positive and negative labeled samples do not represent the general sample distributions, and potentially biased samples may arise during uncertainty sampling in an AL algorithm. Therefore, we developed several novel sampling and ML schemes to improve AL performance in DDI IR analysis. In particular, random negative sampling was added as a part of AL since it has no expanse in the manual data label. We also used two ML algorithms in an AL process to differentiate random negative samples from manually labeled negative samples, and updated both the training and validation samples during the AL process to avoid or reduce biased sampling. Two supervised ML algorithms, support vector machine (SVM) and logistic regression (LR), were used to investigate the consistency of our proposed AL algorithm. Because the ultimate goal of clinical safety DDI IR is to retrieve all DDI toxicity–relevant abstracts, a recall rate of 0.99 was set in developing the AL methods. When we used our newly proposed AL method with SVM, the precision in differentiating the positive samples from manually labeled negative samples improved from 0.45 in the first round to 0.83 in the second round, and the precision in differentiating the positive samples from random negative samples improved from 0.70 to 0.82 in the first and second rounds, respectively. When our proposed AL method was used with LR, the improvements in precision followed a similar trend. However, the other AL algorithms tested did not show improved precision largely because of biased samples caused by the uncertainty sampling or differences between training and validation data sets.

Read full abstract

We have designed and implemented a software system, named PhyloID™, that can be used to detect putative adventitious agents in biological samples characterized by next-generation sequencing. PhyloID is run in two steps, each being a self-contained automated process amenable to GMP validation. The first module, MiLY, is responsible for assembling individual sequence reads into contigs, and annotating all sequences with a unique sequence identifier, the number of reads in each contig, and the length of the sequence. The trimmed, assembled and annotated data are then processed by PhyloID's second module, NGmapper. NGmapper takes the FASTA-formatted output from MiLY and identifies the taxonomic origins of the contigs and singletons therein. It compares each sequence's BLASTN hit profile against the patterns of evolutionary relationships described within phylogenomic distance matrices for all of the various taxonomic groups, in order to find the best fit. NGmapper then produces lists of taxonomic assignments in both summarized and detailed form, and tree files for viewing results graphically. We optimized PhyloID's parameters and measured its performance using simulated metagenomic data and subsets of the reference phylogenies. PhyloID's precision and recall in identifying simulated sequences were measured by information retrieval analysis, focusing on read length, read number, sequence accuracy, background complexity, taxonomy and reference data coverage. We found PhyloID to be highly accurate and quantitative in its taxonomic mapping of sequences, with excellent precision, sensitivity and robustness. The degree of taxonomic representation available in publicly available databases remains an issue, as expected, for any sequence classifier, but community sequencing efforts are poised to overcome this problem. In order to illustrate real-world usage of the application, we also describe some simple spike-recovery experiments as well as a multi-site comparative characterization of a viral suspension. These data help to illustrate, to corroborate, and to extend results using simulated data. In order to address gaps in the detection of contaminating viruses and microorganisms in vaccines and other biologicals, manufacturers are exploring the use of new technologies that promise greater sensitivity and breadth of coverage. One challenge in implementing such new methods is the complexity of analysis of the "big data" generated by these new instruments: hundreds of millions of sequence reads (segments of genetic material from viruses and cells) need to be compared against a vast and growing number of entries in genetic databases, in order to come up with a confident identification. These large-scale analyses must furthermore be carried out within the strict regulatory environment that governs the industry. We have developed an automated software pipeline named PhyloID™ that is capable of identifying viruses and microorganisms from large-scale sequence data. Using simulated data as well as real samples, we show that PhyloID is both sensitive and accurate in identifying any type of potential contaminant. Such a powerful new assay will be an important addition to the adventitious agent testing package, providing further assurance about product safety.

Read full abstract

Information Retrieval Analysis Research Articles

Related Topics

Articles published on Information Retrieval Analysis

Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature

Integrated Random Negative Sampling and Uncertainty Sampling in Active Learning Improve Clinical Drug Safety Drug–Drug Interaction Information Retrieval

Net ergonomics: information retrieval in health care domains.

Cataloguing the Taxonomic Origins of Sequences from a Heterogeneous Sample Using Phylogenomics: Applications in Adventitious Agent Detection

User Profiling on a Pilot Digital Library with the Final Result of a New Adaptive Knowledge Management Solution

The cost‐effectiveness analysis of information retrieval and dissemination systems

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Information Retrieval Analysis Research Articles

Related Topics

Articles published on Information Retrieval Analysis

Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature

Integrated Random Negative Sampling and Uncertainty Sampling in Active Learning Improve Clinical Drug Safety Drug–Drug Interaction Information Retrieval

Net ergonomics: information retrieval in health care domains.

Cataloguing the Taxonomic Origins of Sequences from a Heterogeneous Sample Using Phylogenomics: Applications in Adventitious Agent Detection

User Profiling on a Pilot Digital Library with the Final Result of a New Adaptive Knowledge Management Solution

The cost‐effectiveness analysis of information retrieval and dissemination systems