Finding relevant data in the biomedical literature can be difficult sometime. To select a few neuroscience examples, suppose you would like to know (a) which serotonin receptor subunits are expressed in dentate gyrus mossy cells; (b) the average volume of the amygdala in adult male chimpanzees; (c) whether there is an EEG signature of CreutzfeldJacob disease; or (d) studies reporting bilateral fMRI activity in Brodmann area 38. These are fairly simple questions, for which standard search engines should fare relatively well. Yet even for these kinds of questions, securing the relevant information takes much longer than googling up the local weather forecast for the week-end or tomorrow’s commuter train schedule. The actual data required to build biologically realistic computational models are often more detailed: what is the time constant of the excitatory synaptic current from a specified pair of neuron types? Finding the answer in this case might require many hours or even days of queries over multiple search engines. Most importantly, the results of these queries must be typically followed by at least cursory reading of dozens of papers. When the graduate student triumphantly brings to the lab the needed reference, the adviser could mumble without lifting the eyes from the keyboard “that’s in young animals, and it was recorded at room temperature”. Another unfortunate major limitation is that, until and unless a definitive answer is found, it is usually impossible to know whether the information is available or not. In other words, existing biomedical search engines are ill-equipped to inform users that something is not yet known. Standard search algorithms such as PubMed are less than ideal to deal with data identification, because they are ultimately based on matching strings or concepts that appear in the title, abstract, and the keywords. These texts, however, are written with narrow scientific agendas in mind. The authors cannot possibly provide a list of keywords that would encompass all research projects for which some data in their articles might be relevant. If the topic of a report is the molecular phenotyping of a new genetic model of schizophrenia, the technical details of the deconvolution algorithm to deblur the optical micrographs would be nearly impossible to pick up through keyword searches. Could we devise a procedure to interrogate the scientific literature so as to extract accurately and efficiently most if not all of the relevant data? Is there a literature mining protocol that could give us the confidence that, if the query returns a blank, it means that the sought data is not yet available? Although common to all of biomedical science, this issue is particularly critical in neuroscience because of its unmatched diversity of dimensions, scales, questions, approaches, and techniques. Thus, effective tagging of publications with relevant metadata remains an outstanding neuroinformatics challenge. Full text searches provide half of the solution, in that they eliminate many of the issues related to false negatives. Many of the helpful terms to identify relevant data, for example, appear in the Materials and Methods sections of published articles rather than in their titles, abstracts, and keywords. Search engines scanning through the entire main text of publications include early visionary projects such as Textpresso, which started within the limited domain of C. elegans, then expanded to a
Read full abstract