Abstract

To ensure the quality of a medical thesaurus is a non-trivial task, due to the inherent complexity of medical terminology. The peculiarities of the medical sublanguage and the subjectivism of lexicographers' choices complicate the thesaurus construction process. Our experience is based on the MorphoSaurus lexicon, the basis of a biomedical cross-language indexing and retrieval system. We describe two complementary maintenance approaches, viz. i) corpus-based error detection, and ii) thesaurus anomaly detection. These techniques were developed to detect so-called dynamic and static errors, which are committed by the lexicographers during the construction and maintenance process. Considering multilingual parallel corpora, the distribution of semantic identifiers should be similar whenever comparing related texts in different languages. In the first approach, those semantic identifiers are identified that exhibit greatest frequency variations when comparing text pairs. A manual review of these search results is supposed to spot content errors, which are subsequently classified and fixed by the lexicographers. The second approach analyses transaction-based anomalies, which are identified by interpreting the log of lexicographers' actions during thesaurus maintenance. This methodology highlights the four most common types of this kind of anomaly and evaluates the effectiveness of the corpus-based detection techniques. The overall quality improvement of the thesaurus was evaluated using the OHSUMED IR benchmark.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call