Abstract

When processing written German language, it is helpful, to use the base form (or: lemma) of possibly inflected words, such as verbs, nouns or named entities. However, for German text from the (bio)medical domain, e.g., discharge letters, or entries stored in electronic medical or health records (EMR, EHR), difficulties exist in finding the correct lemma, as, for instance, the medical language has roots in Latin or Greek. In such cases, stemming techniques might provide inaccurate results for text written in German. This study demonstrates a Machine Learning approach for training Apache OpenNLP-based lemmatizer models from publicly available German treebanks. The resulting four "DE-Lemma" models were evaluated against a sample of (bio)medical nouns, randomly selected from real-world discharge letters. The most promising DE-Lemma model achieved an accuracy of 88.0% (F1 = .936).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.