Abstract

In languages with diacritic marks, such as French, there remain instances of textual or terminological resources that are available in electronic form without diacritic marks, which hinders their use in natural language interfaces. In a specialized domain such as medicine, it is often the case that some words are not found in the available electronic lexicons. The issue of accenting unknown words then arises: it is the theme of this work. We propose two internal methods for accenting unknown words, which both learn on a reference set of accented words the contexts of occurrence of the various accented forms of a given letter. One method is adapted from part-of-speech tagging, the other is based on finite state transducers. We show experimental results for letter e on the French version of the Medical Subject Headings thesaurus. With the best training set, the tagging method obtains a precision-recall breakeven point of 84.2±4.4% and the transducer method 83.8±4.5% (with a baseline at 64%) for the unknown words that contain this letter. A consensus combination of both increases precision to 92.0±3.7% with a recall of 75%. We perform an error analysis and discuss further steps that might help improve over the current performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.