Abstract

Multiword expressions (MWEs) are syntactic and/or semantic units in language, where the meaning of whole is limitedly connected to the meanings of the constituting units. The most prominent property that distinguishes MWEs from random word combinations is the recurrence. The recurrence is commonly measured by the occurrence frequencies of the MWE and the constituting words. Though occurrence frequency measures are known to be best in distinguishing MWEs from random combinations, the performance of those measures depend mainly on the quality and size of the data source where frequencies are obtained. The main goal of this study is to provide a detailed analysis on the change in performance of frequency based measures when the traditional frequency source, corpus, is swapped with a massive and dynamic data source, the World Wide Web. In order to use the web as a frequency source, the constituting words and word combinations are queried among a popular search engine, and the number of results for each query is accepted to be web-based frequency for the regarding word/word combination. In this study, the web-based frequencies are employed in three different MWE detection-related experiments utilizing a Turkish data set. In first group of experiments, the individual performances of 20 well-known frequency metrics in ranking/sorting MWE candidates based on their tendency to be a MWE is examined. Secondly, the most successful frequency metrics are determined by a feature selection method: filtering. Lastly, MWE detection is accepted to be a classification problem. Eight supervised methods are applied in order to show the combined performance of frequency metrics when the frequency is obtained from web. In all experiments, the performance of web-based frequencies in identification of MWEs is compared to the performance of traditional corpus based frequencies. The experimental results showed that the use of web-based frequency in identification of MWEs reveals promising results.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.