Abstract

Dictionaries not only are the source of getting meanings of the word but also serve the purpose of comprehending the context in which the words are used. For such purpose, we see a small sentence as an example for the very word in comprehensive book‐dictionaries and more recently in online dictionaries. The lexicographers perform a very meticulous activity for the elicitation of Good Dictionary EXamples (GDEX)—a sentence that is best fit in a dictionary for the word’s definition. The rules for the elicitation of GDEX are very strenuous and require a lot of time for committing the manual process. In this regard, this paper focuses on two major tasks, i.e., the development of labelled corpora for top 3K English words through the usage of distant supervision approach and devising a state‐of‐the‐art artificial intelligence‐based automated procedure for discriminating Good Dictionary EXamples from the bad ones. The proposed methodology involves a suite of five machine learning (ML) and five word embedding‐based deep learning (DL) architectures. A thorough analysis of the results shows that GDEX elicitation can be done by both ML and DL models; however, DL‐based models show a trivial improvement of 3.5% over the conventional ML models. We find that the random forests with parts‐of‐speech information and word2vec‐based bidirectional LSTM are the most optimal ML and DL combinations for automated GDEX elicitation; on the test set, these models, respectively, secured a balanced accuracy of 73% and 77%.

Highlights

  • Muhammad Yaseen Khan,1,2 Abdul Qayoom,1 Muhammad Suffian Nizami,3 Muhammad Shoaib Siddiqui,4 Shaukat Wasi,1 and Syed Muhammad Khaliq-ur-Rahman Raazi 1

  • This paper focuses on two major tasks, i.e., the development of labelled corpora for top 3K English words through the usage of distant supervision approach and devising a state-of-the-art artificial intelligence-based automated procedure for discriminating Good Dictionary EXamples from the bad ones. e proposed methodology involves a suite of five machine learning (ML) and five word embedding-based deep learning (DL) architectures

  • We find that the random forests with parts-of-speech information and word2vec-based bidirectional Long-Short Term Memory (LSTM) are the most optimal ML and DL combinations for automated Good Dictionary EXamples (GDEX) elicitation; on the test set, these models, respectively, secured a balanced accuracy of 73% and 77%

Read more

Summary

Literature Review

On the problem under study, there are many significant methodologies proposed by researchers; we maintain that, in comparison to other classification tasks in NLP, the amount of work for GDEX classification is small. E group used the web corpus of etTenTen; in their approach, they focus on the sentence length, word length, the number of subordinate clauses, and keyword position In another similar study, Uprety and Shakya [14] conducted a test to analyse the effectiveness of context clue sentences among Nepalese students. Where C is a dictionary with key-value pairs such as word w being the key, against whom a list of tuples is retained; further, the contents of the tuple shows the example sentence Swi along with its thumbs-up votes (Ui) and thumbs-down votes (Di); the subscript i indicates the index of sentence respectively. E dataset for every scoring function is balanced, i.e., each class contains 20K records (which alternatively means 40K sentences, in total, are used in the experiments.) One key observation we can get from the table is the average sentence length of good examples is approximately half of its counterclass. It further asserts that the distinct supervision (or nearly crowdsourced data) appeared to have aligned with rule#1 (i.e., already stated in Subsection 2.2)

Machine Learning-Based Classification
Result
TF-IDF Vectroization 3
Results and Discussion
Evaluation metrics
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.