Abstract

Vocabulary of a language has a great role to play in the Natural Language Processing (NLP) applications. Such applications make use of lists like stop-word list, general service list, academic word list and technical domain word list. The technical domain word list differs with each domain and though it is available for fields like medicine, biology, computer science, physics and law, the domain of databases in specific has still not been explored. For the first time, we propose technical vocabulary comprising of POS-tagged unigram tokens and POS-tagged unigram lemmata for the technical domain of databases. This vocabulary has been called DBTechVoc with a coined term. Notably, the multi-word phrases have also been considered, without their further tokenization, to maintain their semantics. The empirical results, with more than 1000 high quality research papers collected over a period of 45 years from 1976 to 2021, prove that the technical general word list of the domain of computer science is different from the technical and specific word list of the domain of databases. The overlap was found to be less than 2%. The research titles use 6% Rainbow stop words while 13% of the words used for the research paper titles are inflectional forms of lemmata.

Highlights

  • It has been empirically proved by Liu and Nation [1] that in order to comprehend a piece of text, at least 95% of the words should be recognized by the reader

  • The present research work is the first formal attempt to create a technical vocabulary for the domain of databases

  • This vocabulary called DBTechVoc consists of a POS-tagged token list having 1758 multi-word phrase unigrams and a POStagged lemmata list having 1530 multi-word phrase unigrams

Read more

Summary

Introduction

It has been empirically proved by Liu and Nation [1] that in order to comprehend a piece of text, at least 95% of the words should be recognized by the reader. This concept could be applied well to the listeners of a natural language too. Several research works like those of Dewan and Gupta [2], Tullu [3], Mack [4] and Karagel and Karagel [5] have advocated and elaborated the importance of title of the research paper as a gist of the paper contents. Hengl and Gould [7] emphatically highlighted that the title of the research papers should tend to clearly indicate the main contents of the research paper in addition to the actual discoveries discussed in the paper

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call