Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.

Similar Papers
  • Conference Article
  • Cite Count Icon 31
  • 10.1109/isacc.2015.7377332
Hidden Markov Model based Part of Speech Tagging for Nepali language
  • Sep 1, 2015
  • Abhijit Paul + 2 more

Natural Language Processing (NLP) is mainly concerned with the development of computational models and tools of aspects of human (natural) language processing. Part of Speech Tagging (POS) is well studied topic and also one of the most fundamental preprocessing steps for any language in NLP. Natural language processing of Nepali is still lack significant research efforts in the area of NLP in India. POS tagging of Nepali is a necessary component for most NLP applications in Nepali, which analyses the construction of the language, behavior of the language and can be used to develop automated tools for language processing. From the literature survey and related works, it has been found that, not much work has been done previously on POS tagging for Nepali language in India due to lack of comprehensive set of tagged corpus or correct hand written rules. In this paper, Hidden Markov Model (HMM) based Part of Speech (POS) tagging for Nepali language has been discussed. HMM is the most popular used statistical model for POS tagging that uses little amount of knowledge about the language, apart from contextual information of the language. The evaluation of the tagger has been done using the corpora, which are collected from TDIL (Technology Development for Indian Languages) and the BIS tagset of 42 tags. Tagset has been designed to meet the morph-syntactic requirements of the Nepali language. Apart from corpora and the tagset, python programming language and the NLTK's (Natural Language Toolkit) library has been used for implementation. The tagger achieves accuracy over 96% for known words but for unknown words, the research is still continuing.

  • Conference Article
  • 10.1109/estc.2012.6485604
An approach to reduce part of speech ambiguity using semantically annotated lexicon definitions
  • Sep 1, 2012
  • Andrei Minca + 1 more

In computational linguistics, the problem of word-sense disambiguation (WSD) is a difficult one and methods using a flat topology of the tokens are not very effective.One solution to this is to use a Part of Speech (POS) tagger before starting the WSD process.However, POS taggers show their limitations when high precision tagging is required or large texts are processed.This paper presents a technique to reduce the POS ambiguity using semantic information.As benchmarks we use as following standard WSD corpuses: Senseval2, Senseval3 and Semcor.Moreover, we tested our approach on WordNet semantically tagged glosses for English and on our own semantically tagged lexicon glosses for Romanian language.

  • Conference Article
  • Cite Count Icon 1
  • 10.2991/msie-13.2013.94
An Approach to Reduce Part of Speech Ambiguity Using Semantically Annotated Lexicon Definitions
  • Jan 1, 2013
  • Andrei Minc + 1 more

In computational linguistics, the problem of word-sense disambiguation (WSD) is a difficult one and methods using a flat topology of the tokens are not very effective. One solution to this is to use a Part of Speech (POS) tagger before starting the WSD process. However, POS taggers show their limitations when high precision tagging is required or large texts are processed. This paper presents a technique to reduce the POS ambiguity using semantic information. As benchmarks we use as following standard WSD corpuses: Senseval2, Senseval3 and Semcor. Moreover, we tested our approach on WordNet semantically tagged glosses for English and on our own semantically tagged lexicon glosses for Romanian language.

  • Research Article
  • Cite Count Icon 1
  • 10.11591/eecsi.v7.2034
Combination of Genetic Algorithm and Brill Tagger Algorithm for Part of Speech Tagging Bahasa Madura
  • Oct 1, 2020
  • Proceeding of the Electrical Engineering Computer Science and Informatics
  • Nindian Puspa Dewi + 3 more

Part of speech (POS) is commonly known as word types in a sentence such as verbs, adjectives, nouns, and so on. Part of Speech (POS) Tagging is a process of marking the word class or part of speech in every word in a sentence. Part of Speech Tagging has an important role to be used as a basis for research in Natural Language Processing. That is why research on Part of Speech Tagging for Bahasa Madura as an effort to preserve and develop the use of regional languages. In this research, POS Tagging is done using the Brill Tagger Algorithm which is combined with the Genetic Algorithm. Brill Tagger is a POS Tagging Algorithm that has the best level of accuracy when implemented in other languages. Genetic Algorithms used in the contextual learner process with consideration in previous studies can increase the speed of the training process so that it is more efficient. The results of this study are then compared with the results of the previous study so that we can find out suitable algorithms used for the development of text processing in Bahasa Madura. From a series of experiments, the average accuracy obtained by using Brill Tagger is 86.4% with the highest accuracy of 86.7%, while using GA Brill Tagger shows an average accuracy of 86.5% with the highest accuracy of 86.6%. Testing by observing OOV (Out of Vocabulary) achieves an average accuracy of 67.7% for Brill Taggers and 64.6% for GA Brill Taggers. Testing by considering multiple POS with Brill Tagger produces an average accuracy of 73.3% while testing using GA Brill Tagger produces an average accuracy of 90.9%. This shows that the accuracy with GA Brill Tagger is better than Brill Tagger, especially if considering multiple POS. This is because GA Brill Tagger can generate rules for handling the existence of multiple POS more than pure Brill Tagger. Part of speech (POS) is commonly known as word types in a sentence such as verbs, adjectives, nouns, and so on. Part of Speech (POS) Tagging is a process of marking the word class or part of speech in every word in a sentence. Part of Speech Tagging has an important role to be used as a basis for research in Natural Language Processing. That is why research on Part of Speech Tagging for Bahasa Madura as an effort to preserve and develop the use of regional languages. In this research, POS Tagging is done using the Brill Tagger Algorithm which is combined with the Genetic Algorithm. Brill Tagger is a POS Tagging Algorithm that has the best level of accuracy when implemented in other languages. Genetic Algorithms used in the contextual learner process with consideration in previous studies can increase the speed of the training process so that it is more efficient. The results of this study are then compared with the results of the previous study so that we can find out suitable algorithms used for the development of text processing in Bahasa Madura. From a series of experiments, the average accuracy obtained by using Brill Tagger is 86.4% with the highest accuracy of 86.7%, while using GA Brill Tagger shows an average accuracy of 86.5% with the highest accuracy of 86.6%. Testing by observing OOV (Out of Vocabulary) achieves an average accuracy of 67.7% for Brill Taggers and 64.6% for GA Brill Taggers. Testing by considering multiple POS with Brill Tagger produces an average accuracy of 73.3% while testing using GA Brill Tagger produces an average accuracy of 90.9%. This shows that the accuracy with GA Brill Tagger is better than Brill Tagger, especially if considering multiple POS. This is because GA Brill Tagger can generate rules for handling the existence of multiple POS more than pure Brill Tagger

  • Book Chapter
  • Cite Count Icon 2
  • 10.25215/8119070682.09
A REVIEW ON DIFFERENT APPROACHES OF POS TAGGING IN NLP
  • Jan 1, 2020
  • K Aparna + 2 more

Natural language processing (NLP) techniques have piqued the curiosity of many as information and communication technology has advanced rapidly. As a result, several NLP tools are being developed. However, there are several obstacles to building effective and efficient NLP systems that analyze natural languages effectively. Part of speech (POS) technique for identifying a specific phrase is tagging or words in a paragraph based on the context of the sentence/words inside the paragraph. Despite tremendous research efforts, POS tagging continues to encounter hurdles in boosting accuracy while minimizing false-positive rates and identifying unfamiliar terms. Furthermore, ambiguity must be avoided when tagging terms with distinct contextual meanings inside a phrase. Deep learning (DL) and machine learning (ML)-based POS taggers have recently been deployed as promising methods for identifying words in a particular phrase throughout a paragraph. In this post, we'll define part of speech POS tagging. It then provides comprehensive classification based on the well-known ML and DL approaches used in the design and implementation of part of speech taggers. A complete assessment of the most recent POS tagging publications is offered, with the weaknesses and merits of the suggested methodologies discussed. Then, in terms of the proposed techniques used and their performance assessment criteria, current trends and developments in DL and ML-based part-of-speech-taggers are given. Using the limitations of the offered techniques, we highlighted some research gaps and presented future research recommendations for developing DL and ML-based POS tagging.

  • Book Chapter
  • Cite Count Icon 12
  • 10.1007/978-3-030-68154-8_93
Towards POS Tagging Methods for Bengali Language: A Comparative Analysis
  • Jan 1, 2021
  • Fatima Jahara + 6 more

Part of Speech (POS) tagging is recognized as a significant research problem in the field of Natural Language Processing (NLP). It has considerable importance in several NLP technologies. However, developing an efficient POS tagger is a challenging task for resource-scarce languages like Bengali. This paper presents an empirical investigation of various POS tagging techniques concerning the Bengali language. An extensively annotated corpus of around 7390 sentences has been used for 16 POS tagging techniques, including eight stochastic based methods and eight transformation-based methods. The stochastic methods are uni-gram, bi-gram, tri-gram, unigram+bigram, unigram+bigram+trigram, Hidden Markov Model (HMM), Conditional Random Forest (CRF), Trigrams ‘n’ Tags (TnT) whereas the transformation methods are Brill with the combination of previously mentioned stochastic techniques. A comparative analysis of the tagging methods is performed using two tagsets (30-tag and 11-tag) with accuracy measures. Brill combined with CRF shows the highest accuracy of 91.83% (for 11 tagset) and 84.5% (for 30 tagset) among all the tagging techniques.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 199
  • 10.1186/s40537-022-00561-y
Part of speech tagging: a systematic review of deep learning and machine learning approaches
  • Jan 24, 2022
  • Journal of Big Data
  • Alebachew Chiche + 1 more

Natural language processing (NLP) tools have sparked a great deal of interest due to rapid improvements in information and communications technologies. As a result, many different NLP tools are being produced. However, there are many challenges for developing efficient and effective NLP tools that accurately process natural languages. One such tool is part of speech (POS) tagging, which tags a particular sentence or words in a paragraph by looking at the context of the sentence/words inside the paragraph. Despite enormous efforts by researchers, POS tagging still faces challenges in improving accuracy while reducing false-positive rates and in tagging unknown words. Furthermore, the presence of ambiguity when tagging terms with different contextual meanings inside a sentence cannot be overlooked. Recently, Deep learning (DL) and Machine learning (ML)-based POS taggers are being implemented as potential solutions to efficiently identify words in a given sentence across a paragraph. This article first clarifies the concept of part of speech POS tagging. It then provides the broad categorization based on the famous ML and DL techniques employed in designing and implementing part of speech taggers. A comprehensive review of the latest POS tagging articles is provided by discussing the weakness and strengths of the proposed approaches. Then, recent trends and advancements of DL and ML-based part-of-speech-taggers are presented in terms of the proposed approaches deployed and their performance evaluation metrics. Using the limitations of the proposed approaches, we emphasized various research gaps and presented future recommendations for the research in advancing DL and ML-based POS tagging.

  • Book Chapter
  • Cite Count Icon 30
  • 10.1007/978-3-642-36543-0_6
A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles
  • Jan 1, 2013
  • Rayner Alfred + 2 more

The Malay language is an Austronesian language spoken in most countries in the South East Asia region that includes Malaysia, Indonesia, Singapore, Brunei and Thailand. Traditional linguistics is well developed for Malay but there are very limited resources and tools that are available or made accessible for computer linguistic analysis of Malay language. Assigning part of speech (POS) to running words in a sentence for Malay language is one of the pipeline processes in Natural Language Processing (NLP) tasks and it is not well investigated. This paper outlines an approach to perform the Part of Speech (POS) tagging for Malay text articles. We apply a simple Rule-based Part of Speech (RPOS) tagger to perform the tagging operation on Malay text articles. POS tagging can be described as a task of performing automatic annotation of syntactic categories for each word in a text document. A rule-based POS tagger generally involves a POS tag dictionary and a set of rules in order to identify the words that are considered parts of speech. In this paper, we propose a framework that applies Malay affixing rules to identify the Malay POS tag and the relation between words in order to select the best POS tag for words that have two or more valid POS tags. The results show that the performance accuracy of the ruled-based POS tagger is higher compared to a statistical POS tagger. This indicates that the proposed RPOS tagger is able to predict any unknown word’s POS at some promising accuracy.

  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/978-3-319-63645-0_56
Building Machine Learning System with Deep Neural Network for Text Processing
  • Aug 17, 2017
  • Shashi Pal Singh + 5 more

This paper provides the method and process to build machine learning system using Deep Neural Network (DNN) for lexicon analysis of text. Parts of Speech (POS) tagging of word is important in Natural language processing either it is speech technology or machine translation. The recent advancement of Deep Neural Network would help us to achieve better result in POS tagging of words and phrases. Word2vec tool of Dl4j library is very popular to represent the words in continuous vector space and these vectors capture the syntactic and semantic meaning of corresponding words. If we have a database of sample words with their POS category, it is possible to assign POS tag to the words but it fails when the word is not present in database. Cosine similarity concept plays an important role to find the POS Tags of the words and phrases which are not previously trained or POS Tagged. With the help of Cosine similarity, system assign the appropriate POS tags to the words by finding their nearest similar words using the vectors which we have trained from Word2vec database. Deep neural network like RNN outperforms as compare to traditional state of the art as it deals with the issue of word sense disambiguation. Semi-supervised learning is used to train the network. This approach can be applicable for Indian languages as well as for foreign languages. In this paper, RNN is implemented to build a machine learning system for POS-tagging of the words in English language sentences.

  • Conference Article
  • Cite Count Icon 52
  • 10.1109/icaicta.2016.7803103
InaNLP: Indonesia natural language processing toolkit, case study: Complaint tweet classification
  • Aug 1, 2016
  • Ayu Purwarianti + 4 more

This research discusses how natural language processing (NLP) toolkit for Indonesia formal text and social media text, named as InaNLP, has been developed. Several NLP modules were integrated into InaNLP to make people easier in building an NLP system for Indonesia language. The toolkit contains several NLP modules such as sentence splitter, tokenization, Part of Speech (POS) tagger, phrase chunker, named entity (NE) tagger, syntactic parser, semantic analyzer, and word normalization. Several NLP modules were built using rule based approach, whereas several others implemented statistical based approach. Here, the accuracy of several modules such as the POS tagger, NE tagger, syntactic parser and semantic analyzer are shown. In the NE tagger, five (5) word windows with features of POS, orthography, and word list are used. In the NE tagger experiment for evaluating the features, using SMO algorithm and 1500 sentences, for 15 NE classes, token classification accuracy of 93.419%, which outperform related work, could be achieved. For the POS tagger, using 12,000 token as the training data and 3,000 token as the testing data, the accuracy of 96.50% was achieved. For the syntactic parser, using CYK algorithm and 100 sentences as the training data and 36 sentences as the testing data, the experiment achieved the accuracy of 47.22%. For the semantic analyzer, using 200 sentences as the training data, the experiment achieved the accuracy of 62.50%. This research also shows an example in building an Indonesia NLP system using InaNLP for complaint tweet classification. In the experiment for the complaint classification, using 7440 data, the experiment achieved 0.892 of average F-measure score.

  • Research Article
  • 10.5614/itbj.ict.res.appl.2013.7.3.1
Implementation of Kadazan Tagger Based on Brill's Method
  • Dec 1, 2013
  • Journal of ICT Research and Applications
  • Marylyn Alex + 1 more

We present and evaluate the implementation of Part of Speech (POS) Tagging for the Kadazan language by using the Transformation-based approach. The main purpose of this study is to develop an automatic POS tagging for the Kadazan language, which had never, been developed before. POS tagging can tag the Kadazan corpus automatically and can help reduce the disambiguation problem of this language. The implementation of this approach in this study is to achieve a better and higher accuracy or at least similar to that of the other tagging approaches such as the statistical and the original rule-based approach. This approach can transform the tags based on the prescribed set of rules. A number of objectives were set in order to achieve the main purpose of this study. Firstly, to apply the lexical and contextual rules for this language. Secondly, to implement the Brill's algorithm based on the set of rules and finally to determine the effectiveness of the Kadazan Part of Speech by using this approach. The tagging system had been trained using four Kadazan corpuses containing 5663 words in all. Based on the evaluation results, the tagging system had achieved around 93% accuracy.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/icws.2019.00068
A Novel Part of Speech Tagging Framework for NLP Based Business Process Management
  • Jul 1, 2019
  • Xue Han + 5 more

Natural Language Processing (NLP) is a key technique to automate Business Process Management (BPM) at different levels. The performance of existing NLP based BPM methods suffer from the limited accuracy of Part of Speech (POS) tagging, which is a key step in NLP pipelines. Note that the performance of POS tagging highly depends on the domain of annotated training data. However, most state-of-the-art POS taggers are trained from corpus in newswire domain which usually have different syntax features with business process description (BPD). The syntax features of BPD domain include usually starting with an imperative verb and containing numerous out-of-vocabulary (OOV) words. In this paper, we propose a novel POS tagging framework to tackle these problems. The main idea is that syntax feature of starting with imperative verb could be studied by enhancing the proportion of correctly POS-annotated imperative sentences in the training data. The trained POS tagger could reduce the overall POS tagging error by nearly 12% compared with newswire trained POS tagger. For verbs which are key words in BPD, the tagging precision could be increased by 27%. The lexical ambiguity caused by OOV words is solved by extracting local contextual knowledge out of images which are attached to help users understand the process better. Experimental results show that the overall POS tagging accuracy could be increased by nearly 10% with contextual OOV knowledge.

  • Conference Article
  • Cite Count Icon 16
  • 10.1109/ubmk.2018.8566272
Deep Neural Network Architecture for Part-of-Speech Tagging for Turkish Language
  • Sep 1, 2018
  • Cenk Anil Bahcevan + 2 more

Parts of Speech (POS) tagging is one of the most well-studied problems in the field of Natural Language Processing (NLP). In this paper, a Neural Network Language Models (NNLM) such as Recurrent Neural Network (RNN) and Long-Short Term Memory (LSTM) have been trained and assessed to address the POS tagging problem for the Turkish Language. The performance is compared to the state-of-art methods. The results show that LSTM outperforms RNN with 88.7% Fl-score. This study is the first study that contributes to the literature utilizing word embedding and NNLM for the Turkish language.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/ict4da56482.2022.9971153
Design and Develop A Part of Speech Tagging for Ge’ez Language using Deep Learning Approach
  • Nov 28, 2022
  • Asnak Yihunie Kassahun + 1 more

Part of Speech (POS) tagging is one of the basic and important applications of Natural Language Processing (NLP).Different POS tagging systems in various languages have been developed. In some languages, POS tagging works well with higher accuracy, but in the Ge’ez language, it is still an unsolved problem. Ge’ez is a morphological rich, free word order language, and very ambiguous where every word has many more variants based on its suffixes and prefixes. This paper proposed a deep learning-based POS tagging for Ge’ez language. The Gated Recurrent Unit(GRU), Bidirectional Gated Recurrent Unit(Bi-GRU), Long-Short Term Memory(LSTM), and Bidirectional Long-Short Term Memory(Bi-LSTM) models were developed by varying the number of training epochs and important hyperparameters. Ge’ez language has not standard POS corpus. Therefore, we have taken 2,552 sentences from Holy Bible then divided into 1,786 as a training data set and 766 as a test data set. The corpus has 26,607 total words and the experimental results show that the Bi-GRU model has achieved an accuracy value of 86.70% with 10 epochs,2 hidden layers, 128 neurons per hidden layer, and a learning rate of 0.01. The second best model is the GRU model with an accuracy value of 86.15% using the same hyperparameters with Bi-GRU model. The final result shown in the experiment shows that the GRU and Bi-GRU Ge’ez POS tagging has performed better according to tag-wise classification results shown in and the overall accuracy results of the training and testing data set. This implies that the GRU and Bi-GRU based POS tagger outperforms the LSTM and Bi-LSTM tagger when used separately. Overall, deep learning approach is better than other traditional approaches for developing POS tagger for low-resource language-Ge’ez.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-030-63128-4_50
Bangla Part of Speech Tagging Using Contextual Embeddings and Oversampling Techniques
  • Oct 31, 2020
  • Koushik Roy + 7 more

Part of Speech (PoS) Tagging has been a customary research area in the field of Natural Language Processing. The popularization of Neural Networks has opened substantially more scope of research for Bangla PoS Tagging especially with the class of sequential models particularly using Recurrent Neural Networks like Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU). Our contribution in this paper is that we transformed the overall sequential modeling problem to an inconsequent model using BERT embeddings to leverage the existing well understood oversampling algorithms for improving PoS Tagging using a shallow feed-forward Neural Network. Our experiment results indicate that Synthetic Minority Over-sampling Technique (SMOTE) works well as an oversampling algorithm for BERT embeddings.

Save Icon
Up Arrow
Open/Close