Improved Part-of-Speech Prediction in Suffix Analysis

Mario Fruzangohar,David L Adelson,Trent A Kroeger

doi:10.1371/journal.pone.0076042

Abstract

MotivationPredicting the part of speech (POS) tag of an unknown word in a sentence is a significant challenge. This is particularly difficult in biomedicine, where POS tags serve as an input to training sophisticated literature summarization techniques, such as those based on Hidden Markov Models (HMM). Different approaches have been taken to deal with the POS tagger challenge, but with one exception – the TnT POS tagger - previous publications on POS tagging have omitted details of the suffix analysis used for handling unknown words. The suffix of an English word is a strong predictor of a POS tag for that word. As a pre-requisite for an accurate HMM POS tagger for biomedical publications, we present an efficient suffix prediction method for integration into a POS tagger.ResultsWe have implemented a fully functional HMM POS tagger using experimentally optimised suffix based prediction. Our simple suffix analysis method, significantly outperformed the probability interpolation based TnT method. We have also shown how important suffix analysis can be for probability estimation of a known word (in the training corpus) with an unseen POS tag; a common scenario with a small training corpus. We then integrated this simple method in our POS tagger and determined an optimised parameter set for both methods, which can help developers to optimise their current algorithm, based on our results. We also introduce the concept of counting methods in maximum likelihood estimation for the first time and show how counting methods can affect the prediction result. Finally, we describe how machine-learning techniques were applied to identify words, for which prediction of POS tags were always incorrect and propose a method to handle words of this type.Availability and ImplementationJava source code, binaries and setup instructions are freely available at http://genomes.sapac.edu.au/text_mining/pos_tagger.zip.

Highlights

Hidden Markov Models (HMM) have been used in Part-OfSpeech (POS) tagging of text for 30 years
We have shown how important suffix analysis can be for probability estimation of a known word with an unseen POS tag; a common scenario with a small training corpus
We introduce the concept of counting methods in maximum likelihood estimation for the first time and show how counting methods can affect the prediction result

Summary

Introduction

Hidden Markov Models (HMM) have been used in Part-OfSpeech (POS) tagging of text for 30 years. The existence of special characters (capitals, numbers, hyphens or symbols) is the first characteristic used to predict a word tag. If a new word does not contain any special characters, when that word is made of all alphabetic lower case characters, the best method to predict a word tag is to examine the lexical structure of the word, such as the suffix and postfix. In English and some other languages, the suffix is a strong predictive feature for word tagging. We used TnT’s suffix analysis method to handle new words. Subsequent testing of TnT system gave an unsatisfactory result for suffix analysis, prompting us to design and implement a novel method, which increased accuracy from 66 to 95 percent

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS ONE	Publication Date: Oct 4, 2013
Citations: 20	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Improved Part-of-Speech Prediction in Suffix Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

Improving Persian POS tagging using the maximum entropy model
Ahmad A Kardan ... Maryam Bahojb Imani
-
Ahmad A Kardan, et. al.Ahmad A Kardan ... Maryam Bahojb Imani
01 Feb 2014
01 Feb 2014

Part of speech tagging: a systematic review of deep learning and machine learning approaches
Alebachew Chiche ... Betselot Yitagesu
Journal of Big Data | VOL. 9
Alebachew Chiche, et. al.Alebachew Chiche ... Betselot Yitagesu
24 Jan 2022
Journal of Big Data | VOL. 9

Arabic Part Of Speech (POS) Tagging Analysis using Bee Colony Optimization (BCO) Algorithm on Quran Corpus
Arief Fatchul Huda ... Dian Rachmat Gumelar
-
Arief Fatchul Huda, et. al.Arief Fatchul Huda ... Dian Rachmat Gumelar
19 Aug 2021
19 Aug 2021

Hidden Markov Model based Part of Speech Tagging for Nepali language
Abhijit Paul ... Bipul Syam Purkayastha
-
Abhijit Paul, et. al.Abhijit Paul ... Bipul Syam Purkayastha
01 Sep 2015
01 Sep 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improved Part-of-Speech Prediction in Suffix Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE