Small Training Corpus Research Articles

Despite the potential of social media for environmental monitoring, concerns remain about the quality and reliability of the information automatically extracted. Notably there are many observations of wildlife on Twitter, but their automated detection is a challenge due to the frequent use of wildlife related words in messages that have no connection with wildlife observation. We investigate whether and what type of supervised machine learning methods can be used to create a fully automated text classification model to identify genuine wildlife observations on Twitter, irrespective of species type or whether Tweets are geo-tagged. We perform experiments with various techniques for building feature vectors that serve as input to the classifiers, and consider how they affect classification performance. We compare three classification approaches and perform an analysis of the types of features that are indicative for genuine wildlife observations on Twitter. In particular, we compare some classical machine learning algorithms, widely used in ecology studies, with state-of-the-art neural network models. Results showed that the neural network-based model Bidirectional Encoder Representations from Transformers (BERT) outperformed the classical methods. Notably this was the case for a relatively small training corpus, consisting of less than 3000 instances. This reflects that fact that the BERT classifier uses a transfer learning approach that benefits from prior learning on a very much larger collection of generic text. BERT performed particularly well even for Tweets that employed specialised language relating to wildlife observations. The analysis of possible indicative features for wildlife Tweets revealed interesting trends in the usage of hashtags that are unrelated to official citizen science campaigns. The findings from this study facilitate more accurate identification of wildlife-related data on social media which can in turn be used for enriching citizen science data collections.

Read full abstract

MotivationPredicting the part of speech (POS) tag of an unknown word in a sentence is a significant challenge. This is particularly difficult in biomedicine, where POS tags serve as an input to training sophisticated literature summarization techniques, such as those based on Hidden Markov Models (HMM). Different approaches have been taken to deal with the POS tagger challenge, but with one exception – the TnT POS tagger - previous publications on POS tagging have omitted details of the suffix analysis used for handling unknown words. The suffix of an English word is a strong predictor of a POS tag for that word. As a pre-requisite for an accurate HMM POS tagger for biomedical publications, we present an efficient suffix prediction method for integration into a POS tagger.ResultsWe have implemented a fully functional HMM POS tagger using experimentally optimised suffix based prediction. Our simple suffix analysis method, significantly outperformed the probability interpolation based TnT method. We have also shown how important suffix analysis can be for probability estimation of a known word (in the training corpus) with an unseen POS tag; a common scenario with a small training corpus. We then integrated this simple method in our POS tagger and determined an optimised parameter set for both methods, which can help developers to optimise their current algorithm, based on our results. We also introduce the concept of counting methods in maximum likelihood estimation for the first time and show how counting methods can affect the prediction result. Finally, we describe how machine-learning techniques were applied to identify words, for which prediction of POS tags were always incorrect and propose a method to handle words of this type.Availability and ImplementationJava source code, binaries and setup instructions are freely available at http://genomes.sapac.edu.au/text_mining/pos_tagger.zip.

Read full abstract

Small Training Corpus Research Articles

Related Topics

Articles published on Small Training Corpus

Identifying wildlife observations on twitter

A deep-learning model for semantic role labelling in medical documents

Part-of-Speech Tagging via Deep Neural Networks for Northern-Ethiopic Languages

Knowledge-enhanced neural networks for sentiment analysis of Chinese reviews

New instances classification framework on Quran ontology applied to question answering system

Low-Cost Implementation of a Named Entity Recognition System for Voice-Activated Human-Appliance Interfaces in a Smart Home

Improving the Collocation Extraction Method Using an Untagged Corpus for Persian Word Sense Disambiguation

Computational Methods for Coptic: Developing and Using Part-of-Speech Tagging for Digital Scholarship in the Humanities

A systematic comparison of feature space effects on disease classifier performance for phenotype identification of five diseases

Automatic machine translation error identification

Improved Part-of-Speech Prediction in Suffix Analysis

Towards Generalizing Classification Based Speech Separation

Introducing nativization to Spanish TTS systems

Empirically evaluating the application of reinforcement learning to the induction of effective and adaptive pedagogical strategies

A novel prosody adaptation method for Mandarin concatenation-based text-to-speech system

Genre as noise: noise in genre

Dialect/Accent Classification Using Unrestricted Audio

Research of Pinyin-To-Character conversion based on Maximum Entropy model

English Syntactic Disambiguation Using Parser's Ambiguity Type Information

ORCHID: building linguistic resources in Thai

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Small Training Corpus Research Articles

Related Topics

Articles published on Small Training Corpus

Identifying wildlife observations on twitter

A deep-learning model for semantic role labelling in medical documents

Part-of-Speech Tagging via Deep Neural Networks for Northern-Ethiopic Languages

Knowledge-enhanced neural networks for sentiment analysis of Chinese reviews

New instances classification framework on Quran ontology applied to question answering system

Low-Cost Implementation of a Named Entity Recognition System for Voice-Activated Human-Appliance Interfaces in a Smart Home

Improving the Collocation Extraction Method Using an Untagged Corpus for Persian Word Sense Disambiguation

Computational Methods for Coptic: Developing and Using Part-of-Speech Tagging for Digital Scholarship in the Humanities

A systematic comparison of feature space effects on disease classifier performance for phenotype identification of five diseases

Automatic machine translation error identification

Improved Part-of-Speech Prediction in Suffix Analysis

Towards Generalizing Classification Based Speech Separation

Introducing nativization to Spanish TTS systems

Empirically evaluating the application of reinforcement learning to the induction of effective and adaptive pedagogical strategies

A novel prosody adaptation method for Mandarin concatenation-based text-to-speech system

Genre as noise: noise in genre

Dialect/Accent Classification Using Unrestricted Audio

Research of Pinyin-To-Character conversion based on Maximum Entropy model

English Syntactic Disambiguation Using Parser's Ambiguity Type Information

ORCHID: building linguistic resources in Thai