Maximum Entropy Classifier Research Articles

In the Multiple Myeloma clinical registry at Heidelberg University Hospital, most data are extracted from discharge letters. Our aim was to analyze if it is possible to make the manual documentation process more efficient by using methods of natural language processing for multiclass classification of free-text diagnostic reports to automatically document the diagnosis and state of disease of myeloma patients. The first objective was to create a corpus consisting of free-text diagnosis paragraphs of patients with multiple myeloma from German diagnostic reports, and its manual annotation of relevant data elements by documentation specialists. The second objective was to construct and evaluate a framework using different NLP methods to enable automatic multiclass classification of relevant data elements from free-text diagnostic reports. The main diagnoses paragraph was extracted from the clinical report of one third randomly selected patients of the multiple myeloma research database from Heidelberg University Hospital (in total 737 selected patients). An EDC system was setup and two data entry specialists performed independently a manual documentation of at least nine specific data elements for multiple myeloma characterization. Both data entries were compared and assessed by a third specialist and an annotated text corpus was created. A framework was constructed, consisting of a self-developed package to split multiple diagnosis sequences into several subsequences, four different preprocessing steps to normalize the input data and two classifiers: a maximum entropy classifier (MEC) and a support vector machine (SVM). In total 15 different pipelines were examined and assessed by a ten-fold cross-validation, reiterated 100 times. For quality indication the average error rate and the average F1-score were conducted. For significance testing the approximate randomization test was used. The created annotated corpus consists of 737 different diagnoses paragraphs with a total number of 865 coded diagnosis. The dataset is publicly available in the supplementary online files for training and testing of further NLP methods. Both classifiers showed low average error rates (MEC: 1.05; SVM: 0.84) and high F1-scores (MEC: 0.89; SVM: 0.92). However the results varied widely depending on the classified data element. Preprocessing methods increased this effect and had significant impact on the classification, both positive and negative. The automatic diagnosis splitter increased the average error rate significantly, even if the F1-score decreased only slightly. The low average error rates and high average F1-scores of each pipeline demonstrate the suitability of the investigated NPL methods. However, it was also shown that there is no best practice for an automatic classification of data elements from free-text diagnostic reports.

BackgroundThe acquisition of knowledge about relations between bacteria and their locations (habitats and geographical locations) in short texts about bacteria, as defined in the BioNLP-ST 2013 Bacteria Biotope task, depends on the detection of co-reference links between mentions of entities of each of these three types. To our knowledge, no participant in this task has investigated this aspect of the situation. The present work specifically addresses issues raised by this situation: (i) how to detect these co-reference links and associated co-reference chains; (ii) how to use them to prepare positive and negative examples to train a supervised system for the detection of relations between entity mentions; (iii) what context around which entity mentions contributes to relation detection when co-reference chains are provided.ResultsWe present experiments and results obtained both with gold entity mentions (task 2 of BioNLP-ST 2013) and with automatically detected entity mentions (end-to-end system, in task 3 of BioNLP-ST 2013). Our supervised mention detection system uses a linear chain Conditional Random Fields classifier, and our relation detection system relies on a Logistic Regression (aka Maximum Entropy) classifier. They use a set of morphological, morphosyntactic and semantic features. To minimize false inferences, co-reference resolution applies a set of heuristic rules designed to optimize precision. They take into account the types of the detected entity mentions, and take advantage of the didactic nature of the texts of the corpus, where a large proportion of bacteria naming is fairly explicit (although natural referring expressions such as "the bacteria" are common). The resulting system achieved a 0.495 F-measure on the official test set when taking as input the gold entity mentions, and a 0.351 F-measure when taking as input entity mentions predicted by our CRF system, both of which are above the best BioNLP-ST 2013 participant system.ConclusionsWe show that co-reference resolution substantially improves over a baseline system which does not use co-reference information: about 3.5 F-measure points on the test corpus for the end-to-end system (5.5 points on the development corpus) and 7 F-measure points on both development and test corpora when gold mentions are used. While this outperforms the best published system on the BioNLP-ST 2013 Bacteria Biotope dataset, we consider that it provides mostly a stronger baseline from which more work can be started. We also emphasize the importance and difficulty of designing a comprehensive gold standard co-reference annotation, which we explain is a key point to further progress on the task.

Maximum Entropy Classifier Research Articles

Related Topics

Articles published on Maximum Entropy Classifier

On the effects of using word2vec representations in neural networks for dialogue act recognition

Dropped personal pronoun recovery in Chinese SMS

Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features

Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on Surveillance Estimates.

Assessing Quality of Care and Elder Abuse in Nursing Homes via Google Reviews.

A Value Classification of Electronic Product Reviews Based on Maximum Entropy

Temporal replication of the national land cover database using active machine learning

Extending the geographic extent of existing land cover data using active machine learning and covariate shift corrective sampling

Combining Naïve Bayes and Modified Maximum Entropy Classifiers for Text Classification

AN EFFECTIVE APPROACH FOR TEXT CLASSIFICATION

Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research.

Simulated and Self-Sustained Classification of Twitter Data based on its Sentiment

Combining Overall and Target Oriented Sentiment Analysis over Portuguese Text from Social Media

The contribution of co-reference resolution to supervised relation detection between bacteria and biotopes entities.

Topic-Based Coherence Modeling for Statistical Machine Translation

METSP: a maximum-entropy classifier based text mining tool for transporter-substrate identification with semistructured text.

HPS: High precision stemmer

Combining Probabilistic Classifiers for Text Classification

Identifying Abbreviations in Biomedical Literature Based on Maximum Entropy with Web Features

Learning to Rank for Review Rating Prediction

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Maximum Entropy Classifier Research Articles

Related Topics

Articles published on Maximum Entropy Classifier

On the effects of using word2vec representations in neural networks for dialogue act recognition

Dropped personal pronoun recovery in Chinese SMS

Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features

Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on Surveillance Estimates.

Assessing Quality of Care and Elder Abuse in Nursing Homes via Google Reviews.

A Value Classification of Electronic Product Reviews Based on Maximum Entropy

Temporal replication of the national land cover database using active machine learning

Extending the geographic extent of existing land cover data using active machine learning and covariate shift corrective sampling

Combining Naïve Bayes and Modified Maximum Entropy Classifiers for Text Classification

AN EFFECTIVE APPROACH FOR TEXT CLASSIFICATION

Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research.

Simulated and Self-Sustained Classification of Twitter Data based on its Sentiment

Combining Overall and Target Oriented Sentiment Analysis over Portuguese Text from Social Media

The contribution of co-reference resolution to supervised relation detection between bacteria and biotopes entities.

Topic-Based Coherence Modeling for Statistical Machine Translation

METSP: a maximum-entropy classifier based text mining tool for transporter-substrate identification with semistructured text.

HPS: High precision stemmer

Combining Probabilistic Classifiers for Text Classification

Identifying Abbreviations in Biomedical Literature Based on Maximum Entropy with Web Features

Learning to Rank for Review Rating Prediction