Information Extraction System Research Articles

Textual databases are ubiquitous in many application domains. Examples of textual data range from names and addresses of customers to social media posts and bibliographic records. With online services, individuals are increasingly required to enter their personal details for example when purchasing products online or registering for government services, while many social network and e-commerce sites allow users to post short comments. Many online sites leave open the possibility for people to enter unintended or malicious abnormal values, such as names with errors, bogus values, profane comments, or random character sequences. In other applications, such as online bibliographic databases or comparative online shopping sites, databases are increasingly populated in (semi-) automatic ways through Web crawls. This practice can result in low quality data being added automatically into a database. In this article, we develop three techniques to automatically discover abnormal (unexpected or unusual) values in large textual databases. Following recent work in categorical outlier detection, our assumption is that “normal” values are those that occur frequently in a database, while an individual abnormal value is rare. Our techniques are unsupervised and address the challenge of discovering abnormal values as an outlier detection problem. Our first technique is a basic but efficient q-gram set based technique, the second is based on a probabilistic language model, and the third employs morphological word features to train a one-class support vector machine classifier. Our aim is to investigate and develop techniques that are fast, efficient, and automatic. The output of our techniques can help in the development of rule-based data cleaning and information extraction systems, or be used as training data for further supervised data cleaning procedures. We evaluate our techniques on four large real-world datasets from different domains: two US voter registration databases containing personal details, the 2013 KDD Cup dataset of bibliographic records, and the SNAP Memetracker dataset of phrases from social networking sites. Our results show that our techniques can efficiently and automatically discover abnormal textual values, allowing an organization to conduct efficient data exploration, and improve the quality of their textual databases without the need of requiring explicit training data.

Read full abstract

Medicinal chemistry patents contain rich information about chemical compounds. Although much effort has been devoted to extracting chemical entities from scientific literature, limited numbers of patent mining systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of information extraction systems for medicinal chemistry patents, the 2015 BioCreative V challenge organized a track on Chemical and Drug Named Entity Recognition from patent text (CHEMDNER patents). This track included three individual subtasks: (i) Chemical Entity Mention Recognition in Patents (CEMP), (ii) Chemical Passage Detection (CPD) and (iii) Gene and Protein Related Object task (GPRO). We participated in the two subtasks of CEMP and CPD using machine learning-based systems. Our machine learning-based systems employed the algorithms of conditional random fields (CRF) and structured support vector machines (SSVMs), respectively. To improve the performance of the NER systems, two strategies were proposed for feature engineering: (i) domain knowledge features of dictionaries, chemical structural patterns and semantic type information present in the context of the candidate chemical and (ii) unsupervised feature learning algorithms to generate word representation features by Brown clustering and a novel binarized Word embedding to enhance the generalizability of the system. Further, the system output for the CPD task was yielded based on the patent titles and abstracts with chemicals recognized in the CEMP task.The effects of the proposed feature strategies on both the machine learning-based systems were investigated. Our best system achieved the second best performance among 21 participating teams in CEMP with a precision of 87.18%, a recall of 90.78% and a F-measure of 88.94% and was the top performing system among nine participating teams in CPD with a sensitivity of 98.60%, a specificity of 87.21%, an accuracy of 94.75%, a Matthew’s correlation coefficient (MCC) of 88.24%, a precision at full recall (P_full_R) of 66.57% and an area under the precision-recall curve (AUC_PR) of 0.9347. The SSVM-based CEMP systems outperformed the CRF-based CEMP systems when using the same features. Features generated from both the domain knowledge and unsupervised learning algorithms significantly improved the chemical NER task on patents.Database URL: http:// database. oxfordjournals. org/ content/ 2016/ baw049

Read full abstract

Information Extraction System Research Articles

Related Topics

Articles published on Information Extraction System

Using automatically extracted information from mammography reports for decision-support

Annotation Approach for Document with Recommendation

@MInter: automated text-mining of microbial interactions.

Extracting Databases from Dark Data with DeepDive.

MONGOOSE-Monitoring Global Online Opinions via Semantic Extraction

Learning to extract domain-specific relations from complex sentences

Automatic Discovery of Abnormal Values in Large Textual Databases

Declarative Cleaning of Inconsistencies in Information Extraction

PDF text classification to leverage information extraction from publication reports

A hybrid integrated architecture for energy consumption prediction

Extracting genetic alteration information for personalized cancer therapy from ClinicalTrials.gov

Name identification and extraction with formal concept analysis

Statistic Supported Cooperative Creation of Training Corpora for the Extraction of Traffic Information from Microblogs

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.

Information extraction for personalised services based on conference alerts

Temporal Expressions in Polish Corpus KPWr

A Federated Network for Translational Cancer Research Using Clinical Data and Biospecimens.

Survey of Stages of Developing the Information Extraction Systems from the Web

Fine-grained information extraction from German transthoracic echocardiography reports.

An Approach for Ontology-Based Information Extraction System Selection and Evaluation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Information Extraction System Research Articles

Related Topics

Articles published on Information Extraction System

Using automatically extracted information from mammography reports for decision-support

Annotation Approach for Document with Recommendation

@MInter: automated text-mining of microbial interactions.

Extracting Databases from Dark Data with DeepDive.

MONGOOSE-Monitoring Global Online Opinions via Semantic Extraction

Learning to extract domain-specific relations from complex sentences

Automatic Discovery of Abnormal Values in Large Textual Databases

Declarative Cleaning of Inconsistencies in Information Extraction

PDF text classification to leverage information extraction from publication reports

A hybrid integrated architecture for energy consumption prediction

Extracting genetic alteration information for personalized cancer therapy from ClinicalTrials.gov

Name identification and extraction with formal concept analysis

Statistic Supported Cooperative Creation of Training Corpora for the Extraction of Traffic Information from Microblogs

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.

Information extraction for personalised services based on conference alerts

Temporal Expressions in Polish Corpus KPWr

A Federated Network for Translational Cancer Research Using Clinical Data and Biospecimens.

Survey of Stages of Developing the Information Extraction Systems from the Web

Fine-grained information extraction from German transthoracic echocardiography reports.

An Approach for Ontology-Based Information Extraction System Selection and Evaluation