Textual Paragraphs Research Articles

Geocoding aims to assign unambiguous locations (i.e., geographic coordinates) to place names (i.e., toponyms) referenced within documents (e.g., within spreadsheet tables or textual paragraphs). This task comes with multiple challenges, such as dealing with referent ambiguity (multiple places with a same name) or reference database completeness. In this work, we propose a geocoding approach based on modeling pairs of toponyms, which returns latitude-longitude coordinates. One of the input toponyms will be geocoded, and the second one is used as context to reduce ambiguities. The proposed approach is based on a deep neural network that uses Long Short-Term Memory (LSTM) units to produce representations from sequences of character n-grams. To train our model, we use toponym co-occurrences collected from different contexts, namely textual (i.e., co-occurrences of toponyms in Wikipedia articles) and geographical (i.e., inclusion and proximity of places based on Geonames data). Experiments based on multiple geographical areas of interest—France, United States, Great-Britain, Nigeria, Argentina and Japan—were conducted. Results show that models trained with co-occurrence data obtained a higher geocoding accuracy, and that proximity relations in combination with co-occurrences can help to obtain a slightly higher accuracy in geographical areas with fewer places in the data sources.

Read full abstract

The emergence of "big data" initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate structured chemical databases. Our system provides an extensible, chemistry-aware, natural language processing pipeline for tokenization, part-of-speech tagging, named entity recognition, and phrase parsing. Within this scope, we report improved performance for chemical named entity recognition through the use of unsupervised word clustering based on a massive corpus of chemistry articles. For phrase parsing and information extraction, we present the novel use of multiple rule-based grammars that are tailored for interpreting specific document domains such as textual paragraphs, captions, and tables. We also describe document-level processing to resolve data interdependencies and show that this is particularly necessary for the autogeneration of chemical databases since captions and tables commonly contain chemical identifiers and references that are defined elsewhere in the text. The performance of the toolkit to correctly extract various types of data was evaluated, affording an F-score of 93.4%, 86.8%, and 91.5% for extracting chemical identifiers, spectroscopic attributes, and chemical property attributes, respectively; set against the CHEMDNER chemical name extraction challenge, ChemDataExtractor yields a competitive F-score of 87.8%. All tools have been released under the MIT license and are available to download from http://www.chemdataextractor.org .

Read full abstract

Textual Paragraphs Research Articles

Related Topics

Articles published on Textual Paragraphs

Modelling the Archipelago: Corfu as a Case Study for a Digital Edition of Cristoforo Buondelmonti's Liber Insularum.

Deep Learning for Toponym Resolution: Geocoding Based on Pairs of Toponyms

Exploring Racism, and Anxieties of Identity in Aslam's Selected Work

Neo-Colonialist critique of Hamid's The Reluctant Fundamentalist and Kincaid's A Small Place: A Comparative Postcolonial Study

Oppression and Female Body: A Feminist Critique of the Novel 'Half the Sky'

Random Multiple Choice Questions Generation using Nlp

Bootstrapping Knowledge Graphs From Images and Text.

Deklinowanie wiosny albo zgrywa z Schulza

ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.

OntoVerbal: a Generic Tool and Practical Application to SNOMED CT

A practical prototypic system for psychiatric diagnosis: The ICD-11 Clinical Descriptions and Diagnostic Guidelines

Análise bioética do Código de Ética Odontológica brasileiro

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Textual Paragraphs Research Articles

Related Topics

Articles published on Textual Paragraphs

Modelling the Archipelago: Corfu as a Case Study for a Digital Edition of Cristoforo Buondelmonti's Liber Insularum.

Deep Learning for Toponym Resolution: Geocoding Based on Pairs of Toponyms

Exploring Racism, and Anxieties of Identity in Aslam's Selected Work

Neo-Colonialist critique of Hamid's The Reluctant Fundamentalist and Kincaid's A Small Place: A Comparative Postcolonial Study

Oppression and Female Body: A Feminist Critique of the Novel 'Half the Sky'

Random Multiple Choice Questions Generation using Nlp

Bootstrapping Knowledge Graphs From Images and Text.

Deklinowanie wiosny albo zgrywa z Schulza

ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.

OntoVerbal: a Generic Tool and Practical Application to SNOMED CT

A practical prototypic system for psychiatric diagnosis: The ICD-11 Clinical Descriptions and Diagnostic Guidelines

Análise bioética do Código de Ética Odontológica brasileiro