Corpus-based Models Research Articles

The existence of annotated text corpora is essential for the development of public health services and tools based on natural language processing (NLP) and text mining. Recently organized biomedical NLP shared tasks have provided annotated corpora related to different biomedical entities such as genes, phenotypes, drugs, diseases and chemical entities. These are needed to develop named-entity recognition (NER) models that are used for extracting entities from text and finding their relations. However, to the best of our knowledge, there are limited annotated corpora that provide information about food entities despite food and dietary management being an essential public health issue. Hence, we developed a new annotated corpus of food entities, named FoodBase. It was constructed using recipes extracted from Allrecipes, which is currently the largest food-focused social network. The recipes were selected from five categories: ‘Appetizers and Snacks’, ‘Breakfast and Lunch’, ‘Dessert’, ‘Dinner’ and ‘Drinks’. Semantic tags used for annotating food entities were selected from the Hansard corpus. To extract and annotate food entities, we applied a rule-based food NER method called FoodIE. Since FoodIE provides a weakly annotated corpus, by manually evaluating the obtained results on 1000 recipes, we created a gold standard of FoodBase. It consists of 12 844 food entity annotations describing 2105 unique food entities. Additionally, we provided a weakly annotated corpus on an additional 21 790 recipes. It consists of 274 053 food entity annotations, 13 079 of which are unique. The FoodBase corpus is necessary for developing corpus-based NER models for food science, as a new benchmark dataset for machine learning tasks such as multi-class classification, multi-label classification and hierarchical multi-label classification. FoodBase can be used for detecting semantic differences/similarities between food concepts, and after all we believe that it will open a new path for learning food embedding space that can be used in predictive studies.

Read full abstract

This paper introduces a novel family of ontology-based similarity measures based on the Information Content (IC) theory, a detailed state of the art, a large experimental survey into ontology-based similarity measures on WordNet, and a new comparison between intrinsic and corpus-based IC models. Our experiments are based on our implementation of a large set of similarity measures, intrinsic and corpus-based IC models, which are evaluated on two known datasets and three different WordNet versions. The new measures are called weighted Jiang–Conrath distance (wJ&Cdist) and similarity (wJ&Csim), cosine-normalized Jiang–Conrath similarity (cosJ&Csim) and cosine-normalized weighted Jiang–Conrath similarity (coswJ&Csim). Two of our similarity measures outperform the state-of-the-art measures on the RG65 dataset, and one of them obtains the third overall score on all the datasets and evaluated WordNet versions. The cosine-normalized similarity measures are a non-linear normalization of the classic Jiang–Conrath (J&C) distance and the new wJ&C distance. On the other hand, the wJ&C distance is a generalization of the classic J&C distance which is based on the length of the shortest path between concepts within an IC-based weighted graph. Our measures are based on two not previously considered notions: (1) a generalization of the classic J&C distance to any type of taxonomy, based on an IC-based weighted graph derived from the conditional probabilities between child and parent concepts, and (2) a non-linear normalization function that converts the ontology-based semantic distances into similarity functions. Finally, the corpus-based IC models based on the Resnik method obtain rivaling results as regards the state-of-the-art intrinsic IC models, when they are used with some unexplored WordNet-based frequency files. Therefore, this latter fact allows us to reconsider some previous conclusions about the outperformance of the intrinsic IC models over the corpus-based ones.

Read full abstract

Corpus-based Models Research Articles

Articles published on Corpus-based Models

A case study on decompounding in Indian language IR

Animacy effects in the English genitive alternation: comparing native speakers and EFL learner judgments with corpus data

Determining the importance of frequency and contextual diversity in the lexical organization of multiword expressions.

A corpus-based versus experimental examination of word- and character-frequency effects in Chinese reading: Theoretical implications for models of reading.

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Corpus-Based Paraphrase Detection Experiments and Review

FoodBase corpus: a new resource of annotated food entities.

A Lexical Resource-Constrained Topic Model for Word Relatedness

CLASENTI

Embodied sound design

The language of smell: Connecting linguistic and psychophysical properties of odor descriptors

Pitting corpus-based classification models against each other: a case study for predicting constructional choice in written Estonian

Corpus-Based Semantic Models of the Noun Phrases Containing Words with ‘Person’ Marker

Quantitative computational syntax: some initial results

Machine Meets Man: Evaluating the Psychological Reality of Corpus-based Probabilistic Models

A novel family of IC-based similarity measures with a detailed experimental survey on WordNet

A new family of information content models with an experimental survey on WordNet

Integrating words that refer to typical sequences of events.

Murder in the Arboretum: Comparing Character Models to Personality Models

Las tecnologías del habla en las lenguas románicas ibéricas

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Corpus-based Models Research Articles

Articles published on Corpus-based Models

A case study on decompounding in Indian language IR

Animacy effects in the English genitive alternation: comparing native speakers and EFL learner judgments with corpus data

Determining the importance of frequency and contextual diversity in the lexical organization of multiword expressions.

A corpus-based versus experimental examination of word- and character-frequency effects in Chinese reading: Theoretical implications for models of reading.

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Corpus-Based Paraphrase Detection Experiments and Review

FoodBase corpus: a new resource of annotated food entities.

A Lexical Resource-Constrained Topic Model for Word Relatedness

CLASENTI

Embodied sound design

The language of smell: Connecting linguistic and psychophysical properties of odor descriptors

Pitting corpus-based classification models against each other: a case study for predicting constructional choice in written Estonian

Corpus-Based Semantic Models of the Noun Phrases Containing Words with ‘Person’ Marker

Quantitative computational syntax: some initial results

Machine Meets Man: Evaluating the Psychological Reality of Corpus-based Probabilistic Models

A novel family of IC-based similarity measures with a detailed experimental survey on WordNet

A new family of information content models with an experimental survey on WordNet

Integrating words that refer to typical sequences of events.

Murder in the Arboretum: Comparing Character Models to Personality Models

Las tecnologías del habla en las lenguas románicas ibéricas