Large Text Collections Research Articles

The article focuses on a corpus-based analysis of the concept FRANCE. The analysis of concepts through the lens of corpus linguistics allows us to determine the general perception of a particular reality. Given the current political context and the development of diplomatic relationships, the concept FRANCE becomes significant and requires analysis. As the material for our study, we chose the corpus of Ukrainian language GRAK. General Regionally Annotated Corpus of Ukrainian (GRAC) is a large representative collection of texts in Ukrainian accompanied by a program that enables customization of subcorpora, searching words, grammatical forms and their combinations as well as post-processing of the query results. For this analysis, journalistic and literary texts dated from 1991 to 2022 were selected. The lexeme “France”, representing the concept FRANCE, appeared 189,178 times in GRAK between 1991 and 2022 with the majority of occurrences found in journalistic texts. Besides, other lexical representatives of the concept FRANCE were analyzed, such as “French”, “Paris”, “France”. The article pays particular attention to the contexts in which the concept FRANCE is realized. Ten main thematic groups related to the concept FRANCE were identified and analyzed: FRANCE – PRESTIGE; FRANCE – REFUGE; FRANCE – HISTORY; FRANCE – LAW; FRANCE – POLITICS; FRANCE – LANGUAGE; FRANCE – ECONOMY; FRANCE – SPORT; FRANCE – FOOD; FRANCE – STYLE. Key adjectives and verbs that verbalize the concept FRANCE in the corpus were found. These words often evoke images of well-known politicians and the names of European countries. Moreover, crucial collocates were determined. Thirty collocates representing the lexeme France were identified: Germany, Macron (Emmanuel), Francois (Hollande), President, Italy, Britain, Ministry of Foreign Affairs, Spain, Merkel, Sarkozy, Championship, Leaders, Paris, Ambassador, Team, Elections, PSG, Finance, Embassy, Canada, Government, Lady, Great, Match, Ukraine, Protests, Authority, Visit. These collocates predominantly align with themes of politics, international relations and sports. The extensive usage of the concept FRANCE in Ukrainian corpus indicates a strengthening of political relations between Ukraine and France.

Read full abstract

Pretrained language models (PLMs) have demonstrated strong performance on many natural language processing (NLP) tasks. Despite their great success, these PLMs are typically pretrained only on unstructured free texts without leveraging existing structured knowledge bases that are readily available for many domains, especially scientific domains. As a result, these PLMs may not achieve satisfactory performance on knowledge-intensive tasks such as biomedical NLP. Comprehending a complex biomedical document without domain-specific knowledge is challenging, even for humans. Inspired by this observation, we propose a general framework for incorporating various types of domain knowledge from multiple sources into biomedical PLMs.We encode domain knowledge using lightweight adapter modules, bottleneck feed-forward networks that are inserted into different locations of a backbone PLM. For each knowledge source of interest, we pretrain an adapter module to capture the knowledge in a self-supervised way. We design a wide range of self-supervised objectives to accommodate diverse types of knowledge, ranging from entity relations to description sentences.Once a set of pretrained adapters is available, we employ fusion layers to combine the knowledge encoded within these adapters for downstream tasks. Each fusion layer is a parameterized mixer of the available trained adapters that can identify and activate the most useful adapters for a given input. Our method diverges from prior work by including a knowledge consolidation phase, during which we teach the fusion layers to effectively combine knowledge from both the original PLM and newly-acquired external knowledge using a large collection of unannotated texts. After the consolidation phase, the complete knowledge-enhanced model can be fine-tuned for any downstream task of interest to achieve optimal performance.Extensive experiments on many biomedical NLP datasets show that our proposed framework consistently improves the performance of the underlying PLMs on various downstream tasks such as natural language inference, question answering, and entity linking. These results demonstrate the benefits of using multiple sources of external knowledge to enhance PLMs and the effectiveness of the framework for incorporating knowledge into PLMs. While primarily focused on the biomedical domain in this work, our framework is highly adaptable and can be easily applied to other domains, such as the bioenergy sector.

Read full abstract

Large Text Collections Research Articles

Related Topics

Articles published on Large Text Collections

"My Very Subjective Human Interpretation": Domain Expert Perspectives on Navigating the Text Analysis Loop for Topic Models

Computing MEMs and Relatives on Repetitive Text Collections

Exploring Qualitative Geographies in Large Volumes of Digital Text: Placing Tourists, Travelers, and Inhabitants in the English Lake District

GRAAL: Graph-Based Retrieval for Collecting Related Passages across Multiple Documents

An STS analysis of a digital humanities collaboration: trading zones, boundary objects, and interactional expertise in the DECRYPT project

Computational linguistics at the crossroads: A comprehensive review of NLP advancements

Corpus linguistics and the social sciences

Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics.

CIDER: Context-sensitive polarity measurement for short-form text.

КОРПУС-БАЗОВАНИЙ АНАЛІЗ КОНЦЕПТУ ФРАНЦІЯ

Extracting information on virus-human interactions and on antiviral compounds based on automated analysis of large text collections

Applied corpus linguistics and legal interpretation: A rapidly developing field of interdisciplinary scholarship

A Vector Space Approach for Measuring Relationality and Multidimensionality of Meaning in Large Text Collections

Corpus Analysis with spaCy

PEDL+: protein-centered relation extraction from PubMed at your fingertip.

Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models

Topic modeling methods for short texts: A survey

Context-aware Transliteration of Romanized South Asian Languages

KEBLM: Knowledge-Enhanced Biomedical Language Models

Networks of Migrants’ Narratives: A Post-authentic Approach to Heritage Visualisation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large Text Collections Research Articles

Related Topics

Articles published on Large Text Collections

"My Very Subjective Human Interpretation": Domain Expert Perspectives on Navigating the Text Analysis Loop for Topic Models

Computing MEMs and Relatives on Repetitive Text Collections

Exploring Qualitative Geographies in Large Volumes of Digital Text: Placing Tourists, Travelers, and Inhabitants in the English Lake District

GRAAL: Graph-Based Retrieval for Collecting Related Passages across Multiple Documents

An STS analysis of a digital humanities collaboration: trading zones, boundary objects, and interactional expertise in the DECRYPT project

Computational linguistics at the crossroads: A comprehensive review of NLP advancements

Corpus linguistics and the social sciences

Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics.

CIDER: Context-sensitive polarity measurement for short-form text.

КОРПУС-БАЗОВАНИЙ АНАЛІЗ КОНЦЕПТУ ФРАНЦІЯ

Extracting information on virus-human interactions and on antiviral compounds based on automated analysis of large text collections

Applied corpus linguistics and legal interpretation: A rapidly developing field of interdisciplinary scholarship

A Vector Space Approach for Measuring Relationality and Multidimensionality of Meaning in Large Text Collections

Corpus Analysis with spaCy

PEDL+: protein-centered relation extraction from PubMed at your fingertip.

Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models

Topic modeling methods for short texts: A survey

Context-aware Transliteration of Romanized South Asian Languages

KEBLM: Knowledge-Enhanced Biomedical Language Models

Networks of Migrants’ Narratives: A Post-authentic Approach to Heritage Visualisation