Large-scale Corpus Research Articles

The rapid proliferation of artificial intelligence has led to the development of sophisticated cutting-edge systems in natural language processing and computational linguistics domains. These systems heavily rely on high-quality dataset/corpora for the training of deep-learning algorithms to develop precise models. The preparation of a high-quality gold standard corpus for natural language processing on a large scale is a challenging task due to the need of huge computational resources, accurate language identification models, and precise content parsing tools. This task is further exacerbated in case of regional languages due to the scarcity of web content. In this article, we propose a generic framework of Corpus Analyzer – Corpulyzer – a novel framework for building low resource language corpora. Our framework consists of corpus generation and corpus analyzer module. We demonstrate the efficacy of our framework by creating a high-quality large scale corpus for the Urdu language as a case study. Leveraging dataset from Common Crawl Corpus (CCC), first, we prepare a list of seed URLs by filtering the Urdu language webpages. Next, we use Corpulyzer to crawl the World-Wide-Web (WWW) over a period of four years (2016–2020). We build Urdu web corpus “UrduWeb20” that consists of 8.0 million Urdu webpages crawled from 6,590 websites. In addition, we propose Low-Resource Language (LRL) website scoring algorithm and content-size filter for language-focused crawling to achieve optimal use of computational resources. Moreover, we analyze UrduWeb20 using variety of traditional metrics such as web-traffic-rank, URL depth, duplicate documents, and vocabulary distribution along with our newly defined content-richness metrics. Furthermore, we compare different characteristics of our corpus with three datasets of CCC. In general, we observe that contrary to CCC that focuses on crawling the limited number of webpages from highly ranked Urdu websites, Corpulyzer performs an in-depth crawling of Urdu content-rich websites. Finally, we made available Corpulyzer framework for the research community for corpus building.

Read full abstract

Few-shot learning under the <inline-formula><tex-math notation="LaTeX">$N$</tex-math></inline-formula> -way <inline-formula><tex-math notation="LaTeX">$K$</tex-math></inline-formula> -shot setting (i.e., <inline-formula><tex-math notation="LaTeX">$K$</tex-math></inline-formula> annotated samples for each of <inline-formula><tex-math notation="LaTeX">$N$</tex-math></inline-formula> classes) has been widely studied in relation extraction (e.g., FewRel) and image classification (e.g., Mini-ImageNet). Named entity recognition (NER) is typically framed as a sequence labeling problem where the entity classes are inherently entangled together because the entity number and classes in a sentence are not known in advance, leaving the <inline-formula><tex-math notation="LaTeX">$N$</tex-math></inline-formula> -way <inline-formula><tex-math notation="LaTeX">$K$</tex-math></inline-formula> -shot NER problem so far unexplored. In this paper, we first formally define a more suitable <inline-formula><tex-math notation="LaTeX">$N$</tex-math></inline-formula> -way <inline-formula><tex-math notation="LaTeX">$K$</tex-math></inline-formula> -shot setting for NER. Then we propose FewNER , a novel meta-learning approach for few-shot NER. FewNER separates the entire network into a task-independent part and a task-specific part. During training in FewNER , the task-independent part is meta-learned across multiple tasks and the task-specific part is learned for each individual task in a low-dimensional space. At test time, FewNER keeps the task-independent part fixed and adapts to a new task via gradient descent by updating only the task-specific part, resulting in it being less prone to overfitting and more computationally efficient. Compared with pre-trained language models (e.g., BERT and ELMo) which obtain the transferability in an implicit manner (i.e., relying on large-scale corpora), FewNER explicitly optimizes the capability of “learning to adapt quickly” through meta-learning. The results demonstrate that FewNER achieves state-of-the-art performance against nine baseline methods by significant margins on three adaptation experiments (i.e., intra-domain cross-type, cross-domain intra-type and cross-domain cross-type).

Read full abstract

Large-scale Corpus Research Articles

Related Topics

Articles published on Large-scale Corpus

Improved Distant Supervision Relation Extraction Based on Edge- Reasoning Hybrid Graph Model

Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting

GAN-GLS: Generative Lyric Steganography Based on Generative Adversarial Networks

Corpulyzer: A Novel Framework for Building Low Resource Language Corpora

Summary of Research Methods on Pre-Training Models of Natural Language Processing

Research on Named Entity Recognition Technology for Chinese Titles

Building Chinese Request Pattern Graphs for Chinese-Korean Translation

Acoustic properties of word and phrasal prominence in Uzbek

Ethnic and gender variation in the use of Colloquial Singapore English discourse particles

Selection of In-Domain Bilingual Sentence Pairs Based on Topic Information

The linguistic and cultural community of a Slavic village: research project assumptions

Hierarchical state recurrent neural network for social emotion ranking

The position of the genitive in Old English prose: Intertextual differences and the role of Latin

A hybrid classical-quantum workflow for natural language processing

A BERT Fine-tuning Model for Targeted Sentiment Analysis of Chinese Online Course Reviews

Few-Shot Named Entity Recognition via Meta-Learning

Postnominal relative clauses in Chinese

Cartolabe: A Web-Based Scalable Visualization of Large Document Collections.

Domain-specific meta-embedding with latent semantic structures

Formant Frequencies of Adult Speakers of Australian English and Effects of Sex, Age, Geographical Location, and Vowel Quality

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large-scale Corpus Research Articles

Related Topics

Articles published on Large-scale Corpus

Improved Distant Supervision Relation Extraction Based on Edge- Reasoning Hybrid Graph Model

Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting

GAN-GLS: Generative Lyric Steganography Based on Generative Adversarial Networks

Corpulyzer: A Novel Framework for Building Low Resource Language Corpora

Summary of Research Methods on Pre-Training Models of Natural Language Processing

Research on Named Entity Recognition Technology for Chinese Titles

Building Chinese Request Pattern Graphs for Chinese-Korean Translation

Acoustic properties of word and phrasal prominence in Uzbek

Ethnic and gender variation in the use of Colloquial Singapore English discourse particles

Selection of In-Domain Bilingual Sentence Pairs Based on Topic Information

The linguistic and cultural community of a Slavic village: research project assumptions

Hierarchical state recurrent neural network for social emotion ranking

The position of the genitive in Old English prose: Intertextual differences and the role of Latin

A hybrid classical-quantum workflow for natural language processing

A BERT Fine-tuning Model for Targeted Sentiment Analysis of Chinese Online Course Reviews

Few-Shot Named Entity Recognition via Meta-Learning

Postnominal relative clauses in Chinese

Cartolabe: A Web-Based Scalable Visualization of Large Document Collections.

Domain-specific meta-embedding with latent semantic structures

Formant Frequencies of Adult Speakers of Australian English and Effects of Sex, Age, Geographical Location, and Vowel Quality