Text Corpus Research Articles

The article is devoted to the study of specific features of subdialectal discourse and its mental continuum, viewed through the prism of modern linguistic theories. The purpose of the research is to study and describe the functioning and structure features, as well as the specifics of the linguarium of subdialectal discourse. The object of the study is the subdialectal discourse, which covers speech practices recorded in the village of Pishchanyi Brid, Novoukrainskyi district, Kirovohrad region. The subject of the study is the mental continuum of the subdialect speakers, which includes knowledge, perceptions and assessments related to various aspects of their lives. The source of the study was the speech recordings of the native speakers, i.e. the textual corpus of the subdialect. The methods used in the study were modelling, hypothetical method, deductive and inductive methods, descriptive method, observation method, taxonomic method, and continuous sampling method. The peculiarity of the study of subdialectal speech in discourse studies lies in the analysis of the composition of usages, the specifics of the composition of the linguarium and its influence on the structure of discourse, expression of thoughts, interaction of speakers and the formation of the ethnographic identity of peasants who speak the subdialect. The discourse approach allows us to better understand the speech process in a specific socio-cultural and geographical setting, as well as the interaction of speakers within the dialect environment. It involves analysing the discourse of the language group in question, identifying specific lexical, grammatical and phonetic features that are characteristic of certain socio-cultural groups. The metalinguistic interpretation of discourse types in the linguistic paradigm presents dialectal discourse as a reflection of real language use, a set of idiolects and an indicator of changes in the language. The household, ceremonial, religious and existential knowledge contained in the dialect discourse is critical to the well-being and cultural integrity of rural communities. In particular, household knowledge encompasses practical skills in housekeeping, agriculture and crafts. Ceremonial and religious knowledge includes traditions, rituals and beliefs that support the spiritual life and cultural identity of the community, and existential knowledge is related to the understanding of life, death and moral values, helping community members navigate difficult life situations. Prospects for the study include a meta-analysis of dialectal discourse, which can be expressed in different ways in individual subdialects, the development of new methods for studying dialectal language in the digital environment, and the integration of interdisciplinary approaches for a deeper understanding of speech practices and cultural characteristics of speakers.

We would like to introduce our recently developed systems for taking images of herbarium specimens and for the automatic extraction of data from specimen labels at the Herbarium of the Museum of Nature and Human Activities, Hyogo, Japan (HYO). Firstly, we designed a low-cost, but high-quality specimen imaging system for non-professional photographers to obtain images rapidly (Takano et al. 2019). Our system uses a mass-produced, mirrorless single-lens reflex (SLR) camera (SONY ILCE6300) with a zoom lens (Samyang Optics SYIO35AF-E35 mm F/2.8). We made a photo stand by ourselves to reduce costs. In addition, we have adopted an LED (light-emitting diode) lighting system with high color rendering. This imaging system has been introduced, with some improvements or adjustments for available space, to various herbaria in Japan (e.g., University of Tokyo (TI), Kyoto University (KYO)), contributing to the digitization of herbarium specimens across Japan. Next, we developed a system to extract label information from specimen images. The specimen image was uploaded to Google OCR and data were extracted in the form of text. Uploading the whole specimen image decreased the reading accuracy of the software because the plant images behaved as OCR (Optical Character Reader) noise. Therefore, the label part was cut out from the whole specimen image by using D-Lib*1 and uploaded to tesseract OCR*2 for OCR extraction of the label information (Aoki 2019, Takano et al. 2020). When installing this system for HYO, we designed it as an application accessible externally via the internet, which proved very useful during the coronavirus pandemic: part-time workers checked and conducted label data input from home. Finally, we decided to develop a system that would automatically label the text data extracted by OCR and input them into the appropriate cells of the database. Even though the text data could be extracted from specimen images, it needed a human to input them into the database. Therefore, we adopted Named Entity Recognition (NER), a system that extracts named entities such as place names, identifying proper nouns from unstructured text data. It enables information recorded in herbarium specimens to be tagged as named entities. We tried text matching at first, but the result was not satisfactory, so we started to use machine learning instead. We compared three natural language libraries for Japanese: BERT (Bidirectional Encoder Representations from Transformers), Albert (A Lite version of BERT), and SpaCy. Despite BERT and SpaCy returning similarly high f-scores (indicating good performance), we decided to use SpaCy because it runs better on ordinary PCs or servers. With sufficient machine learning after the creation of a text corpus (a specialised dataset) specific to labels on herbarium specimens, we successfully developed the application. The project files are available on GitHub*3 (Takano et al. 2024). We then examined whether this system could be applied to non-plant specimen images, i.e., fishes or birds, and found that it could efficiently extract data. Therefore, we decided to publicize this system on the cloud server and share it with other natural history museums in Japan*4. Curators can obtain a unique ID and password and upload specimen images from their collection to extract label data. The digitization of natural history collections in Japan has been long behind other countries, and this system will help to accelerate it. The system mentioned above is specialized for the natural history collections of Japan, but we believe it is possible to build similar programs in other countries, and we hope our experience will contribute to the mobilization of the world’s natural history collections.

Text Corpus Research Articles

Related Topics

Articles published on Text Corpus

Dystopian Visions and the Making of Privacy Law

Research on Intelligent Mining Algorithm of English Translation Text Big Data Based on Deep Learning

Konfiks Derivasi Per-/-an dalam Debat Capres-Cawapres Tahun 2024: Metode Linguistik Korpus

Research Topics and Trends in Gifted Education: A Structural Topic Model

The anthropomorphic pursuit of AI-generated journalistic texts: limits to expressing subjectivity

Using ChatGPT and determinologisation to enhance understanding of lung cancer information

Textual studies of the era of big data and neural networks

Claiming rock art, claiming indigeneity: Spaces and scales of recognition in Khoe and San identity claims

Structural reading: Developing the method of Structural Collocation Analysis using a case study on parliamentary reporting

SUBDIALECTAL DISCOURSE AND PECULIARITIES OF ITS MENTAL CONTINUUM IN THE LENS OF THE MODERN LINGUISTIC PARADIGM

A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks

Characterizing pituitary adenomas in clinical notes: Corpus construction and its application in LLMs.

Adoption of Life Cycle Thinking: Impact-driven comparative assessment of Japanese construction corporations’ trends in practices

Produktverpackungen als alltägliche Manifestationen des Umweltdiskurses

Harvesting Ancient Wisdom: Mathematical Modeling of Agricultural Technology in Sangam Literature and its Comparative Study with Modern Techniques

Do discurso à ação política: análise de acontecimentos extremistas que violam uma ética discursiva

Development of an Automated Label Data Entry System from Herbarium Specimen Images at Hyogo Herbarium (HYO)

It Is Time to Take Complaints Seriously? An Exploratory Analysis of Communications Sent by Users to a Public Healthcare Agency before, during and after the COVID-19 Pandemic

Ultimate Fantasy. Subwersyjne strategie queerowania Sztucznej Inteligencji w projektach transliterackich

Media Representation of Tutoring as a Phenomenon of Pedagogical Discourse

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Text Corpus Research Articles

Related Topics

Articles published on Text Corpus

Dystopian Visions and the Making of Privacy Law

Research on Intelligent Mining Algorithm of English Translation Text Big Data Based on Deep Learning

Konfiks Derivasi Per-/-an dalam Debat Capres-Cawapres Tahun 2024: Metode Linguistik Korpus

Research Topics and Trends in Gifted Education: A Structural Topic Model

The anthropomorphic pursuit of AI-generated journalistic texts: limits to expressing subjectivity

Using ChatGPT and determinologisation to enhance understanding of lung cancer information

Textual studies of the era of big data and neural networks

Claiming rock art, claiming indigeneity: Spaces and scales of recognition in Khoe and San identity claims

Structural reading: Developing the method of Structural Collocation Analysis using a case study on parliamentary reporting

SUBDIALECTAL DISCOURSE AND PECULIARITIES OF ITS MENTAL CONTINUUM IN THE LENS OF THE MODERN LINGUISTIC PARADIGM

A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks

Characterizing pituitary adenomas in clinical notes: Corpus construction and its application in LLMs.

Adoption of Life Cycle Thinking: Impact-driven comparative assessment of Japanese construction corporations’ trends in practices

Produktverpackungen als alltägliche Manifestationen des Umweltdiskurses

Harvesting Ancient Wisdom: Mathematical Modeling of Agricultural Technology in Sangam Literature and its Comparative Study with Modern Techniques

Do discurso à ação política: análise de acontecimentos extremistas que violam uma ética discursiva

Development of an Automated Label Data Entry System from Herbarium Specimen Images at Hyogo Herbarium (HYO)

It Is Time to Take Complaints Seriously? An Exploratory Analysis of Communications Sent by Users to a Public Healthcare Agency before, during and after the COVID-19 Pandemic

Ultimate Fantasy. Subwersyjne strategie queerowania Sztucznej Inteligencji w projektach transliterackich

Media Representation of Tutoring as a Phenomenon of Pedagogical Discourse