EXTRACTION OF ANGLICISMS FROM A CORPUS OF MACEDONIAN MAGAZINE TEXTS

  • Abstract
  • Highlights & Summary
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

The present article is a description of the stages involved in compiling a specialized corpus of Macedonian magazine texts and the software tools employed to extract anglicisms from the corpus. The texts were collected from the magazine Kapital and cover two distinct periods: the years 2000 and 2020. The size of the corpus is about 2 million tokens and 141,852 types. The software employed produced word lists that later in combination with other statistical techniques produced a refined Anglicism headword list from which new anglicisms were extracted. In addition to the software tools, careful manual inspection was necessary in both the extraction and analysis stages. As a result of the research, a total of 220 completely new anglicisms have been identified. Most of these new anglicisms are not yet included in existing Macedonian dictionaries.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.15622/ia.21.6.4
Method and Models of Extraction of Knowledge from Medical Documents
  • Nov 24, 2022
  • Информатика и автоматизация
  • Rustem Zulkarneev + 4 more

The paper analyzes the problem of extracting knowledge from clinical recommendations presented in the form of semi-structured corpora of text documents in natural language, taking into account their periodic updating. The considered methods of intellectual analysis of the accumulated arrays of medical data make it possible to automate a number of tasks aimed at improving the quality of medical care due to significant decision support in the treatment process. A brief review of well-known publications has been made, highlighting approaches to automating the construction of ontologies and knowledge graphs in the problems of semantic modeling of a problem-oriented text corpus. The structural and functional organization of the system of knowledge extraction and automatic construction of an ontology and a knowledge graph of a problem-oriented corpus for a specific subject area is presented. The main stages of knowledge extraction and dynamic updating of the knowledge graph are considered: named entity extraction, semantic annotation, term and keyword extraction, topic modeling, topic identification, and relationship extraction. The formalized representation of texts was obtained using a pre-trained BERT transformer model. The automatic selection of triplets "object" - "action" - "subject" based on part-of-speech markup of the text corpus was used to construct fragments of the knowledge graph. An experiment was carried out on a corpus of medical texts on a given topic (162 documents of depersonalized case histories of patients of a pediatric center) without preliminary markup in order to test the proposed solution for extracting triplets and constructing a knowledge graph based on them. An analysis of the experimental results confirms the need for a deeper markup of the corpus of text documents to take into account the specifics of medical text documents. For an unmarked corpus of texts, the proposed solution demonstrates satisfactory performance in view of the selection of atomic fragments included in the automatically generated ontology.

  • Research Article
  • Cite Count Icon 1
  • 10.3406/scoli.2007.1106
L’anaphore lexicale démonstrative dans les langues endo-et exocentriques : langue, texte, discours
  • Jan 1, 2007
  • Scolia
  • Lita Lundquist

The present article studies the use of demonstrative NP anaphors, demN, in a corpus of French and Norwegian texts belonging to the domain of scientific economic discourse. The introduction of the concept of discourse makes it possible to enlarge the traditional lexical and syntactic perspective in order to evaluate to what degree differences between the uses of demN in the two corpuses of texts can be explained by linguistic and/or discursive factors. Among the linguistic factors is the tendency in French , and romance languages in general, to lexicalisé nouns, and hence anaphors, at a more specific level than Norwegian, i.e., Germanic languages, and also the preference for hierarchical, hypotactic presentation of information in desententialised constructions in French as compared to linear, paratactic ordering in Norwegian, which has an impact on the number of realised syntactic subjects.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/icsda.2018.8693013
Development of Text and Speech Corpus for Designing the Multilingual Recognition System
  • May 1, 2018
  • Shweta Bansal + 1 more

To create the multilingual speech and text corpus manually is very difficult and time-consuming task. This paper presents the overall methodology and experiences of text and speech data collection for three under resourced languages i.e., Hindi, Manipuri and Urdu. The text data collection is done through web crawling in 3 domains i.e., general, news and travel to capture the versatility of database among these languages. The main objective of this project is to collect text and speech database which can be used for training the multilingual spoken language identification systems. In total we collected a text corpus of three million words and audio corpus of 150 speakers (50 native speakers) of each language. Each speaker recorded 300 phonetically rich sentences created through text analysis. The speech utterances were recorded at the rate of 16 kHz through microphone using GOLDWAVE software tool in a sound treated room. The collected speech data sets were annotated manually at phonemic level for each language and made available for development of multilingual recognition system.

  • Research Article
  • 10.36809/2309-9380-2024-42-113-118
Ценностные установки в дискурсе поколения Y (на материале стендап-выступлений)
  • Jan 1, 2024
  • Review of Omsk State Pedagogical University. Humanitarian research
  • T.P. Rogozhnikova + 1 more

The present article is carried out in the context of the modern communicative-discursive paradigm and is devoted to the identification and description of the value and attitudinal characteristics of the representatives of generation Y in relation to the Russian reality. The object of the study is the corpus of texts in the genre of stand-up performances of famous Russian performers of the age category from 26 to 33 years old, the subject is the axiosphere of the authors of the performance, as well as the system of factors influencing the ways of expressing value attitudes (socio-cultural background, the theme of concerts, the phenomenon of “black humour”).

  • Book Chapter
  • Cite Count Icon 11
  • 10.1163/9789401204347_020
The retrieval of false anglicisms in newspaper texts
  • Jan 1, 2007
  • Cristiano Furiassi + 1 more

The present article is the description of a project aimed at building a specialized corpus of Italian newspaper texts and at developing a computational technique to retrieve new false anglicisms from it. Texts were collected along a ten-month span from three Italian newspapers: La Stampa, La Repubblica, and Il Corriere della Sera. The size of the corpus is about 20 million tokens and approximately 230,000 types. The system was automatically updated on a daily basis and a list of words was obtained at the end of the collection period. This procedure originated a refined word list in which false anglicisms were searched. Along with computational techniques, careful manual scanning proved to be indispensable to extract new false anglicisms. The corpus is available for future work and may be exploited not only to find false anglicisms but also to retrieve anglicisms, neologisms, and to analyse lexical features of Italian newspaper language.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-662-45701-6_7
Analysis of the Chinese – Portuguese Machine Translation of Chinese Localizers Qian and Hou
  • Jan 1, 2014
  • Chunhui Lu + 3 more

The focus of the present article is the two Chinese localizers qian (front) and hou (back), in their function of time, in the process of the Chinese- Portuguese machine translation, and is integrated in the project Autema SynTree (annotation and Analysis of Bilingual Syntactic Trees for Chinese/Portuguese). The text corpus used in the research is composed of 46 Chinese texts, extracted from The International Chinese Newsweekly, identified as source text (ST), and target texts (TT) are composed of translations into Portuguese executed by the Portuguese-Chinese Translator (PCT) and humans. In Portuguese the prepositions of transversal axis such as antes de and depois de, are used to indicate the time before and after, corresponding to qian and hou in Chinese. Nevertheless, inconsistencies related to the translation of the localizers are found in the output of the PCT when comparing it with the human translation (HT). Based thereupon, the present article shows the developed syntax rules to solve the inconsistencies found in the PCT output. The translations and the proposed rules were evaluated through the application of BLEU metrics.KeywordsMachine translationChinesePortugueseChinese localizerBLEU

  • Research Article
  • Cite Count Icon 1
  • 10.1093/llc/fqab107
Tracking causal relations in the news: data, tools, and models for the analysis of argumentative statements in online media
  • Jan 17, 2022
  • Digital Scholarship in the Humanities
  • Tom Willaert + 3 more

Online debates and debate spheres challenge our assumptions about democracy, politics, journalism, trust, and truth in ways that make them a necessary object of study. In the present article, we argue that the study of online arguments can benefit from an interdisciplinary approach that combines computational methods for text analysis with conceptual models of opinion dynamics. The article thereby seeks to make a conceptual and methodological contribution to the field by highlighting the role of domain-crossing causal statements in debates of societal interest, and by providing a method for automatically mining such statements from textual corpora on the web. The article illustrates the relevance of this approach for the study of online debates by means of a case study in which we analyse cross-cutting statements on climate change and energy technologies from the comment section of the online newspaper The Guardian. In support of this case study, we use data and methods that are made openly available through the Penelope ecosystem of tools and techniques for computational social science.

  • Research Article
  • 10.1524/slaw.2009.0021
Zur Kombination von pronominaler Nah- und nominaler Distanzanrede – gezeigt am Beispiel verschiedener slawischer Sprachen
  • Aug 1, 2009
  • Zeitschrift für Slawistik
  • Claudia Radünzel

The present article is a contribution to the analysis of systems of address in the Slavonic languages. It deals with the special problem of a possible combination of distant nominal and non-distant pronominal address forms (i. e. constructions like German “Frau Müller, kannst du mal herkommen?”). The first part of the paper contains an overview of such combinations which have been described in the relevant specialist literature specialized on this matter. The second part presents a detailed analysis of Croatian, Bosnian, and Czech examples taken from text corpora available on the internet in which nominal address forms containing the vocatives “gospodine” or “pane” are used together with pronouns or verb forms of the second person singular or plural. According to the aim of the present study, special attention is focused on the “irregular” combinations with the non-distant second person singular.

  • Research Article
  • Cite Count Icon 27
  • 10.3758/bf03195536
TPL—KATS-card sort: A tool for assessing structural knowledge
  • Nov 1, 2003
  • Behavior Research Methods, Instruments, & Computers
  • Michelle E Harper + 5 more

The study of how individuals organize knowledge has been a popular endeavor for several decades. As a result, techniques have been developed to assess how individuals represent and organize knowledge internally. Although several conceptual knowledge elicitation methods have been developed and used to assess the organization of knowledge, their use is often labor intensive and time consuming. Presented here is a software tool that was developed to reduce the problems associated with manually administering the conceptual knowledge elicitation technique, or card sorting. The TPL-KATS-card sort software not only simplifies the administration of the task, but also adds features to the card-sorting task such as media insertion, time stamping, and instructorless administration. In the present article, an introduction to the card-sorting technique is provided, the new software tool is described, and the advantages of the software are detailed.

  • Research Article
  • Cite Count Icon 1
  • 10.37892/2313-5816-2024-1-16-43
Языковая ситуация у татышлинских удмуртов: наблюдения экспедиций ОТиПЛа МГУ
  • Jun 1, 2024
  • Rodnoy Yazyk. Linguistic journal
  • Egor Kashkin + 1 more

The present article deals with the sociolinguistics of a subdialect of the Udmurt language spoken primarily in the Tatyshly district of Bash‑ kortostan (Russia). The data was collected via questionnaires filled out by native speakers during fieldwork by the linguistics department of Moscow State University in 2019-2023. Data from a corpus of oral texts that was recorded and transcribed during the fieldwork is included as well. After briefly outlining the population dynamics in the Tatyshly district and our sample of native speakers, we discuss the domains in which Tatyshly Udmurt is used (taking into account variation across generations), code-switching to Russian (especially involving the youn‑ ger generation), the role of Standard Udmurt in the community (inter‑ ference between Tatyshly Udmurt and Standard Udmurt, communica‑ tion with speakers from other areas, education), and language contact with Turkic varieties (knowledge of Bashkir and Tatar among the Ud‑ murt community, communication with Turkic-speaking people). Prac‑ tical activities aimed at Udmurt language maintenance are also out‑ lined.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.18255/1818-1015-2021-3-280-291
Text Classification by Genre Based on Rhythm Features
  • Oct 14, 2021
  • Modeling and Analysis of Information Systems
  • Ksenia Vladimirovna Lagutina + 2 more

The article is devoted to the analysis of the rhythm of texts of different genres: fiction novels, advertisements, scientific articles, reviews, tweets, and political articles. The authors identified lexico-grammatical figures in the texts: anaphora, epiphora, diacope, aposiopesis, etc., that are markers of the text rhythm. On their basis, statistical features were calculated that describe quantitatively and structurally these rhythm features.The resulting text model was visualized for statistical analysis using boxplots and heat maps that showed differences in the rhythm of texts of different genres. The boxplots showed that almost all genres differ from each other in terms of the overall density of rhythm features. Heatmaps showed different rhythm patterns across genres. Further, the rhythm features were successfully used to classify texts into six genres. The classification was carried out in two ways: a binary classification for each genre in order to separate a particular genre from the rest genres, and a multi-class classification of the text corpus into six genres at once. Two text corpora in English and Russian were used for the experiments. Each corpus contains 100 fiction novels, scientific articles, advertisements and tweets, 50 reviews and political articles, i.e. a total of 500 texts. The high quality of the classification with neural networks showed that rhythm features are a good marker for most genres, especially fiction. The experiments were carried out using the ProseRhythmDetector software tool for Russian and English languages. Text corpora contains 300 texts for each language.

  • Book Chapter
  • Cite Count Icon 5
  • 10.1007/978-3-642-10308-7_4
Improving Cognitive Abilities and e-Inclusion in Children with Cerebral Palsy
  • Jan 1, 2009
  • Chiara Martinengo + 1 more

Besides overcoming the motor barriers for accessing to computers and Internet, ICT tools can provide a very useful, and often necessary, support for the cognitive development of motor-impaired children with cerebral palsy. In fact, software tools for computation and communication allow teachers to put into effect, in a more complete and efficient way, the learning methods and the educational plans studied for the child. In the present article, after a brief analysis of the general objectives to be pursued for favouring the learning for children with cerebral palsy, we take account of some specific difficulties in the logical-linguistic and logical-mathematical fields, and we show how they can be overcome using general ICT tools and specifically implemented software programs.Keywordse-inclusioncognitive developmentmathematical learningmotor disabilitycerebral palsyeducational software

  • Research Article
  • 10.1556/062.2015.68.3.1
Kalmyk and Khalkha Ethnographica in Gábor Bálint of Szentkatolna’s manuscripts (1871–1873)
  • Sep 1, 2015
  • Acta Orientalia Academiae Scientiarum Hungaricae
  • Ágnes Birtalan

The Hungarian (Székely) Gábor Bálint of Szentkatolna (1844–1913) was one of the first researchers of Kalmyk and Khalkha vernacular language, folklore and ethnography. His valuable records are written in a very accurate transcription and include the specimens of Kalmyk and Khalkha spoken languages, folklore material and ethnographic narratives, and a comparative grammar of western and eastern Mongolian languages. Bálint’s manuscripts had not been released until recent years when Ágnes Birtalan published his Comparative Grammar in 2009 and the Kalmyk corpus with a comprehensive analysis in 2011. The present article aims to give an introduction to Bálint’s ethnographic materials recorded among the Kalmyks (1871–1872) and Khalkhas (1873). Despite the similar economic and cultural milieu the two ethnic groups lived in, there is considerable difference between the Kalmyk and Khalkha text corpora. Besides presenting and systematising Bálint’s ethnographic material, I shall try to clarify the reason why this significant divergence emerges between the two text corpora. Specimens of a particular phase of the wedding ceremony are represented as examples from both text corpora.

  • Research Article
  • Cite Count Icon 6
  • 10.3917/cca.221.0085
Business Model et normalisation comptable : quelle intégration du modèle économique par les IFRS ?
  • Mar 30, 2016
  • Comptabilité Contrôle Audit
  • Charlotte Disle + 4 more

L’utilisation du concept de Business Model (BM) dans la littérature académique et dans les échanges professionnels et institutionnels s’est sensiblement accrue au cours des dernières années. Partant du constat qu’il existe dans la littérature plusieurs niveaux de caractérisation du concept de Business Model, cet article se propose d’analyser dans quelle mesure le référentiel IFRS prend en compte ce concept dans son acception la plus sophistiquée. L’analyse approfondie des textes composant le référentiel IFRS nous permet de montrer qu’une grande partie des composantes du Business Model est peu intégrée dans les normes. Considérant que la prise en compte du Business Model par les IFRS pourrait se traduire soit par une divulgation obligatoire d’informations complémentaires sur le BM soit par une comptabilisation conditionnée par le BM, l’article s’interroge alors sur les voies d’intégration potentielles de ce concept par les normes IFRS.

  • Research Article
  • 10.17770/sie2020vol3.5143
CORPUS ANALYSIS OF HIGH SCHOOL LEARNERS’ RESEARCH PAPERS IN HEALTH SCIENCE (2016-2019)
  • May 20, 2020
  • SOCIETY. INTEGRATION. EDUCATION. Proceedings of the International Scientific Conference
  • Anita Pastare + 2 more

In Latvia, the learners of secondary school as a requirement gets acquainted with basics of research – selection of literature, data collection and processing, communication and presentation skills.The present article deals with an analysis of the themes and texts (corpus) of the research papers (RP) in Health Science worked out by the authors of the top-ranked RP presented in the Scientific Conference of High School Learners of Latvia from 2016 to 2019. A logical inductive content analysis of features specific to each RP, and sequential categorization and grouping into a higher level of components were performed. The quantitative data were processed by using AntConc, IBM SPSS Statistics 22 and Microsoft Excel software.The aim of the research was to find out the themes and content variations of RP, and the characteristics of the language – tokens, types and keywords.The results show that there is a thematic uniformity in RP. The statistical characteristics of language differ in terms of lexical diversity and the frequency of keywords and their collocations. Mostly individual keywords dominate.The results obtained can be used to develop recommendations for learners and teachers as a model for theme selection, presentation of research and criteria for evaluating RP.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.