Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Comparative Statistical Analysis of Word Frequencies in Human-Written and AI-Generated Texts

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

We classify texts using relative word frequencies. The task is to distinguish human-written texts from those generated by a computer using modern algorithms. We study two essay datasets, each containing an equal number of human-written and computer-generated essays. Studying Zipf diagrams shows that the generated texts have a significantly smaller vocabulary compared to human ones. However, the relative frequency of rare words (not included in the 1000 most common) does not allow us to confidently classify the texts. As additional features, we used the relative frequencies of the four most frequent words, as well as the ratio of the number of hapax legomena to the total number of different words. This feature allows to significantly improve the classification. Using these six features allows us to fairly confidently determine whether the text is computer-generated.

Similar Papers
  • Research Article
  • Cite Count Icon 53
  • 10.1186/s12859-017-1473-7
Prediction of virus-host infectious association by supervised learning methods
  • Mar 1, 2017
  • BMC Bioinformatics
  • Mengge Zhang + 5 more

BackgroundThe study of virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relative simple methods based on the similarity between the word frequency vectors of viruses and bacterial hosts have been developed to study virus-host associations, the problem is significantly understudied. We hypothesize that machine learning methods based on word frequencies can be efficiently used to study virus-host infectious associations.MethodsWe investigate four different representations of word frequencies of viral sequences including the relative word frequency and three normalized word frequencies by subtracting the number of expected from the observed word counts. We also study five machine learning methods including logistic regression, support vector machine, random forest, Gaussian naive Bayes and Bernoulli naive Bayes for separating infectious from non-infectious viruses for nine bacterial host genera with at least 45 infecting viruses. Area under the receiver operating characteristic curve (AUC) is used to compare the performance of different machine learning method and feature combinations. We then evaluate the performance of the best method for the identification of the hosts of contigs in metagenomic studies. We also develop a maximum likelihood method to estimate the fraction of true infectious viruses for a given host in viral tagging experiments.ResultsBased on nine bacterial host genera with at least 45 infectious viruses, we show that random forest together with the relative word frequency vector performs the best in identifying viruses infecting particular hosts. For all the nine host genera, the AUC is over 0.85 and for five of them, the AUC is higher than 0.98 when the word size is 6 indicating the high accuracy of using machine learning approaches for identifying viruses infecting particular hosts. We also show that our method can predict the hosts of viral contigs of length at least 1kbps in metagenomic studies with high accuracy. The random forest together with word frequency vector outperforms current available methods based on Manhattan and d_{2}^{*} dissimilarity measures. Based on word frequencies, we estimate that about 95% of the identified T4-like viruses in viral tagging experiment infect Synechococcus, while only about 29% of the identified non-T4-like viruses and 30% of the contigs in the study potentially infect Synechococcus.ConclusionsThe random forest machine learning method together with the relative word frequencies as features of viruses can be used to predict viruses and viral contigs for specific bacterial hosts. The maximum likelihood approach can be used to estimate the fraction of true infectious associated viruses in viral tagging experiments.

  • Research Article
  • Cite Count Icon 17
  • 10.1016/j.jneuroling.2007.06.003
Do children with Williams syndrome have unusual vocabularies?
  • Aug 10, 2007
  • Journal of Neurolinguistics
  • Vesna Stojanovik + 1 more

Do children with Williams syndrome have unusual vocabularies?

  • Research Article
  • Cite Count Icon 7
  • 10.1002/(sici)1097-4571(1999)50:3<280::aid-asi11>3.3.co;2-8
A model for estimating the occurrence of same‐frequency words and the boundary between high‐ and low‐frequency words in texts
  • Jan 1, 1999
  • Journal of the American Society for Information Science
  • Qinglan Sun + 2 more

A simpler model is proposed for estimating the frequency of any same-frequency words and identifying the boundary point between high-frequency words and low-frequency words in a text. The model, based on a “maximum ranking method,” assigns ranks to the words and estimates word frequency by the formula: Int[(−1 + (1 + 4D/In+1)1/2)/2] > n* ≥ Int[(−1 + (1 + 4D/In)1/2)/2]. The boundary value between high-frequency and low-frequency words is obtained by taking the square root of the number of different words in the text: n* = (D)1/2. This straightforward model was used successfully with both English and Chinese texts, demonstrating that the frequency of words and the number of same-frequency words are dependent only on the vocabulary of a text (the number of different words) but not on its length. Like Zipf's Law, the model may be universally applicable.

  • Research Article
  • Cite Count Icon 16
  • 10.1002/(sici)1097-4571(1999)50:3<280::aid-asi11>3.0.co;2-h
A model for estimating the occurrence of same-frequency words and the boundary between high- and low-frequency words in texts
  • Jan 1, 1999
  • Journal of the American Society for Information Science
  • Qinglan Sun + 2 more

A simpler model is proposed for estimating the frequency of any same-frequency words and identifying the boundary point between high-frequency words and low-frequency words in a text. The model, based on a “maximum ranking method,” assigns ranks to the words and estimates word frequency by the formula: Int[(−1 + (1 + 4D/In+1)1/2)/2] > n* ≥ Int[(−1 + (1 + 4D/In)1/2)/2]. The boundary value between high-frequency and low-frequency words is obtained by taking the square root of the number of different words in the text: n* = (D)1/2. This straightforward model was used successfully with both English and Chinese texts, demonstrating that the frequency of words and the number of same-frequency words are dependent only on the vocabulary of a text (the number of different words) but not on its length. Like Zipf's Law, the model may be universally applicable.

  • Research Article
  • 10.21296/jls.2016.06.77.169
Effects of the Relative Word Frequency and Spelling-to-Sound Correspondency of the Heterographic Homophone Priming on Semantically Related Word Recognition
  • Jun 30, 2016
  • The Journal of Linguistics Science
  • Sangeun Lee + 2 more

The aim of this study was to examine the effects of relative word frequency and the spelling-to-sound correspondence of heregoraphic homophone priming on the recognition of semanticallyrelated words. A cross-modal semantic paradigm has been applied, where homophone primes were presented auditorily and targets were presented visually. The difference in lexical decision latencies of the visual targets was calculated between semantically-related prime and semantically-unrelated prime to measure the size of facilitation. A facilitatory effect was shown only when the spelling corresponds to sound, and no such effect was observed for the relative word frequency condition. These findings indicated that an auditorily presented homophone activates all of its corresponding orthographic words and that the spelling-to-sound correspondence is more influential than the relative word frequency for homophone recognition.

  • Research Article
  • Cite Count Icon 9
  • 10.1080/23279095.2015.1089505
Caution warranted in extrapolating from Boston Naming Test item gradation construct
  • Mar 15, 2016
  • Applied Neuropsychology: Adult
  • Robert A Beattey + 6 more

ABSTRACTThe Boston Naming Test (BNT) was designed to present items in order of difficulty based on word frequency. Changes in word frequencies over time, however, would frustrate extrapolation in clinical and research settings based on the theoretical construct because performance on the BNT might reflect changes in ecological frequency of the test items, rather than performance across items of increasing difficulty. This study identifies the ecological frequency of BNT items at the time of publication using the American Heritage Word Frequency Book and determines changes in frequency over time based on the frequency distribution of BNT items across a current corpus, the Corpus of Contemporary American English. Findings reveal an uneven distribution of BNT items across 2 corpora and instances of negligible differentiation in relative word frequency across test items. As BNT items are not presented in order from least to most frequent, clinicians and researchers should exercise caution in relying on the BNT as presenting items in increasing order of difficulty. A method is proposed for distributing confrontation-naming items to be explicitly measured against test items that are normally distributed across the corpus of a given language.

  • Research Article
  • Cite Count Icon 2
  • 10.3724/sp.j.1042.2022.00333
Neural mechanisms and time course of the age-related word frequency effect in language production
  • Feb 1, 2022
  • Advances in Psychological Science
  • Lina Zhang + 1 more

<p id="p00005">The word frequency effect refers to the phenomenon of processing high-frequency words faster and more accurately than processing low-frequency words in language production. As the age increases in adulthood, the word frequency effect will also change. The Transmission Deficit Hypothesis suggests that ageing weakens the connection between the stored language information nodes, and the word frequency effect increases further. In contrast, according to the Rank Frequency Account, the relative word frequency and the word frequency effect remain unchanged; thus, it can be deduced that ageing does not change the word frequency effect. Furthermore, the Logogen Model assumes that the word frequency decreases with an increase in experience and contact, thereby predicting a reduction of the word frequency effect as the age increases. <p id="p00008">There are differences in the age-related word frequency effect in language production. Numerous studies have demonstrated that the word frequency effect among the elder is greater than that among the young. However, some studies have shown that elderly people have less or no difference in the word frequency effect in comparison with young people, suggesting that the differences in the word frequency effect depending on age may be caused by differences in tasks, stimuli, and individual cognitive abilities. However, the word frequency effect is relatively stable throughout the life cycle. More precisely, it is more difficult to name low-frequency words than high-frequency words for both the young and the elder. There is an age difference between the young and the elder in processing low-frequency words. Compared with the elder, the younger people have higher accuracy and higher activation levels in language-related brain areas (e.g., insula and middle temporal gyrus) and cognitive control-related brain areas (e.g., cingulate cortex). Moreover, there are differences in the word frequency effect between the verb naming task and the noun naming task. In the action picture naming task, high-frequency words activate specific brain regions, whereas in the object naming task, only low-frequency words activate of specific brain regions. <p id="p00007">The word frequency effect may occur in different stages of language production. Previous studies have demonstrated that the word frequency effect may occur in the lemma selection, the connection stage between lemma selection and phonological code retrieval or in the phonological code retrieval in spoken language production. The word frequency effect in writing production appears later than in spoken language production and may occur in the orthographic word form. In addition, there are differences in the time course of the word frequency effect between the younger and older individuals. For the elder, with an increase in age a decline in general cognitive ability, the time course of the word frequency effect is slightly delayed when compared with the young. <p id="p00006">Some studies also explored the change in the word frequency effect in patients with degenerative diseases, suggesting that it can be used as a sensitive indicator for early detection and diagnosis of related diseases. The word frequency effect and the acquisition age effect both impact on the word processing, however, during different stages. The acquisition age affects the visual and semantic processing of the vocabulary, whereas the word frequency effect merely acts on vocabulary retrieval. In the future, we can further distinguish the influence of word frequency effect from the age of acquisition effect on ageing effect of language production, and extend the studies to patients with neurodegenerative diseases.

  • Research Article
  • 10.1044/2025_jslhr-24-00053
Conversational Latency in Autistic Children With Heterogeneous Spoken Language Abilities.
  • Apr 3, 2025
  • Journal of speech, language, and hearing research : JSLHR
  • Lue Shen + 6 more

Conversational latency entails the temporal feature of turn-taking, which is understudied in autistic children. The current study investigated the influences of child-based and parental factors on conversational latency in autistic children with heterogeneous spoken language abilities. Participants were 46 autistic children aged 4-7 years. We remotely collected 15-min naturalistic language samples in the context of parent-child interactions to characterize both child and parent conversational latency. Conversational latency was operationally defined as the time it took for one individual to respond to their conversational partner using spoken language. Naturalistic language samples were transcribed following the Systematic Analysis for Language Transcripts convention to characterize autistic children's spoken language and parental spoken language input. Autistic children's spoken language was measured using number of different words (NDW). The quality and quantity of parental spoken language input was assessed using NDW, mean length of utterance in morphemes (MLUm), and frequency of words per minute (WPM). Additional child-based factors, including receptive language and socialization skills, were evaluated using the Vineland Adaptive Behavior Scales. Spearman correlation and regression analyses were conducted to investigate the relationships between those child-based and parental factors and child conversational latency. Older autistic children showed longer conversation latencies. Longer parent conversational latency was associated with longer child conversational latency after controlling for age. Greater parental WPM was associated with shorter child conversational latency after controlling for age. Child conversational latency was not associated with their spoken language, receptive language, or socialization skills. Child conversational latency was not associated with parental NDW and MLUm. Our findings highlight the interaction loop between autistic children and their parents in everyday interactions. Parents adjusted their timing and quantity of spoken language input to ensure smooth conversational turn-taking when interacting with their autistic children.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/aicas59952.2024.10595962
Enhancing ASR Performance through Relative Word Frequency in OCR and Normal Word Frequency Analysis
  • Apr 22, 2024
  • Kyudan Jung + 5 more

With the growing interest in Conversational AI, a system that enables machines to engage in human-like dialogues, there has been an increased focus on Automatic Speech Recognition (ASR) as an essential component of Conversational AI. Despite ongoing research, ASR performance still falls short in real-life applications such as academic lectures with technical terms. This paper proposes methods to enhance the recognition of technical terms frequently used in academic lectures, thereby improving overall ASR performance. The proposed method is an improvement on the method of analyzing the ratio between the frequency of words extracted by Optical Character Recognition (OCR) and the frequency of common words to accurately recognize technical terms. It was made based on the Power law, which is widely used in the scientific community. The experimental result showed that the reduction of the Word Error Rate (WER) up to 3.22% from the 108 hours of ‘Advanced Compiler’ lecture is achieved.

  • Research Article
  • Cite Count Icon 154
  • 10.1037/0096-1523.21.6.1297
Phonological priming between monosyllabic spoken words.
  • Jan 1, 1995
  • Journal of Experimental Psychology: Human Perception and Performance
  • Monique Radeau + 2 more

Phonological priming between 3-phoneme monosyllabic spoken words was examined as a function of the early or late position of the phonological overlap between the words and of prime-target relative frequency. The pairs of words had either the 2 beginning or the 2 final phonemes in common. Four experiments were conducted, each using a different combination of interstimulus interval (ISI; either 20 ms or 500 ms) and task (either lexical decision or shadowing). Facilitation was consistently found between words with final overlap in both tasks and was not affected by either absolute or relative word frequency. The size of the effect decreased as the ISI increased. Significant priming effects were not obtained between words with initial overlap, although an inhibitory trend was found in the shadowing task at the short ISI for the low-high relative frequency condition. It is suggested that the facilitatory effect of final overlap is prelexical.

  • Research Article
  • Cite Count Icon 19
  • 10.1016/j.ijporl.2010.05.005
Lexical effects on spoken word recognition performance among Mandarin-speaking children with normal hearing and cochlear implants
  • Jun 3, 2010
  • International Journal of Pediatric Otorhinolaryngology
  • Nan Mai Wang + 2 more

Lexical effects on spoken word recognition performance among Mandarin-speaking children with normal hearing and cochlear implants

  • Research Article
  • Cite Count Icon 152
  • 10.1111/j.1467-8659.2009.01439.x
DocuBurst: Visualizing Document Content using Language Structure
  • Jun 1, 2009
  • Computer Graphics Forum
  • Christopher Collins + 2 more

Textual data is at the forefront of information management problems today. One response has been the development of visualizations of text data. These visualizations, commonly based on simple attributes such as relative word frequency, have become increasingly popular tools. We extend this direction, presenting the first visualization of document content which combines word frequency with the human‐created structure in lexical databases to create a visualization that also reflects semantic content. DocuBurst is a radial, space‐filling layout of hyponymy (the IS‐A relation), overlaid with occurrence counts of words in a document of interest to provide visual summaries at varying levels of granularity. Interactive document analysis is supported with geometric and semantic zoom, selectable focus on individual words, and linked access to source text.

  • Conference Article
  • 10.1109/fskd.2013.6816269
Measuring domain similarity for statistical machine translation
  • Jul 1, 2013
  • Lin Liu + 2 more

It is well known that the statistical machine translation (SMT) performance suffers when a model is applied to out-of-domain data. It is also known that the more similar the test domain and the training domain are, the more efficient the training data are for SMT performance. Hence, measuring the similarity of domains is an important task to select appropriate training data. The most widely used method uses the cosine similarity function and word frequency. The lack of exploring other approaches motivates us to propose and compare several similarity measures. Aiming for better SMT performance, we compared 10 similarity measures, which are a combination of 2 feature representations and 5 similarity functions. The results show that using the relative word frequency as the feature representation and using the skew divergence as the similarity function performs the best amongst the 10 measures and outperforms random data selection.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/icitec.2014.7105615
A hot topic detection method for Chinese Microblog based on topic words
  • Dec 1, 2014
  • Jun Zheng + 1 more

Microblog is a kind of new network medium which sprang up quickly. Detection and tracking of hot topics through Microblog has attracted wide attentions from scholars at home and abroad in recent years. The algorithm which aims at finding topics in long text messages such as in traditional news websites and blogs, etc. can't effectively be used in disposing the Microblog data with a property of sparseness. This paper contributes a method, which aims to identify hot topics in Microblog based on the topic words. This method, throughpre-treating the Microblog data and dividing the time-window, extracts topic words in the Microblog data according to the two factors of increasing rate of word frequency and relative word frequency from Microblog data in every time-window. And then extracts and clusters the topic words according to the similarity among them, sieving for a suitable cluster of topic words so as to describe the hot topic and realize the detection of hot topic in Microblog. Through experimental verification, this method can improve the efficiency of detection to a certain extent, and raise the recall ratio and the precision ratio, so as to find hot topic in Microblog effectively and timely.

  • Research Article
  • Cite Count Icon 255
  • 10.1076/jqul.8.3.165.4101
Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited*
  • Dec 1, 2001
  • Journal of Quantitative Linguistics
  • Ramon Ferrer I Cancho + 1 more

Zipf’s law states that the frequency of a word is a power function of its rank. The exponent of the power is usually accepted to be close to (-)1. Great deviations between the predicted and real number of different words of a text, disagreements between the predicted and real exponent of the probability density function and statistics on a big corpus, make evident that word frequency as a function of the rank follows two different exponents, ˜(-)1 for the first regime and ˜(-)2 for the second. The implications of the change in exponents for the metrics of texts and for the origins of complex lexicons are analyzed.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant