- New
- Research Article
- 10.1080/09296174.2026.2646836
- Apr 23, 2026
- Journal of Quantitative Linguistics
- Michaela Nogolová + 2 more
ABSTRACT The Menzerath-Altmann law describes an inverse relationship between the size of a linguistic construct and the average size of its constituents. While its validity has been widely confirmed for lower-level language units, its application to higher levels remains less explored and sometimes inconclusive. For the first time, this study not only investigates but also corroborates the validity of the Menzerath-Altmann law across a fine-grained hierarchical structure of linguistic units. In particular, the following units are considered: sentence – independent clause – clause – phrase – subphrase – chunk – word – syllable – phoneme. Using a corpus of written Czech, we confirm the validity of the law across this hierarchy.
- Research Article
- 10.1080/09296174.2026.2634477
- Mar 1, 2026
- Journal of Quantitative Linguistics
- Yiyang Hu + 2 more
ABSTRACT Research on translation universals has traditionally focused on isolated linguistic features along paradigmatic dimensions due to ease of interpretation. However, syntagmatic approaches, which examine how linguistic elements combine sequentially, remain underexplored. This corpus-based study addresses this gap by analysing R-motifs, defined as recurring sequences of part-of-speech tags, across four genres in translated and native Chinese texts. We investigate both the rank-frequency distributions of R-motif types and motif lengths as potential indicators of translation universals. Our analysis shows that R-motif frequencies in both text types follow the right-truncated Zeta distribution, whereas motif length distributions conform to the Pólya model. Random Forests are used to establish the text classification model where texts are represented by the POS R-motif distribution parameters and attributes. The experiments show that the combination of features from distribution parameters and attributes can detect the translationese efficiently. Future research may extend this approach by exploring more granular features beyond part-of-speech sequences.
- Research Article
- 10.1080/09296174.2026.2612931
- Feb 12, 2026
- Journal of Quantitative Linguistics
- George Mikros + 3 more
ABSTRACT Machine translation (MT) systems are typically evaluated by comparing outputs to human references using metrics that approximate adequacy and fluency, but these metrics are not designed to measure stylistic fidelity, i.e. how closely an output matches the target-language stylistic profile of a high-quality human literary translation. We test whether stylometric distance, operationalized with Burrows’ Delta over the 500 most frequent words, can serve as a convergent validator of adequacy signals while providing interpretable, reference-free diagnostics. Using nine contemporary Greek short stories with author-produced English self-translations and MT outputs, segmented into non-overlapping five-sentence windows, we compare an inverted, min–max normalized Burrows’ Delta score (invΔ B ) against standard reference-based MT metrics (BLEU, chrF2, TER, BLEURT, COMET, BERTScore) and against an adequacy composite (TQI_win). We find strong convergence between stylometric proximity and adequacy signals, particularly at decision-relevant extremes, but stylometry underperforms adequacy metrics when used alone and provides no incremental predictive benefit beyond semantic-embedding baselines. We conclude that stylometry is best used as a complementary, explainable diagnostic and as a constrained reference-free monitor and not as a substitute for adequacy-oriented MT evaluation.
- Research Article
- 10.1080/09296174.2026.2617705
- Jan 31, 2026
- Journal of Quantitative Linguistics
- Ramon Ferrer-I-Cancho
ABSTRACT The frequency of the preferred order for a noun phrase formed by demonstrative, numeral, adjective and noun has received significant attention over the last two decades. We investigate the actual distribution of the 24 possible orders. There is no consensus on whether it is well-fitted by an exponential or a power law distribution. We find that an exponential distribution is a much better model. This finding and other circumstances where an exponential-like distribution is found challenge the view that power-law distributions, e.g. Zipf’s law for word frequencies, are inevitable. We also investigate which of two exponential distributions gives a better fit: an exponential model where the 24 orders have non-zero probability (a geometric distribution truncated at rank 24) or an exponential model where the number of orders that can have non-zero probability is variable (a right-truncated geometric distribution). When consistency and generalizability are prioritized, we find higher support for the exponential model, where all 24 orders have non-zero probability. These findings strongly suggest that there is no hard constraint on word order variation, and then unattested orders merely result from undersampling, consistently with Cysouw’s view.
- Research Article
- 10.1080/09296174.2025.2611604
- Jan 19, 2026
- Journal of Quantitative Linguistics
- Thomas Mccauley + 1 more
ABSTRACT Collocation analysis is a widespread method in corpus linguistics. A key metric used for collocation discovery is pointwise mutual information (PMI), determined by how frequently a collocation occurs relative to its expected frequency under the assumption of independence. However, PMI suffers from several limitations, especially its well-known bias for collocations involving low-frequency words. In this paper, we propose a method to determine the significance of the PMI statistic by calculating its p-value, following two probability models for collocations involving the binomial distribution and the Poisson distribution. We demonstrate the effectiveness of this method by investigating collocations involving the Greek word θεóς , ‘god’, in ancient historiography. This example illustrates that the PMI statistic alone does not reveal the significance of a collocation, but rather that p-values provide a consistent threshold of statistical significance and thereby overcome many of the well-known limitations of PMI.
- Research Article
- 10.1080/09296174.2025.2603705
- Jan 12, 2026
- Journal of Quantitative Linguistics
- Jan Andres + 3 more
ABSTRACT This pilot study introduces methods from dynamical systems theory to the analysis of sign language, highlighting their potential to reveal patterns of stability and variability in linguistic signals. We present two complementary measures: local Lyapunov exponents, which indicate how sensitive sequences of linguistic units are to small changes, and topological entropy, which quantifies the overall temporal complexity (chaoticity) of the sign language. The methods are illustrated using a single sign language text, analysed at two levels, sentences and individual signs, measured both in numbers of signs or pseudosyllables and in seconds. Results show higher complexity and lower stability at the level of individual signs. Lyapunov exponents capture local fluctuations and sensitivity in linguistic structure, suggesting moments where planning or motor execution may influence production, while topological entropy reflects the broader organization and predictability of discourse. Together, these measures provide a dynamic, multi-level perspective on language organization, indicating how micro-level variability interacts with macro-level structure, and offering new insights into the temporal and structural dynamics of sign language communication.
- Research Article
- 10.1080/09296174.2025.2585611
- Jan 10, 2026
- Journal of Quantitative Linguistics
- Víctor Franco-Sánchez + 2 more
ABSTRACT Consider a linguistic structure formed by n elements, for instance, subject, directobject and verb (n=3) or subject, direct object, indirect object and verb (n=4). We investigate whether the frequency of the n! possible orders is constrained by two principles. First, entropy minimization, a principle that has been suggested to shape natural communication systems at distinct levels of organization. Second, swap distance minimization, namely a preference for word orders that require fewer swaps of adjacent elements to be produced from a source order. We present average swap distance, a novel score for research on swap distance minimization. We find strong evidence of pressure for entropy minimization and swap distance minimization with respect to a die rolling experiment in distinct linguistic structures with n=3 or n=4. Evidence with respect to a Polya urn process is strong for n=4 but weaker for n=3. We still find evidence consistent with the action of swap distance minimization when word order frequencies are shuffled, indicating that swap distance minimization effects are beyond pressure to reduce word order entropy.
- Research Article
- 10.1080/09296174.2025.2611528
- Jan 5, 2026
- Journal of Quantitative Linguistics
- Jiamiao Song + 1 more
ABSTRACT Zipf’s law, Zipf-Mandelbrot law and Heaps’ law have been validated across languages and are viewed as universal linguistic principles. Recent studies increasingly investigated their parameter implications. However, their applicability to ancient languages remains underexplored, and the exponent of Heaps’ law has received limited attention. Our study explores how well these laws hold in Classical Chinese and whether their exponents can serve as diachronic indicators of lexical diversity. The results indicate that Classical Chinese exhibits distributional patterns consistent with the laws. The exponent of Zipf’s law decreases diachronically, whereas that of Heaps’ law and lexical diversity increase. The exponent of Zipf’s law correlates negatively with lexical diversity, that of Heaps’ law positively, and all the three show strong pairwise correlations. The parameters of Zipf-Mandelbrot law exhibit no clear monotonic trend and correlate only internally. Our findings provide support for the three laws in Classical Chinese and demonstrate both the exponents of Zipf’s law and Heaps’ law function as diachronic indicators of lexical diversity in Classical Chinese. However, the study is limited by its language scope, metric choice, missing polysemy analysis, untested mechanisms and unit-related issues. Future research could further extend these points.
- Research Article
- 10.1080/09296174.2025.2587380
- Jan 4, 2026
- Journal of Quantitative Linguistics
- Jacek Bąkowski
ABSTRACT Synonymy is a widespread yet puzzling linguistic phenomenon. Absolute synonyms should theoretically not exist, as they do not expand language’s expressive potential. However, it was suggested that even if synonyms denote the same concept, they may reflect different perspectives or carry distinct cultural associations, claims that have rarely been tested quantitatively. In Hindi, prolonged contact with Persian produced many Perso-Arabic loanwords coexisting with their Sanskrit counterpart, forming numerous synonym pairs. This study investigates whether centuries after these borrowings appeared in the Subcontinent their origin can still be distinguished using distributional data alone and regardless of their semantic content. A Random Forest trained on word embeddings of Hindi synonyms successfully classified words by Sanskrit or Perso-Arabic origin, even when they were semantically unrelated, suggesting that usage patterns preserve traces of etymology. These findings provide quantitative evidence that context encodes etymological signals and that synonymy may reflect subtle but systematic distinctions linked to origin. They support the idea that synonymous words can offer different perspectives and that etymologically related words may form distinct conceptual subspaces, creating a new type of semantic frame shaped by historical origin. Overall, the results highlight the power of context in capturing nuanced distinctions beyond traditional semantic similarity.
- Research Article
- 10.1080/09296174.2025.2607850
- Dec 29, 2025
- Journal of Quantitative Linguistics
- Yosuke Takubo + 2 more
ABSTRACT Evaluating statistical fluctuations in natural language data is essential for assessing the consistency between observed data and the predictions of language models. In this study, fluctuations in word frequencies in Japanese texts are quantified, revealing that they are underestimated when those expected from a Poisson distribution are used. The evaluated fluctuations are incorporated into the data points. The consistency with Zipf’s law is then examined using χ 2 and Kolmogorov – Smirnov (KS) tests, in order to investigate how the outcomes of these statistical tests differ from those obtained under the assumption of Poisson errors. The results indicated that the fluctuations evaluated in this study should be used for precise comparisons between natural language data and language models.