Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Hapax legomena via stochastic processes

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

We study the number of words that occur exactly once since the beginning of a text. We model it as a stochastic process over the length of the text. The elementary probability model, going back to Bahadur and Karlin, states that the number of words that occur exactly once should grow according to a power law, like the number of different words. The final value of the number of words occurring exactly once is the number of hapaxes of this text. We construct two statistical tests to test Karlin’s model under the assumption that the probabilities of words in this model satisfy the generalized Zipf’s law. These statistical tests show that some texts fit the model well, but many texts deviate significantly from it. This deviation is that the number of hapaxes is too small relative to the number of different words.

Similar Papers
  • Research Article
  • Cite Count Icon 20
  • 10.1016/j.physa.2011.05.003
The growth statistics of Zipfian ensembles: Beyond Heaps’ law
  • May 12, 2011
  • Physica A: Statistical Mechanics and its Applications
  • Iddo Eliazar

The growth statistics of Zipfian ensembles: Beyond Heaps’ law

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 163
  • 10.1103/physrevx.3.021006
Stochastic Model for the Vocabulary Growth in Natural Languages
  • May 14, 2013
  • Physical Review X
  • Martin Gerlach + 1 more

We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core words, which have higher frequency and do not affect the probability of a new word to be used, and (ii) the remaining virtually infinite number of noncore words, which have lower frequency and, once used, reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the Google Ngram database of books published in the last centuries, and its main consequence is the generalization of Zipf’s and Heaps’ law to two-scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model, the main change on historical time scales is the composition of the specific words included in the finite list of core words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.

  • Research Article
  • Cite Count Icon 4
  • 10.1080/09296174.2023.2275342
Text Segmentation Via Processes that Count the Number of Different Words Forward and Backward
  • Nov 10, 2023
  • Journal of Quantitative Linguistics
  • Berhane Abebe + 2 more

The paper is developing a new statistical approach to automatic partitioning of texts into parts belonging to different authors. It is based on the analysis of processes that counts the number of different words forward and backward. The theoretical study of the processes is based on the assumptions of an elementary probability model with a change point. We prove consistence of our statistical estimate of the point of concatenation in the case when the concatenated texts have different Zipf exponents. This method is being tested on the Brown corpus and also on newspaper texts in different languages. Testing shows a good estimate of the concatenation point. This method can be used in parallel with other text segmentation methods.

  • Research Article
  • Cite Count Icon 3
  • 10.33048/semi.2020.17.132
A statistical test for correspondence of texts to the Zipf—Mandelbrot law
  • Nov 27, 2020
  • Sibirskie Elektronnye Matematicheskie Izvestiya
  • A Chakrabarty + 5 more

We analyse correspondence of texts to a simple probabilistic model.The model assumes that the words are selected independently from an innite dictionary, and the probability distribution of words corresponds to the ZipfMandelbrot law.We count the numbers of dierent words in the text sequentially and get the process of the numbers of dierent words.Then we estimate the ZipfMandelbrot law's parameters using the same sequence and construct an estimate of the expectation of the number of dierent words in the text.After that we subtract the corresponding values of the estimate from the sequence and normalize along the coordinate axes, obtaining a random process on a segment from 0 to 1.We prove that this process (the empirical text bridge) converges weakly in the uniform metric on C(0, 1) to a centered Gaussian process with continuous a.s.paths.We develop and implement an algorithm for calculating the probability distribution of the integral of the square of this process.We present several examples of application of the algorithm for analysis of the homogeneity of texts in English, French, Russian, and Chinese.

  • Research Article
  • Cite Count Icon 3
  • 10.33048/semi.2019.16.129
A statistical test for the Zipf's law by deviations from the Heaps' law
  • Dec 4, 2019
  • Sibirskie Elektronnye Matematicheskie Izvestiya
  • M G Chebunin + 1 more

We explore a probabilistic model of an artistic text: words of the text are chosen independently of each other in accordance with a discrete probability distribution on an infinite dictionary. The words are enumerated 1, 2, $\ldots$, and the probability of appearing the $i$'th word is asymptotically a power function. Bahadur proved that in this case the number of different words depends on the length of the text is asymptotically a power function, too. On the other hand, in the applied statistics community, there exist statements supported by empirical observations, the Zipf's and the Heaps' laws. We highlight the links between Bahadur results and Zipf's/Heaps' laws, and introduce and analyse a corresponding statistical test.

  • Research Article
  • 10.53482/2025_58_423
Comparative Statistical Analysis of Word Frequencies in Human-Written and AI-Generated Texts
  • Jan 1, 2025
  • Glottometrics
  • Anna Kudryavtseva + 1 more

We classify texts using relative word frequencies. The task is to distinguish human-written texts from those generated by a computer using modern algorithms. We study two essay datasets, each containing an equal number of human-written and computer-generated essays. Studying Zipf diagrams shows that the generated texts have a significantly smaller vocabulary compared to human ones. However, the relative frequency of rare words (not included in the 1000 most common) does not allow us to confidently classify the texts. As additional features, we used the relative frequencies of the four most frequent words, as well as the ratio of the number of hapax legomena to the total number of different words. This feature allows to significantly improve the classification. Using these six features allows us to fairly confidently determine whether the text is computer-generated.

  • Research Article
  • Cite Count Icon 13
  • 10.1016/j.physa.2005.06.049
Fractal power law in literary English
  • Jul 18, 2005
  • Physica A: Statistical Mechanics and its Applications
  • L.L Gonçalves + 1 more

Fractal power law in literary English

  • Research Article
  • Cite Count Icon 1
  • 10.1155/2017/9143747
Calculation of Precise Constants in a Probability Model of Zipf’s Law Generation and Asymptotics of Sums of Multinomial Coefficients
  • Jan 1, 2017
  • International Journal of Mathematics and Mathematical Sciences
  • Vladimir Bochkarev + 1 more

Let ω0,ω1,…,ωn be a full set of outcomes (symbols) and let positive pi, i=0,…,n, be their probabilities (∑i=0npi=1). Let us treat ω0 as a stop symbol; it can occur in sequences of symbols (we call them words) only once, at the very end. The probability of a word is defined as the product of probabilities of its symbols. We consider the list of all possible words sorted in the nonincreasing order of their probabilities. Let pr be the probability of the rth word in this list. We prove that if at least one of the ratios log⁡pi/log⁡pj, i,j∈1,…,n, is irrational, then the limit limr→∞pr/r-1/γ exists and differs from zero; here γ is the root of the equation ∑i=1npiγ=1. The limit constant can be expressed (rather easily) in terms of the entropy of the distribution (p1γ,…,pnγ).

  • Research Article
  • 10.35120/sciencej0304029m
APPLICATION OF PARAMETER STATISTICAL TESTS AND DATA ENVELOPMENT ANALYSIS METHODS IN MODERN BUSINESS
  • Dec 10, 2024
  • SCIENCE International Journal
  • Branka Marković + 2 more

In the absence of a sufficient amount of information for quality business decision-making, i.e. successful performance of activities without unnecessary losses in the consumption of inputs, recently the non-parametric DEA method (Data Envelopment Analysis) is most often used through the linear programming technique. In the event that company managers have enough information to make business decisions, parametric statistical tests are used that compare the company's current performance with optimal performance, i.e. those that are on the edge of efficiency. However, this situation is very rare, so before making business decisions, non-parametric and then parametric statistical tests are carried out in detail. The subject of research of this paper is primarily focused on the simultaneous application of parametric and non-parametric statistical tests in the assessment of the economic efficiency of an economic entity. After the conducted research and analysis of the obtained results, it was determined that the null hypothesis, which claims that the relative efficiency of the warehouse obtained by parametric statistical tests and the DEA method is identical and that the trends have the same direction, could not be fully accepted. Namely, it was found out that the results of one and the same economic situation using the mentioned two types of analysis differ to the extent that they are not adequate for economic decision-making, however identical results were obtained in the assessment of the trend. It can be concluded that the simultaneous application of both methods, as well as its implementation in several iterations, can provide enough quality information for effective decision-making. Stochastic processes that occur during the implementation of business decisions using the DEA technique can be minimized through the simultaneous application of statistical parametric methods and tests for evaluating the expected efficiency of DEA. The effectiveness of this method in any case depends on the size of the sample implemented in the aforementioned statistical analysis. The aforementioned statistical tests enable the measurement and detection of those input parameters that will most effectively contribute to the efficiency of business systems.

  • Preprint Article
  • Cite Count Icon 1
  • 10.5194/egusphere-egu24-3412
Unravelling the spatial structure of regular environmental spatial patterns
  • Nov 27, 2024
  • Karl Kästner + 3 more

Spatial patterns where patches of high biomass alternate with bare ground occur in many resource-limited ecosystems. Especially fascinating are regular patterns, which are self-similar at a lag distance corresponding to the typical distance between patches. Regular patterns are understood to form autogenously through self-organization, which can be generated with deterministic reaction-diffusion models. Such models generate highly regular patterns, which repeat at the characteristic wavelength and are therefore periodic. Natural patterns do not repeat, as they are noisy and as the patch size and spacing vary. Natural patterns are therefore usually perceived as perturbed periodic patterns. However, the self-similarity of natural patterns decreases at longer lag distances, which indicates that their spatial structure is not a perturbed periodic structure originating through deterministic processes. Here, we provide an overview of our recent work on the spatial structure and formation of natural environmental spatial patterns as a basis for discussion: First, we develop a statistical periodicity test and compile a large dataset of more than 10,000 regular environmental spatial patterns. We find that neither isotropic (spotted) nor anisotropic (banded) patterns are periodic. Instead, we find that their spatial structure can be well described as random fields originating through stochastic processes. Second, we recognize the regularity as a gradually varying property, rather than a dichotomous property of being periodic or not. We develop a method for quantifying the regularity and apply it in a metastudy to a set of natural and model-generated patterns found in the literature. We find that patterns generated with deterministic reaction-diffusion models do not well reproduce the spatial structure of environmental spatial structure, as they are too regular. Third, we develop an understanding of pattern formation through stochastic reaction-diffusion processes, which incorporate random environmental heterogeneities. We find that regular patterns form through filtering of the environmental heterogeneities and identify stochastic processes which reproduce both isotropic and anisotropic patterns.

  • Conference Article
  • Cite Count Icon 13
  • 10.1145/1569901.1570044
Evolving stochastic processes using feature tests and genetic programming
  • Jul 8, 2009
  • Brian J Ross + 1 more

The synthesis of stochastic processes using genetic programming is investigated. Stochastic process behaviours take the form of time series data, in which quantities of interest vary over time in a probabilistic, and often noisy, manner. A suite of statistical feature tests are performed on time series plots from example processes, and the resulting feature values are used as targets during evolutionary search. A process algebra, the stochastic π-calculus, is used to denote processes. Investigations consider variations of GP representations for a subset of the stochastic π-calculus, for example, the use of channel unification, and various grammatical constraints. Target processes of varying complexity are studied. Results show that the use of grammatical GP with statistical feature tests can successfully synthesize stochastic processes. Success depends upon a selection of appropriate feature tests for characterizing the target behaviour, and the complexity of the target process.

  • Research Article
  • Cite Count Icon 9
  • 10.2307/3520045
Relevant Distributions for Insurance Prices in an Arbitrage Free Equilibrium
  • Mar 1, 1991
  • The Journal of Risk and Insurance
  • Patrick L Brockett + 1 more

Relevant Distributions for Insurance Prices an Arbitrage Free Equilibrium The increased volatility of economic and financial-risk factors such as inflation, interest rates, investment returns, and exchange rates during the past decade have forced consideration of more financial factors along with underwriting risk factors insurance pricing models. Moreover, competition financial services markets have forced insurers to move more into the financial arena, for example, terms of products tied to investment performance, discounting of loss reserves, actuarial modeling of investment strategies, hedging interest-rate risks, and the internationalization of insurance operations. As a consequence, some of the research finance and risk management and insurance have started to converge as noted by Smith (1986), and Buhlmann (1987). Hence, more of the intertemporal models utilized insurance and actuarial applications involving financial linkages attempt to incorporate the financial concepts of market efficiency and the equilibrium notions underlying competitive market structures (e.g., Kraus and Ross 1982 and Cummins 1988). The same continuous time, stochastic process models are being used for insurance and asset pricing by scholars risk management and insurance and by researchers finance. One reason for this convergence is that insurers have most of their assets financial instruments (e.g. bonds, stocks and mortgages for life insurance companies and stocks and bonds for property casualty companies) and their liabilities consist of interest sensitive components, such as reserves both life-health and property-liability insurance which are discounted to a specific valuation date. Consequently, this article analyzes the probabilistic implications of efficiency and equilibrium from the perspective of potential stochastic models pertinent to actuarial calculations or insurance pricing involving financial transactions an efficient capital market equilibrium. Intuitively, an efficient capital market is the manifestation of a market system that works a cost-effective manner, and the study of efficient markets is a study of the (stochastic) process of price formation, or equivalently of the return generating stochastic process, and the market's adjustment to a sequence of relevant subsets. However, the primitive notion from finance that in equilibrium, price efficiency implies that prices reflect all relevant information is too general to have any practical quantitative applications for actuarial modeling of insurance products affected by financial prices. To quantitatively formalize and model this intuitive notion of an efficient market, scholars insurance, actuarial science and finance have developed several approaches to describing the stochastic process of prices. Two of these are the traditional independent increments or random walk model familiar to actuaries from risk theory, and the more general fair game or martingale model.(1) Some scholars, such as Cummins (1988), Boyle (1977), Black and Scholes (1972), and Boyle and Schwartz (1977), assume that rates of return, for example on stocks or bonds, follow a Brownian motion process. While there is some empirical support for the implied lognormality of the corresponding prices at any fixed point time, it would be desirable and preferable to complement this with an economically based theoretical argument showing why such continuous time probability models arise as a consequence of basic economic notions. Grossman and Shiller (1982, p. 197) also appeal for even a further basic economic rationale for the Brownian motion models which they use. The Brownian motion and stochastic calculus models referred to above are widely used insurance and actuarial research, for example see: Emanuel, Harrison and Taylor (1975), Boyle (1977), Martin-Lof (1986), Cummins (1988), and Sharp (1989). …

  • Research Article
  • Cite Count Icon 1
  • 10.22052/mir.2020.210794.1186
Random Walk Modeling for Retrieving Information on Semantic Networking
  • Mar 1, 2020
  • SHILAP Revista de lepidopterología
  • Meghdad Abarghouei Nejad + 3 more

In this article, the famous random walk model is exploited as a model of stochastic processes to retrieve some specific words which are used in social media by users. By spreading activation on semantic networking, this model can predict the probability of the words' activation, including all probabilities in different steps. In fact, the trend of probability in different steps is shown and the result of two different weights, when the steps tend to infinity is compared. In addition, it is shown that the results of the random walk model are aligned with the experimental psychological tests, showing that, as a model for semantic memory, it is a suitable model for retrieving in social media.

  • Book Chapter
  • 10.1007/978-1-4614-6234-7_5
Temporal Risk Processes
  • Dec 26, 2012
  • Charles S Tapiero

This chapter provides an introduction to inter-temporal probability processes used commonly in modeling risk processes. The chapter begins with the questions “what is time,” “what is memory?” how are their definitions used to construct quantitative and temporal models. Elementary models such as probability models with the Markov property, random (binomial) walks, Poisson processes and continuous state and time stochastic processes are both presented intuitively and applied to many risk problems. These stochastic processes are then extended to more complex situations including long run memory models (fractal models), short memory models as well as to models departing from the basic Markov property and random walks models. While some of these models require a more advanced quantitative background than assumed for Chaps. 3 and 4, their applications are used to highlight both their importance and their implications to financial and risk models.

  • Book Chapter
  • 10.1007/3-540-36283-5_4
Ftcs with Markovian Parameters (FTCSMP)
  • Jan 1, 2003

In Chapter 2, it was mentioned that faults may occur randomly at any instant of time, in any form, with various degree of severity, and in any system component. To represent the stochastic behavior of faults and fault-induced changes, a stochastic process, η(t), is defined. The decisions of the FDI process are based on statistical tests, and they are not deterministic. A stochastic process, Ψ(t), is defined to represent these decisions. Both η(t) and Ψ(t) are assumed to have Markovian characteristics with finite state spaces S = {1,2, ...,sý and R = {1,2, ..., rý, respectively.KeywordsDynamical SystemState SpaceStochastic ProcessSystem TheoryFeedback ControlThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant