scholarly journals Stylometry and Numerals Usage: Benford’s Law and Beyond

Stats ◽  
2021 ◽  
Vol 4 (4) ◽  
pp. 1051-1068
Author(s):  
Andrei V. Zenkov

We suggest two approaches to the statistical analysis of texts, both based on the study of numerals occurrence in literary texts. The first approach is related to Benford’s Law and the analysis of the frequency distribution of various leading digits of numerals contained in the text. In coherent literary texts, the share of the leading digit 1 is even larger than prescribed by Benford’s Law and can reach 50 percent. The frequencies of occurrence of the digit 1, as well as, to a lesser extent, the digits 2 and 3, are usually a characteristic the author’s style feature, manifested in all (sufficiently long) literary texts of any author. This approach is convenient for testing whether a group of texts has common authorship: the latter is dubious if the frequency distributions are sufficiently different. The second approach is the extension of the first one and requires the study of the frequency distribution of numerals themselves (not their leading digits). The approach yields non-trivial information about the author, stylistic and genre peculiarities of the texts and is suited for the advanced stylometric analysis. The proposed approaches are illustrated by examples of computer analysis of the literary texts in English and Russian.

2021 ◽  
Vol 270 ◽  
pp. 01038
Author(s):  
Andrei Zenkov ◽  
Eugene Zenkov ◽  
Miroslav Zenkov ◽  
Larisa Sazanova

Two approaches to the statistical analysis of texts are suggested, both based on the study of numerals occurrence in coherent texts. The first approach is related to the study of the frequency distribution of various leading digits of numerals occurring in the text. These frequencies are unequal: the digit 1 is strongly dominating; usually, the incidence of subsequent digits is monotonically decreasing. The frequencies of occurrence of the digit 1, as well as, to a lesser extent, the digits 2 and 3, are usually a characteristic author’s style feature, manifested in all (sufficiently long) texts of any author. This approach is convenient for testing whether a group of texts has common authorship: the latter is dubious if the frequency distributions are sufficiently different. The second approach is the extension of the first one and requires the study of the frequency distribution of numerals themselves (not their leading digits). The approach yields non-trivial information about the author, stylistic and genre peculiarities of the texts and is suited for the advanced discourse analysis. This paper deals with the application of the second approach to the literary texts in Turkish. We have analysed almost the whole corpus of works by are illustrated by examples of computer analysis of the literary texts by O. Pamuk and Y. Kemal – two of Turkey’s most prominent novelists. The hierarchical cluster analysis based on the occurrence of numerals in the texts by Pamuk and Kemal shows the author, genre, and chronology differences of numerals usage in the literary texts of these authors.


2021 ◽  
Vol 93 ◽  
pp. 03026
Author(s):  
Andrei Zenkov ◽  
Eugene Zenkov ◽  
Ansgar Belke

Two approaches to the statistical analysis of texts are suggested, both based on the study of numerals occurring in literary texts. The first approach is related to the study of the frequency distribution of various leading digits of numerals occurring in the text. This approach is convenient for testing whether a group of texts has common authorship: the latter is dubious if the frequency distributions are sufficiently different. The second approach requires the study of the frequencies of numerals themselves. The approach yields information about the author, stylistic and genre peculiarities of the texts and is suited for advanced study of authorial texts. The hypothesis that I. Ilf and E. Petrov are fake authors of novels "The Twelve Chairs" and "The Little Golden Calf", and they were ghosted by M. Bulgakov, is checked. The frequency distribution of numerals, as well as its cluster analysis, do not confirm this hypothesis.


2014 ◽  
Vol 53 (1) ◽  
pp. 87-98
Author(s):  
Anna J. Kwiatkowska

Paper deals with the results of statistical analysis of the type of frequency distribution of species occuring in the field layers of two forest phytocoenoses. In the both cases frequency distributions were ranged out for the surface area of 1, 2, 4, 8, 16, 32 and 64 m<sup>2</sup>. The types of frequency distributions were determined on the grounds of the values of Fisher's and Pearson's K coefficients. Analysed distributions were classified into Pearson's system. Also the size of the sample plot at which the empirical frequency distributions were symetrical, from the statistical point of view, nad where they were U-shaped was determined.


2019 ◽  
Vol 69 (2) ◽  
pp. 217-239
Author(s):  
Vladan Pavlović ◽  
Goranka Knežević ◽  
Marijana Joksimović ◽  
Dušan Joksimović

Benford's Law is a useful tool for detecting fraud in financial statements. In this paper we test the financial item named ‘Work performed by the undertaking for its own purpose and capitalised’ applying this tool. The data are taken from the financial reports of all companies submitted to the Serbian Business Register Agency for the period of 2008–2013. Our conclusion shows that there is a very high probability that the frequency distribution of the second digit does not satisfy Benford's Law. In other words, it implies that certain manipulations have been usually done with the second digit of the aforementioned item in the financial statement. This research confirms our hypothesis that financial statement frauds are usually conducted using the second digit.


2014 ◽  
Vol 54 (4) ◽  
pp. 465-475
Author(s):  
Anna J. Kwiatkowska ◽  
Ewa Symonides

Homogeneity of the <em>Leucobryo-Pineium</em> phytocoenose was assessed on the grounds of the species frequency distribution and frequency distributions of the total ground-layer biomass and those of individual species. It was confirmed that: 1) species frequency distribution and frequency distribution of biomass, as well as their statistical characteristics depended on the area size and 2) for analysed phytocoenose the area at which frequency distributions of both measures were symmetrical could be determined. The studies showed that phytocoenose homogeneity was related only to the definite area size, i.e. to the definite scale of its spatial differentiation.


Sign in / Sign up

Export Citation Format

Share Document