Stability of the syntagmatic probability distributions

The aim of the present study is to establish criteria for the optimal size of a corpus that can provide stable conditional probabilities of morphological and/or syntagmatic types. The optimality of corpus size is defined in terms of the smallest sample that generates probability distribution equal to distribution derived from the large sample that generates stable probabilities. The latter distribution we refer to as 'target distribution'. In order to establish the above criteria we varied the sample size, the word sequence size (bigrams and trigrams), sampling procedure (randomly chosen words and continuous text) and position of the target word in a sequence. The obtained distributions of conditional probabilities derived from smaller samples have been correlated with target distributions. Sample size at which probability distribution reaches maximal correlation (r=1) with the target distribution was taken as being optimal. The research was done on Corpus of Serbian language. In case of bigrams the optimal sample size for random word selection is 65.000 words, and 281.000 words for trigrams. In contrast, continuous text sampling requires much larger samples to reach stability: 810.000 words for bigrams and 868.000 words for trigrams. The factors that caused these differences remain unclear and need additional empirical investigation.

Download Full-text

Regression equations of probability plot correlation coefficient test statistics using machine learning

10.5194/egusphere-egu2020-12315 ◽

2020 ◽

Author(s):

Hyunjun Ahn ◽

Sunghun Kim ◽

Joohyung Lee ◽

Jun-Haeng Heo

Keyword(s):

Machine Learning ◽

Probability Distribution ◽

Sample Size ◽

Correlation Coefficient ◽

Goodness Of Fit ◽

Probability Distributions ◽

Distribution Model ◽

Regression Equations ◽

Probability Plot ◽

Sample Data

<p>In the extremes hydrology field, it is essential to find the probability distribution model that is most appropriate for the sample data to estimate the reasonable probability quantile. Depending on the assumed probability distribution model, the probability quantile could be estimated with quite different values. The probability plot correlation coefficient (PPCC) test is one of the goodness-of-fit tests for finding suitable probability distributions for a given sample. The PPCC test determines whether assumed probability distributions are acceptable for the sample data using correlation coefficients between sample data and theoretical quantiles of assumed probability distributions. The critical values for identification are presented as a two-dimensional table, depending on the sample size and the shape parameters of models, for a three-parameter probability distribution. In this study, the applicability and utility of machine learning in the hydrology field were examined. For the usability of the PPCC test, a regression equation was derived using a machine learning algorithm with two variables: sample size and shape parameter.</p>

Download Full-text

Probability estimate and the optimal text size

Psihologija ◽

10.2298/psi0801035k ◽

2008 ◽

Vol 41 (1) ◽

pp. 35-51 ◽

Cited By ~ 1

Author(s):

Aleksandar Kostic ◽

Svetlana Ilic ◽

Petar Milin

Keyword(s):

Probability Distribution ◽

Sample Size ◽

Probability Distributions ◽

A Priori ◽

Optimal Sample Size ◽

Minimal Sample ◽

Optimal Sample ◽

Geometrical Progression ◽

The One ◽

Stable Probability Distributions

Reliable language corpus implies a text sample of size n that provides stable probability distributions of linguistic phenomena. The question is what is the minimal (i.e. the optimal) text size at which probabilities of linguistic phenomena become stable. Specifically, we were interested in probabilities of grammatical forms. We started with an a priori assumption that text size of 1.000.000 words is sufficient to provide stable probability distributions. Text of this size we treated as a "quasi-population". Probability distribution derived from the "quasi-population" was then correlated with probability distribution obtained on a minimal sample size (32 items) for a given linguistic category (e.g. nouns). Correlation coefficient was treated as a measure of similarity between the two probability distributions. The minimal sample was increased by geometrical progression, up to the size where correlation between distribution derived from the quasi-population and the one derived from an increased sample reached its maximum (r=1). Optimal sample size was established for grammatical forms of nouns, adjectives and verbs. General formalism is proposed that allows estimate of an optimal sample size from minimal sample (i.e. 32 items).

Download Full-text

WIGNER FUNCTION OF PULSED FIELDS RECONSTRUCTED BY DIRECT DETECTION

International Journal of Quantum Information ◽

10.1142/s0219749911006995 ◽

2011 ◽

Vol 09 (supp01) ◽

pp. 39-47

Author(s):

ALESSIA ALLEVI ◽

MARIA BONDANI ◽

ALESSANDRA ANDREONI

Keyword(s):

Probability Distribution ◽

Wigner Function ◽

Beam Splitter ◽

Direct Detection ◽

Probability Distributions ◽

Linear Regime ◽

Wigner Functions ◽

Intensity Measurements ◽

Pulsed Fields

We present the experimental reconstruction of the Wigner function of some optical states. The method is based on direct intensity measurements by non-ideal photodetectors operated in the linear regime. The signal state is mixed at a beam-splitter with a set of coherent probes of known complex amplitudes and the probability distribution of the detected photons is measured. The Wigner function is given by a suitable sum of these probability distributions measured for different values of the probe. For comparison, the same data are analyzed to obtain the number distributions and the Wigner functions for photons.

Download Full-text

Analysis of Magnitude and Frequency of Floods in the Damanganga Basin: Western India

Hydrospatial Analysis ◽

10.21523/gcj3.2021050101 ◽

2021 ◽

Vol 5 (1) ◽

pp. 1-11

Author(s):

Vitthal Anwat ◽

Pramodkumar Hire ◽

Uttam Pawar ◽

Rajendra Gunjal

Keyword(s):

Probability Distribution ◽

Probability Distributions ◽

Flood Frequency ◽

Flood Frequency Analysis ◽

Western India ◽

Type I ◽

Return Periods ◽

Pearson Type ◽

Kolmogorov Smirnov ◽

Anderson Darling

Flood Frequency Analysis (FFA) method was introduced by Fuller in 1914 to understand the magnitude and frequency of floods. The present study is carried out using the two most widely accepted probability distributions for FFA in the world namely, Gumbel Extreme Value type I (GEVI) and Log Pearson type III (LP-III). The Kolmogorov-Smirnov (KS) and Anderson-Darling (AD) methods were used to select the most suitable probability distribution at sites in the Damanganga Basin. Moreover, discharges were estimated for various return periods using GEVI and LP-III. The recurrence interval of the largest peak flood on record (Qmax) is 107 years (at Nanipalsan) and 146 years (at Ozarkhed) as per LP-III. Flood Frequency Curves (FFC) specifies that LP-III is the best-fitted probability distribution for FFA of the Damanganga Basin. Therefore, estimated discharges and return periods by LP-III probability distribution are more reliable and can be used for designing hydraulic structures.

Download Full-text

Analysis and Synthesis of Mechanical Error in Universal Joints

16th Design Automation Conference: Volume 2 — Optimal Design and Mechanical Systems Analysis ◽

10.1115/detc1990-0090 ◽

1990 ◽

Author(s):

J. L. Cagney ◽

S. S. Rao

Keyword(s):

Probability Distribution ◽

Real World ◽

Probability Distributions ◽

Manufacturing Cost ◽

Universal Joint ◽

Accuracy Requirement ◽

Output Error ◽

Manufacturing Errors ◽

Analysis And Synthesis ◽

Limiting Value

Abstract The modeling of manufacturing errors in mechanisms is a significant task to validate practical designs. The use of probability distributions for errors can simulate manufacturing variations and real world operations. This paper presents the mechanical error analysis of universal joint drivelines. Each error is simulated using a probability distribution, i.e., a design of the mechanism is created by assigning random values to the errors. Each design is then evaluated by comparing the output error with a limiting value and the reliability of the universal joint is estimated. For this, the design is considered a failure whenever the output error exceeds the specified limit. In addition, the problem of synthesis, which involves the allocation of tolerances (errors) for minimum manufacturing cost without violating a specified accuracy requirement of the output, is also considered. Three probability distributions — normal, Weibull and beta distributions — were used to simulate the random values of the errors. The similarity of the results given by the three distributions suggests that the use of normal distribution would be acceptable for modeling the tolerances in most cases.

Download Full-text

Field-theoretic density estimation for biological sequence space with applications to 5′ splice site diversity and aneuploidy in cancer

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2025782118 ◽

2021 ◽

Vol 118 (40) ◽

pp. e2025782118

Author(s):

Wei-Chia Chen ◽

Juannan Zhou ◽

Jason M. Sheltzer ◽

Justin B. Kinney ◽

David M. McCandlish

Keyword(s):

Probability Distribution ◽

Maximum Entropy ◽

Density Estimation ◽

Sequence Space ◽

Chromosomal Abnormalities ◽

Fundamental Problem ◽

Probability Distributions ◽

Biological Sequence ◽

Point Estimates ◽

Site Diversity

Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5′ splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.

Download Full-text

Estimating the Effect of the Shape Parameter and the Sample Size on Probability Distributions Using the Maximum Likehood

رماح للبحوث و الدراسات ◽

10.12816/0054882 ◽

2019 ◽

pp. 278-301

Author(s):

Hamza , Hamza Ibrahim

Keyword(s):

Sample Size ◽

Shape Parameter ◽

Probability Distributions

Download Full-text

A best-fit probability distribution for the estimation of rainfall in northern regions of Pakistan

Open Life Sciences ◽

10.1515/biol-2016-0057 ◽

2016 ◽

Vol 11 (1) ◽

pp. 432-440 ◽

Cited By ~ 10

Author(s):

M. T. Amin ◽

M. Rizwan ◽

A. A. Alazba

Keyword(s):

Probability Distribution ◽

Goodness Of Fit ◽

Probability Distributions ◽

Future Research ◽

Maximum Rainfall ◽

Type Iii ◽

Goodness Of Fit Tests ◽

Pearson Type ◽

Northern Regions ◽

Best Fit

AbstractThis study was designed to find the best-fit probability distribution of annual maximum rainfall based on a twenty-four-hour sample in the northern regions of Pakistan using four probability distributions: normal, log-normal, log-Pearson type-III and Gumbel max. Based on the scores of goodness of fit tests, the normal distribution was found to be the best-fit probability distribution at the Mardan rainfall gauging station. The log-Pearson type-III distribution was found to be the best-fit probability distribution at the rest of the rainfall gauging stations. The maximum values of expected rainfall were calculated using the best-fit probability distributions and can be used by design engineers in future research.

Download Full-text

Constraints on probability distributions of grammatical forms

Psihologija ◽

10.2298/psi0701005k ◽

2007 ◽

Vol 40 (1) ◽

pp. 5-35

Author(s):

Aleksandar Kostic ◽

Milena Bozic

Keyword(s):

Probability Distribution ◽

Relative Entropy ◽

Probability Distributions ◽

The Third ◽

Singular Form ◽

And Gender ◽

Grammatical Number ◽

Morphological System

In this study we investigate the constraints on probability distribution of grammatical forms within morphological paradigms of Serbian language, where paradigm is specified as a coherent set of elements with defined criteria for inclusion. Thus, for example, in Serbian all feminine nouns that end with the suffix "a" in their nominative singular form belong to the third declension, the declension being a paradigm. The notion of a paradigm could be extended to other criteria as well, hence, we can think of noun cases, irrespective of grammatical number and gender, or noun gender, irrespective of case and grammatical number, also as paradigms. We took the relative entropy as a measure of homogeneity of probability distribution within paradigms. The analysis was performed on 116 morphological paradigms of typical Serbian and for each paradigm the relative entropy has been calculated. The obtained results indicate that for most paradigms the relative entropy values fall within a range of 0.75 - 0.9. Nonhomogeneous distribution of relative entropy values allows for estimating the relative entropy of the morphological system as a whole. This value is 0.69 and can tentatively be taken as an index of stability of the morphological system.

Download Full-text

Posterior Probability on Finite Set

Formalized Mathematics ◽

10.2478/v10037-012-0030-0 ◽

2012 ◽

Vol 20 (4) ◽

pp. 257-263

Author(s):

Hiroyuki Okazaki

Keyword(s):

Probability Distribution ◽

Probability Measure ◽

Posterior Probability ◽

Probability Distributions ◽

Sample Space ◽

Finite Sample ◽

Finite Set

Summary In [14] we formalized probability and probability distribution on a finite sample space. In this article first we propose a formalization of the class of finite sample spaces whose element’s probability distributions are equivalent with each other. Next, we formalize the probability measure of the class of sample spaces we have formalized above. Finally, we formalize the sampling and posterior probability.

Download Full-text