scholarly journals Probability estimate and the optimal text size

Psihologija ◽  
2008 ◽  
Vol 41 (1) ◽  
pp. 35-51 ◽  
Author(s):  
Aleksandar Kostic ◽  
Svetlana Ilic ◽  
Petar Milin

Reliable language corpus implies a text sample of size n that provides stable probability distributions of linguistic phenomena. The question is what is the minimal (i.e. the optimal) text size at which probabilities of linguistic phenomena become stable. Specifically, we were interested in probabilities of grammatical forms. We started with an a priori assumption that text size of 1.000.000 words is sufficient to provide stable probability distributions. Text of this size we treated as a "quasi-population". Probability distribution derived from the "quasi-population" was then correlated with probability distribution obtained on a minimal sample size (32 items) for a given linguistic category (e.g. nouns). Correlation coefficient was treated as a measure of similarity between the two probability distributions. The minimal sample was increased by geometrical progression, up to the size where correlation between distribution derived from the quasi-population and the one derived from an increased sample reached its maximum (r=1). Optimal sample size was established for grammatical forms of nouns, adjectives and verbs. General formalism is proposed that allows estimate of an optimal sample size from minimal sample (i.e. 32 items).

2020 ◽  
Vol 29 (10) ◽  
pp. 2958-2971 ◽  
Author(s):  
Maria Stark ◽  
Antonia Zapf

Introduction In a confirmatory diagnostic accuracy study, sensitivity and specificity are considered as co-primary endpoints. For the sample size calculation, the prevalence of the target population must be taken into account to obtain a representative sample. In this context, a general problem arises. With a low or high prevalence, the study may be overpowered in one subpopulation. One further issue is the correct pre-specification of the true prevalence. With an incorrect assumption about the prevalence, an over- or underestimated sample size will result. Methods To obtain the desired power independent of the prevalence, a method for an optimal sample size calculation for the comparison of a diagnostic experimental test with a prespecified minimum sensitivity and specificity is proposed. To face the problem of an incorrectly pre-specified prevalence, a blinded one-time re-estimation design of the sample size based on the prevalence and a blinded repeated re-estimation design of the sample size based on the prevalence are evaluated by a simulation study. Both designs are compared to a fixed design and additionally among each other. Results The type I error rates of both blinded re-estimation designs are not inflated. Their empirical overall power equals the desired theoretical power and both designs offer unbiased estimates of the prevalence. The repeated re-estimation design reveals no advantages concerning the mean squared error of the re-estimated prevalence or sample size compared to the one-time re-estimation design. The appropriate size of the internal pilot study in the one-time re-estimation design is 50% of the initially calculated sample size. Conclusions A one-time re-estimation design of the prevalence based on the optimal sample size calculation is recommended in single-arm diagnostic accuracy studies.


Kybernetes ◽  
2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Yang Liu ◽  
Yi Chen ◽  
Kefan Xie ◽  
Jia Liu

PurposeThis research aims to figure out whether the pool testing method of SARS-CoV-2 for COVID-19 is effective and the optimal sample size is in one bunch. Additionally, since the infection rate was unknown at the beginning, this research aims to propose a multiple sampling approach that enables the pool testing method to be utilized successfully.Design/methodology/approachThe authors verify that the pool testing method of SARS-CoV-2 for COVID-19 is effective under the situation of the shortage of nucleic acid detection kits based on probabilistic modeling. In this method, the testing is performed on several samples of the cases together as a bunch. If the test result of the bunch is negative, then it is shown that none of the cases in the bunch has been infected with the novel coronavirus. On the contrary, if the test result of the bunch is positive, then the samples are tested one by one to confirm which cases are infected.FindingsIf the infection rate is extremely low, while the same number of detection kits is used, the expected number of cases that can be tested by the pool testing method is far more than that by the one-by-one testing method. The pool testing method is effective only when the infection rate is less than 0.3078. The higher the infection rate, the smaller the optimal sample size in one bunch. If N samples are tested by the pool testing method, while the sample size in one bunch is G, the number of detection kits required is in the interval (N/G, N).Originality/valueThis research proves that the pool testing method is not only suitable for the situation of the shortage of detection kits but also the situation of the overall or sampling detection for a large population. More importantly, it calculates the optimal sample size in one bunch corresponding to different infection rates. Additionally, a multiple sampling approach is proposed. In this approach, the whole testing process is divided into several rounds in which the sample sizes in one bunch are different. The actual infection rate is estimated gradually precisely by sampling inspection in each round.


2019 ◽  
Vol 12 (08) ◽  
pp. 1950086
Author(s):  
Carlos N. Bouza-Herrera ◽  
Sira M. Allende-Alonso ◽  
Gajendra K. Vishwakarma ◽  
Neha Singh

In many medical researches, it is needed to determine the optimal sample size allocation in a heterogeneous population. This paper proposes the algorithm for optimal sample size allocation. We consider the optimal allocation problem as an optimization problem and the solution is obtained by using Bisection, Secant, Regula–Falsi and other numerical methods. The performance of the algorithm for different numerical methods are analyzed and evaluated in terms of computing time, number of iterations and gain in accuracy using stratification. The efficacy of algorithm is evaluated for the response in terms of body mass index (BMI) to the dietetic supplement with diabetes mellitus, HIV/AIDS and cancer post-operatory recovery patients.


2017 ◽  
Vol 60 (1) ◽  
pp. 155-173 ◽  
Author(s):  
Pier Francesco Perri ◽  
María del Mar Rueda García ◽  
Beatriz Cobo Rodríguez

Sign in / Sign up

Export Citation Format

Share Document