correlated binomial
Recently Published Documents


TOTAL DOCUMENTS

31
(FIVE YEARS 4)

H-INDEX

8
(FIVE YEARS 1)

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Samir Rachid Zaim ◽  
Colleen Kenost ◽  
Joanne Berghout ◽  
Wesley Chiu ◽  
Liam Wilson ◽  
...  

Abstract Background In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcripts) than samples (i.e., mice or human samples) in a study, it poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest (RF) classifiers are widely used due to their flexibility, powerful performance, their ability to rank features, and their robustness to the “P > > N” high-dimensional limitation that many matrix regression algorithms face. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions. Results In both simulations and validation studies using datasets from the TCGA and UCI repositories, binomialRF showed computational gains (up to 5 to 300 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone. Conclusion binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide pathway-level feature selection from gene expression input data.


Author(s):  
Nien Fan Zhang

In the branch of forensic science known as firearm evidence identification, estimating error rates is a fundamental challenge. Recently, a new quantitative approach known as the congruent matching cells (CMC) method was developed to improve the accuracy of ballistic identifications and provide a basis for estimating error rates. To estimate error rates, the key is to find an appropriate probability distribution for the relative frequency distribution of observed CMCs overlaid on a relevant measured firearm surface such as the breech face of a cartridge case. Several probability models based on the assumption of independence between cell pair comparisons have been proposed, but the assumption of independence among the cell pair comparisons from the CMC method may not be valid. This article proposes statistical models based on dependent Bernoulli trials, along with corresponding methodology for parameter estimation. To demonstrate the potential improvement from the use of the dependent Bernoulli trial model, the methodology is applied to an actual data set of fired cartridge cases.


2019 ◽  
Author(s):  
Samir Rachid Zaim ◽  
Colleen Kenost ◽  
Joanne Berghout ◽  
Wesley Chiu ◽  
Liam Wilson ◽  
...  

AbstractBackgroundIn this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcript) than samples (i.e., mice or human samples) in a study, this poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest1 (RF) classifiers are widely used2–7 due to their flexibility, powerful performance, and robustness to “P predictors ≫ subjects N” difficulties and their ability to rank features. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions.MethodsbinomialRF treats each tree in a RF as a correlated but exchangeable binary trial. It determines importance by constructing a test statistic based on a feature’s selection frequency to compute its rank, nominal p-value, and multiplicity-adjusted q-value using a one-sided hypothesis test with a correlated binomial distribution. A distributional adjustment addresses the co-dependencies among trees as these trees subsample from the same dataset. The proposed algorithm efficiently identifies multiway nonlinear interactions by generalizing the test statistic to count sub-trees.ResultsIn simulations and in the Madelon benchmark datasets studies, binomialRF showed computational gains (up to 30 to 600 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone.ConclusionbinomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide path-way-level feature selection from gene expression input data.AvailabilityGithub: https://github.com/SamirRachidZaim/binomialRFSupplementary informationSupplementary analyses and results are available at https://github.com/SamirRachidZaim/binomialRF_simulationStudy


Author(s):  
Cheng-Ta Yeh ◽  
Lance Fiondella ◽  
Ping-Chen Chang

Transmission quality of a communication/computer system is a high-level objective of system supervisors. Therefore, transmission reliability improvement or optimization is an important issue for many organizations. One way to maximize transmission reliability is to model the system as a stochastic communication network including arcs and nodes and then determine the optimal component redundancy allocation. However, modern components are highly reliable. Thus, a decision maker may be more concerned about cost than reliability. This article considers cost-oriented component allocation subject to a reliability threshold and correlated failures characterized by a correlated binomial distribution model. To solve this problem, we employ a genetic algorithm to search for the optimal component redundancy allocation possessing minimal allocation cost. The computational efficiency of the genetic algorithm–based method is demonstrated through several benchmark networks and compared against several popular soft computing algorithms.


2013 ◽  
Vol 144 (1) ◽  
pp. 248-255 ◽  
Author(s):  
Walid W. Nasr ◽  
Bacel Maddah ◽  
Moueen K. Salameh
Keyword(s):  

2012 ◽  
Vol 56 (8) ◽  
pp. 2513-2525 ◽  
Author(s):  
Rubiane M. Pires ◽  
Carlos A.R. Diniz

2012 ◽  
Vol 2012 ◽  
pp. 1-10 ◽  
Author(s):  
N. Rao Chaganty ◽  
Roy Sabo ◽  
Yihao Deng

While univariate instances of binomial data are readily handled with generalized linear models, cases of multivariate or repeated measure binomial data are complicated by the possibility of correlated responses. Likelihood-based estimation can be applied by using mixture distribution models, though this approach can present computational challenges. The logistic transformation can be used to bypass these concerns and allow for alternative estimating procedures. One popular alternative is the generalized estimating equation (GEE) method, though systematic errors can lead to infeasible correlation estimates or nonconvergence problems. Our approach is the coupling of quasileast squares (QLSs) method with a rarely used matrix factorization, which achieves a simplified estimation platform—as compared to the mixture model approach—and does not suffer from the convergence problems in GEE method. A noncontrived example is provided that shows the mechanical breakdown of GEE using several statistical software packages and highlights the usefulness of the QLS approach.


2011 ◽  
Vol 11 (3) ◽  
pp. 391-405
Author(s):  
S. Mori ◽  
K. Kitsukawa ◽  
M. Hisakado
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document