scholarly journals Efficient Identification of Approximate Best Configuration of Training in Large Datasets

Author(s):  
Silu Huang ◽  
Chi Wang ◽  
Bolin Ding ◽  
Surajit Chaudhuri

A configuration of training refers to the combinations of feature engineering, learner, and its associated hyperparameters. Given a set of configurations and a large dataset randomly split into training and testing set, we study how to efficiently identify the best configuration with approximately the highest testing accuracy when trained from the training set. To guarantee small accuracy loss, we develop a solution using confidence interval (CI)-based progressive sampling and pruning strategy. Compared to using full data to find the exact best configuration, our solution achieves more than two orders of magnitude speedup, while the returned top configuration has identical or close test accuracy.

2021 ◽  
Author(s):  
Xuan Deng ◽  
Silvia Tanumiharjo ◽  
Qianyin Chen ◽  
Shengnan Li ◽  
Huimin Lin ◽  
...  

Aims: To investigate the evaluation indices (diagnostic test accuracy and agreement) of 15 combinations of ultrawide field scanning laser ophthalmoscopy (UWF SLO) images in myopic retinal changes (MRC) screening to determine the combination of imaging that yields the highest evaluation indices in screening MRC. Methods: This is a retrospective study of UWF SLO images obtained from myopes and were analyzed by two retinal specialists independently. 5-field UWF SLO images that included the posterior (B), superior (S), inferior (I), nasal (N) and temporal (T) regions were obtained for analysis and its results used as a reference standard. The evaluation indices of different combinations comprising of one to four fields of the retina were compared to determine the abilities of each combinations screen for MRC. Results: UWF SLO images obtained from 823 myopic patients (1646 eyes) were included for the study. Sensitivities ranged from 50.0% to 98.9% (95% confidence interval (CI), 43.8-99.7%); the combinations of B+S+I (97.3%; 95% CI, 94.4-98.8%), B+T+S+I (98.5%; 95% CI, 95.9-99.5%), and B+S+N+I (98.9%; 95% CI, 96.4-99.7%) ranked highest. Furthermore, the combinations of B+S+I, B+T+S+I and B+S+N+I also revealed the highest accuracy (97.7%; 95% CI, 95.1-100.0%, 98.6%; 95% CI, 96.7-100.0%, 98.8%; 95% CI, 96.9-100.0%) and agreement (Kappa = 0.968, 0.980 and 0.980). For the various combinations, specificities were all higher than 99.5% (95% CI, 99.3-100.0%). Conclusion: In our study, screening combinations of B+S+I, B+T+S+I and B+S+N+I stand out with high-performing optimal evaluation indices. However, when time is limited, B+S+I may be more applicable in primary screening of MRC.


Author(s):  
K Sooknunan ◽  
M Lochner ◽  
Bruce A Bassett ◽  
H V Peiris ◽  
R Fender ◽  
...  

Abstract With the advent of powerful telescopes such as the Square Kilometer Array and the Vera C. Rubin Observatory, we are entering an era of multiwavelength transient astronomy that will lead to a dramatic increase in data volume. Machine learning techniques are well suited to address this data challenge and rapidly classify newly detected transients. We present a multiwavelength classification algorithm consisting of three steps: (1) interpolation and augmentation of the data using Gaussian processes; (2) feature extraction using wavelets; (3) classification with random forests. Augmentation provides improved performance at test time by balancing the classes and adding diversity into the training set. In the first application of machine learning to the classification of real radio transient data, we apply our technique to the Green Bank Interferometer and other radio light curves. We find we are able to accurately classify most of the eleven classes of radio variables and transients after just eight hours of observations, achieving an overall test accuracy of 78%. We fully investigate the impact of the small sample size of 82 publicly available light curves and use data augmentation techniques to mitigate the effect. We also show that on a significantly larger simulated representative training set that the algorithm achieves an overall accuracy of 97%, illustrating that the method is likely to provide excellent performance on future surveys. Finally, we demonstrate the effectiveness of simultaneous multiwavelength observations by showing how incorporating just one optical data point into the analysis improves the accuracy of the worst performing class by 19%.


2020 ◽  
Author(s):  
Bingbing Cao ◽  
Li Li ◽  
Xiangfei Su ◽  
Jianfeng Zeng ◽  
Guo weibing

Abstract Background: Laparoscopic Cholecystectomy (LC) is a common surgical procedure for managing gallbladder disease. Prolonged length of stay (LOS) in the postanesthesia care unit (PACU) may lead to overcrowding and a decline in medical resource utilization. In this work, we aimed to develop and validate a predictive nomogram for identifying patients who require prolonged PACU LOS.Methods: Data from 913 patients undergoing LC at a single institution in China between 2018 and 2019 were collected, and grouped into a training set (cases during 2018) and a test set (cases during 2019). Using the least absolute shrinkage and selection operator regression model, the optimal feature was selected, and multivariable logistic regression analysis was used to build the prolonged PACU LOS risk model. The C-index, calibration plot, and decision curve analysis were used in assessing the model calibration, discrimination, and clinical application value, respectively. For external validation, the test set data was evaluated.Results: The predictive nomogram had 8 predictor variables for prolonged PACU LOS, including age, ASA grade, active smoker, gastrointestinal disease, liver disease, and cardiovascular disease. This model displayed efficient calibration and moderate discrimination with a C-index of 0.662 (95% confidence interval, 0.603 to 0.721) for the training set, and 0.609 (95% confidence interval, 0.549 to 0.669) for the test set. Decision curve analysis demonstrated that the prolonged PACU LOS nomogram was reliable for clinical application when an intervention was decided at the possible threshold of 7%.Conclusions: We developed and validated a predictive nomogram with efficient calibration and moderate discrimination, and can be applied to identify patients most likely to be subjected to prolonged PACU LOS. This novel tool may shun overcrowding in PACU and optimize medical resource utilization.


2016 ◽  
Vol 144 (16) ◽  
pp. 3531-3539 ◽  
Author(s):  
W. BEAUVAIS ◽  
M. ORYNBAYEV ◽  
J. GUITIAN

SUMMARYEstimation of farm prevalence is common in veterinary research. Typically, not all animals within the farm are sampled, and imperfect tests are used. Often, assumptions about herd sizes and sampling proportions are made, which may be invalid in smallholder settings. We propose an alternative method for estimating farm prevalence in the context of Brucella seroprevalence estimation in an endemic region of Kazakhstan. We collected 210 milk samples from Otar district, with a population of about 1000 cattle and 16 000 small ruminants, and tested them using an indirect ELISA. Individual-level prevalence and 95% confidence intervals were estimated using Taylor series linearization. A model was developed to estimate the smallholding prevalence, taking into account variable sampling proportions and uncertainty in the test accuracy. We estimate that 73% of households that we sampled had at least one Brucella-seropositive animal (95% credible interval 68–82). We estimate that 58% (95% confidence interval 40–76) of lactating small ruminants and 14% (95% confidence interval 1–28) of lactating cows were seropositive. Our results suggest that brucellosis is highly endemic in the area and conflict with those of the official brucellosis-testing programme, which found that in 2013 0% of cows and 1·7% of small ruminants were seropositive.


Brain ◽  
2020 ◽  
Author(s):  
Daniel B Rubin ◽  
Brigid Angelini ◽  
Maryum Shoukat ◽  
Catherine J Chu ◽  
Sahar F Zafar ◽  
...  

Abstract Intravenous third-line anaesthetic agents are typically titrated in refractory status epilepticus to achieve either seizure suppression or burst suppression on continuous EEG. However, the optimum treatment paradigm is unknown and little data exist to guide the withdrawal of anaesthetics in refractory status epilepticus. Premature withdrawal of anaesthetics risks the recurrence of seizures, whereas the prolonged use of anaesthetics increases the risk of treatment-associated adverse effects. This study sought to measure the accuracy of features of EEG activity during anaesthetic weaning in refractory status epilepticus as predictors of successful weaning from intravenous anaesthetics. We prespecified a successful anaesthetic wean as the discontinuation of intravenous anaesthesia without developing recurrent status epilepticus, and a wean failure as either recurrent status epilepticus or the resumption of anaesthesia for the purpose of treating an EEG pattern concerning for incipient status epilepticus. We evaluated two types of features as predictors of successful weaning: spectral components of the EEG signal, and spatial-correlation-based measures of functional connectivity. The results of these analyses were used to train a classifier to predict wean outcome. Forty-seven consecutive anaesthetic weans (23 successes, 24 failures) were identified from a single-centre cohort of patients admitted with refractory status epilepticus from 2016 to 2019. Spectral components of the EEG revealed no significant differences between successful and unsuccessful weans. Analysis of functional connectivity measures revealed that successful anaesthetic weans were characterized by the emergence of larger, more densely connected, and more highly clustered spatial functional networks, yielding 75.5% (95% confidence interval: 73.1–77.8%) testing accuracy in a bootstrap analysis using a hold-out sample of 20% of data for testing and 74.6% (95% confidence interval 73.2–75.9%) testing accuracy in a secondary external validation cohort, with an area under the curve of 83.3%. Distinct signatures in the spatial networks of functional connectivity emerge during successful anaesthetic liberation in status epilepticus; these findings are absent in patients with anaesthetic wean failure. Identifying features that emerge during successful anaesthetic weaning may allow faster and more successful anaesthetic liberation after refractory status epilepticus.


2005 ◽  
Vol 51 (1) ◽  
pp. 16-24 ◽  
Author(s):  
Dirk Stengel ◽  
Kai Bauwens ◽  
Didier Keh ◽  
Herwig Gerlach ◽  
Axel Ekkernkamp ◽  
...  

Abstract Background: After severe trauma, decreased plasma concentrations of the soluble adhesion molecule L-selectin (sCD62L) have been linked to an increased incidence of lung failure and multiorgan dysfunction syndrome (MODS). Individual studies have had conflicting results, however. We examined multiple studies in an attempt to determine whether early sCD62L concentrations are predictive of major complications after severe trauma. Methods: We performed a systematic review of six electronic databases and a manual search for clinical studies comparing outcomes of multiply injured patients (Injury Severity Score ≥16) depending on their early sCD62L blood concentrations. Because of various outcome definitions, acute lung injury (ALI) and adult respiratory distress syndrome (ARDS) were studied as a composite endpoint. Weighted mean differences (WMDs) in sCD62L concentrations were calculated between individuals with and without complications by fixed- and random-effects models. Results: Altogether, 3370 citations were identified. Seven prospective studies including 350 patients were eligible for data synthesis. Published data showed the discriminatory features of sCD62L but did not allow for calculation of measures of test accuracy. Three of four studies showed lower early sCD62L concentrations among individuals progressing to ALI and ARDS (WMD = −229 μg/L; 95% confidence interval, −476 to 18 μg/L). No differences in sCD62L concentrations were noted among patients with or without later MODS. Nonsurvivors had significantly lower early sCD62L plasma concentrations (WMD = 121 μg/L; 95% confidence interval, 63–179 μg/L), but little information was available on potential confounders in this group. Conclusions: Early decreased soluble L-selectin concentrations after multiple trauma may signal an increased likelihood of lung injury and ARDS. The findings of this metaanalysis warrant a large cohort study to develop selectin-based models targeting the risk of inflammatory complications.


2015 ◽  
Vol 143 (16) ◽  
pp. 3538-3545 ◽  
Author(s):  
M. TREMBLAY ◽  
J. S. DAHM ◽  
C. N. WAMAE ◽  
W. A. DE GLANVILLE ◽  
E. M. FÈVRE ◽  
...  

SUMMARYLarge datasets are often not amenable to analysis using traditional single-step approaches. Here, our general objective was to apply imputation techniques, principal component analysis (PCA), elastic net and generalized linear models to a large dataset in a systematic approach to extract the most meaningful predictors for a health outcome. We extracted predictors for Plasmodium falciparum infection, from a large covariate dataset while facing limited numbers of observations, using data from the People, Animals, and their Zoonoses (PAZ) project to demonstrate these techniques: data collected from 415 homesteads in western Kenya, contained over 1500 variables that describe the health, environment, and social factors of the humans, livestock, and the homesteads in which they reside. The wide, sparse dataset was simplified to 42 predictors of P. falciparum malaria infection and wealth rankings were produced for all homesteads. The 42 predictors make biological sense and are supported by previous studies. This systematic data-mining approach we used would make many large datasets more manageable and informative for decision-making processes and health policy prioritization.


2002 ◽  
Vol 19 (3) ◽  
pp. 381-390 ◽  
Author(s):  
Jingjia Luo ◽  
Leland Jameson

Abstract Wavelet analysis offers a new approach for viewing and analyzing various large datasets by dividing information according to scale and location. Here a new method is presented that is designed to characterize time-evolving structures in large datasets from computer simulations and from observational data. An example of the use of this method to identify, classify, label, and track eddylike structures in a time-evolving dataset is presented. The initial target application is satellite data from the TOPEX/Poseiden satellite. But, the technique can certainly be used in any large dataset that might contain time-evolving or stationary structures.


2013 ◽  
Vol 457-458 ◽  
pp. 1334-1337
Author(s):  
Xu Guang Sun ◽  
Chang Hai Wang ◽  
Shi Yan Shan ◽  
Cheng Long Feng

In brief introduction of the pneumatic shape measuring roller's construction and work principle, the static and dynamic state experiments of the roller then found distributing regulation and characteristics of the measuring differential pressure which following circumference direction. Then we put forward dynamic phase correction method to improve the testing accuracy of the roller. It passes by experimentation to prove the possibility of method in this text.


Author(s):  
Javier Quinteros ◽  
Jerry A. Carter ◽  
Jonathan Schaeffer ◽  
Chad Trabant ◽  
Helle A. Pedersen

Abstract New data acquisition techniques are generating data at much finer temporal and spatial resolution, compared to traditional seismic experiments. This is a challenge for data centers and users. As the amount of data potentially flowing into data centers increases by one or two orders of magnitude, data management challenges are found throughout all stages of the data flow. The Incorporated Research Institutions for Seismology—Réseau sismologique et géodésique français and GEOForschungsNetz data centers—carried out a survey and conducted interviews of users working with very large datasets to understand their needs and expectations. One of the conclusions is that existing data formats and services are not well suited for users of large datasets. Data centers are exploring storage solutions, data formats, and data delivery options to meet large dataset user needs. New approaches will need to be discussed within the community, to establish large dataset standards and best practices, perhaps through participation of stakeholders and users in discussion groups and forums.


Sign in / Sign up

Export Citation Format

Share Document