scholarly journals Comparison of Cross-Validation and Test Sets Approaches to Evaluation of Classifiers in Authorship Attribution Domain

Author(s):  
Grzegorz Baron
2020 ◽  
Author(s):  
Yusuke Okuda ◽  
Takaya Shimura ◽  
Hiroyasu Iwasaki ◽  
Shigeki Fukusada ◽  
Ruriko Nishigaki ◽  
...  

Abstract Background: Esophageal cancer (EC) including esophageal squamous cell carcinoma (ESCC) and adenocarcinoma (EAC) generally exhibits poor prognosis; hence, a noninvasive biomarker enabling early detection is necessary. Methods: Age- and sex-matched 150 healthy controls (HCs) and 43 patients with ESCC were randomly divided into two groups: 9 patients in the discovery cohort for microarray analysis and 184 patients in the training/test cohort with cross-validation for qRT-PCR analysis. Using 152 urine samples (144 HCs and 8 EACs), we validated the urinary miRNA biomarkers for EAC diagnosis.Results: Among eight miRNAs selected in the discovery cohort, urinary levels of five miRNAs (miR-1273f, miR-619-5p, miR-150-3p, miR-4327, and miR-3135b) were significantly higher in the ESCC group than in the HC group, in the training/test cohort. Consistently, these five urinary miRNAs were significantly different between HC and ESCC in both training and test sets. Especially, urinary miR-1273f and miR-619-5p showed excellent values of area under the curve (AUC) ≥ 0.80 for diagnosing stage I ESCC. Similarly, the EAC group had significantly higher urinary levels of these five miRNAs than the HC group, with AUC values of approximately 0.80.Conclusion: The present study established novel urinary miRNA biomarkers that can early detect ESCC and EAC.


2013 ◽  
Vol 12 (01) ◽  
pp. 1250106 ◽  
Author(s):  
ALI MEHDIKHANI ◽  
HAMID REZA LOTFIZADEH ◽  
KAMYAR ARMAN ◽  
HADI NOORIZADEH

Thermal desorption-comprehensive two-dimensional gas chromatography high-resolution time-of-flight mass spectrometry (TD–GC × GC–HRTOF-MS) is one of the most powerful tools in analytical nanoparticle compounds. Genetic algorithm and partial least square (GA-PLS) and kernel PLS (GA-KPLS) models were used to investigate the correlation between reverse factor (RF) and descriptors for 50 nanoparticles fraction with a diameter of 29–58 nm in roadside atmosphere which obtained by TD–GC×GC–HRTOF-MS. The correlation coefficient leave-group-out cross validation (LGO-CV (Q2)) of prediction for the GA-PLS and GA-KPLS models for training and test sets were (0.761 and 0.718) and (0.825 and 0.814), respectively, revealing the reliability of these models. This is the first research on the quantitative structure-property relationship (QSPR) of the nanoparticles in roadside atmosphere using the GA-PLS and GA-KPLS.


2019 ◽  
Vol 36 (7) ◽  
pp. 2025-2032
Author(s):  
Yuwei Zhang ◽  
Tianfei Yi ◽  
Huihui Ji ◽  
Guofang Zhao ◽  
Yang Xi ◽  
...  

Abstract Motivation Long noncoding RNA (lncRNA) has been verified to interact with other biomolecules especially protein-coding genes (PCGs), thus playing essential regulatory roles in life activities and disease development. However, the inner mechanisms of most lncRNA–PCG relationships are still unclear. Our study investigated the characteristics of true lncRNA–PCG relationships and constructed a novel predictor with machine learning algorithms. Results We obtained the 307 true lncRNA-PCG pairs from database and found that there are significant differences in multiple characteristics between true and random lncRNA–PCG sets. Besides, 3-fold cross-validation and prediction results on independent test sets show the great AUC values of LR, SVM and RF, among which RF has the best performance with average AUC 0.818 for cross-validation, 0.823 and 0.853 for two independent test sets, respectively. In case study, some candidate lncRNA–PCG relationships in colorectal cancer were found and HOTAIR–COMP interaction was specially exemplified. The proportion of the reported pairs in the predicted positive results was significantly higher than that in negative results (P < 0.05). Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Colby Redfield ◽  
Abdulhakim Tlimat ◽  
Yoni Halpern ◽  
David Schoenfeld ◽  
Edward Ullman ◽  
...  

AbstractBackgroundLinking EMS electronic patient care reports (ePCRs) to ED records can provide clinicians access to vital information that can alter management. It can also create rich databases for research and quality improvement. Unfortunately, previous attempts at ePCR - ED record linkage have had limited success.ObjectiveTo derive and validate an automated record linkage algorithm between EMS ePCR’s and ED records using supervised machine learning.MethodsAll consecutive ePCR’s from a single EMS provider between June 2013 and June 2015 were included. A primary reviewer matched ePCR’s to a list of ED patients to create a gold standard. Age, gender, last name, first name, social security number (SSN), and date of birth (DOB) were extracted. Data was randomly split into 80%/20% training and test data sets. We derived missing indicators, identical indicators, edit distances, and percent differences. A multivariate logistic regression model was trained using 5k fold cross-validation, using label k-fold, L2 regularization, and class re-weighting.ResultsA total of 14,032 ePCRs were included in the study. Inter-rater reliability between the primary and secondary reviewer had a Kappa of 0.9. The algorithm had a sensitivity of 99.4%, a PPV of 99.9% and AUC of 0.99 in both the training and test sets. DOB match had the highest odd ratio of 16.9, followed by last name match (10.6). SSN match had an odds ratio of 3.8.ConclusionsWe were able to successfully derive and validate a probabilistic record linkage algorithm from a single EMS ePCR provider to our hospital EMR.


Water ◽  
2020 ◽  
Vol 12 (6) ◽  
pp. 1743 ◽  
Author(s):  
Jeongwoo Lee ◽  
Chul-Gyum Kim ◽  
Jeong Eun Lee ◽  
Nam Won Kim ◽  
Hyeonjun Kim

In this study, artificial neural network (ANN) models were constructed to predict the rainfall during May and June for the Han River basin, South Korea. This was achieved using the lagged global climate indices and historical rainfall data. Monte-Carlo cross-validation and aggregation (MCCVA) was applied to create an ensemble of forecasts. The input-output patterns were randomly divided into training, validation, and test datasets. This was done 100 times to achieve diverse data splitting. In each data splitting, ANN training was repeated 100 times using randomly assigned initial weight vectors of the network to construct 10,000 prediction ensembles and estimate their prediction uncertainty interval. The optimal ANN model that was used to forecast the monthly rainfall in May had 11 input variables of the lagged climate indices such as the Arctic Oscillation (AO), East Atlantic/Western Russia Pattern (EAWR), Polar/Eurasia Pattern (POL), Quasi-Biennial Oscillation (QBO), Sahel Precipitation Index (SPI), and Western Pacific Index (WP). The ensemble of the rainfall forecasts exhibited the values of the averaged root mean squared error (RMSE) of 27.4, 33.6, and 39.5 mm, and the averaged correlation coefficient (CC) of 0.809, 0.725, and 0.641 for the training, validation, and test sets, respectively. The estimated uncertainty band has covered 58.5% of observed rainfall data with an average band width of 50.0 mm, exhibiting acceptable results. The ANN forecasting model for June has 9 input variables, which differed from May, of the Atlantic Meridional Mode (AMM), East Pacific/North Pacific Oscillation (EPNP), North Atlantic Oscillation (NAO), Scandinavia Pattern (SCAND), Equatorial Eastern Pacific SLP (SLP_EEP), and POL. The averaged RMSE values are 39.5, 46.1, and 62.1 mm, and the averaged CC values are 0.853, 0.771, and 0.683 for the training, validation, and test sets, respectively. The estimated uncertainty band for June rainfall forecasts generally has a coverage of 67.9% with an average band width of 83.0 mm. It can be concluded that the neural network with MCCVA enables us to provide acceptable medium-term rainfall forecasts and define the prediction uncertainty interval.


2019 ◽  
Vol 9 (17) ◽  
pp. 3538 ◽  
Author(s):  
Hailong Hu ◽  
Zhong Li ◽  
Arne Elofsson ◽  
Shangxin Xie

The prediction of protein secondary structure continues to be an active area of research in bioinformatics. In this paper, a Bi-LSTM based ensemble model is developed for the prediction of protein secondary structure. The ensemble model with dual loss function consists of five sub-models, which are finally joined by a Bi-LSTM layer. In contrast to existing ensemble methods, which generally train each sub-model and then join them as a whole, this ensemble model and sub-models can be trained simultaneously and the performance of each model can be observed and compared during the training process. Three independent test sets (e.g., data1199, 513 protein Cuff & Barton set (CB513) and 203 proteins from Critical Appraisals Skills Programme (CASP203)) are employed to test the method. On average, the ensemble model achieved 84.3% in Q 3 accuracy and 81.9% in segment overlap measure ( SOV ) score by using 10-fold cross validation. There is an improvement of up to 1% over some state-of-the-art prediction methods of protein secondary structure.


Sensors ◽  
2020 ◽  
Vol 20 (19) ◽  
pp. 5673
Author(s):  
Feifei Liu ◽  
Weihao Zhang ◽  
Yu Sun ◽  
Jianwei Liu ◽  
Jungang Miao ◽  
...  

Metamaterials, artificially engineered structures with extraordinary physical properties, offer multifaceted capabilities in interdisciplinary fields. To address the looming threat of stealthy monitoring, the detection and identification of metamaterials is the next research frontier but have not yet been explored. Here, we show that the crypto-oriented convolutional neural network (CNN) makes possible the secure intelligent detection of metamaterials in mixtures. Terahertz signals were encrypted by homomorphic encryption and the ciphertext was submitted to the CNN directly for results, which can only be decrypted by the data owner. The experimentally measured terahertz signals were augmented and further divided into training sets and test sets using 5-fold cross-validation. Experimental results illustrated that the model achieved an accuracy of 100% on the test sets, which highly outperformed humans and the traditional machine learning. The CNN took 9.6 s to inference on 92 encrypted test signals with homomorphic encryption backend. The proposed method with accuracy and security provides private preserving paradigm for artificial intelligence-based material identification.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Yusuke Okuda ◽  
Takaya Shimura ◽  
Hiroyasu Iwasaki ◽  
Shigeki Fukusada ◽  
Ruriko Nishigaki ◽  
...  

AbstractEsophageal cancer (EC) including esophageal squamous cell carcinoma (ESCC) and adenocarcinoma (EAC) generally exhibits poor prognosis; hence, a noninvasive biomarker enabling early detection is necessary. Age- and sex-matched 150 healthy controls (HCs) and 43 patients with ESCC were randomly divided into two groups: 9 individuals in the discovery cohort for microarray analysis and 184 individuals in the training/test cohort with cross-validation for qRT-PCR analysis. Using 152 urine samples (144 HCs and 8 EACs), we validated the urinary miRNA biomarkers for EAC diagnosis. Among eight miRNAs selected in the discovery cohort, urinary levels of five miRNAs (miR-1273f, miR-619-5p, miR-150-3p, miR-4327, and miR-3135b) were significantly higher in the ESCC group than in the HC group, in the training/test cohort. Consistently, these five urinary miRNAs were significantly different between HC and ESCC in both training and test sets. Especially, urinary miR-1273f and miR-619-5p showed excellent values of area under the curve (AUC) ≥ 0.80 for diagnosing stage I ESCC. Similarly, the EAC group had significantly higher urinary levels of these five miRNAs than the HC group, with AUC values of approximately 0.80. The present study established novel urinary miRNA biomarkers that can early detect ESCC and EAC.


2021 ◽  
Author(s):  
Tuğba Alp Tokat ◽  
Burçin Türkmenoğlu ◽  
Yahya Güzel

Abstract According to the descriptors in the pharmacophore model, dividing molecules into training and test sets serves to create a good model. It is difficult to track the Local Reactive Descriptor (LRD) effect of the pharmacophore at each interaction point in the 3D metric system. A subset of clusters of atoms can correspond to all or part of the pharmacophore structure. In this study, the multidimensional system of the subset was reduced to a one-dimensional index and the Vector Fingerprint Functions (VFF) of the molecules were created. Models were established by dividing molecules with close and similar VFFs into training and test sets. Sub-clusters were examined for all molecules by applying the Genetic Algorithm (GA). The model was predicted using the Leave One Out-Cross Validation (LOO-CV) method and verified with an external test set. The statistical results of the model obtained according to the division in the new method we developed (Q2 = 0.604 and R2 = 0.760 for training-80 and external test-20 sets, respectively) were compared with random and manual division results.


2017 ◽  
Vol 29 (2) ◽  
pp. 519-554 ◽  
Author(s):  
Ruibo Wang ◽  
Yu Wang ◽  
Jihong Li ◽  
Xingli Yang ◽  
Jing Yang

A cross-validation method based on [Formula: see text] replications of two-fold cross validation is called an [Formula: see text] cross validation. An [Formula: see text] cross validation is used in estimating the generalization error and comparing of algorithms’ performance in machine learning. However, the variance of the estimator of the generalization error in [Formula: see text] cross validation is easily affected by random partitions. Poor data partitioning may cause a large fluctuation in the number of overlapping samples between any two training (test) sets in [Formula: see text] cross validation. This fluctuation results in a large variance in the [Formula: see text] cross-validated estimator. The influence of the random partitions on variance becomes serious as [Formula: see text] increases. Thus, in this study, the partitions with a restricted number of overlapping samples between any two training (test) sets are defined as a block-regularized partition set. The corresponding cross validation is called block-regularized [Formula: see text] cross validation ([Formula: see text] BCV). It can effectively reduce the influence of random partitions. We prove that the variance of the [Formula: see text] BCV estimator of the generalization error is smaller than the variance of [Formula: see text] cross-validated estimator and reaches the minimum in a special situation. An analytical expression of the variance can also be derived in this special situation. This conclusion is validated through simulation experiments. Furthermore, a practical construction method of [Formula: see text] BCV by a two-level orthogonal array is provided. Finally, a conservative estimator is proposed for the variance of estimator of the generalization error.


Sign in / Sign up

Export Citation Format

Share Document