Comparison of CART and discriminant analysis of morphometric data in foraminiferal taxonomy

2006 ◽  
Vol 29 (1) ◽  
pp. 153-162
Author(s):  
Pratul Kumar Saraswati ◽  
Sanjeev V Sabnis

Paleontologists use statistical methods for prediction and classification of taxa. Over the years, the statistical analyses of morphometric data are carried out under the assumption of multivariate normality. In an earlier study, three closely resembling species of a biostratigraphically important genus Nummulites were discriminated by multi-group discrimination. Two discriminant functions that used diameter and thickness of the tests and height and length of chambers in the final whorl accounted for nearly 100% discrimination. In this paper Classification and Regression Tree (CART), a non-parametric method, is used for classification and prediction of the same data set. In all 111 iterations of CART methodology are performed by splitting the data set of 55 observations into training, validation and test data sets in varying proportions. In the validation data sets 40% of the iterations are correctly classified and only one case of misclassification in 49% of the iterations is noted. As regards test data sets, nearly 70% contain no misclassification cases whereas in about 25% test data sets only one case of misclassification is found. The results suggest that the method is highly successful in assigning an individual to a particular species. The key variables on the basis of which tree models are built are combinations of thickness of the test (T), height of the chambers in the final whorl (HL) and diameter of the test (D). Both discriminant analysis and CART thus appear to be comparable in discriminating the three species. However, CART reduces the number of requisite variables without increasing the misclassification error. The method is very useful for professional geologists for quick identification of species.

Genes ◽  
2019 ◽  
Vol 10 (10) ◽  
pp. 778 ◽  
Author(s):  
Liu ◽  
Liu ◽  
Pan ◽  
Li ◽  
Yang ◽  
...  

For cancer diagnosis, many DNA methylation markers have been identified. However, few studies have tried to identify DNA methylation markers to diagnose diverse cancer types simultaneously, i.e., pan-cancers. In this study, we tried to identify DNA methylation markers to differentiate cancer samples from the respective normal samples in pan-cancers. We collected whole genome methylation data of 27 cancer types containing 10,140 cancer samples and 3386 normal samples, and divided all samples into five data sets, including one training data set, one validation data set and three test data sets. We applied machine learning to identify DNA methylation markers, and specifically, we constructed diagnostic prediction models by deep learning. We identified two categories of markers: 12 CpG markers and 13 promoter markers. Three of 12 CpG markers and four of 13 promoter markers locate at cancer-related genes. With the CpG markers, our model achieved an average sensitivity and specificity on test data sets as 92.8% and 90.1%, respectively. For promoter markers, the average sensitivity and specificity on test data sets were 89.8% and 81.1%, respectively. Furthermore, in cell-free DNA methylation data of 163 prostate cancer samples, the CpG markers achieved the sensitivity as 100%, and the promoter markers achieved 92%. For both marker types, the specificity of normal whole blood was 100%. To conclude, we identified methylation markers to diagnose pan-cancers, which might be applied to liquid biopsy of cancers.


1994 ◽  
Vol 24 (10) ◽  
pp. 2068-2077 ◽  
Author(s):  
Valerie M. Lemay ◽  
David E. Tait ◽  
Bart J. van der Kamp

Trees are classed as decayed if there is structural damage as a result of internal decay and as sound otherwise. This classification is often performed on standing trees using tree and stand measurements to predict the class. For this study, three general rule systems were proposed and compared. Two different methods, discriminant analysis and classification-trees analysis, were used to derive the various rules. These rules were developed for three species, true fir (Abieslasiocarpa (Hook.) Nutt.), western red cedar (Thujaplicata Donn), and trembling aspen (Populustremuloides Michx.), growing in British Columbia (B.C.). The success of each of the three rules systems developed using each of the two analytical methods was evaluated using the misclassification error rates calculated for a reserved portion of the data available for each species (test data sets). The ease of using and interpreting the rules was also considered in the evaluation. The results indicated that the developed rules were reasonably accurate in predicting the class of trees in the test data sets. Results were more accurate for cedar and aspen than for fir. The use of classification-trees analysis to develop rules in recommended over the use of discriminant analysis. Misclassification error rates were similar for the two methods, but the dichotomous trees that result from classification-trees analysis are much easier to use and interpret.


2005 ◽  
Vol 23 (19) ◽  
pp. 4322-4329 ◽  
Author(s):  
Mark Garzotto ◽  
Tomasz M. Beer ◽  
R. Guy Hudson ◽  
Laura Peters ◽  
Yi-Ching Hsieh ◽  
...  

Purpose To build a decision tree for patients suspected of having prostate cancer using classification and regression tree (CART) analysis. Patients and Methods Data were uniformly collected on 1,433 referred men with a serum prostate-specific antigen (PSA) levels of ≤ 10 ng/mL who underwent a prostate biopsy. Factors analyzed included demographic, laboratory, and ultrasound data (ie, hypoechoic lesions and PSA density [PSAD]). Twenty percent of the data was randomly selected and reserved for study validation. CART analysis was performed in two steps, initially using PSA and digital rectal examination (DRE) alone and subsequently using the remaining variables. Results CART analysis selected a PSA cutoff of more than 1.55 ng/mL for further work-up, regardless of DRE findings. CART then selected the following subgroups at risk for a positive biopsy: (1) PSAD more than 0.165 ng/mL/cc; (2) PSAD ≤ 0.165 ng/mL/cc and a hypoechoic lesion; (3) PSAD ≤ 0.165 ng/mL/cc, no hypoechoic lesions, age older than 55.5 years, and prostate volume ≤ 44.0 cc; and (4) PSAD ≤ 0.165 ng/mL/cc, no hypoechoic lesions, age older than 55.5 years, and 50.25 cc less than prostate volume ≤ 80.8 cc. In the validation data set, specificity and sensitivity were 31.3% and 96.6%, respectively. Cancers that were missed by the CART were Gleason score 6 or less in 93.4% of cases. Receiver operator characteristic curve analysis showed that CART and logistic regression models had similar accuracy (area under the curve = 0.74 v 0.72, respectively). Conclusion Application of CART analysis to the prostate biopsy decision results in a significant reduction in unnecessary biopsies while retaining a high degree of sensitivity when compared with the standard of performing a biopsy of all patients with an abnormal PSA or DRE.


2021 ◽  
Author(s):  
David Cotton ◽  

<p><strong>Introduction</strong></p><p>HYDROCOASTAL is a two year project funded by ESA, with the objective to maximise exploitation of SAR and SARin altimeter measurements in the coastal zone and inland waters, by evaluating and implementing new approaches to process SAR and SARin data from CryoSat-2, and SAR altimeter data from Sentinel-3A and Sentinel-3B. Optical data from Sentinel-2 MSI and Sentinel-3 OLCI instruments will also be used in generating River Discharge products.</p><p>New SAR and SARin processing algorithms for the coastal zone and inland waters will be developed and implemented and evaluated through an initial Test Data Set for selected regions. From the results of this evaluation a processing scheme will be implemented to generate global coastal zone and river discharge data sets.</p><p>A series of case studies will assess these products in terms of their scientific impacts.</p><p>All the produced data sets will be available on request to external researchers, and full descriptions of the processing algorithms will be provided</p><p> </p><p><strong>Objectives</strong></p><p>The scientific objectives of HYDROCOASTAL are to enhance our understanding  of interactions between the inland water and coastal zone, between the coastal zone and the open ocean, and the small scale processes that govern these interactions. Also the project aims to improve our capability to characterize the variation at different time scales of inland water storage, exchanges with the ocean and the impact on regional sea-level changes</p><p>The technical objectives are to develop and evaluate  new SAR  and SARin altimetry processing techniques in support of the scientific objectives, including stack processing, and filtering, and retracking. Also an improved Wet Troposphere Correction will be developed and evaluated.</p><p><strong>Project  Outline</strong></p><p>There are four tasks to the project</p><ul><li>Scientific Review and Requirements Consolidation: Review the current state of the art in SAR and SARin altimeter data processing as applied to the coastal zone and to inland waters</li> <li>Implementation and Validation: New processing algorithms with be implemented to generate a Test Data sets, which will be validated against models, in-situ data, and other satellite data sets. Selected algorithms will then be used to generate global coastal zone and river discharge data sets</li> <li>Impacts Assessment: The impact of these global products will be assess in a series of Case Studies</li> <li>Outreach and Roadmap: Outreach material will be prepared and distributed to engage with the wider scientific community and provide recommendations for development of future missions and future research.</li> </ul><p> </p><p><strong>Presentation</strong></p><p>The presentation will provide an overview to the project, present the different SAR altimeter processing algorithms that are being evaluated in the first phase of the project, and early results from the evaluation of the initial test data set.</p><p> </p>


2021 ◽  
Author(s):  
Louise Bloch ◽  
Christoph M. Friedrich

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.


Heart ◽  
2018 ◽  
Vol 104 (23) ◽  
pp. 1921-1928 ◽  
Author(s):  
Ming-Zher Poh ◽  
Yukkee Cheung Poh ◽  
Pak-Hei Chan ◽  
Chun-Ka Wong ◽  
Louise Pun ◽  
...  

ObjectiveTo evaluate the diagnostic performance of a deep learning system for automated detection of atrial fibrillation (AF) in photoplethysmographic (PPG) pulse waveforms.MethodsWe trained a deep convolutional neural network (DCNN) to detect AF in 17 s PPG waveforms using a training data set of 149 048 PPG waveforms constructed from several publicly available PPG databases. The DCNN was validated using an independent test data set of 3039 smartphone-acquired PPG waveforms from adults at high risk of AF at a general outpatient clinic against ECG tracings reviewed by two cardiologists. Six established AF detectors based on handcrafted features were evaluated on the same test data set for performance comparison.ResultsIn the validation data set (3039 PPG waveforms) consisting of three sequential PPG waveforms from 1013 participants (mean (SD) age, 68.4 (12.2) years; 46.8% men), the prevalence of AF was 2.8%. The area under the receiver operating characteristic curve (AUC) of the DCNN for AF detection was 0.997 (95% CI 0.996 to 0.999) and was significantly higher than all the other AF detectors (AUC range: 0.924–0.985). The sensitivity of the DCNN was 95.2% (95% CI 88.3% to 98.7%), specificity was 99.0% (95% CI 98.6% to 99.3%), positive predictive value (PPV) was 72.7% (95% CI 65.1% to 79.3%) and negative predictive value (NPV) was 99.9% (95% CI 99.7% to 100%) using a single 17 s PPG waveform. Using the three sequential PPG waveforms in combination (<1 min in total), the sensitivity was 100.0% (95% CI 87.7% to 100%), specificity was 99.6% (95% CI 99.0% to 99.9%), PPV was 87.5% (95% CI 72.5% to 94.9%) and NPV was 100% (95% CI 99.4% to 100%).ConclusionsIn this evaluation of PPG waveforms from adults screened for AF in a real-world primary care setting, the DCNN had high sensitivity, specificity, PPV and NPV for detecting AF, outperforming other state-of-the-art methods based on handcrafted features.


2019 ◽  
Vol 15 (1) ◽  
Author(s):  
Görkem Sariyer ◽  
Ceren Öcal Taşar ◽  
Gizem Ersoy Cepe

Abstract Emergency departments (EDs) are the largest departments of hospitals which encounter high variety of cases as well as high level of patient volumes. Thus, an efficient classification of those patients at the time of their registration is very important for the operations planning and management. Using secondary data from the ED of an urban hospital, we examine the significance of factors while classifying patients according to their length of stay. Random Forest, Classification and Regression Tree, Logistic Regression (LR), and Multilayer Perceptron (MLP) were adopted in the data set of July 2016, and these algorithms were tested in data set of August 2016. Besides adopting and testing the algorithms on the whole data set, patients in these sets were grouped into 21 based on the similarities in their diagnoses and the algorithms were also performed in these subgroups. Performances of the classifiers were evaluated based on the sensitivity, specificity, and accuracy. It was observed that sensitivity, specificity, and accuracy values of the classifiers were similar, where LR and MLP had somehow higher values. In addition, the average performance of the classifying patients within the subgroups outperformed the classifying based on the whole data set for each of the classifiers.


2019 ◽  
Vol 7 (3) ◽  
pp. SE113-SE122 ◽  
Author(s):  
Yunzhi Shi ◽  
Xinming Wu ◽  
Sergey Fomel

Salt boundary interpretation is important for the understanding of salt tectonics and velocity model building for seismic migration. Conventional methods consist of computing salt attributes and extracting salt boundaries. We have formulated the problem as 3D image segmentation and evaluated an efficient approach based on deep convolutional neural networks (CNNs) with an encoder-decoder architecture. To train the model, we design a data generator that extracts randomly positioned subvolumes from large-scale 3D training data set followed by data augmentation, then feed a large number of subvolumes into the network while using salt/nonsalt binary labels generated by thresholding the velocity model as ground truth labels. We test the model on validation data sets and compare the blind test predictions with the ground truth. Our results indicate that our method is capable of automatically capturing subtle salt features from the 3D seismic image with less or no need for manual input. We further test the model on a field example to indicate the generalization of this deep CNN method across different data sets.


Author(s):  
Gihong Kim ◽  
Bonghee Hong

The testing of RFID information services requires a test data set of business events comprising object, aggregation, quantity and transaction events. To generate business events, we need to address the performance issues in creating a large volume of event data. This paper proposes a new model for the tag life cycle and a fast generation algorithm for this model. We present the results of experiments with the generation algorithm, showing that it outperforms previous methods.


2021 ◽  
Vol 79 (1) ◽  
Author(s):  
Romana Haneef ◽  
Sofiane Kab ◽  
Rok Hrzic ◽  
Sonsoles Fuentes ◽  
Sandrine Fosse-Edorh ◽  
...  

Abstract Background The use of machine learning techniques is increasing in healthcare which allows to estimate and predict health outcomes from large administrative data sets more efficiently. The main objective of this study was to develop a generic machine learning (ML) algorithm to estimate the incidence of diabetes based on the number of reimbursements over the last 2 years. Methods We selected a final data set from a population-based epidemiological cohort (i.e., CONSTANCES) linked with French National Health Database (i.e., SNDS). To develop this algorithm, we adopted a supervised ML approach. Following steps were performed: i. selection of final data set, ii. target definition, iii. Coding variables for a given window of time, iv. split final data into training and test data sets, v. variables selection, vi. training model, vii. Validation of model with test data set and viii. Selection of the model. We used the area under the receiver operating characteristic curve (AUC) to select the best algorithm. Results The final data set used to develop the algorithm included 44,659 participants from CONSTANCES. Out of 3468 variables from SNDS linked to CONSTANCES cohort were coded, 23 variables were selected to train different algorithms. The final algorithm to estimate the incidence of diabetes was a Linear Discriminant Analysis model based on number of reimbursements of selected variables related to biological tests, drugs, medical acts and hospitalization without a procedure over the last 2 years. This algorithm has a sensitivity of 62%, a specificity of 67% and an accuracy of 67% [95% CI: 0.66–0.68]. Conclusions Supervised ML is an innovative tool for the development of new methods to exploit large health administrative databases. In context of InfAct project, we have developed and applied the first time a generic ML-algorithm to estimate the incidence of diabetes for public health surveillance. The ML-algorithm we have developed, has a moderate performance. The next step is to apply this algorithm on SNDS to estimate the incidence of type 2 diabetes cases. More research is needed to apply various MLTs to estimate the incidence of various health conditions.


Sign in / Sign up

Export Citation Format

Share Document