User-Guided Global Explanations for Deep Image Recognition: A User Study

We study a user-guided approach for producing global explanations of deep networks for image recognition. The global explanations are produced with respect to a test data set and give the overall frequency of different “recognition reasons” across the data. Each reason corresponds to a small number of the most significant human-recognizable visual concepts used by the network. The key challenge is that the visual concepts cannot be predetermined and those concepts will often not correspond to existing vocabulary or have labelled data sets. We address this issue via an interactive-naming interface, which allows users to freely cluster significant image regions in the data into visually similar concepts. Our main contribution is a user study on two visual recognition tasks. The results show that the participants were able to produce a small number of visual concepts sufficient for explanation and that there was significant agreement among the concepts, and hence global explanations, produced by different participants.

Download Full-text

Improving SAR Altimeter processing over the coastal zone and inland waters - the ESA HYDROCOASTAL project

10.5194/egusphere-egu21-9 ◽

2021 ◽

Author(s):

David Cotton ◽

Keyword(s):

Coastal Zone ◽

Test Data ◽

River Discharge ◽

Altimeter Data ◽

Inland Waters ◽

Data Sets ◽

Data Set ◽

Discharge Data ◽

Processing Algorithms ◽

The Impact

IntroductionHYDROCOASTAL is a two year project funded by ESA, with the objective to maximise exploitation of SAR and SARin altimeter measurements in the coastal zone and inland waters, by evaluating and implementing new approaches to process SAR and SARin data from CryoSat-2, and SAR altimeter data from Sentinel-3A and Sentinel-3B. Optical data from Sentinel-2 MSI and Sentinel-3 OLCI instruments will also be used in generating River Discharge products.New SAR and SARin processing algorithms for the coastal zone and inland waters will be developed and implemented and evaluated through an initial Test Data Set for selected regions. From the results of this evaluation a processing scheme will be implemented to generate global coastal zone and river discharge data sets.A series of case studies will assess these products in terms of their scientific impacts.All the produced data sets will be available on request to external researchers, and full descriptions of the processing algorithms will be provided&#160;ObjectivesThe scientific objectives of HYDROCOASTAL are to enhance our understanding&#160; of interactions between the inland water and coastal zone, between the coastal zone and the open ocean, and the small scale processes that govern these interactions. Also the project aims to improve our capability to characterize the variation at different time scales of inland water storage, exchanges with the ocean and the impact on regional sea-level changesThe technical objectives are to develop and evaluate&#160; new SAR&#160; and SARin altimetry processing techniques in support of the scientific objectives, including stack processing, and filtering, and retracking. Also an improved Wet Troposphere Correction will be developed and evaluated.Project&#160; OutlineThere are four tasks to the project<ul><li>Scientific Review and Requirements Consolidation: Review the current state of the art in SAR and SARin altimeter data processing as applied to the coastal zone and to inland waters</li> <li>Implementation and Validation: New processing algorithms with be implemented to generate a Test Data sets, which will be validated against models, in-situ data, and other satellite data sets. Selected algorithms will then be used to generate global coastal zone and river discharge data sets</li> <li>Impacts Assessment: The impact of these global products will be assess in a series of Case Studies</li> <li>Outreach and Roadmap: Outreach material will be prepared and distributed to engage with the wider scientific community and provide recommendations for development of future missions and future research.</li> </ul>&#160;PresentationThe presentation will provide an overview to the project, present the different SAR altimeter processing algorithms that are being evaluated in the first phase of the project, and early results from the evaluation of the initial test data set.&#160;

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

Generation of Business Event Data Sets for Testing RFID Information Services

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194015500096 ◽

2015 ◽

Vol 25 (04) ◽

pp. 757-780

Author(s):

Gihong Kim ◽

Bonghee Hong

Keyword(s):

Life Cycle ◽

Test Data ◽

Large Volume ◽

Information Services ◽

Data Sets ◽

Event Data ◽

Data Set ◽

Generation Algorithm ◽

New Model ◽

Business Events

The testing of RFID information services requires a test data set of business events comprising object, aggregation, quantity and transaction events. To generate business events, we need to address the performance issues in creating a large volume of event data. This paper proposes a new model for the tag life cycle and a fast generation algorithm for this model. We present the results of experiments with the generation algorithm, showing that it outperforms previous methods.

Download Full-text

Use of artificial intelligence for public health surveillance: a case study to develop a machine Learning-algorithm to estimate the incidence of diabetes mellitus in France

Archives of Public Health ◽

10.1186/s13690-021-00687-0 ◽

2021 ◽

Vol 79 (1) ◽

Author(s):

Romana Haneef ◽

Sofiane Kab ◽

Rok Hrzic ◽

Sonsoles Fuentes ◽

Sandrine Fosse-Edorh ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Test Data ◽

Public Health Surveillance ◽

Health Surveillance ◽

Data Sets ◽

Data Set ◽

Linear Discriminant ◽

Final Data ◽

Selection Of

Abstract Background The use of machine learning techniques is increasing in healthcare which allows to estimate and predict health outcomes from large administrative data sets more efficiently. The main objective of this study was to develop a generic machine learning (ML) algorithm to estimate the incidence of diabetes based on the number of reimbursements over the last 2 years. Methods We selected a final data set from a population-based epidemiological cohort (i.e., CONSTANCES) linked with French National Health Database (i.e., SNDS). To develop this algorithm, we adopted a supervised ML approach. Following steps were performed: i. selection of final data set, ii. target definition, iii. Coding variables for a given window of time, iv. split final data into training and test data sets, v. variables selection, vi. training model, vii. Validation of model with test data set and viii. Selection of the model. We used the area under the receiver operating characteristic curve (AUC) to select the best algorithm. Results The final data set used to develop the algorithm included 44,659 participants from CONSTANCES. Out of 3468 variables from SNDS linked to CONSTANCES cohort were coded, 23 variables were selected to train different algorithms. The final algorithm to estimate the incidence of diabetes was a Linear Discriminant Analysis model based on number of reimbursements of selected variables related to biological tests, drugs, medical acts and hospitalization without a procedure over the last 2 years. This algorithm has a sensitivity of 62%, a specificity of 67% and an accuracy of 67% [95% CI: 0.66–0.68]. Conclusions Supervised ML is an innovative tool for the development of new methods to exploit large health administrative databases. In context of InfAct project, we have developed and applied the first time a generic ML-algorithm to estimate the incidence of diabetes for public health surveillance. The ML-algorithm we have developed, has a moderate performance. The next step is to apply this algorithm on SNDS to estimate the incidence of type 2 diabetes cases. More research is needed to apply various MLTs to estimate the incidence of various health conditions.

Download Full-text

DNA Methylation Markers for Pan-Cancer Prediction by Deep Learning

Genes ◽

10.3390/genes10100778 ◽

2019 ◽

Vol 10 (10) ◽

pp. 778 ◽

Cited By ~ 6

Author(s):

Liu ◽

Pan ◽

Li ◽

Yang ◽

...

Keyword(s):

Dna Methylation ◽

Deep Learning ◽

Sensitivity And Specificity ◽

Test Data ◽

Data Sets ◽

Methylation Data ◽

Average Sensitivity ◽

Validation Data ◽

Data Set ◽

Cancer Types

For cancer diagnosis, many DNA methylation markers have been identified. However, few studies have tried to identify DNA methylation markers to diagnose diverse cancer types simultaneously, i.e., pan-cancers. In this study, we tried to identify DNA methylation markers to differentiate cancer samples from the respective normal samples in pan-cancers. We collected whole genome methylation data of 27 cancer types containing 10,140 cancer samples and 3386 normal samples, and divided all samples into five data sets, including one training data set, one validation data set and three test data sets. We applied machine learning to identify DNA methylation markers, and specifically, we constructed diagnostic prediction models by deep learning. We identified two categories of markers: 12 CpG markers and 13 promoter markers. Three of 12 CpG markers and four of 13 promoter markers locate at cancer-related genes. With the CpG markers, our model achieved an average sensitivity and specificity on test data sets as 92.8% and 90.1%, respectively. For promoter markers, the average sensitivity and specificity on test data sets were 89.8% and 81.1%, respectively. Furthermore, in cell-free DNA methylation data of 163 prostate cancer samples, the CpG markers achieved the sensitivity as 100%, and the promoter markers achieved 92%. For both marker types, the specificity of normal whole blood was 100%. To conclude, we identified methylation markers to diagnose pan-cancers, which might be applied to liquid biopsy of cancers.

Download Full-text

Improved Retrieval Methods for Sentinel-3 SAR Altimetry over Coastal and Open Ocean and recommendations for implementation: ESA SCOOP Project Results

10.5194/egusphere-egu2020-2215 ◽

2020 ◽

Author(s):

David Cotton ◽

Thomas Moreau ◽

Mònica Roca ◽

Christine Gommenginger ◽

Mathilde Cancet ◽

...

Keyword(s):

Test Data ◽

Open Ocean ◽

Global Navigation Satellite Systems ◽

Future Research ◽

Data Sets ◽

Polar Regions ◽

Path Delay ◽

Data Set ◽

Processing Scheme ◽

Satellite Systems

SCOOP (SAR Altimetry Coastal & Open Ocean Performance) is a project funded under the ESA SEOM (Scientific Exploitation of Operational Missions) Programme Element, to characterise the expected performance of Sentinel-3 SRAL SAR mode altimeter products, and then to develop and evaluate enhancements to the baseline processing scheme in terms of improvements to ocean measurements. Another objective is to develop and evaluate an improved Wet Troposphere correction for Sentinel-3.The SCOOP studies are based on two 2-year test data sets derived from CryoSat-2 FBR data, produced for 10 regions. The first Test Data Set was processed with algorithms equivalent to the Sentinel-3 baseline, and the second with algorithms expected to provide an improved performance.We present results from the SCOOP project that demonstrate the excellent performance of SRAL at the coast in terms of measurement precision, with noise in Sea Surface Height 20Hz measurements of less than 5cm to within 5km of the coast.We then report the development and testing of new processing approaches designed to improve performance, including, for L1B to L2:<ul><li>Application of zero-padding</li> <li>Application of intra-burst Hamming windowing</li> <li>Exact beam forming in the azimuthal direction</li> <li>Restriction of stack processing to within a specified range of look angles.</li> <li>Along-track antenna compensation</li> </ul>&#160;And for L1B to L2<ul><li>Application of alternative re-trackers for SAR and RDSAR.</li> </ul>&#160;Based on the results of this assessment, a second test data set was generated and we present an assessment of the performance of this second Test Data Set generated, and compare it to that of the original Test Data Set.Regarding the WTC for Sentinel-3A, the correction from the on-board MWR has been assessed by means of comparison with independent data sets such as the GPM Microwave Imager (GMI), Jason-2, Jason-3 and Global Navigation Satellite Systems (GNSS) derived WTC at coastal stations. GNSS-derived path Delay Plus (GPD+) corrections have been derived for S3A. Results indicate good overall performance of S3A MWR and GPD+ WTC improvements over MWR-derived WTC, particularly in coastal and polar regions.&#160;Based on the outcomes of this study we provide recommendations for improving SAR mode altimeter processing and priorities for future research.

Download Full-text

Wheel Hub Defects Image Recognition Base on Zero-Shot Learning

Applied Sciences ◽

10.3390/app11041529 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1529

Author(s):

Xiaohong Sun ◽

Jinan Gu ◽

Meimei Wang ◽

Yanhua Meng ◽

Huichao Shi

Keyword(s):

Deep Learning ◽

Image Recognition ◽

Large Scale ◽

Domain Adaptation ◽

Data Sets ◽

Human Beings ◽

Data Set ◽

Defect Image ◽

Product Surface ◽

Projection Domain

In the wheel hub industry, the quality control of the product surface determines the subsequent processing, which can be realized through the hub defect image recognition based on deep learning. Although the existing methods based on deep learning have reached the level of human beings, they rely on large-scale training sets, however, these models are completely unable to cope with the situation without samples. Therefore, in this paper, a generalized zero-shot learning framework for hub defect image recognition was built. First, a reverse mapping strategy was adopted to reduce the hubness problem, then a domain adaptation measure was employed to alleviate the projection domain shift problem, and finally, a scaling calibration strategy was used to avoid the recognition preference of seen defects. The proposed model was validated using two data sets, VOC2007 and the self-built hub defect data set, and the results showed that the method performed better than the current popular methods.

Download Full-text

SMDI: An Index for Measuring Subgingival Microbial Dysbiosis

Journal of Dental Research ◽

10.1177/00220345211035775 ◽

2021 ◽

pp. 002203452110357

Author(s):

T. Chen ◽

P.D. Marsh ◽

N.N. Al-Hebshi

Keyword(s):

Test Data ◽

Training Data ◽

Response To Treatment ◽

Data Sets ◽

Sequencing Data ◽

Characteristic Analysis ◽

Data Set ◽

Microbial Dysbiosis ◽

Log Ratio

An intuitive, clinically relevant index of microbial dysbiosis as a summary statistic of subgingival microbiome profiles is needed. Here, we describe a subgingival microbial dysbiosis index (SMDI) based on machine learning analysis of published periodontitis/health 16S microbiome data. The raw sequencing data, split into training and test sets, were quality filtered, taxonomically assigned to the species level, and centered log-ratio transformed. The training data set was subject to random forest analysis to identify discriminating species (DS) between periodontitis and health. DS lists, compiled by various “Gini” importance score cutoffs, were used to compute the SMDI for samples in the training and test data sets as the mean centered log-ratio abundance of periodontitis-associated species subtracted by that of health-associated ones. Diagnostic accuracy was assessed with receiver operating characteristic analysis. An SMDI based on 49 DS provided the highest accuracy with areas under the curve of 0.96 and 0.92 in the training and test data sets, respectively, and ranged from −6 (most normobiotic) to 5 (most dysbiotic) with a value around zero discriminating most of the periodontitis and healthy samples. The top periodontitis-associated DS were Treponema denticola, Mogibacterium timidum, Fretibacterium spp., and Tannerella forsythia, while Actinomyces naeslundii and Streptococcus sanguinis were the top health-associated DS. The index was highly reproducible by hypervariable region. Applying the index to additional test data sets in which nitrate had been used to modulate the microbiome demonstrated that nitrate has dysbiosis-lowering properties in vitro and in vivo. Finally, 3 genera ( Treponema, Fretibacterium, and Actinomyces) were identified that could be used for calculation of a simplified SMDI with comparable accuracy. In conclusion, we have developed a nonbiased, reproducible, and easy-to-interpret index that can be used to identify patients/sites at risk of periodontitis, to assess the microbial response to treatment, and, importantly, as a quantitative tool in microbiome modulation studies.

Download Full-text

Comparison of CART and discriminant analysis of morphometric data in foraminiferal taxonomy

Anuário do Instituto de Geociências - UFRJ ◽

10.11137/2006_1_153-162 ◽

2006 ◽

Vol 29 (1) ◽

pp. 153-162

Author(s):

Pratul Kumar Saraswati ◽

Sanjeev V Sabnis

Keyword(s):

Discriminant Analysis ◽

Test Data ◽

Parametric Method ◽

Regression Tree ◽

Classification And Regression Tree ◽

Misclassification Error ◽

Data Sets ◽

Morphometric Data ◽

Validation Data ◽

Data Set

Paleontologists use statistical methods for prediction and classification of taxa. Over the years, the statistical analyses of morphometric data are carried out under the assumption of multivariate normality. In an earlier study, three closely resembling species of a biostratigraphically important genus Nummulites were discriminated by multi-group discrimination. Two discriminant functions that used diameter and thickness of the tests and height and length of chambers in the final whorl accounted for nearly 100% discrimination. In this paper Classification and Regression Tree (CART), a non-parametric method, is used for classification and prediction of the same data set. In all 111 iterations of CART methodology are performed by splitting the data set of 55 observations into training, validation and test data sets in varying proportions. In the validation data sets 40% of the iterations are correctly classified and only one case of misclassification in 49% of the iterations is noted. As regards test data sets, nearly 70% contain no misclassification cases whereas in about 25% test data sets only one case of misclassification is found. The results suggest that the method is highly successful in assigning an individual to a particular species. The key variables on the basis of which tree models are built are combinations of thickness of the test (T), height of the chambers in the final whorl (HL) and diameter of the test (D). Both discriminant analysis and CART thus appear to be comparable in discriminating the three species. However, CART reduces the number of requisite variables without increasing the misclassification error. The method is very useful for professional geologists for quick identification of species.

Download Full-text

Computing Sum of Products about the Mean with Pairwise Algorithms

Psychological Reports ◽

10.2466/pr0.1997.81.3f.1387 ◽

1997 ◽

Vol 81 (3_suppl) ◽

pp. 1387-1391

Author(s):

J. Gabriel Molina ◽

Pedro M. Valero ◽

Jaime Sanmartín

Keyword(s):

Test Data ◽

Computational Algorithm ◽

Computer Programs ◽

Data Sets ◽

Individual Data ◽

Data Set ◽

Sum Of Products ◽

The Mean

We discuss pairwise algorithms, a kind of computational algorithm which can be useful in dynamically updating statistics as new samples of data are collected. Since test data are usually collected through time as individual data sets, these algorithms would be profitably used in computer programs to treat this situation. Pair-wise algorithms are presented for calculating the sum of products of deviations about the mean for adding a sample of data (or removing one) to the whole data set.

Download Full-text