CATI: An Extensible Platform Supporting Assisted Classification of Large Datasets

Author(s):  
Gabriela Bosetti ◽  
Előd Egyed-Zsigmond
Keyword(s):  
2020 ◽  
Vol 72 (1) ◽  
Author(s):  
Ashok Balasubramanyam

An etiologically based classification of diabetes is needed to account for the heterogeneity of type 1 and type 2 diabetes (T1D and T2D) and emerging forms of diabetes worldwide. It may be productive for both classification and clinical discovery to consider variant forms of diabetes as a spectrum. Maturity onset diabetes of youth and neonatal diabetes serve as models for etiologically defined, rare forms of diabetes in the spectrum. Ketosis-prone diabetes is a model for more complex forms, amenable to phenotypic dissection. Bioinformatic approaches such as clustering analyses of large datasets and multi-omics investigations of rare and atypical phenotypes are promising avenues to explore and define new subgroups of diabetes. Expected final online publication date for the Annual Review of Medicine, Volume 72 is January 27, 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.


Author(s):  
T. Ravindra Babu ◽  
M. Narasimha Murty ◽  
S. V. Subrahmanya

Data Mining deals with efficient algorithms for dealing with large data. When such algorithms are combined with data compaction, they would lead to superior performance. Approaches to deal with large data include working with representatives of data instead of entire data. The representatives should preferably be generated with minimal data scans. In the current chapter we discuss working with methods of lossy and non-lossy data compression methods combined with clustering and classification of large datasets. We demonstrate the working of such schemes on two large data sets.


Data Mining ◽  
2013 ◽  
pp. 734-750
Author(s):  
T. Ravindra Babu ◽  
M. Narasimha Murty ◽  
S. V. Subrahmanya

Data Mining deals with efficient algorithms for dealing with large data. When such algorithms are combined with data compaction, they would lead to superior performance. Approaches to deal with large data include working with representatives of data instead of entire data. The representatives should preferably be generated with minimal data scans. In the current chapter we discuss working with methods of lossy and non-lossy data compression methods combined with clustering and classification of large datasets. We demonstrate the working of such schemes on two large data sets.


Agronomy ◽  
2019 ◽  
Vol 9 (12) ◽  
pp. 833 ◽  
Author(s):  
Saeed Khaki ◽  
Zahra Khalilzadeh ◽  
Lizhi Wang

Environmental stresses, such as drought and heat, can cause substantial yield loss in agriculture. As such, hybrid crops that are tolerant to drought and heat stress would produce more consistent yields compared to the hybrids that are not tolerant to these stresses. In the 2019 Syngenta Crop Challenge, Syngenta released several large datasets that recorded the yield performances of 2452 corn hybrids planted in 1560 locations between 2008 and 2017 and asked participants to classify the corn hybrids as either tolerant or susceptible to drought stress, heat stress and combined drought and heat stress. However, no data was provided that classified any set of hybrids as tolerant or susceptible to any type of stress. In this paper, we present an unsupervised approach to solving this problem, which was recognized as one of the winners in the 2019 Syngenta Crop Challenge. Our results labeled 121 hybrids as drought tolerant, 193 as heat tolerant and 29 as tolerant to both stresses.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Phuong Pho ◽  
Alexander V. Mantzaris

Abstract Classification of data points which correspond to complex entities such as people or journal articles is a ongoing research task. Notable applications are recommendation systems for customer behaviors based upon their features or past purchases and in academia labeling relevant research papers in order to reduce the reading time required. The features that can be extracted are many and result in large datasets which are a challenge to process with complex machine learning methodologies. There is also an issue on how this is presented and how to interpret the parameterizations beyond the classification accuracies. This work shows how the network information contained in an adjacency matrix allows improved classification of entities through their associations and how the framework of the SGC provide an expressive and fast approach. The proposed regularized SGC incorporates shrinkage upon three different aspects of the projection vectors to reduce the number of parameters, the size of the parameters and the directions between the vectors to produce more meaningful interpretations.


Metabolites ◽  
2021 ◽  
Vol 11 (4) ◽  
pp. 211
Author(s):  
Fernando Perez-Sanz ◽  
Victoria Ruiz-Hernández ◽  
Marta I. Terry ◽  
Sara Arce-Gallego ◽  
Julia Weiss ◽  
...  

Metabolomes comprise constitutive and non-constitutive metabolites produced due to physiological, genetic or environmental effects. However, finding constitutive metabolites and non-constitutive metabolites in large datasets is technically challenging. We developed gcProfileMakeR, an R package using standard Excel output files from an Agilent Chemstation GC-MS for automatic data analysis using CAS numbers. gcProfileMakeR has two filters for data preprocessing removing contaminants and low-quality peaks. The first function NormalizeWithinFiles, samples assigning retention times to CAS. The second function NormalizeBetweenFiles, reaches a consensus between files where compounds in close retention times are grouped together. The third function getGroups, establishes what is considered as Constitutive Profile, Non-constitutive by Frequency i.e., not present in all samples and Non-constitutive by Quality. Results can be plotted with the plotGroup function. We used it to analyse floral scent emissions in four snapdragon genotypes. These included a wild type, Deficiens nicotianoides and compacta affecting floral identity and RNAi:AmLHY targeting a circadian clock gene. We identified differences in scent constitutive and non-constitutive profiles as well as in timing of emission. gcProfileMakeR is a very useful tool to define constitutive and non-constitutive scent profiles. It also allows to analyse genotypes and circadian datasets to identify differing metabolites.


Materials ◽  
2021 ◽  
Vol 14 (22) ◽  
pp. 7027
Author(s):  
Stephania Kossman ◽  
Maxence Bigerelle

High–speed nanoindentation rapidly generates large datasets, opening the door for advanced data analysis methods such as the resources available in artificial intelligence. The present study addresses the problem of differentiating load–displacement curves presenting pop-in, slope changes, or instabilities from curves exhibiting a typical loading path in large nanoindentation datasets. Classification of the curves was achieved with a deep learning model, specifically, a convolutional neural network (CNN) model implemented in Python using TensorFlow and Keras libraries. Load–displacement curves (with pop-in and without pop-in) from various materials were input to train and validate the model. The curves were converted into square matrices (50 × 50) and then used as inputs for the CNN model. The model successfully differentiated between pop-in and non-pop-in curves with approximately 93% accuracy in the training and validation datasets, indicating that the risk of overfitting the model was negligible. These results confirmed that artificial intelligence and computer vision models represent a powerful tool for analyzing nanoindentation data.


2020 ◽  
Author(s):  
Gihad N. Sohsah ◽  
Ali Reza Ibrahimzada ◽  
Huzeyfe Ayaz ◽  
Ali Cakmak

Taxonomy of living organisms gains major importance in making the study of vastly heterogeneous living things easier. In addition, various fields of applied biology (e.g., agriculture) depend on classification of living creatures. Specific fragments of the DNA sequence of a living organism have been defined as DNA barcodes and can be used as markers to identify species efficiently and effectively. The existing DNA barcode-based classification approaches suffer from three major issues: (i) most of them assume that the classification is done within a given taxonomic class and/or input sequences are prealigned, (ii) highly performing classifiers, such as SVM, cannot scale to large taxonomies due to high memory requirements, (iii) mutations and noise in input DNA sequences greatly reduce the taxonomic classification accuracy. In order to address these issues, we propose a multi-level hierarchical classifier framework to automatically assign taxonomy labels to DNA sequences. We utilize an alignment-free approach called spectrum kernel method for feature extraction. We build a proof-of-concept hierarchical classifier with two levels, and evaluated it on real DNA sequence data from BOLD systems. We demonstrate that the proposed framework provides higher accuracy than regular classifiers. Besides, hierarchical framework scales better to large datasets enabling researchers to employ classifiers with high accuracy and high memory requirement on large datasets. Furthermore, we show that the proposed framework is more robust to mutations and noise in sequence data than the non-hierarchical classifiers.


Sign in / Sign up

Export Citation Format

Share Document