Bio-informatics and psychiatric epidemiology

This chapter highlights the methodologies which are increasingly being applied to large datasets or ‘big data’, with an emphasis on bio-informatics. The first stage of any analysis is to collect data from a well-designed study. The chapter begins by looking at the raw data that arises from epidemiological studies and highlighting the first stages in creating clean data that can be used to draw informative conclusions through analysis. The remainder of the chapter covers data formats, data exploration, data cleaning, missing data (i.e. the lack of data for a variable in an observation), reproducibility, classification versus regression, feature identification and selection, method selection (e.g. supervised versus unsupervised machine learning), training a classifier, and drawing conclusions from modelling.

Download Full-text

DATA PREPARATION ON LARGE DATASETS FOR DATA SCIENCE

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.20526 ◽

2017 ◽

Vol 10 (13) ◽

pp. 485

Author(s):

Darshan Barapatre ◽

Vijayalakshmi A

Keyword(s):

Machine Learning ◽

Data Science ◽

Data Cleaning ◽

Large Datasets ◽

Machine Learning Algorithms ◽

Unstructured Data ◽

Data Preparation ◽

Data Profiling ◽

Data Scientist ◽

Preparation Techniques

According to interviews and experts, data scientists spend 50-80% of the valuable time in the mundane task of collecting and preparing structured or unstructured data, before it can be explored for useful analysis. It is very valuable for a data scientist to restructure and refine the data into more meaningful datasets, which can be used further for analytics. Hence, the idea is to build a tool which will contain all the required data preparation techniques to make data well-structured by providing greater flexibility and easy to use UI. Tool will contain different data preparation techniques which will include the process of data cleaning, data structuring, transforming data, data compression, and data profiling and implementation of related machine learning algorithms.

Download Full-text

FATES: A Flexible Analysis Toolkit for the Exploration of Single Particle Mass Spectrometer Data

10.5194/amt-2016-288 ◽

2016 ◽

Cited By ~ 2

Author(s):

Camille M. Sultana ◽

Gavin Cornwell ◽

Paul Rodriguez ◽

Kimberly A. Prather

Keyword(s):

Mass Spectrometer ◽

Single Particle ◽

Mass Spectra ◽

Large Datasets ◽

Particle Mass ◽

Data Exploration ◽

Data Sets ◽

Data Formats ◽

Peak Intensity ◽

Spectrometer Data

Abstract. Single particle mass spectrometer (SPMS) analysis of aerosols has become increasingly popular since its invention in the 1990s. Today many iterations of commercial and lab-built SPMS are in use worldwide. However supporting analysis toolkits for these powerful instruments are either outdated, have limited functionality, or are versions that are not available to the scientific community at large. In an effort to advance this field and allow better communication and collaboration between scientists we have developed FATES (Flexible Analysis Toolkit for the Exploration of SPMS data), a MATLAB toolkit easily extensible to an array of SPMS designs and data formats. FATES was developed to minimize the computational demands of working with large datasets while still allowing easy maintenance, modification, and utilization by novice programmers. FATES permits scientists to explore, without constraint, complex SPMS data with simple scripts in a language popular for scientific numerical analysis. In addition FATES contains an array of data visualization GUIs which can aid both novice and expert users in calibration of raw data, exploration of the dependence of mass spectra characteristics on size, time, and peak intensity, as well investigations of clustered data sets.

Download Full-text

An Improved Machine Learning-Based Employees Attrition Prediction Framework with Emphasis on Feature Selection

Mathematics ◽

10.3390/math9111226 ◽

2021 ◽

Vol 9 (11) ◽

pp. 1226

Author(s):

Saeed Najafi-Zangeneh ◽

Naser Shams-Gharneh ◽

Ali Arjomandi-Nezhad ◽

Sarfaraz Hashemkhani Zolfani

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Standard Deviation ◽

Analytical Formula ◽

Feature Selection Method ◽

Selection Method ◽

Performance Measure ◽

Learning Approaches ◽

Training Costs ◽

Professional Employees

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.

Download Full-text

Predicting and interpreting oxide glass properties by machine learning using large datasets

Ceramics International ◽

10.1016/j.ceramint.2021.05.105 ◽

2021 ◽

Author(s):

Daniel R. Cassar ◽

Saulo Martiello Mastelini ◽

Tiago Botari ◽

Edesio Alcobaça ◽

André C.P. L.F. de Carvalho ◽

...

Keyword(s):

Machine Learning ◽

Large Datasets ◽

Oxide Glass ◽

Glass Properties

Download Full-text

NIMG-46. RADIOGENOMIC FEATURES PREDICT CLINICALLY RELEVANT GENOME-WIDE ALTERATION SIGNATURES IN GLIOBLASTOMA

Neuro-Oncology ◽

10.1093/neuonc/noaa215.659 ◽

2020 ◽

Vol 22 (Supplement_2) ◽

pp. ii158-ii158

Author(s):

Nicholas Nuechterlein ◽

Beibin Li ◽

James Fink ◽

David Haynor ◽

Eric Holland ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Molecular Subtypes ◽

Feature Selection Method ◽

Selection Method ◽

Versus Group ◽

Mri Features ◽

Genome Wide ◽

Group 2 ◽

Group 1

Abstract BACKGROUND Previously, we have shown that combined whole-exome sequencing (WES) and genome-wide somatic copy number alteration (SCNA) information can separate IDH1/2-wildtype glioblastoma into two prognostic molecular subtypes (Group 1 and Group 2) and that these subtypes cannot be distinguished by epigenetic or clinical features. However, the potential for radiographic features to discriminate between these molecular subtypes has not been established. METHODS Radiogenomic features (n=35,400) were extracted from 46 multiparametric, pre-operative magnetic resonance imaging (MRI) of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive, all of whom have corresponding WES and SCNA data in The Cancer Genome Atlas. We developed a novel feature selection method that leverages the structure of extracted radiogenomic MRI features to mitigate the dimensionality challenge posed by the disparity between the number of features and patients in our cohort. Seven traditional machine learning classifiers were trained to distinguish Group 1 versus Group 2 using our feature selection method. Our feature selection was compared to lasso feature selection, recursive feature elimination, and variance thresholding. RESULTS We are able to classify Group 1 versus Group 2 glioblastomas with a cross-validated area under the curve (AUC) score of 0.82 using ridge logistic regression and our proposed feature selection method, which reduces the size of our feature set from 35,400 to 288. An interrogation of the selected features suggests that features describing contours in the T2 abnormality region on the FLAIR MRI modality may best distinguish these two groups from one another. CONCLUSIONS We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups. This algorithm may be applied to future prospective studies to assess the utility of MRI as a surrogate for costly prognostic genomic studies.

Download Full-text

Radiogenomic modeling predicts survival-associated prognostic groups in glioblastoma

Neuro-Oncology Advances ◽

10.1093/noajnl/vdab004 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Nicholas Nuechterlein ◽

Beibin Li ◽

Abdullah Feroze ◽

Eric C Holland ◽

Linda Shapiro ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Molecular Subtypes ◽

Feature Selection Method ◽

Area Under The Curve ◽

Selection Method ◽

Recursive Feature Elimination ◽

Signal Abnormality ◽

Mri Features ◽

Mri Scans

Abstract Background Combined whole-exome sequencing (WES) and somatic copy number alteration (SCNA) information can separate isocitrate dehydrogenase (IDH)1/2-wildtype glioblastoma into two prognostic molecular subtypes, which cannot be distinguished by epigenetic or clinical features. The potential for radiographic features to discriminate between these molecular subtypes has yet to be established. Methods Radiologic features (n = 35 340) were extracted from 46 multisequence, pre-operative magnetic resonance imaging (MRI) scans of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive (TCIA), all of whom have corresponding WES/SCNA data. We developed a novel feature selection method that leverages the structure of extracted MRI features to mitigate the dimensionality challenge posed by the disparity between a large number of features and the limited patients in our cohort. Six traditional machine learning classifiers were trained to distinguish molecular subtypes using our feature selection method, which was compared to least absolute shrinkage and selection operator (LASSO) feature selection, recursive feature elimination, and variance thresholding. Results We were able to classify glioblastomas into two prognostic subgroups with a cross-validated area under the curve score of 0.80 (±0.03) using ridge logistic regression on the 15-dimensional principle component analysis (PCA) embedding of the features selected by our novel feature selection method. An interrogation of the selected features suggested that features describing contours in the T2 signal abnormality region on the T2-weighted fluid-attenuated inversion recovery (FLAIR) MRI sequence may best distinguish these two groups from one another. Conclusions We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups.

Download Full-text

Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning

Electronic Commerce Research and Applications ◽

10.1016/j.elerap.2018.08.002 ◽

2018 ◽

Vol 31 ◽

pp. 24-39 ◽

Cited By ~ 48

Author(s):

Xiaojun Ma ◽

Jinglan Sha ◽

Dehua Wang ◽

Yuanbo Yu ◽

Qian Yang ◽

...

Keyword(s):

Machine Learning ◽

Data Cleaning ◽

High Dimensional Data ◽

P2p Network ◽

High Dimensional ◽

Loan Default

Download Full-text

Machine Learning-based Whitefly Feature Identification and Counting

Journal of Imaging Science and Technology ◽

10.2352/j.imagingsci.technol.2022.66.1.010401 ◽

2021 ◽

Author(s):

Kai-Chao Yao ◽

Shih-Feng Fu ◽

Wei-Tzer Huang ◽

Cheng-Chun Wu

Keyword(s):

Machine Learning ◽

Data Storage ◽

Ccd Camera ◽

Machine Learning Algorithms ◽

Feature Identification ◽

Matching Algorithm ◽

Vision Development ◽

Counting Data ◽

Image Conversion ◽

Pyramid Matching

This article uses LabVIEW, a software program to develop a whitefly feature identification and counting technology, and machine learning algorithms for whitefly monitoring, identification, and counting applications. In addition, a high-magnification CCD camera is used for on-demand image photography, and then the functional programs of the VI library of LabVIEW NI-DAQ and LabVIEW NI Vision Development Module are used to develop image recognition functions. The grayscale-value pyramid-matching algorithm is used for image conversion and recognition in the machine learning mode. The built graphical user interface and device hardware provide convenient and effective whitefly feature identification and sample counting. This monitoring technology exhibits features such as remote monitoring, counting, data storage, and statistical analysis.

Download Full-text

An automatic data cleaning procedure for electron cyclotron emission imaging on EAST tokamak using machine learning algorithm

Journal of Instrumentation ◽

10.1088/1748-0221/13/10/p10029 ◽

2018 ◽

Vol 13 (10) ◽

pp. P10029-P10029 ◽

Cited By ~ 3

Author(s):

C. Li ◽

T. Lan ◽

Y. Wang ◽

J. Liu ◽

J. Xie ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Data Cleaning ◽

Electron Cyclotron ◽

Machine Learning Algorithm ◽

Electron Cyclotron Emission ◽

Automatic Data ◽

Cleaning Procedure ◽

Cyclotron Emission ◽

Electron Cyclotron Emission Imaging

Download Full-text

Using Machine Learning Methods Incorporating Individual Reader Annotations to Classify Paediatric Chest Radiographs in Epidemiological Studies

Wellcome Open Research ◽

10.12688/wellcomeopenres.17164.1 ◽

2021 ◽

Vol 6 ◽

pp. 309

Author(s):

Paul Mwaniki ◽

Timothy Kamanu ◽

Samuel Akech ◽

M. J. C Eijkemans

Keyword(s):

Machine Learning ◽

Epidemiological Studies ◽

Chest Radiographs ◽

World Health ◽

Data Sets ◽

Learning Models ◽

Middle Income ◽

Training Models ◽

Model Training ◽

Machine Learning Models

Introduction: Epidemiological studies that involve interpretation of chest radiographs (CXRs) suffer from inter-reader and intra-reader variability. Inter-reader and intra-reader variability hinder comparison of results from different studies or centres, which negatively affects efforts to track the burden of chest diseases or evaluate the efficacy of interventions such as vaccines. This study explores machine learning models that could standardize interpretation of CXR across studies and the utility of incorporating individual reader annotations when training models using CXR data sets annotated by multiple readers. Methods: Convolutional neural networks were used to classify CXRs from seven low to middle-income countries into five categories according to the World Health Organization's standardized methodology for interpreting paediatric CXRs. We compared models trained to predict the final/aggregate classification with models trained to predict how each reader would classify an image and then aggregate predictions for all readers using unweighted mean. Results: Incorporating individual reader's annotations during model training improved classification accuracy by 3.4% (multi-class accuracy 61% vs 59%). Model accuracy was higher for children above 12 months of age (68% vs 58%). The accuracy of the models in different countries ranged between 45% and 71%. Conclusions: Machine learning models can annotate CXRs in epidemiological studies reducing inter-reader and intra-reader variability. In addition, incorporating individual reader annotations can improve the performance of machine learning models trained using CXRs annotated by multiple readers.

Download Full-text