Bio-informatics and psychiatric epidemiology

Author(s):  
Nicola Voyle ◽  
Maximilian Kerz ◽  
Steven Kiddle ◽  
Richard Dobson

This chapter highlights the methodologies which are increasingly being applied to large datasets or ‘big data’, with an emphasis on bio-informatics. The first stage of any analysis is to collect data from a well-designed study. The chapter begins by looking at the raw data that arises from epidemiological studies and highlighting the first stages in creating clean data that can be used to draw informative conclusions through analysis. The remainder of the chapter covers data formats, data exploration, data cleaning, missing data (i.e. the lack of data for a variable in an observation), reproducibility, classification versus regression, feature identification and selection, method selection (e.g. supervised versus unsupervised machine learning), training a classifier, and drawing conclusions from modelling.

2017 ◽  
Vol 10 (13) ◽  
pp. 485
Author(s):  
Darshan Barapatre ◽  
Vijayalakshmi A

 According to interviews and experts, data scientists spend 50-80% of the valuable time in the mundane task of collecting and preparing structured or unstructured data, before it can be explored for useful analysis. It is very valuable for a data scientist to restructure and refine the data into more meaningful datasets, which can be used further for analytics. Hence, the idea is to build a tool which will contain all the required data preparation techniques to make data well-structured by providing greater flexibility and easy to use UI. Tool will contain different data preparation techniques which will include the process of data cleaning, data structuring, transforming data, data compression, and data profiling and implementation of related machine learning algorithms.


2016 ◽  
Author(s):  
Camille M. Sultana ◽  
Gavin Cornwell ◽  
Paul Rodriguez ◽  
Kimberly A. Prather

Abstract. Single particle mass spectrometer (SPMS) analysis of aerosols has become increasingly popular since its invention in the 1990s. Today many iterations of commercial and lab-built SPMS are in use worldwide. However supporting analysis toolkits for these powerful instruments are either outdated, have limited functionality, or are versions that are not available to the scientific community at large. In an effort to advance this field and allow better communication and collaboration between scientists we have developed FATES (Flexible Analysis Toolkit for the Exploration of SPMS data), a MATLAB toolkit easily extensible to an array of SPMS designs and data formats. FATES was developed to minimize the computational demands of working with large datasets while still allowing easy maintenance, modification, and utilization by novice programmers. FATES permits scientists to explore, without constraint, complex SPMS data with simple scripts in a language popular for scientific numerical analysis. In addition FATES contains an array of data visualization GUIs which can aid both novice and expert users in calibration of raw data, exploration of the dependence of mass spectra characteristics on size, time, and peak intensity, as well investigations of clustered data sets.


Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1226
Author(s):  
Saeed Najafi-Zangeneh ◽  
Naser Shams-Gharneh ◽  
Ali Arjomandi-Nezhad ◽  
Sarfaraz Hashemkhani Zolfani

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.


Author(s):  
Daniel R. Cassar ◽  
Saulo Martiello Mastelini ◽  
Tiago Botari ◽  
Edesio Alcobaça ◽  
André C.P. L.F. de Carvalho ◽  
...  

2020 ◽  
Vol 22 (Supplement_2) ◽  
pp. ii158-ii158
Author(s):  
Nicholas Nuechterlein ◽  
Beibin Li ◽  
James Fink ◽  
David Haynor ◽  
Eric Holland ◽  
...  

Abstract BACKGROUND Previously, we have shown that combined whole-exome sequencing (WES) and genome-wide somatic copy number alteration (SCNA) information can separate IDH1/2-wildtype glioblastoma into two prognostic molecular subtypes (Group 1 and Group 2) and that these subtypes cannot be distinguished by epigenetic or clinical features. However, the potential for radiographic features to discriminate between these molecular subtypes has not been established. METHODS Radiogenomic features (n=35,400) were extracted from 46 multiparametric, pre-operative magnetic resonance imaging (MRI) of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive, all of whom have corresponding WES and SCNA data in The Cancer Genome Atlas. We developed a novel feature selection method that leverages the structure of extracted radiogenomic MRI features to mitigate the dimensionality challenge posed by the disparity between the number of features and patients in our cohort. Seven traditional machine learning classifiers were trained to distinguish Group 1 versus Group 2 using our feature selection method. Our feature selection was compared to lasso feature selection, recursive feature elimination, and variance thresholding. RESULTS We are able to classify Group 1 versus Group 2 glioblastomas with a cross-validated area under the curve (AUC) score of 0.82 using ridge logistic regression and our proposed feature selection method, which reduces the size of our feature set from 35,400 to 288. An interrogation of the selected features suggests that features describing contours in the T2 abnormality region on the FLAIR MRI modality may best distinguish these two groups from one another. CONCLUSIONS We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups. This algorithm may be applied to future prospective studies to assess the utility of MRI as a surrogate for costly prognostic genomic studies.


2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Nicholas Nuechterlein ◽  
Beibin Li ◽  
Abdullah Feroze ◽  
Eric C Holland ◽  
Linda Shapiro ◽  
...  

Abstract Background Combined whole-exome sequencing (WES) and somatic copy number alteration (SCNA) information can separate isocitrate dehydrogenase (IDH)1/2-wildtype glioblastoma into two prognostic molecular subtypes, which cannot be distinguished by epigenetic or clinical features. The potential for radiographic features to discriminate between these molecular subtypes has yet to be established. Methods Radiologic features (n = 35 340) were extracted from 46 multisequence, pre-operative magnetic resonance imaging (MRI) scans of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive (TCIA), all of whom have corresponding WES/SCNA data. We developed a novel feature selection method that leverages the structure of extracted MRI features to mitigate the dimensionality challenge posed by the disparity between a large number of features and the limited patients in our cohort. Six traditional machine learning classifiers were trained to distinguish molecular subtypes using our feature selection method, which was compared to least absolute shrinkage and selection operator (LASSO) feature selection, recursive feature elimination, and variance thresholding. Results We were able to classify glioblastomas into two prognostic subgroups with a cross-validated area under the curve score of 0.80 (±0.03) using ridge logistic regression on the 15-dimensional principle component analysis (PCA) embedding of the features selected by our novel feature selection method. An interrogation of the selected features suggested that features describing contours in the T2 signal abnormality region on the T2-weighted fluid-attenuated inversion recovery (FLAIR) MRI sequence may best distinguish these two groups from one another. Conclusions We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups.


Author(s):  
Kai-Chao Yao ◽  
Shih-Feng Fu ◽  
Wei-Tzer Huang ◽  
Cheng-Chun Wu

This article uses LabVIEW, a software program to develop a whitefly feature identification and counting technology, and machine learning algorithms for whitefly monitoring, identification, and counting applications. In addition, a high-magnification CCD camera is used for on-demand image photography, and then the functional programs of the VI library of LabVIEW NI-DAQ and LabVIEW NI Vision Development Module are used to develop image recognition functions. The grayscale-value pyramid-matching algorithm is used for image conversion and recognition in the machine learning mode. The built graphical user interface and device hardware provide convenient and effective whitefly feature identification and sample counting. This monitoring technology exhibits features such as remote monitoring, counting, data storage, and statistical analysis.


2021 ◽  
Vol 6 ◽  
pp. 309
Author(s):  
Paul Mwaniki ◽  
Timothy Kamanu ◽  
Samuel Akech ◽  
M. J. C Eijkemans

Introduction: Epidemiological studies that involve interpretation of chest radiographs (CXRs) suffer from inter-reader and intra-reader variability. Inter-reader and intra-reader variability hinder comparison of results from different studies or centres, which negatively affects efforts to track the burden of chest diseases or evaluate the efficacy of interventions such as vaccines. This study explores machine learning models that could standardize interpretation of CXR across studies and the utility of incorporating individual reader annotations when training models using CXR data sets annotated by multiple readers. Methods: Convolutional neural networks were used to classify CXRs from seven low to middle-income countries into five categories according to the World Health Organization's standardized methodology for interpreting paediatric CXRs. We compared models trained to predict the final/aggregate classification with models trained to predict how each reader would classify an image and then aggregate predictions for all readers using unweighted mean. Results: Incorporating individual reader's annotations during model training improved classification accuracy by 3.4% (multi-class accuracy 61% vs 59%). Model accuracy was higher for children above 12 months of age (68% vs 58%). The accuracy of the models in different countries ranged between 45% and 71%. Conclusions: Machine learning models can annotate CXRs in epidemiological studies reducing inter-reader and intra-reader variability. In addition, incorporating individual reader annotations can improve the performance of machine learning models trained using CXRs annotated by multiple readers.


Sign in / Sign up

Export Citation Format

Share Document