R package for animal behaviour classification from accelerometer data - rabc

Mapping Intimacies ◽

10.22541/au.160594403.36425983/v1 ◽

2020 ◽

Author(s):

Hui Yu ◽

Marcel Klaassen

Keyword(s):

Feature Selection ◽

Data Visualization ◽

R Package ◽

Animal Behaviour ◽

Accelerometer Data ◽

Raw Data ◽

Ciconia Ciconia ◽

Model Training ◽

Feature Calculation ◽

Feature Visualization

Increasingly animal behaviour studies are enhanced through the use of accelerometry. To allow translation of raw accelerometer data to animal behaviours requires the development of classifiers. Here, we present the “rabc” package to assist researchers with the interactive development of such animal-behaviour classifiers based on datasets consisting out of accelerometer data with their corresponding animal behaviours. Using an accelerometer and a corresponding behavioural dataset collected on white stork (Ciconia ciconia), we illustrate the workflow of this package, including raw data visualization, feature calculation, feature selection, feature visualization, extreme gradient boost model training, validation, and, finally, a demonstration of the behaviour classification results.

Get full-text (via PubEx)

Feature Selection for Unsupervised Machine Learning of Accelerometer Data Physical Activity Clusters – A Systematic Review

Gait & Posture ◽

10.1016/j.gaitpost.2021.08.007 ◽

2021 ◽

Author(s):

Petra J. Jones ◽

Mike Catt ◽

Melanie J. Davies ◽

Charlotte L. Edwardson ◽

Evgeny M. Mirkes ◽

...

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Systematic Review ◽

Feature Selection ◽

Accelerometer Data ◽

Unsupervised Machine Learning ◽

Selection For

Get full-text (via PubEx)

FORESEE: a tool for the systematic comparison of translational drug response modeling pipelines

10.7287/peerj.preprints.27256v1 ◽

2018 ◽

Author(s):

Lisa-Katrin Turnhoff ◽

Ali Hadizadeh Esfahani ◽

Maryam Montazeri ◽

Nina Kusch ◽

Andreas Schuppert

Keyword(s):

Drug Response ◽

Drug Efficacy ◽

Response Prediction ◽

R Package ◽

Supplementary Information ◽

Supplementary File ◽

Data Sets ◽

Training Algorithms ◽

Model Training

Translational models that utilize omics data generated in in vitro studies to predict the drug efficacy of anti-cancer compounds in patients are highly distinct, which complicates the benchmarking process for new computational approaches. In reaction to this, we introduce the uniFied translatiOnal dRug rESponsE prEdiction platform FORESEE, an open-source R-package. FORESEE not only provides a uniform data format for public cell line and patient data sets, but also establishes a standardized environment for drug response prediction pipelines, incorporating various state-of-the-art preprocessing methods, model training algorithms and validation techniques. The modular implementation of individual elements of the pipeline facilitates a straightforward development of combinatorial models, which can be used to re-evaluate and improve already existing pipelines as well as to develop new ones. Availability and Implementation: FORESEE is licensed under GNU General Public License v3.0 and available at https://github.com/JRC-COMBINE/FORESEE . Supplementary Information: Supplementary Files 1 and 2 provide detailed descriptions of the pipeline and the data preparation process, while Supplementary File 3 presents basic use cases of the package. Contact: [email protected]

Get full-text (via PubEx)

modelBuildR: an R package for model building and feature selection with erroneous classifications

PeerJ ◽

10.7717/peerj.10849 ◽

2021 ◽

Vol 9 ◽

pp. e10849

Author(s):

Maximilian Knoll ◽

Jennifer Furkel ◽

Juergen Debus ◽

Amir Abdollahi

Keyword(s):

Feature Selection ◽

Cross Validation ◽

Model Building ◽

Linear Models ◽

Binary Classification ◽

Ground Truth ◽

R Package ◽

Methylation Array ◽

Survival Difference ◽

Error Probabilities

Background Model building is a crucial part of omics based biomedical research to transfer classifications and obtain insights into underlying mechanisms. Feature selection is often based on minimizing error between model predictions and given classification (maximizing accuracy). Human ratings/classifications, however, might be error prone, with discordance rates between experts of 5–15%. We therefore evaluate if a feature pre-filtering step might improve identification of features associated with true underlying groups. Methods Data was simulated for up to 100 samples and up to 10,000 features, 10% of which were associated with the ground truth comprising 2–10 normally distributed populations. Binary and semi-quantitative ratings with varying error probabilities were used as classification. For feature preselection standard cross-validation (V2) was compared to a novel heuristic (V1) applying univariate testing, multiplicity adjustment and cross-validation on switched dependent (classification) and independent (features) variables. Preselected features were used to train logistic regression/linear models (backward selection, AIC). Predictions were compared against the ground truth (ROC, multiclass-ROC). As use case, multiple feature selection/classification methods were benchmarked against the novel heuristic to identify prognostically different G-CIMP negative glioblastoma tumors from the TCGA-GBM 450 k methylation array data cohort, starting from a fuzzy umap based rough and erroneous separation. Results V1 yielded higher median AUC ranks for two true groups (ground truth), with smaller differences for true graduated differences (3–10 groups). Lower fractions of models were successfully fit with V1. Median AUCs for binary classification and two true groups were 0.91 (range: 0.54–1.00) for V1 (Benjamini-Hochberg) and 0.70 (0.28–1.00) for V2, 13% (n = 616) of V2 models showed AUCs < = 50% for 25 samples and 100 features. For larger numbers of features and samples, median AUCs were 0.75 (range 0.59–1.00) for V1 and 0.54 (range 0.32–0.75) for V2. In the TCGA-GBM data, modelBuildR allowed best prognostic separation of patients with highest median overall survival difference (7.51 months) followed a difference of 6.04 months for a random forest based method. Conclusions The proposed heuristic is beneficial for the retrieval of features associated with two true groups classified with errors. We provide the R package modelBuildR to simplify (comparative) evaluation/application of the proposed heuristic (http://github.com/mknoll/modelBuildR).

Get full-text (via PubEx)

IPDfromKM: Reconstruct Individual Patient Data from Published Kaplan-Meier Survival Curves

10.21203/rs.3.rs-117525/v1 ◽

2020 ◽

Author(s):

Na Liu ◽

Yanhong Zhou ◽

J. Jack Lee

Keyword(s):

Survival Data ◽

Individual Patient Data ◽

Secondary Analysis ◽

R Package ◽

Patient Data ◽

Survival Curves ◽

Raw Data ◽

Kaplan Meier ◽

Number Of Patients ◽

Shiny Application

Abstract BackgroundWhen applying secondary analysis on published survival data, it is critical to obtain each patient’s raw data, because the individual patient data (IPD) approach has been considered as the gold standard of data analysis. However, researchers often lack access to the IPD. We aim to propose a straightforward and robust approach to help researchers to obtain IPD from published survival curves with a friendly software platform. ResultsImproving upon the existing methods, we proposed an easy-to-use, two-stage approach to reconstruct IPD from published Kaplan-Meier (K-M) curves. Stage 1 extracts raw data coordinates and Stage 2 reconstructs IPD using the proposed method. To facilitate the use of the proposed method, we develop the R package IPDfromKM and an accompanied web-based Shiny application. Both the R package and Shiny application can be used to extract raw data coordinates from published K-M curves, reconstruct IPD from data coordinates extracted, visualize the reconstructed IPD, assess the accuracy of the reconstruction, and perform secondary analysis on the IPD. We illustrate the use of the R package and the Shiny application with K-M curves from published studies. Extensive simulations and real world data applications demonstrate that the proposed method has high accuracy and great reliability in estimating the number of events, number of patients at risk, survival probabilities, median survival times, as well as hazard ratios. ConclusionsIPDfromKM has great flexibility and accuracy to reconstruct IPD from published K-M curves with different shapes. We believe that the R package and the Shiny application will greatly facilitate the potential use of quality IPD data and advance the use of secondary data to make informed decision in medical research.

Get full-text (via PubEx)

Performance Analysis of Classifiers on Filter-Based Feature Selection Approaches on Microarray Data

Bio-Inspired Computing for Information Retrieval Applications - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-5225-2375-8.ch002 ◽

2017 ◽

pp. 41-70 ◽

Cited By ~ 5

Author(s):

Arunkumar Chinnaswamy ◽

Ramakrishnan Srinivasan

Keyword(s):

Feature Selection ◽

Microarray Data ◽

Classification Accuracy ◽

Information Gain ◽

Feature Subset ◽

Classification Problems ◽

Raw Data ◽

Correlation Based Feature Selection ◽

Feature Selection Approach ◽

Gene Expression Levels

The process of Feature selection in machine learning involves the reduction in the number of features (genes) and similar activities that results in an acceptable level of classification accuracy. This paper discusses the filter based feature selection methods such as Information Gain and Correlation coefficient. After the process of feature selection is performed, the selected genes are subjected to five classification problems such as Naïve Bayes, Bagging, Random Forest, J48 and Decision Stump. The same experiment is performed on the raw data as well. Experimental results show that the filter based approaches reduce the number of gene expression levels effectively and thereby has a reduced feature subset that produces higher classification accuracy compared to the same experiment performed on the raw data. Also Correlation Based Feature Selection uses very fewer genes and produces higher accuracy compared to Information Gain based Feature Selection approach.

Get full-text (via PubEx)

DATA VISUALIZATION TOOLS FOR CONFOUNDING AND SELECTION BIAS IN LONGITUDINAL DATA: THE %LENGTHEN, %BALANCE, AND %MAKEPLOT (CONFOUNDR) MACROS AND R PACKAGE

American Journal of Epidemiology ◽

10.1093/aje/kwaa143 ◽

2020 ◽

Vol 189 (12) ◽

pp. 1633-1636

Author(s):

Erin M Schnellinger ◽

Linda Valeri ◽

John W Jackson

Keyword(s):

Longitudinal Data ◽

Selection Bias ◽

Data Visualization ◽

R Package ◽

Visualization Tools

Get full-text (via PubEx)

Comparative Analysis on Machine Learning and Deep Learning to Predict Post-Induction Hypotension

Sensors ◽

10.3390/s20164575 ◽

2020 ◽

Vol 20 (16) ◽

pp. 4575 ◽

Cited By ~ 1

Author(s):

Jihyun Lee ◽

Jiyoung Woo ◽

Ah Reum Kang ◽

Young-Seob Jeong ◽

Woohyun Jung ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Feature Selection ◽

Deep Learning ◽

Random Forest ◽

Tracheal Intubation ◽

Feature Engineering ◽

Learning Models ◽

Raw Data ◽

Vital Records

Hypotensive events in the initial stage of anesthesia can cause serious complications in the patients after surgery, which could be fatal. In this study, we intended to predict hypotension after tracheal intubation using machine learning and deep learning techniques after intubation one minute in advance. Meta learning models, such as random forest, extreme gradient boosting (Xgboost), and deep learning models, especially the convolutional neural network (CNN) model and the deep neural network (DNN), were trained to predict hypotension occurring between tracheal intubation and incision, using data from four minutes to one minute before tracheal intubation. Vital records and electronic health records (EHR) for 282 of 319 patients who underwent laparoscopic cholecystectomy from October 2018 to July 2019 were collected. Among the 282 patients, 151 developed post-induction hypotension. Our experiments had two scenarios: using raw vital records and feature engineering on vital records. The experiments on raw data showed that CNN had the best accuracy of 72.63%, followed by random forest (70.32%) and Xgboost (64.6%). The experiments on feature engineering showed that random forest combined with feature selection had the best accuracy of 74.89%, while CNN had a lower accuracy of 68.95% than that of the experiment on raw data. Our study is an extension of previous studies to detect hypotension before intubation with a one-minute advance. To improve accuracy, we built a model using state-of-art algorithms. We found that CNN had a good performance, but that random forest had a better performance when combined with feature selection. In addition, we found that the examination period (data period) is also important.

Get full-text (via PubEx)

Online Feature Selection for Robust Classification of the Microbiological Quality of Traditional Vanilla Cream by Means of Multispectral Imaging

Sensors ◽

10.3390/s19194071 ◽

2019 ◽

Vol 19 (19) ◽

pp. 4071 ◽

Cited By ~ 1

Author(s):

Alexandra Lianou ◽

Arianna Mencattini ◽

Alexandro Catini ◽

Corrado Di Natale ◽

George-John E. Nychas ◽

...

Keyword(s):

Feature Selection ◽

Multispectral Imaging ◽

Dairy Product ◽

Real Life ◽

Microbiological Quality ◽

Support Vector ◽

Isothermal Conditions ◽

Online Feature Selection ◽

Model Training

The performance of an Unsupervised Online feature Selection (UOS) algorithm was investigated for the selection of training features of multispectral images acquired from a dairy product (vanilla cream) stored under isothermal conditions. The selected features were further used as input in a support vector machine (SVM) model with linear kernel for the determination of the microbiological quality of vanilla cream. Model training (n = 65) was based on two batches of cream samples provided directly by the manufacturer and stored at different isothermal conditions (4, 8, 12, and 15 °C), whereas model testing (n = 132) and validation (n = 48) were based on real life conditions by analyzing samples from different retail outlets as well as expired samples from the market. Qualitative analysis was performed for the discrimination of cream samples in two microbiological quality classes based on the values of total viable counts [TVC ≤ 2.0 log CFU/g (fresh samples) and TVC ≥ 6.0 log CFU/g (spoiled samples)]. Results exhibited good performance with an overall accuracy of classification for the two classes of 91.7% for model validation. Further on, the model was extended to include the samples in the TVC range 2–6 log CFU/g, using 1 log step to define the microbiological quality of classes in order to assess the potential of the model to estimate increasing microbial populations. Results demonstrated that high rates of correct classification could be obtained in the range of 2–5 log CFU/g, whereas the percentage of erroneous classification increased in the TVC class (5,6) that was close to the spoilage level of the product. Overall, the results of this study demonstrated that the UOS algorithm in tandem with spectral data acquired from multispectral imaging could be a promising method for real-time assessment of the microbiological quality of vanilla cream samples.

Get full-text (via PubEx)

Data Visualization and Feature Selection Methods in Gel-based Proteomics

Current Protein and Peptide Science ◽

10.2174/1389203715666140221112334 ◽

2014 ◽

Vol 15 (1) ◽

pp. 4-22 ◽

Cited By ~ 11

Author(s):

Tome Silva ◽

Nadege Richard ◽

Jorge Dias ◽

Pedro Rodrigues

Keyword(s):

Feature Selection ◽

Data Visualization ◽

Selection Methods

Get full-text (via PubEx)

Missing value imputation for physical activity data measured by accelerometer

Statistical Methods in Medical Research ◽

10.1177/0962280216633248 ◽

2016 ◽

Vol 27 (2) ◽

pp. 490-506 ◽

Cited By ~ 11

Author(s):

Jung Ae Lee ◽

Jeff Gill

Keyword(s):

Physical Activity ◽

Count Data ◽

R Package ◽

Predictive Distribution ◽

Epidemiological Studies ◽

Accelerometer Data ◽

Activity Data ◽

Health And Nutrition ◽

Log Normal ◽

Over Dispersion

An accelerometer, a wearable motion sensor on the hip or wrist, is becoming a popular tool in clinical and epidemiological studies for measuring the physical activity. Such data provide a series of activity counts at every minute or even more often and displays a person’s activity pattern throughout a day. Unfortunately, the collected data can include irregular missing intervals because of noncompliance of participants and therefore make the statistical analysis more challenging. The purpose of this study is to develop a novel imputation method to handle the multivariate count data, motivated by the accelerometer data structure. We specify the predictive distribution of the missing data with a mixture of zero-inflated Poisson and Log-normal distribution, which is shown to be effective to deal with the minute-by-minute autocorrelation as well as under- and over-dispersion of count data. The imputation is performed at the minute level and follows the principles of multiple imputation using a fully conditional specification with the chained algorithm. To facilitate the practical use of this method, we provide an R package accelmissing. Our method is demonstrated using 2003−2004 National Health and Nutrition Examination Survey data.

Get full-text (via PubEx)