R packages | ScienceGate

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.

Download Full-text

Detecting multiple generalized change-points by isolating single ones

Metrika ◽

10.1007/s00184-021-00821-6 ◽

2021 ◽

Author(s):

Andreas Anastasiou ◽

Piotr Fryzlewicz

Keyword(s):

Model Selection ◽

Linear Trend ◽

Information Criterion ◽

Change Points ◽

Isolation Technique ◽

New Approach ◽

Piecewise Constant ◽

Practical Performance ◽

R Packages ◽

The Times

AbstractWe introduce a new approach, called Isolate-Detect (ID), for the consistent estimation of the number and location of multiple generalized change-points in noisy data sequences. Examples of signal changes that ID can deal with are changes in the mean of a piecewise-constant signal and changes, continuous or not, in the linear trend. The number of change-points can increase with the sample size. Our method is based on an isolation technique, which prevents the consideration of intervals that contain more than one change-point. This isolation enhances ID’s accuracy as it allows for detection in the presence of frequent changes of possibly small magnitudes. In ID, model selection is carried out via thresholding, or an information criterion, or SDLL, or a hybrid involving the former two. The hybrid model selection leads to a general method with very good practical performance and minimal parameter choice. In the scenarios tested, ID is at least as accurate as the state-of-the-art methods; most of the times it outperforms them. ID is implemented in the R packages IDetect and breakfast, available from CRAN.

Download Full-text

ImbTreeEntropy and ImbTreeAUC: Novel R Packages for Decision Tree Learning on the Imbalanced Datasets

Electronics ◽

10.3390/electronics10060657 ◽

2021 ◽

Vol 10 (6) ◽

pp. 657

Author(s):

Krzysztof Gajowniczek ◽

Tomasz Ząbkowski

Keyword(s):

Imbalanced Data ◽

Misclassification Cost ◽

Learning Time ◽

Misclassification Costs ◽

Split Point ◽

Speed Up ◽

R Packages ◽

Class Labels ◽

Entropy Functions ◽

Support Cost

This paper presents two R packages ImbTreeEntropy and ImbTreeAUC to handle imbalanced data problems. ImbTreeEntropy functionality includes application of a generalized entropy functions, such as Rényi, Tsallis, Sharma–Mittal, Sharma–Taneja and Kapur, to measure impurity of a node. ImbTreeAUC provides non-standard measures to choose an optimal split point for an attribute (as well the optimal attribute for splitting) by employing local, semi-global and global AUC (Area Under the ROC curve) measures. Both packages are applicable for binary and multiclass problems and they support cost-sensitive learning, by defining a misclassification cost matrix, and weighted-sensitive learning. The packages accept all types of attributes, including continuous, ordered and nominal, where the latter type is simplified for multiclass problems to reduce the computational overheads. Both applications enable optimization of the thresholds where posterior probabilities determine final class labels in a way that misclassification costs are minimized. Model overfitting can be managed either during the growing phase or at the end using post-pruning. The packages are mainly implemented in R, however some computationally demanding functions are written in plain C++. In order to speed up learning time, parallel processing is supported as well.

Download Full-text

Evaluating Unit Testing Practices in R Packages

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) ◽

10.1109/icse43902.2021.00136 ◽

2021 ◽

Author(s):

Melina Vidoni

Keyword(s):

Unit Testing ◽

R Packages ◽

Testing Practices

Download Full-text

Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data

Genomics Proteomics & Bioinformatics ◽

10.1016/j.gpb.2020.07.004 ◽

2020 ◽

Author(s):

Qianhui Huang ◽

Yu Liu ◽

Yuheng Du ◽

Lana X. Garmire

Keyword(s):

Single Cell ◽

Rna Seq ◽

Cell Type ◽

R Packages

Download Full-text

37 Sibs: an R toolkit for computation of relatedness measures using large pedigrees

Journal of Animal Science ◽

10.1093/jas/skaa054.074 ◽

2020 ◽

Vol 98 (Supplement_3) ◽

pp. 41-42

Author(s):

B Victor Oribamise ◽

Lauren L Hulsman Hanna

Keyword(s):

Family Structure ◽

Genetic Relatedness ◽

Current User ◽

Memory Allocation ◽

Relationship Matrix ◽

Additive Effects ◽

Numerator Relationship Matrix ◽

Significant Difference ◽

R Packages ◽

User Friendly

Abstract Without appropriate relationships present in a given population, identifying dominance effects in the expression of desirable traits is challenging. Including non-additive effects is desirable to increase accuracy of breeding values. There is no current user-friendly tool package to investigate genetic relatedness in large pedigrees. The objective was to develop and implement efficient algorithms in R to calculate and visualize measures of relatedness (e.g., sibling and family structure, numerator relationship matrices) for large pedigrees. Comparisons to current R packages (Table 1) are also made. Functions to assign animals to families, summary of sibling counts, calculation of numerator relationship matrix (NRM), and NRM summary by groups were created, providing a comprehensive toolkit (Sibs package) not found in other packages. Pedigrees of various sizes (n = 20, 4,035, 120,000 and 132,833) were used to test functionality and compare to current packages. All runs were conducted on a Windows-based computer with an 8 GB RAM, 2.5 GHz Intel Core i7 processor. Other packages had no significant difference in runtime when constructing the NRM for small pedigrees (n = 20) compared to Sibs (0 to 0.05 s difference). However, packages such as ggroups, AGHmatrix, and pedigree were 10 to 15 min slower than Sibs for a 4,035-individual pedigree. Packages nadiv and pedigreemm competed with Sibs (0.30 to 60 s slower than Sibs), but no package besides Sibs was able to complete the 132,833-individual pedigree due to memory allocation issues in R. The nadiv package was closest with a pedigree of 120,000 individuals, but took 37 min to complete (13 min slower than Sibs). This package also provides easier input of pedigrees and is more encompassing of such relatedness measures than other packages (Table 1). Furthermore, it can provide an option to utilize other packages such as GCA for connectedness calculations when using large pedigrees.

Download Full-text

Co-mention network of R packages: Scientific impact and clustering structure

Journal of Informetrics ◽

10.1016/j.joi.2017.12.001 ◽

2018 ◽

Vol 12 (1) ◽

pp. 87-100 ◽

Cited By ~ 6

Author(s):

Kai Li ◽

Erjia Yan

Keyword(s):

Scientific Impact ◽

R Packages

Download Full-text

PROsetta: An R Package for Linking Patient-Reported Outcome Measures

Applied Psychological Measurement ◽

10.1177/01466216211013106 ◽

2021 ◽

pp. 014662162110131

Author(s):

S. W. Choi ◽

S. Lim ◽

B. D. Schalet ◽

A. J. Kaat ◽

D. Cella

Keyword(s):

Patient Reported Outcomes ◽

R Package ◽

Patient Reported Outcome Measures ◽

Patient Reported Outcome ◽

Measurement Information ◽

Short History ◽

Rosetta Stone ◽

Patient Reported ◽

R Packages ◽

Measurement Information System

A common problem when using a variety of patient-reported outcomes (PROs) for diverse populations and subgroups is establishing a harmonized scale for the incommensurate outcomes. The lack of comparability in metrics (e.g., raw summed scores vs. scaled scores) among different PROs poses practical challenges in studies comparing effects across studies and samples. Linking has long been used for practical benefit in educational testing. Applying various linking techniques to PRO data has a relatively short history; however, in recent years, there has been a surge of published studies on linking PROs and other health outcomes, owing in part to concerted efforts such as the Patient-Reported Outcomes Measurement Information System (PROMIS®) project and the PRO Rosetta Stone (PROsetta Stone®) project ( www.prosettastone.org ). Many R packages have been developed for linking in educational settings; however, they are not tailored for linking PROs where harmonization of data across clinical studies or settings serves as the main objective. We created the PROsetta package to fill this gap and disseminate a protocol that has been established as a standard practice for linking PROs.

Download Full-text

Development of R packages: ‘NonCompart’ and ‘ncar’ for noncompartmental analysis (NCA)

Translational and Clinical Pharmacology ◽

10.12793/tcp.2018.26.1.10 ◽

2018 ◽

Vol 26 (1) ◽

pp. 10 ◽

Cited By ~ 1

Author(s):

Hyungsub Kim ◽

Sungpil Han ◽

Yong-Soon Cho ◽

Seok-Kyu Yoon ◽

Kyun-Seop Bae

Keyword(s):

Noncompartmental Analysis ◽

R Packages

Download Full-text

Meffil: efficient normalisation and analysis of very large DNA methylation samples

10.1101/125963 ◽

2017 ◽

Cited By ~ 17

Author(s):

Josine Min ◽

Gibran Hemani ◽

George Davey Smith ◽

Caroline Relton ◽

Matthew Suderman

Keyword(s):

Dna Methylation ◽

Association Studies ◽

R Package ◽

Individual Level ◽

Technological Advances ◽

Level Data ◽

Fixed And Random Effects ◽

R Packages ◽

Meta Analyses ◽

Dramatic Growth

AbstractBackgroundTechnological advances in high throughput DNA methylation microarrays have allowed dramatic growth of a new branch of epigenetic epidemiology. DNA methylation datasets are growing ever larger in terms of the number of samples profiled, the extent of genome coverage, and the number of studies being meta-analysed. Novel computational solutions are required to efficiently handle these data.MethodsWe have developed meffil, an R package designed to quality control, normalize and perform epigenome-wide association studies (EWAS) efficiently on large samples of Illumina Infinium HumanMethylation450 and MethylationEPIC BeadChip microarrays. We tested meffil by applying it to 6000 450k microarrays generated from blood collected for two different datasets, Accessible Resource for Integrative Epigenomic Studies (ARIES) and The Genetics of Overweight Young Adults (GOYA) study.ResultsA complete reimplementation of functional normalization minimizes computational memory requirements to 5% of that required by other R packages, without increasing running time. Incorporating fixed and random effects alongside functional normalization, and automated estimation of functional normalisation parameters reduces technical variation in DNA methylation levels, thus reducing false positive associations and improving power. We also demonstrate that the ability to normalize datasets distributed across physically different locations without sharing any biologically-based individual-level data may reduce heterogeneity in meta-analyses of epigenome-wide association studies. However, we show that when batch is perfectly confounded with cases and controls functional normalization is unable to prevent spurious associations.Conclusionsmeffil is available online (https://github.com/perishky/meffil/) along with tutorials covering typical use cases.

Download Full-text