Data imputation algorithms for mixed variable types in large scale educational assessment: a comparison of random forest, multivariate imputation using chained equations, and MICE with recursive partitioning

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.

Download Full-text

Random Forest Missing Data Imputation Methods: Implications for Predicting At-Risk Students

Advances in Intelligent Systems and Computing - Intelligent Systems Design and Applications ◽

10.1007/978-3-030-49342-4_29 ◽

2020 ◽

pp. 298-308

Author(s):

Bevan I. Smith ◽

Charles Chimedza ◽

Jacoba H. Bührmann

Keyword(s):

At Risk ◽

Missing Data ◽

Random Forest ◽

At Risk Students ◽

Data Imputation ◽

Missing Data Imputation ◽

Imputation Methods

Download Full-text

A MapReduce-Based Parallel Random Forest Approach for Predicting Large-Scale Protein-Protein Interactions

Intelligent Computing Methodologies - Lecture Notes in Computer Science ◽

10.1007/978-3-030-60796-8_34 ◽

2020 ◽

pp. 400-407

Author(s):

Bo-Ya Ji ◽

Zhu-Hong You ◽

Long Yang ◽

Ji-Ren Zhou ◽

Peng-Wei Hu

Keyword(s):

Random Forest ◽

Protein Interactions ◽

Large Scale ◽

Protein Protein Interactions

Download Full-text

Predicting ligand binding affinity: A comparative study on the use of docking vs. Bayesian categorization and random forest recursive partitioning

Medicinal Chemistry ◽

10.4172/2161-0444.s1.011 ◽

2014 ◽

Vol 04 (12) ◽

Author(s):

David C Kombo

Keyword(s):

Random Forest ◽

Ligand Binding ◽

Comparative Study ◽

Binding Affinity ◽

Recursive Partitioning ◽

Bayesian Categorization

Download Full-text

Downscaling of GRACE-Derived Groundwater Storage Based on the Random Forest Model

Remote Sensing ◽

10.3390/rs11242979 ◽

2019 ◽

Vol 11 (24) ◽

pp. 2979 ◽

Cited By ~ 3

Author(s):

Li Chen ◽

Qisheng He ◽

Kun Liu ◽

Jinyang Li ◽

Chenlin Jing

Keyword(s):

Random Forest ◽

Spatial Resolution ◽

Large Scale ◽

Snow Water Equivalent ◽

Water Storage ◽

Research Area ◽

Groundwater Storage ◽

Coarse Spatial Resolution ◽

Long Time ◽

Local Water

Groundwater is an important part of water storage and one of the important sources of agricultural irrigation, urban living, and industrial water use. The recent launch of Gravity Recovery and Climate Experiment (GRACE) Satellite has provided a new way for studying large-scale water storage. The application of GRACE in local water resources has been greatly limited because of the coarse spatial resolution, and low temporal resolution. Therefore, it is of great significance to improve the spatial resolution of groundwater storage for regional water management. Based on the method of random forest (RF), this study combined six hydrological variables, including precipitation, evapotranspiration, runoff, soil moisture, snow water equivalent, and canopy water to conduct downscaling study, aiming at downscaling the resolution of the total water storage and groundwater storage from 1° (110 km) and to 0.25° (approximately 25 km). The results showed that, from the perspective of long time series, the prediction results of the RF model are ideal in the whole research area and the observations wells area. From the perspective of space, the detailed changes of water storage could be captured in greater detail after downscaling. The verification results show that, on the monthly scale and annual scale, the correlation between the downscaling results and the observation wells is 0.78 and 0.94, respectively, and they both reach the confidence level of 0.01. Therefore, the RF downscaling model has great potential for predicting groundwater storage.

Download Full-text

Modeling Posidonia oceanica shoot density and rhizome primary production

Scientific Reports ◽

10.1038/s41598-020-73722-9 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Elena Catucci ◽

Michele Scardi

Keyword(s):

Random Forest ◽

Primary Production ◽

Large Scale ◽

Posidonia Oceanica ◽

Shoot Density ◽

Large Scale Assessment ◽

Goods And Services ◽

Predictive Variables ◽

Management Perspective ◽

Predicted Values

Abstract Posidonia oceanica meadows rank among the most important and most productive ecosystems in the Mediterranean basin, due to their ecological role and to the goods and services they provide. Estimations of crucial ecological process such as meadows productivity could play a major role in an environmental management perspective and in the assessment of P. oceanica ecosystem services. In this study, a Machine Learning approach, i.e. Random Forest, was aimed at modeling P. oceanica shoot density and rhizome primary production using as predictive variables only environmental factors retrieved from indirect measurements, such as maps. Our predictive models showed a good level of accuracy in modeling both shoot density and rhizome productivity (R2 = 0.761 and R2 = 0.736, respectively). Furthermore, as shoot density is an essential parameter in the estimation of P. oceanica productivity, we proposed a cascaded approach aimed at estimating the latter using predicted values of shoot density rather than observed measurements. In spite of the complexity of the problem, the cascaded Random Forest performed quite well (R2 = 0.637). While direct measurements will always play a fundamental role, our estimates could support large scale assessment of the expected condition of P. oceanica meadows, providing valuable information about the way this crucial ecosystem works.

Download Full-text

Large-Scale Malicious Software Classification with Fuzzified Features and Boosted Fuzzy Random Forest

IEEE Transactions on Fuzzy Systems ◽

10.1109/tfuzz.2020.3016023 ◽

2020 ◽

pp. 1-1

Author(s):

Fangqi Li ◽

Shilin Wang ◽

Alan Wee-Chung Liew ◽

Weiping Ding ◽

Gong Shen Liu

Keyword(s):

Random Forest ◽

Large Scale ◽

Malicious Software ◽

Software Classification ◽

Fuzzy Random

Download Full-text

3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data

Journal of Clinical and Translational Science ◽

10.1017/cts.2019.8 ◽

2019 ◽

Vol 3 (s1) ◽

pp. 2-2

Author(s):

Megan C Hollister ◽

Jeffrey D. Blume

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Random Forest ◽

Gene Expression Data ◽

Large Scale ◽

Second Generation ◽

A Priori ◽

Expression Data ◽

P Values ◽

Machine Learning Methods

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.

Download Full-text

Tracking the Land Use/Land Cover Change in an Area with Underground Mining and Reforestation via Continuous Landsat Classification

Remote Sensing ◽

10.3390/rs11141719 ◽

2019 ◽

Vol 11 (14) ◽

pp. 1719 ◽

Cited By ~ 7

Author(s):

Jiaxin Mi ◽

Yongjun Yang ◽

Shaoliang Zhang ◽

Shi An ◽

Huping Hou ◽

...

Keyword(s):

Land Use ◽

Random Forest ◽

Land Cover ◽

Large Scale ◽

Underground Mining ◽

Mining Area ◽

Random Forest Classifier ◽

Land Use Land Cover ◽

Lulc Change ◽

Mining Areas

Understanding the changes in a land use/land cover (LULC) is important for environmental assessment and land management. However, tracking the dynamic of LULC has proved difficult, especially in large-scale underground mining areas with extensive LULC heterogeneity and a history of multiple disturbances. Additional research related to the methods in this field is still needed. In this study, we tracked the LULC change in the Nanjiao mining area, Shanxi Province, China between 1987 and 2017 via random forest classifier and continuous Landsat imagery, where years of underground mining and reforestation projects have occurred. We applied a Savitzky–Golay filter and a normalized difference vegetation index (NDVI)-based approach to detect the temporal and spatial change, respectively. The accuracy assessment shows that the random forest classifier has a good performance in this heterogeneous area, with an accuracy ranging from 81.92% to 86.6%, which is also higher than that via support vector machine (SVM), neural network (NN), and maximum likelihood (ML) algorithm. LULC classification results reveal that cultivated forest in the mining area increased significantly after 2004, while the spatial extent of natural forest, buildings, and farmland decreased significantly after 2007. The areas where vegetation was significantly reduced were mainly because of the transformation from natural forest and shrubs into grasslands and bare lands, respectively, whereas the areas with an obvious increase in NDVI were mainly because of the conversion from grasslands and buildings into cultivated forest, especially when villages were abandoned after mining subsidence. A partial correlation analysis demonstrated that the extent of LULC change was significantly related to coal production and reforestation, which indicated the effects of underground mining and reforestation projects on LULC changes. This study suggests that continuous Landsat classification via random forest classifier could be effective in monitoring the long-term dynamics of LULC changes, and provide crucial information and data for the understanding of the driving forces of LULC change, environmental impact assessment, and ecological protection planning in large-scale mining areas.

Download Full-text