Nonparametric Variable Selection Using Machine Learning Algorithms in High Dimensional (Large P, Small N) Biomedical Applications

Forest biomass is a major store of carbon and plays a crucial role in the regional and global carbon cycle. Accurate forest biomass assessment is important for monitoring and mapping the status of and changes in forests. However, while remote sensing-based forest biomass estimation in general is well developed and extensively used, improving the accuracy of biomass estimation remains challenging. In this paper, we used China’s National Forest Continuous Inventory data and Landsat 8 Operational Land Imager data in combination with three algorithms, either the linear regression (LR), random forest (RF), or extreme gradient boosting (XGBoost), to establish biomass estimation models based on forest type. In the modeling process, two methods of variable selection, e.g., stepwise regression and variable importance-base method, were used to select optimal variable subsets for LR and machine learning algorithms (e.g., RF and XGBoost), respectively. Comfortingly, the accuracy of models was significantly improved, and thus the following conclusions were drawn: (1) Variable selection is very important for improving the performance of models, especially for machine learning algorithms, and the influence of variable selection on XGBoost is significantly greater than that of RF. (2) Machine learning algorithms have advantages in aboveground biomass (AGB) estimation, and the XGBoost and RF models significantly improved the estimation accuracy compared with the LR models. Despite that the problems of overestimation and underestimation were not fully eliminated, the XGBoost algorithm worked well and reduced these problems to a certain extent. (3) The approach of AGB modeling based on forest type is a very advantageous method for improving the performance at the lower and higher values of AGB. Some conclusions in this paper were probably different as the study area changed. The methods used in this paper provide an optional and useful approach for improving the accuracy of AGB estimation based on remote sensing data, and the estimation of AGB was a reference basis for monitoring the forest ecosystem of the study area.

Download Full-text

Deep learning for predicting disease status using genomic data

10.7287/peerj.preprints.27123 ◽

2018 ◽

Cited By ~ 1

Author(s):

Qianfan Wu ◽

Adel Boueiz ◽

Alican Bozkurt ◽

Arya Masoomi ◽

Allan Wang ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Rapid Development ◽

Learning Algorithms ◽

Genomic Data ◽

Disease Status ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Learning Approach ◽

Low Dimensional

Predicting disease status for a complex human disease using genomic data is an important, yet challenging, step in personalized medicine. Among many challenges, the so-called curse of dimensionality problem results in unsatisfied performances of many state-of-art machine learning algorithms. A major recent advance in machine learning is the rapid development of deep learning algorithms that can efficiently extract meaningful features from high-dimensional and complex datasets through a stacked and hierarchical learning process. Deep learning has shown breakthrough performance in several areas including image recognition, natural language processing, and speech recognition. However, the performance of deep learning in predicting disease status using genomic datasets is still not well studied. In this article, we performed a review on the four relevant articles that we found through our thorough literature review. All four articles used auto-encoders to project high-dimensional genomic data to a low dimensional space and then applied the state-of-the-art machine learning algorithms to predict disease status based on the low-dimensional representations. This deep learning approach outperformed existing prediction approaches, such as prediction based on probe-wise screening and prediction based on principal component analysis. The limitations of the current deep learning approach and possible improvements were also discussed.

Download Full-text

Soil Erosion Assessment of Alpine Grassland in the Source Park of the Yellow River on the Qinghai-Tibetan Plateau, China

Frontiers in Ecology and Evolution ◽

10.3389/fevo.2021.771439 ◽

2022 ◽

Vol 9 ◽

Author(s):

Huilong Lin ◽

Yuting Zhao

Keyword(s):

Machine Learning ◽

Soil Erosion ◽

Tibetan Plateau ◽

Variable Selection ◽

Yellow River ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

The Yellow River ◽

Optimal Model ◽

Erosion Loss

The source park of the Yellow River (SPYR), as a vital ecological shelter on the Qinghai-Tibetan Plateau, is suffering different degrees of degradation and desertification, resulting in soil erosion in recent decades. Therefore, studying the mechanism, influencing factors and current situation of soil erosion in the alpine grassland ecosystems of the SPYR are significant for protecting the ecological and productive functions. Based on the 137Cs element tracing technique and machine learning algorithms, five strategic variable selection algorithms based on machine learning algorithms are used to identify the minimal optimal set and analyze the main factors that influence soil erosion in the SPYR. The optimal model for estimating soil erosion in the SPYR is obtained by comparisons model outputs between the RUSLE and machine learning algorithms combined with variable selection models. We identify the spatial distribution pattern of soil erosion in the study area by the optimal model. The results indicated that: (1) A comprehensive set of variables is more objective than the RUSLE model. In terms of verification accuracy, the simulated annealing -Cubist model (R = 0.67, RMSD = 1,368 t km–2⋅a–1) simulation results represents the best while the RUSLE model (R = 0.49, RMSD = 1,769 t⋅km–2⋅a–1) goes on the worst. (2) The soil erosion is more severe in the north than the southeast of the SPYR. The average erosion modulus is 6,460.95 t⋅km–2⋅a–1 and roughly 99% of the survey region has an intensive erosion modulus (5,000–8,000 t⋅km–2⋅a–1). (3) Total erosion loss is relatively 8.45⋅108 t⋅a–1 in the SPYR, which is commonly 12.64 times greater than the allowable soil erosion loss. The economic monetization of SOC loss caused by soil erosion in the entire research area was almost $47.90 billion in 2014. These results will help provide scientific evidences not only for farmers and herdsmen but also for environmental science managers and administrators. In addition, a new ecological policy recommendation was proposed to balance grassland protection and animal husbandry economic production based on the value of soil erosion reclassification.

Download Full-text

Analysis of dual-stage filtration and validation of high-dimensional real process data for creation of machine learning algorithms

10.1109/iceccme52200.2021.9591094 ◽

2021 ◽

Author(s):

Dusan Strusnik ◽

Jurij Avsec

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Process Data ◽

Real Process ◽

Dual Stage

Download Full-text

Teaching a Machine to Feel Postoperative Pain: Combining High-Dimensional Clinical Data with Machine Learning Algorithms to Forecast Acute Postoperative Pain

Pain Medicine ◽

10.1111/pme.12713 ◽

2015 ◽

Vol 16 (7) ◽

pp. 1386-1401 ◽

Cited By ~ 22

Author(s):

Patrick J. Tighe ◽

Christopher A. Harle ◽

Robert W. Hurley ◽

Haldun Aytug ◽

Andre P. Boezaart ◽

...

Keyword(s):

Machine Learning ◽

Postoperative Pain ◽

Clinical Data ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Acute Postoperative Pain

Download Full-text

Survey on Clustering High-Dimensional data using Hubness

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195671 ◽

2020 ◽

pp. 01-07

Author(s):

Miss. Archana Chaudahri ◽

Mr. Nilesh Vani

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Learning Algorithms ◽

High Dimensional Data ◽

Algorithm Design ◽

Machine Learning Algorithms ◽

High Dimensional ◽

K Nearest Neighbor ◽

Weighted Voting ◽

Conventional Machine

Most data of interest today in data-mining applications is complex and is usually represented by many different features. Such high-dimensional data is by its very nature often quite difficult to handle by conventional machine-learning algorithms. This is considered to be an aspect of the well known curse of dimensionality. Consequently, high-dimensional data needs to be processed with care, which is why the design of machine-learning algorithms needs to take these factors into account. Furthermore, it was observed that some of the arising high-dimensional properties could in fact be exploited in improving overall algorithm design. One such phenomenon, related to nearest-neighbor learning methods, is known as hubness and refers to the emergence of very influential nodes (hubs) in k-nearest neighbor graphs. A crisp weighted voting scheme for the k-nearest neighbor classifier has recently been proposed which exploits this notion.

Download Full-text

High-dimensional sample selection models: Machine-learning algorithms in the Heckman two-step

10.14264/3105ff5 ◽

2018 ◽

Author(s):

Tharani Ransimala Weerasooriya

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Sample Selection ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Selection Models

Download Full-text

Localizing category-related information in speech with multi-scale analyses

PLoS ONE ◽

10.1371/journal.pone.0258178 ◽

2021 ◽

Vol 16 (10) ◽

pp. e0258178

Author(s):

Sam Tilsen ◽

Seung-Eun Kim ◽

Claire Wang

Keyword(s):

Machine Learning ◽

Vocal Tract ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Acoustic Energy ◽

High Dimensional ◽

Linear Discriminant ◽

Multi Scale ◽

Related Information ◽

Unseen Data

Measurements of the physical outputs of speech—vocal tract geometry and acoustic energy—are high-dimensional, but linguistic theories posit a low-dimensional set of categories such as phonemes and phrase types. How can it be determined when and where in high-dimensional articulatory and acoustic signals there is information related to theoretical categories? For a variety of reasons, it is problematic to directly quantify mutual information between hypothesized categories and signals. To address this issue, a multi-scale analysis method is proposed for localizing category-related information in an ensemble of speech signals using machine learning algorithms. By analyzing how classification accuracy on unseen data varies as the temporal extent of training input is systematically restricted, inferences can be drawn regarding the temporal distribution of category-related information. The method can also be used to investigate redundancy between subsets of signal dimensions. Two types of theoretical categories are examined in this paper: phonemic/gestural categories and syntactic relative clause categories. Moreover, two different machine learning algorithms were examined: linear discriminant analysis and neural networks with long short-term memory units. Both algorithms detected category-related information earlier and later in signals than would be expected given standard theoretical assumptions about when linguistic categories should influence speech. The neural network algorithm was able to identify category-related information to a greater extent than the discriminant analyses.

Download Full-text

Deep learning for predicting disease status using genomic data

10.7287/peerj.preprints.27123v1 ◽

2018 ◽

Author(s):

Qianfan Wu ◽

Adel Boueiz ◽

Alican Bozkurt ◽

Arya Masoomi ◽

Allan Wang ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Rapid Development ◽

Learning Algorithms ◽

Genomic Data ◽

Disease Status ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Learning Approach ◽

Low Dimensional

Predicting disease status for a complex human disease using genomic data is an important, yet challenging, step in personalized medicine. Among many challenges, the so-called curse of dimensionality problem results in unsatisfied performances of many state-of-art machine learning algorithms. A major recent advance in machine learning is the rapid development of deep learning algorithms that can efficiently extract meaningful features from high-dimensional and complex datasets through a stacked and hierarchical learning process. Deep learning has shown breakthrough performance in several areas including image recognition, natural language processing, and speech recognition. However, the performance of deep learning in predicting disease status using genomic datasets is still not well studied. In this article, we performed a review on the four relevant articles that we found through our thorough literature review. All four articles used auto-encoders to project high-dimensional genomic data to a low dimensional space and then applied the state-of-the-art machine learning algorithms to predict disease status based on the low-dimensional representations. This deep learning approach outperformed existing prediction approaches, such as prediction based on probe-wise screening and prediction based on principal component analysis. The limitations of the current deep learning approach and possible improvements were also discussed.

Download Full-text

High-dimensional hepatopath data analysis by machine learning for predicting HBV-related fibrosis

Scientific Reports ◽

10.1038/s41598-021-84556-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Xiangke Pu ◽

Danni Deng ◽

Chaoyi Chu ◽

Tianle Zhou ◽

Jianhong Liu

Keyword(s):

Machine Learning ◽

Liver Fibrosis ◽

Clinical Data ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Health Concern ◽

High Dimensional ◽

Non Invasive ◽

Predictive Algorithms ◽

Invasive Method

AbstractChronic HBV infection, the main cause of liver cirrhosis and hepatocellular carcinoma, has become a global health concern. Machine learning algorithms are particularly adept at analyzing medical phenomenon by capturing complex and nonlinear relationships in clinical data. Our study proposed a predictive model on the basis of 55 routine laboratory and clinical parameters by machine learning algorithms as a novel non-invasive method for liver fibrosis diagnosis. The model was further evaluated on the accuracy and rationality and proved to be highly accurate and efficient for the prediction of HBV-related fibrosis. In conclusion, we suggested a potential combination of high-dimensional clinical data and machine learning predictive algorithms for the liver fibrosis diagnosis.

Download Full-text