Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

Scalable hierarchical clustering by composition rank vector encoding and tree structure

10.1101/2020.04.12.038026 ◽

2020 ◽

Author(s):

Xiao Lai ◽

Pu Tian

Keyword(s):

Machine Learning ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Machine Learning Algorithms ◽

Tree Structure ◽

Supervised Machine Learning ◽

High Dimensional ◽

Rank Vector ◽

Nonlinear Correlations

AbstractSupervised machine learning, especially deep learning based on a wide variety of neural network architectures, have contributed tremendously to fields such as marketing, computer vision and natural language processing. However, development of un-supervised machine learning algorithms has been a bottleneck of artificial intelligence. Clustering is a fundamental unsupervised task in many different subjects. Unfortunately, no present algorithm is satisfactory for clustering of high dimensional data with strong nonlinear correlations. In this work, we propose a simple and highly efficient hierarchical clustering algorithm based on encoding by composition rank vectors and tree structure, and demonstrate its utility with clustering of protein structural domains. No record comparison, which is an expensive and essential common step to all present clustering algorithms, is involved. Consequently, it achieves linear time and space computational complexity hierarchical clustering, thus applicable to arbitrarily large datasets. The key factor in this algorithm is definition of composition, which is dependent upon physical nature of target data and therefore need to be constructed case by case. Nonetheless, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations. We hope this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.

Download Full-text

Machine Learning and High-Dimensional Data Analysis

Principles of Clinical Cancer Research ◽

10.1891/9781617052392.0017 ◽

2018 ◽

Author(s):

Sanjay Aneja ◽

James B. Yu

Keyword(s):

Machine Learning ◽

Data Analysis ◽

High Dimensional Data ◽

High Dimensional ◽

High Dimensional Data Analysis

Download Full-text

Finding causative genes from high-dimensional data: an appraisal of statistical and machine learning approaches

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2015-0072 ◽

2016 ◽

Vol 15 (4) ◽

Author(s):

Chamont Wang ◽

Jana L. Gevertz

Keyword(s):

Machine Learning ◽

High Dimensional Data ◽

High Dimensional ◽

Learning Approaches

Download Full-text

Sparse Boosting Based Machine Learning Methods for High-Dimensional Data

10.5772/intechopen.100506 ◽

2021 ◽

Author(s):

Mu Yue

Keyword(s):

Machine Learning ◽

Parameter Estimation ◽

Variable Selection ◽

Survival Data ◽

High Dimensional Data ◽

High Dimensional ◽

Learning Methods ◽

Require Time ◽

Machine Learning Methods ◽

Boosting Method

In high-dimensional data, penalized regression is often used for variable selection and parameter estimation. However, these methods typically require time-consuming cross-validation methods to select tuning parameters and retain more false positives under high dimensionality. This chapter discusses sparse boosting based machine learning methods in the following high-dimensional problems. First, a sparse boosting method to select important biomarkers is studied for the right censored survival data with high-dimensional biomarkers. Then, a two-step sparse boosting method to carry out the variable selection and the model-based prediction is studied for the high-dimensional longitudinal observations measured repeatedly over time. Finally, a multi-step sparse boosting method to identify patient subgroups that exhibit different treatment effects is studied for the high-dimensional dense longitudinal observations. This chapter intends to solve the problem of how to improve the accuracy and calculation speed of variable selection and parameter estimation in high-dimensional data. It aims to expand the application scope of sparse boosting and develop new methods of high-dimensional survival analysis, longitudinal data analysis, and subgroup analysis, which has great application prospects.

Download Full-text

Human and machine learning pipelines for responsible clinical prediction using high-dimensional data

10.21203/rs.3.pex-1655/v1 ◽

2021 ◽

Author(s):

Herdiantri Sufriyana ◽

Yu Wei Wu ◽

Emily Chia-Yu Su

Keyword(s):

Machine Learning ◽

High Dimensional Data ◽

Model Development ◽

Healthcare Providers ◽

Predictive Performance ◽

High Dimensional ◽

Clinical Prediction ◽

Data Collection And Analysis ◽

Medical Histories ◽

Feature Discovery

Abstract This protocol aims to develop, validate, and deploy a prediction model using high dimensional data by both human and machine learning. The applicability is intended for clinical prediction in healthcare providers, including but not limited to those using medical histories from electronic health records. This protocol applies diverse approaches to improve both predictive performance and interpretability while maintaining the generalizability of model evaluation. However, some steps require expensive computational capacity; otherwise, these will take longer time. The key stages consist of designs of data collection and analysis, feature discovery and quality control, and model development, validation, and deployment.

Download Full-text

Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE

Mathematics ◽

10.3390/math8050662 ◽

2020 ◽

Vol 8 (5) ◽

pp. 662 ◽

Cited By ~ 3

Author(s):

Husein Perez ◽

Joseph H. M. Tah

Keyword(s):

Machine Learning ◽

Density Distribution ◽

Image Classification ◽

High Dimensional Data ◽

Supervised Machine Learning ◽

Learning Problems ◽

High Dimensional ◽

Feature Engineering ◽

Outlier Data

In the field of supervised machine learning, the quality of a classifier model is directly correlated with the quality of the data that is used to train the model. The presence of unwanted outliers in the data could significantly reduce the accuracy of a model or, even worse, result in a biased model leading to an inaccurate classification. Identifying the presence of outliers and eliminating them is, therefore, crucial for building good quality training datasets. Pre-processing procedures for dealing with missing and outlier data, commonly known as feature engineering, are standard practice in machine learning problems. They help to make better assumptions about the data and also prepare datasets in a way that best expose the underlying problem to the machine learning algorithms. In this work, we propose a multistage method for detecting and removing outliers in high-dimensional data. Our proposed method is based on utilising a technique called t-distributed stochastic neighbour embedding (t-SNE) to reduce high-dimensional map of features into a lower, two-dimensional, probability density distribution and then use a simple descriptive statistical method called interquartile range (IQR) to identifying any outlier values from the density distribution of the features. t-SNE is a machine learning algorithm and a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualisation in a low-dimensional space of two or three dimensions. We applied this method on a dataset containing images for training a convolutional neural network model (ConvNet) for an image classification problem. The dataset contains four different classes of images: three classes contain defects in construction (mould, stain, and paint deterioration) and a no-defect class (normal). We used the transfer learning technique to modify a pre-trained VGG-16 model. We used this model as a feature extractor and as a benchmark to evaluate our method. We have shown that, when using this method, we can identify and remove the outlier images in the dataset. After removing the outlier images from the dataset and re-training the VGG-16 model, the results have also shown that the accuracy of the classification has significantly improved and the number of misclassified cases has also dropped. While many feature engineering techniques for handling missing and outlier data are common in predictive machine learning problems involving numerical or categorical data, there is little work on developing techniques for handling outliers in high-dimensional data which can be used to improve the quality of machine learning problems involving images such as ConvNet models for image classification and object detection problems.

Download Full-text

Predicting future dynamics from short-term time series using an Anticipated Learning Machine

National Science Review ◽

10.1093/nsr/nwaa025 ◽

2020 ◽

Vol 7 (6) ◽

pp. 1079-1091 ◽

Cited By ~ 3

Author(s):

Chuan Chen ◽

Rui Li ◽

Lin Shu ◽

Zhiyu He ◽

Jining Wang ◽

...

Keyword(s):

Machine Learning ◽

Time Series ◽

High Dimensional Data ◽

Small Sample ◽

High Dimensional ◽

Target Variable ◽

Short Term ◽

Linear Dynamical Systems ◽

Non Linear ◽

Learning Machine

Abstract Predicting time series has significant practical applications over different disciplines. Here, we propose an Anticipated Learning Machine (ALM) to achieve precise future-state predictions based on short-term but high-dimensional data. From non-linear dynamical systems theory, we show that ALM can transform recent correlation/spatial information of high-dimensional variables into future dynamical/temporal information of any target variable, thereby overcoming the small-sample problem and achieving multistep-ahead predictions. Since the training samples generated from high-dimensional data also include information of the unknown future values of the target variable, it is called anticipated learning. Extensive experiments on real-world data demonstrate significantly superior performances of ALM over all of the existing 12 methods. In contrast to traditional statistics-based machine learning, ALM is based on non-linear dynamics, thus opening a new way for dynamics-based machine learning.

Download Full-text

Statistical Machine Learning for Structured and High Dimensional Data

10.21236/ada610544 ◽

2014 ◽

Author(s):

Larry Wasserman ◽

John Lafferty

Keyword(s):

Machine Learning ◽

High Dimensional Data ◽

High Dimensional ◽

Statistical Machine Learning

Download Full-text

Feature selection using autoencoders with Bayesian methods to high-dimensional data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211348 ◽

2021 ◽

pp. 1-10

Author(s):

Lei Shu ◽

Kun Huang ◽

Wenhao Jiang ◽

Wenming Wu ◽

Hongling Liu

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Bayesian Methods ◽

Large Scale ◽

High Dimensional Data ◽

Hybrid Approach ◽

High Dimensional ◽

Real World Data ◽

Learning Tasks ◽

Low Dimensional

It is easy to lead to poor generalization in machine learning tasks using real-world data directly, since such data is usually high-dimensional dimensionality and limited. Through learning the low dimensional representations of high-dimensional data, feature selection can retain useful features for machine learning tasks. Using these useful features effectively trains machine learning models. Hence, it is a challenge for feature selection from high-dimensional data. To address this issue, in this paper, a hybrid approach consisted of an autoencoder and Bayesian methods is proposed for a novel feature selection. Firstly, Bayesian methods are embedded in the proposed autoencoder as a special hidden layer. This of doing is to increase the precision during selecting non-redundant features. Then, the other hidden layers of the autoencoder are used for non-redundant feature selection. Finally, compared with the mainstream approaches for feature selection, the proposed method outperforms them. We find that the way consisted of autoencoders and probabilistic correction methods is more meaningful than that of stacking architectures or adding constraints to autoencoders as regards feature selection. We also demonstrate that stacked autoencoders are more suitable for large-scale feature selection, however, sparse autoencoders are beneficial for a smaller number of feature selection. We indicate that the value of the proposed method provides a theoretical reference to analyze the optimality of feature selection.

Download Full-text