Survey on Feature Transformation Techniques for Data Streams

Mining high-dimensional data streams poses a fundamental challenge to machine learning as the presence of high numbers of attributes can remarkably degrade any mining task's performance. In the past several years, dimension reduction (DR) approaches have been successfully applied for different purposes (e.g., visualization). Due to their high-computational costs and numerous passes over large data, these approaches pose a hindrance when processing infinite data streams that are potentially high-dimensional. The latter increases the resource-usage of algorithms that could suffer from the curse of dimensionality. To cope with these issues, some techniques for incremental DR have been proposed. In this paper, we provide a survey on reduction approaches designed to handle data streams and highlight the key benefits of using these approaches for stream mining algorithms.

Download Full-text

Algorithmic fairness in credit scoring

Oxford Review of Economic Policy ◽

10.1093/oxrep/grab020 ◽

2021 ◽

Vol 37 (3) ◽

pp. 585-617

Author(s):

Teresa Bono ◽

Karen Croxson ◽

Adam Giles

Keyword(s):

Machine Learning ◽

Credit Scoring ◽

Large Data ◽

Error Rates ◽

The Past ◽

Ensemble Machine Learning ◽

Hidden Patterns ◽

Credit Scoring Model ◽

Distributional Impacts ◽

Modelling Approach

Abstract The use of machine learning as an input into decision-making is on the rise, owing to its ability to uncover hidden patterns in large data and improve prediction accuracy. Questions have been raised, however, about the potential distributional impacts of these technologies, with one concern being that they may perpetuate or even amplify human biases from the past. Exploiting detailed credit file data for 800,000 UK borrowers, we simulate a switch from a traditional (logit) credit scoring model to ensemble machine-learning methods. We confirm that machine-learning models are more accurate overall. We also find that they do as well as the simpler traditional model on relevant fairness criteria, where these criteria pertain to overall accuracy and error rates for population subgroups defined along protected or sensitive lines (gender, race, health status, and deprivation). We do observe some differences in the way credit-scoring models perform for different subgroups, but these manifest under a traditional modelling approach and switching to machine learning neither exacerbates nor eliminates these issues. The paper discusses some of the mechanical and data factors that may contribute to statistical fairness issues in the context of credit scoring.

Download Full-text

Measuring Group Differences in High‐Dimensional Choices: Method and Application to Congressional Speech

Econometrica ◽

10.3982/ecta16566 ◽

2019 ◽

Vol 87 (4) ◽

pp. 1307-1340 ◽

Cited By ~ 27

Author(s):

Matthew Gentzkow ◽

Jesse M. Shapiro ◽

Matt Taddy

Keyword(s):

Machine Learning ◽

High Dimensional ◽

Group Differences ◽

Finite Sample ◽

Sample Bias ◽

The Past ◽

Recent Advances ◽

Finite Sample Bias ◽

Choice Set

We study the problem of measuring group differences in choices when the dimensionality of the choice set is large. We show that standard approaches suffer from a severe finite‐sample bias, and we propose an estimator that applies recent advances in machine learning to address this bias. We apply this method to measure trends in the partisanship of congressional speech from 1873 to 2016, defining partisanship to be the ease with which an observer could infer a congressperson's party from a single utterance. Our estimates imply that partisanship is far greater in recent years than in the past, and that it increased sharply in the early 1990s after remaining low and relatively constant over the preceding century.

Download Full-text

A Survey and Analysis of Multi-Label Learning Techniques for Data Streams

Volume 5 - 2020, Issue 9 - September - International Journal of Innovative Science and Research Technology ◽

10.38124/ijisrt20jul198 ◽

2020 ◽

Vol 5 (7) ◽

pp. 1014-1018

Author(s):

S.K.Komagal Yallini ◽

Dr. B. Mukunthan

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Data Streams ◽

Data Stream ◽

Predictive Performance ◽

Class Label ◽

The Past ◽

Learning Techniques ◽

The Future ◽

Learning Concept

Multi-Label Learning (MLL) solves the challenge of characterizing every sample via a particular feature which relates to the group of labels at once. That is, a sample has manifold views where every view is symbolized through a Class Label (CL). In the past decades, significant number of researches has been prepared towards this promising machine learning concept. Such researches on MLL have been motivated on a pre-determined group of CLs. In most of the appliances, the configuration is dynamic and novel views might appear in a Data Stream (DS). In this scenario, a MLL technique should able to identify and categorize the features with evolving fresh labels for maintaining a better predictive performance. For this purpose, several MLL techniques were introduced in the earlier decades. This article aims to present a survey on this field with consequence on conventional MLL techniques. Initially, various MLL techniques proposed by many researchers are studied. Then, a comparative analysis is carried out in terms of merits and demerits of those techniques to conclude the survey and recommend the future enhancements on MLL techniques.

Download Full-text

Performance tuning for machine learning-based software development effort prediction models

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES ◽

10.3906/elk-1809-129 ◽

2019 ◽

pp. 1308-1324

Author(s):

EGEMEN ERTUĞRUL ◽

ZAKİR BAYTAR ◽

ÇAĞATAY ÇATAL ◽

ÖMER CAN MURATLI

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Feature Selection ◽

Principal Component ◽

Parameter Tuning ◽

Development Effort ◽

Feature Transformation ◽

Transformation Techniques ◽

Software Development Effort ◽

Effort Prediction

Software development effort estimation is a critical activity of the project management process. In this study, machine learning algorithms were investigated in conjunction with feature transformation, feature selection, and parameter tuning techniques to estimate the development effort accurately and a new model was proposed as part of an expert system. We preferred the most general-purpose algorithms, applied parameter optimization technique (Grid- Search), feature transformation techniques (binning and one-hot-encoding), and feature selection algorithm (principal component analysis). All the models were trained on the ISBSG datasets and implemented by using the scikit-learn package in the Python language. The proposed model uses a multilayer perceptron as its underlying algorithm, applies binning of the features to transform continuous features and one-hot-encoding technique to transform categorical data into numerical values as feature transformation techniques, does feature selection based on the principal component analysis method, and performs parameter tuning based on the GridSearch algorithm. We demonstrate that our effort prediction model mostly outperforms the other existing models in terms of prediction accuracy based on the mean absolute residual parameter.

Download Full-text

A Review on Machine Learning Techniques to Predict Diseases

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset207211 ◽

2019 ◽

pp. 525-529

Author(s):

Rohit A Nitnaware ◽

Prof. Vijaya Kamble

Keyword(s):

Machine Learning ◽

Disease Diagnosis ◽

Disease Classification ◽

Machine Learning Techniques ◽

High Dimensional ◽

The Past ◽

Course Of Action ◽

Learning Techniques ◽

Past Data ◽

Game Plan

In Disease Diagnosis, affirmation of models is so basic for perceiving the disease exactly. Machine learning is the field, which is used for building the models that can predict the yield relies upon the wellsprings of data, which are connected subject to the past data. Disease unmistakable verification is the most essential task for treating any disease. Classification computations are used for orchestrating the disease. There are a couple of classification computations and dimensionality decline counts used. Machine Learning empowers the PCs to learn without being changed remotely. By using the Classification Algorithm, a hypothesis can be looked over the course of action of decisions the best fits a game plan of recognition. Machine Learning is used for the high dimensional and the multi-dimensional data. Better and modified computations can be made using Machine Learning.

Download Full-text

Integrating Scientific Knowledge into Machine Learning using Interactive Decision Trees

10.31223/x5pp75 ◽

2021 ◽

Author(s):

Georgios Sarailidis ◽

Thorsten Wagener ◽

Francesca Pianosi

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Scientific Knowledge ◽

Large Data ◽

High Dimensional ◽

Machine Learning Method ◽

Human In The Loop ◽

Data Limitations ◽

Visualization Techniques ◽

Composite Variables

Decision Trees (DT) is a machine learning method that has been widely used in the environmental sciences to automatically extract patterns from complex and high dimensional data. However, like any data-based method, is hindered by data limitations and potentially physically unrealistic results. We develop interactive DT (iDT) that put the human in the loop and integrate the power of experts’ scientific knowledge with the power of the algorithms to automatically learn patterns from large data. We created a toolbox that contains methods and visualization techniques that allow users to interact with the DT. Users can create new composite variables, manually change the variable and threshold to split, manually prune and group variables based on physical meaning. We demonstrate with three case studies that iDT help experts incorporate their knowledge in the DT models achieving higher interpretability and realism in a physical sense.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Recent Progress in Machine Learning-based Prediction of Peptide Activity for Drug Discovery

Current Topics in Medicinal Chemistry ◽

10.2174/1568026619666190122151634 ◽

2019 ◽

Vol 19 (1) ◽

pp. 4-16 ◽

Cited By ~ 6

Author(s):

Qihui Wu ◽

Hanzhong Ke ◽

Dongli Li ◽

Qi Wang ◽

Jiansong Fang ◽

...

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Large Scale ◽

Recent Progress ◽

High Specificity ◽

Learning Approaches ◽

Anticancer Peptides ◽

The Past ◽

Traditional Approaches ◽

Large Scale Screening

Over the past decades, peptide as a therapeutic candidate has received increasing attention in drug discovery, especially for antimicrobial peptides (AMPs), anticancer peptides (ACPs) and antiinflammatory peptides (AIPs). It is considered that the peptides can regulate various complex diseases which are previously untouchable. In recent years, the critical problem of antimicrobial resistance drives the pharmaceutical industry to look for new therapeutic agents. Compared to organic small drugs, peptide- based therapy exhibits high specificity and minimal toxicity. Thus, peptides are widely recruited in the design and discovery of new potent drugs. Currently, large-scale screening of peptide activity with traditional approaches is costly, time-consuming and labor-intensive. Hence, in silico methods, mainly machine learning approaches, for their accuracy and effectiveness, have been introduced to predict the peptide activity. In this review, we document the recent progress in machine learning-based prediction of peptides which will be of great benefit to the discovery of potential active AMPs, ACPs and AIPs.

Download Full-text

Classification of Brainwaves for Sleep Stages by High-Dimensional FFT Features from EEG Signals

Applied Sciences ◽

10.3390/app10051797 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1797 ◽

Cited By ~ 2

Author(s):

Mera Kartika Delimayanti ◽

Bedy Purnama ◽

Ngoc Giang Nguyen ◽

Mohammad Reza Faisal ◽

Kunti Robiatul Mahmudah ◽

...

Keyword(s):

Machine Learning ◽

Sleep Stage ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Sleep Stages ◽

Eeg Signals ◽

Stage Classification ◽

Sleep Stage Classification ◽

Low Dimensional

Manual classification of sleep stage is a time-consuming but necessary step in the diagnosis and treatment of sleep disorders, and its automation has been an area of active study. The previous works have shown that low dimensional fast Fourier transform (FFT) features and many machine learning algorithms have been applied. In this paper, we demonstrate utilization of features extracted from EEG signals via FFT to improve the performance of automated sleep stage classification through machine learning methods. Unlike previous works using FFT, we incorporated thousands of FFT features in order to classify the sleep stages into 2–6 classes. Using the expanded version of Sleep-EDF dataset with 61 recordings, our method outperformed other state-of-the art methods. This result indicates that high dimensional FFT features in combination with a simple feature selection is effective for the improvement of automated sleep stage classification.

Download Full-text

Predicting Fraud Victimization Using Classical Machine Learning

Entropy ◽

10.3390/e23030300 ◽

2021 ◽

Vol 23 (3) ◽

pp. 300

Author(s):

Mark Lokanan ◽

Susan Liu

Keyword(s):

Machine Learning ◽

Financial Literacy ◽

Learning Algorithm ◽

Demographic Characteristics ◽

Financial Knowledge ◽

The Past ◽

Machine Learning Model ◽

Long Time ◽

Regulatory Organization ◽

Investment Fraud

Protecting financial consumers from investment fraud has been a recurring problem in Canada. The purpose of this paper is to predict the demographic characteristics of investors who are likely to be victims of investment fraud. Data for this paper came from the Investment Industry Regulatory Organization of Canada’s (IIROC) database between January of 2009 and December of 2019. In total, 4575 investors were coded as victims of investment fraud. The study employed a machine-learning algorithm to predict the probability of fraud victimization. The machine learning model deployed in this paper predicted the typical demographic profile of fraud victims as investors who classify as female, have poor financial knowledge, know the advisor from the past, and are retired. Investors who are characterized as having limited financial literacy but a long-time relationship with their advisor have reduced probabilities of being victimized. However, male investors with low or moderate-level investment knowledge were more likely to be preyed upon by their investment advisors. While not statistically significant, older adults, in general, are at greater risk of being victimized. The findings from this paper can be used by Canadian self-regulatory organizations and securities commissions to inform their investors’ protection mandates.

Download Full-text