scholarly journals Survey on Feature Transformation Techniques for Data Streams

Author(s):  
Maroua Bahri ◽  
Albert Bifet ◽  
Silviu Maniu ◽  
Heitor Murilo Gomes

Mining high-dimensional data streams poses a fundamental challenge to machine learning as the presence of high numbers of attributes can remarkably degrade any mining task's performance. In the past several years, dimension reduction (DR) approaches have been successfully applied for different purposes (e.g., visualization). Due to their high-computational costs and numerous passes over large data, these approaches pose a hindrance when processing infinite data streams that are potentially high-dimensional. The latter increases the resource-usage of algorithms that could suffer from the curse of dimensionality. To cope with these issues, some techniques for incremental DR have been proposed. In this paper, we provide a survey on reduction approaches designed to handle data streams and highlight the key benefits of using these approaches for stream mining algorithms.

2021 ◽  
Vol 37 (3) ◽  
pp. 585-617
Author(s):  
Teresa Bono ◽  
Karen Croxson ◽  
Adam Giles

Abstract The use of machine learning as an input into decision-making is on the rise, owing to its ability to uncover hidden patterns in large data and improve prediction accuracy. Questions have been raised, however, about the potential distributional impacts of these technologies, with one concern being that they may perpetuate or even amplify human biases from the past. Exploiting detailed credit file data for 800,000 UK borrowers, we simulate a switch from a traditional (logit) credit scoring model to ensemble machine-learning methods. We confirm that machine-learning models are more accurate overall. We also find that they do as well as the simpler traditional model on relevant fairness criteria, where these criteria pertain to overall accuracy and error rates for population subgroups defined along protected or sensitive lines (gender, race, health status, and deprivation). We do observe some differences in the way credit-scoring models perform for different subgroups, but these manifest under a traditional modelling approach and switching to machine learning neither exacerbates nor eliminates these issues. The paper discusses some of the mechanical and data factors that may contribute to statistical fairness issues in the context of credit scoring.


Econometrica ◽  
2019 ◽  
Vol 87 (4) ◽  
pp. 1307-1340 ◽  
Author(s):  
Matthew Gentzkow ◽  
Jesse M. Shapiro ◽  
Matt Taddy

We study the problem of measuring group differences in choices when the dimensionality of the choice set is large. We show that standard approaches suffer from a severe finite‐sample bias, and we propose an estimator that applies recent advances in machine learning to address this bias. We apply this method to measure trends in the partisanship of congressional speech from 1873 to 2016, defining partisanship to be the ease with which an observer could infer a congressperson's party from a single utterance. Our estimates imply that partisanship is far greater in recent years than in the past, and that it increased sharply in the early 1990s after remaining low and relatively constant over the preceding century.


Author(s):  
S.K.Komagal Yallini ◽  
Dr. B. Mukunthan

Multi-Label Learning (MLL) solves the challenge of characterizing every sample via a particular feature which relates to the group of labels at once. That is, a sample has manifold views where every view is symbolized through a Class Label (CL). In the past decades, significant number of researches has been prepared towards this promising machine learning concept. Such researches on MLL have been motivated on a pre-determined group of CLs. In most of the appliances, the configuration is dynamic and novel views might appear in a Data Stream (DS). In this scenario, a MLL technique should able to identify and categorize the features with evolving fresh labels for maintaining a better predictive performance. For this purpose, several MLL techniques were introduced in the earlier decades. This article aims to present a survey on this field with consequence on conventional MLL techniques. Initially, various MLL techniques proposed by many researchers are studied. Then, a comparative analysis is carried out in terms of merits and demerits of those techniques to conclude the survey and recommend the future enhancements on MLL techniques.


Author(s):  
EGEMEN ERTUĞRUL ◽  
ZAKİR BAYTAR ◽  
ÇAĞATAY ÇATAL ◽  
ÖMER CAN MURATLI

Software development effort estimation is a critical activity of the project management process. In this study, machine learning algorithms were investigated in conjunction with feature transformation, feature selection, and parameter tuning techniques to estimate the development effort accurately and a new model was proposed as part of an expert system. We preferred the most general-purpose algorithms, applied parameter optimization technique (Grid- Search), feature transformation techniques (binning and one-hot-encoding), and feature selection algorithm (principal component analysis). All the models were trained on the ISBSG datasets and implemented by using the scikit-learn package in the Python language. The proposed model uses a multilayer perceptron as its underlying algorithm, applies binning of the features to transform continuous features and one-hot-encoding technique to transform categorical data into numerical values as feature transformation techniques, does feature selection based on the principal component analysis method, and performs parameter tuning based on the GridSearch algorithm. We demonstrate that our effort prediction model mostly outperforms the other existing models in terms of prediction accuracy based on the mean absolute residual parameter.


Author(s):  
Rohit A Nitnaware ◽  
Prof. Vijaya Kamble

In Disease Diagnosis, affirmation of models is so basic for perceiving the disease exactly. Machine learning is the field, which is used for building the models that can predict the yield relies upon the wellsprings of data, which are connected subject to the past data. Disease unmistakable verification is the most essential task for treating any disease. Classification computations are used for orchestrating the disease. There are a couple of classification computations and dimensionality decline counts used. Machine Learning empowers the PCs to learn without being changed remotely. By using the Classification Algorithm, a hypothesis can be looked over the course of action of decisions the best fits a game plan of recognition. Machine Learning is used for the high dimensional and the multi-dimensional data. Better and modified computations can be made using Machine Learning.


2021 ◽  
Author(s):  
Georgios Sarailidis ◽  
Thorsten Wagener ◽  
Francesca Pianosi

Decision Trees (DT) is a machine learning method that has been widely used in the environmental sciences to automatically extract patterns from complex and high dimensional data. However, like any data-based method, is hindered by data limitations and potentially physically unrealistic results. We develop interactive DT (iDT) that put the human in the loop and integrate the power of experts’ scientific knowledge with the power of the algorithms to automatically learn patterns from large data. We created a toolbox that contains methods and visualization techniques that allow users to interact with the DT. Users can create new composite variables, manually change the variable and threshold to split, manually prune and group variables based on physical meaning. We demonstrate with three case studies that iDT help experts incorporate their knowledge in the DT models achieving higher interpretability and realism in a physical sense.


2019 ◽  
Vol 21 (9) ◽  
pp. 662-669 ◽  
Author(s):  
Junnan Zhao ◽  
Lu Zhu ◽  
Weineng Zhou ◽  
Lingfeng Yin ◽  
Yuchen Wang ◽  
...  

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.


2019 ◽  
Vol 19 (1) ◽  
pp. 4-16 ◽  
Author(s):  
Qihui Wu ◽  
Hanzhong Ke ◽  
Dongli Li ◽  
Qi Wang ◽  
Jiansong Fang ◽  
...  

Over the past decades, peptide as a therapeutic candidate has received increasing attention in drug discovery, especially for antimicrobial peptides (AMPs), anticancer peptides (ACPs) and antiinflammatory peptides (AIPs). It is considered that the peptides can regulate various complex diseases which are previously untouchable. In recent years, the critical problem of antimicrobial resistance drives the pharmaceutical industry to look for new therapeutic agents. Compared to organic small drugs, peptide- based therapy exhibits high specificity and minimal toxicity. Thus, peptides are widely recruited in the design and discovery of new potent drugs. Currently, large-scale screening of peptide activity with traditional approaches is costly, time-consuming and labor-intensive. Hence, in silico methods, mainly machine learning approaches, for their accuracy and effectiveness, have been introduced to predict the peptide activity. In this review, we document the recent progress in machine learning-based prediction of peptides which will be of great benefit to the discovery of potential active AMPs, ACPs and AIPs.


2020 ◽  
Vol 10 (5) ◽  
pp. 1797 ◽  
Author(s):  
Mera Kartika Delimayanti ◽  
Bedy Purnama ◽  
Ngoc Giang Nguyen ◽  
Mohammad Reza Faisal ◽  
Kunti Robiatul Mahmudah ◽  
...  

Manual classification of sleep stage is a time-consuming but necessary step in the diagnosis and treatment of sleep disorders, and its automation has been an area of active study. The previous works have shown that low dimensional fast Fourier transform (FFT) features and many machine learning algorithms have been applied. In this paper, we demonstrate utilization of features extracted from EEG signals via FFT to improve the performance of automated sleep stage classification through machine learning methods. Unlike previous works using FFT, we incorporated thousands of FFT features in order to classify the sleep stages into 2–6 classes. Using the expanded version of Sleep-EDF dataset with 61 recordings, our method outperformed other state-of-the art methods. This result indicates that high dimensional FFT features in combination with a simple feature selection is effective for the improvement of automated sleep stage classification.


Entropy ◽  
2021 ◽  
Vol 23 (3) ◽  
pp. 300
Author(s):  
Mark Lokanan ◽  
Susan Liu

Protecting financial consumers from investment fraud has been a recurring problem in Canada. The purpose of this paper is to predict the demographic characteristics of investors who are likely to be victims of investment fraud. Data for this paper came from the Investment Industry Regulatory Organization of Canada’s (IIROC) database between January of 2009 and December of 2019. In total, 4575 investors were coded as victims of investment fraud. The study employed a machine-learning algorithm to predict the probability of fraud victimization. The machine learning model deployed in this paper predicted the typical demographic profile of fraud victims as investors who classify as female, have poor financial knowledge, know the advisor from the past, and are retired. Investors who are characterized as having limited financial literacy but a long-time relationship with their advisor have reduced probabilities of being victimized. However, male investors with low or moderate-level investment knowledge were more likely to be preyed upon by their investment advisors. While not statistically significant, older adults, in general, are at greater risk of being victimized. The findings from this paper can be used by Canadian self-regulatory organizations and securities commissions to inform their investors’ protection mandates.


Sign in / Sign up

Export Citation Format

Share Document