Feature Selection Algorithms as One of the Python Data Analytical Tools

With the current trend of rapidly growing popularity of the Python programming language for machine learning applications, the gap between machine learning engineer needs and existing Python tools increases. Especially, it is noticeable for more classical machine learning fields, namely, feature selection, as the community attention in the last decade has mainly shifted to neural networks. This paper has two main purposes. First, we perform an overview of existing open-source Python and Python-compatible feature selection libraries, show their problems, if any, and demonstrate the gap between these libraries and the modern state of feature selection field. Then, we present new open-source scikit-learn compatible ITMO FS (Information Technologies, Mechanics and Optics University feature selection) library that is currently under development, explain how its architecture covers modern views on feature selection, and provide some code examples on how to use it with Python and its performance compared with other Python feature selection libraries.

Download Full-text

CancerDiscover: A configurable pipeline for cancer prediction and biomarker identification using machine learning framework

10.1101/182998 ◽

2017 ◽

Author(s):

Akram Mohammed ◽

Greyson Biegert ◽

Jiri Adamec ◽

Tomáš Helikar

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Open Source ◽

High Throughput ◽

High Throughput Screening ◽

Learning Algorithms ◽

Supplementary Information ◽

Supplementary File ◽

Biomarker Identification ◽

Selection Algorithms

AbstractMotivationUse of various high-throughput screening techniques has resulted in an abundance of data, whose complete utility is limited by the tools available for processing and analysis. Machine learning holds great potential for deciphering these data in the context of cancer classification and biomarker identification. However, current machine learning tools require manual processing of raw data from various sequencing platforms, which is both tedious and time-consuming. The current classification tools lack flexibility in choosing the best feature selection algorithms from a range of algorithms and most importantly inability to compare various learning algorithms.ResultsWe developed CancerDiscover, an open-source software pipeline that allows users to efficiently and automatically integrate large high-throughput datasets, preprocess, normalize, and selects best performing features from multiple feature selection algorithms. The pipeline lets users apply various learning algorithms and generates multiple classification models and evaluation reports that distinguish cancer from normal samples, as well as different types and subtypes of cancer.Availability and ImplementationThe open source pipeline is freely available for download at https://github.com/HelikarLab/[email protected] InformationPlease refer to the CancerDiscover README (Supplementary File 1) for detailed instructions on installation and operation of the pipeline. For a list of available feature selection methods, see Supplementary File 2.

Download Full-text

Customer Churn Prediction in Telecom Sector with Machine Learning and Information Gain Filter Feature Selection Algorithms

10.1109/icdabi53623.2021.9655792 ◽

2021 ◽

Author(s):

Yakub K. Saheed ◽

Moshood A. Hambali

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Information Gain ◽

Churn Prediction ◽

Customer Churn ◽

Customer Churn Prediction ◽

Telecom Sector ◽

Selection Algorithms

Download Full-text

Streaming feature selection algorithms for big data: A survey

Applied Computing and Informatics ◽

10.1016/j.aci.2019.01.001 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Cited By ~ 5

Author(s):

Noura AlNuaimi ◽

Mohammad Mehedy Masud ◽

Mohamed Adel Serhani ◽

Nazar Zaki

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Big Data ◽

Real Time ◽

Relevant Literature ◽

Heterogeneous Data ◽

Proper Solution ◽

Exact Figure ◽

Selection Algorithms ◽

Over Time

Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time. However, storing and processing large and varied datasets (known as big data) is challenging to do in real time. In machine learning, streaming feature selection has always been considered a superior technique for selecting the relevant subset features from highly dimensional data and thus reducing learning complexity. In the relevant literature, streaming feature selection refers to the features that arrive consecutively over time; despite a lack of exact figure on the number of features, numbers of instances are well-established. Many scholars in the field have proposed streaming-feature-selection algorithms in attempts to find the proper solution to this problem. This paper presents an exhaustive and methodological introduction of these techniques. This study provides a review of the traditional feature-selection algorithms and then scrutinizes the current algorithms that use streaming feature selection to determine their strengths and weaknesses. The survey also sheds light on the ongoing challenges in big-data research.

Download Full-text

A NOVEL FEATURE SELECTION ALGORITHM WITH SUPERVISED MUTUAL INFORMATION FOR CLASSIFICATION

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500279 ◽

2013 ◽

Vol 22 (04) ◽

pp. 1350027

Author(s):

JAGANATHAN PALANICHAMY ◽

KUPPUCHAMY RAMASAMY

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Mutual Information ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Class A ◽

Selection Algorithms ◽

The Relationship ◽

Class Variable

Feature selection is essential in data mining and pattern recognition, especially for database classification. During past years, several feature selection algorithms have been proposed to measure the relevance of various features to each class. A suitable feature selection algorithm normally maximizes the relevancy and minimizes the redundancy of the selected features. The mutual information measure can successfully estimate the dependency of features on the entire sampling space, but it cannot exactly represent the redundancies among features. In this paper, a novel feature selection algorithm is proposed based on maximum relevance and minimum redundancy criterion. The mutual information is used to measure the relevancy of each feature with class variable and calculate the redundancy by utilizing the relationship between candidate features, selected features and class variables. The effectiveness is tested with ten benchmarked datasets available in UCI Machine Learning Repository. The experimental results show better performance when compared with some existing algorithms.

Download Full-text

Latest Tools for Data Mining and Machine Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i1003.0789s19 ◽

2019 ◽

Vol 8 (9S) ◽

pp. 18-23 ◽

Cited By ~ 2

Keyword(s):

Machine Learning ◽

Data Mining ◽

Decision Making ◽

Feature Selection ◽

Open Source ◽

Predictive Analysis ◽

Learning Tools ◽

Pros And Cons ◽

Selection For ◽

Extract Information

Nowadays, Data Mining is used everywhere for extracting information from the data and in turn, acquires knowledge for decision making. Data Mining analyzes patterns which are used to extract information and knowledge for making decisions. Many open source and licensed tools like Weka, RapidMiner, KNIME, and Orange are available for Data Mining and predictive analysis. This paper discusses about different tools available for Data Mining and Machine Learning, followed by the description, pros and cons of these tools. The article provides details of all the algorithms like classification, regression, characterization, discretization, clustering, visualization and feature selection for Data Mining and Machine Learning tools. It will help people for efficient decision making and suggests which tool is suitable according to their requirement.

Download Full-text

Research Approach With Machine Learning Underpinned

Machine Learning in Cancer Research With Applications in Colon Cancer and Big Data Analysis - Advances in Medical Technologies and Clinical Practice ◽

10.4018/978-1-7998-7316-7.ch003 ◽

2021 ◽

pp. 63-97

Keyword(s):

Machine Learning ◽

Colon Cancer ◽

Feature Selection ◽

High Performance ◽

Classification Algorithms ◽

Research Approach ◽

Research Questions ◽

Selection Algorithms ◽

Gene Feature

This chapter describes several methodologies and proposed models used to examine the accuracy and efficiency of high-performance colon-cancer feature selection and classification algorithms to solve the problems identified in Chapter 2. An elaboration of the diverse methods of gene/feature selection algorithms and the related classification algorithms implemented throughout this study are presented. A prototypical methodology blueprint for each experiment is developed to answer the research questions in Chapter 1. Each system model is also presented, and the measures used to validate the performance of the model's outcome are discussed.

Download Full-text

Machine Learning Applications in Head and Neck Radiation Oncology: Lessons From Open-Source Radiomics Challenges

Frontiers in Oncology ◽

10.3389/fonc.2018.00294 ◽

2018 ◽

Vol 8 ◽

Cited By ~ 11

Author(s):

Hesham Elhalawani ◽

Timothy A. Lin ◽

Stefania Volpe ◽

Abdallah S. R. Mohamed ◽

Aubrey L. White ◽

...

Keyword(s):

Machine Learning ◽

Head And Neck ◽

Open Source ◽

Radiation Oncology ◽

Machine Learning Applications ◽

Head And Neck Radiation

Download Full-text

Machine learning application identifies novel gene signatures from transcriptomic data of spontaneous canine hemangiosarcoma

Briefings in Bioinformatics ◽

10.1093/bib/bbaa252 ◽

2020 ◽

Author(s):

Nuojin Cheng ◽

Ashley J Schulte ◽

Fadil Santosa ◽

Jong Hyuk Kim

Keyword(s):

Machine Learning ◽

Feature Selection ◽

High Throughput Sequencing ◽

Feature Selection Method ◽

Soft Tissue Sarcomas ◽

Gene Signatures ◽

Transcriptomic Data ◽

Novel Gene ◽

Microscopic Evaluation ◽

Machine Learning Applications

Abstract Angiosarcomas are soft-tissue sarcomas that form malignant vascular tissues. Angiosarcomas are very rare, and due to their aggressive behavior and high metastatic propensity, they have poor clinical outcomes. Hemangiosarcomas commonly occur in domestic dogs, and share pathological and clinical features with human angiosarcomas. Typical pathognomonic features of this tumor are irregular vascular channels that are filled with blood and are lined by a mixture of malignant and nonmalignant endothelial cells. The current gold standard is the histological diagnosis of angiosarcoma; however, microscopic evaluation may be complicated, particularly when tumor cells are undetectable due to the presence of excessive amounts of nontumor cells or when tissue specimens have insufficient tumor content. In this study, we implemented machine learning applications from next-generation transcriptomic data of canine hemangiosarcoma tumor samples (n = 76) and nonmalignant tissues (n = 10) to evaluate their training performance for diagnostic utility. The 10-fold cross-validation test and multiple feature selection methods were applied. We found that extra trees and random forest learning models were the best classifiers for hemangiosarcoma in our testing datasets. We also identified novel gene signatures using the mutual information and Monte Carlo feature selection method. The extra trees model revealed high classification accuracy for hemangiosarcoma in validation sets. We demonstrate that high-throughput sequencing data of canine hemangiosarcoma are trainable for machine learning applications. Furthermore, our approach enables us to identify novel gene signatures as reliable determinants of hemangiosarcoma, providing significant insights into the development of potential applications for this vascular malignancy.

Download Full-text

An integrated machine learning framework for a discriminative analysis of schizophrenia using multi-biological data

Scientific Reports ◽

10.1038/s41598-021-94007-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Peng-fei Ke ◽

Dong-sheng Xiong ◽

Jia-hui Li ◽

Zhi-lin Pan ◽

Jing Zhou ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Gut Microbiota ◽

Cross Sectional Study ◽

Biological Data ◽

Support Vector ◽

Cross Sectional ◽

Integrated Framework ◽

Learning Framework ◽

Selection Algorithms

AbstractFinding effective and objective biomarkers to inform the diagnosis of schizophrenia is of great importance yet remains challenging. Relatively little work has been conducted on multi-biological data for the diagnosis of schizophrenia. In this cross-sectional study, we extracted multiple features from three types of biological data, including gut microbiota data, blood data, and electroencephalogram data. Then, an integrated framework of machine learning consisting of five classifiers, three feature selection algorithms, and four cross validation methods was used to discriminate patients with schizophrenia from healthy controls. Our results show that the support vector machine classifier without feature selection using the input features of multi-biological data achieved the best performance, with an accuracy of 91.7% and an AUC of 96.5% (p < 0.05). These results indicate that multi-biological data showed better discriminative capacity for patients with schizophrenia than single biological data. The top 5% discriminative features selected from the optimal model include the gut microbiota features (Lactobacillus, Haemophilus, and Prevotella), the blood features (superoxide dismutase level, monocyte-lymphocyte ratio, and neutrophil count), and the electroencephalogram features (nodal local efficiency, nodal efficiency, and nodal shortest path length in the temporal and frontal-parietal brain areas). The proposed integrated framework may be helpful for understanding the pathophysiology of schizophrenia and developing biomarkers for schizophrenia using multi-biological data.

Download Full-text

ReactionCode: a new versatile format for searching, analysis, classification, transform, and encoding/decoding of reactions

10.26434/chemrxiv.12058971.v1 ◽

2020 ◽

Author(s):

Victorien Delannée ◽

Marc Nicklaus

Keyword(s):

Machine Learning ◽

Open Source ◽

Data Exchange ◽

Similarity Searching ◽

Classification Analysis ◽

The Past ◽

Machine Learning Applications ◽

Database Organization ◽

Machine Readable ◽

Condensed Graph Of Reaction

In the past two decades a lot of different formats for molecules and reactions have been created. These formats were mostly developed for the purposes of identifiers, representation, classification, analysis and data exchange. A lot of efforts have been made on molecule formats but only few for reactions where the endeavors have been made mostly by companies leading to proprietary formats. Here, we developed a new open-source format which allows to encode and decode a reaction into multi-layers machine readable code, which aggregates reactants and products into a condensed graph of reaction (CGR). This format is flexible and can be used in a context of reaction similarity searching and classification. It is also designed for database organization, machine learning applications and as a new transform reaction language.

Download Full-text