Distributed Collaborative Feature Selection Based on Intermediate Representation

Feature selection is an efficient dimensionality reduction technique for artificial intelligence and machine learning. Many feature selection methods learn the data structure to select the most discriminative features for distinguishing different classes. However, the data is sometimes distributed in multiple parties and sharing the original data is difficult due to the privacy requirement. As a result, the data in one party may be lack of useful information to learn the most discriminative features. In this paper, we propose a novel distributed method which allows collaborative feature selection for multiple parties without revealing their original data. In the proposed method, each party finds the intermediate representations from the original data, and shares the intermediate representations for collaborative feature selection. Based on the shared intermediate representations, the original data from multiple parties are transformed to the same low dimensional space. The feature ranking of the original data is learned by imposing row sparsity on the transformation matrix simultaneously. Experimental results on real-world datasets demonstrate the effectiveness of the proposed method.

Download Full-text

An Adaptive Unsupervised Feature Selection Algorithm Based on MDS for Tumor Gene Data Classification

Sensors ◽

10.3390/s21113627 ◽

2021 ◽

Vol 21 (11) ◽

pp. 3627

Author(s):

Bo Jin ◽

Chunling Fu ◽

Yong Jin ◽

Wei Yang ◽

Shengbin Li ◽

...

Keyword(s):

Feature Selection ◽

Local Structure ◽

Gene Selection ◽

Dimensional Space ◽

Original Data ◽

Global Structure ◽

Biological Data ◽

Special Treatment ◽

Selection Scheme ◽

Unsupervised Feature Selection

Identifying the key genes related to tumors from gene expression data with a large number of features is important for the accurate classification of tumors and to make special treatment decisions. In recent years, unsupervised feature selection algorithms have attracted considerable attention in the field of gene selection as they can find the most discriminating subsets of genes, namely the potential information in biological data. Recent research also shows that maintaining the important structure of data is necessary for gene selection. However, most current feature selection methods merely capture the local structure of the original data while ignoring the importance of the global structure of the original data. We believe that the global structure and local structure of the original data are equally important, and so the selected genes should maintain the essential structure of the original data as far as possible. In this paper, we propose a new, adaptive, unsupervised feature selection scheme which not only reconstructs high-dimensional data into a low-dimensional space with the constraint of feature distance invariance but also employs ℓ2,1-norm to enable a matrix with the ability to perform gene selection embedding into the local manifold structure-learning framework. Moreover, an effective algorithm is developed to solve the optimization problem based on the proposed scheme. Comparative experiments with some classical schemes on real tumor datasets demonstrate the effectiveness of the proposed method.

Download Full-text

Network Embedding via a Bi-Mode and Deep Neural Network Model

10.20944/preprints201712.0156.v1 ◽

2017 ◽

Author(s):

Yang Fang ◽

Xiang Zhao ◽

Zhen Tan

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Semantic Information ◽

Dimensional Space ◽

Relation Extraction ◽

Network Embedding ◽

Structure Information ◽

Second Mode ◽

Real World Datasets ◽

Low Dimensional

Network Embedding (NE) is an important method to learn the representations of network via a low-dimensional space. Conventional NE models focus on capturing the structure information and semantic information of vertices while neglecting such information for edges. In this work, we propose a novel NE model named BimoNet to capture both the structure and semantic information of edges. BimoNet is composed of two parts, i.e., the bi-mode embedding part and the deep neural network part. For bi-mode embedding part, the first mode named add-mode is used to express the entity-shared features of edges and the second mode named subtract-mode is employed to represent the entity-specific features of edges. These features actually reflect the semantic information. For deep neural network part, we firstly regard the edges in a network as nodes, and the vertices as links, which will not change the overall structure of the whole network. Then we take the nodes' adjacent matrix as the input of the deep neural network as it can obtain similar representations for nodes with similar structure. Afterwards, by jointly optimizing the objective function of these two parts, BimoNet could preserve both the semantic and structure information of edges. In experiments, we evaluate BimoNet on three real-world datasets and task of relation extraction, and BimoNet is demonstrated to outperform state-of-the-art baseline models consistently and significantly.

Download Full-text

Relation Structure-Aware Heterogeneous Information Network Embedding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014456 ◽

2019 ◽

Vol 33 ◽

pp. 4456-4463 ◽

Cited By ~ 8

Author(s):

Yuanfu Lu ◽

Chuan Shi ◽

Linmei Hu ◽

Zhiyuan Liu

Keyword(s):

Real World ◽

Dimensional Space ◽

Structural Characteristics ◽

Information Network ◽

Network Embedding ◽

Heterogeneous Information Network ◽

Heterogeneous Information ◽

Real World Datasets ◽

Low Dimensional ◽

Embedding Methods

Heterogeneous information network (HIN) embedding aims to embed multiple types of nodes into a low-dimensional space. Although most existing HIN embedding methods consider heterogeneous relations in HINs, they usually employ one single model for all relations without distinction, which inevitably restricts the capability of network embedding. In this paper, we take the structural characteristics of heterogeneous relations into consideration and propose a novel Relation structure-aware Heterogeneous Information Network Embedding model (RHINE). By exploring the real-world networks with thorough mathematical analysis, we present two structure-related measures which can consistently distinguish heterogeneous relations into two categories: Affiliation Relations (ARs) and Interaction Relations (IRs). To respect the distinctive characteristics of relations, in our RHINE, we propose different models specifically tailored to handle ARs and IRs, which can better capture the structures and semantics of the networks. At last, we combine and optimize these models in a unified and elegant manner. Extensive experiments on three real-world datasets demonstrate that our model significantly outperforms the state-of-the-art methods in various tasks, including node clustering, link prediction, and node classification.

Download Full-text

Complex Moment-Based Supervised Eigenmap for Dimensionality Reduction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013910 ◽

2019 ◽

Vol 33 ◽

pp. 3910-3918 ◽

Cited By ~ 1

Author(s):

Akira Imakura ◽

Momo Matsuda ◽

Xiucai Ye ◽

Tetsuya Sakurai

Keyword(s):

Dimensionality Reduction ◽

Parallel Implementation ◽

Dimensional Space ◽

Recognition Performance ◽

Optimization Methods ◽

Original Data ◽

Dimensional Subspace ◽

Reduction Methods ◽

Low Dimensional ◽

Matrix Trace

Dimensionality reduction methods that project highdimensional data to a low-dimensional space by matrix trace optimization are widely used for clustering and classification. The matrix trace optimization problem leads to an eigenvalue problem for a low-dimensional subspace construction, preserving certain properties of the original data. However, most of the existing methods use only a few eigenvectors to construct the low-dimensional space, which may lead to a loss of useful information for achieving successful classification. Herein, to overcome the deficiency of the information loss, we propose a novel complex moment-based supervised eigenmap including multiple eigenvectors for dimensionality reduction. Furthermore, the proposed method provides a general formulation for matrix trace optimization methods to incorporate with ridge regression, which models the linear dependency between covariate variables and univariate labels. To reduce the computational complexity, we also propose an efficient and parallel implementation of the proposed method. Numerical experiments indicate that the proposed method is competitive compared with the existing dimensionality reduction methods for the recognition performance. Additionally, the proposed method exhibits high parallel efficiency.

Download Full-text

Unsupervised Deep Video Hashing with Balanced Rotation

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/429 ◽

2017 ◽

Cited By ~ 10

Author(s):

Gengshen Wu ◽

Li Liu ◽

Yuchen Guo ◽

Guiguang Ding ◽

Jungong Han ◽

...

Keyword(s):

State Of The Art ◽

Dimensional Space ◽

Image Hashing ◽

Neighborhood Structure ◽

Function Learning ◽

Video Hashing ◽

Real World Datasets ◽

Low Dimensional ◽

Balanced Code ◽

Hash Codes

Recently, hashing video contents for fast retrieval has received increasing attention due to the enormous growth of online videos. As the extension of image hashing techniques, traditional video hashing methods mainly focus on seeking the appropriate video features but pay little attention to how the video-specific features can be leveraged to achieve optimal binarization. In this paper, an end-to-end hashing framework, namely Unsupervised Deep Video Hashing (UDVH), is proposed, where feature extraction, balanced code learning and hash function learning are integrated and optimized in a self-taught manner. Particularly, distinguished from previous work, our framework enjoys two novelties: 1) an unsupervised hashing method that integrates the feature clustering and feature binarization, enabling the neighborhood structure to be preserved in the binary space; 2) a smart rotation applied to the video-specific features that are widely spread in the low-dimensional space such that the variance of dimensions can be balanced, thus generating more effective hash codes. Extensive experiments have been performed on two real-world datasets and the results demonstrate its superiority, compared to the state-of-the-art video hashing methods. To bootstrap further developments, the source code will be made publically available.

Download Full-text

Fault Identification Based on Local Feature Correlation

Intelligent Control and Learning Systems - Data-Driven Fault Detection and Reasoning for Industrial Monitoring ◽

10.1007/978-981-16-8044-1_8 ◽

2022 ◽

pp. 119-146

Author(s):

Jing Wang ◽

Jinglin Zhou ◽

Xiaolu Chen

Keyword(s):

Data Processing ◽

High Dimension ◽

Dimensional Space ◽

Original Data ◽

Mapping Method ◽

Kernel Functions ◽

High Dimensional ◽

Monitoring Methods ◽

Multivariate Statistical ◽

Low Dimensional

AbstractIndustrial data variables show obvious high dimension and strong nonlinear correlation. Traditional multivariate statistical monitoring methods, such as PCA, PLS, CCA, and FDA, are only suitable for solving the high-dimensional data processing with linear correlation. The kernel mapping method is the most common technique to deal with the nonlinearity, which projects the original data in the low-dimensional space to the high-dimensional space through appropriate kernel functions so as to achieve the goal of linear separability in the new space. However, the space projection from the low dimension to the high dimension is contradictory to the actual requirement of dimensionality reduction of the data. So kernel-based method inevitably increases the complexity of data processing.

Download Full-text

M3Drop: dropout-based feature selection for scRNASeq

Bioinformatics ◽

10.1093/bioinformatics/bty1044 ◽

2018 ◽

Vol 35 (16) ◽

pp. 2865-2867 ◽

Cited By ~ 61

Author(s):

Tallulah S Andrews ◽

Martin Hemberg

Keyword(s):

Feature Selection ◽

Single Cell ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Selection Methods ◽

Functional Responses ◽

Technical Noise ◽

New Methods ◽

Selection For

Abstract Motivation Most genomes contain thousands of genes, but for most functional responses, only a subset of those genes are relevant. To facilitate many single-cell RNASeq (scRNASeq) analyses the set of genes is often reduced through feature selection, i.e. by removing genes only subject to technical noise. Results We present M3Drop, an R package that implements popular existing feature selection methods and two novel methods which take advantage of the prevalence of zeros (dropouts) in scRNASeq data to identify features. We show these new methods outperform existing methods on simulated and real datasets. Availability and implementation M3Drop is freely available on github as an R package and is compatible with other popular scRNASeq tools: https://github.com/tallulandrews/M3Drop. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Systematic Study of Feature Selection Methods for Learning to Rank Algorithms

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2018070104 ◽

2018 ◽

Vol 8 (3) ◽

pp. 46-67 ◽

Cited By ~ 1

Author(s):

Mehrnoush Barani Shirzad ◽

Mohammad Reza Keyvanpour

Keyword(s):

Feature Selection ◽

State Of The Art ◽

Learning To Rank ◽

Future Research ◽

Selection Methods ◽

Ranking Models ◽

Ranking Systems ◽

Efficiency And Effectiveness ◽

Selection For

This article describes how feature selection for learning to rank algorithms has become an interesting issue. While noisy and irrelevant features influence performance, and result in an overfitting problem in ranking systems, reducing the number of features by illuminating irrelevant and noisy features is a solution. Several studies have applied feature selection for learning to rank, which promote efficiency and effectiveness of ranking models. As the number of features and consequently the number of irrelevant and noisy features is increasing, systematic a review of Feature selection for learning to rank methods is required. In this article, a framework to examine research on feature selection for learning to rank (FSLR) is proposed. Under this framework, the authors review the most state-of-the-art methods and suggest several criteria to analyze them. FSLR offers a structured classification of current algorithms for future research to: a) properly select strategies from existing algorithms using certain criteria or b) to find ways to develop existing methodologies.

Download Full-text

Explaining three-dimensional dimensionality reduction plots

Information Visualization ◽

10.1177/1473871615600010 ◽

2015 ◽

Vol 15 (2) ◽

pp. 154-172 ◽

Cited By ~ 11

Author(s):

Danilo B Coimbra ◽

Rafael M Martins ◽

Tácito TAT Neves ◽

Alexandru C Telea ◽

Fernando V Paulovich

Keyword(s):

Dimensionality Reduction ◽

Dimensional Space ◽

Three Dimensional ◽

Original Data ◽

Reduction Technique ◽

High Dimensional ◽

Dimensionality Reduction Technique ◽

Visualization Techniques ◽

High Dimensional Datasets ◽

Three Dimensional Space

Understanding three-dimensional projections created by dimensionality reduction from high-variate datasets is very challenging. In particular, classical three-dimensional scatterplots used to display such projections do not explicitly show the relations between the projected points, the viewpoint used to visualize the projection, and the original data variables. To explore and explain such relations, we propose a set of interactive visualization techniques. First, we adapt and enhance biplots to show the data variables in the projected three-dimensional space. Next, we use a set of interactive bar chart legends to show variables that are visible from a given viewpoint and also assist users to select an optimal viewpoint to examine a desired set of variables. Finally, we propose an interactive viewpoint legend that provides an overview of the information visible in a given three-dimensional projection from all possible viewpoints. Our techniques are simple to implement and can be applied to any dimensionality reduction technique. We demonstrate our techniques on the exploration of several real-world high-dimensional datasets.

Download Full-text

An Automatic Blocking Keys Selection For Efficient Record Linkage

International Journal of Organizational and Collective Intelligence ◽

10.4018/ijoci.2021010104 ◽

2021 ◽

Vol 11 (1) ◽

pp. 53-70

Author(s):

Hamid Naceur Benkhlaed ◽

Djamal Berrabah ◽

Nassima Dif ◽

Faouzi Boufares

Keyword(s):

Feature Selection ◽

Data Quality ◽

Optimization Algorithm ◽

Real World ◽

Record Linkage ◽

Domain Expert ◽

Bald Eagles ◽

Unsupervised Approach ◽

Selection For ◽

Real World Datasets

One of the important processes in the data quality field is record linkage (RL). RL (also known as entity resolution) is the process of detecting duplicates that refer to the same real-world entity in one or more datasets. The most critical step during the RL process is blocking, which reduces the quadratic complexity of the process by dividing the data into a set of blocks. By that way, matching is done only between the records in the same block. However, selecting the best blocking keys to divide the data is a hard task, and in most cases, it's done by a domain expert. In this paper, a novel unsupervised approach for an automatic blocking key selection is proposed. This approach is based on the recently proposed meta-heuristic bald eagles search (bes) optimization algorithm, where the problem is treated as a feature selection case. The obtained results from experiments on real-world datasets showed the efficiency of the proposition where the BES for feature selection outperformed existed approaches in the literature and returned the best blocking keys.

Download Full-text