Hierarchical Feature Selection with Recursive Regularization

In the big data era, the sizes of datasets have increased dramatically in terms of the number of samples, features, and classes. In particular, there exists usually a hierarchical structure among the classes. This kind of task is called hierarchical classification. Various algorithms have been developed to select informative features for flat classification. However, these algorithms ignore the semantic hyponymy in the directory of hierarchical classes, and select a uniform subset of the features for all classes. In this paper, we propose a new technique for hierarchical feature selection based on recursive regularization. This algorithm takes the hierarchical information of the class structure into account. As opposed to flat feature selection, we select different feature subsets for each node in a hierarchical tree structure using the parent-children relationships and the sibling relationships for hierarchical regularization. By imposing $\ell_{2,1}$-norm regularization to different parts of the hierarchical classes, we can learn a sparse matrix for the feature ranking of each node. Extensive experiments on public datasets demonstrate the effectiveness of the proposed algorithm.

Download Full-text

scFlow: A Scalable and Reproducible Analysis Pipeline for Single-Cell RNA Sequencing Data

10.1101/2021.08.16.456499 ◽

2021 ◽

Author(s):

Combiz Khozoie ◽

Nurun Fancy ◽

Mahdi Moradi Marjaneh ◽

Alan E. Murphy ◽

Paul M. Matthews ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Sparse Matrix ◽

Data Generation ◽

Sequencing Data ◽

Alternative Analysis ◽

Analysis Pipeline ◽

Matrix Quality ◽

Single Cell Rna Sequencing ◽

Public Datasets

Advances in single-cell RNA-sequencing technology over the last decade have enabled exponential increases in throughput: datasets with over a million cells are becoming commonplace. The burgeoning scale of data generation, combined with the proliferation of alternative analysis methods, led us to develop the scFlow toolkit and the nf-core/scflow pipeline for reproducible, efficient, and scalable analyses of single-cell and single-nuclei RNA-sequencing data. The scFlow toolkit provides a higher level of abstraction on top of popular single-cell packages within an R ecosystem, while the nf-core/scflow Nextflow pipeline is built within the nf-core framework to enable compute infrastructure-independent deployment across all institutions and research facilities. Here we present our flexible pipeline, which leverages the advantages of containerization and the potential of Cloud computing for easy orchestration and scaling of the analysis of large case/control datasets by even non-expert users. We demonstrate the functionality of the analysis pipeline from sparse-matrix quality control through to insight discovery with examples of analysis of four recently published public datasets and describe the extensibility of scFlow as a modular, open-source tool for single-cell and single nuclei bioinformatic analyses.

Download Full-text

Effective and Efficient Browsing of Large Image Databases

Handbook of Research on Digital Libraries ◽

10.4018/978-1-59904-879-6.ch014 ◽

2009 ◽

pp. 142-148

Author(s):

Gerald Schaefer

Keyword(s):

Spherical Surface ◽

Image Databases ◽

Tree Structure ◽

Retrieval Method ◽

Hierarchical Tree ◽

Large Databases ◽

The Moment ◽

Hierarchical Tree Structure ◽

Large Image Databases ◽

Interesting Alternative

As image databases are growing, efficient and effective methods for managing such large collections are highly sought after. Content-based approaches have shown large potential in this area as they do not require textual annotation of images. However, while for image databases the query-by-example concept is at the moment the most commonly adopted retrieval method, it is only of limited practical use. Techniques which allow human-centred navigation and visualization of complete image collections therefore provide an interesting alternative. In this chapter we present an effective and efficient approach for user-centred navigation of large image databases. Image thumbnails are projected onto a spherical surface so that images that are visually similar are located close to each other in the visualization space. To avoid overlapping and occlusion effects images are placed on a regular grid structure while large databases are handled through a clustering technique paired with a hierarchical tree structure which allows for intuitive real-time browsing experience.

Download Full-text

Joint sparse matrix regression and nonnegative spectral analysis for two-dimensional unsupervised feature selection

Pattern Recognition ◽

10.1016/j.patcog.2019.01.014 ◽

2019 ◽

Vol 89 ◽

pp. 119-133 ◽

Cited By ~ 7

Author(s):

Haoliang Yuan ◽

Junyu Li ◽

Loi Lei Lai ◽

Yuan Yan Tang

Keyword(s):

Feature Selection ◽

Spectral Analysis ◽

Sparse Matrix ◽

Two Dimensional ◽

Unsupervised Feature Selection

Download Full-text

Modeling Web Crawler Wrappers to Collect User Reviews on Shopping Mall with Various Hierarchical Tree Structure

2009 International Conference on Web Information Systems and Mining ◽

10.1109/wism.2009.22 ◽

2009 ◽

Cited By ~ 4

Author(s):

Hanhoon Kang ◽

Seong Joon Yoo ◽

Dongil Han

Keyword(s):

Tree Structure ◽

Shopping Mall ◽

Web Crawler ◽

Hierarchical Tree ◽

User Reviews ◽

Hierarchical Tree Structure

Download Full-text

Profiles identification on hierarchical tree structure data sets

Journal of Applied Statistics ◽

10.1080/02664763.2018.1442423 ◽

2018 ◽

Vol 45 (15) ◽

pp. 2848-2863

Author(s):

Conceição Rocha ◽

Pedro Quelhas Brito

Keyword(s):

Structure Data ◽

Tree Structure ◽

Data Sets ◽

Hierarchical Tree ◽

Hierarchical Tree Structure

Download Full-text

SICE: an improved missing data imputation technique

Journal Of Big Data ◽

10.1186/s40537-020-00313-w ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Shahidul Islam Khan ◽

Abu Sayed Md Latiful Hoque

Keyword(s):

Missing Data ◽

Binary Data ◽

Missing Values ◽

Hybrid Approach ◽

Data Imputation ◽

Missing Data Imputation ◽

Wrong Prediction ◽

A New Technique ◽

Numeric Data ◽

Public Datasets

Abstract In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.

Download Full-text

Word Similarity Algorithm Based on WordNet And HowNet

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.155-156.375 ◽

2012 ◽

Vol 155-156 ◽

pp. 375-380 ◽

Cited By ~ 1

Author(s):

Wu Ling Ren ◽

Jin Ju Guo

Keyword(s):

Tree Structure ◽

Hierarchical Tree ◽

Word Similarity ◽

Similarity Calculation ◽

Similarity Algorithm ◽

Precision And Accuracy ◽

Hierarchical Tree Structure

To make the word similarity calculated results more reasonable and accurate, a new word similarity algorithm is proposed. It uses HowNet primitive hierarchical tree structure, and calculates the two primitives’ distance with the method computing WordNet node distance which considers the tree depth, density, path and connecting intensity, etc. Moreover, algorithm also improves the method that distance into similarity. Finally, this algorithm is compared with related algorithms through experiment. The results show that the proposed algorithm effectively improves the precision and accuracy of word similarity calculation.

Download Full-text