HCsnip: An R Package for Semi-Supervised Snipping of the Hierarchical Clustering Tree

Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster. However, the fixed-height branch cut may not be ideal in situations where one expects a complicated tree structure with nested clusters. Furthermore, due to lack of utilization of related background information in selecting the cutoff, induced clusters are often difficult to interpret. This paper describes a novel procedure that aims to automatically extract meaningful clusters from the HC tree in a semi-supervised way. The procedure is implemented in the R package HCsnip available from Bioconductor. Rather than cutting the HC tree at a fixed-height, HCsnip probes the various way of snipping, possibly at variable heights, to tease out hidden clusters ensconced deep down in the tree. The cluster extraction process utilizes, along with the data set from which the HC tree is derived, commonly available background information. Consequently, the extracted clusters are highly reproducible and robust against various sources of variations that “haunted” high-dimensional genomics data. Since the clustering process is guided by the background information, clusters are easy to interpret. Unlike existing packages, no constraint is placed on the data type on which clustering is desired. Particularly, the package accepts patient follow-up data for guiding the cluster extraction process. To our knowledge, HCsnip is the first package that is able to decomposes the HC tree into clusters with piecewise snipping under the guidance of patient time-to-event information. Our implementation of the semi-supervised HC tree snipping framework is generic, and can be combined with other algorithms that operate on detected clusters.

Download Full-text

A Novel Density-based Technique for Outlier Detection of High Dimensional Data Utilizing Full Feature Space

Information Technology And Control ◽

10.5755/j01.itc.50.1.25588 ◽

2021 ◽

Vol 50 (1) ◽

pp. 138-152

Author(s):

Mujeeb Ur Rehman ◽

Dost Muhammad Khan

Keyword(s):

Data Mining ◽

Outlier Detection ◽

High Dimensional Data ◽

Research Work ◽

Feature Space ◽

High Dimensional ◽

Data Set ◽

Data Points ◽

Low Dimensional ◽

Intrinsic Feature

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.

Download Full-text

Hierarchical Clustering Using Evolutionary Algorithms

Mathematical Methods for Knowledge Discovery and Data Mining ◽

10.4018/978-1-59904-528-3.ch009 ◽

2011 ◽

pp. 146-156

Author(s):

Monica Chis

Keyword(s):

Evolutionary Algorithm ◽

Hierarchical Clustering ◽

Cluster Structure ◽

Hierarchical Structures ◽

Linear Representation ◽

Large Area ◽

Data Set ◽

The Hierarchical Structure ◽

Data Points ◽

Variation Operators

Clustering is an important technique used in discovering some inherent structure present in data. The purpose of cluster analysis is to partition a given data set into a number of groups such that data in a particular cluster are more similar to each other than objects in different clusters. Hierarchical clustering refers to the formation of a recursive clustering of the data points: a partition into many clusters, each of which is itself hierarchically clustered. Hierarchical structures solve many problems in a large area of interests. In this paper a new evolutionary algorithm for detecting the hierarchical structure of an input data set is proposed. Problem could be very useful in economy, market segmentation, management, biology taxonomy and other domains. A new linear representation of the cluster structure within the data set is proposed. An evolutionary algorithm evolves a population of clustering hierarchies. Proposed algorithm uses mutation and crossover as (search) variation operators. The final goal is to present a data clustering representation to find fast a hierarchical clustering structure.

Download Full-text

Exploring biological data: Mappings between ontology- and cluster-based representations

Information Visualization ◽

10.1177/1473871612468880 ◽

2013 ◽

Vol 12 (3-4) ◽

pp. 291-307 ◽

Cited By ~ 1

Author(s):

Ilir Jusufi ◽

Andreas Kerren ◽

Falk Schreiber

Keyword(s):

Hierarchical Clustering ◽

Visual Analysis ◽

Large Data ◽

Biological Data ◽

Data Set ◽

Metabolomics Data ◽

Domain Experts ◽

Data Points ◽

Biological Input ◽

Insight Into

Ontologies and hierarchical clustering are both important tools in biology and medicine to study high-throughput data such as transcriptomics and metabolomics data. Enrichment of ontology terms in the data is used to identify statistically overrepresented ontology terms, giving insight into relevant biological processes or functional modules. Hierarchical clustering is a standard method to analyze and visualize data to find relatively homogeneous clusters of experimental data points. Both methods support the analysis of the same data set but are usually considered independently. However, often a combined view is desired: visualizing a large data set in the context of an ontology under consideration of a clustering of the data. This article proposes new visualization methods for this task. They allow for interactive selection and navigation to explore the data under consideration as well as visual analysis of mappings between ontology- and cluster-based space-filling representations. In this context, we discuss our approach together with specific properties of the biological input data and identify features that make our approach easily usable for domain experts.

Download Full-text

Exploratory Only: A Tool for Large-Scale Exploratory Analyses

10.31234/osf.io/37fvm ◽

2021 ◽

Author(s):

Jin Kim

Keyword(s):

Web Application ◽

Large Scale ◽

Behavioral Science ◽

A Priori ◽

R Package ◽

Data Set ◽

Mediation Analyses ◽

Follow Up Studies ◽

Minimal Effort

This article presents Exploratory Only: an intuitive tool for conducting large-scale exploratory analyses easily and quickly. Available in three forms (as a web application, standalone program, and R Package) and launched as a point-and-click interface, Exploratory Only allows researchers to conduct all possible correlation, moderation, and mediation analyses among selected variables in their data set with minimal effort and time. Compared to a popular alternative, SPSS, Exploratory Only is shown to be orders of magnitude easier and faster at conducting exploratory analyses. The article demonstrates how to use Exploratory Only and discusses the caveat to using it. As long as researchers use Exploratory Only as intended—to discover novel hypotheses to investigate in follow-up studies, rather than to confirm nonexistent a priori hypotheses (i.e., p-hacking)—Exploratory Only can promote progress in behavioral science by encouraging more exploratory analyses and therefore more discoveries.

Download Full-text

Fitting Functions to Data in High Dimensional Design Space

Volume 1: 25th Design Automation Conference ◽

10.1115/detc99/dac-8622 ◽

1999 ◽

Author(s):

Xiaoou Wang ◽

Yingying Liu ◽

Erik K. Antonsson

Keyword(s):

Piecewise Linear ◽

Region Of Interest ◽

Multivariate Adaptive Regression Splines ◽

High Dimensional ◽

Linear Quadratic ◽

Data Set ◽

Data Points ◽

Adaptive Regression ◽

Representative Points ◽

Adaptive Regression Splines

Abstract One approach to rapidly exploring large design spaces is to evaluate the performance of a small number of representative points in the space, then based on those points, construct an approximation to the response over a region of interest. Linear, piecewise linear, quadratic and multivariate adaptive regression splines (MARS) models are fit to an example 5-dimensional data set representative of information available in preliminary engineering design. When the number of data points representing a high-dimensional response is small, all of the approximation models appear to perform nearly equally.

Download Full-text

Grid Partitioning for Anomaly Detection (Gpad) in High Density Distributed Environment for Mining Techniques

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1193.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 1156-1161

Keyword(s):

Anomaly Detection ◽

Original Data ◽

Extraction Process ◽

High Density ◽

Data Sets ◽

Distributed Environment ◽

Data Set ◽

Detection Scheme ◽

Grid Partitioning ◽

Data Points

Anomaly detection is the most important task in data mining techniques. This helps to increase the scalability, accuracy and efficiency. During the extraction process, the outsource may damage their original data set and that will be defined as the intrusion. To avoid the intrusion and maintain the anomaly detection in a high densely populated environment is another difficult task. For that purpose, Grid Partitioning for Anomaly Detection (GPAD) has been proposed for high density environment. This technique will detect the outlier using the grid partitioning approach and density based outlier detection scheme. Initially, all the data sets will be split in the grid format. Allocate the equal amount of data points to each grid. Compare the density of each grid to their neighbor grid in a zigzag manner. Based on the response, lesser density grid will be detected as outlier function as well as that grid will be eliminated. This proposed Grid Partitioning for Anomaly Detection (GPAD) has reduced the complexity and increases the accuracy and these will be proven in simulation part.

Download Full-text

Functional Dimension Reduction for Chemometrics

Encyclopedia of Artificial Intelligence ◽

10.4018/978-1-59904-849-9.ch100 ◽

2011 ◽

pp. 661-666

Author(s):

Tuomas Kärnä ◽

Amaury Lendasse

Keyword(s):

Least Squares ◽

High Dimensional Data ◽

Computational Time ◽

High Dimensional ◽

Support Vector ◽

Data Set ◽

Spectrometric Data ◽

Function Fitting ◽

Svm Model ◽

Data Points

High dimensional data are becoming more and more common in data analysis. This is especially true in fields that are related to spectrometric data, such as chemometrics. Due to development of more accurate spectrometers one can obtain spectra of thousands of data points. Such a high dimensional data are problematic in machine learning due to increased computational time and the curse of dimensionality (Haykin, 1999; Verleysen & François, 2005; Bengio, Delalleau, & Le Roux, 2006). It is therefore advisable to reduce the dimensionality of the data. In the case of chemometrics, the spectra are usually rather smooth and low on noise, so function fitting is a convenient tool for dimensionality reduction. The fitting is obtained by fixing a set of basis functions and computing the fitting weights according to the least squares error criterion. This article describes a unsupervised method for finding a good function basis that is specifically built to suit the data set at hand. The basis consists of a set of Gaussian functions that are optimized for an accurate fitting. The obtained weights are further scaled using a Delta Test (DT) to improve the prediction performance. Least Squares Support Vector Machine (LS-SVM) model is used for estimation.

Download Full-text

Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

Applied Computational Intelligence and Soft Computing ◽

10.1155/2013/863146 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Singh Vijendra ◽

Sahoo Laxman

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Data Points

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.

Download Full-text

0706 Comparing the Dose-Response Curves of Upper Airway Stimulation and CPAP Usage on Changes in Epworth Sleepiness Scale

SLEEP ◽

10.1093/sleep/zsaa056.702 ◽

2020 ◽

Vol 43 (Supplement_1) ◽

pp. A269-A269

Author(s):

L Kazaglis ◽

J Hsia ◽

K Green ◽

C Iber

Keyword(s):

Dose Response ◽

Upper Airway ◽

Background Information ◽

Medical Systems ◽

Response Curves ◽

Upper Airway Stimulation ◽

Data Set ◽

International Registry ◽

Obstructive Sleep

Abstract Introduction Upper Airway Stimulation (UAS) and Continuous Positive Airway Pressure (CPAP) are trackable therapies for obstructive sleep apnea. We used recent big-data cohorts to compare changes in sleepiness versus usage. Methods ADHERE is an international registry of real-world UAS outcomes from 2016 to date. General UAS criteria are CPAP intolerance, AHI 15-65 (<25% central+mixed), and suggested BMI≤35. Baseline ESS is collected from the medical record, and follow-up ESS and usage is collected 2-4 months after therapy activation. M Health Fairview maintains a database of cross-linked CPAP and EHR data. All new adult sleep patients from 2015 onward were included paralleling ADHERE: BMI≤35, AHI 15-65, and daily CPAP-EHR data starting at least 60 days prior to 2nd ESS measurement. Baseline ESS was collected at consult, and follow-up ESS was collected approximately 6 months later. Device-reported usage hours were compared with the changes in ESS from baseline. Results UAS (n=690) and CPAP (n=514) groups were similar: age 59.7±10.8 versus 59.7±13.6, 78% versus 75% male, and AHI 35.3±14.4 versus 33.8±14.0. UAS group was slightly less obese, BMI 29.3±3.9 versus 30.0±3.4 (p=0.001), with higher baseline ESS, 11.4±5.6 versus 8.6±5.3 (p<0.001). UAS usage was higher at 6.4±2.0 hours/night versus 5.2±2.0 hours/night with CPAP (p<0.001). UAS group average ESS decreased 2.5 points for patients with 0-4 hours of use (n=81), decreasing to 3.8 points with at 4 or more hours of use (n=609). CPAP group average ESS decreased 2.5 points for patients with 0-4 hours of use (n=125), decreasing to 3.3 points with at 4 or more hours of use (n=389). Conclusion Compared to prior works and the UAS cohort, this CPAP cohort was more likely to have normal ESS at baseline. UAS and CPAP both demonstrate a dose-response curve associating increasing hourly usage with larger ESS reductions. Support Kent Lee of Inspire Medical Systems provided background information and access to a de-identified ADHERE data set for analysis.

Download Full-text

DysRegSig: an R package for identifying gene dysregulations and building mechanistic signatures in cancer

Bioinformatics ◽

10.1093/bioinformatics/btaa688 ◽

2020 ◽

Author(s):

Quanxue Li ◽

Wentao Dai ◽

Jixiang Liu ◽

Qingqing Sang ◽

Yi-Xue Li ◽

...

Keyword(s):

Source Code ◽

R Package ◽

Supplementary Information ◽

High Dimensional ◽

Supplementary Data ◽

Analysis Framework ◽

Gene Dysregulation ◽

Cell Processes ◽

Effective Path

Abstract Summary Dysfunctional regulations of gene expression programs relevant to fundamental cell processes can drive carcinogenesis. Therefore, systematically identifying dysregulation events is an effective path for understanding carcinogenesis and provides insightful clues to build predictive signatures with mechanistic interpretability for cancer precision medicine. Here, we implemented a machine learning-based gene dysregulation analysis framework in an R package, DysRegSig, which is capable of exploring gene dysregulations from high-dimensional data and building mechanistic signature based on gene dysregulations. DysRegSig can serve as an easy-to-use tool to facilitate gene dysregulation analysis and follow-up analysis. Availability and implementation The source code and user’s guide of DysRegSig are freely available at Github: https://github.com/SCBIT-YYLab/DysRegSig. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text