Exploring Similarities Across High-Dimensional Datasets

Very often, related data may be collected by a number of sources, which may be unable to share their entire datasets for reasons like confidentiality agreements, dataset size, and so forth. However, these sources may be willing to share a condensed model of their datasets. If some substructures of the condensed models of such datasets, from different sources, are found to be unusually similar, policies successfully applied to one may be successfully applied to the others. In this chapter, we propose a framework for constructing condensed models of datasets and algorithms to find similar substructure in pairs of such models. The algorithms are based on the tensor product. We test our framework on pairs of synthetic datasets and compare our algorithms with an existing one. Finally, we apply it to basketball player statistics for two National Basketball Association (NBA) seasons, and to breast cancer datasets. The results are statistically more interesting than results obtained from independent analysis of the datasets.

Download Full-text

An Efficient Density Biased Sampling Algorithm for Clustering Large High-Dimensional Datasets

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500263 ◽

2015 ◽

Vol 29 (08) ◽

pp. 1550026 ◽

Cited By ~ 3

Author(s):

Xue-Zhong Qian ◽

Jie Deng ◽

Heng Qian ◽

Qin Wu

Keyword(s):

Large Scale ◽

Feature Selection Method ◽

Simple Random Sampling ◽

High Dimensional ◽

Biased Sampling ◽

Sampling Algorithm ◽

Efficiency And Effectiveness ◽

Synthetic Datasets ◽

High Dimensional Datasets ◽

Grid Division

As one of the most popular data reduction category for large scale data mining, simple random sampling (SRS) often leads to the loss of small clusters when dealing with unevenly distributed datasets. A density biased sampling algorithm based on grid can avoid the problem. However, the grid division granularity has an influence on the efficiency and effectiveness of the algorithm. To overcome the drawback, a variable grid density biased sampling is proposed to deal with large scale unevenly distributed datasets. However, the efficiency is restricted by dimensionality. Aiming at this, an efficient density biased sampling algorithm is proposed for large high-dimensional datasets. Firstly, an efficient feature selection method is designed to obtain the feature subsets. Secondly, the variable grid division is executed in the selected feature subsets. Finally, the sample is obtained from the grid space. Synthetic datasets and UCI datasets, tested in our experiments, reveal that the proposed algorithm can achieve higher quality than SRS. Meanwhile, the proposed algorithm consumes less sampling time comparing with density biased sampling algorithm based on grid and density biased sampling algorithm based on variable grid division.

Download Full-text

Iterative Variable Selection for High-Dimensional Data: Prediction of Pathological Response in Triple-Negative Breast Cancer

Mathematics ◽

10.3390/math9030222 ◽

2021 ◽

Vol 9 (3) ◽

pp. 222

Author(s):

Juan C. Laria ◽

M. Carmen Aguilera-Morillo ◽

Enrique Álvarez ◽

Rosa E. Lillo ◽

Sara López-Taruella ◽

...

Keyword(s):

Breast Cancer ◽

Variable Selection ◽

Triple Negative Breast Cancer ◽

Triple Negative ◽

A Priori ◽

Simulated Data ◽

Point Of View ◽

High Dimensional ◽

Whole Genome ◽

Genome Context

Over the last decade, regularized regression methods have offered alternatives for performing multi-marker analysis and feature selection in a whole genome context. The process of defining a list of genes that will characterize an expression profile remains unclear. It currently relies upon advanced statistics and can use an agnostic point of view or include some a priori knowledge, but overfitting remains a problem. This paper introduces a methodology to deal with the variable selection and model estimation problems in the high-dimensional set-up, which can be particularly useful in the whole genome context. Results are validated using simulated data and a real dataset from a triple-negative breast cancer study.

Download Full-text

A new method for mining disjunctive emerging patterns in high-dimensional datasets using hypergraphs

Information Systems ◽

10.1016/j.is.2013.09.001 ◽

2014 ◽

Vol 40 ◽

pp. 1-10 ◽

Cited By ~ 8

Author(s):

Renato Vimieiro ◽

Pablo Moscato

Keyword(s):

New Method ◽

High Dimensional ◽

Emerging Patterns ◽

High Dimensional Datasets

Download Full-text

Unstructured borderline self-organizing map: Learning highly imbalanced, high-dimensional datasets for fault detection

Expert Systems with Applications ◽

10.1016/j.eswa.2021.116028 ◽

2022 ◽

Vol 188 ◽

pp. 116028

Author(s):

Jaeyeon Jang ◽

Chang Ouk Kim

Keyword(s):

Fault Detection ◽

High Dimensional ◽

Self Organizing Map ◽

Map Learning ◽

High Dimensional Datasets ◽

Self Organizing

Download Full-text

Fast Algorithms for LS and LAD-Collaborative Regression

Asia Pacific Journal of Operational Research ◽

10.1142/s0217595922500014 ◽

2021 ◽

Author(s):

Jun Sun ◽

Lingchen Kong ◽

Mei Li

Keyword(s):

Numerical Experiments ◽

Modern Science ◽

Alternating Direction Method ◽

Least Square ◽

High Dimensional ◽

Statistical Interpretation ◽

Absolute Deviation ◽

Linear Rate ◽

Alternating Direction ◽

High Dimensional Datasets

With the development of modern science and technology, it is easy to obtain a large number of high-dimensional datasets, which are related but different. Classical unimodel analysis is less likely to capture potential links between the different datasets. Recently, a collaborative regression model based on least square (LS) method for this problem has been proposed. In this paper, we propose a robust collaborative regression based on the least absolute deviation (LAD). We give the statistical interpretation of the LS-collaborative regression and LAD-collaborative regression. Then we design an efficient symmetric Gauss–Seidel-based alternating direction method of multipliers algorithm to solve the two models, which has the global convergence and the Q-linear rate of convergence. Finally we report numerical experiments to illustrate the efficiency of the proposed methods.

Download Full-text

(Auto)Biographical reflections on the contributions of William F. Loomis (1940-2016) to Dictyostelium biology

The International Journal of Developmental Biology ◽

10.1387/ijdb.190224ak ◽

2019 ◽

Vol 63 (8-9-10) ◽

pp. 343-357

Author(s):

Adam Kuspa ◽

Gad Shaulsky

Keyword(s):

Cell Differentiation ◽

Molecular Biology ◽

Genetic Control ◽

Dictyostelium Discoideum ◽

University Of California ◽

High Dimensional ◽

Social Amoeba ◽

The Social ◽

The University ◽

High Dimensional Datasets

William Farnsworth Loomis studied the social amoeba Dictyostelium discoideum for more than fifty years as a professor of biology at the University of California, San Diego, USA. This biographical reflection describes Dr. Loomis’ major scientific contributions to the field within a career arc that spanned the early days of molecular biology up to the present day where the acquisition of high-dimensional datasets drive research. Dr. Loomis explored the genetic control of social amoeba development, delineated mechanisms of cell differentiation, and significantly advanced genetic and genomic technology for the field. The details of Dr. Loomis’ multifaceted career are drawn from his published work, from an autobiographical essay that he wrote near the end of his career and from extensive conversations between him and the two authors, many of which took place on the deck of his beachfront home in Del Mar, California.

Download Full-text

Probing Multiscale Disorder in Pyrochlore and Related Complex Oxides in the Transmission Electron Microscope: A Review

Frontiers in Chemistry ◽

10.3389/fchem.2021.743025 ◽

2021 ◽

Vol 9 ◽

Author(s):

Jenna L. Wardini ◽

Hasti Vahidi ◽

Huiming Guo ◽

William J. Bowman

Keyword(s):

Complex Oxides ◽

Atomic Scale ◽

High Dimensional ◽

Detection Systems ◽

Spatially Resolved ◽

Chemical Ordering ◽

Transmission Electron ◽

The Many ◽

High Dimensional Datasets

Transmission electron microscopy (TEM), and its counterpart, scanning TEM (STEM), are powerful materials characterization tools capable of probing crystal structure, composition, charge distribution, electronic structure, and bonding down to the atomic scale. Recent (S)TEM instrumentation developments such as electron beam aberration-correction as well as faster and more efficient signal detection systems have given rise to new and more powerful experimental methods, some of which (e.g., 4D-STEM, spectrum-imaging, in situ/operando (S)TEM)) facilitate the capture of high-dimensional datasets that contain spatially-resolved structural, spectroscopic, time- and/or stimulus-dependent information across the sub-angstrom to several micrometer length scale. Thus, through the variety of analysis methods available in the modern (S)TEM and its continual development towards high-dimensional data capture, it is well-suited to the challenge of characterizing isometric mixed-metal oxides such as pyrochlores, fluorites, and other complex oxides that reside on a continuum of chemical and spatial ordering. In this review, we present a suite of imaging and diffraction (S)TEM techniques that are uniquely suited to probe the many types, length-scales, and degrees of disorder in complex oxides, with a focus on disorder common to pyrochlores, fluorites and the expansive library of intermediate structures they may adopt. The application of these techniques to various complex oxides will be reviewed to demonstrate their capabilities and limitations in resolving the continuum of structural and chemical ordering in these systems.

Download Full-text

Diabetes and its Complication Prediction using Multi-Task Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.e2821.039520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 1426-1430

Keyword(s):

Risk Factors ◽

Diabetes Patient ◽

Prediction Performance ◽

Multitask Learning ◽

High Dimensional ◽

Disease Prediction ◽

Healthcare Applications ◽

Future Health ◽

High Dimensional Datasets

Diabetes is a long-term disease that ends up in multiple side-effects. It has now become a reticent exterminator in society because it doesn’t reveal any signs hitherto to the patients until it’s too late. It leads to many complications to other organs, such as kidney, cardiovascular, liver or blood pressure [1]. This work tends to apply a unique multitask learning [2] to synchronously map the relation between manifold complications wherever every task conforms to risks of modelling of complications [3]. It also uses feature selection to reduce the set of risk factors from high-dimensional datasets. Then using the concept of correlation, it finds the degree of relativity among various sideeffects. The proposed method is able to identify the possible future health hazards identified with the diabetes patient. This will enable us to explain medical conditions and can improves healthcare applications which would help to improve disease prediction performance.

Download Full-text