Exploring Similarities Across High-Dimensional Datasets

Author(s):  
Karlton Sequeira ◽  
Mohammed J. Zaki

Very often, related data may be collected by a number of sources, which may be unable to share their entire datasets for reasons like confidentiality agreements, dataset size, and so forth. However, these sources may be willing to share a condensed model of their datasets. If some substructures of the condensed models of such datasets, from different sources, are found to be unusually similar, policies successfully applied to one may be successfully applied to the others. In this chapter, we propose a framework for constructing condensed models of datasets and algorithms to find similar substructure in pairs of such models. The algorithms are based on the tensor product. We test our framework on pairs of synthetic datasets and compare our algorithms with an existing one. Finally, we apply it to basketball player statistics for two National Basketball Association (NBA) seasons, and to breast cancer datasets. The results are statistically more interesting than results obtained from independent analysis of the datasets.

Author(s):  
Xue-Zhong Qian ◽  
Jie Deng ◽  
Heng Qian ◽  
Qin Wu

As one of the most popular data reduction category for large scale data mining, simple random sampling (SRS) often leads to the loss of small clusters when dealing with unevenly distributed datasets. A density biased sampling algorithm based on grid can avoid the problem. However, the grid division granularity has an influence on the efficiency and effectiveness of the algorithm. To overcome the drawback, a variable grid density biased sampling is proposed to deal with large scale unevenly distributed datasets. However, the efficiency is restricted by dimensionality. Aiming at this, an efficient density biased sampling algorithm is proposed for large high-dimensional datasets. Firstly, an efficient feature selection method is designed to obtain the feature subsets. Secondly, the variable grid division is executed in the selected feature subsets. Finally, the sample is obtained from the grid space. Synthetic datasets and UCI datasets, tested in our experiments, reveal that the proposed algorithm can achieve higher quality than SRS. Meanwhile, the proposed algorithm consumes less sampling time comparing with density biased sampling algorithm based on grid and density biased sampling algorithm based on variable grid division.


Mathematics ◽  
2021 ◽  
Vol 9 (3) ◽  
pp. 222
Author(s):  
Juan C. Laria ◽  
M. Carmen Aguilera-Morillo ◽  
Enrique Álvarez ◽  
Rosa E. Lillo ◽  
Sara López-Taruella ◽  
...  

Over the last decade, regularized regression methods have offered alternatives for performing multi-marker analysis and feature selection in a whole genome context. The process of defining a list of genes that will characterize an expression profile remains unclear. It currently relies upon advanced statistics and can use an agnostic point of view or include some a priori knowledge, but overfitting remains a problem. This paper introduces a methodology to deal with the variable selection and model estimation problems in the high-dimensional set-up, which can be particularly useful in the whole genome context. Results are validated using simulated data and a real dataset from a triple-negative breast cancer study.


Author(s):  
Jun Sun ◽  
Lingchen Kong ◽  
Mei Li

With the development of modern science and technology, it is easy to obtain a large number of high-dimensional datasets, which are related but different. Classical unimodel analysis is less likely to capture potential links between the different datasets. Recently, a collaborative regression model based on least square (LS) method for this problem has been proposed. In this paper, we propose a robust collaborative regression based on the least absolute deviation (LAD). We give the statistical interpretation of the LS-collaborative regression and LAD-collaborative regression. Then we design an efficient symmetric Gauss–Seidel-based alternating direction method of multipliers algorithm to solve the two models, which has the global convergence and the Q-linear rate of convergence. Finally we report numerical experiments to illustrate the efficiency of the proposed methods.


2019 ◽  
Vol 63 (8-9-10) ◽  
pp. 343-357
Author(s):  
Adam Kuspa ◽  
Gad Shaulsky

William Farnsworth Loomis studied the social amoeba Dictyostelium discoideum for more than fifty years as a professor of biology at the University of California, San Diego, USA. This biographical reflection describes Dr. Loomis’ major scientific contributions to the field within a career arc that spanned the early days of molecular biology up to the present day where the acquisition of high-dimensional datasets drive research. Dr. Loomis explored the genetic control of social amoeba development, delineated mechanisms of cell differentiation, and significantly advanced genetic and genomic technology for the field. The details of Dr. Loomis’ multifaceted career are drawn from his published work, from an autobiographical essay that he wrote near the end of his career and from extensive conversations between him and the two authors, many of which took place on the deck of his beachfront home in Del Mar, California.


2021 ◽  
Vol 9 ◽  
Author(s):  
Jenna L. Wardini ◽  
Hasti Vahidi ◽  
Huiming Guo ◽  
William J. Bowman

Transmission electron microscopy (TEM), and its counterpart, scanning TEM (STEM), are powerful materials characterization tools capable of probing crystal structure, composition, charge distribution, electronic structure, and bonding down to the atomic scale. Recent (S)TEM instrumentation developments such as electron beam aberration-correction as well as faster and more efficient signal detection systems have given rise to new and more powerful experimental methods, some of which (e.g., 4D-STEM, spectrum-imaging, in situ/operando (S)TEM)) facilitate the capture of high-dimensional datasets that contain spatially-resolved structural, spectroscopic, time- and/or stimulus-dependent information across the sub-angstrom to several micrometer length scale. Thus, through the variety of analysis methods available in the modern (S)TEM and its continual development towards high-dimensional data capture, it is well-suited to the challenge of characterizing isometric mixed-metal oxides such as pyrochlores, fluorites, and other complex oxides that reside on a continuum of chemical and spatial ordering. In this review, we present a suite of imaging and diffraction (S)TEM techniques that are uniquely suited to probe the many types, length-scales, and degrees of disorder in complex oxides, with a focus on disorder common to pyrochlores, fluorites and the expansive library of intermediate structures they may adopt. The application of these techniques to various complex oxides will be reviewed to demonstrate their capabilities and limitations in resolving the continuum of structural and chemical ordering in these systems.


Diabetes is a long-term disease that ends up in multiple side-effects. It has now become a reticent exterminator in society because it doesn’t reveal any signs hitherto to the patients until it’s too late. It leads to many complications to other organs, such as kidney, cardiovascular, liver or blood pressure [1]. This work tends to apply a unique multitask learning [2] to synchronously map the relation between manifold complications wherever every task conforms to risks of modelling of complications [3]. It also uses feature selection to reduce the set of risk factors from high-dimensional datasets. Then using the concept of correlation, it finds the degree of relativity among various sideeffects. The proposed method is able to identify the possible future health hazards identified with the diabetes patient. This will enable us to explain medical conditions and can improves healthcare applications which would help to improve disease prediction performance.


Sign in / Sign up

Export Citation Format

Share Document