Determination of System Dimensionality from Observing Near-Normal Distributions

This paper identifies a previously undiscovered behavior of uniformly distributed data points or vectors in high dimensional ellipsoidal models. Such models give near normal distributions for each of its dimensions. Converse of this may also be true; that is, for a normal-like distribution of an observed variable, it is possible that the distribution is a result of uniform distribution of data points in a high dimensional ellipsoidal model, to which the observed variable belongs. Given the currently held notion of normal distributions, this new behavior raises many interesting questions. This paper also attempts to answer some of those questions. We cover both volume based (filled) and surface based (shell) ellipsoidal models. The phenomenon is demonstrated using statistical as well as mathematical approaches. We also show that the dimensionality of the latent model, that is, the number of hidden variables in a system, can be calculated from the observed distribution. We call the new distribution “Tanazur” and show through experiments that it is at least observed in one real world scenario, that of the motion of particles in an ideal gas. We show that the Maxwell-Boltzmann distribution of particle speeds can be explained on the basis of Tanazur distributions.

Download Full-text

Mahalanobis distance informed by clustering

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iay011 ◽

2018 ◽

Vol 8 (2) ◽

pp. 377-406

Author(s):

Almog Lahav ◽

Ronen Talmon ◽

Yuval Kluger

Keyword(s):

Mahalanobis Distance ◽

High Dimensional Data ◽

Hidden Variables ◽

Real Data ◽

Risk Groups ◽

High Dimensional ◽

Data Sets ◽

Kaplan Meier ◽

Data Points ◽

Survival Plot

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.

Download Full-text

Optimal Clustering and Cluster Identity in Understanding High-Dimensional Data Spaces with Tightly Distributed Points

Machine Learning and Knowledge Extraction ◽

10.3390/make1020042 ◽

2019 ◽

Vol 1 (2) ◽

pp. 715-744 ◽

Cited By ~ 1

Author(s):

Oliver Chikumbo ◽

Vincent Granville

Keyword(s):

High Dimensional Data ◽

Optimal Number ◽

High Dimensional ◽

Distributed Data ◽

Objective Functions ◽

Number Of Clusters ◽

Data Points ◽

Pareto Fronts ◽

Criterion Method ◽

Optimal Number Of Clusters

The sensitivity of the elbow rule in determining an optimal number of clusters in high-dimensional spaces that are characterized by tightly distributed data points is demonstrated. The high-dimensional data samples are not artificially generated, but they are taken from a real world evolutionary many-objective optimization. They comprise of Pareto fronts from the last 10 generations of an evolutionary optimization computation with 14 objective functions. The choice for analyzing Pareto fronts is strategic, as it is squarely intended to benefit the user who only needs one solution to implement from the Pareto set, and therefore a systematic means of reducing the cardinality of solutions is imperative. As such, clustering the data and identifying the cluster from which to pick the desired solution is covered in this manuscript, highlighting the implementation of the elbow rule and the use of hyper-radial distances for cluster identity. The Calinski-Harabasz statistic was favored for determining the criteria used in the elbow rule because of its robustness. The statistic takes into account the variance within clusters and also the variance between the clusters. This exercise also opened an opportunity to revisit the justification of using the highest Calinski-Harabasz criterion for determining the optimal number of clusters for multivariate data. The elbow rule predicted the maximum end of the optimal number of clusters, and the highest Calinski-Harabasz criterion method favored the number of clusters at the lower end. Both results are used in a unique way for understanding high-dimensional data, despite being inconclusive regarding which of the two methods determine the true optimal number of clusters.

Download Full-text

Determination of the International Sensitivity Index of a New Near-Patient Testing Device to Monitor Oral Anticoagulant Therapy

Thrombosis and Haemostasis ◽

10.1055/s-0038-1657641 ◽

1997 ◽

Vol 78 (02) ◽

pp. 855-858 ◽

Cited By ~ 25

Author(s):

Armando Tripodi ◽

Veena Chantarangkul ◽

Marigrazia Clerici ◽

Barbara Negri ◽

Pier Mannuccio Mannucci

Keyword(s):

Linear Relationship ◽

Whole Blood ◽

Oral Anticoagulant ◽

Oral Anticoagulant Therapy ◽

Oral Anticoagulants ◽

Calibration Model ◽

Testing Device ◽

Data Points ◽

Near Patient Testing

SummaryA key issue for the reliable use of new devices for the laboratory control of oral anticoagulant therapy with the INR is their conformity to the calibration model. In the past, their adequacy has mostly been assessed empirically without reference to the calibration model and the use of International Reference Preparations (IRP) for thromboplastin. In this study we reviewed the requirements to be fulfilled and applied them to the calibration of a new near-patient testing device (TAS, Cardiovascular Diagnostics) which uses thromboplastin-containing test cards for determination of the INR. On each of 10 working days citrat- ed whole blood and plasma samples were obtained from 2 healthy subjects and 6 patients on oral anticoagulants. PT testing on whole blood and plasma was done with the TAS and parallel testing for plasma by the manual technique with the IRP CRM 149S. Conformity to the calibration model was judged satisfactory if the following requirements were met: (i) there was a linear relationship between paired log-PTs (TAS vs CRM 149S); (ii) the regression line drawn through patients data points, passed through those of normals; (iii) the precision of the calibration expressed as the CV of the slope was <3%. A good linear relationship was observed for calibration plots for plasma and whole blood (r = 0.98). Regression lines drawn through patients data points, passed through those of normals. The CVs of the slope were in both cases 2.2% and the ISIs were 0.965 and 1.000 for whole blood and plasma. In conclusion, our study shows that near-patient testing devices can be considered reliable tools to measure INR in patients on oral anticoagulants and provides guidelines for their evaluation.

Download Full-text

Determination of sample size for tests concerning means and variances of normal distributions

Statistica Neerlandica ◽

10.1111/j.1467-9574.1973.tb00216.x ◽

1973 ◽

Vol 27 (3) ◽

pp. 103-113 ◽

Cited By ~ 6

Author(s):

William C. Guenther

Keyword(s):

Sample Size ◽

Normal Distributions

Download Full-text

Comparison of several non-linear-regression methods for fitting the Michaelis-Menten equation

Biochemical Journal ◽

10.1042/bj2310171 ◽

1985 ◽

Vol 231 (1) ◽

pp. 171-177 ◽

Cited By ~ 8

Author(s):

L Matyska ◽

J Kovář

Keyword(s):

Simulated Data ◽

Confidence Regions ◽

Parameter Estimates ◽

Interval Parameter ◽

Variance Matrix ◽

Regression Methods ◽

Parameter Estimations ◽

Data Points ◽

Relative Weighting

The known jackknife methods (i.e. standard jackknife, weighted jackknife, linear jackknife and weighted linear jackknife) for the determination of the parameters (as well as of their confidence regions) were tested and compared with the simple Marquardt's technique (comprising the calculation of confidence intervals from the variance-co-variance matrix). The simulated data corresponding to the Michaelis-Menten equation with defined structure and magnitude of error of the dependent variable were used for fitting. There were no essential differences between the results of both point and interval parameter estimations by the tested methods. Marquardt's procedure yielded slightly better results than the jackknives for five scattered data points (the use of this method is advisable for routine analyses). The classical jackknife was slightly superior to the other methods for 20 data points (this method can be recommended for very precise calculations if great numbers of data are available). The weighting does not seem to be necessary in this type of equation because the parameter estimates obtained with all methods with the use of constant weights were comparable with those calculated with the weights corresponding exactly to the real error structure whereas the relative weighting led to rather worse results.

Download Full-text

A Novel Density-based Technique for Outlier Detection of High Dimensional Data Utilizing Full Feature Space

Information Technology And Control ◽

10.5755/j01.itc.50.1.25588 ◽

2021 ◽

Vol 50 (1) ◽

pp. 138-152

Author(s):

Mujeeb Ur Rehman ◽

Dost Muhammad Khan

Keyword(s):

Data Mining ◽

Outlier Detection ◽

High Dimensional Data ◽

Research Work ◽

Feature Space ◽

High Dimensional ◽

Data Set ◽

Data Points ◽

Low Dimensional ◽

Intrinsic Feature

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.

Download Full-text

APMFT: Anamoly Prediction Model for Financial Transactions Using Learning Methods in Machine Learning and Deep Learning

10.3233/apc210101 ◽

2021 ◽

Author(s):

R. Priyadarshini ◽

K. Anuratha ◽

N. Rajendran ◽

S. Sujeetha

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Prediction Models ◽

General Pattern ◽

High Dimensional ◽

Learning Methods ◽

Data Points ◽

The Times ◽

Financial Transactions ◽

Journal Entries

Anamoly is an uncommon and it represents an outlier i.e, a nonconforming case. According to Oxford Dictionary of Mathematics anamoly is defined as an unusal and erroneous observation that usually doesn’t follow the general pattern of drawn population. The process of detecting the anmolies is a process of data mining and it aims at finding the data points or patterns that do not adapt with the actual complete pattern of the data.The study on anamoly behavior and its impact has been done on areas such as Network Security, Finance, Healthcare and Earth Sciences etc. The proper detection and prediction of anamolies are of great importance as these rare observations may carry siginificant information. In today’s finanicial world, the enterprise data is digitized and stored in the cloudand so there is a significant need to detect the anaomalies in financial data which will help the enterprises to deal with the huge amount of auditing The corporate and enterprise is conducting auidts on large number of ledgers and journal entries. The monitoring of those kinds of auidts is performed manually most of the times. There should be proper anamoly detection in the high dimensional data published in the ledger format for auditing purpose. This work aims at analyzing and predicting unusal fraudulent financial transations by emplyoing few Machine Learning and Deep Learning Methods. Even if any of the anamoly like manipulation or tampering of data detected, such anamolies and errors can be identified and marked with proper proof with the help of the machine learning based algorithms. The accuracy of the prediction is increased by 7% by implementing the proposed prediction models.

Download Full-text

Projected Clustering for Biological Data Analysis

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch247 ◽

2011 ◽

pp. 1617-1622

Author(s):

Ping Deng ◽

Qingkai Ma ◽

Weili Wu

Keyword(s):

Nearest Neighbor ◽

Dimensional Space ◽

Clustering Algorithms ◽

Biological Data ◽

High Dimensional ◽

Projected Clustering ◽

Cluster Data ◽

Biological Data Analysis ◽

Data Points ◽

Entire Dataset

Clustering can be considered as the most important unsupervised learning problem. It has been discussed thoroughly by both statistics and database communities due to its numerous applications in problems such as classification, machine learning, and data mining. A summary of clustering techniques can be found in (Berkhin, 2002). Most known clustering algorithms such as DBSCAN (Easter, Kriegel, Sander, & Xu, 1996) and CURE (Guha, Rastogi, & Shim, 1998) cluster data points based on full dimensions. When the dimensional space grows higher, the above algorithms lose their efficiency and accuracy because of the so-called “curse of dimensionality”. It is shown in (Beyer, Goldstein, Ramakrishnan, & Shaft, 1999) that computing the distance based on full dimensions is not meaningful in high dimensional space since the distance of a point to its nearest neighbor approaches the distance to its farthest neighbor as dimensionality increases. Actually, natural clusters might exist in subspaces. Data points in different clusters may be correlated with respect to different subsets of dimensions. In order to solve this problem, feature selection (Kohavi & Sommerfield, 1995) and dimension reduction (Raymer, Punch, Goodman, Kuhn, & Jain, 2000) have been proposed to find the closely correlated dimensions for all the data and the clusters in such dimensions. Although both methods reduce the dimensionality of the space before clustering, the case where clusters may exist in different subspaces of full dimensions is not handled well. Projected clustering has been proposed recently to effectively deal with high dimensionalities. Finding clusters and their relevant dimensions are the objectives of projected clustering algorithms. Instead of projecting the entire dataset on the same subspace, projected clustering focuses on finding specific projection for each cluster such that the similarity is reserved as much as possible.

Download Full-text

Discovering a sparse set of pairwise discriminating features in high-dimensional data

Bioinformatics ◽

10.1093/bioinformatics/btaa690 ◽

2020 ◽

Author(s):

Samuel Melton ◽

Sharad Ramanathan

Keyword(s):

Single Cell ◽

Dimensional Space ◽

Cell Types ◽

Dimensional Subspace ◽

Supplementary Information ◽

High Dimensional ◽

Technological Advances ◽

Data Points ◽

Low Dimensional ◽

Sparse Set

Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Algorithms for clustering high dimensional and distributed data

Intelligent Data Analysis ◽

10.3233/ida-2003-7404 ◽

2003 ◽

Vol 7 (4) ◽

pp. 305-326 ◽

Cited By ~ 9

Author(s):

Tao Li ◽

Shenghuo Zhu ◽

Mitsunori Ogihara

Keyword(s):

High Dimensional ◽

Distributed Data

Download Full-text