Topological data analysis : an overview

Background Information ◽

High Dimensional ◽

Data Sets ◽

Complex Data ◽

Topological Network ◽

A growing area of mathematics topological data analysis (TDA) uses fundamental concepts of topology to analyze complex, high-dimensional data. A topological network represents the data, and the TDA uses the network to analyze the shape of the data and identify features in the network that correspond to patterns in the data. These patterns extract knowledge from the data. TDA provides a framework to advance machine learning’s ability to understand and analyze large, complex data. This paper provides background information about TDA, TDA applications for large data sets, and details related to the investigation and implementation of existing tools and environments.

Topological data analysis to uncover the shape of immune responses during co-infection

10.1101/723957 ◽

2019 ◽

Author(s):

Karin Sasaki ◽

Dunja Bruder ◽

Esteban Hernandez-Vargas

Keyword(s):

Immune Response ◽

Data Analysis ◽

Bacterial Infection ◽

Viral Infection ◽

High Dimensional ◽

Data Sets ◽

Contributing Factors ◽

Transition Points ◽

AbstractCo-infections by multiple pathogens have important implications in many aspects of health, epidemiology and evolution. However, how to disentangle the contributing factors of the immune response when two infections take place at the same time is largely unexplored. Using data sets of the immune response during influenza-pneumococcal co-infection in mice, we employ here topological data analysis to simplify and visualise high dimensional data sets.We identified persistent shapes of the simplicial complexes of the data in the three infection scenarios: single viral infection, single bacterial infection, and co-infection. The immune response was found to be distinct for each of the infection scenarios and we uncovered that the immune response during the co-infection has three phases and two transition points. During the first phase, its dynamics is inherited from its response to the primary (viral) infection. The immune response has an early (few hours post co-infection) and then modulates its response to finally react against the secondary (bacterial) infection. Between 18 to 26 hours post co-infection the nature of the immune response changes again and does no longer resembles either of the single infection scenarios.Author summaryThe mapper algorithm is a topological data analysis technique used for the qualitative analysis, simplification and visualisation of high dimensional data sets. It generates a low-dimensional image that captures topological and geometric information of the data set in high dimensional space, which can highlight groups of data points of interest and can guide further analysis and quantification.To understand how the immune system evolves during the co-infection between viruses and bacteria, and the role of specific cytokines as contributing factors for these severe infections, we use Topological Data Analysis (TDA) along with an extensive semi-unsupervised parameter value grid search, and k-nearest neighbour analysis.We find persistent shapes of the data in the three infection scenarios, single viral and bacterial infections and co-infection. The immune response is shown to be distinct for each of the infections scenarios and we uncover that the immune response during the co-infection has three phases and two transition points, a previously unknown property regarding the dynamics of the immune response during co-infection.

The Application of Topological Data Analysis in Practice and Its Effectiveness

E3S Web of Conferences ◽

10.1051/e3sconf/202021403034 ◽

2020 ◽

Vol 214 ◽

pp. 03034

Author(s):

Liang Cheng

Keyword(s):

Data Analysis ◽

Data Science ◽

High Dimensional ◽

Data Sets ◽

Relevant Feature ◽

Mathematical Algorithm ◽

Working Efficiency ◽

Analyze Data ◽

Topological Data Analysis(TDA) is a new and fast growing field in data science. TDA provides an approach to analyze data sets and derive their relevant feature out of complex high-dimensional data, which greatly improves the working efficiency in many fields. In this paper, the author mainly discusses some mathematics concepts about topology, methods in TDA and the relation between these topological concepts and data sets (how to apply topological concepts on data). The problems of TDA, mathematical algorithm using in TDA and two application-examples are introduced in this paper. In addition, the advantages, limitations, and the direction of future development of TDA are discussed.

Understanding High Dimensional and Large Data Sets: Some Mathematical Challenges and Opportunities

Data Mining for Scientific and Engineering Applications - Massive Computing ◽

10.1007/978-1-4615-1733-7_2 ◽

2001 ◽

pp. 23-34

Author(s):

Jagdish Chandra

Keyword(s):

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Challenges And Opportunities

Intelligent Data Analysis

Intelligent Information Technologies ◽

10.4018/978-1-59904-941-0.ch015 ◽

2011 ◽

pp. 308-314 ◽

Author(s):

Xiaohui Liu

Keyword(s):

Data Analysis ◽

High Performance ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Intelligent Data Analysis ◽

Statistical Knowledge ◽

Interdisciplinary Study ◽

Performance Computing ◽

Effective Analysis

Intelligent Data Analysis (IDA) is an interdisciplinary study concerned with the effective analysis of data. IDA draws the techniques from diverse fields, including artificial intelligence, databases, high-performance computing, pattern recognition, and statistics. These fields often complement each other (e.g., many statistical methods, particularly those for large data sets, rely on computation, but brute computing power is no substitute for statistical knowledge) (Berthold & Hand 2003; Liu, 1999).

Data segmentation based on the local intrinsic dimension

Scientific Reports ◽

10.1038/s41598-020-72222-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Michele Allegra ◽

Elena Facco ◽

Francesco Denti ◽

Alessandro Laio ◽

Antonietta Mira

Keyword(s):

High Dimensional Data ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Imaging Data ◽

Unsupervised Segmentation ◽

Real World Data ◽

Data Set ◽

Intrinsic Dimension

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Topological Data Analysis with Applications

10.1017/9781108975704 ◽

2021 ◽

Author(s):

Gunnar Carlsson ◽

Mikael Vejdemo-Johansson

Keyword(s):

Data Analysis ◽

Data Science ◽

Persistent Homology ◽

Data Sets ◽

Advanced Learners ◽

Computer Scientists ◽

Dramatic Rise ◽

Modeling Data ◽

The continued and dramatic rise in the size of data sets has meant that new methods are required to model and analyze them. This timely account introduces topological data analysis (TDA), a method for modeling data by geometric objects, namely graphs and their higher-dimensional versions: simplicial complexes. The authors outline the necessary background material on topology and data philosophy for newcomers, while more complex concepts are highlighted for advanced learners. The book covers all the main TDA techniques, including persistent homology, cohomology, and Mapper. The final section focuses on the diverse applications of TDA, examining a number of case studies drawn from monitoring the progression of infectious diseases to the study of motion capture data. Mathematicians moving into data science, as well as data scientists or computer scientists seeking to understand this new area, will appreciate this self-contained resource which explains the underlying technology and how it can be used.

Singular Value Decomposition, Clustering, and Indexing for Similarity Search for Large Data Sets in High-Dimensional Spaces

Big Data ◽

10.1201/b18050-8 ◽

2015 ◽

pp. 76-107

Keyword(s):

Singular Value Decomposition ◽

Similarity Search ◽

Large Data ◽

Singular Value ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Value Decomposition

Tensor Numerical Methods: Actual Theory and Recent Applications

Computational Methods in Applied Mathematics ◽

10.1515/cmam-2018-0014 ◽

2019 ◽

Vol 19 (1) ◽

pp. 1-4 ◽

Author(s):

Ivan Gavrilyuk ◽

Boris N. Khoromskij

Keyword(s):

Numerical Modeling ◽

Big Data ◽

Data Analysis ◽

Numerical Methods ◽

Differential Equations ◽

Numerical Solution ◽

Large Data ◽

High Dimensional ◽

Data Sets ◽

Integral Differential Equations

AbstractMost important computational problems nowadays are those related to processing of the large data sets and to numerical solution of the high-dimensional integral-differential equations. These problems arise in numerical modeling in quantum chemistry, material science, and multiparticle dynamics, as well as in machine learning, computer simulation of stochastic processes and many other applications related to big data analysis. Modern tensor numerical methods enable solution of the multidimensional partial differential equations (PDE) in {\mathbb{R}^{d}} by reducing them to one-dimensional calculations. Thus, they allow to avoid the so-called “curse of dimensionality”, i.e. exponential growth of the computational complexity in the dimension size d, in the course of numerical solution of high-dimensional problems. At present, both tensor numerical methods and multilinear algebra of big data continue to expand actively to further theoretical and applied research topics. This issue of CMAM is devoted to the recent developments in the theory of tensor numerical methods and their applications in scientific computing and data analysis. Current activities in this emerging field on the effective numerical modeling of temporal and stationary multidimensional PDEs and beyond are presented in the following ten articles, and some future trends are highlighted therein.

Statistical Methods in Topological Data Analysis for Complex, High-Dimensional Data

Conference on Applied Statistics in Agriculture ◽

10.4148/2475-7772.1130 ◽

2015 ◽

Author(s):

Patrick S Medina ◽

R W Doerge

Keyword(s):

Data Analysis ◽

Statistical Methods ◽

High Dimensional Data ◽

High Dimensional ◽

A Semiautomated Framework for Integrating Expert Knowledge into Disease Marker Identification

Disease Markers ◽

10.1155/2013/613529 ◽

2013 ◽

Vol 35 ◽

pp. 513-523 ◽

Cited By ~ 3

Author(s):

Jing Wang ◽

Bobbie-Jo M. Webb-Robertson ◽

Melissa M. Matzke ◽

Susan M. Varnum ◽

Joseph N. Brown ◽

...

Keyword(s):

Biomarker Discovery ◽

Expert Knowledge ◽

Large Data ◽

Large Data Sets ◽

Biological Information ◽

Data Sets ◽

Complex Data ◽

Selection Scheme ◽

Biomarker Identification ◽

Biomarker Selection

Background. The availability of large complex data sets generated by high throughput technologies has enabled the recent proliferation of disease biomarker studies. However, a recurring problem in deriving biological information from large data sets is how to best incorporate expert knowledge into the biomarker selection process.Objective. To develop a generalizable framework that can incorporate expert knowledge into data-driven processes in a semiautomated way while providing a metric for optimization in a biomarker selection scheme.Methods. The framework was implemented as a pipeline consisting of five components for the identification of signatures from integrated clustering (ISIC). Expert knowledge was integrated into the biomarker identification process using the combination of two distinct approaches; a distance-based clustering approach and an expert knowledge-driven functional selection.Results. The utility of the developed framework ISIC was demonstrated on proteomics data from a study of chronic obstructive pulmonary disease (COPD). Biomarker candidates were identified in a mouse model using ISIC and validated in a study of a human cohort.Conclusions. Expert knowledge can be introduced into a biomarker discovery process in different ways to enhance the robustness of selected marker candidates. Developing strategies for extracting orthogonal and robust features from large data sets increases the chances of success in biomarker identification.