Understanding High Dimensional and Large Data Sets: Some Mathematical Challenges and Opportunities

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Download Full-text

Singular Value Decomposition, Clustering, and Indexing for Similarity Search for Large Data Sets in High-Dimensional Spaces

Big Data ◽

10.1201/b18050-8 ◽

2015 ◽

pp. 76-107

Keyword(s):

Singular Value Decomposition ◽

Similarity Search ◽

Large Data ◽

Singular Value ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Value Decomposition

Download Full-text

BIG DATA ANALYSIS IN HEALTH CARE DOMAIN: A SYSTEMATIC REVIEW

International Journal of Engineering Technologies and Management Research ◽

10.29121/ijetmr.v5.i2.2018.605 ◽

2020 ◽

Vol 5 (2) ◽

pp. 1-8

Author(s):

Abhishek Bajpai ◽

Dr. Sanjiv Sharma

Keyword(s):

Health Care ◽

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Healthcare Data ◽

Challenges And Opportunities ◽

Business Profit

As the Volume of the data produced is increasing day by day in our society, the exploration of big data in healthcare is increasing at an unprecedented rate. Now days, Big data is very popular buzzword concept in the various areas. This paper provide an effort is made to established that even the healthcare industries are stepping into big data pool to take all advantages from its various advanced tools and technologies. This paper provides the review of various research disciplines made in health care realm using big data approaches and methodologies. Big data methodologies can be used for the healthcare data analytics (which consist 4 V’s) which provide the better decision to accelerate the business profit and customer affection, acquire a better understanding of market behaviours and trends and to provide E-Health services using Digital imaging and communication in Medicine (DICOM).Big data Techniques like Map Reduce, Machine learning can be applied to develop system for early diagnosis of disease, i.e. analysis of the chronic disease like- heart disease, diabetes and stroke. The analysis on the data is performed using big data analytics framework Hadoop. Hadoop framework is used to process large data sets Further the paper present the various Big data tools , challenges and opportunities and various hurdles followed by the conclusion.

Download Full-text

Combining thermodynamics with tensor completion techniques to enable multicomponent microstructure prediction

npj Computational Materials ◽

10.1038/s41524-019-0268-y ◽

2020 ◽

Vol 6 (1) ◽

Author(s):

Yuri Amorim Coutinho ◽

Nico Vervliet ◽

Lieven De Lathauwer ◽

Nele Moelans

Keyword(s):

Microstructure Evolution ◽

Materials Science ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Thermodynamic Quantities ◽

Tensor Completion ◽

Microstructure Modeling ◽

Microstructure Prediction

AbstractMulticomponent alloys show intricate microstructure evolution, providing materials engineers with a nearly inexhaustible variety of solutions to enhance material properties. Multicomponent microstructure evolution simulations are indispensable to exploit these opportunities. These simulations, however, require the handling of high-dimensional and prohibitively large data sets of thermodynamic quantities, of which the size grows exponentially with the number of elements in the alloy, making it virtually impossible to handle the effects of four or more elements. In this paper, we introduce the use of tensor completion for high-dimensional data sets in materials science as a general and elegant solution to this problem. We show that we can obtain an accurate representation of the composition dependence of high-dimensional thermodynamic quantities, and that the decomposed tensor representation can be evaluated very efficiently in microstructure simulations. This realization enables true multicomponent thermodynamic and microstructure modeling for alloy design.

Download Full-text

Challenges and Opportunities of Building Fast GBDT Systems

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/632 ◽

2021 ◽

Author(s):

Zeyi Wen ◽

Qinbin Li ◽

Bingsheng He ◽

Bin Cui

Keyword(s):

Data Science ◽

Cluster Computing ◽

Online Advertising ◽

Large Data ◽

Large Data Sets ◽

Gradient Boosting ◽

Data Sets ◽

Survey Paper ◽

Advantages And Disadvantages ◽

Challenges And Opportunities

In the last few years, Gradient Boosting Decision Trees (GBDTs) have been widely used in various applications such as online advertising and spam filtering. However, GBDT training is often a key performance bottleneck for such data science pipelines, especially for training a large number of deep trees on large data sets. Thus, many parallel and distributed GBDT systems have been researched and developed to accelerate the training process. In this survey paper, we review the recent GBDT systems with respect to accelerations with emerging hardware as well as cluster computing, and compare the advantages and disadvantages of the existing implementations. Finally, we present the research opportunities and challenges in designing fast next generation GBDT systems.

Download Full-text

Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets

Information Technology Journal ◽

10.3923/itj.2006.1132.1135 ◽

2006 ◽

Vol 5 (6) ◽

pp. 1132-1135 ◽

Cited By ~ 10

Author(s):

Sanjay Garg ◽

Ramesh Chandra Jain .

Keyword(s):

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets

Download Full-text

Practitioners' Challenges Panel: The Challenges and Opportunities of Forensic Investigation over Large Data Sets: Timeliness vs Precision vs Comprehensiveness

2011 Sixth IEEE International Workshop on Systematic Approaches to Digital Forensic Engineering ◽

10.1109/sadfe.2011.13 ◽

2011 ◽

Author(s):

Nan Zhang ◽

Yong Guan ◽

Michael M Losavio ◽

Peter Vasquez ◽

Robert M. Nissen ◽

...

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Forensic Investigation ◽

Challenges And Opportunities

Download Full-text

Topological data analysis : an overview

10.21079/11681/40943 ◽

2021 ◽

Author(s):

Amy Bednar

Keyword(s):

Data Analysis ◽

Large Data ◽

Large Data Sets ◽

Topological Data Analysis ◽

Background Information ◽

High Dimensional ◽

Data Sets ◽

Complex Data ◽

Topological Network ◽

Topological Data

A growing area of mathematics topological data analysis (TDA) uses fundamental concepts of topology to analyze complex, high-dimensional data. A topological network represents the data, and the TDA uses the network to analyze the shape of the data and identify features in the network that correspond to patterns in the data. These patterns extract knowledge from the data. TDA provides a framework to advance machine learning’s ability to understand and analyze large, complex data. This paper provides background information about TDA, TDA applications for large data sets, and details related to the investigation and implementation of existing tools and environments.

Download Full-text

A Fast Logdet Divergence Based Metric Learning Algorithm for Large Data Sets Classification

Abstract and Applied Analysis ◽

10.1155/2014/463981 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Jiangyuan Mei ◽

Jian Hou ◽

Jicheng Chen ◽

Hamid Reza Karimi

Keyword(s):

Learning Algorithm ◽

Metric Learning ◽

Large Data ◽

Feature Space ◽

Industrial Applications ◽

Large Data Sets ◽

Training Data ◽

High Dimensional ◽

Data Sets ◽

Benchmark Data

Large data sets classification is widely used in many industrial applications. It is a challenging task to classify large data sets efficiently, accurately, and robustly, as large data sets always contain numerous instances with high dimensional feature space. In order to deal with this problem, in this paper we present an online Logdet divergence based metric learning (LDML) model by making use of the powerfulness of metric learning. We firstly generate a Mahalanobis matrix via learning the training data with LDML model. Meanwhile, we propose a compressed representation for high dimensional Mahalanobis matrix to reduce the computation complexity in each iteration. The final Mahalanobis matrix obtained this way measures the distances between instances accurately and serves as the basis of classifiers, for example, thek-nearest neighbors classifier. Experiments on benchmark data sets demonstrate that the proposed algorithm compares favorably with the state-of-the-art methods.

Download Full-text

Feature Selection Using Neighborhood Positive Approximation Rough Set

Feature Dimension Reduction for Content-Based Image Identification - Advances in Multimedia and Interactive Technologies ◽

10.4018/978-1-5225-5775-3.ch005 ◽

2018 ◽

pp. 74-99

Author(s):

Mohammad Atique ◽

Leena Homraj Patil

Keyword(s):

Feature Selection ◽

Rough Set ◽

Attribute Reduction ◽

Large Data ◽

Feature Space ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Overall Performance ◽

Feature Selection Approach

Attribute reduction and feature selection is the main issue in rough set. Researchers have focused on several attribute reduction using rough set. However, the methods found are time consuming for large data sets. Since the key lies in reducing the attributes and selecting the relevant features, the main aim is to reduce the dimensionality of huge amount of data to get the smaller subset which can provide the useful information. Feature selection approach reduces the dimensionality of feature space and improves the overall performance. The challenge in feature selection is to deal with high dimensional. To overcome the issues and challenges, this chapter describes a feature selection based on the proposed neighborhood positive approximation approach and attributes reduction for data sets. This proposed system implements for attribute reduction and finds the relevant features. Evaluation shows that the proposed neighborhood positive approximation algorithm is effective and feasible for large data sets and also reduces the feature space.

Download Full-text