Understanding High Dimensional and Large Data Sets: Some Mathematical Challenges and Opportunities

Author(s):  
Jagdish Chandra
2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Michele Allegra ◽  
Elena Facco ◽  
Francesco Denti ◽  
Alessandro Laio ◽  
Antonietta Mira

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.


Author(s):  
Abhishek Bajpai ◽  
Dr. Sanjiv Sharma

As the Volume of the data produced is increasing day by day in our society, the exploration of big data in healthcare is increasing at an unprecedented rate. Now days, Big data is very popular buzzword concept in the various areas. This paper provide an effort is made to established that even the healthcare industries are stepping into big data pool to take all advantages from its various advanced tools and technologies. This paper provides the review of various research disciplines made in health care realm using big data approaches and methodologies. Big data methodologies can be used for the healthcare data analytics (which consist 4 V’s) which provide the better decision to accelerate the business profit and customer affection, acquire a better understanding of market behaviours and trends and to provide E-Health services using Digital imaging and communication in Medicine (DICOM).Big data Techniques like Map Reduce, Machine learning can be applied to develop system for early diagnosis of disease, i.e. analysis of the chronic disease like- heart disease, diabetes and stroke. The analysis on the data is performed using big data analytics framework Hadoop. Hadoop framework is used to process large data sets Further the paper present the various Big data tools , challenges and opportunities and various hurdles followed by the conclusion.                                      


2020 ◽  
Vol 6 (1) ◽  
Author(s):  
Yuri Amorim Coutinho ◽  
Nico Vervliet ◽  
Lieven De Lathauwer ◽  
Nele Moelans

AbstractMulticomponent alloys show intricate microstructure evolution, providing materials engineers with a nearly inexhaustible variety of solutions to enhance material properties. Multicomponent microstructure evolution simulations are indispensable to exploit these opportunities. These simulations, however, require the handling of high-dimensional and prohibitively large data sets of thermodynamic quantities, of which the size grows exponentially with the number of elements in the alloy, making it virtually impossible to handle the effects of four or more elements. In this paper, we introduce the use of tensor completion for high-dimensional data sets in materials science as a general and elegant solution to this problem. We show that we can obtain an accurate representation of the composition dependence of high-dimensional thermodynamic quantities, and that the decomposed tensor representation can be evaluated very efficiently in microstructure simulations. This realization enables true multicomponent thermodynamic and microstructure modeling for alloy design.


Author(s):  
Zeyi Wen ◽  
Qinbin Li ◽  
Bingsheng He ◽  
Bin Cui

In the last few years, Gradient Boosting Decision Trees (GBDTs) have been widely used in various applications such as online advertising and spam filtering. However, GBDT training is often a key performance bottleneck for such data science pipelines, especially for training a large number of deep trees on large data sets. Thus, many parallel and distributed GBDT systems have been researched and developed to accelerate the training process. In this survey paper, we review the recent GBDT systems with respect to accelerations with emerging hardware as well as cluster computing, and compare the advantages and disadvantages of the existing implementations. Finally, we present the research opportunities and challenges in designing fast next generation GBDT systems.


2021 ◽  
Author(s):  
Amy Bednar

A growing area of mathematics topological data analysis (TDA) uses fundamental concepts of topology to analyze complex, high-dimensional data. A topological network represents the data, and the TDA uses the network to analyze the shape of the data and identify features in the network that correspond to patterns in the data. These patterns extract knowledge from the data. TDA provides a framework to advance machine learning’s ability to understand and analyze large, complex data. This paper provides background information about TDA, TDA applications for large data sets, and details related to the investigation and implementation of existing tools and environments.


2014 ◽  
Vol 2014 ◽  
pp. 1-9 ◽  
Author(s):  
Jiangyuan Mei ◽  
Jian Hou ◽  
Jicheng Chen ◽  
Hamid Reza Karimi

Large data sets classification is widely used in many industrial applications. It is a challenging task to classify large data sets efficiently, accurately, and robustly, as large data sets always contain numerous instances with high dimensional feature space. In order to deal with this problem, in this paper we present an online Logdet divergence based metric learning (LDML) model by making use of the powerfulness of metric learning. We firstly generate a Mahalanobis matrix via learning the training data with LDML model. Meanwhile, we propose a compressed representation for high dimensional Mahalanobis matrix to reduce the computation complexity in each iteration. The final Mahalanobis matrix obtained this way measures the distances between instances accurately and serves as the basis of classifiers, for example, thek-nearest neighbors classifier. Experiments on benchmark data sets demonstrate that the proposed algorithm compares favorably with the state-of-the-art methods.


Author(s):  
Mohammad Atique ◽  
Leena Homraj Patil

Attribute reduction and feature selection is the main issue in rough set. Researchers have focused on several attribute reduction using rough set. However, the methods found are time consuming for large data sets. Since the key lies in reducing the attributes and selecting the relevant features, the main aim is to reduce the dimensionality of huge amount of data to get the smaller subset which can provide the useful information. Feature selection approach reduces the dimensionality of feature space and improves the overall performance. The challenge in feature selection is to deal with high dimensional. To overcome the issues and challenges, this chapter describes a feature selection based on the proposed neighborhood positive approximation approach and attributes reduction for data sets. This proposed system implements for attribute reduction and finds the relevant features. Evaluation shows that the proposed neighborhood positive approximation algorithm is effective and feasible for large data sets and also reduces the feature space.


Sign in / Sign up

Export Citation Format

Share Document