Quantitative and Visual Exploratory Data Analysis for Machine Intelligence

Author(s):  
Dharmendra Trikamlal Patel

Exploratory data analysis is a technique to analyze data sets in order to summarize the main characteristics of them using quantitative and visual aspects. The chapter starts with the introduction of exploratory data analysis. It discusses the conventional view of it and describes the main limitations of it. It explores the features of quantitative and visual exploratory data analysis in detail. It deals with the statistical techniques relevant to EDA. It also emphasizes the main visual techniques to represent the data in an efficient way. R has extraordinary capabilities to deal with quantitative and visual aspects to summarize the main characteristics of the data set. The chapter provides the practical exposure of various plotting systems using R. Finally, the chapter deals with current research and future trends of the EDA.

1980 ◽  
Vol 37 (2) ◽  
pp. 290-294 ◽  
Author(s):  
K. H. Reckhow

Water quality sampling and data analysis are undertaken to acquire and convey information. Therefore, when data are presented, the form of this presentation should be such that information transfer is high. For example, a graph or table of average values is often an inadequate summary of batches of data. As an alternative, a technique is presented (that was developed for exploratory data analysis purposes) that can be used to display several sets of data on a single graph, indicating median, spread, skew, size of data set, and statistical significance of the median. This technique is useful in the study of phosphorus concentration variability in lakes. Additions to, and modifications of, this procedure are easily made and will often enhance the analysis of a particular problem. Some suggestions are made for useful modifications of the plots in the study and display of phosphorus lake data and models.Key words: limnology, exploratory data analysis, statistics, phosphorus, water quality, models, lakes


1990 ◽  
Vol 83 (2) ◽  
pp. 108-112
Author(s):  
James L. Mullenex

Box plots are used for the purpose of analyzing and displaying important features of sets of data. More specifically, box plots are used as graphical representations of five-number summaries. Box plots and five-number summaries are new statistical techniques that were developed by John W. Tukey of Bell Telephone Laboratories. They are parts of a larger set of modern statistical techniques known collectively as exploratory data analysis, or EDA.


2006 ◽  
Vol 3 (4) ◽  
pp. 1487-1516 ◽  
Author(s):  
L. Peeters ◽  
F. Bação ◽  
V. Lobo ◽  
A. Dassargues

Abstract. The use of unsupervised artificial neural network techniques like the self-organizing map (SOM) algorithm has proven to be a useful tool in exploratory data analysis and clustering of multivariate data sets. In this study a variant of the SOM-algorithm is proposed, the GEO3DSOM, capable of explicitly incorporating three-dimensional spatial knowledge into the algorithm. The performance of the GEO3DSOM is compared to the performance of the standard SOM in analyzing an artificial data set and a hydrochemical data set. The hydrochemical data set consists of 141 groundwater samples collected in two detritic, phreatic, Cenozoic aquifers in Central Belgium. The standard SOM proves to be more adequate in representing the structure of the data set and to explore relationships between variables. The GEO3DSOM on the other hand performs better in creating spatially coherent groups based on the data.


2014 ◽  
Vol 53 (1) ◽  
pp. 1-14 ◽  
Author(s):  
Saima Naeem ◽  
Asad Zaman

Razzaque (2009) studied the role of gender in the ultimatum game by running experiments on students in various cities in Pakistan. He used standard confirmatory data analysis techniques, which work well in familiar contexts, where relevant hypotheses of interest are known in advance. Our goal in this paper is to demonstrate that exploratory data analysis is much better suited to the study of experimental data where the goal is to discover patterns of interest. Our exploratory re-analysis of the original data set of Razzaque (2009) leads to several new insights. While we re-confirm the main finding of Razzaque regarding the greater generosity of males, additional analysis suggests that this is driven by student subculture in Pakistan, and would not generalise to the population at large. In addition, we find strong effect of urbanisation. Our exploratory data analysis also offers considerable additional insights into the learning process that takes place over the course of a sequence of games. JEL Classification: C78, C81, C91, J16 Keywords: Ultimatum Game, Gender Differences, Exploratory Data Analysis


2007 ◽  
Vol 11 (4) ◽  
pp. 1309-1321 ◽  
Author(s):  
L. Peeters ◽  
F. Bação ◽  
V. Lobo ◽  
A. Dassargues

Abstract. The use of unsupervised artificial neural network techniques like the self-organizing map (SOM) algorithm has proven to be a useful tool in exploratory data analysis and clustering of multivariate data sets. In this study a variant of the SOM-algorithm is proposed, the GEO3DSOM, capable of explicitly incorporating three-dimensional spatial knowledge into the algorithm. The performance of the GEO3DSOM is compared to the performance of the standard SOM in analyzing an artificial data set and a hydrochemical data set. The hydrochemical data set consists of 131 groundwater samples collected in two detritic, phreatic, Cenozoic aquifers in Central Belgium. Both techniques succeed very well in providing more insight in the groundwater quality data set, visualizing the relationships between variables, highlighting the main differences between groups of samples and pointing out anomalous wells and well screens. The GEO3DSOM however has the advantage to provide an increased resolution while still maintaining a good generalization of the data set.


Antibiotics ◽  
2019 ◽  
Vol 8 (4) ◽  
pp. 225 ◽  
Author(s):  
Antonio Gnoni ◽  
Emanuele De Nitto ◽  
Salvatore Scacco ◽  
Luigi Santacroce ◽  
Luigi Leonardo Palese

Sepsis is a life-threatening condition that accounts for numerous deaths worldwide, usually complications of common community infections (i.e., pneumonia, etc), or infections acquired during the hospital stay. Sepsis and septic shock, its most severe evolution, involve the whole organism, recruiting and producing a lot of molecules, mostly proteins. Proteins are dynamic entities, and a large number of techniques and studies have been devoted to elucidating the relationship between the conformations adopted by proteins and what is their function. Although molecular dynamics has a key role in understanding these relationships, the number of protein structures available in the databases is so high that it is currently possible to build data sets obtained from experimentally determined structures. Techniques for dimensionality reduction and clustering can be applied in exploratory data analysis in order to obtain information on the function of these molecules, and this may be very useful in immunology to better understand the structure-activity relationship of the numerous proteins involved in host defense, moreover in septic patients. The large number of degrees of freedom that characterize the biomolecules requires special techniques which are able to analyze this kind of data sets (with a small number of entries respect to the number of degrees of freedom). In this work we analyzed the ability of two different types of algorithms to provide information on the structures present in three data sets built using the experimental structures of allosteric proteins involved in sepsis. The results obtained by means of a principal component analysis algorithm and those obtained by a random projection algorithm are largely comparable, proving the effectiveness of random projection methods in structural bioinformatics. The usefulness of random projection in exploratory data analysis is discussed, including validation of the obtained clusters. We have chosen these proteins because of their involvement in sepsis and septic shock, aimed to highlight the potentiality of bioinformatics to point out new diagnostic and prognostic tools for the patients.


Author(s):  
M. D. Edge

R is a powerful, free software package for performing statistical tasks. It will be used to simulate data, analyze data, and make data displays. More details about R are given in Appendix B.


Sign in / Sign up

Export Citation Format

Share Document