scholarly journals Complexity curve: a graphical measure of data complexity and classifier performance

2016 ◽  
Vol 2 ◽  
pp. e76 ◽  
Author(s):  
Julian Zubek ◽  
Dariusz M. Plewczynski

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. In contrast to some popular complexity measures, it is not focused on the shape of a decision boundary in a classification task but on the amount of available data with respect to the attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. It demonstrates the relative increase of available information with the growth of sample size. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We then compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining performance of specific classifiers on these sets. We also apply our methodology to a panel of simple benchmark data sets, demonstrating how it can be used in practice to gain insights into data characteristics. Moreover, we show that the complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without compromising classification accuracy. The associated code is available to download at:https://github.com/zubekj/complexity_curve(open source Python implementation).

2016 ◽  
Author(s):  
Julian Zubek ◽  
Dariusz M Plewczynski

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).


2016 ◽  
Author(s):  
Julian Zubek ◽  
Dariusz M Plewczynski

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).


Author(s):  
Jung Hwan Oh ◽  
Jeong Kyu Lee ◽  
Sae Hwang

Data mining, which is defined as the process of extracting previously unknown knowledge and detecting interesting patterns from a massive set of data, has been an active research area. As a result, several commercial products and research prototypes are available nowadays. However, most of these studies have focused on corporate data — typically in an alpha-numeric database, and relatively less work has been pursued for the mining of multimedia data (Zaïane, Han, & Zhu, 2000). Digital multimedia differs from previous forms of combined media in that the bits representing texts, images, audios, and videos can be treated as data by computer programs (Simoff, Djeraba, & Zaïane, 2002). One facet of these diverse data in terms of underlying models and formats is that they are synchronized and integrated hence, can be treated as integrated data records. The collection of such integral data records constitutes a multimedia data set. The challenge of extracting meaningful patterns from such data sets has lead to research and development in the area of multimedia data mining. This is a challenging field due to the non-structured nature of multimedia data. Such ubiquitous data is required in many applications such as financial, medical, advertising and Command, Control, Communications and Intelligence (C3I) (Thuraisingham, Clifton, Maurer, & Ceruti, 2001). Multimedia databases are widespread and multimedia data sets are extremely large. There are tools for managing and searching within such collections, but the need for tools to extract hidden and useful knowledge embedded within multimedia data is becoming critical for many decision-making applications.


Author(s):  
Brian Hoeschen ◽  
Darcy Bullock ◽  
Mark Schlappi

Historically, stopped delay was used to characterize the operation of intersection movements because it was relatively easy to measure. During the past decade, the traffic engineering community has moved away from using stopped delay and now uses control delay. That measurement is more precise but quite difficult to extract from large data sets if strict definitions are used to derive the data. This paper evaluates two procedures for estimating control delay. The first is based on a historical approximation that control delay is 30% larger than stopped delay. The second is new and based on segment delay. The procedures are applied to a diverse data set collected in Phoenix, Arizona, and compared with control delay calculated by using the formal definition. The new approximation was observed to be better than the historical stopped delay procedure; it provided an accurate prediction of control delay. Because it is an approximation, this methodology would be most appropriately applied to large data sets collected from travel time studies for ranking and prioritizing intersections for further analysis.


2020 ◽  
Vol 45 (4) ◽  
pp. 737-763 ◽  
Author(s):  
Anirban Laha ◽  
Parag Jain ◽  
Abhijit Mishra ◽  
Karthik Sankaranarayanan

We present a framework for generating natural language description from structured data such as tables; the problem comes under the category of data-to-text natural language generation (NLG). Modern data-to-text NLG systems typically use end-to-end statistical and neural architectures that learn from a limited amount of task-specific labeled data, and therefore exhibit limited scalability, domain-adaptability, and interpretability. Unlike these systems, ours is a modular, pipeline-based approach, and does not require task-specific parallel data. Rather, it relies on monolingual corpora and basic off-the-shelf NLP tools. This makes our system more scalable and easily adaptable to newer domains. Our system utilizes a three-staged pipeline that: (i) converts entries in the structured data to canonical form, (ii) generates simple sentences for each atomic entry in the canonicalized representation, and (iii) combines the sentences to produce a coherent, fluent, and adequate paragraph description through sentence compounding and co-reference replacement modules. Experiments on a benchmark mixed-domain data set curated for paragraph description from tables reveals the superiority of our system over existing data-to-text approaches. We also demonstrate the robustness of our system in accepting other popular data sets covering diverse data types such as knowledge graphs and key-value maps.


2008 ◽  
pp. 1631-1637
Author(s):  
Jung Hwan Oh ◽  
Jeong Kyu Lee ◽  
Sae Hwang

Data mining, which is defined as the process of extracting previously unknown knowledge and detecting interesting patterns from a massive set of data, has been an active research area. As a result, several commercial products and research prototypes are available nowadays. However, most of these studies have focused on corporate data — typically in an alpha-numeric database, and relatively less work has been pursued for the mining of multimedia data (Zaïane, Han, & Zhu, 2000). Digital multimedia differs from previous forms of combined media in that the bits representing texts, images, audios, and videos can be treated as data by computer programs (Simoff, Djeraba, & Zaïane, 2002). One facet of these diverse data in terms of underlying models and formats is that they are synchronized and integrated hence, can be treated as integrated data records. The collection of such integral data records constitutes a multimedia data set. The challenge of extracting meaningful patterns from such data sets has lead to research and development in the area of multimedia data mining. This is a challenging field due to the non-structured nature of multimedia data. Such ubiquitous data is required in many applications such as financial, medical, advertising and Command, Control, Communications and Intelligence (C3I) (Thuraisingham, Clifton, Maurer, & Ceruti, 2001). Multimedia databases are widespread and multimedia data sets are extremely large. There are tools for managing and searching within such collections, but the need for tools to extract hidden and useful knowledge embedded within multimedia data is becoming critical for many decision-making applications.


2020 ◽  
Vol 2 (4) ◽  
pp. 443-486 ◽  
Author(s):  
Sabbir M. Rashid ◽  
James P. McCusker ◽  
Paulo Pinheiro ◽  
Marcello P. Bax ◽  
Henrique O. Santos ◽  
...  

It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse data sets. In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey data set, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large National Institutes of Health (NIH)-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. We evaluate this work in comparison with traditional data dictionaries, mapping languages, and data integration tools.


Geophysics ◽  
2006 ◽  
Vol 71 (6) ◽  
pp. G301-G312 ◽  
Author(s):  
Ross Brodie ◽  
Malcolm Sambridge

We have developed a holistic method for simultaneously calibrating, processing, and inverting frequency-domain airborne electromagnetic data. A spline-based, 3D, layered conductivity model covering the complete survey area was recovered through inversion of the entire raw airborne data set and available independent conductivity and interface-depth data. The holistic inversion formulation includes a mathematical model to account for systematic calibration errors such as incorrect gain and zero-level drift. By taking these elements into account in the inversion, the need to preprocess the airborne data prior to inversion is eliminated. Conventional processing schemes involve the sequential application of a number of calibration corrections, with data from each frequency treated separately. This is followed by inversion of each multifrequency sample in isolation from other samples.By simultaneously considering all of the available information in a holistic inversion, we are able to exploit interfrequency and spatial-coherency characteristics of the data. The formulation ensures that the conductivity and calibration models are optimal with respect to the airborne data and prior information. Introduction of interfrequency inconsistency and multistage error propagation stemming from the sequential nature of conventional processing schemes is also avoided. We confirm that accurate conductivity and calibration parameter values are recovered from holistic inversion of synthetic data sets. We demonstrate that the results from holistic inversion of raw survey data are superior to the output of conventional 1D inversion of final processed data. In addition to the technical benefits, we expect that holistic inversion will reduce costs by avoiding the expensive calibration-processing-recalibration paradigm. Furthermore, savings may also be made because specific high-altitude zero-level observations, needed for conventional processing, may not be required.


2020 ◽  
Author(s):  
David S. Fischer ◽  
Leander Dony ◽  
Martin König ◽  
Abdul Moeed ◽  
Luke Zappia ◽  
...  

Exploratory analysis of single-cell RNA-seq data sets is currently based on statistical and machine learning models that are adapted to each new data set from scratch. A typical analysis workflow includes a choice of dimensionality reduction, selection of clustering parameters, and mapping of prior annotation. These steps typically require several iterations and can take up significant time in many single-cell RNA-seq projects. Here, we introduce sfaira, which is a single-cell data and model zoo which houses data sets as well as pre-trained models. The data zoo is designed to facilitate the fast and easy contribution of data sets, interfacing to a large community of data providers. Sfaira currently includes 233 data sets across 45 organs and 3.1 million cells in both human and mouse. Using these data sets we have trained eight different example model classes, such as autoencoders and logistic cell type predictors: The infrastructure of sfaira is model agnostic and allows training und usage of many previously published models. Sfaira directly aids in exploratory data analysis by replacing embedding and cell type annotation workflows with end-to-end pre-trained parametric models. As further example use cases for sfaira, we demonstrate the extraction of gene-centric data statistics across many tissues, improved usage of cell type labels at different levels of coarseness, and an application for learning interpretable models through data regularization on extremely diverse data sets.


2021 ◽  
Author(s):  
Mohammad D. Ashkezari ◽  
Norland R. Hagen ◽  
Michael Denholtz ◽  
Andrew Neang ◽  
Tansy C. Burns ◽  
...  

AbstractSimons Collaborative Marine Atlas Project (Simons CMAP) is an open-source data portal that interconnects large, complex, and diverse public data sets currently dispersed in different formats across different Oceanography discipline-specific databases. Simons CMAP is designed to streamline the retrieval of custom subsets of data, the generation of data visualizations, and the analyses of diverse data, thus expanding the power of these potentially underutilized data sets for cross-disciplinary studies of ocean processes. We describe a unified architecture that allows numerical model outputs, satellite products, and field observations to be readily shared, mined, and integrated regardless of data set size or resolution. A current focus of Simons CMAP is integration of physical, chemical, and biological data sets essential for characterizing the biogeography of key marine microbes across ocean basins and seasonal cycles. Using a practical example, we demonstrate how our unifying data harmonization plans significantly simplifies and allows for systematic data integration across all Simons CMAP data sets.


Sign in / Sign up

Export Citation Format

Share Document