Complexity curve: a graphical measure of data complexity and classifier performance

10.7287/peerj.preprints.2095 ◽

2016 ◽

Author(s):

Julian Zubek ◽

Dariusz M Plewczynski

Keyword(s):

Learning Curve ◽

Performance Measure ◽

Machine Learning Algorithms ◽

Classification Error ◽

Data Sets ◽

Data Complexity ◽

Complexity Measures ◽

Data Set ◽

New Variant ◽

Classifier Performance

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).

Download Full-text

Complexity curve: a graphical measure of data complexity and classifier performance

10.7287/peerj.preprints.2095v1 ◽

2016 ◽

Author(s):

Julian Zubek ◽

Dariusz M Plewczynski

Keyword(s):

Learning Curve ◽

Performance Measure ◽

Machine Learning Algorithms ◽

Classification Error ◽

Data Sets ◽

Data Complexity ◽

Complexity Measures ◽

Data Set ◽

New Variant ◽

Classifier Performance

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).

Download Full-text

Video Data Mining

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch223 ◽

2011 ◽

pp. 1185-1189 ◽

Cited By ~ 2

Author(s):

Jung Hwan Oh ◽

Jeong Kyu Lee ◽

Sae Hwang

Keyword(s):

Data Mining ◽

Research Area ◽

Multimedia Databases ◽

Video Data ◽

Multimedia Data ◽

Data Sets ◽

Data Set ◽

Useful Knowledge ◽

Active Research ◽

Diverse Data

Data mining, which is defined as the process of extracting previously unknown knowledge and detecting interesting patterns from a massive set of data, has been an active research area. As a result, several commercial products and research prototypes are available nowadays. However, most of these studies have focused on corporate data — typically in an alpha-numeric database, and relatively less work has been pursued for the mining of multimedia data (Zaïane, Han, & Zhu, 2000). Digital multimedia differs from previous forms of combined media in that the bits representing texts, images, audios, and videos can be treated as data by computer programs (Simoff, Djeraba, & Zaïane, 2002). One facet of these diverse data in terms of underlying models and formats is that they are synchronized and integrated hence, can be treated as integrated data records. The collection of such integral data records constitutes a multimedia data set. The challenge of extracting meaningful patterns from such data sets has lead to research and development in the area of multimedia data mining. This is a challenging field due to the non-structured nature of multimedia data. Such ubiquitous data is required in many applications such as financial, medical, advertising and Command, Control, Communications and Intelligence (C3I) (Thuraisingham, Clifton, Maurer, & Ceruti, 2001). Multimedia databases are widespread and multimedia data sets are extremely large. There are tools for managing and searching within such collections, but the need for tools to extract hidden and useful knowledge embedded within multimedia data is becoming critical for many decision-making applications.

Download Full-text

Estimating Intersection Control Delay Using Large Data Sets of Travel Time from a Global Positioning System

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198105191700103 ◽

2005 ◽

Vol 1917 (1) ◽

pp. 18-27

Author(s):

Brian Hoeschen ◽

Darcy Bullock ◽

Mark Schlappi

Keyword(s):

Travel Time ◽

Traffic Engineering ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Control Delay ◽

Diverse Data ◽

Intersection Control ◽

Better Than

Historically, stopped delay was used to characterize the operation of intersection movements because it was relatively easy to measure. During the past decade, the traffic engineering community has moved away from using stopped delay and now uses control delay. That measurement is more precise but quite difficult to extract from large data sets if strict definitions are used to derive the data. This paper evaluates two procedures for estimating control delay. The first is based on a historical approximation that control delay is 30% larger than stopped delay. The second is new and based on segment delay. The procedures are applied to a diverse data set collected in Phoenix, Arizona, and compared with control delay calculated by using the formal definition. The new approximation was observed to be better than the historical stopped delay procedure; it provided an accurate prediction of control delay. Because it is an approximation, this methodology would be most appropriately applied to large data sets collected from travel time studies for ranking and prioritizing intersections for further analysis.

Download Full-text

Scalable Micro-planned Generation of Discourse from Structured Data

Computational Linguistics ◽

10.1162/coli_a_00363 ◽

2020 ◽

Vol 45 (4) ◽

pp. 737-763 ◽

Cited By ~ 1

Author(s):

Anirban Laha ◽

Parag Jain ◽

Abhijit Mishra ◽

Karthik Sankaranarayanan

Keyword(s):

Natural Language ◽

Structured Data ◽

Data Sets ◽

Data Types ◽

Data Set ◽

Language Generation ◽

Parallel Data ◽

Simple Sentences ◽

Diverse Data ◽

Existing Data

We present a framework for generating natural language description from structured data such as tables; the problem comes under the category of data-to-text natural language generation (NLG). Modern data-to-text NLG systems typically use end-to-end statistical and neural architectures that learn from a limited amount of task-specific labeled data, and therefore exhibit limited scalability, domain-adaptability, and interpretability. Unlike these systems, ours is a modular, pipeline-based approach, and does not require task-specific parallel data. Rather, it relies on monolingual corpora and basic off-the-shelf NLP tools. This makes our system more scalable and easily adaptable to newer domains. Our system utilizes a three-staged pipeline that: (i) converts entries in the structured data to canonical form, (ii) generates simple sentences for each atomic entry in the canonicalized representation, and (iii) combines the sentences to produce a coherent, fluent, and adequate paragraph description through sentence compounding and co-reference replacement modules. Experiments on a benchmark mixed-domain data set curated for paragraph description from tables reveals the superiority of our system over existing data-to-text approaches. We also demonstrate the robustness of our system in accepting other popular data sets covering diverse data types such as knowledge graphs and key-value maps.

Download Full-text

Video Data Mining

Data Warehousing and Mining ◽

10.4018/978-1-59904-951-9.ch094 ◽

2008 ◽

pp. 1631-1637

Author(s):

Jung Hwan Oh ◽

Jeong Kyu Lee ◽

Sae Hwang

Keyword(s):

Data Mining ◽

Research Area ◽

Multimedia Databases ◽

Video Data ◽

Multimedia Data ◽

Data Sets ◽

Data Set ◽

Useful Knowledge ◽

Active Research ◽

Diverse Data

Data mining, which is defined as the process of extracting previously unknown knowledge and detecting interesting patterns from a massive set of data, has been an active research area. As a result, several commercial products and research prototypes are available nowadays. However, most of these studies have focused on corporate data — typically in an alpha-numeric database, and relatively less work has been pursued for the mining of multimedia data (Zaïane, Han, & Zhu, 2000). Digital multimedia differs from previous forms of combined media in that the bits representing texts, images, audios, and videos can be treated as data by computer programs (Simoff, Djeraba, & Zaïane, 2002). One facet of these diverse data in terms of underlying models and formats is that they are synchronized and integrated hence, can be treated as integrated data records. The collection of such integral data records constitutes a multimedia data set. The challenge of extracting meaningful patterns from such data sets has lead to research and development in the area of multimedia data mining. This is a challenging field due to the non-structured nature of multimedia data. Such ubiquitous data is required in many applications such as financial, medical, advertising and Command, Control, Communications and Intelligence (C3I) (Thuraisingham, Clifton, Maurer, & Ceruti, 2001). Multimedia databases are widespread and multimedia data sets are extremely large. There are tools for managing and searching within such collections, but the need for tools to extract hidden and useful knowledge embedded within multimedia data is becoming critical for many decision-making applications.

Download Full-text

The Semantic Data Dictionary – An Approach for Describing and Annotating Data

Data Intelligence ◽

10.1162/dint_a_00058 ◽

2020 ◽

Vol 2 (4) ◽

pp. 443-486 ◽

Cited By ~ 1

Author(s):

Sabbir M. Rashid ◽

James P. McCusker ◽

Paulo Pinheiro ◽

Marcello P. Bax ◽

Henrique O. Santos ◽

...

Keyword(s):

Semantic Representation ◽

Data Sets ◽

Data Dictionary ◽

Biomedical Data ◽

Data Set ◽

Semantic Data ◽

Health Empowerment ◽

Wide Range ◽

Diverse Data ◽

Machine Readable

It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse data sets. In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey data set, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large National Institutes of Health (NIH)-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. We evaluate this work in comparison with traditional data dictionaries, mapping languages, and data integration tools.

Download Full-text

A holistic approach to inversion of frequency-domain airborne EM data

Geophysics ◽

10.1190/1.2356112 ◽

2006 ◽

Vol 71 (6) ◽

pp. G301-G312 ◽

Cited By ~ 60

Author(s):

Ross Brodie ◽

Malcolm Sambridge

Keyword(s):

Frequency Domain ◽

Synthetic Data ◽

Holistic Approach ◽

Data Sets ◽

Data Set ◽

Zero Level ◽

Airborne Electromagnetic ◽

Available Information ◽

Airborne Electromagnetic Data ◽

Parameter Values

We have developed a holistic method for simultaneously calibrating, processing, and inverting frequency-domain airborne electromagnetic data. A spline-based, 3D, layered conductivity model covering the complete survey area was recovered through inversion of the entire raw airborne data set and available independent conductivity and interface-depth data. The holistic inversion formulation includes a mathematical model to account for systematic calibration errors such as incorrect gain and zero-level drift. By taking these elements into account in the inversion, the need to preprocess the airborne data prior to inversion is eliminated. Conventional processing schemes involve the sequential application of a number of calibration corrections, with data from each frequency treated separately. This is followed by inversion of each multifrequency sample in isolation from other samples.By simultaneously considering all of the available information in a holistic inversion, we are able to exploit interfrequency and spatial-coherency characteristics of the data. The formulation ensures that the conductivity and calibration models are optimal with respect to the airborne data and prior information. Introduction of interfrequency inconsistency and multistage error propagation stemming from the sequential nature of conventional processing schemes is also avoided. We confirm that accurate conductivity and calibration parameter values are recovered from holistic inversion of synthetic data sets. We demonstrate that the results from holistic inversion of raw survey data are superior to the output of conventional 1D inversion of final processed data. In addition to the technical benefits, we expect that holistic inversion will reduce costs by avoiding the expensive calibration-processing-recalibration paradigm. Furthermore, savings may also be made because specific high-altitude zero-level observations, needed for conventional processing, may not be required.

Download Full-text

Sfaira accelerates data and model reuse in single cell genomics

10.1101/2020.12.16.419036 ◽

2020 ◽

Author(s):

David S. Fischer ◽

Leander Dony ◽

Martin König ◽

Abdul Moeed ◽

Luke Zappia ◽

...

Keyword(s):

Single Cell ◽

Parametric Models ◽

Data Sets ◽

Rna Seq ◽

Cell Type ◽

Data Set ◽

Analysis Workflow ◽

Exploratory Data ◽

Interpretable Models ◽

Diverse Data

Exploratory analysis of single-cell RNA-seq data sets is currently based on statistical and machine learning models that are adapted to each new data set from scratch. A typical analysis workflow includes a choice of dimensionality reduction, selection of clustering parameters, and mapping of prior annotation. These steps typically require several iterations and can take up significant time in many single-cell RNA-seq projects. Here, we introduce sfaira, which is a single-cell data and model zoo which houses data sets as well as pre-trained models. The data zoo is designed to facilitate the fast and easy contribution of data sets, interfacing to a large community of data providers. Sfaira currently includes 233 data sets across 45 organs and 3.1 million cells in both human and mouse. Using these data sets we have trained eight different example model classes, such as autoencoders and logistic cell type predictors: The infrastructure of sfaira is model agnostic and allows training und usage of many previously published models. Sfaira directly aids in exploratory data analysis by replacing embedding and cell type annotation workflows with end-to-end pre-trained parametric models. As further example use cases for sfaira, we demonstrate the extraction of gene-centric data statistics across many tissues, improved usage of cell type labels at different levels of coarseness, and an application for learning interpretable models through data regularization on extremely diverse data sets.

Download Full-text

Simons Collaborative Marine Atlas Project (Simons CMAP): an open-source portal to share, visualize and analyze ocean data

10.1101/2021.02.16.431537 ◽

2021 ◽

Author(s):

Mohammad D. Ashkezari ◽

Norland R. Hagen ◽

Michael Denholtz ◽

Andrew Neang ◽

Tansy C. Burns ◽

...

Keyword(s):

Open Source ◽

Biological Data ◽

Data Sets ◽

Marine Microbes ◽

Data Set ◽

Set Size ◽

Open Source Data ◽

Public Data ◽

Is Integration ◽

Diverse Data

AbstractSimons Collaborative Marine Atlas Project (Simons CMAP) is an open-source data portal that interconnects large, complex, and diverse public data sets currently dispersed in different formats across different Oceanography discipline-specific databases. Simons CMAP is designed to streamline the retrieval of custom subsets of data, the generation of data visualizations, and the analyses of diverse data, thus expanding the power of these potentially underutilized data sets for cross-disciplinary studies of ocean processes. We describe a unified architecture that allows numerical model outputs, satellite products, and field observations to be readily shared, mined, and integrated regardless of data set size or resolution. A current focus of Simons CMAP is integration of physical, chemical, and biological data sets essential for characterizing the biogeography of key marine microbes across ocean basins and seasonal cycles. Using a practical example, we demonstrate how our unifying data harmonization plans significantly simplifies and allows for systematic data integration across all Simons CMAP data sets.

Download Full-text