Evaluation of Synthetic Datasets Generation for Intent Classification Tasks in Portuguese

A chatbot is an artificial intelligence based system aimed at chatting with users, commonly used as a virtual assistant to help people or answer questions. Intent classification is an essential task for chatbots where it aims to identify what the user wants in a certain dialogue. However, for many domains, little data are available to properly train those systems. In this work, we evaluate the performance of two methods to generate synthetic data for chatbots, one based on template questions and another based on neural text generation. We build four datasets that are used training chatbot components in the intent classification task. We intend to simulate the task of migrating a search-based portal to an interactive dialogue-based information service by using artificial datasets for initial model training. Our results show that template-based datasets are slightly superior to those neural-based generated in our application domain, however, neural-generated present good results and they are a viable option when one has limited access to domain experts to hand-code text templates.

Download Full-text

G-Tric: generating three-way synthetic datasets with triclustering solutions

BMC Bioinformatics ◽

10.1186/s12859-020-03925-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

João Lobo ◽

Rui Henriques ◽

Sara C. Madeira

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Three Dimensions ◽

Additional Advantage ◽

Urban Dynamics ◽

Data Generator ◽

Real World Datasets ◽

Synthetic Datasets

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

Download Full-text

Revisiting Dead Leaves Model: Training with Synthetic Data

IEEE Signal Processing Letters ◽

10.1109/lsp.2021.3132289 ◽

2021 ◽

pp. 1-1

Author(s):

Pavan C Madhusudana ◽

Seok-Jun Lee ◽

Hamid Rahim Sheikh

Keyword(s):

Synthetic Data ◽

Dead Leaves Model ◽

Model Training

Download Full-text

Inferring viral occurrence patterns through a synthetic data simulation

10.1101/2021.07.13.452220 ◽

2021 ◽

Author(s):

Ville N Pimenoff ◽

Ramon Cleries

Keyword(s):

Linear Models ◽

Population Sample ◽

Synthetic Data ◽

Interaction Patterns ◽

Viral Strain ◽

Data Simulation ◽

Synthetic Datasets ◽

Pathogen Occurrence ◽

Log Linear ◽

Occurrence Patterns

Viruses infecting humans are manifold and several of them provoke significant morbidity and mortality. Simulations creating large synthetic datasets from observed multiple viral strain infections in a limited population sample can be a powerful tool to infer significant pathogen occurrence and interaction patterns, particularly if limited number of observed data units is available. Here, to demonstrate diverse human papillomavirus (HPV) strain occurrence patterns, we used log-linear models combined with Bayesian framework for graphical independence network (GIN) analysis. That is, to simulate datasets based on modeling the probabilistic associations between observed viral data points, i.e different viral strain infections in a set of population samples. Our GIN analysis outperformed in precision all oversampling methods tested for simulating large synthetic viral strain-level prevalence dataset from observed set of HPVs data. Altogether, we demonstrate that network modeling is a potent tool for creating synthetic viral datasets for comprehensive pathogen occurrence and interaction pattern estimations.

Download Full-text

Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

Frontiers in Big Data ◽

10.3389/fdata.2021.679939 ◽

2021 ◽

Vol 4 ◽

Author(s):

Michael Platzer ◽

Thomas Reutterer

Keyword(s):

Mixed Type ◽

Synthetic Data ◽

Training Data ◽

Privacy Risk ◽

Individual Level ◽

Empirical Assessment ◽

Model Free ◽

Private Data ◽

Synthetic Datasets ◽

The Individual

AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.

Download Full-text

Human In Command Machine Learning

10.24834/isbn.9789178771875 ◽

2021 ◽

Author(s):

◽

Lars Holmberg

Keyword(s):

Machine Learning ◽

Human Life ◽

Design Guidelines ◽

Significant Other ◽

Target Domain ◽

Extraterrestrial Life ◽

Human Capabilities ◽

Domain Experts ◽

Model Training ◽

Selection Of

Machine Learning (ML) and Artificial Intelligence (AI) impact many aspects of human life, from recommending a significant other to assist the search for extraterrestrial life. The area develops rapidly and exiting unexplored design spaces are constantly laid bare. The focus in this work is one of these areas; ML systems where decisions concerning ML model training, usage and selection of target domain lay in the hands of domain experts. This work is then on ML systems that function as a tool that augments and/or enhance human capabilities. The approach presented is denoted Human In Command ML (HIC-ML) systems. To enquire into this research domain design experiments of varying fidelity were used. Two of these experiments focus on augmenting human capabilities and targets the domains commuting and sorting batteries. One experiment focuses on enhancing human capabilities by identifying similar hand-painted plates. The experiments are used as illustrative examples to explore settings where domain experts potentially can: independently train an ML model and in an iterative fashion, interact with it and interpret and understand its decisions. HIC-ML should be seen as a governance principle that focuses on adding value and meaning to users. In this work, concrete application areas are presented and discussed. To open up for designing ML-based products for the area an abstract model for HIC-ML is constructed and design guidelines are proposed. In addition, terminology and abstractions useful when designing for explicability are presented by imposing structure and rigidity derived from scientific explanations. Together, this opens up for a contextual shift in ML and makes new application areas probable, areas that naturally couples the usage of AI technology to human virtues and potentially, as a consequence, can result in a democratisation of the usage and knowledge concerning this powerful technology.

Download Full-text

Revisiting particle sizing using greyscale optical array probes: evaluation using laboratory experiments and synthetic data

Atmospheric Measurement Techniques ◽

10.5194/amt-12-3067-2019 ◽

2019 ◽

Vol 12 (6) ◽

pp. 3067-3079

Author(s):

Sebastian J. O'Shea ◽

Jonathan Crosier ◽

James Dorsey ◽

Waldemar Schledewitz ◽

Ian Crawford ◽

...

Keyword(s):

Climate Models ◽

Synthetic Data ◽

Mie Scattering ◽

Sample Volume ◽

Research Aircraft ◽

Order Of Magnitude ◽

Ambient Data ◽

In Situ Observations ◽

Synthetic Datasets

Abstract. In situ observations from research aircraft and instrumented ground sites are important contributions to developing our collective understanding of clouds and are used to inform and validate numerical weather and climate models. Unfortunately, biases in these datasets may be present, which can limit their value. In this paper, we discuss artefacts which may bias data from a widely used family of instrumentation in the field of cloud physics, optical array probes (OAPs). Using laboratory and synthetic datasets, we demonstrate how greyscale analysis can be used to filter data, constraining the sample volume of the OAP and improving data quality, particularly at small sizes where OAP data are considered unreliable. We apply the new methodology to ambient data from two contrasting case studies: one warm cloud and one cirrus cloud. In both cases the new methodology reduces the concentration of small particles (<60 µm) by approximately an order of magnitude. This significantly improves agreement with a Mie-scattering spectrometer for the liquid case and with a holographic imaging probe for the cirrus case. Based on these results, we make specific recommendations to instrument manufacturers, instrument operators and data processors about the optimal use of greyscale OAPs. The data from monoscale OAPs are unreliable and should not be used for particle diameters below approximately 100 µm.

Download Full-text

Validation of XRD phase quantification using semi-synthetic data

Powder Diffraction ◽

10.1017/s0885715620000573 ◽

2020 ◽

Vol 35 (4) ◽

pp. 262-275

Author(s):

Nicola Döbelin

Keyword(s):

Reference Materials ◽

Certified Reference Materials ◽

Synthetic Data ◽

Statistical Validation ◽

Validation Parameters ◽

Phase Quantification ◽

Characteristic Features ◽

Diffraction Patterns ◽

Synthetic Datasets ◽

Multi Phase

Validating phase quantification procedures of powder X-ray diffraction (XRD) data for an implementation in an ISO/IEC 17025 accredited environment has been challenging due to a general lack of suitable certified reference materials. The preparation of highly pure and crystalline reference materials and mixtures thereof may exceed the costs for a profitable and justifiable implementation. This study presents a method for the validation of XRD phase quantifications based on semi-synthetic datasets that reduces the effort for a full method validation drastically. Datasets of nearly pure reference substances are stripped of impurity signals and rescaled to 100% crystallinity, thus eliminating the need for the preparation of ultra-pure and -crystalline materials. The processed datasets are then combined numerically while preserving all sample- and instrument-characteristic features of the peak profile, thereby creating multi-phase diffraction patterns of precisely known composition. The number of compositions and repetitions is only limited by computational power and storage capacity. These datasets can be used as input files for the phase quantification procedure, in which statistical validation parameters such as precision, accuracy, linearity, and limits of detection and quantification can be determined from a statistically sound number of datasets and compositions.

Download Full-text

State-of-the-Art Fusion-Finder Algorithms Sensitivity and Specificity

BioMed Research International ◽

10.1155/2013/340620 ◽

2013 ◽

Vol 2013 ◽

pp. 1-6 ◽

Cited By ~ 57

Author(s):

Matteo Carrara ◽

Marco Beccuti ◽

Fulvio Lazzarato ◽

Federica Cavallo ◽

Francesca Cordero ◽

...

Keyword(s):

Synthetic Data ◽

Rna Seq ◽

Comparison Analysis ◽

Generating Functional ◽

Specificity And Sensitivity ◽

Sensitive Tool ◽

Fusion Junction ◽

Synthetic Datasets ◽

Very High ◽

Fusion Detection

Background. Gene fusions arising from chromosomal translocations have been implicated in cancer. RNA-seq has the potential to discover such rearrangements generating functional proteins (chimera/fusion). Recently, many methods for chimeras detection have been published. However, specificity and sensitivity of those tools were not extensively investigated in a comparative way.Results. We tested eight fusion-detection tools (FusionHunter, FusionMap, FusionFinder, MapSplice, deFuse, Bellerophontes, ChimeraScan, and TopHat-fusion) to detect fusion events using synthetic and real datasets encompassing chimeras. The comparison analysis run only on synthetic data could generate misleading results since we found no counterpart on real dataset. Furthermore, most tools report a very high number of false positive chimeras. In particular, the most sensitive tool, ChimeraScan, reports a large number of false positives that we were able to significantly reduce by devising and applying two filters to remove fusions not supported by fusion junction-spanning reads or encompassing large intronic regions.Conclusions. The discordant results obtained using synthetic and real datasets suggest that synthetic datasets encompassing fusion events may not fully catch the complexity of RNA-seq experiment. Moreover, fusion detection tools are still limited in sensitivity or specificity; thus, there is space for further improvement in the fusion-finder algorithms.

Download Full-text

Revisiting particle sizing using grayscale optical array probes evaluation using laboratory experiments and synthetic data

10.5194/amt-2018-435 ◽

2019 ◽

Author(s):

Sebastian J. O'Shea ◽

Jonathan Crosier ◽

James Dorsey ◽

Waldemar Schledewitz ◽

Ian Crawford ◽

...

Keyword(s):

Climate Models ◽

Synthetic Data ◽

Mie Scattering ◽

Sample Volume ◽

Research Aircraft ◽

Order Of Magnitude ◽

Ambient Data ◽

In Situ Observations ◽

Synthetic Datasets

Abstract. In-situ observations from research aircraft and instrumented ground sites are important contributions to developing our collective understanding of clouds, and are used to inform and validate numerical weather and climate models. Unfortunately, biases in these datasets may be present, which can limit their value. In this paper, we discuss artefacts which may bias data from a widely used family of instrumentation in the field of cloud physics, Optical Array Probes (OAPs). Using laboratory and synthetic datasets, we demonstrate how greyscale analysis can be used to filter data, constraining the sample volume of the OAP, and improving data quality particularly at small sizes where OAP data are considered unreliable. We apply the new methodology to ambient data from two contrasting case studies: one warm cloud and one cirrus cloud. In both cases the new methodology reduces the concentration of small particles (< 60 µm) by approximately an order of magnitude. This significantly improves agreement with a Mie scattering spectrometer for the liquid case and with a holographic imaging probe for the cirrus case. Based on these results, we make specific recommendations to instrument manufacturers, instrument operators, and data processors about the optimal use of greyscale OAP’s. We also raise the issue of bias in OAP’s which have no greyscale capability.

Download Full-text

Single Image Re ection Removal via Deep Feature Contrast

International Journal of Circuits, Systems and Signal Processing ◽

10.46300/9106.2022.16.38 ◽

2022 ◽

Vol 16 ◽

pp. 311-320

Author(s):

Lumin Liu

Keyword(s):

Texture Feature ◽

Computational Photography ◽

Single Image ◽

Low Level ◽

Level Information ◽

Fast Development ◽

Model Training ◽

Parallel Feature ◽

Synthetic Datasets ◽

Background Image

Removing undesired re ection from a single image is in demand for computational photography. Re ection removal methods are gradually effective because of the fast development of deep neural networks. However, current results of re ection removal methods usually leave salient re ection residues due to the challenge of recognizing diverse re ection patterns. In this paper, we present a one-stage re ection removal framework with an end-to-end manner that considers both low-level information correlation and efficient feature separation. Our approach employs the criss-cross attention mechanism to extract low-level features and to efficiently enhance contextual correlation. To thoroughly remove re ection residues in the background image, we punish the similar texture feature by contrasting the parallel feature separa- tion networks, and thus unrelated textures in the background image could be progressively separated during model training. Experiments on both real-world and synthetic datasets manifest our approach can reach the state-of-the-art effect quantitatively and qualitatively.

Download Full-text