scholarly journals Crash to Not Crash: Learn to Identify Dangerous Vehicles Using a Simulator

Author(s):  
Hoon Kim ◽  
Kangwook Lee ◽  
Gyeongjo Hwang ◽  
Changho Suh

Developing a computer vision-based algorithm for identifying dangerous vehicles requires a large amount of labeled accident data, which is difficult to collect in the real world. To tackle this challenge, we first develop a synthetic data generator built on top of a driving simulator. We then observe that the synthetic labels that are generated based on simulation results are very noisy, resulting in poor classification performance. In order to improve the quality of synthetic labels, we propose a new label adaptation technique that first extracts internal states of vehicles from the underlying driving simulator, and then refines labels by predicting future paths of vehicles based on a well-studied motion model. Via real-data experiments, we show that our dangerous vehicle classifier can reduce the missed detection rate by at least 18.5% compared with those trained with real data when time-to-collision is between 1.6s and 1.8s.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
João Lobo ◽  
Rui Henriques ◽  
Sara C. Madeira

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.


2021 ◽  
Vol 15 (4) ◽  
pp. 1-20
Author(s):  
Georg Steinbuss ◽  
Klemens Böhm

Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instances with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work, we propose a generic process for the generation of datasets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. We propose and describe a generic process for the benchmarking of unsupervised outlier detection, as sketched so far. We then describe three instantiations of this generic process that generate outliers with specific characteristics, like local outliers. To validate our process, we perform a benchmark with state-of-the-art detection methods and carry out experiments to study the quality of data reconstructed in this way. Next to showcasing the workflow, this confirms the usefulness of our proposed process. In particular, our process yields regular instances close to the ones from real data. Summing up, we propose and validate a new and practical process for the benchmarking of unsupervised outlier detection.


Sensors ◽  
2020 ◽  
Vol 20 (16) ◽  
pp. 4555
Author(s):  
Lee Friedman ◽  
Hal S. Stern ◽  
Larry R. Price ◽  
Oleg V. Komogortsev

It is generally accepted that relatively more permanent (i.e., more temporally persistent) traits are more valuable for biometric performance than less permanent traits. Although this finding is intuitive, there is no current work identifying exactly where in the biometric analysis temporal persistence makes a difference. In this paper, we answer this question. In a recent report, we introduced the intraclass correlation coefficient (ICC) as an index of temporal persistence for such features. Here, we present a novel approach using synthetic features to study which aspects of a biometric identification study are influenced by the temporal persistence of features. What we show is that using more temporally persistent features produces effects on the similarity score distributions that explain why this quality is so key to biometric performance. The results identified with the synthetic data are largely reinforced by an analysis of two datasets, one based on eye-movements and one based on gait. There was one difference between the synthetic and real data, related to the intercorrelation of features in real data. Removing these intercorrelations for real datasets with a decorrelation step produced results which were very similar to that obtained with synthetic features.


Technologies ◽  
2021 ◽  
Vol 9 (4) ◽  
pp. 94
Author(s):  
Daniel Canedo ◽  
Pedro Fonseca ◽  
Petia Georgieva ◽  
António J. R. Neves

Floor-cleaning robots are becoming increasingly more sophisticated over time and with the addition of digital cameras supported by a robust vision system they become more autonomous, both in terms of their navigation skills but also in their capabilities of analyzing the surrounding environment. This document proposes a vision system based on the YOLOv5 framework for detecting dirty spots on the floor. The purpose of such a vision system is to save energy and resources, since the cleaning system of the robot will be activated only when a dirty spot is detected and the quantity of resources will vary according to the dirty area. In this context, false positives are highly undesirable. On the other hand, false negatives will lead to a poor cleaning performance of the robot. For this reason, a synthetic data generator found in the literature was improved and adapted for this work to tackle the lack of real data in this area. This synthetic data generator allows for large datasets with numerous samples of floors and dirty spots. A novel approach in selecting floor images for the training dataset is proposed. In this approach, the floor is segmented from other objects in the image such that dirty spots are only generated on the floor and do not overlap those objects. This helps the models to distinguish between dirty spots and objects in the image, which reduces the number of false positives. Furthermore, a relevant dataset of the Automation and Control Institute (ACIN) was found to be partially labelled. Consequently, this dataset was annotated from scratch, tripling the number of labelled images and correcting some poor annotations from the original labels. Finally, this document shows the process of generating synthetic data which is used for training YOLOv5 models. These models were tested on a real dataset (ACIN) and the best model attained a mean average precision (mAP) of 0.874 for detecting solid dirt. These results further prove that our proposal is able to use synthetic data for the training step and effectively detect dirt on real data. According to our knowledge, there are no previous works reporting the use of YOLOv5 models in this application.


Geophysics ◽  
1991 ◽  
Vol 56 (7) ◽  
pp. 1071-1080 ◽  
Author(s):  
Mark Sams

A long‐spaced sonic survey may be thought of as a special case of ray theoretical tomographic imaging. With such an approach estimates of borehole properties at a resolution of 6 inches (0.15 m) have been obtained by inversion compared with a resolution of 2 ft (0.6 m) from standard borehole‐compensated techniques (BHC). The inversion scheme employs the conjugate gradient technique which is fast and efficient. Unlike BHC, the method compensates for variable refraction angles and provides estimates of errors in the measurements. Results from synthetic data show that these factors greatly improve the imaging of the properties of a finely layered medium, though amplitude decay and coupling are less well defined than velocity and mud traveltime. Results from real data confirm the superior quality of logs from inversion. Furthermore, they indicate that measured amplitudes can be dominated by errors that cause deterioration of BHC estimates of amplitude decay and coupling.


Author(s):  
Vitor Trentin ◽  
Valeria Bastos ◽  
Myrian Costa ◽  
Kenneth Camargo ◽  
Rejane Sobrino ◽  
...  

IntroductionRecord linkage has been increasingly used in Brazil. However, only a few studies report the quality of the linkage process. Synthetic test data can be used to evaluate the quality of data linkage. Objectives and ApproachTo develop a synthetic data generator that creates test datasets with similar attributes and error characteristics found in the Brazilian databases. We analyzed the 2013 mortality database from Rio de Janeiro State to know the characteristics and frequency distribution of the database attributes (name, mother’s name, sex, date of birth and address). We used Python and C++ to customize and add routines to GeCo (http://dlrep.org/dataset/GeCo), a personal data generation tool developed by Tran et al. (DOI:10.1145/2505515.2508207). ResultsBrazilian names have specific characteristics that distinguish them from other countries’ patterns: multiple family names are usual, as are composite first names, and, despite that, homonyms are frequent. Family names may include the full extension or only parts of either the father and mother’s respective family names, or both, so there is a wide variation in progeny family names and not necessarily a common family name for all family members. Conclusion/ImplicationsDue to the specific national characteristics of name building in Brazil, modeling synthetic data is particularly challenging and needs to have more flexible rules in order to generate databases that will actually allow assessing the quality of data linkage processes.


Energies ◽  
2019 ◽  
Vol 12 (17) ◽  
pp. 3326 ◽  
Author(s):  
Xiufeng Liu ◽  
Yanyan Yang ◽  
Rongling Li ◽  
Per Sieverts Nielsen

User activities is an important input to energy modelling, simulation and performance studies of residential buildings. However, it is often difficult to obtain detailed data on user activities and related energy consumption data. This paper presents a stochastic model based on Markov chain to simulate user activities of the households with one or more family members, and formalizes the simulation processes under different conditions. A data generator is implemented to create fine-grained activity sequences that require only a small sample of time-use survey data as a seed. This paper evaluates the data generator by comparing the generated synthetic data with real data, and comparing other related work. The results show the effectiveness of the proposed modelling approach and the efficiency of generating realistic residential user activities.


2021 ◽  
Author(s):  
Fida Dankar ◽  
Mahmoud K. Ibrahim ◽  
Leila Ismail

BACKGROUND Synthetic datasets are gradually emerging as solutions for fast and inclusive health data sharing. Multiple synthetic data generators have been introduced in the last decade fueled by advancement in machine learning, yet their utility is not well understood. Few recent papers tried to compare the utility of synthetic data generators, each focused on different evaluation metrics and presented conclusions targeted at specific analysis. OBJECTIVE This work aims to understand the overall utility (referred to as quality) of four recent synthetic data generators by identifying multiple criteria for high-utility for synthetic data. METHODS We investigate commonly used utility metrics for masked data evaluation and classify them into criteria/categories depending on the function they attempt to preserve: attribute fidelity, bivariate fidelity, population fidelity, and application fidelity. Then we chose a representative metric from each of the identified categories based on popularity and consistency. The set of metrics together, referred to as quality criteria, are used to evaluate the overall utility of four recent synthetic data generators across 19 datasets of different sizes and feature counts. Moreover, correlations between the identified metrics are investigated in an attempt to streamline synthetic data utility. RESULTS Our results indicate that a non-parametric machine learning synthetic data generator (Synthpop) provides the best utility values across all quality criteria along with the highest stability. It displays the best overall accuracy in supervised machine learning and often agrees with real dataset on the learning model with the highest accuracy. On another front, our results suggest no strong correlation between the different metrics, which implies that all categories/dimensions are required when evaluating the overall utility of synthetic data. CONCLUSIONS The paper used four quality criteria to inform on the synthesizer with the best overall utility. The results are promising with small decreases in accuracy observed from the winning synthesizer when tested with real datasets (in comparison with models trained on real data). Further research into one (overall) quality measure would greatly help data holders in optimizing the utility of the released dataset.


Author(s):  
Cheng-Han (Lance) Tsai ◽  
Jen-Yuan (James) Chang

Abstract Artificial Intelligence (AI) has been widely used in different domains such as self-driving, automated optical inspection, and detection of object locations for the robotic pick and place operations. Although the current results of using AI in the mentioned fields are good, the biggest bottleneck for AI is the need for a vast amount of data and labeling of the corresponding answers for a sufficient training. Evidentially, these efforts still require significant manpower. If the quality of the labelling is unstable, the trained AI model becomes unstable and as consequence, so do the results. To resolve this issue, the auto annotation system is proposed in this paper with methods including (1) highly realistic model generation with real texture, (2) domain randomization algorithm in the simulator to automatically generate abundant and diverse images, and (3) visibility tracking algorithm to calculate the occlusion effect objects cause on each other for different picking strategy labels. From our experiments, we will show 10,000 images can be generated per hour, each having multiple objects and each object being labelled in different classes based on their visibility. Instance segmentation AI models can also be trained with these methods to verify the gaps between performance synthetic data for training and real data for testing, indicating that even at mAP 70 the mean average precision can reach 70%!


2019 ◽  
Vol 37 (2) ◽  
Author(s):  
Misael Possidonio de Souza ◽  
Michelangelo Gomes da Silva ◽  
Milton J. Porsani

ABSTRACT. The Solimões Basin Brazil will still be the subject of many discussions in the future due to the success of oil exploration in the 1970s with the discovery of oil and gas fields. The geology of this basin is characterized by significant thick igneous rocks layers, the diabase sills, which can be seen in any stacked section as reflectors with strong amplitude but low frequency. The high contrast of seismic impedance between the sedimentary rock layers and the diabase sills generate multiple reflection and reverberations that can lead to wrong seismic interpretation of stacked sections. In this work, to improve the quality of the stacked sections, we propose a seismic process flow that includes multiple filtering steps in land data, throughout the Multichannel Predictive Deconvolution and the Parabolic Radon Transform. This study was first performed on synthetic data to test the methodology, and then in real data provided by Agência Nacional de Petróleo, Gás Natural e Biocombustíveis (ANP). The conventional processing flowchart was applied using commercial processing software such as SeisSpace/ProMAX, and Fortran 90 codes available in the Centro de Pesquisa em Geofísica e Geologia, Universidade Federal da Bahia (CPGG/UFBA). The results obtained were satisfactory with the methodology used, besides visible improvements in the quality of the stacked seismic sections after attenuation of unwanted noises. Keywords: multiple attenuation, seismic processing, seismic reflection.RESUMO. A Bacia do Solimões será ainda tema de muitas discussões no futuro, devido ao sucesso da exploração de petróleo nas décadas de 1970 com a descoberta de campos de oléo e gás. A geologia desta bacia é caracterizada por espessas camadas de rochas ígneas, as soleiras de diabásio, que podem ser vistas em toda seção empilhada como refletores com forte amplitude e baixa frequência. O alto contraste de impedância sísmica entre as rochas sedimentares e as soleiras de diabásio gera reflexões múltiplas e reverberações que podem levar a uma interpretação sísmica errada das seções empilhadas. Neste trabalho, para melhorar a qualidade das seções empilhadas, propomos um fluxograma de processamento que adicione etapas de filtragem de múltiplas, através da Deconvolução Preditiva Multicanal e da Transformada Radon parabólica. Este estudo foi realizado primeiramente em dados sintéticos para testar a metodologia, e depois em dados reais cedidos pela Agência Nacional de Petróleo, Gás Natural e Biocombustíveis (ANP). O fluxograma de processamento convencional foi aplicado utilizando software comercial de processamento, como o SeisSpace/ProMAX, códigos implementados em Fortran 90 disponíveis no Centro de Pesquisa em Geofísica e Geologia, Universidade Federal da Bahia (CPGG/UFBA). Os resultados obtidos foram satisfatórios com a metodologia utilizada, além de visíveis melhorias na qualidade das seções sísmicas empilhadas após atenuação dos ruídos indesejados.Palavras-chave: atenuação de múltiplas, processamento sísmico, sísmica de reflexão.


Sign in / Sign up

Export Citation Format

Share Document