Improving Input Data Quality in Register‐Based Statistics: The N orwegian Experience

Author(s):  
Coen Hendriks
Keyword(s):  
2019 ◽  
Vol 10 (2) ◽  
pp. 117-125
Author(s):  
Dana Kubíčková ◽  
◽  
Vladimír Nulíček ◽  

The aim of the research project solved at the University of Finance and administration is to construct a new bankruptcy model. The intention is to use data of the firms that have to cease their activities due to bankruptcy. The most common method for bankruptcy model construction is multivariate discriminant analyses (MDA). It allows to derive the indicators most sensitive to the future companies’ failure as a parts of the bankruptcy model. One of the assumptions for using the MDA method and reassuring the reliable results is the normal distribution and independence of the input data. The results of verification of this assumption as the third stage of the project are presented in this article. We have revealed that this assumption is met only in a few selected indicators. Better results were achieved in the indicators in the set of prosperous companies and one year prior the failure. The selected indicators intended for the bankruptcy model construction thus cannot be considered as suitable for using the MDA method.


2021 ◽  
Author(s):  
S. H. Al Gharbi ◽  
A. A. Al-Majed ◽  
A. Abdulraheem ◽  
S. Patil ◽  
S. M. Elkatatny

Abstract Due to high demand for energy, oil and gas companies started to drill wells in remote areas and unconventional environments. This raised the complexity of drilling operations, which were already challenging and complex. To adapt, drilling companies expanded their use of the real-time operation center (RTOC) concept, in which real-time drilling data are transmitted from remote sites to companies’ headquarters. In RTOC, groups of subject matter experts monitor the drilling live and provide real-time advice to improve operations. With the increase of drilling operations, processing the volume of generated data is beyond a human's capability, limiting the RTOC impact on certain components of drilling operations. To overcome this limitation, artificial intelligence and machine learning (AI/ML) technologies were introduced to monitor and analyze the real-time drilling data, discover hidden patterns, and provide fast decision-support responses. AI/ML technologies are data-driven technologies, and their quality relies on the quality of the input data: if the quality of the input data is good, the generated output will be good; if not, the generated output will be bad. Unfortunately, due to the harsh environments of drilling sites and the transmission setups, not all of the drilling data is good, which negatively affects the AI/ML results. The objective of this paper is to utilize AI/ML technologies to improve the quality of real-time drilling data. The paper fed a large real-time drilling dataset, consisting of over 150,000 raw data points, into Artificial Neural Network (ANN), Support Vector Machine (SVM) and Decision Tree (DT) models. The models were trained on the valid and not-valid datapoints. The confusion matrix was used to evaluate the different AI/ML models including different internal architectures. Despite the slowness of ANN, it achieved the best result with an accuracy of 78%, compared to 73% and 41% for DT and SVM, respectively. The paper concludes by presenting a process for using AI technology to improve real-time drilling data quality. To the author's knowledge based on literature in the public domain, this paper is one of the first to compare the use of multiple AI/ML techniques for quality improvement of real-time drilling data. The paper provides a guide for improving the quality of real-time drilling data.


2020 ◽  
Vol 10 (3) ◽  
pp. 820 ◽  
Author(s):  
Marcela Bindzárová Gergeľová ◽  
Žofia Kuzevičová ◽  
Slavomír Labant ◽  
Juraj Gašinec ◽  
Štefan Kuzevič ◽  
...  

Weather-related disasters represent a major threat to the sustainable development of society. This study focuses directly on the assessment of the state of spatial information quality for the needs of hydrodynamic modeling. Based on the selected procedures and methods designed for the collection and processing of spatial information, the aim of this study was to assess their qualitative level of suitability for 3D flood event modeling in accordance with the Infrastructure for Spatial Information in the European Community (INSPIRE) Directive. In the evaluation process we entered geodetic measurements and the digital relief model 3.5 (DMR 3.5) available for the territory of the Slovak Republic. The result of this study is an assessment of the qualitative analysis on three levels: (i) main channel and surrounding topography data from geodetic measurements; (ii) digital relief model; and (iii) hydrodynamic/hydraulic modeling. The qualitative aspect of the input data shows the sensitivity of a given model to changes in the input data quality condition. The average spatial error in the determination of a point’s position was calculated as 0.017 m of all measured points along a watercourse and its slope foot and slope edge. Although the declared accuracy of DMR 3.5 is assumed to be ±2.50 m, in some of the sections in the selected area there were differences in elevation up to 4.79 m. For this reason, we needed a combination of DMR 3.5 and geodetic measurements to refine the input model for the process of hydrodynamic modeling. The quality of the hydrological data for the monitored N annual flow levels was of fourth-class reliability for the selected area.


2013 ◽  
Vol 2013 ◽  
pp. 1-9
Author(s):  
Ferdinando Di Martino ◽  
Salvatore Sessa

Today it is very difficult to evaluate the quality of spatial databases, mainly for the heterogeneity of input data. We define a fuzzy process for evaluating the reliability of a spatial database: the area of study is partitioned in isoreliable zones, defined as homogeneous zones in terms of data quality and environmental characteristics. We model a spatial database in thematic datasets; each thematic dataset concerns a specific spatial domain and includes a set of layers. We estimate the reliability of each thematic dataset and therefore the overall reliability of the spatial database. We have tested this method on the spatial dataset of the town of Cava de' Tirreni (Italy).


Energies ◽  
2020 ◽  
Vol 13 (19) ◽  
pp. 5099 ◽  
Author(s):  
Sascha Lindig ◽  
Atse Louwen ◽  
David Moser ◽  
Marko Topic

Photovoltaic monitoring data are the primary source for studying photovoltaic plant behavior. In particular, performance loss and remaining-useful-lifetime calculations rely on trustful input data. Furthermore, a regular stream of high quality is the basis for pro-active operation and management activities which ensure a smooth operation of PV plants. The raw data under investigation are electrical measurements and usually meteorological data such as in-plane irradiance and temperature. Usually, performance analyses follow a strict pattern of checking input data quality followed by the application of appropriate filter, choosing a key performance indicator and the application of certain methodologies to receive a final result. In this context, this paper focuses on four main objectives. We present common photovoltaics monitoring data quality issues, provide visual guidelines on how to detect and evaluate these, provide new data imputation approaches, and discuss common filtering approaches. Data imputation techniques for module temperature and irradiance data are discussed and compared to classical approaches. This work is intended to be a soft introduction into PV monitoring data analysis discussing best practices and issues an analyst might face. It was seen that if a sufficient amount of training data is available, multivariate adaptive regression splines yields good results for module temperature imputation while histogram-based gradient boosting regression outperforms classical approaches for in-plane irradiance transposition. Based on tested filtering procedures, it is believed that standards should be developed including relatively low irradiance thresholds together with strict power-irradiance pair filters.


2012 ◽  
Vol 20 (1) ◽  
pp. 27-38
Author(s):  
Myeong-Ha Park ◽  
Seung-Man An ◽  
Yun-Soo Choi ◽  
In-Hun Jeong ◽  
Byeong-Kuk Jeon

2021 ◽  
Author(s):  
Andrew McDonald ◽  

Decades of subsurface exploration and characterisation have led to the collation and storage of large volumes of well related data. The amount of data gathered daily continues to grow rapidly as technology and recording methods improve. With the increasing adoption of machine learning techniques in the subsurface domain, it is essential that the quality of the input data is carefully considered when working with these tools. If the input data is of poor quality, the impact on precision and accuracy of the prediction can be significant. Consequently, this can impact key decisions about the future of a well or a field. This study focuses on well log data, which can be highly multi-dimensional, diverse and stored in a variety of file formats. Well log data exhibits key characteristics of Big Data: Volume, Variety, Velocity, Veracity and Value. Well data can include numeric values, text values, waveform data, image arrays, maps, volumes, etc. All of which can be indexed by time or depth in a regular or irregular way. A significant portion of time can be spent gathering data and quality checking it prior to carrying out petrophysical interpretations and applying machine learning models. Well log data can be affected by numerous issues causing a degradation in data quality. These include missing data - ranging from single data points to entire curves; noisy data from tool related issues; borehole washout; processing issues; incorrect environmental corrections; and mislabelled data. Having vast quantities of data does not mean it can all be passed into a machine learning algorithm with the expectation that the resultant prediction is fit for purpose. It is essential that the most important and relevant data is passed into the model through appropriate feature selection techniques. Not only does this improve the quality of the prediction, it also reduces computational time and can provide a better understanding of how the models reach their conclusion. This paper reviews data quality issues typically faced by petrophysicists when working with well log data and deploying machine learning models. First, an overview of machine learning and Big Data is covered in relation to petrophysical applications. Secondly, data quality issues commonly faced with well log data are discussed. Thirdly, methods are suggested on how to deal with data issues prior to modelling. Finally, multiple case studies are discussed covering the impacts of data quality on predictive capability.


Sign in / Sign up

Export Citation Format

Share Document