The Impact of Data Quantity and Source on the Quality of Data-Driven Hints for Programming

Author(s):  
Thomas W. Price ◽  
Rui Zhi ◽  
Yihuan Dong ◽  
Nicholas Lytle ◽  
Tiffany Barnes
2020 ◽  
Author(s):  
Murat Sorkun ◽  
J. M. Koelman ◽  
Süleyman Er

Abstract Accurate prediction of the solubility of chemical substances in solvents remains a challenge. The sparsity of high-quality solubility data is recognized as the biggest hurdle in the development of robust data-driven methods for practical use. Nonetheless, the effects of the quality and quantity of data on aqueous solubility predictions have not yet been scrutinized. In this study, the roles of the size and the quality of datasets on the performances of the solubility prediction models are unraveled, and the concepts of actual and observed performances are introduced. In an effort to curtail the gap between actual and observed performances, a quality-oriented data selection method, which evaluates the quality of data and extracts the most accurate part of it through statistical validation, is designed. Applying this method on the largest publicly available solubility database and using a consensus machine learning approach, a top-performing solubility prediction model is achieved.


2019 ◽  
Author(s):  
Pavankumar Mulgund ◽  
Raj Sharman ◽  
Priya Anand ◽  
Shashank Shekhar ◽  
Priya Karadi

BACKGROUND In recent years, online physician-rating websites have become prominent and exert considerable influence on patients’ decisions. However, the quality of these decisions depends on the quality of data that these systems collect. Thus, there is a need to examine the various data quality issues with physician-rating websites. OBJECTIVE This study’s objective was to identify and categorize the data quality issues afflicting physician-rating websites by reviewing the literature on online patient-reported physician ratings and reviews. METHODS We performed a systematic literature search in ACM Digital Library, EBSCO, Springer, PubMed, and Google Scholar. The search was limited to quantitative, qualitative, and mixed-method papers published in the English language from 2001 to 2020. RESULTS A total of 423 articles were screened. From these, 49 papers describing 18 unique data quality issues afflicting physician-rating websites were included. Using a data quality framework, we classified these issues into the following four categories: intrinsic, contextual, representational, and accessible. Among the papers, 53% (26/49) reported intrinsic data quality errors, 61% (30/49) highlighted contextual data quality issues, 8% (4/49) discussed representational data quality issues, and 27% (13/49) emphasized accessibility data quality. More than half the papers discussed multiple categories of data quality issues. CONCLUSIONS The results from this review demonstrate the presence of a range of data quality issues. While intrinsic and contextual factors have been well-researched, accessibility and representational issues warrant more attention from researchers, as well as practitioners. In particular, representational factors, such as the impact of inline advertisements and the positioning of positive reviews on the first few pages, are usually deliberate and result from the business model of physician-rating websites. The impact of these factors on data quality has not been addressed adequately and requires further investigation.


2021 ◽  
Vol 2021 (2) ◽  
pp. 229-241
Author(s):  
Vera L. LUKICHEVA ◽  
◽  
Andrey A. PRIVALOV ◽  
Daniil D. TITOV ◽  
◽  
...  

Objective: To analyze the impact of computer attacks on the performance quality of data transmission channels and channeling systems. It is also necessary to take into account the capabilities of an intruder to introduce malware into channeling systems when committing a computer attack. Methods: To determine the required design ratios, several options for setting various distribution functions characterizing the parameters used as input data and types of inbound streams have been considered, taking into account the parameters of the intruder’s computer attack model set by the values of the probability of successful attack. Mathematical modeling is carried out using the method of topological transformation of stochastic networks. The exponential, momentum and gamma distributions are considered as distribution functions of random variables. The solutions are presented for inbound streams corresponding to the Poisson, Weibull, and Pareto models. Results: The proposed approach makes it possible to assess the performance quality of data transmission channels in the context of computer attacks. These assessments make it possible to analyze the state and develop guidelines for improving the performance quality of communication channels against the destructive information impact of the intruder. Various variants of the functions of random variables distribution and various types of the inbound stream were used for modeling, making it possible to compare them, as well as to assess the possibility of using them in channels that provide users with different services. Practical importance: The modeling results can be used to build communication management decision support systems, as well as to detect attempts of unauthorized access to the telecommunications resource of transportation management systems. The proposed approach can be applied in the development of threat models to describe the capabilities of the intruder (the ‘Intruder Model’).


2021 ◽  
Author(s):  
Sven Hilbert ◽  
Stefan Coors ◽  
Elisabeth Barbara Kraus ◽  
Bernd Bischl ◽  
Mario Frei ◽  
...  

Classical statistical methods are limited in the analysis of highdimensional datasets. Machine learning (ML) provides a powerful framework for prediction by using complex relationships, often encountered in modern data with a large number of variables, cases and potentially non-linear effects. ML has turned into one of the most influential analytical approaches of this millennium and has recently become popular in the behavioral and social sciences. The impact of ML methods on research and practical applications in the educational sciences is still limited, but continuously grows as larger and more complex datasets become available through massive open online courses (MOOCs) and large scale investigations.The educational sciences are at a crucial pivot point, because of the anticipated impact ML methods hold for the field. Here, we review the opportunities and challenges of ML for the educational sciences, show how a look at related disciplines can help learning from their experiences, and argue for a philosophical shift in model evaluation. We demonstrate how the overall quality of data analysis in educational research can benefit from these methods and show how ML can play a decisive role in the validation of empirical models. In this review, we (1) provide an overview of the types of data suitable for ML, (2) give practical advice for the application of ML methods, and (3) show how ML-based tools and applications can be used to enhance the quality of education. Additionally we provide practical R code with exemplary analyses, available at https: //osf.io/ntre9/?view only=d29ae7cf59d34e8293f4c6bbde3e4ab2.


2019 ◽  
Vol 18 ◽  
pp. 160940691987646 ◽  
Author(s):  
Saltanat Janenova

This article provides a reflective analysis of a local scholar on methodological challenges of conducting research in Kazakhstan — a post-Soviet, authoritarian, Central Asian country. It specifically addresses the problems of getting access to government officials and the quality of data, describes the strategies applied by the researcher to mitigate these obstacles, and discusses the impact of the political environment on decisions relating to the research design, ethical integrity, safety of participants and researchers, and publication dilemma. This article will be of interest both for researchers who are doing or planning to conduct research in Kazakhstan and Central Asia and those who are researching in nondemocratic contexts as methodological challenges of an authoritarian regime stretch beyond the geographical boundaries.


Sensors ◽  
2019 ◽  
Vol 19 (19) ◽  
pp. 4172 ◽  
Author(s):  
Karel Dejmal ◽  
Petr Kolar ◽  
Josef Novotny ◽  
Alena Roubalova

An increasing number of individuals and institutions own or operate meteorological stations, but the resulting data are not yet commonly used in the Czech Republic. One of the main difficulties is the heterogeneity of measuring systems that puts in question the quality of outcoming data. Only after a thorough quality control of recorded data is it possible to proceed with for example a specific survey of variability of a chosen meteorological parameter in an urban or suburban region. The most commonly researched element in the given environment is air temperature. In the first phase, this paper focuses on the quality of data provided by amateur and institutional stations. The following analyses consequently work with already amended time series. Due to the nature of analyzed data and their potential use in the future it is opportune to assess the appropriateness of chronological and possibly spatial interpolation of missing values. The evaluation of seasonal variability of air temperature in the scale of Brno city and surrounding area in 2015–2017 demonstrates, that the enrichment of network of standard (professional) stations with new stations may significantly refine or even revise the current state of knowledge, for example in the case of urban heat island phenomena. A cluster analysis was applied in order to assess the impact of localization circumstances (station environment, exposition, etc.) as well as typological classification of the set of meteorological stations.


Author(s):  
Michael Reiter ◽  
Uwe Breitenbucher ◽  
Oliver Kopp ◽  
Dimka Karastoyanova
Keyword(s):  

2014 ◽  
pp. 3-29 ◽  
Author(s):  
Michael Reiter ◽  
Uwe Breitenbucher ◽  
Oliver Kopp ◽  
Dimka Karastoyanova
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document