Meta-Learning PAC-Bayes Priors in Model Averaging

Yimin Huang; Weiran Huang; Liang Li; Zhenguo Li

doi:10.1609/aaai.v34i04.5841

Meta-Learning PAC-Bayes Priors in Model Averaging

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5841 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4198-4205

Author(s):

Yimin Huang ◽

Weiran Huang ◽

Liang Li ◽

Zhenguo Li

Keyword(s):

Model Averaging ◽

Selection Procedure ◽

Real Data ◽

Poor Quality ◽

Quality Data ◽

Main Challenge ◽

Meta Learning ◽

Model Set ◽

Base Learner ◽

Proper Priors

Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty in order to improve reliability and accuracy of inferences. Here one main challenge is to learn the prior over the model set. To tackle this problem, we propose two data-based algorithms to get proper priors for model averaging. One is for meta-learner, the analysts should use historical similar tasks to extract the information about the prior. The other one is for base-learner, a subsampling method is used to deal with the data step by step. Theoretically, an upper bound of risk for our algorithm is presented to guarantee the performance of the worst situation. In practice, both methods perform well in simulations and real data studies, especially with poor quality data.

Download Full-text

The Impact of Missing and Error-Prone Auxiliary Information on Sparse-Matrix Sub-Population Parameter Estimates

Methodology ◽

10.1027/1614-2241/a000095 ◽

2015 ◽

Vol 11 (3) ◽

pp. 89-99 ◽

Cited By ~ 1

Author(s):

Leslie Rutkowski ◽

Yan Zhou

Keyword(s):

Sparse Matrix ◽

Small Body ◽

Auxiliary Information ◽

Poor Quality ◽

Quality Data ◽

Estimation Methods ◽

Parameter Estimates ◽

Population Parameter ◽

Conditioning Model ◽

The Impact

Abstract. Given a consistent interest in comparing achievement across sub-populations in international assessments such as TIMSS, PIRLS, and PISA, it is critical that sub-population achievement is estimated reliably and with sufficient precision. As such, we systematically examine the limitations to current estimation methods used by these programs. Using a simulation study along with empirical results from the 2007 cycle of TIMSS, we show that a combination of missing and misclassified data in the conditioning model induces biases in sub-population achievement estimates, the magnitude and degree to which can be readily explained by data quality. Importantly, estimated biases in sub-population achievement are limited to the conditioning variable with poor-quality data while other sub-population achievement estimates are unaffected. Findings are generally in line with theory on missing and error-prone covariates. The current research adds to a small body of literature that has noted some of the limitations to sub-population estimation.

Download Full-text

R factors in Rietveld analysis: How good is good enough?

Powder Diffraction ◽

10.1154/1.2179804 ◽

2006 ◽

Vol 21 (1) ◽

pp. 67-70 ◽

Cited By ~ 508

Author(s):

Brian H. Toby

Keyword(s):

Rietveld Analysis ◽

Poor Quality ◽

Quality Data ◽

High Quality ◽

High Quality Data ◽

Poor Quality Data ◽

Error Index ◽

R Factors ◽

Very High

The definitions for important Rietveld error indices are defined and discussed. It is shown that while smaller error index values indicate a better fit of a model to the data, wrong models with poor quality data may exhibit smaller values error index values than some superb models with very high quality data.

Download Full-text

Enhanced Language Model with Hybrid Knowledge Graph for Mathematical Topic Prediction

10.22541/au.163491250.03226531/v1 ◽

2021 ◽

Author(s):

Minghui Wu ◽

Canghong Jin ◽

Wenkang Hu ◽

Yabo Chen

Keyword(s):

Language Model ◽

Real Data ◽

Mathematical Concept ◽

Fine Tuning ◽

Knowledge Graph ◽

Mathematics Knowledge ◽

Hybrid Knowledge ◽

Model Set ◽

Set Up ◽

Mathematical Topic

Understanding mathematical topics is important for both educators and students to capture latent concepts of questions, evaluate study performance, and recommend content in online learning systems. Compared to traditional text classification, mathematical topic classification has several main challenges: (1) the length of mathematical questions is relatively short; (2) there are various representations of the same mathematical concept(i.e., calculations and application); (3) the content of question is complex including algebra, geometry, and calculus. In order to overcome these problems, we propose a framework that combines content tokens and mathematical knowledge concepts in whole procedures. We embed entities from mathematics knowledge graphs, integrate entities into tokens in a masked language model, set up semantic similarity-based tasks for next-sentence prediction, and fuse knowledge vectors and token vectors during the fine-tuning procedure. We also build a Chinese mathematical topic prediction dataset consisting of more than 70,000 mathematical questions with topics. Our experiments using real data demonstrate that our knowledge graph-based mathematical topic prediction model outperforms other state-of-the-art methods.

Download Full-text

SMOS Brightness Temperature Monitoring Quality Control Review and Enhancements

Remote Sensing ◽

10.3390/rs13204081 ◽

2021 ◽

Vol 13 (20) ◽

pp. 4081

Author(s):

Peter Weston ◽

Patricia de Rosnay

Keyword(s):

Quality Control ◽

Brightness Temperature ◽

Weather Prediction ◽

European Space Agency ◽

Poor Quality ◽

Quality Data ◽

Space Agency ◽

Control Procedures ◽

Land Data Assimilation System ◽

Nwp Model

Brightness temperature (Tb) observations from the European Space Agency (ESA) Soil Moisture Ocean Salinity (SMOS) instrument are passively monitored in the European Centre for Medium-range Weather Forecasts (ECMWF) Integrated Forecasting System (IFS). Several quality control procedures are performed to screen out poor quality data and/or data that cannot accurately be simulated from the numerical weather prediction (NWP) model output. In this paper, these quality control procedures are reviewed, and enhancements are proposed, tested, and evaluated. The enhancements presented include improved sea ice screening, coastal and ambiguous land-ocean screening, improved radio frequency interference (RFI) screening, and increased usage of observation at the edge of the satellite swath. Each of the screening changes results in improved agreement between the observations and model equivalent values. This is an important step in advance of future experiments to test the direct assimilation of SMOS Tbs into the ECMWF land data assimilation system.

Download Full-text

TRAC's Report Claiming “Surprising Judge-to-Judge Variation” Fails to Compare Similar Cases, Relies on Poor Quality Data, Uses an Unreliable Method of Identifying Case Type, Uses Incorrect Methods of Reporting Sentence Length, and Contains Numerous Errors

Federal Sentencing Reporter ◽

10.1525/fsr.2012.25.1.20 ◽

2012 ◽

Vol 25 (1) ◽

pp. 20-30

Keyword(s):

Poor Quality ◽

Sentence Length ◽

Quality Data ◽

Poor Quality Data ◽

Case Type ◽

Unreliable Method

Download Full-text

The creation and use of big administrative data

Data in Society ◽

10.1332/policypress/9781447348214.003.0003 ◽

2019 ◽

pp. 23-34

Author(s):

Harvey Goldstein ◽

Ruth Gilbert

Keyword(s):

Administrative Data ◽

Poor Quality ◽

Quality Data ◽

Data Repositories ◽

Public And Private ◽

The Public ◽

Public Benefit ◽

Public Benefits ◽

Poor Quality Data ◽

Few Data

his chapter addresses data linkage which is key to using big administrative datasets to improve efficient and equitable services and policies. These benefits need to weigh against potential harms, which have mainly focussed on privacy. In this chapter we argue for the public and researchers to be alert also to other kinds of harms. These include misuses of big administrative data through poor quality data, misleading analyses, misinterpretation or misuse of findings, and restrictions limiting what questions can be asked and by whom, resulting in research not achieved and advances not made for the public benefit. Ensuring that big administrative data are validly used for public benefit requires increased transparency about who has access and whose access is denied, how data are processed, linked and analysed, and how analyses or algorithms are used in public and private services. Public benefits and especially trust require replicable analyses by many researchers not just a few data controllers. Wider use of big data will be helped by establishing a number of safe data repositories, fully accessible to researchers and their tools, and independent of the current monopolies on data processing, linkage, enhancement and uses of data.

Download Full-text

Poor quality data are major obstacle to improving road safety, says World Bank

BMJ ◽

10.1136/bmj.324.7346.1116/a ◽

2002 ◽

Vol 324 (7346) ◽

pp. 1116a-1116 ◽

Cited By ~ 3

Keyword(s):

World Bank ◽

Road Safety ◽

Poor Quality ◽

Major Obstacle ◽

Quality Data ◽

Poor Quality Data

Download Full-text

Model Selection for Univariable Fractional Polynomials

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x1701700305 ◽

2017 ◽

Vol 17 (3) ◽

pp. 619-629 ◽

Cited By ~ 6

Author(s):

Patrick Royston

Keyword(s):

Test Procedure ◽

Selection Procedure ◽

Real Data ◽

Parametric Modeling ◽

Royal Statistical Society ◽

Fractional Polynomials ◽

Fractional Polynomial ◽

Polynomial Models ◽

Selection For ◽

Function Selection

Since Royston and Altman's 1994 publication ( Journal of the Royal Statistical Society, Series C 43: 429–467), fractional polynomials have steadily gained popularity as a tool for flexible parametric modeling of regression relationships. In this article, I present fp_select, a postestimation tool for fp that allows the user to select a parsimonious fractional polynomial model according to a closed test procedure called the fractional polynomial selection procedure or function selection procedure. I also give a brief introduction to fractional polynomial models and provide examples of using fp and fp_select to select such models with real data.

Download Full-text

How to Inspect and Measure Data Quality about Scientific Publications: Use Case of Wikipedia and CRIS Databases

Algorithms ◽

10.3390/a13050107 ◽

2020 ◽

Vol 13 (5) ◽

pp. 107 ◽

Cited By ~ 1

Author(s):

Otmane Azeroual ◽

Włodzimierz Lewoniewski

Keyword(s):

Data Quality ◽

Spatial Information ◽

Real Data ◽

Knowledge Bases ◽

Quality Analysis ◽

Quality Data ◽

Measure Data ◽

Research Information ◽

Scientific Publications ◽

Measurement Results

The quality assurance of publication data in collaborative knowledge bases and in current research information systems (CRIS) becomes more and more relevant by the use of freely available spatial information in different application scenarios. When integrating this data into CRIS, it is necessary to be able to recognize and assess their quality. Only then is it possible to compile a result from the available data that fulfills its purpose for the user, namely to deliver reliable data and information. This paper discussed the quality problems of source metadata in Wikipedia and CRIS. Based on real data from over 40 million Wikipedia articles in various languages, we performed preliminary quality analysis of the metadata of scientific publications using a data quality tool. So far, no data quality measurements have been programmed with Python to assess the quality of metadata from scientific publications in Wikipedia and CRIS. With this in mind, we programmed the methods and algorithms as code, but presented it in the form of pseudocode in this paper to measure the quality related to objective data quality dimensions such as completeness, correctness, consistency, and timeliness. This was prepared as a macro service so that the users can use the measurement results with the program code to make a statement about their scientific publications metadata so that the management can rely on high-quality data when making decisions.

Download Full-text

Identification of Experimental and Control Areas for CCTV Effectiveness Assessment—The Issue of Spatially Aggregated Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi7120471 ◽

2018 ◽

Vol 7 (12) ◽

pp. 471 ◽

Cited By ~ 1

Author(s):

Adam Dąbrowski ◽

Piotr Matczak ◽

Andrzej Wójtowicz ◽

Michael Leitner

Keyword(s):

Similarity Measures ◽

Selection Procedure ◽

Quality Data ◽

Aggregated Data ◽

Surveillance Technology ◽

Data Limitations ◽

Quasi Experimental ◽

Effectiveness Assessment ◽

Systems Effectiveness ◽

And Control

Progress in surveillance technology has led to the development of Closed-Circuit Television (CCTV) systems in cities around the world. Cameras are considered instrumental in crime reduction, yet existing research does not unambiguously answer the question whether installing them affects the number of crimes committed. The quasi-experimental method usually applied to evaluate CCTV systems’ effectiveness faces difficulties with data quantity and quality. Data quantity has a bearing on the number of crimes that can be conclusively inferred using the experimental procedure. Data quality affects the level of crime data aggregation. The lack of the exact location of a crime incident in the form of a street address or geographic coordinates hinders the selection procedure of experimental and control areas. In this paper we propose an innovative method of dealing with data limitations in a quasi-experimental study on the effectiveness of CCTV systems in Poland. As police data on crime incidents are geocoded onto a neighborhood or a street, we designed a method to overcome this drawback by applying similarity measures to time series and landscape metrics. The method makes it possible to determine experimental (test) and control areas which are necessary to conduct the study.

Download Full-text