scholarly journals Meta-Learning PAC-Bayes Priors in Model Averaging

2020 ◽  
Vol 34 (04) ◽  
pp. 4198-4205
Author(s):  
Yimin Huang ◽  
Weiran Huang ◽  
Liang Li ◽  
Zhenguo Li

Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty in order to improve reliability and accuracy of inferences. Here one main challenge is to learn the prior over the model set. To tackle this problem, we propose two data-based algorithms to get proper priors for model averaging. One is for meta-learner, the analysts should use historical similar tasks to extract the information about the prior. The other one is for base-learner, a subsampling method is used to deal with the data step by step. Theoretically, an upper bound of risk for our algorithm is presented to guarantee the performance of the worst situation. In practice, both methods perform well in simulations and real data studies, especially with poor quality data.

Methodology ◽  
2015 ◽  
Vol 11 (3) ◽  
pp. 89-99 ◽  
Author(s):  
Leslie Rutkowski ◽  
Yan Zhou

Abstract. Given a consistent interest in comparing achievement across sub-populations in international assessments such as TIMSS, PIRLS, and PISA, it is critical that sub-population achievement is estimated reliably and with sufficient precision. As such, we systematically examine the limitations to current estimation methods used by these programs. Using a simulation study along with empirical results from the 2007 cycle of TIMSS, we show that a combination of missing and misclassified data in the conditioning model induces biases in sub-population achievement estimates, the magnitude and degree to which can be readily explained by data quality. Importantly, estimated biases in sub-population achievement are limited to the conditioning variable with poor-quality data while other sub-population achievement estimates are unaffected. Findings are generally in line with theory on missing and error-prone covariates. The current research adds to a small body of literature that has noted some of the limitations to sub-population estimation.


2006 ◽  
Vol 21 (1) ◽  
pp. 67-70 ◽  
Author(s):  
Brian H. Toby

The definitions for important Rietveld error indices are defined and discussed. It is shown that while smaller error index values indicate a better fit of a model to the data, wrong models with poor quality data may exhibit smaller values error index values than some superb models with very high quality data.


Author(s):  
Minghui Wu ◽  
Canghong Jin ◽  
Wenkang Hu ◽  
Yabo Chen

Understanding mathematical topics is important for both educators and students to capture latent concepts of questions, evaluate study performance, and recommend content in online learning systems. Compared to traditional text classification, mathematical topic classification has several main challenges: (1) the length of mathematical questions is relatively short; (2) there are various representations of the same mathematical concept(i.e., calculations and application); (3) the content of question is complex including algebra, geometry, and calculus. In order to overcome these problems, we propose a framework that combines content tokens and mathematical knowledge concepts in whole procedures. We embed entities from mathematics knowledge graphs, integrate entities into tokens in a masked language model, set up semantic similarity-based tasks for next-sentence prediction, and fuse knowledge vectors and token vectors during the fine-tuning procedure. We also build a Chinese mathematical topic prediction dataset consisting of more than 70,000 mathematical questions with topics. Our experiments using real data demonstrate that our knowledge graph-based mathematical topic prediction model outperforms other state-of-the-art methods.


2021 ◽  
Vol 13 (20) ◽  
pp. 4081
Author(s):  
Peter Weston ◽  
Patricia de Rosnay

Brightness temperature (Tb) observations from the European Space Agency (ESA) Soil Moisture Ocean Salinity (SMOS) instrument are passively monitored in the European Centre for Medium-range Weather Forecasts (ECMWF) Integrated Forecasting System (IFS). Several quality control procedures are performed to screen out poor quality data and/or data that cannot accurately be simulated from the numerical weather prediction (NWP) model output. In this paper, these quality control procedures are reviewed, and enhancements are proposed, tested, and evaluated. The enhancements presented include improved sea ice screening, coastal and ambiguous land-ocean screening, improved radio frequency interference (RFI) screening, and increased usage of observation at the edge of the satellite swath. Each of the screening changes results in improved agreement between the observations and model equivalent values. This is an important step in advance of future experiments to test the direct assimilation of SMOS Tbs into the ECMWF land data assimilation system.


2019 ◽  
pp. 23-34
Author(s):  
Harvey Goldstein ◽  
Ruth Gilbert

his chapter addresses data linkage which is key to using big administrative datasets to improve efficient and equitable services and policies. These benefits need to weigh against potential harms, which have mainly focussed on privacy. In this chapter we argue for the public and researchers to be alert also to other kinds of harms. These include misuses of big administrative data through poor quality data, misleading analyses, misinterpretation or misuse of findings, and restrictions limiting what questions can be asked and by whom, resulting in research not achieved and advances not made for the public benefit. Ensuring that big administrative data are validly used for public benefit requires increased transparency about who has access and whose access is denied, how data are processed, linked and analysed, and how analyses or algorithms are used in public and private services. Public benefits and especially trust require replicable analyses by many researchers not just a few data controllers. Wider use of big data will be helped by establishing a number of safe data repositories, fully accessible to researchers and their tools, and independent of the current monopolies on data processing, linkage, enhancement and uses of data.


Author(s):  
Patrick Royston

Since Royston and Altman's 1994 publication ( Journal of the Royal Statistical Society, Series C 43: 429–467), fractional polynomials have steadily gained popularity as a tool for flexible parametric modeling of regression relationships. In this article, I present fp_select, a postestimation tool for fp that allows the user to select a parsimonious fractional polynomial model according to a closed test procedure called the fractional polynomial selection procedure or function selection procedure. I also give a brief introduction to fractional polynomial models and provide examples of using fp and fp_select to select such models with real data.


Algorithms ◽  
2020 ◽  
Vol 13 (5) ◽  
pp. 107 ◽  
Author(s):  
Otmane Azeroual ◽  
Włodzimierz Lewoniewski

The quality assurance of publication data in collaborative knowledge bases and in current research information systems (CRIS) becomes more and more relevant by the use of freely available spatial information in different application scenarios. When integrating this data into CRIS, it is necessary to be able to recognize and assess their quality. Only then is it possible to compile a result from the available data that fulfills its purpose for the user, namely to deliver reliable data and information. This paper discussed the quality problems of source metadata in Wikipedia and CRIS. Based on real data from over 40 million Wikipedia articles in various languages, we performed preliminary quality analysis of the metadata of scientific publications using a data quality tool. So far, no data quality measurements have been programmed with Python to assess the quality of metadata from scientific publications in Wikipedia and CRIS. With this in mind, we programmed the methods and algorithms as code, but presented it in the form of pseudocode in this paper to measure the quality related to objective data quality dimensions such as completeness, correctness, consistency, and timeliness. This was prepared as a macro service so that the users can use the measurement results with the program code to make a statement about their scientific publications metadata so that the management can rely on high-quality data when making decisions.


2018 ◽  
Vol 7 (12) ◽  
pp. 471 ◽  
Author(s):  
Adam Dąbrowski ◽  
Piotr Matczak ◽  
Andrzej Wójtowicz ◽  
Michael Leitner

Progress in surveillance technology has led to the development of Closed-Circuit Television (CCTV) systems in cities around the world. Cameras are considered instrumental in crime reduction, yet existing research does not unambiguously answer the question whether installing them affects the number of crimes committed. The quasi-experimental method usually applied to evaluate CCTV systems’ effectiveness faces difficulties with data quantity and quality. Data quantity has a bearing on the number of crimes that can be conclusively inferred using the experimental procedure. Data quality affects the level of crime data aggregation. The lack of the exact location of a crime incident in the form of a street address or geographic coordinates hinders the selection procedure of experimental and control areas. In this paper we propose an innovative method of dealing with data limitations in a quasi-experimental study on the effectiveness of CCTV systems in Poland. As police data on crime incidents are geocoded onto a neighborhood or a street, we designed a method to overcome this drawback by applying similarity measures to time series and landscape metrics. The method makes it possible to determine experimental (test) and control areas which are necessary to conduct the study.


Sign in / Sign up

Export Citation Format

Share Document