scholarly journals Better models by discarding data?

2013 ◽  
Vol 69 (7) ◽  
pp. 1215-1222 ◽  
Author(s):  
K. Diederichs ◽  
P. A. Karplus

In macromolecular X-ray crystallography, typical data sets have substantial multiplicity. This can be used to calculate the consistency of repeated measurements and thereby assess data quality. Recently, the properties of a correlation coefficient, CC1/2, that can be used for this purpose were characterized and it was shown that CC1/2has superior properties compared with `merging'Rvalues. A derived quantity, CC*, links data and model quality. Using experimental data sets, the behaviour of CC1/2and the more conventional indicators were compared in two situations of practical importance: merging data sets from different crystals and selectively rejecting weak observations or (merged) unique reflections from a data set. In these situations controlled `paired-refinement' tests show that even though discarding the weaker data leads to improvements in the mergingRvalues, the refined models based on these data are of lower quality. These results show the folly of such data-filtering practices aimed at improving the mergingRvalues. Interestingly, in all of these tests CC1/2is the one data-quality indicator for which the behaviour accurately reflects which of the alternative data-handling strategies results in the best-quality refined model. Its properties in the presence of systematic error are documented and discussed.

2007 ◽  
Vol 15 (4) ◽  
pp. 365-386 ◽  
Author(s):  
Yoshiko M. Herrera ◽  
Devesh Kapur

This paper examines the construction and use of data sets in political science. We focus on three interrelated questions: How might we assess data quality? What factors shape data quality? and How can these factors be addressed to improve data quality? We first outline some problems with existing data set quality, including issues of validity, coverage, and accuracy, and we discuss some ways of identifying problems as well as some consequences of data quality problems. The core of the paper addresses the second question by analyzing the incentives and capabilities facing four key actors in a data supply chain: respondents, data collection agencies (including state bureaucracies and private organizations), international organizations, and finally, academic scholars. We conclude by making some suggestions for improving the use and construction of data sets.It is a capital mistake, Watson, to theorise before you have all the evidence. It biases the judgment.—Sherlock Holmes in “A Study in Scarlet”Statistics make officials, and officials make statistics.”—Chinese proverb


Author(s):  
Tu Renwei ◽  
Zhu Zhongjie ◽  
Bai Yongqiang ◽  
Gao Ming ◽  
Ge Zhifeng

Unmanned Aerial Vehicle (UAV) inspection has become one of main methods for current transmission line inspection, but there are still some shortcomings such as slow detection speed, low efficiency, and inability for low light environment. To address these issues, this paper proposes a deep learning detection model based on You Only Look Once (YOLO) v3. On the one hand, the neural network structure is simplified, that is the three feature maps of YOLO v3 are pruned into two to meet specific detection requirements. Meanwhile, the K-means++ clustering method is used to calculate the anchor value of the data set to improve the detection accuracy. On the other hand, 1000 sets of power tower and insulator data sets are collected, which are inverted and scaled to expand the data set, and are fully optimized by adding different illumination and viewing angles. The experimental results show that this model using improved YOLO v3 can effectively improve the detection accuracy by 6.0%, flops by 8.4%, and the detection speed by about 6.0%.


Author(s):  
James B. Elsner ◽  
Thomas H. Jagger

Hurricane data originate from careful analysis of past storms by operational meteorologists. The data include estimates of the hurricane position and intensity at 6-hourly intervals. Information related to landfall time, local wind speeds, damages, and deaths, as well as cyclone size, are included. The data are archived by season. Some effort is needed to make the data useful for hurricane climate studies. In this chapter, we describe the data sets used throughout this book. We show you a work flow that includes importing, interpolating, smoothing, and adding attributes. We also show you how to create subsets of the data. Code in this chapter is more complicated and it can take longer to run. You can skip this material on first reading and continue with model building in Chapter 7. You can return here when you have an updated version of the data that includes the most recent years. Most statistical models in this book use the best-track data. Here we describe these data and provide original source material. We also explain how to smooth and interpolate them. Interpolations are needed for regional hurricane analyses. The best-track data set contains the 6-hourly center locations and intensities of all known tropical cyclones across the North Atlantic basin, including the Gulf of Mexico and Caribbean Sea. The data set is called HURDAT for HURricane DATa. It is maintained by the U.S. National Oceanic and Atmospheric Administration (NOAA) at the National Hurricane Center (NHC). Center locations are given in geographic coordinates (in tenths of degrees) and the intensities, representing the one-minute near-surface (∼10 m) wind speeds, are given in knots (1 kt = .5144 m s−1) and the minimum central pressures are given in millibars (1 mb = 1 hPa). The data are provided in 6-hourly intervals starting at 00 UTC (Universal Time Coordinate). The version of HURDAT file used here contains cyclones over the period 1851 through 2010 inclusive. Information on the history and origin of these data is found in Jarvinen et al (1984). The file has a logical structure that makes it easy to read with a FORTRAN program. Each cyclone contains a header record, a series of data records, and a trailer record.


2019 ◽  
Vol 97 (10) ◽  
pp. 4076-4084
Author(s):  
Tiphaine Macé ◽  
Dominique Hazard ◽  
Fabien Carrière ◽  
Sebastien Douls ◽  
Didier Foulquié ◽  
...  

Abstract The main objective of this work was to study the relationships between body reserve (BR) dynamics and rearing performance (PERF) traits in ewes from a Romane meat sheep flock managed extensively on “Causse” rangelands in the south of France. Flock records were used to generate data sets covering 14 lambing years (YR). The data set included 1,146 ewes with 2 ages of first lambing (AGE), 3 parities (PAR), and 4 litter sizes (LS). Repeated measurements of the BW and BCS were used as indicators of BR. The ewe PERF traits recorded were indirect measurements for maternal abilities and included prolificacy, litter weight and lamb BW at lambing and weaning, ADG at 1, 2, and 3 mo after lambing, and litter survival from lambing to weaning. The effects of different BW and BCS trajectories (e.g., changes in BW and BCS across the production cycle), previously been characterized in the same animals, on PERF traits were investigated. Such trajectories reflected different profiles at the intraflock level in the dynamics of BR mobilization–accretion cycles. Genetic relationships between BR and PERF traits were assessed. All the fixed variables considered (i.e., YR, AGE, PAR, LS, and SEX ratio of the litter) have significant effects on the PERF traits. Similarly, BW trajectories had an effect on the PERF traits across the 3 PARs studied, particularly during the first cycle (PAR 1). The BCS trajectories only affected prolificacy, lamb BW at birth, and litter survival. Most of the PERF traits considered here showed moderate heritabilities (0.17–0.23) except for prolificacy, the lamb growth rate during the third month and litter survival which showed very low heritabilities. With exception of litter survival and prolificacy, ewe PERF traits were genetically, strongly, and positively correlated with BW whatever the physiological stage. A few weak genetic correlations were found between BCS and PERF traits. As illustrated by BW and BCS changes over time, favorable genetic correlations were found, even if few and moderate, between BR accretion or mobilization and PERF traits, particularly for prolificacy and litter weight at birth. In conclusion, our results show significant relationships between BR dynamics and PERF traits in ewes, which could be considered in future sheep selection programs aiming to improve robustness.


2015 ◽  
Vol 5 (3) ◽  
pp. 350-380 ◽  
Author(s):  
Abdifatah Ahmed Haji ◽  
Sanni Mubaraq

Purpose – The purpose of this paper is to examine the impact of corporate governance and ownership structure attributes on firm performance following the revised code on corporate governance in Malaysia. The study presents a longitudinal assessment of the compliance and implications of the revised code on firm performance. Design/methodology/approach – Two data sets consisting of before (2006) and after (2008-2010) the revised code are examined. Drawing from the largest companies listed on Bursa Malaysia (BM), the first data set contains 92 observations in the year 2006 while the second data set comprises of 282 observations drawn from the largest companies listed on BM over a three-year period, from 2008-2010. Both accounting (return on assets and return on equity) and market performance (Tobin’s Q) measures were used to measure firm performance. Multiple and panel data regression analyses were adopted to analyze the data. Findings – The study shows that there were still cases of non-compliance to the basic requirements of the code such as the one-third independent non-executive director (INDs) requirement even after the revised code. While the regression models indicate marginal significance of board size and independent directors before the revised code, the results indicate all corporate governance variables have a significant negative relationship with at least one of the measures of corporate performance. Independent chairperson, however, showed a consistent positive impact on firm performance both before and after the revised code. In addition, ownership structure elements were found to have a negative relationship with either accounting or market performance measures, with institutional ownership showing a consistent negative impact on firm performance. Firm size and leverage, as control variables, were significant in determining corporate performance. Research limitations/implications – One limitation is the use of separate measures of corporate governance attributes, as opposed to a corporate governance index (CGI). As a result, the study constructs a CGI based on the recommendations of the revised code and proposes for future research use. Practical implications – Some of the largest companies did not even comply with basic requirements such as the “one-third INDs” mandatory requirement. Hence, the regulators may want to reinforce the requirements of the code and also detail examples of good governance practices. The results, which show a consistent positive relationship between the presence of an independent chairperson and firm performance in both data sets, suggest listed companies to consider appointing an independent chairperson in the corporate leadership. The regulatory authorities may also wish to note this phenomenon when drafting any future corporate governance codes. Originality/value – This study offers new insights of the implications of regulatory changes on the relationship between corporate governance attributes and firm performance from the perspective of a developing country. The development of a CGI for future research is a novel approach of this study.


2016 ◽  
Author(s):  
Brecht Martens ◽  
Diego G. Miralles ◽  
Hans Lievens ◽  
Robin van der Schalie ◽  
Richard A. M. de Jeu ◽  
...  

Abstract. The Global Land Evaporation Amsterdam Model (GLEAM) is a set of algorithms dedicated to the estimation of terrestrial evaporation and root-zone soil moisture from satellite data. Ever since its development in 2011, the model has been regularly revised aiming at the optimal incorporation of new satellite-observed geophysical variables, and improving the representation of physical processes. In this study, the next version of this model (v3) is presented. Key changes relative to the previous version include: (1) a revised formulation of the evaporative stress, (2) an optimized drainage algorithm, and (3) a new soil moisture data assimilation system. GLEAM v3 is used to produce three new data sets of terrestrial evaporation and root-zone soil moisture, including a 35-year data set spanning the period 1980–2014 (v3.0a, based on satellite-observed soil moisture, vegetation optical depth and snow water equivalents, reanalysis air temperature and radiation, and a multi-source precipitation product), and two fully satellite-based data sets. The latter two share most of their forcing, except for the vegetation optical depth and soil moisture products, which are based on observations from different passive and active C- and L-band microwave sensors (European Space Agency Climate Change Initiative data sets) for the first data set (v3.0b, spanning the period 2003–2015) and observations from the Soil Moisture and Ocean Salinity satellite in the second data set (v3.0c, spanning the period 2011–2015). These three data sets are described in detail, compared against analogous data sets generated using the previous version of GLEAM (v2), and validated against measurements from 64 eddy-covariance towers and 2338 soil moisture sensors across a broad range of ecosystems. Results indicate that the quality of the v3 soil moisture is consistently better than the one from v2: average correlations against in situ surface soil moisture measurements increase from 0.61 to 0.64 in case of the v3.0a data set and the representation of soil moisture in the second layer improves as well, with correlations increasing from 0.47 to 0.53. Similar improvements are observed for the two fully satellite-based data sets. Despite regional differences, the quality of the evaporation fluxes remains overall similar as the one obtained using the previous version of GLEAM, with average correlations against eddy-covariance measurements between 0.78 and 0.80 for the three different data sets. These global data sets of terrestrial evaporation and root-zone soil moisture are now openly available at http://GLEAM.eu and may be used for large-scale hydrological applications, climate studies and research on land-atmosphere feedbacks.


2019 ◽  
Vol 52 (4) ◽  
pp. 854-863 ◽  
Author(s):  
Brendan Sullivan ◽  
Rick Archibald ◽  
Jahaun Azadmanesh ◽  
Venu Gopal Vandavasi ◽  
Patricia S. Langan ◽  
...  

Neutron crystallography offers enormous potential to complement structures from X-ray crystallography by clarifying the positions of low-Z elements, namely hydrogen. Macromolecular neutron crystallography, however, remains limited, in part owing to the challenge of integrating peak shapes from pulsed-source experiments. To advance existing software, this article demonstrates the use of machine learning to refine peak locations, predict peak shapes and yield more accurate integrated intensities when applied to whole data sets from a protein crystal. The artificial neural network, based on the U-Net architecture commonly used for image segmentation, is trained using about 100 000 simulated training peaks derived from strong peaks. After 100 training epochs (a round of training over the whole data set broken into smaller batches), training converges and achieves a Dice coefficient of around 65%, in contrast to just 15% for negative control data sets. Integrating whole peak sets using the neural network yields improved intensity statistics compared with other integration methods, including k-nearest neighbours. These results demonstrate, for the first time, that neural networks can learn peak shapes and be used to integrate Bragg peaks. It is expected that integration using neural networks can be further developed to increase the quality of neutron, electron and X-ray crystallography data.


2020 ◽  
Author(s):  
Marika Kaden ◽  
Katrin Sophie Bohnsack ◽  
Mirko Weber ◽  
Mateusz Kudła ◽  
Kaja Gutowska ◽  
...  

AbstractWe present an approach to investigate SARS-CoV-2 virus sequences based on alignment-free methods for RNA sequence comparison. In particular, we verify a given clustering result for the GISAID data set, which was obtained analyzing the molecular differences in coronavirus populations by phylogenetic trees. For this purpose, we use alignment-free dissimilarity measures for sequences and combine them with learning vector quantization classifiers for virus type discriminant analysis and classification. Those vector quantizers belong to the class of interpretable machine learning methods, which, on the one hand side provide additional knowledge about the classification decisions like discriminant feature correlations, and on the other hand can be equipped with a reject option. This option gives the model the property of self controlled evidence if applied to new data, i.e. the models refuses to make a classification decision, if the model evidence for the presented data is not given. After training such a classifier for the GISAID data set, we apply the obtained classifier model to another but unlabeled SARS-CoV-2 virus data set. On the one hand side, this allows us to assign new sequences to already known virus types and, on the other hand, the rejected sequences allow speculations about new virus types with respect to nucleotide base mutations in the viral sequences.Author summaryThe currently emerging global disease COVID-19 caused by novel SARS-CoV-2 viruses requires all scientific effort to investigate the development of the viral epidemy, the properties of the virus and its types. Investigations of the virus sequence are of special interest. Frequently, those are based on mathematical/statistical analysis. However, machine learning methods represent a promising alternative, if one focuses on interpretable models, i.e. those that do not act as black-boxes. Doing so, we apply variants of Learning Vector Quantizers to analyze the SARS-CoV-2 sequences. We encoded the sequences and compared them in their numerical representations to avoid the computationally costly comparison based on sequence alignments. Our resulting model is interpretable, robust, efficient, and has a self-controlling mechanism regarding the applicability to data. This framework was applied to two data sets concerning SARS-CoV-2. We were able to verify previously published virus type findings for one of the data sets by training our model to accurately identify the virus type of sequences. For sequences without virus type information (second data set), our trained model can predict them. Thereby, we observe a new scattered spreading of the sequences in the data space which probably is caused by mutations in the viral sequences.


2017 ◽  
Vol 5 (4) ◽  
pp. 1
Author(s):  
I. E. Okorie ◽  
A. C. Akpanta ◽  
J. Ohakwe ◽  
D. C. Chikezie ◽  
C. U. Onyemachi ◽  
...  

This paper introduces a new generator of probability distribution-the adjusted log-logistic generalized (ALLoG) distribution and a new extension of the standard one parameter exponential distribution called the adjusted log-logistic generalized exponential (ALLoGExp) distribution. The ALLoGExp distribution is a special case of the ALLoG distribution and we have provided some of its statistical and reliability properties. Notably, the failure rate could be monotonically decreasing, increasing or upside-down bathtub shaped depending on the value of the parameters $\delta$ and $\theta$. The method of maximum likelihood estimation was proposed to estimate the model parameters. The importance and flexibility of he ALLoGExp distribution was demonstrated with a real and uncensored lifetime data set and its fit was compared with five other exponential related distributions. The results obtained from the model fittings shows that the ALLoGExp distribution provides a reasonably better fit than the one based on the other fitted distributions. The ALLoGExp distribution is therefore ecommended for effective modelling of lifetime data sets.


2021 ◽  
Author(s):  
Ruben van de Vijver ◽  
Emmanuel Uwambayinema ◽  
Yu-Ying Chuang

How do speakers comprehend and produce complex words? In the theory ofthe Discriminative Lexicon this is hypothesized to be the results of mapping the phonology of whole word forms onto their semantics and vice versa, without recourse to morphemes. This raises the question whether this hypothesis also holds true in highly agglutinative languages, which are oǒten seen to exemplify the compositional nature of morphology. On the one hand, one could expect that the hypothesis for agglutinative languages is correct, since it remains unclear whether speakers are able to isolate the morphemes they need to achieve this. On the other hand, agglutinative languages have so many different words that it is not obvious how speakers can use their knowledge of words to comprehend and produce them.In this paper, we investigate comprehension and production of verbs in Kinyarwanda,an agglutinative Bantu language, by means of computational modeling within the theDiscriminative Lexicon, a theory of the mental lexicon, which is grounded in word andparadigm morphology, distributional semantics, error-driven learning, and uses insightsof psycholinguistic theories, and is implemented mathematically and computationallyas a shallow, two-layered network.In order to do this, we compiled a data set of 11528 verb forms and annotated for eachverb form its meaning and grammatical functions, and, additionally, we used our dataset to extract 573 verbs that are present in our full data set and for which meanings ofverbs are based on word embeddings. In order to assess comprehension and production of Kinyarwanda verbs, we fed both data sets into the Linear Discriminative Learningalgorithm, a two-layered, fully connected network. One layer represent the phonological form and the layer represents meaning. Comprehension is modeled as a mapping from phonology to meaning and production is modeled as a mapping from meaning to phonology. Both comprehension and production is learned with high accuracy in all data and in held-out data, both for the full data set, with manually annotated semantic features, and for the data set with meanings derived from word embeddings.Our findings provide support for the various hypotheses of the Discriminative Lexicon:Words are stored as wholes, meanings are a result of the distribution of words in utterances, comprehension and production can be successfully modeled from mappings from form to meaning and vice versa, which can be modeled in a shallow two-layered network, and these mappings are learned in by minimizing errors.


Sign in / Sign up

Export Citation Format

Share Document