Replication of Data Analyses

Author(s):  
Emery R. Boose ◽  
Barbara S. Lerner

The metadata that describe how scientific data are created and analyzed are typically limited to a general description of data sources, software used, and statistical tests applied and are presented in narrative form in the methods section of a scientific paper or a data set description. Recognizing that such narratives are usually inadequate to support reproduction of the analysis of the original work, a growing number of journals now require that authors also publish their data. However, finer-scale metadata that describe exactly how individual items of data were created and transformed and the processes by which this was done are rarely provided, even though such metadata have great potential to improve data set reliability. This chapter focuses on the detailed process metadata, called “data provenance,” required to ensure reproducibility of analyses and reliable re-use of the data.

2021 ◽  
pp. 1-11
Author(s):  
Yanan Huang ◽  
Yuji Miao ◽  
Zhenjing Da

The methods of multi-modal English event detection under a single data source and isomorphic event detection of different English data sources based on transfer learning still need to be improved. In order to improve the efficiency of English and data source time detection, based on the transfer learning algorithm, this paper proposes multi-modal event detection under a single data source and isomorphic event detection based on transfer learning for different data sources. Moreover, by stacking multiple classification models, this paper makes each feature merge with each other, and conducts confrontation training through the difference between the two classifiers to further make the distribution of different source data similar. In addition, in order to verify the algorithm proposed in this paper, a multi-source English event detection data set is collected through a data collection method. Finally, this paper uses the data set to verify the method proposed in this paper and compare it with the current most mainstream transfer learning methods. Through experimental analysis, convergence analysis, visual analysis and parameter evaluation, the effectiveness of the algorithm proposed in this paper is demonstrated.


2014 ◽  
Vol 112 (11) ◽  
pp. 2729-2744 ◽  
Author(s):  
Carlo J. De Luca ◽  
Joshua C. Kline

Over the past four decades, various methods have been implemented to measure synchronization of motor-unit firings. In this work, we provide evidence that prior reports of the existence of universal common inputs to all motoneurons and the presence of long-term synchronization are misleading, because they did not use sufficiently rigorous statistical tests to detect synchronization. We developed a statistically based method (SigMax) for computing synchronization and tested it with data from 17,736 motor-unit pairs containing 1,035,225 firing instances from the first dorsal interosseous and vastus lateralis muscles—a data set one order of magnitude greater than that reported in previous studies. Only firing data, obtained from surface electromyographic signal decomposition with >95% accuracy, were used in the study. The data were not subjectively selected in any manner. Because of the size of our data set and the statistical rigor inherent to SigMax, we have confidence that the synchronization values that we calculated provide an improved estimate of physiologically driven synchronization. Compared with three other commonly used techniques, ours revealed three types of discrepancies that result from failing to use sufficient statistical tests necessary to detect synchronization. 1) On average, the z-score method falsely detected synchronization at 16 separate latencies in each motor-unit pair. 2) The cumulative sum method missed one out of every four synchronization identifications found by SigMax. 3) The common input assumption method identified synchronization from 100% of motor-unit pairs studied. SigMax revealed that only 50% of motor-unit pairs actually manifested synchronization.


2005 ◽  
Vol 65 (1) ◽  
pp. 129-139 ◽  
Author(s):  
M. A. H Penna ◽  
M. A Villacorta-Corrêa ◽  
T. Walter ◽  
M. Petrere-JR

In order to decide which is the best growth model for the tambaqui Colossoma macropomum Cuvier, 1818, we utilized 249 and 256 length-at-age ring readings in otholiths and scales respectively, for the same sample of individuals. The Schnute model was utilized and it is concluded that the Von Bertalanffy model is the most adequate for these data, because it proved highly stable for the data set, and only slightly sensitive to the initial values of the estimated parameters. The phi' values estimated from five different data sources presented a CV = 4.78%. The numerical discrepancies between these values are of not much concern due to the high negative correlation between k and L<FONT FACE=Symbol>¥</FONT> viz, so that when one of them increases, the other decreases and the final result in phi' remains nearly unchanged.


2016 ◽  
Vol 16 (24) ◽  
pp. 15545-15559 ◽  
Author(s):  
Ernesto Reyes-Villegas ◽  
David C. Green ◽  
Max Priestman ◽  
Francesco Canonaco ◽  
Hugh Coe ◽  
...  

Abstract. The multilinear engine (ME-2) factorization tool is being widely used following the recent development of the Source Finder (SoFi) interface at the Paul Scherrer Institute. However, the success of this tool, when using the a value approach, largely depends on the inputs (i.e. target profiles) applied as well as the experience of the user. A strategy to explore the solution space is proposed, in which the solution that best describes the organic aerosol (OA) sources is determined according to the systematic application of predefined statistical tests. This includes trilinear regression, which proves to be a useful tool for comparing different ME-2 solutions. Aerosol Chemical Speciation Monitor (ACSM) measurements were carried out at the urban background site of North Kensington, London from March to December 2013, where for the first time the behaviour of OA sources and their possible environmental implications were studied using an ACSM. Five OA sources were identified: biomass burning OA (BBOA), hydrocarbon-like OA (HOA), cooking OA (COA), semivolatile oxygenated OA (SVOOA) and low-volatility oxygenated OA (LVOOA). ME-2 analysis of the seasonal data sets (spring, summer and autumn) showed a higher variability in the OA sources that was not detected in the combined March–December data set; this variability was explored with the triangle plots f44 : f43 f44 : f60, in which a high variation of SVOOA relative to LVOOA was observed in the f44 : f43 analysis. Hence, it was possible to conclude that, when performing source apportionment to long-term measurements, important information may be lost and this analysis should be done to short periods of time, such as seasonally. Further analysis on the atmospheric implications of these OA sources was carried out, identifying evidence of the possible contribution of heavy-duty diesel vehicles to air pollution during weekdays compared to those fuelled by petrol.


2018 ◽  
Vol 1 (1) ◽  
pp. 3-6 ◽  
Author(s):  
Reinhard Heun

AbstractMany books and other published recommendations provide a large, sometimes excessive amount of information to be included, and of mistakes to be avoided in research papers for academic journals. However, there is a lack of simple and clear recommendations on how to write such scientific articles. To make life easier for new authors, we propose a simple hypothesis-based approach, which consistently follows the study hypothesis, section by section throughout the manuscript: The introduction section should develop the study hypothesis, by introducing and explaining the relevant concepts, connecting these concepts and by stating the study hypotheses to be tested at the end. The material and methods section must describe the sample or material, the tools, instruments, procedures and analyses used to test the study hypothesis. The results section must describe the study sample, the data collected and the data analyses that lead to the confirmation or rejection of the hypothesis. The discussion must state if the study hypothesis has been confirmed or rejected, if the study result is comparable to, and compatible with other research. It should evaluate the reliability and validity of the study outcome, clarify the limitations of the study and explore the relevance of the supported or rejected hypothesis for clinical practice and future research. If needed, an abstract at the beginning of the manuscript, usually structured in objectives, material and methods, results and conclusions, should provide summaries in two to three sentences for each section. Acknowledgements, declarations of ethical approval, of informed consent by study subjects, of interests by authors and a reference list will be needed in most scientific journals.


2020 ◽  
Author(s):  
Alexander E. Zarebski ◽  
Louis du Plessis ◽  
Kris V. Parag ◽  
Oliver G. Pybus

Inferring the dynamics of pathogen transmission during an outbreak is an important problem in both infectious disease epidemiology and phylodynamics. In mathematical epidemiology, estimates are often informed by time-series of infected cases while in phylodynamics genetic sequences sampled through time are the primary data source. Each data type provides different, and potentially complementary, insights into transmission. However inference methods are typically highly specialised and field-specific. Recent studies have recognised the benefits of combining data sources, which include improved estimates of the transmission rate and number of infected individuals. However, the methods they employ are either computationally prohibitive or require intensive simulation, limiting their real-time utility. We present a novel birth-death phylogenetic model, called TimTam which can be informed by both phylogenetic and epidemiological data. Moreover, we derive a tractable analytic approximation of the TimTam likelihood, the computational complexity of which is linear in the size of the data set. Using the TimTam we show how key parameters of transmission dynamics and the number of unreported infections can be estimated accurately using these heterogeneous data sources. The approximate likelihood facilitates inference on large data sets, an important consideration as such data become increasingly common due to improving sequencing capability.


2014 ◽  
Vol 14 (13) ◽  
pp. 19747-19789
Author(s):  
F. Tan ◽  
H. S. Lim ◽  
K. Abdullah ◽  
T. L. Yoon ◽  
B. Holben

Abstract. In this study, the optical properties of aerosols in Penang, Malaysia were analyzed for four monsoonal seasons (northeast monsoon, pre-monsoon, southwest monsoon, and post-monsoon) based on data from the AErosol RObotic NETwork (AERONET) from February 2012 to November 2013. The aerosol distribution patterns in Penang for each monsoonal period were quantitatively identified according to the scattering plots of the aerosol optical depth (AOD) against the Angstrom exponent. A modified algorithm based on the prototype model of Tan et al. (2014a) was proposed to predict the AOD data. Ground-based measurements (i.e., visibility and air pollutant index) were used in the model as predictor data to retrieve the missing AOD data from AERONET because of frequent cloud formation in the equatorial region. The model coefficients were determined through multiple regression analysis using selected data set from in situ data. The predicted AOD of the model was generated based on the coefficients and compared against the measured data through standard statistical tests. The predicted AOD in the proposed model yielded a coefficient of determination R2 of 0.68. The corresponding percent mean relative error was less than 0.33% compared with the real data. The results revealed that the proposed model efficiently predicted the AOD data. Validation tests were performed on the model against selected LIDAR data and yielded good correspondence. The predicted AOD can beneficially monitor short- and long-term AOD and provide supplementary information in atmospheric corrections.


2020 ◽  
Vol 14 (4) ◽  
pp. 485-497
Author(s):  
Nan Zheng ◽  
Zachary G. Ives

Data provenance tools aim to facilitate reproducible data science and auditable data analyses, by tracking the processes and inputs responsible for each result of an analysis. Fine-grained provenance further enables sophisticated reasoning about why individual output results appear or fail to appear. However, for reproducibility and auditing, we need a provenance archival system that is tamper-resistant , and efficiently stores provenance for computations computed over time (i.e., it compresses repeated results). We study this problem, developing solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using both scientific and OLAP workloads.


foresight ◽  
2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Christian Hugo Hoffmann

Purpose The purpose of this paper is to offer a panoramic view at the credibility issues that exist within social sciences research. Design/methodology/approach The central argument of this paper is that a joint effort between blockchain and other technologies such as artificial intelligence (AI) and deep learning and how they can prevent scientific data manipulation or data forgery as a way to make science more decentralized and anti-fragile, without losing data integrity or reputation as a trade-off. The authors address it by proposing an online research platform for use in social and behavioral science that guarantees data integrity through a combination of modern institutional economics and blockchain technology. Findings The benefits are mainly twofold: On the one hand, social science scholars get paired with the right target audience for their studies. On the other hand, a snapshot of the gathered data at the time of creation is taken so that researchers can prove that they used the original data set to peers in the future while maintaining full control of their data. Originality/value The proposed combination of behavioral economics with new technologies such as blockchain and AI is novel and translated into a cutting-edge tool to be implemented.


Sign in / Sign up

Export Citation Format

Share Document