scholarly journals Investigating bias in the application of curve fitting programs to atmospheric time series

2014 ◽  
Vol 7 (7) ◽  
pp. 7085-7136 ◽  
Author(s):  
P. A. Pickers ◽  
A. C. Manning

Abstract. The decomposition of an atmospheric time series into its constituent parts is an essential tool for identifying and isolating variations of interest from a data set, and is widely used to obtain information about sources, sinks and trends in climatically important gases. Such procedures involve fitting appropriate mathematical functions to the data, however, it has been demonstrated that the application of such curve fitting procedures can introduce bias, and thus influence the scientific interpretation of the data sets. We investigate the potential for bias associated with the application of three curve fitting programs, known as HPspline, CCGCRV and STL, using CO2, CH4 and O3 data from three atmospheric monitoring field stations. These three curve fitting programs are widely used within the greenhouse gas measurement community to analyse atmospheric time series, but have not previously been compared extensively. The programs were rigorously tested for their ability to accurately represent the salient features of atmospheric time series, their ability to cope with outliers and gaps in the data, and for sensitivity to the values used for the input parameters needed for each program. We find that the programs can produce significantly different curve fits, and these curve fits can be dependent on the input parameters selected. There are notable differences between the results produced by the three programs for many of the decomposed components of the time series, such as the representation of seasonal cycle characteristics and the long-term growth rate. The programs also vary significantly in their response to gaps and outliers in the time series. Overall, we found that none of the three programs were superior, and that each program had its strengths and weaknesses. Thus, we provide a list of recommendations on the appropriate use of these three curve fitting programs for certain types of data sets, and for certain types of analyses and applications. In addition, we recommend that sensitivity tests are performed in any study using curve fitting programs, to ensure that results are not unduly influenced by the input smoothing parameters chosen. Our findings also have implications for previous studies that have relied on a single curve fitting program to interpret atmospheric time series measurements. This is demonstrated by using two other curve fitting programs to replicate work in Piao et al. (2008) on zero-crossing analyses of atmospheric CO2 seasonal cycles to investigate terrestrial biosphere changes. We highlight the importance of using more than one program, to ensure results are consistent, reproducible, and free from bias.

2015 ◽  
Vol 8 (3) ◽  
pp. 1469-1489 ◽  
Author(s):  
P. A. Pickers ◽  
A. C. Manning

Abstract. The decomposition of an atmospheric time series into its constituent parts is an essential tool for identifying and isolating variations of interest from a data set, and is widely used to obtain information about sources, sinks and trends in climatically important gases. Such procedures involve fitting appropriate mathematical functions to the data. However, it has been demonstrated that the application of such curve fitting procedures can introduce bias, and thus influence the scientific interpretation of the data sets. We investigate the potential for bias associated with the application of three curve fitting programs, known as HPspline, CCGCRV and STL, using multi-year records of CO2, CH4 and O3 data from three atmospheric monitoring field stations. These three curve fitting programs are widely used within the greenhouse gas measurement community to analyse atmospheric time series, but have not previously been compared extensively. The programs were rigorously tested for their ability to accurately represent the salient features of atmospheric time series, their ability to cope with outliers and gaps in the data, and for sensitivity to the values used for the input parameters needed for each program. We find that the programs can produce significantly different curve fits, and these curve fits can be dependent on the input parameters selected. There are notable differences between the results produced by the three programs for many of the decomposed components of the time series, such as the representation of seasonal cycle characteristics and the long-term (multi-year) growth rate. The programs also vary significantly in their response to gaps and outliers in the time series. Overall, we found that none of the three programs were superior, and that each program had its strengths and weaknesses. Thus, we provide a list of recommendations on the appropriate use of these three curve fitting programs for certain types of data sets, and for certain types of analyses and applications. In addition, we recommend that sensitivity tests are performed in any study using curve fitting programs, to ensure that results are not unduly influenced by the input smoothing parameters chosen. Our findings also have implications for previous studies that have relied on a single curve fitting program to interpret atmospheric time series measurements. This is demonstrated by using two other curve fitting programs to replicate work in Piao et al. (2008) on zero-crossing analyses of atmospheric CO2 seasonal cycles to investigate terrestrial biosphere changes. We highlight the importance of using more than one program, to ensure results are consistent, reproducible, and free from bias.


1998 ◽  
Vol 185 ◽  
pp. 167-168
Author(s):  
T. Appourchaux ◽  
M.C. Rabello-Soares ◽  
L. Gizon

Two different data sets have been used to derive low-degree rotational splittings. One data set comes from the Luminosity Oscillations Imager of VIRGO on board SOHO; the observation starts on 27 March 96 and ends on 26 March 97, and are made of intensity time series of 12 pixels (Appourchaux et al, 1997, Sol. Phys., 170, 27). The other data set was kindly made available by the GONG project; the observation starts on 26 August 1995 and ends on 21 August 1996, and are made of complex Fourier spectra of velocity time series for l = 0 − 9. For the GONG data, the contamination of l = 1 from the spatial aliases of l = 6 and l = 9 required some cleaning. To achieve this, we applied the inverse of the leakage matrix of l = 1, 6 and 9 to the original Fourier spectra of the same degrees; cleaning of all 3 degrees was achieved simultaneously (Appourchaux and Gizon, 1997, these proceedings).


2008 ◽  
Vol 15 (6) ◽  
pp. 1013-1022 ◽  
Author(s):  
J. Son ◽  
D. Hou ◽  
Z. Toth

Abstract. Various statistical methods are used to process operational Numerical Weather Prediction (NWP) products with the aim of reducing forecast errors and they often require sufficiently large training data sets. Generating such a hindcast data set for this purpose can be costly and a well designed algorithm should be able to reduce the required size of these data sets. This issue is investigated with the relatively simple case of bias correction, by comparing a Bayesian algorithm of bias estimation with the conventionally used empirical method. As available forecast data sets are not large enough for a comprehensive test, synthetically generated time series representing the analysis (truth) and forecast are used to increase the sample size. Since these synthetic time series retained the statistical characteristics of the observations and operational NWP model output, the results of this study can be extended to real observation and forecasts and this is confirmed by a preliminary test with real data. By using the climatological mean and standard deviation of the meteorological variable in consideration and the statistical relationship between the forecast and the analysis, the Bayesian bias estimator outperforms the empirical approach in terms of the accuracy of the estimated bias, and it can reduce the required size of the training sample by a factor of 3. This advantage of the Bayesian approach is due to the fact that it is less liable to the sampling error in consecutive sampling. These results suggest that a carefully designed statistical procedure may reduce the need for the costly generation of large hindcast datasets.


2018 ◽  
Vol 18 (3) ◽  
pp. 1573-1592 ◽  
Author(s):  
Gerrit de Leeuw ◽  
Larisa Sogacheva ◽  
Edith Rodriguez ◽  
Konstantinos Kourtidis ◽  
Aristeidis K. Georgoulias ◽  
...  

Abstract. The retrieval of aerosol properties from satellite observations provides their spatial distribution over a wide area in cloud-free conditions. As such, they complement ground-based measurements by providing information over sparsely instrumented areas, albeit that significant differences may exist in both the type of information obtained and the temporal information from satellite and ground-based observations. In this paper, information from different types of satellite-based instruments is used to provide a 3-D climatology of aerosol properties over mainland China, i.e., vertical profiles of extinction coefficients from the Cloud-Aerosol Lidar with Orthogonal Polarization (CALIOP), a lidar flying aboard the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation (CALIPSO) satellite and the column-integrated extinction (aerosol optical depth – AOD) available from three radiometers: the European Space Agency (ESA)'s Along-Track Scanning Radiometer version 2 (ATSR-2), Advanced Along-Track Scanning Radiometer (AATSR) (together referred to as ATSR) and NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) aboard the Terra satellite, together spanning the period 1995–2015. AOD data are retrieved from ATSR using the ATSR dual view (ADV) v2.31 algorithm, while for MODIS Collection 6 (C6) the AOD data set is used that was obtained from merging the AODs obtained from the dark target (DT) and deep blue (DB) algorithms, further referred to as the DTDB merged AOD product. These data sets are validated and differences are compared using Aerosol Robotic Network (AERONET) version 2 L2.0 AOD data as reference. The results show that, over China, ATSR slightly underestimates the AOD and MODIS slightly overestimates the AOD. Consequently, ATSR AOD is overall lower than that from MODIS, and the difference increases with increasing AOD. The comparison also shows that neither of the ATSR and MODIS AOD data sets is better than the other one everywhere. However, ATSR ADV has limitations over bright surfaces which the MODIS DB was designed for. To allow for comparison of MODIS C6 results with previous analyses where MODIS Collection 5.1 (C5.1) data were used, also the difference between the C6 and C5.1 merged DTDB data sets from MODIS/Terra over China is briefly discussed. The AOD data sets show strong seasonal differences and the seasonal features vary with latitude and longitude across China. Two-decadal AOD time series, averaged over all of mainland China, are presented and briefly discussed. Using the 17 years of ATSR data as the basis and MODIS/Terra to follow the temporal evolution in recent years when the environmental satellite Envisat was lost requires a comparison of the data sets for the overlapping period to show their complementarity. ATSR precedes the MODIS time series between 1995 and 2000 and shows a distinct increase in the AOD over this period. The two data series show similar variations during the overlapping period between 2000 and 2011, with minima and maxima in the same years. MODIS extends this time series beyond the end of the Envisat period in 2012, showing decreasing AOD.


2020 ◽  
Author(s):  
Oleg Skrynyk ◽  
Enric Aguilar ◽  
José A. Guijarro ◽  
Sergiy Bubin

<p>Before using climatological time series in research studies, it is necessary to perform their quality control and homogenization in order to remove possible artefacts (inhomogeneities) usually present in the raw data sets. In the vast majority of cases, the homogenization procedure allows to improve the consistency of the data, which then can be verified by means of the statistical comparison of the raw and homogenized time series. However, a new question then arises: how far are the homogenized data from the true climate signal or, in other words, what errors could still be present in homogenized data?</p><p>The main objective of our work is to estimate the uncertainty produced by the adjustment algorithm of the widely used Climatol homogenization software when homogenizing daily time series of the additive climate variables. We focused our efforts on the minimum and maximum air temperature. In order to achieve our goal we used a benchmark data set created by the INDECIS<sup>*</sup> project. The benchmark contains clean data, extracted from an output of the Royal Netherlands Meteorological Institute Regional Atmospheric Climate Model (version 2) driven by Hadley Global Environment Model 2 - Earth System, and inhomogeneous data, created by introducing realistic breaks and errors.</p><p>The statistical evaluation of discrepancies between the homogenized (by means of Climatol with predefined break points) and clean data sets was performed using both a set of standard parameters and a metrics introduced in our work. All metrics used clearly identifies the main features of errors (systematic and random) present in the homogenized time series. We calculated the metrics for every time series (only over adjusted segments) as well as their averaged values as measures of uncertainties in the whole data set.</p><p>In order to determine how the two key parameters of the raw data collection, namely the length of time series and station density, influence the calculated measures of the adjustment error we gradually decreased the length of the period and number of stations in the area under study. The total number of cases considered was 56, including 7 time periods (1950-2005, 1954-2005, …, 1974-2005) and 8 different quantities of stations (100, 90, …, 30). Additionally, in order to find out how stable are the calculated metrics for each of the 56 cases and determine their confidence intervals we performed 100 random permutations in the introduced inhomogeneity time series and repeated our calculations With that the total number of homogenization exercises performed was 5600 for each of two climate variables.</p><p>Lastly, the calculated metrics were compared with the corresponding values, obtained for raw time series. The comparison showed some substantial improvement of the metric values after homogenization in each of the 56 cases considered (for the both variables).</p><p>-------------------</p><p><sup>*</sup>INDECIS is a part of ERA4CS, an ERA-NET initiated by JPI Climate, and funded by FORMAS (SE), DLR (DE), BMWFW (AT), IFD (DK), MINECO (ES), ANR (FR) with co-funding by the European Union (Grant 690462). The work has been partially supported by the Ministry of Education and Science of Kazakhstan (Grant BR05236454) and Nazarbayev University (Grant 090118FD5345).</p>


2020 ◽  
Vol 8 (6) ◽  
pp. 3704-3708

Big data analytics is a field in which we analyse and process information from large or convoluted data sets to be managed by methods of data-processing. Big data analytics is used in analysing the data and helps in predicting the best outcome from the data sets. Big data analytics can be very useful in predicting crime and also gives the best possible solution to solve that crime. In this system we will be using the past crime data set to find out the pattern and through that pattern we will be predicting the range of the incident. The range of the incident will be determined by the decision model and according to the range the prediction will be made. The data sets will be nonlinear and in the form of time series so in this system we will be using the prophet model algorithm which is used to analyse the non-linear time series data. The prophet model categories in three main category and i.e. trends, seasonality, and holidays. This system will help crime cell to predict the possible incident according to the pattern which will be developed by the algorithm and it also helps to deploy right number of resources to the highly marked area where there is a high chance of incidents to occur. The system will enhance the crime prediction system and will help the crime department to use their resources more efficiently.


2018 ◽  
Author(s):  
Farahnaz Khosrawi ◽  
Stefan Lossow ◽  
Gabriele P. Stiller ◽  
Karen H. Rosenlof ◽  
Joachim Urban ◽  
...  

Abstract. Time series of stratospheric and lower mesospheric water vapour using 33 data sets from 15 different satellite instruments were compared in the framework of the second SPARC (Stratosphere-troposphere Processes And their Role in Climate) water vapour assessment (WAVAS-II). This comparison aimed to provide a comprehensive overview of the typical uncertainties in the observational database that can be considered in the future in observational and modelling studies addressing e.g stratospheric water vapour trends. The time series comparisons are presented for the three latitude bands, the Antarctic (80°–70° S), the tropics (15° S–15° N) and the northern hemisphere mid-latitudes (50° N–60° N) at four different altitudes (0.1, 3, 10 and 80 hPa) covering the stratosphere and lower mesosphere. The combined temporal coverage of observations from the 15 satellite instruments allowed considering the time period 1986–2014. In addition to the qualitative comparison of the time series, the agreement of the data sets is assessed quantitatively in the form of the spread (i.e. the difference between the maximum and minimum volume mixing ratio among the data sets), the (Pearson) correlation coefficient and the drift (i.e. linear changes of the difference between time series over time). Generally, good agreement between the time series was found in the middle stratosphere while larger differences were found in the lower mesosphere and near the tropopause. Concerning the latitude bands, the largest differences were found in the Antarctic while the best agreement was found for the tropics. From our assessment we find that all data sets can be considered in the future in observational and modelling studies addressing e.g. stratospheric and lower mesospheric water vapour variability and trends when data set specific characteristics (e.g. a drift) and restrictions (e.g. temporal and spatial coverage) are taken into account.


2019 ◽  
Author(s):  
Srishti Mishra ◽  
Zohair Shafi ◽  
Santanu Pathak

Data driven decision making is becoming increasingly an important aspect for successful business execution. More and more organizations are moving towards taking informed decisions based on the data that they are generating. Most of this data are in temporal format - time series data. Effective analysis across time series data sets, in an efficient and quick manner is a challenge. The most interesting and valuable part of such analysis is to generate insights on correlation and causation across multiple time series data sets. This paper looks at methods that can be used to analyze such data sets and gain useful insights from it, primarily in the form of correlation and causation analysis. This paper focuses on two methods for doing so, Two Sample Test with Dynamic Time Warping and Hierarchical Clustering and looks at how the results returned from both can be used to gain a better understanding of the data. Moreover, the methods used are meant to work with any data set, regardless of the subject domain and idiosyncrasies of the data set, primarily, a data agnostic approach.


Author(s):  
Richard Heuver ◽  
Ronald Heijmans

In this chapter the authors provide a method to aggregate large value payment system transaction data for executing simulations with the Bank of Finland payment simulator. When transaction data sets get large, simulation may become too time consuming in terms of computer power. Therefore, insufficient data from a statistical point of view can be processed. The method described in this chapter provides a solution to this statistical problem. In order to work around this problem the authors provide a method to aggregate transaction data set in such a way that it does not compromise the outcome of the simulation significantly. Depending on the type of simulations only a few business days or up to a year of data is required. In case of stress scenario analysis, in which e.g. liquidity position of banks deteriorates, long time series are preferred as business days can differ substantially. As an example this chapter shows that aggregating all low value transactions in the Dutch part of TARGET2 will not lead to a significantly different simulation outcome.


2020 ◽  
Author(s):  
Julius Polz ◽  
Christian Chwala ◽  
Maximilian Graf ◽  
Harald Kunstmann

<p>Commercial microwave links (CMLs) can be used for quantitative precipitation estimation. The measurement technique is based on the exploitation of the close to linear relationship between the attenuation of the signal level by rainfall and the path averaged rain rate. At a temporal resolution of one minute, the signal level of almost 4000 CMLs distributed all over Germany is being recorded since August 2017, resulting in one of the biggest CML data sets available for scientific purposes. A crucial step for retrieving rainfall information from this large CML data set is to accurately detect rainy periods in the time-series, a process which is hampered by strong signal fluctuations, occasionally occurring even when there is no rain. In our study, we evaluate the performance of convolutional neural networks (CNNs) to distinguish between rainy and non-rainy signal fluctuations by recognizing their specific patterns. CNNs make use of many layers and local connections of neurons to recognize patterns independent of their location in the time-series. We designed a custom CNN architecture consisting of a feature extraction and classification part with 20 layers of neurons and 1.4 x 10<sup>5</sup> trainable parameters. To train the model and validate the results we refer to the gauge-adjusted radar product RADOLAN-RW, provided by the German meteorological service. Despite not being an absolute truth, it provides robust information about rain events at the CML locations at an hourly time resolution. With only 400 CMLs used for training and 3504 for validation, we find that CNNs can learn to recognize different signal fluctuation patterns and generalize well to sensors and time periods not used for training. Overall we find a good agreement between the CML and weather radar derived rainfall information by detecting on average 87 % of all rainy and 91 % of all non-rainy periods.</p>


Sign in / Sign up

Export Citation Format

Share Document