scholarly journals Effect of data gaps: comparison of different spectral analysis methods

2016 ◽  
Vol 34 (4) ◽  
pp. 437-449 ◽  
Author(s):  
Costel Munteanu ◽  
Catalin Negrea ◽  
Marius Echim ◽  
Kalevi Mursula

Abstract. In this paper we investigate quantitatively the effect of data gaps for four methods of estimating the amplitude spectrum of a time series: fast Fourier transform (FFT), discrete Fourier transform (DFT), Z transform (ZTR) and the Lomb–Scargle algorithm (LST). We devise two tests: the single-large-gap test, which can probe the effect of a single data gap of varying size and the multiple-small-gaps test, used to study the effect of numerous small gaps of variable size distributed within the time series. The tests are applied on two data sets: a synthetic data set composed of a superposition of four sinusoidal modes, and one component of the magnetic field measured by the Venus Express (VEX) spacecraft in orbit around the planet Venus. For single data gaps, FFT and DFT give an amplitude monotonically decreasing with gap size. However, the shape of their amplitude spectrum remains unmodified even for a large data gap. On the other hand, ZTR and LST preserve the absolute level of amplitude but lead to greatly increased spectral noise for increasing gap size. For multiple small data gaps, DFT, ZTR and LST can, unlike FFT, find the correct amplitude of sinusoidal modes even for large data gap percentage. However, for in-situ data collected in a turbulent plasma environment, these three methods overestimate the high frequency part of the amplitude spectrum above a threshold depending on the maximum gap size, while FFT slightly underestimates it.

2012 ◽  
Vol 197 ◽  
pp. 271-277
Author(s):  
Zhu Ping Gong

Small data set approach is used for the estimation of Largest Lyapunov Exponent (LLE). Primarily, the mean period drawback of Small data set was corrected. On this base, the LLEs of daily qualified rate time series of HZ, an electronic manufacturing enterprise, were estimated and all positive LLEs were taken which indicate that this time series is a chaotic time series and the corresponding produce process is a chaotic process. The variance of the LLEs revealed the struggle between the divergence nature of quality system and quality control effort. LLEs showed sharp increase in getting worse quality level coincide with the company shutdown. HZ’s daily qualified rate, a chaotic time series, shows us the predictable nature of quality system in a short-run.


Author(s):  
Andrey Sergeevich Kopyrin ◽  
Irina Leonidovna Makarova

The subject of the research is the process of collecting and preliminary preparation of data from heterogeneous sources. Economic information is heterogeneous and semi-structured or unstructured in nature. Due to the heterogeneity of the primary documents, as well as the human factor, the initial statistical data may contain a large amount of noise, as well as records, the automatic processing of which may be very difficult. This makes preprocessing dynamic input data an important precondition for discovering meaningful patterns and domain knowledge, and making the research topic relevant.Data preprocessing is a series of unique tasks that have led to the emergence of various algorithms and heuristic methods for solving preprocessing tasks such as merge and cleanup, identification of variablesIn this work, a preprocessing algorithm is formulated that allows you to bring together into a single database and structure information on time series from different sources. The key modification of the preprocessing method proposed by the authors is the technology of automated data integration.The technology proposed by the authors involves the combined use of methods for constructing a fuzzy time series and machine lexical comparison on the thesaurus network, as well as the use of a universal database built using the MIVAR concept.The preprocessing algorithm forms a single data model with the ability to transform the periodicity and semantics of the data set and integrate data that can come from various sources into a single information bank.


2013 ◽  
Vol 5 (1) ◽  
pp. 66-83 ◽  
Author(s):  
Iman Rahimi ◽  
Reza Behmanesh ◽  
Rosnah Mohd. Yusuff

The objective of this article is an evaluation and assessment efficiency of the poultry meat farm as a case study with the new method. As it is clear poultry farm industry is one of the most important sub- sectors in comparison to other ones. The purpose of this study is the prediction and assessment efficiency of poultry farms as decision making units (DMUs). Although, several methods have been proposed for solving this problem, the authors strongly need a methodology to discriminate performance powerfully. Their methodology is comprised of data envelopment analysis and some data mining techniques same as artificial neural network (ANN), decision tree (DT), and cluster analysis (CA). As a case study, data for the analysis were collected from 22 poultry companies in Iran. Moreover, due to a small data set and because of the fact that the authors must use large data set for applying data mining techniques, they employed k-fold cross validation method to validate the authors’ model. After assessing efficiency for each DMU and clustering them, followed by applied model and after presenting decision rules, results in precise and accurate optimizing technique.


2020 ◽  
pp. 1-11
Author(s):  
Erjia Yan ◽  
Zheng Chen ◽  
Kai Li

Citation sentiment plays an important role in citation analysis and scholarly communication research, but prior citation sentiment studies have used small data sets and relied largely on manual annotation. This paper uses a large data set of PubMed Central (PMC) full-text publications and analyzes citation sentiment in more than 32 million citances within PMC, revealing citation sentiment patterns at the journal and discipline levels. This paper finds a weak relationship between a journal’s citation impact (as measured by CiteScore) and the average sentiment score of citances to its publications. When journals are aggregated into quartiles based on citation impact, we find that journals in higher quartiles are cited more favorably than those in the lower quartiles. Further, social science journals are found to be cited with higher sentiment, followed by engineering and natural science and biomedical journals, respectively. This result may be attributed to disciplinary discourse patterns in which social science researchers tend to use more subjective terms to describe others’ work than do natural science or biomedical researchers.


2014 ◽  
Vol 7 (6) ◽  
pp. 1547-1570 ◽  
Author(s):  
C. Viatte ◽  
K. Strong ◽  
K. A. Walker ◽  
J. R. Drummond

Abstract. We present a five-year time series of seven tropospheric species measured using a ground-based Fourier transform infrared (FTIR) spectrometer at the Polar Environment Atmospheric Research Laboratory (PEARL; Eureka, Nunavut, Canada; 80°05' N, 86°42' W) from 2007 to 2011. Total columns and temporal variabilities of carbon monoxide (CO), hydrogen cyanide (HCN) and ethane (C2H6) as well as the first derived total columns at Eureka of acetylene (C2H2), methanol (CH3OH), formic acid (HCOOH) and formaldehyde (H2CO) are investigated, providing a new data set in the sparsely sampled high latitudes. Total columns are obtained using the SFIT2 retrieval algorithm based on the optimal estimation method. The microwindows as well as the a priori profiles and variabilities are selected to optimize the information content of the retrievals, and error analyses are performed for all seven species. Our retrievals show good sensitivities in the troposphere. The seasonal amplitudes of the time series, ranging from 34 to 104%, are captured while using a single a priori profile for each species. The time series of the CO, C2H6 and C2H2 total columns at PEARL exhibit strong seasonal cycles with maxima in winter and minima in summer, in opposite phase to the HCN, CH3OH, HCOOH and H2CO time series. These cycles result from the relative contributions of the photochemistry, oxidation and transport as well as biogenic and biomass burning emissions. Comparisons of the FTIR partial columns with coincident satellite measurements by the Atmospheric Chemistry Experiment Fourier Transform Spectrometer (ACE-FTS) show good agreement. The correlation coefficients and the slopes range from 0.56 to 0.97 and 0.50 to 3.35, respectively, for the seven target species. Our new data set is compared to previous measurements found in the literature to assess atmospheric budgets of these tropospheric species in the high Arctic. The CO and C2H6concentrations are consistent with negative trends observed over the Northern Hemisphere, attributed to fossil fuel emission decrease. The importance of poleward transport for the atmospheric budgets of HCN and C2H2 is highlighted. Columns and variabilities of CH3OH and HCOOH at PEARL are comparable to previous measurements performed at other remote sites. However, the small columns of H2CO in early May might reflect its large atmospheric variability and/or the effect of the updated spectroscopic parameters used in our retrievals. Overall, emissions from biomass burning contribute to the day-to-day variabilities of the seven tropospheric species observed at Eureka.


2006 ◽  
Vol 18 (2) ◽  
pp. 470-495 ◽  
Author(s):  
D. Huang ◽  
Tommy W. S. Chow

Data reduction algorithms determine a small data subset from a given large data set. In this article, new types of data reduction criteria, based on the concept of entropy, are first presented. These criteria can evaluate the data reduction performance in a sophisticated and comprehensive way. As a result, new data reduction procedures are developed. Using the newly introduced criteria, the proposed data reduction scheme is shown to be efficient and effective. In addition, an outlier-filtering strategy, which is computationally insignificant, is developed. In some instances, this strategy can substantially improve the performance of supervised data analysis. The proposed procedures are compared with related techniques in two types of application: density estimation and classification. Extensive comparative results are included to corroborate the contributions of the proposed algorithms.


2008 ◽  
Vol 130 (2) ◽  
Author(s):  
Stuart Holdsworth

The European Creep Collaborative Committee (ECCC) approach to creep data assessment has now been established for almost ten years. The methodology covers the analysis of rupture strength and ductility, creep strain, and stress relaxation data, for a range of material conditions. This paper reviews the concepts and procedures involved. The original approach was devised to determine data sheets for use by committees responsible for the preparation of National and International Design and Product Standards, and the methods developed for data quality evaluation and data analysis were therefore intentionally rigorous. The focus was clearly on the determination of long-time property values from the largest possible data sets involving a significant number of observations in the mechanism regime for which predictions were required. More recently, the emphasis has changed. There is now an increasing requirement for full property descriptions from very short times to very long and hence the need for much more flexible model representations than were previously required. There continues to be a requirement for reliable long-time predictions from relatively small data sets comprising relatively short duration tests, in particular, to exploit new alloy developments at the earliest practical opportunity. In such circumstances, it is not feasible to apply the same degree of rigor adopted for large data set assessment. Current developments are reviewed.


Author(s):  
Jason Chen

Clustering analysis is a tool used widely in the Data Mining community and beyond (Everitt et al. 2001). In essence, the method allows us to “summarise” the information in a large data set X by creating a very much smaller set C of representative points (called centroids) and a membership map relating each point in X to its representative in C. An obvious but special type of data set that one might want to cluster is a time series data set. Such data has a temporal ordering on its elements, in contrast to non-time series data sets. In this article we explore the area of time series clustering, focusing mainly on a surprising recent result showing that the traditional method for time series clustering is meaningless. We then survey the literature of recent papers and go on to argue how time series clustering can be made meaningful.


2009 ◽  
Vol 9 (1) ◽  
pp. 3167-3205
Author(s):  
P. Duchatelet ◽  
E. Mahieu ◽  
R. Ruhnke ◽  
W. Feng ◽  
M. Chipperfield ◽  
...  

Abstract. We present an original multi-spectrum fitting procedure to retrieve volume mixing ratio (VMR) profiles of carbonyl fluoride (COF2) from ground-based high resolution Fourier transform infrared (FTIR) solar spectra. The multi-spectrum approach consists of simultaneously combining, during the retrievals, all spectra recorded consecutively during the same day and with the same resolution. Solar observations analyzed in this study with the SFIT-2 v3.91 fitting algorithm correspond to more than 2900 spectra recorded between January 2000 and December 2007 at high zenith angles, with a Fourier Transform Spectrometer operated at the high-altitude International Scientific Station of the Jungfraujoch (ISSJ, 46.5° N latitude, 8.0° E longitude, 3580 m altitude), Switzerland. The goal of the retrieval strategy described here is to provide information about the vertical distribution of carbonyl fluoride. The microwindows used are located in the ν1 or in the ν4 COF2 infrared (IR) absorption bands. Averaging kernel and eigenvector analysis indicates that our FTIR retrieval is sensitive to COF2 inversion between 17 and 30 km, with the major contribution to the retrieved information always coming from the measurement. Moreover, there was no significant bias between COF2 partial columns, total columns or VMR profiles retrieved from the two bands. For each wavenumber region, a complete error budget including all identified sources has been carefully established. In addition, comparisons of FTIR COF2 17–30 km partial columns with KASIMA and SLIMCAT 3-D CTMs are also presented. If we do not notice any significant bias between FTIR and SLIMCAT time series, KASIMA COF2 17–30 km partial columns are lower of around 25%, probably due to incorrect lower boundary conditions. For each times series, linear trend estimation for the 2000–2007 time period as well as a seasonal variation study are also performed and critically discussed. We further demonstrate that all time series are able to reproduce the COF2 seasonal cycle, which main seasonal characteristics deduced from each data set agree quite well.


2009 ◽  
Vol 9 (22) ◽  
pp. 9027-9042 ◽  
Author(s):  
P. Duchatelet ◽  
E. Mahieu ◽  
R. Ruhnke ◽  
W. Feng ◽  
M. Chipperfield ◽  
...  

Abstract. We present an original multi-spectrum fitting procedure to retrieve volume mixing ratio (VMR) profiles of carbonyl fluoride (COF2) from ground-based high resolution Fourier transform infrared (FTIR) solar spectra. The multi-spectrum approach consists of simultaneously combining, during the retrievals, all spectra recorded consecutively during the same day and with the same resolution. Solar observations analyzed in this study with the SFIT-2 v3.91 fitting algorithm correspond to more than 2900 spectra recorded between January 2000 and December 2007 at high zenith angles, with a Fourier Transform Spectrometer operated at the high-altitude International Scientific Station of the Jungfraujoch (ISSJ, 46.5° N latitude, 8.0° E longitude, 3580 m altitude), Switzerland. The goal of the retrieval strategy described here is to provide information about the vertical distribution of carbonyl fluoride. The microwindows used are located in the ν4 or in the ν4 COF2 infrared (IR) absorption bands. Averaging kernel and eigenvector analysis indicates that our FTIR retrieval is sensitive to COF2 inversion between 17 and 30 km, with the major contribution to the retrieved information always coming from the measurement. Moreover, there was no significant bias between COF2 partial columns, total columns or VMR profiles retrieved from the two bands. For each wavenumber region, a complete error budget including all identified sources has been carefully established. In addition, comparisons of FTIR COF2 17–30 km partial columns with KASIMA and SLIMCAT 3-D CTMs are also presented. If we do not notice any significant bias between FTIR and SLIMCAT time series, KASIMA COF2 17–30 km partial columns are lower of around 25%, probably due to incorrect lower boundary conditions. For each times series, linear trend estimation for the 2000–2007 time period as well as a seasonal variation study are also performed and critically discussed. For FTIR and KASIMA time series, very low COF2 growth rates (0.4±0.2%/year and 0.3±0.2%/year, respectively) have been derived. However, the SLIMCAT data set gives a slight negative trend (−0.5±0.2%/year), probably ascribable to discontinuities in the meteorological data used by this model. We further demonstrate that all time series are able to reproduce the COF2 seasonal cycle, which main seasonal characteristics deduced from each data set agree quite well.


Sign in / Sign up

Export Citation Format

Share Document