Multiple Data Set ILI for Mechanical Damage Assessment

Volume 2: Pipeline Integrity Management ◽

10.1115/ipc2012-90244 ◽

2012 ◽

Author(s):

Chris Goller ◽

James Simek ◽

Jed Ludlow

Keyword(s):

Damage Assessment ◽

Public Awareness ◽

Mechanical Damage ◽

Data Sets ◽

Data Set ◽

Multiple Data ◽

Pipe Joints ◽

Multiple Data Sets ◽

Multiple Field ◽

Flux Leakage

The purpose of this paper is to present a non-traditional pipeline mechanical damage ranking system using multiple-data-set in-line inspection (ILI) tools. Mechanical damage continues to be a major factor in reportable incidents for hazardous liquid and gas pipelines. While several ongoing programs seek to limit damage incidents through public awareness, encroachment monitoring, and one-call systems, others have focused efforts on the quantification of mechanical damage severity through modeling, the use of ILI tools, and subsequent feature assessment at locations selected for excavation. Current generation ILI tools capable of acquiring multiple-data-sets in a single survey may provide an improved assessment of the severity of damaged zones using methods developed in earlier research programs as well as currently reported information. For magnetic flux leakage (MFL) type tools, using multiple field levels, varied field directions, and high accuracy deformation sensors enables detection and provides the data necessary for enhanced severity assessments. This paper will provide a review of multiple-data-set ILI results from several pipe joints with simulated mechanical damage locations created mimicing right-of-way encroachment events in addition to field results from ILI surveys using multiple-data-set tools.

Download Full-text

Estimating observation and model error variances using multiple data sets

Atmospheric Measurement Techniques ◽

10.5194/amt-11-4239-2018 ◽

2018 ◽

Vol 11 (7) ◽

pp. 4239-4260 ◽

Cited By ~ 8

Author(s):

Richard Anthes ◽

Therese Rieckh

Keyword(s):

Error Variance ◽

Specific Humidity ◽

Data Sets ◽

Data Set ◽

Multiple Data ◽

Gfs Model ◽

Multiple Data Sets ◽

Using Data ◽

The Tropics ◽

Estimated Error

Abstract. In this paper we show how multiple data sets, including observations and models, can be combined using the “three-cornered hat” (3CH) method to estimate vertical profiles of the errors of each system. Using data from 2007, we estimate the error variances of radio occultation (RO), radiosondes, ERA-Interim, and Global Forecast System (GFS) model data sets at four radiosonde locations in the tropics and subtropics. A key assumption is the neglect of error covariances among the different data sets, and we examine the consequences of this assumption on the resulting error estimates. Our results show that different combinations of the four data sets yield similar relative and specific humidity, temperature, and refractivity error variance profiles at the four stations, and these estimates are consistent with previous estimates where available. These results thus indicate that the correlations of the errors among all data sets are small and the 3CH method yields realistic error variance profiles. The estimated error variances of the ERA-Interim data set are smallest, a reasonable result considering the excellent model and data assimilation system and assimilation of high-quality observations. For the four locations studied, RO has smaller error variances than radiosondes, in agreement with previous studies. Part of the larger error variance of the radiosondes is associated with representativeness differences because radiosondes are point measurements, while the other data sets represent horizontal averages over scales of ∼ 100 km.

Download Full-text

Two-stage Linked Component Analysis for Joint Decomposition of Multiple Biologically Related Data Sets

10.1101/2021.03.22.435728 ◽

2021 ◽

Author(s):

By Huan Chen ◽

Brian Caffo ◽

Genevieve Stein-O’Brien ◽

Jinrui Liu ◽

Ben Langmead ◽

...

Keyword(s):

R Package ◽

Component Analysis ◽

Biological Data ◽

Joint Analysis ◽

Data Sets ◽

Biological Processes ◽

Two Stage ◽

Data Set ◽

Multiple Data ◽

Multiple Data Sets

SummaryIntegrative analysis of multiple data sets has the potential of fully leveraging the vast amount of high throughput biological data being generated. In particular such analysis will be powerful in making inference from publicly available collections of genetic, transcriptomic and epigenetic data sets which are designed to study shared biological processes, but which vary in their target measurements, biological variation, unwanted noise, and batch variation. Thus, methods that enable the joint analysis of multiple data sets are needed to gain insights into shared biological processes that would otherwise be hidden by unwanted intra-data set variation. Here, we propose a method called two-stage linked component analysis (2s-LCA) to jointly decompose multiple biologically related experimental data sets with biological and technological relationships that can be structured into the decomposition. The consistency of the proposed method is established and its empirical performance is evaluated via simulation studies. We apply 2s-LCA to jointly analyze four data sets focused on human brain development and identify meaningful patterns of gene expression in human neurogenesis that have shared structure across these data sets. The code to conduct 2s-LCA has been complied into an R package “PJD”, which is available at https://github.com/CHuanSite/PJD.

Download Full-text

Nonlinear Model Identification From Multiple Data Sets Using an Orthogonal Forward Search Algorithm

Journal of Computational and Nonlinear Dynamics ◽

10.1115/1.4023864 ◽

2013 ◽

Vol 8 (4) ◽

Cited By ~ 7

Author(s):

Ping Li ◽

Hua-Liang Wei ◽

Stephen A. Billings ◽

Michael A. Balikhin ◽

Richard Boynton

Keyword(s):

Search Algorithm ◽

Model Identification ◽

Identification Problem ◽

Data Sets ◽

Nonlinear Dynamic Model ◽

Data Set ◽

Multiple Data ◽

Orthogonal Search ◽

Data Points ◽

Multiple Data Sets

A basic assumption on the data used for nonlinear dynamic model identification is that the data points are continuously collected in chronological order. However, there are situations in practice where this assumption does not hold and we end up with an identification problem from multiple data sets. The problem is addressed in this paper and a new cross-validation-based orthogonal search algorithm for NARMAX model identification from multiple data sets is proposed. The algorithm aims at identifying a single model from multiple data sets so as to extend the applicability of the standard method in the cases, such as the data sets for identification are obtained from multiple tests or a series of experiments, or the data set is discontinuous because of missing data points. The proposed method can also be viewed as a way to improve the performance of the standard orthogonal search method for model identification by making full use of all the available data segments in hand. Simulated and real data are used in this paper to illustrate the operation and to demonstrate the effectiveness of the proposed method.

Download Full-text

Significance levels of common frequencies extracted from multiple data sets

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa190 ◽

2020 ◽

Vol 493 (1) ◽

pp. 48-54

Author(s):

Chris Koen

Keyword(s):

Time Series ◽

Time Base ◽

Data Sets ◽

Frequency Interval ◽

Data Set ◽

Frequency Spectra ◽

Multiple Data ◽

Significance Levels ◽

System Data ◽

Multiple Data Sets

ABSTRACT Large monitoring campaigns, particularly those using multiple filters, have produced replicated time series of observations for literally millions of stars. The search for periodicities in such replicated data can be facilitated by comparing the periodograms of the various time series. In particular, frequency spectra can be searched for common peaks. The sensitivity of this procedure to various parameters (e.g. the time base of the data, length of the frequency interval searched, number of replicate series, etc.) is explored. Two additional statistics that could sharpen results are also discussed: the closeness (in frequency) of peaks identified as common to all data sets, and the sum of the ranks of the peaks. Analytical expressions for the distributions of these two statistics are presented. The method is illustrated by showing that a ‘dubious’ periodicity in an 'Asteroid Terrestrial-impact Last Alert System' data set is highly significant.

Download Full-text

Challenge data set for macromolecular multi-microcrystallography

10.1101/394965 ◽

2018 ◽

Author(s):

James M. Holton

Keyword(s):

Radiation Damage ◽

Synthetic Data ◽

Data Sets ◽

Reference Structure ◽

Macromolecular Crystallography ◽

Data Set ◽

New Methods ◽

Multiple Data ◽

Multiple Data Sets ◽

Ongoing Development

AbstractA synthetic data set demonstrating a particularly challenging case of indexing ambiguity in the context of radiation damage was generated in order to serve as a standard benchmark and reference point for the ongoing development of new methods and new approaches to solving this problem. Of the 100 short wedges of data only the first 71 are currently necessary to solve the structure by “cheating”, or using the correct reference structure as a guide. The total wall-clock time and number of wedges required to solve the structure without cheating is proposed as a metric for the efficacy and efficiency of a given multi-crystal automation pipeline.SynopsisA synthetic dataset demonstrating the challenges of combining multiple data sets with indexing ambiguity in the context of heavy radiation damage in multi-crystal macromolecular crystallography was generated and described, and the problems encountered using contemporary data processing programs were summarized.

Download Full-text

Automatic Identification and Production of Related Words for Historical Linguistics

Computational Linguistics ◽

10.1162/coli_a_00361 ◽

2020 ◽

Vol 45 (4) ◽

pp. 667-704

Author(s):

Alina Maria Ciobanu ◽

Liviu P. Dinu

Keyword(s):

Machine Learning ◽

Language Change ◽

Historical Linguistics ◽

Language Evolution ◽

Automatic Identification ◽

Data Sets ◽

Data Set ◽

Multiple Data ◽

Word Forms ◽

Multiple Data Sets

Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to assist researchers and domain experts in the study of language evolution. First, we introduce a method to automatically determine whether two words are cognates. We propose an algorithm for extracting cognates from electronic dictionaries that contain etymological information. Having built a data set of related words, we further develop machine learning methods based on orthographic alignment for identifying cognates. We use aligned subsequences as features for classification algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Second, we extend the method to a finer-grained level, to identify the type of relationship between words. Discriminating between cognates and borrowings provides a deeper insight into the history of a language and allows a better characterization of language relatedness. We show that orthographic features have discriminative power and we analyze the underlying linguistic factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind. Third, we develop a machine learning method for automatically producing related words. We focus on reconstructing proto-words, but we also address two related sub-problems, producing modern word forms and producing cognates. The task of reconstructing proto-words consists of recreating the words in an ancient language from its modern daughter languages. Having modern word forms in multiple Romance languages, we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when words entered the modern languages. We leverage information from several modern languages, building an ensemble system for reconstructing proto-words. We apply our method to multiple data sets, showing that our approach improves on previous results, also having the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.

Download Full-text

Reliable Metadata and the Creation of Trustworthy, Reproducible, and Re-usable Data Sets

Stepping in the Same River Twice ◽

10.12987/yale/9780300209549.003.0013 ◽

2017 ◽

Cited By ~ 1

Author(s):

Kristin Vanderbilt ◽

David Blankman

Keyword(s):

Data Sets ◽

Data Set ◽

Data Repositories ◽

Error Prevention ◽

Data Intensive ◽

Use Of Data ◽

Multiple Data ◽

Original Analysis ◽

Public Data ◽

Multiple Data Sets

Science has become a data-intensive enterprise. Data sets are commonly being stored in public data repositories and are thus available for others to use in new, often unexpected ways. Such re-use of data sets can take the form of reproducing the original analysis, analyzing the data in new ways, or combining multiple data sets into new data sets that are analyzed still further. A scientist who re-uses a data set collected by another must be able to assess its trustworthiness. This chapter reviews the types of errors that are found in metadata referring to data collected manually, data collected by instruments (sensors), and data recovered from specimens in museum collections. It also summarizes methods used to screen these types of data for errors. It stresses the importance of ensuring that metadata associated with a data set thoroughly document the error prevention, detection, and correction methods applied to the data set prior to publication.

Download Full-text

Causal Discovery from Multiple Data Sets with Non-Identical Variable Sets

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i06.6575 ◽

2020 ◽

Vol 34 (06) ◽

pp. 10153-10161

Author(s):

Biwei Huang ◽

Kun Zhang ◽

Mingming Gong ◽

Clark Glymour

Keyword(s):

Causal Structure ◽

Estimation Procedure ◽

Causal Discovery ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Multiple Data ◽

Multiple Data Sets ◽

Distribution Shifts ◽

Non Gaussian

A number of approaches to causal discovery assume that there are no hidden confounders and are designed to learn a fixed causal model from a single data set. Over the last decade, with closer cooperation across laboratories, we are able to accumulate more variables and data for analysis, while each lab may only measure a subset of them, due to technical constraints or to save time and cost. This raises a question of how to handle causal discovery from multiple data sets with non-identical variable sets, and at the same time, it would be interesting to see how more recorded variables can help to mitigate the confounding problem. In this paper, we propose a principled method to uniquely identify causal relationships over the integrated set of variables from multiple data sets, in linear, non-Gaussian cases. The proposed method also allows distribution shifts across data sets. Theoretically, we show that the causal structure over the integrated set of variables is identifiable under testable conditions. Furthermore, we present two types of approaches to parameter estimation: one is based on maximum likelihood, and the other is likelihood free and leverages generative adversarial nets to improve scalability of the estimation procedure. Experimental results on various synthetic and real-world data sets are presented to demonstrate the efficacy of our methods.

Download Full-text

Integrative Analysis of Gene Networks and Their Application to Lung Adenocarcinoma Studies

Cancer Informatics ◽

10.1177/1176935117690778 ◽

2017 ◽

Vol 16 ◽

pp. 117693511769077

Author(s):

Sangin Lee ◽

Faming Liang ◽

Ling Cai ◽

Guanghua Xiao

Keyword(s):

Lung Adenocarcinoma ◽

Graphical Models ◽

Gene Networks ◽

Integrative Analysis ◽

Biological Knowledge ◽

Data Sets ◽

Gene Expressions ◽

Data Set ◽

Multiple Data ◽

Multiple Data Sets

The construction of gene regulatory networks (GRNs) is an essential component of biomedical research to determine disease mechanisms and identify treatment targets. Gaussian graphical models (GGMs) have been widely used for constructing GRNs by inferring conditional dependence among a set of gene expressions. In practice, GRNs obtained by the analysis of a single data set may not be reliable due to sample limitations. Therefore, it is important to integrate multiple data sets from comparable studies to improve the construction of a GRN. In this article, we introduce an equivalent measure of partial correlation coefficients in GGMs and then extend the method to construct a GRN by combining the equivalent measures from different sources. Furthermore, we develop a method for multiple data sets with a natural missing mechanism to accommodate the differences among different platforms in multiple sources of data. Simulation results show that this integrative analysis outperforms the standard methods and can detect hub genes in the true network. The proposed integrative method was applied to 12 lung adenocarcinoma data sets collected from different studies. The constructed network is consistent with the current biological knowledge and reveals new insights about lung adenocarcinoma.

Download Full-text

Use of massively multiple merged data for low-resolution S-SAD phasing and refinement of flavivirus NS1

Acta Crystallographica Section D Biological Crystallography ◽

10.1107/s1399004714017556 ◽

2014 ◽

Vol 70 (10) ◽

pp. 2719-2729 ◽

Cited By ~ 28

Author(s):

David L. Akey ◽

W. Clay Brown ◽

Jamie R. Konwerski ◽

Craig M. Ogata ◽

Janet L. Smith

Keyword(s):

Nonstructural Protein ◽

High Multiplicity ◽

Anomalous Scattering ◽

Data Sets ◽

Data Set ◽

Nonstructural Protein 1 ◽

Multiple Data ◽

Moderate Resolution ◽

Multiple Data Sets ◽

Using Data

An emergent challenge in macromolecular crystallography is the identification of the substructure from native anomalous scatterers in crystals that diffract to low to moderate resolution. Increasing the multiplicity of data sets has been shown to make previously intractable phasing problems solvable and to increase the useful resolution in model refinement. For theWest Nile virusnonstructural protein 1 (NS1), a protein of novel fold, the utility of exceptionally high multiplicity data is demonstrated both in solving the crystal structure from the anomalous scattering of the native S atoms and in extending the useful limits of resolution during refinement. A high-multiplicity data set from 18 crystals had sufficient anomalous signal to identify sulfur sites using data to 5.2 Å resolution. Phases calculated to 4.5 Å resolution and extended to 3.0 Å resolution were of sufficient quality for automated building of three-quarters of the final structure. Crystallographic refinement to 2.9 Å resolution proceeded smoothly, justifying the increase in resolution that was made possible by combining multiple data sets. The identification and exclusion of data from outlier crystals is shown to result in more robust substructure determination.

Download Full-text