In Search of Union Wage Concessions in Standard Data Sets

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text

Guide for Verifying Computer-Generated Test Results Through The Use Of Standard Data Sets

10.1520/e2443 ◽

2008 ◽

Author(s):

Keyword(s):

Data Sets ◽

Test Results ◽

Standard Data

Download Full-text

Detailed topography of the Devonian Grosmont Formation surface from legacy high-resolution seismic profiles, northeast Alberta

Geophysics ◽

10.1190/geo2013-0268.1 ◽

2014 ◽

Vol 79 (4) ◽

pp. B135-B149 ◽

Cited By ~ 4

Author(s):

Elahe P. Ardakani ◽

Douglas R. Schmitt ◽

Todd D. Bown

Keyword(s):

Data Sets ◽

Well Log ◽

Data Sampling ◽

Seismic Profiles ◽

Cavity Collapse ◽

Standard Data ◽

System A ◽

Surface Modified ◽

Northeastern Alberta ◽

East West

The Devonian Grosmont Formation in northeastern Alberta, Canada, is the world’s largest accumulation of heavy oil in carbonate rock with estimated bitumen in place of [Formula: see text]. Much of the reservoir unconformably subcrops beneath Cretaceous sediments. This is an eroded surface modified by kartstification known as the Sub-Mannville Unconformity (SMU). We studied the reanalysis and integration of legacy seismic data sets obtained in the mid-1980s to investigate the structure of this surface. Standard data processing was carried out supplemented by some more modern approaches to noise reduction. The interpretation of these reprocessed data resulted in some key structural maps above and below the SMU. These seismic maps revealed substantially more detail than those constructed solely on the basis of well-log data; in fact, the use of only well-log information would likely result in erroneous interpretations. Although features smaller than about 40 m in radius cannot be easily discerned at the SMU due to wavefield and data sampling limits, the data did reveal the existence of a roughly east–west-trending ridge-valley system. A more minor northeast–southwest-trending linear valley also was apparent. These observations are all consistent with the model of a karsted/eroded carbonate surface. Comparison of the maps for the differing horizons further suggested that deeper horizons may influence the structure of the SMU and even the overlying Mesozoic formations. This suggested that some displacements due to karst cavity collapse or minor faulting within the Grosmont occurred during or after deposition of the younger Mesozoic sediments on top of the Grosmont surface.

Download Full-text

Knowledge discovery in large model datasets in the marine environment: the THREDDS Data Server example

Advances in Oceanography and Limnology ◽

10.4081/aiol.2012.5325 ◽

2012 ◽

Vol 3 (1) ◽

pp. 41 ◽

Cited By ~ 6

Author(s):

A. Bergamasco ◽

A. Benetazzo ◽

S. Carniel ◽

F.M. Falcieri ◽

T. Minuzzo ◽

...

Keyword(s):

Web Services ◽

Marine Environment ◽

Standard Form ◽

Climate Science ◽

Common Data Model ◽

Data Sets ◽

Distributed Data ◽

Distributed Search ◽

Standard Data ◽

Core Idea

In order to monitor, describe and understand the marine environment, many research institutions are involved in the acquisition and distribution of ocean data, both from observations and models. Scientists from these institutions are spending too much time looking for, accessing, and reformatting data: they need better tools and procedures to make the science they do more efficient. The U.S. Integrated Ocean Observing System (US-IOOS) is working on making large amounts of distributed data usable in an easy and efficient way. It is essentially a network of scientists, technicians and technologies designed to acquire, collect and disseminate observational and modelled data resulting from coastal and oceanic marine regions investigations to researchers, stakeholders and policy makers. In order to be successful, this effort requires standard data protocols, web services and standards-based tools. Starting from the US-IOOS approach, which is being adopted throughout much of the oceanographic and meteorological sectors, we describe here the CNR-ISMAR Venice experience in the direction of setting up a national Italian IOOS framework using the THREDDS (THematic Real-time Environmental Distributed Data Services) Data Server (TDS), a middleware designed to fill the gap between data providers and data users. The TDS provides services that allow data users to find the data sets pertaining to their scientific needs, to access, to visualize and to use them in an easy way, without downloading files to the local workspace. In order to achieve this, it is necessary that the data providers make their data available in a standard form that the TDS understands, and with sufficient metadata to allow the data to be read and searched in a standard way. The core idea is then to utilize a Common Data Model (CDM), a unified conceptual model that describes different datatypes within each dataset. More specifically, Unidata (<a href="http://www.unidata.ucar.edu" target="_blank">www.unidata.ucar.edu</a>) has developed CDM specifications for many of the different kinds of data used by the scientific community, such as grids, profiles, time series, swath data. These datatypes are aligned the NetCDF Climate and Forecast (CF) Metadata Conventions and with Climate Science Modelling Language (CSML); CF-compliant NetCDF files and GRIB files can be read directly with no modification, while non compliant files can be modified to meet appropriate metadata requirements. Once standardized in the CDM, the TDS makes datasets available through a series of web services such as OPeNDAP or Open Geospatial Consortium Web Coverage Service (WCS), allowing the data users to easily obtain small subsets from large datasets, and to quickly visualize their content by using tools such as GODIVA2 or Integrated Data Viewer (IDV). In addition, an ISO metadata service is available through the TDS that can be harvested by catalogue broker services (e.g. GI-cat) to enable distributed search across federated data servers. Example of TDS datasets can be accessed at the CNR-ISMAR Venice site <a href="http://tds.ve.ismar.cnr.it:8080/thredds/catalog.html" target="_blank">http://tds.ve.ismar.cnr.it:8080/thredds/catalog.html</a>.

Download Full-text

Validating Synthetic Longitudinal Populations for evaluation of Population Data Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v3i2.504 ◽

2018 ◽

Vol 3 (2) ◽

Author(s):

Tom Dalton ◽

Graham Kirby ◽

Alan Dearle ◽

Özgür Akgün ◽

Monique Mackenzie

Keyword(s):

Data Linkage ◽

Synthetic Data ◽

Population Data ◽

Data Sets ◽

Population Statistics ◽

Standard Data ◽

Wide Range ◽

Population Reconstruction ◽

Potential Interactions ◽

Linkage Evaluation

Background’Gold-standard’ data to evaluate linkage algorithms are rare. Synthetic data have the advantage that all the true links are known. In the domain of population reconstruction, the ability to synthesize populations on demand, with varying characteristics, allows a linkage approach to be evaluated across a wide range of data. We have implemented ValiPop, a microsimulation model, for this purpose. ApproachValiPop can create many varied populations based upon sets of desired population statistics, thus allowing linkage algorithms to be evaluated across many populations, rather than across a limited number of real world ’gold-standard’ data sets. Given the potential interactions between different desired population statistics, the creation of a population does not necessarily imply that all desired population statistics have been met. To address this we have developed a statistical approach to validate the adherence of created populations to the desired statistics, using a generalized linear model. This talk will discuss the benefits of synthetic data for data linkage evaluation, the approach to validating created populations, and present the results of some initial linkage experiments using our synthetic data.

Download Full-text

Map-Matching Algorithm for Applications in Multimodal Transportation Network Modeling

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/2537-07 ◽

2015 ◽

Vol 2537 (1) ◽

pp. 62-70 ◽

Cited By ~ 9

Author(s):

Kenneth Perrine ◽

Alireza Khani ◽

Natalia Ruiz-Juri

Keyword(s):

Public Transportation ◽

Transportation Network ◽

Data Representation ◽

Transportation Systems ◽

Data Sets ◽

The Public ◽

Standard Data ◽

Planning Models ◽

Transit Agencies ◽

Public Transportation Systems

Generalized Transit Feed Specification (GTFS) files have gained wide acceptance by transit agencies, which now provide them for most major metropolitan areas. The public availability GTFSs combined with the convenience of presenting a standard data representation has promoted the development of numerous applications for their use. Whereas most of these tools are focused on the analysis and utilization of public transportation systems, GTFS data sets are also extremely relevant for the development of multimodal planning models. The use of GTFS data for integrated modeling requires creating a graph of the public transportation network that is consistent with the roadway network. The former is not trivial, given limitations of networks often used for regional planning models and the complexity of the roadway system. A proposed open-source algorithm matches GTFS geographic information to existing planning networks and is also relevant for real-time in-field applications. The methodology is based on maintaining a set of candidate paths connecting successive geographic points. Examples of implementations using traditional planning networks and a network built from crowdsourced OpenStreetMap data are presented. The versatility of the methodology is also demonstrated by using it for matching GPS points from a navigation system. Experimental results suggest that this approach is highly successful even when the underlying roadway network is not complete. The proposed methodology is a promising step toward using novel and inexpensive data sources to facilitate and eventually transform the way that transportation models are built and validated.

Download Full-text

Text-to-Text Similarity of Sentences

Applied Natural Language Processing ◽

10.4018/978-1-60960-741-8.ch007 ◽

2012 ◽

pp. 110-121 ◽

Cited By ~ 4

Author(s):

Vasile Rus ◽

Mihai Lintean ◽

Arthur C. Graesser ◽

Danielle S. McNamara

Keyword(s):

Semantic Analysis ◽

Intelligent Tutoring ◽

Intelligent Tutoring System ◽

Semantic Relations ◽

Data Sets ◽

Data Set ◽

Word Similarity ◽

Tutoring Systems ◽

Standard Data ◽

Sentence Level

Assessing the semantic similarity between two texts is a central task in many applications, including summarization, intelligent tutoring systems, and software testing. Similarity of texts is typically explored at the level of word, sentence, paragraph, and document. The similarity can be defined quantitatively (e.g. in the form of a normalized value between 0 and 1) and qualitatively in the form of semantic relations such as elaboration, entailment, or paraphrase. In this chapter, we focus first on measuring quantitatively and then on detecting qualitatively sentence-level text-to-text semantic relations. A generic approach that relies on word-to-word similarity measures is presented as well as experiments and results obtained with various instantiations of the approach. In addition, we provide results of a study on the role of weighting in Latent Semantic Analysis, a statistical technique to assess similarity of texts. The results were obtained on two data sets: a standard data set on sentence-level paraphrase detection and a data set from an intelligent tutoring system.

Download Full-text

Person Re-Identification Based on Significant Color With the Spatial Correspondence

International Journal of Knowledge and Systems Science ◽

10.4018/ijkss.2021010102 ◽

2021 ◽

Vol 12 (1) ◽

pp. 20-36

Author(s):

Vidhyalakshmi M. K. ◽

Poovammal E. ◽

Masilamani V. ◽

Vidhyacharan Bhaskar

Keyword(s):

Real Time ◽

Video Surveillance ◽

Characteristic Curve ◽

Field Of View ◽

Data Sets ◽

Computationally Efficient ◽

Criminal Offense ◽

Spatial Correspondence ◽

Standard Data ◽

Identification Technique

Video surveillance has played a key role to find an individual just in case of a criminal offense. More studies were done to make the surveillance process autonomous. In this, the person re-identification technique helps to identify people. The surveillance cameras are normally mounted at a height above the head of a person. With such a position of camera, it is difficult to identify the person. Therefore, video surveillance is an application in real time. The images of the same individual may vary appreciably based on different camera field of view. Color content in an image remains an important cue to identify a person. Under the assumption that the clothing color remains unchanged over the period of surveillance, a method based on significant colors with its spatial correspondence in image is proposed. The method is applied on standard data sets like GRID, PRID450s and VIPER. The results are plotted as cumulative matching characteristic curve and compared with other methods. The approach is both computationally efficient and delivers better performance.

Download Full-text

Guide for Verifying Computer-Generated Test Results Through The Use Of Standard Data Sets

10.1520/e2443-05 ◽

2005 ◽

Author(s):

Keyword(s):

Data Sets ◽

Test Results ◽

Standard Data

Download Full-text

Non-Government Sector Mental Health Data Dictionary and Standard Data Set

Health Information Management ◽

10.1177/183335830203000404 ◽

2002 ◽

Vol 30 (4) ◽

pp. 3-13

Author(s):

Christie Wood ◽

Duane Pennebaker

Keyword(s):

Mental Health ◽

Health Sector ◽

Data Sets ◽

Data Dictionary ◽

Data Set ◽

Stakeholder Consultation ◽

Standard Data ◽

Minimum Data ◽

Australian Institute ◽

Government Sector

In order to provide a framework for standardised data reporting in the Australian nongovernment community mental health sector, a Data Dictionary and standard data set were developed. Advisory Committee and key stakeholder consultation, review of local and national minimum data sets and stakeholder validation informed this process. This resulted in a Data Dictionary containing 37 items and a standard data set containing 15 items. These items conform to the Australian Institute of Health & Welfare's (AIHW) standards and address Leginski et al.'s (1989) decision standards.

Download Full-text