MhcVizPipe: A Quality Control Software for Rapid Assessment of Small- to Large-Scale Immunopeptidome Data Sets

AbstractAlthough ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data – critical first steps for any subsequent analysis. We find analysis of large RNA-Seq data sets requires both careful quality control and that one account for sparsity due to the heterogeneity intrinsic in multi-group studies. An R package instantiating our method for large-scale RNA-Seq normalization and preprocessing, YARN, is available at bioconductor.org/packages/yarn.HighlightsOverview of assumptions used in preprocessing and normalizationPipeline for preprocessing, quality control, and normalization of large heterogeneous dataA Bioconductor package for the YARN pipeline and easy manipulation of count dataPreprocessed GTEx data set using the YARN pipeline available as a resource

Download Full-text

Quality Control of Data in a Large-Scale Cancer Register Program

Methods of Information in Medicine ◽

10.1055/s-0038-1636339 ◽

1966 ◽

Vol 05 (02) ◽

pp. 67-74 ◽

Cited By ~ 5

Author(s):

W. I. Lourie ◽

W. Haenszeland

Keyword(s):

United States ◽

Quality Control ◽

Cancer Patients ◽

Large Scale ◽

The United States ◽

Newly Diagnosed ◽

Related Data ◽

Cancer Register ◽

Independent Review ◽

End Results

Quality control of data collected in the United States by the Cancer End Results Program utilizing punchcards prepared by participating registries in accordance with a Uniform Punchcard Code is discussed. Existing arrangements decentralize responsibility for editing and related data processing to the local registries with centralization of tabulating and statistical services in the End Results Section, National Cancer Institute. The most recent deck of punchcards represented over 600,000 cancer patients; approximately 50,000 newly diagnosed cases are added annually.Mechanical editing and inspection of punchcards and field audits are the principal tools for quality control. Mechanical editing of the punchcards includes testing for blank entries and detection of in-admissable or inconsistent codes. Highly improbable codes are subjected to special scrutiny. Field audits include the drawing of a 1-10 percent random sample of punchcards submitted by a registry; the charts are .then reabstracted and recoded by a NCI staff member and differences between the punchcard and the results of independent review are noted.

Download Full-text

Faculty Opinions recommendation of Comparative assessment of large-scale data sets of protein-protein interactions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1006598.82257 ◽

2002 ◽

Author(s):

Rob Russell

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Comparative Assessment ◽

Data Sets ◽

Protein Protein Interactions ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Numerical Simulation of Hydrogen Transport at a Crack Tip in a Pipeline Steel

Volume 3: Materials and Joining; Pipeline Automation and Measurement; Risk and Reliability, Parts A and B ◽

10.1115/ipc2006-10207 ◽

2006 ◽

Cited By ~ 3

Author(s):

Mohsen Dadfarnia ◽

Petros Sofronis ◽

Ian Robertson ◽

Brian P. Somerday ◽

Govindarajan Muralidharan ◽

...

Keyword(s):

Distribution System ◽

Large Scale ◽

Pipeline Steel ◽

Rapid Assessment ◽

Crack Tip ◽

Hydrogen Transport ◽

Natural Gas Pipeline ◽

Central Production ◽

Materials Used ◽

Pressure Hydrogen

The technology of large scale hydrogen transmission from central production facilities to refueling stations and stationary power sites is at present undeveloped. Among the problems which confront the implementation of this technology is the deleterious effect of hydrogen on structural material properties, in particular at gas pressure of 1000 psi which is the desirable transmission pressure suggested by economic studies for efficient transport. In this paper, a hydrogen transport methodology for the calculation of hydrogen accumulation ahead of a crack tip in a pipeline steel is outlined. The approach accounts for stress-driven transient diffusion of hydrogen and trapping at microstructural defects whose density may evolve dynamically with deformation. The results are used to discuss a lifetime prediction methodology for failure of materials used for pipelines and welds exposed to high-pressure hydrogen. Development of such predictive capability and strategies is of paramount importance to the rapid assessment of using the natural-gas pipeline distribution system for hydrogen transport and of the susceptibility of new alloys tailored for use in the new hydrogen economy.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

Accelerating In-Transit Co-Processing for Scientific Simulations Using Region-Based Data-Driven Analysis

Algorithms ◽

10.3390/a14050154 ◽

2021 ◽

Vol 14 (5) ◽

pp. 154

Author(s):

Marcus Walldén ◽

Masao Okita ◽

Fumihiko Ino ◽

Dimitris Drikakis ◽

Ioannis Kokkinakis

Keyword(s):

Large Scale ◽

Data Driven ◽

Data Sets ◽

Output Constraints ◽

Data Driven Approach ◽

Scientific Simulations ◽

Multiple Metrics ◽

In Transit ◽

Multiple Compression ◽

Large Scale Simulations

Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven approach that uses the proposed method to accelerate in-transit co-processing of large-scale simulations. We use the importance metrics to simultaneously employ multiple compression methods on different data regions to accelerate the in-transit co-processing. Our approach strives to adaptively compress data on the fly and uses load balancing to counteract memory imbalances. We demonstrate the method’s efficiency through a fluid mechanics application, a Richtmyer–Meshkov instability simulation, showing how to accelerate the in-transit co-processing of simulations. The results show that the proposed method expeditiously can identify regions of interest, even when using multiple metrics. Our approach achieved a speedup of 1.29× in a lossless scenario. The data decompression time was sped up by 2× compared to using a single compression method uniformly.

Download Full-text

Compartment and hub definitions tune metabolic networks for metabolomic interpretations

GigaScience ◽

10.1093/gigascience/giz137 ◽

2020 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

T Cameron Waller ◽

Jordan A Berg ◽

Alexander Lex ◽

Brian E Chapman ◽

Jared Rutter

Keyword(s):

Large Scale ◽

Metabolic Networks ◽

Shortest Paths ◽

Large Data ◽

Differential Regulation ◽

Large Data Sets ◽

Data Sets ◽

Human Metabolism ◽

Experimental Conditions ◽

Systemic Model

Abstract Background Metabolic networks represent all chemical reactions that occur between molecular metabolites in an organism’s cells. They offer biological context in which to integrate, analyze, and interpret omic measurements, but their large scale and extensive connectivity present unique challenges. While it is practical to simplify these networks by placing constraints on compartments and hubs, it is unclear how these simplifications alter the structure of metabolic networks and the interpretation of metabolomic experiments. Results We curated and adapted the latest systemic model of human metabolism and developed customizable tools to define metabolic networks with and without compartmentalization in subcellular organelles and with or without inclusion of prolific metabolite hubs. Compartmentalization made networks larger, less dense, and more modular, whereas hubs made networks larger, more dense, and less modular. When present, these hubs also dominated shortest paths in the network, yet their exclusion exposed the subtler prominence of other metabolites that are typically more relevant to metabolomic experiments. We applied the non-compartmental network without metabolite hubs in a retrospective, exploratory analysis of metabolomic measurements from 5 studies on human tissues. Network clusters identified individual reactions that might experience differential regulation between experimental conditions, several of which were not apparent in the original publications. Conclusions Exclusion of specific metabolite hubs exposes modularity in both compartmental and non-compartmental metabolic networks, improving detection of relevant clusters in omic measurements. Better computational detection of metabolic network clusters in large data sets has potential to identify differential regulation of individual genes, transcripts, and proteins.

Download Full-text

High Performance Computational Analysis of Large-scale Proteome Data Sets to Assess Incremental Contribution to Coverage of the Human Genome

Journal of Proteome Research ◽

10.1021/pr400181q ◽

2013 ◽

Vol 12 (6) ◽

pp. 2858-2868 ◽

Cited By ~ 29

Author(s):

Nadin Neuhauser ◽

Nagarjuna Nagaraj ◽

Peter McHardy ◽

Sara Zanivan ◽

Richard Scheltema ◽

...

Keyword(s):

Human Genome ◽

High Performance ◽

Large Scale ◽

Computational Analysis ◽

Data Sets

Download Full-text

The Research of Production Process Statistical Steady-State Based on the Hypothesis Test

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.314-316.2433 ◽

2011 ◽

Vol 314-316 ◽

pp. 2433-2438

Author(s):

Wei Zhi Wang

Keyword(s):

Quality Control ◽

Steady State ◽

Production Process ◽

Quantitative Method ◽

Large Scale ◽

Hypothesis Test ◽

Influence Factors ◽

Normal Operation ◽

Scale Production ◽

Large Scale Production

By only applying a after the event exam in the quality control of the batch production is not enough to meet the needs of modern large-scale production. To a certain extent, modern quality control is a dynamic process of the steady-state judge and adjustment. A simple and reliable steady-state judge rule and method is the premise to guarantee the normal operation. This paper provides a quantitative method to evaluate production process steady-state by analyzing influence factors based on mathematical statistics. The method is both suitable for simple production process and complex production process with sub-processes.

Download Full-text

ON GENERATING DIGITAL ELEVATION MODELS FROM LIDAR DATA – RESOLUTION VERSUS ACCURACY AND TOPOGRAPHIC WETNESS INDEX INDICES IN NORTHERN PEATLANDS

Geodesy and Cartography ◽

10.3846/20296991.2012.702983 ◽

2012 ◽

Vol 38 (2) ◽

pp. 57-69 ◽

Cited By ~ 12

Author(s):

Abdulghani Hasan ◽

Petter Pilesjö ◽

Andreas Persson

Keyword(s):

Large Scale ◽

Drainage Area ◽

Data Sets ◽

Topographic Wetness Index ◽

Absolute Deviation ◽

Digital Elevation ◽

Elevation Data ◽

Scale Modelling ◽

Data Points ◽

Emission Modelling

Global change and GHG emission modelling are dependent on accurate wetness estimations for predictions of e.g. methane emissions. This study aims to quantify how the slope, drainage area and the TWI vary with the resolution of DEMs for a flat peatland area. Six DEMs with spatial resolutions from 0.5 to 90 m were interpolated with four different search radiuses. The relationship between accuracy of the DEM and the slope was tested. The LiDAR elevation data was divided into two data sets. The number of data points facilitated an evaluation dataset with data points not more than 10 mm away from the cell centre points in the interpolation dataset. The DEM was evaluated using a quantile-quantile test and the normalized median absolute deviation. It showed independence of the resolution when using the same search radius. The accuracy of the estimated elevation for different slopes was tested using the 0.5 meter DEM and it showed a higher deviation from evaluation data for steep areas. The slope estimations between resolutions showed differences with values that exceeded 50%. Drainage areas were tested for three resolutions, with coinciding evaluation points. The model ability to generate drainage area at each resolution was tested by pair wise comparison of three data subsets and showed differences of more than 50% in 25% of the evaluated points. The results show that consideration of DEM resolution is a necessity for the use of slope, drainage area and TWI data in large scale modelling.

Download Full-text