Sample Data Sets and Matlab Code

2007 ◽  
pp. 235-235
Keyword(s):  
2017 ◽  
Vol 4 (1) ◽  
pp. 41-52
Author(s):  
Dedy Loebis

This paper presents the results of work undertaken to develop and test contrasting data analysis approaches for the detection of bursts/leaks and other anomalies within wate r supply systems at district meter area (DMA)level. This was conducted for Yorkshire Water (YW) sample data sets from the Harrogate and Dales (H&D), Yorkshire, United Kingdom water supply network as part of Project NEPTUNE EP/E003192/1 ). A data analysissystem based on Kalman filtering and statistical approach has been developed. The system has been applied to the analysis of flow and pressure data. The system was proved for one dataset case and have shown the ability to detect anomalies in flow and pres sure patterns, by correlating with other information. It will be shown that the Kalman/statistical approach is a promising approach at detecting subtle changes and higher frequency features, it has the potential to identify precursor features and smaller l eaks and hence could be useful for monitoring the development of leaks, prior to a large volume burst event.


1980 ◽  
Vol 102 (4) ◽  
pp. 1006-1012 ◽  
Author(s):  
M. E. Crawford ◽  
W. M. Kays ◽  
R. J. Moffat

Experimental research into heat transfer from full-coverage film-cooled surfaces with three injection geometries was described in Part I. This part has two objectives. The first is to present a simple numerical procedure for simulation of heat transfer with full-coverage film cooling. The second objective is to present some of the Stanton number data that was used in Part I of the paper. The data chosen for presentation are the low-Reynolds number, heated-starting-length data for the three injection geometries with five-diameter hole spacing. Sample data sets with high blowing ratio and with ten-diameter hole spacing are also presented. The numerical procedure has been successfully applied to the Stanton number data sets.


2015 ◽  
Vol 639 ◽  
pp. 21-30 ◽  
Author(s):  
Stephan Purr ◽  
Josef Meinhardt ◽  
Arnulf Lipp ◽  
Axel Werner ◽  
Martin Ostermair ◽  
...  

Data-driven quality evaluation in the stamping process of car body parts is quite promising because dependencies in the process have not yet been sufficiently researched. However, the application of data mining methods for the process in stamping plants would require a large number of sample data sets. Today, acquiring these data represents a major challenge, because the necessary data are inadequately measured, recorded or stored. Thus, the preconditions for the sample data acquisition must first be created before being able to investigate any correlations. In addition, the process conditions change over time due to wear mechanisms. Therefore, the results do not remain valid and a constant data acquisition is required. In this publication, the current situation in stamping plants regarding the process robustness will be first discussed and the need for data-driven methods will be shown. Subsequently, the state of technology regarding the possibility of collecting the sample data sets for quality analysis in producing car body parts will be researched. At the end of this work, an overview will be provided concerning how this data collection was implemented at BMW as well as what kind of potential can be expected.


2018 ◽  
Author(s):  
Arghavan Bahadorinejad ◽  
Ivan Ivanov ◽  
Johanna W Lampe ◽  
Meredith AJ Hullar ◽  
Robert S Chapkin ◽  
...  

AbstractWe propose a Bayesian method for the classification of 16S rRNA metagenomic profiles of bacterial abundance, by introducing a Poisson-Dirichlet-Multinomial hierarchical model for the sequencing data, constructing a prior distribution from sample data, calculating the posterior distribution in closed form; and deriving an Optimal Bayesian Classifier (OBC). The proposed algorithm is compared to state-of-the-art classification methods for 16S rRNA metagenomic data, including Random Forests and the phylogeny-based Metaphyl algorithm, for varying sample size, classification difficulty, and dimensionality (number of OTUs), using both synthetic and real metagenomic data sets. The results demonstrate that the proposed OBC method, with either noninformative or constructed priors, is competitive or superior to the other methods. In particular, in the case where the ratio of sample size to dimensionality is small, it was observed that the proposed method can vastly outperform the others.Author summaryRecent studies have highlighted the interplay between host genetics, gut microbes, and colorectal tumor initiation/progression. The characterization of microbial communities using metagenomic profiling has therefore received renewed interest. In this paper, we propose a method for classification, i.e., prediction of different outcomes, based on 16S rRNA metagenomic data. The proposed method employs a Bayesian approach, which is suitable for data sets with small ration of number of available instances to the dimensionality. Results using both synthetic and real metagenomic data show that the proposed method can outperform other state-of-the-art metagenomic classification algorithms.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Ting Hon ◽  
Kristin Mars ◽  
Greg Young ◽  
Yu-Chih Tsai ◽  
Joseph W. Karalius ◽  
...  

AbstractThe PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10–25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.


2019 ◽  
Vol 15 ◽  
pp. 117693431984907 ◽  
Author(s):  
Tomáš Farkaš ◽  
Jozef Sitarčík ◽  
Broňa Brejová ◽  
Mária Lucká

Computing similarity between 2 nucleotide sequences is one of the fundamental problems in bioinformatics. Current methods are based mainly on 2 major approaches: (1) sequence alignment, which is computationally expensive, and (2) faster, but less accurate, alignment-free methods based on various statistical summaries, for example, short word counts. We propose a new distance measure based on mathematical transforms from the domain of signal processing. To tolerate large-scale rearrangements in the sequences, the transform is computed across sliding windows. We compare our method on several data sets with current state-of-art alignment-free methods. Our method compares favorably in terms of accuracy and outperforms other methods in running time and memory requirements. In addition, it is massively scalable up to dozens of processing units without the loss of performance due to communication overhead. Source files and sample data are available at https://bitbucket.org/fiitstubioinfo/swspm/src


1980 ◽  
Vol 47 (2) ◽  
pp. 351-357 ◽  
Author(s):  
Charles D. Dziuban ◽  
Edwin C. Shirkey

Version Two of the Kaiser Measures of Sampling Adequacy was derived for a typical six-concept Semantic Differential. The over-all indices indicated that both concept and total correlation matrices would lead to comparable decisions regarding the psychometric quality of the sample data sets. The individual measures, however, showed considerable variability for some scales, placing several in a range which would make them suspect psycho-metrically. It was recommended that the concept of psychometric adequacy be used in determining the efficacy of one's Semantic Differential data for factor analytic procedures.


2008 ◽  
Vol 8 (2) ◽  
pp. 6409-6436 ◽  
Author(s):  
C. A. Cantrell

Abstract. The representation of data, whether geophysical observations, numerical model output or laboratory results, by a best fit straight line is a routine practice in the geosciences and other fields. While the literature is full of detailed analyses of procedures for fitting straight lines to values with uncertainties, a surprising number of scientists blindly use the standard least squares method, such as found on calculators and in spreadsheet programs, that assumes no uncertainties in the x values. Here, the available procedures for estimating the best fit straight line to data, including those applicable to situations for uncertainties present in both the x and y variables, are reviewed. Representative methods that are presented in the literature for bivariate weighted fits are compared using several sample data sets, and guidance is presented as to when the somewhat more involved iterative methods are required, or when the standard least-squares procedure would be expected to be satisfactory. A spreadsheet-based template is made available that employs one method for bivariate fitting.


Sign in / Sign up

Export Citation Format

Share Document