scholarly journals CNValidator: validating somatic copy-number inference

2018 ◽  
Vol 35 (15) ◽  
pp. 2660-2662
Author(s):  
Lucian P Smith ◽  
Jon A Yamato ◽  
Mary K Kuhner

Abstract Motivation CNValidator assesses the quality of somatic copy-number calls based on coherency of haplotypes across multiple samples from the same individual. It is applicable to any copy-number calling algorithm, which makes calls independently for each sample. This test is useful in assessing the accuracy of copy-number calls, as well as choosing among alternative copy-number algorithms or tuning parameter values. Results On a dataset of somatic samples from individuals with Barrett’s Esophagus, CNValidator provided feedback on the correctness of sample ploidy calls and also detected data quality issues. Availability and implementation CNValidator is available on GitHub at https://github.com/kuhnerlab/CNValidator. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Christopher D O’Connor ◽  
John Ng ◽  
Dallas Hill ◽  
Tyler Frederick

Policing is increasingly being shaped by data collection and analysis. However, we still know little about the quality of the data police services acquire and utilize. Drawing on a survey of analysts from across Canada, this article examines several data collection, analysis, and quality issues. We argue that as we move towards an era of big data policing it is imperative that police services pay more attention to the quality of the data they collect. We conclude by discussing the implications of ignoring data quality issues and the need to develop a more robust research culture in policing.


Author(s):  
Syed Mustafa Ali ◽  
Farah Naureen ◽  
Arif Noor ◽  
Maged Kamel N. Boulos ◽  
Javariya Aamir ◽  
...  

Background Increasingly, healthcare organizations are using technology for the efficient management of data. The aim of this study was to compare the data quality of digital records with the quality of the corresponding paper-based records by using data quality assessment framework. Methodology We conducted a desk review of paper-based and digital records over the study duration from April 2016 to July 2016 at six enrolled TB clinics. We input all data fields of the patient treatment (TB01) card into a spreadsheet-based template to undertake a field-to-field comparison of the shared fields between TB01 and digital data. Findings A total of 117 TB01 cards were prepared at six enrolled sites, whereas just 50% of the records (n=59; 59 out of 117 TB01 cards) were digitized. There were 1,239 comparable data fields, out of which 65% (n=803) were correctly matched between paper based and digital records. However, 35% of the data fields (n=436) had anomalies, either in paper-based records or in digital records. 1.9 data quality issues were calculated per digital patient record, whereas it was 2.1 issues per record for paper-based record. Based on the analysis of valid data quality issues, it was found that there were more data quality issues in paper-based records (n=123) than in digital records (n=110). Conclusion There were fewer data quality issues in digital records as compared to the corresponding paper-based records. Greater use of mobile data capture and continued use of the data quality assessment framework can deliver more meaningful information for decision making.


2020 ◽  
Vol 36 (9) ◽  
pp. 2934-2935 ◽  
Author(s):  
Yi Zheng ◽  
Fangqing Zhao

Abstract Summary Circular RNAs (circRNAs) are proved to have unique compositions and splicing events distinct from canonical mRNAs. However, there is no visualization tool designed for the exploration of complex splicing patterns in circRNA transcriptomes. Here, we present CIRI-vis, a Java command-line tool for quantifying and visualizing circRNAs by integrating the alignments and junctions of circular transcripts. CIRI-vis can be applied to visualize the internal structure and isoform abundance of circRNAs and perform circRNA transcriptome comparison across multiple samples. Availability and implementation https://sourceforge.net/projects/ciri/files/CIRI-vis. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (20) ◽  
pp. 4063-4071 ◽  
Author(s):  
Tamim Abdelaal ◽  
Thomas Höllt ◽  
Vincent van Unen ◽  
Boudewijn P F Lelieveldt ◽  
Frits Koning ◽  
...  

Abstract Motivation High-dimensional mass cytometry (CyTOF) allows the simultaneous measurement of multiple cellular markers at single-cell level, providing a comprehensive view of cell compositions. However, the power of CyTOF to explore the full heterogeneity of a biological sample at the single-cell level is currently limited by the number of markers measured simultaneously on a single panel. Results To extend the number of markers per cell, we propose an in silico method to integrate CyTOF datasets measured using multiple panels that share a set of markers. Additionally, we present an approach to select the most informative markers from an existing CyTOF dataset to be used as a shared marker set between panels. We demonstrate the feasibility of our methods by evaluating the quality of clustering and neighborhood preservation of the integrated dataset, on two public CyTOF datasets. We illustrate that by computationally extending the number of markers we can further untangle the heterogeneity of mass cytometry data, including rare cell-population detection. Availability and implementation Implementation is available on GitHub (https://github.com/tabdelaal/CyTOFmerge). Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Liam F Spurr ◽  
Mehdi Touat ◽  
Alison M Taylor ◽  
Adrian M Dubuc ◽  
Juliann Shih ◽  
...  

Abstract Summary The expansion of targeted panel sequencing efforts has created opportunities for large-scale genomic analysis, but tools for copy-number quantification on panel data are lacking. We introduce ASCETS, a method for the efficient quantitation of arm and chromosome-level copy-number changes from targeted sequencing data. Availability and implementation ASCETS is implemented in R and is freely available to non-commercial users on GitHub: https://github.com/beroukhim-lab/ascets, along with detailed documentation. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
pp. 227797522110118
Author(s):  
Amit K. Srivastava ◽  
Rajhans Mishra

Social media platforms have become very popular these days among individuals and organizations. On the one hand, organizations use social media as a potential tool to create awareness of their products among consumers, and on the other hand, social media data is useful to predict the national crisis, election polls, stock prediction, etc. However, nowadays, a debate is going on about the quality of data generated on social media platforms, whether it is relevant for prediction and generalization. The article discusses the relevance and quality of data obtained from social media in the context of research and development. Social media data quality issues may impact the generalizability and reproducibility of the results of the study. The paper explores possible reasons for quality issues in the data generated over social media platforms along with the suggestive measures to minimize them using the proposed social media data quality framework.


2019 ◽  
Author(s):  
Pavankumar Mulgund ◽  
Raj Sharman ◽  
Priya Anand ◽  
Shashank Shekhar ◽  
Priya Karadi

BACKGROUND In recent years, online physician-rating websites have become prominent and exert considerable influence on patients’ decisions. However, the quality of these decisions depends on the quality of data that these systems collect. Thus, there is a need to examine the various data quality issues with physician-rating websites. OBJECTIVE This study’s objective was to identify and categorize the data quality issues afflicting physician-rating websites by reviewing the literature on online patient-reported physician ratings and reviews. METHODS We performed a systematic literature search in ACM Digital Library, EBSCO, Springer, PubMed, and Google Scholar. The search was limited to quantitative, qualitative, and mixed-method papers published in the English language from 2001 to 2020. RESULTS A total of 423 articles were screened. From these, 49 papers describing 18 unique data quality issues afflicting physician-rating websites were included. Using a data quality framework, we classified these issues into the following four categories: intrinsic, contextual, representational, and accessible. Among the papers, 53% (26/49) reported intrinsic data quality errors, 61% (30/49) highlighted contextual data quality issues, 8% (4/49) discussed representational data quality issues, and 27% (13/49) emphasized accessibility data quality. More than half the papers discussed multiple categories of data quality issues. CONCLUSIONS The results from this review demonstrate the presence of a range of data quality issues. While intrinsic and contextual factors have been well-researched, accessibility and representational issues warrant more attention from researchers, as well as practitioners. In particular, representational factors, such as the impact of inline advertisements and the positioning of positive reviews on the first few pages, are usually deliberate and result from the business model of physician-rating websites. The impact of these factors on data quality has not been addressed adequately and requires further investigation.


2019 ◽  
Vol 35 (21) ◽  
pp. 4411-4412 ◽  
Author(s):  
Vinhthuy Phan ◽  
Diem-Trang Pham ◽  
Caroline Melton ◽  
Adam J Ramsey ◽  
Bernie J Daigle ◽  
...  

Abstract Summary Although heteroplasmy has been studied extensively in animal systems, there is a lack of tools for analyzing, exploring and visualizing heteroplasmy at the genome-wide level in other taxonomic systems. We introduce icHET, which is a computational workflow that produces an interactive visualization that facilitates the exploration, analysis and discovery of heteroplasmy across multiple genomic samples. icHET works on short reads from multiple samples from any organism with an organellar reference genome (mitochondrial or plastid) and a nuclear reference genome. Availability and implementation The software is available at https://github.com/vtphan/HeteroplasmyWorkflow. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Gaëtan Benoit ◽  
Mahendra Mariadassou ◽  
Stéphane Robin ◽  
Sophie Schbath ◽  
Pierre Peterlongo ◽  
...  

Abstract Motivation De novo comparative metagenomics is one of the most straightforward ways to analyze large sets of metagenomic data. Latest methods use the fraction of shared k-mers to estimate genomic similarity between read sets. However, those methods, while extremely efficient, are still limited by computational needs for practical usage outside of large computing facilities. Results We present SimkaMin, a quick comparative metagenomics tool with low disk and memory footprints, thanks to an efficient data subsampling scheme used to estimate Bray-Curtis and Jaccard dissimilarities. One billion metagenomic reads can be analyzed in <3 min, with tiny memory (1.09 GB) and disk (≈0.3 GB) requirements and without altering the quality of the downstream comparative analyses, making of SimkaMin a tool perfectly tailored for very large-scale metagenomic projects. Availability and implementation https://github.com/GATB/simka. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Andrew McDonald ◽  

Decades of subsurface exploration and characterisation have led to the collation and storage of large volumes of well related data. The amount of data gathered daily continues to grow rapidly as technology and recording methods improve. With the increasing adoption of machine learning techniques in the subsurface domain, it is essential that the quality of the input data is carefully considered when working with these tools. If the input data is of poor quality, the impact on precision and accuracy of the prediction can be significant. Consequently, this can impact key decisions about the future of a well or a field. This study focuses on well log data, which can be highly multi-dimensional, diverse and stored in a variety of file formats. Well log data exhibits key characteristics of Big Data: Volume, Variety, Velocity, Veracity and Value. Well data can include numeric values, text values, waveform data, image arrays, maps, volumes, etc. All of which can be indexed by time or depth in a regular or irregular way. A significant portion of time can be spent gathering data and quality checking it prior to carrying out petrophysical interpretations and applying machine learning models. Well log data can be affected by numerous issues causing a degradation in data quality. These include missing data - ranging from single data points to entire curves; noisy data from tool related issues; borehole washout; processing issues; incorrect environmental corrections; and mislabelled data. Having vast quantities of data does not mean it can all be passed into a machine learning algorithm with the expectation that the resultant prediction is fit for purpose. It is essential that the most important and relevant data is passed into the model through appropriate feature selection techniques. Not only does this improve the quality of the prediction, it also reduces computational time and can provide a better understanding of how the models reach their conclusion. This paper reviews data quality issues typically faced by petrophysicists when working with well log data and deploying machine learning models. First, an overview of machine learning and Big Data is covered in relation to petrophysical applications. Secondly, data quality issues commonly faced with well log data are discussed. Thirdly, methods are suggested on how to deal with data issues prior to modelling. Finally, multiple case studies are discussed covering the impacts of data quality on predictive capability.


Sign in / Sign up

Export Citation Format

Share Document