FANCY: Fast Estimation of Privacy Risk in Functional Genomics Data

Mapping Intimacies ◽

10.1101/775338 ◽

2019 ◽

Cited By ~ 1

Author(s):

Gamze Gürsoy ◽

Charlotte M. Brannon ◽

Fabio C.P. Navarro ◽

Mark Gerstein

Keyword(s):

Functional Genomics ◽

Cumulative Number ◽

Rna Seq ◽

Privacy Risk ◽

Privacy Concerns ◽

Link Type ◽

Privacy Leakage ◽

Independent Test ◽

Matlab Implementation ◽

Test Sets

AbstractFunctional genomics data is becoming clinically actionable, raising privacy concerns. However, quantifying the privacy leakage by genotyping is difficult due to the heterogeneous nature of sequencing techniques. Thus, we present FANCY, a tool that rapidly estimates number of leaking variants from raw RNA-Seq, ATAC-Seq and ChIP-Seq reads, without explicit genotyping. FANCY employs supervised regression using overall sequencing statistics as features and provides an estimate of the overall privacy risk before data release. FANCY can predict the cumulative number of leaking SNVs with a 0.95 average R2 for all independent test sets. We acknowledged the importance of accurate prediction even when the number of leaked variants is low, so we developed a special version of model, which can make predictions with higher accuracy for only a few leaking variants. A python and MATLAB implementation of FANCY, as well as custom scripts to generate the features can be found at https://github.com/gersteinlab/FANCY. We also provide jupyter notebooks so that users can optimize the parameters in the regression model based on their own data. An easy-to-use webserver that takes inputs and displays results can be found at fancy.gersteinlab.org.

FANCY: fast estimation of privacy risk in functional genomics data

Bioinformatics ◽

10.1093/bioinformatics/btaa661 ◽

2020 ◽

Author(s):

Gamze Gürsoy ◽

Charlotte M Brannon ◽

Fabio C P Navarro ◽

Mark Gerstein

Keyword(s):

Functional Genomics ◽

Supplementary Information ◽

Cumulative Number ◽

Rna Seq ◽

Privacy Risk ◽

Privacy Concerns ◽

Privacy Leakage ◽

Independent Test ◽

Matlab Implementation ◽

Test Sets

Abstract Motivation Functional genomics data are becoming clinically actionable, raising privacy concerns. However, quantifying privacy leakage via genotyping is difficult due to the heterogeneous nature of sequencing techniques. Thus, we present FANCY, a tool that rapidly estimates the number of leaking variants from raw RNA-Seq, ATAC-Seq and ChIP-Seq reads, without explicit genotyping. FANCY employs supervised regression using overall sequencing statistics as features and provides an estimate of the overall privacy risk before data release. Results FANCY can predict the cumulative number of leaking SNVs with an average 0.95 R2 for all independent test sets. We realize the importance of accurate prediction when the number of leaked variants is low. Thus, we develop a special version of the model, which can make predictions with higher accuracy when the number of leaking variants is low. Availability and implementation A python and MATLAB implementation of FANCY, as well as custom scripts to generate the features can be found at https://github.com/gersteinlab/FANCY. We also provide jupyter notebooks so that users can optimize the parameters in the regression model based on their own data. An easy-to-use webserver that takes inputs and displays results can be found at fancy.gersteinlab.org. Supplementary information Supplementary data are available at Bioinformatics online.

Private information leakage from functional genomics data: Quantification with calibration experiments and reduction via data sanitization protocols

10.1101/345074 ◽

2018 ◽

Cited By ~ 4

Author(s):

Gamze Gürsoy ◽

Prashant Emani ◽

Charlotte M. Brannon ◽

Otto A. Jolanki ◽

Arif Harmanci ◽

...

Keyword(s):

Functional Genomics ◽

Private Information ◽

Information Leakage ◽

Privacy Concerns ◽

Privacy Leakage ◽

Data Sanitization ◽

Trade Offs ◽

Using Data ◽

Study Participants ◽

Reference Genomes

AbstractThe generation of functional genomics datasets is surging, as they provide insight into gene regulation and organismal phenotypes (e.g., genes upregulated in cancer). The intention of functional genomics experiments is not necessarily to study genetic variants, yet they pose privacy concerns due to their use of next-generation sequencing. Moreover, there is a great incentive to share raw reads for better analyses and general research reproducibility. Thus, we need new modes of sharing beyond traditional controlled-access models. Here, we develop a data-sanitization procedure allowing raw functional genomics reads to be shared while minimizing privacy leakage, thus enabling principled privacy-utility trade-offs. It works with traditional Illumina-based assays and newer technologies such as 10x single-cell RNA-sequencing. The procedure depends on quantifying the privacy leakage in reads by statistically linking study participants to known individuals. We carried out these linkages using data from highly accurate reference genomes and more realistic environmental samples.

Are radiomics features universally applicable to different organs?

Cancer Imaging ◽

10.1186/s40644-021-00400-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Seung-Hak Lee ◽

Hwan-ho Cho ◽

Junmo Kwon ◽

Ho Yun Lee ◽

Hyunjin Park

Keyword(s):

Risk Groups ◽

Careful Consideration ◽

Distinct Pattern ◽

Common Category ◽

Independent Test ◽

Organ Specific ◽

Tumor Region ◽

Organ Level ◽

Test Sets ◽

Feature Selection Approach

Abstract Background Many studies have successfully identified radiomics features reflecting macroscale tumor features and tumor microenvironment for various organs. There is an increased interest in applying these radiomics features found in a given organ to other organs. Here, we explored whether common radiomics features could be identified over target organs in vastly different environments. Methods Four datasets of three organs were analyzed. One radiomics model was constructed from the training set (lungs, n = 401), and was further evaluated in three independent test sets spanning three organs (lungs, n = 59; kidneys, n = 48; and brains, n = 43). Intensity histograms derived from the whole organ were compared to establish organ-level differences. We constructed a radiomics score based on selected features using training lung data over the tumor region. A total of 143 features were computed for each tumor. We adopted a feature selection approach that favored stable features, which can also capture survival. The radiomics score was applied to three independent test data from lung, kidney, and brain tumors, and whether the score could be used to separate high- and low-risk groups, was evaluated. Results Each organ showed a distinct pattern in the histogram and the derived parameters (mean and median) at the organ-level. The radiomics score trained from the lung data of the tumor region included seven features, and the score was only effective in stratifying survival for other lung data, not in other organs such as the kidney and brain. Eliminating the lung-specific feature (2.5 percentile) from the radiomics score led to similar results. There were no common features between training and test sets, but a common category of features (texture category) was identified. Conclusion Although the possibility of a generally applicable model cannot be excluded, we suggest that radiomics score models for survival were mostly specific for a given organ; applying them to other organs would require careful consideration of organ-specific properties.

Deep learning to predict subtypes of poorly differentiated lung cancer from biopsy whole slide images.

Journal of Clinical Oncology ◽

10.1200/jco.2021.39.15_suppl.8536 ◽

2021 ◽

Vol 39 (15_suppl) ◽

pp. 8536-8536

Author(s):

Gouji Toyokawa ◽

Fahdi Kanavati ◽

Seiya Momosaki ◽

Kengo Tateishi ◽

Hiroaki Takeoka ◽

...

Keyword(s):

Lung Cancer ◽

Deep Learning ◽

Learning Model ◽

Test Set ◽

Cancer Subtypes ◽

Independent Test ◽

Poorly Differentiated ◽

Test Sets ◽

Deep Learning Model ◽

Whole Slide Images

8536 Background: Lung cancer is the leading cause of cancer-related death in many countries, and its prognosis remains unsatisfactory. Since treatment approaches differ substantially based on the subtype, such as adenocarcinoma (ADC), squamous cell carcinoma (SCC) and small cell lung cancer (SCLC), an accurate histopathological diagnosis is of great importance. However, if the specimen is solely composed of poorly differentiated cancer cells, distinguishing between histological subtypes can be difficult. The present study developed a deep learning model to classify lung cancer subtypes from whole slide images (WSIs) of transbronchial lung biopsy (TBLB) specimens, in particular with the aim of using this model to evaluate a challenging test set of indeterminate cases. Methods: Our deep learning model consisted of two separately trained components: a convolutional neural network tile classifier and a recurrent neural network tile aggregator for the WSI diagnosis. We used a training set consisting of 638 WSIs of TBLB specimens to train a deep learning model to classify lung cancer subtypes (ADC, SCC and SCLC) and non-neoplastic lesions. The training set consisted of 593 WSIs for which the diagnosis had been determined by pathologists based on the visual inspection of Hematoxylin-Eosin (HE) slides and of 45 WSIs of indeterminate cases (64 ADCs and 19 SCCs). We then evaluated the models using five independent test sets. For each test set, we computed the receiver operator curve (ROC) area under the curve (AUC). Results: We applied the model to an indeterminate test set of WSIs obtained from TBLB specimens that pathologists had not been able to conclusively diagnose by examining the HE-stained specimens alone. Overall, the model achieved ROC AUCs of 0.993 (confidence interval [CI] 0.971-1.0) and 0.996 (0.981-1.0) for ADC and SCC, respectively. We further evaluated the model using five independent test sets consisting of both TBLB and surgically resected lung specimens (combined total of 2490 WSIs) and obtained highly promising results with ROC AUCs ranging from 0.94 to 0.99. Conclusions: In this study, we demonstrated that a deep learning model could be trained to predict lung cancer subtypes in indeterminate TBLB specimens. The extremely promising results obtained show that if deployed in clinical practice, a deep learning model that is capable of aiding pathologists in diagnosing indeterminate cases would be extremely beneficial as it would allow a diagnosis to be obtained sooner and reduce costs that would result from further investigations.

Value of Radiomics Features From Adrenal Gland and Periadrenal Fat in CT Images for Predicting COVID-19 Prognosis

10.21203/rs.3.rs-989736/v1 ◽

2021 ◽

Author(s):

Mudan zhang ◽

Xuntao Yin ◽

Wuchao Li ◽

Yan Zha ◽

Xianchun Zeng ◽

...

Keyword(s):

Adrenal Gland ◽

Adrenal Glands ◽

Endocrine System ◽

Ct Images ◽

Threshold Probability ◽

Test Set ◽

Clinical Model ◽

Disease Prognosis ◽

Independent Test ◽

Test Sets

Abstract Background: Endocrine system plays an important role in infectious disease prognosis. Our goal is to assess the value of radiomics features extracted from adrenal gland and periadrenal fat CT images in predicting disease prognosis in patients with COVID-19. Methods: A total of 1,325 patients (765 moderate and 560 severe patients) from three centers were enrolled in the retrospective study. We proposed a 3D cascade V-Net to automatically segment adrenal glands in onset CT images. Periadrenal fat areas were obtained using inflation operations. Then, the radiomics features were automatically extracted. Five models were established to predict the disease prognosis in patients with COVID-19: a clinical model (CM), three radiomics models (adrenal gland model [AM], periadrenal fat model [PM], fusion of adrenal gland and periadrenal fat model [FM]), and a radiomics nomogram model (RN).Data from one center (1,183 patients) were utilized as training and validation sets. The remaining two (36 and 106 patients) were used as 2 independent test sets to evaluate the models’ performance. Results: The auto-segmentation framework achieved an average dice of 0.79 in the test set. CM, AM, PM, FM, and RN obtained AUCs of 0.716, 0.755, 0.796, 0.828, and 0.825, respectively in the training set, and the mean AUCs of 0.754, 0.709, 0.672, 0.706 and 0.778 for 2 independent test sets. Decision curve analysis showed that if the threshold probability was more than 0.3, 0.5, and 0.1 in the validation set, the independent-test set 1 and the independent-test set 2 could gain more net benefits using RN than FM and CM, respectively. Conclusion: Radiomics features extracted from CT images of adrenal glands and periadrenal fat are related to disease prognosis in patients with COVID-19 and have great potential for predicting its severity.

Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples

10.1101/097881 ◽

2017 ◽

Cited By ~ 2

Author(s):

Christopher Wilks ◽

Phani Gaddipati ◽

Abhinav Nellore ◽

Ben Langmead

Keyword(s):

Tissue Specificity ◽

Rna Seq ◽

Sequencing Data ◽

Transcription Start ◽

Link Type ◽

Alternative Transcription ◽

Web App ◽

Inverted Indexing ◽

Splice Junctions ◽

Splicing Patterns

AbstractAs more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70,000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can also rank and score junctions according to tissue specificity or other criteria. Further, Snaptron can rank and score samples according to the relative frequency of different splicing patterns. We outline biological questions that can be explored with Snaptron queries, including a study of novel exons in annotated genes, of exonization of repetitive element loci, and of a recently discovered alternative transcription start site for the ALK gene. Web app and documentation are at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron under the MIT license.

Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench

10.1101/2020.05.22.111211 ◽

2020 ◽

Author(s):

Ruben Chazarra-Gil ◽

Stijn van Dongen ◽

Vladimir Yu Kiselev ◽

Martin Hemberg

Keyword(s):

Single Cell ◽

Computational Methods ◽

Rna Seq ◽

Batch Effects ◽

Systematic Comparison ◽

Batch Correction ◽

Link Type ◽

Biological Signals ◽

The Cost

AbstractAs the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects. Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward. Here we present BatchBench (https://github.com/cellgeni/batchbench), a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. We apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.

An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study

F1000Research ◽

10.12688/f1000research.9110.1 ◽

2016 ◽

Vol 5 ◽

pp. 1574 ◽

Cited By ~ 19

Author(s):

Zichen Wang ◽

Avi Ma'ayan

Keyword(s):

Small Molecules ◽

Zika Virus ◽

Principal Component ◽

Global Gene Expression ◽

Brain Morphology ◽

Rna Seq ◽

Link Type ◽

Neuronal Progenitors ◽

Global Gene Expression Profiling ◽

Data Files

RNA-seq analysis is becoming a standard method for global gene expression profiling. However, open and standard pipelines to perform RNA-seq analysis by non-experts remain challenging due to the large size of the raw data files and the hardware requirements for running the alignment step. Here we introduce a reproducible open source RNA-seq pipeline delivered as an IPython notebook and a Docker image. The pipeline uses state-of-the-art tools and can run on various platforms with minimal configuration overhead. The pipeline enables the extraction of knowledge from typical RNA-seq studies by generating interactive principal component analysis (PCA) and hierarchical clustering (HC) plots, performing enrichment analyses against over 90 gene set libraries, and obtaining lists of small molecules that are predicted to either mimic or reverse the observed changes in mRNA expression. We apply the pipeline to a recently published RNA-seq dataset collected from human neuronal progenitors infected with the Zika virus (ZIKV). In addition to confirming the presence of cell cycle genes among the genes that are downregulated by ZIKV, our analysis uncovers significant overlap with upregulated genes that when knocked out in mice induce defects in brain morphology. This result potentially points to the molecular processes associated with the microcephaly phenotype observed in newborns from pregnant mothers infected with the virus. In addition, our analysis predicts small molecules that can either mimic or reverse the expression changes induced by ZIKV. The IPython notebook and Docker image are freely available at: http://nbviewer.jupyter.org/github/maayanlab/Zika-RNAseq-Pipeline/blob/master/Zika.ipynb and https://hub.docker.com/r/maayanlab/zika/.

Agricultural Greenhouses Detection in High-Resolution Satellite Images Based on Convolutional Neural Networks: Comparison of Faster R-CNN, YOLO v3 and SSD

Sensors ◽

10.3390/s20174938 ◽

2020 ◽

Vol 20 (17) ◽

pp. 4938

Author(s):

Min Li ◽

Zhijie Zhang ◽

Liping Lei ◽

Xiaofan Wang ◽

Xudong Guo

Keyword(s):

High Resolution ◽

Visual Inspection ◽

Satellite Images ◽

High Spatial Resolution ◽

Fine Tuning ◽

Single Shot ◽

Modern Agriculture ◽

High Resolution Satellite Images ◽

Independent Test ◽

Test Sets

Agricultural greenhouses (AGs) are an important facility for the development of modern agriculture. Accurately and effectively detecting AGs is a necessity for the strategic planning of modern agriculture. With the advent of deep learning algorithms, various convolutional neural network (CNN)-based models have been proposed for object detection with high spatial resolution images. In this paper, we conducted a comparative assessment of the three well-established CNN-based models, which are Faster R-CNN, You Look Only Once-v3 (YOLO v3), and Single Shot Multi-Box Detector (SSD) for detecting AGs. The transfer learning and fine-tuning approaches were implemented to train models. Accuracy and efficiency evaluation results show that YOLO v3 achieved the best performance according to the average precision (mAP), frames per second (FPS) metrics and visual inspection. The SSD demonstrated an advantage in detection speed with an FPS twice higher than Faster R-CNN, although their mAP is close on the test set. The trained models were also applied to two independent test sets, which proved that these models have a certain transability and the higher resolution images are significant for accuracy improvement. Our study suggests YOLO v3 with superiorities in both accuracy and computational efficiency can be applied to detect AGs using high-resolution satellite images operationally.

A direct comparison of genome alignment and transcriptome pseudoalignment

10.1101/444620 ◽

2018 ◽

Cited By ~ 5

Author(s):

Lynn Yi ◽

Lauren Liu ◽

Páll Melsted ◽

Lior Pachter

Keyword(s):

Genome Analysis ◽

Genome Alignment ◽

Rna Seq ◽

Coordinate Systems ◽

Link Type ◽

Supplementary Material

AbstractMotivationGenome alignment of reads is the first step of most genome analysis workflows. In the case of RNA-Seq, transcriptome pseudoalignment of reads is a fast alternative to genome alignment, but the different “coordinate systems” of the genome and transcriptome have made it difficult to perform direct comparisons between the approaches.ResultsWe have developed tools for converting genome alignments to transcriptome pseudoalignments, and conversely, for projecting transcriptome pseudoalignments to genome alignments. Using these tools, we performed a direct comparison of genome alignment with transcriptome pseudoalignment. We find that both approaches produce similar quantifications. This means that for many applications genome alignment and transcriptome pseudoalignment are interchangeable.Availability and Implementationbam2tcc is a C++14 software for converting alignments in SAM/BAM format to transcript compatibility counts (TCCs) and is available at https://github.com/pachterlab/bam2tcc. kallisto genomebam is a user option of kallisto that outputs a sorted BAM file in genome coordinates as part of transcriptome pseudoalignment. The feature has been released with kallisto v0.44.0, and is available at https://pachterlab.github.io/kallisto/.Supplementary MaterialN/AContactLior Pachter ([email protected])