Blind exploration of the unreferenced transcriptome reveals novel RNAs for prostate cancer diagnosis
AbstractThe broad use of RNA-sequencing technologies held a promise of improved diagnostic tools based on comprehensive transcript sets. However, mining human transcriptome data for disease biomarkers in clinical specimens is restricted by the limited power of conventional reference-based protocols relying on uniquely mapped reads and transcript annotations. Here, we implemented a blind reference-free computational protocol, DE-kupl, to directly infer RNA variations of any origin, including yet unreferenced RNAs, from high coverage total stranded RNA-sequencing datasets of tissue origin. As a bench test, this protocol was powered for detection of RNA subsequences embedded into unannotated putative long noncoding (lnc)RNAs expressed in prostate cancer tissues. Through filtering and visual inspection of 1,179 candidates, we defined 21 lncRNA probes that were further validated for robust tumor-specific expression by NanoString single molecule-based RNA measurements in 144 tissue specimens. Predictive modeling yielded a restricted probe panel enabling over 90% of true positive detection of cancer in an independent dataset from The Cancer Genome Atlas. Remarkably, this clinical signature made of only 9 unannotated lncRNAs largely outperformed PCA3, the only RNA biomarker approved by the Food and Drug Administration agency, specifically, in detection of high-risk prostate tumors. The proposed reference-free computational workflow is modular, highly sensitive and robust and can be applied to any pathology and any clinical application.