unmapped reads
Recently Published Documents


TOTAL DOCUMENTS

35
(FIVE YEARS 21)

H-INDEX

5
(FIVE YEARS 2)

2022 ◽  
Author(s):  
David Pellow ◽  
Abhinav Dutta ◽  
Ron Shamir

As sequencing datasets keep growing larger, time and memory efficiency of read mapping are becoming more critical. Many clever algorithms and data structures were used to develop mapping tools for next generation sequencing, and in the last few years also for third generation long reads. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors. Here we introduce parameterized syncmer schemes, and provide a theoretical analysis for multi-parameter schemes. By combining these schemes with downsampling or minimizers we can achieve any desired compression and window guarantee. We introduced syncmer schemes into the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long read data from a variety of genomes, the syncmer-based algorithms reduced unmapped reads by 20-60% at high compression while using less memory. The advantage of syncmer-based mapping was even more pronounced at lower sequence identity. At sequence identity of 65-75% and medium compression, syncmer mappers had 50-60% fewer unmapped reads, and ∼ 10% fewer of the reads that did map were incorrectly mapped. We conclude that syncmer schemes improve mapping under higher error and mutation rates. This situation happens, for example, when the high error rate of long reads is compounded by a high mutation rate in a cancer tumor, or due to differences between strains of viruses or bacteria.


2021 ◽  
Author(s):  
Arun H. Patil ◽  
Marc K. Halushka ◽  
Bastian K. Fromm

The telomere to telomere (T2T) genome project discovered and mapped ~240 million additional base pairs of primarily telomeric and centromeric reads. Much of this sequence was comprised of satellite sequences and large segmental duplications. We evaluated the extent to which human bona fide microRNAs (miRNAs) may be found in additional paralogous genomic loci or if previously undescribed microRNAs are present in these newly sequenced regions of the human genome. New genomic regions of the T2T project spanning ~240 million bp of sequence were obtained and evaluated by blastn for the human miRNAs contained in MirGeneDB2.0 (N=556) and miRBase (N = 1917) along with all species of MirGeneDB2.0 miRNAs (N=10,899). Additionally, bowtie was used to compare unmapped reads from >4,000 primary cell samples to the new T2T sequence. Based on sequence and structure, no bona fide miRNAs were identified. Ninety-seven miRNAs of questionable authenticity (frequently known repeat elements) were identified from the miRBase dataset across the newly described regions of the human genome. These 97 represent only 51 miRNA families due to paralogy of highly similar miRNAs such as 24 members of the hsa-mir-548 family. Altogether, this data strongly supports our having identified widely expressed bona fide miRNAs in the human genome and move us further toward the completion of human miRNA discovery.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Dongwei Li ◽  
Qitong Huang ◽  
Lei Huang ◽  
Jikai Wen ◽  
Jing Luo ◽  
...  

Abstract Background As a powerful tool, RNA-Seq has been widely used in various studies. Usually, unmapped RNA-seq reads have been considered as useless and been trashed or ignored. Results We develop a strategy to mining the full length sequence by unmapped reads combining with specific reverse transcription primers design and high throughput sequencing. In this study, we salvage 36 unmapped reads from standard RNA-Seq data and randomly select one 149 bp read as a model. Specific reverse transcription primers are designed to amplify its both ends, followed by next generation sequencing. Then we design a statistical model based on power law distribution to estimate its integrality and significance. Further, we validate it by Sanger sequencing. The result shows that the full length is 1556 bp, with insertion mutations in microsatellite structure. Conclusion We believe this method would be a useful strategy to extract the sequences information from the unmapped RNA-seq data. Further, it is an alternative way to get the full length sequence of unknown cDNA.


PLoS ONE ◽  
2021 ◽  
Vol 16 (4) ◽  
pp. e0250758
Author(s):  
Matthew A. Scott ◽  
Amelia R. Woolums ◽  
Cyprianna E. Swiderski ◽  
Andy D. Perkins ◽  
Bindu Nanduri ◽  
...  

Background Despite decades of extensive research, bovine respiratory disease (BRD) remains the most devastating disease in beef cattle production. Establishing a clinical diagnosis often relies upon visual detection of non-specific signs, leading to low diagnostic accuracy. Thus, post-weaned beef cattle are often metaphylactically administered antimicrobials at facility arrival, which poses concerns regarding antimicrobial stewardship and resistance. Additionally, there is a lack of high-quality research that addresses the gene-by-environment interactions that underlie why some cattle that develop BRD die while others survive. Therefore, it is necessary to decipher the underlying host genomic factors associated with BRD mortality versus survival to help determine BRD risk and severity. Using transcriptomic analysis of at-arrival whole blood samples from cattle that died of BRD, as compared to those that developed signs of BRD but lived (n = 3 DEAD, n = 3 ALIVE), we identified differentially expressed genes (DEGs) and associated pathways in cattle that died of BRD. Additionally, we evaluated unmapped reads, which are often overlooked within transcriptomic experiments. Results 69 DEGs (FDR<0.10) were identified between ALIVE and DEAD cohorts. Several DEGs possess immunological and proinflammatory function and associations with TLR4 and IL6. Biological processes, pathways, and disease phenotype associations related to type-I interferon production and antiviral defense were enriched in DEAD cattle at arrival. Unmapped reads aligned primarily to various ungulate assemblies, but failed to align to viral assemblies. Conclusion This study further revealed increased proinflammatory immunological mechanisms in cattle that develop BRD. DEGs upregulated in DEAD cattle were predominantly involved in innate immune pathways typically associated with antiviral defense, although no viral genes were identified within unmapped reads. Our findings provide genomic targets for further analysis in cattle at highest risk of BRD, suggesting that mechanisms related to type I interferons and antiviral defense may be indicative of viral respiratory disease at arrival and contribute to eventual BRD mortality.


Pathogens ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 405
Author(s):  
Anna Matysiak ◽  
Michal Kabza ◽  
Justyna A. Karolak ◽  
Marcelina M. Jaworska ◽  
Malgorzata Rydzanicz ◽  
...  

The ocular microbiome composition has only been partially characterized. Here, we used RNA-sequencing (RNA-Seq) data to assess microbial diversity in human corneal tissue. Additionally, conjunctival swab samples were examined to characterize ocular surface microbiota. Short RNA-Seq reads, obtained from a previous transcriptome study of 50 corneal tissues, were mapped to the human reference genome GRCh38 to remove sequences of human origin. The unmapped reads were then used for taxonomic classification by comparing them with known bacterial, archaeal, and viral sequences from public databases. The components of microbial communities were identified and characterized using both conventional microbiology and polymerase chain reaction (PCR) techniques in 36 conjunctival swabs. The majority of ocular samples examined by conventional and molecular techniques showed very similar microbial taxonomic profiles, with most of the microorganisms being classified into Proteobacteria, Firmicutes, and Actinobacteria phyla. Only 50% of conjunctival samples exhibited bacterial growth. The PCR detection provided a broader overview of positive results for conjunctival materials. The RNA-Seq assessment revealed significant variability of the corneal microbial communities, including fastidious bacteria and viruses. The use of the combined techniques allowed for a comprehensive characterization of the eye microbiome’s elements, especially in aspects of microbiota diversity.


2021 ◽  
Author(s):  
Edgar A. López-Landavery ◽  
Guillermo A. Corona-Herrera ◽  
Luis E. Santos-Rojas ◽  
Nadhia M. Herrera-Castillo ◽  
Tomás H. Delgadin ◽  
...  

AbstractArapaima gigas, one of the largest freshwater fish in the world, is suffering from high fishing pressure and habitat loss, which threaten the conservation status of its natural populations. Of great cultural importance to Amazonian people, the paiche or pirarucu A. gigas is in high demand for its meat, ornamental uses and other byproducts such as scales. Aquaculture is a feasible solution to this dilemma. However, the fact that A. gigas presents no sexual dimorphism until it is 5 years old and its long period to sexual maturity are major obstacles for brood-stock management and fingerling production. Thus, the aim of this study was to develop a molecular tool for non-invasive genotypic sexing of paiche throughout its life cycle. We collected samples from gonads, fins and gill mucus of juvenile specimens from local facilities for histological and molecular analysis. Based on the recently available genome sequence of the paiche and making use of current NGS method, we implemented a novel approach, called Genome Differences by Unmapped Reads, to identify DNA sex markers. We found a Male-Specific Region (MSR), identified as MSR_3728, to be present only in males. Next, we designed two specific sets of primers on this region to identify genotypic sex by qPCR assays. Both primer sets, MSR_107 and MSR_129, detected males with 100% accuracy. Then we developed a duplex qPCR reaction for each primer set along a reference gene, analyzed the melting curves and detected males by observing two distinct peaks, one for MSR_107 or MSR_129 and one for the reference, while females only presented the reference peak. The same results were obtained for gonads, fins and interestingly, a non-invasive source from gill mucus samples. Finally, the gonads were evaluated histologically in a double-blind test, showing 100% accuracy with qPCR assay for identifying males and females. Data clearly demonstrated a novel pipeline approach for identifying DNA sex markers, followed by a quick, non-invasive, cost-effective duplex qPCR method for sexing A. gigas. These results may be valuable to efficient paiche aquaculture and conservation studies, helping to reduce the fishing pressure on its natural populations.Highlights of the manuscript–Implementation of a novel approach, called Genome Differences by Unmapped Reads, to identify DNA sex markers.–Finding of a Male-Specific Region (MSR), present only in males.–Development of a duplex qPCR to identify genotypic sex through non-invasive sampling.


Nature ◽  
2021 ◽  
Vol 590 (7845) ◽  
pp. 290-299 ◽  
Author(s):  
Daniel Taliun ◽  
◽  
Daniel N. Harris ◽  
Michael D. Kessler ◽  
Jedidiah Carlson ◽  
...  

AbstractThe Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.


Viruses ◽  
2020 ◽  
Vol 12 (12) ◽  
pp. 1451
Author(s):  
Laura Buggiotti ◽  
Zhangrui Cheng ◽  
D. Claire Wathes ◽  

Microbial RNA is detectable in host samples by aligning unmapped reads from RNA sequencing against taxon reference sequences, generating a score proportional to the microbial load. An RNA-Seq data analysis showed that 83.5% of leukocyte samples from six dairy herds in different EU countries contained bovine herpes virus-6 (BoHV-6). Phenotypic data on milk production, metabolic function, and disease collected during their first 50 days in milk (DIM) were compared between cows with low (1–200 and n = 114) or high (201–1175 and n = 24) BoHV-6 scores. There were no differences in milk production parameters, but high score cows had numerically fewer incidences of clinical mastitis (4.2% vs. 12.2%) and uterine disease (54.5% vs. 62.7%). Their metabolic status was worse, based on measurements of IGF-1 and various metabolites in blood and milk. A comparison of the global leukocyte transcriptome between high and low BoHV-6 score cows at around 14 DIM yielded 485 differentially expressed genes (DEGs). The top pathway from Gene Ontology (GO) enrichment analysis was the immune system process. Down-regulated genes in the high BoHV-6 cows included those encoding proteins involved in viral detection (DDX6 and DDX58), interferon response, and E3 ubiquitin ligase activity. This suggested that BoHV-6 may largely evade viral detection and that it does not cause clinical disease in dairy cows.


Genes ◽  
2020 ◽  
Vol 11 (11) ◽  
pp. 1350
Author(s):  
Jina Kim ◽  
Joohon Sung ◽  
Kyudong Han ◽  
Wooseok Lee ◽  
Seyoung Mun ◽  
...  

The current human reference genome (GRCh38), with its superior quality, has contributed significantly to genome analysis. However, GRCh38 may still underrepresent the ethnic genome, specifically for Asians, though exactly what we are missing is still elusive. Here, we juxtaposed GRCh38 with a high-contiguity genome assembly of one Korean (AK1) to show that a part of AK1 genome is missing in GRCh38 and that the missing regions harbored ~1390 putative coding elements. Furthermore, we found that multiple populations shared some certain parts in the missing genome when we analyzed the “unmapped” (to GRCh38) reads of fourteen individuals (five East-Asians, four Europeans, and five Africans), amounting to ~5.3 Mb (~0.2% of AK1) of the total genomic regions. The recovered AK1 regions from the “unmapped reads”, which were the estimated missing regions that did not exist in GRCh38, harbored candidate coding elements. We verified that most of the common (shared by ≥7 individuals) missing regions exist in human and chimpanzee DNA. Moreover, we further identified the occurrence mechanism and ethnic heterogeneity as well as the presence of the common missing regions. This study illuminates a potential advantage of using a pangenome reference and brings up the need for further investigations on the various features of regions globally missed in GRCh38.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Karen H. Y. Wong ◽  
Walfred Ma ◽  
Chun-Yu Wei ◽  
Erh-Chan Yeh ◽  
Wan-Jia Lin ◽  
...  

Abstract The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.


Sign in / Sign up

Export Citation Format

Share Document