scholarly journals Sequence-specific minimizers via polar sets

2021 ◽  
Author(s):  
Hongyu Zheng ◽  
Carl Kingsford ◽  
Guillaume Marçais

AbstractMinimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset.

Interest in nucleic acid hybridization stems mainly from its great power as a tool in biological research. It is used in several quite distinct ways. Because of the high degree of specificity that they show, hybridization techniques can be used to measure the amount of one specific sequence within a very heterogeneous mixture of sequences. Measurements of 1/10 6 -10 7 have been recorded. In extension of this, various properties of a specific sequence can often be studied. Secondly, because the kinetics of nucleic acid hybridization are quite well understood, it can be used to characterize both a pure sequence and a very complex mixture of sequences, like the genome of a vertebrate. Thirdly, again because of its specificity, it can be used to measure homologies between different populations of nucleic acids. Lastly, in conjunction with other techniques, it can be used as a basis for the fractionation of nucleic acid populations and the purification of specific sequences. Specific examples of these applications are given, with special reference to the organization of the genome in higher eukaryotes.


2019 ◽  
Vol 20 (S2) ◽  
Author(s):  
Bohu Pan ◽  
Rebecca Kusko ◽  
Wenming Xiao ◽  
Yuanting Zheng ◽  
Zhichao Liu ◽  
...  

BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Ran Li ◽  
Xiaomeng Tian ◽  
Peng Yang ◽  
Yingzhi Fan ◽  
Ming Li ◽  
...  

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.


2020 ◽  
Vol 10 (8) ◽  
pp. 2801-2809 ◽  
Author(s):  
Tingting Zhao ◽  
Zhongqu Duan ◽  
Georgi Z. Genchev ◽  
Hui Lu

Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome de novo assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 de novo long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps and added up to 2.2 Mb novel sequences to the human reference genome. More than 90% of the non-redundant sequences could be verified by unmapped reads from the Simons Genome Diversity Project dataset. In addition, 15.6% of the non-reference sequences were found in at least one of four non-human primate genomes. We further demonstrated that the non-redundant sequences had high content of simple repeats and satellite sequences. Moreover, 43 (32.6%) of the 132 closed gaps were shown to be polymorphic; such sequences may play an important biological role and can be useful in the investigation of human genetic diversity.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Harsh G. Shukla ◽  
Pushpinder Singh Bawa ◽  
Subhashini Srinivasan

2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Bohu Pan ◽  
Rebecca Kusko ◽  
Wenming Xiao ◽  
Yuanting Zheng ◽  
Zhichao Liu ◽  
...  

2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Karen H. Y. Wong ◽  
Walfred Ma ◽  
Chun-Yu Wei ◽  
Erh-Chan Yeh ◽  
Wan-Jia Lin ◽  
...  

Abstract The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.


Author(s):  
Alaina Shumate ◽  
Steven L Salzberg

Abstract Motivation Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. Results One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity. Availability and Implementation Liftoff can be installed via bioconda and PyPI. Additionally, the source code for Liftoff is available at https://github.com/agshumate/Liftoff Supplementary information Supplementary data are available at Bioinformatics online.


2013 ◽  
Vol 132 (8) ◽  
pp. 899-911 ◽  
Author(s):  
Geng Chen ◽  
Charles Wang ◽  
Leming Shi ◽  
Weida Tong ◽  
Xiongfei Qu ◽  
...  

2016 ◽  
Vol 34 (2_suppl) ◽  
pp. 484-484 ◽  
Author(s):  
Gurudatta Naik ◽  
Dongquan Chen ◽  
Michael Crowley ◽  
David Crossman ◽  
Katherine C. Sexton ◽  
...  

484 Background: Molecular alterations and drivers of PSCC, an orphan malignancy, remain unclear. The Cancer Genome Atlas is not studying PSCC and the Catalogue of Somatic Mutations in Cancer has performed targeted analyses only. We report WES of PSCC tumors from a group of patients (pts). Methods: Freshfrozen macrodissected PSCC tumor tissue and adjacent normal tissue samples were procured from the Cooperative Human Tissue Network. DNA was isolated from tissue sections by phenol chloroform extraction. Exome capture was performed with the Agilent SureSelect clinical research exome kit and whole exome-seq was done on the Illumina HiSeq2500 with paired end 100bp chemistry. Raw sequence data in Fastq format were aligned to human reference genome and quantified, and compared by using a local instance of Galaxy (galaxy.uabgrid.uab.edu). These data were analyzed for mutations (SNPs) analysis, by Partek Genomic Suite/Flow(PGS, Partek, St. Louis, MO) for variance calling against human reference genome (hg19) as referenced to dbSNP; and copy number variants (cnv) by FishingCNV tool together with picard tools/samtools/GATK). We focused on missense mutations and amplifications among ≥ 2 tumor samples but not in normal samples as they may cause upregulation of gene/protein function, which may be therapeutically actionable. Results: PSCC tumors were available from 11 patients and adjacent normal tissue from 3 patients. The 10 most common genes with > 4 missense mutations among ≥ 2 tumor samples overall were the following in decreasing order of frequency: MUC4, HLA-DPA1, MUC16, XIRP2, SSPO, TTN, FCGBP, PABPC3, ALPK2 and MKI67. The top upstream transcriptional regulators were PIH1D3, PRDM5, PTK2, Coup-Tf and NBEAL2. When examining candidate actionable genes, recurrent missense alterations were seen in PIK3C2A and PIK3C2G. Additional analysis will study alterations in functional domains and cnv. Conclusions: WES identified a relatively high mutation burden in PSCC withrecurrent missense mutations in multiple genes, notably including the PI3K gene among potentially actionable genes. Validation of these findings and further study of downstream effects is required.


Sign in / Sign up

Export Citation Format

Share Document