Sequence-specific minimizers via polar sets

AbstractMinimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset.

Download Full-text

DNA—RNA hybridization

Philosophical Transactions of the Royal Society of London Series B Biological Sciences ◽

10.1098/rstb.1975.0077 ◽

1975 ◽

Vol 272 (915) ◽

pp. 147-157 ◽

Cited By ~ 22

Keyword(s):

Nucleic Acid ◽

Complex Mixture ◽

Nucleic Acid Hybridization ◽

Biological Research ◽

Specific Sequence ◽

Different Populations ◽

Higher Eukaryotes ◽

High Degree ◽

Kinetics Of ◽

Specific Sequences

Interest in nucleic acid hybridization stems mainly from its great power as a tool in biological research. It is used in several quite distinct ways. Because of the high degree of specificity that they show, hybridization techniques can be used to measure the amount of one specific sequence within a very heterogeneous mixture of sequences. Measurements of 1/10 6 -10 7 have been recorded. In extension of this, various properties of a specific sequence can often be studied. Secondly, because the kinetics of nucleic acid hybridization are quite well understood, it can be used to characterize both a pure sequence and a very complex mixture of sequences, like the genome of a vertebrate. Thirdly, again because of its specificity, it can be used to measure homologies between different populations of nucleic acids. Lastly, in conjunction with other techniques, it can be used as a basis for the fractionation of nucleic acid populations and the purification of specific sequences. Specific examples of these applications are given, with special reference to the organization of the genome in higher eukaryotes.

Download Full-text

Similarities and differences between variants called with human reference genome HG19 or HG38

BMC Bioinformatics ◽

10.1186/s12859-019-2620-0 ◽

2019 ◽

Vol 20 (S2) ◽

Cited By ~ 4

Author(s):

Bohu Pan ◽

Rebecca Kusko ◽

Wenming Xiao ◽

Yuanting Zheng ◽

Zhichao Liu ◽

...

Keyword(s):

Reference Genome ◽

Human Reference Genome ◽

Similarities And Differences

Download Full-text

Recovery of non-reference sequences missing from the human reference genome

BMC Genomics ◽

10.1186/s12864-019-6107-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Ran Li ◽

Xiaomeng Tian ◽

Peng Yang ◽

Yingzhi Fan ◽

Ming Li ◽

...

Keyword(s):

Human Genome ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Precise Location ◽

Protein Coding ◽

Human Reference Genome ◽

Mhc Haplotype ◽

Reference Sequences ◽

Flanking Regions

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.

Download Full-text

Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401280 ◽

2020 ◽

Vol 10 (8) ◽

pp. 2801-2809 ◽

Cited By ~ 1

Author(s):

Tingting Zhao ◽

Zhongqu Duan ◽

Georgi Z. Genchev ◽

Hui Lu

Keyword(s):

Reference Genome ◽

De Novo ◽

Sequence Length ◽

Sequencing Data ◽

Human Reference Genome ◽

Satellite Sequences ◽

Long Read ◽

Data Gap ◽

Simple Repeats ◽

Gap Closing

Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome de novo assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 de novo long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps and added up to 2.2 Mb novel sequences to the human reference genome. More than 90% of the non-redundant sequences could be verified by unmapped reads from the Simons Genome Diversity Project dataset. In addition, 15.6% of the non-reference sequences were found in at least one of four non-human primate genomes. We further demonstrated that the non-redundant sequences had high content of simple repeats and satellite sequences. Moreover, 43 (32.6%) of the 132 closed gaps were shown to be polymorphic; such sequences may play an important biological role and can be useful in the investigation of human genetic diversity.

Download Full-text

hg19KIndel: ethnicity normalized human reference genome

BMC Genomics ◽

10.1186/s12864-019-5854-3 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Harsh G. Shukla ◽

Pushpinder Singh Bawa ◽

Subhashini Srinivasan

Keyword(s):

Reference Genome ◽

Human Reference Genome

Download Full-text

Correction to: Similarities and differences between variants called with human reference genome HG19 or HG38

BMC Bioinformatics ◽

10.1186/s12859-019-2776-7 ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Bohu Pan ◽

Rebecca Kusko ◽

Wenming Xiao ◽

Yuanting Zheng ◽

Zhichao Liu ◽

...

Keyword(s):

Reference Genome ◽

Human Reference Genome ◽

Similarities And Differences

Download Full-text

Towards a reference genome that captures global genetic diversity

Nature Communications ◽

10.1038/s41467-020-19311-w ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Karen H. Y. Wong ◽

Walfred Ma ◽

Chun-Yu Wei ◽

Erh-Chan Yeh ◽

Wan-Jia Lin ◽

...

Keyword(s):

Genetic Diversity ◽

Reference Genome ◽

Regulatory Elements ◽

Human Populations ◽

Single Individual ◽

Rna Seq ◽

Human Reference Genome ◽

Reference Sequences ◽

Genome Annotations ◽

Unmapped Reads

Abstract The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.

Download Full-text

Liftoff: accurate mapping of gene annotations

Bioinformatics ◽

10.1093/bioinformatics/btaa1016 ◽

2020 ◽

Author(s):

Alaina Shumate ◽

Steven L Salzberg

Keyword(s):

Reference Genome ◽

Supplementary Information ◽

Closely Related Species ◽

Protein Coding ◽

Human Reference Genome ◽

Sequence Identity ◽

Gene Annotations ◽

Genome Assemblies ◽

Average Sequence Identity ◽

High Quality Genome

Abstract Motivation Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. Results One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity. Availability and Implementation Liftoff can be installed via bioconda and PyPI. Additionally, the source code for Liftoff is available at https://github.com/agshumate/Liftoff Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Comprehensively identifying and characterizing the missing gene sequences in human reference genome with integrated analytic approaches

Human Genetics ◽

10.1007/s00439-013-1300-9 ◽

2013 ◽

Vol 132 (8) ◽

pp. 899-911 ◽

Cited By ~ 11

Author(s):

Geng Chen ◽

Charles Wang ◽

Leming Shi ◽

Weida Tong ◽

Xiongfei Qu ◽

...

Keyword(s):

Reference Genome ◽

Gene Sequences ◽

Human Reference Genome ◽

Missing Gene

Download Full-text

Whole-exome sequencing (WES) of penile squamous cell carcinoma (PSCC) to identify multiple recurrent mutations.

Journal of Clinical Oncology ◽

10.1200/jco.2016.34.2_suppl.484 ◽

2016 ◽

Vol 34 (2_suppl) ◽

pp. 484-484 ◽

Cited By ~ 1

Author(s):

Gurudatta Naik ◽

Dongquan Chen ◽

Michael Crowley ◽

David Crossman ◽

Katherine C. Sexton ◽

...

Keyword(s):

Normal Tissue ◽

Reference Genome ◽

Sequence Data ◽

The Cancer Genome Atlas ◽

Exome Capture ◽

Missense Mutations ◽

Adjacent Normal Tissue ◽

Human Reference Genome ◽

Whole Exome ◽

Recurrent Mutations

484 Background: Molecular alterations and drivers of PSCC, an orphan malignancy, remain unclear. The Cancer Genome Atlas is not studying PSCC and the Catalogue of Somatic Mutations in Cancer has performed targeted analyses only. We report WES of PSCC tumors from a group of patients (pts). Methods: Freshfrozen macrodissected PSCC tumor tissue and adjacent normal tissue samples were procured from the Cooperative Human Tissue Network. DNA was isolated from tissue sections by phenol chloroform extraction. Exome capture was performed with the Agilent SureSelect clinical research exome kit and whole exome-seq was done on the Illumina HiSeq2500 with paired end 100bp chemistry. Raw sequence data in Fastq format were aligned to human reference genome and quantified, and compared by using a local instance of Galaxy (galaxy.uabgrid.uab.edu). These data were analyzed for mutations (SNPs) analysis, by Partek Genomic Suite/Flow(PGS, Partek, St. Louis, MO) for variance calling against human reference genome (hg19) as referenced to dbSNP; and copy number variants (cnv) by FishingCNV tool together with picard tools/samtools/GATK). We focused on missense mutations and amplifications among ≥ 2 tumor samples but not in normal samples as they may cause upregulation of gene/protein function, which may be therapeutically actionable. Results: PSCC tumors were available from 11 patients and adjacent normal tissue from 3 patients. The 10 most common genes with > 4 missense mutations among ≥ 2 tumor samples overall were the following in decreasing order of frequency: MUC4, HLA-DPA1, MUC16, XIRP2, SSPO, TTN, FCGBP, PABPC3, ALPK2 and MKI67. The top upstream transcriptional regulators were PIH1D3, PRDM5, PTK2, Coup-Tf and NBEAL2. When examining candidate actionable genes, recurrent missense alterations were seen in PIK3C2A and PIK3C2G. Additional analysis will study alterations in functional domains and cnv. Conclusions: WES identified a relatively high mutation burden in PSCC withrecurrent missense mutations in multiple genes, notably including the PI3K gene among potentially actionable genes. Validation of these findings and further study of downstream effects is required.

Download Full-text