scholarly journals PerFSeeB: Designing Long High-weight Single Spaced Seeds for Full Sensitivity Alignment with a Given Number of Mismatches

Author(s):  
Valeriy Titarenko ◽  
Sofya Titarenko

Abstract Background: Technical progress in computational hardware allows researchers to use new approaches for sequence alignment problems. A standard procedure is usually based on pre-aligning of short subsequences followed by proper comparison of neighbouring parts. For this purpose index files are created that store all subsequences (or numbers associated with them) and their positions within a reference sequence. Index files designed on subsequences of 32–64 symbols for a human reference genome can now be easily stored without any compression even on a budget computer. The main goal now is to choose a combination of symbols (a spaced seed) that will tolerate various mismatches between reference and given sequences. An ideal spaced seed should allow us to find all such positions (full sensitivity). By increasing the seed’s weight by one we usually reduce the number of candidate positions fourfold. At the same time longer seeds also reduce the number of signatures to be checked. Results: Several algorithms to assist seed generation are presented. The first one allows us to find all permitted spaced seeds iteratively. The results obtained with the algorithm show specific patterns of the seeds of the highest weight. Among the best seeds, there are periodic seeds with a simple relation between the period of a seed, its length and the length of a read. The second algorithm generates blocks for periodic seeds. A list of blocks is found for blocks of up to 50 symbols and up to 9 mismatches. The third algorithm uses those lists to find spaced seeds for reads of an arbitrary length. Conclusions: Lists of long high-weight spaced seeds are found and available in Supplementary Materials. The seeds are best in terms of weights compared to seeds from other papers and can usually be applied to shorter reads. Codes for all algorithms are available at https://github.com/vtman/PerFSeeB.

2020 ◽  
Vol 10 (1) ◽  
Author(s):  
C. A. Samson ◽  
W. Whitford ◽  
R. G. Snell ◽  
J. C. Jacobsen ◽  
K. Lehnert

Abstract Cells obtained from human saliva are commonly used as an alternative DNA source when blood is difficult or less convenient to collect. Although DNA extracted from saliva is considered to be of comparable quality to that derived from blood, recent studies have shown that non-human contaminating DNA derived from saliva can confound whole genome sequencing results. The most concerning complication is that non-human reads align to the human reference genome using standard methodology, which can critically affect the resulting variant genotypes identified in a genome. We identified clusters of anomalous variants in saliva DNA derived reads which aligned in an atypical manner. These reads had only short regions of identity to the human reference sequence, flanked by soft clipped sequence. Sequence comparisons of atypically aligning reads from eight human saliva-derived samples to RefSeq genomes revealed the majority to be of bacterial origin (63.46%). To partition the non-human reads during the alignment step, a decoy of the most prevalent bacterial genome sequences was designed and utilised. This reduced the number of atypically aligning reads when trialled on the eight saliva-derived samples by 44% and most importantly prevented the associated anomalous genotype calls. Saliva derived DNA is often contaminated by DNA from other species. This can lead to non-human reads aligning to the human reference genome using current alignment best-practices, impacting variant identification. This problem can be diminished by using a bacterial decoy in the alignment process.


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Krisztian Buza ◽  
Bartek Wilczynski ◽  
Norbert Dojer

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and the reference genome is used.Results. We provide a new approach that allows researchers to reconstruct genomes very closely related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available on sourceforge.Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly software.


2019 ◽  
Vol 20 (S2) ◽  
Author(s):  
Bohu Pan ◽  
Rebecca Kusko ◽  
Wenming Xiao ◽  
Yuanting Zheng ◽  
Zhichao Liu ◽  
...  

BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Ran Li ◽  
Xiaomeng Tian ◽  
Peng Yang ◽  
Yingzhi Fan ◽  
Ming Li ◽  
...  

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.


2020 ◽  
Vol 10 (8) ◽  
pp. 2801-2809 ◽  
Author(s):  
Tingting Zhao ◽  
Zhongqu Duan ◽  
Georgi Z. Genchev ◽  
Hui Lu

Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome de novo assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 de novo long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps and added up to 2.2 Mb novel sequences to the human reference genome. More than 90% of the non-redundant sequences could be verified by unmapped reads from the Simons Genome Diversity Project dataset. In addition, 15.6% of the non-reference sequences were found in at least one of four non-human primate genomes. We further demonstrated that the non-redundant sequences had high content of simple repeats and satellite sequences. Moreover, 43 (32.6%) of the 132 closed gaps were shown to be polymorphic; such sequences may play an important biological role and can be useful in the investigation of human genetic diversity.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Harsh G. Shukla ◽  
Pushpinder Singh Bawa ◽  
Subhashini Srinivasan

2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Bohu Pan ◽  
Rebecca Kusko ◽  
Wenming Xiao ◽  
Yuanting Zheng ◽  
Zhichao Liu ◽  
...  

2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Karen H. Y. Wong ◽  
Walfred Ma ◽  
Chun-Yu Wei ◽  
Erh-Chan Yeh ◽  
Wan-Jia Lin ◽  
...  

Abstract The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.


Author(s):  
Alaina Shumate ◽  
Steven L Salzberg

Abstract Motivation Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. Results One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity. Availability and Implementation Liftoff can be installed via bioconda and PyPI. Additionally, the source code for Liftoff is available at https://github.com/agshumate/Liftoff Supplementary information Supplementary data are available at Bioinformatics online.


2013 ◽  
Vol 132 (8) ◽  
pp. 899-911 ◽  
Author(s):  
Geng Chen ◽  
Charles Wang ◽  
Leming Shi ◽  
Weida Tong ◽  
Xiongfei Qu ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document