scholarly journals Comparative assessments of indel annotations in healthy and cancer genomes with next-generation sequencing data

2020 ◽  
Vol 13 (1) ◽  
Author(s):  
Jing Chen ◽  
Jun-tao Guo

Abstract Background Insertion and deletion (indel) is one of the major variation types in human genomes. Accurate annotation of indels is of paramount importance in genetic variation analysis and investigation of their roles in human diseases. Previous studies revealed a high number of false positives from existing indel calling methods, which limits downstream analyses of the effects of indels on both healthy and disease genomes. In this study, we evaluated seven commonly used general indel calling programs for germline indels and four somatic indel calling programs through comparative analysis to investigate their common features and differences and to explore ways to improve indel annotation accuracy. Methods In our comparative analysis, we adopted a more stringent evaluation approach by considering both the indel positions and the indel types (insertion or deletion sequences) between the samples and the reference set. In addition, we applied an efficient way to use a benchmark for improved performance comparisons for the general indel calling programs Results We found that germline indels in healthy genomes derived by combining several indel calling tools could help remove a large number of false positive indels from individual programs without compromising the number of true positives. The performance comparisons of somatic indel calling programs are more complicated due to the lack of a reliable and comprehensive benchmark. Nevertheless our results revealed large variations among the programs and among cancer types. Conclusions While more accurate indel calling programs are needed, we found that the performance for germline indel annotations can be improved by combining the results from several programs. In addition, well-designed benchmarks for both germline and somatic indels are key in program development and evaluations.

2020 ◽  
Author(s):  
Kentaro Tomii ◽  
Shravan Kumar ◽  
Degui Zhi ◽  
Steven E. Brenner

AbstractBackgroundInsertion and deletion sequencing errors are relatively common in next-generation sequencing data and produce long stretches of mistranslated sequence. These frameshifting errors can cause very serious damages to downstream data analysis of reads. However, it is possible to obtain more precise alignment of DNA sequences by taking into account both coding frame and sequencing errors estimated by quality scores.ResultsHere we designed and proposed a novel hidden Markov model (HMM)-based pairwise alignment algorithm, Meta-Align, that aligns DNA sequences in the protein space, incorporating quality scores from the DNA sequences and allowing frameshifts caused by insertions and deletions. Our model is based on both an HMM transducer of a pair HMM and profile HMMs for all possible amino acid pairs. A Viterbi algorithm over our model produces the optimal alignment of a pair of metagenomic reads taking into account all possible translating frames and gap penalties in both the protein space and the DNA space. To reduce the sheer number of states of this model, we also derived and implemented a computationally feasible model, leveraging the degeneracy of the genetic code. In a benchmark test on a diverse set of simulated reads based on BAliBASE we show that Meta-Align outperforms TBLASTX which compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database using the BLAST algorithm. We also demonstrate the effects of incorporating quality scores on Meta-Align.ConclusionsMeta-Align will be particularly effective when applied to error-prone DNA sequences. The package of our software can be downloaded at https://github.com/shravan-repos/Metaalign.


2013 ◽  
Vol 63 (3) ◽  
Author(s):  
Vera Afreixo ◽  
Sara P. Garcia ◽  
João M. O. S. Rodrigues

Single strand symmetry has been observed in several genomes, and some authors have associated this phenomenon to genome evolution. However, it is still not clear how strong and exceptional this phenomenon is. We use next-generation sequencing data from a sample of 1,092 human individuals made available by the 1000 Genomes Project. To evaluate the phenomenon of symmetry of single-strand human genomic DNA, we explore and analyze these 1,092 human genomes and 1,092 randomly generated sequences, each forced to mimic the nucleotide frequency distribution of their real counterpart. Our methodology is based on measurements, traditional and equivalence statistical tests using different parameters. By statistical testing we find that the global symmetries phenomenon is significant for word lengths  smaller than 8. When we evaluate the global symmetry scores, we obtain strong values for all word lengths and both types of sequences under study. However, the symmetry scores in human genomes reach higher values and have lower dispersion than those in random sequences. We also find that human and random symmetry scores are significantly different. We conclude that in the human genome, the differences between symmetric words are higher than in random sequences, but the correlation between symmetric words in human genomes is higher.


Sign in / Sign up

Export Citation Format

Share Document