Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres
Nanopore long-read genome sequencing is emerging as a potential approach for the study of genomes including long repetitive elements like telomeres. Here, we report extensive basecalling induced errors at telomere repeats across nanopore datasets, sequencing platforms, basecallers, and basecalling models. We found that telomeres which are represented by (TTAGGG)n and (CCCTAA)n repeats in many organisms were frequently miscalled (~40-50% of reads) as (TTAAAA)n, or as (CTTCTT)n and (CCCTGG)n repeats respectively in a strand-specific manner during nanopore sequencing. We showed that this miscalling is likely caused by the high similarity of current profiles between telomeric repeats and these repeat artefacts, leading to mis-assignment of electrical current profiles during basecalling. We further demonstrated that tuning of nanopore basecalling models, and selective application of the tuned models to telomeric reads led to improved recovery and analysis of telomeric regions, with little detected negative impact on basecalling of other genomic regions. Our study thus highlights the importance of verifying nanopore basecalls in long, repetitive, and poorly defined regions of the genome, and showcases how such artefacts in regions like telomeres can potentially be resolved by improvements in nanopore basecalling models.