scholarly journals STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci

2021 ◽  
Author(s):  
Harriet Dashnow ◽  
Brent S. Pedersen ◽  
Laurel Hiatt ◽  
Joe Brown ◽  
Sarah J. Beecroft ◽  
...  

Expansions of short tandem repeats (STRs) cause dozens of rare Mendelian diseases. However, STR expansions, especially those arising from repeats not present in the reference genome, are challenging to detect from short-read sequencing data. Such "novel" STRs include new repeat units occurring at known STR loci, or entirely new STR loci where the sequence is absent from the reference genome. A primary cause of difficulty detecting STR expansions is that reads arising from STR expansions are frequently mismapped or unmapped. To address this challenge, we have developed STRling, a new STR detection algorithm that counts k-mers (short DNA sequences of length k) in DNA sequencing reads, to efficiently recover reads that inform the presence and size of STR expansions. As a result, STRling can call expansions at both known and novel STR loci. STRling has a sensitivity of 83% for 14 known STR disease loci, including the novel STRs that cause CANVAS and DBQD2. It is the first method to resolve the position of novel STR expansions to base pair accuracy. Such accuracy is essential to interpreting the consequence of each expansion. STRling has an estimated 0.078 false discovery rate for known pathogenic loci in unaffected individuals and a 0.20 false discovery rate for genome-wide loci in unaffected individuals when using variants called from long-read data as truth. STRling is fast, scalable on cloud computing, open-source, and freely available at https://github.com/quinlan-lab/STRling.

2021 ◽  
Author(s):  
Cody J Steely ◽  
Scott Watkins ◽  
Lisa Baird ◽  
Lynn Jorde

Short tandem repeats (STRs) are tandemly repeated sequences of 1-6 bp motifs. STRs compose approximately 3% of the genome, and mutations at STR loci have been linked to dozens of human diseases including amyotrophic lateral sclerosis, Friedreich ataxia, Huntington disease, and fragile X syndrome. Improving our understanding of these mutations would increase our knowledge of the mutational dynamics of the genome and may uncover additional loci that contribute to disease. Here, to estimate the genome-wide pattern of mutations at STR loci, we analyzed blood-derived whole-genome sequencing data for 544 individuals from 29 three-generation CEPH pedigrees. These pedigrees contain both sets of grandparents, the parents, and an average of 9 grandchildren per family. Using HipSTR we identified de novo STR mutations in the 2nd generation of these pedigrees. Analyzing ~1.6 million STR loci, we estimate the empircal de novo STR mutation rate to be 5.24*10-5 mutations per locus per generation. We find that perfect repeats mutate ~2x more often than imperfect repeats. De novo STRs are significantly enriched in Alu elements (p < 2.2e-16). Approximately 30% of STR mutations occur within Alu elements, which compose only ~11% of the genome, and ~10% are found in LINE-1 insertions, which compose ~17% of the genome. Phasing these de novo mutations to the parent of origin shows that parental transmission biases vary among families. We estimate the average number of de novo genome-wide STR mutations per individual to be ~85, which is similar to the average number of observed de novo single nucleotide variants.


2014 ◽  
Author(s):  
Thomas F. Willems ◽  
Melissa Gymrek ◽  
Gareth Highnam ◽  
The Genomes Project ◽  
David Mittelman ◽  
...  

Short Tandem Repeats are among the most polymorphic loci in the human genome. These loci play a role in the etiology of a range of genetic diseases and have been frequently utilized in forensics, population genetics, and genetic genealogy. Despite this plethora of applications, little is known about the variation of most STRs in the human population. Here, we report the largest-scale analysis of human STR variation to date. We collected information for nearly 700,000 STR loci across over 1,000 individuals in phase 1 of the 1000 Genomes Project. This process nearly saturated common STR variations. After employing a series of quality controls, we utilize this call set to analyze determinants of STR variation, assess the human reference genome?s representation of STR alleles, find STR loci with common loss-of-function alleles, and obtain initial estimates of the linkage disequilibrium between STRs and common SNPs. Overall, these analyses further elucidate the scale of genetic variation beyond classical point mutations. The resource is publicly available at http://strcat.teamerlich.org/ both in raw format and via a graphical interface. 


Genes ◽  
2020 ◽  
Vol 11 (4) ◽  
pp. 381 ◽  
Author(s):  
Olivier Tytgat ◽  
Yannick Gansemans ◽  
Jana Weymaere ◽  
Kaat Rubben ◽  
Dieter Deforce ◽  
...  

Nanopore sequencing for forensic short tandem repeats (STR) genotyping comes with the advantages associated with massively parallel sequencing (MPS) without the need for a high up-front device cost, but genotyping is inaccurate, partially due to the occurrence of homopolymers in STR loci. The goal of this study was to apply the latest progress in nanopore sequencing by Oxford Nanopore Technologies in the field of STR genotyping. The experiments were performed using the state of the art R9.4 flow cell and the most recent R10 flow cell, which was specifically designed to improve consensus accuracy of homopolymers. Two single-contributor samples and one mixture sample were genotyped using Illumina sequencing, Nanopore R9.4 sequencing, and Nanopore R10 sequencing. The accuracy of genotyping was comparable for both types of flow cells, although the R10 flow cell provided improved data quality for loci characterized by the presence of homopolymers. We identify locus-dependent characteristics hindering accurate STR genotyping, providing insights for the design of a panel of STR loci suited for nanopore sequencing. Repeat number, the number of different reference alleles for the locus, repeat pattern complexity, flanking region complexity, and the presence of homopolymers are identified as unfavorable locus characteristics. For single-contributor samples and for a limited set of the commonly used STR loci, nanopore sequencing could be applied. However, the technology is not mature enough yet for implementation in routine forensic workflows.


2012 ◽  
Vol 40 (9) ◽  
pp. e69-e69 ◽  
Author(s):  
Günter Klambauer ◽  
Karin Schwarzbauer ◽  
Andreas Mayr ◽  
Djork-Arné Clevert ◽  
Andreas Mitterecker ◽  
...  

2013 ◽  
Vol 1 (1) ◽  
Author(s):  
Johannis Mallo

Abstrak: Mayat seorang perempuan tak dikenal yang ditemukan di daerah Malalayang dibawa oleh polisi ke Rumah Sakit Umum Pusat Prof.dr.R.D Kandou. Mayat tersebut telah mengalami proses awal pembusukan. Selain itu pihak kepolisian mengalami kesulitan untuk menentukan identitas mayat tersebut serta menemukan keluarganya. Demi pengungkapan kejadian yang menyebabkan kematian perempuan tak dikenal ini, terlebih dahulu polisi harus dapat menentukan identitas mayat. Melalui data medis yang dikumpulkan dari proses autopsi forensik, digabungkan dengan daftar orang hilang yang dibuat oleh kantor kepolisian Sektor Malalayang dan Kepolisian Kota Besar Manado, ditemukan kecocokan data yang merujuk pada seorang perempuan yang dilaporkan hilang oleh keluarganya. Polisi kemudian meminta pemeriksaan identifikasi melalui metode analisis DNA untuk membandingkan DNA mayat dengan DNA individu-individu yang mengaku sebagai keluarga korban. Pada saat autopsi forensik, diambil sampel tulang padat iga kanan dan kiri sepanjang 10 cm dari mayat. Sebagai pembanding diambil apusan mukosa pipi dan 2 cc darah tepi dari individu-individu yang diduga ayah dan adik kandung dari mayat yang ditemukan. Proses ekstraksi, kuantifikasi, PCR, dan proses analisis akan dilakukan di Pusat Laboratorium Forensik Kepolisian Republik Indonesia. Kata kunci: identitas mayat, identifikasi DNA, PCR, STR.     Abstract: Ms X’s corpse was brought to Prof.R.D Kandou general hospital by police officers. Ms X was found in Malalayang without any identity attached to her body. Her body had begun to decompose, and the police had difficulties in finding Ms X’s relatives. In order to uncover the case behind Ms X’s death, the police had to first discover the true identity of Ms X. Medical data was acquired during an autopsy, and from a list that the police made, a match was found in a report of missing persons when two data were compared. The Police requested a paternity DNA examination in order to have a positive identification of Ms X. During the forensic autopsy of Ms X, 10 cm of left and right costal compact bones were obtained. Buccal swabs were made and 2 cc of peripheral blood were taken, each from the suspected father and a suspected sister of Ms X. Extraction, quantification, PCR, and the analysis was made at Pusat Laboratorium Forensik Kepolisian Republik Indonesia the main Police Forensic Laboratory of Indonesia. PCR involves 13 to 15 of nuclear STR loci, and the analyzing process of the samples involves comparing the 13 to 15 nuclear STR loci of the 3 people. If a match is found with 99% accuracy, then identification is verified. The Paternity Index indicates the greatest possibility that the suspected father is the real father of Ms X, compared to other males in the Asian / Indonesian Population.1 Keywords: corpse identity,  DNA identification, PCR, STR.


2017 ◽  
Author(s):  
James Sun ◽  
Linda Zhou ◽  
Daniel J. Emerson ◽  
Thomas G. Gilgenast ◽  
Katelyn Titus ◽  
...  

AbstractMore than 25 inherited neurological disorders are caused by the unstable expansion of repetitive DNA sequences termed short tandem repeats (STRs). A fundamental unresolved question is why specific STRs are susceptible to unstable expansion leading to severe pathology, whereas tens of thousands of normal-length repeat tracts across the human genome are relatively stable. Here, we unexpectedly discover that nearly all STRs associated with repeat expansion diseases are located at boundaries demarcating 3-D chromatin domains. We find that boundaries exhibit markedly higher CpG island density compared to loci internal to domains. Importantly, disease-associated STRs are specifically localized to ultra-dense CpG island-rich boundaries, suggesting that these loci might be hotspots for epigenetic instability and topological disruption upon unstable expansion. In Fragile X Syndrome, mutation-length expansion at the Fmr1 gene results in severe disruption of the boundary between TADs. Our data uncover higher-order chromatin architecture as a new dimension in understanding the mechanistic basis of repeat expansion disorders.


2019 ◽  
Author(s):  
David Jakubosky ◽  
Erin N. Smith ◽  
Matteo D’Antonio ◽  
Marc Jan Bonder ◽  
William W. Young Greenwald ◽  
...  

AbstractStructural variants (SVs) and short tandem repeats (STRs) are important sources of genetic diversity but are not routinely analyzed in genetic studies because they are difficult to accurately identify and genotype. Because SVs and STRs range in size and type, it is necessary to apply multiple algorithms that incorporate different types of evidence from sequencing data and employ complex filtering strategies to discover a comprehensive set of high-quality and reproducible variants. Here we assembled a set of 719 deep whole genome sequencing (WGS) samples (mean 42x) from 477 distinct individuals which we used to discover and genotype a wide spectrum of SV and STR variants using five algorithms. We used 177 unique pairs of genetic replicates to identify factors that affect variant call reproducibility and developed a systematic filtering strategy to create of one of the most complete and well characterized maps of SVs and STRs to date.


F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 200 ◽  
Author(s):  
Andreas Halman ◽  
Alicia Oshlack

Background: Short tandem repeats are an important source of genetic variation. They are highly mutable and repeat expansions are associated dozens of human disorders, such as Huntington's disease and spinocerebellar ataxias. Technical advantages in sequencing technology have made it possible to analyse these repeats at large scale; however, accurate genotyping is still a challenging task. We compared four different short tandem repeats genotyping tools on whole exome sequencing data to determine their genotyping performance and limits, which will aid other researchers in choosing a suitable tool and parameters for analysis. Methods: The analysis was performed on the Simons Simplex Collection dataset, where we used a novel method of evaluation with accuracy determined by the rate of homozygous calls on the X chromosome of male samples. In total we analysed 433 samples and around a million genotypes for evaluating tools on whole exome sequencing data. Results: We determined a relatively good performance of all tools when genotyping repeats of 3-6 bp in length, which could be improved with coverage and quality score filtering. However, genotyping homopolymers was challenging for all tools and a high error rate was present across different thresholds of coverage and quality scores. Interestingly, dinucleotide repeats displayed a high error rate as well, which was found to be mainly caused by the AC/TG repeats. Overall, LobSTR was able to make the most calls and was also the fastest tool, while RepeatSeq and HipSTR exhibited the lowest heterozygous error rate at low coverage. Conclusions: All tools have different strengths and weaknesses and the choice may depend on the application. In this analysis we demonstrated the effect of using different filtering parameters and offered recommendations based on the trade-off between the best accuracy of genotyping and the highest number of calls.


2019 ◽  
Vol 35 (22) ◽  
pp. 4716-4723 ◽  
Author(s):  
Daniel Tello ◽  
Juanita Gil ◽  
Cristian D Loaiza ◽  
John J Riascos ◽  
Nicolás Cardozo ◽  
...  

Abstract Motivation Accurate detection, genotyping and downstream analysis of genomic variants from high-throughput sequencing data are fundamental features in modern production pipelines for genetic-based diagnosis in medicine or genomic selection in plant and animal breeding. Our research group maintains the Next-Generation Sequencing Experience Platform (NGSEP) as a precise, efficient and easy-to-use software solution for these features. Results Understanding that incorrect alignments around short tandem repeats are an important source of genotyping errors, we implemented in NGSEP new algorithms for realignment and haplotype clustering of reads spanning indels and short tandem repeats. We performed extensive benchmark experiments comparing NGSEP to state-of-the-art software using real data from three sequencing protocols and four species with different distributions of repetitive elements. NGSEP consistently shows comparative accuracy and better efficiency compared to the existing solutions. We expect that this work will contribute to the continuous improvement of quality in variant calling needed for modern applications in medicine and agriculture. Availability and implementation NGSEP is available as open source software at http://ngsep.sf.net. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 43 (2) ◽  
pp. 142-150
Author(s):  
Elif Mertoglu ◽  
Gonul Filoglu ◽  
Tolga Zorlu ◽  
Ozlem Bulbul

Abstract Background: The Non-recombining region of the Y-chromosome (NRY) is transferred from father to son in an unchanged form without recombination in meiosis. Since Short tandem repeats on Y-chromosome (Y-STRs) in this region do not have any recombination, these regions are identical in all male individuals who are related to the father except for mutations. Therefore, these regions gain importance in identification for the forensic sciences or determination of paternity. In determination of paternity, if mismatches are observed between father and child, population-specific mutation rates should be used to determine whether it is a mutation or a true exclusion. Therefore in this study, we aim to determine the mutation rates of 17 Y-STR loci in Turkey. Material and methods: 17 Y-STR loci were typed by using AmpFlSTR® Yfiler™ Kit in 90 volunteer father-son pairs. Mutation rates were calculated and compared with other populations. Results: The mutations were found between three father-son pairs at DYS439 and DYS458 loci. In addition, a duplication in DYS389 II loci* 30, 31 was observed. The average mutation rate was determined as 1.96×10−3 for Turkish population. Conclusion: This investigation will contribute to minimize the possibility of false exclusion of the father-son and kinship relations.


Sign in / Sign up

Export Citation Format

Share Document