Evolution of the cystatin B gene: implications for the origin of its variable dodecamer tandem repeat in humans☆☆Sequence data from this article have been deposited with the DDBJ/EMBL/GenBank Data Libraries under Accession Nos. AB083085 to AB083089, AB083416, and AB083417.

Short tandem repeats (STRs) and variable number tandem repeats (VNTRs) are important sources of natural and disease-causing variation, yet they have been problematic to resolve in reference genomes and genotype with short-read technology. We created a framework to model the evolution and instability of STRs and VNTRs in apes. We phased and assembled 3 ape genomes (chimpanzee, gorilla, and orangutan) using long-read and 10x Genomics linked-read sequence data for 21,442 human tandem repeats discovered in 6 haplotype-resolved assemblies of Yoruban, Chinese, and Puerto Rican origin. We define a set of 1,584 STRs/VNTRs expanded specifically in humans, including large tandem repeats affecting coding and noncoding portions of genes (e.g., MUC3A, CACNA1C). We show that short interspersed nuclear element–VNTR–Alu (SVA) retrotransposition is the main mechanism for distributing GC-rich human-specific tandem repeat expansions throughout the genome but with a bias against genes. In contrast, we observe that VNTRs not originating from retrotransposons have a propensity to cluster near genes, especially in the subtelomere. Using tissue-specific expression from human and chimpanzee brains, we identify genes where transcript isoform usage differs significantly, likely caused by cryptic splicing variation within VNTRs. Using single-cell expression from cerebral organoids, we observe a strong effect for genes associated with transcription profiles analogous to intermediate progenitor cells. Finally, we compare the sequence composition of some of the largest human-specific repeat expansions and identify 52 STRs/VNTRs with at least 40 uninterrupted pure tracts as candidates for genetically unstable regions associated with disease.

Download Full-text

Gene discovery and comparative analysis of X-degenerate genes from the domestic cat Y chromosome☆☆Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under Accession No. EU879967-EU879988.

Genomics ◽

10.1016/j.ygeno.2008.06.012 ◽

2008 ◽

Vol 92 (5) ◽

pp. 329-338 ◽

Cited By ~ 36

Author(s):

Alison J. Pearks Wilkerson ◽

Terje Raudsepp ◽

Tina Graves ◽

Derek Albracht ◽

Wesley Warren ◽

...

Keyword(s):

Comparative Analysis ◽

Y Chromosome ◽

Sequence Data ◽

Gene Discovery ◽

Domestic Cat ◽

Chromosome Sequence ◽

Data Libraries

Download Full-text

Human-specific subfamilies of HERV-K (HML-2) long terminal repeats: three master genes were active simultaneously during branching of hominoid lineages☆☆Sequence data from this article have been deposited with the DDBJ/EMBL/GenBank Data Libraries under Accession Nos. AY134884–AY134891 and AF532734–AF532738.

Genomics ◽

10.1016/s0888-7543(02)00027-7 ◽

2003 ◽

Vol 81 (2) ◽

pp. 149-156 ◽

Cited By ~ 65

Author(s):

Anton Buzdin ◽

Svetlana Ustyugova ◽

Konstantin Khodosevich ◽

Ilgar Mamedov ◽

Yuri Lebedev ◽

...

Keyword(s):

Sequence Data ◽

Long Terminal Repeats ◽

Master Genes ◽

Terminal Repeats ◽

Data Libraries ◽

Human Specific

Download Full-text

Characterization of a chloroplast isoform of serine acetyltransferase from the thermo-acidiphilic red alga Cyanidioschyzon merolae1The nucleotide sequence data reported in this paper have been submitted to the EMBL/GenBank/DDBJ Data Libraries under the accession number AB008428.1

Biochimica et Biophysica Acta (BBA) - Molecular Cell Research ◽

10.1016/s0167-4889(98)00031-7 ◽

1998 ◽

Vol 1403 (1) ◽

pp. 72-84 ◽

Cited By ~ 32

Author(s):

Kyoko Toda ◽

Hiroyoshi Takano ◽

Shin-ya Miyagishima ◽

Haruko Kuroiwa ◽

Tsuneyoshi Kuroiwa

Keyword(s):

Nucleotide Sequence ◽

Sequence Data ◽

Red Alga ◽

Accession Number ◽

Serine Acetyltransferase ◽

Nucleotide Sequence Data ◽

Data Libraries

Download Full-text

Comparative and functional analyses of LYL1 loci establish marsupial sequences as a model for phylogenetic footprinting☆ ☆Sequence data from this article have been deposited with the DDBJ/EMBL/GenBank Data Libraries under Accession No. AL731834.

Genomics ◽

10.1016/s0888-7543(03)00005-3 ◽

2003 ◽

Vol 81 (3) ◽

pp. 249-259 ◽

Cited By ~ 31

Author(s):

Michael A Chapman ◽

Fadi J Charchar ◽

Sarah Kinston ◽

Christine P Bird ◽

Darren Grafham ◽

...

Keyword(s):

Sequence Data ◽

Phylogenetic Footprinting ◽

Data Libraries ◽

Functional Analyses

Download Full-text

Sequence, gene structure, and expression pattern of CTNNBL1, a minor-class intron-containing gene—evidence for a role in apoptosis☆☆Sequence data from this article have been deposited with the GenBank Data Libraries under accession numbers as follows: Homo sapiens CTNNBL1: AF239607, AL109964, AL023804, AL118499. Mus musculus CTNNBL1: AY009405. Caenorhabditis elegans CTNNBL1: AAB37831, U80450. Drosophila melanogaster CTNNBL1: AE003681, AAF54309. Schizosaccharomyces pombe CTNNBL1: CAB52570. Arabidopsis thaliana CTNNBL1: AAF32478. Danio rerio CTNNBL1 (ESTs): BI883368, AI584702, BM036242, AI794469, BI881598, AI794082, AI883274, AI584203, BI887925, BI881994, BI883314. Bos taurus CTNNBL1: AF037349.

Genomics ◽

10.1016/s0888-7543(02)00038-1 ◽

2003 ◽

Vol 81 (3) ◽

pp. 292-303 ◽

Cited By ~ 23

Author(s):

Leila Jabbour ◽

Jean F Welter ◽

John Kollar ◽

Thomas M Hering

Keyword(s):

Arabidopsis Thaliana ◽

Drosophila Melanogaster ◽

Gene Structure ◽

Danio Rerio ◽

Bos Taurus ◽

Sequence Data ◽

Homo Sapiens ◽

Gene Structure And Expression ◽

A Minor ◽

Data Libraries

Download Full-text

Towards the Development of Tandem Repeat Analyzer for Genome Sequence Data

2010 Second International Conference on Computer Engineering and Applications ◽

10.1109/iccea.2010.285 ◽

2010 ◽

Author(s):

Eesha Ingle ◽

Abhiram Bhise

Keyword(s):

Genome Sequence ◽

Tandem Repeat ◽

Sequence Data ◽

Genome Sequence Data

Download Full-text

GtTR: Bayesian estimation of absolute tandem repeat copy number using sequence capture and high throughput sequencing

10.1101/246108 ◽

2018 ◽

Cited By ~ 1

Author(s):

Devika Ganesamoorthy ◽

Minh Duc Cao ◽

Tania Duarte ◽

Wenhan Chen ◽

Lachlan Coin

Keyword(s):

High Throughput ◽

Tandem Repeat ◽

Copy Number ◽

Tandem Repeats ◽

High Throughput Sequencing ◽

Sequence Data ◽

Complex Diseases ◽

Sequencing Analysis ◽

Reference Dataset ◽

Long Read

ABSTRACTBackgroundTandem repeats comprise significant proportion of the human genome including coding and regulatory regions. They are highly prone to repeat number variation and nucleotide mutation due to their repetitive and unstable nature, making them a major source of genomic variation between individuals. Despite recent advances in high throughput sequencing, analysis of tandem repeats in the context of complex diseases is still hindered by technical limitations.MethodsWe report a novel targeted sequencing approach, which allows simultaneous analysis of hundreds of repeats. We developed a Bayesian algorithm, namely – GtTR - which combines information from a reference long-read dataset with a short read counting approach to genotype tandem repeats at population scale. PCR sizing analysis was used for validation.ResultsWe used a PacBio long-read sequenced sample to generate a reference tandem repeat genotype dataset with on average 13% absolute deviation from PCR sizing results. Using this reference dataset GtTR generated estimates of VNTR copy number with accuracy within 95% high posterior density (HPD) intervals of 68% and 83% for capture sequence data and 200X WGS data respectively, improving to 87% and 94% with use of a PCR reference. We show that the genotype resolution increases as a function of depth, such that the median 95% HPD interval lies within 25%, 14%, 12% and 8% of the its midpoint copy number value for 30X, 200X WGS, 395X and 800X capture sequence data respectively. We validated nine targets by PCR sizing analysis and genotype estimates from sequencing results correlated well with PCR results.ConclusionsThe novel genotyping approach described here presents a new cost-effective method to explore previously unrecognized class of repeat variation in GWAS studies of complex diseases at the population level. Further improvements in accuracy can be obtained by improving accuracy of the reference dataset.

Download Full-text