Accurate and Efficient KIR Gene and Haplotype Inference From Genome Sequencing Reads With Novel K-mer Signatures

The killer-cell immunoglobulin-like receptor (KIR) proteins evolve to fight viruses and mediate the body’s reaction to pregnancy. These roles provide selection pressure for variation at both the structural/haplotype and base/allele levels. At the same time, the genes have evolved relatively recently by tandem duplication and therefore exhibit very high sequence similarity over thousands of bases. These variation-homology patterns make it impossible to interpret KIR haplotypes from abundant short-read genome sequencing data at population scale using existing methods. Here, we developed an efficient computational approach for in silico KIR probe interpretation (KPI) to accurately interpret individual’s KIR genes and haplotype-pairs from KIR sequencing reads. We designed synthetic 25-base sequence probes by analyzing previously reported haplotype sequences, and we developed a bioinformatics pipeline to interpret the probes in the context of 16 KIR genes and 16 haplotype structures. We demonstrated its accuracy on a synthetic data set as well as a real whole genome sequences from 748 individuals from The Genome of the Netherlands (GoNL). The GoNL predictions were compared with predictions from SNP-based predictions. Our results show 100% accuracy rate for the synthetic tests and a 99.6% family-consistency rate in the GoNL tests. Agreement with the SNP-based calls on KIR genes ranges from 72%–100% with a mean of 92%; most differences occur in genes KIR2DS2, KIR2DL2, KIR2DS3, and KIR2DL5 where KPI predicts presence and the SNP-based interpretation predicts absence. Overall, the evidence suggests that KPI’s accuracy is 97% or greater for both KIR gene and haplotype-pair predictions, and the presence/absence genotyping leads to ambiguous haplotype-pair predictions with 16 reference KIR haplotype structures. KPI is free, open, and easily executable as a Nextflow workflow supported by a Docker environment at https://github.com/droeatumn/kpi.

Download Full-text

Accurate and Efficient KIR Gene and Haplotype Inference from Genome Sequencing Reads with Novel K-mer Signatures

10.1101/541938 ◽

2019 ◽

Cited By ~ 5

Author(s):

David Roe ◽

Rui Kuang

Keyword(s):

Genome Sequencing ◽

Sequence Similarity ◽

Killer Cell ◽

Haplotype Pair ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

High Sequence Similarity ◽

Data Set ◽

Kir Genes ◽

Kir Gene

AbstractThe killer cell immunoglobulin-like receptor (KIR) proteins evolve to fight viruses and mediate the body’s reaction to pregnancy. These roles provide selection pressure for variation at both the structural/haplotype and base/allele levels. At the same time, the genes have evolved relatively recently by tandem duplication and therefore exhibit very high sequence similarity over thousands of bases. These variation-homology patterns make it impossible to interpret KIR haplotypes from abundant short-read genome sequencing data at population scale using existing methods. Here, we developed an efficient computational approach for in silico KIR probe interpretation (KPI) to accurately interpret individual’s KIR genes and haplotype-pairs from KIR sequencing reads. We designed synthetic 25-base sequence probes by analyzing previously reported haplotype sequences, and we developed a bioinformatics pipeline to interpret the probes in the context of 16 KIR genes and 16 haplotype structures. We demonstrated its accuracy on a synthetic data set as well as a real whole genome sequences from 748 individuals from The Genome of the Netherlands (GoNL). The GoNL predictions were compared with predictions from SNP-based predictions. Our results show 100% accuracy rate for the synthetic tests and a 99.6% family-consistency rate in the GoNL tests. Agreement with the SNP-based calls on KIR genes ranges from 72-100% with a mean of 92%; most differences occur in genes KIR2DS2, KIR2DL2, KIR2DS3, and KIR2DL5 where KPI predicts presence and the SNP-based interpretation predicts absence. Overall, the evidence suggests that KPI’s accuracy is 97% or greater for both KIR gene and haplotype-pair predictions, although the presence/absence genotyping leads to ambiguous haplotype-pair predictions with 16 reference KIR haplotype structures. KPI is free, open, and easily executable as a Nextflow workflow supported by a Docker environment at https://github.com/droeatumn/kpi.

Download Full-text

The Complete Chloroplast Genome of the Vulnerable Oreocharis esquirolii (Gesneriaceae): Structural Features, Comparative and Phylogenetic Analysis

Plants ◽

10.3390/plants9121692 ◽

2020 ◽

Vol 9 (12) ◽

pp. 1692

Author(s):

Li Gu ◽

Ting Su ◽

Ming-Tai An ◽

Guo-Xiong Hu

Keyword(s):

Phylogenetic Analysis ◽

Sequence Similarity ◽

Single Copy ◽

Structural Features ◽

Rrna Genes ◽

Trna Genes ◽

Sequencing Data ◽

High Sequence Similarity ◽

Plastid Genomes ◽

Cp Genome

Oreocharis esquirolii, a member of Gesneriaceae, is known as Thamnocharis esquirolii, which has been regarded a synonym of the former. The species is endemic to Guizhou, southwestern China, and is evaluated as vulnerable (VU) under the International Union for Conservation of Nature (IUCN) criteria. Until now, the sequence and genome information of O. esquirolii remains unknown. In this study, we assembled and characterized the complete chloroplast (cp) genome of O. esquirolii using Illumina sequencing data for the first time. The total length of the cp genome was 154,069 bp with a typical quadripartite structure consisting of a pair of inverted repeats (IRs) of 25,392 bp separated by a large single copy region (LSC) of 85,156 bp and a small single copy region (SSC) of18,129 bp. The genome comprised 114 unique genes with 80 protein-coding genes, 30 tRNA genes, and four rRNA genes. Thirty-one repeat sequences and 74 simple sequence repeats (SSRs) were identified. Genome alignment across five plastid genomes of Gesneriaceae indicated a high sequence similarity. Four highly variable sites (rps16-trnQ, trnS-trnG, ndhF-rpl32, and ycf 1) were identified. Phylogenetic analysis indicated that O. esquirolii grouped together with O. mileensis, supporting resurrection of the name Oreocharis esquirolii from Thamnocharisesquirolii. The complete cp genome sequence will contribute to further studies in molecular identification, genetic diversity, and phylogeny.

Download Full-text

YopT domain of the PfhB2 toxin from Pasteurella multocida: protein expression, characterization, crystallization and crystallographic analysis

Acta Crystallographica Section F Structural Biology Communications ◽

10.1107/s2053230x18000857 ◽

2018 ◽

Vol 74 (3) ◽

pp. 128-134

Author(s):

Sanjeev Kumar ◽

Victoria Hedrick ◽

Seema Mattoo

Keyword(s):

Pasteurella Multocida ◽

Opportunistic Infections ◽

Structural Information ◽

Sequence Similarity ◽

Catalytic Triad ◽

Crystallographic Analysis ◽

High Sequence Similarity ◽

Data Set ◽

Cell Parameters ◽

Tract Infections

Pasteurella multocida causes respiratory-tract infections in a broad range of animals, as well as opportunistic infections in humans. P. multocida secretes a multidomain toxin called PfhB2, which contains a YopT-like cysteine protease domain at its C-terminus. The YopT domain of PfhB2 contains a well conserved Cys–His–Asp catalytic triad that defines YopT family members, and shares high sequence similarity with the prototype YopT from Yersinia sp. To date, only one crystal structure of a YopT family member has been reported; however, additional structural information is needed to help characterize the varied substrate specificity and enzymatic action of this large protease family. Here, a catalytically inactive C3733S mutant of PfhB2 YopT that provides enhanced protein stability was used with the aim of gaining structural insight into the diversity within the YopT protein family. To this end, the C3733S mutant of PfhB2 YopT has been successfully cloned, overexpressed, purified and crystallized. Diffraction data sets were collected from native crystals to 3.5 Å resolution and a single-wavelength anomalous data set was collected from an iodide-derivative crystal to 3.2 Å resolution. Data pertaining to crystals belonging to space group P31, with unit-cell parameters a = 136.9, b = 136.9, c = 74.7 Å for the native crystals and a = 139.2, b = 139.2, c = 74.7 Å for the iodide-derivative crystals, are discussed.

Download Full-text

Characterisation of B killer cell immunoglobulin-like receptor genes and telomeric and centromeric motifs in hematopoietic stem cell transplantation donors in Vojvodina, Serbia

Genetika ◽

10.2298/gensr1701345a ◽

2017 ◽

Vol 49 (1) ◽

pp. 345-354

Author(s):

Dusica Ademovic-Sazdanic ◽

Svetlana Vojvodic ◽

S. Popovic ◽

N. Konstantinidis

Keyword(s):

Relapse Prevention ◽

Hematological Malignancies ◽

Killer Cell ◽

Simple Algorithm ◽

Hematopoietic Stem ◽

Graft Versus Host ◽

Kir Genes ◽

Graft Versus Leukemia ◽

Kir Gene ◽

Kir Genotypes

The outcome of HSCT is strongly in?uenced by the genetic similarity or identity in the HLA genes that affects the incidence of graft-versus-host disease (GvHD). Successful allogeneic HSCT, however, depends also on T-cell mediated graft-versus-leukemia (GvL) effect, in which donor-derived T cells and natural killer (NK) cells kill these malignant cells in the patient, therefore playing a crucial role in relapse prevention. The aim of this study was to make the predictive analysis of the structure and distribution of B KIR alleles and centromeric and telomeric KIR genotypes in HSCT donors in Vojvodina with regard to their contribution to protection from relapse. A total of 124 first-degree relatives of patients with hematological malignancies were examined for the presence or absence of 15 KIR genes by using of PCR-SSO technique with Luminex xMap technology. The percentage of individuals carrying each KIR gene, centromeric and telomeric KIR haplotypes and genotypes was determined by direct counting. Sixty two percent of the HSCT donors in Vojvodina carry A KIR haplotype, while nearly 38% carry B KIR haplotype. The distribution of B KIR genes showed that among 124 studied HSCT donors, 31(25%) do not carry none of the KIR genes belonging to B group, 71.77% of donors have two or more B KIR genes, 61.29% of them carry KIR 2DL2 and 2DS2 or more B KIR genes. The analysis of centromeric and telomeric KIR genotypes, showed that Cen-A1/Tel-A1 genotype had a highest frequency of 51.47% and Cen-B2/Tel-B1 the lowest frequency of 1.30%. The usage of donor KIR B gene content and centromeric and telomeric KIR gene structure could be used in development of a simple algorithm to identify donors who will provide the most protection against the relapse in related HSC transplants.

Download Full-text

Urothelial Carcinoma Detection Based on Copy Number Profiles of Urinary Cell-Free DNA by Shallow Whole-Genome Sequencing

Clinical Chemistry ◽

10.1373/clinchem.2019.309633 ◽

2019 ◽

Vol 66 (1) ◽

pp. 188-198 ◽

Cited By ~ 5

Author(s):

Guangzhe Ge ◽

Ding Peng ◽

Bao Guan ◽

Yuanyuan Zhou ◽

Yanqing Gong ◽

...

Keyword(s):

Urothelial Carcinoma ◽

Genome Sequencing ◽

Copy Number ◽

Support Vector ◽

Sequencing Data ◽

Validation Data ◽

Data Set ◽

Cell Free Dna ◽

Clinical Sensitivity ◽

Free Dna

Abstract BACKGROUND Current noninvasive assays for urothelial carcinoma (UC) lack clinical sensitivity and specificity. Given the utility of plasma cell-free DNA (cfDNA) biomarkers, the development of urinary cfDNA biomarkers may improve the diagnostic sensitivity. METHODS We assessed copy number alterations (CNAs) by shallow genome-wide sequencing of urinary cfDNA in 95 cancer-free individuals and 65 patients with UC, 58 with kidney cancer, and 45 with prostate cancer. We used a support vector machine to develop a diagnostic classifier based on CNA profiles to detect UC (UCdetector). The model was further validated in an independent cohort (52 patients). Genome sequencing data of tumor specimens from 90 upper tract urothelial cancers (UTUCs) and CNA data for 410 urothelial carcinomas of bladder (UCBs) from The Cancer Genome Atlas were used to validate the classifier. Genome sequencing data for urine sediment from 32 patients with UC were compared with cfDNA. To monitor the treatment efficacy, we collected cfDNA from 7 posttreatment patients. RESULTS Urinary cfDNA was a more sensitive alternative to urinary sediment. The UCdetector could detect UC at a median clinical sensitivity of 86.5% and specificity of 94.7%. UCdetector performed well in an independent validation data set. Notably, the CNA features selected by UCdetector were specific markers for both UTUC and UCB. Moreover, CNA changes in cfDNA were consistent with the treatment effects. Meanwhile, the same strategy could localize genitourinary cancers to tissue of origin in 70.1% of patients. CONCLUSIONS Our findings underscore the potential utility of urinary cfDNA CNA profiles as a basis for noninvasive UC detection and surveillance.

Download Full-text

A Bioinformatics Pipeline for Estimating Mitochondria DNA Copy Number and Heteroplasmy Levels from Whole Genome Sequencing Data

10.1101/2021.12.28.21268452 ◽

2021 ◽

Author(s):

Stephanie L Battle ◽

Daniela Puiu ◽

Eric Boerwinkle ◽

Kent Taylor ◽

Jerome Rotter ◽

...

Keyword(s):

Mitochondrial Genome ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Copy Number ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Dna Molecules ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Accurate Identification

Mitochondrial diseases are a heterogeneous group of disorders that can be caused by mutations in the nuclear or mitochondrial genome. Mitochondrial DNA variants may exist in a state of heteroplasmy, where a percentage of DNA molecules harbor a variant, or homoplasmy, where all DNA molecules have a variant. The relative quantity of mtDNA in a cell, or copy number (mtDNA-CN), is associated with mitochondrial function, human disease, and mortality. To facilitate accurate identification of heteroplasmy and quantify mtDNA-CN, we built a bioinformatics pipeline that takes whole genome sequencing data and outputs mitochondrial variants, and mtDNA-CN. We incorporate variant annotations to facilitate determination of variant significance. Our pipeline yields uniform coverage by remapping to a circularized chrM and recovering reads falsely mapped to nuclear-encoded mitochondrial sequences. Notably, we construct a consensus chrM sequence for each sample and recall heteroplasmy against the sample's unique mitochondrial genome. We observe an approximately 3-fold increased association with age for heteroplasmic variants in non-homopolymer regions and, are better able to capture genetic variation in the D-loop of chrM compared to existing software. Our bioinformatics pipeline more accurately captures features of mitochondrial genetics than existing pipelines that are important in understanding how mitochondrial dysfunction contributes to disease.

Download Full-text

Genome Wide Variant Analysis of Simplex Autism Families with an Integrative Clinical-Bioinformatics Pipeline

10.1101/019208 ◽

2015 ◽

Author(s):

Laura T Jiménez-Barrón ◽

Jason A O'Rawe ◽

Yiyang Wu ◽

Margaret Yoon ◽

Han Fang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

De Novo ◽

Autism Spectrum ◽

Repetitive Behaviors ◽

Whole Genome ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Bioinformatics Tools ◽

Genome Wide

Autism spectrum disorders (ASD) are a group of developmental disabilities that affect social interaction, communication and are characterized by repetitive behaviors. There is now a large body of evidence that suggests a complex role of genetics in ASD, in which many different loci are involved. Although many current population scale genomic studies have been demonstrably fruitful, these studies generally focus on analyzing a limited part of the genome or use a limited set of bioinformatics tools. These limitations preclude the analysis of genome-wide perturbations that may contribute to the development and severity of ASD-related phenotypes. To overcome these limitations, we have developed and utilized an integrative clinical and bioinformatics pipeline for generating a more complete and reliable set of genomic variants for downstream analyses. Our study focuses on the analysis of three simplex autism families consisting of one affected child, unaffected parents, and one unaffected sibling. All members were clinically evaluated and widely phenotyped. Genotyping arrays and whole genome sequencing were performed on each member, and the resulting sequencing data were analyzed using a variety of available bioinformatics tools. We searched for rare variants of putative functional impact that were found to be segregating according to de-novo, autosomal recessive, x-linked, mitochondrial and compound heterozygote transmission models. The resulting candidate variants included three small heterozygous CNVs, a rare heterozygous de novo nonsense mutation in MYBBP1A located within exon 1, and a novel de novo missense variant in LAMB3. Our work demonstrates how more comprehensive analyses that include rich clinical data and whole genome sequencing data can generate reliable results for use in downstream investigations. We are moving to implement our framework for the analysis and study of larger cohorts of families, where statistical rigor can accompany genetic findings.

Download Full-text

Molecular identification ofAustrobilharziaspecies parasitizingCerithidea cingulata(Gastropoda: Potamididae) from Kuwait Bay

Journal of Helminthology ◽

10.1017/s0022149x11000733 ◽

2011 ◽

Vol 86 (4) ◽

pp. 470-478 ◽

Cited By ~ 5

Author(s):

W.Y. Al-Kandari ◽

S.A. Al-Bustan ◽

A.M. Isaac ◽

B.A. George ◽

B.S. Chandy

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Sequence Similarity ◽

Morphological Identification ◽

High Sequence Similarity ◽

Data Set ◽

Kuwait Bay ◽

Causative Agents ◽

Avian Schistosomes ◽

Combined Data

AbstractAvian schistosomes belonging to the genusAustrobilharzia(Digenea: Schistosomatidae) are among the causative agents of cercarial dermatitis in humans. In this paper, ribosomal and mitochondrial DNA sequences were used to study schistosome cercariae from Kuwait Bay that have been identified morphologically asAustrobilharziasp. Sequence comparison of the ribosomal DNA (rDNA) 28S and 18S regions of the collected schistosome cercariae with corresponding sequences of other schistosomes in GenBank revealed high sequence similarity. This confirmed the morphological identification of schistosome cercariae from Kuwait Bay as belonging to the genusAustrobilharzia. The finding was further supported by the phylogenetic tree that was constructed based on the combined data set 18S-28S-mitochondrial cytochrome oxidase I (mtCO1) sequences in whichAustrobilharziasp. clustered withA. terrigalensisandA. variglandis. Sequence comparison of theAustrobilharziasp. from Kuwait Bay withA. variglandisandA. terrigalensisbased on mtCO1 showed a variation of 10% and 11%, respectively. Since the sequence variation in the mtCO1 was within the interspecific range among trematodes, it seems that theAustrobilharziaspecies from Kuwait Bay is different from the two species reported in GenBank,A.terrigalensisandA. variglandis.

Download Full-text

Cyrius: accurate CYP2D6 genotyping using whole genome sequencing data

10.1101/2020.05.05.077966 ◽

2020 ◽

Author(s):

Xiao Chen ◽

Fei Shen ◽

Nina Gonzaludo ◽

Alka Malhotra ◽

Cande Rogert ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Similarity ◽

Ethnically Diverse ◽

Haplotype Frequency ◽

Superior Performance ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Structural Variants ◽

Sequencing Data

AbstractResponsible for the metabolism of 25% of clinically used drugs, CYP2D6 is a critical component of personalized medicine initiatives. Genotyping CYP2D6 is challenging due to sequence similarity with its pseudogene paralog CYP2D7 and a high number and variety of common structural variants (SVs). Here we describe a novel bioinformatics method, Cyrius, that accurately genotypes CYP2D6 using whole-genome sequencing (WGS) data. We show that Cyrius has superior performance (96.5% concordance with truth genotypes) compared to existing methods (84-86.8%). After implementing the improvements identified from the comparison against the truth data, Cyrius’s accuracy has since been improved to 99.3%. Using Cyrius, we built a haplotype frequency database from 2504 ethnically diverse samples and estimate that SV-containing star alleles are more frequent than previously reported. Cyrius will be an important tool to incorporate pharmacogenomics in WGS-based precision medicine initiatives.

Download Full-text

A binning tool to reconstruct viral haplotypes from assembled contigs

10.1101/704288 ◽

2019 ◽

Author(s):

Jiao Chen ◽

Jiayu Shang ◽

Jianrong Wang ◽

Yanni Sun

Keyword(s):

Genetic Diversity ◽

Rna Viruses ◽

Sequence Similarity ◽

Biological Properties ◽

Sequencing Data ◽

High Sequence Similarity ◽

Effective Prevention ◽

Next Generation Sequencing Technology ◽

Sequence Composition ◽

Genome Scale

AbstractMotivationInfections by RNA viruses such as Influenza, HIV still pose a serious threat to human health despite extensive research on viral diseases. One challenge for producing effective prevention and treatment strategies is high intra-species genetic diversity. As different strains may have different biological properties, characterizing the genetic diversity is thus important to vaccine and drug design. Next-generation sequencing technology enables comprehensive characterization of both known and novel strains and has been widely adopted for sequencing viral populations. However, genome-scale reconstruction of haplotypes is still a challenging problem. In particular, haplotype assembly programs often produce contigs rather than full genomes. As a mutation in one gene can mask the phenotypic effects of a mutation at another locus, clustering these contigs into genome-scale haplotypes is still needed.ResultsWe developed a contig binning tool, VirBin, which clusters contigs into different groups so that each group represents a haplotype. Commonly used features based on sequence composition and contig coverage cannot effectively distinguish viral haplotypes because of their high sequence similarity and heterogeneous sequencing coverage for RNA viruses. VirBin applied prototype-based clustering to cluster regions that are more likely to contain mutations specific to a haplotype. The tool was tested on multiple simulated sequencing data with different haplotype abundance distributions and contig sizes, and also on mock quasispecies sequencing data. The benchmark results with other contig binning tools demonstrated the superior sensitivity and precision of VirBin in contig binning for viral haplotype reconstruction.Availabilityhttps://github.com/chjiao/[email protected]

Download Full-text