scholarly journals Performance-optimized partitioning of clonotypes from high-throughput immunoglobulin repertoire sequencing data

2017 ◽  
Author(s):  
Nima Nouri ◽  
Steven H. Kleinstein

AbstractMotivationDuring adaptive immune responses, activated B cells expand and undergo somatic hypermutation of their immunoglobulin (Ig) receptor, forming a clone of diversified cells that can be related back to a common ancestor. Identification of B cell clonotypes from high-throughput Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data relies on computational analysis. Recently, we proposed an automate method to partition sequences into clonal groups based on single-linkage clustering of the Ig receptor junction region with length-normalized hamming distance metric. This method could identify clonally-related sequences with high confidence on several benchmark experimental and simulated data sets. However, this approach was computationally expensive, and unable to provide estimates of accuracy for new data. Here, a new method is presented that address this computational bottleneck and also provides a study-specific estimation of performance, including sensitivity and specificity. The method uses a finite mixture modeling fitting procedure for learning the parameters of two univariate curves which fit the bimodal distributions of the distance vector between pairs of sequences. These distribution are used to estimate the performance of different threshold choices for partitioning sequences into clonotypes. These performance estimates are validated using simulated and experimental datasets. With this method, clonotypes can be identified from AIRR-seq data with sensitivity and specificity profiles that are user-defined based on the overall goals of the study.AvailabilitySource code is freely available at the Immcantation Portal: www.immcantation.com under the CC BY-SA 4.0 [email protected]

2015 ◽  
Author(s):  
Rahul Reddy

As RNA-Seq and other high-throughput sequencing grow in use and remain critical for gene expression studies, technical variability in counts data impedes studies of differential expression studies, data across samples and experiments, or reproducing results. Studies like Dillies et al. (2013) compare several between-lane normalization methods involving scaling factors, while Hansen et al. (2012) and Risso et al. (2014) propose methods that correct for sample-specific bias or use sets of control genes to isolate and remove technical variability. This paper evaluates four normalization methods in terms of reducing intra-group, technical variability and facilitating differential expression analysis or other research where the biological, inter-group variability is of interest. To this end, the four methods were evaluated in differential expression analysis between data from Pickrell et al. (2010) and Montgomery et al. (2010) and between simulated data modeled on these two datasets. Though the between-lane scaling factor methods perform worse on real data sets, they are much stronger for simulated data. We cannot reject the recommendation of Dillies et al. to use TMM and DESeq normalization, but further study of power to detect effects of different size under each normalization method is merited.


2019 ◽  
Author(s):  
Nima Nouri ◽  
Steven H. Kleinstein

AbstractMotivationAdaptive immune receptor repertoire sequencing (AIRR-Seq) offers the possibility of identifying and tracking B cell clonal expansions during adaptive immune responses. Members of a B cell clone are descended from a common ancestor and share the same initial V(D)J rearrangement, but their B cell receptore (BCR) sequence may differ due to the accumulation of somatic hypermutations (SHMs). Clonal relationships are learned from AIRR-seq data by analyzing the BCR sequence, with the most common methods focused on the highly diverse junction region. However, clonally related cells often share SHMs which have been accumulated during affinity maturation. Here, we investigate whether shared SHMs in the V and J segments of the BCR can be leveraged along with the junction sequence to improve the ability to identify clonally related sequences. We develop independent distance functions that capture junction similarity and shared mutations, and combine these in a spectral clustering framework to infer the BCR clonal relationships. Using both simulated and experimental data, we show that this model improves both the sensitivity and specificity for identifying B cell clones.AvailabilitySource code for this method is freely available in the SCOPer (Spectral Clustering for clOne Partitioning) R package (version 0.2 or later) in the Immcantation framework: www.immcantation.org under the CC BY-SA 4.0 [email protected]


2020 ◽  
Vol 38 (15_suppl) ◽  
pp. e20001-e20001
Author(s):  
Timothy Looney ◽  
Graeme Quest ◽  
Harriet Feilotter ◽  
Zadie Davis

e20001 Background: B cell somatic hypermutation (SHM) and class switch recombination (CSR) are mechanistically related but distinct processes requiring precisely targeted generation and repair of single and double strand DNA breaks. Within the context of chronic lymphocytic leukemia (CLL), the presence of ongoing SHM or CSR may reveal the functionality of DNA damage repair pathways, with potential relevance to therapeutic strategies involving the generation of DNA breaks or inhibition of DNA repair machinery. Here we apply clonal lineage analysis of IGH chain sequencing data to evaluate CSR and SHM in a cohort of CLL and splenic marginal zone lymphoma (SMZL) research samples. We present evidence of ongoing CSR and SHM in a significant subset of samples. Methods: Multiplex primers targeting the IGH framework 1 region and isotype region (Oncomine IGH-LR assay; detection of all nine isotypes) were used for IGH repertoire sequencing via the Ion Gene Studio S5 from 25ng peripheral blood total RNA derived from 63 individuals with CLL and 4 individuals with SMZL. Clonotyping and clonal lineage analysis was performed by Ion Reporter, whereby clonal lineages are defined as sets of unique rearrangements having a shared variable and joining gene, the same CDR3 length, and a minimum CDR3 nucleotide similarity of 85%. Ongoing CSR was defined as the presence of IgM/IgD and at least one switched isotype (IgG, IgA, or IgE), or a combination of switched isotypes, within the same lineage. Ongoing SHM was defined as the presence of subclones that differ within the VDJ region sequence compared to other clonal lineage members. Results: 11/68 cases showed evidence of ongoing CSR or SHM. Of the 57 cases showing no evidence of ongoing CSR or SHM, variable gene mutation analysis revealed the presence of three distinct subgroups having either no SHM, intermediate SHM (average 98% sequence identity) or high SHM < 94% identity). 3 of 4 SMZL cases showed evidence of ongoing CSR or SHM. Conclusions: These results reveal previously underappreciated heterogeneity within CLL and suggest the subdivision of CLL based on a combination of IGHV mutation level and presence of ongoing SHM or CSR. The described heterogeneity may serve as a valuable criterion for stratifying CLL patients in the future.


2013 ◽  
Vol 4 ◽  
Author(s):  
Gur Yaari ◽  
Jason A. Vander Heiden ◽  
Mohamed Uduman ◽  
Daniel Gadala-Maria ◽  
Namita Gupta ◽  
...  

Author(s):  
Alexandre Yahi ◽  
Paul Hoffman ◽  
Margot Brandt ◽  
Pejman Mohammadi ◽  
Nicholas P. Tatonetti ◽  
...  

AbstractGenome editing experiments are generating an increasing amount of targeted sequencing data with specific mutational patterns indicating the success of the experiments and genotypes of clonal cell lines. We present EdiTyper, a high-throughput command line tool specifically designed for analysis of sequencing data from polyclonal and monoclonal cell populations from CRISPR gene editing. It requires simple inputs of sequencing data and reference sequences, and provides comprehensive outputs including summary statistics, plots, and SAM/BAM alignments. Analysis of simulated data showed that EdiTyper is highly accurate for detection of both single nucleotide mutations and indels, robust to sequencing errors, as well as fast and scalable to large experimental batches. EdiTyper is available in github (https://github.com/LappalainenLab/edityper) under the MIT license.


2015 ◽  
Author(s):  
Paul D Blischak ◽  
Laura S Kubatko ◽  
Andrea D Wolfe

Despite the increasing opportunity to collect large-scale data sets for population genomic analyses, the use of high throughput sequencing to study populations of polyploids has seen little application. This is due in large part to problems associated with determining allele copy number in the genotypes of polyploid individuals (allelic dosage uncertainty--ADU), which complicates the calculation of important quantities such as allele frequencies. Here we describe a statistical model to estimate biallelic SNP frequencies in a population of autopolyploids using high throughput sequencing data in the form of read counts.We bridge the gap from data collection (using restriction enzyme based techniques [e.g., GBS, RADseq]) to allele frequency estimation in a unified inferential framework using a hierarchical Bayesian model to sum over genotype uncertainty. Simulated data sets were generated under various conditions for tetraploid, hexaploid and octoploid populations to evaluate the model's performance and to help guide the collection of empirical data. We also provide an implementation of our model in the R package POLYFREQS and demonstrate its use with two example analyses that investigate (i) levels of expected and observed heterozygosity and (ii) model adequacy. Our simulations show that the number of individuals sampled from a population has a greater impact on estimation error than sequencing coverage. The example analyses also show that our model and software can be used to make inferences beyond the estimation of allele frequencies for autopolyploids by providing assessments of model adequacy and estimates of heterozygosity.


2021 ◽  
Vol 12 ◽  
Author(s):  
Eva-Stina Edholm ◽  
Christopher Graham Fenton ◽  
Stanislas Mondot ◽  
Ruth H. Paulssen ◽  
Marie-Paule Lefranc ◽  
...  

In jawed vertebrates, two major T cell populations have been characterized. They are defined as α/β or γ/δ T cells, based on the expressed T cell receptor. Salmonids (family Salmonidae) include two key teleost species for aquaculture, rainbow trout (Oncorhynchus mykiss) and Atlantic salmon (Salmo salar) which constitute important models for fish immunology and important targets for vaccine development. The growing interest to decipher the dynamics of adaptive immune responses against pathogens or vaccines has resulted in recent efforts to sequence the immunoglobulin (IG) or antibodies and T cell receptor (TR) repertoire in these species. In this context, establishing a comprehensive and coherent locus annotation is the fundamental basis for the analysis of high-throughput repertoire sequencing data. We therefore decided to revisit the description and annotation of TRA/TRD locus in Atlantic salmon and two strains of rainbow trout (Swanson and Arlee) using the now available high-quality genome assemblies. Phylogenetic analysis of functional TRA/TRD V genes from these three genomes led to the definition of 25 subgroups shared by both species, some with particular feature. A total of 128 TRAJ genes were identified in Salmo, the majority with a close counterpart in Oncorhynchus. Analysis of expressed TRA repertoire indicates that most TRAV gene subgroups are expressed at mucosal and systemic level. The present work on TRA/TRD locus annotation along with the analysis of TRA repertoire sequencing data show the feasibility and advantages of a common salmonid TRA/TRD nomenclature that allows an accurate annotation and analysis of high-throughput sequencing results, across salmonid T cell subsets.


2016 ◽  
Vol 2016 ◽  
pp. 1-10 ◽  
Author(s):  
Qiang Yu ◽  
Hongwei Huo ◽  
Dazheng Feng

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.


Sign in / Sign up

Export Citation Format

Share Document