Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis

Author(s):  
Prem Bhusal ◽  
A K M Mubashwir Alam ◽  
Keke Chen ◽  
Ning Jiang ◽  
Jun Xiao
2017 ◽  
Vol 18 (1) ◽  
Author(s):  
Anke Fähnrich ◽  
Moritz Krebbel ◽  
Normann Decker ◽  
Martin Leucker ◽  
Felix D. Lange ◽  
...  

2018 ◽  
Vol 9 (1) ◽  
Author(s):  
Quentin Marcou ◽  
Thierry Mora ◽  
Aleksandra M. Walczak

2018 ◽  
Author(s):  
Laura López-Santibáñez-Jácome ◽  
Selma Eréndira Avendaño-Vázquez ◽  
Carlos Fabián Flores-Jasso

With the advent of high-throughput sequencing of immunoglobulins (Ig-Seq), the understanding of antibody repertoires and its dynamics among individuals and populations has become and exiting area of research. There are an increasing number of computational tools that aid in every step of the immune repertoire characterization. However, since not all tools function identically, every pipeline has its unique rationale and capabilities, creating a rich blend of useful features that may appear intimidating for newcomer laboratories with the desire to plunge into immune repertoire analysis to expand and improve their research; hence, all pipeline strengths and differences may not seem evident. In this review we provide an organized list of the current set of computational tools, focusing on their most attractive features and differences in order to carry out the characterization of antibody repertoires so that the reader better decides a strategic approach for the experimental design, and computational analyses of immune repertoires.


Author(s):  
Laura López-Santibáñez-Jácome ◽  
Selma Eréndira Avendaño-Vázquez ◽  
Carlos Fabián Flores-Jasso

With the advent of high-throughput sequencing of immunoglobulins (Ig-Seq), the understanding of antibody repertoires and its dynamics among individuals and populations has become and exiting area of research. There are an increasing number of computational tools that aid in every step of the immune repertoire characterization. However, since not all tools function identically, every pipeline has its unique rationale and capabilities, creating a rich blend of useful features that may appear intimidating for newcomer laboratories with the desire to plunge into immune repertoire analysis to expand and improve their research; hence, all pipeline strengths and differences may not seem evident. In this review we provide an organized list of the current set of computational tools, focusing on their most attractive features and differences in order to carry out the characterization of antibody repertoires so that the reader better decides a strategic approach for the experimental design, and computational analyses of immune repertoires.


2017 ◽  
Author(s):  
Quentin Marcou ◽  
Thierry Mora ◽  
Aleksandra M. Walczak

High throughput immune repertoire sequencing is promising to lead to new statistical diagnostic tools for medicine and biology. Successful implementations of these methods require a correct characterization, analysis and interpretation of these datasets. We present IGoR - a new comprehensive tool that takes B or T-cell receptors sequence reads and quantitatively characterizes the statistics of receptor generation from both cDNA and gDNA. It probabilistically annotates sequences and its modular structure can investigate models of increasing biological complexity for different organisms. For B-cells IGoR returns the hypermutation statistics, which we use to reveal co-localization of hypermutations along the sequence. We demonstrate that IGoR outperforms existing tools in accuracy and estimate the sample sizes needed for reliable repertoire characterization.


Author(s):  
Ming Cao ◽  
Qinke Peng ◽  
Ze-Gang Wei ◽  
Fei Liu ◽  
Yi-Fan Hou

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.


2017 ◽  
Vol 15 (06) ◽  
pp. 1740006 ◽  
Author(s):  
Mohammad Arifur Rahman ◽  
Nathan LaPierre ◽  
Huzefa Rangwala ◽  
Daniel Barbara

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a


GigaScience ◽  
2021 ◽  
Vol 10 (9) ◽  
Author(s):  
Haocheng Ye ◽  
Lin Cheng ◽  
Bin Ju ◽  
Gang Xu ◽  
Yang Liu ◽  
...  

Abstract Background B-cell immunoglobulin repertoires with paired heavy and light chain can be determined by means of 10X single-cell V(D)J sequencing. Precise and quick analysis of 10X single-cell immunoglobulin repertoires remains a challenge owing to the high diversity of immunoglobulin repertoires and a lack of specialized software that can analyze such diverse data. Findings In this study, specialized software for 10X single-cell immunoglobulin repertoire analysis was developed. SCIGA (Single-Cell Immunoglobulin Repertoire Analysis) is an easy-to-use pipeline that performs read trimming, immunoglobulin sequence assembly and annotation, heavy and light chain pairing, statistical analysis, visualization, and multiple sample integration analysis, which is all achieved by using a 1-line command. Then SCIGA was used to profile the single-cell immunoglobulin repertoires of 9 patients with coronavirus disease 2019 (COVID-19). Four neutralizing antibodies against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) were identified from these repertoires. Conclusions SCIGA provides a complete and quick analysis for 10X single-cell V(D)J sequencing datasets. It can help researchers to interpret B-cell immunoglobulin repertoires with paired heavy and light chain.


Sign in / Sign up

Export Citation Format

Share Document