scholarly journals Analysis of B-cell receptor repertoires in COVID-19 patients using deep embedded representations of protein sequences

2021 ◽  
Author(s):  
Inyoung Kim ◽  
Sang Yoon Byun ◽  
Sangyeup Kim ◽  
Sangyoon Choi ◽  
Jinsung Noh ◽  
...  

Analyzing B-cell receptor (BCR) repertoires is immensely useful in evaluating one's immunological status. Conventionally,repertoire analysis methods have focused on comprehensive assessment of clonal compositions, including V(D)J segment usage, nucleotide insertion/deletion, and amino acid distribution. Here, we introduce a novel computational approach that applies deep-learning based protein embedding techniques to analyze BCR repertoires. By selecting the most frequently occurring BCR sequences in a given repertoire and computing the sum of the vector representations of these sequences, we represent an entire repertoire as a 100-dimensional vector and eventually as a single data point in vector space. We demonstrate that our new approach enables us to not only accurately cluster repertoires of COVID-19 patients and healthy subjects, but also efficiently track minute changes in immunity conditions as patients undergo a course of treatment over time. Furthermore, using the distributed representations, we successfully trained an XGBoost classification model that achieved over 87% mean accuracy rate given a repertoire of CDR3 sequences.

2021 ◽  
Author(s):  
Inyoung Kim ◽  
Sang Yoon Byun ◽  
Sangyeup Kim ◽  
Sangyoon Choi ◽  
Jinsung Noh ◽  
...  

Abstract Analyzing B cell receptor (BCR) repertoires is immensely useful in evaluating one’s immunological status. Conventionally, repertoire analysis methods have focused on comprehensive assessments of clonal compositions, including V(D)J segment usage, nucleotide insertions/deletions, and amino acid distributions. Here, we introduce a novel computational approach that applies deep-learning-based protein embedding techniques to analyze BCR repertoires. By selecting the most frequently occurring BCR sequences in a given repertoire and computing the sum of the vector representations of these sequences, we represent an entire repertoire as a 100-dimensional vector and eventually as a single data point in vector space. We demonstrate that this new approach enables us to not only accurately cluster BCR repertoires of coronavirus disease 2019 (COVID-19) patients and healthy subjects but also efficiently track minute changes in immune status over time as patients undergo treatment. Furthermore, using the distributed representations, we successfully trained an XGBoost classification model that achieved a mean accuracy rate of over 87% given a repertoire of CDR3 sequences.


10.2741/2217 ◽  
2007 ◽  
Vol 12 (1) ◽  
pp. 2136 ◽  
Author(s):  
Hilla Azulay-Debby

Sign in / Sign up

Export Citation Format

Share Document