Application of BERT to Enable Gene Classification Based on Clinical Evidence

The identification of profiled cancer-related genes plays an essential role in cancer diagnosis and treatment. Based on literature research, the classification of genetic mutations continues to be done manually nowadays. Manual classification of genetic mutations is pathologist-dependent, subjective, and time-consuming. To improve the accuracy of clinical interpretation, scientists have proposed computational-based approaches for automatic analysis of mutations with the advent of next-generation sequencing technologies. Nevertheless, some challenges, such as multiple classifications, the complexity of texts, redundant descriptions, and inconsistent interpretation, have limited the development of algorithms. To overcome these difficulties, we have adapted a deep learning method named Bidirectional Encoder Representations from Transformers (BERT) to classify genetic mutations based on text evidence from an annotated database. During the training, three challenging features such as the extreme length of texts, biased data presentation, and high repeatability were addressed. Finally, the BERT+abstract demonstrates satisfactory results with 0.80 logarithmic loss, 0.6837 recall, and 0.705 F -measure. It is feasible for BERT to classify the genomic mutation text within literature-based datasets. Consequently, BERT is a practical tool for facilitating and significantly speeding up cancer research towards tumor progression, diagnosis, and the design of more precise and effective treatments.

Download Full-text

HD Spot: Interpretable Deep Learning Classification of Single Cell Transcript Data

10.1101/822759 ◽

2019 ◽

Cited By ~ 1

Author(s):

Eric Prince ◽

Todd C. Hankinson

Keyword(s):

Deep Learning ◽

Single Cell ◽

High Throughput ◽

Ground Truth ◽

Sequencing Technologies ◽

Bioinformatic Tool ◽

Complex Relationships ◽

Insight Into ◽

Generation Sequencing

ABSTRACTHigh throughput data is commonplace in biomedical research as seen with technologies such as single-cell RNA sequencing (scRNA-seq) and other Next Generation Sequencing technologies. As these techniques continue to be increasingly utilized it is critical to have analysis tools that can identify meaningful complex relationships between variables (i.e., in the case of scRNA-seq: genes) in a way such that human bias is absent. Moreover, it is equally paramount that both linear and non-linear (i.e., one-to-many) variable relationships be considered when contrasting datasets. HD Spot is a deep learning-based framework that generates an optimal interpretable classifier a given high-throughput dataset using a simple genetic algorithm as well as an autoencoder to classifier transfer learning approach. Using four unique publicly available scRNA-seq datasets with published ground truth, we demonstrate the robustness of HD Spot and the ability to identify ontologically accurate gene lists for a given data subset. HD Spot serves as a bioinformatic tool to allow novice and advanced analysts to gain complex insight into their respective datasets enabling novel hypotheses development.

Download Full-text

ClinGen Variant Curation Interface: A Variant Classification Platform for the Application of Evidence Criteria from ACMG/AMP Guidelines

10.1101/2021.02.12.21251663 ◽

2021 ◽

Author(s):

Christine G. Preston ◽

Matt W. Wright ◽

Rao Madhavrao ◽

Steven M. Harrison ◽

Jennifer L. Goldstein ◽

...

Keyword(s):

Genetic Alterations ◽

Healthcare Management ◽

Evidence Based ◽

Variant Classification ◽

Sequencing Technologies ◽

Relevant Evidence ◽

Central Platform ◽

Clinically Significant ◽

Generation Sequencing

AbstractBackgroundIdentification of clinically significant genetic alterations involved in human disease has been dramatically accelerated by developments in next-generation sequencing technologies. However, the infrastructure and accessible comprehensive curation tools necessary for analyzing an individual patient genome and interpreting genetic variants to inform healthcare management have been lacking.ResultsHere we present the ClinGen Variant Curation Interface (VCI), a global open-source variant classification platform for supporting the application of evidence criteria and classification of variants based on the ACMG/AMP variant classification guidelines. The VCI is among a suite of tools developed by the NIH-funded Clinical Genome Resource (ClinGen) Consortium, and supports an FDA-recognized human variant curation process. Essential to this is the ability to enable collaboration and peer review across ClinGen Expert Panels supporting users in comprehensively identifying, annotating, and sharing relevant evidence while making variant pathogenicity assertions. To facilitate evidence-based improvements in human variant classification, the VCI is publicly available to the genomics community and is available at https://curation.clinicalgenome.org. Navigation workflows support users providing guidance to comprehensively apply the ACMG/AMP evidence criteria and document provenance for asserting variant classifications.ConclusionThe VCI offers a central platform for clinical variant classification that fills a gap in the learning healthcare system, and facilitates widespread adoption of standards for clinical curation.

Download Full-text

The Influence of Memory-Aware Computation on Distributed BLAST

Current Bioinformatics ◽

10.2174/1574893613666180601080811 ◽

2019 ◽

Vol 14 (2) ◽

pp. 157-163

Author(s):

Majid Hajibaba ◽

Mohsen Sharifi ◽

Saeid Gorgin

Keyword(s):

Search Time ◽

Genomic Research ◽

Local Alignment ◽

Negative Effects ◽

Sequencing Technologies ◽

Percent Improvement ◽

Fast Processing ◽

Search Tool ◽

Memory Awareness ◽

Generation Sequencing

Background: One of the pivotal challenges in nowadays genomic research domain is the fast processing of voluminous data such as the ones engendered by high-throughput Next-Generation Sequencing technologies. On the other hand, BLAST (Basic Local Alignment Search Tool), a longestablished and renowned tool in Bioinformatics, has shown to be incredibly slow in this regard. Objective: To improve the performance of BLAST in the processing of voluminous data, we have applied a novel memory-aware technique to BLAST for faster parallel processing of voluminous data. Method: We have used a master-worker model for the processing of voluminous data alongside a memory-aware technique in which the master partitions the whole data in equal chunks, one chunk for each worker, and consequently each worker further splits and formats its allocated data chunk according to the size of its memory. Each worker searches every split data one-by-one through a list of queries. Results: We have chosen a list of queries with different lengths to run insensitive searches in a huge database called UniProtKB/TrEMBL. Our experiments show 20 percent improvement in performance when workers used our proposed memory-aware technique compared to when they were not memory aware. Comparatively, experiments show even higher performance improvement, approximately 50 percent, when we applied our memory-aware technique to mpiBLAST. Conclusion: We have shown that memory-awareness in formatting bulky database, when running BLAST, can improve performance significantly, while preventing unexpected crashes in low-memory environments. Even though distributed computing attempts to mitigate search time by partitioning and distributing database portions, our memory-aware technique alleviates negative effects of page-faults on performance.

Download Full-text

Clinical Implications of Polymicrobial Synergism Effects on Antimicrobial Susceptibility

Pathogens ◽

10.3390/pathogens10020144 ◽

2021 ◽

Vol 10 (2) ◽

pp. 144

Author(s):

William Little ◽

Caroline Black ◽

Allie Clinton Smith

Keyword(s):

Antimicrobial Susceptibility ◽

Chronic Wounds ◽

Clinical Laboratory ◽

Patient Treatment ◽

Clinical Implications ◽

Clinical Environment ◽

Tolerance Mechanisms ◽

Sequencing Technologies ◽

Generation Sequencing ◽

Polymicrobial Infections

With the development of next generation sequencing technologies in recent years, it has been demonstrated that many human infectious processes, including chronic wounds, cystic fibrosis, and otitis media, are associated with a polymicrobial burden. Research has also demonstrated that polymicrobial infections tend to be associated with treatment failure and worse patient prognoses. Despite the importance of the polymicrobial nature of many infection states, the current clinical standard for determining antimicrobial susceptibility in the clinical laboratory is exclusively performed on unimicrobial suspensions. There is a growing body of research demonstrating that microorganisms in a polymicrobial environment can synergize their activities associated with a variety of outcomes, including changes to their antimicrobial susceptibility through both resistance and tolerance mechanisms. This review highlights the current body of work describing polymicrobial synergism, both inter- and intra-kingdom, impacting antimicrobial susceptibility. Given the importance of polymicrobial synergism in the clinical environment, a new system of determining antimicrobial susceptibility from polymicrobial infections may significantly impact patient treatment and outcomes.

Download Full-text

Manual versus Automatic Classification of Laryngeal Lesions based on Vascular Patterns in CE+NBI Images

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2020-3018 ◽

2020 ◽

Vol 6 (3) ◽

pp. 70-73

Author(s):

Nazila Esmaeili ◽

Alfredo Illanes ◽

Axel Boese ◽

Nikolaos Davaris ◽

Christoph Arens ◽

...

Keyword(s):

Vocal Fold ◽

Narrow Band ◽

Narrow Band Imaging ◽

Automatic Classification ◽

Vascular Pattern ◽

Objective Interpretation ◽

Laryngeal Lesions ◽

Malignant Lesions ◽

Manual Classification

AbstractLongitudinal and perpendicular changes in the blood vessels of the vocal fold have been related to the advancement from benign to malignant laryngeal cancer stages. The combination of Contact Endoscopy (CE) and Narrow Band Imaging (NBI) provides intraoperative realtime visualization of vascular pattern in Larynx. The evaluation of these vascular patterns in CE+NBI images is a subjective process leading to differentiation difficulty and subjectivity between benign and malignant lesions. The main objective of this work is to compare multi-observer classification versus automatic classification of laryngeal lesions. Six clinicians visually classified CE+NBI images into benign and malignant lesions. For the automatic classification of CE+NBI images, we used an algorithm based on characterizing the level of the vessel’s disorder. The results of the manual classification showed that there is no objective interpretation, leading to difficulties to visually distinguish between benign and malignant lesions. The results of the automatic classification of CE+NBI images on the other hand showed the capability of the algorithm to solve these issues. Based on the observed results we believe that, the automatic approach could be a valuable tool to assist clinicians to classifying laryngeal lesions.

Download Full-text

Structural variant detection in cancer genomes: computational challenges and perspectives for precision oncology

npj Precision Oncology ◽

10.1038/s41698-021-00155-6 ◽

2021 ◽

Vol 5 (1) ◽

Author(s):

Ianthe A. E. M. van Belzen ◽

Alexander Schönhuth ◽

Patrick Kemmeren ◽

Jayne Y. Hehir-Kwa

Keyword(s):

Intratumor Heterogeneity ◽

Precision Oncology ◽

Single Nucleotide Variants ◽

Full Spectrum ◽

Single Nucleotide ◽

Sequencing Technologies ◽

Cancer Genomes ◽

Genomic Aberrations

AbstractCancer is generally characterized by acquired genomic aberrations in a broad spectrum of types and sizes, ranging from single nucleotide variants to structural variants (SVs). At least 30% of cancers have a known pathogenic SV used in diagnosis or treatment stratification. However, research into the role of SVs in cancer has been limited due to difficulties in detection. Biological and computational challenges confound SV detection in cancer samples, including intratumor heterogeneity, polyploidy, and distinguishing tumor-specific SVs from germline and somatic variants present in healthy cells. Classification of tumor-specific SVs is challenging due to inconsistencies in detected breakpoints, derived variant types and biological complexity of some rearrangements. Full-spectrum SV detection with high recall and precision requires integration of multiple algorithms and sequencing technologies to rescue variants that are difficult to resolve through individual methods. Here, we explore current strategies for integrating SV callsets and to enable the use of tumor-specific SVs in precision oncology.

Download Full-text

A simple method to estimate the in-house limit of detection for genetic mutations with low allele frequencies in whole-exome sequencing analysis by next-generation sequencing

BMC Genomic Data ◽

10.1186/s12863-020-00956-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Takumi Miura ◽

Satoshi Yasuda ◽

Yoji Sato

Keyword(s):

Next Generation Sequencing ◽

Allele Frequency ◽

Somatic Mutations ◽

Limit Of Detection ◽

Allele Frequencies ◽

Genetic Mutations ◽

Sequencing Data ◽

Simple Method ◽

Whole Exome ◽

Generation Sequencing

Abstract Background Next-generation sequencing (NGS) has profoundly changed the approach to genetic/genomic research. Particularly, the clinical utility of NGS in detecting mutations associated with disease risk has contributed to the development of effective therapeutic strategies. Recently, comprehensive analysis of somatic genetic mutations by NGS has also been used as a new approach for controlling the quality of cell substrates for manufacturing biopharmaceuticals. However, the quality evaluation of cell substrates by NGS largely depends on the limit of detection (LOD) for rare somatic mutations. The purpose of this study was to develop a simple method for evaluating the ability of whole-exome sequencing (WES) by NGS to detect mutations with low allele frequency. To estimate the LOD of WES for low-frequency somatic mutations, we repeatedly and independently performed WES of a reference genomic DNA using the same NGS platform and assay design. LOD was defined as the allele frequency with a relative standard deviation (RSD) value of 30% and was estimated by a moving average curve of the relation between RSD and allele frequency. Results Allele frequencies of 20 mutations in the reference material that had been pre-validated by droplet digital PCR (ddPCR) were obtained from 5, 15, 30, or 40 G base pair (Gbp) sequencing data per run. There was a significant association between the allele frequencies measured by WES and those pre-validated by ddPCR, whose p-value decreased as the sequencing data size increased. By this method, the LOD of allele frequency in WES with the sequencing data of 15 Gbp or more was estimated to be between 5 and 10%. Conclusions For properly interpreting the WES data of somatic genetic mutations, it is necessary to have a cutoff threshold of low allele frequencies. The in-house LOD estimated by the simple method shown in this study provides a rationale for setting the cutoff.

Download Full-text

Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm

Bioinformatics ◽

10.1093/bioinformatics/btaa179 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3669-3679 ◽

Cited By ~ 3

Author(s):

Can Firtina ◽

Jeremie S Kim ◽

Mohammed Alser ◽

Damla Senol Cali ◽

A Ercument Cicek ◽

...

Keyword(s):

Genome Analysis ◽

Supplementary Information ◽

Third Generation ◽

Sequencing Technology ◽

Base Pairs ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing ◽

Large Genomes

Abstract Motivation Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. Results We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward–Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. Availability and implementation Source code is available at https://github.com/CMU-SAFARI/Apollo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text