SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs

Bioinformatics ◽

10.1093/bioinformatics/btaa1015 ◽

2020 ◽

Author(s):

Mohammed Alser ◽

Taha Shahroodi ◽

Juan Gómez-Luna ◽

Can Alkan ◽

Onur Mutlu

Keyword(s):

Sequence Alignment ◽

State Of The Art ◽

Optimal Path ◽

Scoring Function ◽

Supplementary Information ◽

Gpu Acceleration ◽

Scoring Functions ◽

Vlsi Chip ◽

Bit Vector ◽

Grid Layout

Abstract Motivation We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly solves the SNR problem and uses the found optimal path to decide whether or not performing sequence alignment is necessary. Reducing the ASM problem into SNR also makes SneakySnake efficient to implement on CPUs, GPUs and FPGAs. Results SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper and SHD. For short sequences, SneakySnake accelerates Edlib (state-of-the-art implementation of Myers’s bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a configurable scoring function), by up to 37.7× and 43.9× (>12× on average), respectively, with its CPU implementation, and by up to 413× and 689× (>400× on average), respectively, with FPGA and GPU acceleration. For long sequences, the CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence aligner of minimap2) by up to 979× (276.9× on average) and 91.7× (31.7× on average), respectively. As SneakySnake does not replace sequence alignment, users can still obtain all capabilities (e.g. configurable scoring functions) of the aligner of their choice, unlike existing acceleration efforts that sacrifice some aligner capabilities. Availabilityand implementation https://github.com/CMU-SAFARI/SneakySnake. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

METADOCK 2: a high-throughput parallel metaheuristic scheme for molecular docking

Bioinformatics ◽

10.1093/bioinformatics/btz958 ◽

2020 ◽

Author(s):

Baldomero Imbernón ◽

Antonio Serrano ◽

Andrés Bueno-Crespo ◽

José L Abellán ◽

Horacio Pérez-Sánchez ◽

...

Keyword(s):

Molecular Docking ◽

Graphics Processing Units ◽

Degrees Of Freedom ◽

Scoring Function ◽

Optimization Procedure ◽

Supplementary Information ◽

Scoring Functions ◽

Autodock Vina ◽

Molecular Conformations ◽

Computational Performance

Abstract Motivation Molecular docking methods are extensively used to predict the interaction between protein–ligand systems in terms of structure and binding affinity, through the optimization of a physics-based scoring function. However, the computational requirements of these simulations grow exponentially with: (i) the global optimization procedure, (ii) the number and degrees of freedom of molecular conformations generated and (iii) the mathematical complexity of the scoring function. Results In this work, we introduce a novel molecular docking method named METADOCK 2, which incorporates several novel features, such as (i) a ligand-dependent blind docking approach that exhaustively scans the whole protein surface to detect novel allosteric sites, (ii) an optimization method to enable the use of a wide branch of metaheuristics and (iii) a heterogeneous implementation based on multicore CPUs and multiple graphics processing units. Two representative scoring functions implemented in METADOCK 2 are extensively evaluated in terms of computational performance and accuracy using several benchmarks (such as the well-known DUD) against AutoDock 4.2 and AutoDock Vina. Results place METADOCK 2 as an efficient and accurate docking methodology able to deal with complex systems where computational demands are staggering and which outperforms both AutoDock Vina and AutoDock 4. Availability and implementation https://[email protected]/Baldoimbernon/metadock_2.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Accurate peptide fragmentation predictions allow data driven approaches to replace and improve upon proteomics search engine scoring functions

Bioinformatics ◽

10.1093/bioinformatics/btz383 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5243-5248 ◽

Cited By ~ 9

Author(s):

Ana S C. Silva ◽

Robbin Bouwmeester ◽

Lennart Martens ◽

Sven Degroeve

Keyword(s):

Machine Learning ◽

Search Engine ◽

Scoring Function ◽

Peptide Fragmentation ◽

Supervised Machine Learning ◽

Supplementary Information ◽

Scoring Functions ◽

False Discovery ◽

Machine Learning Model ◽

Intensity Information

Abstract Motivation The use of post-processing tools to maximize the information gained from a proteomics search engine is widely accepted and used by the community, with the most notable example being Percolator—a semi-supervised machine learning model which learns a new scoring function for a given dataset. The usage of such tools is however bound to the search engine’s scoring scheme, which doesn’t always make full use of the intensity information present in a spectrum. We aim to show how this tool can be applied in such a way that maximizes the use of spectrum intensity information by leveraging another machine learning-based tool, MS2PIP. MS2PIP predicts fragment ion peak intensities. Results We show how comparing predicted intensities to annotated experimental spectra by calculating direct similarity metrics provides enough information for a tool such as Percolator to accurately separate two classes of peptide-to-spectrum matches. This approach allows using more information out of the data (compared with simpler intensity based metrics, like peak counting or explained intensities summing) while maintaining control of statistics such as the false discovery rate. Availability and implementation All of the code is available online at https://github.com/compomics/ms2rescore. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Learning from the ligand: using ligand-based features to improve binding affinity prediction

Bioinformatics ◽

10.1093/bioinformatics/btz665 ◽

2019 ◽

Cited By ~ 7

Author(s):

Fergus Boyles ◽

Charlotte M Deane ◽

Garrett M Morris

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

Pearson Correlation ◽

Scoring Function ◽

Supplementary Information ◽

Scoring Functions ◽

Limited Information ◽

Ligand Complex ◽

Binding Affinity Prediction ◽

Affinity Prediction

Abstract Motivation Machine learning scoring functions for protein–ligand binding affinity prediction have been found to consistently outperform classical scoring functions. Structure-based scoring functions for universal affinity prediction typically use features describing interactions derived from the protein–ligand complex, with limited information about the chemical or topological properties of the ligand itself. Results We demonstrate that the performance of machine learning scoring functions are consistently improved by the inclusion of diverse ligand-based features. For example, a Random Forest (RF) combining the features of RF-Score v3 with RDKit molecular descriptors achieved Pearson correlation coefficients of up to 0.836, 0.780 and 0.821 on the PDBbind 2007, 2013 and 2016 core sets, respectively, compared to 0.790, 0.746 and 0.814 when using the features of RF-Score v3 alone. Excluding proteins and/or ligands that are similar to those in the test sets from the training set has a significant effect on scoring function performance, but does not remove the predictive power of ligand-based features. Furthermore a RF using only ligand-based features is predictive at a level similar to classical scoring functions and it appears to be predicting the mean binding affinity of a ligand for its protein targets. Availability and implementation Data and code to reproduce all the results are freely available at http://opig.stats.ox.ac.uk/resources. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A knowledge-based scoring function to assess quaternary associations of proteins

Bioinformatics ◽

10.1093/bioinformatics/btaa207 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3739-3748

Author(s):

Abhilesh S Dhawanjewar ◽

Ankit A Roy ◽

Mallur S Madhusudhan

Keyword(s):

Protein Interactions ◽

Statistical Physics ◽

Binary Classification ◽

Scoring Function ◽

Protein Docking ◽

Supplementary Information ◽

Scoring Functions ◽

Biological Interactions ◽

Protein Protein Interactions ◽

Knowledge Based

Abstract Motivation The elucidation of all inter-protein interactions would significantly enhance our knowledge of cellular processes at a molecular level. Given the enormity of the problem, the expenses and limitations of experimental methods, it is imperative that this problem is tackled computationally. In silico predictions of protein interactions entail sampling different conformations of the purported complex and then scoring these to assess for interaction viability. In this study, we have devised a new scheme for scoring protein–protein interactions. Results Our method, PIZSA (Protein Interaction Z-Score Assessment), is a binary classification scheme for identification of native protein quaternary assemblies (binders/nonbinders) based on statistical potentials. The scoring scheme incorporates residue–residue contact preference on the interface with per residue-pair atomic contributions and accounts for clashes. PIZSA can accurately discriminate between native and non-native structural conformations from protein docking experiments and outperform other contact-based potential scoring functions. The method has been extensively benchmarked and is among the top 6 methods, outperforming 31 other statistical, physics based and machine learning scoring schemes. The PIZSA potentials can also distinguish crystallization artifacts from biological interactions. Availability and implementation PIZSA is implemented as a web server at http://cospi.iiserpune.ac.in/pizsa and can be downloaded as a standalone package from http://cospi.iiserpune.ac.in/pizsa/Download/Download.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Shouji: a fast and efficient pre-alignment filter for sequence alignment

Bioinformatics ◽

10.1093/bioinformatics/btz234 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4255-4263 ◽

Cited By ~ 7

Author(s):

Mohammed Alser ◽

Hasan Hassan ◽

Akash Kumar ◽

Onur Mutlu ◽

Can Alkan

Keyword(s):

Sequence Alignment ◽

Execution Time ◽

State Of The Art ◽

Hardware Acceleration ◽

Supplementary Information ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Field Programmable ◽

Computing Platforms ◽

Alignment Step

AbstractMotivationThe ability to generate massive amounts of sequencing data continues to overwhelm the processing capability of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to further boost the performance of our algorithm.ResultsShouji significantly improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU implementation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji with five state-of-the-art sequence aligners, designed for different computing platforms. The addition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art sequence aligners by up to 18.8×. Shouji can be adapted for any bioinformatics pipeline that performs sequence alignment for verification. Unlike most existing methods that aim to accelerate sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step.Availability and implementationhttps://github.com/CMU-SAFARI/Shouji.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Scoring Functions Based on Second Level Score for k-SAT with Long Clauses

Journal of Artificial Intelligence Research ◽

10.1613/jair.4480 ◽

2014 ◽

Vol 51 ◽

pp. 413-441 ◽

Cited By ~ 4

Author(s):

S. Cai ◽

C. Luo ◽

K. Su

Keyword(s):

Phase Transition ◽

Computational Complexity ◽

Local Search ◽

State Of The Art ◽

Scoring Function ◽

Experimental Results ◽

Great Success ◽

Stochastic Local Search ◽

Scoring Functions ◽

New Scoring

It is widely acknowledged that stochastic local search (SLS) algorithms can efficiently find models for satisfiable instances of the satisfiability (SAT) problem, especially for random k-SAT instances. However, compared to random 3-SAT instances where SLS algorithms have shown great success, random k-SAT instances with long clauses remain very difficult. Recently, the notion of second level score, denoted as "score_2", was proposed for improving SLS algorithms on long-clause SAT instances, and was first used in the powerful CCASat solver as a tie breaker. In this paper, we propose three new scoring functions based on score_2. Despite their simplicity, these functions are very effective for solving random k-SAT with long clauses. The first function combines score and score_2, and the second one additionally integrates the diversification property "age". These two functions are used in developing a new SLS algorithm called CScoreSAT. Experimental results on large random 5-SAT and 7-SAT instances near phase transition show that CScoreSAT significantly outperforms previous SLS solvers. However, CScoreSAT cannot rival its competitors on random k-SAT instances at phase transition. We improve CScoreSAT for such instances by another scoring function which combines score_2 with age. The resulting algorithm HScoreSAT exhibits state-of-the-art performance on random k-SAT (k>3) instances at phase transition. We also study the computation of score_2, including its implementation and computational complexity.

Download Full-text

Random Forest Refinement of Pairwise Potentials for Protein-ligand Decoy Detection

10.26434/chemrxiv.8047820.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jun Pei ◽

Zheng Zheng ◽

Hyunji Kim ◽

Lin Song ◽

Sarah Walworth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Probability Function ◽

Pair Potential ◽

Scoring Function ◽

Stable Structure ◽

Scoring Functions ◽

Atom Pair ◽

Data Set ◽

Atom Pairs

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>

Download Full-text

Bringing Things Closer: Enhancing Low-Vision Interaction Experience with Office Productivity Applications

Proceedings of the ACM on Human-Computer Interaction ◽

10.1145/3457144 ◽

2021 ◽

Vol 5 (EICS) ◽

pp. 1-18

Author(s):

Hae-Na Lee ◽

Vikas Ashok ◽

IV Ramakrishnan

Keyword(s):

Assistive Technology ◽

User Study ◽

State Of The Art ◽

Low Vision ◽

Spatial Separation ◽

Usability Study ◽

Presentation Software ◽

Screen Magnifier ◽

Word Processors ◽

Grid Layout

Many people with low vision rely on screen-magnifier assistive technology to interact with productivity applications such as word processors, spreadsheets, and presentation software. Despite the importance of these applications, little is known about their usability with respect to low-vision screen-magnifier users. To fill this knowledge gap, we conducted a usability study with 10 low-vision participants having different eye conditions. In this study, we observed that most usability issues were predominantly due to high spatial separation between main edit area and command ribbons on the screen, as well as the wide span grid-layout of command ribbons; these two GUI aspects did not gel with the screen-magnifier interface due to lack of instantaneous WYSIWYG (What You See Is What You Get) feedback after applying commands, given that the participants could only view a portion of the screen at any time. Informed by the study findings, we developed MagPro, an augmentation to productivity applications, which significantly improves usability by not only bringing application commands as close as possible to the user's current viewport focus, but also enabling easy and straightforward exploration of these commands using simple mouse actions. A user study with nine participants revealed that MagPro significantly reduced the time and workload to do routine command-access tasks, compared to using the state-of-the-art screen magnifier.

Download Full-text

ASFP (Artificial Intelligence based Scoring Function Platform): a web server for the development of customized scoring functions

Journal of Cheminformatics ◽

10.1186/s13321-021-00486-3 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Xujun Zhang ◽

Chao Shen ◽

Xueying Guo ◽

Zhe Wang ◽

Gaoqi Weng ◽

...

Keyword(s):

High Efficiency ◽

Low Cost ◽

Pearson Correlation ◽

Scoring Function ◽

Web Server ◽

Scoring Functions ◽

Protein Ligand Interactions ◽

Prediction Module ◽

Ligand Interactions ◽

Benchmark Datasets

AbstractVirtual screening (VS) based on molecular docking has emerged as one of the mainstream technologies of drug discovery due to its low cost and high efficiency. However, the scoring functions (SFs) implemented in most docking programs are not always accurate enough and how to improve their prediction accuracy is still a big challenge. Here, we propose an integrated platform called ASFP, a web server for the development of customized SFs for structure-based VS. There are three main modules in ASFP: (1) the descriptor generation module that can generate up to 3437 descriptors for the modelling of protein–ligand interactions; (2) the AI-based SF construction module that can establish target-specific SFs based on the pre-generated descriptors through three machine learning (ML) techniques; (3) the online prediction module that provides some well-constructed target-specific SFs for VS and an additional generic SF for binding affinity prediction. Our methodology has been validated on several benchmark datasets. The target-specific SFs can achieve an average ROC AUC of 0.973 towards 32 targets and the generic SF can achieve the Pearson correlation coefficient of 0.81 on the PDBbind version 2016 core set. To sum up, the ASFP server is a powerful tool for structure-based VS.

Download Full-text

CorGAT: a tool for the functional annotation of SARS-CoV-2 genomes

Bioinformatics ◽

10.1093/bioinformatics/btaa1047 ◽

2020 ◽

Author(s):

Matteo Chiara ◽

Federico Zambelli ◽

Marco Antonio Tangaro ◽

Pietro Mandreoli ◽

David S Horner ◽

...

Keyword(s):

Functional Annotation ◽

Ad Hoc ◽

State Of The Art ◽

Supplementary Information ◽

Genomic Sequences ◽

Supplementary Data ◽

Evolutionary Patterns ◽

Genomic Variants ◽

Art Methods ◽

Available Resources

Abstract Summary While over 200 000 genomic sequences are currently available through dedicated repositories, ad hoc methods for the functional annotation of SARS-CoV-2 genomes do not harness all currently available resources for the annotation of functionally relevant genomic sites. Here, we present CorGAT, a novel tool for the functional annotation of SARS-CoV-2 genomic variants. By comparisons with other state of the art methods we demonstrate that, by providing a more comprehensive and rich annotation, our method can facilitate the identification of evolutionary patterns in the genome of SARS-CoV-2. Availabilityand implementation Galaxy http://corgat.cloud.ba.infn.it/galaxy; software: https://github.com/matteo14c/CorGAT/tree/Revision_V1; docker: https://hub.docker.com/r/laniakeacloud/galaxy_corgat. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text