scholarly journals Fast Model-based Protein Homology Discovery without Alignment

2014 ◽  
Vol 1 (2) ◽  
pp. 169-184 ◽  
Author(s):  
Mani Manavalan

The need for quick gene categorization tools is growing as more genomes are sequenced. To evaluate a newly sequenced genome, the genes must first be identified and translated into amino acid sequences, which are then categorized into structural or functional classes. Protein homology detection using sequence alignment algorithms is the most effective way for protein categorization. Discriminative approaches such as support vector machines (SVMs) and position-specific scoring matrices (PSSM) derived from PSI-BLAST have recently been used to improve alignment algorithms. However, if a fresh sequence is being aligned, alignment algorithms take time. must be compared to a large number of previously published sequences — the same is true for SVMs. Building a PSSM for the PSSM is even more time-consuming than a fresh order It would take roughly 25 hours to implement the best-performing approaches to classify the sequences on today's computers. Describing a novel genome (20, 000 genes) as belonging to one single organism. There are hundreds of classes to choose from, though. Another flaw with alignment algorithms is that they do not construct a model of the positive class, instead of measuring the mutual distance between sequences or profiles. Only multiple alignments and hidden Markov models are common classification approaches for creating a positive class model, but they have poor classification performance. A model's advantage is that it may be evaluated for chemical features that are shared by all members of the class to get fresh insights into protein function and structure. We used LSTM to solve a well-known remote protein homology detection benchmark, in which a protein must be categorized as a member of the SCOP superfamily. LSTM achieves state-of-the-art classification performance while being significantly faster than other algorithms with similar classification performance. LSTM is five orders of magnitude quicker than the quickest SVM-based approaches and two orders of magnitude faster than methods that perform somewhat better in classification (which, however, have lower classification performance than LSTM). We applied LSTM to PROSITE classes and analyzed the derived patterns to test the modeling capabilities of the algorithm. Because it does not require established similarity metrics like BLOSUM or PAM matrices, LSTM is complementary to alignment-based techniques. The PROSITE motif was retrieved by LSTM in 8 out of 15 classes. In the remaining seven examples, alternative motifs are developed that, on average, outperform the PROSITE motifs in categorization.

2018 ◽  
Author(s):  
Mohamed Baddar

Remote homology detection is the problem of detecting homology in cases of low sequence similarity. It is a hard computational problem with no approach that works well in all cases. Methods based on profile hidden Markov models (HMM) often exhibit relatively higher sensitivity for detecting remote homologies than commonly used approaches. However, calculating similarity scores in profile HMM methods is computationally intensive as they use dynamic programming algorithms. In this paper we introduce SHsearch: a new method for remote protein homology detection. Our method is implemented as a modification of HHsearch: a remote protein homology detection method based on comparing two profile HMMs. The motivation for modification was to reduce the run time of HHsearch significantly with minimal sensitivity loss. SHsearch focuses on comparing the important submodels of the query and database HMMs instead of comparing the complete models. Hence, SHsearch achieves a significant speedup over HHsearch with minimal loss in sensitivity. On SCOP 1.63, SHsearch achieved 88X speedup with 8.2% loss in sensitivity with respect to HHsearch at error rate of 10%, which deemed to be an acceptable tradeoff.


2018 ◽  
Author(s):  
Mohamed Baddar

Remote homology detection is the problem of detecting homology in cases of low sequence similarity. It is a hard computational problem with no approach that works well in all cases. Methods based on profile hidden Markov models (HMM) often exhibit relatively higher sensitivity for detecting remote homologies than commonly used approaches. However, calculating similarity scores in profile HMM methods is computationally intensive as they use dynamic programming algorithms. In this paper we introduce SHsearch: a new method for remote protein homology detection. Our method is implemented as a modification of HHsearch: a remote protein homology detection method based on comparing two profile HMMs. The motivation for modification was to reduce the run time of HHsearch significantly with minimal sensitivity loss. SHsearch focuses on comparing the important submodels of the query and database HMMs instead of comparing the complete models. Hence, SHsearch achieves a significant speedup over HHsearch with minimal loss in sensitivity. On SCOP 1.63, SHsearch achieved 88X speedup with 8.2% loss in sensitivity with respect to HHsearch at error rate of 10%, which deemed to be an acceptable tradeoff.


2018 ◽  
Author(s):  
Mohamed Baddar

Remote homology detection is the problem of detecting homology in cases of low sequence similarity. It is a hard computational problem with no approach that works well in all cases. Methods based on profile hidden Markov models (HMM) often exhibit relatively higher sensitivity for detecting remote homologies than commonly used approaches. However, calculating similarity scores in profile HMM methods is computationally intensive as they use dynamic programming algorithms. In this paper we introduce SHsearch: a new method for remote protein homology detection. Our method is implemented as a modification of HHsearch: a remote protein homology detection method based on comparing two profile HMMs. The motivation for modification was to reduce the run time of HHsearch significantly with minimal sensitivity loss. SHsearch focuses on comparing the important submodels of the query and database HMMs instead of comparing the complete models. Hence, SHsearch achieves a significant speedup over HHsearch with minimal loss in sensitivity. On SCOP 1.63, SHsearch achieved 88X speedup with 8.2% loss in sensitivity with respect to HHsearch at error rate of 10%, which deemed to be an acceptable tradeoff.


Author(s):  
NAZAR M. ZAKI ◽  
SAFAAI DERIS ◽  
ROSLI M. ILLIAS

Few years back, Jaakkola and Haussler published a method of combining generative and discriminative approaches for detecting protein homologies. The method was a variant of support vector machines using a new kernel function called Fisher Kernel. They begin by training a generative hidden Markov model for a protein family. Then, using the model, they derive a vector of features called Fisher scores that are assigned to the sequence and then use support vector machine in conjunction with the fisher scores for protein homologies detection. In this paper, we revisit the idea of using a discriminative approach, and in particular support vector machines for protein homologies detection. However, in place of the Fisher scoring method, we present a new Hidden Markov Model Combining Scores approach. Six scoring algorithms are combined as a way of extracting features from a protein sequence. Experiments show that our method, improves on previous methods for homologies detection of protein domains.


2020 ◽  
Author(s):  
Nalika Ulapane ◽  
Karthick Thiyagarajan ◽  
sarath kodagoda

<div>Classification has become a vital task in modern machine learning and Artificial Intelligence applications, including smart sensing. Numerous machine learning techniques are available to perform classification. Similarly, numerous practices, such as feature selection (i.e., selection of a subset of descriptor variables that optimally describe the output), are available to improve classifier performance. In this paper, we consider the case of a given supervised learning classification task that has to be performed making use of continuous-valued features. It is assumed that an optimal subset of features has already been selected. Therefore, no further feature reduction, or feature addition, is to be carried out. Then, we attempt to improve the classification performance by passing the given feature set through a transformation that produces a new feature set which we have named the “Binary Spectrum”. Via a case study example done on some Pulsed Eddy Current sensor data captured from an infrastructure monitoring task, we demonstrate how the classification accuracy of a Support Vector Machine (SVM) classifier increases through the use of this Binary Spectrum feature, indicating the feature transformation’s potential for broader usage.</div><div><br></div>


Diagnostics ◽  
2021 ◽  
Vol 11 (3) ◽  
pp. 574
Author(s):  
Gennaro Tartarisco ◽  
Giovanni Cicceri ◽  
Davide Di Pietro ◽  
Elisa Leonardi ◽  
Stefania Aiello ◽  
...  

In the past two decades, several screening instruments were developed to detect toddlers who may be autistic both in clinical and unselected samples. Among others, the Quantitative CHecklist for Autism in Toddlers (Q-CHAT) is a quantitative and normally distributed measure of autistic traits that demonstrates good psychometric properties in different settings and cultures. Recently, machine learning (ML) has been applied to behavioral science to improve the classification performance of autism screening and diagnostic tools, but mainly in children, adolescents, and adults. In this study, we used ML to investigate the accuracy and reliability of the Q-CHAT in discriminating young autistic children from those without. Five different ML algorithms (random forest (RF), naïve Bayes (NB), support vector machine (SVM), logistic regression (LR), and K-nearest neighbors (KNN)) were applied to investigate the complete set of Q-CHAT items. Our results showed that ML achieved an overall accuracy of 90%, and the SVM was the most effective, being able to classify autism with 95% accuracy. Furthermore, using the SVM–recursive feature elimination (RFE) approach, we selected a subset of 14 items ensuring 91% accuracy, while 83% accuracy was obtained from the 3 best discriminating items in common to ours and the previously reported Q-CHAT-10. This evidence confirms the high performance and cross-cultural validity of the Q-CHAT, and supports the application of ML to create shorter and faster versions of the instrument, maintaining high classification accuracy, to be used as a quick, easy, and high-performance tool in primary-care settings.


2021 ◽  
Vol 11 (2) ◽  
pp. 796
Author(s):  
Alhanoof Althnian ◽  
Duaa AlSaeed ◽  
Heyam Al-Baity ◽  
Amani Samha ◽  
Alanoud Bin Dris ◽  
...  

Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.


Cancers ◽  
2021 ◽  
Vol 13 (6) ◽  
pp. 1407
Author(s):  
Matyas Bukva ◽  
Gabriella Dobra ◽  
Juan Gomez-Perez ◽  
Krisztian Koos ◽  
Maria Harmati ◽  
...  

Investigating the molecular composition of small extracellular vesicles (sEVs) for tumor diagnostic purposes is becoming increasingly popular, especially for diseases for which diagnosis is challenging, such as central nervous system (CNS) malignancies. Thorough examination of the molecular content of sEVs by Raman spectroscopy is a promising but hitherto barely explored approach for these tumor types. We attempt to reveal the potential role of serum-derived sEVs in diagnosing CNS tumors through Raman spectroscopic analyses using a relevant number of clinical samples. A total of 138 serum samples were obtained from four patient groups (glioblastoma multiforme, non-small-cell lung cancer brain metastasis, meningioma and lumbar disc herniation as control). After isolation, characterization and Raman spectroscopic assessment of sEVs, the Principal Component Analysis–Support Vector Machine (PCA–SVM) algorithm was performed on the Raman spectra for pairwise classifications. Classification accuracy (CA), sensitivity, specificity and the Area Under the Curve (AUC) value derived from Receiver Operating Characteristic (ROC) analyses were used to evaluate the performance of classification. The groups compared were distinguishable with 82.9–92.5% CA, 80–95% sensitivity and 80–90% specificity. AUC scores in the range of 0.82–0.9 suggest excellent and outstanding classification performance. Our results support that Raman spectroscopic analysis of sEV-enriched isolates from serum is a promising method that could be further developed in order to be applicable in the diagnosis of CNS tumors.


Sign in / Sign up

Export Citation Format

Share Document