scholarly journals SSEalign: accurate function prediction of bacterial unannotated protein, based on effective training dataset

2017 ◽  
Author(s):  
Zhiyuan Yang ◽  
Stephen Kwok-Wing Tsui

AbstractThe functions of numerous bacterial proteins remain unknown because of the variety of their sequences. The performances of existing prediction methods are highly weak toward these proteins, leading to the annotation of “hypothetical protein” deposited in NCBI database. Elucidating the functions of these unannotated proteins is an urgent task in computational biology. We report a method about secondary structure element alignment called SSEalign based on an effective training dataset extracting from 20 well-studied bacterial genomes. The experimentally validated same genes in different species were selected as training positives, while different genes in different species were selected as training negatives. Moreover, SSEalign used a set of well-defined basic alignment elements with the backtracking line search algorithm to derive the best parameters for accurate prediction. Experimental results showed that SSEalign achieved 91.2% test accuracy, better than existing prediction methods. SSEalign was subsequently applied to identify the functions of those unannotated proteins in the latest published minimal bacteria genome JCVI-syn3.0. Results indicated that At least 99 proteins out of 149 unannotated proteins in the JCVI-syn3.0 genome could be annotated by SSEalign. In conclusion, our method is effective for the identification of protein homology and the annotation of uncharacterized proteins in the genome.

2015 ◽  
Vol 32 (6) ◽  
pp. 821-827 ◽  
Author(s):  
Enrique Audain ◽  
Yassel Ramos ◽  
Henning Hermjakob ◽  
Darren R. Flower ◽  
Yasset Perez-Riverol

Abstract Motivation: In any macromolecular polyprotic system—for example protein, DNA or RNA—the isoelectric point—commonly referred to as the pI—can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge—and thus the electrophoretic mobility—of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: [email protected] Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online.


Foods ◽  
2021 ◽  
Vol 10 (7) ◽  
pp. 1633
Author(s):  
Chreston Miller ◽  
Leah Hamilton ◽  
Jacob Lahne

This paper is concerned with extracting relevant terms from a text corpus on whisk(e)y. “Relevant” terms are usually contextually defined in their domain of use. Arguably, every domain has a specialized vocabulary used for describing things. For example, the field of Sensory Science, a sub-field of Food Science, investigates human responses to food products and differentiates “descriptive” terms for flavors from “ordinary”, non-descriptive language. Within the field, descriptors are generated through Descriptive Analysis, a method wherein a human panel of experts tastes multiple food products and defines descriptors. This process is both time-consuming and expensive. However, one could leverage existing data to identify and build a flavor language automatically. For example, there are thousands of professional and semi-professional reviews of whisk(e)y published on the internet, providing abundant descriptors interspersed with non-descriptive language. The aim, then, is to be able to automatically identify descriptive terms in unstructured reviews for later use in product flavor characterization. We created two systems to perform this task. The first is an interactive visual tool that can be used to tag examples of descriptive terms from thousands of whisky reviews. This creates a training dataset that we use to perform transfer learning using GloVe word embeddings and a Long Short-Term Memory deep learning model architecture. The result is a model that can accurately identify descriptors within a corpus of whisky review texts with a train/test accuracy of 99% and precision, recall, and F1-scores of 0.99. We tested for overfitting by comparing the training and validation loss for divergence. Our results show that the language structure for descriptive terms can be programmatically learned.


Author(s):  
Mayank Pareek ◽  
Rupal Vikas Srivastava ◽  
Sara Behdad

Building insulation is considered as a solution to reduce the energy cost for both residential and commercial buildings. However, determining the best combination of insulation materials that result into the lowest total ownership cost is now becoming a bigger challenge. Various factors influence the efficiency of heat transfer within a room including geometry and size of the room, ambient temperature, heat and sink sources presented inside the building, type of insulation materials, etc. The aim of this paper is to develop an optimization-based decision making tool to help house owners select the best combination of given insulation materials considering all these factors. The purpose of design approach adopted in this paper is to minimize total ownership cost while providing the required heating in the building. The SQP, Quasi-Newton, line-search algorithm was used to obtain the optimized thermal conductivity values for the combination of insulation material to be used in the walls, floor, ceiling, window and the door of a room, along with the width of the air gap to be kept. The results help in deciding what combination of insulation material will achieve the required heating for the house owner while keep the total cost incurred to be minimum.


2012 ◽  
Vol 238 ◽  
pp. 709-713
Author(s):  
Bing Jian Wang ◽  
Jian Yong Song ◽  
Jian Ming Lu

Based on a co-rational (CR) framework, a 2-noded element formulation of 3D truss was presented, which was used for accurately modeling of suspension bridges with large displacements and rotations. The CR framework could consider the out-plane stiffness by the geometric stiffness, which was applicable to the analysis of 3D cable bridges. Using the co-rational truss united with the energy convergence criteria and the Newton with Line Search Algorithm, the nonlinear behavior of 3D cable structural system was simulated conveniently and accurately. Therefore, the traditional truss elements based on elastic modulus modified method and complex catenary elements were avoided. In order to simulate the hanging of girder and the structural system changing during the construction, the elements’ killing and activating methods were realized by the modulus modified methods.


Sign in / Sign up

Export Citation Format

Share Document