Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics

A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic boundaries, and it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.

Download Full-text

Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics

F1000Research ◽

10.12688/f1000research.9416.1 ◽

2016 ◽

Vol 5 ◽

pp. 1987 ◽

Cited By ~ 7

Author(s):

Jasper J. Koehorst ◽

Edoardo Saccenti ◽

Peter J. Schaap ◽

Vitor A. P. Martins dos Santos ◽

Maria Suarez-Diez

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

Sequence Similarity ◽

Computational Cost ◽

Protein Domain ◽

Gene Acquisition ◽

Bacterial Fitness ◽

Efficient Alternative ◽

Comparative Functional Genomics ◽

High Computational Cost

A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic bounderies. As the computational cost scales linearly, and not quadratically with the number of genomes, it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.

Download Full-text

Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics

F1000Research ◽

10.12688/f1000research.9416.2 ◽

2016 ◽

Vol 5 ◽

pp. 1987 ◽

Cited By ~ 6

Author(s):

Jasper J. Koehorst ◽

Edoardo Saccenti ◽

Peter J. Schaap ◽

Vitor A. P. Martins dos Santos ◽

Maria Suarez-Diez

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

Sequence Similarity ◽

Computational Cost ◽

Protein Domain ◽

Gene Acquisition ◽

Bacterial Fitness ◽

Efficient Alternative ◽

Comparative Functional Genomics ◽

High Computational Cost

Download Full-text

A NOTE ON PHASING LONG GENOMIC REGIONS USING LOCAL HAPLOTYPE PREDICTIONS

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720006002272 ◽

2006 ◽

Vol 04 (03) ◽

pp. 639-647 ◽

Cited By ~ 6

Author(s):

ELEAZAR ESKIN ◽

RODED SHARAN ◽

ERAN HALPERIN

Keyword(s):

Large Scale ◽

Computational Cost ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Novel Approach ◽

Maximum Likelihood Criterion ◽

The Common ◽

Genomic Regions ◽

High Computational Cost ◽

Combining Information

The common approaches for haplotype inference from genotype data are targeted toward phasing short genomic regions. Longer regions are often tackled in a heuristic manner, due to the high computational cost. Here, we describe a novel approach for phasing genotypes over long regions, which is based on combining information from local predictions on short, overlapping regions. The phasing is done in a way, which maximizes a natural maximum likelihood criterion. Among other things, this criterion takes into account the physical length between neighboring single nucleotide polymorphisms. The approach is very efficient and is applied to several large scale datasets and is shown to be successful in two recent benchmarking studies (Zaitlen et al., in press; Marchini et al., in preparation). Our method is publicly available via a webserver at .

Download Full-text

Grafting for combinatorial binary model using frequent itemset mining

Data Mining and Knowledge Discovery ◽

10.1007/s10618-019-00657-9 ◽

2019 ◽

Vol 34 (1) ◽

pp. 101-123 ◽

Cited By ~ 1

Author(s):

Taito Lee ◽

Shin Matsushima ◽

Kenji Yamanishi

Keyword(s):

Large Scale ◽

Computational Cost ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining ◽

Binary Model ◽

High Knowledge ◽

Linear Predictors ◽

Computational Difficulty ◽

High Computational Cost

Abstract We consider the class of linear predictors over all logical conjunctions of binary attributes, which we refer to as the class of combinatorial binary models (CBMs) in this paper. CBMs are of high knowledge interpretability but naïve learning of them from labeled data requires exponentially high computational cost with respect to the length of the conjunctions. On the other hand, in the case of large-scale datasets, long conjunctions are effective for learning predictors. To overcome this computational difficulty, we propose an algorithm, GRAfting for Binary datasets (GRAB), which efficiently learns CBMs within the $$L_1$$L1-regularized loss minimization framework. The key idea of GRAB is to adopt weighted frequent itemset mining for the most time-consuming step in the grafting algorithm, which is designed to solve large-scale $$L_1$$L1-RERM problems by an iterative approach. Furthermore, we experimentally showed that linear predictors of CBMs are effective in terms of prediction accuracy and knowledge discovery.

Download Full-text

Large-scale sequence similarity analysis reveals the scope of sequence and function divergence in PilZ domain proteins

10.1101/2020.02.11.943704 ◽

2020 ◽

Author(s):

Qing Wei Cheang ◽

Shuo Sheng ◽

Linghui Xu ◽

Zhao-Xun Liang

Keyword(s):

Large Scale ◽

Sequence Similarity ◽

Protein Domain ◽

Divergent Evolution ◽

Cellular Functions ◽

Vast Number ◽

Future Studies ◽

Function Relationship ◽

And Function ◽

Scale Sequence

AbstractPilZ domain-containing proteins constitute a superfamily of widely distributed bacterial signalling proteins. Although studies have established the canonical PilZ domain as an adaptor protein domain evolved to specifically bind the second messenger c-di-GMP, mounting evidence suggest that the PilZ domain has undergone enormous divergent evolution to generate a superfamily of proteins that are characterized by a wide range of c-di-GMP-binding affinity, binding partners and cellular functions. The divergent evolution has even generated families of non-canonical PilZ domains that completely lack c-di-GMP binding ability. In this study, we performed a large-scale sequence analysis on more than 28,000 single- and di-domain PilZ proteins using the sequence similarity networking tool created originally to analyse functionally diverse enzyme superfamilies. The sequence similarity networks (SSN) generated by the analysis feature a large number of putative isofunctional protein clusters, and thus, provide an unprecedented panoramic view of the sequence-function relationship and function diversification in PilZ proteins. Some of the protein clusters in the networks are considered as unexplored clusters that contain proteins with completely unknown biological function; whereas others contain one, two or a few functionally known proteins, and therefore, enabling us to infer the cellular function of uncharacterized homologs or orthologs. With the ultimate goal of elucidating the diverse roles played by PilZ proteins in bacterial signal transduction, the work described here will facilitate the annotation of the vast number of PilZ proteins encoded by bacterial genome and help to prioritize functionally unknown PilZ proteins for future studies.ImportanceAlthough PilZ domain is best known as the protein domain evolved specifically for the binding of the second messenger c-di-GMP, divergent evolution has generated a superfamily of PilZ proteins with a diversity of ligand or protein-binding properties and cellular functions. We analysed the sequences of more than 28,000 PilZ proteins using the sequence similarity networking (SSN) tool to yield a global view of the sequence-function relationship and function diversification in PilZ proteins. The results will facilitate the annotation of the vast number of PilZ proteins encoded by bacterial genomes and help us prioritize PilZ proteins for future studies.

Download Full-text

Impact of UAV Surveying Parameters on Mixed Urban Landuse Surface Modelling

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9110656 ◽

2020 ◽

Vol 9 (11) ◽

pp. 656

Author(s):

Muhammad Hamid Chaudhry ◽

Anuar Ahmad ◽

Qudsia Gulzar

Keyword(s):

Point Cloud ◽

Large Scale ◽

Accuracy Assessment ◽

Computational Cost ◽

Three Dimensional ◽

Point Clouds ◽

Ground Control ◽

Control Points ◽

Ground Control Points ◽

High Computational Cost

Unmanned Aerial Vehicles (UAVs) as a surveying tool are mainly characterized by a large amount of data and high computational cost. This research investigates the use of a small amount of data with less computational cost for more accurate three-dimensional (3D) photogrammetric products by manipulating UAV surveying parameters such as flight lines pattern and image overlap percentages. Sixteen photogrammetric projects with perpendicular flight plans and a variation of 55% to 85% side and forward overlap were processed in Pix4DMapper. For UAV data georeferencing and accuracy assessment, 10 Ground Control Points (GCPs) and 18 Check Points (CPs) were used. Comparative analysis was done by incorporating the median of tie points, the number of 3D point cloud, horizontal/vertical Root Mean Square Error (RMSE), and large-scale topographic variations. The results show that an increased forward overlap also increases the median of the tie points, and an increase in both side and forward overlap results in the increased number of point clouds. The horizontal accuracy of 16 projects varies from ±0.13m to ±0.17m whereas the vertical accuracy varies from ± 0.09 m to ± 0.32 m. However, the lowest vertical RMSE value was not for highest overlap percentage. The tradeoff among UAV surveying parameters can result in high accuracy products with less computational cost.

Download Full-text

SOLVING MEDICAL AKZO NOBEL PROBLEM USING FUNCTIONAL LOAD BALANCING ALGORITHM OF 4(3) DIRK METHOD

International Journal of Modern Physics Conference Series ◽

10.1142/s2010194512005569 ◽

2012 ◽

Vol 09 ◽

pp. 480-487 ◽

Cited By ~ 1

Author(s):

UMMUL KHAIR SALMA DIN ◽

FUDZIAH ISMAIL ◽

ZANARIAH ABDUL MAJID ◽

ROKIAH ROZITA AHMAD

Keyword(s):

Load Balancing ◽

Differential Equations ◽

Large Scale ◽

Computational Cost ◽

Akzo Nobel ◽

Functional Load ◽

One Dimensional ◽

Sparsity Pattern ◽

Load Balancing Algorithm ◽

High Computational Cost

Medical Akzo Nobel problem (MEDAKZO) is known for its tenancy of incurring high computational cost. Originates from the penetration of radio-labeled antibodies into a tissue that has been infected by a tumor, the problem has been derived from a one dimensional partial differential equations to a two dimensional ordinary differential equations thus generates a large scale of problem to be solved. This paper presents the performance of a new 4(3) diagonally implicit Runge-Kutta (DIRK) method which is suitable to excellently solve MEDAKZO problem that is stiff in nature. The sparsity pattern designed on the method enable the functions evaluations to be computed simultaneously on two processors. The functional load balancing can be profitable especially in solving large problems.

Download Full-text

EFFICIENT LARGE-SCALE SERVICE CLUSTERING VIA SPARSE FUNCTIONAL REPRESENTATION AND ACCELERATED OPTIMIZATION

International Journal of Cooperative Information Systems ◽

10.1142/s0218843013410013 ◽

2013 ◽

Vol 22 (04) ◽

pp. 1341001 ◽

Cited By ~ 1

Author(s):

QI YU

Keyword(s):

Web Services ◽

Service Discovery ◽

Large Scale ◽

Computational Cost ◽

Functional Representation ◽

Massive Number ◽

Effectiveness And Efficiency ◽

Service Clustering ◽

Service Data ◽

High Computational Cost

Clustering techniques offer a systematic approach to organize the diverse and fast increasing Web services by assigning relevant services into homogeneous service communities. However, the ever increasing number of Web services poses key challenges for building large-scale service communities. In this paper, we tackle the scalability issue in service clustering, aiming to accurately and efficiently discover service communities over very large-scale services. A key observation is that service descriptions are usually represented by long but very sparse term vectors as each service is only described by a limited number of terms. This inspires us to seek a new service representation that is economical to store, efficient to process, and intuitive to interpret. This new representation enables service clustering to scale to massive number of services. More specifically, a set of anchor services are identified that allows each service to represent as a linear combination of a small number of anchor services. In this way, the large number of services are encoded with a much more compact anchor service space. Despite service clustering can be performed much more efficiently in the compact anchor service space, discovery of anchor services from large-scale service descriptions may incur high computational cost. We develop principled optimization strategies for efficient anchor service discovery. Extensive experiments are conducted on real-world service data to assess both the effectiveness and efficiency of the proposed approach. Results on a dataset with over 3,700 Web services clearly demonstrate the good scalability of sparse functional representation and the efficiency of the optimization algorithms for anchor service discovery.

Download Full-text

Orthogonal and Nonnegative Graph Reconstruction for Large Scale Clustering

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/251 ◽

2017 ◽

Cited By ~ 2

Author(s):

Junwei Han ◽

Kai Xiong ◽

Feiping Nie

Keyword(s):

Spectral Clustering ◽

Large Scale ◽

Computational Cost ◽

Post Processing ◽

Normalized Cut ◽

Clustering Problem ◽

Novel Approach ◽

Graph Reconstruction ◽

Final Cluster ◽

High Computational Cost

Spectral clustering has been widely used due to its simplicity for solving graph clustering problem in recent years. However, it suffers from the high computational cost as data grow in scale, and is limited by the performance of post-processing. To address these two problems simultaneously, in this paper, we propose a novel approach denoted by orthogonal and nonnegative graph reconstruction (ONGR) that scales linearly with the data size. For the relaxation of Normalized Cut, we add nonnegative constraint to the objective. Due to the nonnegativity, ONGR offers interpretability that the final cluster labels can be directly obtained without post-processing. Extensive experiments on clustering tasks demonstrate the effectiveness of the proposed method.

Download Full-text

Deriving Large-Scale Coastal Bathymetry from Sentinel-2 Images Using an HIGH-Performance Cluster: A Case Study Covering North Africa’s Coastal Zone

Sensors ◽

10.3390/s21217006 ◽

2021 ◽

Vol 21 (21) ◽

pp. 7006

Author(s):

Mohamed Wassim Baba ◽

Gregoire Thoumyre ◽

Erwin W. J. Bergsma ◽

Christopher J. Daly ◽

Rafael Almar

Keyword(s):

High Performance ◽

Large Scale ◽

Computational Cost ◽

Remote Sensing Data ◽

Coastal Areas ◽

New Perspective ◽

Sentinel 2A ◽

Performance Computing ◽

High Computational Cost

Coasts are areas of vitality because they host numerous activities worldwide. Despite their major importance, the knowledge of the main characteristics of the majority of coastal areas (e.g., coastal bathymetry) is still very limited. This is mainly due to the scarcity and lack of accurate measurements or observations, and the sparsity of coastal waters. Moreover, the high cost of performing observations with conventional methods does not allow expansion of the monitoring chain in different coastal areas. In this study, we suggest that the advent of remote sensing data (e.g., Sentinel 2A/B) and high performance computing could open a new perspective to overcome the lack of coastal observations. Indeed, previous research has shown that it is possible to derive large-scale coastal bathymetry from S-2 images. The large S-2 coverage, however, leads to a high computational cost when post-processing the images. Thus, we develop a methodology implemented on a High-Performance cluster (HPC) to derive the bathymetry from S-2 over the globe. In this paper, we describe the conceptualization and implementation of this methodology. Moreover, we will give a general overview of the generated bathymetry map for NA compared with the reference GEBCO global bathymetric product. Finally, we will highlight some hotspots by looking closely to their outputs.

Download Full-text