scholarly journals Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2

BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Julie E. Hernández-Salmerón ◽  
Gabriel Moreno-Hagelsieb

Abstract Background Finding orthologs remains an important bottleneck in comparative genomics analyses. While the authors of software for the quick comparison of protein sequences evaluate the speed of their software and compare their results against the most usual software for the task, it is not common for them to evaluate their software for more particular uses, such as finding orthologs as reciprocal best hits (RBH). Here we compared RBH results obtained using software that runs faster than blastp. Namely, lastal, diamond, and MMseqs2. Results We found that lastal required the least time to produce results. However, it yielded fewer results than any other program when comparing the proteins encoded by evolutionarily distant genomes. The program producing the most similar number of RBH to blastp was diamond ran with the “ultra-sensitive” option. However, this option was diamond’s slowest, with the “very-sensitive” option offering the best balance between speed and RBH results. The speeding up of the programs was much more evident when dealing with eukaryotic genomes, which code for more numerous proteins. For example, lastal took a median of approx. 1.5% of the blastp time to run with bacterial proteomes and 0.6% with eukaryotic ones, while diamond with the very-sensitive option took 7.4% and 5.2%, respectively. Though estimated error rates were very similar among the RBH obtained with all programs, RBH obtained with MMseqs2 had the lowest error rates among the programs tested. Conclusions The fast algorithms for pairwise protein comparison produced results very similar to blast in a fraction of the time, with diamond offering the best compromise in speed, sensitivity and quality, as long as a sensitivity option, other than the default, was chosen.

2020 ◽  
Author(s):  
Julie E Hernández-Salmerón ◽  
Gabriel Moreno-Hagelsieb

AbstractIntroductionFinding orthologs remains an important bottleneck in comparative genomics analyses. While the authors of software for the quick comparison of protein sequences evaluate the speed of their software and compare their results against the most usual software for the task, it is not common for them to evaluate their software for more particular uses, such as finding orthologs as reciprocal best hits (RBH). Here we compared RBH results, between prokaryotic genomes, obtained using software that runs faster than blastp. Namely, lastal, diamond, and MMseqs2.ResultsWe found that lastal required the least time to produce results. However, it yielded fewer results than any other program when comparing evolutionarily distant genomes. The program producing the most similar number of RBH as blastp was MMseqs2. This program also resulted in the lowest error estimates among the programs tested. The results with diamond were very close to those obtained with MMseqs2, with diamond running faster. Our results suggest that the best of the programs tested was diamond, ran with the “sensitive” option, which took 7% of the time as blastp to run, and produced results with lower error rates than blastp.AvailabilityA program to obtain reciprocal best hits using the software we tested is maintained at https://github.com/Computational-conSequences/SequenceTools


2021 ◽  
Vol 12 ◽  
Author(s):  
Anastasis Oulas ◽  
Margarita Zachariou ◽  
Christos T. Chasapis ◽  
Marios Tomazou ◽  
Umer Z. Ijaz ◽  
...  

The predominance of bacterial taxa in the gut, was examined in view of the putative antimicrobial peptide sequences (AMPs) within their proteomes. The working assumption was that compatible bacteria would share homology and thus immunity to their putative AMPs, while competing taxa would have dissimilarities in their proteome-hidden AMPs. A network–based method (“Bacterial Wars”) was developed to handle sequence similarities of predicted AMPs among UniProt-derived protein sequences from different bacterial taxa, while a resulting parameter (“Die” score) suggested which taxa would prevail in a defined microbiome. T he working hypothesis was examined by correlating the calculated Die scores, to the abundance of bacterial taxa from gut microbiomes from different states of health and disease. Eleven publicly available 16S rRNA datasets and a dataset from a full shotgun metagenomics served for the analysis. The overall conclusion was that AMPs encrypted within bacterial proteomes affected the predominance of bacterial taxa in chemospheres.


2021 ◽  
Author(s):  
Irene Unterman ◽  
Idit Bloch ◽  
Simona Cazacu ◽  
Gila Kazimirsky ◽  
Benjamin P. Berman ◽  
...  

AbstractInactivating mutations in the Methyl-CpG Binding Protein 2 (MECP2) gene are the main cause of Rett syndrome (RTT). Despite extensive research into MECP2 function, no treatments for RTT are currently available. Here we use an evolutionary genomics approach to construct an unbiased MECP2 gene network, using 1,028 eukaryotic genomes to prioritize proteins with strong co-evolutionary signatures with MECP2. Focusing on proteins targeted by FDA approved drugs led to three promising candidates, two of which were previously linked to MECP2 function (IRAK, KEAP1) and one that was not (EPOR). We show that each of these compounds has the ability to rescue different phenotypes of MECP2 inactivation in cultured human neural cell types, and appear to act on Nuclear Factor Kappa B (NF-κB) signaling in inflammation. This study highlights the potential of comparative genomics to accelerate drug discovery, and yields potential new avenues for the treatment of RTT.Abstract Figure


1977 ◽  
Vol 131 (2) ◽  
pp. 160-167 ◽  
Author(s):  
R. W. Lucas ◽  
P. J. Mullin ◽  
C. B. X. Luna ◽  
D. C. McInroy

SummaryA computer-administered ‘interview’ was developed for eliciting evidence relating to alcohol problems. Thirty-six volunteer male patients on their first visits to a specialist alcohol clinic were interviewed three times, by two psychiatrists and by the computer; information was sought about 72 predefined indicants concerning alcohol consumption, drinking behaviour, and symptoms. Each patient was asked to complete an attitude questionnaire anonymously.The extent of agreement between the evidence elicited by the computer and by the psychiatrists was quite high, and their estimated error rates were very similar, all between 10 per cent and 12 per cent in total. With respects to amounts of alcohol consumed, patients reported significantly greater amounts to the computer than they reported to the psychiatrists. The median amounts of pure ethanol consumed ranged from 1 · 19 kg per week calculated from reports made to one of the psychiatrists, up to 1 · 58 kg per week calculated from reports made to the computer. The results from the attitude questionnaire indicated a high level of acceptability to patients of computer interrogation.


DNA Research ◽  
2017 ◽  
Vol 24 (3) ◽  
pp. 251-260 ◽  
Author(s):  
Abdel Belkorchia ◽  
Jean-François Pombert ◽  
Valérie Polonais ◽  
Nicolas Parisot ◽  
Frédéric Delbac ◽  
...  

eLife ◽  
2021 ◽  
Vol 10 ◽  
Author(s):  
Irene Unterman ◽  
Idit Bloch ◽  
Simona Cazacu ◽  
Gila Kazimirsky ◽  
Bruria Ben-Zeev ◽  
...  

Inactivating mutations in the Methyl-CpG Binding Protein 2 (MECP2) gene are the main cause of Rett syndrome (RTT). Despite extensive research into MECP2 function, no treatments for RTT are currently available. Here, we used an evolutionary genomics approach to construct an unbiased MECP2 gene network, using 1028 eukaryotic genomes to prioritize proteins with strong co-evolutionary signatures with MECP2. Focusing on proteins targeted by FDA-approved drugs led to three promising targets, two of which were previously linked to MECP2 function (IRAK, KEAP1) and one that was not (EPOR). The drugs targeting these three proteins (Pacritinib, DMF, and EPO) were able to rescue different phenotypes of MECP2 inactivation in cultured human neural cell types, and appeared to converge on Nuclear Factor Kappa B (NF-κB) signaling in inflammation. This study highlights the potential of comparative genomics to accelerate drug discovery, and yields potential new avenues for the treatment of RTT.


2015 ◽  
Author(s):  
Davide Verzotto ◽  
Axel M Hillmer ◽  
Audrey S M Teo ◽  
Niranjan Nagarajan

Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and mapping technologies (e.g. optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kbp--2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging due to the lack of efficient and freely available software for robustly aligning maps to sequences. Here we introduce two new map-to-sequence alignment algorithms that efficiently and accurately align high-throughput mapping datasets to large, eukaryotic genomes while accounting for high error rates. In order to do so, these methods (OPTIMA for glocal and OPTIMA-Overlap for overlap alignment) exploit the ability to create efficient data structures that index continuous-valued mapping data while accounting for errors. We also introduce an approach for evaluating the significance of alignments that avoids expensive permutation-based tests while being agnostic to technology-dependent error rates. Our benchmarking results suggest that OPTIMA and OPTIMA-Overlap outperform state-of-the-art approaches in sensitivity (1.6--2X improvement) while simultaneously being more efficient (170--200%) and precise in their alignments (99% precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust and provide improved sensitivity while guaranteeing high precision.


2017 ◽  
Vol 46 (5) ◽  
pp. 557-564 ◽  
Author(s):  
Noora Kanerva ◽  
Jukka Kontto ◽  
Maijaliisa Erkkola ◽  
Jaakko Nevalainen ◽  
Satu Männistö

Aims: Factors that contribute to the development of overweight are numerous and form a complex structure with many unknown interactions and associations. We aimed to explore this structure (i.e. the mutual importance or hierarchy of sociodemographic and lifestyle-related risk factors of being overweight) using a machine-learning technique called random forest (RF). The results were compared with traditional logistic regression (LR) analysis. Methods: The cross-sectional FINRISK 2007 Study included 4757 Finns (aged 25–74 years). Information on participants’ lifestyle and sociodemographic characteristics were collected with questionnaires. Diet was assessed, using a validated food-frequency questionnaire. Height and weight were measured. Participants with a body mass index (BMI) ≥25 kg/m2 were classified as overweight. R-statistical software was used to run RF analysis (‘randomForest’) to derive estimates for variable importance and out-of-bag error, which were compared to a LR model. Results: In total, 704 (32%) men and 1119 (44%) women had normal BMI, whereas 1502 (69%) men and 1432 (57%) women had BMI ≥25. Estimated error rates for the models were similar (RF vs. LR: 42% vs. 40% for men, 38% vs. 35% for women). Both models ranked age, education and physical activity as the most important risk factors for being overweight, but RF ranked macronutrients (carbohydrates and protein) as more important compared to LR. Conclusions: RF did not demonstrate higher power in variable selection compared to LR in our study. The features of RF are more likely to appear beneficial in settings with a larger number of predictors.


2020 ◽  
Vol 12 (4) ◽  
pp. 282-292 ◽  
Author(s):  
Julia Brueckner ◽  
William F Martin

Abstract Eukaryotes are typically depicted as descendants of archaea, but their genomes are evolutionary chimeras with genes stemming from archaea and bacteria. Which prokaryotic heritage predominates? Here, we have clustered 19,050,992 protein sequences from 5,443 bacteria and 212 archaea with 3,420,731 protein sequences from 150 eukaryotes spanning six eukaryotic supergroups. By downsampling, we obtain estimates for the bacterial and archaeal proportions. Eukaryotic genomes possess a bacterial majority of genes. On average, the majority of bacterial genes is 56% overall, 53% in eukaryotes that never possessed plastids, and 61% in photosynthetic eukaryotic lineages, where the cyanobacterial ancestor of plastids contributed additional genes to the eukaryotic lineage. Intracellular parasites, which undergo reductive evolution in adaptation to the nutrient rich environment of the cells that they infect, relinquish bacterial genes for metabolic processes. Such adaptive gene loss is most pronounced in the human parasite Encephalitozoon intestinalis with 86% archaeal and 14% bacterial derived genes. The most bacterial eukaryote genome sampled is rice, with 67% bacterial and 33% archaeal genes. The functional dichotomy, initially described for yeast, of archaeal genes being involved in genetic information processing and bacterial genes being involved in metabolic processes is conserved across all eukaryotic supergroups.


2018 ◽  
Vol 1 (3) ◽  
pp. 13-24 ◽  
Author(s):  
Nida Tabassum Khan

Bioinformatic tools is widely used to manage the enormous genomic and proteomic data involving DNA/protein sequences management, drug designing, homology modelling, motif/domain prediction ,docking, annotation and dynamic simulation etc. Bioinformatics offers a wide range of applications in numerous disciplines such as genomics. Proteomics, comparative genomics, nutrigenomics, microbial genome, biodefense, forensics etc. Thus it offers promising future to accelerate scientific research in biotechnology


Sign in / Sign up

Export Citation Format

Share Document