Precise Prediction of Promoter Strength Based on a De Novo Synthetic Promoter Library Coupled with Machine Learning

Author(s):  
Mei Zhao ◽  
Zhenqi Yuan ◽  
Longtao Wu ◽  
Shenghu Zhou ◽  
Yu Deng
2020 ◽  
Author(s):  
Mei Zhao ◽  
Shenghu Zhou ◽  
Longtao Wu ◽  
Yu Deng

AbstractPromoters are one of the most critical regulatory elements controlling metabolic pathways. However, in recent years, researchers have simply perfected promoter strength, but ignored the relationship between the internal sequences and promoter strength. In this context, we constructed and characterized a mutant promoter library of Ptrc through dozens of mutation-construction-screening-characterization engineering cycles. After excluding invalid mutation sites, we established a synthetic promoter library, which consisted of 3665 different variants, displaying an intensity range of more than two orders of magnitude. The strongest variant was 1.52-fold stronger than a 1 mM isopropyl-β-D-thiogalactoside driven PT7 promoter. Our synthetic promoter library exhibited superior applicability when expressing different reporters, in both plasmids and the genome. Different machine learning models were built and optimized to explore relationships between the promoter sequences and transcriptional strength. Finally, our XgBoost model exhibited optimal performance, and we utilized this approach to precisely predict the strength of artificially designed promoter sequences. Our work provides a powerful platform that enables the predictable tuning of promoters to achieve the optimal transcriptional strength.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Ryan Feehan ◽  
Meghan W. Franklin ◽  
Joanna S. G. Slusky

AbstractMetalloenzymes are 40% of all enzymes and can perform all seven classes of enzyme reactions. Because of the physicochemical similarities between the active sites of metalloenzymes and inactive metal binding sites, it is challenging to differentiate between them. Yet distinguishing these two classes is critical for the identification of both native and designed enzymes. Because of similarities between catalytic and non-catalytic  metal binding sites, finding physicochemical features that distinguish these two types of metal sites can indicate aspects that are critical to enzyme function. In this work, we develop the largest structural dataset of enzymatic and non-enzymatic metalloprotein sites to date. We then use a decision-tree ensemble machine learning model to classify metals bound to proteins as enzymatic or non-enzymatic with 92.2% precision and 90.1% recall. Our model scores electrostatic and pocket lining features as more important than pocket volume, despite the fact that volume is the most quantitatively different feature between enzyme and non-enzymatic sites. Finally, we find our model has overall better performance in a side-to-side comparison against other methods that differentiate enzymatic from non-enzymatic sequences. We anticipate that our model’s ability to correctly identify which metal sites are responsible for enzymatic activity could enable identification of new enzymatic mechanisms and de novo enzyme design.


2018 ◽  
Vol 35 (15) ◽  
pp. 2654-2656 ◽  
Author(s):  
Guoli Ji ◽  
Wenbin Ye ◽  
Yaru Su ◽  
Moliang Chen ◽  
Guangzao Huang ◽  
...  

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Dennie te Molder ◽  
Wasin Poncheewin ◽  
Peter J. Schaap ◽  
Jasper J. Koehorst

Abstract Background The genus Xanthomonas has long been considered to consist predominantly of plant pathogens, but over the last decade there has been an increasing number of reports on non-pathogenic and endophytic members. As Xanthomonas species are prevalent pathogens on a wide variety of important crops around the world, there is a need to distinguish between these plant-associated phenotypes. To date a large number of Xanthomonas genomes have been sequenced, which enables the application of machine learning (ML) approaches on the genome content to predict this phenotype. Until now such approaches to the pathogenomics of Xanthomonas strains have been hampered by the fragmentation of information regarding pathogenicity of individual strains over many studies. Unification of this information into a single resource was therefore considered to be an essential step. Results Mining of 39 papers considering both plant-associated phenotypes, allowed for a phenotypic classification of 578 Xanthomonas strains. For 65 plant-pathogenic and 53 non-pathogenic strains the corresponding genomes were available and de novo annotated for the presence of Pfam protein domains used as features to train and compare three ML classification algorithms; CART, Lasso and Random Forest. Conclusion The literature resource in combination with recursive feature extraction used in the ML classification algorithms provided further insights into the virulence enabling factors, but also highlighted domains linked to traits not present in pathogenic strains.


2021 ◽  
Vol 22 (16) ◽  
pp. 8958
Author(s):  
Phasit Charoenkwan ◽  
Chanin Nantasenamat ◽  
Md. Mehedi Hasan ◽  
Mohammad Ali Moni ◽  
Pietro Lio’ ◽  
...  

Accurate identification of bitter peptides is of great importance for better understanding their biochemical and biophysical properties. To date, machine learning-based methods have become effective approaches for providing a good avenue for identifying potential bitter peptides from large-scale protein datasets. Although few machine learning-based predictors have been developed for identifying the bitterness of peptides, their prediction performances could be improved. In this study, we developed a new predictor (named iBitter-Fuse) for achieving more accurate identification of bitter peptides. In the proposed iBitter-Fuse, we have integrated a variety of feature encoding schemes for providing sufficient information from different aspects, namely consisting of compositional information and physicochemical properties. To enhance the predictive performance, the customized genetic algorithm utilizing self-assessment-report (GA-SAR) was employed for identifying informative features followed by inputting optimal ones into a support vector machine (SVM)-based classifier for developing the final model (iBitter-Fuse). Benchmarking experiments based on both 10-fold cross-validation and independent tests indicated that the iBitter-Fuse was able to achieve more accurate performance as compared to state-of-the-art methods. To facilitate the high-throughput identification of bitter peptides, the iBitter-Fuse web server was established and made freely available online. It is anticipated that the iBitter-Fuse will be a useful tool for aiding the discovery and de novo design of bitter peptides


2021 ◽  
Vol 28 ◽  
Author(s):  
Yuyang Xue ◽  
Xiucai Ye ◽  
Lesong Wei ◽  
Xin Zhang ◽  
Tetsuya Sakurai ◽  
...  

: With its superior performance, the Transformer model, which is based on the 'Encoder-Decoder' paradigm, has become the mainstream in natural language processing. On the other hand, bioinformatics has embraced machine learning and made great progress in drug design and protein property prediction. Cell-penetrating peptides (CPPs) are one kind of permeable protein that is convenient as a kind of 'postman' in drug penetration tasks. However, a small number of CPPs have been discovered by research, let alone practical applications in drug permeability. Therefore, correctly identifying the CPPs has opened up a new way to take macromolecules into cells without other potentially harmful materials in the drug. Most of the previous work only uses trivial machine learning techniques and hand-crafted features to construct a simple classifier. In CPPFormer, we learn from the idea of implementing the attention structure of Transformer, rebuilding the network based on the characteristics of CPPs according to its short length, and using an automatic feature extractor with a few manual engineered features to co-direct the predicted results. Compared to all previous methods and other classic text classification models, the empirical result has shown that our proposed deep model-based method has achieved the best performance of 92.16% accuracy in the CPP924 dataset and has passed various index tests.


Toxins ◽  
2018 ◽  
Vol 10 (12) ◽  
pp. 503 ◽  
Author(s):  
Qing Li ◽  
Maren Watkins ◽  
Samuel Robinson ◽  
Helena Safavi-Hemami ◽  
Mark Yandell

Cone snails (genus Conus) are venomous marine snails that inject prey with a lethal cocktail of conotoxins, small, secreted, and cysteine-rich peptides. Given the diversity and often high affinity for their molecular targets, consisting of ion channels, receptors or transporters, many conotoxins have become invaluable pharmacological probes, drug leads, and therapeutics. Transcriptome sequencing of Conus venom glands followed by de novo assembly and homology-based toxin identification and annotation is currently the state-of-the-art for discovery of new conotoxins. However, homology-based search techniques, by definition, can only detect novel toxins that are homologous to previously reported conotoxins. To overcome these obstacles for discovery, we have created ConusPipe, a machine learning tool that utilizes prominent chemical characters of conotoxins to predict whether a certain transcript in a Conus transcriptome, which has no otherwise detectable homologs in current reference databases, is a putative conotoxin. By using ConusPipe on RNASeq data of 10 species, we report 5148 new putative conotoxin transcripts that have no homologues in current reference databases. 896 of these were identified by at least three out of four models used. These data significantly expand current publicly available conotoxin datasets and our approach provides a new computational avenue for the discovery of novel toxin families.


2020 ◽  
Vol 77 (4) ◽  
pp. 1267-1273
Author(s):  
Cigdem Beyan ◽  
Howard I Browman

Abstract Machine learning, a subfield of artificial intelligence, offers various methods that can be applied in marine science. It supports data-driven learning, which can result in automated decision making of de novo data. It has significant advantages compared with manual analyses that are labour intensive and require considerable time. Machine learning approaches have great potential to improve the quality and extent of marine research by identifying latent patterns and hidden trends, particularly in large datasets that are intractable using other approaches. New sensor technology supports collection of large amounts of data from the marine environment. The rapidly developing machine learning subfield known as deep learning—which applies algorithms (artificial neural networks) inspired by the structure and function of the brain—is able to solve very complex problems by processing big datasets in a short time, sometimes achieving better performance than human experts. Given the opportunities that machine learning can provide, its integration into marine science and marine resource management is inevitable. The purpose of this themed set of articles is to provide as wide a selection as possible of case studies that demonstrate the applications, utility, and promise of machine learning in marine science. We also provide a forward-look by envisioning a marine science of the future into which machine learning has been fully incorporated.


2020 ◽  
Vol 9 (23) ◽  
Author(s):  
Yadi Zhou ◽  
Yuan Hou ◽  
Muzna Hussain ◽  
Sherry‐Ann Brown ◽  
Thomas Budd ◽  
...  

Background The growing awareness of cardiovascular toxicity from cancer therapies has led to the emerging field of cardio‐oncology, which centers on preventing, detecting, and treating patients with cardiac dysfunction before, during, or after cancer treatment. Early detection and prevention of cancer therapy–related cardiac dysfunction (CTRCD) play important roles in precision cardio‐oncology. Methods and Results This retrospective study included 4309 cancer patients between 1997 and 2018 whose laboratory tests and cardiovascular echocardiographic variables were collected from the Cleveland Clinic institutional electronic medical record database (Epic Systems). Among these patients, 1560 (36%) were diagnosed with at least 1 type of CTRCD, and 838 (19%) developed CTRCD after cancer therapy (de novo). We posited that machine learning algorithms can be implemented to predict CTRCDs in cancer patients according to clinically relevant variables. Classification models were trained and evaluated for 6 types of cardiovascular outcomes, including coronary artery disease (area under the receiver operating characteristic curve [AUROC], 0.821; 95% CI, 0.815–0.826), atrial fibrillation (AUROC, 0.787; 95% CI, 0.782–0.792), heart failure (AUROC, 0.882; 95% CI, 0.878–0.887), stroke (AUROC, 0.660; 95% CI, 0.650–0.670), myocardial infarction (AUROC, 0.807; 95% CI, 0.799–0.816), and de novo CTRCD (AUROC, 0.802; 95% CI, 0.797–0.807). Model generalizability was further confirmed using time‐split data. Model inspection revealed several clinically relevant variables significantly associated with CTRCDs, including age, hypertension, glucose levels, left ventricular ejection fraction, creatinine, and aspartate aminotransferase levels. Conclusions This study suggests that machine learning approaches offer powerful tools for cardiac risk stratification in oncology patients by utilizing large‐scale, longitudinal patient data from healthcare systems.


Sign in / Sign up

Export Citation Format

Share Document