Evotuning protocols for Transformer-based variant effect prediction on multi-domain proteins

Mapping Intimacies ◽

10.1101/2021.03.05.434175 ◽

2021 ◽

Author(s):

Hideki Yamaguchi ◽

Yutaka Saito

Keyword(s):

Language Processing ◽

Structural Information ◽

Level Structure ◽

Domain Architecture ◽

Fine Tuning ◽

Homology Search ◽

Learning Approaches ◽

Variant Effect ◽

The University ◽

Variant Effect Prediction

AbstractAccurate variant effect prediction has broad impacts on protein engineering. Recent machine learning approaches toward this end are based on representation learning, by which feature vectors are learned and generated from unlabeled sequences. However, it is unclear how to effectively learn evolutionary properties of an engineering target protein from homologous sequences, taking into account the protein’s sequence-level structure called domain architecture (DA). Additionally, no optimal protocols are established for incorporating such properties into Transformer, the neural network well-known to perform the best in natural language processing research. This article proposes DA-aware evolutionary fine-tuning, or “evotuning”, protocols for Transformer-based variant effect prediction, considering various combinations of homology search, fine-tuning, and sequence vectorization strategies. We exhaustively evaluated our protocols on diverse proteins with different functions and DAs. The results indicated that our protocols achieved significantly better performances than previous DA-unaware ones. The visualizations of attention maps suggested that the structural information was incorporated by evotuning without direct supervision, possibly leading to better prediction accuracy.Short descriptions of the authorsHideki Yamaguchi is a PhD candidate at the Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo. Yutaka Saito, PhD, is a senior researcher at Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), and a visiting associate professor at The University of Tokyo.Availabilityhttps://github.com/dlnp2/evotuning_protocols_for_transformers

Download Full-text

Unsupervised law article mining based on deep pre-trained language representation models with application to the Italian civil code

Artificial Intelligence and Law ◽

10.1007/s10506-021-09301-8 ◽

2021 ◽

Author(s):

Andrea Tagarelli ◽

Andrea Simeri

Keyword(s):

Deep Learning ◽

Language Processing ◽

Civil Code ◽

Fine Tuning ◽

Learning Approaches ◽

Retrieval Task ◽

Learning Framework ◽

Text Classifiers ◽

The Law ◽

Prediction Problems

AbstractModeling law search and retrieval as prediction problems has recently emerged as a predominant approach in law intelligence. Focusing on the law article retrieval task, we present a deep learning framework named LamBERTa, which is designed for civil-law codes, and specifically trained on the Italian civil code. To our knowledge, this is the first study proposing an advanced approach to law article prediction for the Italian legal system based on a BERT (Bidirectional Encoder Representations from Transformers) learning framework, which has recently attracted increased attention among deep learning approaches, showing outstanding effectiveness in several natural language processing and learning tasks. We define LamBERTa models by fine-tuning an Italian pre-trained BERT on the Italian civil code or its portions, for law article retrieval as a classification task. One key aspect of our LamBERTa framework is that we conceived it to address an extreme classification scenario, which is characterized by a high number of classes, the few-shot learning problem, and the lack of test query benchmarks for Italian legal prediction tasks. To solve such issues, we define different methods for the unsupervised labeling of the law articles, which can in principle be applied to any law article code system. We provide insights into the explainability and interpretability of our LamBERTa models, and we present an extensive experimental analysis over query sets of different type, for single-label as well as multi-label evaluation tasks. Empirical evidence has shown the effectiveness of LamBERTa, and also its superiority against widely used deep-learning text classifiers and a few-shot learner conceived for an attribute-aware prediction task.

Download Full-text

A Tracer Study of ICT Graduate Students at Mzuzu University, Malawi

Mousaion ◽

10.25159/2663-659x/5227 ◽

2019 ◽

Vol 36 (3) ◽

Author(s):

Chimango Nyasulu ◽

Winner Chawinga ◽

George Chipeta

Keyword(s):

Language Processing ◽

Mobile Application ◽

First Century ◽

Application Development ◽

Job Requirements ◽

Level Of Satisfaction ◽

Information And Communication ◽

Quantitative Design ◽

The Right ◽

The University

Governments the world over are increasingly challenging universities to produce human resources with the right skills sets and knowledge required to drive their economies in this twenty-first century. It therefore becomes important for universities to produce graduates that bring tangible and meaningful contributions to the economies. Graduate tracer studies are hailed to be one of the ways in which universities can respond and reposition themselves to the actual needs of the industry. It is against this background that this study was conducted to establish the relevance of the Department of Information and Communication Technology at Mzuzu University to the Malawian economy by systematically investigating occupations of its former students after graduating from the University. The study adopted a quantitative design by distributing an online-based questionnaire with predominantly closed-ended questions. The study focused on three key objectives: to identify key employing sectors of ICT graduates, to gauge the relevance of the ICT programme to its former students’ jobs and businesses, and to establish the level of satisfaction of the ICT curriculum from the perspectives of former ICT graduates. The key findings from the study are that the ICT programme is relevant to the industry. However, some respondents were of the view that the curriculum should be strengthened by revising it through an addition of courses such as Mobile Application Development, Machine Learning, Natural Language Processing, Data Mining, and LINUX Administration to keep abreast with the ever-changing ICT trends and job requirements. The study strongly recommends the need for regular reviews of the curriculum so that it is continually responding to and matches the needs of the industry.

Download Full-text

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Supervised Machine Learning ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Download Full-text

Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database

Applied Sciences ◽

10.3390/app11010024 ◽

2020 ◽

Vol 11 (1) ◽

pp. 24

Author(s):

Jin Tao ◽

Kelly Brayton ◽

Shira Broschat

Keyword(s):

Language Processing ◽

Fine Tuning ◽

Support Vector ◽

Protein Annotation ◽

Computing Power ◽

Journal Publication ◽

Novel Approach ◽

Uniprotkb Database ◽

Public Repositories ◽

Annotation Errors

Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.

Download Full-text

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01459-0 ◽

2021 ◽

Vol 21 (S2) ◽

Author(s):

Feihong Yang ◽

Xuwen Wang ◽

Hetong Ma ◽

Jiao Li

Keyword(s):

Language Processing ◽

Pearson Correlation ◽

Fine Tuning ◽

Entity Recognition ◽

Training Dataset ◽

Training Methods ◽

Code Size ◽

Model Framework ◽

Language Understanding ◽

Medical Language

Abstract Background Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits. Methods In transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn. Results We collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation. In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers’ script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers. Conclusions The proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803. In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn.

Download Full-text

Genetic variant effect prediction by supervised nonnegative matrix tri-factorization

Molecular Omics ◽

10.1039/d1mo00038a ◽

2021 ◽

Author(s):

Asieh Amousoltani Arani ◽

Mohammadreza Sehhati ◽

Mohammad Amin Tabatabaiefar

Keyword(s):

Input Data ◽

Genetic Variant ◽

Nonnegative Matrix ◽

Feature Space ◽

Variant Effect ◽

New Feature ◽

Variant Effect Prediction

A new feature space, which can discriminate deleterious variants, was constructed by the integration of various input data using the proposed supervised nonnegative matrix tri-factorization (sNMTF) algorithm.

Download Full-text

Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Electronics ◽

10.3390/electronics10121372 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1372

Author(s):

Sanjanasri JP ◽

Vijay Krishna Menon ◽

Soman KP ◽

Rajendran S ◽

Agnieszka Wolk

Keyword(s):

Deep Learning ◽

Language Processing ◽

Semantic Space ◽

Semantic Interpretation ◽

Learning Approaches ◽

Qualitative Comparison ◽

Bilingual Dictionary ◽

Pos Tagging ◽

Part Of Speech ◽

Cross Lingual

Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.

Download Full-text

A Comprehensive Study of Artificial Intelligence and Machine Learning Approaches in Confronting the Coronavirus (COVID-19) Pandemic

International Journal of Health Services ◽

10.1177/00207314211017469 ◽

2021 ◽

pp. 002073142110174

Author(s):

Md Mijanur Rahman ◽

Fatema Khatun ◽

Ashik Uzzaman ◽

Sadia Islam Sami ◽

Md Al-Amin Bhuiyan ◽

...

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Health Care ◽

Language Processing ◽

Probabilistic Models ◽

Health Care Systems ◽

Learning Approaches ◽

Care Systems ◽

Novel Coronavirus ◽

Comprehensive Study

The novel coronavirus disease (COVID-19) has spread over 219 countries of the globe as a pandemic, creating alarming impacts on health care, socioeconomic environments, and international relationships. The principal objective of the study is to provide the current technological aspects of artificial intelligence (AI) and other relevant technologies and their implications for confronting COVID-19 and preventing the pandemic’s dreadful effects. This article presents AI approaches that have significant contributions in the fields of health care, then highlights and categorizes their applications in confronting COVID-19, such as detection and diagnosis, data analysis and treatment procedures, research and drug development, social control and services, and the prediction of outbreaks. The study addresses the link between the technologies and the epidemics as well as the potential impacts of technology in health care with the introduction of machine learning and natural language processing tools. It is expected that this comprehensive study will support researchers in modeling health care systems and drive further studies in advanced technologies. Finally, we propose future directions in research and conclude that persuasive AI strategies, probabilistic models, and supervised learning are required to tackle future pandemic challenges.

Download Full-text

CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores

Genome Medicine ◽

10.1186/s13073-021-00835-9 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Philipp Rentzsch ◽

Max Schubach ◽

Jay Shendure ◽

Martin Kircher

Keyword(s):

Prediction Models ◽

Splice Variants ◽

Superior Performance ◽

Data Set ◽

Pathogenic Variants ◽

Genome Wide ◽

Donor And Acceptor ◽

Human Proteins ◽

Variant Effect ◽

Variant Effect Prediction

Abstract Background Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. Methods It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. Results We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. Conclusions While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

Download Full-text

Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks

Scientific Reports ◽

10.1038/s41598-021-88027-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Aditi S. Krishnapriyan ◽

Joseph Montoya ◽

Maciej Haranczyk ◽

Jens Hummelshøj ◽

Dmitriy Morozov

Keyword(s):

Machine Learning ◽

Language Processing ◽

Metal Organic Framework ◽

Persistent Homology ◽

Level Structure ◽

Chemical Information ◽

Material System ◽

Word Embeddings ◽

Structure Property ◽

Metal Organic

AbstractMachine learning has emerged as a powerful approach in materials discovery. Its major challenge is selecting features that create interpretable representations of materials, useful across multiple prediction tasks. We introduce an end-to-end machine learning model that automatically generates descriptors that capture a complex representation of a material’s structure and chemistry. This approach builds on computational topology techniques (namely, persistent homology) and word embeddings from natural language processing. It automatically encapsulates geometric and chemical information directly from the material system. We demonstrate our approach on multiple nanoporous metal–organic framework datasets by predicting methane and carbon dioxide adsorption across different conditions. Our results show considerable improvement in both accuracy and transferability across targets compared to models constructed from the commonly-used, manually-curated features, consistently achieving an average 25–30% decrease in root-mean-squared-deviation and an average increase of 40–50% in R2 scores. A key advantage of our approach is interpretability: Our model identifies the pores that correlate best to adsorption at different pressures, which contributes to understanding atomic-level structure–property relationships for materials design.

Download Full-text