Evaluating Protein Transfer Learning with TAPE

AbstractProtein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We bench-mark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

Download Full-text

Machine Learning Based Assembly of Fragments of Ancient Papyrus

Journal on Computing and Cultural Heritage ◽

10.1145/3460961 ◽

2021 ◽

Vol 14 (3) ◽

pp. 1-21

Author(s):

Roy Abitbol ◽

Ilan Shimshoni ◽

Jonathan Ben-Dov

Keyword(s):

Machine Learning ◽

Spatial Information ◽

Real Life ◽

Dead Sea ◽

Machine Learning Techniques ◽

Automated Classification ◽

Learning Techniques ◽

Test Batch ◽

Validation Set ◽

Local Edge

The task of assembling fragments in a puzzle-like manner into a composite picture plays a significant role in the field of archaeology as it supports researchers in their attempt to reconstruct historic artifacts. In this article, we propose a method for matching and assembling pairs of ancient papyrus fragments containing mostly unknown scriptures. Papyrus paper is manufactured from papyrus plants and therefore portrays typical thread patterns resulting from the plant’s stems. The proposed algorithm is founded on the hypothesis that these thread patterns contain unique local attributes such that nearby fragments show similar patterns reflecting the continuations of the threads. We posit that these patterns can be exploited using image processing and machine learning techniques to identify matching fragments. The algorithm and system which we present support the quick and automated classification of matching pairs of papyrus fragments as well as the geometric alignment of the pairs against each other. The algorithm consists of a series of steps and is based on deep-learning and machine learning methods. The first step is to deconstruct the problem of matching fragments into a smaller problem of finding thread continuation matches in local edge areas (squares) between pairs of fragments. This phase is solved using a convolutional neural network ingesting raw images of the edge areas and producing local matching scores. The result of this stage yields very high recall but low precision. Thus, we utilize these scores in order to conclude about the matching of entire fragments pairs by establishing an elaborate voting mechanism. We enhance this voting with geometric alignment techniques from which we extract additional spatial information. Eventually, we feed all the data collected from these steps into a Random Forest classifier in order to produce a higher order classifier capable of predicting whether a pair of fragments is a match. Our algorithm was trained on a batch of fragments which was excavated from the Dead Sea caves and is dated circa the 1st century BCE. The algorithm shows excellent results on a validation set which is of a similar origin and conditions. We then tried to run the algorithm against a real-life set of fragments for which we have no prior knowledge or labeling of matches. This test batch is considered extremely challenging due to its poor condition and the small size of its fragments. Evidently, numerous researchers have tried seeking matches within this batch with very little success. Our algorithm performance on this batch was sub-optimal, returning a relatively large ratio of false positives. However, the algorithm was quite useful by eliminating 98% of the possible matches thus reducing the amount of work needed for manual inspection. Indeed, experts that reviewed the results have identified some positive matches as potentially true and referred them for further investigation.

Download Full-text

Ecological Interactions and the Netflix Problem

10.1101/089771 ◽

2016 ◽

Cited By ~ 1

Author(s):

Philippe Desjardins-Proulx ◽

Idaline Laigle ◽

Timothée Poisot ◽

Dominique Gravel

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Random Forests ◽

Species Interactions ◽

Similarity Measures ◽

Theoretical Models ◽

Machine Learning Techniques ◽

Nearest Neighbour ◽

Ecological Interactions ◽

Learning Techniques

0AbstractSpecies interactions are a key component of ecosystems but we generally have an incomplete picture of who-eats-who in a given community. Different techniques have been devised to predict species interactions using theoretical models or abundances. Here, we explore the K nearest neighbour approach, with a special emphasis on recommendation, along with other machine learning techniques. Recommenders are algorithms developed for companies like Netflix to predict if a customer would like a product given the preferences of similar customers. These machine learning techniques are well-suited to study binary ecological interactions since they focus on positive-only data. We also explore how the K nearest neighbour approach can be used with both positive and negative information, in which case the goal of the algorithm is to fill missing entries from a matrix (imputation). By removing a prey from a predator, we find that recommenders can guess the missing prey around 50% of the times on the first try, with up to 881 possibilities. Traits do not improve significantly the results for the K nearest neighbour, although a simple test with a supervised learning approach (random forests) show we can predict interactions with high accuracy using only three traits per species. This result shows that binary interactions can be predicted without regard to the ecological community given only three variables: body mass and two variables for the species’ phylogeny. These techniques are complementary, as recommenders can predict interactions in the absence of traits, using only information about other species’ interactions, while supervised learning algorithms such as random forests base their predictions on traits only but do not exploit other species’ interactions. Further work should focus on developing custom similarity measures specialized to ecology to improve the KNN algorithms and using richer data to capture indirect relationships between species.

Download Full-text

A Review of Machine Learning Techniques for Anomaly Detection in Static Graphs

Implementing Computational Intelligence Techniques for Security Systems Design - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-7998-2418-3.ch007 ◽

2020 ◽

pp. 146-162

Author(s):

Hesham M. Al-Ammal

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Anomaly Detection ◽

Real Life ◽

Machine Learning Techniques ◽

Support Vector ◽

Learning Methods ◽

Data Set ◽

Learning Techniques ◽

Vector Machines

Detection of anomalies in a given data set is a vital step in several applications in cybersecurity; including intrusion detection, fraud, and social network analysis. Many of these techniques detect anomalies by examining graph-based data. Analyzing graphs makes it possible to capture relationships, communities, as well as anomalies. The advantage of using graphs is that many real-life situations can be easily modeled by a graph that captures their structure and inter-dependencies. Although anomaly detection in graphs dates back to the 1990s, recent advances in research utilized machine learning methods for anomaly detection over graphs. This chapter will concentrate on static graphs (both labeled and unlabeled), and the chapter summarizes some of these recent studies in machine learning for anomaly detection in graphs. This includes methods such as support vector machines, neural networks, generative neural networks, and deep learning methods. The chapter will reflect the success and challenges of using these methods in the context of graph-based anomaly detection.

Download Full-text

A Semi-Supervised Learning Approach for Tackling Twitter Spam Drift

International Journal of Computational Intelligence and Applications ◽

10.1142/s146902681950010x ◽

2019 ◽

Vol 18 (02) ◽

pp. 1950010 ◽

Cited By ~ 2

Author(s):

Niddal Imam ◽

Biju Issac ◽

Seibu Mary Jacob

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Research Community ◽

Machine Learning Techniques ◽

Spam Detection ◽

Learning Approach ◽

New Approach ◽

Detection Systems ◽

Learning Techniques ◽

Over Time

Twitter has changed the way people get information by allowing them to express their opinion and comments on the daily tweets. Unfortunately, due to the high popularity of Twitter, it has become very attractive to spammers. Unlike other types of spam, Twitter spam has become a serious issue in the last few years. The large number of users and the high amount of information being shared on Twitter play an important role in accelerating the spread of spam. In order to protect the users, Twitter and the research community have been developing different spam detection systems by applying different machine-learning techniques. However, a recent study showed that the current machine learning-based detection systems are not able to detect spam accurately because spam tweet characteristics vary over time. This issue is called “Twitter Spam Drift”. In this paper, a semi-supervised learning approach (SSLA) has been proposed to tackle this. The new approach uses the unlabeled data to learn the structure of the domain. Different experiments were performed on English and Arabic datasets to test and evaluate the proposed approach and the results show that the proposed SSLA can reduce the effect of Twitter spam drift and outperform the existing techniques.

Download Full-text

Delving into Android Malware Families with a Novel Neural Projection Method

Complexity ◽

10.1155/2019/6101697 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10 ◽

Cited By ~ 4

Author(s):

Rafael Vega Vega ◽

Héctor Quintián ◽

Carlos Cambra ◽

Nuño Basurto ◽

Álvaro Herrero ◽

...

Keyword(s):

Machine Learning ◽

Projection Method ◽

Hebbian Learning ◽

Real Life ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Android Malware ◽

Learning Techniques ◽

First Time

Present research proposes the application of unsupervised and supervised machine-learning techniques to characterize Android malware families. More precisely, a novel unsupervised neural-projection method for dimensionality-reduction, namely, Beta Hebbian Learning (BHL), is applied to visually analyze such malware. Additionally, well-known supervised Decision Trees (DTs) are also applied for the first time in order to improve characterization of such families and compare the original features that are identified as the most important ones. The proposed techniques are validated when facing real-life Android malware data by means of the well-known and publicly available Malgenome dataset. Obtained results support the proposed approach, confirming the validity of BHL and DTs to gain deep knowledge on Android malware.

Download Full-text

The NoisyOffice Database: A Corpus To Train Supervised Machine Learning Filters For Image Processing

The Computer Journal ◽

10.1093/comjnl/bxz098 ◽

2019 ◽

Vol 63 (11) ◽

pp. 1658-1667

Author(s):

M J Castro-Bleda ◽

S España-Boquera ◽

J Pastor-Pellicer ◽

F Zamora-Martínez

Keyword(s):

Machine Learning ◽

Image Processing ◽

Deep Learning ◽

Supervised Learning ◽

Image Enhancement ◽

Super Resolution ◽

Supervised Machine Learning ◽

Text Documents ◽

Learning Techniques ◽

Printed Text

Abstract This paper presents the ‘NoisyOffice’ database. It consists of images of printed text documents with noise mainly caused by uncleanliness from a generic office, such as coffee stains and footprints on documents or folded and wrinkled sheets with degraded printed text. This corpus is intended to train and evaluate supervised learning methods for cleaning, binarization and enhancement of noisy images of grayscale text documents. As an example, several experiments of image enhancement and binarization are presented by using deep learning techniques. Also, double-resolution images are also provided for testing super-resolution methods. The corpus is freely available at UCI Machine Learning Repository. Finally, a challenge organized by Kaggle Inc. to denoise images, using the database, is described in order to show its suitability for benchmarking of image processing systems.

Download Full-text

Predicting Takeover Success Using Machine Learning Techniques

Journal of Business & Economics Research (JBER) ◽

10.19030/jber.v10i10.7264 ◽

2012 ◽

Vol 10 (10) ◽

pp. 547

Author(s):

Mei Zhang ◽

Gregory Johnson ◽

Jia Wang

Keyword(s):

Machine Learning ◽

Learning Community ◽

Binary Classification ◽

Classification Problem ◽

Machine Learning Techniques ◽

Success Prediction ◽

Support Vector ◽

Font Size ◽

Network Support ◽

Learning Techniques

A takeover success prediction model aims at predicting the probability that a takeover attempt will succeed by using publicly available information at the time of the announcement. We perform a thorough study using machine learning techniques to predict takeover success. Specifically, we model takeover success prediction as a binary classification problem, which has been widely studied in the machine learning community. Motivated by the recent advance in machine learning, we empirically evaluate and analyze many state-of-the-art classifiers, including logistic regression, artificial neural network, support vector machines with different kernels, decision trees, random forest, and Adaboost. The experiments validate the effectiveness of applying machine learning in takeover success prediction, and we found that the support vector machine with linear kernel and the Adaboost with stump weak classifiers perform the best for the task. The result is consistent with the general observations of these two approaches.

Download Full-text

ContextPCA: Predicting Context-Aware Smartphone Apps Usage Based On Machine Learning Techniques

Symmetry ◽

10.3390/sym12040499 ◽

2020 ◽

Vol 12 (4) ◽

pp. 499 ◽

Cited By ~ 8

Author(s):

Iqbal H. Sarker ◽

Yoosef B. Abushark ◽

Asif Irshad Khan

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Real Life ◽

Machine Learning Techniques ◽

Model Complexity ◽

Context Aware ◽

Smartphone Apps ◽

Data Set ◽

Machine Learning Classification ◽

Learning Techniques

This paper mainly formulates the problem of predicting context-aware smartphone apps usage based on machine learning techniques. In the real world, people use various kinds of smartphone apps differently in different contexts that include both the user-centric context and device-centric context. In the area of artificial intelligence and machine learning, decision tree model is one of the most popular approaches for predicting context-aware smartphone usage. However, real-life smartphone apps usage data may contain higher dimensions of contexts, which may cause several issues such as increases model complexity, may arise over-fitting problem, and consequently decreases the prediction accuracy of the context-aware model. In order to address these issues, in this paper, we present an effective principal component analysis (PCA) based context-aware smartphone apps prediction model, “ContextPCA” using decision tree machine learning classification technique. PCA is an unsupervised machine learning technique that can be used to separate symmetric and asymmetric components, and has been adopted in our “ContextPCA” model, in order to reduce the context dimensions of the original data set. The experimental results on smartphone apps usage datasets show that “ContextPCA” model effectively predicts context-aware smartphone apps in terms of precision, recall, f-score and ROC values in various test cases.

Download Full-text

Application of Machine Learning Algorithms in Stock Market Prediction

Handbook of Research on Smart Technology Models for Business and Industry - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-7998-3645-2.ch007 ◽

2020 ◽

pp. 153-180

Author(s):

Sumit Kumar ◽

Sanlap Acharya

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

Stock Prices ◽

Stock Price ◽

Short Term Memory ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Better Than

The prediction of stock prices has always been a very challenging problem for investors. Using machine learning techniques to predict stock prices is also one of the favourite topics for academics working in this domain. This chapter discusses five supervised learning techniques and two unsupervised learning techniques to solve the problem of stock price prediction and has compared the performances of all the algorithms. Among the supervised learning techniques, Long Short-Term Memory (LSTM) algorithm performed better than the others whereas, among the unsupervised learning techniques, Restricted Boltzmann Machine (RBM) performed better. RBM is found to be performing even better than LSTM.

Download Full-text

Impact Analysis of Machine Learning Techniques for COVID-19 Diagnosis: A Critical Review

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i2.863 ◽

2021 ◽

Vol 12 (2) ◽

pp. 510-516

Author(s):

Anshul, Et. al.

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Severe Acute Respiratory Syndrome ◽

Impact Analysis ◽

Machine Learning Techniques ◽

X Ray ◽

Learning Techniques ◽

The World ◽

Almost All

COVID-19 virus belongs to the severe acute respiratory syndrome (SARS) family raised a situation of health emergency in almost all the countries of the world. Numerous machine learning and deep learning based techniques are used to diagnose COVID positive patients using different image modalities like CT SCAN, X-RAY, or CBX, etc. This paper provides the works done in COVID-19 diagnosis, the role of ML and DL based methods to solve this problem, and presents limitations with respect to COVID-19 diagnosis.

Download Full-text