AliSim: A Fast and Versatile Phylogenetic Sequence Simulator For the Genomic Era

Sequence simulators play an important role in phylogenetics. Simulated data has many applications, such as evaluating the performance of different methods, hypothesis testing with parametric bootstraps, and, more recently, generating data for training machine-learning applications. Many sequence simulation programs exist, but the most feature-rich programs tend to be rather slow, and the fastest programs tend to be feature-poor. Here, we introduce AliSim, a new tool that can efficiently simulate biologically realistic alignments under a large range of complex evolutionary models. To achieve high performance across a wide range of simulation conditions, AliSim implements an adaptive approach that combines the commonly-used rate matrix and probability matrix approach. AliSim takes 1.3 hours and 1.3 GB RAM to simulate alignments with one million sequences or sites, while popular software Seq-Gen, Dawg, and INDELible require two to five hours and 50 to 500 GB of RAM. We provide AliSim as an extension of the IQ-TREE software version 2.2, freely available at www.iqtree.org, and a comprehensive user tutorial at http://www.iqtree.org/doc/AliSim.

Download Full-text

Building High Performance Explainable Machine Learning Models for Social Media-based Substance Use Prediction

International Journal of Artificial Intelligence Tools ◽

10.1142/s021821302060009x ◽

2020 ◽

Vol 29 (03n04) ◽

pp. 2060009

Author(s):

Tao Ding ◽

Fatema Hasan ◽

Warren K. Bickel ◽

Shimei Pan

Keyword(s):

Machine Learning ◽

Social Media ◽

Substance Use ◽

Human Behavior ◽

High Performance ◽

Supervised Machine Learning ◽

Learning Models ◽

Wide Range ◽

And Behavior ◽

Machine Learning Models

Social media contain rich information that can be used to help understand human mind and behavior. Social media data, however, are mostly unstructured (e.g., text and image) and a large number of features may be needed to represent them (e.g., we may need millions of unigrams to represent social media texts). Moreover, accurately assessing human behavior is often difficult (e.g., assessing addiction may require medical diagnosis). As a result, the ground truth data needed to train a supervised human behavior model are often difficult to obtain at a large scale. To avoid overfitting, many state-of-the-art behavior models employ sophisticated unsupervised or self-supervised machine learning methods to leverage a large amount of unsupervised data for both feature learning and dimension reduction. Unfortunately, despite their high performance, these advanced machine learning models often rely on latent features that are hard to explain. Since understanding the knowledge captured in these models is important to behavior scientists and public health providers, we explore new methods to build machine learning models that are not only accurate but also interpretable. We evaluate the effectiveness of the proposed methods in predicting Substance Use Disorders (SUD). We believe the methods we proposed are general and applicable to a wide range of data-driven human trait and behavior analysis applications.

Download Full-text

Securing Machine Learning in the Cloud: A Systematic Review of Cloud Machine Learning Security

Frontiers in Big Data ◽

10.3389/fdata.2020.587139 ◽

2020 ◽

Vol 3 ◽

Author(s):

Adnan Qayyum ◽

Aneeqa Ijaz ◽

Muhammad Usama ◽

Waleed Iqbal ◽

Junaid Qadir ◽

...

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Graphics Processing Units ◽

High Performance ◽

Cloud Services ◽

Third Party ◽

Systematic Evaluation ◽

Research Issues ◽

Wide Range ◽

Attack And Defense

With the advances in machine learning (ML) and deep learning (DL) techniques, and the potency of cloud computing in offering services efficiently and cost-effectively, Machine Learning as a Service (MLaaS) cloud platforms have become popular. In addition, there is increasing adoption of third-party cloud services for outsourcing training of DL models, which requires substantial costly computational resources (e.g., high-performance graphics processing units (GPUs)). Such widespread usage of cloud-hosted ML/DL services opens a wide range of attack surfaces for adversaries to exploit the ML/DL system to achieve malicious goals. In this article, we conduct a systematic evaluation of literature of cloud-hosted ML/DL models along both the important dimensions—attacks and defenses—related to their security. Our systematic review identified a total of 31 related articles out of which 19 focused on attack, six focused on defense, and six focused on both attack and defense. Our evaluation reveals that there is an increasing interest from the research community on the perspective of attacking and defending different attacks on Machine Learning as a Service platforms. In addition, we identify the limitations and pitfalls of the analyzed articles and highlight open research issues that require further investigation.

Download Full-text

A Grey-Box Ensemble Model Exploiting Black-Box Accuracy and White-Box Intrinsic Interpretability

Algorithms ◽

10.3390/a13010017 ◽

2020 ◽

Vol 13 (1) ◽

pp. 17 ◽

Cited By ~ 4

Author(s):

Emmanuel Pintelas ◽

Ioannis E. Livieris ◽

Panagiotis Pintelas

Keyword(s):

Machine Learning ◽

Real World ◽

High Performance ◽

Black Box ◽

Box Model ◽

Proposed Model ◽

Wide Range ◽

Critical Issues ◽

Key Factor ◽

Real World Datasets

Machine learning has emerged as a key factor in many technological and scientific advances and applications. Much research has been devoted to developing high performance machine learning models, which are able to make very accurate predictions and decisions on a wide range of applications. Nevertheless, we still seek to understand and explain how these models work and make decisions. Explainability and interpretability in machine learning is a significant issue, since in most of real-world problems it is considered essential to understand and explain the model’s prediction mechanism in order to trust it and make decisions on critical issues. In this study, we developed a Grey-Box model based on semi-supervised methodology utilizing a self-training framework. The main objective of this work is the development of a both interpretable and accurate machine learning model, although this is a complex and challenging task. The proposed model was evaluated on a variety of real world datasets from the crucial application domains of education, finance and medicine. Our results demonstrate the efficiency of the proposed model performing comparable to a Black-Box and considerably outperforming single White-Box models, while at the same time remains as interpretable as a White-Box model.

Download Full-text

[ANT]: A Machine Learning Approach for Building Performance Simulation: Methods and Development

The Academic Research Community Publication ◽

10.21625/archive.v3i1.442 ◽

2019 ◽

Vol 3 (1) ◽

pp. 205

Author(s):

Mahmoud M. Abdelrahman ◽

Ahmed Mohamed Yousef Toutou

Keyword(s):

Machine Learning ◽

High Performance ◽

General Purpose ◽

Ease Of Use ◽

Building Performance ◽

Performance Simulation ◽

Building Performance Simulation ◽

Wide Range ◽

Performance Development ◽

High Level

In this paper, we represent an approach for combining machine learning (ML) techniques with building performance simulation by introducing four methods in which ML could be effectively involved in this field i.e. Classification, Regression, Clustering and Model selection . Rhino-3d-Grasshopper SDK was used to develop a new plugin for involving machine learning in design process using Python programming language and making use of scikit-learn module, that is, a python module which provides a general purpose high level language to nonspecialist user by integration of wide range supervised and unsupervised learning algorithms with high performance, ease of use and well documented features. ANT plugin provides a method to make use of these modules inside Rhino\Grasshopper to be handy to designers. This tool is open source and is released under BSD simplified license. This approach represents promising results regarding making use of data in automating building performance development and could be widely applied. Future studies include providing parallel computation facility using PyOpenCL module as well as computer vision integration using scikit-image.

Download Full-text

Machine learning misclassification of academic publications reveals non-trivial interdependencies of scientific disciplines

Scientometrics ◽

10.1007/s11192-020-03789-8 ◽

2020 ◽

Author(s):

Alexey Lyutov ◽

Yilmaz Uygun ◽

Marc-Thorsten Hütt

Keyword(s):

Machine Learning ◽

Quantitative Methods ◽

Classification Problem ◽

General Level ◽

Scientific Disciplines ◽

Wide Range ◽

Machine Learning Applications ◽

Academic Publications ◽

Standard Classification ◽

Classification Tasks

AbstractExploring the production of knowledge with quantitative methods is the foundation of scientometrics. In an application of machine learning to scientometrics, we here consider the classification problem of the mapping of academic publications to the subcategories of a multidisciplinary journal—and hence to scientific disciplines—based on the information contained in the abstract. In contrast to standard classification tasks, we are not interested in maximizing the accuracy, but rather we ask, whether the failures of an automatic classification are systematic and contain information about the system under investigation. These failures can be represented as a ’misclassification network’ inter-relating scientific disciplines. Here we show that this misclassification network (1) gives a markedly different pattern of interdependencies among scientific disciplines than common ’maps of science’, (2) reveals a statistical association between misclassification and citation frequencies, and (3) allows disciplines to be classified as ’method lenders’ and ’content explorers’, based on their in-degree out-degree asymmetry. On a more general level, in a wide range of machine learning applications misclassification networks have the potential of extracting systemic information from the failed classifications, thus allowing to visualize and quantitatively assess those aspects of a complex system, which are not machine learnable.

Download Full-text

Colony Fingerprint-Based Discrimination of Staphylococcus species with Machine Learning Approaches

Sensors ◽

10.3390/s18092789 ◽

2018 ◽

Vol 18 (9) ◽

pp. 2789 ◽

Cited By ~ 4

Author(s):

Yoshiaki Maeda ◽

Yui Sugiyama ◽

Atsushi Kogiso ◽

Tae-Kyu Lim ◽

Manabu Harada ◽

...

Keyword(s):

Machine Learning ◽

High Performance ◽

Imaging System ◽

Bacterial Species ◽

Support Vector ◽

Learning Approaches ◽

Wide Range ◽

Central Regions ◽

Species Specific ◽

Staphylococcus Spp

Detection and discrimination of bacteria are crucial in a wide range of industries, including clinical testing, and food and beverage production. Staphylococcus species cause various diseases, and are frequently detected in clinical specimens and food products. In particular, S. aureus is well known to be the most pathogenic species. Conventional phenotypic and genotypic methods for discrimination of Staphylococcus spp. are time-consuming and labor-intensive. To address this issue, in the present study, we applied a novel discrimination methodology called colony fingerprinting. Colony fingerprinting discriminates bacterial species based on the multivariate analysis of the images of microcolonies (referred to as colony fingerprints) with a size of up to 250 μm in diameter. The colony fingerprints were obtained via a lens-less imaging system. Profiling of the colony fingerprints of five Staphylococcus spp. (S. aureus, S. epidermidis, S. haemolyticus, S. saprophyticus, and S. simulans) revealed that the central regions of the colony fingerprints showed species-specific patterns. We developed 14 discriminative parameters, some of which highlight the features of the central regions, and analyzed them by several machine learning approaches. As a result, artificial neural network (ANN), support vector machine (SVM), and random forest (RF) showed high performance for discrimination of theses bacteria. Bacterial discrimination by colony fingerprinting can be performed within 11 h, on average, and therefore can cut discrimination time in half compared to conventional methods. Moreover, we also successfully demonstrated discrimination of S. aureus in a mixed culture with Pseudomonas aeruginosa. These results suggest that colony fingerprinting is useful for discrimination of Staphylococcus spp.

Download Full-text

Recent Advances on Machine Learning Applications in Machining Processes

Applied Sciences ◽

10.3390/app11188764 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8764

Author(s):

Francesco Aggogeri ◽

Nicola Pellegrini ◽

Franco Luis Tagliani

Keyword(s):

Machine Learning ◽

High Performance ◽

Production Quality ◽

Machining Processes ◽

Industrial Systems ◽

Equipment Availability ◽

Machine Learning Applications ◽

Conventional Machine ◽

Operational Activities ◽

Short Time

This study aims to present an overall review of the recent research status regarding Machine Learning (ML) applications in machining processes. In the current industrial systems, processes require the capacity to adapt to manufacturing conditions continuously, guaranteeing high performance- in terms of production quality and equipment availability. Artificial Intelligence (AI) offers new opportunities to develop and integrate innovative solutions in conventional machine tools to reduce undesirable effects during operational activities. In particular, the significant increase of the computational capacity may permit the application of complex algorithms to big data volumes in a short time, expanding the potentialities of ML techniques. ML applications are present in several contexts of machining processes, from roughness quality prediction to tool condition monitoring. This review focuses on recent applications and implications, classifying the main problems that may be solved using ML related to the machining quality, energy consumption and conditional monitoring. Finally, a discussion on the advantages and limits of ML algorithms is summarized for future investigations.

Download Full-text

Machine learning applications for shock train diagnostics

AIAA Scitech 2021 Forum ◽

10.2514/6.2021-1878 ◽

2021 ◽

Author(s):

Jared Chin ◽

Mirko Gamba

Keyword(s):

Machine Learning ◽

Shock Train ◽

Machine Learning Applications

Download Full-text

Efficient Prediction of Structural and Electronic Properties of Hybrid 2D Materials Using DFT and Machine Learning

10.26434/chemrxiv.6254756.v1 ◽

2018 ◽

Author(s):

Sherif Tawfik ◽

Olexandr Isayev ◽

Catherine Stampfl ◽

Joseph Shapter ◽

David Winkler ◽

...

Keyword(s):

Machine Learning ◽

Band Gap ◽

Density Functional ◽

2D Materials ◽

Van Der Waals ◽

Building Blocks ◽

Machine Learning Techniques ◽

Interlayer Distance ◽

Computational Screening ◽

Wide Range

Materials constructed from different van der Waals two-dimensional (2D) heterostructures offer a wide range of benefits, but these systems have been little studied because of their experimental and computational complextiy, and because of the very large number of possible combinations of 2D building blocks. The simulation of the interface between two different 2D materials is computationally challenging due to the lattice mismatch problem, which sometimes necessitates the creation of very large simulation cells for performing density-functional theory (DFT) calculations. Here we use a combination of DFT, linear regression and machine learning techniques in order to rapidly determine the interlayer distance between two different 2D heterostructures that are stacked in a bilayer heterostructure, as well as the band gap of the bilayer. Our work provides an excellent proof of concept by quickly and accurately predicting a structural property (the interlayer distance) and an electronic property (the band gap) for a large number of hybrid 2D materials. This work paves the way for rapid computational screening of the vast parameter space of van der Waals heterostructures to identify new hybrid materials with useful and interesting properties.

Download Full-text

COVID-19 Outbreak Prediction with Machine Learning

10.34055/osf.io/xr4js ◽

2020 ◽

Author(s):

Sina Faizollahzadeh Ardabili ◽

Amir Mosavi ◽

Pedram Ghamisi ◽

Filip Ferdinand ◽

Annamaria R. Varkonyi-Koczy ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Fuzzy Inference ◽

Control Measures ◽

Future Research ◽

Complex Nature ◽

Inference System ◽

Wide Range ◽

Standard Models ◽

High Level

Several outbreak prediction models for COVID-19 are being used by officials around the world to make informed-decisions and enforce relevant control measures. Among the standard models for COVID-19 global pandemic prediction, simple epidemiological and statistical models have received more attention by authorities, and they are popular in the media. Due to a high level of uncertainty and lack of essential data, standard models have shown low accuracy for long-term prediction. Although the literature includes several attempts to address this issue, the essential generalization and robustness abilities of existing models needs to be improved. This paper presents a comparative analysis of machine learning and soft computing models to predict the COVID-19 outbreak as an alternative to SIR and SEIR models. Among a wide range of machine learning models investigated, two models showed promising results (i.e., multi-layered perceptron, MLP, and adaptive network-based fuzzy inference system, ANFIS). Based on the results reported here, and due to the highly complex nature of the COVID-19 outbreak and variation in its behavior from nation-to-nation, this study suggests machine learning as an effective tool to model the outbreak. This paper provides an initial benchmarking to demonstrate the potential of machine learning for future research. Paper further suggests that real novelty in outbreak prediction can be realized through integrating machine learning and SEIR models.

Download Full-text