An Exotic IWD - SVR Based Approach for Failure Prognostication in Cloud-Based Scientific Workflows

Mapping Intimacies ◽

10.21203/rs.3.rs-716843/v1 ◽

2021 ◽

Author(s):

Sridevi S ◽

Jeevaa Katiravan Jeevaa Katiravan

Keyword(s):

Large Scale ◽

Performance Metrics ◽

Prediction Models ◽

Fault Tolerant ◽

Scientific Workflow ◽

Scientific Workflows ◽

Support Vector ◽

Learning Approaches ◽

Task Failure ◽

Proactive Measures

Abstract Scientific workflows deserve the emerging attention in sophisticated large-scale scientific problem-solving environments. Though a single task failure occurs in workflow based applications, due to its task dependency nature the reliability of the overall system will be affected drastically. Hence rather than reactive fault tolerant approaches, proactive measures are vital in scientific workflows. This work puts forth an attempt to concentrate on the exploration issue of structuring an Exotic Intelligent Water Drops - Support Vector Regression-based approach for task failure prognostication which facilitates proactive fault tolerance in scientific workflow applications. The failure prediction models in this study have been implemented through SVR-based machine learning approaches and its precision accuracy is optimized by IWDA and various performance metrics were evaluated. The experimental results prove that the proposed approach performs better compared with the other existing techniques.

Download Full-text

Machine Learning Methods Applied to the Prediction of Pseudo-nitzschia spp. Blooms in the Galician Rias Baixas (NW Spain)

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10040199 ◽

2021 ◽

Vol 10 (4) ◽

pp. 199

Author(s):

Francisco M. Bellas Aláez ◽

Jesus M. Torres Palenzuela ◽

Evangelos Spyrakos ◽

Luis González Vilas

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Prediction Models ◽

Support Vector ◽

False Alarms ◽

Learning Approaches ◽

Learning Methods ◽

Machine Learning Methods ◽

Rías Baixas ◽

New Algorithms

This work presents new prediction models based on recent developments in machine learning methods, such as Random Forest (RF) and AdaBoost, and compares them with more classical approaches, i.e., support vector machines (SVMs) and neural networks (NNs). The models predict Pseudo-nitzschia spp. blooms in the Galician Rias Baixas. This work builds on a previous study by the authors (doi.org/10.1016/j.pocean.2014.03.003) but uses an extended database (from 2002 to 2012) and new algorithms. Our results show that RF and AdaBoost provide better prediction results compared to SVMs and NNs, as they show improved performance metrics and a better balance between sensitivity and specificity. Classical machine learning approaches show higher sensitivities, but at a cost of lower specificity and higher percentages of false alarms (lower precision). These results seem to indicate a greater adaptation of new algorithms (RF and AdaBoost) to unbalanced datasets. Our models could be operationally implemented to establish a short-term prediction system.

Download Full-text

Forecasting the risk at infractions: an ensemble comparison of machine learning approach

Industrial Management & Data Systems ◽

10.1108/imds-10-2020-0603 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Lei Li ◽

Desheng Wu

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Short Term Memory ◽

Model Performance ◽

Large Data ◽

Support Vector ◽

Learning Approaches ◽

Content Type ◽

Day To Day Operations ◽

Prediction Approach

PurposeThe infraction of securities regulations (ISRs) of listed firms in their day-to-day operations and management has become one of common problems. This paper proposed several machine learning approaches to forecast the risk at infractions of listed corporates to solve financial problems that are not effective and precise in supervision.Design/methodology/approachThe overall proposed research framework designed for forecasting the infractions (ISRs) include data collection and cleaning, feature engineering, data split, prediction approach application and model performance evaluation. We select Logistic Regression, Naïve Bayes, Random Forest, Support Vector Machines, Artificial Neural Network and Long Short-Term Memory Networks (LSTMs) as ISRs prediction models.FindingsThe research results show that prediction performance of proposed models with the prior infractions provides a significant improvement of the ISRs than those without prior, especially for large sample set. The results also indicate when judging whether a company has infractions, we should pay attention to novel artificial intelligence methods, previous infractions of the company, and large data sets.Originality/valueThe findings could be utilized to address the problems of identifying listed corporates' ISRs at hand to a certain degree. Overall, results elucidate the value of the prior infraction of securities regulations (ISRs). This shows the importance of including more data sources when constructing distress models and not only focus on building increasingly more complex models on the same data. This is also beneficial to the regulatory authorities.

Download Full-text

Machine Learning Frameworks in Cancer Detection

E3S Web of Conferences ◽

10.1051/e3sconf/202129701073 ◽

2021 ◽

Vol 297 ◽

pp. 01073

Author(s):

Sabyasachi Pramanik ◽

K. Martin Sagayam ◽

Om Prakash Jena

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Cancer Development ◽

Support Vector ◽

Learning Approaches ◽

Learning Techniques ◽

Fact Finding ◽

Risk Of Cancer

Cancer has been described as a diverse illness with several distinct subtypes that may occur simultaneously. As a result, early detection and forecast of cancer types have graced essentially in cancer fact-finding methods since they may help to improve the clinical treatment of cancer survivors. The significance of categorizing cancer suffers into higher or lower-threat categories has prompted numerous fact-finding associates from the bioscience and genomics field to investigate the utilization of machine learning (ML) algorithms in cancer diagnosis and treatment. Because of this, these methods have been used with the goal of simulating the development and treatment of malignant diseases in humans. Furthermore, the capacity of machine learning techniques to identify important characteristics from complicated datasets demonstrates the significance of these technologies. These technologies include Bayesian networks and artificial neural networks, along with a number of other approaches. Decision Trees and Support Vector Machines which have already been extensively used in cancer research for the creation of predictive models, also lead to accurate decision making. The application of machine learning techniques may undoubtedly enhance our knowledge of cancer development; nevertheless, a sufficient degree of validation is required before these approaches can be considered for use in daily clinical practice. An overview of current machine learning approaches utilized in the simulation of cancer development is presented in this paper. All of the supervised machine learning approaches described here, along with a variety of input characteristics and data samples, are used to build the prediction models. In light of the increasing trend towards the use of machine learning methods in biomedical research, we offer the most current papers that have used these approaches to predict risk of cancer or patient outcomes in order to better understand cancer.

Download Full-text

Fault-Tolerant and Data-Intensive Resource Scheduling and Management for Scientific Applications in Cloud Computing

Sensors ◽

10.3390/s21217238 ◽

2021 ◽

Vol 21 (21) ◽

pp. 7238

Author(s):

Zulfiqar Ahmad ◽

Ali Imran Jehangiri ◽

Mohammed Alaa Ala’anzy ◽

Mohamed Othman ◽

Arif Iqbal Umar

Keyword(s):

Cloud Computing ◽

Fault Tolerant ◽

Research Work ◽

Resource Scheduling ◽

Scientific Workflow ◽

Scientific Workflows ◽

Scientific Applications ◽

Data Intensive ◽

Computing Paradigm ◽

Cost Constraints

Cloud computing is a fully fledged, matured and flexible computing paradigm that provides services to scientific and business applications in a subscription-based environment. Scientific applications such as Montage and CyberShake are organized scientific workflows with data and compute-intensive tasks and also have some special characteristics. These characteristics include the tasks of scientific workflows that are executed in terms of integration, disintegration, pipeline, and parallelism, and thus require special attention to task management and data-oriented resource scheduling and management. The tasks executed during pipeline are considered as bottleneck executions, the failure of which result in the wholly futile execution, which requires a fault-tolerant-aware execution. The tasks executed during parallelism require similar instances of cloud resources, and thus, cluster-based execution may upgrade the system performance in terms of make-span and execution cost. Therefore, this research work presents a cluster-based, fault-tolerant and data-intensive (CFD) scheduling for scientific applications in cloud environments. The CFD strategy addresses the data intensiveness of tasks of scientific workflows with cluster-based, fault-tolerant mechanisms. The Montage scientific workflow is considered as a simulation and the results of the CFD strategy were compared with three well-known heuristic scheduling policies: (a) MCT, (b) Max-min, and (c) Min-min. The simulation results showed that the CFD strategy reduced the make-span by 14.28%, 20.37%, and 11.77%, respectively, as compared with the existing three policies. Similarly, the CFD reduces the execution cost by 1.27%, 5.3%, and 2.21%, respectively, as compared with the existing three policies. In case of the CFD strategy, the SLA is not violated with regard to time and cost constraints, whereas it is violated by the existing policies numerous times.

Download Full-text

Hybrid Balanced Task Clustering Algorithm for Scientific Workflows in Cloud Computing

Scalable Computing Practice and Experience ◽

10.12694/scpe.v20i2.1515 ◽

2019 ◽

Vol 20 (2) ◽

pp. 237-258

Author(s):

Avinash Kaur ◽

Pooja Gupta ◽

Manpreet Singh

Keyword(s):

Impact Factor ◽

Large Scale ◽

Clustering Algorithm ◽

Data Transfer ◽

Scientific Workflow ◽

Scientific Workflows ◽

Coarse Grained ◽

Clustering Methods ◽

The Impact ◽

Task Clustering

Scientific Workflow is a composition of both coarse-grained and fine-grained computational tasks displaying varying execution requirements. Large-scale data transfer is involved in scientific workflows, so efficient techniques are required to reduce the makespan of the workflow. Task clustering is an efficient technique used in such a scenario that involves combining multiple tasks with shorter execution time into a single cluster to be executed on a resource. This leads to a reduction of scheduling overheads in scientific workflows and thus improvement of performance. However available task clustering methods involve clustering the tasks horizontally without the consideration of the structure of tasks in a workflow. We propose hybrid balanced task clustering algorithm that uses the parameter of impact factor of workflows along with the structure of workflow. According to this technique, tasks can be considered for clustering either vertically or horizontally based on the value of the impact factor. This minimizes the system overheads and the makespan for execution of a workflow. A simulation based evaluation is performed on real workflows that shows the proposed algorithm is efficient in recommending clusters. It shows improvement of 5-10\% in makespan time of workflow depending on the type of workflow used.

Download Full-text

Optimized sgRNA design by deep learning to balance the off-target effects and on-target activity of CRISPR/Cas9

10.1101/2020.03.04.976340 ◽

2020 ◽

Author(s):

Jie Lan ◽

Yang Cui ◽

Xiaowen Wang ◽

Guangtao Song ◽

Jizhong Lou

Keyword(s):

Dna Sequences ◽

Large Scale ◽

Prediction Models ◽

Design Tool ◽

Biological Study ◽

Support Vector ◽

Neuron Network ◽

Sgrna Design ◽

Target Activity ◽

Target Effects

ABSTRACTThe CRISPR/Cas9 system derived from bacteria especially Streptococcus pyogenes (SpyCas9) is currently considered as the most advanced tool used for numerous areas of biological study in which it is useful to target or modify specific DNA sequences. However, low on-target cleavage efficiency and off-target effects impede its wide application. Several different sgRNA design tools for SpyCas9 by using various algorithms have been developed, including linear regression model, support vector machine (SVM) model and convolutional neuron network model. While the deep insight into the sgRNA features contributing for both on-target activity and off-target still remains to be determined. Here, with public large-scale CRISPR screen data, we evaluated contribution of different features influence sgRNA activity and off-target effects, and developed models for sgRNA off-target evaluation and on-target activity prediction. In addition, we combined both activity and off-target prediction models and packaged them as an online sgRNA design tool, OPT-sgRNA.

Download Full-text

A machine learning approach to predict ethnicity using personal name and census location in Canada

PLoS ONE ◽

10.1371/journal.pone.0241239 ◽

2020 ◽

Vol 15 (11) ◽

pp. e0241239

Author(s):

Kai On Wong ◽

Osmar R. Zaïane ◽

Faith G. Davis ◽

Yutaka Yasui

Keyword(s):

Machine Learning ◽

First Nations ◽

Predictive Value ◽

Large Scale ◽

Performance Metrics ◽

Characteristic Curve ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Approach ◽

Machine Learning Approach

Background Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. Methods Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. Results The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68–95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63–67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). Conclusions The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories.

Download Full-text

Predicting Design Performance Utilizing Automated Topic Discovery

Journal of Mechanical Design ◽

10.1115/1.4048455 ◽

2020 ◽

Vol 142 (12) ◽

Author(s):

Zachary Ball ◽

Kemper Lewis

Keyword(s):

Open Innovation ◽

Performance Metrics ◽

Prediction Models ◽

Support Vector ◽

Sociotechnical System ◽

Topic Identification ◽

Design Performance ◽

Mass Collaboration ◽

Technical Competency ◽

And Performance

Abstract Increasingly complex engineering design challenges requires the diversification of knowledge required on design teams. In the context of open innovation, positioning key members within these teams or groups based on their estimated abilities leads to more impactful results since mass collaboration is fundamentally a sociotechnical system. Determining how each individual influences the overall design process requires an understanding of the predicted mapping between their technical competency and performance. This work explores this relationship through the use of predictive models composed of various algorithms. With support of a dataset composed of documents related to the design performance of students working on their capstone design project in combination with textual descriptors representing individual technical aptitudes, correlations are explored as a method to predict overall project development performance. Each technical competency and project is represented as a distribution of topic knowledge to produce the performance metrics, which are referred to as topic competencies, since topic representations increase the ability to decompose and identify human-centric performance measures. Three methods of topic identification and five prediction models are compared based on their prediction accuracy. From this analysis, it is found that representing input variables as topics distributions and the resulting performance as a single indicator while using support vector regression provided the most accurate mapping between ability and performance. With these findings, complex open innovation projects will benefit from increased knowledge of individual ability and how that correlates to their predicted performances.

Download Full-text

A Systematic Methodology to Evaluate Prediction Models for Driving Style Classification

Sensors ◽

10.3390/s20061692 ◽

2020 ◽

Vol 20 (6) ◽

pp. 1692 ◽

Cited By ~ 6

Author(s):

Iván Silva ◽

José Eugenio Naranjo

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Performance Metrics ◽

Prediction Models ◽

Statistical Tests ◽

Area Under The Curve ◽

The Other ◽

Support Vector ◽

Classification Models ◽

K Nearest Neighbor

Identifying driving styles using classification models with in-vehicle data can provide automated feedback to drivers on their driving behavior, particularly if they are driving safely. Although several classification models have been developed for this purpose, there is no consensus on which classifier performs better at identifying driving styles. Therefore, more research is needed to evaluate classification models by comparing performance metrics. In this paper, a data-driven machine-learning methodology for classifying driving styles is introduced. This methodology is grounded in well-established machine-learning (ML) methods and literature related to driving-styles research. The methodology is illustrated through a study involving data collected from 50 drivers from two different cities in a naturalistic setting. Five features were extracted from the raw data. Fifteen experts were involved in the data labeling to derive the ground truth of the dataset. The dataset fed five different models (Support Vector Machines (SVM), Artificial Neural Networks (ANN), fuzzy logic, k-Nearest Neighbor (kNN), and Random Forests (RF)). These models were evaluated in terms of a set of performance metrics and statistical tests. The experimental results from performance metrics showed that SVM outperformed the other four models, achieving an average accuracy of 0.96, F1-Score of 0.9595, Area Under the Curve (AUC) of 0.9730, and Kappa of 0.9375. In addition, Wilcoxon tests indicated that ANN predicts differently to the other four models. These promising results demonstrate that the proposed methodology may support researchers in making informed decisions about which ML model performs better for driving-styles classification.

Download Full-text

Using machine learning techniques to develop prediction models for detecting unpaid credit card customers

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189080 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6073-6087

Author(s):

Meltem Yontar ◽

Özge Hüsniye Namli ◽

Seda Yanik

Keyword(s):

Decision Tree ◽

Credit Card ◽

Banking Sector ◽

Performance Metrics ◽

Prediction Models ◽

Regression Tree ◽

Classification And Regression Tree ◽

Machine Learning Techniques ◽

Support Vector ◽

Cart Algorithm

Customer behavior prediction is gaining more importance in the banking sector like in any other sector recently. This study aims to propose a model to predict whether credit card users will pay their debts or not. Using the proposed model, potential unpaid risks can be predicted and necessary actions can be taken in time. For the prediction of customers’ payment status of next months, we use Artificial Neural Network (ANN), Support Vector Machine (SVM), Classification and Regression Tree (CART) and C4.5, which are widely used artificial intelligence and decision tree algorithms. Our dataset includes 10713 customer’s records obtained from a well-known bank in Taiwan. These records consist of customer information such as the amount of credit, gender, education level, marital status, age, past payment records, invoice amount and amount of credit card payments. We apply cross validation and hold-out methods to divide our dataset into two parts as training and test sets. Then we evaluate the algorithms with the proposed performance metrics. We also optimize the parameters of the algorithms to improve the performance of prediction. The results show that the model built with the CART algorithm, one of the decision tree algorithm, provides high accuracy (about 86%) to predict the customers’ payment status for next month. When the algorithm parameters are optimized, classification accuracy and performance are increased.

Download Full-text