Tensors

Deep Learning (DL) has created a growing demand for simpler ways to develop complex models and efficient ways to execute them. Thus, a significant effort has gone into frameworks like PyTorch or TensorFlow to support a variety of DL models and run efficiently and seamlessly over heterogeneous and distributed hardware. Since these frameworks will continue improving given the predominance of DL workloads, it is natural to ask what else can be done with them. This is not a trivial question since these frameworks are based on the efficient implementation of tensors, which are well adapted to DL but, in principle, to nothing else. In this paper we explore to what extent Tensor Computation Runtimes (TCRs) can support non-ML data processing applications, so that other use cases can take advantage of the investments made on TCRs. In particular, we are interested in graph processing and relational operators, two use cases very different from ML, in high demand, and complement quite well what TCRs can do today. Building on HUMMINGBIRD, a recent platform converting traditional machine learning algorithms to tensor computations, we explore how to map selected graph processing and relational operator algorithms into tensor computations. Our vision is supported by the results: our code often outperforms custom-built C++ and CUDA kernels, while massively reducing the development effort, taking advantage of the cross-platform compilation capabilities of TCRs.

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

10.7287/peerj.preprints.1460 ◽

2015 ◽

Author(s):

Jeffrey A Thompson ◽

Jie Tan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

PeerJ ◽

10.7717/peerj.1621 ◽

2016 ◽

Vol 4 ◽

pp. e1621 ◽

Cited By ~ 42

Author(s):

Jeffrey A. Thompson ◽

Jie Tan ◽

Casey S. Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simplelog2transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

Machine Learning Methodologies for Prediction of Rhythm-Control Strategy in Patients Diagnosed with Atrial Fibrillation: Model Development and Comparison Study (Preprint)

10.2196/preprints.29225 ◽

2021 ◽

Author(s):

Rachel Soon-Yong Kim ◽

Steve Simon ◽

Brett Powers ◽

Amneet Sandhu ◽

Jose Sanchez ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Control Strategy ◽

Prediction Models ◽

Rhythm Control ◽

Machine Learning Algorithms ◽

Support Tool ◽

Wide Range ◽

Complex Models ◽

The Impact

BACKGROUND Identification of the appropriate rhythm management strategy for patients diagnosed with atrial fibrillation (AF) remains a major challenge for providers. While clinical trials have identified sub-groups of patients in whom a rate- or rhythm-control strategy might be indicated to improve outcomes, the wide range of presentations and risk factors among patients presenting with AF makes such approaches challenging. A strength of electronic health records (EHR) is the ability to build in logic to guide management decisions, such that the system can automatically identify patients in whom a rhythm-control strategy is more likely and promote efficient referrals to specialists. However, like any clinical decision-support tool, there is a balance between interpretability and accurate prediction. OBJECTIVE In this investigation, we sought to create an EHR-based prediction tool to guide patient referral to specialists for rhythm-control management by comparing different machine learning algorithms. METHODS We compared machine learning models of increasing complexity and using up to 50,845 variables to predict the rhythm-control strategy in 42,022 patients within the UC Health system at the time of AF diagnosis. Models were evaluated on their classification accuracy, defined by the F1 score and other metrics, and interpretability, captured by inspection of the relative importance of each predictor. RESULTS We found that age was by far the strongest single predictor of a rhythm-control strategy, but that greater accuracy could be achieved with more complex models incorporating neural networks and more predictors for each subject. We determined that the impact of better prediction models was notable primarily in the rate of inappropriate referrals for rhythm-control, in which more complex models provided an average of 20% fewer inappropriate referrals than simpler, more interpretable models. CONCLUSIONS We conclude that any healthcare system seeking to incorporate algorithms to guide rhythm management for patients with AF will need to address this trade-off between prediction accuracy and model interpretability.

Download Full-text

Software Development Effort Duration and Cost Estimation using Linear Regression and K-Nearest Neighbors Machine Learning Algorithms

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k2306.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 1043-1047

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Software Development ◽

Cost Estimation ◽

Performance Metrics ◽

Nearest Neighbors ◽

Machine Learning Algorithms ◽

Development Effort ◽

Effort Estimation ◽

K Nearest Neighbors

Effort estimation is a crucial step that leads to Duration estimation and cost estimation in software development. Estimations done in the initial stage of projects are based on requirements that may lead to success or failure of the project. Accurate estimations lead to success and inaccurate estimates lead to failure. There is no one particular method which cloud do accurate estimations. In this work, we propose Machine learning techniques linear regression and K-nearest Neighbors to predict Software Effort estimation using COCOMO81, COCOMONasa, and COCOMONasa2 datasets. The results obtained from these two methods have been compared. The 80% data in data sets used for training and remaining used as the test set. The correlation coefficient, Mean squared error (MSE) and Mean magnitude relative error (MMRE) are used as performance metrics. The experimental results show that these models forecast the software effort accurately.

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

10.7287/peerj.preprints.1460v1 ◽

2015 ◽

Author(s):

Jeffrey A Thompson ◽

Jie Tan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Download Full-text

Software Effort Estimation

International Journal of Intelligent Information Technologies ◽

10.4018/jiit.2011070104 ◽

2011 ◽

Vol 7 (3) ◽

pp. 41-53 ◽

Cited By ~ 4

Author(s):

Jeremiah D. Deng ◽

Martin Purvis ◽

Maryam Purvis

Keyword(s):

Machine Learning ◽

Software Development ◽

Domain Knowledge ◽

Modeling Processes ◽

Machine Learning Algorithms ◽

Development Effort ◽

Effort Estimation ◽

Software Effort Estimation ◽

Software Development Effort ◽

Software Development Effort Estimation

Software development effort estimation is important for quality management in the software development industry, yet its automation still remains a challenging issue. Applying machine learning algorithms alone often cannot achieve satisfactory results. This paper presents an integrated data mining framework that incorporates domain knowledge into a series of data analysis and modeling processes, including visualization, feature selection, and model validation. An empirical study on the software effort estimation problem using a benchmark dataset shows the necessity and effectiveness of the proposed approach.

Download Full-text

Double-Step U-Net: A Deep Learning-Based Approach for the Estimation of Wildfire Damage Severity through Sentinel-2 Satellite Data

Applied Sciences ◽

10.3390/app10124332 ◽

2020 ◽

Vol 10 (12) ◽

pp. 4332

Author(s):

Alessandro Farasin ◽

Luca Colomba ◽

Paolo Garza

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Burned Area ◽

Learning Approach ◽

Double Step ◽

Severity Level ◽

Domain Experts ◽

Automatic Estimation ◽

Significant Effort ◽

Sentinel 2

Wildfire damage severity census is a crucial activity for estimating monetary losses and for planning a prompt restoration of the affected areas. It consists in assigning, after a wildfire, a numerical damage/severity level, between 0 and 4, to each sub-area of the hit area. While burned area identification has been automatized by means of machine learning algorithms, the wildfire damage severity census operation is usually still performed manually and requires a significant effort of domain experts through the analysis of imagery and, sometimes, on-site missions. In this paper, we propose a novel supervised learning approach for the automatic estimation of the damage/severity level of the hit areas after the wildfire extinction. Specifically, the proposed approach, leveraging on the combination of a classification algorithm and a regression one, predicts the damage/severity level of the sub-areas of the area under analysis by processing a single post-fire satellite acquisition. Our approach has been validated in five different European countries and on 21 wildfires. It has proved to be robust for the application in several geographical contexts presenting similar geological aspects.

Download Full-text

Use case repository framework based on machine learning algorithm to analyze the software development estimation with intelligent information systems

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691319410078 ◽

2019 ◽

Vol 18 (01) ◽

pp. 1941007

Author(s):

R. Lalitha ◽

B. Latha ◽

G. Sumathi

Keyword(s):

Machine Learning ◽

Software Development ◽

Reference Model ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Use Cases ◽

Software Project ◽

Use Case ◽

Effort Estimation ◽

Software Estimation

The success of information system process depends on accuracy of software estimation. Estimation is done at initial phase of software development. It requires a collection of all relevant required information for estimating the software effort. In this paper, a methodology is proposed to maintain a knowledgeable use case repository to store the use cases of various projects in several software project-related domains. This acts as a reference model to compare similar use cases of similar types of projects. The use case points are calculated and using this, schedule estimation and effort estimation of a project are calculated using the formulas of software engineering. These values are compared with the estimated effort and scheduled effort of a new project under development. Apart from these, the effective machine learning technique called neural network is used to measure how accurately the information is processed by use of case repository framework. The proposed machine learning-based use case repository system helps to estimate and analyze the effort using the machine learning algorithms.

Download Full-text

Machine Learning in Assessing the Performance of Hydrological Models

Hydrology ◽

10.3390/hydrology9010005 ◽

2021 ◽

Vol 9 (1) ◽

pp. 5

Author(s):

Evangelos Rozos ◽

Panayiotis Dimitriadis ◽

Vasilis Bellos

Keyword(s):

Machine Learning ◽

Hydrological Models ◽

Learning Models ◽

Technological Field ◽

Physically Based ◽

Complex Models ◽

Short Term Forecasting ◽

Applications Of Machine Learning ◽

Significant Effort ◽

Machine Learning Models

Machine learning has been employed successfully as a tool virtually in every scientific and technological field. In hydrology, machine learning models first appeared as simple feed-forward networks that were used for short-term forecasting, and have evolved into complex models that can take into account even the static features of catchments, imitating the hydrological experience. Recent studies have found machine learning models to be robust and efficient, frequently outperforming the standard hydrological models (both conceptual and physically based). However, and despite some recent efforts, the results of the machine learning models require significant effort to interpret and derive inferences. Furthermore, all successful applications of machine learning in hydrology are based on networks of fairly complex topology that require significant computational power and CPU time to train. For these reasons, the value of the standard hydrological models remains indisputable. In this study, we suggest employing machine learning models not as a substitute for hydrological models, but as an independent tool to assess their performance. We argue that this approach can help to unveil the anomalies in catchment data that do not fit in the employed hydrological model structure or configuration, and to deal with them without compromising the understanding of the underlying physical processes.

Download Full-text

Artificial Intelligence for Prognostic Scores in Oncology: a Benchmarking Study

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.625573 ◽

2021 ◽

Vol 4 ◽

Author(s):

Hugo Loureiro ◽

Tim Becker ◽

Anna Bauer-Mehren ◽

Narges Ahmidi ◽

Janick Weberpals

Keyword(s):

Machine Learning ◽

Survival Analysis ◽

Real World ◽

Machine Learning Algorithms ◽

Prognostic Scores ◽

Phase Iii ◽

Feature Sets ◽

Out Of Sample ◽

Complex Models ◽

Sample Testing

Introduction: Prognostic scores are important tools in oncology to facilitate clinical decision-making based on patient characteristics. To date, classic survival analysis using Cox proportional hazards regression has been employed in the development of these prognostic scores. With the advance of analytical models, this study aimed to determine if more complex machine-learning algorithms could outperform classical survival analysis methods.Methods: In this benchmarking study, two datasets were used to develop and compare different prognostic models for overall survival in pan-cancer populations: a nationwide EHR-derived de-identified database for training and in-sample testing and the OAK (phase III clinical trial) dataset for out-of-sample testing. A real-world database comprised 136K first-line treated cancer patients across multiple cancer types and was split into a 90% training and 10% testing dataset, respectively. The OAK dataset comprised 1,187 patients diagnosed with non-small cell lung cancer. To assess the effect of the covariate number on prognostic performance, we formed three feature sets with 27, 44 and 88 covariates. In terms of methods, we benchmarked ROPRO, a prognostic score based on the Cox model, against eight complex machine-learning models: regularized Cox, Random Survival Forests (RSF), Gradient Boosting (GB), DeepSurv (DS), Autoencoder (AE) and Super Learner (SL). The C-index was used as the performance metric to compare different models.Results: For in-sample testing on the real-world database the resulting C-index [95% CI] values for RSF 0.720 [0.716, 0.725], GB 0.722 [0.718, 0.727], DS 0.721 [0.717, 0.726] and lastly, SL 0.723 [0.718, 0.728] showed significantly better performance as compared to ROPRO 0.701 [0.696, 0.706]. Similar results were derived across all feature sets. However, for the out-of-sample validation on OAK, the stronger performance of the more complex models was not apparent anymore. Consistently, the increase in the number of prognostic covariates did not lead to an increase in model performance.Discussion: The stronger performance of the more complex models did not generalize when applied to an out-of-sample dataset. We hypothesize that future research may benefit by adding multimodal data to exploit advantages of more complex models.

Download Full-text