How Competitive Is Genetic Programming in Business Data Science Applications?

Author(s):  
Arthur Kordon ◽  
Theresa Kotanchek ◽  
Mark Kotanchek
2017 ◽  
Vol 112 ◽  
pp. 1881-1890 ◽  
Author(s):  
Shastri L Nimmagadda ◽  
Torsten Reiners ◽  
Amit Rudra
Keyword(s):  
Big Data ◽  

2016 ◽  
Vol 36 (4) ◽  
pp. 607-617 ◽  
Author(s):  
Russell Newman ◽  
Victor Chang ◽  
Robert John Walters ◽  
Gary Brian Wills

Author(s):  
Herath Mudiyanselage Viraj Vidura Herath ◽  
Jayashree Chadalawada ◽  
Vladan Babovic

Abstract Genetic programming (GP) is a widely used machine learning (ML) algorithm that has been applied in water resources science and engineering since its conception in the early 1990s. However, similar to other ML applications, the GP algorithm is often used as a data fitting tool rather than as a model building instrument. We find this a gross underutilization of the GP capabilities. The most unique and distinct feature of GP that makes it distinctly different from the rest of ML techniques is its capability to produce explicit mathematical relationships between input and output variables. In the context of theory-guided data science (TGDS) which recently emerged as a new paradigm in ML with the main goal of blending the existing body of knowledge with ML techniques to induce physically sound models. Hence, TGDS has evolved into a popular data science paradigm, especially in scientific disciplines including water resources. Following these ideas, in our prior work, we developed two hydrologically informed rainfall-runoff model induction toolkits for lumped modelling and distributed modelling based on GP. In the current work, the two toolkits are applied using a different hydrological model building library. Here, the model building blocks are derived from the Sugawara TANK model template which represents the elements of hydrological knowledge. Results are compared against the traditional GP approach and suggest that GP as a rainfall-runoff model induction toolkit preserves the prediction power of the traditional GP short-term forecasting approach while benefiting to better understand the catchment runoff dynamics through the readily interpretable induced models.


2021 ◽  
Vol 62 (2) ◽  
pp. 87-98
Author(s):  
Tobias Knuth

Data have become ubiquitous in the 21st century. Companies can achieve a competitive advantage if they manage to utilise business data successfully, and a data strategy can help to become a data-driven company. In this article, three pillars are presented: data literacy as a central competence, data science as a specialisation, and the chief data officer as a C-level executive. Die Digitalisierung führt im 21. Jahrhundert zu immer größeren Datenmengen. Unternehmen können betriebliche Daten nutzen, um fundierte und bessere Entscheidungen schneller zu treffen. Dabei stellt sich die Frage, wie im Anschluss an die digitale Transformation die Entwicklung von einem digitalisierten zu einem datengetriebenen Unternehmen erfolgen kann. Der geschulte Umgang mit Daten, die sogenannte Data Literacy, gilt als eine der grundlegenden Kompetenzen der modernen Wissensgesellschaft. Die in diesem Artikel vorgestellte Datenstrategie hat drei Säulen: Data Literacy als entscheidende Kompetenz aller Mitarbeiter, Data Science als Spezialisierung für komplexe Fragestellungen und die Rolle des Chief Data Officers als strategische Führungskraft zur Koordina­tion und Etablierung von datengetriebenen Prozessen. Die erfolgreiche Umsetzung einer Datenstrategie kann einen messbaren Wettbewerbsvorteil schaffen.


2018 ◽  
Author(s):  
Trang T Le ◽  
Weixuan Fu ◽  
Jason H Moore

Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programming to recommend an optimized analysis pipeline for the data scientist's prediction problem. However, TPOT may reach computational resource limits when working on big data such as whole-genome expression data. We introduce two new features implemented in TPOT that helps increase the system's scalability: Dataset selector and Template. Dataset selector (DS) provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. Built in at the beginning of each pipeline structure, DS reduces the computational expense of TPOT to only evaluate on a smaller subset of data rather than the entire dataset. Consequently, DS increases TPOT's efficiency in application on big data by slicing the dataset into smaller sets of features and allowing genetic programming to select the best subset in the final pipeline. Template enforces type constraints with strongly typed genetic programming and enables the incorporation of DS at the beginning of each pipeline. We show that DS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-DS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-DS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of the enrichment scores of two modules, in an automated fashion, TPOT-DS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual.


Author(s):  
Charles Bouveyron ◽  
Gilles Celeux ◽  
T. Brendan Murphy ◽  
Adrian E. Raftery

Sign in / Sign up

Export Citation Format

Share Document