How Competitive Is Genetic Programming in Business Data Science Applications?

Abstract Genetic programming (GP) is a widely used machine learning (ML) algorithm that has been applied in water resources science and engineering since its conception in the early 1990s. However, similar to other ML applications, the GP algorithm is often used as a data fitting tool rather than as a model building instrument. We find this a gross underutilization of the GP capabilities. The most unique and distinct feature of GP that makes it distinctly different from the rest of ML techniques is its capability to produce explicit mathematical relationships between input and output variables. In the context of theory-guided data science (TGDS) which recently emerged as a new paradigm in ML with the main goal of blending the existing body of knowledge with ML techniques to induce physically sound models. Hence, TGDS has evolved into a popular data science paradigm, especially in scientific disciplines including water resources. Following these ideas, in our prior work, we developed two hydrologically informed rainfall-runoff model induction toolkits for lumped modelling and distributed modelling based on GP. In the current work, the two toolkits are applied using a different hydrological model building library. Here, the model building blocks are derived from the Sugawara TANK model template which represents the elements of hydrological knowledge. Results are compared against the traditional GP approach and suggest that GP as a rainfall-runoff model induction toolkit preserves the prediction power of the traditional GP short-term forecasting approach while benefiting to better understand the catchment runoff dynamics through the readily interpretable induced models.

Download Full-text

Data Literacy und Strategien der datengetriebenen Wertschöpfung

Der Betriebswirt: Volume 54, Issue 3 - Der Betriebswirt ◽

10.3790/dbw.62.2.87 ◽

2021 ◽

Vol 62 (2) ◽

pp. 87-98

Author(s):

Tobias Knuth

Keyword(s):

Competitive Advantage ◽

21St Century ◽

Data Science ◽

Data Driven ◽

Digitale Transformation ◽

Data Literacy ◽

Business Data

Data have become ubiquitous in the 21st century. Companies can achieve a competitive advantage if they manage to utilise business data successfully, and a data strategy can help to become a data-driven company. In this article, three pillars are presented: data literacy as a central competence, data science as a specialisation, and the chief data officer as a C-level executive. Die Digitalisierung führt im 21. Jahrhundert zu immer größeren Datenmengen. Unternehmen können betriebliche Daten nutzen, um fundierte und bessere Entscheidungen schneller zu treffen. Dabei stellt sich die Frage, wie im Anschluss an die digitale Transformation die Entwicklung von einem digitalisierten zu einem datengetriebenen Unternehmen erfolgen kann. Der geschulte Umgang mit Daten, die sogenannte Data Literacy, gilt als eine der grundlegenden Kompetenzen der modernen Wissensgesellschaft. Die in diesem Artikel vorgestellte Datenstrategie hat drei Säulen: Data Literacy als entscheidende Kompetenz aller Mitarbeiter, Data Science als Spezialisierung für komplexe Fragestellungen und die Rolle des Chief Data Officers als strategische Führungskraft zur Koordination und Etablierung von datengetriebenen Prozessen. Die erfolgreiche Umsetzung einer Datenstrategie kann einen messbaren Wettbewerbsvorteil schaffen.

Download Full-text

Scaling tree-based automated machine learning to biomedical big data with a dataset selector

10.1101/502484 ◽

2018 ◽

Author(s):

Trang T Le ◽

Weixuan Fu ◽

Jason H Moore

Keyword(s):

Machine Learning ◽

Big Data ◽

Genetic Programming ◽

Data Science ◽

Computation Time ◽

Genome Expression ◽

Automated Machine Learning ◽

Scan Data ◽

Whole Genome Expression ◽

Entire Dataset

Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programming to recommend an optimized analysis pipeline for the data scientist's prediction problem. However, TPOT may reach computational resource limits when working on big data such as whole-genome expression data. We introduce two new features implemented in TPOT that helps increase the system's scalability: Dataset selector and Template. Dataset selector (DS) provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. Built in at the beginning of each pipeline structure, DS reduces the computational expense of TPOT to only evaluate on a smaller subset of data rather than the entire dataset. Consequently, DS increases TPOT's efficiency in application on big data by slicing the dataset into smaller sets of features and allowing genetic programming to select the best subset in the final pipeline. Template enforces type constraints with strongly typed genetic programming and enables the incorporation of DS at the beginning of each pipeline. We show that DS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-DS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-DS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of the enrichment scores of two modules, in an automated fashion, TPOT-DS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual.

Download Full-text