DATA SCIENCE AND KNOWLEDGE DISCOVERY THROUGH DATA MINING PARADIGMS

The Cross-Industry Standard Process for Data Mining (CRISP-DM) is a widely accepted framework in production and manufacturing. This data-driven knowledge discovery framework provides an orderly partition of the often complex data mining processes to ensure a practical implementation of data analytics and machine learning models. However, the practical application of robust industry-specific data-driven knowledge discovery models faces multiple data- and model development-related issues. These issues need to be carefully addressed by allowing a flexible, customized and industry-specific knowledge discovery framework. For this reason, extensions of CRISP-DM are needed. In this paper, we provide a detailed review of CRISP-DM and summarize extensions of this model into a novel framework we call Generalized Cross-Industry Standard Process for Data Science (GCRISP-DS). This framework is designed to allow dynamic interactions between different phases to adequately address data- and model-related issues for achieving robustness. Furthermore, it emphasizes also the need for a detailed business understanding and the interdependencies with the developed models and data quality for fulfilling higher business objectives. Overall, such a customizable GCRISP-DS framework provides an enhancement for model improvements and reusability by minimizing robustness-issues.

Download Full-text

PREDICCIÓN DE FUGA DE CLIENTES EN UNA EMPRESA DE DISTRIBUCIÓN DE GAS NATURAL MEDIANTE EL USO DE MINERÍA DE DATOS

Universidad Ciencia y Tecnología ◽

10.47460/uct.v24i106.399 ◽

2020 ◽

Vol 24 (106) ◽

pp. 79-87

Author(s):

Fredy Humberto Troncoso Espinosa ◽

Javiera Valentina Ruiz Tapia

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Knowledge Discovery ◽

Data Science ◽

Machine Learning Techniques ◽

Support Vector ◽

Churn Prediction ◽

Customer Churn ◽

Using Data

La fuga de clientes es un problema relevante al que enfrentan las empresas de servicios y que les puede generar pérdidas económicas significativas. Identificar los elementos que llevan a un cliente a dejar de consumir un servicio es una tarea compleja, sin embargo, mediante su comportamiento es posible estimar una probabilidad de fuga asociada a cada uno de ellos. Esta investigación aplica minería de datos para la predicción de la fuga de clientes en una empresa de distribución de gas natural, mediante dos técnicas de machine learning: redes neuronales y support vector machine. Los resultados muestran que mediante la aplicación de estas técnicas es posible identificar los clientes con mayor probabilidad de fuga para tomar sobre estas acciones de retenciónoportunas y focalizadas, minimizando los costos asociados al error en la identificación de estos clientes. Palabras Clave: fuga de clientes, minería de datos, machine learning, distribución de gas natural. Referencias [1]J. Miranda, P. Rey y R. Weber, «Predicción de Fugas de Clientes para una Institución Financiera Mediante Support Vector Machines,» Revista Ingeniería de Sistemas Volumen XIX, pp. 49-68, 2005. [2]P. A. Pérez V., «Modelo de predicción de fuga de clientes de telefonía movil post pago,» Universidad de Chile, Santiago, Chile, 2014. [3]Gas Sur S.A., «https://www.gassur.cl/Quienes-Somos/,» [En línea]. [4]J. Xiao, X. Jiang, C. He y G. Teng, «Churn prediction in customer relationship management via GMDH-based multiple classifiers ensemble,» IEEE IntelligentSystems, vol. 31, nº 2, pp. 37-44, 2016. [5]A. M. Almana, M. S. Aksoy y R. Alzahrani, «A survey on data mining techniques in customer churn analysis for telecom industry,» International Journal of Engineering Research and Applications, vol. 4, nº 5, pp. 165-171, 2014. [6]A. Jelvez, M. Moreno, V. Ovalle, C. Torres y F. Troncoso, «Modelo predictivo de fuga de clientes utilizando mineríaa de datos para una empresa de telecomunicaciones en chile,» Universidad, Ciencia y Tecnología, vol. 18, nº 72, pp. 100-109, 2014. [7]D. Anil Kumar y V. Ravi, «Predicting credit card customer churn in banks using data mining,» International Journal of Data Analysis Techniques and Strategies, vol. 1, nº 1, pp. 4-28, 2008. [8]E. Aydoğan, C. Gencer y S. Akbulut, «Churn analysis and customer segmentation of a cosmetics brand using data mining techniques,» Journal of Engineeringand Natural Sciences, vol. 26, nº 1, 2008. [9]G. Dror, D. Pelleg, O. Rokhlenko y I. Szpektor, «Churn prediction in new users of Yahoo! answers,» de Proceedings of the 21st International Conference onWorld Wide Web, 2012. [10]T. Vafeiadis, K. Diamantaras, G. Sarigiannidis y K. Chatzisavvas, «A comparison of machine learning techniques for customer churn prediction,» SimulationModelling Practice and Theory, vol. 55, pp. 1-9, 2015. [11]Y. Xie, X. Li, E. Ngai y W. Ying, «Customer churn prediction using improved balanced random forests,» Expert Systems with Applications, vol. 36, nº 3, pp.5445-5449, 2009. [12]U. Fayyad, G. Piatetsky-Shapiro y P. Smyth, «Knowledge Discovery and Data Mining: Towards a Unifying Framework,» de KDD-96 Proceedings, 1996. [13]R. Brachman y T. Anand, «The process of knowledge discovery in databases,» de Advances in knowledge discovery and data mining, 1996. [14]K. Lakshminarayan, S. Harp, R. Goldman y T. Samad, «Imputation of Missing Data Using Machine Learning Techniques,» de KDD, 1996. [15]B. Nguyen , J. L. Rivero y C. Morell, «Aprendizaje supervisado de funciones de distancia: estado del arte,» Revista Cubana de Ciencias Informáticas, vol. 9, nº 2, pp. 14-28, 2015. [16]I. Monedero, F. Biscarri, J. Guerrero, M. Peña, M. Roldán y C. León, «Detection of water meter under-registration using statistical algorithms,» Journal of Water Resources Planning and Management, vol. 142, nº 1, p. 04015036, 2016. [17]I. Guyon y A. Elisseeff, «An introduction to variable and feature selection,» Journal of machine learning research, vol. 3, nº Mar, pp. 1157-1182, 2003. [18]K. Polat y S. Güneş, «A new feature selection method on classification of medical datasets: Kernel F-score feature selection,» Expert Systems with Applications, vol. 36, nº 7, pp. 10367-10373, 2009. [19]D. J. Matich, «Redes Neuronales. Conceptos Básicos y Aplicaciones,» de Cátedra: Informática Aplicada ala Ingeniería de Procesos- Orientación I, 2001. [20]E. Acevedo M., A. Serna A. y E. Serna M., «Principios y Características de las Redes Neuronales Artificiales, » de Desarrollo e Innovación en Ingeniería, Medellín, Editorial Instituto Antioqueño de Investigación, 2017, pp. Capítulo 10, 173-182. [21]M. Hofmann y R. Klinkenberg, RapidMiner: Data mining use cases and business analytics applications, CRC Press, 2016. [22]R. Pupale, «Towards Data Science,» 2018. [En línea]. Disponible: https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989. [23]F. H. Troncoso Espinosa, «Prediction of recidivismin thefts and burglaries using machine learning,» Indian Journal of Science and Technology, vol. 13, nº 6, pp. 696-711, 2020. [24]L. Tashman, «Out-of-sample tests of forecasting accuracy: an analysis and review,» International journal of forecasting, vol. 16, nº 4, pp. 437-450, 2000. [25]S. Varma y R. Simon, «Bias in error estimation when using cross-validation for model selection,» BMC bioinformatics, vol. 7, nº 1, p. 91, 2006. [26]N. V. Chawla, K. W. Bowyer, L. O. Hall y W. Kegelmeyer, «SMOTE: Synthetic Minority Over-sampling Technique,» Journal of Artificial Inteligence Research16, pp. 321-357, 2002. [27]M. Sokolova y G. Lapalme, «A systematic analysis of performance measures for classification tasks,» Information processing & management, vol. 45, nº 4, pp. 427-437, 2009. [28]S. Narkhede, «Understanding AUC-ROC Curve,» Towards Data Science, vol. 26, 2018. [29]R. Westermann y W. Hager, «Error Probabilities in Educational and Psychological Research,» Journal of Educational Statistics, Vol 11, No 2, pp. 117-146, 1986.

Download Full-text

An Interview with Dr. Michael Zeller, Winner of ACM SIGKDD 2020 Service Award

ACM SIGKDD Explorations Newsletter ◽

10.1145/3447556.3447561 ◽

2021 ◽

Vol 22 (2) ◽

pp. 6-7

Author(s):

Michael Zeller

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Data Mining ◽

Knowledge Discovery ◽

Data Science ◽

Professional Services ◽

Investment Company ◽

Service Award ◽

Virtual Conference ◽

Learning Data

Michael Zeller, Ph.D. is the recipient of the 2020 ACM SIGKDD Service Award, which is the highest service award in the field of knowledge discovery and data mining. Conferred annually on one individual or group in recognition of outstanding professional services and contributions to the field of knowledge discovery and data mining, Dr. Zeller was honored for his years of service and many accomplishments as the secretary and treasurer for ACM SIGKDD, the organizing body of the annual KDD conference. Zeller is also head of AI strategy and solutions at Temasek, a global investment company seeking to make a difference always with tomorrow in mind. He sat down with SIGKDD Explorations to discuss how he first got involved in the KDD conference in 1999, what he learned from the first-ever virtual conference, his work at Temasek, and what excites him about the future of machine learning, data science and artificial intelligence.

Download Full-text

An Interview with Dr. Shipeng Yu, Winner of ACM SIGKDD 2021 Service Award

ACM SIGKDD Explorations Newsletter ◽

10.1145/3510374.3510376 ◽

2021 ◽

Vol 23 (2) ◽

pp. 1-2

Author(s):

Shipeng Yu

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Data Mining ◽

Knowledge Discovery ◽

Data Science ◽

Professional Services ◽

Professional Network ◽

Service Award ◽

Years Of Service ◽

Learning Data

Shipeng Yu, Ph.D. is the recipient of the 2021 ACM SIGKDD Service Award, which is the highest service award in the field of knowledge discovery and data mining. Conferred annually on one individual or group in recognition of outstanding professional services and contributions to the field of knowledge discovery and data mining, Dr. Yu was honored for his years of service and many accomplishments as general chair of KDD 2017 and currently as sponsorship director for SIGKDD. Dr. Yu is Director of AI Engineering, Head of the Growth AI team at LinkedIn, the world's largest professional network. He sat down with SIGKDD Explorations to discuss how he first got involved in the KDD conference in 2006, the benefits and drawbacks of virtual conferences, his work at LinkedIn, and KDD's place in the field of machine learning, data science and artificial intelligence.

Download Full-text

A Comprehensive Survey of Dynamic Data Mining Process in Knowledge Discovery from Database

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i12.504509 ◽

2018 ◽

Vol 6 (12) ◽

pp. 504-509

Author(s):

D. Ramana Kumar ◽

S. Krishna Mohan Rao

Keyword(s):

Data Mining ◽

Knowledge Discovery ◽

Dynamic Data ◽

Comprehensive Survey

Download Full-text

Data Mining and Statistics in Data Science

Social Sciences Studies Journal ◽

10.26449/sssj.1295 ◽

2019 ◽

Vol 5 (30) ◽

pp. 960-968

Author(s):

Güner Gözde KILIÇ

Keyword(s):

Data Mining ◽

Data Science

Download Full-text

Analisis Data Pembayaran Kredit Nasabah Bank Menggunakan Metode Data Mining

Jurnal ULTIMA InfoSys ◽

10.31937/si.v4i1.238 ◽

2013 ◽

Vol 4 (1) ◽

pp. 18-27

Author(s):

Ira Melissa ◽

Raymond S. Oetama

Keyword(s):

Data Mining ◽

Knowledge Discovery ◽

Knowledge Discovery In Database

Data mining adalah analisis atau pengamatan terhadap kumpulan data yang besar dengan tujuan untuk menemukan hubungan tak terduga dan untuk meringkas data dengan cara yang lebih mudah dimengerti dan bermanfaat bagi pemilik data. Data mining merupakan proses inti dalam Knowledge Discovery in Database (KDD). Metode data mining digunakan untuk menganalisis data pembayaran kredit peminjam pembayaran kredit. Berdasarkan pola pembayaran kredit peminjam yang dihasilkan, dapat dilihat parameter-parameter kredit yang memiliki keterkaitan dan paling berpengaruh terhadap pembayaran angsuran kredit. Kata kunci—data mining, outlier, multikolonieritas, Anova

Download Full-text

The AI Delusion

10.1093/oso/9780198824305.001.0001 ◽

2018 ◽

Cited By ~ 5

Author(s):

Gary Smith

Keyword(s):

Data Mining ◽

Knowledge Discovery ◽

Industrial Revolution ◽

The Real ◽

Intelligent Machines ◽

Black Boxes ◽

Real Danger ◽

The Way

We live in an incredible period in history. The Computer Revolution may be even more life-changing than the Industrial Revolution. We can do things with computers that could never be done before, and computers can do things for us that could never be done before. But our love of computers should not cloud our thinking about their limitations. We are told that computers are smarter than humans and that data mining can identify previously unknown truths, or make discoveries that will revolutionize our lives. Our lives may well be changed, but not necessarily for the better. Computers are very good at discovering patterns, but are useless in judging whether the unearthed patterns are sensible because computers do not think the way humans think. We fear that super-intelligent machines will decide to protect themselves by enslaving or eliminating humans. But the real danger is not that computers are smarter than us, but that we think computers are smarter than us and, so, trust computers to make important decisions for us. The AI Delusion explains why we should not be intimidated into thinking that computers are infallible, that data-mining is knowledge discovery, and that black boxes should be trusted.

Download Full-text

Particularities of data mining in medicine: lessons learned from patient medical time series data analysis

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-019-1582-2 ◽

2019 ◽

Vol 2019 (1) ◽

Cited By ~ 2

Author(s):

Shadi Aljawarneh ◽

Aurea Anguera ◽

John William Atwood ◽

Juan A. Lara ◽

David Lizcano

Keyword(s):

Data Mining ◽

Time Series ◽

Knowledge Discovery ◽

Time Series Data ◽

Medical Patient ◽

Lessons Learned ◽

Physiological Signals ◽

Knowledge Discovery In Databases ◽

Series Data ◽

Data Mining Techniques

AbstractNowadays, large amounts of data are generated in the medical domain. Various physiological signals generated from different organs can be recorded to extract interesting information about patients’ health. The analysis of physiological signals is a hard task that requires the use of specific approaches such as the Knowledge Discovery in Databases process. The application of such process in the domain of medicine has a series of implications and difficulties, especially regarding the application of data mining techniques to data, mainly time series, gathered from medical examinations of patients. The goal of this paper is to describe the lessons learned and the experience gathered by the authors applying data mining techniques to real medical patient data including time series. In this research, we carried out an exhaustive case study working on data from two medical fields: stabilometry (15 professional basketball players, 18 elite ice skaters) and electroencephalography (100 healthy patients, 100 epileptic patients). We applied a previously proposed knowledge discovery framework for classification purpose obtaining good results in terms of classification accuracy (greater than 99% in both fields). The good results obtained in our research are the groundwork for the lessons learned and recommendations made in this position paper that intends to be a guide for experts who have to face similar medical data mining projects.

Download Full-text