Data Mining and Knowledge Discovery

Author(s):  
Zude Zhou ◽  
Huaiqing Wang ◽  
Ping Lou

In Chapters 2 and 3, the knowledge-based system and Multi-Agent system were illustrated. These are significant methods and theories of Manufacturing Intelligence (MI). Data Mining (DM) and Knowledge Discovery (KD) are at the foundation of MI. Humans are immersed in data, but are thirsty for knowledge. With the wider application of database technology, a dilemma has arisen whereby people are ‘rich in data, poor in knowledge’. The explosion of knowledge and information has brought great benefit to mankind, but has also carried with it certain drawbacks, since it has resulted in knowledge and information ‘pollution. Facing a vast but polluted ocean of data, a technical means to discard the bad and retain the good was sought. Data Mining and Knowledge Discovery (DMKD) was therefore proposed against the background of rapidly expanding data and databases. It is also the result of the development and fusion of database technology, Artificial Intelligence (AI), statistical techniques and visualization technology (Fayyad U., 1998). DMKD has become a research focus and cutting-edge technology in the field of computer information processing (Jef Woksem, 2001). The development background, conception, working process, classification and general application of DM and KD are firstly introduced in this chapter. Secondly, basic functions and assignment such as prediction, description, data clustering, data classification, conception description and visualization processing are discussed. Then the methods and tools for DM are presented, such as the association rule, decision tree, genetic algorithm, rough set and support vector machine. Finally, the application of DMKD in intelligent manufacturing is summarized.

2016 ◽  
Vol 23 (1) ◽  
pp. 177-191
Author(s):  
Anderson Roges Teixeira Góes ◽  
Maria Teresinha Arns Steiner

Resumo A qualidade na educação tem sido objeto de muita discussão, seja nas escolas e entre seus gestores, seja na mídia ou na literatura. No entanto, uma análise mais profunda na literatura parece não indicar técnicas que explorem bancos de dados com a finalidade de obter classificações para o desempenho escolar, nem tampouco há um consenso sobre o que seja “qualidade educacional”. Diante deste contexto, neste artigo, é proposta uma metodologia que se enquadra no processo KDD (Knowledge Discovery in Databases, ou seja, Descoberta de Conhecimento em Bases de Dados) para a classificação do desempenho de instituições de ensino, de forma comparativa, com base nas notas obtidas na Prova Brasil, um dos itens integrantes do Índice de Desenvolvimento da Educação Básica (IDEB) no Brasil. Para ilustrar a metodologia, esta foi aplicada às escolas públicas municipais de Araucária, PR, região metropolitana de Curitiba, PR, num total de 17, que, por ocasião da pesquisa, ofertavam Ensino Fundamental, considerando as notas obtidas pela totalidade dos alunos dos anos iniciais (1º. ao 5º. ano do ensino fundamental) e dos anos finais (6º. ao 9º. ano do ensino fundamental). Na etapa de Data Mining, principal etapa do processo KDD, foram utilizadas três técnicas de forma comparativa para o Reconhecimento de Padrões: Redes Neurais Artificiais; Support Vector Machines; e Algoritmos Genéticos. Essas técnicas apresentaram resultados satisfatórios na classificação das escolas, representados por meio de uma “Etiqueta de Classificação do Desempenho”. Por meio desta etiqueta, os gestores educacionais poderão ter melhor base para definir as medidas a serem adotadas junto a cada escola, podendo definir mais claramente as metas a serem cumpridas.


2021 ◽  
Vol 36 ◽  
Author(s):  
Emmanuelle Grislin-Le Strugeon ◽  
Kathia Marcal de Oliveira ◽  
Dorsaf Zekri ◽  
Marie Thilliez

Abstract Introduced as an interdisciplinary area that combines multi-agent systems, data mining and knowledge discovery, agent mining is currently in practice. To develop agent mining applications involves a combination of different approaches (model, architecture, technique and so on) from software agent and data mining (DM) areas. This paper presents an investigation of the approaches used in the agent mining systems by deeply analyzing 121 papers resulting from a systematic literature review. An ontology was defined to capitalize the knowledge collected from this study. The ontology is organized according to seven main facets: the problem addressed, the application domain, the agent-related and the mining-related elements, the models, processes and algorithms. This ontology is aimed at providing support to decisions about agent mining application design.


Author(s):  
Edgard Benítez-Guerrero ◽  
Omar Nieva-García

The vast amounts of digital information stored in databases and other repositories represent a challenge for finding useful knowledge. Traditionalmethods for turning data into knowledge based on manual analysis reach their limits in this context, and for this reason, computer-based methods are needed. Knowledge Discovery in Databases (KDD) is the semi-automatic, nontrivial process of identifying valid, novel, potentially useful, and understandable knowledge (in the form of patterns) in data (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996). KDD is an iterative and interactive process with several steps: understanding the problem domain, data preprocessing, pattern discovery, and pattern evaluation and usage. For discovering patterns, Data Mining (DM) techniques are applied.


Author(s):  
Xuelong Zhang

With the advent of the era of big data, people are eager to extract valuable knowledge from the rapidly expanding data, so that they can more effectively use these massive storage data. The traditional data processing technology can only achieve basic functions such as data query and statistics, and cannot achieve the goal of extracting the knowledge existing in the data to predict the future trend. Therefore, along with the rapid development of database technology and the rapid improvement of computer’s computing power, data mining (DM) came into existence. Research on DM algorithms includes knowledge of various fields such as database, statistics, pattern recognition and artificial intelligence. Pattern recognition mainly extracts features of known data samples. The DM algorithm using pattern recognition technology is a better method to obtain effective information from massive data, thus providing decision support, and has a good application prospect. Support vector machine (SVM) is a new pattern recognition algorithm proposed in recent years, which avoids dimension disaster by dimensioning and linearization. Based on this, this paper studies the DM algorithm based on pattern recognition, and proposes a DM algorithm based on SVM. The algorithm divides the vector of the SV set into two different types and iterates through multiple iterations to obtain a classifier that converges to the final result. Finally, through the cross-validation simulation experiment, the results show that the DM algorithm based on pattern recognition can effectively reduce the training time and solve the mining problem of massive data. The results show that the algorithm has certain rationality and feasibility.


Author(s):  
Reza Safdari ◽  
Peyman Rezaei-Hachesu ◽  
Marjan GhaziSaeedi ◽  
Taha Samad-Soltani ◽  
Maryam Zolnoori

Medical data mining intends to solve real-world problems in the diagnosis and treatment of diseases. This process applies various techniques and algorithms which have different levels of accuracy and precision. The purpose of this article is to apply data mining techniques to the diagnosis of asthma. Sensitivity, specificity and accuracy of K-nearest neighbor, Support Vector Machine, naive Bayes, Artificial Neural Network, classification tree, CN2 algorithms, and related similar studies were evaluated. ROC curves were plotted to show the performance of the authors' approach. Support vector machine (SVM) algorithms achieved the highest accuracy at 98.59% with a sensitivity of 98.59% and a specificity of 98.61% for class 1. Other algorithms had a range of accuracy greater than 87%. The results show that the authors can accurately diagnose asthma approximately 98% of the time based on demographics and clinical data. The study also has a higher sensitivity when compared to expert and knowledge-based systems.


Author(s):  
Oyinloye Oghenerukevwe Elohor ◽  
Adesoji susan ◽  
Akinbohun Folake

The study is aimed at developing a text summarizer using clustering and anomalies detection with SVM classification. A text summarization approach is proposed which uses the SVM clustering algorithm. The proposed project can be used to summarize articles from fields as diverse as politics, sports, current affairs, finance and any other explanatory document. However, it does cause a trade-off between domain independence and a knowledge-based summary which would provide data in a form more easily understandable to the user. A bundle of libraries and software’s was utilized for proper text summary of alphanumeric entering. KEYWORDS— Anomalies detection, SVM (support vector machine), clustering, text summarization, data mining


2020 ◽  
Vol 24 (106) ◽  
pp. 79-87
Author(s):  
Fredy Humberto Troncoso Espinosa ◽  
Javiera Valentina Ruiz Tapia

La fuga de clientes es un problema relevante al que enfrentan las empresas de servicios y que les puede generar pérdidas económicas significativas. Identificar los elementos que llevan a un cliente a dejar de consumir un servicio es una tarea compleja, sin embargo, mediante su comportamiento es posible estimar una probabilidad de fuga asociada a cada uno de ellos. Esta investigación aplica minería de datos para la predicción de la fuga de clientes en una empresa de distribución de gas natural, mediante dos técnicas de machine learning: redes neuronales y support vector machine. Los resultados muestran que mediante la aplicación de estas técnicas es posible identificar los clientes con mayor probabilidad de fuga para tomar sobre estas acciones de retenciónoportunas y focalizadas, minimizando los costos asociados al error en la identificación de estos clientes. Palabras Clave: fuga de clientes, minería de datos, machine learning, distribución de gas natural. Referencias [1]J. Miranda, P. Rey y R. Weber, «Predicción de Fugas de Clientes para una Institución Financiera Mediante Support Vector Machines,» Revista Ingeniería de Sistemas Volumen XIX, pp. 49-68, 2005. [2]P. A. Pérez V., «Modelo de predicción de fuga de clientes de telefonía movil post pago,» Universidad de Chile, Santiago, Chile, 2014. [3]Gas Sur S.A., «https://www.gassur.cl/Quienes-Somos/,» [En línea]. [4]J. Xiao, X. Jiang, C. He y G. Teng, «Churn prediction in customer relationship management via GMDH-based multiple classifiers ensemble,» IEEE IntelligentSystems, vol. 31, nº 2, pp. 37-44, 2016. [5]A. M. Almana, M. S. Aksoy y R. Alzahrani, «A survey on data mining techniques in customer churn analysis for telecom industry,» International Journal of Engineering Research and Applications, vol. 4, nº 5, pp. 165-171, 2014. [6]A. Jelvez, M. Moreno, V. Ovalle, C. Torres y F. Troncoso, «Modelo predictivo de fuga de clientes utilizando mineríaa de datos para una empresa de telecomunicaciones en chile,» Universidad, Ciencia y Tecnología, vol. 18, nº 72, pp. 100-109, 2014. [7]D. Anil Kumar y V. Ravi, «Predicting credit card customer churn in banks using data mining,» International Journal of Data Analysis Techniques and Strategies, vol. 1, nº 1, pp. 4-28, 2008. [8]E. Aydoğan, C. Gencer y S. Akbulut, «Churn analysis and customer segmentation of a cosmetics brand using data mining techniques,» Journal of Engineeringand Natural Sciences, vol. 26, nº 1, 2008. [9]G. Dror, D. Pelleg, O. Rokhlenko y I. Szpektor, «Churn prediction in new users of Yahoo! answers,» de Proceedings of the 21st International Conference onWorld Wide Web, 2012. [10]T. Vafeiadis, K. Diamantaras, G. Sarigiannidis y K. Chatzisavvas, «A comparison of machine learning techniques for customer churn prediction,» SimulationModelling Practice and Theory, vol. 55, pp. 1-9, 2015. [11]Y. Xie, X. Li, E. Ngai y W. Ying, «Customer churn prediction using improved balanced random forests,» Expert Systems with Applications, vol. 36, nº 3, pp.5445-5449, 2009. [12]U. Fayyad, G. Piatetsky-Shapiro y P. Smyth, «Knowledge Discovery and Data Mining: Towards a Unifying Framework,» de KDD-96 Proceedings, 1996. [13]R. Brachman y T. Anand, «The process of knowledge discovery in databases,» de Advances in knowledge discovery and data mining, 1996. [14]K. Lakshminarayan, S. Harp, R. Goldman y T. Samad, «Imputation of Missing Data Using Machine Learning Techniques,» de KDD, 1996. [15]B. Nguyen , J. L. Rivero y C. Morell, «Aprendizaje supervisado de funciones de distancia: estado del arte,» Revista Cubana de Ciencias Informáticas, vol. 9, nº 2, pp. 14-28, 2015. [16]I. Monedero, F. Biscarri, J. Guerrero, M. Peña, M. Roldán y C. León, «Detection of water meter under-registration using statistical algorithms,» Journal of Water Resources Planning and Management, vol. 142, nº 1, p. 04015036, 2016. [17]I. Guyon y A. Elisseeff, «An introduction to variable and feature selection,» Journal of machine learning research, vol. 3, nº Mar, pp. 1157-1182, 2003. [18]K. Polat y S. Güneş, «A new feature selection method on classification of medical datasets: Kernel F-score feature selection,» Expert Systems with Applications, vol. 36, nº 7, pp. 10367-10373, 2009. [19]D. J. Matich, «Redes Neuronales. Conceptos Básicos y Aplicaciones,» de Cátedra: Informática Aplicada ala Ingeniería de Procesos- Orientación I, 2001. [20]E. Acevedo M., A. Serna A. y E. Serna M., «Principios y Características de las Redes Neuronales Artificiales, » de Desarrollo e Innovación en Ingeniería, Medellín, Editorial Instituto Antioqueño de Investigación, 2017, pp. Capítulo 10, 173-182. [21]M. Hofmann y R. Klinkenberg, RapidMiner: Data mining use cases and business analytics applications, CRC Press, 2016. [22]R. Pupale, «Towards Data Science,» 2018. [En línea]. Disponible: https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989. [23]F. H. Troncoso Espinosa, «Prediction of recidivismin thefts and burglaries using machine learning,» Indian Journal of Science and Technology, vol. 13, nº 6, pp. 696-711, 2020. [24]L. Tashman, «Out-of-sample tests of forecasting accuracy: an analysis and review,» International journal of forecasting, vol. 16, nº 4, pp. 437-450, 2000. [25]S. Varma y R. Simon, «Bias in error estimation when using cross-validation for model selection,» BMC bioinformatics, vol. 7, nº 1, p. 91, 2006. [26]N. V. Chawla, K. W. Bowyer, L. O. Hall y W. Kegelmeyer, «SMOTE: Synthetic Minority Over-sampling Technique,» Journal of Artificial Inteligence Research16, pp. 321-357, 2002. [27]M. Sokolova y G. Lapalme, «A systematic analysis of performance measures for classification tasks,» Information processing & management, vol. 45, nº 4, pp. 427-437, 2009. [28]S. Narkhede, «Understanding AUC-ROC Curve,» Towards Data Science, vol. 26, 2018. [29]R. Westermann y W. Hager, «Error Probabilities in Educational and Psychological Research,» Journal of Educational Statistics, Vol 11, No 2, pp. 117-146, 1986.  


10.28945/2697 ◽  
2003 ◽  
Author(s):  
Krzysztof Hauke ◽  
Mievzyslaw L. Owoc ◽  
Maciej Pondel

Data Mining (DM) is a very crucial issue in knowledge discovery processes. The basic facilities to create data mining models were implemented successfully on Oracle 9i as the extension of the database server. DM tools enable developers to create Business Intelligence (BI) applications. As a result Data Mining models can be used as support of knowledge-based management. The main goal of the paper is to present new features of the Oracle platform in building and testing DM models. Authors characterize methods of building and testing Data Mining models available on the Oracle 9i platform, stressing the critical steps of the whole process and presenting examples of practical usage of DM models. Verification techniques of the generated knowledge bases are discussed in the mentioned environment.


Entropy ◽  
2021 ◽  
Vol 23 (4) ◽  
pp. 485 ◽  
Author(s):  
Carlos A. Palacios ◽  
José A. Reyes-Suárez ◽  
Lorena A. Bearzotti ◽  
Víctor Leiva ◽  
Carolina Marchant

Data mining is employed to extract useful information and to detect patterns from often large data sets, closely related to knowledge discovery in databases and data science. In this investigation, we formulate models based on machine learning algorithms to extract relevant information predicting student retention at various levels, using higher education data and specifying the relevant variables involved in the modeling. Then, we utilize this information to help the process of knowledge discovery. We predict student retention at each of three levels during their first, second, and third years of study, obtaining models with an accuracy that exceeds 80% in all scenarios. These models allow us to adequately predict the level when dropout occurs. Among the machine learning algorithms used in this work are: decision trees, k-nearest neighbors, logistic regression, naive Bayes, random forest, and support vector machines, of which the random forest technique performs the best. We detect that secondary educational score and the community poverty index are important predictive variables, which have not been previously reported in educational studies of this type. The dropout assessment at various levels reported here is valid for higher education institutions around the world with similar conditions to the Chilean case, where dropout rates affect the efficiency of such institutions. Having the ability to predict dropout based on student’s data enables these institutions to take preventative measures, avoiding the dropouts. In the case study, balancing the majority and minority classes improves the performance of the algorithms.


2021 ◽  
Vol 39 (11) ◽  
pp. 1331-1340
Author(s):  
Janaína Lopes Dias ◽  
Michele Kremer Sott ◽  
Caroline Cipolatto Ferrão ◽  
João Carlos Furtado ◽  
Jorge André Ribas Moraes

The processes related to solid waste management (SWM) are being revised as new technologies emerge and are applied in the area to achieve greater environmental, social and economic sustainability for society. To achieve our goal, two robust review protocols (Population, Intervention, Comparison, Outcome, and Context (PICOC) and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)) were used to systematically analyze 62 documents extracted from the Web of Science database to identify the main techniques and tools for Knowledge Discovery in Databases (KDD) and Data Mining (DM) as applied to SWM and explore the technological potential to optimize the stages of collecting and transporting waste. Moreover, it was possible to analyze the main challenges and opportunities of KDD and DM for SWM. The results show that the most used tools for SWM are MATLAB (29.7%) and GIS (13.5%), whereas the most used techniques are Artificial Neural Networks (35.8%), Linear Regression (16.0%) and Support Vector Machine (12.3%). In addition, 15.3% of the studies were conducted with data from China, 11.1% from India and 9.7% of the studies analyzed and compared data from several other countries. Furthermore, the research showed that the main challenges in the field of study are related to the collection and treatment of data, whereas the opportunities appear to be linked mainly to the impact on the pillars of sustainable development. Thus, this study portrays important issues associated with the use of KDD and DM for optimal SWM and has the potential to assist and direct researchers and field professionals in future studies.


Sign in / Sign up

Export Citation Format

Share Document