Applications of Multi-Label Classification

The absence of labels and the bad quality of data is a prevailing challenge in numerous data mining and machine learning problems. The performance of a model is limited by available data samples with few labels for training. These problems are ultra-critical in multi-label classification, which usually needs clean data. Multi-label classification is a challenging research problem that emerges in several applications such as multi-object recognition, text categorization, music categorization and image classification. This paper presents a literature review on multi-label classification, various evaluation metrics used for analyzing performance and research hchallenges.

Download Full-text

Data Mining-based Financial Statement Fraud Detection: Systematic Literature Review and Meta-analysis to Estimate Data Sample Mapping of Fraudulent Companies Against Non-fraudulent Companies

Global Business Review ◽

10.1177/0972150920984857 ◽

2021 ◽

pp. 097215092098485

Author(s):

Sonika Gupta ◽

Sushil Kumar Mehta

Keyword(s):

Machine Learning ◽

Data Mining ◽

Literature Review ◽

Systematic Literature Review ◽

Classification Accuracy ◽

Meta Analysis ◽

Financial Statement ◽

Research Articles ◽

Financial Statement Fraud ◽

Data Mining Techniques

Data mining techniques have proven quite effective not only in detecting financial statement frauds but also in discovering other financial crimes, such as credit card frauds, loan and security frauds, corporate frauds, bank and insurance frauds, etc. Classification of data mining techniques, in recent years, has been accepted as one of the most credible methodologies for the detection of symptoms of financial statement frauds through scanning the published financial statements of companies. The retrieved literature that has used data mining classification techniques can be broadly categorized on the basis of the type of technique applied, as statistical techniques and machine learning techniques. The biggest challenge in executing the classification process using data mining techniques lies in collecting the data sample of fraudulent companies and mapping the sample of fraudulent companies against non-fraudulent companies. In this article, a systematic literature review (SLR) of studies from the area of financial statement fraud detection has been conducted. The review has considered research articles published between 1995 and 2020. Further, a meta-analysis has been performed to establish the effect of data sample mapping of fraudulent companies against non-fraudulent companies on the classification methods through comparing the overall classification accuracy reported in the literature. The retrieved literature indicates that a fraudulent sample can either be equally paired with non-fraudulent sample (1:1 data mapping) or be unequally mapped using 1:many ratio to increase the sample size proportionally. Based on the meta-analysis of the research articles, it can be concluded that machine learning approaches, in comparison to statistical approaches, can achieve better classification accuracy, particularly when the availability of sample data is low. High classification accuracy can be obtained with even a 1:1 mapping data set using machine learning classification approaches.

Download Full-text

When the System Fails

Public Health and Welfare ◽

10.4018/978-1-5225-1674-3.ch049 ◽

2016 ◽

pp. 1064-1089

Author(s):

Jacqueline Y. Ford

Keyword(s):

Data Mining ◽

Lived Experiences ◽

Quality Of Data ◽

Adoptive Families ◽

Psychodynamic Theory ◽

Peer Reviews ◽

Traumatized Children ◽

Before And After ◽

Verification Techniques

Guided by the lens of psychodynamic theory, Ford (2015) investigated the challenges faced by adoptive families of traumatized children. Fifteen families were randomly selected to participate in this study from a group of 30 parents who adopted traumatized children in Arizona. Thematic categories were drawn and summarized. Textual descriptions evolved from the thematic groups acknowledging their experiences and how these lived experiences guided their decision to adopt a traumatized child. Verification techniques, data mining, journaling, clustering, brainstorming, and peer reviews were used to ensure the quality of data. Emergent themes emphasized the need for adoption-focused training specific to traumatized children. Ford's (2015) study revealed that these adoptive families desired to be equipped with specialized therapeutic training before and after their adoptions.

Download Full-text

Applications in Machine Learning

Particle Swarm Optimization and Intelligence - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-61520-666-7.ch006 ◽

2010 ◽

pp. 149-167 ◽

Cited By ~ 10

Author(s):

E. Parsopoulos Konstantinos ◽

N. Vrahatis Michael

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Artificial Neural Networks ◽

Literature Review ◽

General Framework ◽

Computational Models ◽

Cognitive Maps ◽

Fuzzy Cognitive Maps ◽

Optimization Process ◽

Learning Problems

This chapter presents the fundamental concepts regarding the application of PSO on machine learning problems. The main objective in such problems is the training of computational models for performing classification and simulation tasks. It is not our intention to provide a literature review of the numerous relative applications. Instead, we aim at providing guidelines for the application and adaptation of PSO on this problem type. To achieve this, we focus on two representative cases, namely the training of artificial neural networks, and learning in fuzzy cognitive maps. In each case, the problem is first defined in a general framework, and then an illustrative example is provided to familiarize readers with the main procedures and possible obstacles that may arise during the optimization process.

Download Full-text

A systematic literature review of data science, data analytics and machine learning applied to healthcare engineering systems

Management Decision ◽

10.1108/md-01-2020-0035 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Roberto Salazar-Reyna ◽

Fernando Gonzalez-Aleu ◽

Edgar M.A. Granda-Gutierrez ◽

Jenny Diaz-Ramirez ◽

Jose Arturo Garza-Reyes ◽

...

Keyword(s):

Machine Learning ◽

Data Mining ◽

Big Data ◽

Literature Review ◽

Systematic Literature Review ◽

Data Analytics ◽

Research Area ◽

Engineering Systems ◽

Content Type ◽

Healthcare Engineering

PurposeThe objective of this paper is to assess and synthesize the published literature related to the application of data analytics, big data, data mining and machine learning to healthcare engineering systems.Design/methodology/approachA systematic literature review (SLR) was conducted to obtain the most relevant papers related to the research study from three different platforms: EBSCOhost, ProQuest and Scopus. The literature was assessed and synthesized, conducting analysis associated with the publications, authors and content.FindingsFrom the SLR, 576 publications were identified and analyzed. The research area seems to show the characteristics of a growing field with new research areas evolving and applications being explored. In addition, the main authors and collaboration groups publishing in this research area were identified throughout a social network analysis. This could lead new and current authors to identify researchers with common interests on the field.Research limitations/implicationsThe use of the SLR methodology does not guarantee that all relevant publications related to the research are covered and analyzed. However, the authors' previous knowledge and the nature of the publications were used to select different platforms.Originality/valueTo the best of the authors' knowledge, this paper represents the most comprehensive literature-based study on the fields of data analytics, big data, data mining and machine learning applied to healthcare engineering systems.

Download Full-text

Creating Informative Data Warehouses: Exploring Data and Information Quality through Data Mining

10.28945/2584 ◽

2002 ◽

Author(s):

Herna L. Viktor ◽

Wayne Motha

Keyword(s):

Data Mining ◽

Information Quality ◽

Data Warehousing ◽

Poor Quality ◽

Quality Data ◽

Data Warehouses ◽

Quality Of Data ◽

Poor Quality Data ◽

Daunting Task

Increasingly, large organizations are engaging in data warehousing projects in order to achieve a competitive advantage through the exploration of the information as contained therein. It is therefore paramount to ensure that the data warehouse includes high quality data. However, practitioners agree that the improvement of the quality of data in an organization is a daunting task. This is especially evident in data warehousing projects, which are often initiated “after the fact”. The slightest suspicion of poor quality data often hinders managers from reaching decisions, when they waste hours in discussions to determine what portion of the data should be trusted. Augmenting data warehousing with data mining methods offers a mechanism to explore these vast repositories, enabling decision makers to assess the quality of their data and to unlock a wealth of new knowledge. These methods can be effectively used with inconsistent, noisy and incomplete data that are commonplace in data warehouses.

Download Full-text

Open Data Based Machine Learning Applications in Smart Cities: A Systematic Literature Review

Electronics ◽

10.3390/electronics10232997 ◽

2021 ◽

Vol 10 (23) ◽

pp. 2997

Author(s):

Luminita Hurbean ◽

Doina Danaiata ◽

Florin Militaru ◽

Andrei-Mihail Dodea ◽

Ana-Maria Negovan

Keyword(s):

Machine Learning ◽

Literature Review ◽

Systematic Literature Review ◽

Smart City ◽

Smart Cities ◽

Open Data ◽

Quality Of Data ◽

Data Movement ◽

Depth Analysis ◽

Machine Learning Applications

Machine learning (ML) has already gained the attention of the researchers involved in smart city (SC) initiatives, along with other advanced technologies such as IoT, big data, cloud computing, or analytics. In this context, researchers also realized that data can help in making the SC happen but also, the open data movement has encouraged more research works using machine learning. Based on this line of reasoning, the aim of this paper is to conduct a systematic literature review to investigate open data-based machine learning applications in the six different areas of smart cities. The results of this research reveal that: (a) machine learning applications using open data came out in all the SC areas and specific ML techniques are discovered for each area, with deep learning and supervised learning being the first choices. (b) Open data platforms represent the most frequently used source of data. (c) The challenges associated with open data utilization vary from quality of data, to frequency of data collection, to consistency of data, and data format. Overall, the data synopsis as well as the in-depth analysis may be a valuable support and inspiration for the future smart city projects.

Download Full-text

Machine Learning for the Educational Sciences

10.31234/osf.io/3hnr6 ◽

2021 ◽

Author(s):

Sven Hilbert ◽

Stefan Coors ◽

Elisabeth Barbara Kraus ◽

Bernd Bischl ◽

Mario Frei ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Decisive Role ◽

Quality Of Data ◽

Practical Applications ◽

Educational Sciences ◽

Complex Relationships ◽

The Impact ◽

Analytical Approaches

Classical statistical methods are limited in the analysis of highdimensional datasets. Machine learning (ML) provides a powerful framework for prediction by using complex relationships, often encountered in modern data with a large number of variables, cases and potentially non-linear effects. ML has turned into one of the most influential analytical approaches of this millennium and has recently become popular in the behavioral and social sciences. The impact of ML methods on research and practical applications in the educational sciences is still limited, but continuously grows as larger and more complex datasets become available through massive open online courses (MOOCs) and large scale investigations.The educational sciences are at a crucial pivot point, because of the anticipated impact ML methods hold for the field. Here, we review the opportunities and challenges of ML for the educational sciences, show how a look at related disciplines can help learning from their experiences, and argue for a philosophical shift in model evaluation. We demonstrate how the overall quality of data analysis in educational research can benefit from these methods and show how ML can play a decisive role in the validation of empirical models. In this review, we (1) provide an overview of the types of data suitable for ML, (2) give practical advice for the application of ML methods, and (3) show how ML-based tools and applications can be used to enhance the quality of education. Additionally we provide practical R code with exemplary analyses, available at https: //osf.io/ntre9/?view only=d29ae7cf59d34e8293f4c6bbde3e4ab2.

Download Full-text

ROC curve, lift chart and calibration plot

Advances in Methodology and Statistics ◽

10.51936/noqf3710 ◽

2006 ◽

Vol 3 (1) ◽

Author(s):

Miha Vuk ◽

Tomaž Curk

Keyword(s):

Machine Learning ◽

Data Mining ◽

Roc Curve ◽

Classification Accuracy ◽

Empirical Evaluation ◽

Calibration Plot ◽

Mathematical Framework ◽

Classification Models ◽

Classification Quality

This paper presents ROC curve, lift chart and calibration plot, three well known graphical techniques that are useful for evaluating the quality of classification models used in data mining and machine learning. Each technique, normally used and studied separately, defines its own measure of classification quality and its visualization. Here, we give a brief survey of the methods and establish a common mathematical framework which adds some new aspects, explanations and interrelations between these techniques. We conclude with an empirical evaluation and a few examples on how to use the presented techniques to boost classification accuracy.

Download Full-text

Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE

Mathematics ◽

10.3390/math8050662 ◽

2020 ◽

Vol 8 (5) ◽

pp. 662 ◽

Cited By ~ 3

Author(s):

Husein Perez ◽

Joseph H. M. Tah

Keyword(s):

Machine Learning ◽

Density Distribution ◽

Image Classification ◽

High Dimensional Data ◽

Supervised Machine Learning ◽

Learning Problems ◽

High Dimensional ◽

Feature Engineering ◽

Outlier Data

In the field of supervised machine learning, the quality of a classifier model is directly correlated with the quality of the data that is used to train the model. The presence of unwanted outliers in the data could significantly reduce the accuracy of a model or, even worse, result in a biased model leading to an inaccurate classification. Identifying the presence of outliers and eliminating them is, therefore, crucial for building good quality training datasets. Pre-processing procedures for dealing with missing and outlier data, commonly known as feature engineering, are standard practice in machine learning problems. They help to make better assumptions about the data and also prepare datasets in a way that best expose the underlying problem to the machine learning algorithms. In this work, we propose a multistage method for detecting and removing outliers in high-dimensional data. Our proposed method is based on utilising a technique called t-distributed stochastic neighbour embedding (t-SNE) to reduce high-dimensional map of features into a lower, two-dimensional, probability density distribution and then use a simple descriptive statistical method called interquartile range (IQR) to identifying any outlier values from the density distribution of the features. t-SNE is a machine learning algorithm and a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualisation in a low-dimensional space of two or three dimensions. We applied this method on a dataset containing images for training a convolutional neural network model (ConvNet) for an image classification problem. The dataset contains four different classes of images: three classes contain defects in construction (mould, stain, and paint deterioration) and a no-defect class (normal). We used the transfer learning technique to modify a pre-trained VGG-16 model. We used this model as a feature extractor and as a benchmark to evaluate our method. We have shown that, when using this method, we can identify and remove the outlier images in the dataset. After removing the outlier images from the dataset and re-training the VGG-16 model, the results have also shown that the accuracy of the classification has significantly improved and the number of misclassified cases has also dropped. While many feature engineering techniques for handling missing and outlier data are common in predictive machine learning problems involving numerical or categorical data, there is little work on developing techniques for handling outliers in high-dimensional data which can be used to improve the quality of machine learning problems involving images such as ConvNet models for image classification and object detection problems.

Download Full-text

Multi-Instance Learning with MultiObjective Genetic Programming

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch212 ◽

2011 ◽

pp. 1372-1379

Author(s):

Amelia Zafra

Keyword(s):

Machine Learning ◽

Genetic Programming ◽

Learning Community ◽

Text Categorization ◽

Learning Algorithm ◽

Programming Algorithm ◽

Learning Framework ◽

Wide Range ◽

Training Examples

The multiple-instance problem is a difficult machine learning problem that appears in cases where knowledge about training examples is incomplete. In this problem, the teacher labels examples that are sets (also called bags) of instances. The teacher does not label whether an individual instance in a bag is positive or negative. The learning algorithm needs to generate a classifier that will correctly classify unseen examples (i.e., bags of instances). This learning framework is receiving growing attention in the machine learning community and since it was introduced by Dietterich, Lathrop, Lozano-Perez (1997), a wide range of tasks have been formulated as multi-instance problems. Among these tasks, we can cite content-based image retrieval (Chen, Bi, & Wang, 2006) and annotation (Qi and Han, 2007), text categorization (Andrews, Tsochantaridis, & Hofmann, 2002), web index page recommendation (Zhou, Jiang, & Li, 2005; Xue, Han, Jiang, & Zhou, 2007) and drug activity prediction (Dietterich et al., 1997; Zhou & Zhang, 2007). In this chapter we introduce MOG3P-MI, a multiobjective grammar guided genetic programming algorithm to handle multi-instance problems. In this algorithm, based on SPEA2, individuals represent classification rules which make it possible to determine if a bag is positive or negative. The quality of each individual is evaluated according to two quality indexes: sensitivity and specificity. Both these measures have been adapted to MIL circumstances. Computational experiments show that the MOG3P-MI is a robust algorithm for classification in different domains where achieves competitive results and obtain classifiers which contain simple rules which add comprehensibility and simplicity in the knowledge discovery process, being suitable method for solving MIL problems (Zafra & Ventura, 2007).

Download Full-text