BUILDING CUSTOMER MODELS FROM BUSINESS DATA: AN AUTOMATIC APPROACH BASED ON FUZZY CLUSTERING AND MACHINE LEARNING

Data mining (DM) is a new emerging discipline that aims to extract knowledge from data using several techniques. DM turned out to be useful in business where the data describing the customers and their transactions is in the order of terabytes. In this paper, we propose an approach for building customer models (said also profiles in the literature) from business data. Our approach is three-step. In the first step, we use fuzzy clustering to categorize customers, i.e., determine groups of customers. A key feature is that the number of groups (or clusters) is computed automatically from data using the partition entropy as a validity criteria. In the second step, we proceed to a dimensionality reduction which aims at keeping for each group of customers only the most informative attributes. For this, we define the information loss to quantify the information degree of an attribute. Hence, and as a result to this second step, we obtain groups of customers each described by a distinct set of attributes. In the third and final step, we use backpropagation neural networks to extract useful knowledge from these groups. Experimental results on real-world data sets reveal a good performance of our approach and should simulate future research.

Download Full-text

Learning From Imbalanced Data

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch159 ◽

2018 ◽

pp. 1825-1834 ◽

Cited By ~ 3

Author(s):

Lincy Mathews ◽

Seetha Hari

Keyword(s):

Real World ◽

Imbalanced Data ◽

Misclassification Rate ◽

Future Research ◽

Data Sets ◽

Real World Data ◽

Imbalanced Data Sets ◽

Challenging Issue ◽

Research Challenges ◽

Balanced Distribution

A very challenging issue in real world data is that in many domains like medicine, finance, marketing, web, telecommunication, management etc., the distribution of data among classes is inherently imbalanced. A widely accepted researched issue is that the traditional classifier algorithms assume a balanced distribution among the classes. Data imbalance is evident when the number of instances representing the class of concern is much lesser than other classes. Hence, the classifiers tend to bias towards the well-represented class. This leads to a higher misclassification rate among the lesser represented class. Hence, there is a need of efficient learners to classify imbalanced data. This chapter aims to address the need, challenges, existing methods and evaluation metrics identified when learning from imbalanced data sets. Future research challenges and directions are highlighted.

Download Full-text

A Unified Perspective for Disinformation Detection and Truth Discovery in Social Sensing: A Survey

ACM Computing Surveys ◽

10.1145/3477138 ◽

2023 ◽

Vol 55 (1) ◽

pp. 1-33

Author(s):

Fan Xu ◽

Victor S. Sheng ◽

Mingwen Wang

Keyword(s):

Online Social Networks ◽

Low Cost ◽

Future Research ◽

Data Sets ◽

Real World Data ◽

Comprehensive Overview ◽

Social Sensing ◽

Source Codes ◽

Truth Discovery ◽

Pros And Cons

With the proliferation of social sensing, large amounts of observation are contributed by people or devices. However, these observations contain disinformation. Disinformation can propagate across online social networks at a relatively low cost, but result in a series of major problems in our society. In this survey, we provide a comprehensive overview of disinformation and truth discovery in social sensing under a unified perspective, including basic concepts and the taxonomy of existing methodologies. Furthermore, we summarize the mechanism of disinformation from four different perspectives (i.e., text only, text with image/multi-modal, text with propagation, and fusion models). In addition, we review existing solutions based on these requirements and compare their pros and cons and give a sort of guide to usage based on a detailed lesson learned. To facilitate future studies in this field, we summarize related publicly accessible real-world data sets and open source codes. Last but the most important, we emphasize potential future research topics and challenges in this domain through a deep analysis of most recent methods.

Download Full-text

A Systematic Methodology on Class Imbalanced Problems involved in the Classification of Real-World Datasets

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c5756.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 7071-7081

Keyword(s):

Machine Learning ◽

Real World ◽

Imbalanced Data ◽

Past Research ◽

Future Research ◽

Data Sets ◽

Time Data ◽

Real World Data ◽

The Past ◽

Research Studies

Current generation real-world data sets processed through machine learning are imbalanced by nature. This imbalanced data enables the researchers with a challenging scenario in the context of perdition for both the machine learning and data mining algorithms. It is observed from the past research studies most of the imbalanced data sets consists of the major classes and minor classes and the major class leads the minor class. Several standards and hybrid prediction algorithms are proposed in various application domains but in most of the real-time data sets analyzed in the studies are imbalanced by nature thereby affecting the accuracy of the prediction. This paper presents a systematic survey of the past research studies to analyze intrinsic data characteristics and techniques utilized for handling class-imbalanced data. In addition, this study reveals the research gaps, trends and patterns in existing studies and discusses briefly on future research directions

Download Full-text

Learning From Imbalanced Data

Advances in Computer and Electrical Engineering - Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics ◽

10.4018/978-1-5225-7598-6.ch030 ◽

2019 ◽

pp. 403-414 ◽

Cited By ~ 1

Author(s):

Lincy Mathews ◽

Seetha Hari

Keyword(s):

Real World ◽

Imbalanced Data ◽

Misclassification Rate ◽

Future Research ◽

Data Sets ◽

Real World Data ◽

Imbalanced Data Sets ◽

Challenging Issue ◽

Research Challenges ◽

Balanced Distribution

A very challenging issue in real-world data is that in many domains like medicine, finance, marketing, web, telecommunication, management, etc. the distribution of data among classes is inherently imbalanced. A widely accepted researched issue is that the traditional classifier algorithms assume a balanced distribution among the classes. Data imbalance is evident when the number of instances representing the class of concern is much lesser than other classes. Hence, the classifiers tend to bias towards the well-represented class. This leads to a higher misclassification rate among the lesser represented class. Hence, there is a need of efficient learners to classify imbalanced data. This chapter aims to address the need, challenges, existing methods, and evaluation metrics identified when learning from imbalanced data sets. Future research challenges and directions are highlighted.

Download Full-text

Stable 2D magnetotelluric strikes and impedances via the phase tensor and the quadratic equation

Geophysics ◽

10.1190/geo2015-0700.1 ◽

2017 ◽

Vol 82 (4) ◽

pp. E169-E186 ◽

Cited By ~ 5

Author(s):

Yunuhen Muñíz ◽

Enrique Gómez-Treviño ◽

Francisco J. Esparza ◽

Mayra Cuellar

Keyword(s):

Noisy Data ◽

Quadratic Equation ◽

Second Step ◽

Impedance Tensor ◽

Data Sets ◽

The Third ◽

Phase Tensor ◽

Strike Direction ◽

New Ideas ◽

Stable Estimate

A combination of the magnetotelluric phase tensor and the quadratic algorithm provides a fast and simple solution to the problem of a 2D impedance tensor distorted by 3D electrogalvanic effects. The strike direction is provided by the phase tensor, which is known to provide unstable estimates for noisy data. We obtain stable directions in three steps. First, we use bootstrapping to find the most stable estimate among the different periods. Second, this value is used as the seed to select the neighbor strikes assuming continuity over periods. This second step is repeated several times to compute variances. The third step, which we call prerotating, consists of rotating the original impedance tensor to a most favorable angle for optimal stability and then rotating it back for compensation. The procedure is developed as a progressing algorithm through its application to the gradually more difficult data sets COPROD2S1, COPROD2, far-hi, and BC87, all available for testing new ideas. Alternately, using the Groom-Bailey terminology, the quadratic algorithm provides the amplitudes and phases independently of the strike direction and twist. The amplitudes and phases still need to be tuned up by the correct shear. The correct shear is obtained by contrasting the phases from the phase tensor and from the quadratic equation until they match for all available periods. The results are the undistorted impedances. Uncertainties are computed using formulas derived for the quadratic equation. We use the same data sets as for the strike to illustrate the recovery of impedances and their uncertainties.

Download Full-text

BIARC APPROXIMATION, SIMPLIFICATION AND SMOOTHING OF POLYGONAL CURVES BY MEANS OF VORONOI-BASED TOLERANCE BANDS

International Journal of Computational Geometry & Applications ◽

10.1142/s0218195908002593 ◽

2008 ◽

Vol 18 (03) ◽

pp. 221-250 ◽

Cited By ~ 16

Author(s):

MARTIN HEIMLICH ◽

MARTIN HELD

Keyword(s):

Approximation Algorithm ◽

Real World ◽

Second Step ◽

Data Sets ◽

Real World Data ◽

Worst Case ◽

Case Complexity ◽

Noisy Input ◽

Worst Case Complexity ◽

Polygonal Curves

We present an algorithm for approximating multiple closed polygons in a tangent-continuous manner with circular biarcs. The approximation curves are guaranteed to lie within a user-specified tolerance of the original input. If requested, our algorithm can also ensure that the input is within a user-specified tolerance of the approximation curves. These tolerances can be either symmetric, asymmetric, one-sided, or even one-sided and completely disconnected from the inputs. Our algorithm makes use of Voronoi diagrams to build disjoint and continuous tolerance bands for every polygon of the input. In a second step the approximation curves are fitted into the tolerance bands. Our algorithm has a worst-case complexity of O(n log n) for an n-vertex input. Extensive experiments with synthetic and real-world data sets show that our algorithm generates approximation curves with significantly fewer approximation primitives than previously proposed algorithms. This difference becomes more prominent the larger the tolerance threshold is or the more severe the noise in the input is. In particular, no heuristic is needed for smoothing noisy input prior to the actual approximation. Rather, our approximation algorithm can be used to smooth out noise in a reliable manner.

Download Full-text

TrustSVD: A Novel Trust-Based Matrix Factorization Model with User Trust and Item Ratings

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i11.422 ◽

2017 ◽

Vol 7 (11) ◽

pp. 7 ◽

Cited By ~ 1

Author(s):

K Sobha Rani

Keyword(s):

Matrix Factorization ◽

Social Trust ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Recommendation Algorithm ◽

Active User ◽

Factorization Model ◽

The Social ◽

Matrix Factorization Technique

Collaborative filtering suffers from the problems of data sparsity and cold start, which dramatically degrade recommendation performance. To help resolve these issues, we propose TrustSVD, a trust-based matrix factorization technique. By analyzing the social trust data from four real-world data sets, we conclude that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Hence, we build on top of a state-of-the-art recommendation algorithm SVD++ which inherently involves the explicit and implicit influence of rated items, by further incorporating both the explicit and implicit influence of trusted users on the prediction of items for an active user. To our knowledge, the work reported is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that our approach TrustSVD achieves better accuracy than other ten counterparts, and can better handle the concerned issues.

Download Full-text

Users behaviors: The Heritage of Digital World

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v13i7.2512 ◽

2014 ◽

Vol 13 (7) ◽

pp. 4675-4682

Author(s):

Atefeh Danesh Moghadam ◽

Alireza Alagha

Keyword(s):

Technology Development ◽

Second Step ◽

Information Communication ◽

The Third ◽

Digital World ◽

New Concepts ◽

Emerging Patterns ◽

History Of ◽

Fourth Step ◽

Communication And Technology

In the advent of information era, not only digital world is going to expand its territories, it is going to penetrate into the traditional notions about the meaning of the words and also valorize new concepts. According to Oxford Dictionary, the word heritage is defined: The history, tradition and qualities that a country or society has had for many years and that are considered an important part of its character. In order to present how emerging patterns, as the consequences of technology development, are going to be considered as the new concept of heritage, we follow four steps. In the first step, we present the convergence of Information, Communication and Technology (ICT) and a concise history of its convergence. In the second step, we argue how convergence has culminated in emerging patterns and also has made changes in digital world. In the third step, the importance of users behaviors and its mining is surveyed. Finally, in the fourth step; we illustrate User Generated Contents (UGC) as the most prominent users behaviors in digital world.

Download Full-text

The Kinney Three Paragraphs (and More) for Accounting Ph.D. Students

Accounting Horizons ◽

10.2308/acch-52451 ◽

2019 ◽

Vol 33 (4) ◽

pp. 1-14 ◽

Cited By ~ 9

Author(s):

William R. Kinney

Keyword(s):

Review Process ◽

Student Experiences ◽

Second Step ◽

Practical Relevance ◽

The Third ◽

Validity Assessment ◽

Essential Components ◽

Potential Problems ◽

Empirical Accounting Research

SYNOPSIS This Commentary is intended to help beginning Ph.D. students identify, evaluate, and communicate essential components of proposed empirical accounting research using a three-step process. The first step is a structured top-down approach of writing answers to three related questions—What, Why, How—that emphasize the central role of conceptual thinking in research design, as well as practical relevance. The second step is a predictive validity assessment that anticipates concerns likely to arise in the scholarly review process, and the third is consideration of the likely outcome and potential problems to be encountered if the proposal is implemented as planned. First-hand accounts of Ph.D. student experiences using the three paragraphs and three-step approach are presented, along with an exercise that beginners can use to help themselves identify, analyze, and anticipate problems to improve chances for research success ex ante.

Download Full-text

Hfinger: Malware HTTP Request Fingerprinting

Entropy ◽

10.3390/e23050507 ◽

2021 ◽

Vol 23 (5) ◽

pp. 507

Author(s):

Piotr Białczak ◽

Wojciech Mazurczyk

Keyword(s):

Real World ◽

Network Traffic ◽

Experimental Evaluation ◽

Data Sets ◽

Real World Data ◽

Malicious Software ◽

Default Mode ◽

World Data ◽

Effectiveness Analysis ◽

Http Protocol

Malicious software utilizes HTTP protocol for communication purposes, creating network traffic that is hard to identify as it blends into the traffic generated by benign applications. To this aim, fingerprinting tools have been developed to help track and identify such traffic by providing a short representation of malicious HTTP requests. However, currently existing tools do not analyze all information included in the HTTP message or analyze it insufficiently. To address these issues, we propose Hfinger, a novel malware HTTP request fingerprinting tool. It extracts information from the parts of the request such as URI, protocol information, headers, and payload, providing a concise request representation that preserves the extracted information in a form interpretable by a human analyst. For the developed solution, we have performed an extensive experimental evaluation using real-world data sets and we also compared Hfinger with the most related and popular existing tools such as FATT, Mercury, and p0f. The conducted effectiveness analysis reveals that on average only 1.85% of requests fingerprinted by Hfinger collide between malware families, what is 8–34 times lower than existing tools. Moreover, unlike these tools, in default mode, Hfinger does not introduce collisions between malware and benign applications and achieves it by increasing the number of fingerprints by at most 3 times. As a result, Hfinger can effectively track and hunt malware by providing more unique fingerprints than other standard tools.

Download Full-text