International Journal of Data Warehousing and Mining
Latest Publications


TOTAL DOCUMENTS

292
(FIVE YEARS 64)

H-INDEX

21
(FIVE YEARS 3)

Published By Igi Global

1548-3932, 1548-3924

2022 ◽  
Vol 18 (1) ◽  
pp. 1-17
Author(s):  
Sarah Nait Bahloul ◽  
Oussama Abderrahim ◽  
Aya Ichrak Benhadj Amar ◽  
Mohammed Yacine Bouhedadja

The classification of data streams has become a significant and active research area. The principal characteristics of data streams are a large amount of arrival data, the high speed and rate of its arrival, and the change of their nature and distribution over time. Hoeffding Tree is a method to, incrementally, build decision trees. Since its proposition in the literature, it has become one of the most popular tools of data stream classification. Several improvements have since emerged. Hoeffding Anytime Tree was recently introduced and is considered one of the most promising algorithms. It offers a higher accuracy compared to the Hoeffding Tree in most scenarios, at a small additional computational cost. In this work, the authors contribute by proposing three improvements to the Hoeffding Anytime Tree. The improvements are tested on known benchmark datasets. The experimental results show that two of the proposed variants make better usage of Hoeffding Anytime Tree’s properties. They learn faster while providing the same desired accuracy.


2022 ◽  
Vol 18 (1) ◽  
pp. 0-0

Social media data become an integral part in the business data and should be integrated into the decisional process for better decision making based on information which reflects better the true situation of business in any field. However, social media data are unstructured and generated in very high frequency which exceeds the capacity of the data warehouse. In this work, we propose to extend the data warehousing process with a staging area which heart is a large scale system implementing an information extraction process using Storm and Hadoop frameworks to better manage their volume and frequency. Concerning structured information extraction, mainly events, we combine a set of techniques from NLP, linguistic rules and machine learning to succeed the task. Finally, we propose the adequate data warehouse conceptual model for events modeling and integration with enterprise data warehouse using an intermediate table called Bridge table. For application and experiments, we focus on drug abuse events extraction from Twitter data and their modeling into the Event Data Warehouse.


2022 ◽  
Vol 18 (1) ◽  
pp. 0-0

It has been witnessed in recent years for the rising of Group recommender systems (GRSs) in most e-commerce and tourism applications like Booking.com, Traveloka.com, Amazon, etc. One of the most concerned problems in GRSs is to guarantee the fairness between users in a group so-called the consensus-driven group recommender system. This paper proposes a new flexible alternative that embeds a fuzzy measure to aggregation operators of consensus process to improve fairness of group recommendation and deals with group member interaction. Choquet integral is used to build a fuzzy measure based on group member interactions and to seek a better fairness recommendation. The empirical results on the benchmark datasets show the incremental advances of the proposal for dealing with group member interactions and the issue of fairness in Consensus-driven GRS.


2021 ◽  
Vol 17 (4) ◽  
pp. 1-28
Author(s):  
Waqas Ahmed ◽  
Esteban Zimányi ◽  
Alejandro A. Vaisman ◽  
Robert Wrembel

Data warehouses (DWs) evolve in both their content and schema due to changes of user requirements, business processes, or external sources to name a few. Although multiple approaches using temporal and/or multiversion DWs have been proposed to handle these changes, an efficient solution for this problem is still lacking. The authors' approach is to separate concerns and use temporal DWs to deal with content changes, and multiversion DWs to deal with schema changes. To address the former, previously, they have proposed a temporal multidimensional (MD) model. In this paper, they propose a multiversion MD model for schema evolution to tackle the latter problem. The two models complement each other and allow managing both content and schema evolution. In this paper, the semantics of schema modification operators (SMOs) to derive various schema versions are given. It is also shown how online analytical processing (OLAP) operations like roll-up work on the model. Finally, the mapping from the multiversion MD model to a relational schema is given along with OLAP operations in standard SQL.


2021 ◽  
Vol 17 (4) ◽  
pp. 101-118
Author(s):  
Nandhini Abirami ◽  
Durai Raj Vincent ◽  
Seifedine Kadry

Early and automatic segmentation of lung infections from computed tomography images of COVID-19 patients is crucial for timely quarantine and effective treatment. However, automating the segmentation of lung infection from CT slices is challenging due to a lack of contrast between the normal and infected tissues. A CNN and GAN-based framework are presented to classify and then segment the lung infections automatically from COVID-19 lung CT slices. In this work, the authors propose a novel method named P2P-COVID-SEG to automatically classify COVID-19 and normal CT images and then segment COVID-19 lung infections from CT images using GAN. The proposed model outperformed the existing classification models with an accuracy of 98.10%. The segmentation results outperformed existing methods and achieved infection segmentation with accurate boundaries. The Dice coefficient achieved using GAN segmentation is 81.11%. The segmentation results demonstrate that the proposed model outperforms the existing models and achieves state-of-the-art performance.


2021 ◽  
Vol 17 (4) ◽  
pp. 29-47
Author(s):  
Bruno Oliveira ◽  
Óscar Oliveira ◽  
Orlando Belo

Considering extract-transform-load (ETL) as a complex and evolutionary process, development teams must conscientiously and rigorously create log strategies for retrieving the most value of the information that can be gathered from the events that occur through the ETL workflow. Efficient logging strategies must be structured so that metrics, logs, and alerts can, beyond their troubleshooting capabilities, provide insights about the system. This paper presents a configurable and flexible ETL component for creating logging mechanisms in ETL workflows. A pattern-oriented approach is followed as a way to abstract ETL activities and enable its mapping to physical primitives that can be interpreted by ETL commercial tools.


2021 ◽  
Vol 17 (4) ◽  
pp. 67-100
Author(s):  
Thang Truong Nguyen ◽  
Nguyen Long Giang ◽  
Dai Thanh Tran ◽  
Trung Tuan Nguyen ◽  
Huy Quang Nguyen ◽  
...  

Attribute reduction from decision tables is one of the crucial topics in data mining. This problem belongs to NP-hard and many approximation algorithms based on the filter or the filter-wrapper approaches have been designed to find the reducts. Intuitionistic fuzzy set (IFS) has been regarded as the effective tool to deal with such the problem by adding two degrees, namely the membership and non-membership for each data element. The separation of attributes in the view of two counterparts as in the IFS set would increase the quality of classification and reduce the reducts. From this motivation, this paper proposes a new filter-wrapper algorithm based on the IFS for attribute reduction from decision tables. The contributions include a new instituitionistics fuzzy distance between partitions accompanied with theoretical analysis. The filter-wrapper algorithm is designed based on that distance with the new stopping condition based on the concept of delta-equality. Experiments are conducted on the benchmark UCI machine learning repository datasets.


2021 ◽  
Vol 17 (4) ◽  
pp. 48-66
Author(s):  
Han Li ◽  
Zhao Liu ◽  
Ping Zhu

The missing values in industrial data restrict the applications. Although this incomplete data contains enough information for engineers to support subsequent development, there are still too many missing values for algorithms to establish precise models. This is because the engineering domain knowledge is not considered, and valuable information is not fully captured. Therefore, this article proposes an engineering domain knowledge-based framework for modelling incomplete industrial data. The raw datasets are partitioned and processed at different scales. Firstly, the hierarchical features are combined to decrease the missing ratio. In order to fill the missing values in special data, which is identified for classifying the samples, samples with only part of the features presented are fully utilized instead of being removed to establish local imputation model. Then samples are divided into different groups to transfer the information. A series of industrial data is analyzed for verifying the feasibility of the proposed method.


2021 ◽  
Vol 17 (3) ◽  
pp. 22-43
Author(s):  
Sonali Ashish Chakraborty

Data from multiple sources are loaded into the organization data warehouse for analysis. Since some OLAP queries are quite frequently fired on the warehouse data, their execution time is reduced by storing the queries and results in a relational database, referred as materialized query database (MQDB). If the tables, fields, functions, and criteria of input query and stored query are the same but the query criteria specified in WHERE or HAVING clause do not match, then they are considered non-synonymous to each other. In the present research, the results of non-synonymous queries are generated by reusing the existing stored results after applying UNION or MINUS operations on them. This will reduce the execution time of non-synonymous queries. For superset criteria values of input query, UNION operation is applied, and for subset values, MINUS operation is applied. Incremental result processing of existing stored results, if required, is performed using Data Marts.


2021 ◽  
Vol 17 (3) ◽  
pp. 44-67
Author(s):  
Nguyen Truong Thang ◽  
Giang Long Nguyen ◽  
Hoang Viet Long ◽  
Nguyen Anh Tuan ◽  
Tuan Manh Tran ◽  
...  

Attribute reduction is a crucial problem in the process of data mining and knowledge discovery in big data. In incomplete decision systems, the model using tolerance rough set is fundamental to solve the problem by computing the redact to reduce the execution time. However, these proposals used the traditional filter approach so that the reduct was not optimal in the number of attributes and the accuracy of classification. The problem is critical in the dynamic incomplete decision systems which are more appropriate for real-world applications. Therefore, this paper proposes two novel incremental algorithms using the combination of filter and wrapper approach, namely IFWA_ADO and IFWA_DEO, respectively, for the dynamic incomplete decision systems. The IFWA_ADO computes reduct incrementally in cases of adding multiple objects while IFWA_DEO updates reduct when removing multiple objects. These algorithms are also verified on six data sets. Experimental results show that the filter-wrapper algorithms get higher performance than the other filter incremental algorithms.


Sign in / Sign up

Export Citation Format

Share Document