distributed data mining
Recently Published Documents


TOTAL DOCUMENTS

264
(FIVE YEARS 21)

H-INDEX

20
(FIVE YEARS 1)

As the voluminous amount of data is generated because of inexorably widespread proliferation of electronic data maintained using the Electronic Health Records (EHRs). Medical health facilities have great potential to discern the patterns from this data and utilize them in diagnosing a specific disease or predicting outbreak of an epidemic etc. This discern of patterns might reveal sensitive information about individuals and this information is vulnerable to misuse. This is, however, a challenging task to share such sensitive data as it compromises the privacy of patients. In this paper, a random forest-based distributed data mining approach is proposed. Performance of the proposed model is evaluated using accuracy, f-measure and appa statistics analysis. Experimental results reveal that the proposed model is efficient and scalable enough in both performance and accuracy within the imbalanced data and also in maintaining the privacy by sharing only useful healthcare knowledge in the form of local models without revealing and sharing of sensitive data.


Author(s):  
Musavir Hassan ◽  
Muheet Ahmed Butt ◽  
Majid Zaman

As the voluminous amount of data is generated because of inexorably widespread proliferation of electronic data maintained using the Electronic Health Records (EHRs). Medical health facilities have great potential to discern the patterns from this data and utilize them in diagnosing a specific disease or predicting outbreak of an epidemic etc. This discern of patterns might reveal sensitive information about individuals and this information is vulnerable to misuse. This is, however, a challenging task to share such sensitive data as it compromises the privacy of patients. In this paper, a random forest-based distributed data mining approach is proposed. Performance of the proposed model is evaluated using accuracy, f-measure and appa statistics analysis. Experimental results reveal that the proposed model is efficient and scalable enough in both performance and accuracy within the imbalanced data and also in maintaining the privacy by sharing only useful healthcare knowledge in the form of local models without revealing and sharing of sensitive data.


Data Science ◽  
2021 ◽  
Vol 4 (2) ◽  
pp. 121-150
Author(s):  
Chang Sun ◽  
Lianne Ippel ◽  
Andre Dekker ◽  
Michel Dumontier ◽  
Johan van Soest

Combining and analysing sensitive data from multiple sources offers considerable potential for knowledge discovery. However, there are a number of issues that pose problems for such analyses, including technical barriers, privacy restrictions, security concerns, and trust issues. Privacy-preserving distributed data mining techniques (PPDDM) aim to overcome these challenges by extracting knowledge from partitioned data while minimizing the release of sensitive information. This paper reports the results and findings of a systematic review of PPDDM techniques from 231 scientific articles published in the past 20 years. We summarize the state of the art, compare the problems they address, and identify the outstanding challenges in the field. This review identifies the consequence of the lack of standard criteria to evaluate new PPDDM methods and proposes comprehensive evaluation criteria with 10 key factors. We discuss the ambiguous definitions of privacy and confusion between privacy and security in the field, and provide suggestions of how to make a clear and applicable privacy description for new PPDDM techniques. The findings from our review enhance the understanding of the challenges of applying theoretical PPDDM methods to real-life use cases, and the importance of involving legal-ethical and social experts in implementing PPDDM methods. This comprehensive review will serve as a helpful guide to past research and future opportunities in the area of PPDDM.


Author(s):  
Waleed A. Mohammad ◽  
Hajar Maseeh Yasin ◽  
Azar Abid Salih ◽  
Adel AL-Zebari ◽  
Naaman Omar ◽  
...  

Distributed systems, which may be utilized to do computations, are being developed as a result of the fast growth of sharing resources. Data mining, which has a huge range of real applications, provides significant techniques for extracting meaningful and usable information from massive amounts of data. Traditional data mining methods, on the other hand, suppose that the data is gathered centrally, stored in memory, and is static. Managing massive amounts of data and processing them with limited resources is difficult. Large volumes of data, for instance, are swiftly generated and stored in many locations. This becomes increasingly costly to centralize them at a single location. Furthermore, traditional data mining methods typically have several issues and limitations, such as memory restrictions, limited processing ability, and insufficient hard drive space, among others. To overcome the following issues, distributed data mining's have emerged as a beneficial option in several applications According to several authors, this research provides a study of state-of-the-art distributed data mining methods, such as distributed common item-set mining, distributed frequent sequence mining, technical difficulties with distributed systems, distributed clustering, as well as privacy-protection distributed data mining. Furthermore, each work is evaluated and compared to the others.


Author(s):  
Xianwen Sun ◽  
Ruzhi Xu ◽  
Longfei Wu ◽  
Zhitao Guan

AbstractA wide range of data mining applications benefit from the low latency offered by edge computing. However, edge computing suffers from limited computing resources, which inhibits the applications of the computationally expensive data mining methods. In the edge-cloud environment, usually, the participants turn to collaboratively train machine-learning models that yield more accurate prediction results. However, data owners may not be willing to sharing the own data for the privacy concerns. To handle such disparate goals, we focus on tree-based distributed data mining scheme with differential privacy, which is computationally friendly. The basic idea of our approach is based on a distributed ensemble strategy. Each participant builds an elegant decision model based on their own data, which has a good tradeoff between the computation and the accuracy of the data distribution, and shares it with other participants after being injected with the elaborate noise. Then the useful knowledge transferred from the decision models is acquired by other participants in an adaptive ensemble strategy. Both the theoretical analysis and the experiments show that our scheme provides an efficient data mining manner that can achieve a good prediction accuracy while providing rigorous privacy guarantee over the distributed data.


IEEE Access ◽  
2021 ◽  
pp. 1-1
Author(s):  
Mais Haj Qasem ◽  
Nadim Obeid ◽  
Amjad Hudaib ◽  
Mohammed Amin Almaiah ◽  
Ali Al-Zahrani ◽  
...  

2020 ◽  
Vol 25 (3) ◽  
pp. 39
Author(s):  
David Martínez-Galicia ◽  
Alejandro Guerra-Hernández ◽  
Nicandro Cruz-Ramírez ◽  
Xavier Limón ◽  
Francisco Grimaldo

Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method exhibits a strong negative correlation between the accuracy of the learned models and the number of examples used to induce them, i.e., the higher the accuracy of the obtained model, the fewer examples used to induce it. This paper contributes to a better understanding of this behavior in order to promote windowing as a sub-sampling method for Distributed Data Mining. For this, the generalization of the behavior of windowing beyond decision trees is established, by corroborating the observed negative correlation when adopting inductive algorithms of different nature. Then, focusing on decision trees, the windows (samples) and the obtained models are analyzed in terms of Minimum Description Length (MDL), Area Under the ROC Curve (AUC), Kulllback–Leibler divergence, and the similitude metric Sim1; and compared to those obtained when using traditional methods: random, balanced, and stratified samplings. It is shown that the aggressive sampling performed by windowing, up to 3% of the original dataset, induces models that are significantly more accurate than those obtained from the traditional sampling methods, among which only the balanced sampling is comparable in terms of AUC. Although the considered informational properties did not correlate with the obtained accuracy, they provide clues about the behavior of windowing and suggest further experiments to enhance such understanding and the performance of the method, i.e., studying the evolution of the windows over time.


2020 ◽  
Vol 40 ◽  
pp. 101097 ◽  
Author(s):  
Sachi Nandan Mohanty ◽  
E. Laxmi Lydia ◽  
Mohamed Elhoseny ◽  
Majid M. Gethami Al Otaibi ◽  
K. Shankar

Sign in / Sign up

Export Citation Format

Share Document