An Ensemble Random Forest Algorithm for Privacy Preserving Distributed Medical Data Mining

As the voluminous amount of data is generated because of inexorably widespread proliferation of electronic data maintained using the Electronic Health Records (EHRs). Medical health facilities have great potential to discern the patterns from this data and utilize them in diagnosing a specific disease or predicting outbreak of an epidemic etc. This discern of patterns might reveal sensitive information about individuals and this information is vulnerable to misuse. This is, however, a challenging task to share such sensitive data as it compromises the privacy of patients. In this paper, a random forest-based distributed data mining approach is proposed. Performance of the proposed model is evaluated using accuracy, f-measure and appa statistics analysis. Experimental results reveal that the proposed model is efficient and scalable enough in both performance and accuracy within the imbalanced data and also in maintaining the privacy by sharing only useful healthcare knowledge in the form of local models without revealing and sharing of sensitive data.

Download Full-text

Multi-agent distributed data mining approach for classifying meteorology data: case study on Iran’s synoptic weather stations

International Journal of Environmental Science and Technology ◽

10.1007/s13762-017-1351-x ◽

2017 ◽

Vol 15 (1) ◽

pp. 149-158 ◽

Cited By ~ 8

Author(s):

A. Niazalizadeh Moghadam ◽

R. Ravanmehr

Keyword(s):

Data Mining ◽

Distributed Data Mining ◽

Distributed Data ◽

Data Mining Approach ◽

Synoptic Weather ◽

Multi Agent ◽

Weather Stations

Download Full-text

A WSRF-enabled distributed data mining approach to association rules WEKA4WS -based

2010 IEEE 2nd Symposium on Web Society ◽

10.1109/sws.2010.5607452 ◽

2010 ◽

Author(s):

Zheng Shi-ming ◽

Yang Jun-Qiang ◽

Song Zi-ling ◽

Miao Zhuang

Keyword(s):

Data Mining ◽

Association Rules ◽

Distributed Data Mining ◽

Distributed Data ◽

Data Mining Approach

Download Full-text

Distributed Data Mining

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch110 ◽

2011 ◽

pp. 709-715 ◽

Cited By ~ 4

Author(s):

Grigorios Tsoumakas

Keyword(s):

Data Mining ◽

Data Warehouse ◽

Computational Cost ◽

Distributed Data Mining ◽

The Internet ◽

Distributed Data ◽

Distributed Environments ◽

Sensitive Data ◽

Huge Data ◽

Central Storage

The continuous developments in information and communication technology have recently led to the appearance of distributed computing environments, which comprise several, and different sources of large volumes of data and several computing units. The most prominent example of a distributed environment is the Internet, where increasingly more databases and data streams appear that deal with several areas, such as meteorology, oceanography, economy and others. In addition the Internet constitutes the communication medium for geographically distributed information systems, as for example the earth observing system of NASA (eos. gsfc.nasa.gov). Other examples of distributed environments that have been developed in the last few years are sensor networks for process monitoring and grids where a large number of computing and storage units are interconnected over a high-speed network. The application of the classical knowledge discovery process in distributed environments requires the collection of distributed data in a data warehouse for central processing. However, this is usually either ineffective or infeasible for the following reasons: (1) Storage cost. It is obvious that the requirements of a central storage system are enormous. A classical example concerns data from the astronomy science, and especially images from earth and space telescopes. The size of such databases is reaching the scale of exabytes (1018 bytes) and is increasing at a high pace. The central storage of the data of all telescopes of the planet would require a huge data warehouse of enormous cost. (2) Communication cost. The transfer of huge data volumes over network might take extremely much time and also require an unbearable financial cost. Even a small volume of data might create problems in wireless network environments with limited bandwidth. Note also that communication may be a continuous overhead, as distributed databases are not always constant and unchangeable. On the contrary, it is common to have databases that are frequently updated with new data or data streams that constantly record information (e.g remote sensing, sports statistics, etc.). (3) Computational cost. The computational cost of mining a central data warehouse is much bigger than the sum of the cost of analyzing smaller parts of the data that could also be done in parallel. In a grid, for example, it is easier to gather the data at a central location. However, a distributed mining approach would make a better exploitation of the available resources. (4) Private and sensitive data. There are many popular data mining applications that deal with sensitive data, such as people’s medical and financial records. The central collection of such data is not desirable as it puts their privacy into risk. In certain cases (e.g. banking, telecommunication) the data might belong to different, perhaps competing, organizations that want to exchange knowledge without the exchange of raw private data. This article is concerned with Distributed Data Mining algorithms, methods and systems that deal with the above issues in order to discover knowledge from distributed data in an effective and efficient way.

Download Full-text

Collective dendrogram clustering with collaborative filtering for Distributed Data Mining on electronic health records

2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT) ◽

10.1109/icecct.2017.8117876 ◽

2017 ◽

Author(s):

S. Urmela ◽

M. Nandhini

Keyword(s):

Data Mining ◽

Electronic Health Records ◽

Collaborative Filtering ◽

Distributed Data Mining ◽

Distributed Data ◽

Health Records ◽

Electronic Health

Download Full-text

Notice of Retraction: A WSRF-enabled distributed data mining approach to clustering WEKA4WS -based

2010 IEEE 2nd Symposium on Web Society ◽

10.1109/sws.2010.5607449 ◽

2010 ◽

Cited By ~ 1

Author(s):

Ren Zai-an ◽

Wang Bin ◽

Zheng Shi-ming ◽

Miao Zhuang ◽

Shao Rong-ming

Keyword(s):

Data Mining ◽

Distributed Data Mining ◽

Distributed Data ◽

Data Mining Approach

Download Full-text

Misusability Measure Based Sanitization of Big Data for Privacy Preserving MapReduce Programming

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i6.pp4524-4532 ◽

2018 ◽

Vol 8 (6) ◽

pp. 4524

Author(s):

D. Radhika ◽

D. Aruna Kumari

Keyword(s):

Data Mining ◽

Big Data ◽

Data Privacy ◽

Hybrid Approach ◽

Privacy Preserving ◽

Data Publishing ◽

Distributed Data Mining ◽

Distributed Data ◽

Public Cloud ◽

Sensitive Data

Leakage and misuse of sensitive data is a challenging problem to enterprises. It has become more serious problem with the advent of cloud and big data. The rationale behind this is the increase in outsourcing of data to public cloud and publishing data for wider visibility. Therefore Privacy Preserving Data Publishing (PPDP), Privacy Preserving Data Mining (PPDM) and Privacy Preserving Distributed Data Mining (PPDM) are crucial in the contemporary era. PPDP and PPDM can protect privacy at data and process levels respectively. Therefore, with big data privacy to data became indispensable due to the fact that data is stored and processed in semi-trusted environment. In this paper we proposed a comprehensive methodology for effective sanitization of data based on misusability measure for preserving privacy to get rid of data leakage and misuse. We followed a hybrid approach that caters to the needs of privacy preserving MapReduce programming. We proposed an algorithm known as Misusability Measure-Based Privacy serving Algorithm (MMPP) which considers level of misusability prior to choosing and application of appropriate sanitization on big data. Our empirical study with Amazon EC2 and EMR revealed that the proposed methodology is useful in realizing privacy preserving Map Reduce programming.

Download Full-text

HDSM: A distributed data mining approach to classifying vertically distributed data streams

Knowledge-Based Systems ◽

10.1016/j.knosys.2019.105114 ◽

2020 ◽

Vol 189 ◽

pp. 105114

Author(s):

Benjamin Denham ◽

Russel Pears ◽

M. Asif Naeem

Keyword(s):

Data Mining ◽

Data Streams ◽

Distributed Data Mining ◽

Distributed Data ◽

Data Mining Approach ◽

Distributed Data Streams

Download Full-text

A systematic review on privacy-preserving distributed data mining

Data Science ◽

10.3233/ds-210036 ◽

2021 ◽

Vol 4 (2) ◽

pp. 121-150

Author(s):

Chang Sun ◽

Lianne Ippel ◽

Andre Dekker ◽

Michel Dumontier ◽

Johan van Soest

Keyword(s):

Systematic Review ◽

Data Mining ◽

Real Life ◽

Past Research ◽

Privacy Preserving ◽

Distributed Data Mining ◽

Sensitive Information ◽

Distributed Data ◽

Multiple Sources ◽

Privacy And Security

Combining and analysing sensitive data from multiple sources offers considerable potential for knowledge discovery. However, there are a number of issues that pose problems for such analyses, including technical barriers, privacy restrictions, security concerns, and trust issues. Privacy-preserving distributed data mining techniques (PPDDM) aim to overcome these challenges by extracting knowledge from partitioned data while minimizing the release of sensitive information. This paper reports the results and findings of a systematic review of PPDDM techniques from 231 scientific articles published in the past 20 years. We summarize the state of the art, compare the problems they address, and identify the outstanding challenges in the field. This review identifies the consequence of the lack of standard criteria to evaluate new PPDDM methods and proposes comprehensive evaluation criteria with 10 key factors. We discuss the ambiguous definitions of privacy and confusion between privacy and security in the field, and provide suggestions of how to make a clear and applicable privacy description for new PPDDM techniques. The findings from our review enhance the understanding of the challenges of applying theoretical PPDDM methods to real-life use cases, and the importance of involving legal-ethical and social experts in implementing PPDDM methods. This comprehensive review will serve as a helpful guide to past research and future opportunities in the area of PPDDM.

Download Full-text