Privacy Protection in Enterprise Social Networks Using a Hybrid De-Identification System

2021 ◽  
Vol 15 (1) ◽  
pp. 138-152
Author(s):  
Mohamed Abdou Souidi ◽  
Noria Taghezout

Enterprise social networks (ESN) have been widely used within organizations as a communication infrastructure that allows employees to collaborate with each other and share files and documents. The shared documents may contain a large amount of sensitive information that affect the privacy of persons such as phone numbers, which must be protected against any kind of disclosure or unauthorized access. In this study, authors propose a hybrid de-identification system that extract sensitive information from textual documents shared in ESNs. The system is based on both machine learning and rule-based classifiers. Gradient boosted trees (GBTs) algorithm is used as machine learning classifier. Experiments ran on a modified CoNLL 2003 dataset show that GBTs algorithm achieve a very high F1-score (95%). Additionally, the rule-based classifier is consisted of regular expression and gazetteers in order to complement the machine learning classifier. Thereafter, the sensitive information extracted by the two classifiers are merged and encrypted using Format Preserving Encryption method.

2012 ◽  
Vol 5s1 ◽  
pp. BII.S8963 ◽  
Author(s):  
Wenbo Wang ◽  
Lu Chen ◽  
Ming Tan ◽  
Shaojun Wang ◽  
Amit P. Sheth

This paper presents our solution for the i2b2 sentiment classification challenge. Our hybrid system consists of machine learning and rule-based classifiers. For the machine learning classifier, we investigate a variety of lexical, syntactic and knowledge-based features, and show how much these features contribute to the performance of the classifier through experiments. For the rule-based classifier, we propose an algorithm to automatically extract effective syntactic and lexical patterns from training examples. The experimental results show that the rule-based classifier outperforms the baseline machine learning classifier using unigram features. By combining the machine learning classifier and the rule-based classifier, the hybrid system gains a better trade-off between precision and recall, and yields the highest micro-averaged F-measure (0.5038), which is better than the mean (0.4875) and median (0.5027) micro-average F-measures among all participating teams.


Literator ◽  
2008 ◽  
Vol 29 (1) ◽  
pp. 21-42 ◽  
Author(s):  
S. Pilon ◽  
M.J. Puttkammer ◽  
G.B. Van Huyssteen

The development of a hyphenator and compound analyser for Afrikaans The development of two core-technologies for Afrikaans, viz. a hyphenator and a compound analyser is described in this article. As no annotated Afrikaans data existed prior to this project to serve as training data for a machine learning classifier, the core-technologies in question are first developed using a rule-based approach. The rule-based hyphenator and compound analyser are evaluated and the hyphenator obtains an fscore of 90,84%, while the compound analyser only reaches an f-score of 78,20%. Since these results are somewhat disappointing and/or insufficient for practical implementation, it was decided that a machine learning technique (memory-based learning) will be used instead. Training data for each of the two core-technologies is then developed using “TurboAnnotate”, an interface designed to improve the accuracy and speed of manual annotation. The hyphenator developed using machine learning has been trained with 39 943 words and reaches an fscore of 98,11% while the f-score of the compound analyser is 90,57% after being trained with 77 589 annotated words. It is concluded that machine learning (specifically memory-based learning) seems an appropriate approach for developing coretechnologies for Afrikaans.


2020 ◽  
Vol 14 ◽  
pp. e171481
Author(s):  
Alexandre Moreira Nascimento ◽  
Vinicius Veloso De Melo ◽  
Anna Carolina Muller Queiroz ◽  
Thomas Brashear-Alejandro ◽  
Fernando de Souza Meirelles

The purpose of this study is to develop a predictive model that increases the accuracy of business operational planning using data from a small business. By using Machine Learning (ML) techniques feature expansion, resampling, and combination techniques, it was possible to address several existing limitations in the available research. Then, the use of the novel technique of feature engineering allowed us to increase the accuracy of the model by finding 10 new features derived from the original ones and constructed automatically through the nonlinear relationships found between them. Finally, we built a rule-based classifier to predict the store's revenue with high accuracy. The results show the proposed approach open new possibilities for ML research applied to small and medium businesses.


Author(s):  
Kazuma Matsumoto ◽  
Takato Tatsumi ◽  
Hiroyuki Sato ◽  
Tim Kovacs ◽  
Keiki Takadama ◽  
...  

The correctness rate of classification of neural networks is improved by deep learning, which is machine learning of neural networks, and its accuracy is higher than the human brain in some fields. This paper proposes the hybrid system of the neural network and the Learning Classifier System (LCS). LCS is evolutionary rule-based machine learning using reinforcement learning. To increase the correctness rate of classification, we combine the neural network and the LCS. This paper conducted benchmark experiments to verify the proposed system. The experiment revealed that: 1) the correctness rate of classification of the proposed system is higher than the conventional LCS (XCSR) and normal neural network; and 2) the covering mechanism of XCSR raises the correctness rate of proposed system.


2009 ◽  
Vol 2009 ◽  
pp. 1-25 ◽  
Author(s):  
Ryan J. Urbanowicz ◽  
Jason H. Moore

If complexity is your problem, learning classifier systems (LCSs) may offer a solution. These rule-based, multifaceted, machine learning algorithms originated and have evolved in the cradle of evolutionary biology and artificial intelligence. The LCS concept has inspired a multitude of implementations adapted to manage the different problem domains to which it has been applied (e.g., autonomous robotics, classification, knowledge discovery, and modeling). One field that is taking increasing notice of LCS is epidemiology, where there is a growing demand for powerful tools to facilitate etiological discovery. Unfortunately, implementation optimization is nontrivial, and a cohesive encapsulation of implementation alternatives seems to be lacking. This paper aims to provide an accessible foundation for researchers of different backgrounds interested in selecting or developing their own LCS. Included is a simple yet thorough introduction, a historical review, and a roadmap of algorithmic components, emphasizing differences in alternative LCS implementations.


2019 ◽  
Vol 26 (11) ◽  
pp. 1247-1254 ◽  
Author(s):  
Michel Oleynik ◽  
Amila Kugic ◽  
Zdenko Kasáč ◽  
Markus Kreuzthaler

Abstract Objective Automated clinical phenotyping is challenging because word-based features quickly turn it into a high-dimensional problem, in which the small, privacy-restricted, training datasets might lead to overfitting. Pretrained embeddings might solve this issue by reusing input representation schemes trained on a larger dataset. We sought to evaluate shallow and deep learning text classifiers and the impact of pretrained embeddings in a small clinical dataset. Materials and Methods We participated in the 2018 National NLP Clinical Challenges (n2c2) Shared Task on cohort selection and received an annotated dataset with medical narratives of 202 patients for multilabel binary text classification. We set our baseline to a majority classifier, to which we compared a rule-based classifier and orthogonal machine learning strategies: support vector machines, logistic regression, and long short-term memory neural networks. We evaluated logistic regression and long short-term memory using both self-trained and pretrained BioWordVec word embeddings as input representation schemes. Results Rule-based classifier showed the highest overall micro F1 score (0.9100), with which we finished first in the challenge. Shallow machine learning strategies showed lower overall micro F1 scores, but still higher than deep learning strategies and the baseline. We could not show a difference in classification efficiency between self-trained and pretrained embeddings. Discussion Clinical context, negation, and value-based criteria hindered shallow machine learning approaches, while deep learning strategies could not capture the term diversity due to the small training dataset. Conclusion Shallow methods for clinical phenotyping can still outperform deep learning methods in small imbalanced data, even when supported by pretrained embeddings.


Author(s):  
Padmavathi .S ◽  
M. Chidambaram

Text classification has grown into more significant in managing and organizing the text data due to tremendous growth of online information. It does classification of documents in to fixed number of predefined categories. Rule based approach and Machine learning approach are the two ways of text classification. In rule based approach, classification of documents is done based on manually defined rules. In Machine learning based approach, classification rules or classifier are defined automatically using example documents. It has higher recall and quick process. This paper shows an investigation on text classification utilizing different machine learning techniques.


Sign in / Sign up

Export Citation Format

Share Document