Private attribute inference from Facebook’s public text metadata: a case study of Korean users

2017 ◽  
Vol 117 (8) ◽  
pp. 1687-1706
Author(s):  
Daeseon Choi ◽  
Younho Lee ◽  
Seokhyun Kim ◽  
Pilsung Kang

Purpose As the number of users on social network services (SNSs) continues to increase at a remarkable rate, privacy and security issues are consistently arising. Although users may not want to disclose their private attributes, these can be inferred from their public behavior on social media. In order to investigate the severity of the leakage of private information in this manner, the purpose of this paper is to present a method to infer undisclosed personal attributes of users based only on the data available on their public profiles on Facebook. Design/methodology/approach Facebook profile data consisting of 32 attributes were collected for 111,123 Korean users. Inferences were made for four private attributes (gender, age, marital status, and relationship status) based on five machine learning-based classification algorithms and three regression algorithms. Findings Experimental results showed that users’ gender can be inferred very accurately, whereas marital status and relationship status can be predicted more accurately with the authors’ algorithms than with a random model. Moreover, the average difference between the actual and predicted ages of users was only 0.5 years. The results show that some private attributes can be easily inferred from only a few pieces of user profile information, which can jeopardize personal information and may increase the risk to dignity. Research limitations/implications In this paper, the authors’ only utilized each user’s own profile data, especially text information. Since users in SNSs are directly or indirectly connected, inference performance can be improved if the profile data of the friends of a given user are additionally considered. Moreover, utilizing non-text profile information, such as profile images, can help increase inference accuracy. The authors’ can also provide a more generalized inference performance if a larger data set of Facebook users is available. Practical implications A private attribute leakage alarm system based on the inference model would be helpful for users not desirous of the disclosure of their private attributes on SNSs. SNS service providers can measure and monitor the risk of privacy leakage in their system to protect their users and optimize the target marketing based on the inferred information if users agree to use it. Originality/value This paper investigates whether private attributes of SNS users can be inferred with a few pieces of publicly available information although users are not willing to disclose them. The experimental results showed that gender, age, marital status, and relationship status, can be inferred by machine-learning algorithms. Based on these results, an early warning system was designed to help both service providers and users to protect the users’ privacy.

2022 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Krishnadas Nanath ◽  
Supriya Kaitheri ◽  
Sonia Malik ◽  
Shahid Mustafa

Purpose The purpose of this paper is to examine the factors that significantly affect the prediction of fake news from the virality theory perspective. The paper looks at a mix of emotion-driven content, sentimental resonance, topic modeling and linguistic features of news articles to predict the probability of fake news. Design/methodology/approach A data set of over 12,000 articles was chosen to develop a model for fake news detection. Machine learning algorithms and natural language processing techniques were used to handle big data with efficiency. Lexicon-based emotion analysis provided eight kinds of emotions used in the article text. The cluster of topics was extracted using topic modeling (five topics), while sentiment analysis provided the resonance between the title and the text. Linguistic features were added to the coding outcomes to develop a logistic regression predictive model for testing the significant variables. Other machine learning algorithms were also executed and compared. Findings The results revealed that positive emotions in a text lower the probability of news being fake. It was also found that sensational content like illegal activities and crime-related content were associated with fake news. The news title and the text exhibiting similar sentiments were found to be having lower chances of being fake. News titles with more words and content with fewer words were found to impact fake news detection significantly. Practical implications Several systems and social media platforms today are trying to implement fake news detection methods to filter the content. This research provides exciting parameters from a viral theory perspective that could help develop automated fake news detectors. Originality/value While several studies have explored fake news detection, this study uses a new perspective on viral theory. It also introduces new parameters like sentimental resonance that could help predict fake news. This study deals with an extensive data set and uses advanced natural language processing to automate the coding techniques in developing the prediction model.


2015 ◽  
Vol 82 (4) ◽  
pp. 992-1003 ◽  
Author(s):  
Eric D. Becraft ◽  
Jeremy A. Dodsworth ◽  
Senthil K. Murugapiran ◽  
J. Ingemar Ohlsson ◽  
Brandon R. Briggs ◽  
...  

ABSTRACTThe vast majority of microbial life remains uncatalogued due to the inability to cultivate these organisms in the laboratory. This “microbial dark matter” represents a substantial portion of the tree of life and of the populations that contribute to chemical cycling in many ecosystems. In this work, we leveraged an existing single-cell genomic data set representing the candidate bacterial phylum “Calescamantes” (EM19) to calibrate machine learning algorithms and define metagenomic bins directly from pyrosequencing reads derived from Great Boiling Spring in the U.S. Great Basin. Compared to other assembly-based methods, taxonomic binning with a read-based machine learning approach yielded final assemblies with the highest predicted genome completeness of any method tested. Read-first binning subsequently was used to extractCalescamantesbins from all metagenomes with abundantCalescamantespopulations, including metagenomes from Octopus Spring and Bison Pool in Yellowstone National Park and Gongxiaoshe Spring in Yunnan Province, China. Metabolic reconstruction suggests thatCalescamantesare heterotrophic, facultative anaerobes, which can utilize oxidized nitrogen sources as terminal electron acceptors for respiration in the absence of oxygen and use proteins as their primary carbon source. Despite their phylogenetic divergence, the geographically separateCalescamantespopulations were highly similar in their predicted metabolic capabilities and core gene content, respiring O2, or oxidized nitrogen species for energy conservation in distant but chemically similar hot springs.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Marcus Renatus Johannes Wolkenfelt ◽  
Frederik Bungaran Ishak Situmeang

Purpose The purpose of this paper is to contribute to the marketing literature and practice by examining the effect of product pricing on consumer behaviours with regard to the assertiveness and the sentiments expressed in their product reviews. In addition, the paper uses new data collection and machine learning tools that can also be extended for other research of online consumer reviewing behaviours. Design/methodology/approach Using web crawling techniques, a large data set was extracted from the Google Play Store. Following this, the authors created machine learning algorithms to identify topics from product reviews and to quantify assertiveness and sentiments from the review texts. Findings The results indicate that product pricing models affect consumer review sentiment, assertiveness and topics. Removing upfront payment obligations positively impacts the overall and pricing specific consumer sentiment and reduces assertiveness. Research limitations/implications The results reveal new effects of pricing models on the nature of consumer reviews of products and form a basis for future research. The study was conducted in the gaming category of the Google Play Store and the generalisability of the findings for other app segments or marketplaces should be further tested. Originality/value The findings can help companies that create digital products in choosing a pricing strategy for their apps. The paper is the first to investigate how pricing modes affect the nature of online reviews written by consumers.


Author(s):  
Mervin Joe Thomas ◽  
Mithun M. Sanjeev ◽  
A.P. Sudheer ◽  
Joy M.L.

Purpose This paper aims to use different machine learning (ML) algorithms for the prediction of inverse kinematic solutions in parallel manipulators (PMs) to overcome the computational difficulties and approximations involved with the analytical methods. The results obtained from the ML algorithms and the Denavit–Hartenberg (DH) approach are compared with the experimental results to evaluate their performances. The study is performed on a novel 6-degree of freedom (DoF) PM that offers precise motions with a large workspace for the end effector. Design/methodology/approach The kinematic model for the proposed 3-PPSS PM is obtained using the modified DH approach and its inverse kinematic solutions are determined using the Levenberg–Marquardt algorithm. Various prediction algorithms such as the multiple linear regression, multi-variate polynomial regression, support vector, decision tree, random forest regression and multi-layer perceptron networks are applied to predict the inverse kinematic solutions for the manipulator. The data set required to train the network is generated experimentally by recording the poses of the end effector for different instantaneous positions of the slider using the concept of ArUco markers. Findings This paper fully demonstrates the possibility to use artificial intelligence for the prediction of inverse kinematic solutions especially for complex geometries. Originality/value As the analytical models derived from the geometrical method, Screw theory or numerical techniques involve approximations and needs more computational power, it is not advisable for real-time control of the manipulator. In addition, the data set obtained from the derived inverse kinematic equations to train the network may lead to inaccuracies in the predicted results. This error may generate significant deviations in the end-effector position from the desired position. The present work attempts to resolve this issue by proposing a camera-based approach that uses ArUco library and ML algorithms to create the data set experimentally and predict the inverse kinematic solutions accurately.


2019 ◽  
Vol 15 (5) ◽  
pp. 489-509 ◽  
Author(s):  
Youssef Mourdi ◽  
Mohamed Sadgal ◽  
Hamada El Kabtane ◽  
Wafaa Berrada Fathi

Purpose Even if MOOCs (massive open online courses) are becoming a trend in distance learning, they suffer from a very high rate of learners’ dropout, and as a result, on average, only 10 per cent of enrolled learners manage to obtain their certificates of achievement. This paper aims to give tutors a clearer vision for an effective and personalized intervention as a solution to “retain” each type of learner at risk of dropping out. Design/methodology/approach This paper presents a methodology to provide predictions on learners’ behaviors. This work, which uses a Stanford data set, was divided into several phases, namely, a data extraction, an exploratory study and then a multivariate analysis to reduce dimensionality and to extract the most relevant features. The second step was the comparison between five machine learning algorithms. Finally, the authors used the principle of association rules to extract similarities between the behaviors of learners who dropped out from the MOOC. Findings The results of this work have given that deep learning ensures the best predictions in terms of accuracy, which is an average of 95.8 per cent, and is comparable to other measures such as precision, AUC, Recall and F1 score. Originality/value Many research studies have tried to tackle the MOOC dropout problem by proposing different dropout predictive models. In the same context, comes the present proposal with which the authors have tried to predict not only learners at a risk of dropping out of the MOOCs but also those who will succeed or fail.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Burak Cankaya ◽  
Berna Eren Tokgoz ◽  
Ali Dag ◽  
K.C. Santosh

Purpose This paper aims to propose a machine learning-based automatic labeling methodology for chemical tanker activities that can be applied to any port with any number of active tankers and the identification of important predictors. The methodology can be applied to any type of activity tracking that is based on automatically generated geospatial data. Design/methodology/approach The proposed methodology uses three machine learning algorithms (artificial neural networks, support vector machines (SVMs) and random forest) along with information fusion (IF)-based sensitivity analysis to classify chemical tanker activities. The data set is split into training and test data based on vessels, with two vessels in the training data and one in the test data set. Important predictors were identified using a receiver operating characteristic comparative approach, and overall variable importance was calculated using IF from the top models. Findings Results show that an SVM model has the best balance between sensitivity and specificity, at 93.5% and 91.4%, respectively. Speed, acceleration and change in the course on the ground for the vessels are identified as the most important predictors for classifying vessel activity. Research limitations/implications The study evaluates the vessel movements waiting between different terminals in the same port, but not their movements between different ports for their tank-cleaning activities. Practical implications The findings in this study can be used by port authorities, shipping companies, vessel operators and other stakeholders for decision support, performance tracking, as well as for automated alerts. Originality/value This analysis makes original contributions to the existing literature by defining and demonstrating a methodology that can automatically label vehicle activity based on location data and identify certain characteristics of the activity by finding important location-based predictors that effectively classify the activity status.


2020 ◽  
Vol 38 (3) ◽  
pp. 213-225 ◽  
Author(s):  
Agostino Valier

PurposeIn the literature there are numerous tests that compare the accuracy of automated valuation models (AVMs). These models first train themselves with price data and property characteristics, then they are tested by measuring their ability to predict prices. Most of them compare the effectiveness of traditional econometric models against the use of machine learning algorithms. Although the latter seem to offer better performance, there is not yet a complete survey of the literature to confirm the hypothesis.Design/methodology/approachAll tests comparing regression analysis and AVMs machine learning on the same data set have been identified. The scores obtained in terms of accuracy were then compared with each other.FindingsMachine learning models are more accurate than traditional regression analysis in their ability to predict value. Nevertheless, many authors point out as their limit their black box nature and their poor inferential abilities.Practical implicationsAVMs machine learning offers a huge advantage for all real estate operators who know and can use them. Their use in public policy or litigation can be critical.Originality/valueAccording to the author, this is the first systematic review that collects all the articles produced on the subject done comparing the results obtained.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Quang-Vinh Dang

Purpose This study aims to explain the state-of-the-art machine learning models that are used in the intrusion detection problem for human-being understandable and study the relationship between the explainability and the performance of the models. Design/methodology/approach The authors study a recent intrusion data set collected from real-world scenarios and use state-of-the-art machine learning algorithms to detect the intrusion. The authors apply several novel techniques to explain the models, then evaluate manually the explanation. The authors then compare the performance of model post- and prior-explainability-based feature selection. Findings The authors confirm our hypothesis above and claim that by forcing the explainability, the model becomes more robust, requires less computational power but achieves a better predictive performance. Originality/value The authors draw our conclusions based on their own research and experimental works.


2020 ◽  
Vol 28 (4) ◽  
pp. 575-589
Author(s):  
Antonia Michael ◽  
Jan Eloff

Purpose Malicious activities conducted by disgruntled employees via an email platform can cause profound damage to an organization such as financial and reputational losses. This threat is known as an “Insider IT Sabotage” threat. This involves employees misusing their access rights to harm the organization. Events leading up to the attack are not technical but rather behavioural. The problem is that owing to the high volume and complexity of emails, the risk of insider IT sabotage cannot be diminished with rule-based approaches. Design/methodology/approach Malicious human behaviours that insiders within the insider IT sabotage category would possess are studied and mapped to phrases that would appear in email communications. A large email data set is classified according to behavioural characteristics of these employees. Machine learning algorithms are used to identify occurrences of this insider threat type. The accuracy of these approaches is measured. Findings It is shown in this paper that suspicious behaviour of disgruntled employees can be discovered, by means of machine intelligence techniques. The output of the machine learning classifier depends mainly on the depth and quality of the phrases and behaviour analysis, cleansing and number of email attributes examined. This process of labelling content in isolation could be improved if other attributes of the email data are included, such that a confidence score can be computed for each user. Originality/value This research presents a novel approach to show that the creation of a prototype that can automate the detection of insider IT sabotage within email systems to mitigate the risk within organizations.


2020 ◽  
Vol 38 (1) ◽  
pp. 65-80 ◽  
Author(s):  
Ammara Zamir ◽  
Hikmat Ullah Khan ◽  
Tassawar Iqbal ◽  
Nazish Yousaf ◽  
Farah Aslam ◽  
...  

Purpose This paper aims to present a framework to detect phishing websites using stacking model. Phishing is a type of fraud to access users’ credentials. The attackers access users’ personal and sensitive information for monetary purposes. Phishing affects diverse fields, such as e-commerce, online business, banking and digital marketing, and is ordinarily carried out by sending spam emails and developing identical websites resembling the original websites. As people surf the targeted website, the phishers hijack their personal information. Design/methodology/approach Features of phishing data set are analysed by using feature selection techniques including information gain, gain ratio, Relief-F and recursive feature elimination (RFE) for feature selection. Two features are proposed combining the strongest and weakest attributes. Principal component analysis with diverse machine learning algorithms including (random forest [RF], neural network [NN], bagging, support vector machine, Naïve Bayes and k-nearest neighbour) is applied on proposed and remaining features. Afterwards, two stacking models: Stacking1 (RF + NN + Bagging) and Stacking2 (kNN + RF + Bagging) are applied by combining highest scoring classifiers to improve the classification accuracy. Findings The proposed features played an important role in improving the accuracy of all the classifiers. The results show that RFE plays an important role to remove the least important feature from the data set. Furthermore, Stacking1 (RF + NN + Bagging) outperformed all other classifiers in terms of classification accuracy to detect phishing website with 97.4% accuracy. Originality/value This research is novel in this regard that no previous research focusses on using feed forward NN and ensemble learners for detecting phishing websites.


Sign in / Sign up

Export Citation Format

Share Document