Journal Of Big Data | ScienceGate

An unsupervised method for social network spammer detection based on user information interests

Journal Of Big Data ◽

10.1186/s40537-021-00552-5 ◽

2022 ◽

Vol 9 (1) ◽

Author(s):

Darshika Koggalahewa ◽

Yue Xu ◽

Ernest Foo

Keyword(s):

Social Networks ◽

Social Network ◽

Online Social Networks ◽

Supervised Classification ◽

Peer Acceptance ◽

Spam Detection ◽

Highly Active ◽

Detection Approach ◽

Data Fabrication ◽

Unsupervised Approach

AbstractOnline Social Networks (OSNs) are a popular platform for communication and collaboration. Spammers are highly active in OSNs. Uncovering spammers has become one of the most challenging problems in OSNs. Classification-based supervised approaches are the most commonly used method for detecting spammers. Classification-based systems suffer from limitations of “data labelling”, “spam drift”, “imbalanced datasets” and “data fabrication”. These limitations effect the accuracy of a classifier’s detection. An unsupervised approach does not require labelled datasets. We aim to address the limitation of data labelling and spam drifting through an unsupervised approach.We present a pure unsupervised approach for spammer detection based on the peer acceptance of a user in a social network to distinguish spammers from genuine users. The peer acceptance of a user to another user is calculated based on common shared interests over multiple shared topics between the two users. The main contribution of this paper is the introduction of a pure unsupervised spammer detection approach based on users’ peer acceptance. Our approach does not require labelled training datasets. While it does not better the accuracy of supervised classification-based approaches, our approach has become a successful alternative for traditional classifiers for spam detection by achieving an accuracy of 96.9%.

Addressing big data variety using an automated approach for data characterization

Journal Of Big Data ◽

10.1186/s40537-021-00554-3 ◽

2022 ◽

Vol 9 (1) ◽

Author(s):

Georgios Vranopoulos ◽

Nathan Clarke ◽

Shirley Atkinson

Keyword(s):

Big Data ◽

Heterogeneous Data ◽

Use Of Self ◽

Regulatory Requirements ◽

Data Set ◽

Ugly Duckling ◽

Data Scientist ◽

Algorithmic Techniques ◽

Manual Tasks ◽

Self Learning

AbstractThe creation of new knowledge from manipulating and analysing existing knowledge is one of the primary objectives of any cognitive system. Most of the effort on Big Data research has been focussed upon Volume and Velocity, while Variety, “the ugly duckling” of Big Data, is often neglected and difficult to solve. A principal challenge with Variety is being able to understand and comprehend the data. This paper proposes and evaluates an automated approach for metadata identification and enrichment in describing Big Data. The paper focuses on the use of self-learning systems that will enable automatic compliance of data against regulatory requirements along with the capability of generating valuable and readily usable metadata towards data classification. Two experiments towards data confidentiality and data identification were conducted in evaluating the feasibility of the approach. The focus of the experiments was to confirm that repetitive manual tasks can be automated, thus reducing the focus of a Data Scientist on data identification and thereby providing more focus towards the extraction and analysis of the data itself. The origin of the datasets used were Private/Business and Public/Governmental and exhibited diverse characteristics in relation to the number of files and size of the files. The experimental work confirmed that: (a) the use of algorithmic techniques attributed to the substantial decrease in false positives regarding the identification of confidential information; (b) evidence that the use of a fraction of a data set along with statistical analysis and supervised learning is sufficient in identifying the structure of information within it. With this approach, the issues of understanding the nature of data can be mitigated, enabling a greater focus on meaningful interpretation of the heterogeneous data.

Programming big data analysis: principles and solutions

Journal Of Big Data ◽

10.1186/s40537-021-00555-2 ◽

2022 ◽

Vol 9 (1) ◽

Author(s):

Loris Belcastro ◽

Riccardo Cantini ◽

Fabrizio Marozzo ◽

Alessio Orsino ◽

Domenico Talia ◽

...

Keyword(s):

Big Data ◽

Data Analysis ◽

Message Passing ◽

Global Analysis ◽

Big Data Analysis ◽

Digital Data ◽

Advantages And Disadvantages ◽

Depth Analysis ◽

Social Media Platforms ◽

Many Sources

AbstractIn the age of the Internet of Things and social media platforms, huge amounts of digital data are generated by and collected from many sources, including sensors, mobile devices, wearable trackers and security cameras. This data, commonly referred to as Big Data, is challenging current storage, processing, and analysis capabilities. New models, languages, systems and algorithms continue to be developed to effectively collect, store, analyze and learn from Big Data. Most of the recent surveys provide a global analysis of the tools that are used in the main phases of Big Data management (generation, acquisition, storage, querying and visualization of data). Differently, this work analyzes and reviews parallel and distributed paradigms, languages and systems used today to analyze and learn from Big Data on scalable computers. In particular, we provide an in-depth analysis of the properties of the main parallel programming paradigms (MapReduce, workflow, BSP, message passing, and SQL-like) and, through programming examples, we describe the most used systems for Big Data analysis (e.g., Hadoop, Spark, and Storm). Furthermore, we discuss and compare the different systems by highlighting the main features of each of them, their diffusion (community of developers and users) and the main advantages and disadvantages of using them to implement Big Data analysis applications. The final goal of this work is to help designers and developers in identifying and selecting the best/appropriate programming solution based on their skills, hardware availability, application domains and purposes, and also considering the support provided by the developer community.

IoT information theft prediction using ensemble feature selection

Journal Of Big Data ◽

10.1186/s40537-021-00558-z ◽

2022 ◽

Vol 9 (1) ◽

Author(s):

Joffrey L. Leevy ◽

John Hancock ◽

Taghi M. Khoshgoftaar ◽

Jared M. Peterson

Keyword(s):

Feature Selection ◽

Operating Characteristic ◽

Characteristic Curve ◽

Classification Performance ◽

Feature Reduction ◽

Security Risk ◽

Precision Recall Curve ◽

Iot Devices ◽

Feature Selection Techniques ◽

Better Than

AbstractThe recent years have seen a proliferation of Internet of Things (IoT) devices and an associated security risk from an increasing volume of malicious traffic worldwide. For this reason, datasets such as Bot-IoT were created to train machine learning classifiers to identify attack traffic in IoT networks. In this study, we build predictive models with Bot-IoT to detect attacks represented by dataset instances from the Information Theft category, as well as dataset instances from the data exfiltration and keylogging subcategories. Our contribution is centered on the evaluation of ensemble feature selection techniques (FSTs) on classification performance for these specific attack instances. A group or ensemble of FSTs will often perform better than the best individual technique. The classifiers that we use are a diverse set of four ensemble learners (Light GBM, CatBoost, XGBoost, and random forest (RF)) and four non-ensemble learners (logistic regression (LR), decision tree (DT), Naive Bayes (NB), and a multi-layer perceptron (MLP)). The metrics used for evaluating classification performance are area under the receiver operating characteristic curve (AUC) and Area Under the precision-recall curve (AUPRC). For the most part, we determined that our ensemble FSTs do not affect classification performance but are beneficial because feature reduction eases computational burden and provides insight through improved data visualization.

Modeling the public attitude towards organic foods: a big data and text mining approach

Journal Of Big Data ◽

10.1186/s40537-021-00551-6 ◽

2022 ◽

Vol 9 (1) ◽

Author(s):

Anupam Singh ◽

Aldona Glińska-Neweś

Keyword(s):

Big Data ◽

Text Mining ◽

Research Methods ◽

Food Consumption ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Food Delivery ◽

Public Attitude ◽

Organic Foods ◽

The Public

AbstractThis study aims to identify the topics that users post on Twitter about organic foods and to analyze the emotion-based sentiment of those tweets. The study addresses a call for an application of big data and text mining in different fields of research, as well as proposes more objective research methods in studies on food consumption. There is a growing interest in understanding consumer choices for foods which are caused by the predominant contribution of the food industry to climate change. So far, customer attitudes towards organic food have been studied mostly with self-reported methods, such as questionnaires and interviews, which have many limitations. Therefore, in the present study, we used big data and text mining techniques as more objective methods to analyze the public attitude about organic foods. A total of 43,724 Twitter posts were extracted with streaming Application Programming Interface (API). Latent Dirichlet Allocation (LDA) algorithm was applied for topic modeling. A test of topic significance was performed to evaluate the quality of the topics. Public sentiment was analyzed based on the NRC emotion lexicon by utilizing Syuzhet package. Topic modeling results showed that people discuss on variety of themes related to organic foods such as plant-based diet, saving the planet, organic farming and standardization, authenticity, and food delivery, etc. Sentiment analysis results suggest that people view organic foods positively, though there are also people who are skeptical about the claims that organic foods are natural and free from chemicals and pesticides. The study contributes to the field of consumer behavior by implementing research methods grounded in text mining and big data. The study contributes also to the advancement of research in the field of sustainable food consumption by providing a fresh perspective on public attitude toward organic foods, filling the gaps in existing literature and research.

Machine learning approaches in Covid-19 severity risk prediction in Morocco

Journal Of Big Data ◽

10.1186/s40537-021-00557-0 ◽

2022 ◽

Vol 9 (1) ◽

Author(s):

Mariam Laatifi ◽

Samira Douzi ◽

Abdelaziz Bouklouz ◽

Hind Ezzine ◽

Jaafar Jaafari ◽

...

Keyword(s):

Machine Learning ◽

Data Analysis ◽

Biological Data ◽

Topological Data Analysis ◽

Test Machine ◽

Learning Approaches ◽

Reactive Protein ◽

Severity Prediction ◽

New Feature ◽

Prognostic Prediction

AbstractThe purpose of this study is to develop and test machine learning-based models for COVID-19 severity prediction. COVID-19 test samples from 337 COVID-19 positive patients at Cheikh Zaid Hospital were grouped according to the severity of their illness. Ours is the first study to estimate illness severity by combining biological and non-biological data from patients with COVID-19. Moreover the use of ML for therapeutic purposes in Morocco is currently restricted, and ours is the first study to investigate the severity of COVID-19. When data analysis approaches were used to uncover patterns and essential characteristics in the data, C-reactive protein, platelets, and D-dimers were determined to be the most associated to COVID-19 severity prediction. In this research, many data reduction algorithms were used, and Machine Learning models were trained to predict the severity of sickness using patient data. A new feature engineering method based on topological data analysis called Uniform Manifold Approximation and Projection (UMAP) shown that it achieves better results. It has 100% accuracy, specificity, sensitivity, and ROC curve in conducting a prognostic prediction using different machine learning classifiers such as X_GBoost, AdaBoost, Random Forest, and ExtraTrees. The proposed approach aims to assist hospitals and medical facilities in determining who should be seen first and who has a higher priority for admission to the hospital.

The use of Big Data Analytics in healthcare

Journal Of Big Data ◽

10.1186/s40537-021-00553-4 ◽

2022 ◽

Vol 9 (1) ◽

Author(s):

Kornelia Batko ◽

Andrzej Ślęzak

Keyword(s):

Big Data ◽

Data Analytics ◽

New Technologies ◽

Health Management ◽

Big Data Analytics ◽

Structural Data ◽

Unstructured Data ◽

Clinical Area ◽

Use Of Data ◽

Medical Facilities

AbstractThe introduction of Big Data Analytics (BDA) in healthcare will allow to use new technologies both in treatment of patients and health management. The paper aims at analyzing the possibilities of using Big Data Analytics in healthcare. The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data Analytics in medical facilities. The direct research was carried out based on research questionnaire and conducted on a sample of 217 medical facilities in Poland. Literature studies have shown that the use of Big Data Analytics can bring many benefits to medical facilities, while direct research has shown that medical facilities in Poland are moving towards data-based healthcare because they use structured and unstructured data, reach for analytics in the administrative, business and clinical area. The research positively confirmed that medical facilities are working on both structural data and unstructured data. The following kinds and sources of data can be distinguished: from databases, transaction data, unstructured content of emails and documents, data from devices and sensors. However, the use of data from social media is lower as in their activity they reach for analytics, not only in the administrative and business but also in the clinical area. It clearly shows that the decisions made in medical facilities are highly data-driven. The results of the study confirm what has been analyzed in the literature that medical facilities are moving towards data-based healthcare, together with its benefits.

Towards data sharing economy on Internet of Things: a semantic for telemetry data

Journal Of Big Data ◽

10.1186/s40537-021-00549-0 ◽

2022 ◽

Vol 9 (1) ◽

Author(s):

Dareen K. Halim ◽

Samuel Hutagalung

Keyword(s):

Internet Of Things ◽

Data Sharing ◽

Physical World ◽

Sharing Economy ◽

Sensor Data ◽

Machine Learning Techniques ◽

Telemetry Data ◽

Time Data ◽

Domain Specific Knowledge ◽

Iot Devices

AbstractInternet of Things (IoT) provides data processing and machine learning techniques with access to physical world data through sensors, namely telemetry data. Acquiring sensor data through IoT faces challenges such as connectivity and proper measurement requiring domain-specific knowledge, that results in data quality problems. Data sharing is one solution to this. In this work, we propose IoT Telemetry Data Hub (IoT TeleHub), a general framework and semantic for telemetry data collection and sharing. The framework is principled on abstraction, layering of elements, and extensibility and openness. We showed that while the framework is defined specifically for telemetry data, it is general enough to be mapped to existing IoT platforms with various use cases. Our framework also considers the machine-readable and machine-understandable notion in regard to resource-constrained IoT devices. We also present IoThingsHub, an IoT platform for real-time data sharing based on the proposed framework. The platform demonstrated that the framework could be implemented with existing technologies such as HTTP, MQTT, SQL, NoSQL.

Assessing survival time of heart failure patients: using Bayesian approach

Journal Of Big Data ◽

10.1186/s40537-021-00537-4 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Tafese Ashine ◽

Geremew Muleta ◽

Kenenisa Tadesse

Keyword(s):

Heart Failure ◽

Survival Time ◽

Failure Time ◽

Accelerated Failure Time Model ◽

Accelerated Failure Time ◽

Time Model ◽

Heart Failure Patients ◽

Accelerated Failure ◽

Log Normal ◽

Failure Time Model

AbstractHeart failure is a failure of the heart to pump blood with normal efficiency and a globally growing public health issue with a high death rate all over the world, including Ethiopia. The goal of this study was to identify factors affecting the survival time of heart failure patients. To achieve the aim, 409 heart failure patients were included in the study based on data taken from medical records of patients enrolled from January 2016 to January 2019 at Jimma University Medical Center, Jimma, Ethiopia. The Kaplan Meier plots and log-rank test were used for comparison of survival functions; the Cox-PH model and the Bayesian parametric survival models were used to analyze the survival time of heart failure patients using R-software. Integrated nested Laplace approximation methods have been applied. Out of the total heart failure patients in the study, 40.1% died, and 59.9% were censored. The estimated median survival time of patients was 31 months. Using model selection criteria, the Bayesian log-normal accelerated failure time model was found to be appropriate. The results of this model show that age, chronic kidney disease, diabetes mellitus, etiology of heart failure, hypertension, anemia, smoking cigarettes, and stages of heart failure all have a significant impact on the survival time of heart failure patients. The Bayesian log-normal accelerated failure time model described the survival time of heart failure patient's data-set well. The findings of this study suggested that the age group (49 to 65 years, and greater than or equal to 65 years); etiology of heart failure (rheumatic valvular heart disease, hypertensive heart disease, and other diseases); the presence of hypertension; the presence of anemia; the presence of chronic kidney disease; smokers; diabetes mellitus (type I, and type II); and stages of heart failure (II, III, and IV) shortened their survival time of heart failure patients.

Machine learning techniques to predict daily rainfall amount

Journal Of Big Data ◽

10.1186/s40537-021-00545-4 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Chalachew Muluken Liyew ◽

Haileyesus Amsaya Melese

Keyword(s):

Machine Learning ◽

Pearson Correlation ◽

Daily Rainfall ◽

Learning Model ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Correlation Technique ◽

Learning Techniques ◽

Machine Learning Model ◽

Extreme Gradient Boosting

AbstractPredicting the amount of daily rainfall improves agricultural productivity and secures food and water supply to keep citizens healthy. To predict rainfall, several types of research have been conducted using data mining and machine learning techniques of different countries’ environmental datasets. An erratic rainfall distribution in the country affects the agriculture on which the economy of the country depends on. Wise use of rainfall water should be planned and practiced in the country to minimize the problem of the drought and flood occurred in the country. The main objective of this study is to identify the relevant atmospheric features that cause rainfall and predict the intensity of daily rainfall using machine learning techniques. The Pearson correlation technique was used to select relevant environmental variables which were used as an input for the machine learning model. The dataset was collected from the local meteorological office at Bahir Dar City, Ethiopia to measure the performance of three machine learning techniques (Multivariate Linear Regression, Random Forest, and Extreme Gradient Boost). Root mean squared error and Mean absolute Error methods were used to measure the performance of the machine learning model. The result of the study revealed that the Extreme Gradient Boosting machine learning algorithm performed better than others.

Journal Of Big Data
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer (Biomed Central Ltd.)

An unsupervised method for social network spammer detection based on user information interests

Addressing big data variety using an automated approach for data characterization

Programming big data analysis: principles and solutions

IoT information theft prediction using ensemble feature selection

Modeling the public attitude towards organic foods: a big data and text mining approach

Machine learning approaches in Covid-19 severity risk prediction in Morocco

The use of Big Data Analytics in healthcare

Towards data sharing economy on Internet of Things: a semantic for telemetry data

Assessing survival time of heart failure patients: using Bayesian approach

Machine learning techniques to predict daily rainfall amount

Export Citation Format

Journal Of Big DataLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer (Biomed Central Ltd.)

An unsupervised method for social network spammer detection based on user information interests

Addressing big data variety using an automated approach for data characterization

Programming big data analysis: principles and solutions

IoT information theft prediction using ensemble feature selection

Modeling the public attitude towards organic foods: a big data and text mining approach

Machine learning approaches in Covid-19 severity risk prediction in Morocco

The use of Big Data Analytics in healthcare

Towards data sharing economy on Internet of Things: a semantic for telemetry data

Assessing survival time of heart failure patients: using Bayesian approach

Machine learning techniques to predict daily rainfall amount

Journal Of Big Data
Latest Publications