Machine-Learning-Based User Position Prediction and Behavior Analysis for Location Services

Machine learning (ML)-based methods are increasingly used in different fields of business to improve the quality and efficiency of services. The increasing amount of data and the development of artificial intelligence algorithms have improved the services provided to customers in shopping malls. Most new services are based on customers’ precise positioning in shopping malls, especially customer positioning within shops. We propose a novel method to accurately predict the specific shops in which customers are located in shopping malls. We use global positioning system (GPS) information provided by customers’ mobile terminals and WiFi information that completely covers the shopping mall. According to the prediction results, we learn some of the behavior preferences of users. We use these predicted customer locations to provide customers with more accurate services. Our training dataset is built using feature extraction and screening from some real customers’ transaction records in shopping malls. In order to prove the validity of the model, we also cross-check our algorithm with a variety of machine learning algorithms. Our method achieves the best speed–accuracy trade-off and can accurately locate the shops in which customers are located in shopping malls in real time. Compared to other algorithms, the proposed model is more accurate. User preference behaviors can be used in applications to efficiently provide more tailored services.

Download Full-text

AN EFFICIENT MACHINE LEARNING MODEL FOR PREDICTION OF ACUTE MYOCARDIAL INFARCTION

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813666200325104317 ◽

2020 ◽

Vol 13 ◽

Author(s):

Dhilsath Fathima.M ◽

S. Justin Samuel ◽

R. Hari Haran

Keyword(s):

Machine Learning ◽

Myocardial Infarction ◽

Acute Myocardial Infarction ◽

Logistic Regression ◽

Decision Tree ◽

Learning Model ◽

Training Dataset ◽

Data Set ◽

Machine Learning Model ◽

Proposed Model

Aim: This proposed work is used to develop an improved and robust machine learning model for predicting Myocardial Infarction (MI) could have substantial clinical impact. Objectives: This paper explains how to build machine learning based computer-aided analysis system for an early and accurate prediction of Myocardial Infarction (MI) which utilizes framingham heart study dataset for validation and evaluation. This proposed computer-aided analysis model will support medical professionals to predict myocardial infarction proficiently. Methods: The proposed model utilize the mean imputation to remove the missing values from the data set, then applied principal component analysis to extract the optimal features from the data set to enhance the performance of the classifiers. After PCA, the reduced features are partitioned into training dataset and testing dataset where 70% of the training dataset are given as an input to the four well-liked classifiers as support vector machine, k-nearest neighbor, logistic regression and decision tree to train the classifiers and 30% of test dataset is used to evaluate an output of machine learning model using performance metrics as confusion matrix, classifier accuracy, precision, sensitivity, F1-score, AUC-ROC curve. Results: Output of the classifiers are evaluated using performance measures and we observed that logistic regression provides high accuracy than K-NN, SVM, decision tree classifiers and PCA performs sound as a good feature extraction method to enhance the performance of proposed model. From these analyses, we conclude that logistic regression having good mean accuracy level and standard deviation accuracy compared with the other three algorithms. AUC-ROC curve of the proposed classifiers is analyzed from the output figure.4, figure.5 that logistic regression exhibits good AUC-ROC score, i.e. around 70% compared to k-NN and decision tree algorithm. Conclusion: From the result analysis, we infer that this proposed machine learning model will act as an optimal decision making system to predict the acute myocardial infarction at an early stage than an existing machine learning based prediction models and it is capable to predict the presence of an acute myocardial Infarction with human using the heart disease risk factors, in order to decide when to start lifestyle modification and medical treatment to prevent the heart disease.

Download Full-text

Derivation of Respiratory Metrics in Health and Asthma; Machine Learning Methodology (Preprint)

10.2196/preprints.25178 ◽

2020 ◽

Author(s):

Joseph Prinable ◽

Peter Jones ◽

David Boland ◽

Alistair McEwan ◽

Cindy Thamrin

Keyword(s):

Machine Learning ◽

Short Term Memory ◽

Reference Signal ◽

Machine Learning Algorithms ◽

Training Dataset ◽

Local Health ◽

Strongly Correlated ◽

Human Research Ethics ◽

Expiration Time ◽

Wide Variability

BACKGROUND The ability to continuously monitor breathing metrics may have indications for general health as well as respiratory conditions such as asthma. However, few studies have focused on breathing due to a lack of available wearable technologies. OBJECTIVE Examine the performance of two machine learning algorithms in extracting breathing metrics from a finger-based pulse oximeter, which is amenable to long-term monitoring. METHODS Pulse oximetry data was collected from 11 healthy and 11 asthma subjects who breathed at a range of controlled respiratory rates. UNET and Long Short-Term memory (LSTM) algorithms were applied to the data, and results compared against breathing metrics derived from respiratory inductance plethysmography measured simultaneously as a reference. RESULTS The UNET vs LSTM model provided breathing metrics which were strongly correlated with those from the reference signal (all p<0.001, except for inspiratory:expiratory ratio). The following relative mean bias(95% confidence interval) were observed: inspiration time 1.89(-52.95, 56.74)% vs 1.30(-52.15, 54.74)%, expiration time -3.70(-55.21, 47.80)% vs -4.97(-56.84, 46.89)%, inspiratory:expiratory ratio -4.65(-87.18, 77.88)% vs -5.30(-87.07, 76.47)%, inter-breath intervals -2.39(-32.76, 27.97)% vs -3.16(-33.69, 27.36)%, and respiratory rate 2.99(-27.04 to 33.02)% vs 3.69(-27.17 to 34.56)%. CONCLUSIONS Both machine learning models show strongly correlation and good comparability with reference, with low bias though wide variability for deriving breathing metrics in asthma and health cohorts. Future efforts should focus on improvement of performance of these models, e.g. by increasing the size of the training dataset at the lower breathing rates. CLINICALTRIAL Sydney Local Health District Human Research Ethics Committee (#LNR\16\HAWKE99 ethics approval).

Download Full-text

Exploratory Analysis of Driving Force of Wildfires in Australia: An Application of Machine Learning within Google Earth Engine

Remote Sensing ◽

10.3390/rs13010010 ◽

2020 ◽

Vol 13 (1) ◽

pp. 10

Author(s):

Andrea Sulova ◽

Jamal Jokar Arsanjani

Keyword(s):

Climate Change ◽

Machine Learning ◽

Random Forest ◽

Google Earth ◽

Summer Season ◽

Driving Factors ◽

Machine Learning Algorithms ◽

Classification And Regression Tree ◽

Training Dataset ◽

Google Earth Engine

Recent studies have suggested that due to climate change, the number of wildfires across the globe have been increasing and continue to grow even more. The recent massive wildfires, which hit Australia during the 2019–2020 summer season, raised questions to what extent the risk of wildfires can be linked to various climate, environmental, topographical, and social factors and how to predict fire occurrences to take preventive measures. Hence, the main objective of this study was to develop an automatized and cloud-based workflow for generating a training dataset of fire events at a continental level using freely available remote sensing data with a reasonable computational expense for injecting into machine learning models. As a result, a data-driven model was set up in Google Earth Engine platform, which is publicly accessible and open for further adjustments. The training dataset was applied to different machine learning algorithms, i.e., Random Forest, Naïve Bayes, and Classification and Regression Tree. The findings show that Random Forest outperformed other algorithms and hence it was used further to explore the driving factors using variable importance analysis. The study indicates the probability of fire occurrences across Australia as well as identifies the potential driving factors of Australian wildfires for the 2019–2020 summer season. The methodical approach and achieved results and drawn conclusions can be of great importance to policymakers, environmentalists, and climate change researchers, among others.

Download Full-text

Prediction of Pest Insect Appearance Using Sensors and Machine Learning

Sensors ◽

10.3390/s21144846 ◽

2021 ◽

Vol 21 (14) ◽

pp. 4846

Author(s):

Dušan Marković ◽

Dejan Vujičić ◽

Snežana Tanasković ◽

Borislav Đorđević ◽

Siniša Ranđić ◽

...

Keyword(s):

Machine Learning ◽

Relative Humidity ◽

Weather Conditions ◽

Daily Basis ◽

Machine Learning Algorithms ◽

Lower Percentage ◽

Timely Manner ◽

Proposed Model ◽

Set Up ◽

Accuracy Of Prediction

The appearance of pest insects can lead to a loss in yield if farmers do not respond in a timely manner to suppress their spread. Occurrences and numbers of insects can be monitored through insect traps, which include their permanent touring and checking of their condition. Another more efficient way is to set up sensor devices with a camera at the traps that will photograph the traps and forward the images to the Internet, where the pest insect’s appearance will be predicted by image analysis. Weather conditions, temperature and relative humidity are the parameters that affect the appearance of some pests, such as Helicoverpa armigera. This paper presents a model of machine learning that can predict the appearance of insects during a season on a daily basis, taking into account the air temperature and relative humidity. Several machine learning algorithms for classification were applied and their accuracy for the prediction of insect occurrence was presented (up to 76.5%). Since the data used for testing were given in chronological order according to the days when the measurement was performed, the existing model was expanded to take into account the periods of three and five days. The extended method showed better accuracy of prediction and a lower percentage of false detections. In the case of a period of five days, the accuracy of the affected detections was 86.3%, while the percentage of false detections was 11%. The proposed model of machine learning can help farmers to detect the occurrence of pests and save the time and resources needed to check the fields.

Download Full-text

Accurate estimation of isoelectric point of protein and peptide based on amino acid sequences

Bioinformatics ◽

10.1093/bioinformatics/btv674 ◽

2015 ◽

Vol 32 (6) ◽

pp. 821-827 ◽

Cited By ~ 19

Author(s):

Enrique Audain ◽

Yassel Ramos ◽

Henning Hermjakob ◽

Darren R. Flower ◽

Yasset Perez-Riverol

Keyword(s):

Machine Learning ◽

Isoelectric Point ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Basis Set ◽

Superior Performance ◽

Supplementary Information ◽

Training Dataset ◽

Accurate Estimation ◽

Prediction Methods

Abstract Motivation: In any macromolecular polyprotic system—for example protein, DNA or RNA—the isoelectric point—commonly referred to as the pI—can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge—and thus the electrophoretic mobility—of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: [email protected] Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Modified Decision Tree Technique for Ransomware Detection at Runtime through API Calls

Scientific Programming ◽

10.1155/2020/8845833 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Faizan Ullah ◽

Qaisar Javaid ◽

Abdu Salam ◽

Masood Ahmad ◽

Nadeem Sarwar ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Decision Tree ◽

Feature Vector ◽

Machine Learning Algorithms ◽

The Novel ◽

Proposed Model ◽

Testing Accuracy ◽

Financial Losses

Ransomware (RW) is a distinctive variety of malware that encrypts the files or locks the user’s system by keeping and taking their files hostage, which leads to huge financial losses to users. In this article, we propose a new model that extracts the novel features from the RW dataset and performs classification of the RW and benign files. The proposed model can detect a large number of RW from various families at runtime and scan the network, registry activities, and file system throughout the execution. API-call series was reutilized to represent the behavior-based features of RW. The technique extracts fourteen-feature vector at runtime and analyzes it by applying online machine learning algorithms to predict the RW. To validate the effectiveness and scalability, we test 78550 recent malign and benign RW and compare with the random forest and AdaBoost, and the testing accuracy is extended at 99.56%.

Download Full-text

A Soft Voting Ensemble-Based Model for the Early Prediction of Idiopathic Pulmonary Fibrosis (IPF) Disease Severity in Lungs Disease Patients

Life ◽

10.3390/life11101092 ◽

2021 ◽

Vol 11 (10) ◽

pp. 1092

Author(s):

Sikandar Ali ◽

Ali Hussain ◽

Satyabrata Aich ◽

Moo Suk Park ◽

Man Pyo Chung ◽

...

Keyword(s):

Machine Learning ◽

Idiopathic Pulmonary Fibrosis ◽

Pulmonary Fibrosis ◽

Lung Disease ◽

Lung Diseases ◽

Early Stage ◽

Confusion Matrix ◽

Machine Learning Techniques ◽

Training Dataset ◽

Proposed Model

Idiopathic pulmonary fibrosis, which is one of the lung diseases, is quite rare but fatal in nature. The disease is progressive, and detection of severity takes a long time as well as being quite tedious. With the advent of intelligent machine learning techniques, and also the effectiveness of these techniques, it was possible to detect many lung diseases. So, in this paper, we have proposed a model that could be able to detect the severity of IPF at the early stage so that fatal situations can be controlled. For the development of this model, we used the IPF dataset of the Korean interstitial lung disease cohort data. First, we preprocessed the data while applying different preprocessing techniques and selected 26 highly relevant features from a total of 502 features for 2424 subjects. Second, we split the data into 80% training and 20% testing sets and applied oversampling on the training dataset. Third, we trained three state-of-the-art machine learning models and combined the results to develop a new soft voting ensemble-based model for the prediction of severity of IPF disease in patients with this chronic lung disease. Hyperparameter tuning was also performed to get the optimal performance of the model. Fourth, the performance of the proposed model was evaluated by calculating the accuracy, AUC, confusion matrix, precision, recall, and F1-score. Lastly, our proposed soft voting ensemble-based model achieved the accuracy of 0.7100, precision 0.6400, recall 0.7100, and F1-scores 0.6600. This proposed model will help the doctors, IPF patients, and physicians to diagnose the severity of the IPF disease in its early stages and assist them to take proactive measures to overcome this disease by enabling the doctors to take necessary decisions pertaining to the treatment of IPF disease.

Download Full-text

An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques

Entropy ◽

10.3390/e23101258 ◽

2021 ◽

Vol 23 (10) ◽

pp. 1258

Author(s):

Taher Al-Shehari ◽

Rakan A. Alsowail

Keyword(s):

Machine Learning ◽

Detection System ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Sensitive Period ◽

Insider Threat ◽

Leakage Detection ◽

Insider Threats ◽

Insider Attack ◽

Proposed Model

Insider threats are malicious acts that can be carried out by an authorized employee within an organization. Insider threats represent a major cybersecurity challenge for private and public organizations, as an insider attack can cause extensive damage to organization assets much more than external attacks. Most existing approaches in the field of insider threat focused on detecting general insider attack scenarios. However, insider attacks can be carried out in different ways, and the most dangerous one is a data leakage attack that can be executed by a malicious insider before his/her leaving an organization. This paper proposes a machine learning-based model for detecting such serious insider threat incidents. The proposed model addresses the possible bias of detection results that can occur due to an inappropriate encoding process by employing the feature scaling and one-hot encoding techniques. Furthermore, the imbalance issue of the utilized dataset is also addressed utilizing the synthetic minority oversampling technique (SMOTE). Well known machine learning algorithms are employed to detect the most accurate classifier that can detect data leakage events executed by malicious insiders during the sensitive period before they leave an organization. We provide a proof of concept for our model by applying it on CMU-CERT Insider Threat Dataset and comparing its performance with the ground truth. The experimental results show that our model detects insider data leakage events with an AUC-ROC value of 0.99, outperforming the existing approaches that are validated on the same dataset. The proposed model provides effective methods to address possible bias and class imbalance issues for the aim of devising an effective insider data leakage detection system.

Download Full-text

Detection of misinformation on garlic and COVID-19 in Twitter: A machine learning-based approach (Preprint)

10.2196/preprints.33056 ◽

2021 ◽

Author(s):

Myeong Gyu Kim ◽

Jae Hyun Kim ◽

Kyungim Kim

Keyword(s):

Machine Learning ◽

Social Media ◽

Latent Dirichlet Allocation ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Training Dataset ◽

Polynomial Kernel ◽

Support Vector ◽

Accurate Information ◽

Probability Number

BACKGROUND Garlic-related misinformation is prevalent whenever a virus outbreak occurs. Again, with the outbreak of coronavirus disease 2019 (COVID-19), garlic-related misinformation is spreading through social media sites, including Twitter. Machine learning-based approaches can be used to detect misinformation from vast tweets. OBJECTIVE This study aimed to develop machine learning algorithms for detecting misinformation on garlic and COVID-19 in Twitter. METHODS This study used 5,929 original tweets mentioning garlic and COVID-19. Tweets were manually labeled as misinformation, accurate information, and others. We tested the following algorithms: k-nearest neighbors; random forest; support vector machine (SVM) with linear, radial, and polynomial kernels; and neural network. Features for machine learning included user-based features (verified account, user type, number of followers, and follower rate) and text-based features (uniform resource locator, negation, sentiment score, Latent Dirichlet Allocation topic probability, number of retweets, and number of favorites). A model with the highest accuracy in the training dataset (70% of overall dataset) was tested using a test dataset (30% of overall dataset). Predictive performance was measured using overall accuracy, sensitivity, specificity, and balanced accuracy. RESULTS SVM with the polynomial kernel model showed the highest accuracy of 0.670. The model also showed a balanced accuracy of 0.757, sensitivity of 0.819, and specificity of 0.696 for misinformation. Important features in the misinformation and accurate information classes included topic 4 (common myths), topic 13 (garlic-specific myths), number of followers, topic 11 (misinformation on social media), and follower rate. Topic 3 (cooking recipes) was the most important feature in the others class. CONCLUSIONS Our SVM model showed good performance in detecting misinformation. The results of our study will help detect misinformation related to garlic and COVID-19. It could also be applied to prevent misinformation related to dietary supplements in the event of a future outbreak of a disease other than COVID-19.

Download Full-text

Using Supervised Machine Learning Algorithms for Automated Lithology Prediction from Wireline Log Data

10.2118/208559-ms ◽

2021 ◽

Author(s):

Marian Popescu ◽

Rebecca Head ◽

Tim Ferriday ◽

Kate Evans ◽

Jose Montero ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Training Dataset ◽

Depth Interval ◽

Log Data ◽

Machine Learning Approach ◽

Lithology Prediction ◽

Logging While Drilling

Abstract This paper presents advancements in machine learning and cloud deployment that enable rapid and accurate automated lithology interpretation. A supervised machine learning technique is described that enables rapid, consistent, and accurate lithology prediction alongside quantitative uncertainty from large wireline or logging-while-drilling (LWD) datasets. To leverage supervised machine learning, a team of geoscientists and petrophysicists made detailed lithology interpretations of wells to generate a comprehensive training dataset. Lithology interpretations were based on applying determinist cross-plotting by utilizing and combining various raw logs. This training dataset was used to develop a model and test a machine learning pipeline. The pipeline was applied to a dataset previously unseen by the algorithm, to predict lithology. A quality checking process was performed by a petrophysicist to validate new predictions delivered by the pipeline against human interpretations. Confidence in the interpretations was assessed in two ways. The prior probability was calculated, a measure of confidence in the input data being recognized by the model. Posterior probability was calculated, which quantifies the likelihood that a specified depth interval comprises a given lithology. The supervised machine learning algorithm ensured that the wells were interpreted consistently by removing interpreter biases and inconsistencies. The scalability of cloud computing enabled a large log dataset to be interpreted rapidly; >100 wells were interpreted consistently in five minutes, yielding >70% lithological match to the human petrophysical interpretation. Supervised machine learning methods have strong potential for classifying lithology from log data because: 1) they can automatically define complex, non-parametric, multi-variate relationships across several input logs; and 2) they allow classifications to be quantified confidently. Furthermore, this approach captured the knowledge and nuances of an interpreter's decisions by training the algorithm using human-interpreted labels. In the hydrocarbon industry, the quantity of generated data is predicted to increase by >300% between 2018 and 2023 (IDC, Worldwide Global DataSphere Forecast, 2019–2023). Additionally, the industry holds vast legacy data. This supervised machine learning approach can unlock the potential of some of these datasets by providing consistent lithology interpretations rapidly, allowing resources to be used more effectively.

Download Full-text