scholarly journals Detecting Phishing Websites Using an Efficient Feature-based Machine Learning Framework

2021 ◽  
Vol 11 (2) ◽  
pp. 2106-2112
Author(s):  
K. Mohana Sundaram ◽  
R. Sasikumar ◽  
Atthipalli Sai Meghana ◽  
Arava Anuja ◽  
Chandolu Praneetha

Phishing is a form of digital crime where spam messages and spam sites attract users to exploit sensitive information on fishermen. Sensitive information obtained is used to take notes or to access money. To combat the crime of identity theft, Microsoft's cloud-based program attempts to use logical testing to determine how you can build trust with the characters. The purpose of this paper is to create a molded channel using a variety of machine learning methods. Separation is a method of machine learning that can be used effectively to identify fish, assemble and test models, use different mixing settings, and look at different mechanical learning processes, and measure the accuracy of the modified model and show multiple measurement measurements. The current study compares predictive accuracy, f1 scores, guessing and remembering multiple machine learning methods including Naïve Bayes (NB) and Random forest (RF) to detect criminal messages to steal sensitive information and improve the process by selecting highlighting strategies and improving crime classification accuracy. to steal sensitive information.

2019 ◽  
pp. 089443931988844
Author(s):  
Ranjith Vijayakumar ◽  
Mike W.-L. Cheung

Machine learning methods have become very popular in diverse fields due to their focus on predictive accuracy, but little work has been conducted on how to assess the replicability of their findings. We introduce and adapt replication methods advocated in psychology to the aims and procedural needs of machine learning research. In Study 1, we illustrate these methods with the use of an empirical data set, assessing the replication success of a predictive accuracy measure, namely, R 2 on the cross-validated and test sets of the samples. We introduce three replication aims. First, tests of inconsistency examine whether single replications have successfully rejected the original study. Rejection will be supported if the 95% confidence interval (CI) of R 2 difference estimates between replication and original does not contain zero. Second, tests of consistency help support claims of successful replication. We can decide apriori on a region of equivalence, where population values of the difference estimates are considered equivalent for substantive reasons. The 90% CI of a different estimate lying fully within this region supports replication. Third, we show how to combine replications to construct meta-analytic intervals for better precision of predictive accuracy measures. In Study 2, R 2 is reduced from the original in a subset of replication studies to examine the ability of the replication procedures to distinguish true replications from nonreplications. We find that when combining studies sampled from same population to form meta-analytic intervals, random-effects methods perform best for cross-validated measures while fixed-effects methods work best for test measures. Among machine learning methods, regression was comparable to many complex methods, while support vector machine performed most reliably across a variety of scenarios. Social scientists who use machine learning to model empirical data can use these methods to enhance the reliability of their findings.


Phishing is a cyber-attack which is socially engineered to trick naive online users into revealing sensitive information such as user data, login credentials, social security number, banking information etc. Attackers fool the Internet users by posing as a legitimate webpage to retrieve personal information. This can also be done by sending emails posing as reputable companies or businesses. Phishing exploits several vulnerabilities effectively and there is no one solution which protects users from all vulnerabilities. A classification/prediction model is designed based on heuristic features that are extracted from website domain, URL, web protocol, source code to eliminate the drawbacks of existing anti-phishing techniques. In the model we combine some existing solutions such as blacklisting and whitelisting, heuristics and visual-based similarity which provides higher level security. We use the model with different Machine Learning Algorithms, namely Logistic Regression, Decision Trees, K-Nearest Neighbours and Random Forests, and compare the results to find the most efficient machine learning framework.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Cindy Feng ◽  
George Kephart ◽  
Elizabeth Juarez-Colunga

Abstract Background Coronavirus disease (COVID-19) presents an unprecedented threat to global health worldwide. Accurately predicting the mortality risk among the infected individuals is crucial for prioritizing medical care and mitigating the healthcare system’s burden. The present study aimed to assess the predictive accuracy of machine learning methods to predict the COVID-19 mortality risk. Methods We compared the performance of classification tree, random forest (RF), extreme gradient boosting (XGBoost), logistic regression, generalized additive model (GAM) and linear discriminant analysis (LDA) to predict the mortality risk among 49,216 COVID-19 positive cases in Toronto, Canada, reported from March 1 to December 10, 2020. We used repeated split-sample validation and k-steps-ahead forecasting validation. Predictive models were estimated using training samples, and predictive accuracy of the methods for the testing samples was assessed using the area under the receiver operating characteristic curve, Brier’s score, calibration intercept and calibration slope. Results We found XGBoost is highly discriminative, with an AUC of 0.9669 and has superior performance over conventional tree-based methods, i.e., classification tree or RF methods for predicting COVID-19 mortality risk. Regression-based methods (logistic, GAM and LASSO) had comparable performance to the XGBoost with slightly lower AUCs and higher Brier’s scores. Conclusions XGBoost offers superior performance over conventional tree-based methods and minor improvement over regression-based methods for predicting COVID-19 mortality risk in the study population.


2018 ◽  
Vol 226 (4) ◽  
pp. 259-273 ◽  
Author(s):  
Ranjith Vijayakumar ◽  
Mike W.-L. Cheung

Abstract. Machine learning tools are increasingly used in social sciences and policy fields due to their increase in predictive accuracy. However, little research has been done on how well the models of machine learning methods replicate across samples. We compare machine learning methods with regression on the replicability of variable selection, along with predictive accuracy, using an empirical dataset as well as simulated data with additive, interaction, and non-linear squared terms added as predictors. Methods analyzed include support vector machines (SVM), random forests (RF), multivariate adaptive regression splines (MARS), and the regularized regression variants, least absolute shrinkage and selection operator (LASSO), and elastic net. In simulations with additive and linear interactions, machine learning methods performed similarly to regression in replicating predictors; they also performed mostly equal or below regression on measures of predictive accuracy. In simulations with square terms, machine learning methods SVM, RF, and MARS improved predictive accuracy and replicated predictors better than regression. Thus, in simulated datasets, the gap between machine learning methods and regression on predictive measures foreshadowed the gap in variable selection. In replications on the empirical dataset, however, improved prediction by machine learning methods was not accompanied by a visible improvement in replicability in variable selection. This disparity is explained by the overall explanatory power of the models. When predictors have small effects and noise predominates, improved global measures of prediction in a sample by machine learning methods may not lead to the robust selection of predictors; thus, in the presence of weak predictors and noise, regression remains a useful tool for model building and replication.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Jessalyn K. Holodinsky ◽  
Amy Y. X. Yu ◽  
Moira K. Kapral ◽  
Peter C. Austin

Abstract Background Hometime, the total number of days a person is living in the community (not in a healthcare institution) in a defined period of time after a hospitalization, is a patient-centred outcome metric increasingly used in healthcare research. Hometime exhibits several properties which make its statistical analysis difficult: it has a highly non-normal distribution, excess zeros, and is bounded by both a lower and upper limit. The optimal methodology for the analysis of hometime is currently unknown. Methods Using administrative data we identified adult patients diagnosed with stroke between April 1, 2010 and December 31, 2017 in Ontario, Canada. 90-day hometime and clinically relevant covariates were determined through administrative data linkage. Fifteen different statistical and machine learning models were fit to the data using a derivation sample. The models’ predictive accuracy and bias were assessed using an independent validation sample. Results Seventy-five thousand four hundred seventy-five patients were identified (divided into a derivation set of 49,402 and a test set of 26,073). In general, the machine learning models had lower root mean square error and mean absolute error than the statistical models. However, some statistical models resulted in lower (or equal) bias than the machine learning models. Most of the machine learning models constrained predicted values between the minimum and maximum observable hometime values but this was not the case for the statistical models. The machine learning models also allowed for the display of complex non-linear interactions between covariates and hometime. No model captured the non-normal bucket shaped hometime distribution. Conclusions Overall, no model clearly outperformed the others. However, it was evident that machine learning methods performed better than traditional statistical methods. Among the machine learning methods, generalized boosting machines using the Poisson distribution as well as random forests regression were the best performing. No model was able to capture the bucket shaped hometime distribution and future research on factors which are associated with extreme values of hometime that are not available in administrative data is warranted.


Processes ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 2074
Author(s):  
Yuting Liu ◽  
Mengzhou Bi ◽  
Xuewen Zhang ◽  
Na Zhang ◽  
Guohui Sun ◽  
...  

Casein kinase 2 (CK2) is considered an important target for anti-cancer drugs. Given the structural diversity and broad spectrum of pharmaceutical activities of natural products, numerous studies have been performed to prove them as valuable sources of drugs. However, there has been little study relevant to identifying structural factors responsible for their inhibitory activity against CK2 with machine learning methods. In this study, classification studies were conducted on 115 natural products as CK2 inhibitors. Seven machine learning methods along with six molecular fingerprints were employed to develop qualitative classification models. The performances of all models were evaluated by cross-validation and test set. By taking predictive accuracy(CA), the area under receiver operating characteristic (AUC), and (MCC)as three performance indicators, the optimal models with high reliability and predictive ability were obtained, including the Extended Fingerprint-Logistic Regression model (CA = 0.859, AUC = 0.826, MCC = 0.520) for training test andPubChem fingerprint along with the artificial neural model (CA = 0.826, AUC = 0.933, MCC = 0.628) for test set. Meanwhile, the privileged substructures responsible for their inhibitory activity against CK2 were also identified through a combination of frequency analysis and information gain. The results are expected to provide useful information for the further utilization of natural products and the discovery of novel CK2 inhibitors.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Micah Cearns ◽  
Tim Hahn ◽  
Bernhard T. Baune

Abstract Machine learning methods hold promise for personalized care in psychiatry, demonstrating the potential to tailor treatment decisions and stratify patients into clinically meaningful taxonomies. Subsequently, publication counts applying machine learning methods have risen, with different data modalities, mathematically distinct models, and samples of varying size being used to train and test models with the promise of clinical translation. Consequently, and in part due to the preliminary nature of such works, many studies have reported largely varying degrees of accuracy, raising concerns over systematic overestimation and methodological inconsistencies. Furthermore, a lack of procedural evaluation guidelines for non-expert medical professionals and funding bodies leaves many in the field with no means to systematically evaluate the claims, maturity, and clinical readiness of a project. Given the potential of machine learning methods to transform patient care, albeit, contingent on the rigor of employed methods and their dissemination, we deem it necessary to provide a review of current methods, recommendations, and future directions for applied machine learning in psychiatry. In this review we will cover issues of best practice for model training and evaluation, sources of systematic error and overestimation, model explainability vs. trust, the clinical implementation of AI systems, and finally, future directions for our field.


Author(s):  
Hongbin Liu ◽  
Jinyuan Jia ◽  
Neil Zhenqiang Gong

Differentially private machine learning trains models while protecting privacy of the sensitive training data. The key to obtain differentially private models is to introduce noise/randomness to the training process. In particular, existing differentially private machine learning methods add noise to the training data, the gradients, the loss function, and/or the model itself. Bagging, a popular ensemble learning framework, randomly creates some subsamples of the training data, trains a base model for each subsample using a base learner, and takes majority vote among the base models when making predictions. Bagging has intrinsic randomness in the training process as it randomly creates subsamples. Our major theoretical results show that such intrinsic randomness already makes Bagging differentially private without the needs of additional noise. Moreover, we prove that if no assumptions about the base learner are made, our derived privacy guarantees are tight. We empirically evaluate Bagging on MNIST and CIFAR10. Our experimental results demonstrate that Bagging achieves significantly higher accuracies than state-of-the-art differentially private machine learning methods with the same privacy budgets.


Sign in / Sign up

Export Citation Format

Share Document