scholarly journals ANALISA DAN PREDIKSI IKLAN LOWONGAN KERJA PALSU DENGAN METODE NATURAL LANGUAGE PROGRAMING DAN MACHINE LEARNING

2021 ◽  
Vol 21 (1) ◽  
pp. 14-22
Author(s):  
Hary Sabita ◽  
Fitria Fitria ◽  
Riko Herwanto

This research was conducted using the data provided by Kaggle. This data contains features that describe job vacancies. This study used location-based data in the US, which covered 60% of all data. Job vacancies that are posted are categorized as real or fake. This research was conducted by following five stages, namely: defining the problem, collecting data, cleaning data (exploration and pre-processing) and modeling. The evaluation and validation models use Naïve Bayes as a baseline model and Small Group Discussion as end model. For the Naïve Bayes model, an accuracy value of 0.971 and an F1-score of 0.743 is obtained. While the Stochastic Gradient Descent obtained an accuracy value of 0.977 and an F1-score of 0.81. These final results indicate that SGD performs slightly better than Naïve Bayes.Keywords—NLP, Machine Learning, Naïve Bayes, SGD, Fake Jobs

2018 ◽  
Vol 5 (5) ◽  
pp. 567 ◽  
Author(s):  
Irvi Oktanisa ◽  
Ahmad Afif Supianto

<p class="Abstrak">Klasifikasi merupakan teknik dalam <em>data mining</em> untuk mengelompokkan data berdasarkan keterikatan data terhadap  data sampel. Pada penelitian ini, kami melakukan perbandingan 9 teknik klasifikasi untuk mengklasifikasi respon pelanggan pada <em>dataset Bank Direct Marketing</em>. Perbandingan teknik klasifikasi ini dilakukan untuk mengetahui model dalam teknik klasfikasi yang paling efektif untuk mengklasifikasi target pada <em>dataset Bank Direct Marketing</em>. Teknik klasifikasi yang digunakan yaitu <em>Support Vector Machine</em>, <em>AdaBoost</em>, <em>Naïve Bayes</em>, <em>Constant, KNN, Tree, Random Forest, Stochastic Gradient Descent</em>, dan <em>CN2 Rule</em>. Proses klasifikasi diawali dengan <em>preprocessing</em> data untuk melakukan penghilangan <em>missing value</em> dan pemilihan fitur pada <em>dataset</em>. Pada tahap evaluasi digunakan teknik <em>10 fold cross validation</em>. Setelah dilakukan pengujian, didapatkan bahwa hasil klasifikasi menunjukkan akurasi terbaik diperoleh oleh model <em>Tree, Constant</em>, <em>Naive Bayes</em>, dan <em>Stochastic Gardient Descent</em>. Kemudian diikuti oleh model <em>Random Forest</em>, <em>K-Nearest Neighbor</em>, <em>CN-2 Rule</em>, <em>AdaBoost</em> dan <em>Support Vector Machine</em>. Dari keempat model yang menunjukkan hasil akurasi terbaik, untuk kasus ini <em>Stochastic Gradient Descent</em> terpilih sebagai model yang memiliki akurasi terbaik dengan nilai akurasi sebesar 0,972 dan hasil visualisasi yang dihasilkan lebih jelas untuk mengklasifikasi target pada <em>dataset Bank Direct Marketing</em>.</p><p class="Abstrak"><em><strong><br /></strong></em></p><p class="Abstrak"><em><strong>Abstract</strong></em></p>Classification is a technique in data mining to classify data based on the attachment of data to the sample data.. In this paper, we present the comparison of  9 classification techniques performed to classify customer response on the dataset of Bank Direct Marketing. The techniques performed to find out the effectiveness model in the classification technique used to classify targets on the dataset of Bank Direct Marketing. The techniques used are Support Vector Machine, AdaBoost, Naïve Bayes, Constant, KNN, Tree, Random Forest, Stochastic Gradient Descent, and CN2 Rule. The classification process begins with preprocessing data to perform missing value omissions and feature selection on the dataset. Cross validation technique, with k value is 10, used in the evaluation stage. After testing, it was found that the classification results showed the best accuracy obtained when using the Tree model, Constant, Naive Bayes and Stochastic Gradient Descent. Afterwards the Random Forest model, K-Nearest Neighbor, CN-2 Rule, AdaBoost, and Support Vector Machine are followed. Of the four models with the high accuracy results, in this case Stochastic Gradient Descent was selected as the best accuracy model with an accuracy value of 0.972 and resulting visualization more clearly to classify targets on the dataset of Bank Direct Marketing.


2021 ◽  
Vol 7 (2) ◽  
pp. 112
Author(s):  
Shanto Moyrano Tambunan ◽  
Yessica Nataliani ◽  
Elizabeth Sri Lestari

Perkembangan teknologi tidak luput dari dampak negatif, salah satunya hoaks. Twitter menjadi salah satu media sosial yang paling aktif digunakan sebagai pertukaran informasi, komunikasi, dan hiburan. Oleh karena itu pengguna Twitter dapat menyebarkan berita atau hoaks dengan mudah. Penelitian ini bertujuan mengidentifikasi tweet yang berisi informasi hoaks maupun valid menggunakan pembelajaran mesin. Algoritma yang digunakan adalah Stochastic Gradient Descent, Naïve Bayes, Random Forest, dan Rocchio. Keempat algoritma tersebut dibandingkan untuk kemudian dicari hasil terbaik dalam mengidentifikasi dan memverifikasi tweet di Twitter yang berisi hoaks atau informasi valid secara otomatis. Kata kunci yang digunakan adalah Corona, Mutasi Corona, PSBB, Dana Bansos, Dana Otsus, Utang Pemerintah, dan Sekolah Tatap Muka sebanyak 898 tweet. Data dikelompokkan berdasarkan kelas hoaks dan valid lalu diolah menjadi dataset dengan melewati tahap pra-proses hingga pembobotan kata dengan TF-IDF. Hasil pengujian menunjukkan algoritma Stochastic Gradient Descent merupakan algoritma terbaik dengan hasil akurasi rata-rata sebesar 84.92%. Pengujian lanjutan dilakukan dengan menghitung nilai presisi, recall, dan F-1. Hasil presisi terbaik sebesar 82.95% pada algoritma Naïve Bayes, sedangkan hasil recall dan F-1 terbaik didapat dari algoritma Stochastic Gradient Descent sebesar 85.05% dan 82.42%.


2020 ◽  
Vol 4 (2) ◽  
pp. 1-9
Author(s):  
Veronica Sari ◽  
◽  
Feranandah Firdausi ◽  
Yufis Azhar ◽  
◽  
...  

Classification is one of the techniques that exist in data mining and is useful for grouping a data based on the attachment of the data with the sample data. The dataset that is used in this study is the coffee dataset taken from Dataset Coffee Quality Institute on the GitHub platform. The attributes that contained in the dataset are Aroma, Aftertaste, Flavor, Acidity, Balance, Body, Uniformity, Sweetness, Clean Cup, and Copper points. There are 3 classification methods that are used in this study, Stochastic Gradient Descent, Random Forest and Naive Bayes. The aim of this study is to find out which algorithm is the most effective to predict the coffee quality in the dataset. After that, the prediction results will be tested using K-Fold Cross Validation and Area Under the Curve (AUC) method. The results show that Stochastic Gradient Descent obtained the best accuracy results compared to the other two methods with an accuracy of 98% and increased to 99% after tested using K-fold Cross Validation and AUC method.


Author(s):  
Parastoo Golpour ◽  
Majid Ghayour-Mobarhan ◽  
Azadeh Saki ◽  
Habibollah Esmaily ◽  
Ali Taghipour ◽  
...  

(1) Background: Coronary angiography is considered to be the most reliable method for the diagnosis of cardiovascular disease. However, angiography is an invasive procedure that carries a risk of complications; hence, it would be preferable for an appropriate method to be applied to determine the necessity for angiography. The objective of this study was to compare support vector machine, naïve Bayes and logistic regressions to determine the diagnostic factors that can predict the need for coronary angiography. These models are machine learning algorithms. Machine learning is considered to be a branch of artificial intelligence. Its aims are to design and develop algorithms that allow computers to improve their performance on data analysis and decision making. The process involves the analysis of past experiences to find practical and helpful regularities and patterns, which may also be overlooked by a human. (2) Materials and Methods: This cross-sectional study was performed on 1187 candidates for angiography referred to Ghaem Hospital, Mashhad, Iran from 2011 to 2012. A logistic regression, naive Bayes and support vector machine were applied to determine whether they could predict the results of angiography. Afterwards, the sensitivity, specificity, positive and negative predictive values, AUC (area under the curve) and accuracy of all three models were computed in order to compare them. All analyses were performed using R 3.4.3 software (R Core Team; Auckland, New Zealand) with the help of other software packages including receiver operating characteristic (ROC), caret, e1071 and rminer. (3) Results: The area under the curve for logistic regression, naïve Bayes and support vector machine were similar—0.76, 0.74 and 0.75, respectively. Thus, in terms of the model parsimony and simplicity of application, the naïve Bayes model with three variables had the best performance in comparison with the logistic regression model with seven variables and support vector machine with six variables. (4) Conclusions: Gender, age and fasting blood glucose (FBG) were found to be the most important factors to predict the result of coronary angiography. The naïve Bayes model performed well using these three variables alone, and they are considered important variables for the other two models as well. According to an acceptable prediction of the models, they can be used as pragmatic, cost-effective and valuable methods that support physicians in decision making.


2021 ◽  
Author(s):  
Graeme Hart ◽  
Michael Woodburn ◽  
Nada Marhoon ◽  
Alan Pritchard ◽  
Jeff Feldman ◽  
...  

BACKGROUND Background: Quality Assurance activities are frequently dependent on manual assessment of text-based records. Increasingly, these records have digital structures that may be amenable to computer analysis. We used the Australian Commission for Safety and Quality in Healthcare (ACSQHC) National Clinical Care Colonoscopy standard reporting requirement as a proof of concept for an analytics process to streamline and reduce manual reporting overheads. The endoscopy unit performs approximately 4,500 colonoscopies (mainly outpatient) per year. Quarterly reporting of colonoscopy outcomes requires approximately 30 hours of manual data abstraction, collation and combination from a variety of electronic databases. The most time consuming is manual retrieval and abstraction of histopathology records from the EMR. OBJECTIVE 1. To reduce the manual overheads of quarterly National Standards KPI reporting for colonoscopy compliance using an automated data pipeline and Artificial Intelligence tools. 2. The service also wished to minimise the risk of failure to follow up in new cancer diagnoses for outpatient colonoscopies. 3. To develop a data and analytic pipeline that would be easily re-purposed for additional standards, audit and research projects. METHODS A data pipeline and analysis environment were established in the hospitals’ secure Microsoft Azure databricks resource. A Training data set of 1000 colonoscopies was extracted using from the procedural Provation database using the the ProvationMD ® reporting tool and linked to relevant histopathology reports provided from the Clinical Research Data Warehouse (CRDW). The Machine Learning (ML) training data set was created when histopathological reports were manually coded by Gastroenterology Registrars & nurses into the following categories: Adenoma Clinically Significant Sessile Serrated Adenoma Cancer Adequate Bowel Preparation Complete examination A variety of Natural Language Processing (NLP) & ML models were assessed and refined to minimize error rate. Sensitivity was prioritised for the diagnosis of Cancer to minimize missed cases. Reporting to clinicians and quality co-ordinators was established using Microsoft Power BI. RESULTS The Naïve Bayes model for multinomial data resulted in high accuracy, but impacted recall. Sensitivity improved using a virtual ensemble approach, layering models within the processing pipeline and maximised using Microsoft’s ® Text Analytics – Healthcare NLP model with our custom Naïve Bayes model. F1 scores between 0.89 and 0.93 were achieved. The algorithm checks daily for new data and performs the analysis. Quarterly analysis and reporting time decreased from 30 hours to less than 5 minutes and reports can now be continuously updated in the Microsoft Power BI reporting portal. CONCLUSIONS Advanced analytic techniques can be deployed for mandatory quality reporting in a secure, cloud based, hospital data domain. The cost was far less than the manual processes it replaces. Reporting is more timely as it is automated. The potential for training such algorithms for other QA reporting is high. Text based research and audit within the free text domain of the EMR clinical documentation also becomes possible. CLINICALTRIAL Not applicable


Information ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 204
Author(s):  
Charlyn Villavicencio ◽  
Julio Jerison Macrohon ◽  
X. Alphonse Inbaraj ◽  
Jyh-Horng Jeng ◽  
Jer-Guang Hsieh

A year into the COVID-19 pandemic and one of the longest recorded lockdowns in the world, the Philippines received its first delivery of COVID-19 vaccines on 1 March 2021 through WHO’s COVAX initiative. A month into inoculation of all frontline health professionals and other priority groups, the authors of this study gathered data on the sentiment of Filipinos regarding the Philippine government’s efforts using the social networking site Twitter. Natural language processing techniques were applied to understand the general sentiment, which can help the government in analyzing their response. The sentiments were annotated and trained using the Naïve Bayes model to classify English and Filipino language tweets into positive, neutral, and negative polarities through the RapidMiner data science software. The results yielded an 81.77% accuracy, which outweighs the accuracy of recent sentiment analysis studies using Twitter data from the Philippines.


2014 ◽  
Vol 2014 ◽  
pp. 1-16 ◽  
Author(s):  
Qingchao Liu ◽  
Jian Lu ◽  
Shuyan Chen ◽  
Kangjia Zhao

This study presents the applicability of the Naïve Bayes classifier ensemble for traffic incident detection. The standard Naive Bayes (NB) has been applied to traffic incident detection and has achieved good results. However, the detection result of the practically implemented NB depends on the choice of the optimal threshold, which is determined mathematically by using Bayesian concepts in the incident-detection process. To avoid the burden of choosing the optimal threshold and tuning the parameters and, furthermore, to improve the limited classification performance of the NB and to enhance the detection performance, we propose an NB classifier ensemble for incident detection. In addition, we also propose to combine the Naïve Bayes and decision tree (NBTree) to detect incidents. In this paper, we discuss extensive experiments that were performed to evaluate the performances of three algorithms: standard NB, NB ensemble, and NBTree. The experimental results indicate that the performances of five rules of the NB classifier ensemble are significantly better than those of standard NB and slightly better than those of NBTree in terms of some indicators. More importantly, the performances of the NB classifier ensemble are very stable.


2020 ◽  
Vol 1 (2) ◽  
pp. 61-66
Author(s):  
Febri Astiko ◽  
Achmad Khodar

This study aims to design a machine learning model of sentiment analysis on Indosat Ooredoo service reviews on social media twitter using the Naive Bayes algorithm as a classifier of positive and negative labels. This sentiment analysis uses machine learning to get patterns an model that can be used again to predict new data.


Sign in / Sign up

Export Citation Format

Share Document