Spam classification: a comparative analysis of different boosted decision tree approaches

2018 ◽  
Vol 20 (3) ◽  
pp. 298-105 ◽  
Author(s):  
Shrawan Kumar Trivedi ◽  
Prabin Kumar Panigrahi

PurposeEmail spam classification is now becoming a challenging area in the domain of text classification. Precise and robust classifiers are not only judged by classification accuracy but also by sensitivity (correctly classified legitimate emails) and specificity (correctly classified unsolicited emails) towards the accurate classification, captured by both false positive and false negative rates. This paper aims to present a comparative study between various decision tree classifiers (such as AD tree, decision stump and REP tree) with/without different boosting algorithms (bagging, boosting with re-sample and AdaBoost).Design/methodology/approachArtificial intelligence and text mining approaches have been incorporated in this study. Each decision tree classifier in this study is tested on informative words/features selected from the two publically available data sets (SpamAssassin and LingSpam) using a greedy step-wise feature search method.FindingsOutcomes of this study show that without boosting, the REP tree provides high performance accuracy with the AD tree ranking as the second-best performer. Decision stump is found to be the under-performing classifier of this study. However, with boosting, the combination of REP tree and AdaBoost compares favourably with other classification models. If the metrics false positive rate and performance accuracy are taken together, AD tree and REP tree with AdaBoost were both found to carry out an effective classification task. Greedy stepwise has proven its worth in this study by selecting a subset of valuable features to identify the correct class of emails.Research limitations/implicationsThis research is focussed on the classification of those email spams that are written in the English language only. The proposed models work with content (words/features) of email data that is mostly found in the body of the mail. Image spam has not been included in this study. Other messages such as short message service or multi-media messaging service were not included in this study.Practical implicationsIn this research, a boosted decision tree approach has been proposed and used to classify email spam and ham files; this is found to be a highly effective approach in comparison with other state-of-the-art modes used in other studies. This classifier may be tested for different applications and may provide new insights for developers and researchers.Originality/valueA comparison of decision tree classifiers with/without ensemble has been presented for spam classification.

2020 ◽  
Vol 4 (Supplement_1) ◽  
pp. 259-260
Author(s):  
Laura Curtis ◽  
Lauren Opsasnick ◽  
Julia Yoshino Benavente ◽  
Cindy Nowinski ◽  
Rachel O’Conor ◽  
...  

Abstract Early detection of Cognitive impairment (CI) is imperative to identify potentially treatable underlying conditions or provide supportive services when due to progressive conditions such as Alzheimer’s Disease. While primary care settings are ideal for identifying CI, it frequently goes undetected. We developed ‘MyCog’, a brief technology-enabled, 2-step assessment to detect CI and dementia in primary care settings. We piloted MyCog in 80 participants 65 and older recruited from an ongoing cognitive aging study. Cases were identified either by a documented diagnosis of dementia or mild cognitive impairment (MCI) or based on a comprehensive cognitive battery. Administered via an iPad, Step 1 consists of a single self-report item indicating concern about memory or other thinking problems and Step 2 includes two cognitive assessments from the NIH Toolbox: Picture Sequence Memory (PSM) and Dimensional Change Card Sorting (DCCS). 39%(31/80) participants were considered cognitively impaired. Those who expressed concern in Step 1 (n=52, 66%) resulted in a 37% false positive and 3% false negative rate. With the addition of the PSM and DCCS assessments in Step 2, the paradigm demonstrated 91% sensitivity, 75% specificity and an area under the ROC curve (AUC)=0.82. Steps 1 and 2 had an average administration time of <7 minutes. We continue to optimize MyCog by 1) examining additional items for Step 1 to reduce the false positive rate and 2) creating a self-administered version to optimize use in clinical settings. With further validation, MyCog offers a practical, scalable paradigm for the routine detection of cognitive impairment and dementia.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Çiğdem Karakükcü ◽  
Mehmet Zahid Çıracı ◽  
Derya Kocer ◽  
Mine Yüce Faydalı ◽  
Muhittin Abdulkadir Serdar

Abstract Objectives To obtain optimal immunoassay screening and LC-MS/MS confirmation cut-offs for opiate group tests to reduce false positive (FP) and false negative (FN) rates. Methods A total of 126 urine samples, −50 opiate screening negative, 76 positive according to the threshold of 300 ng/mL by CEDIA method – were confirmed by a full-validated in-house LC-MS/MS method. Sensitivity, specificity, FP, and FN rates were determined at cut-off concentrations of both 300 and 2,000 ng/mL for morphine and codeine, and 10 ng/mL for heroin metabolite 6-mono-acetyl-morphine (6-MAM). Results All CEDIA opiate negative urine samples were negative for morphine, codeine and 6-MAM. Although sensitivity was 100% for each cut-off; specificity was 54.9% at CEDIA cut-off 300 ng/mL vs. LC-MS/MS cut-off 300 ng/mL and, 75% at CEDIA cut-off 2,000 ng/mL vs. LC-MS/MS cut-off 2,000 ng/mL. False positive rate was highest (45.1%) at CEDIA cut-off 300 ng/mL. At CEDIA cut-off 2,000 ng/mL vs. LC-MS/MS cut-off 300 ng/mL, specificity increased to 82.4% and FP rate decreased to 17.6%. All 6-MAM positive samples had CEDIA concentration ≥2,000 ng/mL. Conclusions 2,000 ng/mL for screening and 300 ng/mL for confirmation cut-offs are the most efficient thresholds for the lowest rate of FP opiate results.


1997 ◽  
Vol 22 (5) ◽  
pp. 653-655
Author(s):  
J. M. SOLER-MINOVES ◽  
J. GONZALEZ-USTES ◽  
R. PÉREZ ◽  
M. GIFREU ◽  
A. M. GALLART

We carried out X-rays and computed tomography in 59 wrists in patients who had previous surgical intercarpal fusions. 1.2 mm thick axial images were obtained perpendicular to the axis of the joint. CT showed whether or not the carpal fusions were united. Compared with CT, plain radiography yielded a 25% false negative and 6% false positive rate. We conclude that CT is more useful than plain X-rays for evaluating partial carpal arthrodesis.


2018 ◽  
Vol 29 (4) ◽  
pp. 435-441 ◽  
Author(s):  
Kazuyoshi Kobayashi ◽  
Kei Ando ◽  
Ryuichi Shinjo ◽  
Kenyu Ito ◽  
Mikito Tsushima ◽  
...  

OBJECTIVEMonitoring of brain evoked muscle-action potentials (Br[E]-MsEPs) is a sensitive method that provides accurate periodic assessment of neurological status. However, occasionally this method gives a relatively high rate of false-positives, and thus hinders surgery. The alarm point is often defined based on a particular decrease in amplitude of a Br(E)-MsEP waveform, but waveform latency has not been widely examined. The purpose of this study was to evaluate onset latency in Br(E)-MsEP monitoring in spinal surgery and to examine the efficacy of an alarm point using a combination of amplitude and latency.METHODSA single-center, retrospective study was performed in 83 patients who underwent spine surgery using intraoperative Br(E)-MsEP monitoring. A total of 1726 muscles in extremities were chosen for monitoring, and acceptable baseline Br(E)-MsEP responses were obtained from 1640 (95%). Onset latency was defined as the period from stimulation until the waveform was detected. Relationships of postoperative motor deficit with onset latency alone and in combination with a decrease in amplitude of ≥ 70% from baseline were examined.RESULTSNine of the 83 patients had postoperative motor deficits. The delay of onset latency compared to the control waveform differed significantly between patients with and without these deficits (1.09% ± 0.06% vs 1.31% ± 0.14%, p < 0.01). In ROC analysis, an intraoperative 15% delay in latency from baseline had a sensitivity of 78% and a specificity of 96% for prediction of postoperative motor deficit. In further ROC analysis, a combination of a decrease in amplitude of ≥ 70% and delay of onset latency of ≥ 10% from baseline had sensitivity of 100%, specificity of 93%, a false positive rate of 7%, a false negative rate of 0%, a positive predictive value of 64%, and a negative predictive value of 100% for this prediction.CONCLUSIONSIn spinal cord monitoring with intraoperative Br(E)-MsEP, an alarm point using a decrease in amplitude of ≥ 70% and delay in onset latency of ≥ 10% from baseline has high specificity that reduces false positive results.


1989 ◽  
Vol 75 (2) ◽  
pp. 156-162 ◽  
Author(s):  
Sandro Sulfaro ◽  
Francesco Querin ◽  
Luigi Barzan ◽  
Mario Lutman ◽  
Roberto Comoretto ◽  
...  

Sixty-six whole-organ sectioned laryngopharyngectomy specimens removed for cancer during a seven-year period were uniformly examined to determine the accuracy of preoperative high resolution computerized tomography (CT) for detection of cartilaginous involvement. Our results indicate that CT has a high overall specificity (88.2%) but a low sensitivity (47.1 %); we observed a high false-negative rate (26.5%) and a fairly low false-positive rate (5.9%). Massive cartilage destruction was easily assessed by CT, whereas both small macroscopic and microscopic neoplastic foci of cartilaginous invasion were missed on CT scans. Moreover, false-positive cases were mainly due to proximity of the tumor to the cartilage. Clinical implications of these results are discussed.


2020 ◽  
Vol 6 (1) ◽  
pp. 10 ◽  
Author(s):  
Dawn S. Peck ◽  
Jean M. Lacey ◽  
Amy L. White ◽  
Gisele Pino ◽  
April L. Studinski ◽  
...  

Enzyme-based newborn screening for Mucopolysaccharidosis type I (MPS I) has a high false-positive rate due to the prevalence of pseudodeficiency alleles, often resulting in unnecessary and costly follow up. The glycosaminoglycans (GAGs), dermatan sulfate (DS) and heparan sulfate (HS) are both substrates for α-l-iduronidase (IDUA). These GAGs are elevated in patients with MPS I and have been shown to be promising biomarkers for both primary and second-tier testing. Since February 2016, we have measured DS and HS in 1213 specimens submitted on infants at risk for MPS I based on newborn screening. Molecular correlation was available for 157 of the tested cases. Samples from infants with MPS I confirmed by IDUA molecular analysis all had significantly elevated levels of DS and HS compared to those with confirmed pseudodeficiency and/or heterozygosity. Analysis of our testing population and correlation with molecular results identified few discrepant outcomes and uncovered no evidence of false-negative cases. We have demonstrated that blood spot GAGs analysis accurately discriminates between patients with confirmed MPS I and false-positive cases due to pseudodeficiency or heterozygosity and increases the specificity of newborn screening for MPS I.


Sensor Review ◽  
2019 ◽  
Vol 39 (1) ◽  
pp. 107-120 ◽  
Author(s):  
Deepika Kishor Nagthane ◽  
Archana M. Rajurkar

PurposeOne of the main reasons for increase in mortality rate in woman is breast cancer. Accurate early detection of breast cancer seems to be the only solution for diagnosis. In the field of breast cancer research, many new computer-aided diagnosis systems have been developed to reduce the diagnostic test false positives because of the subtle appearance of breast cancer tissues. The purpose of this study is to develop the diagnosis technique for breast cancer using LCFS and TreeHiCARe classifier model.Design/methodology/approachThe proposed diagnosis methodology initiates with the pre-processing procedure. Subsequently, feature extraction is performed. In feature extraction, the image features which preserve the characteristics of the breast tissues are extracted. Consequently, feature selection is performed by the proposed least-mean-square (LMS)-Cuckoo search feature selection (LCFS) algorithm. The feature selection from the vast range of the features extracted from the images is performed with the help of the optimal cut point provided by the LCS algorithm. Then, the image transaction database table is developed using the keywords of the training images and feature vectors. The transaction resembles the itemset and the association rules are generated from the transaction representation based ona priorialgorithm with high conviction ratio and lift. After association rule generation, the proposed TreeHiCARe classifier model emanates in the diagnosis methodology. In TreeHICARe classifier, a new feature index is developed for the selection of a central feature for the decision tree centered on which the classification of images into normal or abnormal is performed.FindingsThe performance of the proposed method is validated over existing works using accuracy, sensitivity and specificity measures. The experimentation of proposed method on Mammographic Image Analysis Society database resulted in classification of normal and abnormal cancerous mammogram images with an accuracy of 0.8289, sensitivity of 0.9333 and specificity of 0.7273.Originality/valueThis paper proposes a new approach for the breast cancer diagnosis system by using mammogram images. The proposed method uses two new algorithms: LCFS and TreeHiCARe. LCFS is used to select optimal feature split points, and TreeHiCARe is the decision tree classifier model based on association rule agreements.


Author(s):  
S. Neelakandan ◽  
D. Paulraj

People communicate their views, arguments and emotions about their everyday life on social media (SM) platforms (e.g. Twitter and Facebook). Twitter stands as an international micro-blogging service that features a brief message called tweets. Freestyle writing, incorrect grammar, typographical errors and abbreviations are some noises that occur in the text. Sentiment analysis (SA) centered on a tweet posted by the user, and also opinion mining (OM) of the customers review is another famous research topic. The texts are gathered from users’ tweets by means of OM and automatic-SA centered on ternary classifications, namely positive, neutral and negative. It is very challenging for the researchers to ascertain sentiments as a result of its limited size, misspells, unstructured nature, abbreviations and slangs for Twitter data. This paper, with the aid of the Gradient Boosted Decision Tree classifier (GBDT), proposes an efficient SA and Sentiment Classification (SC) of Twitter data. Initially, the twitter data undergoes pre-processing. Next, the pre-processed data is processed using HDFS MapReduce. Now, the features are extracted from the processed data, and then efficient features are selected using the Improved Elephant Herd Optimization (I-EHO) technique. Now, score values are calculated for each of those chosen features and given to the classifier. At last, the GBDT classifier classifies the data as negative, positive, or neutral. Experiential results are analyzed and contrasted with the other conventional techniques to show the highest performance of the proposed method.


2020 ◽  
Vol 15 (1) ◽  
Author(s):  
Shintaro Sukegawa ◽  
Sawako Ono ◽  
Keisuke Nakano ◽  
Kiyofumi Takabatake ◽  
Hotaka Kawai ◽  
...  

Abstract Background This study was conducted to compare the histological diagnostic accuracy of conventional oral-based cytology and liquid-based cytology (LBC) methods. Methods Histological diagnoses of 251 cases were classified as negative (no malignancy lesion, inflammation, or mild/moderate dysplasia) and positive [severe dysplasia/carcinoma in situ (CIS) and squamous cell carcinoma (SCC)]. Cytological diagnoses were classified as negative for intraepithelial lesion or malignancy (NILM), oral low-grade squamous intraepithelial lesion (OLSIL), oral high-grade squamous intraepithelial lesion (OHSIL), or SCC. Cytological diagnostic results were compared with histology results. Results Of NILM cytology cases, the most frequent case was negative [LBC n = 50 (90.9%), conventional n = 22 (95.7%)]. Among OLSIL cytodiagnoses, the most common was negative (LBC n = 34; 75.6%, conventional n = 14; 70.0%). Among OHSIL cytodiagnoses (LBC n = 51, conventional n = 23), SCC was the most frequent (LBC n = 31; 60.8%, conventional n = 7; 30.4%). Negative cases were common (LBC n = 13; 25.5%, conventional n = 14; 60.9%). Among SCC cytodiagnoses SCC was the most common (LBC n = 16; 88.9%, conventional n = 14; 87.5%). Regarding the diagnostic results of cytology, assuming OHSIL and SCC as cytologically positive, the LBC method/conventional method showed a sensitivity of 79.4%/76.7%, specificity of 85.1%/69.2%, false-positive rate of 14.9%/30.7%, and false-negative rate of 20.6%/23.3%. Conclusions LBC method was superior to conventional cytodiagnosis methods. It was especially superior for OLSIL and OHSIL. Because of the false-positive and false-negative cytodiagnoses, it is necessary to make a comprehensive diagnosis considering the clinical findings.


Sign in / Sign up

Export Citation Format

Share Document