LEARNING GEOGRAPHICAL DISTRIBUTION OF VACANT HOUSES USING CLOSED MUNICIPAL DATA: A CASE STUDY OF WAKAYAMA CITY, JAPAN

Abstract. Vacant housing detection is an urgent problem that needs to be addressed. It is also a suitable example to promote utilisation of smart data that are stored in municipalities. This study proposes a vacant housing detection model that uses closed municipal data and considers accelerating the use of public data to promote smart cities. Employing a machine learning technique, this study ensures high predictive power for vacant housing detection. The model enables us to handle complex municipal data that include non-linear feature characteristics and substantial missing data. In particular, handling missing data is important in the practical use of closed municipal data because not all of the data are necessarily absorbed to a building unit. Consequently, the model in this analysis showed that the accuracy and false positive rate are 95.4 percent and 3.7 percent, respectively, which are high enough to detect vacant houses. However, the true positive rate is 77.0 percent. Although the rate is not low to some extent, selection of features and further collection of extra samples may improve the rate. Geographic distribution of vacant houses further enabled us to check the difference between the actual and estimated number of vacant houses, and more than 80 percent of 500-meter grid data are with below 10 errors, which we think, provides city planners with informative data to roughly grasp geographical tendencies.

Download Full-text

Weighting Methods for Rare Event Identification From Imbalanced Datasets

Frontiers in Big Data ◽

10.3389/fdata.2021.715320 ◽

2021 ◽

Vol 4 ◽

Author(s):

Jia He ◽

Maggie X. Cheng

Keyword(s):

Machine Learning ◽

False Positive Rate ◽

Rare Event ◽

Poor Performance ◽

True Positive Rate ◽

Grid Data ◽

Imbalanced Datasets ◽

Weighting Method ◽

Main Class ◽

Positive Rate

In machine learning, we often face the situation where the event we are interested in has very few data points buried in a massive amount of data. This is typical in network monitoring, where data are streamed from sensing or measuring units continuously but most data are not for events. With imbalanced datasets, the classifiers tend to be biased in favor of the main class. Rare event detection has received much attention in machine learning, and yet it is still a challenging problem. In this paper, we propose a remedy for the standing problem. Weighting and sampling are two fundamental approaches to address the problem. We focus on the weighting method in this paper. We first propose a boosting-style algorithm to compute class weights, which is proved to have excellent theoretical property. Then we propose an adaptive algorithm, which is suitable for real-time applications. The adaptive nature of the two algorithms allows a controlled tradeoff between true positive rate and false positive rate and avoids excessive weight on the rare class, which leads to poor performance on the main class. Experiments on power grid data and some public datasets show that the proposed algorithms outperform the existing weighting and boosting methods, and that their superiority is more noticeable with noisy data.

Download Full-text

Optimizing Android Malware Detection Via Ensemble Learning

International Journal of Interactive Mobile Technologies (iJIM) ◽

10.3991/ijim.v14i09.11548 ◽

2020 ◽

Vol 14 (09) ◽

pp. 61

Author(s):

Abikoye Oluwakemi Christianah ◽

Benjamin Aruwa Gyunka ◽

Akande Noah Oluwatobi

Keyword(s):

Random Forest ◽

False Positive ◽

False Positive Rate ◽

True Positive Rate ◽

Ensemble Model ◽

True Positive ◽

Android Malware ◽

Detection Model ◽

Android Malware Detection ◽

Positive Rate

<p>Android operating system has become very popular, with the highest market share, amongst all other mobile operating systems due to its open source nature and users friendliness. This has brought about an uncontrolled rise in malicious applications targeting the Android platform. Emerging trends of Android malware are employing highly sophisticated detection and analysis avoidance techniques such that the traditional signature-based detection methods have become less potent in their ability to detect new and unknown malware. Alternative approaches, such as the Machine learning techniques have taken the lead for timely zero-day anomaly detections. The study aimed at developing an optimized Android malware detection model using ensemble learning technique. Random Forest, Support Vector Machine, and k-Nearest Neighbours were used to develop three distinct base models and their predictive results were further combined using Majority Vote combination function to produce an ensemble model. Reverse engineering procedure was employed to extract static features from large repository of malware samples and benign applications. WEKA 3.8.2 data mining suite was used to perform all the learning experiments. The results showed that Random Forest had a true positive rate of 97.9%, a false positive rate of 1.9% and was able to correctly classify instances with 98%, making it a strong base model. The ensemble model had a true positive rate of 98.1%, false positive rate of 1.8% and was able to correctly classify instances with 98.16%. The finding shows that, although the base learners had good detection results, the ensemble learner produced a better optimized detection model compared with the performances of those of the base learners.</p>

Download Full-text

A Hybrid Skin Detection Model from Multiple Color Spaces Based on a Dual-Threshold Bayesian Algorithm

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001416550181 ◽

2016 ◽

Vol 30 (07) ◽

pp. 1655018 ◽

Cited By ~ 3

Author(s):

Fujunku Chen ◽

Zhigang Hu ◽

Keqin Li ◽

Wei Liu

Keyword(s):

False Positive Rate ◽

Recognition Rate ◽

Color Space ◽

True Positive Rate ◽

Skin Detection ◽

Color Spaces ◽

Bayesian Algorithm ◽

Detection Model ◽

Positive Rate ◽

Specific Rule

As a preliminary step of many applications, skin detection serves as an irreplaceable role in image processing applications, such as face recognition, gesture recognition, web image filtering, and image retrieval systems. Combining information from multiple color spaces improves the recognition rate and reduces the error rate because the same color is represented differently in other color spaces. Consequently, a hybrid skin detection model from multiple color spaces based on a dual-threshold Bayesian algorithm (DTBA) has been proposed. In each color space, the pixels of images are divided into three categories, namely, skin, nonskin, and undetermined, when using the DTBA. Then, nearly all skin pixels are obtained by using a specific rule that combines the recognition results from multiple color spaces. Furthermore, skin texture filtering and morphological filtering are applied to the results by effectively reducing false identified pixels. In addition, the proposed skin model can overcome interference from a complex background. The method has been validated in a series of experiments using the Compaq and the high-resolution image datasets (HRIDs). The findings have demonstrated the proposed approach produced an improvement, the true positive rate (TPR) improves more than 6% and the false positive rate (FPR) reduces more than 11%, compared with the Bayesian classifier. We confirm that the method is competitive. Meanwhile, this model is robust against skin distribution, scaling, partial occlusions, and illumination variations.

Download Full-text

PRATD: A Phased Remote Access Trojan Detection Method with Double-Sided Features

Electronics ◽

10.3390/electronics9111894 ◽

2020 ◽

Vol 9 (11) ◽

pp. 1894

Author(s):

Chun Guo ◽

Zihua Song ◽

Yuan Ping ◽

Guowei Shen ◽

Yuhei Cui ◽

...

Keyword(s):

False Positive ◽

Detection Method ◽

False Positive Rate ◽

True Positive Rate ◽

Remote Access ◽

Detection Methods ◽

Security Threats ◽

True Positive ◽

Trojan Detection ◽

Positive Rate

Remote Access Trojan (RAT) is one of the most terrible security threats that organizations face today. At present, two major RAT detection methods are host-based and network-based detection methods. To complement one another’s strengths, this article proposes a phased RATs detection method by combining double-side features (PRATD). In PRATD, both host-side and network-side features are combined to build detection models, which is conducive to distinguishing the RATs from benign programs because that the RATs not only generate traffic on the network but also leave traces on the host at run time. Besides, PRATD trains two different detection models for the two runtime states of RATs for improving the True Positive Rate (TPR). The experiments on the network and host records collected from five kinds of benign programs and 20 famous RATs show that PRATD can effectively detect RATs, it can achieve a TPR as high as 93.609% with a False Positive Rate (FPR) as low as 0.407% for the known RATs, a TPR 81.928% and FPR 0.185% for the unknown RATs, which suggests it is a competitive candidate for RAT detection.

Download Full-text

Ascertaining an efficient eligibility cut-off for extended Medicare items for eating disorders

Australasian Psychiatry ◽

10.1177/10398562211028632 ◽

2021 ◽

pp. 103985622110286

Author(s):

Tracey Wade ◽

Jamie-Lee Pennesi ◽

Yuan Zhou

Keyword(s):

Eating Disorders ◽

Eating Disorder ◽

False Positive Rate ◽

Area Under The Curve ◽

Rate Sensitivity ◽

True Positive Rate ◽

Eating Disorder Examination Questionnaire ◽

Eating Disorder Examination ◽

Positive Rate ◽

The Relationship

Objective: Currently eligibility for expanded Medicare items for eating disorders (excluding anorexia nervosa) require a score ⩾ 3 on the 22-item Eating Disorder Examination-Questionnaire (EDE-Q). We compared these EDE-Q “cases” with continuous scores on a validated 7-item version of the EDE-Q (EDE-Q7) to identify an EDE-Q7 cut-off commensurate to 3 on the EDE-Q. Methods: We utilised EDE-Q scores of female university students ( N = 337) at risk of developing an eating disorder. We used a receiver operating characteristic (ROC) curve to assess the relationship between the true-positive rate (sensitivity) and the false-positive rate (1-specificity) of cases ⩾ 3. Results: The area under the curve showed outstanding discrimination of 0.94 (95% CI: .92–.97). We examined two specific cut-off points on the EDE-Q7, which included 100% and 87% of true cases, respectively. Conclusion: Given the EDE-Q cut-off for Medicare is used in conjunction with other criteria, we suggest using the more permissive EDE-Q7 cut-off (⩾2.5) to replace use of the EDE-Q cut-off (⩾3) in eligibility assessments.

Download Full-text

Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records

Political Analysis ◽

10.1093/pan/mpw001 ◽

2016 ◽

Vol 24 (2) ◽

pp. 263-272 ◽

Cited By ~ 29

Author(s):

Kosuke Imai ◽

Kabir Khanna

Keyword(s):

Mean Squared Error ◽

False Positive Rate ◽

True Positive Rate ◽

Voter Registration ◽

Racial Groups ◽

Ecological Inference ◽

Inference Problem ◽

Individual Level ◽

Positive Rate ◽

Election Results

In both political behavior research and voting rights litigation, turnout and vote choice for different racial groups are often inferred using aggregate election results and racial composition. Over the past several decades, many statistical methods have been proposed to address this ecological inference problem. We propose an alternative method to reduce aggregation bias by predicting individual-level ethnicity from voter registration records. Building on the existing methodological literature, we use Bayes's rule to combine the Census Bureau's Surname List with various information from geocoded voter registration records. We evaluate the performance of the proposed methodology using approximately nine million voter registration records from Florida, where self-reported ethnicity is available. We find that it is possible to reduce the false positive rate among Black and Latino voters to 6% and 3%, respectively, while maintaining the true positive rate above 80%. Moreover, we use our predictions to estimate turnout by race and find that our estimates yields substantially less amounts of bias and root mean squared error than standard ecological inference estimates. We provide open-source software to implement the proposed methodology.

Download Full-text

Watch For Failing Objects: What Inappropriate Compliance Reveals About Shared Mental Models In Autonomous Cars

Proceedings of the Human Factors and Ergonomics Society Annual Meeting ◽

10.1177/1071181321651081 ◽

2021 ◽

Vol 65 (1) ◽

pp. 643-647

Author(s):

Yosef S. Razin ◽

Jack Gale ◽

Jiaojiao Fan ◽

Jaznae’ Smith ◽

Karen M. Feigh

Keyword(s):

Mental Models ◽

Mental Model ◽

False Positive Rate ◽

Ground Truth ◽

True Positive Rate ◽

Shared Mental Models ◽

Shared Mental Model ◽

Autonomous Cars ◽

Positive Rate ◽

Dispositional Factors

This paper evaluates Banks et al.’s Human-AI Shared Mental Model theory by examining how a self-driving vehicle’s hazard assessment facilitates shared mental models. Participants were asked to affirm the vehicle’s assessment of road objects as either hazards or mistakes in real-time as behavioral and subjective measures were collected. The baseline performance of the AI was purposefully low (<50%) to examine how the human’s shared mental model might lead to inappropriate compliance. Results indicated that while the participant true positive rate was high, overall performance was reduced by the large false positive rate, indicating that participants were indeed being influenced by the Al’s faulty assessments, despite full transparency as to the ground-truth. Both performance and compliance were directly affected by frustration, mental, and even physical demands. Dispositional factors such as faith in other people’s cooperativeness and in technology companies were also significant. Thus, our findings strongly supported the theory that shared mental models play a measurable role in performance and compliance, in a complex interplay with trust.

Download Full-text

Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates

10.1101/011767 ◽

2014 ◽

Author(s):

Andreas Tuerk ◽

Gregor Wiktorin ◽

Serhat Güler

Keyword(s):

Probability Distributions ◽

False Positive Rate ◽

Synthetic Data ◽

True Positive Rate ◽

Rna Seq ◽

Microarray Quality Control ◽

Data Set ◽

Rna Transcripts ◽

Positive Rate ◽

Fragment Distribution

Quantification of RNA transcripts with RNA-Seq is inaccurate due to positional fragment bias, which is not represented appropriately by current statistical models of RNA-Seq data. This article introduces the Mix2(rd. "mixquare") model, which uses a mixture of probability distributions to model the transcript specific positional fragment bias. The parameters of the Mix2model can be efficiently trained with the Expectation Maximization (EM) algorithm resulting in simultaneous estimates of the transcript abundances and transcript specific positional biases. Experiments are conducted on synthetic data and the Universal Human Reference (UHR) and Brain (HBR) sample from the Microarray quality control (MAQC) data set. Comparing the correlation between qPCR and FPKM values to state-of-the-art methods Cufflinks and PennSeq we obtain an increase in R2value from 0.44 to 0.6 and from 0.34 to 0.54. In the detection of differential expression between UHR and HBR the true positive rate increases from 0.44 to 0.71 at a false positive rate of 0.1. Finally, the Mix2model is used to investigate biases present in the MAQC data. This reveals 5 dominant biases which deviate from the common assumption of a uniform fragment distribution. The Mix2software is available at http://www.lexogen.com/fileadmin/uploads/bioinfo/mix2model.tgz.

Download Full-text

Exploring Author Profiling for Fake News Detection

10.36227/techrxiv.17197655 ◽

2021 ◽

Author(s):

Shloak Rathod

Keyword(s):

False Positive Rate ◽

True Positive Rate ◽

Supervised Machine Learning ◽

Online Media ◽

Fake News ◽

Writing Style ◽

Short Term ◽

Positive Rate ◽

Rapid Dissemination ◽

Author Profiling

<div><div><div><p>The proliferation of online media allows for the rapid dissemination of unmoderated news, unfortunately including fake news. The extensive spread of fake news poses a potent threat to both individuals and society. This paper focuses on designing author profiles to detect authors who are primarily engaged in publishing fake news articles. We build on the hypothesis that authors who write fake news repeatedly write only fake news articles, at least in short-term periods. Fake news authors have a distinct writing style compared to real news authors, who naturally want to maintain trustworthiness. We explore the potential to detect fake news authors by designing authors’ profiles based on writing style, sentiment, and co-authorship patterns. We evaluate our approach using a publicly available dataset with over 5000 authors and 20000 articles. For our evaluation, we build and compare different classes of supervised machine learning models. We find that the K-NN model performed the best, and it could detect authors who are prone to writing fake news with an 83% true positive rate with only a 5% false positive rate.</p></div></div></div>

Download Full-text

PF : Website Fingerprinting Attack Using Probabilistic Topic Model

Security and Communication Networks ◽

10.1155/2021/3265300 ◽

2021 ◽

Vol 2021 ◽

pp. 1-17

Author(s):

Hongcheng Zou ◽

Ziling Wei ◽

Jinshu Su ◽

Baokang Zhao ◽

Yusheng Xia ◽

...

Keyword(s):

Topic Model ◽

False Positive Rate ◽

True Positive Rate ◽

The Other ◽

Open World ◽

Closed World ◽

Probabilistic Topic Model ◽

High True Positive Rate ◽

Positive Rate ◽

Fingerprinting Attack

Website fingerprinting (WFP) attack enables identifying the websites a user is browsing even under the protection of privacy-enhancing technologies (PETs). Previous studies demonstrate that most machine-learning attacks need multiple types of features as input, thus inducing tremendous feature engineering work. However, we show the other alternative. That is, we present Probabilistic Fingerprinting (PF), a new website fingerprinting attack that merely leverages one type of features. They are produced by using a mathematical model PWFP that combines a probabilistic topic model with WFP for the first time, due to a finding that a plain text and the sequence file generated from a traffic instance are essentially the same. Experimental results show that the proposed new features are more distinguishing than the existing features. In a closed-world setting, PF attains a better accuracy performance (99.79% at most) than prior attacks on various datasets gathered in the scenarios of Shadowsocks, SSH, and TLS, respectively. Besides, even when the number of training instances drops to as few as 4, PF still reaches an accuracy of above 90%. In the more realistic open-world setting, PF attains a high true positive rate (TPR) and Bayes detection rate (BDR), and a low false positive rate (FPR) in all evaluations, which outperforms the other attacks. These results highlight that it is meaningful and possible to explore new features to improve the accuracy of WFP attacks.

Download Full-text