Adaptive Indoor Area Localization for Perpetual Crowdsourced Data Collection

The accuracy of fingerprinting-based indoor localization correlates with the quality and up-to-dateness of collected training data. Perpetual crowdsourced data collection reduces manual labeling effort and provides a fresh data base. However, the decentralized collection comes with the cost of heterogeneous data that causes performance degradation. In settings with imperfect data, area localization can provide higher positioning guarantees than exact position estimation. Existing area localization solutions employ a static segmentation into areas that is independent of the available training data. This approach is not applicable for crowdsoucred data collection, which features an unbalanced spatial training data distribution that evolves over time. A segmentation is required that utilizes the existing training data distribution and adapts once new data is accumulated. We propose an algorithm for data-aware floor plan segmentation and a selection metric that balances expressiveness (information gain) and performance (correctly classified examples) of area classifiers. We utilize supervised machine learning, in particular, deep learning, to train the area classifiers. We demonstrate how to regularly provide an area localization model that adapts its prediction space to the accumulating training data. The resulting models are shown to provide higher reliability compared to models that pinpoint the exact position.

Download Full-text

Effects of Training Set Size on Supervised Machine-Learning Land-Cover Classification of Large-Area High-Resolution Remotely Sensed Data

Remote Sensing ◽

10.3390/rs13030368 ◽

2021 ◽

Vol 13 (3) ◽

pp. 368

Author(s):

Christopher A. Ramezan ◽

Timothy A. Warner ◽

Aaron E. Maxwell ◽

Bradley S. Price

Keyword(s):

Machine Learning ◽

Sample Size ◽

Remotely Sensed ◽

Training Data ◽

Supervised Machine Learning ◽

Sample Sizes ◽

Remotely Sensed Data ◽

Large Area ◽

Training Set ◽

Set Size

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.

Download Full-text

Evaluating Grayware Characteristics and Risks

Journal of Computer Networks and Communications ◽

10.1155/2011/569829 ◽

2011 ◽

Vol 2011 ◽

pp. 1-28 ◽

Cited By ~ 1

Author(s):

Zhongqiang Chen ◽

Zhanyan Liang ◽

Yuan Zhang ◽

Zhongrong Chen

Keyword(s):

Information Gain ◽

Feature Space ◽

Training Data ◽

Support Vector ◽

Learning Models ◽

Generalization Capability ◽

Self Organizing Maps ◽

Defense Strategies ◽

Security Applications ◽

Vector Machines

Grayware encyclopedias collect known species to provide information for incident analysis, however, the lack of categorization and generalization capability renders them ineffective in the development of defense strategies against clustered strains. A grayware categorization framework is therefore proposed here to not only classify grayware according to diverse taxonomic features but also facilitate evaluations on grayware risk to cyberspace. Armed with Support Vector Machines, the framework builds learning models based on training data extracted automatically from grayware encyclopedias and visualizes categorization results with Self-Organizing Maps. The features used in learning models are selected with information gain and the high dimensionality of feature space is reduced by word stemming and stopword removal process. The grayware categorizations on diversified features reveal that grayware typically attempts to improve its penetration rate by resorting to multiple installation mechanisms and reduced code footprints. The framework also shows that grayware evades detection by attacking victims' security applications and resists being removed by enhancing its clotting capability with infected hosts. Our analysis further points out that species in categoriesSpywareandAdwarecontinue to dominate the grayware landscape and impose extremely critical threats to the Internet ecosystem.

Download Full-text

Entity Type Recognition for Heterogeneous Semantic Graphs

AI Magazine ◽

10.1609/aimag.v36i1.2569 ◽

2015 ◽

Vol 36 (1) ◽

pp. 75-86 ◽

Cited By ~ 4

Author(s):

Jennifer Sleeman ◽

Tim Finin ◽

Anupam Joshi

Keyword(s):

Machine Learning ◽

Background Knowledge ◽

Knowledge Bases ◽

Heterogeneous Data ◽

Unstructured Data ◽

Supervised Machine Learning ◽

Coreference Resolution ◽

Multiple Sources ◽

Fine Grained ◽

High Level

We describe an approach for identifying fine-grained entity types in heterogeneous data graphs that is effective for unstructured data or when the underlying ontologies or semantic schemas are unknown. Identifying fine-grained entity types, rather than a few high-level types, supports coreference resolution in heterogeneous graphs by reducing the number of possible coreference relations that must be considered. Big data problems that involve integrating data from multiple sources can benefit from our approach when the datas ontologies are unknown, inaccessible or semantically trivial. For such cases, we use supervised machine learning to map entity attributes and relations to a known set of attributes and relations from appropriate background knowledge bases to predict instance entity types. We evaluated this approach in experiments on data from DBpedia, Freebase, and Arnetminer using DBpedia as the background knowledge base.

Download Full-text

Artificially Generated Training Data-sets for Supervised Machine Learning Techniques in Magnetic Resonance Imaging: An Example in Myocardial Segmentation

2019 Computing in Cardiology Conference (CinC) ◽

10.22489/cinc.2019.220 ◽

2019 ◽

Author(s):

Christos Xanthis ◽

Kostas Haris ◽

Dimitrios Filos ◽

Anthony Aletras

Keyword(s):

Magnetic Resonance Imaging ◽

Machine Learning ◽

Magnetic Resonance ◽

Training Data ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Data Sets ◽

Resonance Imaging ◽

Learning Techniques ◽

Myocardial Segmentation

Download Full-text

Implementation of Backpropagation for Ulap-ulap Pattern Recognition

JELIKU (Jurnal Elektronik Ilmu Komputer Udayana) ◽

10.24843/jlk.2021.v09.i03.p12 ◽

2021 ◽

Vol 9 (3) ◽

pp. 405

Author(s):

Ni Luh Yulia Alami Dewi ◽

I Wayan Santiyasa

Keyword(s):

Image Processing ◽

Pattern Recognition ◽

Data Collection ◽

Test Data ◽

Training Data ◽

Image Pattern

Ulap-ulap is one of the symbols used to indicate that a building has been carried out Mlaspas ceremony. Mlaspas is one of the ceremonies performed to purify and clean a building. Ulap-ulap itself consists of various types depending on the building where it is placed, for example the ulap-ulap placed on the Pelinggih building will be different from the ulap-ulap placed on the Bale building. So that the pattern contained in each type of Ulap-ulap is different. The purpose of this research is to be able to do pattern recognition on Ulap-ulap images. The method used in this study is Backpropagation, and for its implementation, the MATLAB 7.5.0 (R2007b) application is used. This study used 18 images of Ulap-ulap, including 15 training data and 6 test data. The stages of the process carried out are for Ulap-ulap pattern recognition, the first is data collection, then image processing, and finally the pattern recognition. Recognition of the Ulap-ulap image pattern with Backpropagation, resulted in an accuracy of 83.333%.

Download Full-text

Indoor Localization Based on Wi-Fi Received Signal Strength Indicators: Feature Extraction, Mobile Fingerprinting, and Trajectory Learning

Applied Sciences ◽

10.3390/app9183930 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3930 ◽

Cited By ~ 1

Author(s):

Jaehyun Yoo ◽

Jongho Park

Keyword(s):

Feature Extraction ◽

Indoor Localization ◽

Signal Strength ◽

Received Signal Strength ◽

Position Estimation ◽

Training Data ◽

Data Sets ◽

Feature Extraction Method ◽

Fingerprinting Method ◽

Trajectory Learning

This paper studies the indoor localization based on Wi-Fi received signal strength indicator (RSSI). In addition to position estimation, this study examines the expansion of applications using Wi-Fi RSSI data sets in three areas: (i) feature extraction, (ii) mobile fingerprinting, and (iii) mapless localization. First, the features of Wi-Fi RSSI observations are extracted with respect to different floor levels and designated landmarks. Second, the mobile fingerprinting method is proposed to allow a trainer to collect training data efficiently, which is faster and more efficient than the conventional static fingerprinting method. Third, in the case of the unknown-map situation, the trajectory learning method is suggested to learn map information using crowdsourced data. All of these parts are interconnected from the feature extraction and mobile fingerprinting to the map learning and the estimation. Based on the experimental results, we observed (i) clearly classified data points by the feature extraction method as regards the floors and landmarks, (ii) efficient mobile fingerprinting compared to conventional static fingerprinting, and (iii) improvement of the positioning accuracy owing to the trajectory learning.

Download Full-text

Internal and External Threat Analysis of Anonymized Dataset

Handbook of Research on Intrusion Detection Systems - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-7998-2242-4.ch009 ◽

2020 ◽

pp. 172-185

Author(s):

Saurav Jindal ◽

Poonam Saini

Keyword(s):

Data Mining ◽

Data Collection ◽

Information Gain ◽

Threat Analysis ◽

External Threat ◽

Utility Of Information ◽

Cost Penalty ◽

Computational Processes ◽

Different Sources ◽

Unusual Situation

In recent years, data collection and data mining have emerged as fast-paced computational processes as the amount of data from different sources has increased manifold. With the advent of such technologies, major concern is exposure of an individual's self-contained information. To confront the unusual situation, anonymization of dataset is performed before being released into public for further usage. The chapter discusses various existing techniques of anonymization. Thereafter, a novel redaction technique is proposed for generalization to minimize the overall cost (penalty) of the process being inversely proportional to utility of generated dataset. To validate the proposed work, authors assume a pre-processed dataset and further compare our algorithm with existing techniques. Lastly, the proposed technique is made scalable thus ensuring further minimization of generalization cost and improving overall utility of information gain.

Download Full-text

Correcting Bias in Crowdsourced Data to Map Bicycle Ridership of All Bicyclists

Urban Science ◽

10.3390/urbansci3020062 ◽

2019 ◽

Vol 3 (2) ◽

pp. 62 ◽

Cited By ~ 6

Author(s):

Avipsa Roy ◽

Trisalyn A. Nelson ◽

A. Stewart Fotheringham ◽

Meghan Winters

Keyword(s):

Temporal Resolution ◽

Test Data ◽

Training Data ◽

Residential Areas ◽

Maricopa County ◽

Covariate Data ◽

Street Level ◽

Crowdsourced Data ◽

The Relationship ◽

The City

Traditional methods of counting bicyclists are resource-intensive and generate data with sparse spatial and temporal detail. Previous research suggests big data from crowdsourced fitness apps offer a new source of bicycling data with high spatial and temporal resolution. However, crowdsourced bicycling data are biased as they oversample recreational riders. Our goals are to quantify geographical variables, which can help in correcting bias in crowdsourced, data and to develop a generalized method to correct bias in big crowdsourced data on bicycle ridership in different settings in order to generate maps for cities representative of all bicyclists at a street-level spatial resolution. We used street-level ridership data for 2016 from a crowdsourced fitness app (Strava), geographical covariate data, and official counts from 44 locations across Maricopa County, Arizona, USA (training data); and 60 locations from the city of Tempe, within Maricopa (test data). First, we quantified the relationship between Strava and official ridership data volumes. Second, we used a multi-step approach with variable selection using LASSO followed by Poisson regression to integrate geographical covariates, Strava, and training data to correct bias. Finally, we predicted bias-corrected average annual daily bicyclist counts for Tempe and evaluated the model’s accuracy using the test data. We found a correlation between the annual ridership data from Strava and official counts (R2 = 0.76) in Maricopa County for 2016. The significant variables for correcting bias were: The proportion of white population, median household income, traffic speed, distance to residential areas, and distance to green spaces. The model could correct bias in crowdsourced data from Strava in Tempe with 86% of road segments being predicted within a margin of ±100 average annual bicyclists. Our results indicate that it is possible to map ridership for cities at the street-level by correcting bias in crowdsourced bicycle ridership data, with access to adequate data from official count programs and geographical covariates at a comparable spatial and temporal resolution.

Download Full-text

A Novel Enhanced Positioning Trilateration Algorithm Implemented for Medical Implant In-Body Localization

International Journal of Antennas and Propagation ◽

10.1155/2013/819695 ◽

2013 ◽

Vol 2013 ◽

pp. 1-10 ◽

Cited By ~ 14

Author(s):

Peter Brida ◽

Juraj Machaj

Keyword(s):

Signal Strength ◽

Received Signal Strength ◽

Position Estimation ◽

Time Of Arrival ◽

Medical Implant ◽

Exact Position ◽

Two Phases ◽

Positioning Algorithm ◽

Selection Of

Medical implants based on wireless communication will play crucial role in healthcare systems. Some applications need to know the exact position of each implant. RF positioning seems to be an effective approach for implant localization. The two most common positioning data typically used for RF positioning are received signal strength and time of flight of a radio signal between transmitter and receivers (medical implant and network of reference devices with known position). This leads to positioning methods: received signal strength (RSS) and time of arrival (ToA). Both methods are based on trilateration. Used positioning data are very important, but the positioning algorithm which estimates the implant position is important as well. In this paper, the proposal of novel algorithm for trilateration is presented. The proposed algorithm improves the quality of basic trilateration algorithms with the same quality of measured positioning data. It is called Enhanced Positioning Trilateration Algorithm (EPTA). The proposed algorithm can be divided into two phases. The first phase is focused on the selection of the most suitable sensors for position estimation. The goal of the second one lies in the positioning accuracy improving by adaptive algorithm. Finally, we provide performance analysis of the proposed algorithm by computer simulations.

Download Full-text

Lost in Space: Geolocation in Event Data

Political Science Research and Methods ◽

10.1017/psrm.2018.23 ◽

2018 ◽

Vol 7 (04) ◽

pp. 871-888 ◽

Cited By ~ 6

Author(s):

Sophie J. Lee ◽

Howard Liu ◽

Michael D. Ward

Keyword(s):

Learning Algorithm ◽

Text Processing ◽

Contextual Information ◽

Training Data ◽

Supervised Machine Learning ◽

Model Parameters ◽

Event Data ◽

Data Set ◽

N Gram ◽

Automated Text Processing

Improving geolocation accuracy in text data has long been a goal of automated text processing. We depart from the conventional method and introduce a two-stage supervised machine-learning algorithm that evaluates each location mention to be either correct or incorrect. We extract contextual information from texts, i.e., N-gram patterns for location words, mention frequency, and the context of sentences containing location words. We then estimate model parameters using a training data set and use this model to predict whether a location word in the test data set accurately represents the location of an event. We demonstrate these steps by constructing customized geolocation event data at the subnational level using news articles collected from around the world. The results show that the proposed algorithm outperforms existing geocoders even in a case added post hoc to test the generality of the developed algorithm.

Download Full-text