A data set of aerial imagery from robotics simulator for map-based localization systems benchmark

Purpose The purpose of this paper is to present a new data set of aerial imagery from robotics simulator (AIR). AIR data set aims to provide a starting point for localization system development and to become a typical benchmark for accuracy comparison of map-based localization algorithms, visual odometry and SLAM for high-altitude flights. Design/methodology/approach The presented data set contains over 100,000 aerial images captured from Gazebo robotics simulator using orthophoto maps as a ground plane. Flights with three different trajectories are performed on maps from urban and forest environment at different altitudes, totaling over 33 kilometers of flight distance. Findings The review of previous research studies show that the presented data set is the largest currently available public data set with downward facing camera imagery. Originality/value This paper presents the problem of missing publicly available data sets for high-altitude (100‒3,000 meters) UAV flights; the current state-of-the-art research studies performed to develop map-based localization system for UAVs depend on real-life test flights and custom-simulated data sets for accuracy evaluation of the algorithms. The presented new data set solves this problem and aims to help the researchers to improve and benchmark new algorithms for high-altitude flights.

Download Full-text

Fast and accurate detection of surface defect based on improved YOLOv4

Assembly Automation ◽

10.1108/aa-04-2021-0044 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Jiawei Lian ◽

Junhong He ◽

Yun Niu ◽

Tianze Wang

Keyword(s):

Feature Extraction ◽

Real Time ◽

Surface Defect ◽

Steel Ingot ◽

Industrial Applications ◽

Data Sets ◽

Data Set ◽

Processing Technologies ◽

Content Type ◽

Public Data

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.

Download Full-text

A systematic review of machine learning-based missing value imputation techniques

Data Technologies and Applications ◽

10.1108/dta-12-2020-0298 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Tressy Thomas ◽

Enayat Rajabi

Keyword(s):

Machine Learning ◽

Selection Process ◽

Evaluation Metrics ◽

Correct Prediction ◽

Data Sets ◽

Data Set ◽

Missing Value ◽

Content Type ◽

Missing Value Imputation ◽

Literature Reviews

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

Download Full-text

Website traffic measurement and rankings: competitive intelligence tools examination

International Journal of Web Information Systems ◽

10.1108/ijwis-01-2018-0001 ◽

2018 ◽

Vol 14 (4) ◽

pp. 423-437 ◽

Cited By ~ 1

Author(s):

David Prantl ◽

Martin Prantl

Keyword(s):

Design Methodology ◽

State Of The Art ◽

Competitive Intelligence ◽

Data Sets ◽

Traffic Data ◽

Content Type ◽

Traffic Measurement ◽

Data Estimation ◽

Research Studies

PurposeThe purpose of this paper is to examine and verify the competitive intelligence tools Alexa and SimilarWeb, which are broadly used for website traffic data estimation. Tested tools belong to the state of the art in this area.Design/methodology/approachThe authors use quantitative approach. Research was conducted on a sample of Czech websites for which there are accurate traffic data values, against which the other data sets (less accurate) provided by Alexa and SimilarWeb will be compared.FindingsThe results show that neither tool can accurately determine the ranking of websites on the internet. However, it is possible to approximately determine the significance of a particular website. These results are useful for another research studies which use data from Alexa or SimilarWeb. Moreover, the results show that it is still not possible to accurately estimate website traffic of any website in the world.Research limitations/implicationsThe limitation of the research lies in the fact that it was conducted solely in the Czech market.Originality/valueSignificant amount of research studies use data sets provided by Alexa and SimilarWeb. However, none of these research studies focus on the quality of the website traffic data acquired by Alexa or SimilarWeb, nor do any of them refer to other studies that would deal with this issue. Furthermore, authors describe approaches to measuring website traffic and based on the analysis, the possible usability of these methods is discussed.

Download Full-text

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

An application of data envelopment analysis for Korean banks with negative data

Benchmarking An International Journal ◽

10.1108/bij-02-2016-0023 ◽

2017 ◽

Vol 24 (4) ◽

pp. 1052-1064 ◽

Cited By ~ 9

Author(s):

Yong Joo Lee ◽

Seong-Jong Joo ◽

Hong Gyun Park

Keyword(s):

Data Envelopment Analysis ◽

Data Sets ◽

Data Envelopment ◽

Negative Data ◽

Ownership Type ◽

Data Set ◽

Translation Invariant ◽

Content Type ◽

Dea Models ◽

Regional Banks

Purpose The purpose of this paper is to measure the comparative efficiency of 18 Korean commercial banks under the presence of negative observations and examine performance differences among them by grouping them according to their market conditions. Design/methodology/approach The authors employ two data envelopment analysis (DEA) models such as a Banker, Charnes, and Cooper (BCC) model and a modified slacks-based measure of efficiency (MSBM) model, which can handle negative data. The BCC model is proven to be translation invariant for inputs or outputs depending on output or input orientation. Meanwhile, the MSBM model is unit invariant in addition to translation invariant. The authors compare results from both models and choose one for interpreting results. Findings Most Korean banks recovered from the worst performance in 2011 and showed similar performance in recent years. Among three groups such as national banks, regional banks, and special banks, the most special banks demonstrated superb performance across models and years. Especially, the performance difference between the special banks and the regional banks was statistically significant. The authors concluded that the high performance of the special banks was due to their nationwide market access and ownership type. Practical implications This study demonstrates how to analyze and measure the efficiency of entities when variables contain negative observations using a data set for Korean banks. The authors have tried two major DEA models that are able to handle negative data and proposed a practical direction for future studies. Originality/value Although there are research papers for measuring the performance of banks in Korea, all of the papers in the topic have studied efficiency or productivity using positive data sets. However, variables such as net incomes and growth rates frequently include negative observations in bank data sets. This is the first paper to investigate the efficiency of bank operations in the presence of negative data in Korea.

Download Full-text

A scalable eigenspace-based fuzzy c-means for topic detection

Data Technologies and Applications ◽

10.1108/dta-11-2020-0262 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Hendri Murfi

Keyword(s):

Representation Learning ◽

Detection Methods ◽

Data Sets ◽

Topic Detection ◽

Data Set ◽

Content Type ◽

Running Time ◽

Fuzzy C Means ◽

Coherence Score ◽

Value Decomposition

PurposeThe aim of this research is to develop an eigenspace-based fuzzy c-means method for scalable topic detection.Design/methodology/approachThe eigenspace-based fuzzy c-means (EFCM) combines representation learning and clustering. The textual data are transformed into a lower-dimensional eigenspace using truncated singular value decomposition. Fuzzy c-means is performed on the eigenspace to identify the centroids of each cluster. The topics are provided by transforming back the centroids into the nonnegative subspace of the original space. In this paper, we extend the EFCM method for scalability by using the two approaches, i.e. single-pass and online. We call the developed topic detection methods as oEFCM and spEFCM.FindingsOur simulation shows that both oEFCM and spEFCM methods provide faster running times than EFCM for data sets that do not fit in memory. However, there is a decrease in the average coherence score. For both data sets that fit and do not fit into memory, the oEFCM method provides a tradeoff between running time and coherence score, which is better than spEFCM.Originality/valueThis research produces a scalable topic detection method. Besides this scalability capability, the developed method also provides a faster running time for the data set that fits in memory.

Download Full-text

A systematical approach to classification problems with feature space heterogeneity

Kybernetes ◽

10.1108/k-06-2018-0313 ◽

2019 ◽

Vol 48 (9) ◽

pp. 2006-2029

Author(s):

Hongshan Xiao ◽

Yu Wang

Keyword(s):

Factor Analysis ◽

Meta Analysis ◽

Feature Space ◽

Classification Performance ◽

Classification Algorithm ◽

Significant Feature ◽

Data Sets ◽

Data Set ◽

Classification Techniques ◽

Content Type

Purpose Feature space heterogeneity exists widely in various application fields of classification techniques, such as customs inspection decision, credit scoring and medical diagnosis. This paper aims to study the relationship between feature space heterogeneity and classification performance. Design/methodology/approach A measurement is first developed for measuring and identifying any significant heterogeneity that exists in the feature space of a data set. The main idea of this measurement is derived from a meta-analysis. For the data set with significant feature space heterogeneity, a classification algorithm based on factor analysis and clustering is proposed to learn the data patterns, which, in turn, are used for data classification. Findings The proposed approach has two main advantages over the previous methods. The first advantage lies in feature transform using orthogonal factor analysis, which results in new features without redundancy and irrelevance. The second advantage rests on samples partitioning to capture the feature space heterogeneity reflected by differences of factor scores. The validity and effectiveness of the proposed approach is verified on a number of benchmarking data sets. Research limitations/implications Measurement should be used to guide the heterogeneity elimination process, which is an interesting topic in future research. In addition, to develop a classification algorithm that enables scalable and incremental learning for large data sets with significant feature space heterogeneity is also an important issue. Practical implications Measuring and eliminating the feature space heterogeneity possibly existing in the data are important for accurate classification. This study provides a systematical approach to feature space heterogeneity measurement and elimination for better classification performance, which is favorable for applications of classification techniques in real-word problems. Originality/value A measurement based on meta-analysis for measuring and identifying any significant feature space heterogeneity in a classification problem is developed, and an ensemble classification framework is proposed to deal with the feature space heterogeneity and improve the classification accuracy.

Download Full-text

US academic libraries' staffing and expenditure trends (1996–2016)

Library Management ◽

10.1108/lm-12-2019-0093 ◽

2020 ◽

Vol 41 (4/5) ◽

pp. 247-268 ◽

Cited By ~ 1

Author(s):

Starr Hoffman ◽

Samantha Godbey

Keyword(s):

Academic Libraries ◽

The United States ◽

Data Sets ◽

Size Category ◽

Data Set ◽

Content Type ◽

Education Statistics ◽

Trends Over Time ◽

Doctoral Institutions ◽

Over Time

PurposeThis paper explores trends over time in library staffing and staffing expenditures among two- and four-year colleges and universities in the United States.Design/methodology/approachResearchers merged and analyzed data from 1996 to 2016 from the National Center for Education Statistics for over 3,500 libraries at postsecondary institutions. This study is primarily descriptive in nature and addresses the research questions: How do staffing trends in academic libraries over this period of time relate to Carnegie classification and institution size? How do trends in library staffing expenditures over this period of time correspond to these same variables?FindingsAcross all institutions, on average, total library staff decreased from 1998 to 2012. Numbers of librarians declined at master’s and doctoral institutions between 1998 and 2016. Numbers of students per librarian increased over time in each Carnegie and size category. Average inflation-adjusted staffing expenditures have remained steady for master's, baccalaureate and associate's institutions. Salaries as a percent of library budget decreased only among doctoral institutions and institutions with 20,000 or more students.Originality/valueThis is a valuable study of trends over time, which has been difficult without downloading and merging separate data sets from multiple government sources. As a result, few studies have taken such an approach to this data. Consequently, institutions and libraries are making decisions about resource allocation based on only a fraction of the available data. Academic libraries can use this study and the resulting data set to benchmark key staffing characteristics.

Download Full-text

A classification approach based on variable precision rough sets and cluster validity index function

Engineering Computations ◽

10.1108/ec-11-2012-0297 ◽

2014 ◽

Vol 31 (8) ◽

pp. 1778-1789

Author(s):

Hongkang Lin

Keyword(s):

Optimal Number ◽

Data Sets ◽

Cluster Validity ◽

Cluster Validity Index ◽

Index Method ◽

Data Set ◽

Content Type ◽

The Individual ◽

Variable Precision Rough Sets ◽

Optimal Number Of Clusters

Purpose – The clustering/classification method proposed in this study, designated as the PFV-index method, provides the means to solve the following problems for a data set characterized by imprecision and uncertainty: first, discretizing the continuous values of all the individual attributes within a data set; second, evaluating the optimality of the discretization results; third, determining the optimal number of clusters per attribute; and fourth, improving the classification accuracy (CA) of data sets characterized by uncertainty. The paper aims to discuss these issues. Design/methodology/approach – The proposed method for the solution of the clustering/classifying problem, designated as PFV-index method, combines a particle swarm optimization algorithm, fuzzy C-means method, variable precision rough sets theory, and a new cluster validity index function. Findings – This method could cluster the values of the individual attributes within the data set and achieves both the optimal number of clusters and the optimal CA. Originality/value – The validity of the proposed approach is investigated by comparing the classification results obtained for UCI data sets with those obtained by supervised classification BPNN, decision-tree methods.

Download Full-text

Biodiversity of Active and Inactive Bacteria in the Gut Flora of Wood-Feeding Huhu Beetle Larvae (Prionoplus reticularis)

Applied and Environmental Microbiology ◽

10.1128/aem.05609-11 ◽

2011 ◽

Vol 77 (19) ◽

pp. 7000-7006 ◽

Cited By ~ 63

Author(s):

Nicola M. Reid ◽

Sarah L. Addison ◽

Lucy J. Macdonald ◽

Gareth Lloyd-Jones

Keyword(s):

Genomic Dna ◽

Data Sets ◽

Data Set ◽

Content Type ◽

Dna And Rna ◽

New Information ◽

Related Organism ◽

The Family ◽

Insect Symbionts ◽

Derived Data

ABSTRACTHuhu grubs (Prionoplus reticularis) are wood-feeding beetle larvae endemic to New Zealand and belonging to the family Cerambycidae. Compared to the wood-feeding lower termites, very little is known about the diversity and activity of microorganisms associated with xylophagous cerambycid larvae. To address this, we used pyrosequencing to evaluate the diversity of metabolically active and inactive bacteria in the huhu larval gut. Our estimate, that the gut harbors at least 1,800 phylotypes, is based on 33,420 sequences amplified from genomic DNA and reverse-transcribed RNA. Analysis of genomic DNA- and RNA-derived data sets revealed that 71% of all phylotypes (representing 95% of all sequences) were metabolically active. Rare phylotypes contributed considerably to the richness of the community and were also largely metabolically active, indicating their participation in digestive processes in the gut. The dominant families in the active community (RNA data set) includedAcidobacteriaceae(24.3%),Xanthomonadaceae(16.7%),Acetobacteraceae(15.8%),Burkholderiaceae(8.7%), andEnterobacteriaceae(4.1%). The most abundant phylotype comprised 14% of the active community and affiliated withDyella ginsengisoli(Gammaproteobacteria), suggesting that aDyella-related organism is a likely symbiont. This study provides new information on the diversity and activity of gut-associated microorganisms that are essential for the digestion of the nutritionally poor diet consumed by wood-feeding larvae. Many huhu gut phylotypes affiliated with insect symbionts or with bacteria present in acidic environments or associated with fungi.

Download Full-text