Land Map Image Dataset: Ground-Truth And Classification Using Visual And Textural Features

Abstract Research on document image analysis is actively pursued in the last few decades and services like OCR, vectorization of drawings/graphics and various types of form processing are very common. Handwritten documents, old historical documents and documents captured through camera are now being the subjects of active research. However, another very important type of paper document, namely the map document image processing research suffers due to the inherent complexities of the map document and also for nonavailability of benchmark public data-sets. This paper presents a new data-set, namely, the Land Map Image Database (LMIDb) that consists of a variety of land maps images (446 images at present and growing; scanned at 200/300 dpi in TIF format) and the corresponding ground-truth. Using semiautomatic tools non-text part of the images are deleted and the text-only ground-truth is also kept in the database. This paper also presents a classification strategy for map images using which the maps in the database are automatically classified into Political (Po), Physical (Ph), Resource (R) and Topographic (T) maps. The automatic classification of maps help indexing of the images in LMIDb for archival and easy retrieval of the right maps to get the appropriate geographical information. Classification accuracy is also tested on the proposed data-set and the result is encouraging.

Download Full-text

Systematic prediction of drug resistance caused by transporter genes in cancer cells

Scientific Reports ◽

10.1038/s41598-021-86921-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Yao Shen ◽

Zhipeng Yan

Keyword(s):

Drug Resistance ◽

Large Scale ◽

Ground Truth ◽

Data Sets ◽

Drug Transporter ◽

Data Set ◽

Ground Truth Data ◽

Public Data ◽

The Difference ◽

Novel Drug

AbstractTo study the drug resistance problem caused by transporters, we leveraged multiple large-scale public data sets of drug sensitivity, cell line genetic and transcriptional profiles, and gene silencing experiments. Through systematic integration of these data sets, we built various machine learning models to predict the difference between cell viability upon drug treatment and the silencing of its target across the same cell lines. More than 50% of the models built with the same data set or with independent data sets successfully predicted the testing set with significant correlation to the ground truth data. Features selected by our models were also significantly enriched in known drug transporters annotated in DrugBank for more than 60% of the models. Novel drug-transporter interactions were discovered, such as lapatinib and gefitinib with ABCA1, olaparib and NVPADW742 with ABCC3, and gefitinib and AZ628 with SLC4A4. Furthermore, we identified ABCC3, SLC12A7, SLCO4A1, SERPINA1, and SLC22A3 as potential transporters for erlotinib, three of which are also significantly more highly expressed in patients who were resistant to therapy in a clinical trial.

Download Full-text

Fast and accurate detection of surface defect based on improved YOLOv4

Assembly Automation ◽

10.1108/aa-04-2021-0044 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Jiawei Lian ◽

Junhong He ◽

Yun Niu ◽

Tianze Wang

Keyword(s):

Feature Extraction ◽

Real Time ◽

Surface Defect ◽

Steel Ingot ◽

Industrial Applications ◽

Data Sets ◽

Data Set ◽

Processing Technologies ◽

Content Type ◽

Public Data

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.

Download Full-text

Autonomous feature type selection based on environment using expectation maximization in self-localization

International Journal of Advanced Robotic Systems ◽

10.1177/1729881418814701 ◽

2018 ◽

Vol 15 (6) ◽

pp. 172988141881470

Author(s):

Nezih Ergin Özkucur ◽

H Levent Akın

Keyword(s):

Expectation Maximization ◽

Autonomous Robots ◽

Expectation Maximization Algorithm ◽

Sensory Information ◽

Ground Truth ◽

Local Environment ◽

Localization Algorithm ◽

Data Set ◽

Public Data ◽

The Individual

Self-localization in autonomous robots is one of the fundamental issues in the development of intelligent robots, and processing of raw sensory information into useful features is an integral part of this problem. In a typical scenario, there are several choices for the feature extraction algorithm, and each has its weaknesses and strengths depending on the characteristics of the environment. In this work, we introduce a localization algorithm that is capable of capturing the quality of a feature type based on the local environment and makes soft selection of feature types throughout different regions. A batch expectation–maximization algorithm is developed for both discrete and Monte Carlo localization models, exploiting the probabilistic pose estimations of the robot without requiring ground truth poses and also considering different observation types as blackbox algorithms. We tested our method in simulations, data collected from an indoor environment with a custom robot platform and a public data set. The results are compared with the individual feature types as well as naive fusion strategy.

Download Full-text

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

Video Data Mining

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch223 ◽

2011 ◽

pp. 1185-1189 ◽

Cited By ~ 2

Author(s):

Jung Hwan Oh ◽

Jeong Kyu Lee ◽

Sae Hwang

Keyword(s):

Data Mining ◽

Research Area ◽

Multimedia Databases ◽

Video Data ◽

Multimedia Data ◽

Data Sets ◽

Data Set ◽

Useful Knowledge ◽

Active Research ◽

Diverse Data

Data mining, which is defined as the process of extracting previously unknown knowledge and detecting interesting patterns from a massive set of data, has been an active research area. As a result, several commercial products and research prototypes are available nowadays. However, most of these studies have focused on corporate data — typically in an alpha-numeric database, and relatively less work has been pursued for the mining of multimedia data (Zaïane, Han, & Zhu, 2000). Digital multimedia differs from previous forms of combined media in that the bits representing texts, images, audios, and videos can be treated as data by computer programs (Simoff, Djeraba, & Zaïane, 2002). One facet of these diverse data in terms of underlying models and formats is that they are synchronized and integrated hence, can be treated as integrated data records. The collection of such integral data records constitutes a multimedia data set. The challenge of extracting meaningful patterns from such data sets has lead to research and development in the area of multimedia data mining. This is a challenging field due to the non-structured nature of multimedia data. Such ubiquitous data is required in many applications such as financial, medical, advertising and Command, Control, Communications and Intelligence (C3I) (Thuraisingham, Clifton, Maurer, & Ceruti, 2001). Multimedia databases are widespread and multimedia data sets are extremely large. There are tools for managing and searching within such collections, but the need for tools to extract hidden and useful knowledge embedded within multimedia data is becoming critical for many decision-making applications.

Download Full-text

Open-Source Data Collection and Data Sets for Activity Recognition in Smart Homes

Sensors ◽

10.3390/s20030879 ◽

2020 ◽

Vol 20 (3) ◽

pp. 879 ◽

Cited By ~ 2

Author(s):

Uwe Köckemann ◽

Marjan Alirezaie ◽

Jennifer Renoux ◽

Nicolas Tsiftes ◽

Mobyen Uddin Ahmed ◽

...

Keyword(s):

Data Collection ◽

Activity Recognition ◽

Care Home ◽

Open Data ◽

Ground Truth ◽

Smart Homes ◽

Sensor Data ◽

Data Sets ◽

Data Set ◽

Home Setting

As research in smart homes and activity recognition is increasing, it is of ever increasing importance to have benchmarks systems and data upon which researchers can compare methods. While synthetic data can be useful for certain method developments, real data sets that are open and shared are equally as important. This paper presents the E-care@home system, its installation in a real home setting, and a series of data sets that were collected using the E-care@home system. Our first contribution, the E-care@home system, is a collection of software modules for data collection, labeling, and various reasoning tasks such as activity recognition, person counting, and configuration planning. It supports a heterogeneous set of sensors that can be extended easily and connects collected sensor data to higher-level Artificial Intelligence (AI) reasoning modules. Our second contribution is a series of open data sets which can be used to recognize activities of daily living. In addition to these data sets, we describe the technical infrastructure that we have developed to collect the data and the physical environment. Each data set is annotated with ground-truth information, making it relevant for researchers interested in benchmarking different algorithms for activity recognition.

Download Full-text

Model Distribution Effects on Likelihood Ratios in Fire Debris Analysis

Separations ◽

10.3390/separations5030044 ◽

2018 ◽

Vol 5 (3) ◽

pp. 44 ◽

Cited By ~ 3

Author(s):

Alyssa Allen ◽

Mary Williams ◽

Nicholas Thurn ◽

Michael Sigman

Keyword(s):

Computational Models ◽

Area Under The Curve ◽

Ground Truth ◽

Data Sets ◽

Likelihood Ratios ◽

Data Set ◽

Discriminant Model ◽

Fire Debris ◽

Characteristic Area ◽

Ignitable Liquid

Computational models for determining the strength of fire debris evidence based on likelihood ratios (LR) were developed and validated against data sets derived from different distributions of ASTM E1618-14 designated ignitable liquid class and substrate pyrolysis contributions using in-silico generated data. The models all perform well in cross validation against the distributions used to generate the model. However, a model generated based on data that does not contain representatives from all of the ASTM E1618-14 classes does not perform well in validation with data sets that contain representatives from the missing classes. A quadratic discriminant model based on a balanced data set (ignitable liquid versus substrate pyrolysis), with a uniform distribution of the ASTM E1618-14 classes, performed well (receiver operating characteristic area under the curve of 0.836) when tested against laboratory-developed casework-relevant samples of known ground truth.

Download Full-text

CNVScope: Visually Exploring Copy Number Aberrations in Cancer Genomes

Cancer Informatics ◽

10.1177/1176935119890290 ◽

2019 ◽

Vol 18 ◽

pp. 117693511989029

Author(s):

James LT Dalgleish ◽

Yonghong Wang ◽

Jack Zhu ◽

Paul S Meltzer

Keyword(s):

Copy Number ◽

High Performance ◽

Data Sets ◽

Data Set ◽

The Public ◽

Public Data ◽

Analysis Package ◽

Cis And Trans ◽

High Performance Computing Cluster ◽

Shiny Application

Motivation: DNA copy number (CN) data are a fast-growing source of information used in basic and translational cancer research. Most CN segmentation data are presented without regard to the relationship between chromosomal regions. We offer both a toolkit to help scientists without programming experience visually explore the CN interactome and a package that constructs CN interactomes from publicly available data sets. Results: The CNVScope visualization, based on a publicly available neuroblastoma CN data set, clearly displays a distinct CN interaction in the region of the MYCN, a canonical frequent amplicon target in this cancer. Exploration of the data rapidly identified cis and trans events, including a strong anticorrelation between 11q loss and17q gain with the region of 11q loss bounded by the cell cycle regulator CCND1. Availability: The shiny application is readily available for use at http://cnvscope.nci.nih.gov/ , and the package can be downloaded from CRAN ( https://cran.r-project.org/package=CNVScope ), where help pages and vignettes are located. A newer version is available on the GitHub site ( https://github.com/jamesdalg/CNVScope/ ), which features an animated tutorial. The CNVScope package can be locally installed using instructions on the GitHub site for Windows and Macintosh systems. This CN analysis package also runs on a linux high-performance computing cluster, with options for multinode and multiprocessor analysis of CN variant data. The shiny application can be started using a single command (which will automatically install the public data package).

Download Full-text

SaltSeg: Automatic 3D salt segmentation using a deep convolutional neural network

Interpretation ◽

10.1190/int-2018-0235.1 ◽

2019 ◽

Vol 7 (3) ◽

pp. SE113-SE122 ◽

Cited By ~ 26

Author(s):

Yunzhi Shi ◽

Xinming Wu ◽

Sergey Fomel

Keyword(s):

Large Scale ◽

Model Building ◽

Ground Truth ◽

Velocity Model ◽

Training Data ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Seismic Image ◽

Data Generator

Salt boundary interpretation is important for the understanding of salt tectonics and velocity model building for seismic migration. Conventional methods consist of computing salt attributes and extracting salt boundaries. We have formulated the problem as 3D image segmentation and evaluated an efficient approach based on deep convolutional neural networks (CNNs) with an encoder-decoder architecture. To train the model, we design a data generator that extracts randomly positioned subvolumes from large-scale 3D training data set followed by data augmentation, then feed a large number of subvolumes into the network while using salt/nonsalt binary labels generated by thresholding the velocity model as ground truth labels. We test the model on validation data sets and compare the blind test predictions with the ground truth. Our results indicate that our method is capable of automatically capturing subtle salt features from the 3D seismic image with less or no need for manual input. We further test the model on a field example to indicate the generalization of this deep CNN method across different data sets.

Download Full-text

A PROBABILISTIC SELF-ORGANIZING MAP FOR BINARY DATA TOPOGRAPHIC CLUSTERING

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026808002351 ◽

2008 ◽

Vol 07 (04) ◽

pp. 363-383 ◽

Cited By ~ 10

Author(s):

MUSTAPHA LEBBAH ◽

YOUNÈS BENNANI ◽

NICOLETA ROGOVSCHI

Keyword(s):

Binary Data ◽

Learning Algorithm ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Binary Coding ◽

Public Data ◽

Multivariate Binary Data ◽

Self Organizing

This paper introduces a probabilistic self-organizing map for topographic clustering, analysis and visualization of multivariate binary data or categorical data using binary coding. We propose a probabilistic formalism dedicated to binary data in which cells are represented by a Bernoulli distribution. Each cell is characterized by a prototype with the same binary coding as used in the data space and the probability of being different from this prototype. The learning algorithm, Bernoulli on self-organizing map, that we propose is an application of the EM standard algorithm. We illustrate the power of this method with six data sets taken from a public data set repository. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text