Model Distribution Effects on Likelihood Ratios in Fire Debris Analysis

Computational models for determining the strength of fire debris evidence based on likelihood ratios (LR) were developed and validated against data sets derived from different distributions of ASTM E1618-14 designated ignitable liquid class and substrate pyrolysis contributions using in-silico generated data. The models all perform well in cross validation against the distributions used to generate the model. However, a model generated based on data that does not contain representatives from all of the ASTM E1618-14 classes does not perform well in validation with data sets that contain representatives from the missing classes. A quadratic discriminant model based on a balanced data set (ignitable liquid versus substrate pyrolysis), with a uniform distribution of the ASTM E1618-14 classes, performed well (receiver operating characteristic area under the curve of 0.836) when tested against laboratory-developed casework-relevant samples of known ground truth.

Download Full-text

Open-Source Data Collection and Data Sets for Activity Recognition in Smart Homes

Sensors ◽

10.3390/s20030879 ◽

2020 ◽

Vol 20 (3) ◽

pp. 879 ◽

Cited By ~ 2

Author(s):

Uwe Köckemann ◽

Marjan Alirezaie ◽

Jennifer Renoux ◽

Nicolas Tsiftes ◽

Mobyen Uddin Ahmed ◽

...

Keyword(s):

Data Collection ◽

Activity Recognition ◽

Care Home ◽

Open Data ◽

Ground Truth ◽

Smart Homes ◽

Sensor Data ◽

Data Sets ◽

Data Set ◽

Home Setting

As research in smart homes and activity recognition is increasing, it is of ever increasing importance to have benchmarks systems and data upon which researchers can compare methods. While synthetic data can be useful for certain method developments, real data sets that are open and shared are equally as important. This paper presents the E-care@home system, its installation in a real home setting, and a series of data sets that were collected using the E-care@home system. Our first contribution, the E-care@home system, is a collection of software modules for data collection, labeling, and various reasoning tasks such as activity recognition, person counting, and configuration planning. It supports a heterogeneous set of sensors that can be extended easily and connects collected sensor data to higher-level Artificial Intelligence (AI) reasoning modules. Our second contribution is a series of open data sets which can be used to recognize activities of daily living. In addition to these data sets, we describe the technical infrastructure that we have developed to collect the data and the physical environment. Each data set is annotated with ground-truth information, making it relevant for researchers interested in benchmarking different algorithms for activity recognition.

Download Full-text

Novel Approaches to Smoothing and Comparing SELDI TOF Spectra

Cancer Informatics ◽

10.1177/117693510500100109 ◽

2005 ◽

Vol 1 ◽

pp. 117693510500100 ◽

Cited By ~ 4

Author(s):

Sreelatha Meleth ◽

Isam-Eldin Eltoum ◽

Liu Zhu ◽

Denise Oelschlager ◽

Chandrika Piyathilake ◽

...

Keyword(s):

Spectral Analysis ◽

Fourier Transforms ◽

Area Under The Curve ◽

Maximum Intensity ◽

Data Sets ◽

Intensity Level ◽

Prominent Feature ◽

Data Set ◽

The Third ◽

Novel Approaches

Background Most published literature using SELDI-TOF has used traditional techniques in Spectral Analysis such as Fourier transforms and wavelets for denoising. Most of these publications also compare spectra using their most prominent feature, ie, peaks or local maximums. Methods The maximum intensity value within each window of differentiable m/z values was used to represent the intensity level in that window. We also calculated the ‘Area under the Curve’ (AUC) spanned by each window. Results Keeping everything else constant, such as pre-processing of the data and the classifier used, the AUC performed much better as a metric of comparison than the peaks in two out of three data sets. In the third data set both metrics performed equivalently. Conclusions This study shows that the feature used to compare spectra can have an impact on the results of a study attempting to identify biomarkers using SELDI TOF data.

Download Full-text

SaltSeg: Automatic 3D salt segmentation using a deep convolutional neural network

Interpretation ◽

10.1190/int-2018-0235.1 ◽

2019 ◽

Vol 7 (3) ◽

pp. SE113-SE122 ◽

Cited By ~ 26

Author(s):

Yunzhi Shi ◽

Xinming Wu ◽

Sergey Fomel

Keyword(s):

Large Scale ◽

Model Building ◽

Ground Truth ◽

Velocity Model ◽

Training Data ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Seismic Image ◽

Data Generator

Salt boundary interpretation is important for the understanding of salt tectonics and velocity model building for seismic migration. Conventional methods consist of computing salt attributes and extracting salt boundaries. We have formulated the problem as 3D image segmentation and evaluated an efficient approach based on deep convolutional neural networks (CNNs) with an encoder-decoder architecture. To train the model, we design a data generator that extracts randomly positioned subvolumes from large-scale 3D training data set followed by data augmentation, then feed a large number of subvolumes into the network while using salt/nonsalt binary labels generated by thresholding the velocity model as ground truth labels. We test the model on validation data sets and compare the blind test predictions with the ground truth. Our results indicate that our method is capable of automatically capturing subtle salt features from the 3D seismic image with less or no need for manual input. We further test the model on a field example to indicate the generalization of this deep CNN method across different data sets.

Download Full-text

An Interpretable Computer-Aided Diagnosis Method for Periodontitis From Panoramic Radiographs

Frontiers in Physiology ◽

10.3389/fphys.2021.655556 ◽

2021 ◽

Vol 12 ◽

Author(s):

Haoyang Li ◽

Juexiao Zhou ◽

Yi Zhou ◽

Qiang Chen ◽

Yangyang She ◽

...

Keyword(s):

Computational Models ◽

Alveolar Bone ◽

Calibration Method ◽

Disease Diagnosis ◽

Chronic Inflammatory Disease ◽

Data Sets ◽

Data Set ◽

Panoramic Radiographs ◽

The Individual ◽

Severity Degree

Periodontitis is a prevalent and irreversible chronic inflammatory disease both in developed and developing countries, and affects about 20–50% of the global population. The tool for automatically diagnosing periodontitis is highly demanded to screen at-risk people for periodontitis and its early detection could prevent the onset of tooth loss, especially in local communities and health care settings with limited dental professionals. In the medical field, doctors need to understand and trust the decisions made by computational models and developing interpretable models is crucial for disease diagnosis. Based on these considerations, we propose an interpretable method called Deetal-Perio to predict the severity degree of periodontitis in dental panoramic radiographs. In our method, alveolar bone loss (ABL), the clinical hallmark for periodontitis diagnosis, could be interpreted as the key feature. To calculate ABL, we also propose a method for teeth numbering and segmentation. First, Deetal-Perio segments and indexes the individual tooth via Mask R-CNN combined with a novel calibration method. Next, Deetal-Perio segments the contour of the alveolar bone and calculates a ratio for individual tooth to represent ABL. Finally, Deetal-Perio predicts the severity degree of periodontitis given the ratios of all the teeth. The Macro F1-score and accuracy of the periodontitis prediction task in our method reach 0.894 and 0.896, respectively, on Suzhou data set, and 0.820 and 0.824, respectively on Zhongshan data set. The entire architecture could not only outperform state-of-the-art methods and show robustness on two data sets in both periodontitis prediction, and teeth numbering and segmentation tasks, but also be interpretable for doctors to understand the reason why Deetal-Perio works so well.

Download Full-text

The importance of the accuracy of the experimental data for the prediction of solubility

Journal of the Serbian Chemical Society ◽

10.2298/jsc090809022e ◽

2010 ◽

Vol 75 (4) ◽

pp. 483-495 ◽

Cited By ~ 2

Author(s):

Slavica Eric ◽

Marko Kalinic ◽

Aleksandar Popovic ◽

Halid Makic ◽

Elvisa Civic ◽

...

Keyword(s):

Experimental Data ◽

Computational Models ◽

Linear Regression Analysis ◽

Heuristic Method ◽

Parameter Analysis ◽

Solubility Data ◽

Data Sets ◽

Data Set ◽

Experimental Solubility Data ◽

Experimental Solubility

Aqueous solubility is an important factor influencing several aspects of the pharmacokinetic profile of a drug. Numerous publications present different methodologies for the development of reliable computational models for the prediction of solubility from structure. The quality of such models can be significantly affected by the accuracy of the employed experimental solubility data. In this work, the importance of the accuracy of the experimental solubility data used for model training was investigated. Three data sets were used as training sets - Data Set 1 containing solubility data collected from various literature sources using a few criteria (n = 319), Data Set 2 created by substituting 28 values from Data set 1 with uniformly determined experimental data from one laboratory (n = 319) and Data Set 3 created by including 56 additional components, for which the solubility was also determined under uniform conditions in the same laboratory, in the Data Set 2 (n = 375). The selection of the most significant descriptors was performed by the heuristic method, using one-parameter and multi-parameter analysis. The correlations between the most significant descriptors and solubility were established using multi-linear regression analysis (MLR) for all three investigated data sets. Notable differences were observed between the equations corresponding to different data sets, suggesting that models updated with new experimental data need to be additionally optimized. It was successfully shown that the inclusion of uniform experimental data consistently leads to an improvement in the correlation coefficients. These findings contribute to an emerging consensus that improving the reliability of solubility prediction requires the inclusion of many diverse compounds for which solubility was measured under standardized conditions in the data set.

Download Full-text

An Integration of Cardiovascular Event Data and Machine Learning Models for Cardiac Arrest Predictions

International Journal of Health Sciences and Pharmacy ◽

10.47992/ijhsp.2581.6411.0061 ◽

2021 ◽

pp. 55-71

Author(s):

Krishna Prasad K ◽

Aithal P. S. ◽

Navin N. Bappalige ◽

Soumya S

Keyword(s):

Machine Learning ◽

Cardiac Arrest ◽

Area Under The Curve ◽

Computer Applications ◽

Data Sets ◽

Cardiovascular Risks ◽

Data Set ◽

Average Area ◽

Learning Classifier ◽

Tree Classifier

Purpose: Predicting and then preventing cardiac arrest of a patient in ICU is the most challenging phase even for a most highly skilled professional. The data been collected in ICU for a patient are huge, and the selection of a portion of data for preventing cardiac arrest in a quantum of time is highly decisive, analysing and predicting that large data require an effective system. An effective integration of computer applications and cardiovascular data is necessary to predict the cardiovascular risks. A machine learning technique is the right choice in the advent of technology to manage patients with cardiac arrest. Methodology: In this work we have collected and merged three data sets, Cleveland Dataset of US patients with total 303 records, Statlog Dataset of UK patients with 270 records, and Hungarian dataset of Hungary, Switzerland with 617 records. These data are the most comprehensive data set with a combination of all three data sets consisting of 11 common features with 1190 records. Findings/Results: Feature extraction phase extracts 7 features, which contribute to the event. In addition, extracted features are used to train the selected machine learning classifier models, and results are obtained and obtained results are then evaluated using test data and final results are drawn. Extra Tree Classifier has the highest value of 0.957 for average area under the curve (AUC). Originality: The originality of this combined Dataset analysis using machine learning classifier model results Extra Tree Classifier with highest value of 0.957 for average area under the curve (AUC). Paper Type: Experimental Research Keywords: Cardiac, Machine Learning, Random Forest, XBOOST, ROC AUC, ST Slope.

Download Full-text

Assessing Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists (Preprint)

10.2196/preprints.35391 ◽

2021 ◽

Author(s):

Ibukun Oloruntoba ◽

Toan D Nguyen ◽

Zongyuan Ge ◽

Tine Vestergaard ◽

Victoria Mar

Keyword(s):

Skin Cancer ◽

Area Under The Curve ◽

Image Data ◽

Skin Lesions ◽

Training Image ◽

Data Sets ◽

Image Capture ◽

Data Set ◽

Unseen Data ◽

Test Variability

BACKGROUND Convolutional neural networks (CNNs) are a type of artificial intelligence that show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets of varying quality and image capture standardization. OBJECTIVE The aim of our study is to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish teledermatologists when tested on images acquired from Denmark. METHODS Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 nonstandardized images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardized images, and CNN-S2 was trained on 25,331 standardized images (matched for number and classes of training images to CNN-NS). Both standardized data sets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. A total of 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick skin types II and III were used to test the performance of the models. Four teledermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity, and area under the curve of the receiver operating characteristic (AUROC). RESULTS A total of 569 images were taken from 495 patients (n=280, 57% women, n=215, 43% men; mean age 55, SD 17 years) for this study. On these images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889; P<.001), and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; P=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (95% CI 0.722-0.794; P<.001; P=.009). When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists, the model’s resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (P=.10; P=.05). Performance across all CNN models and teledermatologists was influenced by the image quality. CONCLUSIONS CNNs trained on standardized images had improved performance and therefore greater generalizability in skin cancer classification when applied to an unseen data set. This is an important consideration for future algorithm development, regulation, and approval. Further, when tested on these unseen test images, the teledermatologists clinically outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S.

Download Full-text

Land Map Image Dataset: Ground-Truth And Classification Using Visual And Textural Features

Image Processing & Communications ◽

10.1515/ipc-2015-0024 ◽

2014 ◽

Vol 19 (4) ◽

pp. 37-55 ◽

Cited By ~ 1

Author(s):

Sayan Mandal ◽

Samit Biswas ◽

Amit Kumar Das ◽

Bhabatosh Chanda

Keyword(s):

Ground Truth ◽

Document Image ◽

Geographical Information ◽

Data Sets ◽

Data Set ◽

Handwritten Documents ◽

Form Processing ◽

Public Data ◽

Active Research ◽

The Right

Abstract Research on document image analysis is actively pursued in the last few decades and services like OCR, vectorization of drawings/graphics and various types of form processing are very common. Handwritten documents, old historical documents and documents captured through camera are now being the subjects of active research. However, another very important type of paper document, namely the map document image processing research suffers due to the inherent complexities of the map document and also for nonavailability of benchmark public data-sets. This paper presents a new data-set, namely, the Land Map Image Database (LMIDb) that consists of a variety of land maps images (446 images at present and growing; scanned at 200/300 dpi in TIF format) and the corresponding ground-truth. Using semiautomatic tools non-text part of the images are deleted and the text-only ground-truth is also kept in the database. This paper also presents a classification strategy for map images using which the maps in the database are automatically classified into Political (Po), Physical (Ph), Resource (R) and Topographic (T) maps. The automatic classification of maps help indexing of the images in LMIDb for archival and easy retrieval of the right maps to get the appropriate geographical information. Classification accuracy is also tested on the proposed data-set and the result is encouraging.

Download Full-text

Assessing Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists

Iproceedings ◽

10.2196/35391 ◽

2021 ◽

Vol 6 (1) ◽

pp. e35391

Author(s):

Ibukun Oloruntoba ◽

Toan D Nguyen ◽

Zongyuan Ge ◽

Tine Vestergaard ◽

Victoria Mar

Keyword(s):

Skin Cancer ◽

Conflicts Of Interest ◽

Area Under The Curve ◽

Image Data ◽

Skin Lesions ◽

Training Image ◽

Data Sets ◽

Image Capture ◽

Data Set ◽

Unseen Data

Background Convolutional neural networks (CNNs) are a type of artificial intelligence that show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets of varying quality and image capture standardization. Objective The aim of our study is to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish teledermatologists when tested on images acquired from Denmark. Methods Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 nonstandardized images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardized images, and CNN-S2 was trained on 25,331 standardized images (matched for number and classes of training images to CNN-NS). Both standardized data sets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. A total of 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick skin types II and III were used to test the performance of the models. Four teledermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity, and area under the curve of the receiver operating characteristic (AUROC). Results A total of 569 images were taken from 495 patients (n=280, 57% women, n=215, 43% men; mean age 55, SD 17 years) for this study. On these images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889; P<.001), and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; P=.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (95% CI 0.722-0.794; P<.001; P=.009). When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists, the model’s resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (P=.10; P=.05). Performance across all CNN models and teledermatologists was influenced by the image quality. Conclusions CNNs trained on standardized images had improved performance and therefore greater generalizability in skin cancer classification when applied to an unseen data set. This is an important consideration for future algorithm development, regulation, and approval. Further, when tested on these unseen test images, the teledermatologists clinically outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S. Conflicts of Interest VM received speakers fees from Merck, Eli Lily, Novartis and Bristol Myers Squibb. VM is the principal investigator for a clinical trial funded by the Victorian Department of Health and Human Services with 1:1 contribution from MoleMap.

Download Full-text

DISC: Disambiguating homonyms using graph structural clustering

Journal of Information Science ◽

10.1177/0165551518761011 ◽

2018 ◽

Vol 44 (6) ◽

pp. 830-847 ◽

Cited By ~ 4

Author(s):

Ijaz Hussain ◽

Sohail Asghar

Keyword(s):

Ground Truth ◽

Detection Algorithm ◽

Bibliographic Databases ◽

Data Sets ◽

Database Integration ◽

Data Set ◽

Entity Disambiguation ◽

Structural Clustering ◽

Community Detection Algorithm ◽

Almost All

Author name ambiguity degrades information retrieval, database integration, search results and, more importantly, correct attributions in bibliographic databases. Some unresolved issues include how to ascertain the actual number of authors, how to improve the performance and how to make the method more effective in terms of representative clustering metrics (average cluster purity, average author purity, K-metric, pairwise precision, pairwise recall, pairwise-F1, cluster precision, cluster recall and cluster-F1). It is a non-trivial task to disambiguate authors using only the implicit bibliographic information. An effective method ‘DISC’ is proposed that uses graph community detection algorithm, feature vectors and graph operations to disambiguate homonyms. The citation data set is pre-processed and ambiguous author blocks are formed. A co-authors graph is constructed using authors and their co-author’s relationships. A graph structural clustering ‘gSkeletonClu’ is applied to identify hubs, outliers and clusters of nodes in a co-author’s graph. Homonyms are resolved by splitting these clusters of nodes across the hub if their feature vector similarity is less than a predefined threshold. DISC utilises only co-authors and titles that are available in almost all bibliographic databases. With little modifications, DISC can also be used for entity disambiguation. To validate the DISC performance, experiments are performed on two Arnetminer data sets and compared with five previous unsupervised methods. Despite using limited bibliographic metadata, DISC achieves on average K-metric, pairwise-F1, and cluster-F1 of 92%, 84% and 74%, respectively, using Arnetminer-S and 86%, 80% and 57%, respectively, using Arnetminer-L. About 77.5% and 73.2% clusters are within the range (ground truth clusters ± 3) in Arnetminer-S and Arnetminer-L, respectively.

Download Full-text