Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables

Journal of Intelligent Information Systems ◽

10.1007/s10844-021-00693-2 ◽

2021 ◽

Author(s):

Summaya Mumtaz ◽

Martin Giese

Keyword(s):

Machine Learning ◽

Similarity Measure ◽

Domain Knowledge ◽

Similarity Measures ◽

Training Data ◽

Mixed Data ◽

Categorical Variables ◽

Use Case ◽

Data Types ◽

Semantic Similarity Measure

AbstractIn low-resource domains, it is challenging to achieve good performance using existing machine learning methods due to a lack of training data and mixed data types (numeric and categorical). In particular, categorical variables with high cardinality pose a challenge to machine learning tasks such as classification and regression because training requires sufficiently many data points for the possible values of each variable. Since interpolation is not possible, nothing can be learned for values not seen in the training set. This paper presents a method that uses prior knowledge of the application domain to support machine learning in cases with insufficient data. We propose to address this challenge by using embeddings for categorical variables that are based on an explicit representation of domain knowledge (KR), namely a hierarchy of concepts. Our approach is to 1. define a semantic similarity measure between categories, based on the hierarchy—we propose a purely hierarchy-based measure, but other similarity measures from the literature can be used—and 2. use that similarity measure to define a modified one-hot encoding. We propose two embedding schemes for single-valued and multi-valued categorical data. We perform experiments on three different use cases. We first compare existing similarity approaches with our approach on a word pair similarity use case. This is followed by creating word embeddings using different similarity approaches. A comparison with existing methods such as Google, Word2Vec and GloVe embeddings on several benchmarks shows better performance on concept categorisation tasks when using knowledge-based embeddings. The third use case uses a medical dataset to compare the performance of semantic-based embeddings and standard binary encodings. Significant improvement in performance of the downstream classification tasks is achieved by using semantic information.

Download Full-text

Ensemble Creation using Fuzzy Similarity Measures and Feature Subset Evaluators

10.5121/csit.2021.111407 ◽

2021 ◽

Author(s):

Valerie Cross ◽

Michael Zmuda

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Similarity Measure ◽

Fuzzy Set ◽

Learning Process ◽

Similarity Measures ◽

Training Data ◽

Feature Subset ◽

Data Set ◽

Fuzzy Similarity

Current machine learning research is addressing the problem that occurs when the data set includes numerous features but the number of training data is small. Microarray data, for example, typically has a very large number of features, the genes, as compared to the number of training data examples, the patients. An important research problem is to develop techniques to effectively reduce the number of features by selecting the best set of features for use in a machine learning process, referred to as the feature selection problem. Another means of addressing high dimensional data is the use of an ensemble of base classifiers. Ensembles have been shown to improve the predictive performance of a single model by training multiple models and combining their predictions. This paper examines combining an enhancement of the random subspace model of feature selection using fuzzy set similarity measures with different measures of evaluating feature subsets in the construction of an ensemble classifier. Experimental results show that in most cases a fuzzy set similarity measure paired with a feature subset evaluator outperforms the corresponding fuzzy similarity measure by itself and the learning process only needs to occur on typically about half the number of base classifiers since the features subset evaluator eliminates those feature subsets of low quality from use in the ensemble. In general, the fuzzy consistency index is the better performing feature subset evaluator, and inclusion maximum is the better performing fuzzy similarity measure.

Download Full-text

Construction of a quality model for machine learning systems

Software Quality Journal ◽

10.1007/s11219-021-09557-y ◽

2021 ◽

Author(s):

Julien Siebert ◽

Lisa Joeckel ◽

Jens Heidrich ◽

Adam Trendowicz ◽

Koji Nakamichi ◽

...

Keyword(s):

Machine Learning ◽

Lessons Learned ◽

Training Data ◽

Use Case ◽

Construction Process ◽

Quality Model ◽

Quality Models ◽

Quality Properties ◽

Reference Quality ◽

Industrial Use

AbstractNowadays, systems containing components based on machine learning (ML) methods are becoming more widespread. In order to ensure the intended behavior of a software system, there are standards that define necessary qualities of the system and its components (such as ISO/IEC 25010). Due to the different nature of ML, we have to re-interpret existing qualities for ML systems or add new ones (such as trustworthiness). We have to be very precise about which quality property is relevant for which entity of interest (such as completeness of training data or correctness of trained model), and how to objectively evaluate adherence to quality requirements. In this article, we present how to systematically construct quality models for ML systems based on an industrial use case. This quality model enables practitioners to specify and assess qualities for ML systems objectively. In addition to the overall construction process described, the main outcomes include a meta-model for specifying quality models for ML systems, reference elements regarding relevant views, entities, quality properties, and measures for ML systems based on existing research, an example instantiation of a quality model for a concrete industrial use case, and lessons learned from applying the construction process. We found that it is crucial to follow a systematic process in order to come up with measurable quality properties that can be evaluated in practice. In the future, we want to learn how the term quality differs between different types of ML systems and come up with reference quality models for evaluating qualities of ML systems.

Download Full-text

Drill-Core Mineral Abundance Estimation Using Hyperspectral and High-Resolution Mineralogical Data

Remote Sensing ◽

10.3390/rs12071218 ◽

2020 ◽

Vol 12 (7) ◽

pp. 1218

Author(s):

Laura Tuşa ◽

Mahdi Khodadadzadeh ◽

Cecilia Contreras ◽

Kasra Rafiezadeh Shahi ◽

Margret Fuchs ◽

...

Keyword(s):

Machine Learning ◽

High Resolution ◽

Ore Deposits ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Drill Core ◽

Data Types ◽

Mineralogical Characterization ◽

Core Samples

Due to the extensive drilling performed every year in exploration campaigns for the discovery and evaluation of ore deposits, drill-core mapping is becoming an essential step. While valuable mineralogical information is extracted during core logging by on-site geologists, the process is time consuming and dependent on the observer and individual background. Hyperspectral short-wave infrared (SWIR) data is used in the mining industry as a tool to complement traditional logging techniques and to provide a rapid and non-invasive analytical method for mineralogical characterization. Additionally, Scanning Electron Microscopy-based image analyses using a Mineral Liberation Analyser (SEM-MLA) provide exhaustive high-resolution mineralogical maps, but can only be performed on small areas of the drill-cores. We propose to use machine learning algorithms to combine the two data types and upscale the quantitative SEM-MLA mineralogical data to drill-core scale. This way, quasi-quantitative maps over entire drill-core samples are obtained. Our upscaling approach increases result transparency and reproducibility by employing physical-based data acquisition (hyperspectral imaging) combined with mathematical models (machine learning). The procedure is tested on 5 drill-core samples with varying training data using random forests, support vector machines and neural network regression models. The obtained mineral abundance maps are further used for the extraction of mineralogical parameters such as mineral association.

Download Full-text

GARUM: A Semantic Similarity Measure Based on Machine Learning and Entity Characteristics

Lecture Notes in Computer Science - Database and Expert Systems Applications ◽

10.1007/978-3-319-98809-2_11 ◽

2018 ◽

pp. 169-183 ◽

Cited By ~ 1

Author(s):

Ignacio Traverso-Ribón ◽

Maria-Esther Vidal

Keyword(s):

Machine Learning ◽

Semantic Similarity ◽

Similarity Measure ◽

Semantic Similarity Measure

Download Full-text

MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors

Applied Sciences ◽

10.3390/app10217831 ◽

2020 ◽

Vol 10 (21) ◽

pp. 7831

Author(s):

Han Kyul Kim ◽

Sae Won Choi ◽

Ye Seul Bae ◽

Jiin Choi ◽

Hyein Kwon ◽

...

Keyword(s):

Machine Learning ◽

Contextual Information ◽

String Matching ◽

Similarity Measures ◽

Mapping Method ◽

Machine Learning Algorithms ◽

Training Data ◽

Context Aware ◽

Text Data ◽

Data Standardization

With growing interest in machine learning, text standardization is becoming an increasingly important aspect of data pre-processing within biomedical communities. As performances of machine learning algorithms are affected by both the amount and the quality of their training data, effective data standardization is needed to guarantee consistent data integrity. Furthermore, biomedical organizations, depending on their geographical locations or affiliations, rely on different sets of text standardization in practice. To facilitate easier machine learning-related collaborations between these organizations, an effective yet practical text data standardization method is needed. In this paper, we introduce MARIE (a context-aware term mapping method with string matching and embedding vectors), an unsupervised learning-based tool, to find standardized clinical terminologies for queries, such as a hospital’s own codes. By incorporating both string matching methods and term embedding vectors generated by BioBERT (bidirectional encoder representations from transformers for biomedical text mining), it utilizes both structural and contextual information to calculate similarity measures between source and target terms. Compared to previous term mapping methods, MARIE shows improved mapping accuracy. Furthermore, it can be easily expanded to incorporate any string matching or term embedding methods. Without requiring any additional model training, it is not only effective, but also a practical term mapping method for text data standardization and pre-processing.

Download Full-text

Learning similarity measures from data

Progress in Artificial Intelligence ◽

10.1007/s13748-019-00201-2 ◽

2019 ◽

Vol 9 (2) ◽

pp. 129-143 ◽

Cited By ~ 4

Author(s):

Bjørn Magnus Mathisen ◽

Agnar Aamodt ◽

Kerstin Bach ◽

Helge Langseth

Keyword(s):

Machine Learning ◽

Similarity Measure ◽

State Of The Art ◽

Similarity Measures ◽

Learning System ◽

Case Based Reasoning ◽

Training Time ◽

Domain Experts ◽

Trained Classifier ◽

Clustering Data

Abstract Defining similarity measures is a requirement for some machine learning methods. One such method is case-based reasoning (CBR) where the similarity measure is used to retrieve the stored case or a set of cases most similar to the query case. Describing a similarity measure analytically is challenging, even for domain experts working with CBR experts. However, datasets are typically gathered as part of constructing a CBR or machine learning system. These datasets are assumed to contain the features that correctly identify the solution from the problem features; thus, they may also contain the knowledge to construct or learn such a similarity measure. The main motivation for this work is to automate the construction of similarity measures using machine learning. Additionally, we would like to do this while keeping training time as low as possible. Working toward this, our objective is to investigate how to apply machine learning to effectively learn a similarity measure. Such a learned similarity measure could be used for CBR systems, but also for clustering data in semi-supervised learning, or one-shot learning tasks. Recent work has advanced toward this goal which relies on either very long training times or manually modeling parts of the similarity measure. We created a framework to help us analyze the current methods for learning similarity measures. This analysis resulted in two novel similarity measure designs: The first design uses a pre-trained classifier as basis for a similarity measure, and the second design uses as little modeling as possible while learning the similarity measure from data and keeping training time low. Both similarity measures were evaluated on 14 different datasets. The evaluation shows that using a classifier as basis for a similarity measure gives state-of-the-art performance. Finally, the evaluation shows that our fully data-driven similarity measure design outperforms state-of-the-art methods while keeping training time low.

Download Full-text

Unsupervised Learning for Large Scale Data: The ATHLOS Project

10.1101/2021.04.01.21254751 ◽

2021 ◽

Author(s):

Petros Barmpas ◽

Sotiris Tasoulis ◽

Aristidis G. Vrahatis ◽

Panagiotis Anagnostou ◽

Spiros Georgakopoulos ◽

...

Keyword(s):

Unsupervised Learning ◽

Real World ◽

Large Scale ◽

High Dimensional Data ◽

Experimental Studies ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Types ◽

Unified Framework

1AbstractRecent technological advancements in various domains, such as the biomedical and health, offer a plethora of big data for analysis. Part of this data pool is the experimental studies that record various and several features for each instance. It creates datasets having very high dimensionality with mixed data types, with both numerical and categorical variables. On the other hand, unsupervised learning has shown to be able to assist in high-dimensional data, allowing the discovery of unknown patterns through clustering, visualization, dimensionality reduction, and in some cases, their combination. This work highlights unsupervised learning methodologies for large-scale, high-dimensional data, providing the potential of a unified framework that combines the knowledge retrieved from clustering and visualization. The main purpose is to uncover hidden patterns in a high-dimensional mixed dataset, which we achieve through our application in a complex, real-world dataset. The experimental analysis indicates the existence of notable information exposing the usefulness of the utilized methodological framework for similar high-dimensional and mixed, real-world applications.

Download Full-text

A New Semantic Similarity Measure Based On Ontology for Movie Rate Prediction

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c4442.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 6756-6762

Keyword(s):

Semantic Similarity ◽

Similarity Measure ◽

Experimental Evaluation ◽

Pearson Correlation ◽

Similarity Measures ◽

Similarity Score ◽

Cosine Similarity ◽

Semantic Similarity Measure ◽

Rate Prediction ◽

Target User

A recommendation algorithm comprises of two important steps: 1) Predicting rates, and 2) Recommendation. Rate prediction is a cumulative function of the similarity score between two movies and rate history of those movies by other users. There are various methods for rate prediction such as weighted sum method, regression, deviation based etc. All these methods rely on finding similar items to the items previously viewed/rated by target user, with assumption that user tends to have similar rating for similar items. Computing the similarities can be done using various similarity measures such as Euclidian Distance, Cosine Similarity, Adjusted Cosine Similarity, Pearson Correlation, Jaccard Similarity etc. All of these well-known approaches calculate similarity score between two movies using simple rating based data. Hence, such similarity measures could not accurately model rating behavior of user. In this paper, we will show that the accuracy in rate prediction can be enhanced by incorporating ontological domain knowledge in similarity computation. This paper introduces a new ontological semantic similarity measure between two movies. For experimental evaluation, the performance of proposed approach is compared with two existing approaches: 1) Adjusted Cosine Similarity (ACS), and 2) Weighted Slope One (WSO) algorithm, in terms of two performance measures: 1) Execution time and 2) Mean Absolute Error (MAE). The open-source Movielens (ml-1m) dataset is used for experimental evaluation. As our results show, the ontological semantic similarity measure enhances the performance of rate prediction as compared to the existing-well known approaches.

Download Full-text

Semantic similarity and machine learning with ontologies

Briefings in Bioinformatics ◽

10.1093/bib/bbaa199 ◽

2020 ◽

Author(s):

Maxat Kulmanov ◽

Fatima Zohra Smaili ◽

Xin Gao ◽

Robert Hoehndorf

Keyword(s):

Machine Learning ◽

Semantic Similarity ◽

Domain Knowledge ◽

Life Sciences ◽

Similarity Measures ◽

Background Knowledge ◽

Biological Database ◽

Learning Models ◽

Machine Learning Methods ◽

Machine Learning Models

Abstract Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.

Download Full-text

Relative Hausdorff distance for network analysis

Applied Network Science ◽

10.1007/s41109-019-0198-0 ◽

2019 ◽

Vol 4 (1) ◽

Cited By ~ 1

Author(s):

Sinan G. Aksoy ◽

Kathleen E. Nowak ◽

Emilie Purvine ◽

Stephen J. Young

Keyword(s):

Machine Learning ◽

Network Analysis ◽

Similarity Measure ◽

Hausdorff Distance ◽

Data Science ◽

Edit Distance ◽

Similarity Measures ◽

Graph Edit Distance ◽

Computationally Intensive

Abstract Similarity measures are used extensively in machine learning and data science algorithms. The newly proposed graph Relative Hausdorff (RH) distance is a lightweight yet nuanced similarity measure for quantifying the closeness of two graphs. In this work we study the effectiveness of RH distance as a tool for detecting anomalies in time-evolving graph sequences. We apply RH to cyber data with given red team events, as well to synthetically generated sequences of graphs with planted attacks. In our experiments, the performance of RH distance is at times comparable, and sometimes superior, to graph edit distance in detecting anomalous phenomena. Our results suggest that in appropriate contexts, RH distance has advantages over more computationally intensive similarity measures.

Download Full-text