Evaluation of standard and semantically-augmented distance metrics for neurology patients

Abstract Background: When patient distances are calculated based on phenotype, signs and symptoms are often converted to concepts from an ontological hierarchy. There is controversy as to whether patient distance metrics that consider the semantic similarity between concepts can outperform standard patient distance metrics that are agnostic to concept similarity. The choice of distance metric often dominates the performance of classification or clustering algorithms. Our objective was to determine if semantically augmented distance metrics would outperform standard metrics on machine learning tasks. Methods: We converted the neurological signs and symptoms from 382 published neurology cases into sets of concepts with corresponding machine-readable codes. We calculated inter-patient distances by four different metrics (cosine distance, a semantically augmented cosine distance, Jaccard distance, and a semantically augmented bipartite distance). Semantic augmentation for two of the metrics depended on concept similarities from a hierarchical neuro-ontology. For machine learning algorithms, we used the patient diagnosis as the ground truth label and patient signs and symptoms as the machine learning features . We assessed classification accuracy for four classifiers and cluster quality for two clustering algorithms for each of the distance metrics. Results: Inter-patient distances were smaller when the distance metric was semantically augmented. Classification accuracy and cluster quality were not significantly different by distance metric. Conclusion: Using patient diagnoses as labels and patient signs and symptoms as features, we did not find improved classification accuracy or improved cluster quality with semantically augmented distance metrics. Semantic augmentation reduced inter-patient distances but did not improve machine learning performance.

Download Full-text

Evaluation of standard and semantically-augmented distance metrics for neurology patients

10.21203/rs.3.rs-20018/v3 ◽

2020 ◽

Author(s):

Daniel B Hier ◽

Jonathan Kopel ◽

Steven U Brint ◽

Donald C Wunsch II ◽

Gayla R Olbricht ◽

...

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Clustering Algorithms ◽

Signs And Symptoms ◽

Ground Truth ◽

Machine Learning Algorithms ◽

Distance Metrics ◽

Distance Metric ◽

Cosine Distance ◽

Cluster Quality

Abstract Background: Patient distances can be calculated based on signs and symptoms derived from an ontological hierarchy. There is controversy as to whether patient distance metrics that consider the semantic similarity between concepts can outperform standard patient distance metrics that are agnostic to concept similarity. The choice of distance metric can dominate the performance of classification or clustering algorithms. Our objective was to determine if semantically augmented distance metrics would outperform standard metrics on machine learning tasks. Methods: We converted the neurological findings from 382 published neurology cases into sets of concepts with corresponding machine-readable codes. We calculated patient distances by four different metrics (cosine distance, a semantically augmented cosine distance, Jaccard distance, and a semantically augmented bipartite distance). Semantic augmentation for two of the metrics depended on concept similarities from a hierarchical neuro-ontology. For machine learning algorithms, we used the patient diagnosis as the ground truth label and patient findings as machine learning features. We assessed classification accuracy for four classifiers and cluster quality for two clustering algorithms for each of the distance metrics.Results: Inter-patient distances were smaller when the distance metric was semantically augmented. Classification accuracy and cluster quality were not significantly different by distance metric.Conclusion: Although semantic augmentation reduced inter-patient distances, we did not find improved classification accuracy or improved cluster quality with semantically augmented patient distance metrics.

Download Full-text

Evaluation of standard and semantically-augmented distance metrics for neurology patients

10.21203/rs.3.rs-20018/v4 ◽

2020 ◽

Author(s):

Daniel B Hier ◽

Jonathan Kopel ◽

Steven U Brint ◽

Donald C Wunsch II ◽

Gayla R Olbricht ◽

...

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Clustering Algorithms ◽

Signs And Symptoms ◽

Ground Truth ◽

Machine Learning Algorithms ◽

Distance Metrics ◽

Distance Metric ◽

Cosine Distance ◽

Cluster Quality

Abstract Background: Patient distances can be calculated based on signs and symptoms derived from an ontological hierarchy. There is controversy as to whether patient distance metrics that consider the semantic similarity between concepts can outperform standard patient distance metrics that are agnostic to concept similarity. The choice of distance metric can dominate the performance of classification or clustering algorithms. Our objective was to determine if semantically augmented distance metrics would outperform standard metrics on machine learning tasks.Methods: We converted the neurological findings from 382 published neurology cases into sets of concepts with corresponding machine-readable codes. We calculated patient distances by four different metrics (cosine distance, a semantically augmented cosine distance, Jaccard distance, and a semantically augmented bipartite distance). Semantic augmentation for two of the metrics depended on concept similarities from a hierarchical neuro-ontology. For machine learning algorithms, we used the patient diagnosis as the ground truth label and patient findings as machine learning features . We assessed classification accuracy for four classifiers and cluster quality for two clustering algorithms for each of the distance metrics.Results: Inter-patient distances were smaller when the distance metric was semantically augmented. Classification accuracy and cluster quality were not significantly different by distance metric.Conclusion: Although semantic augmentation reduced inter-patient distances, we did not find improved classification accuracy or improved cluster quality with semantically augmented patient distance metrics when applied to a dataset of neurology patients. Further work is needed to assess the utility of semantically augmented patient distances.

Download Full-text

Performance Examination of Hard Clustering Algorithm with Distance Metrics

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1045.1292s319 ◽

2019 ◽

Vol 9 (2S3) ◽

pp. 172-178

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Computational Cost ◽

Physical World ◽

Classification Performance ◽

Distance Metrics ◽

Distance Metric ◽

Data Files ◽

Performance Examination ◽

Cosine Distance

Clustering algorithms based on partitions are widely us ed in unsupervised data analysis. K-means algorithm is one the efficient partition based algorithms ascribable to its intelligibility in computational cost. Distance metric has a noteworthy role in the efficiency of any clustering algorithm. In this work, K-means algorithms with three distance metrics, Hausdorff, Chebyshev and cosine distance metrics are implemented from the UC Irvine ml-database on three well-known physical-world data-files, thyroid, wine and liver diagnosis. The classification performance is evaluated and compared based on the clustering output validation and using popular Adjusted Rand and Fowlkes-Mallows indices compared to the repository results. The experimental results reported that the algorithm with Hausdorff distance metric outperforms the algorithm with Chebyshev and cosine distance metrics.

Download Full-text

Hybridization of Machine Learning Algorithm in Intrusion Detection System

Handbook of Research on Machine and Deep Learning Applications for Cyber Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-5225-9611-0.ch008 ◽

2020 ◽

pp. 150-175

Author(s):

Amudha P. ◽

Sivakumari S.

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

Classification Accuracy ◽

Intrusion Detection System ◽

Learning Algorithm ◽

Detection System ◽

Principal Component ◽

Machine Learning Algorithms ◽

Feature Selection Technique ◽

Efficient Manner

In recent years, the field of machine learning grows very fast both on the development of techniques and its application in intrusion detection. The computational complexity of the machine learning algorithms increases rapidly as the number of features in the datasets increases. By choosing the significant features, the number of features in the dataset can be reduced, which is critical to progress the classification accuracy and speed of algorithms. Also, achieving high accuracy and detection rate and lowering false alarm rates are the major challenges in designing an intrusion detection system. The major motivation of this work is to address these issues by hybridizing machine learning and swarm intelligence algorithms for enhancing the performance of intrusion detection system. It also emphasizes applying principal component analysis as feature selection technique on intrusion detection dataset for identifying the most suitable feature subsets which may provide high-quality results in a fast and efficient manner.

Download Full-text

A Hybrid Feature Selection Method for Improve the Accuracy of Medical Classification Process

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a9624.1111121 ◽

2021 ◽

Vol 11 (1) ◽

pp. 50-55

Author(s):

Maria Mohammad Yousef ◽

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Dimensionality Reduction ◽

Classification Accuracy ◽

Fitness Function ◽

Machine Learning Algorithms ◽

Feature Subset Selection ◽

High Dimensionality ◽

Support Vector ◽

Feature Subset

Generally, medical dataset classification has become one of the biggest problems in data mining research. Every database has a given number of features but it is observed that some of these features can be redundant and can be harmful as well as disrupt the process of classification and this problem is known as a high dimensionality problem. Dimensionality reduction in data preprocessing is critical for increasing the performance of machine learning algorithms. Besides the contribution of feature subset selection in dimensionality reduction gives a significant improvement in classification accuracy. In this paper, we proposed a new hybrid feature selection approach based on (GA assisted by KNN) to deal with issues of high dimensionality in biomedical data classification. The proposed method first applies the combination between GA and KNN for feature selection to find the optimal subset of features where the classification accuracy of the k-Nearest Neighbor (kNN) method is used as the fitness function for GA. After selecting the best-suggested subset of features, Support Vector Machine (SVM) are used as the classifiers. The proposed method experiments on five medical datasets of the UCI Machine Learning Repository. It is noted that the suggested technique performs admirably on these databases, achieving higher classification accuracy while using fewer features.

Download Full-text

Evaluating Human versus Machine Learning Performance in a LegalTech Problem

Applied Sciences ◽

10.3390/app12010297 ◽

2021 ◽

Vol 12 (1) ◽

pp. 297

Author(s):

Tamás Orosz ◽

Renátó Vági ◽

Gergely Márk Csányi ◽

Dániel Nagy ◽

István Üveges ◽

...

Keyword(s):

Machine Learning ◽

Domain Knowledge ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Added Value ◽

Learning Performance ◽

Production Environment ◽

Legal Domain ◽

The Cost

Many machine learning-based document processing applications have been published in recent years. Applying these methodologies can reduce the cost of labor-intensive tasks and induce changes in the company’s structure. The artificial intelligence-based application can replace the application of trainees and free up the time of experts, which can increase innovation inside the company by letting them be involved in tasks with greater added value. However, the development cost of these methodologies can be high, and usually, it is not a straightforward task. This paper presents a survey result, where a machine learning-based legal text labeler competed with multiple people with different legal domain knowledge. The machine learning-based application used binary SVM-based classifiers to resolve the multi-label classification problem. The used methods were encapsulated and deployed as a digital twin into a production environment. The results show that machine learning algorithms can be effectively utilized for monotonous but domain knowledge- and attention-demanding tasks. The results also suggest that embracing the machine learning-based solution can increase discoverability and enrich the value of data. The test confirmed that the accuracy of a machine learning-based system matches up with the long-term accuracy of legal experts, which makes it applicable to automatize the working process.

Download Full-text

A Hybrid Meta-Learner Technique for Credit Scoring of Banks’ Customers

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.1361 ◽

2017 ◽

Vol 7 (5) ◽

pp. 2073-2082 ◽

Cited By ~ 1

Author(s):

A. G. Armaki ◽

M. F. Fallah ◽

M. Alborzi ◽

A. Mohammadzadeh

Keyword(s):

Machine Learning ◽

Hybrid Model ◽

Credit Scoring ◽

Clustering Algorithms ◽

Real Life ◽

Ensemble Methods ◽

Scoring Systems ◽

Error Rates ◽

Machine Learning Algorithms ◽

Machine Learning Techniques

Financial institutions are exposed to credit risk due to issuance of consumer loans. Thus, developing reliable credit scoring systems is very crucial for them. Since, machine learning techniques have demonstrated their applicability and merit, they have been extensively used in credit scoring literature. Recent studies concentrating on hybrid models through merging various machine learning algorithms have revealed compelling results. There are two types of hybridization methods namely traditional and ensemble methods. This study combines both of them and comes up with a hybrid meta-learner model. The structure of the model is based on the traditional hybrid model of ‘classification + clustering’ in which the stacking ensemble method is employed in the classification part. Moreover, this paper compares several versions of the proposed hybrid model by using various combinations of classification and clustering algorithms. Hence, it helps us to identify which hybrid model can achieve the best performance for credit scoring purposes. Using four real-life credit datasets, the experimental results show that the model of (KNN-NN-SVMPSO)-(DL)-(DBSCAN) delivers the highest prediction accuracy and the lowest error rates.

Download Full-text

Ammonoid Taxonomy with Supervised and Unsupervised Machine Learning Algorithms

10.31233/osf.io/ewkx9 ◽

2021 ◽

Author(s):

Floe Foxon

Keyword(s):

Machine Learning ◽

Naive Bayes ◽

Learning Algorithms ◽

Clustering Algorithms ◽

Measurement Data ◽

Naïve Bayes ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Unsupervised Machine Learning

Ammonoid identification is crucial to biostratigraphy, systematic palaeontology, and evolutionary biology, but may prove difficult when shell features and sutures are poorly preserved. This necessitates novel approaches to ammonoid taxonomy. This study aimed to taxonomize ammonoids by their conch geometry using supervised and unsupervised machine learning algorithms. Ammonoid measurement data (conch diameter, whorl height, whorl width, and umbilical width) were taken from the Paleobiology Database (PBDB). 11 species with ≥50 specimens each were identified providing N=781 total unique specimens. Naive Bayes, Decision Tree, Random Forest, Gradient Boosting, K-Nearest Neighbours, and Support Vector Machine classifiers were applied to the PBDB data with a 5x5 nested cross-validation approach to obtain unbiased generalization performance estimates across a grid search of algorithm parameters. All supervised classifiers achieved ≥70% accuracy in identifying ammonoid species, with Naive Bayes demonstrating the least over-fitting. The unsupervised clustering algorithms K-Means, DBSCAN, OPTICS, Mean Shift, and Affinity Propagation achieved Normalized Mutual Information scores of ≥0.6, with the centroid-based methods having most success. This presents a reasonably-accurate proof-of-concept approach to ammonoid classification which may assist identification in cases where more traditional methods are not feasible.

Download Full-text

Analysis of the possibilities for using machine learning algorithms in the Unity environment

Journal of Computer Sciences Institute ◽

10.35784/jcsi.2680 ◽

2021 ◽

Vol 20 ◽

pp. 197-204

Author(s):

Karina Litwynenko ◽

Małgorzata Plechawska-Wójcik

Keyword(s):

Machine Learning ◽

Reinforcement Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Learning Performance ◽

Imitation Learning ◽

Policy Optimization

Reinforcement learning algorithms are gaining popularity, and their advancement is made possible by the presence of tools to evaluate them. This paper concerns the applicability of machine learning algorithms on the Unity platform using the Unity ML-Agents Toolkit library. The purpose of the study was to compare two algorithms: Proximal Policy Optimization and Soft Actor-Critic. The possibility of improving the learning results by combining these algorithms with Generative Adversarial Imitation Learning was also verified. The results of the study showed that the PPO algorithm can perform better in uncomplicated environments with non-immediate rewards, while the additional use of GAIL can improve learning performance.

Download Full-text

An Empirical Evaluation of Feature Selection Methods

Improving Knowledge Discovery through the Integration of Data Mining Techniques - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8513-0.ch012 ◽

2015 ◽

pp. 233-258 ◽

Cited By ~ 1

Author(s):

Mohsin Iqbal ◽

Saif Ur Rehman ◽

Saira Gillani ◽

Sohail Asghar

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Classification Accuracy ◽

Information Gain ◽

Learning Algorithm ◽

Empirical Evaluation ◽

Machine Learning Algorithms ◽

Selection Methods ◽

The One ◽

Processing And Storage

The key objective of the chapter would be to study the classification accuracy, using feature selection with machine learning algorithms. The dimensionality of the data is reduced by implementing Feature selection and accuracy of the learning algorithm improved. We test how an integrated feature selection could affect the accuracy of three classifiers by performing feature selection methods. The filter effects show that Information Gain (IG), Gain Ratio (GR) and Relief-f, and wrapper effect show that Bagging and Naive Bayes (NB), enabled the classifiers to give the highest escalation in classification accuracy about the average while reducing the volume of unnecessary attributes. The achieved conclusions can advise the machine learning users, which classifier and feature selection methods to use to optimize the classification accuracy, and this can be important, especially at risk-sensitive applying Machine Learning whereas in the one of the aim to reduce costs of collecting, processing and storage of unnecessary data.

Download Full-text