Algorithm for Automated Generation of a Training Sample for Solving the Problem of Determining Semantic Similarity between a Pair of Keywords using Machine Learning Methods

2021 ◽  
Vol 12 (6) ◽  
pp. 283-294
Author(s):  
K. V. Lunev ◽  

Currently, machine learning is an effective approach to solving many problems of information-analytical systems. To use such approaches, a training set of examples is required. Collecting a training dataset is usually a time-consuming process. Its implementation requires the participation of several experts in the subject area for which the training set is collected. Moreover, for some tasks, including the task of determining the semantic similarity of keyword pairs, it is difficult even to correctly draw up instructions for experts to adequately evaluate the test examples. The reason for such difficulties is that semantic similarity is a subjective value and strongly depends on the scope, context, person, and task. The article presents the results of research on the search for models, algorithms and software tools for the automated formation of objects of the training sample in the problem of determining the semantic similarity of a pair of words. In addition, models built on an automated training sample allow us to solve not only the problem of determining semantic similarity, but also an arbitrary problem of classifying edges of a graph. The methods used in this paper are based on graph theory algorithms.

2018 ◽  
Vol 12 (2) ◽  
pp. 66-71
Author(s):  
A. V. Zolotaryuk ◽  
I. A. Chechneva

The authors consider the problems associated with the activities of microfinance organizations, and directions to eliminate them. The subject of the study is the need to introduce machine learning to solve urgent problems. Machine learning methods are increasingly being implemented to analyze financial and economic information, which reduces and eliminates some of the difficulties. Although currently these methods are not widely used in the field of microfinance institutions (MFIs), there are opportunities for their application. The aim of the work is to determine the prospects for the use of these methods in MFOs. The article describes the subject area of research, associated with MFIs. The authors identify the main groups of problems related to MFOs, consider the possibility of introducing machine learning for data analysis in this area and determine the main directions of the possible use of machine learning for MFIs. The authors concluded that such methods are applicable for assessing the performance of MFIs.


Author(s):  
Sook-Ling Chua ◽  
Stephen Marsland ◽  
Hans W. Guesgen

The problem of behaviour recognition based on data from sensors is essentially an inverse problem: given a set of sensor observations, identify the sequence of behaviours that gave rise to them. In a smart home, the behaviours are likely to be the standard human behaviours of living, and the observations will depend upon the sensors that the house is equipped with. There are two main approaches to identifying behaviours from the sensor stream. One is to use a symbolic approach, which explicitly models the recognition process. Another is to use a sub-symbolic approach to behaviour recognition, which is the focus in this chapter, using data mining and machine learning methods. While there have been many machine learning methods of identifying behaviours from the sensor stream, they have generally relied upon a labelled dataset, where a person has manually identified their behaviour at each time. This is particularly tedious to do, resulting in relatively small datasets, and is also prone to significant errors as people do not pinpoint the end of one behaviour and commencement of the next correctly. In this chapter, the authors consider methods to deal with unlabelled sensor data for behaviour recognition, and investigate their use. They then consider whether they are best used in isolation, or should be used as preprocessing to provide a training set for a supervised method.


Geosciences ◽  
2019 ◽  
Vol 9 (7) ◽  
pp. 308 ◽  
Author(s):  
Valeri G. Gitis ◽  
Alexander B. Derendyaev

In this paper, we suggest two machine learning methods for seismic hazard forecast. The first method is used for spatial forecasting of maximum possible earthquake magnitudes ( M m a x ), whereas the second is used for spatio-temporal forecasting of strong earthquakes. The first method, the method of approximation of interval expert estimates, is based on a regression approach in which values of M m a x at the points of the training sample are estimated by experts. The method allows one to formalize the knowledge of experts, to find the dependence of M m a x on the properties of the geological environment, and to construct a map of the spatial forecast. The second method, the method of minimum area of alarm, uses retrospective data to identify the alarm area in which the epicenters of strong (target) earthquakes are expected at a certain time interval. This method is the basis of an automatic web-based platform that systematically forecasts target earthquakes. The results of testing the approach to earthquake prediction in the Mediterranean and Californian regions are presented. For the tests, well known parameters of earthquake catalogs were used. The method showed a satisfactory forecast quality.


Author(s):  
Arman Ahvaev ◽  
Valeriy Fedorovich Shurshev

The article touches upon the forecasting problem, the solution of which in systems characterized by selecting a traditional algorithm for its description is reduced to machine learning technology. In the context of predicting emergencies in heat supply systems this technology is the most effective. Carrying out the forecast is reduced to the problem of restoring the function in the general content of training by the teacher. Of the available machine learning tools, gradient boosting should be used. It works according to the following principle: at the first iterations the weak algorithms are used, then there increases the ensemble by gradual improvements of those data sections where the previous models have not been finalized. But when constructing the next simple model, it is built not just on reweighted observations, but in such a way as to better approximate the overall gradient of the objective function. Gradient boosting is one of the effective forecasting algorithms and the accuracy of the forecast depends on the correct input data (training sample). The subject area under study, namely the study of emergency situations on heating networks, has sufficient accumulated data to use boosting as the main tool for forecasting.


Author(s):  
Georgia Papacharalampous ◽  
Hristos Tyralis ◽  
Demetris Koutsoyiannis

Research within the field of hydrology often focuses on comparing stochastic to machine learning (ML) forecasting methods. The comparisons performed are all based on case studies, while an extensive study aiming to provide generalized results on the subject is missing. Herein, we compare 11 stochastic and 9 ML methods regarding their multi-step ahead forecasting properties by conducting 12 large-scale computational experiments based on simulations. Each of these experiments uses 2 000 time series generated by linear stationary stochastic processes. We conduct each simulation experiment twice; the first time using time series of 100 values and the second time using time series of 300 values. Additionally, we conduct a real-world experiment using 405 mean annual river discharge time series of 100 values. We quantify the performance of the methods using 18 metrics. The results indicate that stochastic and ML methods perform equally well.


Author(s):  
Maxat Kulmanov ◽  
Fatima Zohra Smaili ◽  
Xin Gao ◽  
Robert Hoehndorf

Abstract Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.


Author(s):  
Vladimir Viktorovich Pekunov

The author considers a problem of automatic synthesis (induction) of the rules for transforming the natural language formulation of the problem into a semantic model of the problem. According to this model a program that solves this problem can be generated. The  problem is considered in relation to the system of generation, recognition and transformation of programs PGEN ++. Based on the analysis of literary sources, a combined approach was chosen to solve this problem, within which the rules for transforming the natural language formulation into a semantic model of the problem are generated automatically, and the specifications of the generating classes and the rules for generating a program from the model are written manually by a specialist in a specific subject area. Within the framework of object-event models, for the first time, a mechanism for the automatic generation of recognizing scripts and related entities (CSV tables, XPath functions) was proposed. Generation is based on the analysis of the training sample, which includes sentences describing objects in the subject area, in combination with instances of such objects. The analysis is performed by searching for unique keywords and characteristic grammatical relationships, followed by the application of simple eliminative-inducing schemes. A mechanism for the automatic generation of rules for replenishing / completing the primary recognized models to full meaning ones is also proposed. Such generation is performed by analyzing the relations between the objects of the training sample, taking into account information from the specifications of the classes of the subject area. The proposed schemes have been tested on the subject area "Simple vector data processing", the successful transformation of natural language statements (both included in the training set and modified) into semantic models with the subsequent generation of programs solving the assigned tasks is shown.


Author(s):  
Georgia Papacharalampous ◽  
Hristos Tyralis ◽  
Demetris Koutsoyiannis

Research within the field of hydrology often focuses on comparing stochastic to machine learning (ML) forecasting methods. The comparisons performed are all based on case studies, while an extensive study aiming to provide generalized results on the subject is missing. Herein, we compare 11 stochastic and 9 ML methods regarding their multi-step ahead forecasting properties by conducting 12 large-scale computational experiments based on simulations. Each of these experiments uses 2 000 time series generated by linear stationary stochastic processes. We conduct each simulation experiment twice; the first time using time series of 100 values and the second time using time series of 300 values. Additionally, we conduct a real-world experiment using 405 mean annual river discharge time series of 100 values. We quantify the performance of the methods using 18 metrics. The results indicate that stochastic and ML methods perform equally well.


2021 ◽  
Vol 27 (3) ◽  
pp. 189-199
Author(s):  
Ilias Tougui ◽  
Abdelilah Jilbab ◽  
Jamal El Mhamdi

Objectives: With advances in data availability and computing capabilities, artificial intelligence and machine learning technologies have evolved rapidly in recent years. Researchers have taken advantage of these developments in healthcare informatics and created reliable tools to predict or classify diseases using machine learning-based algorithms. To correctly quantify the performance of those algorithms, the standard approach is to use cross-validation, where the algorithm is trained on a training set, and its performance is measured on a validation set. Both datasets should be subject-independent to simulate the expected behavior of a clinical study. This study compares two cross-validation strategies, the subject-wise and the record-wise techniques; the subject-wise strategy correctly mimics the process of a clinical study, while the record-wise strategy does not.Methods: We started by creating a dataset of smartphone audio recordings of subjects diagnosed with and without Parkinson’s disease. This dataset was then divided into training and holdout sets using subject-wise and the record-wise divisions. The training set was used to measure the performance of two classifiers (support vector machine and random forest) to compare six cross-validation techniques that simulated either the subject-wise process or the record-wise process. The holdout set was used to calculate the true error of the classifiers.Results: The record-wise division and the record-wise cross-validation techniques overestimated the performance of the classifiers and underestimated the classification error.Conclusions: In a diagnostic scenario, the subject-wise technique is the proper way of estimating a model’s performance, and record-wise techniques should be avoided.


2021 ◽  
Vol 70 (11) ◽  
Author(s):  
Wenjia Liu ◽  
Nanjiao Ying ◽  
Qiusi Mo ◽  
Shanshan Li ◽  
Mengjie Shao ◽  
...  

Introduction. Klebsiella pneumoniae , a gram-negative bacterium, is a common pathogen causing nosocomial infection. The drug-resistance rate of K. pneumoniae is increasing year by year, posing a severe threat to public health worldwide. K. pneumoniae has been listed as one of the pathogens causing the global crisis of antimicrobial resistance in nosocomial infections. We need to explore the drug resistance of K. pneumoniae for clinical diagnosis. Single nucleotide polymorphisms (SNPs) are of high density and have rich genetic information in whole-genome sequencing (WGS), which can affect the structure or expression of proteins. SNPs can be used to explore mutation sites associated with bacterial resistance. Hypothesis/Gap Statement. Machine learning methods can detect genetic features associated with the drug resistance of K. pneumoniae from whole-genome SNP data. Aims. This work used Fast Feature Selection (FFS) and Codon Mutation Detection (CMD) machine learning methods to detect genetic features related to drug resistance of K. pneumoniae from whole-genome SNP data. Methods. WGS data on resistance of K. pneumoniae strains to four antibiotics (tetracycline, gentamicin, imipenem, amikacin) were downloaded from the European Nucleotide Archive (ENA). Sequence alignments were performed with MUMmer 3 to complete SNP calling using K. pneumoniae HS11286 chromosome as the reference genome. The FFS algorithm was applied to feature selection of the SNP dataset. The training set was constructed based on mutation sites with mutation frequency >0.995. Based on the original SNP training set, 70% of SNPs were randomly selected from each dataset as the test set to verify the accuracy of the training results. Finally, the resistance genes were obtained by the CMD algorithm and Venny. Results. The number of strains resistant to tetracycline, gentamicin, imipenem and amikacin was 931, 1048, 789 and 203, respectively. Machine learning algorithms were applied to the SNP training set and test set, and 28 and 23 resistance genes were predicted, respectively. The 28 resistance genes in the training set included 22 genes in the test set, which verified the accuracy of gene prediction. Among them, some genes (KPHS_35310, KPHS_18220, KPHS_35880, etc.) corresponded to known resistance genes (Eef2, lpxK, MdtC, etc). Logistic regression classifiers were established based on the identified SNPs in the training set. The area under the curves (AUCs) of the four antibiotics was 0.939, 0.950, 0.912 and 0.935, showing a strong ability to predict bacterial resistance. Conclusion. Machine learning methods can effectively be used to predict resistance genes and associated SNPs. The FFS and CMD algorithms have wide applicability. They can be used for the drug-resistance analysis of any microorganism with genomic variation and phenotypic data. This work lays a foundation for resistance research in clinical applications.


Sign in / Sign up

Export Citation Format

Share Document