scholarly journals Distribution-Based Similarity Measures Applied to Laboratory Results Matching

2021 ◽  
Author(s):  
Martin Courtois ◽  
Alexandre Filiot ◽  
Gregoire Ficheur

The use of international laboratory terminologies inside hospital information systems is required to conduct data reuse analyses through inter-hospital databases. While most terminology matching techniques performing semantic interoperability are language-based, another strategy is to use distribution matching that performs terms matching based on the statistical similarity. In this work, our objective is to design and assess a structured framework to perform distribution matching on concepts described by continuous variables. We propose a framework that combines distribution matching and machine learning techniques. Using a training sample consisting of correct and incorrect correspondences between different terminologies, a match probability score is built. For each term, best candidates are returned and sorted in decreasing order using the probability given by the model. Searching 101 terms from Lille University Hospital among the same list of concepts in MIMIC-III, the model returned the correct match in the top 5 candidates for 96 of them (95%). Using this open-source framework with a top-k suggestions system could make the expert validation of terminologies alignment easier.

2016 ◽  
Author(s):  
Philippe Desjardins-Proulx ◽  
Idaline Laigle ◽  
Timothée Poisot ◽  
Dominique Gravel

0AbstractSpecies interactions are a key component of ecosystems but we generally have an incomplete picture of who-eats-who in a given community. Different techniques have been devised to predict species interactions using theoretical models or abundances. Here, we explore the K nearest neighbour approach, with a special emphasis on recommendation, along with other machine learning techniques. Recommenders are algorithms developed for companies like Netflix to predict if a customer would like a product given the preferences of similar customers. These machine learning techniques are well-suited to study binary ecological interactions since they focus on positive-only data. We also explore how the K nearest neighbour approach can be used with both positive and negative information, in which case the goal of the algorithm is to fill missing entries from a matrix (imputation). By removing a prey from a predator, we find that recommenders can guess the missing prey around 50% of the times on the first try, with up to 881 possibilities. Traits do not improve significantly the results for the K nearest neighbour, although a simple test with a supervised learning approach (random forests) show we can predict interactions with high accuracy using only three traits per species. This result shows that binary interactions can be predicted without regard to the ecological community given only three variables: body mass and two variables for the species’ phylogeny. These techniques are complementary, as recommenders can predict interactions in the absence of traits, using only information about other species’ interactions, while supervised learning algorithms such as random forests base their predictions on traits only but do not exploit other species’ interactions. Further work should focus on developing custom similarity measures specialized to ecology to improve the KNN algorithms and using richer data to capture indirect relationships between species.


2020 ◽  
Vol 29 (4) ◽  
pp. e70-e80
Author(s):  
Mireia Ladios-Martin ◽  
José Fernández-de-Maya ◽  
Francisco-Javier Ballesta-López ◽  
Adrián Belso-Garzas ◽  
Manuel Mas-Asencio ◽  
...  

Background Pressure injuries are an important problem in hospital care. Detecting the population at risk for pressure injuries is the first step in any preventive strategy. Available tools such as the Norton and Braden scales do not take into account all of the relevant risk factors. Data mining and machine learning techniques have the potential to overcome this limitation. Objectives To build a model to detect pressure injury risk in intensive care unit patients and to put the model into production in a real environment. Methods The sample comprised adult patients admitted to an intensive care unit (N = 6694) at University Hospital of Torrevieja and University Hospital of Vinalopó. A retrospective design was used to train (n = 2508) and test (n = 1769) the model and then a prospective design was used to test the model in a real environment (n = 2417). Data mining was used to extract variables from electronic medical records and a predictive model was built with machine learning techniques. The sensitivity, specificity, area under the curve, and accuracy of the model were evaluated. Results The final model used logistic regression and incorporated 23 variables. The model had sensitivity of 0.90, specificity of 0.74, and area under the curve of 0.89 during the initial test, and thus it outperformed the Norton scale. The model performed well 1 year later in a real environment. Conclusions The model effectively predicts risk of pressure injury. This allows nurses to focus on patients at high risk for pressure injury without increasing workload.


Education could be a important resource that has to lean to all or any kids. one in all the largest assets of the longer term generation cloud is alleged because the education that's given to the youngsters. Most of the youngsters aren't ready to continue their education because of many reasons. The prediction of student dropout plays a very important role in characteristic the scholars World Health Organization are on the sting of being a dropout from their education. whereas predicting this, we will simply try and solve their issues and create them continue their education. during this paper, we've planned a model for predicting the scholars can get born out or not mistreatment many machine learning techniques. we have a tendency to create use of decision trees that make a call mistreatment many factors. the choice of the prediction involves crucial wherever many knowledge attributes are used for prediction like correlations, similarity measures, frequent patterns, and associations rule mining. The planned work is evaluated mistreatment numerous parameters and is well-tried to figure expeditiously in predicting the dropout students compared with alternative.


Antibiotics ◽  
2020 ◽  
Vol 9 (10) ◽  
pp. 644
Author(s):  
Valeria Bellelli ◽  
Guido Siccardi ◽  
Livia Conte ◽  
Luigi Celani ◽  
Elena Congeduti ◽  
...  

Invasive pulmonary aspergillosis (IPA) is typically considered a disease of immunocompromised patients, but, recently, many cases have been reported in patients without typical risk factors. The aim of our study is to develop a risk predictive model for IPA through machine learning techniques (decision trees) in patients with influenza. We conducted a retrospective observational study analyzing data regarding patients diagnosed with influenza hospitalized at the University Hospital “Umberto I” of Rome during the 2018-2019 season. We collected five IPA cases out of 77 influenza patients. Although the small sample size is a limit, the most vulnerable patients among the influenza-infected population seem to be those with evidence of lymphocytopenia and those that received corticosteroid therapy.


2017 ◽  
Author(s):  
Yulia Kolesnikova ◽  
Adam Lathrop ◽  
Bree Norlander ◽  
An Yan

Few research studies have quantitatively analyzed metadata elements associated with scientific data reuse. By using metadata and dataset download rates from the National Snow and Ice Data Center, we address whether there are key indicators in data repository metadata that show a statistically significant correlation with the download count of a dataset and whether we can predict data reuse using machine learning techniques. We used the download rate by unique IP addresses for individual datasets as our dependent variable and as a proxy for data reuse. Our analysis shows that the following metadata elements in NSIDC datasets are positively correlated with download rates: year of citation, number of data formats, number of contributors, number of platforms, number of spatial coverage areas, number of locations, and number of keywords. Our results are applicable to researchers and professionals working with data and add to the small body of work addressing metadata best practices for increasing discovery of data.


2021 ◽  
Vol 54 (6) ◽  
pp. 1-25
Author(s):  
Pádraig Cunningham ◽  
Sarah Jane Delany

Perhaps the most straightforward classifier in the arsenal or Machine Learning techniques is the Nearest Neighbour Classifier—classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance, because issues of poor runtime performance is not such a problem these days with the computational power that is available. This article presents an overview of techniques for Nearest Neighbour classification focusing on: mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours, and mechanisms for reducing the dimension of the data. This article is the second edition of a paper previously published as a technical report [16]. Sections on similarity measures for time-series, retrieval speedup, and intrinsic dimensionality have been added. An Appendix is included, providing access to Python code for the key methods.


2020 ◽  
Vol 19 (5-6) ◽  
pp. 350-363
Author(s):  
Duc-Hau Le

Abstract Disease gene prediction is an essential issue in biomedical research. In the early days, annotation-based approaches were proposed for this problem. With the development of high-throughput technologies, interaction data between genes/proteins have grown quickly and covered almost genome and proteome; thus, network-based methods for the problem become prominent. In parallel, machine learning techniques, which formulate the problem as a classification, have also been proposed. Here, we firstly show a roadmap of the machine learning-based methods for the disease gene prediction. In the beginning, the problem was usually approached using a binary classification, where positive and negative training sample sets are comprised of disease genes and non-disease genes, respectively. The disease genes are ones known to be associated with diseases; meanwhile, non-disease genes were randomly selected from those not yet known to be associated with diseases. However, the later may contain unknown disease genes. To overcome this uncertainty of defining the non-disease genes, more realistic approaches have been proposed for the problem, such as unary and semi-supervised classification. Recently, more advanced methods, including ensemble learning, matrix factorization and deep learning, have been proposed for the problem. Secondly, 12 representative machine learning-based methods for the disease gene prediction were examined and compared in terms of prediction performance and running time. Finally, their advantages, disadvantages, interpretability and trust were also analyzed and discussed.


2006 ◽  
Author(s):  
Christopher Schreiner ◽  
Kari Torkkola ◽  
Mike Gardner ◽  
Keshu Zhang

2020 ◽  
Vol 12 (2) ◽  
pp. 84-99
Author(s):  
Li-Pang Chen

In this paper, we investigate analysis and prediction of the time-dependent data. We focus our attention on four different stocks are selected from Yahoo Finance historical database. To build up models and predict the future stock price, we consider three different machine learning techniques including Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN) and Support Vector Regression (SVR). By treating close price, open price, daily low, daily high, adjusted close price, and volume of trades as predictors in machine learning methods, it can be shown that the prediction accuracy is improved.


Sign in / Sign up

Export Citation Format

Share Document