scholarly journals An encoding of genome content for machine learning

2019 ◽  
Author(s):  
A. Viehweger ◽  
S. Krautwurst ◽  
D. H. Parks ◽  
B. König ◽  
M. Marz

AbstractAn ever-growing number of metagenomes can be used for biomining and the study of microbial functions. The use of learning algorithms in this context has been hindered, because they often need input in the form of low-dimensional, dense vectors of numbers. We propose such a representation for genomes callednanotextthat scales to very large data sets.The underlying model is learned from a corpus of nearly 150 thousand genomes spanning 750 million protein domains. We treat the protein domains in a genome like words in a document, assuming that protein domains in a similar context have similar “meaning”. This meaning can be distributed by a neural net over a vector of numbers.The resulting vectors efficiently encode function, preserve known phylogeny, capture subtle functional relationships and are robust against genome incompleteness. The “functional” distance between two vectors complements nucleotide-based distance, so that genomes can be identified as similar even though their nucleotide identity is low.nanotextcan thus encode (meta)genomes for direct use in downstream machine learning tasks. We show this by predicting plausible culture media for metagenome assembled genomes (MAGs) from theTara Oceans Expeditionusing their genome content only.nanotextis freely released under a BSD licence (https://github.com/phiweger/nanotext).

2020 ◽  
Vol 6 ◽  
Author(s):  
Jaime de Miguel Rodríguez ◽  
Maria Eugenia Villafañe ◽  
Luka Piškorec ◽  
Fernando Sancho Caparrini

Abstract This work presents a methodology for the generation of novel 3D objects resembling wireframes of building types. These result from the reconstruction of interpolated locations within the learnt distribution of variational autoencoders (VAEs), a deep generative machine learning model based on neural networks. The data set used features a scheme for geometry representation based on a ‘connectivity map’ that is especially suited to express the wireframe objects that compose it. Additionally, the input samples are generated through ‘parametric augmentation’, a strategy proposed in this study that creates coherent variations among data by enabling a set of parameters to alter representative features on a given building type. In the experiments that are described in this paper, more than 150 k input samples belonging to two building types have been processed during the training of a VAE model. The main contribution of this paper has been to explore parametric augmentation for the generation of large data sets of 3D geometries, showcasing its problems and limitations in the context of neural networks and VAEs. Results show that the generation of interpolated hybrid geometries is a challenging task. Despite the difficulty of the endeavour, promising advances are presented.


2011 ◽  
Vol 16 (9) ◽  
pp. 1059-1067 ◽  
Author(s):  
Peter Horvath ◽  
Thomas Wild ◽  
Ulrike Kutay ◽  
Gabor Csucs

Imaging-based high-content screens often rely on single cell-based evaluation of phenotypes in large data sets of microscopic images. Traditionally, these screens are analyzed by extracting a few image-related parameters and use their ratios (linear single or multiparametric separation) to classify the cells into various phenotypic classes. In this study, the authors show how machine learning–based classification of individual cells outperforms those classical ratio-based techniques. Using fluorescent intensity and morphological and texture features, they evaluated how the performance of data analysis increases with increasing feature numbers. Their findings are based on a case study involving an siRNA screen monitoring nucleoplasmic and nucleolar accumulation of a fluorescently tagged reporter protein. For the analysis, they developed a complete analysis workflow incorporating image segmentation, feature extraction, cell classification, hit detection, and visualization of the results. For the classification task, the authors have established a new graphical framework, the Advanced Cell Classifier, which provides a very accurate high-content screen analysis with minimal user interaction, offering access to a variety of advanced machine learning methods.


2022 ◽  
pp. 27-50
Author(s):  
Rajalaxmi Prabhu B. ◽  
Seema S.

A lot of user-generated data is available these days from huge platforms, blogs, websites, and other review sites. These data are usually unstructured. Analyzing sentiments from these data automatically is considered an important challenge. Several machine learning algorithms are implemented to check the opinions from large data sets. A lot of research has been undergone in understanding machine learning approaches to analyze sentiments. Machine learning mainly depends on the data required for model building, and hence, suitable feature exactions techniques also need to be carried. In this chapter, several deep learning approaches, its challenges, and future issues will be addressed. Deep learning techniques are considered important in predicting the sentiments of users. This chapter aims to analyze the deep-learning techniques for predicting sentiments and understanding the importance of several approaches for mining opinions and determining sentiment polarity.


Author(s):  
Fenxiao Chen ◽  
Yun-Cheng Wang ◽  
Bin Wang ◽  
C.-C. Jay Kuo

Abstract Research on graph representation learning has received great attention in recent years since most data in real-world applications come in the form of graphs. High-dimensional graph data are often in irregular forms. They are more difficult to analyze than image/video/audio data defined on regular lattices. Various graph embedding techniques have been developed to convert the raw graph data into a low-dimensional vector representation while preserving the intrinsic graph properties. In this review, we first explain the graph embedding task and its challenges. Next, we review a wide range of graph embedding techniques with insights. Then, we evaluate several stat-of-the-art methods against small and large data sets and compare their performance. Finally, potential applications and future directions are presented.


Author(s):  
Brendan Juba ◽  
Hai S. Le

Practitioners of data mining and machine learning have long observed that the imbalance of classes in a data set negatively impacts the quality of classifiers trained on that data. Numerous techniques for coping with such imbalances have been proposed, but nearly all lack any theoretical grounding. By contrast, the standard theoretical analysis of machine learning admits no dependence on the imbalance of classes at all. The basic theorems of statistical learning establish the number of examples needed to estimate the accuracy of a classifier as a function of its complexity (VC-dimension) and the confidence desired; the class imbalance does not enter these formulas anywhere. In this work, we consider the measures of classifier performance in terms of precision and recall, a measure that is widely suggested as more appropriate to the classification of imbalanced data. We observe that whenever the precision is moderately large, the worse of the precision and recall is within a small constant factor of the accuracy weighted by the class imbalance. A corollary of this observation is that a larger number of examples is necessary and sufficient to address class imbalance, a finding we also illustrate empirically.


Machine learning is a technology which with accumulated data provides better decisions towards future applications. It is also the scientific study of algorithms implemented efficiently to perform a specific task without using explicit instructions. It may also be viewed as a subset of artificial intelligence in which it may be linked with the ability to automatically learn and improve from experience without being explicitly programmed. Its primary intention is to allow the computers learn automatically and produce more accurate results in order to identify profitable opportunities. Combining machine learning with AI and cognitive technologies can make it even more effective in processing large volumes human intervention or assistance and adjust actions accordingly. It may enable analyzing the huge data of information. It may also be linked to algorithm driven study towards improving the performance of the tasks. In such scenario, the techniques can be applied to judge and predict large data sets. The paper concerns the mechanism of supervised learning in the database systems, which would be self driven as well as secure. Also the citation of an organization dealing with student loans has been presented. The paper ends discussion, future direction and conclusion.


2021 ◽  
Author(s):  
Maureen Rebecca Smith ◽  
Maria Trofimova ◽  
Ariane Weber ◽  
Yannick Duport ◽  
Denise Kuhnert ◽  
...  

In May 2021, over 160 million SARS-CoV-2 infections have been reported worldwide. Yet, the true amount of infections is unknown and believed to exceed the reported numbers by several fold, depending on national testing policies that can strongly affect the proportion of undetected cases. To overcome this testing bias and better assess SARS-CoV-2 transmission dynamics, we propose a genome-based computational pipeline, GInPipe, to reconstruct the SARS-CoV-2 incidence dynamics through time. After validating GInPipe against in silico generated outbreak data, as well as more complex phylodynamic analyses, we use the pipeline to reconstruct incidence histories in Denmark, Scotland, Switzerland, and Victoria (Australia) solely from viral sequence data. The proposed method robustly reconstructs the different pandemic waves in the investigated countries and regions, does not require phylodynamic reconstruction, and can be directly applied to publicly deposited SARS-CoV-2 sequencing data sets. We observe differences in the relative magnitude of reconstructed versus reported incidences during times with sparse availability of diagnostic tests. Using the reconstructed incidence dynamics, we assess how testing policies may have affected the probability to diagnose and report infected individuals. We find that under-reporting was highest in mid 2020 in all analysed countries, coinciding with liberal testing policies at times of low test capacities. Due to the increased use of real-time sequencing, it is envisaged that GInPipe can complement established surveillance tools to monitor the SARS-CoV-2 pandemic and evaluate testing policies. The method executes within minutes on very large data sets and is freely available as a fully automated pipeline from https://github.com/KleistLab/GInPipe.


2018 ◽  
Vol 3 ◽  
Author(s):  
Andreas Baumann

Machine learning is a powerful method when working with large data sets such as diachronic corpora. However, as opposed to standard techniques from inferential statistics like regression modeling, machine learning is less commonly used among phonological corpus linguists. This paper discusses three different machine learning techniques (K nearest neighbors classifiers; Naïve Bayes classifiers; artificial neural networks) and how they can be applied to diachronic corpus data to address specific phonological questions. To illustrate the methodology, I investigate Middle English schwa deletion and when and how it potentially triggered reduction of final /mb/ clusters in English.


2018 ◽  
Vol 64 (11) ◽  
pp. 1586-1595 ◽  
Author(s):  
Edmund H Wilkes ◽  
Gill Rumsby ◽  
Gary M Woodward

Abstract BACKGROUND Urine steroid profiles are used in clinical practice for the diagnosis and monitoring of disorders of steroidogenesis and adrenal pathologies. Machine learning (ML) algorithms are powerful computational tools used extensively for the recognition of patterns in large data sets. Here, we investigated the utility of various ML algorithms for the automated biochemical interpretation of urine steroid profiles to support current clinical practices. METHODS Data from 4619 urine steroid profiles processed between June 2012 and October 2016 were retrospectively collected. Of these, 1314 profiles were used to train and test various ML classifiers' abilities to differentiate between “No significant abnormality” and “?Abnormal” profiles. Further classifiers were trained and tested for their ability to predict the specific biochemical interpretation of the profiles. RESULTS The best performing binary classifier could predict the interpretation of No significant abnormality and ?Abnormal profiles with a mean area under the ROC curve of 0.955 (95% CI, 0.949–0.961). In addition, the best performing multiclass classifier could predict the individual abnormal profile interpretation with a mean balanced accuracy of 0.873 (0.865–0.880). CONCLUSIONS Here we have described the application of ML algorithms to the automated interpretation of urine steroid profiles. This provides a proof-of-concept application of ML algorithms to complex clinical laboratory data that has the potential to improve laboratory efficiency in a setting of limited staff resources.


Author(s):  
Sergey Pronin ◽  
Mykhailo Miroshnichenko

A system for analyzing large data sets using machine learning algorithms


Sign in / Sign up

Export Citation Format

Share Document