Predicting cross-tissue hormone-gene relations using balanced word embeddings

AbstractMotivationLarge volumes of biomedical literature present an opportunity to build whole-body human models comprising both within-tissue and across-tissue interactions among genes. Current studies have mostly focused on identifying within-tissue or tissue-agnostic associations, with a heavy emphasis on associations among disease, genes and drugs. Literature mining studies that extract relations pertaining to inter-tissue communication, such as between genes and hormones, are solely missing.ResultsWe present here a first study to identify from literature the genes involved in inter-tissue signaling via a hormone in the human body. Our models BioEmbedS and BioEmbedS-TS respectively predict if a hormone-gene pair is associated or not, and whether an associated gene is involved in the hormone’s production or response. Our models are classifiers trained on word embeddings that we had carefully balanced across different strata of the training data such as across production vs. response genes of a hormone (or) well-studied vs. poorly-represented hormones in the literature. Model training and evaluation are enabled by a unified dataset called HGv1 of ground-truth associations between genes and known endocrine hormones that we had compiled. Our models not only recapitulate known gene mediators of tissue-tissue signaling (e.g., at average 70.4% accuracy for BioEmbedS), but also predicts novel genes involved in inter-tissue communication in humans. Furthermore, the species-agnostic nature of our ground-truth HGv1 data and our predictive modeling approach, demonstrated concretely using human data and generalized to mouse, hold much promise for future work on elucidating inter-tissue signaling in other multi-cellular organisms.AvailabilityProposed HGv1 dataset along with our models’ predictions, and the associated code to reproduce this work are available respectively at https://cross-tissue-signaling.herokuapp.com/, and https://github.com/BIRDSgroup/[email protected]

Download Full-text

On the objectivity, reliability, and validity of deep learning enabled bioimage analyses

eLife ◽

10.7554/elife.59780 ◽

2020 ◽

Vol 9 ◽

Cited By ~ 1

Author(s):

Dennis Segebarth ◽

Matthias Griebel ◽

Nikolai Stein ◽

Cora R von Collenberg ◽

Corinna Martin ◽

...

Keyword(s):

Deep Learning ◽

Signal To Noise Ratio ◽

Biological Effects ◽

Reliability And Validity ◽

Ground Truth ◽

Training Data ◽

Model Organisms ◽

Data Annotation ◽

Bioimage Analysis ◽

Model Training

Bioimage analysis of fluorescent labels is widely used in the life sciences. Recent advances in deep learning (DL) allow automating time-consuming manual image analysis processes based on annotated training data. However, manual annotation of fluorescent features with a low signal-to-noise ratio is somewhat subjective. Training DL models on subjective annotations may be instable or yield biased models. In turn, these models may be unable to reliably detect biological effects. An analysis pipeline integrating data annotation, ground truth estimation, and model training can mitigate this risk. To evaluate this integrated process, we compared different DL-based analysis approaches. With data from two model organisms (mice, zebrafish) and five laboratories, we show that ground truth estimation from multiple human annotators helps to establish objectivity in fluorescent feature annotations. Furthermore, ensembles of multiple models trained on the estimated ground truth establish reliability and validity. Our research provides guidelines for reproducible DL-based bioimage analyses.

Download Full-text

Addressing Age-Related Bias in Sentiment Analysis

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/852 ◽

2019 ◽

Cited By ~ 2

Author(s):

Mark Díaz ◽

Isaac Johnson ◽

Amanda Lazar ◽

Anne Marie Piper ◽

Darren Gergle

Keyword(s):

Sentiment Analysis ◽

Sociodemographic Factors ◽

Training Data ◽

Word Embeddings ◽

Age Bias ◽

Age Related ◽

Model Training ◽

Analysis Models ◽

Social Biases

Recent studies have identified various forms of bias in language-based models, raising concerns about the risk of propagating social biases against certain groups based on sociodemographic factors (e.g., gender, race, geography). In this study, we analyze the treatment of age-related terms across 15 sentiment analysis models and 10 widely-used GloVe word embeddings and attempt to alleviate bias through a method of processing model training data. Our results show significant age bias is encoded in the outputs of many sentiment analysis algorithms and word embeddings, and we can alleviate this bias by manipulating training data.

Download Full-text

Word embedding mining for SARS-CoV-2 and COVID-19 drug repurposing

F1000Research ◽

10.12688/f1000research.24271.1 ◽

2020 ◽

Vol 9 ◽

pp. 585

Author(s):

Finn Kuusisto ◽

David Page ◽

Ron Stewart

Keyword(s):

De Novo ◽

Drug Repurposing ◽

Ground Truth ◽

Biomedical Literature ◽

Word Embedding ◽

Electronic Health Record Data ◽

Global Pandemic ◽

Rapid Spread ◽

Approved Drugs ◽

Future Work

Background: The rapid spread of illness and death caused by the severe respiratory syndrome coronavirus 2 (SARS-CoV-2) and its associated coronavirus disease 2019 (COVID-19) demands a rapid response in treatment development. Limitations of de novo drug development, however, suggest that drug repurposing is best suited to meet this demand. Methods: Due to the difficulty of accessing electronic health record data in general and in the midst of a global pandemic, and due to the similarity between SARS-CoV-2 and SARS-CoV, we propose mining the extensive biomedical literature for treatments to SARS that may also then be appropriate for COVID-19. In particular, we propose a method of mining a large biomedical word embedding for FDA approved drugs based on drug-disease treatment analogies. Results: We first validate that our method correctly identifies ground truth treatments for well-known diseases. We then use our method to find several approved drugs that have been suggested or are currently in clinical trials for COVID-19 in our top hits and present the rest as promising leads for further experimental investigation. Conclusions: We find our approach promising and present it, along with suggestions for future work, to the computational drug repurposing community at large as another tool to help fight the pandemic. Code and data for our methods can be found at https://github.com/finnkuusisto/covid19_word_embedding.

Download Full-text

Identifying Physico-Chemical Laws from the Robotically Collected Data

10.26434/chemrxiv.8490149 ◽

2019 ◽

Author(s):

Liwei Cao ◽

Danilo Russo ◽

Vassilios S. Vassiliadis ◽

Alexei Lapkin

Keyword(s):

Experimental Data ◽

Numerical Models ◽

Predictor Variable ◽

Physical Models ◽

Training Data ◽

Mixed Integer ◽

Physico Chemical ◽

Data Points ◽

Future Work ◽

The Relationship

A mixed-integer nonlinear programming (MINLP) formulation for symbolic regression was proposed to identify physical models from noisy experimental data. The formulation was tested using numerical models and was found to be more efficient than the previous literature example with respect to the number of predictor variables and training data points. The globally optimal search was extended to identify physical models and to cope with noise in the experimental data predictor variable. The methodology was coupled with the collection of experimental data in an automated fashion, and was proven to be successful in identifying the correct physical models describing the relationship between the shear stress and shear rate for both Newtonian and non-Newtonian fluids, and simple kinetic laws of reactions. Future work will focus on addressing the limitations of the formulation presented in this work, by extending it to be able to address larger complex physical models.

Download Full-text

Quantifying the structure of strong gravitational lens potentials with uncertainty-aware deep neural networks

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3201 ◽

2020 ◽

Vol 499 (4) ◽

pp. 5641-5652

Author(s):

Georgios Vernardos ◽

Grigorios Tsagkatakis ◽

Yannis Pantazis

Keyword(s):

Confidence Intervals ◽

Galaxy Evolution ◽

Gravitational Lensing ◽

Probability Distributions ◽

Mass Density ◽

Ground Truth ◽

Gaussian Random Fields ◽

Training Data ◽

Gravitational Lens ◽

Data Set

ABSTRACT Gravitational lensing is a powerful tool for constraining substructure in the mass distribution of galaxies, be it from the presence of dark matter sub-haloes or due to physical mechanisms affecting the baryons throughout galaxy evolution. Such substructure is hard to model and is either ignored by traditional, smooth modelling, approaches, or treated as well-localized massive perturbers. In this work, we propose a deep learning approach to quantify the statistical properties of such perturbations directly from images, where only the extended lensed source features within a mask are considered, without the need of any lens modelling. Our training data consist of mock lensed images assuming perturbing Gaussian Random Fields permeating the smooth overall lens potential, and, for the first time, using images of real galaxies as the lensed source. We employ a novel deep neural network that can handle arbitrary uncertainty intervals associated with the training data set labels as input, provides probability distributions as output, and adopts a composite loss function. The method succeeds not only in accurately estimating the actual parameter values, but also reduces the predicted confidence intervals by 10 per cent in an unsupervised manner, i.e. without having access to the actual ground truth values. Our results are invariant to the inherent degeneracy between mass perturbations in the lens and complex brightness profiles for the source. Hence, we can quantitatively and robustly quantify the smoothness of the mass density of thousands of lenses, including confidence intervals, and provide a consistent ranking for follow-up science.

Download Full-text

Fully automated contrast and non-contrast cardiac view detection in echocardiography a multi-centre, multi-vendor study

European Heart Journal ◽

10.1093/ehjci/ehaa946.0078 ◽

2020 ◽

Vol 41 (Supplement_2) ◽

Author(s):

S Gao ◽

D Stojanovski ◽

A Parker ◽

P Marques ◽

S Heitner ◽

...

Keyword(s):

Neural Network ◽

Training Data ◽

Classification Model ◽

Validation Dataset ◽

Funding Source ◽

Private Company ◽

Validation Data ◽

Independent Test ◽

Model Training ◽

Confusion Matrices

Abstract Background Correctly identifying views acquired in a 2D echocardiographic examination is paramount to post-processing and quantification steps often performed as part of most clinical workflows. In many exams, particularly in stress echocardiography, microbubble contrast is used which greatly affects the appearance of the cardiac views. Here we present a bespoke, fully automated convolutional neural network (CNN) which identifies apical 2, 3, and 4 chamber, and short axis (SAX) views acquired with and without contrast. The CNN was tested in a completely independent, external dataset with the data acquired in a different country than that used to train the neural network. Methods Training data comprised of 2D echocardiograms was taken from 1014 subjects from a prospective multisite, multi-vendor, UK trial with the number of frames in each view greater than 17,500. Prior to view classification model training, images were processed using standard techniques to ensure homogenous and normalised image inputs to the training pipeline. A bespoke CNN was built using the minimum number of convolutional layers required with batch normalisation, and including dropout for reducing overfitting. Before processing, the data was split into 90% for model training (211,958 frames), and 10% used as a validation dataset (23,946 frames). Image frames from different subjects were separated out entirely amongst the training and validation datasets. Further, a separate trial dataset of 240 studies acquired in the USA was used as an independent test dataset (39,401 frames). Results Figure 1 shows the confusion matrices for both validation data (left) and independent test data (right), with an overall accuracy of 96% and 95% for the validation and test datasets respectively. The accuracy for the non-contrast cardiac views of >99% exceeds that seen in other works. The combined datasets included images acquired across ultrasound manufacturers and models from 12 clinical sites. Conclusion We have developed a CNN capable of automatically accurately identifying all relevant cardiac views used in “real world” echo exams, including views acquired with contrast. Use of the CNN in a routine clinical workflow could improve efficiency of quantification steps performed after image acquisition. This was tested on an independent dataset acquired in a different country to that used to train the model and was found to perform similarly thus indicating the generalisability of the model. Figure 1. Confusion matrices Funding Acknowledgement Type of funding source: Private company. Main funding source(s): Ultromics Ltd.

Download Full-text

Improved PET/MRI attenuation correction in the pelvic region using a statistical decomposition method on T2-weighted images

EJNMMI Physics ◽

10.1186/s40658-020-00336-5 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Elin Wallstén ◽

Jan Axelsson ◽

Joakim Jonsson ◽

Camilla Thellenberg Karlsson ◽

Tufve Nyholm ◽

...

Keyword(s):

Attenuation Correction ◽

Research Study ◽

Standardized Uptake Value ◽

Ground Truth ◽

Whole Body ◽

Mr Images ◽

Pelvic Region ◽

Mri Scans ◽

The Mean ◽

Prostate Cancer Research

Abstract Background Attenuation correction of PET/MRI is a remaining problem for whole-body PET/MRI. The statistical decomposition algorithm (SDA) is a probabilistic atlas-based method that calculates synthetic CTs from T2-weighted MRI scans. In this study, we evaluated the application of SDA for attenuation correction of PET images in the pelvic region. Materials and method Twelve patients were retrospectively selected from an ongoing prostate cancer research study. The patients had same-day scans of [11C]acetate PET/MRI and CT. The CT images were non-rigidly registered to the PET/MRI geometry, and PET images were reconstructed with attenuation correction employing CT, SDA-generated CT, and the built-in Dixon sequence-based method of the scanner. The PET images reconstructed using CT-based attenuation correction were used as ground truth. Results The mean whole-image PET uptake error was reduced from − 5.4% for Dixon-PET to − 0.9% for SDA-PET. The prostate standardized uptake value (SUV) quantification error was significantly reduced from − 5.6% for Dixon-PET to − 2.3% for SDA-PET. Conclusion Attenuation correction with SDA improves quantification of PET/MR images in the pelvic region compared to the Dixon-based method.

Download Full-text

Automated analysis of 3D-echocardiography using spatially registered patient-specific CMR meshes

European Heart Journal - Cardiovascular Imaging ◽

10.1093/ehjci/jeaa356.425 ◽

2021 ◽

Vol 22 (Supplement_1) ◽

Author(s):

D Zhao ◽

E Ferdian ◽

GD Maso Talou ◽

GM Quill ◽

K Gilbert ◽

...

Keyword(s):

New Zealand ◽

Interobserver Variability ◽

Ground Truth ◽

Automated Analysis ◽

3D Echocardiography ◽

Training Data ◽

Patient Specific ◽

Manual Analysis ◽

Lv Mass ◽

3D Echo

Abstract Funding Acknowledgements Type of funding sources: Public grant(s) – National budget only. Main funding source(s): National Heart Foundation (NHF) of New Zealand Health Research Council (HRC) of New Zealand Artificial intelligence shows considerable promise for automated analysis and interpretation of medical images, particularly in the domain of cardiovascular imaging. While application to cardiac magnetic resonance (CMR) has demonstrated excellent results, automated analysis of 3D echocardiography (3D-echo) remains challenging, due to the lower signal-to-noise ratio (SNR), signal dropout, and greater interobserver variability in manual annotations. As 3D-echo is becoming increasingly widespread, robust analysis methods will substantially benefit patient evaluation. We sought to leverage the high SNR of CMR to provide training data for a convolutional neural network (CNN) capable of analysing 3D-echo. We imaged 73 participants (53 healthy volunteers, 20 patients with non-ischaemic cardiac disease) under both CMR and 3D-echo (<1 hour between scans). 3D models of the left ventricle (LV) were independently constructed from CMR and 3D-echo, and used to spatially align the image volumes using least squares fitting to a cardiac template. The resultant transformation was used to map the CMR mesh to the 3D-echo image. Alignment of mesh and image was verified through volume slicing and visual inspection (Fig. 1) for 120 paired datasets (including 47 rescans) each at end-diastole and end-systole. 100 datasets (80 for training, 20 for validation) were used to train a shallow CNN for mesh extraction from 3D-echo, optimised with a composite loss function consisting of normalised Euclidian distance (for 290 mesh points) and volume. Data augmentation was applied in the form of rotations and tilts (<15 degrees) about the long axis. The network was tested on the remaining 20 datasets (different participants) of varying image quality (Tab. I). For comparison, corresponding LV measurements from conventional manual analysis of 3D-echo and associated interobserver variability (for two observers) were also estimated. Initial results indicate that the use of embedded CMR meshes as training data for 3D-echo analysis is a promising alternative to manual analysis, with improved accuracy and precision compared with conventional methods. Further optimisations and a larger dataset are expected to improve network performance. (n = 20) LV EDV (ml) LV ESV (ml) LV EF (%) LV mass (g) Ground truth CMR 150.5 ± 29.5 57.9 ± 12.7 61.5 ± 3.4 128.1 ± 29.8 Algorithm error -13.3 ± 15.7 -1.4 ± 7.6 -2.8 ± 5.5 0.1 ± 20.9 Manual error -30.1 ± 21.0 -15.1 ± 12.4 3.0 ± 5.0 Not available Interobserver error 19.1 ± 14.3 14.4 ± 7.6 -6.4 ± 4.8 Not available Tab. 1. LV mass and volume differences (means ± standard deviations) for 20 test cases. Algorithm: CNN – CMR (as ground truth). Abstract Figure. Fig 1. CMR mesh registered to 3D-echo.

Download Full-text

ChemProps: A RESTful API enabled database for composite polymer name standardization

Journal of Cheminformatics ◽

10.1186/s13321-021-00502-6 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Bingyin Hu ◽

Anqi Lin ◽

L. Catherine Brinson

Keyword(s):

Data Exchange ◽

Ground Truth ◽

Training Data ◽

Test Accuracy ◽

Polymer Science ◽

Data Systems ◽

Materials Informatics ◽

Related Data ◽

Weight Factors ◽

Data Points

AbstractThe inconsistency of polymer indexing caused by the lack of uniformity in expression of polymer names is a major challenge for widespread use of polymer related data resources and limits broad application of materials informatics for innovation in broad classes of polymer science and polymeric based materials. The current solution of using a variety of different chemical identifiers has proven insufficient to address the challenge and is not intuitive for researchers. This work proposes a multi-algorithm-based mapping methodology entitled ChemProps that is optimized to solve the polymer indexing issue with easy-to-update design both in depth and in width. RESTful API is enabled for lightweight data exchange and easy integration across data systems. A weight factor is assigned to each algorithm to generate scores for candidate chemical names and optimized to maximize the minimum value of the score difference between the ground truth chemical name and the other candidate chemical names. Ten-fold validation is utilized on the 160 training data points to prevent overfitting issues. The obtained set of weight factors achieves a 100% test accuracy on the 54 test data points. The weight factors will evolve as ChemProps grows. With ChemProps, other polymer databases can remove duplicate entries and enable a more accurate “search by SMILES” function by using ChemProps as a common name-to-SMILES translator through API calls. ChemProps is also an excellent tool for auto-populating polymer properties thanks to its easy-to-update design.

Download Full-text

DeepMAsED: evaluating the quality of metagenomic assemblies

Bioinformatics ◽

10.1093/bioinformatics/btaa124 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3011-3017 ◽

Cited By ~ 5

Author(s):

Olga Mineeva ◽

Mateo Rojas-Carulla ◽

Ruth E Ley ◽

Bernhard Schölkopf ◽

Nicholas D Youngblut

Keyword(s):

Large Scale ◽

State Of The Art ◽

Ground Truth ◽

Supplementary Information ◽

Learning Approach ◽

Wide Range ◽

Metagenome Assembly ◽

Model Training ◽

Reference Genomes

Abstract Motivation Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. Availability and implementation DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text