scholarly journals Learning supervised embeddings for large scale sequence comparisons

2019 ◽  
Author(s):  
Dhananjay Kimothi ◽  
Pravesh Biyani ◽  
James M Hogan ◽  
Akshay Soni ◽  
Wayne Kelly

Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence.In this paper, we introduce SuperVec, a novel supervised approach to learning sequence embeddings. Our method extends earlier Representation Learning (RL) based methods to include jointly contextual and class-related information for each sequence during training. This ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain.Such representations may be used for downstream machine learning tasks or employed directly. Here, we apply SuperVec embeddings to a sequence retrieval task, where the goal is to retrieve sequences with the same family label as a given query. The SuperVec approach is extended further through H-SuperVec, a tree-based hierarchical method which learns embeddings across a range of feature spaces based on the class labels and their exclusive and exhaustive subsets.Experiments show that supervised learning of embeddings based on sequence labels using SuperVec and H-SuperVec provides a substantial improvement in retrieval performance over existing (unsupervised) RL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches in which SuperVec rapidly filters the collection so that only potentially relevant records remain, allowing slower, more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before.Finally, for some problems, direct use of embeddings is already sufficient to yield high levels of precision and recall. Extending this work to encompass weaker homology is the subject of ongoing research.

2021 ◽  
Author(s):  
Peng Wang ◽  
Yunyan Hu ◽  
Shaochen Bai ◽  
Shiyi Zou

BACKGROUND Ontology matching seeks to find semantic correspondences between ontologies. With an increasing number of biomedical ontologies being developed independently, matching these ontologies to solve the interoperability problem has become a critical task in biomedical applications. However, some challenges remain. First, extracting and constructing matching clues from biomedical ontologies is a nontrivial problem. Second, it is unknown whether there are dominant matchers while matching biomedical ontologies. Finally, ontology matching also suffers from computational complexity owing to the large-scale sizes of biomedical ontologies. OBJECTIVE To investigate the effectiveness of matching clues and composite match approaches, this paper presents a spectrum of matchers with different combination strategies and empirically studies their influence on matching biomedical ontologies. Besides, extended reduction anchors are introduced to effectively decrease the time complexity while matching large biomedical ontologies. METHODS In this paper, atomic and composite matching clues are first constructed in 4 dimensions: terminology, structure, external knowledge, and representation learning. Then, a spectrum of matchers based on a flexible combination of atomic clues are designed and utilized to comprehensively study the effectiveness. Besides, we carry out a systematic comparative evaluation of different combinations of matchers. Finally, extended reduction anchor is proposed to significantly alleviate the time complexity for matching large-scale biomedical ontologies. RESULTS Experimental results show that considering distinguishable matching clues in biomedical ontologies leads to a substantial improvement in all available information. Besides, incorporating different types of matchers with reliability results in a marked improvement, which is comparative to the state-of-the-art methods. The dominant matchers achieve F1 measures of 0.9271, 0.8218, and 0.5 on Anatomy, FMA-NCI (Foundation Model of Anatomy-National Cancer Institute), and FMA-SNOMED data sets, respectively. Extended reduction anchor is able to solve the scalability problem of matching large biomedical ontologies. It achieves a significant reduction in time complexity with little loss of F1 measure at the same time, with a 0.21% decrease on the Anatomy data set and 0.84% decrease on the FMA-NCI data set, but with a 2.65% increase on the FMA-SNOMED data set. CONCLUSIONS This paper systematically analyzes and compares the effectiveness of different matching clues, matchers, and combination strategies. Multiple empirical studies demonstrate that distinguishing clues have significant implications for matching biomedical ontologies. In contrast to the matchers with single clue, those combining multiple clues exhibit more stable and accurate performance. In addition, our results provide evidence that the approach based on extended reduction anchors performs well for large ontology matching tasks, demonstrating an effective solution for the problem.


Author(s):  
Xin Huang ◽  
Yuxin Peng ◽  
Mingkuan Yuan

DNN-based cross-modal retrieval is a research hotspot to retrieve across different modalities as image and text, but existing methods often face the challenge of insufficient cross-modal training data. In single-modal scenario, similar problem is usually relieved by transferring knowledge from large-scale auxiliary datasets (as ImageNet). Knowledge from such single-modal datasets is also very useful for cross-modal retrieval, which can provide rich general semantic information that can be shared across different modalities. However, it is challenging to transfer useful knowledge from single-modal (as image) source domain to cross-modal (as image/text) target domain. Knowledge in source domain cannot be directly transferred to both two different modalities in target domain, and the inherent cross-modal correlation contained in target domain provides key hints for cross-modal retrieval which should be preserved during transfer process. This paper proposes Cross-modal Hybrid Transfer Network (CHTN) with two subnetworks: Modal-sharing transfer subnetwork utilizes the modality in both source and target domains as a bridge, for transferring knowledge to both two modalities simultaneously; Layer-sharing correlation subnetwork preserves the inherent cross-modal semantic correlation to further adapt to cross-modal retrieval task. Cross-modal data can be converted to common representation by CHTN for retrieval, and comprehensive experiment on 3 datasets shows its effectiveness.


10.2196/28212 ◽  
2021 ◽  
Vol 9 (8) ◽  
pp. e28212
Author(s):  
Peng Wang ◽  
Yunyan Hu ◽  
Shaochen Bai ◽  
Shiyi Zou

Background Ontology matching seeks to find semantic correspondences between ontologies. With an increasing number of biomedical ontologies being developed independently, matching these ontologies to solve the interoperability problem has become a critical task in biomedical applications. However, some challenges remain. First, extracting and constructing matching clues from biomedical ontologies is a nontrivial problem. Second, it is unknown whether there are dominant matchers while matching biomedical ontologies. Finally, ontology matching also suffers from computational complexity owing to the large-scale sizes of biomedical ontologies. Objective To investigate the effectiveness of matching clues and composite match approaches, this paper presents a spectrum of matchers with different combination strategies and empirically studies their influence on matching biomedical ontologies. Besides, extended reduction anchors are introduced to effectively decrease the time complexity while matching large biomedical ontologies. Methods In this paper, atomic and composite matching clues are first constructed in 4 dimensions: terminology, structure, external knowledge, and representation learning. Then, a spectrum of matchers based on a flexible combination of atomic clues are designed and utilized to comprehensively study the effectiveness. Besides, we carry out a systematic comparative evaluation of different combinations of matchers. Finally, extended reduction anchor is proposed to significantly alleviate the time complexity for matching large-scale biomedical ontologies. Results Experimental results show that considering distinguishable matching clues in biomedical ontologies leads to a substantial improvement in all available information. Besides, incorporating different types of matchers with reliability results in a marked improvement, which is comparative to the state-of-the-art methods. The dominant matchers achieve F1 measures of 0.9271, 0.8218, and 0.5 on Anatomy, FMA-NCI (Foundation Model of Anatomy-National Cancer Institute), and FMA-SNOMED data sets, respectively. Extended reduction anchor is able to solve the scalability problem of matching large biomedical ontologies. It achieves a significant reduction in time complexity with little loss of F1 measure at the same time, with a 0.21% decrease on the Anatomy data set and 0.84% decrease on the FMA-NCI data set, but with a 2.65% increase on the FMA-SNOMED data set. Conclusions This paper systematically analyzes and compares the effectiveness of different matching clues, matchers, and combination strategies. Multiple empirical studies demonstrate that distinguishing clues have significant implications for matching biomedical ontologies. In contrast to the matchers with single clue, those combining multiple clues exhibit more stable and accurate performance. In addition, our results provide evidence that the approach based on extended reduction anchors performs well for large ontology matching tasks, demonstrating an effective solution for the problem.


Author(s):  
Zein Al Abidin Ibrahim ◽  
Siba Haidar ◽  
Ihab Sbeity

The production of video has increased and expanded dramatically. There is a need to reach accurate video classification. In our work, we use deep learning as a mean to accelerate the video retrieval task by classifying them into categories. We classify a video depending on the text extracted from it. We trained our model using fastText, a library for efficient text classification and representation learning, and tested our model on 15000 videos. Experimental results show that our approach is efficient and has good performance. Our technique can be used on huge datasets. It produces a model that can be used to classify any video into a specific category very quickly.


2020 ◽  
Vol 15 (7) ◽  
pp. 750-757
Author(s):  
Jihong Wang ◽  
Yue Shi ◽  
Xiaodan Wang ◽  
Huiyou Chang

Background: At present, using computer methods to predict drug-target interactions (DTIs) is a very important step in the discovery of new drugs and drug relocation processes. The potential DTIs identified by machine learning methods can provide guidance in biochemical or clinical experiments. Objective: The goal of this article is to combine the latest network representation learning methods for drug-target prediction research, improve model prediction capabilities, and promote new drug development. Methods: We use large-scale information network embedding (LINE) method to extract network topology features of drugs, targets, diseases, etc., integrate features obtained from heterogeneous networks, construct binary classification samples, and use random forest (RF) method to predict DTIs. Results: The experiments in this paper compare the common classifiers of RF, LR, and SVM, as well as the typical network representation learning methods of LINE, Node2Vec, and DeepWalk. It can be seen that the combined method LINE-RF achieves the best results, reaching an AUC of 0.9349 and an AUPR of 0.9016. Conclusion: The learning method based on LINE network can effectively learn drugs, targets, diseases and other hidden features from the network topology. The combination of features learned through multiple networks can enhance the expression ability. RF is an effective method of supervised learning. Therefore, the Line-RF combination method is a widely applicable method.


2021 ◽  
Vol 52 (1) ◽  
Author(s):  
Jobin Thomas ◽  
Ana Balseiro ◽  
Christian Gortázar ◽  
María A. Risalde

AbstractAnimal tuberculosis (TB) is a multi-host disease caused by members of the Mycobacterium tuberculosis complex (MTC). Due to its impact on economy, sanitary standards of milk and meat industry, public health and conservation, TB control is an actively ongoing research subject. Several wildlife species are involved in the maintenance and transmission of TB, so that new approaches to wildlife TB diagnosis have gained relevance in recent years. Diagnosis is a paramount step for screening, epidemiological investigation, as well as for ensuring the success of control strategies such as vaccination trials. This is the first review that systematically addresses data available for the diagnosis of TB in wildlife following the Preferred Reporting Items of Systematic Reviews and Meta-Analyses (PRISMA) guidelines. The article also gives an overview of the factors related to host, environment, sampling, and diagnostic techniques which can affect test performance. After three screenings, 124 articles were considered for systematic review. Literature indicates that post-mortem examination and culture are useful methods for disease surveillance, but immunological diagnostic tests based on cellular and humoral immune response detection are gaining importance in wildlife TB diagnosis. Among them, serological tests are especially useful in wildlife because they are relatively inexpensive and easy to perform, facilitate large-scale surveillance and can be used both ante- and post-mortem. Currently available studies assessed test performance mostly in cervids, European badgers, wild suids and wild bovids. Research to improve diagnostic tests for wildlife TB diagnosis is still needed in order to reach accurate, rapid and cost-effective diagnostic techniques adequate to a broad range of target species and consistent over space and time to allow proper disease monitoring.


Cancers ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 2111
Author(s):  
Bo-Wei Zhao ◽  
Zhu-Hong You ◽  
Lun Hu ◽  
Zhen-Hao Guo ◽  
Lei Wang ◽  
...  

Identification of drug-target interactions (DTIs) is a significant step in the drug discovery or repositioning process. Compared with the time-consuming and labor-intensive in vivo experimental methods, the computational models can provide high-quality DTI candidates in an instant. In this study, we propose a novel method called LGDTI to predict DTIs based on large-scale graph representation learning. LGDTI can capture the local and global structural information of the graph. Specifically, the first-order neighbor information of nodes can be aggregated by the graph convolutional network (GCN); on the other hand, the high-order neighbor information of nodes can be learned by the graph embedding method called DeepWalk. Finally, the two kinds of feature are fed into the random forest classifier to train and predict potential DTIs. The results show that our method obtained area under the receiver operating characteristic curve (AUROC) of 0.9455 and area under the precision-recall curve (AUPR) of 0.9491 under 5-fold cross-validation. Moreover, we compare the presented method with some existing state-of-the-art methods. These results imply that LGDTI can efficiently and robustly capture undiscovered DTIs. Moreover, the proposed model is expected to bring new inspiration and provide novel perspectives to relevant researchers.


2010 ◽  
Vol 20-23 ◽  
pp. 700-705
Author(s):  
Tian Yuan ◽  
Shang Guan Wei ◽  
Zhi Zhong Lu

Multi-channel Virtual reality simulation technology is a kind of simulation technology, which support the grand scene and high degree of immersion, has better visualization effect. In this paper, a moving target monitoring collaboratory simulation technology based on multi-channel is studied. Firstly, study the mathematical modeling foundation of Multi-Channel technology systematically, based on the mobile target spatial model and co-simulation technology, select the appropriate applications of multi-channel technology, building laboratory simulation platform and achieved a space-based six-degree of freedom simulation of multi-channel moving target monitoring simulation. The experiment has proved that in multi-channel target monitoring co-simulation technology used in this paper has strong practicality, combine with a moving target-space model and co-simulation technology, the advantages of objective observation to solve the requirements like large-scale, realism, immersion requirements, etc.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Smritikana Dutta ◽  
Anwesha Deb ◽  
Prasun Biswas ◽  
Sukanya Chakraborty ◽  
Suman Guha ◽  
...  

AbstractBamboos, member of the family Poaceae, represent many interesting features with respect to their fast and extended vegetative growth, unusual, yet divergent flowering time across species, and impact of sudden, large scale flowering on forest ecology. However, not many studies have been conducted at the molecular level to characterize important genes that regulate vegetative and flowering habit in bamboo. In this study, two bamboo FD genes, BtFD1 and BtFD2, which are members of the florigen activation complex (FAC) have been identified by sequence and phylogenetic analyses. Sequence comparisons identified one important amino acid, which was located in the DNA-binding basic region and was altered between BtFD1 and BtFD2 (Ala146 of BtFD1 vs. Leu100 of BtFD2). Electrophoretic mobility shift assay revealed that this alteration had resulted into ten times higher binding efficiency of BtFD1 than BtFD2 to its target ACGT motif present at the promoter of the APETALA1 gene. Expression analyses in different tissues and seasons indicated the involvement of BtFD1 in flower and vegetative development, while BtFD2 was very lowly expressed throughout all the tissues and conditions studied. Finally, a tenfold increase of the AtAP1 transcript level by p35S::BtFD1 Arabidopsis plants compared to wild type confirms a positively regulatory role of BtFD1 towards flowering. However, constitutive expression of BtFD1 had led to dwarfisms and apparent reduction in the length of flowering stalk and numbers of flowers/plant, whereas no visible phenotype was observed for BtFD2 overexpression. This signifies that timely expression of BtFD1 may be critical to perform its programmed developmental role in planta.


2016 ◽  
Vol 42 (3) ◽  
pp. 391-419 ◽  
Author(s):  
Weiwei Sun ◽  
Xiaojun Wan

From the perspective of structural linguistics, we explore paradigmatic and syntagmatic lexical relations for Chinese POS tagging, an important and challenging task for Chinese language processing. Paradigmatic lexical relations are explicitly captured by word clustering on large-scale unlabeled data and are used to design new features to enhance a discriminative tagger. Syntagmatic lexical relations are implicitly captured by syntactic parsing in the constituency formalism, and are utilized via system combination. Experiments on the Penn Chinese Treebank demonstrate the importance of both paradigmatic and syntagmatic relations. Our linguistically motivated, hybrid approaches yield a relative error reduction of 18% in total over state-of-the-art baselines. Despite the effectiveness to boost accuracy, computationally expensive parsers make hybrid systems inappropriate for many realistic NLP applications. In this article, we are also concerned with improving tagging efficiency at test time. In particular, we explore unlabeled data to transfer the predictive power of hybrid models to simple sequence models. Specifically, hybrid systems are utilized to create large-scale pseudo training data for cheap models. Experimental results illustrate that the re-compiled models not only achieve high accuracy with respect to per token classification, but also serve as a front-end to a parser well.


Sign in / Sign up

Export Citation Format

Share Document