scholarly journals A Similarity Search Using Molecular Topological Graphs

2009 ◽  
Vol 2009 ◽  
pp. 1-8 ◽  
Author(s):  
Yoshifumi Fukunishi ◽  
Haruki Nakamura

A molecular similarity measure has been developed using molecular topological graphs and atomic partial charges. Two kinds of topological graphs were used. One is the ordinary adjacency matrix and the other is a matrix which represents the minimum path length between two atoms of the molecule. The ordinary adjacency matrix is suitable to compare the local structures of molecules such as functional groups, and the other matrix is suitable to compare the global structures of molecules. The combination of these two matrices gave a similarity measure. This method was applied toin silicodrug screening, and the results showed that it was effective as a similarity measure.

Author(s):  
Jung-Hoon Cho ◽  
Seung Woo Ham ◽  
Dong-Kyu Kim

With the growth of the bike-sharing system, the problem of demand forecasting has become important to the bike-sharing system. This study aims to develop a novel prediction model that enhances the accuracy of the peak hourly demand. A spatiotemporal graph convolutional network (STGCN) is constructed to consider both the spatial and temporal features. One of the model’s essential steps is determining the main component of the adjacency matrix and the node feature matrix. To achieve this, 131 days of data from the bike-sharing system in Seoul are used and experiments conducted on the models with various adjacency matrices and node feature matrices, including public transit usage. The results indicate that the STGCN models reflecting the previous demand pattern to the adjacency matrix show outstanding performance in predicting demand compared with the other models. The results also show that the model that includes bus boarding and alighting records is more accurate than the model that contains subway records, inferring that buses have a greater connection to bike-sharing than the subway. The proposed STGCN with public transit data contributes to the alleviation of unmet demand by enhancing the accuracy in predicting peak demand.


2021 ◽  
Author(s):  
Yuxiang Chen ◽  
Chuanlei Liu ◽  
Yang An ◽  
Yue Lou ◽  
Yang Zhao ◽  
...  

Machine learning and computer-aided approaches significantly accelerate molecular design and discovery in scientific and industrial fields increasingly relying on data science for efficiency. The typical method used is supervised learning which needs huge datasets. Semi-supervised machine learning approaches are effective to train unlabeled data with improved modeling performance, whereas they are limited by the accumulation of prediction errors. Here, to screen solvents for removal of methyl mercaptan, a type of organosulfur impurities in natural gas, we constructed a computational framework by integrating molecular similarity search and active learning methods, namely, molecular active selection machine learning (MASML). This new model framework identifies the optimal molecules set by molecular similarity search and iterative addition to the training dataset. Among all 126,068 compounds in the initial dataset, 3 molecules were identified to be promising for methyl mercaptan (MeSH) capture, including benzylamine (BZA), p-methoxybenzylamine (PZM), and N,N-diethyltrimethylenediamine (DEAPA). Further experiments confirmed the effectiveness of our modeling framework in efficient molecular design and identification for capturing methyl mercaptan, in which DEAPA presents a Henry's law constant 89.4% lower than that of methyl diethanolamine (MDEA).


2020 ◽  
pp. 016555152093949
Author(s):  
Wenyu Zhang ◽  
Shunshun Shi ◽  
Xiaoling Huang ◽  
Shuai Zhang ◽  
Peijia Yao ◽  
...  

In the research on interdisciplinarity (RID), measures for evaluating the interdisciplinarity of scientific entities (e.g., papers, authors, journals or research areas) have been proposed for a long time. The author interdisciplinarity is very different from the other types of interdisciplinarity because of the complex interpersonal relationships between the connected authors. However, previous work has failed to uncover the distinctiveness of author interdisciplinarity and has regarded it as equivalent to other types of interdisciplinarity. In this work, an extended Rao–Stirling diversity measure is proposed, which incorporates the co-author network and a network similarity measure to specifically evaluate the author interdisciplinarity. Moreover, betweenness centrality is used for improving network similarity measure, because of its intrinsic advantage of expressing how an entity loads on different factors in a network, which is highly in line with the characteristic of interdisciplinarity. An experiment on the papers about Public Administration in the Web of Science is conducted; based on the final results, a deeper investigation is performed into by typical authors. The work proposes a novel idea for measuring author interdisciplinarity, which can promote the study of interdisicplinarity measuring in RID.


2014 ◽  
Vol 28 (17) ◽  
pp. 1450111 ◽  
Author(s):  
Zikai Wu ◽  
Baoyu Hou ◽  
Hongjuan Zhang ◽  
Feng Jin

Deterministic network models have been attractive media for discussing dynamical processes' dependence on network structural features. On the other hand, the heterogeneity of weights affect dynamical processes taking place on networks. In this paper, we present a family of weighted expanded Koch networks based on Koch networks. They originate from a r-polygon, and each node of current generation produces m r-polygons including the node and whose weighted edges are scaled by factor w in subsequent evolutionary step. We derive closed-form expressions for average weighted shortest path length (AWSP). In large network, AWSP stays bounded with network order growing (0 < w < 1). Then, we focus on a special random walks and trapping issue on the networks. In more detail, we calculate exactly the average receiving time (ART). ART exhibits a sub-linear dependence on network order (0 < w < 1), which implies that nontrivial weighted expanded Koch networks are more efficient than un-weighted expanded Koch networks in receiving information. Besides, efficiency of receiving information at hub nodes is also dependent on parameters m and r. These findings may pave the way for controlling information transportation on general weighted networks.


2020 ◽  
Author(s):  
Janani Durairaj ◽  
Mehmet Akdel ◽  
Dick de Ridder ◽  
Aalt DJ van Dijk

AbstractMotivationAs the number of experimentally solved protein structures rises, it becomes increasingly appealing to use structural information for predictive tasks involving proteins. Due to the large variation in protein sizes, folds, and topologies, an attractive approach is to embed protein structures into fixed-length vectors, which can be used in machine learning algorithms aimed at predicting and understanding functional and physical properties. Many existing embedding approaches are alignment-based, which is both time-consuming and ineffective for distantly related proteins. On the other hand, library- or model-based approaches depend on a small library of fragments or require the use of a trained model, both of which may not generalize well.ResultsWe present Geometricus, a novel and universally applicable approach to embedding proteins in a fixed-dimensional space. The approach is fast, accurate, and interpretable. Geometricus uses a set of 3D moment invariants to discretize fragments of protein structures into shape-mers, which are then counted to describe the full structure as a vector of counts. We demonstrate the applicability of this approach in various tasks, ranging from fast structure similarity search, unsupervised clustering, and structure classification across proteins from different superfamilies as well as within the same family.AvailabilityPython code available at https://git.wur.nl/durai001/[email protected], [email protected]


Author(s):  
Lee Naish

Identifying patterns and associations in data is fundamental to discovery in science. This work investigates a very simple instance of the problem, where each data point consists of a vector of binary attributes, and attributes are treated equally. For example, each data point may correspond to a person and the attributes may be their sex, whether they smoke cigarettes, whether they have been diagnosed with lung cancer, etc. Measuring similarity of attributes in the data is equivalent to measuring similarity of sets - an attribute can be mapped to the set of data points which have the attribute. Furthermore, there is one identified base set (or attribute) and only similarity to that set is considered - the other sets are just ranked according to how similar they are to the base set. For example, if the base set is lung cancer sufferers, the set of smokers may well be high in the ranking. Identifying set similarity or correlation has many uses and is often the first step in determining causality. Set similarity is also the basis for comparing binary classifiers such as diagnostic tests for any data set. More than a hundred set similarity measures have been proposed in the literature is but there is very little understanding of how best to choose a similarity measure for a given domain. This work discusses numerous properties that similarity measures can have, weakening some previously proposed definitions so they are no longer incompatible, and identifying important forms of symmetry which have not previously been considered. It defines ordering relations over similarity measures and shows how some properties of a domain can be used to help choose a similarity measure which will perform well for that domain.


2016 ◽  
Author(s):  
Lee Naish

Identifying patterns and associations in data is fundamental to discovery in science. This work investigates a very simple instance of the problem, where each data point consists of a vector of binary attributes, and attributes are treated equally. For example, each data point may correspond to a person and the attributes may be their sex, whether they smoke cigarettes, whether they have been diagnosed with lung cancer, etc. Measuring similarity of attributes in the data is equivalent to measuring similarity of sets - an attribute can be mapped to the set of data points which have the attribute. Furthermore, there is one identified base set (or attribute) and only similarity to that set is considered - the other sets are just ranked according to how similar they are to the base set. For example, if the base set is lung cancer sufferers, the set of smokers may well be high in the ranking. Identifying set similarity or correlation has many uses and is often the first step in determining causality. Set similarity is also the basis for comparing binary classifiers such as diagnostic tests for any data set. More than a hundred set similarity measures have been proposed in the literature is but there is very little understanding of how best to choose a similarity measure for a given domain. This work discusses numerous properties that similarity measures can have, weakening some previously proposed definitions so they are no longer incompatible, and identifying important forms of symmetry which have not previously been considered. It defines ordering relations over similarity measures and shows how some properties of a domain can be used to help choose a similarity measure which will perform well for that domain.


Order ◽  
2020 ◽  
Author(s):  
Gábor Czédli ◽  
Robert C. Powers ◽  
Jeremy M. White

AbstractLet L be a lattice of finite length and let d denote the minimum path length metric on the covering graph of L. For any $\xi =(x_{1},\dots ,x_{k})\in L^{k}$ ξ = ( x 1 , … , x k ) ∈ L k , an element y belonging to L is called a median of ξ if the sum d(y,x1) + ⋯ + d(y,xk) is minimal. The lattice L satisfies the c1-median property if, for any $\xi =(x_{1},\dots ,x_{k})\in L^{k}$ ξ = ( x 1 , … , x k ) ∈ L k and for any median y of ξ, $y\leq x_{1}\vee \dots \vee x_{k}$ y ≤ x 1 ∨ ⋯ ∨ x k . Our main theorem asserts that if L is an upper semimodular lattice of finite length and the breadth of L is less than or equal to 2, then L satisfies the c1-median property. Also, we give a construction that yields semimodular lattices, and we use a particular case of this construction to prove that our theorem is sharp in the sense that 2 cannot be replaced by 3.


2013 ◽  
Vol 05 (03) ◽  
pp. 1350010
Author(s):  
LAURENT LYAUDET ◽  
PAULIN MELATAGIA YONTA ◽  
MAURICE TCHUENTE ◽  
RENÉ NDOUNDAM

Given an undirected graph G = (V, E) with n vertices and a positive length w(e) on each edge e ∈ E, we consider Minimum Average Distance (MAD) spanning trees i.e., trees that minimize the path length summed over all pairs of vertices. One of the first results on this problem is due to Wong who showed in 1980 that a Distance Preserving (DP) spanning tree rooted at the median of G is a 2-approximate solution. On the other hand, Dankelmann has exhibited in 2000 a class of graphs where no MAD spanning tree is distance preserving from a vertex. We establish here a new relation between MAD and DP trees in the particular case where the lengths are integers. We show that in a MAD spanning tree of G, each subtree H′ = (V′, E′) consisting of a vertex [Formula: see text] and the union of branches of [Formula: see text] that are each of size less than or equal to [Formula: see text], where w+ is the maximum edge-length in G, is a distance preserving spanning tree of the subgraph of G induced by V′.


Sign in / Sign up

Export Citation Format

Share Document