scholarly journals On the Use of Real-World Datasets for Reaction Yield Prediction

Author(s):  
Mandana Saebi ◽  
Bozhao Nan ◽  
John Herr ◽  
Jessica Wahlers ◽  
Zhichun Guo ◽  
...  

The lack of publicly available, large, and unbiased datasets is a key bottleneck for the application of machine learning (ML) methods in synthetic chemistry. Data from electronic laboratory notebooks (ELNs) could provide less biased, large datasets, but no such datasets have been made publicly available. The first real-world dataset from the ELNs of a large pharmaceutical company is disclosed and its relationship to high-throughput experimentation (HTE) datasets is described. For chemical yield predictions, a key task in chemical synthesis, an attributed graph neural network (AGNN) performs as good or better than the best previous models on two HTE datasets for the Suzuki and Buchwald-Hartwig reactions. However, training of the AGNN on the ELN dataset does not lead to a predictive model. The implications of using ELN data for training ML-based models are discussed in the context of yield predictions.

Author(s):  
Nan Yan ◽  
Subin Huang ◽  
Chao Kong

Discovering entity synonymous relations is an important work for many entity-based applications. Existing entity synonymous relation extraction approaches are mainly based on lexical patterns or distributional corpus-level statistics, ignoring the context semantics between entities. For example, the contexts around ''apple'' determine whether ''apple'' is a kind of fruit or Apple Inc. In this paper, an entity synonymous relation extraction approach is proposed using context-aware permutation invariance. Specifically, a triplet network is used to obtain the permutation invariance between the entities to learn whether two given entities possess synonymous relation. To track more synonymous features, the relational context semantics and entity representations are integrated into the triplet network, which can improve the performance of extracting entity synonymous relations. The proposed approach is implemented on three real-world datasets. Experimental results demonstrate that the approach performs better than the other compared approaches on entity synonymous relation extraction task.


Author(s):  
Amanuel Fekade Tadesse ◽  
Nishani Vincent

This advisory case is designed to develop data analytics skills using multiple large real-world datasets based on eXtensible Business Reporting Language (XBRL). This case can also be used to introduce students to XBRL concepts such as extension taxonomies. Students are asked to recommend an XBRL preparation software for a hypothetical company (ViewDrive) that is adopting XBRL to satisfy the financial report filing requirements imposed by the Securities and Exchange Commission (SEC). Students perform data cleansing (extract, transform, load) procedures to prepare large datasets for data analytics. Students are encouraged to think critically, specify assumptions before performing data analytics (using analytic software such as Tableau), and generate visualizations that support their written recommendations. The case is easy to implement, promotes active learning, and has received favorable student and instructor feedback. This case can be used to introduce technology and data analytics topics into the accounting curriculum to help satisfy AACSB’s objectives.


2015 ◽  
Vol 12 (1) ◽  
pp. 62-74 ◽  
Author(s):  
Zhi Li ◽  
Jian Cao ◽  
Qi Gu

The number of services on the Internet is growing rapidly. Thus, the problem of selecting proper services for most users becomes serious and service recommendation is widely needed. Besides functions, QoS information is also an important factor to be considered when making recommendations to users. However, QoS changes with time. To address and solve these challenges, this paper proposes a temporal-aware QoS-based service recommendation framework, and also comes up with a prediction algorithm based on Tucker decomposition. Moreover, the authors use real-world datasets to verify our method with results better than traditional methods.


2020 ◽  
Vol 2020 (10) ◽  
pp. 182-1-182-8
Author(s):  
Zhao Gao ◽  
Eran Edirisinghe ◽  
Slava Chesnokov

Over-exposure happens often in daily-life photography due to the range of light far exceeding the capabilities of the limited dynamic range of current imaging sensors. Correcting overexposure aims to recover the fine details from the input. Most of the existing methods are based on manual image pixel manipulation, and therefore are often tedious and time-consuming. In this paper, we present the first convolutional neural network (CNN) capable of inferring the photo-realistic natural image for the single over-exposed photograph. To achieve this, we propose a simple and lightweight Over-Exposure Correction CNN, namely OEC-cnn, and construct a synthesized dataset that covers various scenes and exposure rates to facilitate training. By doing so, we effectively replace the manual fixing operations with an end-toend automatic correction process. Experiments on both synthesized and real-world datasets demonstrate that the proposed approach performs significantly better than existing methods and its simplicity and robustness make it a very useful tool for practical over-exposure correction. Our code and synthesized dataset will be made publicly available.


2016 ◽  
Vol 2016 (4) ◽  
pp. 470-487 ◽  
Author(s):  
Gábor György Gulyás ◽  
Gergely Acs ◽  
Claude Castelluccia

Abstract Several recent studies have demonstrated that people show large behavioural uniqueness. This has serious privacy implications as most individuals become increasingly re-identifiable in large datasets or can be tracked, while they are browsing the web, using only a couple of their attributes, called as their fingerprints. Often, the success of these attacks depends on explicit constraints on the number of attributes learnable about individuals, i.e., the size of their fingerprints. These constraints can be budget as well as technical constraints imposed by the data holder. For instance, Apple restricts the number of applications that can be called by another application on iOS in order to mitigate the potential privacy threats of leaking the list of installed applications on a device. In this work, we address the problem of identifying the attributes (e.g., smartphone applications) that can serve as a fingerprint of users given constraints on the size of the fingerprint. We give the best fingerprinting algorithms in general, and evaluate their effectiveness on several real-world datasets. Our results show that current privacy guards limiting the number of attributes that can be queried about individuals is insufficient to mitigate their potential privacy risks in many practical cases.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Vincenza Carchiolo ◽  
Marco Grassia ◽  
Alessandro Longheu ◽  
Michele Malgeri ◽  
Giuseppe Mangioni

AbstractMany systems are today modelled as complex networks, since this representation has been proven being an effective approach for understanding and controlling many real-world phenomena. A significant area of interest and research is that of networks robustness, which aims to explore to what extent a network keeps working when failures occur in its structure and how disruptions can be avoided. In this paper, we introduce the idea of exploiting long-range links to improve the robustness of Scale-Free (SF) networks. Several experiments are carried out by attacking the networks before and after the addition of links between the farthest nodes, and the results show that this approach effectively improves the SF network correct functionalities better than other commonly used strategies.


2021 ◽  
Vol 21 (3) ◽  
pp. 1-17
Author(s):  
Wu Chen ◽  
Yong Yu ◽  
Keke Gai ◽  
Jiamou Liu ◽  
Kim-Kwang Raymond Choo

In existing ensemble learning algorithms (e.g., random forest), each base learner’s model needs the entire dataset for sampling and training. However, this may not be practical in many real-world applications, and it incurs additional computational costs. To achieve better efficiency, we propose a decentralized framework: Multi-Agent Ensemble. The framework leverages edge computing to facilitate ensemble learning techniques by focusing on the balancing of access restrictions (small sub-dataset) and accuracy enhancement. Specifically, network edge nodes (learners) are utilized to model classifications and predictions in our framework. Data is then distributed to multiple base learners who exchange data via an interaction mechanism to achieve improved prediction. The proposed approach relies on a training model rather than conventional centralized learning. Findings from the experimental evaluations using 20 real-world datasets suggest that Multi-Agent Ensemble outperforms other ensemble approaches in terms of accuracy even though the base learners require fewer samples (i.e., significant reduction in computation costs).


Data ◽  
2020 ◽  
Vol 6 (1) ◽  
pp. 1
Author(s):  
Ahmed Elmogy ◽  
Hamada Rizk ◽  
Amany M. Sarhan

In data mining, outlier detection is a major challenge as it has an important role in many applications such as medical data, image processing, fraud detection, intrusion detection, and so forth. An extensive variety of clustering based approaches have been developed to detect outliers. However they are by nature time consuming which restrict their utilization with real-time applications. Furthermore, outlier detection requests are handled one at a time, which means that each request is initiated individually with a particular set of parameters. In this paper, the first clustering based outlier detection framework, (On the Fly Clustering Based Outlier Detection (OFCOD)) is presented. OFCOD enables analysts to effectively find out outliers on time with request even within huge datasets. The proposed framework has been tested and evaluated using two real world datasets with different features and applications; one with 699 records, and another with five millions records. The experimental results show that the performance of the proposed framework outperforms other existing approaches while considering several evaluation metrics.


2021 ◽  
Vol 15 (6) ◽  
pp. 1-20
Author(s):  
Dongsheng Li ◽  
Haodong Liu ◽  
Chao Chen ◽  
Yingying Zhao ◽  
Stephen M. Chu ◽  
...  

In collaborative filtering (CF) algorithms, the optimal models are usually learned by globally minimizing the empirical risks averaged over all the observed data. However, the global models are often obtained via a performance tradeoff among users/items, i.e., not all users/items are perfectly fitted by the global models due to the hard non-convex optimization problems in CF algorithms. Ensemble learning can address this issue by learning multiple diverse models but usually suffer from efficiency issue on large datasets or complex algorithms. In this article, we keep the intermediate models obtained during global model learning as the snapshot models, and then adaptively combine the snapshot models for individual user-item pairs using a memory network-based method. Empirical studies on three real-world datasets show that the proposed method can extensively and significantly improve the accuracy (up to 15.9% relatively) when applied to a variety of existing collaborative filtering methods.


Entropy ◽  
2021 ◽  
Vol 23 (6) ◽  
pp. 680
Author(s):  
Hanyang Lin ◽  
Yongzhao Zhan ◽  
Zizheng Zhao ◽  
Yuzhong Chen ◽  
Chen Dong

There is a wealth of information in real-world social networks. In addition to the topology information, the vertices or edges of a social network often have attributes, with many of the overlapping vertices belonging to several communities simultaneously. It is challenging to fully utilize the additional attribute information to detect overlapping communities. In this paper, we first propose an overlapping community detection algorithm based on an augmented attribute graph. An improved weight adjustment strategy for attributes is embedded in the algorithm to help detect overlapping communities more accurately. Second, we enhance the algorithm to automatically determine the number of communities by a node-density-based fuzzy k-medoids process. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed algorithms can effectively detect overlapping communities with fewer parameters compared to the baseline methods.


Sign in / Sign up

Export Citation Format

Share Document