scholarly journals ChemTables: a dataset for semantic classification on tables in chemical patents

2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Zenan Zhai ◽  
Christian Druckenbrodt ◽  
Camilo Thorne ◽  
Saber A. Akhondi ◽  
Dat Quoc Nguyen ◽  
...  

AbstractChemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called ChemTables, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on ChemTables. The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged $$F_1$$ F 1 score on the table classification task. The ChemTables dataset is publicly available at https://doi.org/10.17632/g7tjh7tbrj.3, subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables.

2021 ◽  
Author(s):  
Zenan Zhai ◽  
Christian Druckenbrodt ◽  
Camilo Thorne ◽  
Saber Akhondi ◽  
Dat Quoc Ngueyn ◽  
...  

Abstract Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called ChemTables, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on ChemTables. The best performing model, Table-BERT, achieves a performance of 88.66 micro F1 score on the table classification task. Availability: The ChemTables dataset is publicly available at http://dx.doi.org/10.17632/g7tjh7tbrj.1, subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables.


2020 ◽  
Author(s):  
Zenan Zhai ◽  
Christian Druckenbrodt ◽  
Camilo Thorne ◽  
Saber A Akhondi ◽  
Dat Quoc Nguyen ◽  
...  

Abstract Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called ChemTables, which consists of 7,886 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on ChemTables. The best performing model, Table-BERT, achieves a performance of 88.66 micro F1 score on the table classification task. Availability: A 10% sample of the ChemTables dataset has been made publicly available, subject to a data usage agreement.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Molham Al-Maleh ◽  
Said Desouki

AbstractNatural language processing has witnessed remarkable progress with the advent of deep learning techniques. Text summarization, along other tasks like text translation and sentiment analysis, used deep neural network models to enhance results. The new methods of text summarization are subject to a sequence-to-sequence framework of encoder–decoder model, which is composed of neural networks trained jointly on both input and output. Deep neural networks take advantage of big datasets to improve their results. These networks are supported by the attention mechanism, which can deal with long texts more efficiently by identifying focus points in the text. They are also supported by the copy mechanism that allows the model to copy words from the source to the summary directly. In this research, we are re-implementing the basic summarization model that applies the sequence-to-sequence framework on the Arabic language, which has not witnessed the employment of this model in the text summarization before. Initially, we build an Arabic data set of summarized article headlines. This data set consists of approximately 300 thousand entries, each consisting of an article introduction and the headline corresponding to this introduction. We then apply baseline summarization models to the previous data set and compare the results using the ROUGE scale.


Author(s):  
Yonatan Belinkov ◽  
James Glass

The field of natural language processing has seen impressive progress in recent years, with neural network models replacing many of the traditional systems. A plethora of new models have been proposed, many of which are thought to be opaque compared to their feature-rich counterparts. This has led researchers to analyze, interpret, and evaluate neural networks in novel and more fine-grained ways. In this survey paper, we review analysis methods in neural language processing, categorize them according to prominent research trends, highlight existing limitations, and point to potential directions for future work.


2021 ◽  
Vol 70 (10) ◽  
Author(s):  
Kazuyoshi Gotoh ◽  
Makoto Miyoshi ◽  
I Putu Bayu Mayura ◽  
Koji Iio ◽  
Osamu Matsushita ◽  
...  

The options available for treating infections with carbapenemase-producing Enterobacteriaceae (CPE) are limited; with the increasing threat of these infections, new treatments are urgently needed. Biapenem (BIPM) is a carbapenem, and limited data confirming its in vitro killing effect against CPE are available. In this study, we examined the minimum inhibitory concentrations (MICs) and minimum bactericidal concentrations (MBCs) of BIPM for 14 IMP-1-producing Enterobacteriaceae strains isolated from the Okayama region in Japan. The MICs against almost all the isolates were lower than 0.5 µg ml−1, indicating susceptibility to BIPM, while approximately half of the isolates were confirmed to be bacteriostatic to BIPM. However, initial killing to a 99.9 % reduction was observed in seven out of eight strains in a time–kill assay. Despite the small data set, we concluded that the in vitro efficacy of BIPM suggests that the drug could be a new therapeutic option against infection with IMP-producing CPE.


2019 ◽  
Vol 53 (1) ◽  
pp. 2-19 ◽  
Author(s):  
Erion Çano ◽  
Maurizio Morisio

Purpose The fabulous results of convolution neural networks in image-related tasks attracted attention of text mining, sentiment analysis and other text analysis researchers. It is, however, difficult to find enough data for feeding such networks, optimize their parameters, and make the right design choices when constructing network architectures. The purpose of this paper is to present the creation steps of two big data sets of song emotions. The authors also explore usage of convolution and max-pooling neural layers on song lyrics, product and movie review text data sets. Three variants of a simple and flexible neural network architecture are also compared. Design/methodology/approach The intention was to spot any important patterns that can serve as guidelines for parameter optimization of similar models. The authors also wanted to identify architecture design choices which lead to high performing sentiment analysis models. To this end, the authors conducted a series of experiments with neural architectures of various configurations. Findings The results indicate that parallel convolutions of filter lengths up to 3 are usually enough for capturing relevant text features. Also, max-pooling region size should be adapted to the length of text documents for producing the best feature maps. Originality/value Top results the authors got are obtained with feature maps of lengths 6–18. An improvement on future neural network models for sentiment analysis could be generating sentiment polarity prediction of documents using aggregation of predictions on smaller excerpt of the entire text.


Author(s):  
Aditya Rajbongshi ◽  
Thaharim Khan ◽  
Md. Mahbubur Rahman ◽  
Anik Pramanik ◽  
Shah Md Tanvir Siddiquee ◽  
...  

<p>The acknowledgment of plant diseases assumes an indispensable part in taking infectious prevention measures to improve the quality and amount of harvest yield. Mechanization of plant diseases is a lot advantageous as it decreases the checking work in an enormous cultivated area where mango is planted to a huge extend. Leaves being the food hotspot for plants, the early and precise recognition of leaf diseases is significant. This work focused on grouping and distinguishing the diseases of mango leaves through the process of CNN. DenseNet201, InceptionResNetV2, InceptionV3, ResNet50, ResNet152V2, and Xception all these models of CNN with transfer learning techniques are used here for getting better accuracy from the targeted data set. Image acquisition, image segmentation, and features extraction are the steps involved in disease detection. Different kinds of leaf diseases which are considered as the class for this work such as anthracnose, gall machi, powdery mildew, red rust are used in the dataset consisting of 1500 images of diseased and also healthy mango leaves image data another class is also added in the dataset. We have also evaluated the overall performance matrices and found that the DenseNet201 outperforms by obtaining the highest accuracy as 98.00% than other models.</p>


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Md Vaseem Chavhan ◽  
M. Ramesh Naidu ◽  
Hayavadana Jamakhandi

Purpose This paper aims to propose the artificial neural network (ANN) and regression models for the estimation of the thread consumption at multilayered seam assembly stitched with lock stitch 301. Design/methodology/approach In the present study, the generalized regression and neural network models are developed by considering the fabric types: woven, nonwoven and multilayer combination thereof, with basic sewing parameters: sewing thread linear density, stitch density, needle count and fabric assembly thickness. The network with feed-forward backpropagation is considered to build the ANN, and the training function trainlm of MATLAB software is used to adjust weight and basic values according to the optimization of Levenberg Marquardt. The performance of networks measured in terms of the mean squared error and the layer output is set according to the sigmoid transfer function. Findings The proposed ANN and regression model are able to predict the thread consumption with more accuracy for multilayered seam assembly. The predictability of thread consumption from available geometrical models, regression models and industrial empirical techniques are compared with proposed linear regression, quadratic regression and neural network models. The proposed quadratic regression model showed a good correlation with practical thread consumption value and more accuracy in prediction with an overall 4.3% error, as compared to other techniques for given multilayer substrates. Further, the developed ANN network showed good accuracy in the prediction of thread consumption. Originality/value The estimation of thread consumed while stitching is the prerequisite of the garment industry for inventory management especially with the introduction of the costly high-performance sewing thread. In practice, different types of fabrics are stitched at multilayer combinations at different locations of the stitched product. The ANN and regression models are developed for multilayered seam assembly of woven and nonwoven fabric blend composition for better prediction of thread consumption.


Healthcare ◽  
2020 ◽  
Vol 8 (2) ◽  
pp. 181 ◽  
Author(s):  
Patricia Melin ◽  
Julio Cesar Monica ◽  
Daniela Sanchez ◽  
Oscar Castillo

In this paper, a multiple ensemble neural network model with fuzzy response aggregation for the COVID-19 time series is presented. Ensemble neural networks are composed of a set of modules, which are used to produce several predictions under different conditions. The modules are simple neural networks. Fuzzy logic is then used to aggregate the responses of several predictor modules, in this way, improving the final prediction by combining the outputs of the modules in an intelligent way. Fuzzy logic handles the uncertainty in the process of making a final decision about the prediction. The complete model was tested for the case of predicting the COVID-19 time series in Mexico, at the level of the states and the whole country. The simulation results of the multiple ensemble neural network models with fuzzy response integration show very good predicted values in the validation data set. In fact, the prediction errors of the multiple ensemble neural networks are significantly lower than using traditional monolithic neural networks, in this way showing the advantages of the proposed approach.


Sign in / Sign up

Export Citation Format

Share Document