natural language process
Recently Published Documents


TOTAL DOCUMENTS

30
(FIVE YEARS 19)

H-INDEX

5
(FIVE YEARS 3)

2021 ◽  
Author(s):  
Xinqiao Wang ◽  
Chuansheng Yao ◽  
Yun Zhang ◽  
Jiahui Yu ◽  
Haoran Qiao ◽  
...  

Abstract Deep learning methods have been proven their potential roles in the chemical field, such as reaction prediction and retrosynthesis analysis. However, the de novo generation of unreported reactions using artificial intelligence technology remains not be completely explored. Inspired by molecular generation, we proposed the task of novel reaction generation. In this work, we applied the Heck reactions to train the transformer model, state-of-art natural language process model and obtained 4717 generated reactions after sampling and processing. We then confirmed that 2253 novel Heck reactions by organizing chemists to judge the generated reactions, and adopted organic synthesis experiment to verify the feasibility of unreported reactions. In this process, it only took 15 days from Heck reaction generation to experimental verification, proving that our model learns reaction rules in-depth and can make great contributions in the novel reaction discovery.


2021 ◽  
pp. 1-12
Author(s):  
Prakash Mohan ◽  
Manikandan Sundaram ◽  
Sambit Satpathy ◽  
Sanchali Das

Techniques of data compression involve de-duplication of data that plays an important role in eliminating duplicate copies of information and has been widely employed in cloud storage to scale back the storage capacity and save information measure. A secure AES encryption de-duplication system for finding duplication with the meaning and store up it in the cloud. To protect the privacy of sensitive information whereas supporting de-duplication, The AES encryption technique and SHA-256 hashing algorithm have been utilized to encrypt the information before outsourcing. Pre-processing is completed and documents are compared and verified with the use of wordnet. Cosine similarity is employed to see the similarity between both the documents and to perform this, a far economical VSM data structure is used. Wordnet hierarchical corpus is used to see syntax and semantics so that the identification of duplicates is done. NLTK provides a large vary of libraries and programs for symbolic and statistical natural language process (NLP) for the Python programming language that is used here for the unidentified words by cosine similarity. Within the previous strategies, cloud storage was used abundantly since similar files were allowed to store. By implementing our system, space for storing is reduced up to 85%. Since AES and SHA-256 are employed, it provides high security and efficiency.


Author(s):  
Shekkari Akhil

One of the most important areas where the Natural Language Process of Machine Learning may help is determining if two questions are similar. The model we create can instantly detect if a question is similar to one that has already been posed. To find the underlying patterns in our data, we'll do a complete Exploratory Data Analysis. Based on our observations, we will do feature engineering. We'll try out a few different modelling strategies to determine which one works the best and keeps the greatest outcomes.


Electronics ◽  
2021 ◽  
Vol 10 (7) ◽  
pp. 845
Author(s):  
Danbi Cho ◽  
Hyunyoung Lee ◽  
Seungshik Kang

It is important how the token unit is defined in a sentence in natural language process tasks, such as text classification, machine translation, and generation. Many studies recently utilized the subword tokenization in language models such as BERT, KoBERT, and ALBERT. Although these language models achieved state-of-the-art results in various NLP tasks, it is not clear whether the subword tokenization is the best token unit for Korean sentence embedding. Thus, we carried out sentence embedding based on word, morpheme, subword, and submorpheme, respectively, on Korean sentiment analysis. We explored the two-sentence representation methods for sentence embedding: considering the order of tokens in a sentence and not considering the order. While inputting a sentence, which is decomposed by token unit, to the two-sentence representation methods, we construct the sentence embedding with various tokenizations to find the most effective token unit for Korean sentence embedding. In our work, we confirmed: the robustness of the subword unit for out-of-vocabulary (OOV) problems compared to other token units, the disadvantage of replacing whitespace with a particular symbol in the sentiment analysis task, and that the optimal vocabulary size is 16K in subword and submorpheme tokenization. We empirically noticed that the subword, which was tokenized by a vocabulary size of 16K without replacement of whitespace, was the most effective for sentence embedding on the Korean sentiment analysis task.


2021 ◽  
Vol 717 (1) ◽  
pp. 012001
Author(s):  
Dyah Estu Kurniawati ◽  
Salahudin ◽  
Gonda Yumitro ◽  
Demeiati Nur Kusumaningrum

2020 ◽  
Vol 34 (6) ◽  
pp. 721-729
Author(s):  
Kheira Z. Bousmaha ◽  
Nour H. Chergui ◽  
Mahfoud Sid Ali Mbarek ◽  
Lamia Belguith Hadrich

The Arabic natural language process (ANLP) community does not have an automatic generator of questions for texts in the Arabic language. Our objective is to provide it one. This paper presents a novel automatic question generation approach that generates questions as a form of support for children learning through the platform QUIZZITO. Our approach combines the semantic role labelling of PropBank (SRL) and the flexibility of question models. It essentially relates to an approach of instantiation model of representation based on an analysis focused on the semantics. This allowed us to capture the maximum sense of sentence given the flexibility of the grammar of the Arabic language. This model was written in a set of Patterns and Templates based on the REGEX languages. Our goal is to enrich Quizzito's online quiz platform, which contains more than 254.5k quizzes, and to provide it with a generator of Arabic language questions for children's texts. Our Arabic Question Generator system (AQG) is functional and reaches up to 86% f-measure.


Author(s):  
Mukund Upadhyay and Prof. Shallu Bashambu

Image captioning means automatically generating a caption for an image with the development of deep learning, the combination of computer vision and natural language process has caught great attention in the last few years. Image captioning is a representative of this filed, which makes the computer learn to use one or more sentences to understand the visual content of an image. The meaningful description generation process of highlevel image semantics requires not only the recognition of the object and the scene, but the ability of analyzing the state, the attributes and the relationship among these objects. Neural network based methods are further divided into subcategories based on the specific framework they use. Each subcategory of neural network based methods are discussed in detail. After that, state of the art methods are compared on benchmark datasets. Following that, discussions on future research directions are presented.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Yoojoong Kim ◽  
Jeong Hyeon Lee ◽  
Sunho Choi ◽  
Jeong Moon Lee ◽  
Jong-Ho Kim ◽  
...  

AbstractPathology reports contain the essential data for both clinical and research purposes. However, the extraction of meaningful, qualitative data from the original document is difficult due to the narrative and complex nature of such reports. Keyword extraction for pathology reports is necessary to summarize the informative text and reduce intensive time consumption. In this study, we employed a deep learning model for the natural language process to extract keywords from pathology reports and presented the supervised keyword extraction algorithm. We considered three types of pathological keywords, namely specimen, procedure, and pathology types. We compared the performance of the present algorithm with the conventional keyword extraction methods on the 3115 pathology reports that were manually labeled by professional pathologists. Additionally, we applied the present algorithm to 36,014 unlabeled pathology reports and analysed the extracted keywords with biomedical vocabulary sets. The results demonstrated the suitability of our model for practical application in extracting important data from pathology reports.


Sign in / Sign up

Export Citation Format

Share Document