text extraction
Recently Published Documents


TOTAL DOCUMENTS

406
(FIVE YEARS 84)

H-INDEX

24
(FIVE YEARS 2)

2022 ◽  
Vol 2022 ◽  
pp. 1-15
Author(s):  
Yinghai Zhou ◽  
Yi Tang ◽  
Ming Yi ◽  
Chuanyu Xi ◽  
Hai Lu

With the development of advanced persistent threat (APT) and the increasingly severe situation of network security, the strategic defense idea with the concept of “active defense, traceability, and countermeasures” arises at the historic moment, thus cyberspace threat intelligence (CTI) has become increasingly valuable in enhancing the ability to resist cyber threats. Based on the actual demand of defending against the APT threat, we apply natural language processing to process the cyberspace threat intelligence (CTI) and design a new automation system CTI View, which is oriented to text extraction and analysis for the massive unstructured cyberspace threat intelligence (CTI) released by various security vendors. The main work of CTI View is as follows: (1) to deal with heterogeneous CTI, a text extraction framework for threat intelligence is designed based on automated test framework, text recognition technology, and text denoising technology. It effectively solves the problem of poor adaptability when crawlers are used to crawl heterogeneous CTI; (2) using regular expressions combined with blacklist and whitelist mechanism to extract the IOC and TTP information described in CTI effectively; (3) according to the actual requirements, a model based on bidirectional encoder representations from transformers (BERT) is designed to complete the entity extraction algorithm for heterogeneous threat intelligence. In this paper, the GRU layer is added to the existing BERT-BiLSTM-CRF model, and we evaluate the proposed model on the marked dataset and get better performance than the current mainstream entity extraction mode.


2022 ◽  
Vol 12 (1) ◽  
pp. 0-0

As we all know, listening makes learning easier and interesting than reading. An audiobook is a software that converts text to speech. Though this sounds good, the audiobooks available in the market are not free and feasible for everyone. Added to this, we find that these audiobooks are only meant for fictional stories, novels or comics. A comprehensive review of the available literature shows that very little intensive work was done for image to speech conversion. In this paper, we employ various strategies for the entire process. As an initial step, deep learning techniques are constructed to denoise the images that are fed to the system. This is followed by text extraction with the help of OCR engines. Additional improvements are made to improve the quality of text extraction and post processing spell check mechanism are incorporated for this purpose. Our result analysis demonstrates that with denoising and spell checking, our model has achieved an accuracy of 98.11% when compared to 84.02% without any denoising or spell check mechanism.


2021 ◽  
Vol 7 ◽  
pp. e717
Author(s):  
Hazrat Ali ◽  
Khalid Iqbal ◽  
Ghulam Mujtaba ◽  
Ahmad Fayyaz ◽  
Mohammad Farhad Bulbul ◽  
...  

Text detection in natural scene images for content analysis is an interesting task. The research community has seen some great developments for English/Mandarin text detection. However, Urdu text extraction in natural scene images is a task not well addressed. In this work, firstly, a new dataset is introduced for Urdu text in natural scene images. The dataset comprises of 500 standalone images acquired from real scenes. Secondly, the channel enhanced Maximally Stable Extremal Region (MSER) method is applied to extract Urdu text regions as candidates in an image. Two-stage filtering mechanism is applied to eliminate non-candidate regions. In the first stage, text and noise are classified based on their geometric properties. In the second stage, a support vector machine classifier is trained to discard non-text candidate regions. After this, text candidate regions are linked using centroid-based vertical and horizontal distances. Text lines are further analyzed by a different classifier based on HOG features to remove non-text regions. Extensive experimentation is performed on the locally developed dataset to evaluate the performance. The experimental results show good performance on test set images. The dataset will be made available for research use. To the best of our knowledge, the work is the first of its kind for the Urdu language and would provide a good dataset for free research use and serve as a baseline performance on the task of Urdu text extraction.


Author(s):  
Ashwini Dalvi ◽  
Irfan Siddavatam ◽  
Apoorva Jain ◽  
Smit Moradiya ◽  
Faruk Kazi ◽  
...  
Keyword(s):  

In India, majority of people are speaking the language Hindi, but a major portion of signboards are in English. On a business or pleasure trip, the travelers get confused by the various sign boards written in English. As smartphones becomes most popular inrecent years,they canrely on smartphone for the same. This paper explains the work intended to build a mobile application that can recognize the English content and sign present on the signboard image, detect and translate the content and symbols from English to Hindi and display the translated Hindi text back to the screen of the phone. The system uses pre-trained faster regional convolutional neural networkusing pre-trained CNN for object detection, tesseract OCR for text extraction and English-to-Hindi dictionary for translation.


Author(s):  
Dr. K. Suresh

The current way of checking answer scripts is hectic for the college. They need to manually check the answers and allocate the marks to the students. Our proposed system uses Machine Learning and Natural Language Processing techniques to beat this. Machine learning algorithms use computational methods to find out directly from data without hopping on predetermined rules. NLP algorithms identify specific entities within the text, explore for key elements during a document, run a contextual search for synonyms and detect misspelled words or similar entries, and more. Our algorithm performs similarity checking and also the number of words associated with the question exactly matched between two documents. It also checks whether the grammar is correctly used or not within the student's answer. Our proposed system performs text extraction and evaluation of marks by applying Machine Learning and Natural Language Processing techniques.


Author(s):  
Akanksha Mate ◽  
Megha Gurav ◽  
Kajal Babar ◽  
Gauri Raskar ◽  
Prof. Prakash Kshirsagar

Picture Text is the content data implanted or written in picture of various structure. Picture text can be found in caught pictures, filtered records, magazines, papers, banners and so on These picture messages are profoundly accessible these days and they are vital in addressing, depicting and moving data which help people groups in correspondence, tackling issues, accessibility, formation of new sorts of occupations, cost viability, efficiency, globalization and social hole and so forth The data from these picture archives would give higher proficiency and straightforward entry on the off chance that it is changed over to message structure. The cycle by which Image Text changed over into plain content is Text Extraction. Text Extraction is helpful in data recovering, looking, altering, recording, filing or detailing of picture text. In any case, variety of these writings because of contrasts in size, direction style, and arrangement, text is installed in complex hued archive pictures, corrupted reports picture, inferior quality picture, as well as low picture differentiation and complex foundation make issue text extraction incredibly troublesome what's more, testing one. Various strategies like Connected Component Method, Mathematical Morphology Method, Edged Based Method and Texture Based Method have been utilized beforehand, however those all have their own constraints when estimated by various boundaries like exactness, review and f- score. In this paper, text extraction from picture reports, utilizing blend of the two amazing techniques Connected Component and Edge Based Method, to improve execution and exactness of text extraction is talked about and execution is finished by incorporated MATLAB code with MATLAB/Simulink device and the proposed framework is tried by Digital Image Binarization Competition (DIBCO) 2017 dataset. At long last, the separated and perceived is changed over to discourse for legitimate use for outwardly hindered individuals.


Sign in / Sign up

Export Citation Format

Share Document