Pengenalan Pola Berbasis OCR untuk Pengambilan Data Bursa Saham

The investor must be able to use instinct to evaluate when to sell and buy stocks. This is, of fact, a weakness for inexperienced investors, in addition to the decision's inaccuracy and the time it takes to evaluate a slew of ineffective results. So that, a support system is needed to help the investors make decisions in buying and selling shares. This support system creates an online analysis curve display through text data in the BEI stock price application. The data processing based on pattern recognition will be carried out so that a buying and selling decision can be made to calculate the profit and loss by investors. As the first step of the whole system, this research has built an image-to-text conversion system based on OCR (Optical Character Recognition) that can convert the non-editable text (.jpg) to be editable (.text) online. After obtaining this .text data, the will used the system in further research to analyze stock buying and selling decisions. According to research on eight companies, the OCR-based image to text conversion has a 96.8% accuracy rate. Meanwhile, using Droid serif, Takao PGhotic, and Waree fonts at 12pt font sizes, it has 100 percent accuracy in Libre Office.

Download Full-text

Analytics for Noisy Unstructured Text Data I

Encyclopedia of Artificial Intelligence ◽

10.4018/978-1-59904-849-9.ch015 ◽

2011 ◽

pp. 99-104

Author(s):

Shourya Roy ◽

L. Venkata Subramaniam

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Language Model ◽

Standard Form ◽

Acoustic Model ◽

Text Data ◽

Optical Character ◽

Contact Centers ◽

Noisy Text

Accdrnig to rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer be at the rghit pclae. Tihs is bcuseae the human mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.1 Unfortunately computing systems are not yet as smart as the human mind. Over the last couple of years a significant number of researchers have been focussing on noisy text analytics. Noisy text data is found in informal settings (online chat, SMS, e-mails, message boards, among others) and in text produced through automated speech recognition or optical character recognition systems. Noise can possibly degrade the performance of other information processing algorithms such as classification, clustering, summarization and information extraction. We will identify some of the key research areas for noisy text and give a brief overview of the state of the art. These areas will be, (i) classification of noisy text, (ii) correcting noisy text, (iii) information extraction from noisy text. We will cover the first one in this chapter and the later two in the next chapter. We define noise in text as any kind of difference in the surface form of an electronic text from the intended, correct or original text. We see such noisy text everyday in various forms. Each of them has unique characteristics and hence requires special handling. We introduce some such forms of noisy textual data in this section. Online Noisy Documents: E-mails, chat logs, scrapbook entries, newsgroup postings, threads in discussion fora, blogs, etc., fall under this category. People are typically less careful about the sanity of written content in such informal modes of communication. These are characterized by frequent misspellings, commonly and not so commonly used abbreviations, incomplete sentences, missing punctuations and so on. Almost always noisy documents are human interpretable, if not by everyone, at least by intended readers. SMS: Short Message Services are becoming more and more common. Language usage over SMS text significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language (Choudhury et. al., 2007). Text Generated by ASR Devices: ASR is the process of converting a speech signal to a sequence of words. An ASR system takes speech signal such as monologs, discussions between people, telephonic conversations, etc. as input and produces a string a words, typically not demarcated by punctuations as transcripts. An ASR system consists of an acoustic model, a language model and a decoding algorithm. The acoustic model is trained on speech data and their corresponding manual transcripts. The language model is trained on a large monolingual corpus. ASR convert audio into text by searching the acoustic model and language model space using the decoding algorithm. Most conversations at contact centers today between agents and customers are recorded. To do any processing of this data to obtain customer intelligence it is necessary to convert the audio into text. Text Generated by OCR Devices: Optical character recognition, or ‘OCR’, is a technology that allows digital images of typed or handwritten text to be transferred into an editable text document. It takes the picture of text and translates the text into Unicode or ASCII. . For handwritten optical character recognition, the rate of recognition is 80% to 90% with clean handwriting. Call Logs in Contact Centers: Today’s contact centers (also known as call centers, BPOs, KPOs) produce huge amounts of unstructured data in the form of call logs apart from emails, call transcriptions, SMS, chattranscripts etc. Agents are expected to summarize an interaction as soon as they are done with it and before picking up the next one. As the agents work under immense time pressure hence the summary logs are very poorly written and sometimes even difficult for human interpretation. Analysis of such call logs are important to identify problem areas, agent performance, evolving problems etc. In this chapter we will be focussing on automatic classification of noisy text. Automatic text classification refers to segregating documents into different topics depending on content. For example, categorizing customer emails according to topics such as billing problem, address change, product enquiry etc. It has important applications in the field of email categorization, building and maintaining web directories e.g. DMoz, spam filter, automatic call and email routing in contact center, pornographic material filter and so on.

Download Full-text

Finger Spelling in Air System for Deaf and Dumb

International Journal of Technology Diffusion ◽

10.4018/ijtd.2018010103 ◽

2018 ◽

Vol 9 (1) ◽

pp. 28-44

Author(s):

Urmila Shrawankar ◽

Shruti Gedam

Keyword(s):

Real Time ◽

Character Recognition ◽

Optical Character Recognition ◽

Human Interaction ◽

Finger Movements ◽

Accuracy Rate ◽

Optical Character ◽

Average Accuracy ◽

Bright Color ◽

Natural Way

Finger spelling in air helps user to operate a computer in order to make human interaction easier and faster than keyboard and touch screen. This article presents a real-time video based system which recognizes the English alphabets and words written in air using finger movements only. Optical Character Recognition (OCR) is used for recognition which is trained using more than 500 various shapes and styles of all alphabets. This system works with different light situations and adapts automatically to various changing conditions; and gives a natural way of communicating where no extra hardware is used other than system camera and a bright color tape. Also, this system does not restrict writing speed and color of tape. Overall, this system achieves an average accuracy rate of character recognition for all alphabets of 94.074%. It is concluded that this system is very useful for communication with deaf and dumb people.

Download Full-text

A guide for digitising manuscript climate data

Climate of the Past ◽

10.5194/cp-2-137-2006 ◽

2006 ◽

Vol 2 (2) ◽

pp. 137-144 ◽

Cited By ~ 24

Author(s):

S. Brönnimann ◽

J. Annis ◽

W. Dann ◽

T. Ewen ◽

A. N. Grant ◽

...

Keyword(s):

Quality Control ◽

Speech Recognition ◽

Character Recognition ◽

Optical Character Recognition ◽

Control Strategies ◽

Climate Data ◽

Text Data ◽

Technical Aspects ◽

Optical Character ◽

The Right

Abstract. Hand-written or printed manuscript data are an important source for paleo-climatological studies, but bringing them into a suitable format can be a time consuming adventure with uncertain success. Before digitising such data (e.g., in the context a specific research project), it is worthwhile spending a few thoughts on the characteristics of the data, the scientific requirements with respect to quality and coverage, the metadata, and technical aspects such as reproduction techniques, digitising techniques, and quality control strategies. Here we briefly discuss the most important considerations according to our own experience and describe different methods for digitising numeric or text data (optical character recognition, speech recognition, and key entry). We present a tentative guide that is intended to help others compiling the necessary information and making the right decisions.

Download Full-text

Optical character recognition of typeset Coptic text with neural networks

Digital Scholarship in the Humanities ◽

10.1093/llc/fqz023 ◽

2019 ◽

Vol 34 (Supplement_1) ◽

pp. i135-i141

Author(s):

So Miyagawa ◽

Kirill Bulert ◽

Marco Büchler ◽

Heike Behlmer

Keyword(s):

Neural Network ◽

Neural Networks ◽

Artificial Neural Networks ◽

Character Recognition ◽

Optical Character Recognition ◽

Accuracy Rate ◽

Optical Character ◽

The Neural Network ◽

Text Model ◽

The Many

Abstract Digital Humanities (DH) within Coptic Studies, an emerging field of development, will be much aided by the digitization of large quantities of typeset Coptic texts. Until recently, the only Optical Character Recognition (OCR) analysis of printed Coptic texts had been executed by Moheb S. Mekhaiel, who used the Tesseract program to create a text model for liturgical books in the Bohairic dialect of Coptic. However, this model is not suitable for the many scholarly editions of texts in the Sahidic dialect of Coptic which use noticeably different fonts. In the current study, DH and Coptological projects based in Göttingen, Germany, collaborated to develop a new Coptic OCR pipeline suitable for use with all Coptic dialects. The objective of the study was to generate a model which can facilitate digital Coptic Studies and produce Coptic corpora from existing printed texts. First, we compared the two available OCR programs that can recognize Coptic: Tesseract and Ocropy. The results indicated that the neural network model, i.e. Ocropy, performed better at recognizing the letters with supralinear strokes that characterize the published Sahidic texts. After training Ocropy for Coptic using artificial neural networks, the team achieved an accuracy rate of >91% for the OCR analysis of Coptic typeset. We subsequently compared the efficiency of Ocropy to that of manual transcribing and concluded that the use of Ocropy to extract Coptic from digital images of printed texts is highly beneficial to Coptic DH.

Download Full-text

Recognition of Nastaliq Urdu Text using Multi-SVM

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e6949.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 5665-5674

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Research Field ◽

Basic Unit ◽

Connected Components ◽

Accuracy Rate ◽

Optical Character ◽

Baseline Information ◽

Style Of Writing ◽

Text Images

Optical Character Recognition has emerged as an attractive research field nowadays. Lot of work has been done in Urdu script based on various approaches and diverse methodologies have been put forward based on Nastaliq font style. Urdu is written diagonally from top to bottom, the style known as Nastaliq. This feature of Nastaliq makes Urdu highly cursive and more sensitive leading to a difficult recognition problem. Due to the peculiarities of Nastaliq Style of writing, we have chosen ligature as a basic unit of recognition in order to reduce the complexity of system. The accuracy rate of recognizing ligature in Urdu text corresponds to the efficiency with which the ligatures are segmented. In addition to extracting connected components, the ligature segmentation takes into consideration various factors like baseline information, height, width, and centroid. In this paper ligature Recognition is performed by using multi-SVM (Sup-port Vector Machine) approach which gives an accuracy of 97% when 903 text images are fed to it.

Download Full-text

A Method on The Translation of The Words Written by Ottoman Alphabet to Turkish Alphabet

Artificial Intelligence Studies ◽

10.30855/ais.2018.01.01.05 ◽

2018 ◽

Vol 1 (1) ◽

pp. 39-46

Author(s):

Önder ÖZBEK

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Comparison Process ◽

Recognition Method ◽

Accuracy Rate ◽

Optical Character

The Ottoman alphabet was used as a writing language in Turkish with Arabic letters. The Turkish alphabet is used as a writing language in Turkish with Latin letters. There are numerous documents written in the Ottoman alphabet in the archives. In this study, the image of the words written in the Ottoman alphabet was converted into editable text by optical character recognition method. In this way the words are translated into text. Later, the characters in these words were made understandable by using their equivalents in the Turkish alphabet. It was tried to increase the accuracy rate by comparing the words translated into Turkish alphabet and the table where Turkish words were found. An algorithm that gives a similarity value is used for the comparison process.

Download Full-text

Kurdish Optical Character Recognition

UKH Journal of Science and Engineering ◽

10.25079/ukhjse.v2n1y2018.pp18-27 ◽

2018 ◽

Vol 2 (1) ◽

pp. 18-27

Author(s):

Rasty Yaseen ◽

Hossein Hassani

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Character Segmentation ◽

Accuracy Rate ◽

Optical Character ◽

Arabic Script

Currently, no offline tool is available for Optical Character Recognition (OCR) in Kurdish. Kurdish is spoken in different dialects and uses several scripts for writing. The Persian/Arabic script is widely used among these dialects. The Persian/Arabic script is written from Right to Left (RTL), it is cursive, and it uses unique diacritics. These features, particularly the last two, affect the segmentation stage in developing a Kurdish OCR. In this article, we introduce an enhanced character segmentation based method which addresses the mentioned characteristics. We applied the method to text-only images and tested the Kurdish OCR using documents of different fonts, font sizes, and image resolutions. The results of the experiments showed that the accuracy rate of character recognition of the proposed method was 90.82% on average.

Download Full-text

A guide for digitising manuscript climate data

Climate of the Past Discussions ◽

10.5194/cpd-2-191-2006 ◽

2006 ◽

Vol 2 (3) ◽

pp. 191-207 ◽

Cited By ~ 1

Author(s):

S. Brönnimann ◽

J. Annis ◽

W. Dann ◽

T. Ewen ◽

A. N. Grant ◽

...

Keyword(s):

Speech Recognition ◽

Character Recognition ◽

Optical Character Recognition ◽

Climate Data ◽

Text Data ◽

Advantages And Disadvantages ◽

Optical Character

Abstract. Hand-written or printed manuscript data are an important source for paleo-climatological studies, but bringing them into a suitable format can be a time consuming adventure with uncertain success. Before starting the digitising work, it is worthwhile spending a few thoughts on the characteristics of the data, the scientific requirements with respect to quality and coverage, and on the different digitising techniques. Here we briefly discuss the most important considerations and report our own experience. We describe different methods for digitising numeric or text data, i.e., optical character recognition (OCR), speech recognition, and key entry. Each technique has its advantages and disadvantages that may become important for certain applications. It is therefore crucial to thoroughly investigate beforehand the characteristics of the manuscript data, define the quality targets and develop validation strategies.

Download Full-text

SELECTION TECHNIQUE FOR MULTIPLE OUTPUTS OF OPTICAL CHARACTER RECOGNITION

Eurasian Journal of Mathematical and Computer Applications ◽

10.32523/2306-6172-2020-8-2-41-51 ◽

2020 ◽

Vol 8 (2) ◽

pp. 41-51

Author(s):

I.Q. Habeeb ◽

Z.Q. Al-Zaydi ◽

H.N. Abdulkhudhur

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Selection Technique ◽

Multiple Outputs ◽

Optical Character

Download Full-text

A Structured Method for the Recognition of Complex Historical Tables

History and Computing ◽

10.3366/hac.1997.9.1-3.58 ◽

1997 ◽

Vol 9 (1-3) ◽

pp. 58-77

Author(s):

Vitaly Kliatskine ◽

Eugene Shchepin ◽

Gunnar Thorvaldsen ◽

Konstantin Zingerman ◽

Valery Lazarev

Keyword(s):

Nineteenth Century ◽

Character Recognition ◽

Optical Character Recognition ◽

Complex Structure ◽

Source Material ◽

Historical Sources ◽

Tax Assessment ◽

Optical Character ◽

Algorithmic Model ◽

Machine Readable

In principle, printed source material should be made machine-readable with systems for Optical Character Recognition, rather than being typed once more. Offthe-shelf commercial OCR programs tend, however, to be inadequate for lists with a complex layout. The tax assessment lists that assess most nineteenth century farms in Norway, constitute one example among a series of valuable sources which can only be interpreted successfully with specially designed OCR software. This paper considers the problems involved in the recognition of material with a complex table structure, outlining a new algorithmic model based on ‘linked hierarchies’. Within the scope of this model, a variety of tables and layouts can be described and recognized. The ‘linked hierarchies’ model has been implemented in the ‘CRIPT’ OCR software system, which successfully reads tables with a complex structure from several different historical sources.

Download Full-text