scholarly journals Ekstraksi Informasi Halaman Web Menggunakan Pendekatan Bootstrapping pada Ontology-Based Information Extraction

Author(s):  
Erma Susanti ◽  
Khabib Mustofa

AbstrakEkstraksi  informasi  merupakan suatu bidang ilmu untuk pengolahan bahasa alami, dengan cara mengubah teks tidak terstruktur menjadi informasi dalam bentuk terstruktur. Berbagai jenis informasi di Internet ditransmisikan secara tidak terstruktur melalui website, menyebabkan munculnya kebutuhan akan suatu teknologi untuk menganalisa teks dan menemukan pengetahuan yang relevan dalam bentuk informasi terstruktur. Contoh informasi tidak terstruktur adalah informasi utama yang ada pada konten halaman web. Bermacam pendekatan untuk ekstraksi informasi telah dikembangkan oleh berbagai peneliti, baik menggunakan metode manual atau otomatis, namun masih perlu ditingkatkan kinerjanya terkait akurasi dan kecepatan ekstraksi. Pada penelitian ini diusulkan suatu penerapan pendekatan ekstraksi informasi dengan mengkombinasikan pendekatan bootstrapping dengan Ontology-based Information Extraction (OBIE). Pendekatan bootstrapping dengan menggunakan sedikit contoh data berlabel, digunakan untuk memimalkan keterlibatan manusia dalam proses ekstraksi informasi, sedangkan penggunakan panduan ontologi untuk mengekstraksi classes (kelas), properties dan instance digunakan untuk menyediakan konten semantik untuk web semantik. Pengkombinasian kedua pendekatan tersebut diharapkan dapat meningkatan kecepatan proses ekstraksi dan akurasi hasil ekstraksi. Studi kasus untuk penerapan sistem ekstraksi informasi menggunakan dataset “LonelyPlanet”. Kata kunci—Ekstraksi informasi, ontologi, bootstrapping, Ontology-Based Information Extraction, OBIE, kinerja Abstract Information extraction is a field study of natural language processing by converting unstructured text into structured information. Several types of information on the Internet is transmitted through unstructured information via websites, led to emergence of the need a technology to analyze text and found relevant knowledge into structured information. For example of unstructured information is existing main information on the content of web pages. Various approaches  for information extraction have been developed by many researchers, either using manual or automatic method, but still need to be improved performance related accuracy and speed of extraction. This research proposed an approach of information extraction that combines bootstrapping approach with Ontology-Based Information Extraction (OBIE). Bootstrapping approach using small seed of labelled data, is used to minimize human intervention on information extraction process, while the use of guide ontology for extracting classes, properties and instances, using for provide semantic content for semantic web. Combining both approaches expected to increase speed of extraction process and accuracy of extraction results. Case study to apply information extraction system using “LonelyPlanet” datasets. Keywords— Information extraction, ontology, bootstrapping, Ontology-Based Information Extraction, OBIE, performance

2020 ◽  
Vol 10 (3) ◽  
pp. 35
Author(s):  
Ahmed Adeeb Jalal

Technology world has greatly evolved over the past decades, which led to inflated data volume. This progress of technology in the digital form generated scattered texts across millions of web pages. Unstructured texts contain a vast amount of textual data. Discover of useful and interesting relations from unstructured texts requires more processing by computers. Therefore, text mining and information extraction have become an exciting research field to get structured and valuable information. This paper focuses on text pre-processing of automotive advertisements domains to configure a structured database. The structured database was created by extract the information over unstructured automotive advertisements, which is an area of natural language processing. Information extraction deals with finding factual information in text using learning regular expressions. We manually craft rule-based specific approaches to extract structured information from unstructured web pages. Structured information will be provided by user-friendly search engine designed for topic-specific knowledge. Consequently, this information that extracted from these advertisements uses to perform a structured search over certain interesting attributes. Thus, the tuples are assigned a probability and indexed to support the efficiency of extraction and exploration via user queries.


2013 ◽  
Vol 7 (2) ◽  
pp. 574-579 ◽  
Author(s):  
Dr Sunitha Abburu ◽  
G. Suresh Babu

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.  But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies  data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.   It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.  The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.


2004 ◽  
pp. 227-267
Author(s):  
Wee Keong Ng ◽  
Zehua Liu ◽  
Zhao Li ◽  
Ee Peng Lim

With the explosion of information on the Web, traditional ways of browsing and keyword searching of information over web pages no longer satisfy the demanding needs of web surfers. Web information extraction has emerged as an important research area that aims to automatically extract information from target web pages and convert them into a structured format for further processing. The main issues involved in the extraction process include: (1) the definition of a suitable extraction language; (2) the definition of a data model representing the web information source; (3) the generation of the data model, given a target source; and (4) the extraction and presentation of information according to a given data model. In this chapter, we discuss the challenges of these issues and the approaches that current research activities have taken to revolve these issues. We propose several classification schemes to classify existing approaches of information extraction from different perspectives. Among the existing works, we focus on the Wiccap system — a software system that enables ordinary end-users to obtain information of interest in a simple and efficient manner by constructing personalized web views of information sources.


2008 ◽  
pp. 211-238
Author(s):  
Wee Keong Ng ◽  
Zehua Liu ◽  
Zhao Li ◽  
Ee Peng Lim

With the explosion of information on the Web, traditional ways of browsing and keyword searching of information over web pages no longer satisfy the demanding needs of web surfers. Web information extraction has emerged as an important research area that aims to automatically extract information from target web pages and convert them into a structured format for further processing. The main issues involved in the extraction process include: (1) the definition of a suitable extraction language; (2) the definition of a data model representing the web information source; (3) the generation of the data model, given a target source; and (4) the extraction and presentation of information according to a given data model. In this chapter, we discuss the challenges of these issues and the approaches that current research activities have taken to revolve these issues. We propose several classification schemes to classify existing approaches of information extraction from different perspectives. Among the existing works, we focus on the Wiccap system — a software system that enables ordinary end-users to obtain information of interest in a simple and efficient manner by constructing personalized web views of information sources.


Author(s):  
Partha Sarathy Banerjee ◽  
Jaya Banerjee

This work focuses on the topic of natural language processing for clinical data analysis. In a world where information is being generated at an exponential rate, the need for this information handling and management finds wide attention. The majority of the data being generated is in the form of unstructured data. The processing of structured information is relatively easier as compared to semi-structured or unstructured data. In the case of clinical data, the larger chunk is in unstructured form like the patient's case study and history. This chapter will provide a deeper insight into this class of data and will provide various solutions to how this data can be interpreted and represented for better healthcare of the common masses. In this chapter, the authors discuss a generic system developed for unstructured data handling: Natural Language Information Interpretation and Representation System (NLIIRS).


2013 ◽  
Vol 427-429 ◽  
pp. 2489-2492 ◽  
Author(s):  
Tian Yu Zhao ◽  
Jian Yi Liu ◽  
Ru Zhang

Rich information is contributed to microblogs by millions of users all around the world. However, few work has been done on the study of microblog web page extraction so far. We proposed a unified structured information extraction method based on hierarchical clustering which is suitable for microblog web pages of any microblog websites. The experiment result on microblog web pages of some popular microblog service providers indicates the high performance of our method.


The movement of internet technology enhanced the speed and accuracy of data retrieval over the internet. The retrieval of data over the internet needs some automatic process of information extraction and query retrieval. The information extraction gives the process of the predefined structure of the concept to a particular domain of knowledge. The process of information extraction proceeds in two steps one is preprocessing of data and post-processing of data. In preprocessing of data used the concept of the glowworm optimization algorithm. The glowworm algorithm is a family of kits a gives the better selection of information in constraints of similarity. The selection of similarity based on the process of lubrification. The optimization of glowworm removed the unwanted noise of data and filtered it. For the extraction of information used ensemblebased information extraction. The ensemble-based information extraction proceeds with constraints function that function is called mapper constraints. The mapper constraints map the process of ontology with guided domain ontology. The ensemblebased information extraction process used the concept of machine learning for the binding of process. The goals of this work are the development of an OBIE for the domain of different fields of data retrieval such as news agencies, hotel industries and sports. The proposed model combines with the use of ontology, POS and language processing tools and constraintsbased mapper with domain ontology.


2020 ◽  
Vol 29 (4) ◽  
pp. 2049-2067
Author(s):  
Karmen L. Porter ◽  
Janna B. Oetting ◽  
Loretta Pecchioni

Purpose This study examined caregiver perceptions of their child's language and literacy disorder as influenced by communications with their speech-language pathologist. Method The participants were 12 caregivers of 10 school-aged children with language and literacy disorders. Employing qualitative methods, a collective case study approach was utilized in which the caregiver(s) of each child represented one case. The data came from semistructured interviews, codes emerged directly from the caregivers' responses during the interviews, and multiple coding passes using ATLAS.ti software were made until themes were evident. These themes were then further validated by conducting clinical file reviews and follow-up interviews with the caregivers. Results Caregivers' comments focused on the types of information received or not received, as well as the clarity of the information. This included information regarding their child's diagnosis, the long-term consequences of their child's disorder, and the connection between language and reading. Although caregivers were adept at describing their child's difficulties and therapy goals/objectives, their comments indicated that they struggled to understand their child's disorder in a way that was meaningful to them and their child. Conclusions The findings showed the value caregivers place on receiving clear and timely diagnostic information, as well as the complexity associated with caregivers' understanding of language and literacy disorders. The findings are discussed in terms of changes that could be made in clinical practice to better support children with language and literacy disorders and their families.


TAPPI Journal ◽  
2012 ◽  
Vol 11 (8) ◽  
pp. 17-24 ◽  
Author(s):  
HAKIM GHEZZAZ ◽  
LUC PELLETIER ◽  
PAUL R. STUART

The evaluation and process risk assessment of (a) lignin precipitation from black liquor, and (b) the near-neutral hemicellulose pre-extraction for recovery boiler debottlenecking in an existing pulp mill is presented in Part I of this paper, which was published in the July 2012 issue of TAPPI Journal. In Part II, the economic assessment of the two biorefinery process options is presented and interpreted. A mill process model was developed using WinGEMS software and used for calculating the mass and energy balances. Investment costs, operating costs, and profitability of the two biorefinery options have been calculated using standard cost estimation methods. The results show that the two biorefinery options are profitable for the case study mill and effective at process debottlenecking. The after-tax internal rate of return (IRR) of the lignin precipitation process option was estimated to be 95%, while that of the hemicellulose pre-extraction process option was 28%. Sensitivity analysis showed that the after tax-IRR of the lignin precipitation process remains higher than that of the hemicellulose pre-extraction process option, for all changes in the selected sensitivity parameters. If we consider the after-tax IRR, as well as capital cost, as selection criteria, the results show that for the case study mill, the lignin precipitation process is more promising than the near-neutral hemicellulose pre-extraction process. However, the comparison between the two biorefinery options should include long-term evaluation criteria. The potential of high value-added products that could be produced from lignin in the case of the lignin precipitation process, or from ethanol and acetic acid in the case of the hemicellulose pre-extraction process, should also be considered in the selection of the most promising process option.


Sign in / Sign up

Export Citation Format

Share Document