Text Mining: Design of Interactive Search Engine Based Regular Expressions of Online Automobile Advertisements

Technology world has greatly evolved over the past decades, which led to inflated data volume. This progress of technology in the digital form generated scattered texts across millions of web pages. Unstructured texts contain a vast amount of textual data. Discover of useful and interesting relations from unstructured texts requires more processing by computers. Therefore, text mining and information extraction have become an exciting research field to get structured and valuable information. This paper focuses on text pre-processing of automotive advertisements domains to configure a structured database. The structured database was created by extract the information over unstructured automotive advertisements, which is an area of natural language processing. Information extraction deals with finding factual information in text using learning regular expressions. We manually craft rule-based specific approaches to extract structured information from unstructured web pages. Structured information will be provided by user-friendly search engine designed for topic-specific knowledge. Consequently, this information that extracted from these advertisements uses to perform a structured search over certain interesting attributes. Thus, the tuples are assigned a probability and indexed to support the efficiency of extraction and exploration via user queries.

Download Full-text

Ekstraksi Informasi Halaman Web Menggunakan Pendekatan Bootstrapping pada Ontology-Based Information Extraction

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.7540 ◽

2015 ◽

Vol 9 (2) ◽

pp. 111 ◽

Cited By ~ 1

Author(s):

Erma Susanti ◽

Khabib Mustofa

Keyword(s):

Information Extraction ◽

Language Processing ◽

Semantic Content ◽

Extraction Process ◽

Web Pages ◽

Structured Information ◽

Improved Performance ◽

Types Of Information ◽

Unstructured Information

AbstrakEkstraksi informasi merupakan suatu bidang ilmu untuk pengolahan bahasa alami, dengan cara mengubah teks tidak terstruktur menjadi informasi dalam bentuk terstruktur. Berbagai jenis informasi di Internet ditransmisikan secara tidak terstruktur melalui website, menyebabkan munculnya kebutuhan akan suatu teknologi untuk menganalisa teks dan menemukan pengetahuan yang relevan dalam bentuk informasi terstruktur. Contoh informasi tidak terstruktur adalah informasi utama yang ada pada konten halaman web. Bermacam pendekatan untuk ekstraksi informasi telah dikembangkan oleh berbagai peneliti, baik menggunakan metode manual atau otomatis, namun masih perlu ditingkatkan kinerjanya terkait akurasi dan kecepatan ekstraksi. Pada penelitian ini diusulkan suatu penerapan pendekatan ekstraksi informasi dengan mengkombinasikan pendekatan bootstrapping dengan Ontology-based Information Extraction (OBIE). Pendekatan bootstrapping dengan menggunakan sedikit contoh data berlabel, digunakan untuk memimalkan keterlibatan manusia dalam proses ekstraksi informasi, sedangkan penggunakan panduan ontologi untuk mengekstraksi classes (kelas), properties dan instance digunakan untuk menyediakan konten semantik untuk web semantik. Pengkombinasian kedua pendekatan tersebut diharapkan dapat meningkatan kecepatan proses ekstraksi dan akurasi hasil ekstraksi. Studi kasus untuk penerapan sistem ekstraksi informasi menggunakan dataset “LonelyPlanet”. Kata kunci—Ekstraksi informasi, ontologi, bootstrapping, Ontology-Based Information Extraction, OBIE, kinerja Abstract Information extraction is a field study of natural language processing by converting unstructured text into structured information. Several types of information on the Internet is transmitted through unstructured information via websites, led to emergence of the need a technology to analyze text and found relevant knowledge into structured information. For example of unstructured information is existing main information on the content of web pages. Various approaches for information extraction have been developed by many researchers, either using manual or automatic method, but still need to be improved performance related accuracy and speed of extraction. This research proposed an approach of information extraction that combines bootstrapping approach with Ontology-Based Information Extraction (OBIE). Bootstrapping approach using small seed of labelled data, is used to minimize human intervention on information extraction process, while the use of guide ontology for extracting classes, properties and instances, using for provide semantic content for semantic web. Combining both approaches expected to increase speed of extraction process and accuracy of extraction results. Case study to apply information extraction system using “LonelyPlanet” datasets. Keywords— Information extraction, ontology, bootstrapping, Ontology-Based Information Extraction, OBIE, performance

Download Full-text

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

Search Engine-Based Web Information Extraction

Web Technologies ◽

10.4018/978-1-60566-982-3.ch109 ◽

2011 ◽

pp. 2048-2081

Author(s):

Gijs Geleijnse ◽

Jan Korst

Keyword(s):

Semantic Web ◽

Information Extraction ◽

Search Engine ◽

Community Based ◽

Web Information Extraction ◽

Structure Information ◽

Web Information ◽

Structured Information ◽

The Web ◽

Standard Semantic

In this chapter we discuss approaches to find, extract, and structure information from natural language texts on the Web. Such structured information can be expressed and shared using the standard Semantic Web languages and hence be machine interpreted. In this chapter we focus on two tasks in Web information extraction. The first part focuses on mining facts from the Web, while in the second part, we present an approach to collect community-based meta-data. A search engine is used to retrieve potentially relevant texts. From these texts, instances and relations are extracted. The proposed approaches are illustrated using various case-studies, showing that we can reliably extract information from the Web using simple techniques.

Download Full-text

Search Engine-Based Web Information Extraction

Semantic Web Engineering in the Knowledge Society ◽

10.4018/978-1-60566-112-4.ch009 ◽

2009 ◽

pp. 208-241

Author(s):

Gijs Geleijnse

Keyword(s):

Semantic Web ◽

Information Extraction ◽

Search Engine ◽

Community Based ◽

Web Information Extraction ◽

Structure Information ◽

Web Information ◽

Structured Information ◽

The Web ◽

Standard Semantic

Download Full-text

An Approach of Web Page Information Extraction

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.347-350.2479 ◽

2013 ◽

Vol 347-350 ◽

pp. 2479-2482

Author(s):

Yao Hui Li ◽

Li Xia Wang ◽

Jian Xiong Wang ◽

Jie Yue ◽

Ming Zhan Zhao

Keyword(s):

Information Extraction ◽

Search Engine ◽

Information Source ◽

Web Pages ◽

Web Page ◽

Extraction Technology ◽

Page Segmentation ◽

The Web

The Web has become the largest information source, but the noise content is an inevitable part in any web pages. The noise content reduces the nicety of search engine and increases the load of server. Information extraction technology has been developed. Information extraction technology is mostly based on page segmentation. Through analyzed the existing method of page segmentation, an approach of web page information extraction is provided. The block node is identified by analyzing attributes of HTML tags. This algorithm is easy to implementation. Experiments prove its good performance.

Download Full-text

Ontology Based Object-Attribute-Value Information Extraction from Web Pages in Search Engine Result Retrieval

Smart Innovation, Systems and Technologies - Advanced Computing, Networking and Informatics- Volume 1 ◽

10.1007/978-3-319-07353-8_70 ◽

2014 ◽

pp. 611-620 ◽

Cited By ~ 2

Author(s):

V. Vijayarajan ◽

M. Dinakaran ◽

Mayank Lohani

Keyword(s):

Information Extraction ◽

Search Engine ◽

Web Pages ◽

Search Engine Result ◽

Object Attribute ◽

Attribute Value

Download Full-text

A Unified Microblog Web Page Structured Information Extraction Method Based on Hierarchical Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.427-429.2489 ◽

2013 ◽

Vol 427-429 ◽

pp. 2489-2492 ◽

Cited By ~ 1

Author(s):

Tian Yu Zhao ◽

Jian Yi Liu ◽

Ru Zhang

Keyword(s):

Information Extraction ◽

Hierarchical Clustering ◽

Extraction Method ◽

High Performance ◽

Service Providers ◽

Web Pages ◽

Web Page ◽

The World ◽

Structured Information ◽

Rich Information

Rich information is contributed to microblogs by millions of users all around the world. However, few work has been done on the study of microblog web page extraction so far. We proposed a unified structured information extraction method based on hierarchical clustering which is suitable for microblog web pages of any microblog websites. The experiment result on microblog web pages of some popular microblog service providers indicates the high performance of our method.

Download Full-text

WEIR-P: An Information Extraction Pipeline for the Wastewater Domain

10.5194/egusphere-egu21-2708 ◽

2021 ◽

Author(s):

Nanée Chahinian ◽

Thierry Bonnabaud La Bruyère ◽

Serge Conrad ◽

Carole Delenne ◽

Francesca Frontini ◽

...

Keyword(s):

Text Mining ◽

Information Extraction ◽

Language Processing ◽

Past Century ◽

Wastewater Management ◽

Learning Approaches ◽

Public Authorities ◽

Information Measures ◽

Spatio Temporal ◽

Secondary Information

Urbanization has been an increasing trend over the past century (UN, 2018) and city managers have had to constantly extend water access and sanitation services to new peripheral areas. Originally these networks were installed, operated, and repaired by their owners (Rogers et al. 2012). However, as concessions were increasingly granted to private companies and new tenders requested regularly by public authorities, archives were sometimes misplaced and event logs were lost. Thus, part of the networks&#8217; operational history was thought to be permanently erased. The advent of Web big data and text-mining techniques may offer the possibility of recovering some of this knowledge by crawling secondary information sources, i.e. documents available on the Web. Thus, insight might be gained on the wastewater collection scheme, the treatment processes, the network&#8217;s geometry and events (accidents, shortages) which may have affected these facilities and amenities. The primary aim of the "Megadata, Linked Data and Data Mining for Wastewater Networks" (MeDo) project (http://webmedo.msem.univ-montp2.fr/?page_id=223&lang=en), is to develop resources for text mining and information extraction in the wastewater domain. We developed a specific Natural Language Processing (NLP) pipeline named WEIR-P (WastewatEr InfoRmation extraction Platform) which allows users to retrieve relevant documents for a given network, process them to extract potentially new information, assess this information also by using an interactive visualization and add it to a pre-existing knowledge base. The system identifies the entities and relations to be extracted from texts, pertaining network information, wastewater treatment, accidents and works, organizations, spatio-temporal information, measures and water quality. We present and evaluate the first version of the NLP system. The preliminary results obtained on the Montpellier corpus (1,557 HTML and PDF documents in French) are encouraging and show how a mix of Machine Learning approaches and rule-based techniques can be used to extract useful information and reconstruct the various phases of the extension of a given wastewater network. While the NLP and Information Extraction (IE) methods used are state of the art, the novelty of our work lies in their adaptation to the domain, and in particular in the wastewater management conceptual model, which defines the relations between entities.

Download Full-text

Proceedings of the Workshop on Balto-Slavonic Natural Language Processing Information Extraction and Enabling Technologies - ACL '07

10.3115/1567545 ◽

2007 ◽

Cited By ~ 1

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Enabling Technologies ◽

Processing Information

Download Full-text

Does higher education properly prepare graduates for the growing artificial intelligence market? Gaps identification using text mining

Human Systems Management ◽

10.3233/hsm-211179 ◽

2021 ◽

pp. 1-13

Author(s):

Lamiae Benhayoun ◽

Daniel Lang

Keyword(s):

Artificial Intelligence ◽

Natural Language Processing ◽

Text Mining ◽

Natural Language ◽

Language Processing ◽

Academic Training ◽

Market Requirements ◽

Job Advertisements ◽

The Individual

BACKGROUND: The renewed advent of Artificial Intelligence (AI) is inducing profound changes in the classic categories of technology professions and is creating the need for new specific skills. OBJECTIVE: Identify the gaps in terms of skills between academic training on AI in French engineering and Business Schools, and the requirements of the labour market. METHOD: Extraction of AI training contents from the schools’ websites and scraping of a job advertisements’ website. Then, analysis based on a text mining approach with a Python code for Natural Language Processing. RESULTS: Categorization of occupations related to AI. Characterization of three classes of skills for the AI market: Technical, Soft and Interdisciplinary. Skills’ gaps concern some professional certifications and the mastery of specific tools, research abilities, and awareness of ethical and regulatory dimensions of AI. CONCLUSIONS: A deep analysis using algorithms for Natural Language Processing. Results that provide a better understanding of the AI capability components at the individual and the organizational levels. A study that can help shape educational programs to respond to the AI market requirements.

Download Full-text