A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling

Umamageswari Kumaresan; Kalpana Ramanujam

doi:10.4018/ijirr.290830

A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling

International Journal of Information Retrieval Research ◽

10.4018/ijirr.290830 ◽

2022 ◽

Vol 12 (1) ◽

pp. 1-18

Author(s):

Umamageswari Kumaresan ◽

Kalpana Ramanujam

Keyword(s):

Web Sites ◽

Syntactic Structure ◽

Structured Data ◽

Web Pages ◽

Semantic Labeling ◽

Repeated Pattern ◽

Computationally Intensive ◽

To Come ◽

String Pattern ◽

Informative Content

The intent of this research is to come up with an automated web scraping system which is capable of extracting structured data records embedded in semi-structured web pages. Most of the automated extraction techniques in the literature captures repeated pattern among a set of similarly structured web pages, thereby deducing the template used for the generation of those web pages and then data records extraction is done. All of these techniques exploit computationally intensive operations such as string pattern matching or DOM tree matching and then perform manual labeling of extracted data records. The technique discussed in this paper departs from the state-of-the-art approaches by determining informative sections in the web page through repetition of informative content rather than syntactic structure. From the experiments, it is clear that the system has identified data rich region with 100% precision for web sites belonging to different domains. The experiments conducted on the real world web sites prove the effectiveness and versatility of the proposed approach.

Download Full-text

Googling for suicide: a content and quality analysis of suicide-related websites. (Preprint)

10.2196/preprints.29146 ◽

2021 ◽

Author(s):

Wen Chen ◽

Andrea Boggero ◽

Giovanni del Puente ◽

Martina Olcese ◽

Andrea Prestia ◽

...

Keyword(s):

Mental Health Professionals ◽

Web Sites ◽

Small Sample Size ◽

Medical Professional ◽

Small Sample ◽

Quality Analysis ◽

Health Concern ◽

Web Pages ◽

Quality Material ◽

Informative Content

BACKGROUND Suicide represents a public health concern, imposing a dramatic burden. Pro-suicide websites are “virtual pathways” facilitating the insurgence of suicidal behaviors, especially among socially-isolated, susceptible individuals. OBJECTIVE To characterize suicide-related web-pages in the Italian language. METHODS The first five most commonly used search engines in Italy (namely, Bing©, Virgilio©, Yahoo©, Google©, and Libero©) were mined, searching for “suicidio” (Italian for suicide). For each search, the first 100 web-pages were considered. Web-sites resulting from each search were collected and duplicates deleted, in such a way that unique web-pages were analyzed and rated, using the HONcode© instrument. RESULTS Sixty-five web-pages were included: 12.5% were anti-suicide and 6.3% explicitly pro-suicide. The majority of the included websites had a mixed/neutral attitude towards suicide (81.2%) and had an informative content and purpose (60.9%). Most web-pages targeted adolescents as age-group (59.4%), contained a reference to other psychiatric disorders/co-morbidities (65.6%), were with a medical/professional supervision/guidance (70.3%), without figures/pictures related to suicide (64.1%) and did not contain any access restraint (96.9%). CONCLUSIONS The major shortcoming is the small sample size of web-pages analyzed and the search limited to the keyword “suicide”. Specialized mental health professionals should try to improve their presence online and providing high-quality material.

Download Full-text

Website Reputation System

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k1248.09811s19 ◽

2019 ◽

Vol 8 (11S) ◽

pp. 1228-1233

Keyword(s):

Web Sites ◽

Attack Detection ◽

Web Pages ◽

Ensemble Classifiers ◽

Detection Model ◽

Machine Learning Approach ◽

Fast Development ◽

To Come ◽

Attack Type ◽

Selection Of

Because of the fast development of the web, sites have turned into the interloper’s principle target. As the quantity of web pages expands, the vindictive pages are likewise expanding and the assault is progressively turned out to be modern developing different ways to trick a client into visiting malicious websites extracting credential information. This paper presents a detailed account of ensemble based machine learning approach for URL classification. Models already existing either use outdated techniques or limited set of features in their attack detection model and thus leads to lower detection rate. But ensemble classifiers along with a selection of robust feature list for single and multi attack type detection outperform all the previous deployed techniques. Focus of the study is being able to come up with a system model that yields us better results with a higher accuracy rate.

Download Full-text

Invisible: The Online Presence of Medical Library Web Pages on Hospital Web Sites

Journal of Hospital Librarianship ◽

10.1080/15323269.2012.637859 ◽

2012 ◽

Vol 12 (1) ◽

pp. 14-24 ◽

Cited By ~ 6

Author(s):

Christine Marton

Keyword(s):

Web Sites ◽

Web Pages ◽

Online Presence ◽

Medical Library

Download Full-text

A Simulation of the Structure of the World-Wide Web

Sociological Research Online ◽

10.5153/sro.684 ◽

2002 ◽

Vol 7 (1) ◽

pp. 9-25 ◽

Cited By ~ 2

Author(s):

Moses Boudourides ◽

Gerasimos Antypas

Keyword(s):

World Wide Web ◽

Power Law ◽

Web Sites ◽

World Wide ◽

The Internet ◽

Web Pages ◽

Small Worlds ◽

Web Page ◽

Simple Simulation ◽

The World

In this paper we are presenting a simple simulation of the Internet World-Wide Web, where one observes the appearance of web pages belonging to different web sites, covering a number of different thematic topics and possessing links to other web pages. The goal of our simulation is to reproduce the form of the observed World-Wide Web and of its growth, using a small number of simple assumptions. In our simulation, existing web pages may generate new ones as follows: First, each web page is equipped with a topic concerning its contents. Second, links between web pages are established according to common topics. Next, new web pages may be randomly generated and subsequently they might be equipped with a topic and be assigned to web sites. By repeated iterations of these rules, our simulation appears to exhibit the observed structure of the World-Wide Web and, in particular, a power law type of growth. In order to visualise the network of web pages, we have followed N. Gilbert's (1997) methodology of scientometric simulation, assuming that web pages can be represented by points in the plane. Furthermore, the simulated graph is found to possess the property of small worlds, as it is the case with a large number of other complex networks.

Download Full-text

An Investigation into News Webpage interface Design in Kurdistan Region of Iraq

Journal of University of Human Development ◽

10.21928/juhd.v1n3y2015.pp351-356 ◽

2015 ◽

Vol 1 (3) ◽

pp. 351

Author(s):

Hoger Mahmud Hussen ◽

Mazen Ismaeel Ghareb ◽

Zana Azeez Kaka Rash

Keyword(s):

Web Sites ◽

Interface Design ◽

New Technologies ◽

Quality Standard ◽

Website Design ◽

Web Pages ◽

Kurdistan Region ◽

News Websites ◽

Other Information ◽

Web Interface Design

Recently the Kurdistan Region of Iraq has experienced an explosion in exposure to new technologies in different sectors especially in media and telecommunication. Internet is one of those technologies that have opened a way for information proliferation amongst a previously censored region. Developing web sites to deliver news and other information is a relatively new phenomenon in Kurdistan; this means that the design and development of web pages may lack the quality standard required. In this paper the quality of webpage interface design and usability in the field of news journalism in the KRI is examined against a set of web interface design and usability criterion. For the purpose of data collection 9 available popular news websites are chosen and 900 questionnaires are sent to 100 random users. The result is analyzed and we have found that the majority of users are satisfied with the interface design and usability of the news WebPages, however the result points out some weakness that can be improved. The outcome of this research can be used to enhance website design and usability in the field of journalism in the KRI.

Download Full-text

Technical Aspects of Web Photography as a Medium of Tourism Development

Tehnički glasnik ◽

10.31803/tg-20210506103212 ◽

2021 ◽

Vol 15 (4) ◽

pp. 467-474

Author(s):

Petra Ptiček ◽

Ivana Žganjar ◽

Miroslav Mikota ◽

Mile Matijević

Keyword(s):

New Media ◽

Web Sites ◽

Tourism Development ◽

Web Pages ◽

Sustainable Tourism Development ◽

National Strategic Plan ◽

Information And Communication ◽

Crucial Information ◽

The Relationship

Information and communication technology is an important factor for national, regional and local sustainable tourism development according to the long-term Croatian national strategic plan. New forms of information, such as web sites; new media, materials, political and social change, all influence tourists’ decisions when choosing specific destinations. The aim of this research is to determine, based on the analysis of the tourism media campaign, the relationship between new communication trends and the application of photography as a medium that influences the experience when choosing a destination and the importance of crucial information factors on web pages based on their technical and visual characteristics.

Download Full-text

A Framework to Analyze Business Process Log in XML Format

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i3.1264 ◽

2021 ◽

Vol 12 (3) ◽

pp. 2623-2630

Author(s):

Ang Jin Sheng Et.al

Keyword(s):

Data Mining ◽

Business Process ◽

Accurate Result ◽

Structured Data ◽

Web Pages ◽

Web Searching ◽

General Application ◽

Data Mining Techniques ◽

Xml Documents ◽

The Relationship

XML has numerous uses in a wide variety of web pages and applications. Some common uses of XML include tasks for web publishing, web searching and automation, and general application such as for utilize, store, transfer and display business process log data. The amount of information expressed in XML has gone up rapidly. Many works have been done on sensible approaches to address issues related to the handling and review of XML documents. Mining XML documents offera way to understand both the structure and the content of XML documents. A common approach capable of analysing XML documents is frequent subtree mining.Frequent subtree mining is one of the data mining techniques that finds the relationship between transactions in a tree structured database. Due to the structure and the content of XML format, traditional data mining and statistical analysis hardly applied to get accurate result. This paper proposes a framework that can flatten a tree structured data into a flat and structured data, while preserving their structure and content.Enabling these XML documents into relational structured data allows a range of data mining techniques and statistical test can be applied and conducted to extract more information from the business process log.

Download Full-text

Retrieving Informative Content from Web Pages with Conditional Learning of Support Vector Machines and Semantic Analysis

Artificial Intelligence and Soft Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-29350-4_15 ◽

2012 ◽

pp. 128-135

Author(s):

Piotr Ładyżyński ◽

Przemysław Grzegorzewski

Keyword(s):

Support Vector Machines ◽

Semantic Analysis ◽

Support Vector ◽

Web Pages ◽

Vector Machines ◽

Conditional Learning ◽

Informative Content

Download Full-text

Web Accessibility and the Law

Design and Implementation of Web-Enabled Teaching Tools ◽

10.4018/978-1-59140-107-0.ch001 ◽

2011 ◽

pp. 1-24 ◽

Cited By ~ 5

Author(s):

Holly Yu

Keyword(s):

Web Sites ◽

Ad Hoc ◽

Web Design ◽

Web Accessibility ◽

Legal Responsibility ◽

Web Pages ◽

State Laws ◽

Section 508 ◽

Rehabilitation Act ◽

Accessibility Issues

Through a series of federal and state laws and standards, the legal foundation concerning Web accessibility that impact people with disabilities and their ability to fully overcome digital barriers and participate in the Web environment has been established. Currently, the concept of accessible design or universal design is increasingly becoming an important component of Web design. However, the unanswered questions in laws, the absence of the obligation in fulfilling legal requirements, and the general unawareness of the need to make Web pages accessible have created barriers in implementing the Americans with disabilities Act (ADA), Section 504 of the Rehabilitation Act of 1973, Section 508 of the Rehabilitation Act as amended in 1998, and others. In many cases, the absence of obligations is due to unfamiliarity with legal responsibility of creating accessible Web sites. As a result, the response to Web accessibility concerns frequently comes about only on an ad hoc basis. Identifying these barriers is the first step toward solutions. There are legal and practical approaches for addressing Web accessibility issues in policies, education, research and development, and technology and tools.

Download Full-text

A Framework for Citizen-Centric Government Websites

E-Government Website Development ◽

10.4018/978-1-61692-018-0.ch002 ◽

2010 ◽

pp. 20-34 ◽

Cited By ~ 4

Author(s):

F. Dianne Lux Wigand

Keyword(s):

Information And Communication Technologies ◽

Web Sites ◽

Communication Technologies ◽

Public Administrators ◽

End User ◽

Government Services ◽

Government Web Sites ◽

Information And Communication ◽

Government Websites ◽

To Come

This author argues for a stronger end-user and citizen-centric approach to the development and evaluation of e-government services provided via the Internet. Over the past decade government agencies at all levels have created web sites that provide primarily information and only offer few two-way transactions. The predicted and hoped for resulting transformation of government at all levels due to the advent of Internet services seems yet to occur. The overall development of e-government services has been slow and uneven. To add value to existing and future government web sites, public administrators need to come to grips with a framework presented here and to understand the nature of and relationships among three variables: End-user, task, and channel characteristics and then consider their respective role and impact on channel selection. This framework along with an end-user perspective enables public administrators to assess not only the value of current information and service channels, but newer information and communication technologies such as those found in Web 2.0 or social media developments. Recommendations are offered.

Download Full-text