Web Archiving: Techniques, Challenges, and Solutions

Web archiving is the process of collecting valuable content from the World Wide Web in a an archival format, to ensure the information can be managed independently and preserved for the general public, historians, researchers, and future generation. If the Web is not preserved, eventually valuable content will be lost forever. The Web is a very valuable source of information and several government and private institutions are involved in archiving parts of it for various purposes. This paper gives an overview of web archiving, describes the techniques used in web archiving, discusses some challenges encountered during web archiving and gives possible solutions to these challenges.

Download Full-text

Urological Cancer Information on the Web: How Accurate is It?

Bulletin of The Royal College of Surgeons of England ◽

10.1308/147363508x281651 ◽

2008 ◽

Vol 90 (3) ◽

pp. 92-95

Author(s):

K Kok ◽

AR Parikh ◽

A Clarke ◽

AV Kaisary ◽

PEM Butler

Keyword(s):

World Wide Web ◽

Cancer Patients ◽

Health Information ◽

World Wide ◽

Cancer Information ◽

The Internet ◽

Urological Cancer ◽

The World ◽

Source Of Information ◽

The Web

The world wide web is the fastestgrowing health information medium. In 2001, 52 million adults in America accessed the web to obtain such information.1 Cancer has been shown to be among the top three health topics searched for on the internet. A survey performed by American oncologists estimated that approximately 30% of their patients use the internet to obtain information. Other surveys have shown that up to 50% of cancer patients use the net for this purpose. The internet is also seen as an important source of information for family members and caregivers of cancer patients.

Download Full-text

Collecting and preserving the Ukraine conflict (2014-2015): a web archive at University of California, Berkeley

Collection Building ◽

10.1108/cb-04-2016-0006 ◽

2016 ◽

Vol 35 (3) ◽

pp. 64-72 ◽

Cited By ~ 1

Author(s):

Liladhar R. Pendse

Keyword(s):

World Wide Web ◽

Russian Federation ◽

World Wide ◽

The Internet ◽

Web Archiving ◽

Content Type ◽

The World ◽

The Russian Federation ◽

Web Archive ◽

The Web

Purpose The purpose of this paper is to highlight the web-archiving as a tool for possible collection development in a research level academic library. The paper highlights the web-archiving project that dealt with the contemporary Ukraine conflict. Currently, as the conflict in Ukraine drags on, the need for collecting and preserving the information from various web-based resources with different ideological orientations acquires a special importance. The demise of the Soviet Union in 1991 and the emergence of independent republics were heralded by some as a peaceful transition to the “free-market” style economies. This transition was nevertheless nuanced and not seamless. Besides the incomplete market liberalization, rent-seeking behaviors of different sort, it was also accompanied by the almost ubiquitous use of and access to the internet and the internet communication technologies. Now 24 years later, the ongoing conflict in Ukraine also appears to be unfolding on the World Wide Web. With the Russian annexation of Crimea and its unification to the Russian Federation, the governmental and non-governmental websites of the Ukrainian Crimea suddenly came to represent a sort of “an endangered archive”. Design/methodology/approach The main purpose of this project was to make the information that is contained in Ukrainian and Russia websites available to the wider body of scholars and students over the longer period of time in a web archive. The author does not take any ideological stance on the legal status of Crimea or on the ongoing conflict in Ukraine. There are currently several projects that are devoted to the preservation of these websites. This article also focuses on providing a survey of the landscape of these projects and highlights the ongoing web-archiving project that is entitled, “the Ukraine Crisis: 2014-2015” at the UC Berkeley Library. Findings The UC Berkeley’s Ukraine Conflict Archive was made available to public in March of 2015 after enough materials were archived. The initial purpose of the archive was to selectively harvest, and archive those websites that are bound to either disappear or change significantly during the evolution of Crimea’s accession to Russia. However, in the aftermath of the Crimean conflict, the ensuing of military conflict in Ukraine had forced to reevaluate the web-archiving strategy. The project was never envisioned to be a competing project to the Ukraine Conflict project. Instead, it was supposed to capture complimentary data that could have been missed by other similar projects. This web archive has been made public to provide a glimpse of what was happening and what is happening in Ukraine. Research limitations/implications Now 24 years later, the ongoing conflict in Ukraine also appears to be unfolding on the World Wide Web. With the Russian annexation of Crimea and its unification to the Russian Federation, the governmental and non-governmental websites of the Ukrainian Crimea suddenly came to represent a sort of “an endangered archive”. The impetus for archiving the selected Ukrainian websites came as a result of the changing geopolitical realities of Crimea. The daily changes to the websites and also loss of information that is contained within them is one of the many problems faced by the users of these websites. In some cases, the likelihood of these websites is relatively high. This in turn was followed by the author’s desire to preserve the information about the daily lives in Ukraine’s east in light of the unfolding violent armed conflict. Originality/value Upon close survey of the Library and Information Sciences currently published articles on Ukraine Conflict, no articles that are currently dedicated to archiving the Crimean and Ukrainian situations were found.

Download Full-text

InfoArch: uma ontologia para modelar o domínio da Arquitetura da Informação para Web | InfoArch: an ontology for modeling the field of Information Architecture for the World Wide Web

Liinc em Revista ◽

10.18617/liinc.v7i1.413 ◽

2011 ◽

Vol 7 (1) ◽

Author(s):

Marckson Roberto Ferreira de Sousa ◽

Edilson Leite da Silva ◽

Guilherme Ataíde Dias ◽

Maria Amélia Teixeira da Silva ◽

Frederico Luiz Gonçalves de Freitas ◽

...

Keyword(s):

World Wide Web ◽

Semantic Web ◽

World Wide ◽

Information Architecture ◽

Site Development ◽

The World ◽

Word Wide Web ◽

Site Web ◽

Source Of Information ◽

The Web

Resumo Neste trabalho é apresentada uma ontologia para modelar o domínio da Arquitetura da Informação para Web (AI para web), de acordo com os preceitos definidos por Morville e Rosenfeld no livro Information Architecture for the Word Wide Web, versão 2006. Objetiva-se estruturar o conhecimento relacionado ao domínio de AI para Web, formalizando o mesmo, bem como auxiliar o ensino dos conceitos e relacionamentos do domínio da área de AI para Web. A pesquisa realizada é de caráter teórico e qualitativo, e classifica-se como descritiva e exploratória. A modelagem foi realizada mediante a utilização da linguagem Ontology Web Language (OWL) e do framework Protégé 3.4.1, seguindo os passos da metodologia 101. Os resultados mostram a InfoArch, uma ontologia que representa os conceitos e relacionamentos, além de possibilitar responder a questionamentos sobre o domínio. Considera-se que InfoArch traz contribuições principalmente relativas as questões de ensino, pesquisa e extensão, pois servirá como fonte de informação para pesquisadores, professores e equipes de desenvolvimento de sites que trabalhem com Arquitetura da Informação para Web.Palavras-chave Arquitetura da Informação para Web; ontologia; desenvolvimento de site; web semânticaAbstract This paper presents an ontology to model the field of Information Architecture for Web (Web IA), according to the precepts defined by Morville and Rosenfeld in the book Information Architecture for the World Wide Web, version 2006. It aims to structure the related knowledge in the field of IA for the Web, formalizing this area and helping to teach the concepts and relationships in the domain of the IA for the Web. The research is theoretical and qualitative, and is classified as descriptive and exploratory. The modeling was performed using the language of Ontology Web Language (OWL) and the Protégé framework 3.4.1, by following the steps of the methodology 101. The results show the InfoArch, an ontology that represents concepts and relationships, and enables finding answers to questions about the area. It is considered that InfoArch brings contributions especially on issues of teaching, research and extension, and thus will serve as a source of information for researchers, teachers and staff of developing sites that work with Information Architecture for the World Wide Web.Keywords information architecture for the World Wide Web; ontology; site development; semantic web

Download Full-text

The (Re)Shaping of the Israeli Sport Media: The Case of Talk-Back

International Journal of Sport Communication ◽

10.1123/ijsc.1.3.273 ◽

2008 ◽

Vol 1 (3) ◽

pp. 273-285 ◽

Cited By ~ 18

Author(s):

Yair Galily

Keyword(s):

Comparative Analysis ◽

World Wide Web ◽

20Th Century ◽

World Wide ◽

Multimedia Content ◽

Fundamental Change ◽

Sport Media ◽

The World ◽

Web News ◽

Source Of Information

From its explosive development in the last decade of the 20th century, the World Wide Web has become an ideal medium for dedicated sports fanatics and a useful resource for casual fans, as well. Its accessibility, interactivity, speed, and multimedia content have triggered a fundamental change in the delivery of mediated sports, a change for which no one can yet predict the outcome (Real, 2006). This commentary sheds light on a process in which the talk-back mechanism, which enables readers to comment on Web-published articles, is (re)shaping the sport realm in Israeli media. The study on which this commentary is based involved the comparative analysis of over 3,000 talk-backs from the sports sections of 3 daily Web news sites (Ynet, nrg, and Walla!). The argument is made that talkbacks serve not only as an extension of the journalistic sphere but also as a new source of information and debate.

Download Full-text

Effective Use of the World Wide Web (WWW) for HF/E Consultants

Proceedings of the Human Factors and Ergonomics Society Annual Meeting ◽

10.1177/154193129804201011 ◽

1998 ◽

Vol 42 (10) ◽

pp. 714-718 ◽

Cited By ~ 2

Author(s):

Anthony D. Andre

Keyword(s):

World Wide Web ◽

Human Factors ◽

World Wide ◽

Clear Understanding ◽

Human Factors And Ergonomics ◽

The World ◽

Effective Use ◽

The Web

This paper provides an overview of the various human factors and ergonomics (HF/E) resources on the World Wide Web (WWW). A list of the most popular and useful HF/E sites will be provided, along with several critical guidelines relevant to using the WWW. The reader will gain a clear understanding of how to find HF/E information on the Web and how to successfully use the Web towards various HF/E professional consulting activities. Finally, we consider the ergonomic implications of surfing the Web.

Download Full-text

Design of a Migrating Crawler Based on a Novel URL Scheduling Mechanism using AHP

International Journal of Rough Sets and Data Analysis ◽

10.4018/ijrsda.2017010106 ◽

2017 ◽

Vol 4 (1) ◽

pp. 95-110 ◽

Cited By ~ 2

Author(s):

Deepika Punj ◽

Ashutosh Dixit

Keyword(s):

World Wide Web ◽

Significant Role ◽

Crucial Role ◽

World Wide ◽

Web Crawler ◽

Redundancy Elimination ◽

The World ◽

Meta Information ◽

Elimination Mechanism ◽

The Web

In order to manage the vast information available on web, crawler plays a significant role. The working of crawler should be optimized to get maximum and unique information from the World Wide Web. In this paper, architecture of migrating crawler is proposed which is based on URL ordering, URL scheduling and document redundancy elimination mechanism. The proposed ordering technique is based on URL structure, which plays a crucial role in utilizing the web efficiently. Scheduling ensures that URLs should go to optimum agent for downloading. To ensure this, characteristics of both agents and URLs are taken into consideration for scheduling. Duplicate documents are also removed to make the database unique. To reduce matching time, document matching is made on the basis of their Meta information only. The agents of proposed migrating crawler work more efficiently than traditional single crawler by providing ordering and scheduling of URLs.

Download Full-text

Twenty years of unnecessary forward slashes: towards a post-ontological critique of Tim Berners-Lee's evolving aspirations for the Web and the World Wide Web Consortium from the cultural studies perspective

10.32920/ryerson.14658222 ◽

2021 ◽

Author(s):

Michael Dick

Keyword(s):

World Wide Web ◽

Cultural Studies ◽

World Wide ◽

Technological Determinism ◽

Uniform Resource Identifier ◽

Social Construct ◽

The Social ◽

The World ◽

Alternative Means ◽

The Web

Since it was first formally proposed in 1990 (and since the first website was launched in 1991), the World Wide Web has evolved from a collection of linked hypertext documents residing on the Internet, to a "meta-medium" featuring platforms that older media have leveraged to reach their publics through alternative means. However, this pathway towards the modernization of the Web has not been entirely linear, nor will it proceed as such. Accordingly, this paper problematizes the notion of "progress" as it relates to the online realm by illuminating two distinct perspectives on the realized and proposed evolution of the Web, both of which can be grounded in the broader debate concerning technological determinism versus the social construction of technology: on the one hand, the centralized and ontology-driven shift from a human-centred "Web of Documents" to a machine-understandable "Web of Data" or "Semantic Web", which is supported by the Web's inventor, Tim Berners-Lee, and the organization he heads, the World Wide Web Consortium (W3C); on the other, the decentralized and folksonomy-driven mechanisms through which individuals and collectives exert control over the online environment (e.g. through the social networking applications that have come to characterize the contemporary period of "Web 2.0"). Methodologically, the above is accomplished through a sustained exploration of theory derived from communication and cultural studies, which discursively weaves these two viewpoints together with a technical history of recent W3C projects. As a case study, it is asserted that the forward slashes contained in a Uniform Resource Identifier (URI) were a social construct that was eventually rendered extraneous by the end-user community. By focusing On the context of the technology itself, it is anticipated that this paper will contribute to the broader debate concerning the future of the Web and its need to move beyond a determinant "modernization paradigm" or over-arching ontology, as well as advance the potential connections that can be cultivated with cognate disciplines.

Download Full-text

Dark Web

Encyclopedia of Criminal Activities and the Deep Web ◽

10.4018/978-1-5225-9715-5.ch010 ◽

2020 ◽

pp. 152-164

Author(s):

Punam Bedi ◽

Neha Gupta ◽

Vinita Jindal

Keyword(s):

World Wide Web ◽

Search Engines ◽

World Wide ◽

Data Dissemination ◽

Deep Web ◽

Web Browsers ◽

Web Content ◽

The World ◽

Dark Web ◽

The Web

The World Wide Web is a part of the Internet that provides data dissemination facility to people. The contents of the Web are crawled and indexed by search engines so that they can be retrieved, ranked, and displayed as a result of users' search queries. These contents that can be easily retrieved using Web browsers and search engines comprise the Surface Web. All information that cannot be crawled by search engines' crawlers falls under Deep Web. Deep Web content never appears in the results displayed by search engines. Though this part of the Web remains hidden, it can be reached using targeted search over normal Web browsers. Unlike Deep Web, there exists a portion of the World Wide Web that cannot be accessed without special software. This is known as the Dark Web. This chapter describes how the Dark Web differs from the Deep Web and elaborates on the commonly used software to enter the Dark Web. It highlights the illegitimate and legitimate sides of the Dark Web and specifies the role played by cryptocurrencies in the expansion of Dark Web's user base.

Download Full-text

Web 2.0 and Beyond-Participation Culture on the Web

Encyclopedia of Multimedia Technology and Networking, Second Edition ◽

10.4018/978-1-60566-014-1.ch208 ◽

2009 ◽

pp. 1537-1544

Author(s):

August-Wilhelm Scheer

Keyword(s):

World Wide Web ◽

Web Sites ◽

World Wide ◽

Broad Band ◽

Local Data ◽

Server Side ◽

The World ◽

Long Time ◽

Internet Connections ◽

The Web

The emergence of what we call today the World Wide Web, the WWW, or simply the Web, dates back to 1989 when Tim Berners-Lee proposed a hypertext system to manage information overload at CERN, Switzerland (Berners-Lee, 1989). This article outlines how his approaches evolved into the Web that drives today’s information society and explores its full potentials still ahead. The formerly known wide-area hypertext information retrieval initiative quickly gained momentum due to the fast adoption of graphical browser programs and standardization activities of the World Wide Web Consortium (W3C). In the beginning, based only on the standards of HTML, HTTP, and URL, the sites provided by the Web were static, meaning the information stayed unchanged until the original publisher decided for an update. For a long time, the WWW, today referred to as Web 1.0, was understood as a technical mean to publish information to a vast audience across time and space. Data was kept locally and Web sites were only occasionally updated by uploading files from the client to the Web server. Application software was limited to local desktops and operated only on local data. With the advent of dynamic concepts on server-side (script languages like hypertext preprocessor (PHP) or Perl and Web applications with JSP or ASP) and client-side (e.g., JavaScript), the WWW became more dynamic. Server-side content management systems (CMS) allowed editing Web sites via the browser during run-time. These systems interact with multiple users through PHP-interfaces that push information into server-side databases (e.g., mySQL) which again feed Web sites with content. Thus, the Web became accessible and editable not only for programmers and “techies” but also for the common user. Yet, technological limitations such as slow Internet connections, consumer-unfriendly Internet rates, and poor multimedia support still inhibited a mass-usage of the Web. It needed broad-band Internet access, flat rates, and digitalized media processing to catch on.

Download Full-text

Standards for Web Services

Engineering Service Oriented Systems ◽

10.4018/978-1-59904-968-7.ch003 ◽

2008 ◽

pp. 47-86

Author(s):

Bill Karakostas ◽

Yannis Zorgios

Keyword(s):

World Wide Web ◽

Web Services ◽

Web Sites ◽

World Wide ◽

Comprehensive Description ◽

Business Services ◽

Software Technology ◽

The World ◽

The Web

Chapter II presented the main concepts underlying business services. Ultimately, as this book proposes, business services need to be decomposed into networks of executable Web services. Web services are the primary software technology available today that closely matches the characteristics of business services. To understand the mapping from business to Web services, we need to understand the fundamental characteristics of the latter. This chapter therefore will introduce the main Web services concepts and standards. It does not intend to be a comprehensive description of all standards applicable to Web services, as many of them are still in a state of flux. It focuses instead on the more important and stable standards. All such standards are fully and precisely defined and maintained by the organizations that have defined and endorsed them, such as the World Wide Web Consortium (http://w3c. org), the OASIS organization (http://www.oasis-open.org) and others. We advise readers to visit periodically the Web sites describing the various standards to obtain the up to date versions.

Download Full-text