Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives

Contemporary and future historians need to grapple with and confront the challenges posed by web archives. These large collections of material, accessed either through the Internet Archive's Wayback Machine or through other computational methods, represent both a challenge and an opportunity to historians. Through these collections, we have the potential to access the voices of millions of non-elite individuals (recognizing of course the cleavages in both Web access as well as method of access). To put this in perspective, the Old Bailey Online currently describes its monumental holdings of 197,745 trials between 1674 and 1913 as the “largest body of texts detailing the lives of non-elite people ever published.” GeoCities.com, a platform for everyday web publishing in the mid-to-late 1990s and early 2000s, amounted to over thirty-eight million individual webpages. Historians will have access, in some form, to millions of pages: written by everyday people of various classes, genders, ethnicities, and ages. While the Web was not a perfect democracy by any means – it was and is unevenly accessed across each of those categories – this still represents a massive collection of non-elite speech. Yet a figure like thirty-eight million webpages is both a blessing and a curse. We cannot read every website, and must instead rely upon discovery tools to find the information that we need. Yet these tools largely do not exist for web archives, or are in a very early state of development: what will they look like? What information do historians want to access? We cannot simply map over web tools optimized for discovering current information through online searches or metadata analysis. We need to find information that mattered at the time, to diverse and very large communities. Furthermore, web pages cannot be viewed in isolation, outside of the networks that they inhabited. In theory, amongst corpuses of millions of pages, researchers can find whatever they want to confirm. The trick is situating it into a larger social and cultural context: is it representative? Unique? In this paper, “Lost in the Infinite Archive,” I explore what the future of digital methods for historians will be when they need to explore web archives. Historical research of periods beginning in the mid-1990s will need to use web archives, and right now we are not ready. This article draws on first-hand research with the Internet Archive and Archive-It web archiving teams. It draws upon three exhaustive datasets: the large Web ARChive (WARC) files that make up Wide Web Scrapes of the Web; the metadata-intensive WAT files that provide networked contextual information; and the lifted-straight-from-the-web guerilla archives generated by groups like Archive Team. Through these case studies, we can see – hands-on – what richness and potentials lie in these new cultural records, and what approaches we may need to adopt. It helps underscore the need to have humanists involved at this early, crucial stage.

Download Full-text

Feature Selection for Web Page Classification

Web Technologies ◽

10.4018/978-1-60566-982-3.ch078 ◽

2011 ◽

pp. 1462-1477 ◽

Cited By ~ 1

Author(s):

K. Selvakuberan ◽

M. Indra Devi ◽

R. Rajaram

Keyword(s):

Feature Selection ◽

Financial Management ◽

Contextual Information ◽

Information Service ◽

Web Pages ◽

Web Page ◽

Customer Information ◽

Web Access ◽

Feature Selection Techniques ◽

The Web

The World Wide Web serves as a huge, widely distributed, global information service center for news, advertisements, customer information, financial management, education, government, e-commerce and many others. The Web contains a rich and dynamic collection of hyperlink information. The Web page access and usage information provide rich sources for data mining. Web pages are classified based on the content and/or contextual information embedded in them. As the Web pages contain many irrelevant, infrequent, and stop words that reduce the performance of the classifier, selecting relevant representative features from the Web page is the essential preprocessing step. This provides secured accessing of the required information. The Web access and usage information can be mined to predict the authentication of the user accessing the Web page. This information may be used to personalize the information needed for the users and to preserve the privacy of the users by hiding the personal details. The issue lies in selecting the features which represent the Web pages and processing the details of the user needed the details. In this article we focus on the feature selection, issues in feature selections, and the most important feature selection techniques described and used by researchers.

Download Full-text

Recovering contemporary genre histories – the development of chick lit as seen through the internet archive’s wayback machine and wikipedia’s history page

International Journal of Digital Humanities ◽

10.1007/s42803-021-00031-6 ◽

2021 ◽

Author(s):

Sandra Folie

Keyword(s):

The Internet ◽

Literary Genres ◽

Internet Archive ◽

Web Archives ◽

Digital Material ◽

Chick Lit ◽

Archival Work ◽

The Web

AbstractIn the perception of literary scholars, the investigation of genre histories is still closely linked to ‘offline’ archival work. However, the Internet has been publicly accessible since 1991, and over the last thirty years, numerous new literary genres have emerged. They have often been proclaimed, defined, spread, marketed, criticized, and even pronounced dead online. By now, a great deal of this digital material is said to have disappeared. What many scholars do not consider, however, is that parts of the web are archived, for example by the Internet Archive and Wikipedia, which make their archives publicly available via the Wayback Machine and the history page respectively. This makes it possible to track early online definitions of contemporary genres and their development. In this paper, I will use the chick lit genre, which emerged in the second half of the 1990s, as a case study to show the benefits of including web archives in the reconstruction of contemporary genre histories. An analysis of both the first extensive and long-running fan websites, which are now offline but well-documented in the Internet Archive, and the history page of the Wikipedia article on chick lit will challenge some of the narratives that have long dominated chick lit research.

Download Full-text

Feature Selection for Web Page Classification

Social Implications of Data Mining and Information Privacy ◽

10.4018/978-1-60566-196-4.ch012 ◽

2010 ◽

pp. 213-228

Author(s):

K. Selvakuberan ◽

M. Indra Devi ◽

R. Rajaram

Keyword(s):

Feature Selection ◽

Financial Management ◽

Contextual Information ◽

Information Service ◽

Web Pages ◽

Web Page ◽

Customer Information ◽

Web Access ◽

Feature Selection Techniques ◽

The Web

The World Wide Web serves as a huge, widely distributed, global information service center for news, advertisements, customer information, financial management, education, government, e-commerce and many others. The Web contains a rich and dynamic collection of hyperlink information. The Web page access and usage information provide rich sources for data mining. Web pages are classified based on the content and/or contextual information embedded in them. As the Web pages contain many irrelevant, infrequent, and stop words that reduce the performance of the classifier, selecting relevant representative features from the Web page is the essential preprocessing step. This provides secured accessing of the required information. The Web access and usage information can be mined to predict the authentication of the user accessing the Web page. This information may be used to personalize the information needed for the users and to preserve the privacy of the users by hiding the personal details. The issue lies in selecting the features which represent the Web pages and processing the details of the user needed the details. In this chapter we focus on the feature selection, issues in feature selection, and the most important feature selection techniques described and used by researchers.

Download Full-text

Governance Disclosure on the Internet by Leading Indian Public Sector Companies

Think India ◽

10.26643/think-india.v22i2.8716 ◽

2019 ◽

Vol 22 (2) ◽

pp. 174-187

Author(s):

Harmandeep Singh ◽

Arwinder Singh

Keyword(s):

Public Sector ◽

Financial Information ◽

Three Dimensions ◽

The Internet ◽

Web Pages ◽

Non Profit ◽

Index Approach ◽

Significant Difference ◽

Governance Disclosure ◽

The Web

Nowadays, internet satisfying people with different services related to different fields. The profit, as well as non-profit organization, uses the internet for various business purposes. One of the major is communicated various financial as well as non-financial information on their respective websites. This study is conducted on the top 30 BSE listed public sector companies, to measure the extent of governance disclosure (non-financial information) on their web pages. The disclosure index approach to examine the extent of governance disclosure on the internet was used. The governance index was constructed and broadly categorized into three dimensions, i.e., organization and structure, strategy & Planning and accountability, compliance, philosophy & risk management. The empirical evidence of the study reveals that all the Indian public sector companies have a website, and on average, 67% of companies disclosed some kind of governance information directly on their websites. Further, we found extreme variations in the web disclosure between the three categories, i.e., The Maharatans, The Navratans, and Miniratans. However, the result of Kruskal-Wallis indicates that there is no such significant difference between the three categories. The study provides valuable insights into the Indian economy. It explored that Indian public sector companies use the internet for governance disclosure to some extent, but lacks symmetry in the disclosure. It is because there is no such regulation for web disclosure. Thus, the recommendation of the study highlighted that there must be such a regulated framework for the web disclosure so that stakeholders ensure the transparency and reliability of the information.

Download Full-text

Hypermedia, Eternal Life, and the Impermanence Agent

Leonardo ◽

10.1162/002409499553569 ◽

1999 ◽

Vol 32 (5) ◽

pp. 353-358 ◽

Cited By ~ 1

Author(s):

Noah Wardrip-Fruin

Keyword(s):

Eternal Life ◽

The Internet ◽

Web Browsing ◽

Original Stories ◽

Internet Archive ◽

Vannevar Bush ◽

The Web ◽

Critical Project

We look to media as memory, and a place to memorialize, when we have lost. Hypermedia pioneers such as Ted Nelson and Vannevar Bush envisioned the ultimate media within the ultimate archive—with each element in continual flux, and with constant new addition. Dynamism without loss. Instead we have the Web, where “Not Found” is a daily message. Projects such as the Internet Archive and Afterlife dream of fixing this uncomfortable impermanence. Marketeers promise that agents (indentured information servants that may be the humans of About.com or the software of “Ask Jeeves”) will make the Web comfortable through filtering—hiding the impermanence and overwhelming profluence that the Web's dynamism produces. The Impermanence Agent—a programmatic, esthetic, and critical project created by the author, Brion Moss, a.c. chapman, and Duane Whitehurst— operates differently. It begins as a storytelling agent, telling stories of impermanence, stories of preservation, memorial stories. It monitors each user's Web browsing, and starts customizing its storytelling by weaving in images and texts that the user has pulled from the Web. In time, the original stories are lost. New stories, collaboratively created, have taken their place.

Download Full-text

Web Authoring

Web Portfolio Design and Applications ◽

10.4018/978-1-59140-854-3.ch007 ◽

2011 ◽

pp. 122-156

Author(s):

John DiMarco

Keyword(s):

Web Applications ◽

The Internet ◽

Web Pages ◽

Web Development ◽

Web Page ◽

Web Based ◽

Content Creation ◽

Microsoft Word ◽

Software Skills ◽

The Web

Web authoring is the process of developing Web pages. The Web development process requires you to use software to create functional pages that will work on the Internet. Adding Web functionality is creating specific components within a Web page that do something. Adding links, rollover graphics, and interactive multimedia items to a Web page creates are examples of enhanced functionality. This chapter demonstrates Web based authoring techniques using Macromedia Dreamweaver. The focus is on adding Web functions to pages generated from Macromedia Fireworks and to overview creating Web pages from scratch using Dreamweaver. Dreamweaver and Fireworks are professional Web applications. Using professional Web software will benefit you tremendously. There are other ways to create Web pages using applications not specifically made to create Web pages. These applications include Microsoft Word and Microsoft PowerPoint. The use of Microsoft applications for Web page development is not covered in this chapter. However, I do provide steps on how to use these applications for Web page authoring within the appendix of this text. If you feel that you are more comfortable using the Microsoft applications or the Macromedia applications simply aren’t available to you yet, follow the same process for Web page conceptualization and content creation and use the programs available to you. You should try to get Web page development skills using Macromedia Dreamweaver because it helps you expand your software skills outside of basic office applications. The ability to create a Web page using professional Web development software is important to building a high-end computer skills set. The main objectives of this chapter are to get you involved in some technical processes that you’ll need to create the Web portfolio. Focus will be on guiding you through opening your sliced pages, adding links, using tables, creating pop up windows for content and using layers and timelines for dynamic HTML. The coverage will not try to provide a complete tutorial set for Macromedia Dreamweaver, but will highlight essential techniques. Along the way you will get pieces of hand coded action scripts and JavaScripts. You can decide which pieces you want to use in your own Web portfolio pages. The techniques provided are a concentrated workflow for creating Web pages. Let us begin to explore Web page authoring.

Download Full-text

“What is Palliative Care?”

American Journal of Hospice and Palliative Medicine® ◽

10.1177/1049909115615566 ◽

2016 ◽

Vol 34 (3) ◽

pp. 241-247 ◽

Cited By ~ 5

Author(s):

Elissa Kozlov ◽

Brian D. Carpenter

Keyword(s):

Palliative Care ◽

The Internet ◽

Web Pages ◽

Broad Concept ◽

Information Page ◽

Care Information ◽

Content Coverage ◽

Basic Facts ◽

Expert Ratings ◽

The Web

Background and Aim: Americans rely on the Internet for health information, and people are likely to turn to online resources to learn about palliative care as well. The purpose of this study was to analyze online palliative care information pages to evaluate the breadth of their content. We also compared how frequently basic facts about palliative care appeared on the Web pages to expert rankings of the importance of those facts to understanding palliative care. Design: Twenty-six pages were identified. Two researchers independently coded each page for content. Palliative care professionals (n = 20) rated the importance of content domains for comparison with content frequency in the Web pages. Results: We identified 22 recurring broad concepts about palliative care. Each information page included, on average, 9.2 of these broad concepts (standard deviation [SD] = 3.36, range = 5-15). Similarly, each broad concept was present in an average of 45% of the Web pages (SD = 30.4%, range = 8%-96%). Significant discrepancies emerged between expert ratings of the importance of the broad concepts and the frequency of their appearance in the Web pages ( rτ = .25, P > .05). Conclusion and Implications: This study demonstrates that palliative care information pages available online vary considerably in their content coverage. Furthermore, information that palliative care professionals rate as important for consumers to know is not always included in Web pages. We developed guidelines for information pages for the purpose of educating consumers in a consistent way about palliative care.

Download Full-text

The MPI-Mainz UV/VIS Spectral Atlas of Gaseous Molecules of Atmospheric Interest

Earth System Science Data ◽

10.5194/essd-5-365-2013 ◽

2013 ◽

Vol 5 (2) ◽

pp. 365-373 ◽

Cited By ~ 167

Author(s):

H. Keller-Rudek ◽

G. K. Moortgat ◽

R. Sander ◽

R. Sörensen

Keyword(s):

Cross Sections ◽

Wavelength Region ◽

Quantum Yields ◽

The Internet ◽

Web Pages ◽

Data Sets ◽

Absorption Cross ◽

Large Collection ◽

Data Files ◽

The Web

Abstract. We present the MPI-Mainz UV/VIS Spectral Atlas of Gaseous Molecules, which is a large collection of absorption cross sections and quantum yields in the ultraviolet and visible (UV/VIS) wavelength region for gaseous molecules and radicals primarily of atmospheric interest. The data files contain results of individual measurements, covering research of almost a whole century. To compare and visualize the data sets, multicoloured graphical representations have been created. The MPI-Mainz UV/VIS Spectral Atlas is available on the Internet at http://www.uv-vis-spectral-atlas-mainz.org. It now appears with improved browse and search options, based on new database software. In addition to the Web pages, which are continuously updated, a frozen version of the data is available under the doi:10.5281/zenodo.6951.

Download Full-text

Albert Einstein online

First Monday ◽

10.5210/fm.v2i2.509 ◽

1997 ◽

Author(s):

Steven M. Friedman

Keyword(s):

Web Sites ◽

Albert Einstein ◽

The Internet ◽

Web Pages ◽

Web Based ◽

Hierarchical Tree ◽

The World ◽

Random Nature ◽

The Web ◽

Online Web

The power of the World Wide Web, it is commonly believed, lies in the vast information it makes available; "Content is king," the mantra runs. This image creates the conception of the Internet as most of us envision it: a vast, horizontal labyrinth of pages which connect almost arbitrarily to each other, creating a system believed to be "democratic" in which anyone can publish Web pages. I am proposing a new, vertical and hierarchical conception of the Web, observing the fact that almost everyone searching for information on the Web has to go through filter Web sites of some sort, such as search engines, to find it. The Albert Einstein Online Web site provides a paradigm for this re-conceptualization of the Web, based on a distinction between the wealth of information and that which organizes it and frames the viewers' conceptions of the information. This emphasis on organization implies that we need a new metaphor for the Internet; the hierarchical "Tree" would be more appropriate organizationally than a chaotic "Web." This metaphor needs to be changed because the current one implies an anarchic and random nature to the Web, and this implication may turn off potential Netizens, who can be scared off by such overwhelming anarchy and the difficulty of finding information.

Download Full-text

BUILDING A KNOWLEDGE BASE FOR IMPLEMENTING A WEB-BASED COMPUTERIZED RECOMMENDATION SYSTEM

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213007003552 ◽

2007 ◽

Vol 16 (05) ◽

pp. 793-828 ◽

Cited By ~ 10

Author(s):

JUAN D. VELÁSQUEZ ◽

VASILE PALADE

Keyword(s):

Knowledge Base ◽

Web Site ◽

Web Mining ◽

Recommendation System ◽

The Internet ◽

Web Pages ◽

Web Based ◽

Web Logs ◽

Mining Tools ◽

The Web

Understanding the web user browsing behaviour in order to adapt a web site to the needs of a particular user represents a key issue for many commercial companies that do their business over the Internet. This paper presents the implementation of a Knowledge Base (KB) for building web-based computerized recommender systems. The Knowledge Base consists of a Pattern Repository that contains patterns extracted from web logs and web pages, by applying various web mining tools, and a Rule Repository containing rules that describe the use of discovered patterns for building navigation or web site modification recommendations. The paper also focuses on testing the effectiveness of the proposed online and offline recommendations. An ample real-world experiment is carried out on a web site of a bank.

Download Full-text