Semi-Structured Data Extraction from Heterogenous Sources

Author(s):  
Xiaoying Gao ◽  
Leon Sterling

The World Wide Web is known as the “universe of network-accessible information, the embodiment of human knowledge” (W3C, 1999). Internet-based knowledge management aims to use the Internet as the world wide environment for knowledge publishing, searching, sharing, reusing, and integration, and to support collaboration and decision making. However, knowledge on the Internet is buried in documents. Most of the documents are written in languages for human readers. The knowledge contained therein cannot be easily accessed by computer programs such as knowledge management systems. In order to make the Internet “machine readable,” information extraction from Web pages becomes a crucial research problem.

2002 ◽  
Vol 7 (1) ◽  
pp. 9-25 ◽  
Author(s):  
Moses Boudourides ◽  
Gerasimos Antypas

In this paper we are presenting a simple simulation of the Internet World-Wide Web, where one observes the appearance of web pages belonging to different web sites, covering a number of different thematic topics and possessing links to other web pages. The goal of our simulation is to reproduce the form of the observed World-Wide Web and of its growth, using a small number of simple assumptions. In our simulation, existing web pages may generate new ones as follows: First, each web page is equipped with a topic concerning its contents. Second, links between web pages are established according to common topics. Next, new web pages may be randomly generated and subsequently they might be equipped with a topic and be assigned to web sites. By repeated iterations of these rules, our simulation appears to exhibit the observed structure of the World-Wide Web and, in particular, a power law type of growth. In order to visualise the network of web pages, we have followed N. Gilbert's (1997) methodology of scientometric simulation, assuming that web pages can be represented by points in the plane. Furthermore, the simulated graph is found to possess the property of small worlds, as it is the case with a large number of other complex networks.


2002 ◽  
pp. 145-152 ◽  
Author(s):  
Fiona Fui-Hoon Nah

The explosive expansion of the World Wide Web (WWW) is the biggest event in the Internet. Since its public introduction in 1991, the WWW has become an important channel for electronic commerce, information access, and publication. However, the long waiting time for accessing web pages has become a critical issue, especially with the popularity of multimedia technology and the exponential increase in the number of Web users. Although various technologies and techniques have been implemented to alleviate the situation and to comfort the impatient users, there is still the need to carry out fundamental research to investigate what constitutes an acceptable waiting time for a typical WWW user. This research not only evaluates Nielsen’s hypothesis of 15 seconds as the maximum waiting time of WWW users, but also provides approximate distributions of waiting time of WWW users.


2020 ◽  
Vol 18 (06) ◽  
pp. 1119-1125 ◽  
Author(s):  
Kessia Nepomuceno ◽  
Thyago Nepomuceno ◽  
Djamel Sadok

2011 ◽  
pp. 3340-3345
Author(s):  
Bruce Rollier ◽  
Fred Niederman

Although the Internet has been in existence since 1969, it was not widely used for educational purposes in its first two decades. Few students had access to e-mail, and few educators could visualize its value as a teaching tool. Programs to serve students from remote locations, often called “distance education,” became popular; these were generally delivered synchronously through television broadcasts and did not involve the Internet. When the World Wide Web was created in the early 1990s (Berners-Lee, 1999) and the first browsers became available (Waldrop, 2001), the enormous potential for education began to be recognized. New global users came online at a fantastic pace, and the value of all this connectivity was increasing even more rapidly in accordance with Metcalf’s Law (Gilder, 1996). Nearly all students used e-mail regularly, and college professors were putting syllabi and course assignments online and creating Web pages with increasing sophistication. Soon entire programs were offered completely via the Internet, with students from all over the globe taking courses together.


2011 ◽  
pp. 1069-1075
Author(s):  
Barbara A. Frey ◽  
Ashli Molinero ◽  
Ellen Cohn

Just as wheelchair ramps and elevators provide access to wheelchair users, good Web design provides “electronic curb ramps” to the Internet for individuals with visual or other disabilities (Waddell, 1997). Research shows it is easier and less expensive to initially construct accessible Web pages rather than to retrofit the pages with corrections. Most of the technical requirements for accessible Web design can be met if Web designers adhere to the straightforward principles suggested by the World Wide Web Consortium’s Web Accessibility Initiative.


Author(s):  
Bruce Rollier ◽  
Fred Niederman

Although the Internet has been in existence since 1969, it was not widely used for educational purposes in its first two decades. Few students had access to e-mail, and few educators could visualize its value as a teaching tool. Programs to serve students from remote locations, often called “distance education,” became popular; these were generally delivered synchronously through television broadcasts and did not involve the Internet. When the World Wide Web was created in the early 1990s (Berners-Lee, 1999) and the first browsers became available (Waldrop, 2001), the enormous potential for education began to be recognized. New global users came online at a fantastic pace, and the value of all this connectivity was increasing even more rapidly in accordance with Metcalf’s Law (Gilder, 1996). Nearly all students used e-mail regularly, and college professors were putting syllabi and course assignments online and creating Web pages with increasing sophistication. Soon entire programs were offered completely via the Internet, with students from all over the globe taking courses together.


Author(s):  
Vijay Kasi ◽  
Radhika Jain

In the context of the Internet, a search engine can be defined as a software program designed to help one access information, documents, and other content on the World Wide Web. The adoption and growth of the Internet in the last decade has been unprecedented. The World Wide Web has always been applauded for its simplicity and ease of use. This is evident looking at the extent of the knowledge one requires to build a Web page. The flexible nature of the Internet has enabled the rapid growth and adoption of it, making it hard to search for relevant information on the Web. The number of Web pages has been increasing at an astronomical pace, from around 2 million registered domains in 1995 to 233 million registered domains in 2004 (Consortium, 2004). The Internet, considered a distributed database of information, has the CRUD (create, retrieve, update, and delete) rule applied to it. While the Internet has been effective at creating, updating, and deleting content, it has considerably lacked in enabling the retrieval of relevant information. After all, there is no point in having a Web page that has little or no visibility on the Web. Since the 1990s when the first search program was released, we have come a long way in terms of searching for information. Although we are currently witnessing a tremendous growth in search engine technology, the growth of the Internet has overtaken it, leading to a state in which the existing search engine technology is falling short. When we apply the metrics of relevance, rigor, efficiency, and effectiveness to the search domain, it becomes very clear that we have progressed on the rigor and efficiency metrics by utilizing abundant computing power to produce faster searches with a lot of information. Rigor and efficiency are evident in the large number of indexed pages by the leading search engines (Barroso, Dean, & Holzle, 2003). However, more research needs to be done to address the relevance and effectiveness metrics. Users typically type in two to three keywords when searching, only to end up with a search result having thousands of Web pages! This has made it increasingly hard to effectively find any useful, relevant information. Search engines face a number of challenges today requiring them to perform rigorous searches with relevant results efficiently so that they are effective. These challenges include the following (“Search Engines,” 2004). 1. The Web is growing at a much faster rate than any present search engine technology can index. 2. Web pages are updated frequently, forcing search engines to revisit them periodically. 3. Dynamically generated Web sites may be slow or difficult to index, or may result in excessive results from a single Web site. 4. Many dynamically generated Web sites are not able to be indexed by search engines. 5. The commercial interests of a search engine can interfere with the order of relevant results the search engine shows. 6. Content that is behind a firewall or that is password protected is not accessible to search engines (such as those found in several digital libraries).1 7. Some Web sites have started using tricks such as spamdexing and cloaking to manipulate search engines to display them as the top results for a set of keywords. This can make the search results polluted, with more relevant links being pushed down in the result list. This is a result of the popularity of Web searches and the business potential search engines can generate today. 8. Search engines index all the content of the Web without any bounds on the sensitivity of information. This has raised a few security and privacy flags. With the above background and challenges in mind, we lay out the article as follows. In the next section, we begin with a discussion of search engine evolution. To facilitate the examination and discussion of the search engine development’s progress, we break down this discussion into the three generations of search engines. Figure 1 depicts this evolution pictorially and highlights the need for better search engine technologies. Next, we present a brief discussion on the contemporary state of search engine technology and various types of content searches available today. With this background, the next section documents various concerns about existing search engines setting the stage for better search engine technology. These concerns include information overload, relevance, representation, and categorization. Finally, we briefly address the research efforts under way to alleviate these concerns and then present our conclusion.


2016 ◽  
Vol 14 (2) ◽  
pp. 28-34 ◽  
Author(s):  
V. Sathiyamoorthi

AbstractIt is generally observed throughout the world that in the last two decades, while the average speed of computers has almost doubled in a span of around eighteen months, the average speed of the network has doubled merely in a span of just eight months!. In order to improve the performance, more and more researchers are focusing their research in the field of computers and its related technologies. World Wide Web (WWW) acts as a medium for sharing of information. As a result, millions of applications run on the Internet and cause increased network traffic and put a great demand on the available network infrastructure. The slow retrieval of Web pages may reduce the user interest from accessing them. To deal with this problem Web caching and Web pre-fetching are used. This paper focuses on a methodology for improving the proxy-based Web caching system using Web mining. It integrates Web caching and Pre-fetching through an efficient clustering based pre-fetching technique.


2011 ◽  
pp. 3020-3027
Author(s):  
Barbara A. Frey ◽  
Ashli Molinero ◽  
Ellen Cohn

Just as wheelchair ramps and elevators provide access to wheelchair users, good Web design provides “electronic curb ramps” to the Internet for individuals with visual or other disabilities (Waddell, 1997). Research shows it is easier and less expensive to initially construct accessible Web pages rather than to retrofit the pages with corrections. Most of the technical requirements for accessible Web design can be met if Web designers adhere to the straightforward principles suggested by the World Wide Web Consortium’s Web Accessibility Initiative.


1999 ◽  
Vol 40 (1) ◽  
pp. 97-104
Author(s):  
Susan Brady

Over the past decade academic and research libraries throughout the world have taken advantage of the enormous developments in communication technology to improve services to their users. Through the Internet and the World Wide Web researchers now have convenient electronic access to library catalogs, indexes, subject bibliographies, descriptions of manuscript and archival collections, and other resources. This brief overview illustrates how libraries are facilitating performing arts research in new ways.


Sign in / Sign up

Export Citation Format

Share Document