WDPMA

Author(s):  
Santosh Kumar ◽  
Ravi Kumar

The internet is very huge in size and increasing exponentially. Finding any relevant information from such a huge information source is now becoming very difficult. Millions of web pages are returned in response to a user's ordinary query. Displaying these web pages without ranking makes it very challenging for the user to find the relevant results of a query. This paper has proposed a novel approach that utilizes web content, usage, and structure data to prioritize web documents. The proposed approach has applications in several major areas like web personalization, adaptive website development, recommendation systems, search engine optimization, business intelligence solutions, etc. Further, the proposed approach has been compared experimentally by other approaches, WDPGA, WDPSA, and WDPII, and it has been observed that with a little trade off time, it has an edge over these approaches.

2018 ◽  
Vol 7 (4.19) ◽  
pp. 1041
Author(s):  
Santosh V. Chobe ◽  
Dr. Shirish S. Sane

There is an explosive growth of information on Internet that makes extraction of relevant data from various sources, a difficult task for its users. Therefore, to transform the Web pages into databases, Information Extraction (IE) systems are needed. Relevant information in Web documents can be extracted using information extraction and presented in a structured format.By applying information extraction techniques, information can be extracted from structured, semi-structured, and unstructured data. This paper presents some of the major information extraction tools. Here, advantages and limitations of the tools are discussed from a user’s perspective.  


2012 ◽  
Vol 8 (4) ◽  
pp. 1-21 ◽  
Author(s):  
C. I. Ezeife ◽  
Titas Mutsuddy

The process of extracting comparative heterogeneous web content data which are derived and historical from related web pages is still at its infancy and not developed. Discovering potentially useful and previously unknown information or knowledge from web contents such as “list all articles on ’Sequential Pattern Mining’ written between 2007 and 2011 including title, authors, volume, abstract, paper, citation, year of publication,” would require finding the schema of web documents from different web pages, performing web content data integration, building their virtual or physical data warehouse before web content extraction and mining from the database. This paper proposes a technique for automatic web content data extraction, the WebOMiner system, which models web sites of a specific domain like Business to Customer (B2C) web sites, as object oriented database schemas. Then, non-deterministic finite state automata (NFA) based wrappers for recognizing content types from this domain are built and used for extraction of related contents from data blocks into an integrated database for future second level mining for deep knowledge discovery.


2010 ◽  
Vol 171-172 ◽  
pp. 543-546 ◽  
Author(s):  
G. Poonkuzhali ◽  
R. Kishore Kumar ◽  
R. Kripa Keshav ◽  
P. Sudhakar ◽  
K. Sarukesi

The enrichment of internet has resulted in the flooding of abundant information on WWW with more replicas. As the duplicated web pages increase the indexing space and time complexity, finding and removing these pages becomes significant for search engines and other likely system which will improve on accuracy of search results as well as search speed. Web content mining plays a vital role in resolving these aspects. Existing algorithm for web content mining focus attention on applying weightage to structured documents whereas in this research work, a mathematical approach based on linear correlation is developed to detect and remove the duplicates present in both structured and unstructured web document. In the proposed work, linear correlation between two web documents is found out. If the correlated value is 1 then the documents are said to be exactly redundant and it should be eliminated otherwise not redundant.


2010 ◽  
Vol 10 (1) ◽  
pp. 28-33 ◽  
Author(s):  
Glenda Browne

AbstractThe internet provides access to a huge amount of information, and most people experience problems with information overload rather than scarcity. Glenda Browne explains how indexing provides a way of increasing retrieval of relevant information from the content available. Manual, book-style indexes can be created for websites and individual web documents such as online books. Keyword metadata is a crucial behind the scenes aid to improved search engine functioning, and categorisation, social bookmarking and automated indexing also play a part.


2020 ◽  
Vol 2 (1) ◽  
pp. 17-21
Author(s):  
Fares Hasan ◽  
Koo Kwong Ze ◽  
Rozilawati Razali ◽  
Abudhahir Buhari ◽  
Elisha Tadiwa

PageRank is an algorithm that brings an order to the Internet by returning the best result to the users corresponding to a search query. The algorithm returns the result by calculating the outgoing links that a webpage has thus reflecting whether the webpage is relevant or not. However, there are still problems existing which relate to the time needed to calculate the page rank of all the webpages. The turnaround time is long as the webpages in the Internet are a lot and keep increasing. Secondly, the results returned by the algorithm are biased towards mainly old webpages resulting in newly created webpages having lower page rankings compared to old webpages even though new pages might have comparatively more relevant information. To overcome these setbacks, this research proposes an alternative hybrid algorithm based on an optimized normalization technique and content-based approach. The proposed algorithm reduces the number of iterations required to calculate the page rank hence improving efficiency by calculating the mean of all page rank values and normalising the page rank value through the use of the mean. This is complemented by calculating the valid links of web pages based on the validity of the links rather than the conventional popularity.


Author(s):  
Jos van Iwaarden ◽  
Ton van der Wiele ◽  
Roger Williams ◽  
Steve Eldridge

The Internet has come of age as a global source of information about every topic imaginable. A company like Google has become a household name in Western countries and making use of its internet search engine is so popular that “Googling” has even become a verb in many Western languages. Whether it is for business or private purposes, people worldwide rely on Google to present them relevant information. Even the scientific community is increasingly employing Google’s search engine to find academic articles and other sources of information about the topics they are studying. Yet, the vast amount of information that is available on the internet is gradually changing in nature. Initially, information would be uploaded by the administrators of the web site and would then be visible to all visitors of the site. This approach meant that web sites tended to be limited in the amount of content they provided, and that such content was strictly controlled by the administrators. Over time, web sites have granted their users the authority to add information to web pages, and sometimes even to alter existing information. Current examples of such web sites are eBay (auction), Wikipedia (encyclopedia), YouTube (video sharing), LinkedIn (social networking), Blogger (weblogs) and Delicious (social bookmarking).


2014 ◽  
Vol 989-994 ◽  
pp. 4452-4455
Author(s):  
Rui Zhang ◽  
Chun Dong Hu ◽  
Peng Sheng ◽  
Xiao Dan Zhang ◽  
Yuan Zhe Zhao ◽  
...  

In the experiment of Neutral Beam Injection (NBI), Experimental Data Publishing Software (EDPS) is developed to get experimental data monitored remotely. Adopting B/S module, EDPS publishes information, such as experimental configurations and results on the Internet dynamically. The design and implementation of how EDPS uses the present information source in NBI Control System (NBICS), the procedure EDPS handles the requests from clients and displays the information on web pages are introduced in this paper.


Author(s):  
Ralla Suresh ◽  
Saritha Vemuri ◽  
Swetha V

The information extracted from Web pages can be used for effective query expansion. The aspect needed to improve accuracy of web search engines is the inclusion of metadata, not only to analyze Web content, but also to interpret. With the Web of today being unstructured and semantically heterogeneous, keyword-based queries are likely to miss important results. . Using data mining methods, our system derives dependency rules and applies them to concept-based queries. This paper presents a novel approach for query expansion that applies dependence rules mined from a large Web World, combining several existing techniques for data extraction and mining, to integrate the system into COMPACT, our prototype implementation of a concept-based search engine.


Author(s):  
Vijay Kasi ◽  
Radhika Jain

In the context of the Internet, a search engine can be defined as a software program designed to help one access information, documents, and other content on the World Wide Web. The adoption and growth of the Internet in the last decade has been unprecedented. The World Wide Web has always been applauded for its simplicity and ease of use. This is evident looking at the extent of the knowledge one requires to build a Web page. The flexible nature of the Internet has enabled the rapid growth and adoption of it, making it hard to search for relevant information on the Web. The number of Web pages has been increasing at an astronomical pace, from around 2 million registered domains in 1995 to 233 million registered domains in 2004 (Consortium, 2004). The Internet, considered a distributed database of information, has the CRUD (create, retrieve, update, and delete) rule applied to it. While the Internet has been effective at creating, updating, and deleting content, it has considerably lacked in enabling the retrieval of relevant information. After all, there is no point in having a Web page that has little or no visibility on the Web. Since the 1990s when the first search program was released, we have come a long way in terms of searching for information. Although we are currently witnessing a tremendous growth in search engine technology, the growth of the Internet has overtaken it, leading to a state in which the existing search engine technology is falling short. When we apply the metrics of relevance, rigor, efficiency, and effectiveness to the search domain, it becomes very clear that we have progressed on the rigor and efficiency metrics by utilizing abundant computing power to produce faster searches with a lot of information. Rigor and efficiency are evident in the large number of indexed pages by the leading search engines (Barroso, Dean, & Holzle, 2003). However, more research needs to be done to address the relevance and effectiveness metrics. Users typically type in two to three keywords when searching, only to end up with a search result having thousands of Web pages! This has made it increasingly hard to effectively find any useful, relevant information. Search engines face a number of challenges today requiring them to perform rigorous searches with relevant results efficiently so that they are effective. These challenges include the following (“Search Engines,” 2004). 1. The Web is growing at a much faster rate than any present search engine technology can index. 2. Web pages are updated frequently, forcing search engines to revisit them periodically. 3. Dynamically generated Web sites may be slow or difficult to index, or may result in excessive results from a single Web site. 4. Many dynamically generated Web sites are not able to be indexed by search engines. 5. The commercial interests of a search engine can interfere with the order of relevant results the search engine shows. 6. Content that is behind a firewall or that is password protected is not accessible to search engines (such as those found in several digital libraries).1 7. Some Web sites have started using tricks such as spamdexing and cloaking to manipulate search engines to display them as the top results for a set of keywords. This can make the search results polluted, with more relevant links being pushed down in the result list. This is a result of the popularity of Web searches and the business potential search engines can generate today. 8. Search engines index all the content of the Web without any bounds on the sensitivity of information. This has raised a few security and privacy flags. With the above background and challenges in mind, we lay out the article as follows. In the next section, we begin with a discussion of search engine evolution. To facilitate the examination and discussion of the search engine development’s progress, we break down this discussion into the three generations of search engines. Figure 1 depicts this evolution pictorially and highlights the need for better search engine technologies. Next, we present a brief discussion on the contemporary state of search engine technology and various types of content searches available today. With this background, the next section documents various concerns about existing search engines setting the stage for better search engine technology. These concerns include information overload, relevance, representation, and categorization. Finally, we briefly address the research efforts under way to alleviate these concerns and then present our conclusion.


2021 ◽  
Vol 1 (3) ◽  
pp. 29-34
Author(s):  
Ayad Abdulrahman

Due to the daily expansion of the web, the amount of information has increased significantly. Thus, the need for retrieving relevant information has also increased. In order to explore the internet, users depend on various search engines. Search engines face a significant challenge in returning the most relevant results for a user's query. The search engine's performance is determined by the algorithm used to rank web pages, which prioritizes the pages with the most relevancy to appear at the top of the result page. In this paper, various web page ranking algorithms such as Page Rank, Time Rank, EigenRumor, Distance Rank, SimRank, etc. are analyzed and compared based on some parameters, including the mining technique to which the algorithm belongs (for instance, Web Content Mining, Web Structure Mining, and Web Usage Mining), the methodology used for ranking web pages, time complexity (amount of time to run an algorithm), input parameters (parameters utilized in the ranking process such as InLink, OutLink, Tag name, Keyword, etc.), and the result relevancy to the user query.


Sign in / Sign up

Export Citation Format

Share Document