Machine Learning as a New Search Engine Interface: An Overview

Taposh Kumar Neogy; Harish Paruchuri

doi:10.18034/ei.v2i2.539

Using Machine Learning for Web Page Classification in Search Engine Optimization

Future Internet ◽

10.3390/fi13010009 ◽

2021 ◽

Vol 13 (1) ◽

pp. 9

Author(s):

Goran Matošević ◽

Jasminka Dobša ◽

Dunja Mladenić

Keyword(s):

Machine Learning ◽

Search Engine ◽

Search Engines ◽

Machine Learning Algorithms ◽

Practical Significance ◽

Web Pages ◽

Web Page ◽

Data Set ◽

Search Engine Optimization ◽

Novel Approach

This paper presents a novel approach of using machine learning algorithms based on experts’ knowledge to classify web pages into three predefined classes according to the degree of content adjustment to the search engine optimization (SEO) recommendations. In this study, classifiers were built and trained to classify an unknown sample (web page) into one of the three predefined classes and to identify important factors that affect the degree of page adjustment. The data in the training set are manually labeled by domain experts. The experimental results show that machine learning can be used for predicting the degree of adjustment of web pages to the SEO recommendations—classifier accuracy ranges from 54.59% to 69.67%, which is higher than the baseline accuracy of classification of samples in the majority class (48.83%). Practical significance of the proposed approach is in providing the core for building software agents and expert systems to automatically detect web pages, or parts of web pages, that need improvement to comply with the SEO guidelines and, therefore, potentially gain higher rankings by search engines. Also, the results of this study contribute to the field of detecting optimal values of ranking factors that search engines use to rank web pages. Experiments in this paper suggest that important factors to be taken into consideration when preparing a web page are page title, meta description, H1 tag (heading), and body text—which is aligned with the findings of previous research. Another result of this research is a new data set of manually labeled web pages that can be used in further research.

Download Full-text

Internet Search Engines

Encyclopedia of E-Commerce, E-Government, and Mobile Commerce ◽

10.4018/978-1-59140-799-7.ch108 ◽

2011 ◽

pp. 672-677

Author(s):

Vijay Kasi ◽

Radhika Jain

Keyword(s):

Search Engine ◽

Web Sites ◽

Search Engines ◽

World Wide ◽

Relevant Information ◽

The Internet ◽

Web Pages ◽

Web Page ◽

The World ◽

The Web

In the context of the Internet, a search engine can be defined as a software program designed to help one access information, documents, and other content on the World Wide Web. The adoption and growth of the Internet in the last decade has been unprecedented. The World Wide Web has always been applauded for its simplicity and ease of use. This is evident looking at the extent of the knowledge one requires to build a Web page. The flexible nature of the Internet has enabled the rapid growth and adoption of it, making it hard to search for relevant information on the Web. The number of Web pages has been increasing at an astronomical pace, from around 2 million registered domains in 1995 to 233 million registered domains in 2004 (Consortium, 2004). The Internet, considered a distributed database of information, has the CRUD (create, retrieve, update, and delete) rule applied to it. While the Internet has been effective at creating, updating, and deleting content, it has considerably lacked in enabling the retrieval of relevant information. After all, there is no point in having a Web page that has little or no visibility on the Web. Since the 1990s when the first search program was released, we have come a long way in terms of searching for information. Although we are currently witnessing a tremendous growth in search engine technology, the growth of the Internet has overtaken it, leading to a state in which the existing search engine technology is falling short. When we apply the metrics of relevance, rigor, efficiency, and effectiveness to the search domain, it becomes very clear that we have progressed on the rigor and efficiency metrics by utilizing abundant computing power to produce faster searches with a lot of information. Rigor and efficiency are evident in the large number of indexed pages by the leading search engines (Barroso, Dean, & Holzle, 2003). However, more research needs to be done to address the relevance and effectiveness metrics. Users typically type in two to three keywords when searching, only to end up with a search result having thousands of Web pages! This has made it increasingly hard to effectively find any useful, relevant information. Search engines face a number of challenges today requiring them to perform rigorous searches with relevant results efficiently so that they are effective. These challenges include the following (“Search Engines,” 2004). 1. The Web is growing at a much faster rate than any present search engine technology can index. 2. Web pages are updated frequently, forcing search engines to revisit them periodically. 3. Dynamically generated Web sites may be slow or difficult to index, or may result in excessive results from a single Web site. 4. Many dynamically generated Web sites are not able to be indexed by search engines. 5. The commercial interests of a search engine can interfere with the order of relevant results the search engine shows. 6. Content that is behind a firewall or that is password protected is not accessible to search engines (such as those found in several digital libraries).1 7. Some Web sites have started using tricks such as spamdexing and cloaking to manipulate search engines to display them as the top results for a set of keywords. This can make the search results polluted, with more relevant links being pushed down in the result list. This is a result of the popularity of Web searches and the business potential search engines can generate today. 8. Search engines index all the content of the Web without any bounds on the sensitivity of information. This has raised a few security and privacy flags. With the above background and challenges in mind, we lay out the article as follows. In the next section, we begin with a discussion of search engine evolution. To facilitate the examination and discussion of the search engine development’s progress, we break down this discussion into the three generations of search engines. Figure 1 depicts this evolution pictorially and highlights the need for better search engine technologies. Next, we present a brief discussion on the contemporary state of search engine technology and various types of content searches available today. With this background, the next section documents various concerns about existing search engines setting the stage for better search engine technology. These concerns include information overload, relevance, representation, and categorization. Finally, we briefly address the research efforts under way to alleviate these concerns and then present our conclusion.

Download Full-text

Similarity Web Pages Retrieval Technologies on the Internet

Encyclopedia of Information Science and Technology, First Edition ◽

10.4018/978-1-59140-553-5.ch440 ◽

2005 ◽

pp. 2486-2491

Author(s):

Rung Ching Chen ◽

Ming Yung Tsai ◽

Chung Hsun Hsieh

Keyword(s):

Information Retrieval ◽

Search Engine ◽

Search Engines ◽

Fast Growth ◽

The Other ◽

The Internet ◽

Web Pages ◽

Query Term ◽

Web Page ◽

Critical Problems

In recent years, due to the fast growth of the Internet, the services and information it provides are constantly expanding. Madria and Bhowmick (1999) and Baeza-Yates (2003) indicated that most large search engines need to comply to, on average, at least millions of hits daily in order to satisfy the users’ needs for information. Each search engine has its own sorting policy and the keyword format for the query term, but there are some critical problems. The searches may get more or less information. In the former, the user always gets buried in the information. Requiring only a little information, they always select some former items from the large amount of returned information. In the latter, the user always re-queries using another searching keyword to do searching work. The re-query operation also leads to retrieving information in a great amount, which leads to having a large amount of useless information. That is a bad cycle of information retrieval. The similarity Web page retrieval can help avoid browsing the useless information. The similarity Web page retrieval indicates a Web page, and then compares the page with the other Web pages from the searching results of search engines. The similarity Web page retrieval will allow users to save time by not browsing unrelated Web pages and reject non-similar Web pages, rank the similarity order of Web pages and cluster the similarity Web pages into the same classification.

Download Full-text

An Ontology-based Ranking Model in Search Engines

Journal of Computer Science Research ◽

10.30564/jcsr.v1i2.972 ◽

2019 ◽

Vol 1 (2) ◽

Author(s):

Yu Hou ◽

Lixin Tao

Keyword(s):

Search Engine ◽

Search Engines ◽

Semantic Search ◽

The Internet ◽

Web Pages ◽

Ranking Algorithm ◽

Intelligent Network ◽

Web Page ◽

Pagerank Algorithm ◽

Hits Algorithm

As the tsunami of data has emerged, search engines have become the most powerful tool for obtaining scattered information on the internet. The traditional search engines return the organized results by using ranking algorithm such as term frequency, link analysis (PageRank algorithm and HITS algorithm) etc. However, these algorithms must combine the keyword frequency to determine the relevance between user’s query and the data in the computer system or internet. Moreover, we expect the search engines could understand users’ searching by content meanings rather than literal strings. Semantic Web is an intelligent network and it could understand human’s language more semantically and make the communication easier between human and computers. But, the current technology for the semantic search is hard to apply. Because some meta data should be annotated to each web pages, then the search engine will have the ability to understand the users intend. However, annotate every web page is very time-consuming and leads to inefficiency. So, this study designed an ontology-based approach to improve the current traditional keyword-based search and emulate the effects of semantic search. And let the search engine can understand users more semantically when it gets the knowledge.

Download Full-text

Critical Analysis of Major Search Engines

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.8239 ◽

2019 ◽

Vol 16 (9) ◽

pp. 3712-3716

Author(s):

Kailash Kumar ◽

Abdulaziz Al-Besher

Keyword(s):

Search Engine ◽

Search Engines ◽

Critical Analysis ◽

Research Paper ◽

Web Pages ◽

Ranking Algorithm ◽

Discovering How Students Search a Library Web Site: A Usability Case Study

College & Research Libraries ◽

10.5860/crl.63.4.354 ◽

2002 ◽

Vol 63 (4) ◽

pp. 354-365 ◽

Cited By ~ 43

Author(s):

Susan Augustine ◽

Courtney Greene

Keyword(s):

Search Engine ◽

Long Range ◽

Web Sites ◽

Search Engines ◽

Web Pages ◽

Usability Study ◽

Web Page ◽

The Past ◽

Library Resources

Have Internet search engines influenced the way students search library Web pages? The results of this usability study reveal that students consistently and frequently use the library Web site’s internal search engine to find information rather than navigating through pages. If students are searching rather than navigating, library Web page designers must make metadata and powerful search engines priorities. The study also shows that students have difficulty interpreting library terminology, experience confusion discerning difference amongst library resources, and prefer to seek human assistance when encountering problems online. These findings imply that library Web sites have not alleviated some of the basic and long-range problems that have challenged librarians in the past.

Download Full-text

Web Crawler and Web Crawler Algorithms: A Perspective

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e9362.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 203-205

Keyword(s):

Search Engine ◽

Search Engines ◽

The Internet ◽

Web Pages ◽

Web Crawler ◽

Day By Day ◽

The Web

A web crawler is also called spider. For the intention of web indexing it automatically searches on the WWW. As the W3 is increasing day by day, globally the number of web pages grown massively. To make the search sociable for users, searching engine are mandatory. So to discover the particular data from the WWW search engines are operated. It would be almost challenging for mankind devoid of search engines to find anything from the web unless and until he identifies a particular URL address. A central depository of HTML documents in indexed form is sustained by every search Engine. Every time an operator gives the inquiry, searching is done at the database of indexed web pages. The size of a database of every search engine depends on the existing page on the internet. So to increase the proficiency of search engines, it is permitted to store only the most relevant and significant pages in the database.

Download Full-text

Classification of Web Pages Using Machine Learning Techniques

Social Implications of Data Mining and Information Privacy ◽

10.4018/978-1-60566-196-4.ch008 ◽

2010 ◽

pp. 134-150

Author(s):

K. Selvakuberan ◽

M. Indra Devi ◽

R. Rajaram

Keyword(s):

Machine Learning ◽

Search Engines ◽

Research Problem ◽

Machine Learning Techniques ◽

The Internet ◽

Web Pages ◽

Current Research Problem ◽

Learning Techniques ◽

The Web

The explosive growth of the Web makes it a very useful information resource to all types of users. Today, everyone accesses the Internet for various purposes and retrieving the required information within the stipulated time is the major demand from users. Also, the Internet provides millions of Web pages for each and every search term. Getting interesting and required results from the Web becomes very difficult and turning the classification of Web pages into relevant categories is the current research topic. Web page classification is the current research problem that focuses on classifying the documents into different categories, which are used by search engines for producing the result. In this chapter we focus on different machine learning techniques and how Web pages can be classified using these machine learning techniques. The automatic classification of Web pages using machine learning techniques is the most efficient way used by search engines to provide accurate results to the users. Machine learning classifiers may also be trained to preserve the personal details from unauthenticated users and for privacy preserving data mining.

Download Full-text

Classification of Web Pages Using Machine Learning Techniques

Machine Learning ◽

10.4018/978-1-60960-818-7.ch105 ◽

2012 ◽

pp. 50-65 ◽

Cited By ~ 1

Author(s):

K. Selvakuberan ◽

M. Indra Devi ◽

R. Rajaram

Keyword(s):

Machine Learning ◽

Search Engines ◽

Research Problem ◽

Machine Learning Techniques ◽

The Internet ◽

Web Pages ◽

Current Research Problem ◽

Learning Techniques ◽

The Web

The explosive growth of the Web makes it a very useful information resource to all types of users. Today, everyone accesses the Internet for various purposes and retrieving the required information within the stipulated time is the major demand from users. Also, the Internet provides millions of Web pages for each and every search term. Getting interesting and required results from the Web becomes very difficult and turning the classification of Web pages into relevant categories is the current research topic. Web page classification is the current research problem that focuses on classifying the documents into different categories, which are used by search engines for producing the result. In this chapter we focus on different machine learning techniques and how Web pages can be classified using these machine learning techniques. The automatic classification of Web pages using machine learning techniques is the most efficient way used by search engines to provide accurate results to the users. Machine learning classifiers may also be trained to preserve the personal details from unauthenticated users and for privacy preserving data mining.

Download Full-text

The Core Aspects of Search Engine Optimisation Necessary to Move up the Ranking

International Journal of Ambient Computing and Intelligence ◽

10.4018/jaci.2011100105 ◽

2011 ◽

Vol 3 (4) ◽

pp. 62-70 ◽

Cited By ~ 6

Author(s):

Stephen O’Neill ◽

Kevin Curran

Keyword(s):

Local Search ◽

Search Engine ◽

Search Engines ◽

Internet Marketing ◽

Web Pages ◽

Image Search ◽

Web Page ◽

Search Results ◽

The Core

Search engine optimization (SEO) is the process of improving the visibility, volume and quality of traffic to website or a web page in search engines via the natural search results. SEO can also target other areas of a search, including image search and local search. SEO is one of many different strategies used for marketing a website but SEO has been proven the most effective. An Internet marketing campaign may drive organic search results to websites or web pages but can be involved with paid advertising on search engines. All search engines have a unique way of ranking the importance of a website. Some search engines focus on the content while others review Meta tags to identify who and what a web site’s business is. Most engines use a combination of Meta tags, content, link popularity, click popularity and longevity to determine a sites ranking. To make it even more complicated, they change their ranking policies frequently. This paper provides an overview of search engine optimisation strategies and pitfalls.

Download Full-text