Development of Focused Crawlers for Building Large Punjabi News Corpus

Gurjot Singh Mahi; Amandeep Verma

doi:10.5614/itbj.ict.res.appl.2021.15.3.1

Development of Focused Crawlers for Building Large Punjabi News Corpus

Journal of ICT Research and Applications ◽

10.5614/itbj.ict.res.appl.2021.15.3.1 ◽

2021 ◽

Vol 15 (3) ◽

pp. 205-215

Author(s):

Gurjot Singh Mahi ◽

Amandeep Verma

Keyword(s):

Search Engines ◽

Scientific Community ◽

The Internet ◽

Python Programming Language ◽

News Websites ◽

Web Crawlers ◽

News Corpus ◽

Python Programming ◽

Focused Crawlers ◽

The Web

Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.

Download Full-text

Economic Dictionaries on the Web

HERMES - Journal of Language and Communication in Business ◽

10.7146/hjlcb.v26i50.97787 ◽

2017 ◽

Vol 26 (50) ◽

pp. 13

Author(s):

Daniele Besomi

Keyword(s):

Search Engines ◽

The Internet ◽

Electronic Editing ◽

Paper Structure ◽

Research Students ◽

The Relationship ◽

The Web

This paper surveys the economic dictionaries available on the internet, both for free and on subscription, addressed to various kinds of audiences from schoolchildren to research students and academics. The focus is not much on content, but on whether and how the possibilities opened by electronic editing and by the modes of distribution and interaction opened by the internet are exploited in the organization and presentation of the materials. The upshot is that although a number of web dictionaries have taken advantage of some of the innovations offered by the internet (in particular the possibility of regularly updating, of turning cross-references into hyperlinks, of adding links to external materials, of adding more or less complex search engines), the observation that internet lexicography has mostly produced more ef! cient dictionary without, however, fundamentally altering the traditional paper structure can be con! rmed for this particular subset of reference works. In particular, what is scarcely explored is the possibility of visualizing the relationship between entries, thus abandoning the project of the early encyclopedists right when the technology provides the means of accomplishing it.

Download Full-text

Seek and Ye Shall Find

Handbook of Research on Computer Mediated Communication ◽

10.4018/978-1-59904-863-5.ch053 ◽

2011 ◽

pp. 740-754 ◽

Cited By ~ 2

Author(s):

Suely Fragoso

Keyword(s):

Mass Media ◽

Search Engine ◽

Mass Distribution ◽

Search Engines ◽

Distribution Model ◽

The Internet ◽

Information Distribution ◽

Distribution Mode ◽

History Of ◽

The Web

This chapter proposes that search engines apply a verticalizing pressure on the WWW many-to-many information distribution model, forcing this to revert to a distributive model similar to that of the mass media. The argument for this starts with a critical descriptive examination of the history of search mechanisms for the Internet. Parallel to this there is a discussion of the increasing ties between the search engines and the advertising market. The chapter then presents questions concerning the concentration of traffic on the Web around a small number of search engines which are in the hands of an equally limited number of enterprises. This reality is accentuated by the confidence that users place in the search engine and by the ongoing acquisition of collaborative systems and smaller players by the large search engines. This scenario demonstrates the verticalizing pressure that the search engines apply to the majority of WWW users, that bring it back toward the mass distribution mode.

Download Full-text

Deep Web

Handbook of Research on Innovations in Database Technologies and Applications ◽

10.4018/978-1-60566-242-8.ch062 ◽

2009 ◽

pp. 581-588 ◽

Cited By ~ 5

Author(s):

Denis Shestakov

Keyword(s):

Search Engines ◽

Large Scale ◽

Web Search ◽

Web Database ◽

Web Search Engine ◽

Search Form ◽

Complete Set ◽

Web Crawlers ◽

Pass Through ◽

The Web

Finding information on the Web using a web search engine is one of the primary activities of today’s web users. For a majority of users results returned by conventional search engines are an essentially complete set of links to all pages on the Web relevant to their queries. However, currentday searchers do not crawl and index a significant portion of the Web and, hence, web users relying on search engines only are unable to discover and access a large amount of information from the nonindexable part of the Web. Specifically, dynamic pages generated based on parameters provided by a user via web search forms are not indexed by search engines and cannot be found in searchers’ results. Such search interfaces provide web users with an online access to myriads of databases on the Web. In order to obtain some information from a web database of interest, a user issues his/her query by specifying query terms in a search form and receives the query results, a set of dynamic pages which embed required information from a database. At the same time, issuing a query via an arbitrary search interface is an extremely complex task for any kind of automatic agents including web crawlers, which, at least up to the present day, do not even attempt to pass through web forms on a large scale.

Download Full-text

Society on the Web

10.1093/oxfordhb/9780199589074.013.0004 ◽

2013 ◽

Cited By ~ 3

Author(s):

Michael Thelwall

Keyword(s):

Web 2.0 ◽

Search Engines ◽

Personal Information ◽

Link Analysis ◽

Social Web ◽

Government Information ◽

News Websites ◽

The Social ◽

Web Use ◽

The Web

This chapter, which argues that the structure of the Web reflects the offline world, making it a valuable lens for exploring society, introduces the theories and issues which make general observations about the Web and then provides examples of investigations into particular topics, such as academic web use. The Web offers unique entrée to free information from Wikipedia to news websites and from government information portals to search engines. Moreover, the two broad approaches to investigating society on the Web are reported, which are based around link analysis and Web 2.0 investigations. Web 2.0 has spawned broad research to probe its effect on several aspects of society. The publishing of personal information on the Web, particularly on the social web, appears likely to continue and expand.

Download Full-text

Effects of Information Filters

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2012040101 ◽

2012 ◽

Vol 2 (2) ◽

pp. 1-12

Author(s):

K. G. Srinivasa ◽

N. Pramod ◽

K. R. Venugopal ◽

L. M. Patnaik

Keyword(s):

Information Retrieval ◽

Information Processing ◽

Information Filtering ◽

Vital Role ◽

Rational Approach ◽

The Internet ◽

Information Filters ◽

Web Crawlers ◽

Key Topics ◽

The Web

In the Internet era, information processing for personalization and relevance has been one of the key topics of research and development. It ranges from design of applications like search engines, web crawlers, learning engines to reverse image searches, audio processed search, auto complete, etc. Information retrieval plays a vital role in most of the above mentioned applications. A part of information retrieval which deals with personalization and rendering is often referred to as Information Filtering. The emphasis of this paper is to empirically analyze the information filters commonly seen and to analyze their correctness and effects. The measure of correctness is not in terms of percentage of correct results but instead a rational approach of analysis using a non mathematical argument is presented. Filters employed by Google’s search engine are used to analyse the effects of filtering on the web. A plausible solution to the errors of filtering phenomenon is also discussed.

Download Full-text

Searching Bioinformatics Information Strategies for Effective Use of Search Engine

Biomedical Engineering ◽

10.4018/978-1-5225-3158-6.ch033 ◽

2018 ◽

pp. 742-748

Author(s):

Viveka Vardhan Jumpala

Keyword(s):

World Wide Web ◽

Search Engine ◽

Search Engines ◽

World Wide ◽

The Internet ◽

Information Strategies ◽

The World ◽

Search For Information ◽

Effective Use ◽

The Web

The Internet, which is an information super high way, has practically compressed the world into a cyber colony through various networks and other Internets. The development of the Internet and the emergence of the World Wide Web (WWW) as common vehicle for communication and instantaneous access to search engines and databases. Search Engine is designed to facilitate search for information on the WWW. Search Engines are essentially the tools that help in finding required information on the web quickly in an organized manner. Different search engines do the same job in different ways thus giving different results for the same query. Search Strategies are the new trend on the Web.

Download Full-text

Web Crawler and Web Crawler Algorithms: A Perspective

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e9362.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 203-205

Keyword(s):

Search Engine ◽

Search Engines ◽

The Internet ◽

Web Pages ◽

Web Crawler ◽

Day By Day ◽

The Web

A web crawler is also called spider. For the intention of web indexing it automatically searches on the WWW. As the W3 is increasing day by day, globally the number of web pages grown massively. To make the search sociable for users, searching engine are mandatory. So to discover the particular data from the WWW search engines are operated. It would be almost challenging for mankind devoid of search engines to find anything from the web unless and until he identifies a particular URL address. A central depository of HTML documents in indexed form is sustained by every search Engine. Every time an operator gives the inquiry, searching is done at the database of indexed web pages. The size of a database of every search engine depends on the existing page on the internet. So to increase the proficiency of search engines, it is permitted to store only the most relevant and significant pages in the database.

Download Full-text

Extraction of Metadata from Images

International Journal of Technology ◽

10.52711/2231-3915.2021.00003 ◽

2021 ◽

pp. 19-22

Author(s):

Rohit S ◽

M N Nachappa

Keyword(s):

Programming Language ◽

Specific Data ◽

Python Programming Language ◽

The World ◽

Python Programming ◽

User Friendly ◽

Web Platform ◽

The Web

Metadata is defined as the information providing data about one or more faces of the data. It is used to abridge basic indication about data which can make pursuing and working with specific data easier. The idea of metadata is often prolonged to involve words or phrases that stand for objects or “objects” in the world, leading to the notion of unit extraction. In this paper, I am proposing extracting the metadata of the files user inputs to the system, this can be achieved using Flask as the web platform and Python programming language, our goal is to make a free and lightweight metadata extractor which is more efficient and user friendly.

Download Full-text

Classification of Web Pages Using Machine Learning Techniques

Social Implications of Data Mining and Information Privacy ◽

10.4018/978-1-60566-196-4.ch008 ◽

2010 ◽

pp. 134-150

Author(s):

K. Selvakuberan ◽

M. Indra Devi ◽

R. Rajaram

Keyword(s):

Machine Learning ◽

Search Engines ◽

Research Problem ◽

Machine Learning Techniques ◽

The Internet ◽

Web Pages ◽

Current Research Problem ◽

Learning Techniques ◽

The Web

The explosive growth of the Web makes it a very useful information resource to all types of users. Today, everyone accesses the Internet for various purposes and retrieving the required information within the stipulated time is the major demand from users. Also, the Internet provides millions of Web pages for each and every search term. Getting interesting and required results from the Web becomes very difficult and turning the classification of Web pages into relevant categories is the current research topic. Web page classification is the current research problem that focuses on classifying the documents into different categories, which are used by search engines for producing the result. In this chapter we focus on different machine learning techniques and how Web pages can be classified using these machine learning techniques. The automatic classification of Web pages using machine learning techniques is the most efficient way used by search engines to provide accurate results to the users. Machine learning classifiers may also be trained to preserve the personal details from unauthenticated users and for privacy preserving data mining.

Download Full-text

A New Approach for Building a Scalable and Adaptive Vertical Search Engine

Methodological Advancements in Intelligent Information Technologies - Advances in Intelligent Information Technologies ◽

10.4018/978-1-60566-970-0.ch009 ◽

2010 ◽

pp. 159-187

Author(s):

H. Arafat Ali ◽

Ali I. El Desouky ◽

Ahmed I. Saleh

Keyword(s):

Search Engine ◽

Search Engines ◽

General Purpose ◽

Specific Domain ◽

Vertical Search ◽

Vertical Search Engine ◽

Focused Crawlers ◽

New Generation ◽

Adaptation Ability ◽

The Web

Search engines are the most important search tools for finding useful and recent information on the Web today. They rely on crawlers that continually crawl the Web for new pages. Meanwhile, focused crawlers have become an attractive area for research in recent years. They suggest a better solution for general-purpose search engine limitations and lead to a new generation of search engines called vertical-search engines. Searching the Web vertically is to divide the Web into smaller regions; each region is related to a specific domain. In addition, one crawler is allowed to search in each domain. The innovation of this article is adding intelligence and adaptation ability to focused crawlers. Such added features will certainly guide the crawler perfectly to retrieve more relevant pages while crawling the Web. The proposed crawler has the ability to estimate the rank of the page before visiting it and adapts itself to any changes in its domain using.

Download Full-text