Improving Data Collection on Article Clustering by Using Distributed Focused Crawler

Collecting or harvesting data from the Internet is often done by using web crawler. General web crawler is developed to be more focus on certain topic. The type of this web crawler called focused crawler. To improve the datacollection performance, creating focused crawler is not enough as the focused crawler makes efficient usage of network bandwidth and storage capacity. This research proposes a distributed focused crawler in order to improve the web crawler performance which also efficient in network bandwidth and storage capacity. This distributed focused crawler implements crawling scheduling, site ordering to determine URL queue, and focused crawler by using Naïve Bayes. This research also tests the web crawling performance by conducting multithreaded, then observe the CPU and memory utilization. The conclusion is the web crawling performance will be decrease when too many threads are used. As the consequences, the CPU and memory utilization will be very high, meanwhile performance of the distributed focused crawler will be low.

Download Full-text

The Creation of a Context to Knowledge Management and Innovation

Handbook of Research on Strategic Innovation Management for Improved Competitive Advantage - Advances in Business Strategy and Competitive Advantage ◽

10.4018/978-1-5225-3012-1.ch021 ◽

2018 ◽

pp. 383-396

Author(s):

Rodrigo Dos Santos Costa

Keyword(s):

Knowledge Management ◽

Data Analysis ◽

Tacit Knowledge ◽

Knowledge Creation ◽

Central Axis ◽

The Internet ◽

Information Economy ◽

Public Data ◽

And Storage ◽

The Web

In spite of a contemporary discussion about the management of knowledge and the deep use of technologies focused on architecture, organization and knowledge detection based on organization inner data analysis, as well as public data available on the internet, it is necessary a critic look above the organization knowledge creation processes even as the load of tacit knowledge there is in an organization. It is observed that the evolution of technologies, such as mobile computing, the web, besides the architecture of the computers and their ability of handling and storage data, has brought to the information economy or the age of knowledge, diverting focus on people, the central axis of organizational knowledge, and their ability to reason, infer, make decisions, and above them all the processes of knowledge creation focused on the collaborative solution of problems and generation of innovation based on the socialization of knowledge.

Download Full-text

Design of the Distributed Web Crawler

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.204-210.1454 ◽

2011 ◽

Vol 204-210 ◽

pp. 1454-1458

Author(s):

Xing Chen ◽

Wei Jiang Li ◽

Tie Jun Zhao ◽

Xing Hai Piao

Keyword(s):

Load Balance ◽

Web Site ◽

Hash Function ◽

Time Frame ◽

The Other ◽

The Internet ◽

Web Crawler ◽

Dynamic Configuration ◽

Effective Time ◽

The Web

On the current scale of the Internet, the single web crawler is unable to visit the entire web in an effective time-frame. So, we develop a distributed web crawler system to deal with it. In our distribution design, we mainly consider two facets of parallel. One is the multi-thread in the internal nodes; the other is distributed parallel among the nodes. We focus on the distribution and parallel between nodes. We address two issues of the distributed web crawler which include the crawl strategy and dynamic configuration. The results of experiment show that the hash function based on the web site achieves the goal of the distributed web crawler. At the same time, we pursue the load balance of the system, we also should reduce the communication and management spending as much as possible.

Download Full-text

Analysis on Web Crawling Algorithms

International Journal on Recent and Innovation Trends in Computing and Communication ◽

10.17762/ijritcc.v6i12.5216 ◽

2018 ◽

Vol 6 (12) ◽

pp. 33-36

Author(s):

Deepak Mayal

Keyword(s):

World Wide Web ◽

World Wide ◽

Web Pages ◽

Web Crawling ◽

Web Crawler ◽

Search System ◽

Web Crawlers ◽

Source Of Information ◽

Search Information ◽

The Web

World Wide Web (WWW)also referred to as web acts as a vital source of information and searching over the web has become so much easy nowadays all thanks to search engines google, yahoo etc. A search engine is basically a complex multiprogram that allows user to search information available on the web and for that purpose, they use web crawlers. Web crawler systematically browses the world wide web. Effective search helps in avoiding downloading and visiting irrelevant web pages on the web in order to do that web crawlers use different searching algorithm . This paper reviews different web crawling algorithm that determines the fate of the search system.

Download Full-text

Web Crawler and Web Crawler Algorithms: A Perspective

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e9362.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 203-205

Keyword(s):

Search Engine ◽

Search Engines ◽

The Internet ◽

Web Pages ◽

Web Crawler ◽

Day By Day ◽

The Web

A web crawler is also called spider. For the intention of web indexing it automatically searches on the WWW. As the W3 is increasing day by day, globally the number of web pages grown massively. To make the search sociable for users, searching engine are mandatory. So to discover the particular data from the WWW search engines are operated. It would be almost challenging for mankind devoid of search engines to find anything from the web unless and until he identifies a particular URL address. A central depository of HTML documents in indexed form is sustained by every search Engine. Every time an operator gives the inquiry, searching is done at the database of indexed web pages. The size of a database of every search engine depends on the existing page on the internet. So to increase the proficiency of search engines, it is permitted to store only the most relevant and significant pages in the database.

Download Full-text

Alexa Rank Sebagai Alat Ukur Popularitas Website Crowdfunding

Technomedia Journal ◽

10.33050/tmj.v2i2.322 ◽

2018 ◽

Vol 2 (2) ◽

pp. 29-40

Author(s):

Untung Rahardja ◽

Qurotul Aini ◽

Yustin Novita Dewi

Keyword(s):

Data Collection ◽

Literature Survey ◽

The Internet ◽

Measuring Instrument ◽

World Today ◽

Internet Users ◽

Present Information ◽

The World ◽

Use Of The Internet ◽

The Web

Pengguna internet yang drastis di seluruh penjuru dunia saat ini membuat semua hal berubah sangat cepat. Dalam penggunaan internet yang berkembang saat ini salah satunya adalah website. Untuk menentukan nilai sebuah website dibutuhkan popularitas untuk menunjangnya. Dapat dilihat dari pemeringkatan web terutama pada website crowdfunding, website yang bergerak di bidang penggalangan dana yang menyajikan informasi dan interaksi didalamnya. Untuk dapat mengetahui peringkat dalam sebuah website, dapat menggunakan Alexa Rank sebagai alat ukur pemeringkatannya. Dalam penelitian ini menggunakan metode pengumpulan data dengan dilakukan 7 (tujuh) literature survey tentang pemeringkatan website diantaranya terkait dengan penggunaan Alexa Rank. Sebuah popularitas dapat di buktikan dengan menggunakan Alexa Rank melalui 3 (tiga) fasilitas Alexa Rank yang dapat digunakan dengan mudah. Hasil akhir penelitian ini dilakukan dengan penggunaan fasilitas yang di sediakan Alexa Rank untuk mengetahui, memantau maupun memonitoring popularitas website crowdfunding. Kata kunci: Website, Website Crowdfunding, Alexa Rank Internet users who drastically throughout the world today makes all things change very quickly. In the growing use of the internet at this time one of them is the website. To determine the value of a website's popularity is required for menunjangnya. It can be seen from the web ratings mainly on the website of crowdfunding, website in the fundraising that present information and interactions within. To be able to know the ranking in a website, it can use the Alexa Rank as a measuring instrument pemeringkatannya. In this study using the method of data collection was done with 7 (seven) literature survey about the ranking of websites which were associated with the use of the Alexa Rank. A popularity can be proved by using the Alexa Rank through three (3) Alexa Rank facilities that can be used easily. The end result of this research was conducted with the use of facilities that provide Alexa Rank to discover, monitor or monitor crowdfunding website popularity. Keywords: Website, Website Crowdfunding, Alexa Rank

Download Full-text

The Internet

Internet Management Issues ◽

10.4018/978-1-930708-21-1.ch013 ◽

2011 ◽

pp. 209-219 ◽

Cited By ~ 3

Author(s):

Ray J. Paul

Keyword(s):

Data Collection ◽

Problem Formulation ◽

Approximate Model ◽

Mental Activity ◽

The Internet ◽

Decision Aiding ◽

Alternative Approach ◽

Alternative Approaches ◽

Classical Decision ◽

The Web

Problem formulation, data collection, modeling, testing running, analyzing and results–these are the pre-Internet staged approaches to decision aiding, when the modeling time allowed to the analyst was to some extent determined by the fact there were few alternative approaches that were either better and/or faster. It is possible that the Internet now facilitates “cut-and-paste” modeling, the development of an acceptable approximate model, suitable for the immediate decision, constructed from bits of programs from anywhere on the Web. It is this possibility that is examined in this chapter. First we look at classical decision modeling, then at a hypothesized Internet alternative approach and lastly mention some dangers of the Internet approach, which is, what might happen to the benefits of mental activity?

Download Full-text

Embedded System Based Television Data Collection and Return Technology

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.48-49.496 ◽

2011 ◽

Vol 48-49 ◽

pp. 496-501

Author(s):

Yu Su ◽

Shu Hong Wen ◽

Jian Ping Chai

Keyword(s):

Data Collection ◽

Embedded System ◽

Original Video ◽

Storage Space ◽

Video Content ◽

Network Bandwidth ◽

Tv Program ◽

Broadcasting System ◽

Program Content ◽

And Storage

Television data collection and return technologies are one of key technologies in television secure broadcasting system, TV video content surveillance, TV program copyright protection, and client advertisement broadcasting. In china, the dominating methods of TV video content surveillance are manual tape recording and whole TV program Automatic Return. Manual method costs too much, whole TV program return method needs lots of net bandwidth and storage space. This paper proposes a new method of television data collection and return technology, video field is extracted from continuous video and coded at frequency of about one field per second, in other words, one field is extracted from continuous fifty fields of original video for PAL TV system, extracted frame can be coded by all means, for example JPEG2000, or intra mode code of H.264 or MPEG2. TV programs whose content and topic change most frequently are news and advertisement program, which may change topic in five to ten seconds, so extracted sequences hold the same topic and content and enough information with original video for TV program content surveillance application. The data quantity of extracted sequence is about 3 percent of the original video program, which will save large quantity of network bandwidth and storage space. One hardware implementation method of this technology based on embedded system is proposed, the TV Field Extractor, which circularly extracts images from target TV program, uses high-performance compression algorithm for image compression and stores the final output sequences of stationary images on the hard disk, or transmits these sequences to the monitoring center via network. This method evidently reduces device cost, network bandwidth and storage space, which can be widely adopted in TV program content surveillance and TV secure broadcasting system.

Download Full-text

The Web Experimental Psychology Lab: Five years of data collection on the Internet

Behavior Research Methods Instruments &amp Computers ◽

10.3758/bf03195366 ◽

2001 ◽

Vol 33 (2) ◽

pp. 201-211 ◽

Cited By ~ 52

Author(s):

Ulf-Dietrich Reips

Keyword(s):

Data Collection ◽

Experimental Psychology ◽

The Internet ◽

The Web

Download Full-text

Femaleâ€™s Gaze in Idol Drama: A Case Study of My Sunshine

Asian Journal of Humanities and Social Studies ◽

10.24203/ajhss.v5i2.4586 ◽

2017 ◽

Vol 5 (2) ◽

Author(s):

Yongling Song

Keyword(s):

Web Search ◽

The Internet ◽

Search Query ◽

Content Strategy ◽

Production Company ◽

Very High ◽

The Web ◽

Tv Production ◽

Research Studies

Croton Media (å…‹é¡¿ä¼ åª’) is a Chinese TV production company noted for its content strategy developed from research studies on digitalization. My Sunshine (ä½•ä»¥ç¬™è§é»˜), one of Croton Mediaâ€™s notable TV series, won a very high rating on terrestrial TV in 2015. It got hundreds of millions views on the first day of its launch on the Internet, breaking the record. Soon after its launch, the web search query for Wallace Chung (é’Ÿæ±‰è‰¯), one of the leading actors, rose up to the top, beating the most popular Korean stars at that time. In May, a Korean TV station, MBC, acquired the programming rights, adding Korean subtitles, and aired this series, which set a record of being the first Chinese TV series acquired and aired by Korean terrestrial TV stations.Â

Download Full-text

Web Crawling on News Web Page using Different Frameworks

International Journal of Scientific Research in Science and Technology ◽

10.32628/cseit2174120 ◽

2021 ◽

pp. 513-519

Author(s):

Harshala Bhoir ◽

K. Jayamalini

Keyword(s):

Web Sites ◽

Parse Tree ◽

Web Pages ◽

Web Crawling ◽

Web Page ◽

Web Crawler ◽

Information Searching ◽

System Use ◽

Xpath Expression ◽

The Web

Now a days Internet is widely used by users to find required information. Searching on web for useful information has become more difficult. Web crawler helps to extract the relevant and irrelevant links from the web. Web crawler downloads web pages through the program. This paper implements web crawler with Scrapy and Beautiful Soup python web crawler framework to crawls news on news web sites.Scrapy is a web crawling framework that allow programmer to create spider that define how a certain site or a group of sites will be scraped. It has built-in support for extracting data from HTML sources using XPath expression and CSS expression. BeautifulSoup is a framework that extract data from web pages. Beautiful Soup provides a few simple methods for navigating, searching and modifying a parse tree. BeautifulSoup automatically convert incoming document to Unicode and outgoing document to UTF-8.Proposed system use BeautifulSoup and scrapy framework to crawls news web sites. This paper also compares scrapy and beautiful Soup4 web crawler frameworks.

Download Full-text