scholarly journals Improving Data Collection on Article Clustering by Using Distributed Focused Crawler

2017 ◽  
Vol 1 (1) ◽  
pp. 1-12 ◽  
Author(s):  
Dani Gunawan ◽  
Amalia Amalia ◽  
Atras Najwan

Collecting or harvesting data from the Internet is often done by using web crawler. General web crawler is developed to be more focus on certain topic. The type of this web crawler called focused crawler. To improve the datacollection performance, creating focused crawler is not enough as the focused crawler makes efficient usage of network bandwidth and storage capacity. This research proposes a distributed focused crawler in order to improve the web crawler performance which also efficient in network bandwidth and storage capacity. This distributed focused crawler implements crawling scheduling, site ordering to determine URL queue, and focused crawler by using Naïve Bayes. This research also tests the web crawling performance by conducting multithreaded, then observe the CPU and memory utilization. The conclusion is the web crawling performance will be decrease when too many threads are used. As the consequences, the CPU and memory utilization will be very high, meanwhile performance of the distributed focused crawler will be low.

Author(s):  
Rodrigo Dos Santos Costa

In spite of a contemporary discussion about the management of knowledge and the deep use of technologies focused on architecture, organization and knowledge detection based on organization inner data analysis, as well as public data available on the internet, it is necessary a critic look above the organization knowledge creation processes even as the load of tacit knowledge there is in an organization. It is observed that the evolution of technologies, such as mobile computing, the web, besides the architecture of the computers and their ability of handling and storage data, has brought to the information economy or the age of knowledge, diverting focus on people, the central axis of organizational knowledge, and their ability to reason, infer, make decisions, and above them all the processes of knowledge creation focused on the collaborative solution of problems and generation of innovation based on the socialization of knowledge.


2011 ◽  
Vol 204-210 ◽  
pp. 1454-1458
Author(s):  
Xing Chen ◽  
Wei Jiang Li ◽  
Tie Jun Zhao ◽  
Xing Hai Piao

On the current scale of the Internet, the single web crawler is unable to visit the entire web in an effective time-frame. So, we develop a distributed web crawler system to deal with it. In our distribution design, we mainly consider two facets of parallel. One is the multi-thread in the internal nodes; the other is distributed parallel among the nodes. We focus on the distribution and parallel between nodes. We address two issues of the distributed web crawler which include the crawl strategy and dynamic configuration. The results of experiment show that the hash function based on the web site achieves the goal of the distributed web crawler. At the same time, we pursue the load balance of the system, we also should reduce the communication and management spending as much as possible.


Author(s):  
Deepak Mayal

World Wide Web (WWW)also referred to as web acts as a vital source of information and searching over the web has become so much easy nowadays all thanks to search engines google, yahoo etc. A search engine is basically a complex multiprogram that allows user to search information available on the web and for that purpose, they use web crawlers. Web crawler systematically browses the world wide web. Effective search helps in avoiding downloading and visiting irrelevant web pages on the web in order to do that web crawlers use different searching algorithm . This paper reviews different web crawling algorithm that determines the fate of the search system.


A web crawler is also called spider. For the intention of web indexing it automatically searches on the WWW. As the W3 is increasing day by day, globally the number of web pages grown massively. To make the search sociable for users, searching engine are mandatory. So to discover the particular data from the WWW search engines are operated. It would be almost challenging for mankind devoid of search engines to find anything from the web unless and until he identifies a particular URL address. A central depository of HTML documents in indexed form is sustained by every search Engine. Every time an operator gives the inquiry, searching is done at the database of indexed web pages. The size of a database of every search engine depends on the existing page on the internet. So to increase the proficiency of search engines, it is permitted to store only the most relevant and significant pages in the database.


2018 ◽  
Vol 2 (2) ◽  
pp. 29-40
Author(s):  
Untung Rahardja ◽  
Qurotul Aini ◽  
Yustin Novita Dewi

Pengguna internet yang drastis di seluruh penjuru dunia saat ini membuat semua hal berubah sangat cepat. Dalam penggunaan internet yang berkembang saat ini salah satunya adalah website. Untuk menentukan nilai sebuah website dibutuhkan popularitas untuk menunjangnya. Dapat dilihat dari pemeringkatan web terutama pada website crowdfunding, website yang bergerak di bidang penggalangan dana yang menyajikan informasi dan interaksi didalamnya. Untuk dapat mengetahui peringkat dalam sebuah website, dapat menggunakan Alexa Rank sebagai alat ukur pemeringkatannya. Dalam penelitian ini menggunakan metode pengumpulan data dengan dilakukan 7 (tujuh) literature survey tentang pemeringkatan website diantaranya terkait dengan penggunaan Alexa Rank. Sebuah popularitas dapat di buktikan dengan menggunakan Alexa Rank melalui 3 (tiga) fasilitas Alexa Rank yang dapat digunakan dengan mudah. Hasil akhir penelitian ini dilakukan dengan penggunaan fasilitas yang di sediakan Alexa Rank untuk mengetahui, memantau maupun memonitoring popularitas website crowdfunding. Kata kunci: Website, Website Crowdfunding, Alexa Rank Internet users who drastically throughout the world today makes all things change very quickly. In the growing use of the internet at this time one of them is the website. To determine the value of a website's popularity is required for menunjangnya. It can be seen from the web ratings mainly on the website of crowdfunding, website in the fundraising that present information and interactions within. To be able to know the ranking in a website, it can use the Alexa Rank as a measuring instrument pemeringkatannya. In this study using the method of data collection was done with 7 (seven) literature survey about the ranking of websites which were associated with the use of the Alexa Rank. A popularity can be proved by using the Alexa Rank through three (3) Alexa Rank facilities that can be used easily. The end result of this research was conducted with the use of facilities that provide Alexa Rank to discover, monitor or monitor crowdfunding website popularity. Keywords: Website, Website Crowdfunding, Alexa Rank  


2011 ◽  
pp. 209-219 ◽  
Author(s):  
Ray J. Paul

Problem formulation, data collection, modeling, testing running, analyzing and results–these are the pre-Internet staged approaches to decision aiding, when the modeling time allowed to the analyst was to some extent determined by the fact there were few alternative approaches that were either better and/or faster. It is possible that the Internet now facilitates “cut-and-paste” modeling, the development of an acceptable approximate model, suitable for the immediate decision, constructed from bits of programs from anywhere on the Web. It is this possibility that is examined in this chapter. First we look at classical decision modeling, then at a hypothesized Internet alternative approach and lastly mention some dangers of the Internet approach, which is, what might happen to the benefits of mental activity?


2011 ◽  
Vol 48-49 ◽  
pp. 496-501
Author(s):  
Yu Su ◽  
Shu Hong Wen ◽  
Jian Ping Chai

Television data collection and return technologies are one of key technologies in television secure broadcasting system, TV video content surveillance, TV program copyright protection, and client advertisement broadcasting. In china, the dominating methods of TV video content surveillance are manual tape recording and whole TV program Automatic Return. Manual method costs too much, whole TV program return method needs lots of net bandwidth and storage space. This paper proposes a new method of television data collection and return technology, video field is extracted from continuous video and coded at frequency of about one field per second, in other words, one field is extracted from continuous fifty fields of original video for PAL TV system, extracted frame can be coded by all means, for example JPEG2000, or intra mode code of H.264 or MPEG2. TV programs whose content and topic change most frequently are news and advertisement program, which may change topic in five to ten seconds, so extracted sequences hold the same topic and content and enough information with original video for TV program content surveillance application. The data quantity of extracted sequence is about 3 percent of the original video program, which will save large quantity of network bandwidth and storage space. One hardware implementation method of this technology based on embedded system is proposed, the TV Field Extractor, which circularly extracts images from target TV program, uses high-performance compression algorithm for image compression and stores the final output sequences of stationary images on the hard disk, or transmits these sequences to the monitoring center via network. This method evidently reduces device cost, network bandwidth and storage space, which can be widely adopted in TV program content surveillance and TV secure broadcasting system.


Author(s):  
Yongling Song

Croton Media (克顿传媒) is a Chinese TV production company noted for its content strategy developed from research studies on digitalization. My Sunshine (何以笙萧默), one of Croton Media’s notable TV series, won a very high rating on terrestrial TV in 2015. It got hundreds of millions views on the first day of its launch on the Internet, breaking the record. Soon after its launch, the web search query for Wallace Chung (钟汉良), one of the leading actors, rose up to the top, beating the most popular Korean stars at that time. In May, a Korean TV station, MBC, acquired the programming rights, adding Korean subtitles, and aired this series, which set a record of being the first Chinese TV series acquired and aired by Korean terrestrial TV stations. 


Author(s):  
Harshala Bhoir ◽  
K. Jayamalini

Now a days Internet is widely used by users to find required information. Searching on web for useful information has become more difficult. Web crawler helps to extract the relevant and irrelevant links from the web. Web crawler downloads web pages through the program. This paper implements web crawler with Scrapy and Beautiful Soup python web crawler framework to crawls news on news web sites.Scrapy is a web crawling framework that allow programmer to create spider that define how a certain site or a group of sites will be scraped. It has built-in support for extracting data from HTML sources using XPath expression and CSS expression. BeautifulSoup is a framework that extract data from web pages. Beautiful Soup provides a few simple methods for navigating, searching and modifying a parse tree. BeautifulSoup automatically convert incoming document to Unicode and outgoing document to UTF-8.Proposed system use BeautifulSoup and scrapy framework to crawls news web sites. This paper also compares scrapy and beautiful Soup4 web crawler frameworks.


Sign in / Sign up

Export Citation Format

Share Document