web crawlers Latest Research Papers

Development of Focused Crawlers for Building Large Punjabi News Corpus

Journal of ICT Research and Applications ◽

10.5614/itbj.ict.res.appl.2021.15.3.1 ◽

2021 ◽

Vol 15 (3) ◽

pp. 205-215

Author(s):

Gurjot Singh Mahi ◽

Amandeep Verma

Keyword(s):

Search Engines ◽

Scientific Community ◽

The Internet ◽

Python Programming Language ◽

News Websites ◽

Web Crawlers ◽

News Corpus ◽

Python Programming ◽

Focused Crawlers ◽

The Web

Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.

THE USE OF WEB-CRAWLERS IN TECHNOLOGY OF CONCRETE HISTORICAL INVESTIGATION SUPPORT

Systems and Means of Informatics ◽

10.14357/08696527210413 ◽

2021 ◽

Keyword(s):

Historical Investigation ◽

Web Crawlers

Cyber strategies used to combat child sexual abuse material

10.52922/ti78313 ◽

2021 ◽

Author(s):

Graeme Edwards ◽

Larissa Christensen

Keyword(s):

Sexual Abuse ◽

Child Sexual Abuse ◽

Facial Recognition ◽

Research Synthesis ◽

Peer To Peer ◽

Warning Messages ◽

Peer Network ◽

Web Crawlers ◽

Evaluative Research ◽

Peer To Peer Network

Cyber strategies play a role in combating child sexual abuse material (CSAM). These strategies aim to detect offenders and prevent them from accessing and producing CSAM, or to identify victims. This paper explores five cyber strategies: peer-to-peer network monitoring, automated multi-modal CSAM detection tools, using web crawlers to identify CSAM sites, pop-up warning messages, and facial recognition. This research synthesis captures the background of each strategy, how it works and the evaluative research, along with the benefits, limitations and implementation considerations, offering a practical overview for a broad audience.

Data Expression and Protection of Intellectual Property Education Resources Based on Machine Learning

Complexity ◽

10.1155/2021/5583389 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Meng Mei ◽

Hui Tan

Keyword(s):

Machine Learning ◽

Intellectual Property ◽

Property Rights ◽

Point Of View ◽

Mental Property ◽

Education Resources ◽

Web Crawlers ◽

Overall Performance ◽

The Relationship ◽

Academic Resources

With the improvement and growth of instructional informatisation, the contradiction between the open supply of academic resources, information expression, and mental property safety is turning into greater acute. Remedying the relationship between the two is very necessary for the overall performance of records expression of academic assets and the advent of true surroundings for mental property protection. The safety of mental property rights is to shield the rights and pursuits of know-how owners, defend the strength of information producers to produce knowledge, and defend the supply of academic sources sharing. The data expression and protection of intellectual property education resources based on machine learning is a kind of protection tool for the intellectual property of education resources developed using the characteristics of automation, real-time monitoring, and growth of machine learning. It can prevent web crawlers from harming e-commerce websites, prevent them from stealing the intellectual property of e-commerce websites, and analyse web crawlers that visit websites to prevent important website data from being stolen by them. From this point of view, based on the relationship between the fact expression of instructional sources and the safety of mental property rights, this paper advocates to promote the records expression and safety of mental property rights of academic sources from a couple of perspectives.

An Unsupervised-Learning Based Method for Detecting Groups of Malicious Web Crawlers in Internet

10.1109/case49439.2021.9551622 ◽

2021 ◽

Author(s):

Tianyi Yue ◽

Yadong Zhou ◽

Bowen Hu ◽

Zhanbo Xu ◽

Xiaohong Guan ◽

...

Keyword(s):

Unsupervised Learning ◽

Web Crawlers

Researchgate.net crawler and a new contribution determines sequence (CDS) method.

10.32920/ryerson.14668263 ◽

2021 ◽

Author(s):

Hammook Zahra

Keyword(s):

Social Network ◽

Data Analysis ◽

Impact Factor ◽

Computer Science ◽

Journal Impact Factor ◽

Real Data ◽

Java Language ◽

Web Crawlers ◽

Academic Social Network ◽

Journal Impact

General and Focus crawlers are the main types of web crawlers used for different goals, with different crawling techniques and architecture. Our crawler was written in Java language using different software and libraries. To test the crawler, it has been run on the academic social network, Researchgate.net from 3 rd.April to 28th.June 2014 and retrieved real data. The crawler consists of three main algorithms to crawl information such as researchers details, publications details, questions/answers activity details. The retrieved data has been analyzed to highlight the performance of Canadian researchers, in the field of Computer Science on Researchgate.net. Data analysis has been done from the collaboration and (alt)metrics perspectives. Among other features Researchgate.net came with “Impact Points” and “RG Score” (alt)metrics. The former builds on ISI Journal Impact Factor, which disregards author’s contribution in its calculations. A new Contribution Determines Sequence (CDS) method has been developed and tested, with all required scripts which showed better performance than other methods.

Researchgate.net crawler and a new contribution determines sequence (CDS) method.

10.32920/ryerson.14668263.v1 ◽

2021 ◽

Author(s):

Hammook Zahra

Keyword(s):

Social Network ◽

Data Analysis ◽

Impact Factor ◽

Computer Science ◽

Journal Impact Factor ◽

Real Data ◽

Java Language ◽

Web Crawlers ◽

Academic Social Network ◽

Journal Impact

General and Focus crawlers are the main types of web crawlers used for different goals, with different crawling techniques and architecture. Our crawler was written in Java language using different software and libraries. To test the crawler, it has been run on the academic social network, Researchgate.net from 3 rd.April to 28th.June 2014 and retrieved real data. The crawler consists of three main algorithms to crawl information such as researchers details, publications details, questions/answers activity details. The retrieved data has been analyzed to highlight the performance of Canadian researchers, in the field of Computer Science on Researchgate.net. Data analysis has been done from the collaboration and (alt)metrics perspectives. Among other features Researchgate.net came with “Impact Points” and “RG Score” (alt)metrics. The former builds on ISI Journal Impact Factor, which disregards author’s contribution in its calculations. A new Contribution Determines Sequence (CDS) method has been developed and tested, with all required scripts which showed better performance than other methods.

Focused Web Crawlers on Domain-Specific Retrieval Systems

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/1125/1/012045 ◽

2021 ◽

Vol 1125 (1) ◽

pp. 012045

Author(s):

Ika Oktavia Suzanti ◽

Fakhrur Razi ◽

Husni ◽

Eka Mala Sari Rochman ◽

Nurhayati Fitriani

Keyword(s):

Domain Specific ◽

Retrieval Systems ◽

Web Crawlers

WNV-Detector: automated and scalable detection of wireless network vulnerabilities

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-021-01978-4 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Yanxi Huang ◽

Fangzhou Zhu ◽

Liang Liu ◽

Wezhi Meng ◽

Simin Hu ◽

...

Keyword(s):

Wireless Network ◽

Semantic Analysis ◽

Automatic Device ◽

Prototype System ◽

Security Vulnerabilities ◽

Named Entities ◽

Device Identification ◽

Automated Method ◽

Network Vulnerabilities ◽

Web Crawlers

AbstractThe security of wireless routers receives much attention given by the increasing security threats. In the era of Internet of Things, many devices pose security vulnerabilities, and there is a significant need to analyze the current security status of devices. In this paper, we develop WNV-Detector, a universal and scalable framework for detecting wireless network vulnerabilities. Based on semantic analysis and named entities recognition, we design rules for automatic device identification of wireless access points and routers. The rules are automatically generated based on the information extracted from the admin webpages, and can be updated with a semi-automated method. To detect the security status of devices, WNV-Detector aims to extract the critical identity information and retrieve known vulnerabilities. In the evaluation, we collect information through web crawlers and build a comprehensive vulnerability database. We also build a prototype system based on WNV-Detector and evaluate it with routers from various vendors on the market. Our results indicate that the effectiveness of our WNV-Detector, i.e., the success rate of vulnerability detection could achieve 95.5%.

Scalability and Robustness Testing for Open Source Web Crawlers

2021 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunication Engineering ◽

10.1109/ectidamtncon51128.2021.9425701 ◽

2021 ◽

Author(s):

Desheng Yang ◽

Pree Thiengburanathum

Keyword(s):

Open Source ◽

Robustness Testing ◽

Web Crawlers

web crawlers
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Development of Focused Crawlers for Building Large Punjabi News Corpus

THE USE OF WEB-CRAWLERS IN TECHNOLOGY OF CONCRETE HISTORICAL INVESTIGATION SUPPORT

Cyber strategies used to combat child sexual abuse material

Data Expression and Protection of Intellectual Property Education Resources Based on Machine Learning

An Unsupervised-Learning Based Method for Detecting Groups of Malicious Web Crawlers in Internet

Researchgate.net crawler and a new contribution determines sequence (CDS) method.

Researchgate.net crawler and a new contribution determines sequence (CDS) method.

Focused Web Crawlers on Domain-Specific Retrieval Systems

WNV-Detector: automated and scalable detection of wireless network vulnerabilities

Scalability and Robustness Testing for Open Source Web Crawlers

Export Citation Format

web crawlersRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Development of Focused Crawlers for Building Large Punjabi News Corpus

THE USE OF WEB-CRAWLERS IN TECHNOLOGY OF CONCRETE HISTORICAL INVESTIGATION SUPPORT

Cyber strategies used to combat child sexual abuse material

Data Expression and Protection of Intellectual Property Education Resources Based on Machine Learning

An Unsupervised-Learning Based Method for Detecting Groups of Malicious Web Crawlers in Internet

Researchgate.net crawler and a new contribution determines sequence (CDS) method.

Researchgate.net crawler and a new contribution determines sequence (CDS) method.

Focused Web Crawlers on Domain-Specific Retrieval Systems

WNV-Detector: automated and scalable detection of wireless network vulnerabilities

Scalability and Robustness Testing for Open Source Web Crawlers

web crawlers
Recently Published Documents