Applications of Web Scraping in Economics and Finance

Author(s):  
Piotr Śpiewanowski ◽  
Oleksandr Talavera ◽  
Linh Vi

The 21st-century economy is increasingly built around data. Firms and individuals upload and store enormous amount of data. Most of the produced data is stored on private servers, but a considerable part is made publicly available across the 1.83 billion websites available online. These data can be accessed by researchers using web-scraping techniques. Web scraping refers to the process of collecting data from web pages either manually or using automation tools or specialized software. Web scraping is possible and relatively simple thanks to the regular structure of the code used for websites designed to be displayed in web browsers. Websites built with HTML can be scraped using standard text-mining tools, either scripts in popular (statistical) programming languages such as Python, Stata, R, or stand-alone dedicated web-scraping tools. Some of those tools do not even require any prior programming skills. Since about 2010, with the omnipresence of social and economic activities on the Internet, web scraping has become increasingly more popular among academic researchers. In contrast to proprietary data, which might not be feasible due to substantial costs, web scraping can make interesting data sources accessible to everyone. Thanks to web scraping, the data are now available in real time and with significantly more details than what has been traditionally offered by statistical offices or commercial data vendors. In fact, many statistical offices have started using web-scraped data, for example, for calculating price indices. Data collected through web scraping has been used in numerous economic and finance projects and can easily complement traditional data sources.

2014 ◽  
Vol 23 (01) ◽  
pp. 177-181 ◽  
Author(s):  
W. Hersh ◽  
A. U. Jai Ganesh ◽  
P. Otero

Summary Objective: The growing volume and diversity of health and biomedical data indicate that the era of Big Data has arrived for healthcare. This has many implications for informatics, not only in terms of implementing and evaluating information systems, but also for the work and training of informatics researchers and professionals. This article addresses the question: What do biomedical and health informaticians working in analytics and Big Data need to know? Methods: We hypothesize a set of skills that we hope will be discussed among academic and other informaticians. Results: The set of skills includes: Programming - especially with data-oriented tools, such as SQL and statistical programming languages; Statistics - working knowledge to apply tools and techniques; Domain knowledge - depending on one’s area of work, bioscience or health care; and Communication - being able to understand needs of people and organizations, and articulate results back to them. Conclusion: Biomedical and health informatics educational programs must introduce concepts of analytics, Big Data, and the underlying skills to use and apply them into their curricula. The development of new coursework should focus on those who will become experts, with training aiming to provide skills in “deep analytical talent” as well as those who need knowledge to support such individuals.


2020 ◽  
pp. 193896552097358
Author(s):  
Saram Han ◽  
Christopher K. Anderson

As consumers increasingly research and purchase hospitality and travel services online, new research opportunities have become available to hospitality academics. There is a growing interest in understanding the online travel marketplace among hospitality researchers. Although many researchers have attempted to better understand the online travel market through the use of analytical models, experiments, or survey collection, these studies often fail to capture the full complexity of the market. Academics often rely upon survey data or experiments owing to their ease of collection or potentially to the difficulty in assembling online data. In this study, we hope to equip hospitality researchers with the tools and methods to augment their traditional data sources with the readily available data that consumers use to make their travel choices. In this article, we provide a guideline (and Python code) for how to best collect/scrape publicly available online hotel data. We focus on the collection of online data across numerous platforms, including online travel agents, review sites, and hotel brand sites. We outline some exciting possibilities regarding how these data sources might be utilized, as well as discuss some of the caveats that have to be considered when analyzing online data.


2019 ◽  
Vol 11 (1) ◽  
pp. 1-1
Author(s):  
Sabrina Kletz ◽  
Marco Bertini ◽  
Mathias Lux

Having already discussed MatConvNet and Keras, let us continue with an open source framework for deep learning, which takes a new and interesting approach. TensorFlow.js is not only providing deep learning for JavaScript developers, but it's also making applications of deep learning available in the WebGL enabled web browsers, or more specifically, Chrome, Chromium-based browsers, Safari and Firefox. Recently node.js support has been added, so TensorFlow.js can be used to directly control TensorFlow without the browser. TensorFlow.js is easy to install. As soon as a browser is installed one is ready to go. Browser based, cross platform applications, e.g. running with Electron, can also make use of TensorFlow.js without an additional install. The performance, however, depends on the browser the client is running, and memory and GPU on the client device. More specifically, one cannot expect to analyze 4K videos on a mobile phone in real time. While it's easy to install, and it's easy to develop based on TensorFlow.js, there are drawbacks: (i) developers have less control over where the machine learning actually takes place (e.g. on CPU or GPU), that it is running in the same sandbox as all web pages in the browser do, and (ii) that in the current release it still has rough edges and is not considered stable enough to use in production.


Author(s):  
Mike Thelwall

Scientific Web Intelligence (SWI) is a research field that combines techniques from data mining, Web intelligence, and scientometrics to extract useful information from the links and text of academic-related Web pages using various clustering, visualization, and counting techniques. Its origins lie in previous scientometric research into mining off-line academic data sources such as journal citation databases. Typical scientometric objectives are either evaluative (assessing the impact of research) or relational (identifying patterns of communication within and among research fields). From scientometrics, SWI also inherits a need to validate its methods and results so that the methods can be justified to end users, and the causes of the results can be found and explained.


Author(s):  
José-Fernando. Diez-Higuera ◽  
Francisco-Javier Diaz-Pernas

In the last few years, because of the increasing growth of the Internet, general-purpose clients have achieved a high level of popularity for static consultation of text and pictures. This is the case of the World Wide Web (i.e., the Web browsers). Using a hypertext system, Web users can select and read in their computers information from all around the world, with no other requirement than an Internet connection and a navigation program. For a long time, the information available on the Internet has been series of written texts and 2D pictures (i.e., static information). This sort of information suited many publications, but it was highly unsatisfactory for others, like those related to objects of art, where real volume, and interactivity with the user, are of great importance. Here, the possibility of including 3D information in Web pages makes real sense.


2018 ◽  
Vol 14 (2) ◽  
pp. 212-232 ◽  
Author(s):  
Weidan Du ◽  
Zhenyu Cheryl Qian ◽  
Paul Parsons ◽  
Yingjie Victor Chen

Purpose Modern Web browsers all provide a history function that allows users to see a list of URLs they have visited in chronological order. The history log contains rich information but is seldom used because of the tedious nature of scrolling through long lists. This paper aims to propose a new way to improve users’ Web browsing experience by analyzing, clustering and visualizing their browsing history. Design/methodology/approach The authors developed a system called Personal Web Library to help users develop awareness of and understand their Web browsing patterns, identify their topics of interest and retrieve previously visited Web pages more easily. Findings User testing showed that this system is usable and attractive. It found that users can easily see patterns and trends at different time granularities, recall pages from the past and understand the local context of a browsing session. Its flexibility provides users with much more information than the traditional history function in modern Web browsers. Participants in the study gained an improved awareness of their Web browsing patterns. Participants mentioned that they were willing to improve their time management after viewing their browsing patterns. Practical implications As more and more daily activities rely on the internet and Web browsers, browsing data captures a large part of users’ lives. Providing users with interactive visualizations of their browsing history can facilitate personal information management, time management and other meta-level activities. Originality/value This paper aims to help users gain insights into and improve their Web browsing experience, the authors hope that the work they conducted can spur more research contributions in this underdeveloped yet important area.


2000 ◽  
Vol 09 (01n02) ◽  
pp. 147-169
Author(s):  
PATRICK MARTIN ◽  
WENDY POWLEY ◽  
ANDREW WESTON ◽  
PETER ZION

In the not too distant past, the amount of online data available to general users was relatively small. Most of the online data was maintained in organizations' database management systems and accessible only through the interfaces provided by those systems. The popularity of the Internet, in particular, has meant that there is now an abundance of online data available to users in the form of Web pages and files. This data, however, is maintained in passive data sources, that is sources that do not provide facilities to search or query their data. The data must be queried and examined using applications such as browsers and search engines. In this paper, we explore an approach to querying passive data sources based on the extraction, and subsequent exploitation, of metadata from the data sources. We describe two situations in which this approach has been used, evaluate the approach and draw some general conclusions.


2016 ◽  
Vol 2016 ◽  
pp. 1-14
Author(s):  
Shukai Liu ◽  
Xuexiong Yan ◽  
Qingxian Wang ◽  
Xu Zhao ◽  
Chuansen Chai ◽  
...  

The high-profile attacks of malicious HTML and JavaScript code have seen a dramatic increase in both awareness and exploitation in recent years. Unfortunately, exiting security mechanisms provide no enough protection. We propose a new protection mechanism named PMHJ based on the support of both web applications and web browsers against malicious HTML and JavaScript code in vulnerable web applications. PMHJ prevents the injection attack of HTML elements with a random attribute value and the node-split attack by an attribute with the hash value of the HTML element. PMHJ ensures the content security in web pages by verifying HTML elements, confining the insecure HTML usages which can be exploited by attackers, and disabling the JavaScript APIs which may incur injection vulnerabilities. PMHJ provides a flexible way to rein the high-risk JavaScript APIs with powerful ability according to the principle of least authority. The PMHJ policy is easy to be deployed into real-world web applications. The test results show that PMHJ has little influence on the run time and code size of web pages.


2017 ◽  
Vol 37 (7) ◽  
pp. 735-746 ◽  
Author(s):  
Hawre Jalal ◽  
Petros Pechlivanoglou ◽  
Eline Krijkamp ◽  
Fernando Alarid-Escudero ◽  
Eva Enns ◽  
...  

As the complexity of health decision science applications increases, high-level programming languages are increasingly adopted for statistical analyses and numerical computations. These programming languages facilitate sophisticated modeling, model documentation, and analysis reproducibility. Among the high-level programming languages, the statistical programming framework R is gaining increased recognition. R is freely available, cross-platform compatible, and open source. A large community of users who have generated an extensive collection of well-documented packages and functions supports it. These functions facilitate applications of health decision science methodology as well as the visualization and communication of results. Although R’s popularity is increasing among health decision scientists, methodological extensions of R in the field of decision analysis remain isolated. The purpose of this article is to provide an overview of existing R functionality that is applicable to the various stages of decision analysis, including model design, input parameter estimation, and analysis of model outputs.


Sign in / Sign up

Export Citation Format

Share Document