Web data mining1

2021 ◽  
pp. 46-70
Author(s):  
Stefan Bosse ◽  
Lena Dahlhaus ◽  
Uwe Engel
Keyword(s):  
2009 ◽  
Vol 20 (11) ◽  
pp. 2950-2964 ◽  
Author(s):  
Xiao-Yong DU ◽  
Yan WANG ◽  
Bin LÜ

2020 ◽  
Author(s):  
Elise Braekman ◽  
Stefaan Demarest ◽  
Rana Charafeddine ◽  
Sabine Drieskens ◽  
Finaba Berete ◽  
...  

BACKGROUND Potential is seen in web data collection for population health surveys due to a combination of its cost-effectiveness, implementation ease and the increased internet penetration. Nonetheless, web modes may lead to lower and more selective unit response rates than traditional modes and hence may increase bias in the measured indicators. OBJECTIVE This research assesses the unit response and costs of a web versus F2F study. METHODS Alongside the F2F Belgian Health Interview Survey of 2018 (BHIS2018; n gross sample used: 7,698), a web survey (BHISWEB; n gross sample=6,183) is organized. Socio-demographic data on invited individuals is obtained from the national register and census linkages. Unit response rates considering the different sampling probabilities of both surveys are calculated. Logistic regression analyses examine the association between mode system (web vs. F2F) and socio-demographic characteristics on unit non-response. The costs per completed web questionnaire are compared with these for a completed F2F questionnaire. RESULTS The unit response rate is lower in BHISWEB (18.0%) versus BHIS2018 (43.1%). A lower web response is found among all socio-demographic groups, however, the difference is higher among people older than 65, low educated people, people with a non-Belgian nationality, people living alone and these living in Brussels Capital. Not the same socio-demographic characteristics are associated with non-response in both studies. Having another European (OR (95% CI): 1.60 (1.20-2.13)) or a non-European nationality (OR (95% CI): 2.57 (1.79-3.70)) (compared to having the Belgian nationality) and living in the Brussels Capital (95% CI): 1.72 (1.41-2.10)) or Walloon (OR (95% CI): 1.47 (1.15 - 1.87) region (compared to living in the Flemish region) is only in BHISWEB associated with a higher non-response. In BHIS2018 younger people (OR (95% CI): 1.31 (1.11-1.54)) are more likely to be non-respondent than older people, this was not found BHISWEB. In both studies, lower educated people have a higher change to be non-respondent, but this effect is more pronounced in BHISWEB (OR low vs. high education level (95% CI): Web 2.71 (2.21-3.39)); F2F 1.70 (1.48-1.95)). The BHISWEB study has a considerable cost advantage; the total cost per completed questionnaire is almost three times lower (€41) compared to the F2F data collection (€111). CONCLUSIONS The F2F unit response rate is generally higher, yet for certain groups the difference between web versus F2F is more limited. A considerable cost advantage of web collection is found. It is therefore worthwhile to experiment with adaptive mixed-mode designs to optimize financial resources without increasing selection bias; e.g. only inviting socio-demographic groups more eager to participate online for web surveys while remaining to focus on increasing the F2F response rates for other groups. CLINICALTRIAL Studies approved by the Ethics Committee of the University hospital of Ghent


Author(s):  
Dilip Kumar Sharma ◽  
A. K. Sharma

A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the required information. Various techniques have been proposed for crawling deep Web information, but much remains undiscovered. In this paper, the authors analyze and compare important deep Web information crawling techniques to find their relative limitations and advantages. To minimize limitations of existing deep Web crawlers, a novel architecture is proposed based on QIIIEP specifications (Sharma & Sharma, 2009). The proposed architecture is cost effective and has features of privatized search and general search for deep Web data hidden behind html forms.


Sign in / Sign up

Export Citation Format

Share Document