Getting Started Creating Data Dictionaries: How to Create a Shareable Dataset

As researchers embrace open and transparent data sharing, they will need to provide information about their data that effectively helps others understand their data sets’ contents. Without proper documentation, data stored in online repositories such as OSF will often be rendered unfindable and unreadable by other researchers and indexing search engines. Data dictionaries and codebooks provide a wealth of information about variables, data collection, and other important facets of a data set. This information, called metadata, provides key insights into how the data might be further used in research and facilitates search-engine indexing to reach a broader audience of interested parties. This Tutorial first explains terminology and standards relevant to data dictionaries and codebooks. Accompanying information on OSF presents a guided workflow of the entire process from source data (e.g., survey answers on Qualtrics) to an openly shared data set accompanied by a data dictionary or codebook that follows an agreed-upon standard. Finally, we discuss freely available Web applications to assist this process of ensuring that psychology data are findable, accessible, interoperable, and reusable.

Download Full-text

Search Integration with WebSphere Portal

Enhancing Enterprise and Service-Oriented Architectures with Advanced Web Portal Technologies ◽

10.4018/978-1-4666-0336-3.ch007 ◽

2012 ◽

pp. 68-86

Author(s):

Andreas Prokoph

Keyword(s):

Search Engine ◽

Search Engines ◽

Web Applications ◽

New Technologies ◽

User Interaction ◽

Web Pages ◽

Portal Site ◽

Dynamic Content ◽

Site Search ◽

Web Crawlers

Modern web applications and servers like Portal require adequate support for integration of search services due to user focused information delivery and user interaction, as well as new technologies used to render such information, which is exemplified by two fundamental problems that have long plagued web crawlers: dynamic content and Javascript generated content. Today, the solution is simple: ignore such web pages. To enable “search” in Portals, a different “crawling” paradigm is required to search engines to gather and consume information. WebSphere Portal provides a framework that propagates content and information through “Seedlists”—comparable to HTML based sitemaps but richer in terms of features. This mandates that information and content delivering applications must be “search engine aware”, requiring them to enable services and seedlists for fast, efficient and complete delivery of content and information. This is the main integration point for search engines into the portal for Portal site search services for a rich and user focused search experience. This article discusses how such technologies can allow for more efficient crawling of public Portal sites by prominent Internet search engines as well as myths surrounding search engine optimization.

Download Full-text

Search Integration with WebSphere Portal

International Journal of Web Portals ◽

10.4018/jwp.2010070101 ◽

2010 ◽

Vol 2 (3) ◽

pp. 1-18

Author(s):

Andreas Prokoph

Keyword(s):

Search Engine ◽

Search Engines ◽

Web Applications ◽

New Technologies ◽

User Interaction ◽

Web Pages ◽

Portal Site ◽

Dynamic Content ◽

Site Search ◽

Web Crawlers

Modern web applications and servers like Portal require adequate support for integration of search services due to user focused information delivery and user interaction, as well as new technologies used to render such information, which is exemplified by two fundamental problems that have long plagued web crawlers: dynamic content and Javascript generated content. Today, the solution is simple: ignore such web pages. To enable “search” in Portals, a different “crawling” paradigm is required to search engines to gather and consume information. WebSphere Portal provides a framework that propagates content and information through “Seedlists”—comparable to HTML based sitemaps but richer in terms of features. This mandates that information and content delivering applications must be “search engine aware”, requiring them to enable services and seedlists for fast, efficient and complete delivery of content and information. This is the main integration point for search engines into the portal for Portal site search services for a rich and user focused search experience. This article discusses how such technologies can allow for more efficient crawling of public Portal sites by prominent Internet search engines as well as myths surrounding search engine optimization.

Download Full-text

Fingerprinting Keywords in Search Queries over Tor

Proceedings on Privacy Enhancing Technologies ◽

10.1515/popets-2017-0048 ◽

2017 ◽

Vol 2017 (4) ◽

pp. 251-270 ◽

Cited By ~ 4

Author(s):

Se Eun Oh ◽

Shuai Li ◽

Nicholas Hopper

Keyword(s):

Search Engine ◽

Search Engines ◽

Web Applications ◽

Service Providers ◽

Machine Learning Techniques ◽

Internet Service ◽

Feature Sets ◽

Anonymous Network ◽

Learning Techniques ◽

Passive Network

AbstractSearch engine queries contain a great deal of private and potentially compromising information about users. One technique to prevent search engines from identifying the source of a query, and Internet service providers (ISPs) from identifying the contents of queries is to query the search engine over an anonymous network such as Tor.In this paper, we study the extent to which Website Fingerprinting can be extended to fingerprint individual queries or keywords to web applications, a task we call Keyword Fingerprinting (KF). We show that by augmenting traffic analysis using a two-stage approach with new task-specific feature sets, a passive network adversary can in many cases defeat the use of Tor to protect search engine queries.We explore three popular search engines, Google, Bing, and Duckduckgo, and several machine learning techniques with various experimental scenarios. Our experimental results show that KF can identify Google queries containing one of 300 targeted keywords with recall of 80% and precision of 91%, while identifying the specific monitored keyword among 300 search keywords with accuracy 48%. We also further investigate the factors that contribute to keyword fingerprintability to understand how search engines and users might protect against KF.

Download Full-text

Chinese Open Source Data Collection, Big Data, And Private Enterprise Work For State Intelligence and Security: The Case of Shenzhen Zhenhua

SSRN Electronic Journal ◽

10.2139/ssrn.3691999 ◽

2020 ◽

Author(s):

Christopher Balding

Keyword(s):

Big Data ◽

Data Collection ◽

Open Source ◽

Private Enterprise ◽

Open Source Data ◽

Source Data

Download Full-text

The Matter of Chance: Auditing Web Search Results Related to the 2020 U.S. Presidential Primary Elections Across Six Search Engines

Social Science Computer Review ◽

10.1177/08944393211006863 ◽

2021 ◽

pp. 089443932110068

Author(s):

Aleksandra Urman ◽

Mykola Makhortykh ◽

Roberto Ulloa

Keyword(s):

Search Engine ◽

Search Engines ◽

Large Scale ◽

Web Search ◽

Primary Elections ◽

Virtual Agents ◽

Search Results ◽

Presidential Primary ◽

Large Scale Analysis ◽

Algorithmic Information

We examine how six search engines filter and rank information in relation to the queries on the U.S. 2020 presidential primary elections under the default—that is nonpersonalized—conditions. For that, we utilize an algorithmic auditing methodology that uses virtual agents to conduct large-scale analysis of algorithmic information curation in a controlled environment. Specifically, we look at the text search results for “us elections,” “donald trump,” “joe biden,” “bernie sanders” queries on Google, Baidu, Bing, DuckDuckGo, Yahoo, and Yandex, during the 2020 primaries. Our findings indicate substantial differences in the search results between search engines and multiple discrepancies within the results generated for different agents using the same search engine. It highlights that whether users see certain information is decided by chance due to the inherent randomization of search results. We also find that some search engines prioritize different categories of information sources with respect to specific candidates. These observations demonstrate that algorithmic curation of political information can create information inequalities between the search engine users even under nonpersonalized conditions. Such inequalities are particularly troubling considering that search results are highly trusted by the public and can shift the opinions of undecided voters as demonstrated by previous research.

Download Full-text

A multi-source data collection and information fusion method for distribution network based on IOT protocol

IOP Conference Series Earth and Environmental Science ◽

10.1088/1755-1315/651/2/022076 ◽

2021 ◽

Vol 651 (2) ◽

pp. 022076

Author(s):

Yongxiang Cai ◽

Xiaobing Xiao ◽

Hao Tian ◽

Yu Fu ◽

Peng Wu ◽

...

Keyword(s):

Data Collection ◽

Information Fusion ◽

Distribution Network ◽

Fusion Method ◽

Source Data

Download Full-text

Improving Specimen Labelling and Data Collection in Bio-science Research using Mobile and Web Applications

Open Computer Science ◽

10.1515/comp-2020-0002 ◽

2020 ◽

Vol 10 (1) ◽

pp. 1-16

Author(s):

Isaac Nyabisa Oteyo ◽

Mary Esther Muyoka Toili

Keyword(s):

Data Collection ◽

Web Applications ◽

Data Entry ◽

Science Research ◽

Quality Data ◽

Quality Of Data ◽

Data Entry Form ◽

Research Activities ◽

Daunting Task

AbstractResearchers in bio-sciences are increasingly harnessing technology to improve processes that were traditionally pegged on pen-and-paper and highly manual. The pen-and-paper approach is used mainly to record and capture data from experiment sites. This method is typically slow and prone to errors. Also, bio-science research activities are often undertaken in remote and distributed locations. Timeliness and quality of data collected are essential. The manual method is slow to collect quality data and relay it in a timely manner. Capturing data manually and relaying it in real time is a daunting task. The data collected has to be associated to respective specimens (objects or plants). In this paper, we seek to improve specimen labelling and data collection guided by the following questions; (1) How can data collection in bio-science research be improved? (2) How can specimen labelling be improved in bio-science research activities? We present WebLog, an application that we prototyped to aid researchers generate specimen labels and collect data from experiment sites. We use the application to convert the object (specimen) identifiers into quick response (QR) codes and use them to label the specimens. Once a specimen label is successfully scanned, the application automatically invokes the data entry form. The collected data is immediately sent to the server in electronic form for analysis.

Download Full-text

Examining and Fine-tuning the Selection of Glycan Compositions with GlyConnect Compozitor

Molecular & Cellular Proteomics ◽

10.1074/mcp.ra120.002041 ◽

2020 ◽

Vol 19 (10) ◽

pp. 1602-1618 ◽

Cited By ~ 1

Author(s):

Thibault Robin ◽

Julien Mariethoz ◽

Frédérique Lisacek

Keyword(s):

Search Engine ◽

Posttranslational Modifications ◽

Search Engines ◽

Web Application ◽

Contextual Information ◽

Fine Tuning ◽

Data Sources ◽

Web Interface ◽

Definition Of ◽

Selection Of

A key point in achieving accurate intact glycopeptide identification is the definition of the glycan composition file that is used to match experimental with theoretical masses by a glycoproteomics search engine. At present, these files are mainly built from searching the literature and/or querying data sources focused on posttranslational modifications. Most glycoproteomics search engines include a default composition file that is readily used when processing MS data. We introduce here a glycan composition visualizing and comparative tool associated with the GlyConnect database and called GlyConnect Compozitor. It offers a web interface through which the database can be queried to bring out contextual information relative to a set of glycan compositions. The tool takes advantage of compositions being related to one another through shared monosaccharide counts and outputs interactive graphs summarizing information searched in the database. These results provide a guide for selecting or deselecting compositions in a file in order to reflect the context of a study as closely as possible. They also confirm the consistency of a set of compositions based on the content of the GlyConnect database. As part of the tool collection of the Glycomics@ExPASy initiative, Compozitor is hosted at https://glyconnect.expasy.org/compozitor/ where it can be run as a web application. It is also directly accessible from the GlyConnect database.

Download Full-text

Search Engine Update

Legal Information Management ◽

10.1017/s1472669600000566 ◽

2001 ◽

Vol 1 (3) ◽

pp. 28-31 ◽

Cited By ~ 1

Author(s):

Valerie Stevenson

Keyword(s):

Search Engine ◽

Search Engines ◽

Search Strategy ◽

Search Strategies ◽

Boolean Logic ◽

Web Searches ◽

Search Techniques ◽

Looking Back ◽

The Web

Looking back to 1999, there were a number of search engines which performed equally well. I recommended defining the search strategy very carefully, using Boolean logic and field search techniques, and always running the search in more than one search engine. Numerous articles and Web columns comparing the performance of different search engines came to different conclusions on the ‘best’ search engines. Over the last year, however, all the speakers at conferences and seminars I have attended have recommended Google as their preferred tool for locating all kinds of information on the Web. I confess that I have now abandoned most of my carefully worked out search strategies and comparison tests, and use Google for most of my own Web searches.

Download Full-text

Getting Started Creating Data Dictionaries: How to Create a Shareable Dataset