Maximizing the sensitivity and reliability of peptide identification in large-scale proteomic experiments by harnessing multiple search engines

We examine how six search engines filter and rank information in relation to the queries on the U.S. 2020 presidential primary elections under the default—that is nonpersonalized—conditions. For that, we utilize an algorithmic auditing methodology that uses virtual agents to conduct large-scale analysis of algorithmic information curation in a controlled environment. Specifically, we look at the text search results for “us elections,” “donald trump,” “joe biden,” “bernie sanders” queries on Google, Baidu, Bing, DuckDuckGo, Yahoo, and Yandex, during the 2020 primaries. Our findings indicate substantial differences in the search results between search engines and multiple discrepancies within the results generated for different agents using the same search engine. It highlights that whether users see certain information is decided by chance due to the inherent randomization of search results. We also find that some search engines prioritize different categories of information sources with respect to specific candidates. These observations demonstrate that algorithmic curation of political information can create information inequalities between the search engine users even under nonpersonalized conditions. Such inequalities are particularly troubling considering that search results are highly trusted by the public and can shift the opinions of undecided voters as demonstrated by previous research.

Download Full-text

Quantifying the Impact of Chimera MS/MS Spectra on Peptide Identification in Large-Scale Proteomics Studies

Journal of Proteome Research ◽

10.1021/pr1003856 ◽

2010 ◽

Vol 9 (8) ◽

pp. 4152-4160 ◽

Cited By ~ 107

Author(s):

Stephane Houel ◽

Robert Abernathy ◽

Kutralanathan Renganathan ◽

Karen Meyer-Arendt ◽

Natalie G. Ahn ◽

...

Keyword(s):

Large Scale ◽

Peptide Identification ◽

The Impact

Download Full-text

A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines

Advanced Web Technologies and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-540-24655-8_6 ◽

2004 ◽

pp. 48-58 ◽

Cited By ~ 2

Author(s):

Shaozhi Ye ◽

Ruihua Song ◽

Ji-Rong Wen ◽

Wei-Ying Ma

Keyword(s):

Search Engines ◽

Large Scale ◽

Duplicate Detection ◽

Detection Approach

Download Full-text

I/O-Conscious Data Preparation for Large-Scale Web Search Engines

VLDB '02: Proceedings of the 28th International Conference on Very Large Databases ◽

10.1016/b978-155860869-6/50041-x ◽

2002 ◽

pp. 382-393 ◽

Cited By ~ 1

Author(s):

Maxim Lifantsev ◽

Tzi-cker Chiueh

Keyword(s):

Search Engines ◽

Large Scale ◽

Web Search ◽

Data Preparation ◽

Web Search Engines

Download Full-text

Deep Web

Handbook of Research on Innovations in Database Technologies and Applications ◽

10.4018/978-1-60566-242-8.ch062 ◽

2009 ◽

pp. 581-588 ◽

Cited By ~ 5

Author(s):

Denis Shestakov

Keyword(s):

Search Engines ◽

Large Scale ◽

Web Search ◽

Web Database ◽

Web Search Engine ◽

Search Form ◽

Complete Set ◽

Web Crawlers ◽

Pass Through ◽

The Web

Finding information on the Web using a web search engine is one of the primary activities of today’s web users. For a majority of users results returned by conventional search engines are an essentially complete set of links to all pages on the Web relevant to their queries. However, currentday searchers do not crawl and index a significant portion of the Web and, hence, web users relying on search engines only are unable to discover and access a large amount of information from the nonindexable part of the Web. Specifically, dynamic pages generated based on parameters provided by a user via web search forms are not indexed by search engines and cannot be found in searchers’ results. Such search interfaces provide web users with an online access to myriads of databases on the Web. In order to obtain some information from a web database of interest, a user issues his/her query by specifying query terms in a search form and receives the query results, a set of dynamic pages which embed required information from a database. At the same time, issuing a query via an arbitrary search interface is an extremely complex task for any kind of automatic agents including web crawlers, which, at least up to the present day, do not even attempt to pass through web forms on a large scale.

Download Full-text

Scalability and efficiency challenges in large-scale web search engines

Proceedings of the 23rd International Conference on World Wide Web - WWW '14 Companion ◽

10.1145/2567948.2577271 ◽

2014 ◽

Cited By ~ 4

Author(s):

Ricardo Baeza-Yates ◽

B. Barla Cambazoglu

Keyword(s):

Search Engines ◽

Large Scale ◽

Web Search ◽

Web Search Engines

Download Full-text

Improving large-scale search engines with semantic annotations

Expert Systems with Applications ◽

10.1016/j.eswa.2012.10.042 ◽

2013 ◽

Vol 40 (6) ◽

pp. 2287-2296 ◽

Cited By ~ 5

Author(s):

Damaris Fuentes-Lorenzo ◽

Norberto Fernández ◽

Jesús A. Fisteus ◽

Luis Sánchez

Keyword(s):

Search Engines ◽

Large Scale ◽

Semantic Annotations

Download Full-text

Open-pFind enables precise, comprehensive and rapid peptide identification in shotgun proteomics

10.1101/285395 ◽

2018 ◽

Cited By ~ 6

Author(s):

Hao Chi ◽

Chao Liu ◽

Hao Yang ◽

Wen-Feng Zeng ◽

Long Wu ◽

...

Keyword(s):

Large Scale ◽

Search Algorithm ◽

Large Fraction ◽

Olfactory Receptors ◽

Peptide Identification ◽

Shotgun Proteomics ◽

Isotopic Labeling ◽

Search Space ◽

Global Scale ◽

Search Algorithms

ABSTRACTShotgun proteomics has grown rapidly in recent decades, but a large fraction of tandem mass spectrometry (MS/MS) data in shotgun proteomics are not successfully identified. We have developed a novel database search algorithm, Open-pFind, to efficiently identify peptides even in an ultra-large search space which takes into account unexpected modifications, amino acid mutations, semi- or non-specific digestion and co-eluting peptides. Tested on two metabolically labeled MS/MS datasets, Open-pFind reported 50.5‒117.0% more peptide-spectrum matches (PSMs) than the seven other advanced algorithms. More importantly, the Open-pFind results were more credible judged by the verification experiments using stable isotopic labeling. Tested on four additional large-scale datasets, 70‒85% of the spectra were confidently identified, and high-quality spectra were nearly completely interpreted by Open-pFind. Further, Open-pFind was over 40 times faster than the other three open search algorithms and 2‒3 times faster than three restricted search algorithms. Re-analysis of an entire human proteome dataset consisting of ∼25 million spectra using Open-pFind identified a total of 14,064 proteins encoded by 12,723 genes by requiring at least two uniquely identified peptides. In this search results, Open-pFind also excelled in an independent test for false positives based on the presence or absence of olfactory receptors. Thus, a practical use of the open search strategy has been realized by Open-pFind for the truly global-scale proteomics experiments of today and in the future.

Download Full-text

Resource-Efficient Index Shard Replication in Large Scale Search Engines

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2019.2924423 ◽

2019 ◽

Vol 30 (12) ◽

pp. 2820-2835

Author(s):

Yusen Li ◽

Xueyan Tang ◽

Wentong Cai ◽

Jiancong Tong ◽

Xiaoguang Liu ◽

...

Keyword(s):

Search Engines ◽

Large Scale

Download Full-text

SW-Tandem: a highly efficient tool for large-scale peptide identification with parallel spectrum dot product on Sunway TaihuLight

Bioinformatics ◽

10.1093/bioinformatics/btz147 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3861-3863 ◽

Cited By ~ 2

Author(s):

Chuang Li ◽

Kenli Li ◽

Tao Chen ◽

Yunping Zhu ◽

Qiang He

Keyword(s):

Large Scale ◽

Peptide Identification ◽

Software Tool ◽

Critical Issue ◽

Peptide Sequencing ◽

Peptide Sequence ◽

Supplementary Information ◽

Database Searching ◽

Sunway Taihulight ◽

Dot Product

Abstract Summary Tandem mass spectrometry based database searching is a widely acknowledged and adopted method that identifies peptide sequence in shotgun proteomics. However, database searching is extremely computationally expensive, which can take days even weeks to process a large spectra dataset. To address this critical issue, this paper presents SW-Tandem, a new tool for large-scale peptide sequencing. SW-Tandem parallelizes the spectrum dot product scoring algorithm and leverages the advantages of Sunway TaihuLight, the No. 1 supercomputer in the world in 2017. Sunway TaihuLight is powered by the brand new many-core SW26010 processors and provides a peak computation performance greater than 100PFlops. To fully utilize the Sunway TaihuLights capacity, SW-Tandem employs three mechanisms to accelerate large-scale peptide identification, memory-access optimizations, double buffering and vectorization. The results of experiments conducted on multiple datasets demonstrate the performance of SW-Tandem against three state-of-the-art tools for peptide identification, including X!! Tandem, MR-Tandem and MSFragger. In addition, it shows high scalability in the experiments on extremely large datasets sized up to 12 GB. Availability and implementation SW-Tandem is an open source software tool implemented in C++. The source code and the parameter settings are available at https://github.com/Logic09/SW-Tandem. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text