scholarly journals Cross-lingual citations in English papers: a large-scale analysis of prevalence, usage, and impact

Author(s):  
Tarek Saier ◽  
Michael Färber ◽  
Tornike Tsereteli

AbstractCitation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation-based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-English publications are often not included in data sets, or that language metadata is not available. Because of this, citations between publications of differing languages (cross-lingual citations) have only been studied to a very limited degree. In this paper, we present an analysis of cross-lingual citations based on over one million English papers, spanning three scientific disciplines and a time span of three decades. Our investigation covers differences between cited languages and disciplines, trends over time, and the usage characteristics as well as impact of cross-lingual citations. Among our findings are an increasing rate of citations to publications written in Chinese, citations being primarily to local non-English languages, and consistency in citation intent between cross- and monolingual citations. To facilitate further research, we make our collected data and source code publicly available.

2012 ◽  
Vol 3 ◽  
pp. 747-758 ◽  
Author(s):  
Blake W Erickson ◽  
Séverine Coquoz ◽  
Jonathan D Adams ◽  
Daniel J Burns ◽  
Georg E Fantner

Modern high-speed atomic force microscopes generate significant quantities of data in a short amount of time. Each image in the sequence has to be processed quickly and accurately in order to obtain a true representation of the sample and its changes over time. This paper presents an automated, adaptive algorithm for the required processing of AFM images. The algorithm adaptively corrects for both common one-dimensional distortions as well as the most common two-dimensional distortions. This method uses an iterative thresholded processing algorithm for rapid and accurate separation of background and surface topography. This separation prevents artificial bias from topographic features and ensures the best possible coherence between the different images in a sequence. This method is equally applicable to all channels of AFM data, and can process images in seconds.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Jonathan P. Ling ◽  
Christopher Wilks ◽  
Rone Charles ◽  
Patrick J. Leavey ◽  
Devlina Ghosh ◽  
...  

AbstractPublic archives of next-generation sequencing data are growing exponentially, but the difficulty of marshaling this data has led to its underutilization by scientists. Here, we present ASCOT, a resource that uses annotation-free methods to rapidly analyze and visualize splice variants across tens of thousands of bulk and single-cell data sets in the public archive. To demonstrate the utility of ASCOT, we identify novel cell type-specific alternative exons across the nervous system and leverage ENCODE and GTEx data sets to study the unique splicing of photoreceptors. We find that PTBP1 knockdown and MSI1 and PCBP2 overexpression are sufficient to activate many photoreceptor-specific exons in HepG2 liver cancer cells. This work demonstrates how large-scale analysis of public RNA-Seq data sets can yield key insights into cell type-specific control of RNA splicing and underscores the importance of considering both annotated and unannotated splicing events.


2020 ◽  
pp. 1-51
Author(s):  
Ivan Vulić ◽  
Simon Baker ◽  
Edoardo Maria Ponti ◽  
Ulla Petti ◽  
Ira Leviant ◽  
...  

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages.We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via aWeb site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.


2019 ◽  
Author(s):  
Zachary B. Abrams ◽  
Caitlin E. Coombes ◽  
Suli Li ◽  
Kevin R. Coombes

AbstractSummaryUnsupervised data analysis in many scientific disciplines is based on calculating distances between observations and finding ways to visualize those distances. These kinds of unsupervised analyses help researchers uncover patterns in large-scale data sets. However, researchers can select from a vast number of different distance metrics, each designed to highlight different aspects of different data types. There are also numerous visualization methods with their own strengths and weaknesses. To help researchers perform unsupervised analyses, we developed the Mercator R package. Mercator enables users to see important patterns in their data by generating multiple visualizations using different standard algorithms, making it particularly easy to compare and contrast the results arising from different metrics. By allowing users to select the distance metric that best fits their needs, Mercator helps researchers perform unsupervised analyses that use pattern identification through computation and visual inspection.Availability and ImplementationMercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html)[email protected] informationSupplementary data are available at Bioinformatics online.


2007 ◽  
Vol 28 (3) ◽  
pp. 273-283 ◽  
Author(s):  
M. D. Harrell ◽  
S. Harbi ◽  
J. F. Hoffman ◽  
J. Zavadil ◽  
W. A. Coetzee

The immature and mature heart differ from each other in terms of excitability, action potential properties, contractility, and relaxation. This includes upregulation of repolarizing K+ currents, an enhanced inward rectifier K+ ( Kir) current, and changes in Ca2+, Na+, and Cl− currents. At the molecular level, the developmental regulation of ion channels is scantily described. Using a large-scale real-time quantitative reverse transcriptase polymerase chain reaction (qRT-PCR) assay, we performed a comprehensive analysis of ion channel transcript expression during perinatal development in the embryonic (embryonic day 17.5), neonatal (postnatal days 1–2), and adult Swiss-Webster mouse hearts. These data are compared with publicly available microarray data sets (Cardiogenomics project). Developmental mRNA expression for several transcripts was consistent with the published literature. For example, transcripts such as Kir2.1, Kir3.1, Nav1.5, Cav1.2, etc. were upregulated after birth, whereas others [e.g., Ca2+-activated K+ (KCa)2.3 and minK] were downregulated. Cl− channel transcripts were expressed at higher levels in immature heart, particularly those that are activated by intracellular Ca2+. Defining alterations in the ion channel transcriptome during perinatal development will lead to a much improved understanding of the electrophysiological alterations occurring in the heart after birth. Our study may have important repercussions in understanding the mechanisms and consequences of electrophysiological alterations in infants and may pave the way for better understanding of clinically relevant events such as congenital abnormalities, cardiomyopathies, heart failure, arrhythmias, cardiac drug therapy, and the sudden infant death syndrome.


2021 ◽  
Vol 18 (2) ◽  
pp. 172988142199654
Author(s):  
Joohyung Kim ◽  
Janghun Hyeon ◽  
Nakju Doh

As interest in image-based rendering increases, the need for multiview inpainting is emerging. Despite of rapid progresses in single-image inpainting based on deep learning approaches, they have no constraint in obtaining color consistency over multiple inpainted images. We target object removal in large-scale indoor spaces and propose a novel pipeline of multiview inpainting to achieve color consistency and boundary consistency in multiple images. The first step of the pipeline is to create color prior information on masks by coloring point clouds from multiple images and projecting the colored point clouds onto the image planes. Next, a generative inpainting network accepts a masked image, a color prior image, imperfect guideline, and two different masks as inputs and yields the refined guideline and inpainted image as outputs. The color prior and guideline input ensure color and boundary consistencies across multiple images. We validate our pipeline on real indoor data sets quantitatively using consistency distance and similarity distance, metrics we defined for comparing results of multiview inpainting and qualitatively.


2014 ◽  
Vol 2014 ◽  
pp. 1-7 ◽  
Author(s):  
Jonatan Taminau ◽  
Cosmin Lazar ◽  
Stijn Meganck ◽  
Ann Nowé

An increasing amount of microarray gene expression data sets is available through public repositories. Their huge potential in making new findings is yet to be unlocked by making them available for large-scale analysis. In order to do so it is essential that independent studies designed for similar biological problems can be integrated, so that new insights can be obtained. These insights would remain undiscovered when analyzing the individual data sets because it is well known that the small number of biological samples used per experiment is a bottleneck in genomic analysis. By increasing the number of samples the statistical power is increased and more general and reliable conclusions can be drawn. In this work, two different approaches for conducting large-scale analysis of microarray gene expression data—meta-analysis and data merging—are compared in the context of the identification of cancer-related biomarkers, by analyzing six independent lung cancer studies. Within this study, we investigate the hypothesis that analyzing large cohorts of samples resulting in merging independent data sets designed to study the same biological problem results in lower false discovery rates than analyzing the same data sets within a more conservative meta-analysis approach.


2018 ◽  
Vol 145 ◽  
pp. 243-254
Author(s):  
Alassane Samba ◽  
Yann Busnel ◽  
Alberto Blanc ◽  
Philippe Dooze ◽  
Gwendal Simon

2017 ◽  
Author(s):  
Yingwei Hu ◽  
Punit Shah ◽  
David J. Clark ◽  
Minghui Ao ◽  
Hui Zhang

ABSTRACTProtein glycosylation plays fundamental roles in many cellular processes, and previous reports have shown dysregulation to be associated with several human diseases, including diabetes, cancer, and neurodegenerative disorders. Despite the vital role of glycosylation for proper protein function, the analysis of glycoproteins has been lagged behind to other protein modifications. In this study, we describe the re-analysis of global proteomic data from breast cancer xenograft tissues using recently developed software package GPQuest 2.0, revealing a large number of previously unidentifiedN-linked glycopeptides. More importantly, we found that using immobilized metal affinity chromatography (IMAC) technology for the enrichment of phosphopeptides had co-enriched a substantial number of sialoglycopeptides, allowing for a large-scale analysis of sialoglycopeptides in conjunction with the analysis of phosphopeptides. Collectively, combined MS/MS analyses of global proteomic and phosphoproteomic datasets resulted in the identification of 6,724 N-linked glycopeptides from 617 glycoproteins derived from two breast cancer xenograft tissues. Next, we utilized GPQuest for the re-analysis of global and phosphoproteomic data generated from 108 human breast cancer tissues that were previously analyzed by Clinical Proteomic Analysis Consortium (CPTAC). Reanalysis of the CPTAC dataset resulted in the identification of 2,683 glycopeptides from the global proteomic data set and 4,554 glycopeptides from phosphoproteomic data set, respectively. Together, 11,292 N-linked glycopeptides corresponding to 1,731 N-linked glycosites from 883 human glycoproteins were identified from the two data sets. This analysis revealed an extensive number of glycopeptides hidden in the global and enriched in IMAC-based phosphopeptide-enriched proteomic data, information which would have remained unknown from the original study otherwise. The reanalysis described herein can be readily applied to identify glycopeptides from already existing data sets, providing insight into many important facets of protein glycosylation in different biological, physiological, and pathological processes.


Sign in / Sign up

Export Citation Format

Share Document