data annotation
Recently Published Documents


TOTAL DOCUMENTS

237
(FIVE YEARS 137)

H-INDEX

12
(FIVE YEARS 4)

2022 ◽  
pp. 000276422110660
Author(s):  
Paola Tubaro ◽  
Antonio A. Casilli

In this paper, we analyze the recessionary effects of the COVID-19 pandemic on digital platform workers. The crisis has been described as a great work-from-home experiment, with platform ecosystems positing as its most advanced form. Our analysis differentiates the direct (health) and indirect (economic) risks incurred by workers, to critically assess the portrayal of platforms as buffers against crisis-induced layoffs. We submit that platform-mediated labor may eventually increase precarity, without necessarily reducing health risks for workers. Our argument is based on a comparison of the three main categories of platform work—“on-demand labor” (gigs such as delivery and transportation), “online labor” (tasks performed remotely, such as data annotation), and “social networking labor” (content generation and moderation). We discuss the strategies that platforms deploy to transfer risk from clients onto workers, thus deepening existing power imbalances between them. These results question the problematic equivalence between work-from-home and platform labor. Instead of attaining the advantages of the former in terms of direct and indirect risk mitigation, an increasing number of platformized jobs drift toward high economic and insuppressible health risks.


2022 ◽  
Author(s):  
Chenfei Wang ◽  
Pengfei Ren ◽  
Xiaoying Shi ◽  
Xin Dong ◽  
Zhiguang Yu ◽  
...  

Abstract The rapid accumulation of single-cell RNA-seq data has provided rich resources to characterize various human cell types. Cell type annotation is the critical step in analyzing single-cell RNA-seq data. However, accurate cell type annotation based on public references is challenging due to the inconsistent annotations, batch effects, and poor characterization of rare cell types. Here, we introduce SELINA (single cELl identity NAvigator), an integrative annotation transferring framework for automatic cell type annotation. SELINA optimizes the annotation for minority cell types by synthetic minority over-sampling, removes batch effects among reference datasets using a multiple-adversarial domain adaptation network (MADA), and fits the query data with reference data using an autoencoder. Finally, SELINA affords a comprehensive and uniform reference atlas with 1.7 million cells covering 230 major human cell types. We demonstrated the robustness and superiority of SELINA in most human tissues compared to existing methods. SELINA provided a one-stop solution for human single- cell RNA-seq data annotation with the potential to extend for other species.


2022 ◽  
Vol 12 (1) ◽  
pp. 25
Author(s):  
Varvara Koshman ◽  
Anastasia Funkner ◽  
Sergey Kovalchuk

Electronic medical records (EMRs) include many valuable data about patients, which is, however, unstructured. Therefore, there is a lack of both labeled medical text data in Russian and tools for automatic annotation. As a result, today, it is hardly feasible for researchers to utilize text data of EMRs in training machine learning models in the biomedical domain. We present an unsupervised approach to medical data annotation. Syntactic trees are produced from initial sentences using morphological and syntactical analyses. In retrieved trees, similar subtrees are grouped using Node2Vec and Word2Vec and labeled using domain vocabularies and Wikidata categories. The usage of Wikidata categories increased the fraction of labeled sentences 5.5 times compared to labeling with domain vocabularies only. We show on a validation dataset that the proposed labeling method generates meaningful labels correctly for 92.7% of groups. Annotation with domain vocabularies and Wikidata categories covered more than 82% of sentences of the corpus, extended with timestamp and event labels 97% of sentences got covered. The obtained method can be used to label EMRs in Russian automatically. Additionally, the proposed methodology can be applied to other languages, which lack resources for automatic labeling and domain vocabulary.


2022 ◽  
Vol 133 ◽  
pp. 103994
Author(s):  
Zhiyong Zhang ◽  
Xiaolei Yin ◽  
Zhiyuan Yan

2021 ◽  
Vol 2 ◽  
Author(s):  
Jan Erik Doornweerd ◽  
Gert Kootstra ◽  
Roel F. Veerkamp ◽  
Esther D. Ellen ◽  
Jerine A. J. van der Eijk ◽  
...  

Animal pose-estimation networks enable automated estimation of key body points in images or videos. This enables animal breeders to collect pose information repeatedly on a large number of animals. However, the success of pose-estimation networks depends in part on the availability of data to learn the representation of key body points. Especially with animals, data collection is not always easy, and data annotation is laborious and time-consuming. The available data is therefore often limited, but data from other species might be useful, either by itself or in combination with the target species. In this study, the across-species performance of animal pose-estimation networks and the performance of an animal pose-estimation network trained on multi-species data (turkeys and broilers) were investigated. Broilers and turkeys were video recorded during a walkway test representative of the situation in practice. Two single-species and one multi-species model were trained by using DeepLabCut and tested on two single-species test sets. Overall, the within-species models outperformed the multi-species model, and the models applied across species, as shown by a lower raw pixel error, normalized pixel error, and higher percentage of keypoints remaining (PKR). The multi-species model had slightly higher errors with a lower PKR than the within-species models but had less than half the number of annotated frames available from each species. Compared to the single-species broiler model, the multi-species model achieved lower errors for the head, left foot, and right knee keypoints, although with a lower PKR. Across species, keypoint predictions resulted in high errors and low to moderate PKRs and are unlikely to be of direct use for pose and gait assessments. A multi-species model may reduce annotation needs without a large impact on performance for pose assessment, however, with the recommendation to only be used if the species are comparable. If a single-species model exists it could be used as a pre-trained model for training a new model, and possibly require a limited amount of new data. Future studies should investigate the accuracy needed for pose and gait assessments and estimate genetic parameters for the new phenotypes before pose-estimation networks can be applied in practice.


Sensors ◽  
2021 ◽  
Vol 21 (24) ◽  
pp. 8313
Author(s):  
Łukasz Lepak ◽  
Kacper Radzikowski ◽  
Robert Nowak ◽  
Karol J. Piczak

Models for keyword spotting in continuous recordings can significantly improve the experience of navigating vast libraries of audio recordings. In this paper, we describe the development of such a keyword spotting system detecting regions of interest in Polish call centre conversations. Unfortunately, in spite of recent advancements in automatic speech recognition systems, human-level transcription accuracy reported on English benchmarks does not reflect the performance achievable in low-resource languages, such as Polish. Therefore, in this work, we shift our focus from complete speech-to-text conversion to acoustic similarity matching in the hope of reducing the demand for data annotation. As our primary approach, we evaluate Siamese and prototypical neural networks trained on several datasets of English and Polish recordings. While we obtain usable results in English, our models’ performance remains unsatisfactory when applied to Polish speech, both after mono- and cross-lingual training. This performance gap shows that generalisation with limited training resources is a significant obstacle for actual deployments in low-resource languages. As a potential countermeasure, we implement a detector using audio embeddings generated with a generic pre-trained model provided by Google. It has a much more favourable profile when applied in a cross-lingual setup to detect Polish audio patterns. Nevertheless, despite these promising results, its performance on out-of-distribution data are still far from stellar. It would indicate that, in spite of the richness of internal representations created by more generic models, such speech embeddings are not entirely malleable to cross-language transfer.


2021 ◽  
Vol 16 (1) ◽  
pp. 65-84
Author(s):  
Martin Clayton ◽  
Simone Tarsitani ◽  
Richard Jankowsky ◽  
Luis Jure ◽  
Laura Leante ◽  
...  

The Interpersonal Entrainment in Music Performance Data Collection (IEMPDC) comprises six related corpora of music research materials: Cuban Son & Salsa (CSS), European String Quartet (ESQ), Malian Jembe (MJ), North Indian Raga (NIR), Tunisian Stambeli (TS), and Uruguayan Candombe (UC). The core data for each corpus comprises media files and computationally extracted event onset timing data. Annotation of metrical structure and code used in the preparation of the collection is also shared. The collection is unprecedented in size and level of detail and represents a significant new resource for empirical and computational research in music. In this article we introduce the main features of the data collection and the methods used in its preparation. Details of technical validation procedures and notes on data visualization are available as Appendices. We also contextualize the collection in relation to developments in Open Science and Open Data, discussing important distinctions between the two related concepts.


2021 ◽  
Author(s):  
Patrizia Vizza ◽  
Giuseppe Tradigo ◽  
Ivan Brunelli ◽  
Pierangelo Veltri

GigaScience ◽  
2021 ◽  
Vol 10 (12) ◽  
Author(s):  
Nathan C Sheffield ◽  
Michał Stolarczyk ◽  
Vincent P Reuter ◽  
André F Rendeiro

Abstract Background Organizing and annotating biological sample data is critical in data-intensive bioinformatics. Unfortunately, metadata formats from a data provider are often incompatible with requirements of a processing tool. There is no broadly accepted standard to organize metadata across biological projects and bioinformatics tools, restricting the portability and reusability of both annotated datasets and analysis software. Results To address this, we present the Portable Encapsulated Project (PEP) specification, a formal specification for biological sample metadata structure. The PEP specification accommodates typical features of data-intensive bioinformatics projects with many biological samples. In addition to standardization, the PEP specification provides descriptors and modifiers for project-level and sample-level metadata, which improve portability across both computing environments and data processing tools. PEPs include a schema validator framework, allowing formal definition of required metadata attributes for data analysis broadly. We have implemented packages for reading PEPs in both Python and R to provide a language-agnostic interface for organizing project metadata. Conclusions The PEP specification is an important step toward unifying data annotation and processing tools in data-intensive biological research projects. Links to tools and documentation are available at http://pep.databio.org/.


Sign in / Sign up

Export Citation Format

Share Document