Anbar: Collection and analysis of a large scale Urdu language Twitter corpus

2021 ◽  
pp. 1-12
Author(s):  
Bilal Tahir ◽  
Muhammad Amir Mehmood

 The confluence of high performance computing algorithms and large scale high-quality data has led to the availability of cutting edge tools in computational linguistics. However, these state-of-the-art tools are available only for the major languages of the world. The preparation of large scale high-quality corpora for low resource language such as Urdu is a challenging task as it requires huge computational and human resources. In this paper, we build and analyze a large scale Urdu language Twitter corpus Anbar. For this purpose, we collect 106.9 million Urdu tweets posted by 1.69 million users during one year (September 2018-August 2019). Our corpus consists of tweets with a rich vocabulary of 3.8 million unique tokens along with 58K hashtags and 62K URLs. Moreover, it contains 75.9 million (71.0%) retweets and 847K geotagged tweets. Furthermore, we examine Anbar using a variety of metrics like temporal frequency of tweets, vocabulary size, geo-location, user characteristics, and entities distribution. To the best of our knowledge, this is the largest repository of Urdu language tweets for the NLP research community which can be used for Natural Language Understanding (NLU), social analytics, and fake news detection.

Processes ◽  
2020 ◽  
Vol 8 (6) ◽  
pp. 649
Author(s):  
Yifeng Liu ◽  
Wei Zhang ◽  
Wenhao Du

Deep learning based on a large number of high-quality data plays an important role in many industries. However, deep learning is hard to directly embed in the real-time system, because the data accumulation of the system depends on real-time acquisitions. However, the analysis tasks of such systems need to be carried out in real time, which makes it impossible to complete the analysis tasks by accumulating data for a long time. In order to solve the problems of high-quality data accumulation, high timeliness of the data analysis, and difficulty in embedding deep-learning algorithms directly in real-time systems, this paper proposes a new progressive deep-learning framework and conducts experiments on image recognition. The experimental results show that the proposed framework is effective and performs well and can reach a conclusion similar to the deep-learning framework based on large-scale data.


2019 ◽  
Vol 3 (1) ◽  
pp. e201900546
Author(s):  
Matthias Blum ◽  
Pierre-Etienne Cholley ◽  
Valeriya Malysheva ◽  
Samuel Nicaise ◽  
Julien Moehlin ◽  
...  

The enormous amount of freely accessible functional genomics data is an invaluable resource for interrogating the biological function of multiple DNA-interacting players and chromatin modifications by large-scale comparative analyses. However, in practice, interrogating large collections of public data requires major efforts for (i) reprocessing available raw reads, (ii) incorporating quality assessments to exclude artefactual and low-quality data, and (iii) processing data by using high-performance computation. Here, we present qcGenomics, a user-friendly online resource for ultrafast retrieval, visualization, and comparative analysis of tens of thousands of genomics datasets to gain new functional insight from global or focused multidimensional data integration.


2020 ◽  
Vol 68 (3) ◽  
pp. 878-895
Author(s):  
Ragheb Rahmaniani ◽  
Shabbir Ahmed ◽  
Teodor Gabriel Crainic ◽  
Michel Gendreau ◽  
Walter Rei

Many methods that have been proposed to solve large-scale MILP problems rely on the use of decomposition strategies. These methods exploit either the primal or dual structures of the problems by applying the Benders decomposition or Lagrangian dual decomposition strategy, respectively. In “The Benders Dual Decomposition Method,” Rahmaniani, Ahmed, Crainic, Gendreau, and Rei propose a new and high-performance approach that combines the complementary advantages of both strategies. The authors show that this method (i) generates stronger feasibility and optimality cuts compared with the classical Benders method, (ii) can converge to the optimal integer solution at the root node of the Benders master problem, and (iii) is capable of generating high-quality incumbent solutions at the early iterations of the algorithm. The developed algorithm obtains encouraging computational results when used to solve various benchmark MILP problems.


2006 ◽  
Vol 302-303 ◽  
pp. 398-404
Author(s):  
Ming Tang ◽  
Xiao Li ◽  
Tao Wang

According to abalone’s growth characteristics, artificial abalone reefs are invented in this paper. The trace elements are added in concrete. The proportion is fixed by test. Ocean alga adheres to reefs with them very well. The craft, matching optimization, curing terms in the island environment and concrete long-term stability in the ocean current are studied to solve the durability of reefs in the marine environment. It shows the durability of fishing reef by high performance, high function, and ecological concrete technology is reliable. Its strength is still increasing for one year and no damage has been found. It is feasible to use the complex admixture, high-quality fly ash, ultrafine silicon powder, surface-soaking-into water-hating material made by our own, adhering-shaking-compact molding equipment made by ourselves and solar-energy-curing technology. Ten thousands of large-scale artificial abalone reefs have been done. A large amount of marine organisms covered the reefs only after 40 days using.


Author(s):  
Inam U. Haq ◽  
Chittineni V. Kumar ◽  
Rayed M. Al-Zaid

This paper reports the synchronous vibration instability problem (a rare phenomenon) experienced in a high pressure steam turbine rotor (19MW) driving synthesis gas compressor train in a large scale petrochemical complex. The turbine had about one year history of showing infrequently high vibration. Rotor vibrations appeared in an intermittent and irregular fashion and the perturbation frequency was the rotor operating speed of 10,135 rpm. The sealing steam system was found responsible for cropping the vibration. At a definite level of seal steam pressure (0.90 to 1.10 bar-gauge), operating speed and load, the rotor radial vibration response was reached at 4.5 mils as compared to the frequently smooth running level of less than 1.0 mil. Subsequently, the major overhauling of the turbine revealed severely worn and, virtually, non-functional high pressure end labyrinth seals. The paper also elaborates the steam turbine rotordynamics behaviors recorded during excessive levels of vibration.


2015 ◽  
Vol 6 (1) ◽  
Author(s):  
Jaqueline Kaleian Eserian ◽  
Márcia Lombardo

The validation of analytical methods is required to obtain high-quality data. For the pharmaceutical industry, method validation is crucial to ensure the product quality as regards both therapeutic efficacy and patient safety. The most critical step in validating a method is to establish a protocol containing well-defined procedures and criteria. A well planned and organized protocol, such as the one proposed in this paper, results in a rapid and concise method validation procedure for quantitative high performance liquid chromatography (HPLC) analysis.   Type: Commentary


2011 ◽  
Vol 1336 ◽  
Author(s):  
M. Takenaka ◽  
S. Takagi

ABSTRACTThe heterogeneous integration of III-V semiconductors with the Si platform is expected to provide high performance CMOS logic for future technology nodes because of high electron mobility and low electron effective mass in III-V semiconductors. However, there are many technology issues to be addressed for integrating III-V MOSFETs on the Si platform as follow; high-quality MOS interface formation, low resistivity source/drain formation, and high-quality III-V film formation on Si substrates. In this paper, we present several possible solutions for the above critical issues of III-V MOSFETs on the Si platform. In addition, we present the III-V CMOS photonics platform on which III-V MOSFETs and III-V photonics can be monolithically integrated for ultra-large scale electric-optic integrated circuits.


2016 ◽  
Vol 14 ◽  
Author(s):  
Ana Marasović ◽  
Mengfei Zhou ◽  
Alexis Palmer ◽  
Anette Frank

Modal verbs have different interpretations depending on their context. Their sense categories – epistemic, deontic and dynamic – provide important dimensions of meaning for the interpretation of discourse. Previous work on modal sense classification achieved relatively high performance using shallow lexical and syntactic features drawn from small-size annotated corpora. Due to the restricted empirical basis, it is difficult to assess the particular difficulties of modal sense classification and the generalization capacity of the proposed models. In this work we create large-scale, high-quality annotated corpora for modal sense classification using an automatic paraphrase-driven projection approach. Using the acquired corpora, we investigate the modal sense classification task from different perspectives.


2020 ◽  
Author(s):  
Brian R. Lee ◽  
Agata Budzillo ◽  
Kristen Hadley ◽  
Jeremy A. Miller ◽  
Tim Jarsky ◽  
...  

The Patch-seq approach is a powerful variation of the standard patch clamp technique that allows for the combined electrophysiological, morphological, and transcriptomic characterization of individual neurons. To generate Patch-seq datasets at a scale and quality that can be integrated with high-throughput dissociated cell transcriptomic data, we have optimized the technique by identifying and refining key factors that contribute to the efficient collection of high-quality data. To rapidly generate high-quality electrophysiology data, we developed patch clamp electrophysiology software with analysis functions specifically designed to automate acquisition with online quality control. We recognized a substantial improvement in transcriptomic data quality when the nucleus was extracted following the recording. For morphology success, the importance of maximizing the neuron’s membrane integrity during the extraction of the nucleus was much more critical to success than varying the duration of the electrophysiology recording. We compiled the lab protocol with the analysis and acquisition software at https://github.com/AllenInstitute/patchseqtools. This resource can be used by individual labs to generate Patch-seq data across diverse mammalian species and that is compatible with recent large-scale publicly available Allen Institute Patch-seq datasets.


2014 ◽  
Vol 13s7 ◽  
pp. CIN.S16346 ◽  
Author(s):  
Scott White ◽  
Karoline Laske ◽  
Marij J.P. Welters ◽  
Nicole Bidmon ◽  
Sjoerd H. Van Der Burg ◽  
...  

With the recent results of promising cancer vaccines and immunotherapy 1 – 5 , immune monitoring has become increasingly relevant for measuring treatment-induced effects on T cells, and an essential tool for shedding light on the mechanisms responsible for a successful treatment. Flow cytometry is the canonical multi-parameter assay for the fine characterization of single cells in solution, and is ubiquitously used in pre-clinical tumor immunology and in cancer immunotherapy trials. Current state-of-the-art polychromatic flow cytometry involves multi-step, multi-reagent assays followed by sample acquisition on sophisticated instruments capable of capturing up to 20 parameters per cell at a rate of tens of thousands of cells per second. Given the complexity of flow cytometry assays, reproducibility is a major concern, especially for multi-center studies. A promising approach for improving reproducibility is the use of automated analysis borrowing from statistics, machine learning and information visualization 21 – 23 , as these methods directly address the subjectivity, operator-dependence, labor-intensive and low fidelity of manual analysis. However, it is quite time-consuming to investigate and test new automated analysis techniques on large data sets without some centralized information management system. For large-scale automated analysis to be practical, the presence of consistent and high-quality data linked to the raw FCS files is indispensable. In particular, the use of machine-readable standard vocabularies to characterize channel metadata is essential when constructing analytic pipelines to avoid errors in processing, analysis and interpretation of results. For automation, this high-quality metadata needs to be programmatically accessible, implying the need for a consistent Application Programming Interface (API). In this manuscript, we propose that upfront time spent normalizing flow cytometry data to conform to carefully designed data models enables automated analysis, potentially saving time in the long run. The ReFlow informatics framework was developed to address these data management challenges.


Sign in / Sign up

Export Citation Format

Share Document