Towards High Performance Text Mining

Shanshan Yu; Jindian Su; Pengfei Li; Hao Wang

doi:10.4018/ijghpc.2016040104

Towards High Performance Text Mining

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2016040104 ◽

2016 ◽

Vol 8 (2) ◽

pp. 58-75 ◽

Cited By ~ 4

Author(s):

Shanshan Yu ◽

Jindian Su ◽

Pengfei Li ◽

Hao Wang

Keyword(s):

Text Mining ◽

Time Complexity ◽

High Performance ◽

Large Scale ◽

Recall Rate ◽

Text Structure ◽

Keyword Extraction ◽

Learning Method ◽

Automatic Summarization ◽

Linguistic Features

As a typical unsupervised learning method, the TextRank algorithm performs well for large-scale text mining, especially for automatic summarization or keyword extraction. However, TextRank only considers the similarities between sentences in the processes of automatic summarization and neglects information about text structure and context. To overcome these shortcomings, the authors propose an improved highly-scalable method, called iTextRank. When building a TextRank graph in their new method, the authors compute sentence similarities and adjust the weights of nodes by considering statistical and linguistic features, such as similarities in titles, paragraph structures, special sentences, sentence positions and lengths. Their analysis shows that the time complexity of iTextRank is comparable with TextRank. More importantly, two experiments show that iTextRank has a higher accuracy and lower recall rate than TextRank, and it is as effective as several popular online automatic summarization systems.

Download Full-text

Data augmentation by CycleGAN-based extra-supervised for non-destructive testing

Measurement Science and Technology ◽

10.1088/1361-6501/ac3ec3 ◽

2021 ◽

Author(s):

Jiangshan Ai ◽

Lulu Tian ◽

Libing Bai ◽

Jie Zhang

Keyword(s):

High Performance ◽

Field Data ◽

Large Scale ◽

Data Augmentation ◽

Learning Method ◽

Non Destructive Testing ◽

Destructive Testing ◽

Deep Convolutional Neural Networks ◽

X Ray ◽

Non Destructive

Abstract Deep learning method is widely used in computer vision tasks with large scale annotated datasets. However, it is a big challenge to obtain such datasets in most directions of the vision based non-destructive testing (NDT) field. Data augmentation is proved as an efficient way in dealing with the lack of large-scale annotated datasets. In this paper, we propose CycleGAN-based extra-supervised (CycleGAN-ES) to generate synthetic NDT images, where the ES is used to ensure that the bidirectional mapping are learned for corresponding label and defect. Furthermore, we show the effectiveness of using the synthesized images to train deep convolutional neural networks (DCNN) for defects recognition. In the experiments, we extract numbers of X-ray welding images with both defect and no-defect from the published GDXray dataset, CycleGAN-ES are used to generate the synthetic defect images based on a small number of extracted defect images and manually drawn labels which are used as a content guide. For quality verification of the synthesized defect images, we use a high-performance classifier pre-trained using big dataset to recognize the synthetic defects and show comparability of the performances of classifiers trained using synthetic defects and real defects respectively. To present the effectiveness of using the synthesized defects as an augmentation method, we train and evaluate the performances of DCNN for defects recognition with or without the synthesized defects.

Download Full-text

The Structure and Properties of MoSi2 Thin Film in Mos Process

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s1431927600001379 ◽

1980 ◽

Vol 38 ◽

pp. 326-327

Author(s):

C.K. Wu ◽

P. Chang ◽

N. Godinho

Keyword(s):

Thin Film ◽

Integrated Circuits ◽

High Performance ◽

Large Scale ◽

Process Development ◽

Structure And Properties ◽

Metal Silicides ◽

High Oxidation ◽

Important Approach ◽

High Oxidation Resistance

Recently, the use of refractory metal silicides as low resistivity, high temperature and high oxidation resistance gate materials in large scale integrated circuits (LSI) has become an important approach in advanced MOS process development (1). This research is a systematic study on the structure and properties of molybdenum silicide thin film and its applicability to high performance LSI fabrication.

Download Full-text

Large-Scale Data Learning Method for Anomaly Detection using Machine Learning for Monitoring Vibration in Vehicle Equipment

IEEJ Transactions on Industry Applications ◽

10.1541/ieejias.140.480 ◽

2020 ◽

Vol 140 (6) ◽

pp. 480-487

Author(s):

Minoru Kondo

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Large Scale ◽

Learning Method ◽

Large Scale Data ◽

Scale Data

Download Full-text

RECOMMENDATIONS FOR THE CHOICE OF MILKING INSTALLATIONS IN LOOSE HOUSING SYSTEMS OF COWS

Molochnoe i miasnoe skotovodstvo ◽

10.33943/mms.2020.12.24.001 ◽

2020 ◽

Author(s):

В.В. ГОРДЕЕВ ◽

В.Е. ХАЗАНОВ

Keyword(s):

Dairy Cows ◽

High Performance ◽

Large Scale ◽

Dairy Farms ◽

Economic Indicators ◽

Technical Level ◽

Housing Systems ◽

Working Shift ◽

Technical And Economic Indicators

При выборе типа доильной установки и ее размера необходимо учитывать максимальное планируемое поголовье дойных коров и размер технологической группы, кратность и время одного доения, продолжительность рабочей смены дояров. Анализ технико-экономических показателей наиболее распространенных на сегодняшний день типов доильных установок одинакового технического уровня свидетельствует, что наилучшие удельные показатели имеет установка типа «Карусель» (1), а установка типа «Елочка» (2) требует более высоких затрат труда и средств. Установка «Параллель» (3) занимает промежуточное положение. Из анализа пропускной способности и количества необходимых операторов: установка 2 рекомендована для ферм с поголовьем дойного стада до 600 голов, 3 — не более 1200 дойных коров, 1 — более 1200 дойных коров. «Карусель» — наиболее рациональный, высокопроизводительный, легко автоматизируемый и, следовательно, перспективный способ доения в залах, особенно для крупных молочных ферм. The choice of the proper type and size of milking installations needs to take into account the maximum planned number of dairy cows, the size of a technological group, the number of milkings per day, and the duration of one milking and the operator's working shift. The analysis of technical and economic indicators of currently most common types of milking machines of the same technical level revealed that the Carousel installation had the best specific indicators while the Herringbone installation featured higher labour inputs and cash costs. The Parallel installation was found somewhere in between. In terms of the throughput and the required number of operators Herringbone is recommended for farms with up to 600 dairy cows, Parallel — below 1200 dairy cows, Carousel — above 1200 dairy cows. Carousel was found the most practical, high-performance, easily automated and, therefore, promising milking system for milking parlours, especially on the large-scale dairy farms.

Download Full-text

Investigating Diseases and Chemicals in COVID-19 Literature with Text Mining (Preprint)

10.2196/preprints.21503 ◽

2020 ◽

Author(s):

Amir Karami ◽

Brandon Bookstaver ◽

Melissa Nolan

Keyword(s):

Text Mining ◽

Literature Review ◽

Topic Modeling ◽

Large Scale ◽

Clinical Manifestations ◽

International Health ◽

Research Papers ◽

Strategic Plans ◽

Funding Agencies ◽

The Relationship

BACKGROUND The COVID-19 pandemic has impacted nearly all aspects of life and has posed significant threats to international health and the economy. Given the rapidly unfolding nature of the current pandemic, there is an urgent need to streamline literature synthesis of the growing scientific research to elucidate targeted solutions. While traditional systematic literature review studies provide valuable insights, these studies have restrictions, including analyzing a limited number of papers, having various biases, being time-consuming and labor-intensive, focusing on a few topics, incapable of trend analysis, and lack of data-driven tools. OBJECTIVE This study fills the mentioned restrictions in the literature and practice by analyzing two biomedical concepts, clinical manifestations of disease and therapeutic chemical compounds, with text mining methods in a corpus containing COVID-19 research papers and find associations between the two biomedical concepts. METHODS This research has collected papers representing COVID-19 pre-prints and peer-reviewed research published in 2020. We used frequency analysis to find highly frequent manifestations and therapeutic chemicals, representing the importance of the two biomedical concepts. This study also applied topic modeling to find the relationship between the two biomedical concepts. RESULTS We analyzed 9,298 research papers published through May 5, 2020 and found 3,645 disease-related and 2,434 chemical-related articles. The most frequent clinical manifestations of disease terminology included COVID-19, SARS, cancer, pneumonia, fever, and cough. The most frequent chemical-related terminology included Lopinavir, Ritonavir, Oxygen, Chloroquine, Remdesivir, and water. Topic modeling provided 25 categories showing relationships between our two overarching categories. These categories represent statistically significant associations between multiple aspects of each category, some connections of which were novel and not previously identified by the scientific community. CONCLUSIONS Appreciation of this context is vital due to the lack of a systematic large-scale literature review survey and the importance of fast literature review during the current COVID-19 pandemic for developing treatments. This study is beneficial to researchers for obtaining a macro-level picture of literature, to educators for knowing the scope of literature, to journals for exploring most discussed disease symptoms and pharmaceutical targets, and to policymakers and funding agencies for creating scientific strategic plans regarding COVID-19.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

Automated Cytogenetic Biodosimetry at Population-Scale

Radiation ◽

10.3390/radiation1020008 ◽

2021 ◽

Vol 1 (2) ◽

pp. 79-94

Author(s):

Peter K. Rogan ◽

Eliseos J. Mucaki ◽

Ben C. Shirley ◽

Yanxin Li ◽

Ruth C. Wilkins ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Dicentric Chromosome ◽

Automated Assignment ◽

Simulated Population ◽

Dose Estimation ◽

Processing Times ◽

Multiple Processors ◽

Exposure Levels ◽

Population Scale

The dicentric chromosome (DC) assay accurately quantifies exposure to radiation; however, manual and semi-automated assignment of DCs has limited its use for a potential large-scale radiation incident. The Automated Dicentric Chromosome Identifier and Dose Estimator (ADCI) software automates unattended DC detection and determines radiation exposures, fulfilling IAEA criteria for triage biodosimetry. This study evaluates the throughput of high-performance ADCI (ADCI-HT) to stratify exposures of populations in 15 simulated population scale radiation exposures. ADCI-HT streamlines dose estimation using a supercomputer by optimal hierarchical scheduling of DC detection for varying numbers of samples and metaphase cell images in parallel on multiple processors. We evaluated processing times and accuracy of estimated exposures across census-defined populations. Image processing of 1744 samples on 16,384 CPUs required 1 h 11 min 23 s and radiation dose estimation based on DC frequencies required 32 sec. Processing of 40,000 samples at 10 exposures from five laboratories required 25 h and met IAEA criteria (dose estimates were within 0.5 Gy; median = 0.07). Geostatistically interpolated radiation exposure contours of simulated nuclear incidents were defined by samples exposed to clinically relevant exposure levels (1 and 2 Gy). Analysis of all exposed individuals with ADCI-HT required 0.6–7.4 days, depending on the population density of the simulation.

Download Full-text

Native Chilean Berries Preservation and In Vitro Studies of a Polyphenol Highly Antioxidant Extract from Maqui as a Potential Agent against Inflammatory Diseases

Antioxidants ◽

10.3390/antiox10060843 ◽

2021 ◽

Vol 10 (6) ◽

pp. 843

Author(s):

Tamara Ortiz ◽

Federico Argüelles-Arias ◽

Belén Begines ◽

Josefa-María García-Montes ◽

Alejandra Pereira ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Inflammatory Diseases ◽

In Vitro Studies ◽

Aminosalicylic Acid ◽

Colon Cells ◽

Antioxidant Power ◽

Maqui Berry ◽

Physiological Benefits

The best conservation method for native Chilean berries has been investigated in combination with an implemented large-scale extract of maqui berry, rich in total polyphenols and anthocyanin to be tested in intestinal epithelial and immune cells. The methanolic extract was obtained from lyophilized and analyzed maqui berries using Folin–Ciocalteu to quantify the total polyphenol content, as well as 2,2-diphenyl-1-picrylhydrazyl (DPPH), ferric reducing antioxidant power (FRAP), and oxygen radical absorbance capacity (ORAC) to measure the antioxidant capacity. Determination of maqui’s anthocyanins profile was performed by ultra-high-performance liquid chromatography (UHPLC-MS/MS). Viability, cytotoxicity, and percent oxidation in epithelial colon cells (HT-29) and macrophages cells (RAW 264.7) were evaluated. In conclusion, preservation studies confirmed that the maqui properties and composition in fresh or frozen conditions are preserved and a more efficient and convenient extraction methodology was achieved. In vitro studies of epithelial cells have shown that this extract has a powerful antioxidant strength exhibiting a dose-dependent behavior. When lipopolysaccharide (LPS)-macrophages were activated, noncytotoxic effects were observed, and a relationship between oxidative stress and inflammation response was demonstrated. The maqui extract along with 5-aminosalicylic acid (5-ASA) have a synergistic effect. All of the compiled data pointed out to the use of this extract as a potential nutraceutical agent with physiological benefits for the treatment of inflammatory bowel disease (IBD).

Download Full-text

IGD: high-performance search for large-scale genomic interval datasets

Bioinformatics ◽

10.1093/bioinformatics/btaa1062 ◽

2020 ◽

Author(s):

Jianglin Feng ◽

Nathan C Sheffield

Keyword(s):

High Performance ◽

Large Scale ◽

Interval Data ◽

Scale Analysis ◽

Genome Database ◽

Genomic Interval ◽

Critical Resource ◽

Genomic Regions ◽

Genome Projects

Abstract Summary Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions. Availability https://github.com/databio/IGD

Download Full-text

Large-scale production of highly stable silicon monoxide nanowires by radio-frequency thermal plasma as anodes for high-performance Li-ion batteries

Journal of Power Sources ◽

10.1016/j.jpowsour.2021.229906 ◽

2021 ◽

Vol 497 ◽

pp. 229906

Author(s):

Zongxian Yang ◽

Yu Du ◽

Yijun Yang ◽

Huacheng Jin ◽

Hebang Shi ◽

...

Keyword(s):

Radio Frequency ◽

Thermal Plasma ◽

High Performance ◽

Large Scale ◽

Silicon Monoxide ◽

Scale Production ◽

Li Ion Batteries ◽

Large Scale Production ◽

Li Ion

Download Full-text