scholarly journals Mercator: a pipeline for multi-method, unsupervised visualization and distance generation

Author(s):  
Zachary B Abrams ◽  
Caitlin E Coombes ◽  
Suli Li ◽  
Kevin R Coombes

Abstract Summary Unsupervised machine learning provides tools for researchers to uncover latent patterns in large-scale data, based on calculated distances between observations. Methods to visualize high-dimensional data based on these distances can elucidate subtypes and interactions within multi-dimensional and high-throughput data. However, researchers can select from a vast number of distance metrics and visualizations, each with their own strengths and weaknesses. The Mercator R package facilitates selection of a biologically meaningful distance from 10 metrics, together appropriate for binary, categorical and continuous data, and visualization with 5 standard and high-dimensional graphics tools. Mercator provides a user-friendly pipeline for informaticians or biologists to perform unsupervised analyses, from exploratory pattern recognition to production of publication-quality graphics. Availabilityand implementation Mercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html).

2019 ◽  
Author(s):  
Zachary B. Abrams ◽  
Caitlin E. Coombes ◽  
Suli Li ◽  
Kevin R. Coombes

AbstractSummaryUnsupervised data analysis in many scientific disciplines is based on calculating distances between observations and finding ways to visualize those distances. These kinds of unsupervised analyses help researchers uncover patterns in large-scale data sets. However, researchers can select from a vast number of different distance metrics, each designed to highlight different aspects of different data types. There are also numerous visualization methods with their own strengths and weaknesses. To help researchers perform unsupervised analyses, we developed the Mercator R package. Mercator enables users to see important patterns in their data by generating multiple visualizations using different standard algorithms, making it particularly easy to compare and contrast the results arising from different metrics. By allowing users to select the distance metric that best fits their needs, Mercator helps researchers perform unsupervised analyses that use pattern identification through computation and visual inspection.Availability and ImplementationMercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html)[email protected] informationSupplementary data are available at Bioinformatics online.


2019 ◽  
Vol 15 (3) ◽  
pp. 64-78
Author(s):  
Chandrakala D ◽  
Sumathi S ◽  
Saran Kumar A ◽  
Sathish J

Detection and realization of new trends from corpus are achieved through Emergent Trend Detection (ETD) methods, which is a principal application of text mining. This article discusses the influence of the Particle Swarm Optimization (PSO) on Dynamic Adaptive Self Organizing Maps (DASOM) in the design of an efficient ETD scheme by optimizing the neural parameters of the network. This hybrid machine learning scheme is designed to accomplish maximum accuracy with minimum computational time. The efficiency and scalability of the proposed scheme is analyzed and compared with standard algorithms such as SOM, DASOM and Linear Regression analysis. The system is trained and tested on DBLP database, University of Trier, Germany. The superiority of hybrid DASOM algorithm over the well-known algorithms in handling high dimensional large-scale data to detect emergent trends from the corpus is established in this article.


1999 ◽  
Vol 3 (1) ◽  
pp. 53-60
Author(s):  
Kristi Yuthas ◽  
Dennis F. Togo

In this era of massive data accumulation, dynamic development of large-scale data-bases and interfaces intended to be user-friendly, there is still an increasing demand on analysts as actual user access to databases is still not a common practice. A data dictionary approach, that includes providing users with a list of relevant data items within the database, can expedite the analysis of information requirements and the development of user-requested information systems. Furthermore, this approach enhances user involvement and reduces the demands on the analysts for systems devel-opment projects.


1994 ◽  
Vol 83 (03) ◽  
pp. 135-141 ◽  
Author(s):  
P. Fisher ◽  
R. Van Haselen

AbstractLarge scale data collection combined with modern information technology is a powerful tool to evaluate the efficacy and safety of homoeopathy. It also has great potential to improve homoeopathic practice. Data collection has not been widely used in homoeopathy. This appears to be due to the clumsiness sof the methodology and the perception that it is of little value to daily practice. 3 protocols addressing different aspects of this issue are presented.- A proposal to establish common basic data collection methodology for homoeopaths throughout Europe.- A systematic survey of the results of homoeopathic treatment of patients with rheumatoid arthritis using quality of life and objective assessments.- Verification of a set of homoeopathic prescribing features for Rhus toxicodendron.These proposals are designed to be ‘user-friendly’ and to provide practical information relevant to daily homoeopathic practice.


2019 ◽  
Vol 48 (4) ◽  
pp. 673-681
Author(s):  
Shufen Zhang ◽  
Zhiyu Liu ◽  
Xuebin Chen ◽  
Changyin Luo

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.


2016 ◽  
Vol 29 (6) ◽  
pp. 1061-1075
Author(s):  
Eun-Kyung Lee ◽  
Nayoung Hwang ◽  
Yoondong Lee

2016 ◽  
Vol 32 (21) ◽  
pp. 3351-3353 ◽  
Author(s):  
Gang Wu ◽  
Ron C. Anafi ◽  
Michael E. Hughes ◽  
Karl Kornacker ◽  
John B. Hogenesch

Lingua Sinica ◽  
2020 ◽  
Vol 6 (1) ◽  
pp. 1-24
Author(s):  
Yipu Wei ◽  
Dirk Speelman ◽  
Jacqueline Evers-Vermeul

Abstract Collocation analysis can be used to extract meaningful linguistic information from large-scale corpus data. This paper reviews the methodological issues one may encounter when performing collocation analysis for discourse studies on Chinese. We propose four crucial aspects to consider in such analyses: (i) the definition of collocates according to various parameters; (ii) the choice of analysis and association measures; (iii) the definition of the search span; and (iv) the selection of corpora for analysis. To illustrate how these aspects can be addressed when applying a Chinese collocation analysis, we conducted a case study of two Chinese causal connectives: yushi ‘that is why’ and yin’er ‘as a result’. The distinctive collocation analysis shows how these two connectives differ in volitionality, an important dimension of discourse relations. The study also demonstrates that collocation analysis, as an explorative approach based on large-scale data, can provide valuable converging evidence for corpus-based studies that have been conducted with laborious manual analysis on limited datasets.


Sign in / Sign up

Export Citation Format

Share Document