Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study (Preprint)

Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study

Journal of Medical Internet Research ◽

10.2196/10013 ◽

2019 ◽

Vol 21 (1) ◽

pp. e10013 ◽

Cited By ~ 5

Author(s):

Hyunki Woo ◽

Kyunga Kim ◽

KyeongMin Cha ◽

Jin-Young Lee ◽

Hansong Mun ◽

...

Keyword(s):

Large Scale ◽

Data Cleaning ◽

Text Clustering ◽

Stool Examination ◽

Medical Reports ◽

Efficient Data

Download Full-text

Improving Clustering Methods By Exploiting Richness Of Text Data

10.26686/wgtn.17019287.v1 ◽

2021 ◽

Author(s):

◽

Abdul Wahid

Keyword(s):

Evolutionary Algorithm ◽

State Of The Art ◽

Ensemble Methods ◽

Text Clustering ◽

Clustering Methods ◽

Clustering Method ◽

Clustering Ensemble ◽

Text Data ◽

Multi Objective ◽

User Queries

<p>Clustering is an unsupervised machine learning technique, which involves discovering different clusters (groups) of similar objects in unlabeled data and is generally considered to be a NP hard problem. Clustering methods are widely used in a verity of disciplines for analyzing different types of data, and a small improvement in clustering method can cause a ripple effect in advancing research of multiple fields. Clustering any type of data is challenging and there are many open research questions. The clustering problem is exacerbated in the case of text data because of the additional challenges such as issues in capturing semantics of a document, handling rich features of text data and dealing with the well known problem of the curse of dimensionality. In this thesis, we investigate the limitations of existing text clustering methods and address these limitations by providing five new text clustering methods--Query Sense Clustering (QSC), Dirichlet Weighted K-means (DWKM), Multi-View Multi-Objective Evolutionary Algorithm (MMOEA), Multi-objective Document Clustering (MDC) and Multi-Objective Multi-View Ensemble Clustering (MOMVEC). These five new clustering methods showed that the use of rich features in text clustering methods could outperform the existing state-of-the-art text clustering methods. The first new text clustering method QSC exploits user queries (one of the rich features in text data) to generate better quality clusters and cluster labels. The second text clustering method DWKM uses probability based weighting scheme to formulate a semantically weighted distance measure to improve the clustering results. The third text clustering method MMOEA is based on a multi-objective evolutionary algorithm. MMOEA exploits rich features to generate a diverse set of candidate clustering solutions, and forms a better clustering solution using a cluster-oriented approach. The fourth and the fifth text clustering method MDC and MOMVEC address the limitations of MMOEA. MDC and MOMVEC differ in terms of the implementation of their multi-objective evolutionary approaches. All five methods are compared with existing state-of-the-art methods. The results of the comparisons show that the newly developed text clustering methods out-perform existing methods by achieving up to 16\% improvement for some comparisons. In general, almost all newly developed clustering algorithms showed statistically significant improvements over other existing methods. The key ideas of the thesis highlight that exploiting user queries improves Search Result Clustering(SRC); utilizing rich features in weighting schemes and distance measures improves soft subspace clustering; utilizing multiple views and a multi-objective cluster oriented method improves clustering ensemble methods; and better evolutionary operators and objective functions improve multi-objective evolutionary clustering ensemble methods. The new text clustering methods introduced in this thesis can be widely applied in various domains that involve analysis of text data. The contributions of this thesis which include five new text clustering methods, will not only help researchers in the data mining field but also to help a wide range of researchers in other fields.</p>

Download Full-text

Efficient data transfer scheme using word-pair-encoding-based compression for large-scale text-data processing

2014 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS) ◽

10.1109/apccas.2014.7032862 ◽

2014 ◽

Cited By ~ 3

Author(s):

Hasitha Muthumala Waidyasooriya ◽

Daisuke Ono ◽

Masanori Hariyama ◽

Michitaka Kameyama

Keyword(s):

Data Processing ◽

Word Pair ◽

Large Scale ◽

Data Transfer ◽

Transfer Scheme ◽

Text Data ◽

Efficient Data

Download Full-text

dropClust: Efficient clustering of ultra-large scRNA-seq data

10.1101/170308 ◽

2017 ◽

Cited By ~ 2

Author(s):

Debajyoti Sinha ◽

Akhilesh Kumar ◽

Himanshu Kumar ◽

Sanghamitra Bandyopadhyay ◽

Debarka Sengupta

Keyword(s):

Single Cell ◽

Large Scale ◽

Best Practice ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

De Novo ◽

Single Cells ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Clustering Methods

ABSTRACTDroplet based single cell transcriptomics has recently enabled parallel screening of tens of thousands of single cells. Clustering methods that scale for such high dimensional data without compromising accuracy are scarce. We exploit Locality Sensitive Hashing, an approximate nearest neighbor search technique to develop ade novoclustering algorithm for large-scale single cell data. On a number of real datasets, dropClust outperformed the existing best practice methods in terms of execution time, clustering accuracy and detectability of minor cell sub-types.

Download Full-text

Steps of Text Data Cleaning: for Network Text Analysis Using Large-scale Data

The Korean Association of Governance Studies ◽

10.26847/mspa.2017.27.4.35 ◽

2017 ◽

Vol 27 (4) ◽

pp. 35-68

Author(s):

Chisung Park ◽

◽

Jun-suk Lee ◽

Keyword(s):

Text Analysis ◽

Large Scale ◽

Data Cleaning ◽

Text Data ◽

Large Scale Data ◽

Network Text Analysis ◽

Scale Data

Download Full-text

Improving Clustering Methods By Exploiting Richness Of Text Data

10.26686/wgtn.17019287 ◽

2021 ◽

Author(s):

◽

Abdul Wahid

Keyword(s):

Evolutionary Algorithm ◽

State Of The Art ◽

Ensemble Methods ◽

Text Clustering ◽

Clustering Methods ◽

Clustering Method ◽

Clustering Ensemble ◽

Text Data ◽

Multi Objective ◽

User Queries

<p>Clustering is an unsupervised machine learning technique, which involves discovering different clusters (groups) of similar objects in unlabeled data and is generally considered to be a NP hard problem. Clustering methods are widely used in a verity of disciplines for analyzing different types of data, and a small improvement in clustering method can cause a ripple effect in advancing research of multiple fields. Clustering any type of data is challenging and there are many open research questions. The clustering problem is exacerbated in the case of text data because of the additional challenges such as issues in capturing semantics of a document, handling rich features of text data and dealing with the well known problem of the curse of dimensionality. In this thesis, we investigate the limitations of existing text clustering methods and address these limitations by providing five new text clustering methods--Query Sense Clustering (QSC), Dirichlet Weighted K-means (DWKM), Multi-View Multi-Objective Evolutionary Algorithm (MMOEA), Multi-objective Document Clustering (MDC) and Multi-Objective Multi-View Ensemble Clustering (MOMVEC). These five new clustering methods showed that the use of rich features in text clustering methods could outperform the existing state-of-the-art text clustering methods. The first new text clustering method QSC exploits user queries (one of the rich features in text data) to generate better quality clusters and cluster labels. The second text clustering method DWKM uses probability based weighting scheme to formulate a semantically weighted distance measure to improve the clustering results. The third text clustering method MMOEA is based on a multi-objective evolutionary algorithm. MMOEA exploits rich features to generate a diverse set of candidate clustering solutions, and forms a better clustering solution using a cluster-oriented approach. The fourth and the fifth text clustering method MDC and MOMVEC address the limitations of MMOEA. MDC and MOMVEC differ in terms of the implementation of their multi-objective evolutionary approaches. All five methods are compared with existing state-of-the-art methods. The results of the comparisons show that the newly developed text clustering methods out-perform existing methods by achieving up to 16\% improvement for some comparisons. In general, almost all newly developed clustering algorithms showed statistically significant improvements over other existing methods. The key ideas of the thesis highlight that exploiting user queries improves Search Result Clustering(SRC); utilizing rich features in weighting schemes and distance measures improves soft subspace clustering; utilizing multiple views and a multi-objective cluster oriented method improves clustering ensemble methods; and better evolutionary operators and objective functions improve multi-objective evolutionary clustering ensemble methods. The new text clustering methods introduced in this thesis can be widely applied in various domains that involve analysis of text data. The contributions of this thesis which include five new text clustering methods, will not only help researchers in the data mining field but also to help a wide range of researchers in other fields.</p>

Download Full-text

HGC: fast hierarchical clustering for large-scale single-cell data

10.1101/2021.02.07.430106 ◽

2021 ◽

Author(s):

Ziheng Zou ◽

Kui Hua ◽

Xuegong Zhang

Keyword(s):

Single Cell ◽

Hierarchical Clustering ◽

Large Scale ◽

Nearest Neighbor ◽

Linear Time ◽

Fixed Number ◽

Large Datasets ◽

Clustering Methods ◽

Shared Nearest Neighbor ◽

Cell Data

AbstractClustering is a key step in revealing heterogeneities in single-cell data. Cell heterogeneity can be explored at different resolutions and the resulted varying cell states are inherently nested. However, most existing single-cell clustering methods output a fixed number of clusters without the hierarchical information. Classical hierarchical clustering provides dendrogram of cells, but cannot scale to large datasets due to the high computational complexity. We present HGC, a fast Hierarchical Graph-based Clustering method to address both problems. It combines the advantages of graph-based clustering and hierarchical clustering. On the shared nearest neighbor graph of cells, HGC constructs the hierarchical tree with linear time complexity. Experiments showed that HGC enables multiresolution exploration of the biological hierarchy underlying the data, achieves state-of-the-art accuracy on benchmark data, and can scale to large datasets. HGC is freely available for academic use at https://www.github.com/XuegongLab/[email protected], [email protected]

Download Full-text