Content-Based Multimedia Retrieval Using Feature Correlation Clustering and Fusion

Author(s):  
Hsin-Yu Ha ◽  
Fausto C. Fleites ◽  
Shu-Ching Chen

Nowadays, only processing visual features is not enough for multimedia semantic retrieval due to the complexity of multimedia data, which usually involve a variety of modalities, e.g. graphics, text, speech, video, etc. It becomes crucial to fully utilize the correlation between each feature and the target concept, the feature correlation within modalities, and the feature correlation across modalities. In this paper, the authors propose a Feature Correlation Clustering-based Multi-Modality Fusion Framework (FCC-MMF) for multimedia semantic retrieval. Features from different modalities are combined into one feature set with the same representation via a normalization and discretization process. Within and across modalities, multiple correspondence analysis is utilized to obtain the correlation between feature-value pairs, which are then projected onto the two principal components. K-medoids algorithm, which is a widely used partitioned clustering algorithm, is selected to minimize the Euclidean distance within the resulted clusters and produce high intra-correlated feature-value pair clusters. Majority vote is applied to subsequently decide which cluster each feature belongs to. Once the feature clusters are formed, one classifier is built and trained for each cluster. The correlation and confidence of each classifier are considered while fusing the classification scores, and mean average precision is used to evaluate the final ranked classification scores. Finally, the proposed framework is applied on NUS-wide Lite data set to demonstrate the effectiveness in multimedia semantic retrieval.

2021 ◽  
pp. 016555152110184
Author(s):  
Gunjan Chandwani ◽  
Anil Ahlawat ◽  
Gaurav Dubey

Document retrieval plays an important role in knowledge management as it facilitates us to discover the relevant information from the existing data. This article proposes a cluster-based inverted indexing algorithm for document retrieval. First, the pre-processing is done to remove the unnecessary and redundant words from the documents. Then, the indexing of documents is done by the cluster-based inverted indexing algorithm, which is developed by integrating the piecewise fuzzy C-means (piFCM) clustering algorithm and inverted indexing. After providing the index to the documents, the query matching is performed for the user queries using the Bhattacharyya distance. Finally, the query optimisation is done by the Pearson correlation coefficient, and the relevant documents are retrieved. The performance of the proposed algorithm is analysed by the WebKB data set and Twenty Newsgroups data set. The analysis exposes that the proposed algorithm offers high performance with a precision of 1, recall of 0.70 and F-measure of 0.8235. The proposed document retrieval system retrieves the most relevant documents and speeds up the storing and retrieval of information.


2021 ◽  
pp. postgradmedj-2020-139361
Author(s):  
María Matesanz-Fernández ◽  
Teresa Seoane-Pillado ◽  
Iria Iñiguez-Vázquez ◽  
Roi Suárez-Gil ◽  
Sonia Pértega-Díaz ◽  
...  

ObjectiveWe aim to identify patterns of disease clusters among inpatients of a general hospital and to describe the characteristics and evolution of each group.MethodsWe used two data sets from the CMBD (Conjunto mínimo básico de datos - Minimum Basic Hospital Data Set (MBDS)) of the Lucus Augusti Hospital (Spain), hospitalisations and patients, realising a retrospective cohort study among the 74 220 patients discharged from the Medic Area between 01 January 2000 and 31 December 2015. We created multimorbidity clusters using multiple correspondence analysis.ResultsWe identified five clusters for both gender and age. Cluster 1: alcoholic liver disease, alcoholic dependency syndrome, lung and digestive tract malignant neoplasms (age under 50 years). Cluster 2: large intestine, prostate, breast and other malignant neoplasms, lymphoma and myeloma (age over 70, mostly males). Cluster 3: malnutrition, Parkinson disease and other mobility disorders, dementia and other mental health conditions (age over 80 years and mostly women). Cluster 4: atrial fibrillation/flutter, cardiac failure, chronic kidney failure and heart valve disease (age between 70–80 and mostly women). Cluster 5: hypertension/hypertensive heart disease, type 2 diabetes mellitus, ischaemic cardiomyopathy, dyslipidaemia, obesity and sleep apnea, including mostly men (age range 60–80). We assessed significant differences among the clusters when gender, age, number of chronic pathologies, number of rehospitalisations and mortality during the hospitalisation were assessed (p<0001 in all cases).ConclusionsWe identify for the first time in a hospital environment five clusters of disease combinations among the inpatients. These clusters contain several high-incidence diseases related to both age and gender that express their own evolution and clinical characteristics over time.


Genetics ◽  
2001 ◽  
Vol 159 (2) ◽  
pp. 699-713
Author(s):  
Noah A Rosenberg ◽  
Terry Burke ◽  
Kari Elo ◽  
Marcus W Feldman ◽  
Paul J Freidlin ◽  
...  

Abstract We tested the utility of genetic cluster analysis in ascertaining population structure of a large data set for which population structure was previously known. Each of 600 individuals representing 20 distinct chicken breeds was genotyped for 27 microsatellite loci, and individual multilocus genotypes were used to infer genetic clusters. Individuals from each breed were inferred to belong mostly to the same cluster. The clustering success rate, measuring the fraction of individuals that were properly inferred to belong to their correct breeds, was consistently ~98%. When markers of highest expected heterozygosity were used, genotypes that included at least 8–10 highly variable markers from among the 27 markers genotyped also achieved &gt;95% clustering success. When 12–15 highly variable markers and only 15–20 of the 30 individuals per breed were used, clustering success was at least 90%. We suggest that in species for which population structure is of interest, databases of multilocus genotypes at highly variable markers should be compiled. These genotypes could then be used as training samples for genetic cluster analysis and to facilitate assignments of individuals of unknown origin to populations. The clustering algorithm has potential applications in defining the within-species genetic units that are useful in problems of conservation.


2021 ◽  
Vol 11 (22) ◽  
pp. 10596
Author(s):  
Chung-Hong Lee ◽  
Hsin-Chang Yang ◽  
Yenming J. Chen ◽  
Yung-Lin Chuang

Recently, an emerging application field through Twitter messages and algorithmic computation to detect real-time world events has become a new paradigm in the field of data science applications. During a high-impact event, people may want to know the latest information about the development of the event because they want to better understand the situation and possible trends of the event for making decisions. However, often in emergencies, the government or enterprises are usually unable to notify people in time for early warning and avoiding risks. A sensible solution is to integrate real-time event monitoring and intelligence gathering functions into their decision support system. Such a system can provide real-time event summaries, which are updated whenever important new events are detected. Therefore, in this work, we combine a developed Twitter-based real-time event detection algorithm with pre-trained language models for summarizing emergent events. We used an online text-stream clustering algorithm and self-adaptive method developed to gather the Twitter data for detection of emerging events. Subsequently we used the Xsum data set with a pre-trained language model, namely T5 model, to train the summarization model. The Rouge metrics were used to compare the summary performance of various models. Subsequently, we started to use the trained model to summarize the incoming Twitter data set for experimentation. In particular, in this work, we provide a real-world case study, namely the COVID-19 pandemic event, to verify the applicability of the proposed method. Finally, we conducted a survey on the example resulting summaries with human judges for quality assessment of generated summaries. From the case study and experimental results, we have demonstrated that our summarization method provides users with a feasible method to quickly understand the updates in the specific event intelligence based on the real-time summary of the event story.


2021 ◽  
Author(s):  
ElMehdi SAOUDI ◽  
Said Jai Andaloussi

Abstract With the rapid growth of the volume of video data and the development of multimedia technologies, it has become necessary to have the ability to accurately and quickly browse and search through information stored in large multimedia databases. For this purpose, content-based video retrieval ( CBVR ) has become an active area of research over the last decade. In this paper, We propose a content-based video retrieval system providing similar videos from a large multimedia data-set based on a query video. The approach uses vector motion-based signatures to describe the visual content and uses machine learning techniques to extract key-frames for rapid browsing and efficient video indexing. We have implemented the proposed approach on both, single machine and real-time distributed cluster to evaluate the real-time performance aspect, especially when the number and size of videos are large. Experiments are performed using various benchmark action and activity recognition data-sets and the results reveal the effectiveness of the proposed method in both accuracy and processing time compared to state-of-the-art methods.


2011 ◽  
pp. 24-32 ◽  
Author(s):  
Nicoleta Rogovschi ◽  
Mustapha Lebbah ◽  
Younès Bennani

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.


2021 ◽  
Author(s):  
Xin Sui ◽  
Wanjing Wang ◽  
Jinfeng Zhang

In this work, we trained an ensemble model for predicting drug-protein interactions within a sentence based on only its semantics. Our ensembled model was built using three separate models: 1) a classification model using a fine-tuned BERT model; 2) a fine-tuned sentence BERT model that embeds every sentence into a vector; and 3) another classification model using a fine-tuned T5 model. In all models, we further improved performance using data augmentation. For model 2, we predicted the label of a sentence using k-nearest neighbors with its embedded vector. We also explored ways to ensemble these 3 models: a) we used the majority vote method to ensemble these 3 models; and b) based on the HDBSCAN clustering algorithm, we trained another ensemble model using features from all the models to make decisions. Our best model achieved an F-1 score of 0.753 on the BioCreative VII Track 1 test dataset.


2019 ◽  
Author(s):  
Lin Fei ◽  
Yang Yang ◽  
Wang Shihua ◽  
Xu Yudi ◽  
Ma Hong

Unreasonable public bicycle dispatching area division seriously affects the operational efficiency of the public bicycle system. To solve this problem, this paper innovatively proposes an improved community discovery algorithm based on multi-objective optimization (CDoMO). The data set is preprocessed into a lease/return relationship, thereby it calculated a similarity matrix, and the community discovery algorithm Fast Unfolding is executed on the matrix to obtain a scheduling scheme. For the results obtained by the algorithm, the workload indicators (scheduled distance, number of sites, and number of scheduling bicycles) should be adjusted to maximize the overall benefits, and the entire process is continuously optimized by a multi-objective optimization algorithm NSGA2. The experimental results show that compared with the clustering algorithm and the community discovery algorithm, the method can shorten the estimated scheduling distance by 20%-50%, and can effectively balance the scheduling workload of each area. The method can provide theoretical support for the public bicycle dispatching department, and improve the efficiency of public bicycle dispatching system.


2021 ◽  
Author(s):  
TIONG GOH ◽  
MengJun Liu

The ability to predict COVID-19 patients' level of severity (death or survival) enables clinicians to prioritise treatment. Recently, using three blood biomarkers, an interpretable machine learning model was developed to predict the mortality of COVID-19 patients. The method was reported to be suffering from performance stability because the identified biomarkers are not consistent predictors over an extended duration. To sustain performance, the proposed method partitioned data into three different time windows. For each window, an end-classifier, a mid-classifier and a front-classifier were designed respectively using the XGboost single tree approach. These time window classifiers were integrated into a majority vote classifier and tested with an isolated test data set. The voting classifier strengthens the overall performance of 90% cumulative accuracy from a 14 days window to a 21 days prediction window. An additional 7 days of prediction window can have a considerable impact on a patient's chance of survival. This study validated the feasibility of the time window voting classifier and further support the selection of biomarkers features set for the early prognosis of patients with a higher risk of mortality.


Sign in / Sign up

Export Citation Format

Share Document