scholarly journals Social Media-based User Embedding: A Literature Review

Author(s):  
Shimei Pan ◽  
Tao Ding

Automated representation learning is behind many recent success stories in machine learning. It is often used to transfer knowledge learned from a large dataset (e.g., raw text) to tasks for which only a small number of training examples are available. In this paper, we review recent advance in learning to represent social media users in low-dimensional embeddings. The technology is critical for creating high performance social media-based human traits and behavior models since the ground truth for assessing latent human traits and behavior is often expensive to acquire at a large scale. In this survey, we review typical methods for learning a unified user embeddings from heterogeneous user data (e.g., combines social media texts with images to learn a unified user representation). Finally we point out some current issues and future directions.

2021 ◽  
Vol 21 (3) ◽  
pp. 1-17
Author(s):  
Feiran Huang ◽  
Chaozhuo Li ◽  
Boyu Gao ◽  
Yun Liu ◽  
Sattam Alotaibi ◽  
...  

The analysis for social networks, such as the socially connected Internet of Things, has shown a deep influence of intelligent information processing technology on industrial systems for Smart Cities. The goal of social media representation learning is to learn dense, low-dimensional, and continuous representations for multimodal data within social networks, facilitating many real-world applications. Since social media images are usually accompanied by rich metadata (e.g., textual descriptions, tags, groups, and submitted users), simply modeling the image is not effective to learn the comprehensive information from social media images. In this work, we treat the image and its textual description as multimodal content, and transform other metainformation into the links between contents (such as two images marked by the same tag or submitted by the same user). Based on the multimodal content and social links, we propose a Deep Attentive Multimodal Graph Embedding model named DAMGE for more effective social image representation learning. We introduce both small- and large-scale datasets to conduct extensive experiments, of which the results confirm the superiority of the proposal on the tasks of social image classification and link prediction.


2019 ◽  
pp. 1049-1070
Author(s):  
Fabian Neuhaus

User data created in the digital context has increasingly been of interest to analysis and spatial analysis in particular. Large scale computer user management systems such as digital ticketing and social networking are creating vast amount of data. Such data systems can contain information generated by potentially millions of individuals. This kind of data has been termed big data. The analysis of big data can in its spatial but also in a temporal and social nature be of much interest for analysis in the context of cities and urban areas. This chapter discusses this potential along with a selection of sample work and an in-depth case study. Hereby the focus is mainly on the use and employment of insight gained from social media data, especially the Twitter platform, in regards to cities and urban environments. The first part of the chapter discusses a range of examples that make use of big data and the mapping of digital social network data. The second part discusses the way the data is collected and processed. An important section is dedicated to the aspects of ethical considerations. A summary and an outlook are discussed at the end.


Author(s):  
Ronghui You ◽  
Yuxuan Liu ◽  
Hiroshi Mamitsuka ◽  
Shanfeng Zhu

Abstract Motivation With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. Results We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. Supplementary information Supplementary data are available at Bioinformatics online


Author(s):  
Suppawong Tuarob ◽  
Conrad S. Tucker

The acquisition and mining of product feature data from online sources such as customer review websites and large scale social media networks is an emerging area of research. In many existing design methodologies that acquire product feature preferences form online sources, the underlying assumption is that product features expressed by customers are explicitly stated and readily observable to be mined using product feature extraction tools. In many scenarios however, product feature preferences expressed by customers are implicit in nature and do not directly map to engineering design targets. For example, a customer may implicitly state “wow I have to squint to read this on the screen”, when the explicit product feature may be a larger screen. The authors of this work propose an inference model that automatically assigns the most probable explicit product feature desired by a customer, given an implicit preference expressed. The algorithm iteratively refines its inference model by presenting a hypothesis and using ground truth data, determining its statistical validity. A case study involving smartphone product features expressed through Twitter networks is presented to demonstrate the effectiveness of the proposed methodology.


2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Hanjing Jiang ◽  
Yabing Huang

Abstract Background Drug-disease associations (DDAs) can provide important information for exploring the potential efficacy of drugs. However, up to now, there are still few DDAs verified by experiments. Previous evidence indicates that the combination of information would be conducive to the discovery of new DDAs. How to integrate different biological data sources and identify the most effective drugs for a certain disease based on drug-disease coupled mechanisms is still a challenging problem. Results In this paper, we proposed a novel computation model for DDA predictions based on graph representation learning over multi-biomolecular network (GRLMN). More specifically, we firstly constructed a large-scale molecular association network (MAN) by integrating the associations among drugs, diseases, proteins, miRNAs, and lncRNAs. Then, a graph embedding model was used to learn vector representations for all drugs and diseases in MAN. Finally, the combined features were fed to a random forest (RF) model to predict new DDAs. The proposed model was evaluated on the SCMFDD-S data set using five-fold cross-validation. Experiment results showed that GRLMN model was very accurate with the area under the ROC curve (AUC) of 87.9%, which outperformed all previous works in terms of both accuracy and AUC in benchmark dataset. To further verify the high performance of GRLMN, we carried out two case studies for two common diseases. As a result, in the ranking of drugs that were predicted to be related to certain diseases (such as kidney disease and fever), 15 of the top 20 drugs have been experimentally confirmed. Conclusions The experimental results show that our model has good performance in the prediction of DDA. GRLMN is an effective prioritization tool for screening the reliable DDAs for follow-up studies concerning their participation in drug reposition.


2020 ◽  
Vol 29 (03n04) ◽  
pp. 2060009
Author(s):  
Tao Ding ◽  
Fatema Hasan ◽  
Warren K. Bickel ◽  
Shimei Pan

Social media contain rich information that can be used to help understand human mind and behavior. Social media data, however, are mostly unstructured (e.g., text and image) and a large number of features may be needed to represent them (e.g., we may need millions of unigrams to represent social media texts). Moreover, accurately assessing human behavior is often difficult (e.g., assessing addiction may require medical diagnosis). As a result, the ground truth data needed to train a supervised human behavior model are often difficult to obtain at a large scale. To avoid overfitting, many state-of-the-art behavior models employ sophisticated unsupervised or self-supervised machine learning methods to leverage a large amount of unsupervised data for both feature learning and dimension reduction. Unfortunately, despite their high performance, these advanced machine learning models often rely on latent features that are hard to explain. Since understanding the knowledge captured in these models is important to behavior scientists and public health providers, we explore new methods to build machine learning models that are not only accurate but also interpretable. We evaluate the effectiveness of the proposed methods in predicting Substance Use Disorders (SUD). We believe the methods we proposed are general and applicable to a wide range of data-driven human trait and behavior analysis applications.


Author(s):  
Pankaj Lathar ◽  
K. G. Srinivasa ◽  
Abhishek Kumar ◽  
Nabeel Siddiqui

Advancements in web-based technology and the proliferation of sensors and mobile devices interacting with the internet have resulted in immense data management requirements. These data management activities include storage, processing, demand of high-performance read-write operations of big data. Large-scale and high-concurrency applications like SNS and search engines have appeared to be facing challenges in using the relational database to store and query dynamic user data. NoSQL and cloud computing has emerged as a paradigm that could meet these requirements. The available diversity of existing NoSQL and cloud computing solutions make it difficult to comprehend the domain and choose an appropriate solution for a specific business task. Therefore, this chapter reviews NoSQL and cloud-system-based solutions with the goal of providing a perspective in the field of data storage technology/algorithms, leveraging guidance to researchers and practitioners to select the best-fit data store, and identifying challenges and opportunities of the paradigm.


Author(s):  
Fabian Neuhaus

User data created in the digital context has increasingly been of interest to analysis and spatial analysis in particular. Large scale computer user management systems such as digital ticketing and social networking are creating vast amount of data. Such data systems can contain information generated by potentially millions of individuals. This kind of data has been termed big data. The analysis of big data can in its spatial but also in a temporal and social nature be of much interest for analysis in the context of cities and urban areas. This chapter discusses this potential along with a selection of sample work and an in-depth case study. Hereby the focus is mainly on the use and employment of insight gained from social media data, especially the Twitter platform, in regards to cities and urban environments. The first part of the chapter discusses a range of examples that make use of big data and the mapping of digital social network data. The second part discusses the way the data is collected and processed. An important section is dedicated to the aspects of ethical considerations. A summary and an outlook are discussed at the end.


GigaScience ◽  
2020 ◽  
Vol 9 (11) ◽  
Author(s):  
Shaokun An ◽  
Jizu Huang ◽  
Lin Wan

Abstract Background Dimensionality reduction and visualization play vital roles in single-cell RNA sequencing (scRNA-seq) data analysis. While they have been extensively studied, state-of-the-art dimensionality reduction algorithms are often unable to preserve the global structures underlying data. Elastic embedding (EE), a nonlinear dimensionality reduction method, has shown promise in revealing low-dimensional intrinsic local and global data structure. However, the current implementation of the EE algorithm lacks scalability to large-scale scRNA-seq data. Results We present a distributed optimization implementation of the EE algorithm, termed distributed elastic embedding (D-EE). D-EE reveals the low-dimensional intrinsic structures of data with accuracy equal to that of elastic embedding, and it is scalable to large-scale scRNA-seq data. It leverages distributed storage and distributed computation, achieving memory efficiency and high-performance computing simultaneously. In addition, an extended version of D-EE, termed distributed optimization implementation of time-series elastic embedding (D-TSEE), enables the user to visualize large-scale time-series scRNA-seq data by incorporating experimentally temporal information. Results with large-scale scRNA-seq data indicate that D-TSEE can uncover oscillatory gene expression patterns by using experimentally temporal information. Conclusions D-EE is a distributed dimensionality reduction and visualization tool. Its distributed storage and distributed computation technique allow us to efficiently analyze large-scale single-cell data at the cost of constant time speedup. The source code for D-EE algorithm based on C and MPI tailored to a high-performance computing cluster is available at https://github.com/ShaokunAn/D-EE.


Author(s):  
Grigoris Antoniou ◽  
Sotiris Batsakis ◽  
Raghava Mutharaju ◽  
Jeff Z. Pan ◽  
Guilin Qi ◽  
...  

AbstractAs more and more data is being generated by sensor networks, social media and organizations, the Web interlinking this wealth of information becomes more complex. This is particularly true for the so-called Web of Data, in which data is semantically enriched and interlinked using ontologies. In this large and uncoordinated environment, reasoning can be used to check the consistency of the data and of associated ontologies, or to infer logical consequences which, in turn, can be used to obtain new insights from the data. However, reasoning approaches need to be scalable in order to enable reasoning over the entire Web of Data. To address this problem, several high-performance reasoning systems, which mainly implement distributed or parallel algorithms, have been proposed in the last few years. These systems differ significantly; for instance in terms of reasoning expressivity, computational properties such as completeness, or reasoning objectives. In order to provide a first complete overview of the field, this paper reports a systematic review of such scalable reasoning approaches over various ontological languages, reporting details about the methods and over the conducted experiments. We highlight the shortcomings of these approaches and discuss some of the open problems related to performing scalable reasoning.


Sign in / Sign up

Export Citation Format

Share Document