scholarly journals From Motion Activity to Geo-Embeddings: Generating and Exploring Vector Representations of Locations, Traces and Visitors through Large-Scale Mobility Data

2019 ◽  
Vol 8 (3) ◽  
pp. 134 ◽  
Author(s):  
Alessandro Crivellari ◽  
Euro Beinat

The rapid growth of positioning technology allows tracking motion between places, making trajectory recordings an important source of information about place connectivity, as they map the routes that people commonly perform. In this paper, we utilize users’ motion traces to construct a behavioral representation of places based on how people move between them, ignoring geographical coordinates and spatial proximity. Inspired by natural language processing techniques, we generate and explore vector representations of locations, traces and visitors, obtained through an unsupervised machine learning approach, which we generically named motion-to-vector (Mot2vec), trained on large-scale mobility data. The algorithm consists of two steps, the trajectory pre-processing and the Word2vec-based model building. First, mobility traces are converted into sequences of locations that unfold in fixed time steps; then, a Skip-gram Word2vec model is used to construct the location embeddings. Trace and visitor embeddings are finally created combining the location vectors belonging to each trace or visitor. Mot2vec provides a meaningful representation of locations, based on the motion behavior of users, defining a direct way of comparing locations’ connectivity and providing analogous similarity distributions for places of the same type. In addition, it defines a metric of similarity for traces and visitors beyond their spatial proximity and identifies common motion behaviors between different categories of people.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Esteban Moro ◽  
Dan Calacci ◽  
Xiaowen Dong ◽  
Alex Pentland

AbstractTraditional understanding of urban income segregation is largely based on static coarse-grained residential patterns. However, these do not capture the income segregation experience implied by the rich social interactions that happen in places that may relate to individual choices, opportunities, and mobility behavior. Using a large-scale high-resolution mobility data set of 4.5 million mobile phone users and 1.1 million places in 11 large American cities, we show that income segregation experienced in places and by individuals can differ greatly even within close spatial proximity. To further understand these fine-grained income segregation patterns, we introduce a Schelling extension of a well-known mobility model, and show that experienced income segregation is associated with an individual’s tendency to explore new places (place exploration) as well as places with visitors from different income groups (social exploration). Interestingly, while the latter is more strongly associated with demographic characteristics, the former is more strongly associated with mobility behavioral variables. Our results suggest that mobility behavior plays an important role in experienced income segregation of individuals. To measure this form of income segregation, urban researchers should take into account mobility behavior and not only residential patterns.


Author(s):  
Lewis Mitchell ◽  
Joshua Dent ◽  
Joshua Ross

It is widely accepted that different online social media platforms produce different modes of communication, however the ways in which these modalities are shaped by the constraints of a particular platform remain difficult to quantify. On 7 November 2017 Twitter doubled the character limit for users to 280 characters, presenting a unique opportunity to study the response of this population to an exogenous change to the communication medium. Here we analyse a large dataset comprising 387 million English-language tweets (10% of all public tweets) collected over the September 2017--January 2018 period to quantify and explain large-scale changes in individual behaviour and communication patterns precipitated by the character-length change. Using statistical and natural language processing techniques we find that linguistic complexity increased after the change, with individuals writing at a significantly higher reading level. However, we find that some textual properties such as statistical language distribution remain invariant across the change, and are no different to writings in different online media. By fitting a generative mathematical model to the data we find a surprisingly slow response of the Twitter population to this exogenous change, with a substantial number of users taking a number of weeks to adjust to the new medium. In the talk we describe the model and Bayesian parameter estimation techniques used to make these inferences. Furthermore, we argue for mathematical models as an alternative exploratory methodology for "Big" social media datasets, empowering the researcher to make inferences about the human behavioural processes which underlie large-scale patterns and trends.


Author(s):  
Mario Fernando Jojoa Acosta ◽  
Begonya Garcia-Zapirain ◽  
Marino J. Gonzalez ◽  
Bernardo Perez-Villa ◽  
Elena Urizar ◽  
...  

The review of previous works shows this study is the first attempt to analyse the lockdown effect using Natural Language Processing Techniques, particularly sentiment analysis methods applied at large scale. On the other hand, it is also the first of its kind to analyse the impact of COVID 19 on the university community jointly on staff and students and with a multi-country perspective. The main overall findings of this work show that the most often related words were family, anxiety, house and life. On another front, it has also been shown that staff have a slightly less negative perception of the consequences of COVID in their daily life. We have used artificial intelligence models like swivel embedding and the Multilayer Perceptron, as classification algorithms. The performance reached in terms of accuracy metric are 88.8% and 88.5%, for student and staff respectively. The main conclusion of our study is that higher education institutions and policymakers around the world may benefit from these findings while formulating policy recommendations and strategies to support students during this and any future pandemics.


2019 ◽  
Vol 3 (2) ◽  
Author(s):  
Marije Michel ◽  
Akira Murakami ◽  
Theodora Alexopoulou ◽  
Detmar Meurers

This study investigates the effect of instructional design on (morpho)syntactic complexity in second language (L2) writing development. We operationalised instructional design in terms of task type and empirically based the investigation on a large subcorpus (669,876 writings by 119,960 learners from 128 tasks at all Common European Framework of Reference for Languages levels) of the EF-Cambridge Open Language Database (EFCAMDAT; Geertzen, Alexopoulou and Korhonen 2014). First, the 128 task prompts were manually categorised for task type (e.g. argumentation, description). Next, developmental trajectories of syntactic complexity from A1 to C2 were established using a variety of global (e.g. mean length of clause) and specific (e.g. non-third person singular present tense verbs) measures extracted using natural language processing techniques. The effects of task type were analysed using the categorisation from the first step. Finally, tasks that showed atypical behaviour for a measure given their task type were explored qualitatively. Our results partially confirm earlier experimental and corpus-based studies (e.g. subordination associated with argumentative tasks). Going beyond, our large-scale data-driven analysis made it possible to identify specific measures that were naturally prompted by instructional design (e.g. narrations eliciting wh-phrases). We discuss which measures typically align with certain task types and highlight how instructional design relates to L2 developmental trajectories over time.


Information ◽  
2019 ◽  
Vol 10 (6) ◽  
pp. 212 ◽  
Author(s):  
Joseba Fernandez de Landa ◽  
Rodrigo Agerri ◽  
Iñaki Alegria

Social networks like Twitter are increasingly important in the creation of new ways of communication. They have also become useful tools for social and linguistic research due to the massive amounts of public textual data available. This is particularly important for less resourced languages, as it allows to apply current natural language processing techniques to large amounts of unstructured data. In this work, we study the linguistic and social aspects of young and adult people’s behaviour based on their tweets’ contents and the social relations that arise from them. With this objective in mind, we have gathered over 10 million tweets from more than 8000 users. First, we classified each user in terms of its life stage (young/adult) according to the writing style of their tweets. Second, we applied topic modelling techniques to the personal tweets to find the most popular topics according to life stages. Third, we established the relations and communities that emerge based on the retweets. We conclude that using large amounts of unstructured data provided by Twitter facilitates social research using computational techniques such as natural language processing, giving the opportunity both to segment communities based on demographic characteristics and to discover how they interact or relate to them.


2019 ◽  
Author(s):  
Ian R. Braun ◽  
Carolyn J. Lawrence-Dill

1AbstractNatural language descriptions of plant phenotypes are a rich source of information for genetics and genomics research. We computationally translated descriptions of plant phenotypes into structured representations that can be analyzed to identify biologically meaningful associations. These repre-sentations include the EQ (Entity-Quality) formalism, which uses terms from biological ontologies to represent phenotypes in a standardized, semantically-rich format, as well as numerical vector representations generated using Natural Language Processing (NLP) methods (such as the bag-of-words approach and document embedding). We compared resulting phenotype similarity measures to those derived from manually curated data to determine the performance of each method. Computationally derived EQ and vector representations were comparably successful in recapitulating biological truth to representations created through manual EQ statement curation. Moreover, NLP methods for generating vector representations of phenotypes are scalable to large quantities of text because they require no human input. These results indicate that it is now possible to computationally and automatically produce and populate large-scale information resources that enable researchers to query phenotypic descriptions directly.


2017 ◽  
Author(s):  
Sabrina Jaeger ◽  
Simone Fulle ◽  
Samo Turk

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.


2021 ◽  
Vol 55 (1) ◽  
pp. 1-2
Author(s):  
Bhaskar Mitra

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.


Information ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 204
Author(s):  
Charlyn Villavicencio ◽  
Julio Jerison Macrohon ◽  
X. Alphonse Inbaraj ◽  
Jyh-Horng Jeng ◽  
Jer-Guang Hsieh

A year into the COVID-19 pandemic and one of the longest recorded lockdowns in the world, the Philippines received its first delivery of COVID-19 vaccines on 1 March 2021 through WHO’s COVAX initiative. A month into inoculation of all frontline health professionals and other priority groups, the authors of this study gathered data on the sentiment of Filipinos regarding the Philippine government’s efforts using the social networking site Twitter. Natural language processing techniques were applied to understand the general sentiment, which can help the government in analyzing their response. The sentiments were annotated and trained using the Naïve Bayes model to classify English and Filipino language tweets into positive, neutral, and negative polarities through the RapidMiner data science software. The results yielded an 81.77% accuracy, which outweighs the accuracy of recent sentiment analysis studies using Twitter data from the Philippines.


Sign in / Sign up

Export Citation Format

Share Document