vector representation
Recently Published Documents


TOTAL DOCUMENTS

422
(FIVE YEARS 126)

H-INDEX

26
(FIVE YEARS 3)

2022 ◽  
Vol 13 (2) ◽  
pp. 1-20
Author(s):  
Zhe Jiang ◽  
Wenchong He ◽  
Marcus Stephen Kirby ◽  
Arpan Man Sainju ◽  
Shaowen Wang ◽  
...  

In recent years, deep learning has achieved tremendous success in image segmentation for computer vision applications. The performance of these models heavily relies on the availability of large-scale high-quality training labels (e.g., PASCAL VOC 2012). Unfortunately, such large-scale high-quality training data are often unavailable in many real-world spatial or spatiotemporal problems in earth science and remote sensing (e.g., mapping the nationwide river streams for water resource management). Although extensive efforts have been made to reduce the reliance on labeled data (e.g., semi-supervised or unsupervised learning, few-shot learning), the complex nature of geographic data such as spatial heterogeneity still requires sufficient training labels when transferring a pre-trained model from one region to another. On the other hand, it is often much easier to collect lower-quality training labels with imperfect alignment with earth imagery pixels (e.g., through interpreting coarse imagery by non-expert volunteers). However, directly training a deep neural network on imperfect labels with geometric annotation errors could significantly impact model performance. Existing research that overcomes imperfect training labels either focuses on errors in label class semantics or characterizes label location errors at the pixel level. These methods do not fully incorporate the geometric properties of label location errors in the vector representation. To fill the gap, this article proposes a weakly supervised learning framework to simultaneously update deep learning model parameters and infer hidden true vector label locations. Specifically, we model label location errors in the vector representation to partially reserve geometric properties (e.g., spatial contiguity within line segments). Evaluations on real-world datasets in the National Hydrography Dataset (NHD) refinement application illustrate that the proposed framework outperforms baseline methods in classification accuracy.


2022 ◽  
Vol 16 (4) ◽  
pp. 1-16
Author(s):  
Fereshteh Jafariakinabad ◽  
Kien A. Hua

The syntactic structure of sentences in a document substantially informs about its authorial writing style. Sentence representation learning has been widely explored in recent years and it has been shown that it improves the generalization of different downstream tasks across many domains. Even though utilizing probing methods in several studies suggests that these learned contextual representations implicitly encode some amount of syntax, explicit syntactic information further improves the performance of deep neural models in the domain of authorship attribution. These observations have motivated us to investigate the explicit representation learning of syntactic structure of sentences. In this article, we propose a self-supervised framework for learning structural representations of sentences. The self-supervised network contains two components; a lexical sub-network and a syntactic sub-network which take the sequence of words and their corresponding structural labels as the input, respectively. Due to the n -to-1 mapping of words to their structural labels, each word will be embedded into a vector representation which mainly carries structural information. We evaluate the learned structural representations of sentences using different probing tasks, and subsequently utilize them in the authorship attribution task. Our experimental results indicate that the structural embeddings significantly improve the classification tasks when concatenated with the existing pre-trained word embeddings.


2022 ◽  
Author(s):  
Priyadarshini Rai ◽  
Atishay Jain ◽  
Neha Jha ◽  
Divya Sharma ◽  
Shivani Kumar ◽  
...  

Dysregulation of a gene′s function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene–pathology relationships is an ever daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape (HCL) project [1], researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger–tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite heterogeneous and is confounded by several clinical and demographic covariates. To circumvent this, we mined ≈18 million PubMed abstracts published till May 2019 and selected ≈6.1 million of them that describe the pathological role of genes in different diseases. Further, we employed a word embedding technique from the domain of Natural Language Processing (NLP) to learn vector representation of entities such as genes, diseases, tissues, etc., in a way such that their relationship is preserved in a vector space. Notably, Pathomap, by the virtue of its underpinning theory, also learns transitive relationships. Pathomap provided a vector representation of words indicating a possible association between DNMT3A/BCOR with CYLD cutaneous syndrome (CCS). The first manuscript reporting this finding was not part of our training data.


Algorithms ◽  
2021 ◽  
Vol 14 (12) ◽  
pp. 348
Author(s):  
Zahra Tayebi ◽  
Sarwan Ali ◽  
Murray Patterson

The widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a level of detail, unlike any virus before it. On the one hand, this will help biologists, policymakers, and other authorities to make timely and appropriate decisions to control the spread of the coronavirus. On the other hand, such studies will help to more effectively deal with any possible future pandemic. Since the SARS-CoV-2 virus contains different variants, each of them having different mutations, performing any analysis on such data becomes a difficult task, given the size of the data. It is well known that much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence—the relatively short region which codes for the spike protein(s). In this paper, we propose a robust feature-vector representation of biological sequences that, when combined with the appropriate feature selection method, allows different downstream clustering approaches to perform well on a variety of different measures. We use such proposed approach with an array of clustering techniques to cluster spike protein sequences in order to study the behavior of different known variants that are increasing at a very high rate throughout the world. We use a k-mers based approach first to generate a fixed-length feature vector representation of the spike sequences. We then show that we can efficiently and effectively cluster the spike sequences based on the different variants with the appropriate feature selection. Using a publicly available set of SARS-CoV-2 spike sequences, we perform clustering of these sequences using both hard and soft clustering methods and show that, with our feature selection methods, we can achieve higher F1 scores for the clusters and also better clustering quality metrics compared to baselines.


2021 ◽  
Vol 11 (22) ◽  
pp. 10774
Author(s):  
Hongchan Li ◽  
Yu Ma ◽  
Zishuai Ma ◽  
Haodong Zhu

With the rapid increase of public opinion data, the technology of Weibo text sentiment analysis plays a more and more significant role in monitoring network public opinion. Due to the sparseness and high-dimensionality of text data and the complex semantics of natural language, sentiment analysis tasks face tremendous challenges. To solve the above problems, this paper proposes a new model based on BERT and deep learning for Weibo text sentiment analysis. Specifically, first using BERT to represent the text with dynamic word vectors and using the processed sentiment dictionary to enhance the sentiment features of the vectors; then adopting the BiLSTM to extract the contextual features of the text, the processed vector representation is weighted by the attention mechanism. After weighting, using the CNN to extract the important local sentiment features in the text, finally the processed sentiment feature representation is classified. A comparative experiment was conducted on the Weibo text dataset collected during the COVID-19 epidemic; the results showed that the performance of the proposed model was significantly improved compared with other similar models.


2021 ◽  
Vol 26 (5) ◽  
pp. 453-460
Author(s):  
Krishna Chythanya Nagaraju ◽  
Cherku Ramesh Kumar Reddy

A reusable code component is the one which can be easily used with a little or no adaptation to fit in to the application being developed. The major concern in such process is the maintenance of these reusable components in one place called ‘Repository’, so that those code components can be effectively identified as well as reused. Word embedding allows us to numerically represent our textual information. They have become so pervasive that almost all Natural Language Processing projects make use of them. In this work, we considered to use Word2Vec concept to find vector representation of features of a reusable component. The features of a reusable component in the form of sequence of words are input to Word2Vec network. Our method using Word2Vec with Continuous Bag of Words out performs existing method in the market. The proposed methodology has shown an accuracy of 94.8% in identifying the existing reusable component.


Author(s):  
Andrei Borovsky ◽  
Elena Rakovskaya

Essential issues of toponymy presuppose studying separate words to reconstruct the denotative meaning of geographical names that were lost in the modern language and to find out how the peculiarities of the local topography, the inhabitants’ activities, etc. are reflected in them. It is possible to solve this kind of problems using intellectual methods of data analysis on the basis of information technologies. However, in scientific literature on toponymy, such methods are practically ignored. The article is devoted to the study of the origin and semantic meanings of geographical names based on finding semantic associates and calculating the semantic similarity of words using the embedding model. According to the proposed method, the origin of some toponyms of the Irkutsk region was determined, their semantic relations were revealed. The dichotomy method was used for toponyms that have two roots in their structure. This made it possible to improve the operation of the model by clarifying the morphemic composition of the original word. The method of word transformation was used to determine the etymology of the toponym «Moscow». We have received new versions of the origin of the toponym. It is shown that the application of the methods based on distributive semantics and vector representation of words, obtained on the basis of large arrays of text data, significantly expands the possibilities of research in the field of determining the origin of toponyms and clarifying their meaning.


Author(s):  
Yongxiang Hu

Network representation learning (NRL) aims to convert nodes of a network into vector forms in Euclidean space. The information of a network is needed to be preserved as much as possible when NRL converts nodes into vector representation. A hybrid approach proposed in this paper is a framework to improve other NRL methods by considering the structure of densely connected nodes (community-like structure). HARP [1] is to contract a network into a series of contracted networks and embed them from the high-level contracted network to the low-level one. The vector representation (or embedding) for a high-level contracted network is used to initialize the learning process of a low-level contracted graph hierarchically. In this method (Hybrid Approach), HARP is revised by using a well-designed initialization process on the most high-level contracted network to preserve more community-like structure information.


2021 ◽  
Vol 28 (3) ◽  
pp. 292-311
Author(s):  
Vitaly I. Yuferev ◽  
Nikolai A. Razin

It is known that in the tasks of natural language processing, the representation of texts by vectors of fixed length using word-embedding models makes sense in cases where the vectorized texts are short.The longer the texts being compared, the worse the approach works. This situation is due to the fact that when using word-embedding models, information is lost when converting the vector representations of the words that make up the text into a vector representation of the entire text, which usually has the same dimension as the vector of a single word.This paper proposes an alternative way for using pre-trained word-embedding models for text vectorization. The essence of the proposed method consists in combining semantically similar elements of the dictionary of the existing text corpus by clustering their (dictionary elements) embeddings, as a result of which a new dictionary is formed with a size smaller than the original one, each element of which corresponds to one cluster. The original corpus of texts is reformulated in terms of this new dictionary, after which vectorization is performed on the reformulated texts using one of the dictionary approaches (TF-IDF was used in the work). The resulting vector representation of the text can be additionally enriched using the vectors of words of the original dictionary obtained by decreasing the dimension of their embeddings for each cluster.A series of experiments to determine the optimal parameters of the method is described in the paper, the proposed approach is compared with other methods of text vectorization for the text ranking problem – averaging word embeddings with TF-IDF weighting and without weighting, as well as vectorization based on TF-IDF coefficients.


2021 ◽  
Vol 7 (8) ◽  
pp. 153
Author(s):  
Jieying Wang ◽  
Jiří Kosinka ◽  
Alexandru Telea

Medial descriptors are of significant interest for image simplification, representation, manipulation, and compression. On the other hand, B-splines are well-known tools for specifying smooth curves in computer graphics and geometric design. In this paper, we integrate the two by modeling medial descriptors with stable and accurate B-splines for image compression. Representing medial descriptors with B-splines can not only greatly improve compression but is also an effective vector representation of raster images. A comprehensive evaluation shows that our Spline-based Dense Medial Descriptors (SDMD) method achieves much higher compression ratios at similar or even better quality to the well-known JPEG technique. We illustrate our approach with applications in generating super-resolution images and salient feature preserving image compression.


Sign in / Sign up

Export Citation Format

Share Document