Mining Twitter Data for Landslide Events Reported Worldwide

The explosion of user generated content in social media published from mobile devices has led to the concept known as “citizen sensing.” Although English has been adopted by many as a de facto standard international language, reports about events, such as disasters, are frequently provided by citizens in their local language in addition to English. Attempting to integrate citizen reports from many languages is a significant challenge. This article describes the tools that address this challenge to enable the support of citizen-sensing of landslide events reported worldwide. Multilingual support is based on the first unified cross-lingual dataset of word vectors for representing texts in multiple languages. The classification model based on the proposed cross-lingual word vectors outperforms the “native” and “translated” approaches based on monolingual word vectors. Furthermore, it does not require the creation of a separate training set in a local language or its translation to English.

Download Full-text

Feature-Weighted Sampling for Proper Evaluation of Classification Models

Applied Sciences ◽

10.3390/app11052039 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2039

Author(s):

Hyunseok Shin ◽

Sejong Oh

Keyword(s):

Random Sampling ◽

Sampling Method ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Feature Importance ◽

Proper Training ◽

Machine Learning Applications ◽

Test Sets ◽

The Given

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.

Download Full-text

Fully Automated Detection of Paramagnetic Rims in Multiple Sclerosis Lesions on 3T Susceptibility-Based MR Imaging

10.1101/2020.08.31.276238 ◽

2020 ◽

Author(s):

Carolyn Lou ◽

Pascal Sati ◽

Martina Absinta ◽

Kelly Clark ◽

Jordan D. Dworkin ◽

...

Keyword(s):

Multiple Sclerosis ◽

Severe Disease ◽

Area Under The Curve ◽

Machine Learning Algorithms ◽

Classification Model ◽

List Type ◽

Training Set ◽

Random Forest Classification ◽

Automated Method ◽

Forest Classification

AbstractBackground and PurposeThe presence of a paramagnetic rim around a white matter lesion has recently been shown to be a hallmark of a particular pathological type of multiple sclerosis (MS) lesion. Increased prevalence of these paramagnetic rim lesions (PRLs) is associated with a more severe disease course in MS. The identification of these lesions is time-consuming to perform manually. We present a method to automatically detect PRLs on 3T T2*-phase images.MethodsT1-weighted, T2-FLAIR, and T2*-phase MRI of the brain were collected at 3T for 19 subjects with MS. The images were then processed with lesion segmentation, lesion center detection, lesion labelling, and lesion-level radiomic feature extraction. A total of 877 lesions were identified, 118 (13%) of which contained a paramagnetic rim. We divided our data into a training set (15 patients, 673 lesions) and a testing set (4 patients, 204 lesions). We fit a random forest classification model on the training set and assessed our ability to classify lesions as PRL on the test set.ResultsThe number of PRLs per subject identified via our automated lesion labelling method was highly correlated with the gold standard count of PRLs per subject, r = 0.91 (95% CI [0.79, 0.97]). The classification algorithm using radiomic features can classify a lesion as PRL or not with an area under the curve of 0.80 (95% CI [0.67, 0.86]).ConclusionThis study develops a fully automated technique for the detection of paramagnetic rim lesions using standard T1 and FLAIR sequences and a T2*phase sequence obtained on 3T MR images.HighlightsA fully automated method for both the identification and classification of paramagnetic rim lesions is proposed.Radiomic features in conjunction with machine learning algorithms can accurately classify paramagnetic rim lesions.Challenges for classification are largely driven by heterogeneity between lesions, including equivocal rim signatures and lesion location.

Download Full-text

Cross-Lingual Information Retrieval

Design and Usability of Digital Libraries ◽

10.4018/978-1-59140-441-5.ch009 ◽

2005 ◽

pp. 153-170 ◽

Cited By ~ 4

Author(s):

Christopher Yang ◽

Kar W. Li

Keyword(s):

Information Retrieval ◽

Digital Library ◽

Digital Libraries ◽

World Wide ◽

Relevant Information ◽

Semantic Interoperability ◽

Library Research ◽

The World ◽

Cross Lingual ◽

Multiple Languages

Structural and semantic interoperability have been the focus of digital library research in the early 1990s. Many research works have been done on searching and retrieving objects across variations in protocols, formats, and disciplines. As the World Wide Web has become more popular in the last ten years, information is available in multiple languages in global digital libraries. Users are searching across the language boundary to identify the relevant information that may not be available in their own language. Cross-lingual semantic interoperability has become one of the focuses in digital library research in the late 1990s. In particular, research in cross-lingual information retrieval (CLIR) has been very active in recent conferences on information retrieval, digital libraries, knowledge management, and information systems. The major problem in CLIR is how to build the bridge between the representations of user queries and documents if they are of different languages.

Download Full-text

Multi-Pig Part Detection and Association with a Fully-Convolutional Network

Sensors ◽

10.3390/s19040852 ◽

2019 ◽

Vol 19 (4) ◽

pp. 852 ◽

Cited By ~ 12

Author(s):

Eric Psota ◽

Mateusz Mittek ◽

Lance Pérez ◽

Ty Schmidt ◽

Benny Mote

Keyword(s):

Body Part ◽

Training Set ◽

Convolutional Network ◽

Fully Convolutional Network ◽

Non Invasive ◽

Significant Challenge ◽

Lighting Conditions ◽

Computer Vision Systems ◽

Public Datasets ◽

Private Datasets

Computer vision systems have the potential to provide automated, non-invasive monitoring of livestock animals, however, the lack of public datasets with well-defined targets and evaluation metrics presents a significant challenge for researchers. Consequently, existing solutions often focus on achieving task-specific objectives using relatively small, private datasets. This work introduces a new dataset and method for instance-level detection of multiple pigs in group-housed environments. The method uses a single fully-convolutional neural network to detect the location and orientation of each animal, where both body part locations and pairwise associations are represented in the image space. Accompanying this method is a new dataset containing 2000 annotated images with 24,842 individually annotated pigs from 17 different locations. The proposed method achieves over 99% precision and over 96% recall when detecting pigs in environments previously seen by the network during training. To evaluate the robustness of the trained network, it is also tested on environments and lighting conditions unseen in the training set, where it achieves 91% precision and 67% recall. The dataset is publicly available for download.

Download Full-text

Adversarial Training for Community Question Answer Selection Based on Multi-Scale Matching

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301395 ◽

2019 ◽

Vol 33 ◽

pp. 395-402

Author(s):

Xiao Yang ◽

Madian Khabsa ◽

Miaosen Wang ◽

Wei Wang ◽

Ahmed Hassan Awadallah ◽

...

Keyword(s):

Question Answering ◽

State Of The Art ◽

Classification Problem ◽

Classification Model ◽

Training Set ◽

Community Based ◽

Multi Scale ◽

Adversarial Training ◽

Source Of Information ◽

Different Levels

Community-based question answering (CQA) websites represent an important source of information. As a result, the problem of matching the most valuable answers to their corresponding questions has become an increasingly popular research topic. We frame this task as a binary (relevant/irrelevant) classification problem, and present an adversarial training framework to alleviate label imbalance issue. We employ a generative model to iteratively sample a subset of challenging negative samples to fool our classification model. Both models are alternatively optimized using REINFORCE algorithm. The proposed method is completely different from previous ones, where negative samples in training set are directly used or uniformly down-sampled. Further, we propose using Multi-scale Matching which explicitly inspects the correlation between words and ngrams of different levels of granularity. We evaluate the proposed method on SemEval 2016 and SemEval 2017 datasets and achieves state-of-the-art or similar performance.

Download Full-text

Real Time Opinion Mining and Analysis of Twitter Data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.12.16104 ◽

2018 ◽

Vol 7 (3.12) ◽

pp. 351

Author(s):

K Senthil Kumar ◽

Mohammad Musab Trumboo ◽

Vaibhav . ◽

Satyajai Ahlawat

Keyword(s):

Sentiment Analysis ◽

Opinion Mining ◽

Far East ◽

Lexical Database ◽

Training Set ◽

Twitter Data ◽

The Far East ◽

The People ◽

Single Person ◽

Mass Information

This era, in which we currently stand, is an era of public opinion and mass information. People from all around the globe are joined together through various information junctions to create a global community, where one thing from the far east reaches to the people of the far west within seconds. Nothing is hidden, everything and anything can be scrutinized to its core and through these global criticisms and mass discussions of gigantic magnitude, we have reached to the pinnacle of correct decisions and better choices. These pseudo social groups and data junctions have bombarded our society so much that they now hold the forelock of our opinions and sentiments, ergo, we reach out to these groups to achieve a better outcome. But, all this enormous data and all these opinions cannot be researched by a single person, hence, comes the need of sentiment analysis. In this paper we’ll try to accomplish this by creating a system that will enable us to fetch tweets from twitter and use those tweets against a lexical database which will create a training set and then compare it with the pre-fetched tweets. Through this we will be able to assign a polarity to all the tweets by means of which we can address them as negative, positive or neutral and this is the very foundation of sentiment analysis, so subtle yet so magnificent.

Download Full-text

Multiclass Classifier for P-Glycoprotein Substrates, Inhibitors, and Non-Active Compounds

Molecules ◽

10.3390/molecules24102006 ◽

2019 ◽

Vol 24 (10) ◽

pp. 2006 ◽

Cited By ~ 1

Author(s):

Liadys Mora Lagares ◽

Nikola Minovski ◽

Marjana Novič

Keyword(s):

In Silico ◽

Transmembrane Protein ◽

External Validation ◽

Assessment Process ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Active Compounds ◽

P Glycoprotein ◽

Validation Set

P-glycoprotein (P-gp) is a transmembrane protein that actively transports a wide variety of chemically diverse compounds out of the cell. It is highly associated with the ADMET (absorption, distribution, metabolism, excretion and toxicity) properties of drugs/drug candidates and contributes to decreasing toxicity by eliminating compounds from cells, thereby preventing intracellular accumulation. Therefore, in the drug discovery and toxicological assessment process it is advisable to pay attention to whether a compound under development could be transported by P-gp or not. In this study, an in silico multiclass classification model capable of predicting the probability of a compound to interact with P-gp was developed using a counter-propagation artificial neural network (CP ANN) based on a set of 2D molecular descriptors, as well as an extensive dataset of 2512 compounds (1178 P-gp inhibitors, 477 P-gp substrates and 857 P-gp non-active compounds). The model provided a good classification performance, producing non error rate (NER) values of 0.93 for the training set and 0.85 for the test set, while the average precision (AvPr) was 0.93 for the training set and 0.87 for the test set. An external validation set of 385 compounds was used to challenge the model’s performance. On the external validation set the NER and AvPr values were 0.70 for both indices. We believe that this in silico classifier could be effectively used as a reliable virtual screening tool for identifying potential P-gp ligands.

Download Full-text

A personalized, web-based prognostic tool for resectable gastric cancer.

Journal of Clinical Oncology ◽

10.1200/jco.2017.35.15_suppl.e15575 ◽

2017 ◽

Vol 35 (15_suppl) ◽

pp. e15575-e15575

Author(s):

Brice Jabo ◽

John W. Morgan ◽

Mayada A. Aljehani ◽

Matthew J Selleck ◽

Albert Y. Lin

Keyword(s):

Gastric Cancer ◽

Health Care Providers ◽

Classification Model ◽

Adjuvant Chemoradiotherapy ◽

Care Providers ◽

Training Set ◽

Test Set ◽

Web Based ◽

Prognostic Tool ◽

Sensitivity Specificity

e15575 Background: Gastric cancer (GC) mortality remains high, with a 5-year survival of 30 percent. For patients with resectable GC, mortality varies depending on both patient and tumor characteristics. The current study sought to develop a web-based prognostic model to assist patients and health care providers in decision making regarding either surgery-only or adjuvant chemoradiotherapy (CRT). Methods: California SEER data was used and records, including demographic, pathologic, and treatment information, for 2,583 patients diagnosed with stage IB to III GC and treated with either surgery only or adjuvant CRT from 2006 to 2013 were retrieved. Purposeful selection using Cox regression model was used to identify important mortality predictors. Additionally, with simple random sampling, 70% of the data were assigned to the training set and the remaining 30% were assigned to the test set. Furthermore, generalized boosted classification model was trained using the training set and validated using the test set. Area under the curve (AUC) of the receiver operating characteristic (ROC), sensitivity, specificity and accuracy were determined for 5- and 10-year mortality. Results: The median survival was 33 months for patients in the training set, and 32 for the test set. Predictors included in the model were age, ethnicity (Asian/other, Hispanic, non-Hispanic black and non-Hispanic white), T-stage, histology (intestinal, diffuse and other), presence of signet ring (yes/no), proximal location (yes/no), lymph node ratio, and CRT following surgery (yes/no). Validation of the model on the test set showed as follows: AUC, sensitivity, specificity and accuracy of 0.78(95%CI = 0.75,0.82), 0.75, 0.65 and 0.70 for 5-year survival and 0.77(95%CI = 0.74,0.80), 0.79, 0.55 and 0.70 for 10-year survival. Conclusions: The proposed web-based prognostic tool using readily available patient and tumor characteristic provides validated and personalized prognostic information to aide clinicians and patients in GC adjuvant treatment decision process. [Table: see text]

Download Full-text

Predicting the Brexit Vote by Tracking and Classifying Public Opinion Using Twitter Data

Statistics Politics and Policy ◽

10.1515/spp-2017-0006 ◽

2017 ◽

Vol 8 (1) ◽

Cited By ~ 4

Author(s):

Julio Cesar Amador Diaz Lopez ◽

Sofia Collignon-Delmar ◽

Kenneth Benoit ◽

Akitaka Matsuo

Keyword(s):

Public Opinion ◽

Training Set ◽

Twitter Data ◽

Eu Referendum ◽

The Uk ◽

High Level ◽

The Eu ◽

Training Sets

AbstractWe use 23M Tweets related to the EU referendum in the UK to predict the Brexit vote. In particular, we use user-generated labels known as hashtags to build training sets related to the Leave/Remain campaign. Next, we train SVMs in order to classify Tweets. Finally, we compare our results to Internet and telephone polls. This approach not only allows to reduce the time of hand-coding data to create a training set, but also achieves high level of correlations with Internet polls. Our results suggest that Twitter data may be a suitable substitute for Internet polls and may be a useful complement for telephone polls. We also discuss the reach and limitations of this method.

Download Full-text

PERFECTIONOF CLASSIFICATION ACCURACY IN TEXT CATEGORIZATION

International Journal of Advanced Research ◽

10.21474/ijar01/13437 ◽

2021 ◽

Vol 9 (09) ◽

pp. 484-488

Author(s):

Rajeev Tripathi ◽

Keyword(s):

Sentiment Analysis ◽

Text Classification ◽

Classification Accuracy ◽

Text Categorization ◽

Classification Model ◽

Text Data ◽

Twitter Data ◽

Long Time ◽

Google Alerts ◽

Email Spam

Problems and strategies for text classification have already been known for a long time. Theyre widely utilised by companies like Google and Yahoo for email spam screening, sentiment analysis of Twitter data, and automatic news categories in Google alerts. Were still working on getting the findings to be as accurate as possible. When dealing with large amounts of text data, however, the models performance and accuracy become a difficulty. The type of words utilised in the corpus and the type of features produced for classification have a big impact on the performance of a text classification model.

Download Full-text