Autism_genepheno: Text mining of gene-phenotype associations reveals new phenotypic profiles of autism-associated genes

Autism is a spectrum disorder with wide variation in type and severity of symptoms. Understanding gene-phenotype associations is vital to unravel the disease mechanisms and advance its diagnosis and treatment. To date, several databases have stored a large portion of gene-phenotype associations which are mainly obtained from genetic experiments. However, a large proportion of gene-phenotype associations are still buried in the autism-related literature and there are limited resources to investigate autism-associated gene-phenotype associations. Given the abundance of the autism-related literature, we were thus motivated to develop Autism_genepheno, a text mining pipeline to identify sentence-level mentions of autism-associated genes and phenotypes in literature through natural language processing methods. We have generated a comprehensive database of gene-phenotype associations in the last five years' autism-related literature that can be easily updated as new literature becomes available. We have evaluated our pipeline through several different approaches, and we are able to rank and select top autism-associated genes through their unique and wide spectrum of phenotypic profiles, which could provide a unique resource for the diagnosis and treatment of autism. The data resources and the Autism_genpheno pipeline are available at: https://github.com/maiziezhoulab/Autism_genepheno.

Download Full-text

Text mining of gene–phenotype associations reveals new phenotypic profiles of autism-associated genes

Scientific Reports ◽

10.1038/s41598-021-94742-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sijie Li ◽

Ziqi Guo ◽

Jacob B. Ioffe ◽

Yunfei Hu ◽

Yi Zhen ◽

...

Keyword(s):

Text Mining ◽

Language Processing ◽

Wide Spectrum ◽

Diagnosis And Treatment ◽

Limited Resources ◽

Related Literature ◽

Disease Mechanisms ◽

Sentence Level ◽

Rank And Select ◽

New Literature

AbstractAutism is a spectrum disorder with wide variation in type and severity of symptoms. Understanding gene–phenotype associations is vital to unravel the disease mechanisms and advance its diagnosis and treatment. To date, several databases have stored a large portion of gene–phenotype associations which are mainly obtained from genetic experiments. However, a large proportion of gene–phenotype associations are still buried in the autism-related literature and there are limited resources to investigate autism-associated gene–phenotype associations. Given the abundance of the autism-related literature, we were thus motivated to develop Autism_genepheno, a text mining pipeline to identify sentence-level mentions of autism-associated genes and phenotypes in literature through natural language processing methods. We have generated a comprehensive database of gene–phenotype associations in the last five years’ autism-related literature that can be easily updated as new literature becomes available. We have evaluated our pipeline through several different approaches, and we are able to rank and select top autism-associated genes through their unique and wide spectrum of phenotypic profiles, which could provide a unique resource for the diagnosis and treatment of autism. The data resources and the Autism_genpheno pipeline are available at: https://github.com/maiziezhoulab/Autism_genepheno.

Download Full-text

Sentiment Analysis using Rapid Miner

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i3332.0789s319 ◽

2019 ◽

Vol 8 (9S3) ◽

pp. 1589-1594

Keyword(s):

Data Mining ◽

Text Mining ◽

Sentiment Analysis ◽

Language Processing ◽

Quality Information ◽

Classification Techniques ◽

Major Task ◽

Text Document ◽

Sentence Level ◽

Day By Day

Now a day the data grows day by day so data mining replaced by big data. Under data mining, Text mining is one of the processes of deriving structured or quality information or data from text document. It helps to business for finding valuable knowledge. Sentiment analysis is one of the applications in text mining. In sentiment analysis, determine the emotional tone under the text. It is the major task of natural language processing. The objective of this paper to categorize the document in sentence level and review level, and classification techniques applied on the dataset (electronic product data). There is an ensemble number of classification techniques applied on the dataset. Then compare each techniques, based on various parameters and find out which one is best. According to that give better suggestions to the company for improving the product.

Download Full-text

ACRank: a multi-evidence text-mining model for alliance discovery from news articles

Information Technology and People ◽

10.1108/itp-06-2018-0272 ◽

2020 ◽

Vol 33 (5) ◽

pp. 1357-1380

Author(s):

Yilu Zhou ◽

Yuan Xue

Keyword(s):

Text Mining ◽

Language Processing ◽

Strategic Alliances ◽

Domain Knowledge ◽

News Article ◽

Data Set ◽

Content Type ◽

Manual Search ◽

Thomson Reuters ◽

Sentence Level

PurposeStrategic alliances among organizations are some of the central drivers of innovation and economic growth. However, the discovery of alliances has relied on pure manual search and has limited scope. This paper proposes a text-mining framework, ACRank, that automatically extracts alliances from news articles. ACRank aims to provide human analysts with a higher coverage of strategic alliances compared to existing databases, yet maintain a reasonable extraction precision. It has the potential to discover alliances involving less well-known companies, a situation often neglected by commercial databases.Design/methodology/approachThe proposed framework is a systematic process of alliance extraction and validation using natural language processing techniques and alliance domain knowledge. The process integrates news article search, entity extraction, and syntactic and semantic linguistic parsing techniques. In particular, Alliance Discovery Template (ADT) identifies a number of linguistic templates expanded from expert domain knowledge and extract potential alliances at sentence-level. Alliance Confidence Ranking (ACRank)further validates each unique alliance based on multiple features at document-level. The framework is designed to deal with extremely skewed, noisy data from news articles.FindingsIn evaluating the performance of ACRank on a gold standard data set of IBM alliances (2006–2008) showed that: Sentence-level ADT-based extraction achieved 78.1% recall and 44.7% precision and eliminated over 99% of the noise in news articles. ACRank further improved precision to 97% with the top20% of extracted alliance instances. Further comparison with Thomson Reuters SDC database showed that SDC covered less than 20% of total alliances, while ACRank covered 67%. When applying ACRank to Dow 30 company news articles, ACRank is estimated to achieve a recall between 0.48 and 0.95, and only 15% of the alliances appeared in SDC.Originality/valueThe research framework proposed in this paper indicates a promising direction of building a comprehensive alliance database using automatic approaches. It adds value to academic studies and business analyses that require in-depth knowledge of strategic alliances. It also encourages other innovative studies that use text mining and data analytics to study business relations.

Download Full-text

CoCoScore: context-aware co-occurrence scoring for text mining applications using distant supervision

Bioinformatics ◽

10.1093/bioinformatics/btz490 ◽

2019 ◽

Vol 36 (1) ◽

pp. 264-271 ◽

Cited By ~ 4

Author(s):

Alexander Junge ◽

Lars Juhl Jensen

Keyword(s):

Text Mining ◽

Language Processing ◽

Gold Standard ◽

Relation Extraction ◽

Supplementary Information ◽

Functional Protein ◽

Distant Supervision ◽

Sentence Level ◽

Gene Associations ◽

Standard Set

Abstract Motivation Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. Results We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease–gene and tissue–gene associations as well as in identifying physical and functional protein–protein associations in different species. CoCoScore is a versatile text mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications. Availability and implementation CoCoScore is available at: https://github.com/JungeAlexander/cocoscore. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CoCoScore: Context-aware co-occurrence scoring for text mining applications using distant supervision

10.1101/444398 ◽

2018 ◽

Cited By ~ 1

Author(s):

Alexander Junge ◽

Lars Juhl Jensen

Keyword(s):

Text Mining ◽

Language Processing ◽

Gold Standard ◽

Relation Extraction ◽

Functional Protein ◽

Training Corpus ◽

Distant Supervision ◽

Sentence Level ◽

Gene Associations ◽

Standard Set

Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease-gene and tissue-gene associations as well as in identifying physical and functional protein-protein associations in different species. CoCoScore is a versatile text-mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications. CoCoScore is available at: https://github.com/JungeAlexander/cocoscore

Download Full-text

Does higher education properly prepare graduates for the growing artificial intelligence market? Gaps identification using text mining

Human Systems Management ◽

10.3233/hsm-211179 ◽

2021 ◽

pp. 1-13

Author(s):

Lamiae Benhayoun ◽

Daniel Lang

Keyword(s):

Artificial Intelligence ◽

Natural Language Processing ◽

Text Mining ◽

Natural Language ◽

Language Processing ◽

Academic Training ◽

Market Requirements ◽

Job Advertisements ◽

The Individual

BACKGROUND: The renewed advent of Artificial Intelligence (AI) is inducing profound changes in the classic categories of technology professions and is creating the need for new specific skills. OBJECTIVE: Identify the gaps in terms of skills between academic training on AI in French engineering and Business Schools, and the requirements of the labour market. METHOD: Extraction of AI training contents from the schools’ websites and scraping of a job advertisements’ website. Then, analysis based on a text mining approach with a Python code for Natural Language Processing. RESULTS: Categorization of occupations related to AI. Characterization of three classes of skills for the AI market: Technical, Soft and Interdisciplinary. Skills’ gaps concern some professional certifications and the mastery of specific tools, research abilities, and awareness of ethical and regulatory dimensions of AI. CONCLUSIONS: A deep analysis using algorithms for Natural Language Processing. Results that provide a better understanding of the AI capability components at the individual and the organizational levels. A study that can help shape educational programs to respond to the AI market requirements.

Download Full-text

Two New Large Corpora for Vietnamese Aspect-based Sentiment Analysis at Sentence Level

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3446678 ◽

2021 ◽

Vol 20 (4) ◽

pp. 1-22

Author(s):

Dang Van Thin ◽

Ngan Luu-Thuy Nguyen ◽

Tri Minh Truong ◽

Lac Si Le ◽

Duy Tin Vo

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

Network Architectures ◽

Low Resource ◽

The Neural Network ◽

Sentence Level ◽

Push Forward ◽

Polarity Classification ◽

Learning Architectures ◽

Single Approach

Aspect-based sentiment analysis has been studied in both research and industrial communities over recent years. For the low-resource languages, the standard benchmark corpora play an important role in the development of methods. In this article, we introduce two benchmark corpora with the largest sizes at sentence-level for two tasks: Aspect Category Detection and Aspect Polarity Classification in Vietnamese. Our corpora are annotated with high inter-annotator agreements for the restaurant and hotel domains. The release of our corpora would push forward the low-resource language processing community. In addition, we deploy and compare the effectiveness of supervised learning methods with a single and multi-task approach based on deep learning architectures. Experimental results on our corpora show that the multi-task approach based on BERT architecture outperforms the neural network architectures and the single approach. Our corpora and source code are published on this footnoted site. 1

Download Full-text

Identifying Causality and Contributory Factors of Pipeline incidents by Employing Natural Language Processing and Text Mining Techniques

Process Safety and Environmental Protection ◽

10.1016/j.psep.2021.05.036 ◽

2021 ◽

Author(s):

Guanyang Liu ◽

Mason Boyd ◽

Mengxi Yu ◽

S. Zohra Halim ◽

Noor Quddus

Keyword(s):

Natural Language Processing ◽

Text Mining ◽

Natural Language ◽

Language Processing ◽

Contributory Factors

Download Full-text

Proficiency Differences in Syntactic Processing of Monolingual Native Speakers Indexed by Event-related Potentials

Journal of Cognitive Neuroscience ◽

10.1162/jocn.2009.21393 ◽

2010 ◽

Vol 22 (12) ◽

pp. 2728-2744 ◽

Cited By ~ 95

Author(s):

Eric Pakulak ◽

Helen J. Neville

Keyword(s):

Language Proficiency ◽

Language Processing ◽

English Language ◽

Native Speakers ◽

Memory Span ◽

Wide Spectrum ◽

Event Related Potentials ◽

Related Potentials ◽

Native Speakers Of English ◽

Proficiency Scores

Although anecdotally there appear to be differences in the way native speakers use and comprehend their native language, most empirical investigations of language processing study university students and none have studied differences in language proficiency, which may be independent of resource limitations such as working memory span. We examined differences in language proficiency in adult monolingual native speakers of English using an ERP paradigm. ERPs were recorded to insertion phrase structure violations in naturally spoken English sentences. Participants recruited from a wide spectrum of society were given standardized measures of English language proficiency, and two complementary ERP analyses were performed. In between-groups analyses, participants were divided on the basis of standardized proficiency scores into lower proficiency and higher proficiency groups. Compared with lower proficiency participants, higher proficiency participants showed an early anterior negativity that was more focal, both spatially and temporally, and a larger and more widely distributed positivity (P600) to violations. In correlational analyses, we used a wide spectrum of proficiency scores to examine the degree to which individual proficiency scores correlated with individual neural responses to syntactic violations in regions and time windows identified in the between-groups analyses. This approach also used partial correlation analyses to control for possible confounding variables. These analyses provided evidence for the effects of proficiency that converged with the between-groups analyses. These results suggest that adult monolingual native speakers of English who vary in language proficiency differ in the recruitment of syntactic processes that are hypothesized to be at least in part automatic as well as of those thought to be more controlled. These results also suggest that to fully characterize neural organization for language in native speakers it is necessary to include participants of varying proficiency.

Download Full-text

A Cascaded Unsupervised Model for PoS Tagging

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3447759 ◽

2021 ◽

Vol 20 (1) ◽

pp. 1-23

Author(s):

Necva Bölücü ◽

Burcu Can

Keyword(s):

Linear Model ◽

Language Processing ◽

Bayesian Model ◽

Linear Models ◽

Syntactic Category ◽

Semantic Parsing ◽

Pos Tagging ◽

Part Of Speech ◽

Sentence Level ◽

Log Linear

Part of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing, as it assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective, etc.). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g., dependency parsing) and thereby extract the meaning of the sentence (e.g., semantic parsing). Various methods have been proposed for learning PoS tags in an unsupervised setting without using any annotated corpora. One of the widely used methods for the tagging problem is log-linear models. Initialization of the parameters in a log-linear model is very crucial for the inference. Different initialization techniques have been used so far. In this work, we present a log-linear model for PoS tagging that uses another fully unsupervised Bayesian model to initialize the parameters of the model in a cascaded framework. Therefore, we transfer some knowledge between two different unsupervised models to leverage the PoS tagging results, where a log-linear model benefits from a Bayesian model’s expertise. We present results for Turkish as a morphologically rich language and for English as a comparably morphologically poor language in a fully unsupervised framework. The results show that our framework outperforms other unsupervised models proposed for PoS tagging.

Download Full-text