Conclusion

Computational Text Analysis ◽

10.1093/oso/9780198567400.003.0018 ◽

2006 ◽

Author(s):

Soumya Raychaudhuri

Keyword(s):

Text Mining ◽

High Throughput ◽

Relevant Literature ◽

Critical Role ◽

The Body ◽

Data Sets ◽

Data Set ◽

New Techniques ◽

Scientific Text ◽

Mining Methods

The genomics era has presented many new high throughput experimental modalities that are capable of producing large amounts of data on comprehensive sets of genes. In time there will certainly be many more new techniques that explore new avenues in biology. In any case, textual analysis will be an important aspect of the analysis. The body of the peer-reviewed scientific text represents all of our accomplishments in biology, and it plays a critical role in hypothesizing and interpreting any data set. To altogether ignore it is tantamount to reinventing the wheel with each analysis. The volume of relevant literature approaches proportions where it is all but impossible to manually search through all of it. Instead we must often rely on automated text mining methods to access the literature efficiently and effectively. The methods we present in this book provide an introduction to the avenues that one can employ to include text in a meaningful way in the analysis of these functional genomics data sets. They serve as a complement to the statistical methods such as classification and clustering that are commonly employed to analyze data sets. We are hopeful that this book will serve to encourage the reader to utilize and further develop text mining in their own analyses.

Download Full-text

A quantitative analysis of environmental associations in sauropod dinosaurs

Paleobiology ◽

10.1666/08085.1 ◽

2010 ◽

Vol 36 (2) ◽

pp. 253-282 ◽

Cited By ~ 54

Author(s):

Philip D. Mannion ◽

Paul Upchurch

Keyword(s):

Positive Association ◽

Large Data ◽

The Body ◽

Time Range ◽

Data Sets ◽

Coastal Environments ◽

Carbonate Platforms ◽

Significant Positive Association ◽

Data Set ◽

Body Fossil

Both the body fossils and trackways of sauropod dinosaurs indicate that they inhabited a range of inland and coastal environments during their 160-Myr evolutionary history. Quantitative paleoecological analyses of a large data set of sauropod occurrences reveal a statistically significant positive association between non-titanosaurs and coastal environments, and between titanosaurs and inland environments. Similarly, “narrow-gauge” trackways are positively associated with coastal environments and “wide-gauge” trackways are associated with inland environments. The statistical support for these associations suggests that this is a genuine ecological signal: non-titanosaur sauropods preferred coastal environments such as carbonate platforms, whereas titanosaurs preferred inland environments such as fluvio-lacustrine systems. These results remain robust when the data set is time sliced and jackknifed in various ways. When the analyses are repeated using the more inclusive groupings of titanosauriforms and Macronaria, the signal is weakened or lost. These results reinforce the hypothesis that “wide-gauge” trackways were produced by titanosaurs. It is commonly assumed that the trackway and body fossil records will give different results, with the former providing a more reliable guide to the habitats occupied by extinct organisms because footprints are produced during life, whereas carcasses can be transported to different environments prior to burial. However, this view is challenged by our observation that separate body fossil and trackway data sets independently support the same conclusions regarding environmental preferences in sauropod dinosaurs. Similarly, analyzing localities and individuals independently results in the same environmental associations. We demonstrate that conclusions about environmental patterns among fossil taxa can be highly sensitive to an investigator's choices regarding analytical protocols. In particular, decisions regarding the taxonomic groupings used for comparison, the time range represented by the data set, and the criteria used to identify the number of localities can all have a marked effect on conclusions regarding the existence and nature of putative environmental associations. We recommend that large data sets be explored for such associations at a variety of different taxonomic and temporal scales.

Download Full-text

A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets

BioMed Research International ◽

10.1155/2015/218068 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 4

Author(s):

Yipu Zhang ◽

Ping Wang

Keyword(s):

High Throughput ◽

Motif Discovery ◽

Large Scale ◽

High Throughput Sequencing ◽

Es Cells ◽

Motif Finding ◽

Data Sets ◽

Data Set ◽

Binding Motifs ◽

Motif Finding Algorithm

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the(l, d)motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the(l, d)motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.

Download Full-text

A Further Investigation on the Application of Critical Pore Size as an Approach for Reservoir Rock Typing

Journal of Energy Resources Technology ◽

10.1115/1.4049735 ◽

2021 ◽

Vol 143 (11) ◽

Author(s):

Mohsen Faramarzi-Palangar ◽

Behnam Sedaee ◽

Mohammad Emami Niri

Keyword(s):

Pore Size ◽

Constant Coefficient ◽

Prediction Models ◽

Critical Role ◽

Development Planning ◽

Reservoir Rock ◽

Data Sets ◽

Data Set ◽

Rock Typing ◽

Correct Definition

Abstract The correct definition of rock types plays a critical role in reservoir characterization, simulation, and field development planning. In this study, we use the critical pore size (linf) as an approach for reservoir rock typing. Two linf relations were separately derived based on two permeability prediction models and then merged together to drive a generalized linf relation. The proposed rock typing methodology includes two main parts: in the first part, we determine an appropriate constant coefficient, and in the second part, we perform reservoir rock typing based on two different scenarios. The first scenario is based on the forming groups of rocks using statistical analysis, and the second scenario is based on the forming groups of rocks with similar capillary pressure curves. This approach was applied to three data sets. In detail, two data sets were used to determine the constant coefficient, and one data set was used to show the applicability of the linf method in comparison with FZI for rock typing.

Download Full-text

Analysis of High-Throughput Flow Cytometry Data Using plateCore

Advances in Bioinformatics ◽

10.1155/2009/356141 ◽

2009 ◽

Vol 2009 ◽

pp. 1-10 ◽

Cited By ~ 6

Author(s):

Errol Strain ◽

Florian Hahne ◽

Ryan R. Brinkman ◽

Perry Haaland

Keyword(s):

Flow Cytometry ◽

High Throughput ◽

Negative Control ◽

Data Sets ◽

Data Set ◽

Flow Cytometry Data ◽

Open Platform ◽

Screening Experiments ◽

Software Packages ◽

Good Agreement

Flow cytometry (FCM) software packages from R/Bioconductor, such as flowCore and flowViz, serve as an open platform for development of new analysis tools and methods. We created plateCore, a new package that extends the functionality in these core packages to enable automated negative control-based gating and make the processing and analysis of plate-based data sets from high-throughput FCM screening experiments easier. plateCore was used to analyze data from a BD FACS CAP screening experiment where five Peripheral Blood Mononucleocyte Cell (PBMC) samples were assayed for 189 different human cell surface markers. This same data set was also manually analyzed by a cytometry expert using the FlowJo data analysis software package (TreeStar, USA). We show that the expression values for markers characterized using the automated approach in plateCore are in good agreement with those from FlowJo, and that using plateCore allows for more reproducible analyses of FCM screening data.

Download Full-text

Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data

Canadian Journal of Microbiology ◽

10.1139/cjm-2015-0821 ◽

2016 ◽

Vol 62 (8) ◽

pp. 692-703 ◽

Cited By ~ 132

Author(s):

Gregory B. Gloor ◽

Gregor Reid

Keyword(s):

Data Analysis ◽

High Throughput ◽

High Throughput Sequencing ◽

Compositional Data ◽

Critical Role ◽

Compositional Data Analysis ◽

Data Sets ◽

Clear Understanding ◽

Sequencing Data ◽

Microbiome Data

A workshop held at the 2015 annual meeting of the Canadian Society of Microbiologists highlighted compositional data analysis methods and the importance of exploratory data analysis for the analysis of microbiome data sets generated by high-throughput DNA sequencing. A summary of the content of that workshop, a review of new methods of analysis, and information on the importance of careful analyses are presented herein. The workshop focussed on explaining the rationale behind the use of compositional data analysis, and a demonstration of these methods for the examination of 2 microbiome data sets. A clear understanding of bioinformatics methodologies and the type of data being analyzed is essential, given the growing number of studies uncovering the critical role of the microbiome in health and disease and the need to understand alterations to its composition and function following intervention with fecal transplant, probiotics, diet, and pharmaceutical agents.

Download Full-text

Particle Swarm Optimization pada Analisa Review Software Antivirus Menggunakan Metode K-Nearest Neighbors

INFORMATICS FOR EDUCATORS AND PROFESSIONAL : Journal of Informatics ◽

10.51211/itbi.v4i2.1313 ◽

2020 ◽

Vol 4 (2) ◽

pp. 123

Author(s):

Sucitra Sahara ◽

Rizqi Agung Permana ◽

Hariyanto Hariyanto

Keyword(s):

Particle Swarm Optimization ◽

Text Mining ◽

Nearest Neighbor ◽

Particle Swarm ◽

Optimization Method ◽

Data Sets ◽

K Nearest Neighbor ◽

Process Data ◽

Data Set ◽

Swarm Optimization

Abstrak: Virus pada komputer menjadi hal yang membahayakan bagi para pengguna komputer perorangan maupun perusahaan yang telah menerapkan sistem terkomputerisasi. Virus program yang didesain untuk tujuan jahat dapat merusak bagian tertentu dari komputer, bahkan yang paling merugikan adalah dapat merusak data penting pada perusahaan. Dalam hal ini maka diciptakanlah sebuah software anti virus, perkembangan anti virus selalu lebih lambat dari virus itu sendiri, sehingga peneliti akan mengadakan penyeleksian software anti virus pada suatu opini atau berdasarkan komentar masyarakat yang telah menggunakan software anti virus produk tertentu dan dituangkan ke media online seperti komentar pada suatu situs penjualan produk tersebut. Berdasarkan ribuan komentar akan diolah dan dikelompokkan pada jenis kata teks positif dan teks negatif, dan peneliti membuat klasifikasi data dengan menggunakan metode algoritma k-Nearest Neighbor (k-NN), algoritma k-NN adalah salah satu algoritma yang sesuai dalam penelitian kali ini. Peneliti menemukan bahwa algoritma k-NN mampu mengolah data set yang sudah dikelompokan pada teks positif dan negatif khususnya dalam pemilihan teks, dan penerapan metode optimasi Particle Swarm Optimization (PSO) yang dikombinasikan dengan k-NN diharapkan mampu meningkatkan nilai akurasi sehingga datanya lebih kuat dan valid. Sebelum data set diolah menggukanan optimasi PSO hanya menggunakan metode k-NN akurasi data yang diperoleh 70,50% sedangkan hasil akurasi setelah penggunaan metode k-nn dan optimasi PSO didapatkan nilai akurasi sebesar 83,50%. Dapat disimpulkan bahwa penggunaan optimasi PSO dan metode k-NN sangat sesuai pada konsep text mining dan penyeksian pada data set berupa text. Kata kunci: Analisis Review, Optimasi Particle Swarm Optimization, Metode k-Nearest Neighbor. Abstract: Viruses on computers become dangerous for individual computer users and companies that have implemented computerized systems. Virus programs that are designed for malicious purposes can damage certain parts of the computer, even the most detrimental is that it can damage important data on the company. In this case an anti-virus software is created, the development of anti-virus is always slower than the virus itself, so researchers will conduct an anti-virus software selection on an opinion or based on public comments that have used a particular product's anti-virus software and poured it into online media such as comment on a product sales site. Of the thousands of comments will be processed and grouped on the type of positive and negative text words, and researchers make data classification using the k-Nearest Neighbor (k-NN) algorithm method, the k-NN algorithm is one of the appropriate algorithms in this study. The researcher found that the k-NN algorithm is able to process data sets that have been grouped in positive and negative texts, especially in text selection, and the application of the Particle Swarm Optimization (PSO) optimization method combined with k-NN is expected to be able to increase the accuracy value so that the data is stronger and valid. Before the data set is processed using PSO optimization only using the k-NN method the accuracy of the data obtained is 70.50% while the accuracy results after the use of the k-nn method and PSO optimization obtained an accuracy value of 83.50%. It can be concluded that the use of PSO optimization and the k-NN method are very compatible with the concept of text mining and correction of text data sets. Keywords: Analysis Review, k-Nearest Neighbor Method, Particle Swarm Optimization optimization

Download Full-text

Canadian Prostitution Law

10.1093/oxfordhb/9780199915248.013.13 ◽

2016 ◽

Author(s):

Lauren Jones

Keyword(s):

Descriptive Analysis ◽

Relevant Literature ◽

Crime Reporting ◽

Online Review ◽

Data Sets ◽

Data Set ◽

Sex Trade ◽

Research Directions ◽

History Of ◽

Uniform Crime Reporting

This chapter reviews the history of prostitution law in Canada. It begins with a review of relevant literature on the history and policy of the sex trade in Canada, along with current laws and their enforcement. It then discusses two sources of data available for use in prostitution research in Canada: the Uniform Crime Reporting Survey, a data set that tracks crime and arrest information, and the Erotic Review (TER), a data set drawn from an online review website for sex professionals. These data sets are employed in descriptive analysis of the state of prostitution markets in Canada. The chapter also considers the challenges brought against Canadian prostitution law and concludes by suggesting potential research directions.

Download Full-text

Outlier Mining in High Throughput Screening Experiments

CrossRef Listing of Deleted DOIs ◽

10.1177/108705710200700406 ◽

2002 ◽

Vol 7 (4) ◽

pp. 341-351 ◽

Cited By ~ 16

Author(s):

Michael F.M. Engels ◽

Luc Wouters ◽

Rudi Verbeeck ◽

Greet Vanhoof

Keyword(s):

Data Mining ◽

Biological Activity ◽

High Throughput ◽

High Throughput Screening ◽

False Negative ◽

Data Sets ◽

Data Set ◽

Structure Information ◽

Screening Experiments ◽

Mining Procedure

A data mining procedure for the rapid scoring of high-throughput screening (HTS) compounds is presented. The method is particularly useful for monitoring the quality of HTS data and tracking outliers in automated pharmaceutical or agrochemical screening, thus providing more complete and thorough structure-activity relationship (SAR) information. The method is based on the utilization of the assumed relationship between the structure of the screened compounds and the biological activity on a given screen expressed on a binary scale. By means of a data mining method, a SAR description of the data is developed that assigns probabilities of being a hit to each compound of the screen. Then, an inconsistency score expressing the degree of deviation between the adequacy of the SAR description and the actual biological activity is computed. The inconsistency score enables the identification of potential outliers that can be primed for validation experiments. The approach is particularly useful for detecting false-negative outliers and for identifying SAR-compliant hit/nonhit borderline compounds, both of which are classes of compounds that can contribute substantially to the development and understanding of robust SARs. In a first implementation of the method, one- and two-dimensional descriptors are used for encoding molecular structure information and logistic regression for calculating hits/nonhits probability scores. The approach was validated on three data sets, the first one from a publicly available screening data set and the second and third from in-house HTS screening campaigns. Because of its simplicity, robustness, and accuracy, the procedure is suitable for automation.

Download Full-text

Artificial Neural Network—Based Analysis of High-Throughput Screening Data for Improved Prediction of Active Compounds

CrossRef Listing of Deleted DOIs ◽

10.1177/1087057109351312 ◽

2009 ◽

Vol 14 (10) ◽

pp. 1236-1244 ◽

Cited By ~ 5

Author(s):

Swapan Chakrabarti ◽

Stan R. Svojanovsky ◽

Romana Slavik ◽

Gunda I. Georg ◽

George S. Wilson ◽

...

Keyword(s):

High Throughput ◽

High Throughput Screening ◽

Back Propagation ◽

Large Data ◽

Classification Performance ◽

Data Sets ◽

Data Set ◽

Active Compounds ◽

Physicochemical Features ◽

Artificial Neural

Artificial neural networks (ANNs) are trained using high-throughput screening (HTS) data to recover active compounds from a large data set. Improved classification performance was obtained on combining predictions made by multiple ANNs. The HTS data, acquired from a methionine aminopeptidases inhibition study, consisted of a library of 43,347 compounds, and the ratio of active to nonactive compounds, R A/N, was 0.0321. Back-propagation ANNs were trained and validated using principal components derived from the physicochemical features of the compounds. On selecting the training parameters carefully, an ANN recovers one-third of all active compounds from the validation set with a 3-fold gain in R A/N value. Further gains in RA/N values were obtained upon combining the predictions made by a number of ANNs. The generalization property of the back-propagation ANNs was used to train those ANNs with the same training samples, after being initialized with different sets of random weights. As a result, only 10% of all available compounds were needed for training and validation, and the rest of the data set was screened with more than a 10-fold gain of the original RA/N value. Thus, ANNs trained with limited HTS data might become useful in recovering active compounds from large data sets.

Download Full-text

A Three-Dimensional Parametric Biomechanical Rider Model for Multibody Applications

Applied Sciences ◽

10.3390/app10134509 ◽

2020 ◽

Vol 10 (13) ◽

pp. 4509

Author(s):

Matteo Bova ◽

Matteo Massaro ◽

Nicola Petrone

Keyword(s):

Center Of Mass ◽

Three Dimensional ◽

The Body ◽

Experimental Comparison ◽

Minor Effect ◽

Data Sets ◽

Data Set ◽

Multibody Model ◽

Inertial Parameters ◽

A Minor

Bicycles and motorcycles are characterized by large rider-to-vehicle mass ratios, thus making estimation of the rider’s inertia especially relevant. The total inertia can be derived from the body segment inertial properties (BSIP) which, in turn, can be obtained from the prediction/regression formulas available in the literature. Therefore, a parametric multibody three-dimensional rider model is devised, where the four most-used BSIP formulas (herein named Dempster, Reynolds-NASA, Zatsiorsky–DeLeva, and McConville–Young–Dumas, after their authors) are implemented. After an experimental comparison, the effects of the main posture parameters (i.e., torso inclination, knee distance, elbow distance, and rider height) are analyzed in three riding conditions (sport, touring, and scooter). It is found that the elbow distance has a minor effect on the location of the center of mass and moments of inertia, while the effect of the knee distance is on the same order magnitude as changing the BSIP data set. Torso inclination and rider height are the most relevant parameters. Tables with the coefficients necessary to populate the three-dimensional rider model with the four data sets considered are given. Typical inertial parameters of the whole rider are also given, as a reference for those not willing to implement the full multibody model.

Download Full-text