Living Multiples: How Large-scale Scientific Data-mining Pursues Identity and Differences

2013 ◽  
Vol 30 (4) ◽  
pp. 72-91 ◽  
Author(s):  
Adrian Mackenzie ◽  
Ruth McNally
2002 ◽  
Vol 14 (4) ◽  
pp. 731-749 ◽  
Author(s):  
Xiong Wang ◽  
J.T.L. Wang ◽  
D. Shasha ◽  
B.A. Shapiro ◽  
I. Rigoutsos ◽  
...  

Author(s):  
Rahul Ramachandran ◽  
Sara Graves ◽  
John Rushing ◽  
Ken Keizer ◽  
Manil Maskey ◽  
...  

2017 ◽  
Author(s):  
Joon-Yong Lee ◽  
Grant M. Fujimoto ◽  
Ryan Wilson ◽  
H. Steven Wiley ◽  
Samuel H. Payne

AbstractIdentifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is that the vast majority of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductive pairwise comparison. Two bioinformatics applications of the tool are presented to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.


2021 ◽  
Vol 8 ◽  
Author(s):  
Majid Jaberi-Douraki ◽  
Soudabeh Taghian Dinani ◽  
Nuwan Indika Millagaha Gedara ◽  
Xuan Xu ◽  
Emily Richards ◽  
...  

Extra-label drug use in food animal medicine is authorized by the US Animal Medicinal Drug Use Clarification Act (AMDUCA), and estimated withdrawal intervals are based on published scientific pharmacokinetic data. Occasionally there is a paucity of scientific data on which to base a withdrawal interval or a large number of animals being treated, driving the need to test for drug residues. Rapid assay commercial farm-side tests are essential for monitoring drug residues in animal products to protect human health. Active ingredients, sensitivity, matrices, and species that have been evaluated for commercial rapid assay tests are typically reported on manufacturers' websites or in PDF documents that are available to consumers but may require a special access request. Additionally, this information is not always correlated with FDA-approved tolerances. Furthermore, parameter changes for these tests can be very challenging to regularly identify, especially those listed on websites or in documents that are not publicly available. Therefore, artificial intelligence plays a critical role in efficiently extracting the data and ensure current information. Extracting tables from PDF and HTML documents has been investigated both by academia and commercial tool builders. Research in text mining of such documents has become a widespread yet challenging arena in implementing natural language programming. However, techniques of extracting tables are still in their infancy and being investigated and improved by researchers. In this study, we developed and evaluated a data-mining method for automatically extracting rapid assay data from electronic documents. Our automatic electronic data extraction method includes a software package module, a developed pattern recognition tool, and a data mining engine. Assay details were provided by several commercial entities that produce these rapid drug residue assay tests. During this study, we developed a real-time conversion system and method for reflowing contents in these files for accessibility practice and research data mining. Embedded information was extracted using an AI technology for text extraction and text mining to convert to structured formats. These data were then made available to veterinarians and producers via an online interface, allowing interactive searching and also presenting the commercial test assay parameters in reference to FDA-approved tolerances.


Sign in / Sign up

Export Citation Format

Share Document