Folksonomy

This chapter discusses folksonomies as a novel way of indexing documents and locating information based on user generated keywords. Folksonomies are considered from the point of view of knowledge organization and representation in the context of user collaboration within the Web 2.0 environments. Folksonomies provide multiple benefits which make them a useful indexing method in various contexts; however, they also have a number of shortcomings that may hamper precise or exhaustive document retrieval. The position maintained is that folksonomies are a valuable addition to the traditional spectrum of knowledge organization methods since they facilitate user input, stimulate active language use and timeliness, create opportunities for processing large data sets, and allow new ways of social navigation within document collections. Applications of folksonomies as well as recommendations for effective information indexing and retrieval are discussed.

Download Full-text

Stopped-flow spectroscopy

Spectrophotometry and Spectrofluorimetry ◽

10.1093/oso/9780199638130.003.0012 ◽

2000 ◽

Author(s):

M. T. Wilson ◽

J. Torres

Keyword(s):

Large Data ◽

Kinetic Studies ◽

Point Of View ◽

Quality Data ◽

Stopped Flow ◽

Light Sources ◽

Data Sets ◽

Kinetic Measurements ◽

Signal Average ◽

Short Time

There was a time, fortunately some years ago now, when to undertake rapid kinetic measurements using a stopped-flow spectrophotometer verged on the heroic. One needed to be armed with knowledge of amplifiers, light sources, oscilloscopes etc. and ideally one’s credibility was greatly enhanced were one to build one’s own instrument. Analysis of the data was similarly difficult. To obtain a single rate constant might involve a wide range of skills in addition to those required for the chemical/biochemical manipulation of the system and could easily include photography, developing prints and considerable mathematical agility. Now all this has changed and, from the point of view of the scientist attempting to solve problems through transient kinetic studies, a good thing too! Very high quality data can readily be obtained by anyone with a few hours training and the ability to use a mouse and ‘point and click’ programs. Excellent stopped -flow spectrophotometers can be bought which are reliable, stable, sensitive and which are controlled by computers able to signal-average and to analyse, in seconds, kinetic progress curves in a number of ways yielding rate constants, amplitudes, residuals and statistics. Because it is now so easy, from the technical point of view, to make measurement and to do so without an apprenticeship in kinetic methods, it becomes important to make sure that one collects data that are meaningful and open to sensible interpretation. There are a number of pitfalls to avoid. The emphasis of this article is, therefore, somewhat different to that written by Eccleston (1) in an earlier volume of this series. Less time will be spent on consideration of the hardware, although the general principles are given, but the focus will be on making sure that the data collected means what one thinks it means and then how to be sure one is extracting kinetic parameters from this in a sensible way. With the advent of powerful, fast computers it has now become possible to process very large data sets quickly and this has paved the way for the application of ‘rapid scan’ devices (usually, but not exclusively, diode arrays), which allow complete spectra to be collected at very short time intervals during a reaction.

Download Full-text

Near Duplicated Text Detection Based on MapReduce

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.427-429.2618 ◽

2013 ◽

Vol 427-429 ◽

pp. 2618-2621 ◽

Cited By ~ 1

Author(s):

Ling Shen ◽

Qing Xi Peng

Keyword(s):

Large Scale ◽

Programming Model ◽

Linear Time ◽

Large Data ◽

Text Detection ◽

Experimental Result ◽

Data Sets ◽

Document Collections ◽

Document Similarity ◽

Map Function

As the emerging date intensive applications have received more and more attentions from researchers, its a severe challenge for near duplicated text detection for large scale data. This paper presents an algorithm based on MapReduce and ontology for near duplicated text detection via computing pair document similarity in large scale document collections. We mapping the words in the document to the synonym and then calculate the similarity between them. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key /value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. In large scale test, experimental result demonstrates that this approach outperforms other state of the art solutions. Many advantages such as linear time and accuracy make the algorithm valuable in actual practice.

Download Full-text

Subword-based Semantic Retrieval of Clinical and Bibliographic Documents

Methods of Information in Medicine ◽

10.3414/me9303 ◽

2010 ◽

Vol 49 (02) ◽

pp. 141-147 ◽

Cited By ~ 4

Author(s):

S. Schulz ◽

M. L. Müller ◽

W. Dzeyk ◽

L. Prinzen ◽

E. J. Pacheco ◽

...

Keyword(s):

Document Retrieval ◽

Bibliographic Databases ◽

Semantic Retrieval ◽

Text Indexing ◽

Document Collections ◽

Domain Specific ◽

European Languages ◽

Indexing And Retrieval ◽

User Groups ◽

User Friendly

Summary Objectives: The increasing amount of electronically available documents in bibliographic databases and the clinical documentation requires user-friendly techniques for content retrieval. Methods: A domain-specific approach on semantic text indexing for document retrieval is presented. It is based on a subword thesaurus and maps the content of texts in different European languages to a common interlingual representation, which supports the search across multilingual document collections. Results: Three use cases are presented where the semantic retrieval method has been implemented: a bibliographic database, a department EHR system, and a consumer-oriented Web portal. Conclusions: It could be shown that a semantic indexing and retrieval approach, the performance of which had already been empirically assessed in prior studies, proved useful in different prototypical and routine scenarios and was well accepted by several user groups.

Download Full-text

Soft Computing for XML Data Mining

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch275 ◽

2011 ◽

pp. 1806-1809

Author(s):

K. G. Srinivasa ◽

K. R. Venugopal ◽

L. M. Patnaik

Keyword(s):

Data Mining ◽

Soft Computing ◽

Information Exchange ◽

Large Data ◽

Heterogeneous Data ◽

Approximate Reasoning ◽

Support Vector ◽

Data Sets ◽

Document Collections ◽

Extensible Markup

Efficient tools and algorithms for knowledge discovery in large data sets have been devised during the recent years. These methods exploit the capability of computers to search huge amounts of data in a fast and effective manner. However, the data to be analyzed is imprecise and afflicted with uncertainty. In the case of heterogeneous data sources such as text, audio and video, the data might moreover be ambiguous and partly conflicting. Besides, patterns and relationships of interest are usually vague and approximate. Thus, in order to make the information mining process more robust or say, human-like methods for searching and learning it requires tolerance towards imprecision, uncertainty and exceptions. Thus, they have approximate reasoning capabilities and are capable of handling partial truth. Properties of the aforementioned kind are typical soft computing. Soft computing techniques like Genetic Algorithms (GA), Artificial Neural Networks, Fuzzy Logic, Rough Sets and Support Vector Machines (SVM) when used in combination was found to be effective. Therefore, soft computing algorithms are used to accomplish data mining across different applications (Mitra S, Pal S K & Mitra P, 2002; Alex A Freitas, 2002). Extensible Markup Language (XML) is emerging as a de facto standard for information exchange among various applications of World Wide Web due to XML’s inherent data self-describing capacity and flexibility of organizing data. In XML representation, the semantics are associated with the contents of the document by making use of self describing tags which can be defined by the users. Hence XML can be used as a medium for interoperability over the Internet. With these advantages, the amount of data that is being published on the Web in the form of XML is growing enormously and many naïve users find the need to search over large XML document collections (Gang Gou & Rada Chirkova, 2007; Luk R et al., 2000).

Download Full-text

PIR

Web Design and Development ◽

10.4018/978-1-4666-8619-9.ch035 ◽

2016 ◽

pp. 748-771

Author(s):

Xiaobing Huang ◽

Tian Zhao ◽

Yu Cao

Keyword(s):

Information Retrieval ◽

Single Machine ◽

Large Data ◽

Salient Feature ◽

Data Sets ◽

Local Cluster ◽

Domain Specific ◽

Indexing And Retrieval ◽

Multiple Cores ◽

Distributed Execution

Multimedia Information Retrieval (MIR) is a problem domain that includes programming tasks such as salient feature extraction, machine learning, indexing, and retrieval. There are a variety of implementations and algorithms for these tasks in different languages and frameworks, which are difficult to compose and reuse due to the interface and language incompatibility. Due to this low reusability, researchers often have to implement their experiments from scratch and the resulting programs cannot be easily adapted to parallel and distributed executions, which is important for handling large data sets. In this paper, we present Pipeline Information Retrieval (PIR), a Domain Specific Language (DSL) for multi-modal feature manipulation. The goal of PIR is to unify the MIR programming tasks by hiding the programming details under a flexible layer of domain specific interface. PIR optimizes the MIR tasks by compiling the DSL programs into pipeline graphs, which can be executed using a variety of strategies (e.g. sequential, parallel, or distributed execution). The authors evaluated the performance of PIR applications on single machine with multiple cores, local cluster, and Amazon Elastic Compute Cloud (EC2) platform. The result shows that the PIR programs can greatly help MIR researchers and developers perform fast prototyping on single machine environment and achieve nice scalability on distributed platforms.

Download Full-text

PIR

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2014070101 ◽

2014 ◽

Vol 5 (3) ◽

pp. 1-27 ◽

Cited By ~ 1

Author(s):

Xiaobing Huang ◽

Tian Zhao ◽

Yu Cao

Keyword(s):

Information Retrieval ◽

Single Machine ◽

Large Data ◽

Salient Feature ◽

Data Sets ◽

Local Cluster ◽

Domain Specific ◽

Indexing And Retrieval ◽

Multiple Cores ◽

Distributed Execution

Multimedia Information Retrieval (MIR) is a problem domain that includes programming tasks such as salient feature extraction, machine learning, indexing, and retrieval. There are a variety of implementations and algorithms for these tasks in different languages and frameworks, which are difficult to compose and reuse due to the interface and language incompatibility. Due to this low reusability, researchers often have to implement their experiments from scratch and the resulting programs cannot be easily adapted to parallel and distributed executions, which is important for handling large data sets. In this paper, we present Pipeline Information Retrieval (PIR), a Domain Specific Language (DSL) for multi-modal feature manipulation. The goal of PIR is to unify the MIR programming tasks by hiding the programming details under a flexible layer of domain specific interface. PIR optimizes the MIR tasks by compiling the DSL programs into pipeline graphs, which can be executed using a variety of strategies (e.g. sequential, parallel, or distributed execution). The authors evaluated the performance of PIR applications on single machine with multiple cores, local cluster, and Amazon Elastic Compute Cloud (EC2) platform. The result shows that the PIR programs can greatly help MIR researchers and developers perform fast prototyping on single machine environment and achieve nice scalability on distributed platforms.

Download Full-text

An example of spectrum imaging used for comparison of EELS quantitative analysis techniques on Al-Li

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s042482010008794x ◽

1991 ◽

Vol 49 ◽

pp. 726-727

Author(s):

John A. Hunt

Keyword(s):

Quantitative Analysis ◽

Large Data ◽

Difference Spectrum ◽

Large Data Sets ◽

Foil Thickness ◽

Data Sets ◽

Analysis Techniques ◽

Spectrum Imaging ◽

Normal Spectrum ◽

Electron Energy Loss

Spectrum-imaging is a useful technique for comparing different processing methods on very large data sets which are identical for each method. This paper is concerned with comparing methods of electron energy-loss spectroscopy (EELS) quantitative analysis on the Al-Li system. The spectrum-image analyzed here was obtained from an Al-10at%Li foil aged to produce δ' precipitates that can span the foil thickness. Two 1024 channel EELS spectra offset in energy by 1 eV were recorded and stored at each pixel in the 80x80 spectrum-image (25 Mbytes). An energy range of 39-89eV (20 channels/eV) are represented. During processing the spectra are either subtracted to create an artifact corrected difference spectrum, or the energy offset is numerically removed and the spectra are added to create a normal spectrum. The spectrum-images are processed into 2D floating-point images using methods and software described in [1].

Download Full-text

Cluster analysis for large data sets: applications to individual aerosol particles from the mid-pacific

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100132078 ◽

1992 ◽

Vol 50 (2) ◽

pp. 1488-1489

Author(s):

Thomas W. Shattuck ◽

James R. Anderson ◽

Neil W. Tindale ◽

Peter R. Buseck

Keyword(s):

Cluster Analysis ◽

Chemical Reactivity ◽

Large Data ◽

Large Data Sets ◽

Particle Analysis ◽

Data Sets ◽

Halogen Chemistry ◽

Complete Study ◽

Components Analysis ◽

Automated Scanning

Individual particle analysis involves the study of tens of thousands of particles using automated scanning electron microscopy and elemental analysis by energy-dispersive, x-ray emission spectroscopy (EDS). EDS produces large data sets that must be analyzed using multi-variate statistical techniques. A complete study uses cluster analysis, discriminant analysis, and factor or principal components analysis (PCA). The three techniques are used in the study of particles sampled during the FeLine cruise to the mid-Pacific ocean in the summer of 1990. The mid-Pacific aerosol provides information on long range particle transport, iron deposition, sea salt ageing, and halogen chemistry.Aerosol particle data sets suffer from a number of difficulties for pattern recognition using cluster analysis. There is a great disparity in the number of observations per cluster and the range of the variables in each cluster. The variables are not normally distributed, they are subject to considerable experimental error, and many values are zero, because of finite detection limits. Many of the clusters show considerable overlap, because of natural variability, agglomeration, and chemical reactivity.

Download Full-text

Confidentiality of Statistical Records: A Threat-Monitoring Scheme for On Line Dialogue

Methods of Information in Medicine ◽

10.1055/s-0038-1635718 ◽

1976 ◽

Vol 15 (01) ◽

pp. 36-42 ◽

Cited By ~ 14

Author(s):

J. Schlörer

Keyword(s):

Statistical Data ◽

Cost Benefit ◽

Data Bank ◽

High Ratio ◽

Point Of View ◽

Data Sets ◽

Monitoring Scheme ◽

Access Controls ◽

On Line ◽

Bona Fide

From a statistical data bank containing only anonymous records, the records sometimes may be identified and then retrieved, as personal records, by on line dialogue. The risk mainly applies to statistical data sets representing populations, or samples with a high ratio n/N. On the other hand, access controls are unsatisfactory as a general means of protection for statistical data banks, which should be open to large user communities. A threat monitoring scheme is proposed, which will largely block the techniques for retrieval of complete records. If combined with additional measures (e.g., slight modifications of output), it may be expected to render, from a cost-benefit point of view, intrusion attempts by dialogue valueless, if not absolutely impossible. The bona fide user has to pay by some loss of information, but considerable flexibility in evaluation is retained. The proposal of controlled classification included in the scheme may also be useful for off line dialogue systems.

Download Full-text