An Efficient Similarity Search in Large Data Collections with MapReduce

Author(s):  
Trong Nhan Phan ◽  
Josef Küng ◽  
Tran Khanh Dang
Author(s):  
Hans-Peter Kriegel ◽  
Peer Kröger ◽  
Martin Pfeifle ◽  
Stefan Brecheisen ◽  
Marco Pötke ◽  
...  

Similarity search in database systems is becoming an increasingly important task in modern application domains such as multimedia, molecular biology, medical imaging, and many others. Especially for CAD (Computer-Aided Design), suitable similarity models and a clear representation of the results can help to reduce the cost of developing and producing new parts by maximizing the reuse of existing parts. In this chapter, we present different similarity models for voxelized CAD data based on space partitioning and data partitioning. Based on these similarity models, we introduce anindustrial prototype, called BOSS, which helps the user to get an overview over a set of CAD objects. BOSS allows the user to easily browse large data collections by graphically displaying the results of a hierarchical clustering algorithm. This representation is well suited for the evaluation of similarity models and to aid an industrial user searching for similar parts.


Author(s):  
Saranya N. ◽  
Saravana Selvam

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.


2019 ◽  
Vol 56 (4) ◽  
pp. 604-614 ◽  
Author(s):  
Yuri M Zhukov ◽  
Christian Davenport ◽  
Nadiya Kostyuk

Researchers today have access to an unprecedented amount of geo-referenced, disaggregated data on political conflict. Because these new data sources use disparate event typologies and units of analysis, findings are rarely comparable across studies. As a result, we are unable to answer basic questions like ‘what does conflict A tell us about conflict B?’ This article introduces xSub – a ‘database of databases’ for disaggregated research on political conflict ( www.x-sub.org ). xSub reduces barriers to comparative subnational research, by empowering researchers to quickly construct custom, analysis-ready datasets. xSub currently features subnational data on conflict in 156 countries, from 21 sources, including large data collections and data from individual scholars. To facilitate comparisons across countries and sources, xSub organizes these data into consistent event categories, actors, spatial units (country, province, district, grid cell, electoral constituency), and time units (year, month, week, and day). This article introduces xSub and illustrates its potential, by investigating the impact of repression on dissent across thousands of subnational datasets.


1998 ◽  
Vol 54 (6) ◽  
pp. 1178-1182 ◽  
Author(s):  
Manfred Hendlich

Recent advances in experimental techniques have led to an enormous explosion of available data about protein–ligand complexes. To exploit the information that is hidden in these large data, collection tools for managing and accessing huge data collections are needed. This paper discusses databases for protein–ligand data which are accessibleviathe World Wide Web. A strong focus is placed on the ReLiBase database system which is a new three-dimensional database for storing and analysing structures of protein–ligand complexes currently deposited in the Brookhaven Protein Data Bank (PDB). ReLiBase contains efficient query tools for identifying and analysing ligands and protein–ligand complexes. Its application for structure-based drug design is illustrated.


2012 ◽  
Vol 9 (7) ◽  
pp. 8039-8073
Author(s):  
T. Tanhua ◽  
R. F. Keeling

Abstract. Increasing concentrations of dissolved inorganic carbon (DIC) in the interior ocean is expected as a direct consequence of increasing concentrations of CO2 in the atmosphere. This extra DIC is often referred to as anthropogenic carbon (Cant), and its inventory, or increase rate, in the interior ocean has previously been estimated by a multitude of observational approaches. Each of these methods are associated with hard to test assumptions since Cant cannot be directly observed. Results from a simpler concept with few assumptions applied to the Atlantic Ocean are reported on here using two large data collections of carbon relevant bottle data. The change in column inventory on decadal time scales, i.e. the storage rate, of DIC, respiration compensated DIC and oxygen is calculated for the Atlantic Ocean. The average storage rates for DIC and oxygen is calculated to 0.72 ± 1.22 (95% confidence interval of the mean trend: 0.65–0.78) mol m−2 yr−1 and −0.54 ± 1.64 (95% confidence interval of the mean trend: –0.64–(−0.45)) mol m−2 yr−1, respectively, for the Atlantic Ocean, where the uncertainties reflect station-to-station variability and where the mean trends are non-zero at the 95% confidence level. The standard deviation mainly reflects uncertainty due to regional variations, whereas the confidence interval reflects the mean trend. The storage rates are similar to changes found by other studies, although with large uncertainty. For the subpolar North Atlantic the storage rates show significant temporal variation of all variables. This seems to be due to variations in the prevalence of subsurface water masses with different DIC concentrations leading to sometimes different signs of storage rates for DIC and Cant. This study suggest that accurate assessment of the uptake of CO2 by the oceans will require accounting not only for processes that influence Cant but also additional processes that modify CO2 storage.


English Today ◽  
2015 ◽  
Vol 32 (2) ◽  
pp. 24-30 ◽  
Author(s):  
Reinhard Heuberger

English lexicography is undergoing a transformation so profound that both dictionary makers and users need new strategies to cope with the challenges of today's technologies and to take full advantage of their potential. Rundell has rightly stated that dictionaries have finally found their ideal platform in the electronic medium (2012: 15), which provides quicker and more sophisticated access to large data collections that are no longer subject to space restrictions. But the innovations go far beyond storage space and ease of access - customization, hybridization and user-input are amongst the most promising trends in electronic lexicography. Customization means that dictionaries can be adaptable, i.e. manually customized by the user, or even adaptive, i.e. automatically adapted to users’ needs on the basis of their behaviour (Granger, 2012: 4). Paquot lists genre, domain as well as L1 as examples of fruitful areas for customization (2012: 185). In the electronic medium, the barriers between different language resources such as dictionaries, encyclopaedias, databases, writing aids and translation tools are disappearing, a development referred to as hybridization (Granger, 2012: 4). And the concept of user-input is exemplified by the well-known platforms Wiktionary and Urban Dictionary, both of which are online reference works based on contributions by users.


Big data marks a major turning point in the use of data and is a powerful vehicle for growth and profitability. A comprehensive understanding of a company's data, its potential can be a new vector for performance. It must be recognized that without an adequate analysis, our data are just an unusable raw material. In this context, the traditional data processing tools cannot support such an explosion of volume. They cannot respond to new needs in a timely manner and at a reasonable cost. Big data is a broad term generally referring to very large data collections that impose complications on analytics tools for harnessing and managing such. This chapter details what big data analysis is. It presents the development of its applications. It is interested in the important changes that have touched the analytics context.


2003 ◽  
pp. 200-221 ◽  
Author(s):  
Mirek Riedewald ◽  
Divyakant Agrawal ◽  
Amr El Abbadi

Data cubes are ubiquitous tools in data warehousing, online analytical processing, and decision support applications. Based on a selection of pre-computed and materialized aggregate values, they can dramatically speed up aggregation and summarization over large data collections. Traditionally, the emphasis has been on lowering query costs with little regard to maintenance, i.e., update cost issues. We argue that current trends require data cubes to be not only query-efficient, but also dynamic at the same time, and we also show how this can be achieved. Several array-based techniques with different tradeoffs between query and update cost are discussed in detail. We also survey selected approaches for sparse data and the popular data cube operator, CUBE. Moreover, this work includes an overview of future trends and their impact on data cubes.


Author(s):  
Pushpa Mannava

Data mining is considered as a vital procedure as it is used for locating brand-new, legitimate, useful as well as reasonable kinds of data. The assimilation of data mining methods in cloud computing gives a versatile and also scalable design that can be made use of for reliable mining of significant quantity of data from virtually incorporated data resources with the goal of creating beneficial information which is useful in decision making. The procedure of removing concealed, beneficial patterns, as well as useful info from big data is called big data analytics. This is done via using advanced analytics techniques on large data collections. This paper provides the information about big data analytics in intra-data center networks, components of data mining and also techniques of Data mining.


Sign in / Sign up

Export Citation Format

Share Document