Aras

Author(s):  
Mohammadhossein Barkhordari ◽  
Mahdi Niamanesh

Because of to the high rate of data growth and the need for data analysis, data warehouse management for big data is an important issue. Single node solutions cannot manage the large amount of information. Information must be distributed over multiple hardware nodes. Nevertheless, data distribution over nodes causes each node to need data from other nodes to execute a query. Data exchange among nodes creates problems, such as the joins between data segments that exist on different nodes, network congestion, and hardware node wait for data reception. In this paper, the Aras method is proposed. This method is a MapReduce-based method that introduces a data set on each mapper. By applying this method, each mapper node can execute its query independently and without need to exchange data with other nodes. Node independence solves the aforementioned data distribution problems. The proposed method has been compared with prominent data warehouses for big data, and the Aras query execution time was much lower than other methods.

2018 ◽  
Vol 9 (3) ◽  
pp. 15-30 ◽  
Author(s):  
S. Vengadeswaran ◽  
S. R. Balasundaram

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.


Author(s):  
Subodh Kesharwani

The whole ball of wax we create leaves a digital footprint. Big data had ascended as a catchword in recent years. Principally, it means a prodigious aggregate of information that is stimulated as trails or by-products of online and offline doings — what we get using credit cards, where we travel via GPS, what we ‘like’ on Facebook or retweet on Twitter, or what we bargain either through “apnidukaan” via amazon, and so on. In this era and stage, the Data as a Service (DaaS) battle is gaining force, spurring one of the fastest growing industries in the world.“Big data” is a jargon for data sets that are so gigantic or multi-layered that old-style data processing application software’s are deprived to concord with them. Challenges contain apprehension, storage, analysis, data curation, search, sharing, and transmission, visualization, querying, and updating information privacy. The term “big data” usually refers self-effacingly to the use of extrapolative analytics, user behaviour analytics, or sure other advanced data analytics methods that extract value from data, and infrequently to a separable size of data set


2021 ◽  
Vol 25 (4) ◽  
pp. 907-927
Author(s):  
Adam Krechowicz

Proper data items distribution may seriously improve the performance of data processing in distributed environment. However, typical datastorage systems as well as distributed computational frameworks do not pay special attention to that aspect. In this paper author introduces two custom data items addressing methods for distributed datastorage on the example of Scalable Distributed Two-Layer Datastore. The basic idea of those methods is to preserve that data items stored on the same cluster node are similar to each other following concepts of data clustering. Still, most of the data clustering mechanisms have serious problem with data scalability which is a severe limitation in Big Data applications. The proposed methods allow to efficiently distribute data set over a set of buckets. As it was shown by the experimental results, all proposed methods generate good results efficiently in comparison to traditional clustering techniques like k-means, agglomerative and birch clustering. Distributed environment experiments shown that proper data distribution can seriously improve the effectiveness of Big Data processing.


2020 ◽  
Author(s):  
Lungwani Muungo

Quantitative phosphoproteome and transcriptome analysisof ligand-stimulated MCF-7 human breast cancer cells wasperformed to understand the mechanisms of tamoxifen resistanceat a system level. Phosphoproteome data revealed thatWT cells were more enriched with phospho-proteins thantamoxifen-resistant cells after stimulation with ligands.Surprisingly, decreased phosphorylation after ligand perturbationwas more common than increased phosphorylation.In particular, 17?-estradiol induced down-regulation inWT cells at a very high rate. 17?-Estradiol and the ErbBligand heregulin induced almost equal numbers of up-regulatedphospho-proteins in WT cells. Pathway and motifactivity analyses using transcriptome data additionallysuggested that deregulated activation of GSK3? (glycogensynthasekinase 3?) and MAPK1/3 signaling might be associatedwith altered activation of cAMP-responsive elementbindingprotein and AP-1 transcription factors intamoxifen-resistant cells, and this hypothesis was validatedby reporter assays. An examination of clinical samples revealedthat inhibitory phosphorylation of GSK3? at serine 9was significantly lower in tamoxifen-treated breast cancerpatients that eventually had relapses, implying that activationof GSK3? may be associated with the tamoxifen-resistantphenotype. Thus, the combined phosphoproteomeand transcriptome data set analyses revealed distinct signal


2019 ◽  
Vol 48 (D1) ◽  
pp. D17-D23 ◽  
Author(s):  
Charles E Cook ◽  
Oana Stroe ◽  
Guy Cochrane ◽  
Ewan Birney ◽  
Rolf Apweiler

Abstract Data resources at the European Bioinformatics Institute (EMBL-EBI, https://www.ebi.ac.uk/) archive, organize and provide added-value analysis of research data produced around the world. This year's update for EMBL-EBI focuses on data exchanges among resources, both within the institute and with a wider global infrastructure. Within EMBL-EBI, data resources exchange data through a rich network of data flows mediated by automated systems. This network ensures that users are served with as much information as possible from any search and any starting point within EMBL-EBI’s websites. EMBL-EBI data resources also exchange data with hundreds of other data resources worldwide and collectively are a key component of a global infrastructure of interconnected life sciences data resources. We also describe the BioImage Archive, a deposition database for raw images derived from primary research that will supply data for future knowledgebases that will add value through curation of primary image data. We also report a new release of the PRIDE database with an improved technical infrastructure, a new API, a new webpage, and improved data exchange with UniProt and Expression Atlas. Training is a core mission of EMBL-EBI and in 2018 our training team served more users, both in-person and through web-based programmes, than ever before.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Sara Migliorini ◽  
Alberto Belussi ◽  
Elisa Quintarelli ◽  
Damiano Carra

AbstractThe MapReduce programming paradigm is frequently used in order to process and analyse a huge amount of data. This paradigm relies on the ability to apply the same operation in parallel on independent chunks of data. The consequence is that the overall performances greatly depend on the way data are partitioned among the various computation nodes. The default partitioning technique, provided by systems like Hadoop or Spark, basically performs a random subdivision of the input records, without considering the nature and correlation between them. Even if such approach can be appropriate in the simplest case where all the input records have to be always analyzed, it becomes a limit for sophisticated analyses, in which correlations between records can be exploited to preliminarily prune unnecessary computations. In this paper we design a context-based multi-dimensional partitioning technique, called CoPart, which takes care of data correlation in order to determine how records are subdivided between splits (i.e., units of work assigned to a computation node). More specifically, it considers not only the correlation of data w.r.t. contextual attributes, but also the distribution of each contextual dimension in the dataset. We experimentally compare our approach with existing ones, considering both quality criteria and the query execution times.


Author(s):  
Yihao Tian

Big data is an unstructured data set with a considerable volume, coming from various sources such as the internet, business organizations, etc., in various formats. Predicting consumer behavior is a core responsibility for most dealers. Market research can show consumer intentions; it can be a big order for a best-designed research project to penetrate the veil, protecting real customer motivations from closer scrutiny. Customer behavior usually focuses on customer data mining, and each model is structured at one stage to answer one query. Customer behavior prediction is a complex and unpredictable challenge. In this paper, advanced mathematical and big data analytical (BDA) methods to predict customer behavior. Predictive behavior analytics can provide modern marketers with multiple insights to optimize efforts in their strategies. This model goes beyond analyzing historical evidence and making the most knowledgeable assumptions about what will happen in the future using mathematical. Because the method is complex, it is quite straightforward for most customers. As a result, most consumer behavior models, so many variables that produce predictions that are usually quite accurate using big data. This paper attempts to develop a model of association rule mining to predict customers’ behavior, improve accuracy, and derive major consumer data patterns. The finding recommended BDA method improves Big data analytics usability in the organization (98.2%), risk management ratio (96.2%), operational cost (97.1%), customer feedback ratio (98.5%), and demand prediction ratio (95.2%).


2021 ◽  
Vol 87 (4) ◽  
pp. 283-293
Author(s):  
Wei Wang ◽  
Yuan Xu ◽  
Yingchao Ren ◽  
Gang Wang

Recently, performance improvement in facade parsing from 3D point clouds has been brought about by designing more complex network structures, which cost huge computing resources and do not take full advantage of prior knowledge of facade structure. Instead, from the perspective of data distribution, we construct a new hierarchical mesh multi-view data domain based on the characteristics of facade objects to achieve fusion of deep-learning models and prior knowledge, thereby significantly improving segmentation accuracy. We comprehensively evaluate the current mainstream method on the RueMonge 2014 data set and demonstrate the superiority of our method. The mean intersection-over-union index on the facade-parsing task reached 76.41%, which is 2.75% higher than the current best result. In addition, through comparative experiments, the reasons for the performance improvement of the proposed method are further analyzed.


2021 ◽  
pp. 1-29
Author(s):  
Eric Sonny Mathew ◽  
Moussa Tembely ◽  
Waleed AlAmeri ◽  
Emad W. Al-Shalabi ◽  
Abdul Ravoof Shaik

Two of the most critical properties for multiphase flow in a reservoir are relative permeability (Kr) and capillary pressure (Pc). To determine these parameters, careful interpretation of coreflooding and centrifuge experiments is necessary. In this work, a machine learning (ML) technique was incorporated to assist in the determination of these parameters quickly and synchronously for steady-state drainage coreflooding experiments. A state-of-the-art framework was developed in which a large database of Kr and Pc curves was generated based on existing mathematical models. This database was used to perform thousands of coreflood simulation runs representing oil-water drainage steady-state experiments. The results obtained from the corefloods including pressure drop and water saturation profile, along with other conventional core analysis data, were fed as features into the ML model. The entire data set was split into 70% for training, 15% for validation, and the remaining 15% for the blind testing of the model. The 70% of the data set for training teaches the model to capture fluid flow behavior inside the core, and then 15% of the data set was used to validate the trained model and to optimize the hyperparameters of the ML algorithm. The remaining 15% of the data set was used for testing the model and assessing the model performance scores. In addition, K-fold split technique was used to split the 15% testing data set to provide an unbiased estimate of the final model performance. The trained/tested model was thereby used to estimate Kr and Pc curves based on available experimental results. The values of the coefficient of determination (R2) were used to assess the accuracy and efficiency of the developed model. The respective crossplots indicate that the model is capable of making accurate predictions with an error percentage of less than 2% on history matching experimental data. This implies that the artificial-intelligence- (AI-) based model is capable of determining Kr and Pc curves. The present work could be an alternative approach to existing methods for interpreting Kr and Pc curves. In addition, the ML model can be adapted to produce results that include multiple options for Kr and Pc curves from which the best solution can be determined using engineering judgment. This is unlike solutions from some of the existing commercial codes, which usually provide only a single solution. The model currently focuses on the prediction of Kr and Pc curves for drainage steady-state experiments; however, the work can be extended to capture the imbibition cycle as well.


Sign in / Sign up

Export Citation Format

Share Document