large scale data
Recently Published Documents





In Cloud based Big Data applications, Hadoop has been widely adopted for distributed processing large scale data sets. However, the wastage of energy consumption of data centers still constitutes an important axis of research due to overuse of resources and extra overhead costs. As a solution to overcome this challenge, a dynamic scaling of resources in Hadoop YARN Cluster is a practical solution. This paper proposes a dynamic scaling approach in Hadoop YARN (DSHYARN) to add or remove nodes automatically based on workload. It is based on two algorithms (scaling up/down) which are implemented to automate the scaling process in the cluster. This article aims to assure energy efficiency and performance of Hadoop YARN’ clusters. To validate the effectiveness of DSHYARN, a case study with sentiment analysis on tweets about covid-19 vaccine is provided. the goal is to analyze tweets of the people posted on Twitter application. The results showed improvement in CPU utilization, RAM utilization and Job Completion time. In addition, the energy has been reduced of 16% under average workload.

2022 ◽  
pp. 17-25
Nancy Jan Sliper

Experimenters today frequently quantify millions or even billions of characteristics (measurements) each sample to address critical biological issues, in the hopes that machine learning tools would be able to make correct data-driven judgments. An efficient analysis requires a low-dimensional representation that preserves the differentiating features in data whose size and complexity are orders of magnitude apart (e.g., if a certain ailment is present in the person's body). While there are several systems that can handle millions of variables and yet have strong empirical and conceptual guarantees, there are few that can be clearly understood. This research presents an evaluation of supervised dimensionality reduction for large scale data. We provide a methodology for expanding Principal Component Analysis (PCA) by including category moment estimations in low-dimensional projections. Linear Optimum Low-Rank (LOLR) projection, the cheapest variant, includes the class-conditional means. We show that LOLR projections and its extensions enhance representations of data for future classifications while retaining computing flexibility and reliability using both experimental and simulated data benchmark. When it comes to accuracy, LOLR prediction outperforms other modular linear dimension reduction methods that require much longer computation times on conventional computers. LOLR uses more than 150 million attributes in brain image processing datasets, and many genome sequencing datasets have more than half a million attributes.

2022 ◽  
pp. 41-67
Vo Ngoc Phu ◽  
Vo Thi Ngoc Tran

Machine learning (ML), neural network (NN), evolutionary algorithm (EA), fuzzy systems (FSs), as well as computer science have been very famous and very significant for many years. They have been applied to many different areas. They have contributed much to developments of many large-scale corporations, massive organizations, etc. Lots of information and massive data sets (MDSs) have been generated from these big corporations, organizations, etc. These big data sets (BDSs) have been the challenges of many commercial applications, researches, etc. Therefore, there have been many algorithms of the ML, the NN, the EA, the FSs, as well as computer science which have been developed to handle these massive data sets successfully. To support for this process, the authors have displayed all the possible algorithms of the NN for the large-scale data sets (LSDSs) successfully in this chapter. Finally, they have presented a novel model of the NN for the BDS in a sequential environment (SE) and a distributed network environment (DNE).

2022 ◽  
pp. 52-80
Shouheng Sun ◽  
Dafei Yang ◽  
Xue Yan

This study aims to develop a typological configuration that characterizes the full spectrum of collaborative platform economy business practice in the real world. The analysis is conducted on the basis of a large-scale data set which contains information on 1,335 representative platforms in more than 60 countries on five continents, covering almost all collaborative platform economy business practices mentioned in academic journals and public media. Leveraging the k-means clustering method, an empirical typology comprising seven categories of collaborative platform economy business practice is proposed: collaborative support platform, resource supply platform, authentic C2C platform, C2C mutualized mobility platform, hybrid service platform, B2C service platforms, collaborative finance platform. In addition, with the help of operating status data of the collaborative platform economy, a cross-comparative analysis was also carried out on the category differences and geographic differences.

2022 ◽  
pp. 112-145
Vo Ngoc Phu ◽  
Vo Thi Ngoc Tran

Artificial intelligence (ARTINT) and information have been famous fields for many years. A reason has been that many different areas have been promoted quickly based on the ARTINT and information, and they have created many significant values for many years. These crucial values have certainly been used more and more for many economies of the countries in the world, other sciences, companies, organizations, etc. Many massive corporations, big organizations, etc. have been established rapidly because these economies have been developed in the strongest way. Unsurprisingly, lots of information and large-scale data sets have been created clearly from these corporations, organizations, etc. This has been the major challenges for many commercial applications, studies, etc. to process and store them successfully. To handle this problem, many algorithms have been proposed for processing these big data sets.

2021 ◽  
Vol 14 (1) ◽  
pp. 19
Zineddine Kouahla ◽  
Ala-Eddine Benrazek ◽  
Mohamed Amine Ferrag ◽  
Brahim Farou ◽  
Hamid Seridi ◽  

The past decade has been characterized by the growing volumes of data due to the widespread use of the Internet of Things (IoT) applications, which introduced many challenges for efficient data storage and management. Thus, the efficient indexing and searching of large data collections is a very topical and urgent issue. Such solutions can provide users with valuable information about IoT data. However, efficient retrieval and management of such information in terms of index size and search time require optimization of indexing schemes which is rather difficult to implement. The purpose of this paper is to examine and review existing indexing techniques for large-scale data. A taxonomy of indexing techniques is proposed to enable researchers to understand and select the techniques that will serve as a basis for designing a new indexing scheme. The real-world applications of the existing indexing techniques in different areas, such as health, business, scientific experiments, and social networks, are presented. Open problems and research challenges, e.g., privacy and large-scale data mining, are also discussed.

2021 ◽  
Vol 12 (1) ◽  
pp. 292
Yunyong Ko ◽  
Sang-Wook Kim

The recent unprecedented success of deep learning (DL) in various fields is underlied by its use of large-scale data and models. Training a large-scale deep neural network (DNN) model with large-scale data, however, is time-consuming. To speed up the training of massive DNN models, data-parallel distributed training based on the parameter server (PS) has been widely applied. In general, a synchronous PS-based training suffers from the synchronization overhead, especially in heterogeneous environments. To reduce the synchronization overhead, asynchronous PS-based training employs the asynchronous communication between PS and workers so that PS processes the request of each worker independently without waiting. Despite the performance improvement of asynchronous training, however, it inevitably incurs the difference among the local models of workers, where such a difference among workers may cause slower model convergence. Fro addressing this problem, in this work, we propose a novel asynchronous PS-based training algorithm, SHAT that considers (1) the scale of distributed training and (2) the heterogeneity among workers for successfully reducing the difference among the local models of workers. The extensive empirical evaluation demonstrates that (1) the model trained by SHAT converges to the higher accuracy up to 5.22% than state-of-the-art algorithms, and (2) the model convergence of SHAT is robust under various heterogeneous environments.

2021 ◽  
pp. 095679762110246
Molly Lewis ◽  
Matt Cooper Borkenhagen ◽  
Ellen Converse ◽  
Gary Lupyan ◽  
Mark S. Seidenberg

We investigated how gender is represented in children’s books using a novel 200,000-word corpus comprising 247 popular, contemporary books for young children. Using adult human judgments and word co-occurrence data, we quantified gender biases of words in individual books and in the whole corpus. We found that children’s books contain many words that adults judge as gendered. Semantic analyses based on co-occurrence data yielded word clusters related to gender stereotypes (e.g., feminine: emotions; masculine: tools). Co-occurrence data also indicated that many books instantiate gender stereotypes identified in other research (e.g., girls are better at reading, and boys are better at math). Finally, we used large-scale data to estimate the gender distribution of the audience for individual books, and we found that children are more often exposed to stereotypes for their own gender. Together, the data suggest that children’s books may be an early source of gender associations and stereotypes.

2021 ◽  
Vol 2021 ◽  
pp. 1-8
Xing Zhang

With the development of network and multimedia technology, multimedia communication has attracted the attention of researchers. Image encryption has become an urgent need for secure multimedia communication. Compared with the traditional encryption system, encryption algorithms based on chaos are easier to implement, which makes them more suitable for large-scale data encryption. The calculation method of image encryption proposed in this paper is a combination of high-dimensional chaotic systems. This algorithm is mainly used for graph mapping and used the Lorenz system to expand and replace them one by one. Studies have shown that this calculation method causes mixed pixel values, good diffusion performance, and strong key performance with strong resistance. The pixel of the encrypted picture is distributed relatively random, and the characteristics of similar loudness are not relevant. It is proved through experiments that the above calculation methods have strong safety performance.

Sign in / Sign up

Export Citation Format

Share Document