Efficient data transfer scheme using word-pair-encoding-based compression for large-scale text-data processing

Author(s):  
Hasitha Muthumala Waidyasooriya ◽  
Daisuke Ono ◽  
Masanori Hariyama ◽  
Michitaka Kameyama
1989 ◽  
Vol 167 ◽  
Author(s):  
John D. Crow

The interconnection of many processors in order to increase the computing power of the ensemble is a growing theme in the data processing industry [1]. These processors may be within modules on a common board, on seperate boards within a common frame, or in seperate frames distributed in a room or building. A key element of this “computer complex” is the network which allows efficient data transfers. These data processing networks are differentiated from data communications networks by a demand for fast data transfer, so that the processors and memory at the nodes can interact in times measured in about 1–1000 machine cycles. Machine cycle times might be from tens to hundreds of nanoseconds, and the amount of data transferred might be measured in Kilobytes [2]. This implies both a fast and a high bandwidth technology for implementing these interconnections, as well as a limited link distance. The VLSI IC technology and associated dense electrical chip packaging which has made such powerful computing nodes possible, also implies a requirement for dense packaging of the optical link adapter, for compatability [3]. The multiprocessor complex has very high requirements on reliability, leading to low tolerance for component failure or erroneous data transmission. Eight year life, with failure rates less than 0.01%/Khr, and link bit error rates less than 1 in 1015 are not uncommon requirements [4]. The performance and cost of these networks are often determined by the wiring technology chosen, specifically the electronics and opto-electronics of the interfaces at the network's nodes. Optical interconnections are potentially attractive for this application, but the requirements on optical, electrical and associated packaging technologies are significantly different than the technology which has been developed for the data communications applications. [4, 5].


2018 ◽  
Author(s):  
Hyunki Woo ◽  
Kyunga Kim ◽  
KyeongMin Cha ◽  
Jin-Young Lee ◽  
Hansong Mun ◽  
...  

BACKGROUND Since medical research based on big data has become more common, the community’s interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data. OBJECTIVE In this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data. METHODS The proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion. RESULTS A total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words. CONCLUSIONS Our data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy.


2015 ◽  
Vol 2015 ◽  
pp. 1-12 ◽  
Author(s):  
Mostefa Bendjima ◽  
Mohammed Feham

Wireless sensor networks (WSNs) are designed to collect information across a large number of sensor nodes with limited batteries. Therefore, it is important to minimize energy consumption of each node, so as to extend the lifetime of the network. This paper proposes the use of an intelligent WSN communication architecture based on a multiagent system (MAS), to ensure optimal data collection. MAS refers to a group of agents that interact and cooperate to achieve a specific goal. To ensure this objective, we propose the integration of a migrating agent into each node to process data and enhance cooperation between neighboring nodes, while mobile agents (MAs) can be used to reduce data transfer between the nodes and send them to the base station (Sink). The collaboration of these agents generates a simple message that summarizes important information to be transmitted by an MA. To reduce the size of MAs, nodes in the network sectors are grouped in such way that, for each MA, an optimal itinerary is established, using a minimum amount of energy with efficient data aggregation within a minimum time. Successive simulations in large-scale sensor networks show the good performance of our proposal in terms of energy consumption and packet delivery rate.


2020 ◽  
Vol 12 (4) ◽  
pp. 607 ◽  
Author(s):  
Chen Xu ◽  
Xiaoping Du ◽  
Zhenzhen Yan ◽  
Xiangtao Fan

Mass remote sensing data management and processing is currently one of the most important topics. In this study, we introduce ScienceEarth, a cluster-based data processing framework. The aim of ScienceEarth is to store, manage, and process large-scale remote sensing data in a cloud-based cluster-computing environment. The platform consists of the following three main parts: ScienceGeoData, ScienceGeoIndex, and ScienceGeoSpark. ScienceGeoData stores and manages remote sensing data. ScienceGeoIndex is an index and query system, a spatial index based on quad-tree and Hilbert curve which is combined for heterogeneous tiled remote sensing data that makes efficient data retrieval in ScienceGeoData. ScienceGeoSpark is an easy-to-use computing framework in which we use Apache Spark as the analytics engine for big remote sensing data processing. The result of tests proves that ScienceEarth can efficiently store, retrieve, and process remote sensing data. The results reveal ScienceEarth has the potential and capabilities of efficient big remote sensing data processing.


2020 ◽  
Vol 14 ◽  
Author(s):  
Khoirom Motilal Singh ◽  
Laiphrakpam Dolendro Singh ◽  
Themrichon Tuithung

Background: Data which are in the form of text, audio, image and video are used everywhere in our modern scientific world. These data are stored in physical storage, cloud storage and other storage devices. Some of it are very sensitive and requires efficient security while storing as well as in transmitting from the sender to the receiver. Objective: With the increase in data transfer operation, enough space is also required to store these data. Many researchers have been working to develop different encryption schemes, yet there exist many limitations in their works. There is always a need for encryption schemes with smaller cipher data, faster execution time and low computation cost. Methods: A text encryption based on Huffman coding and ElGamal cryptosystem is proposed. Initially, the text data is converted to its corresponding binary bits using Huffman coding. Next, the binary bits are grouped and again converted into large integer values which will be used as the input for the ElGamal cryptosystem. Results: Encryption and Decryption are successfully performed where the data size is reduced using Huffman coding and advance security with the smaller key size is provided by the ElGamal cryptosystem. Conclusion: Simulation results and performance analysis specifies that our encryption algorithm is better than the existing algorithms under consideration.


2021 ◽  
Vol 77 (2) ◽  
pp. 98-108
Author(s):  
R. M. Churchill ◽  
C. S. Chang ◽  
J. Choi ◽  
J. Wong ◽  
S. Klasky ◽  
...  

Biomimetics ◽  
2021 ◽  
Vol 6 (2) ◽  
pp. 32
Author(s):  
Tomasz Blachowicz ◽  
Jacek Grzybowski ◽  
Pawel Steblinski ◽  
Andrea Ehrmann

Computers nowadays have different components for data storage and data processing, making data transfer between these units a bottleneck for computing speed. Therefore, so-called cognitive (or neuromorphic) computing approaches try combining both these tasks, as is done in the human brain, to make computing faster and less energy-consuming. One possible method to prepare new hardware solutions for neuromorphic computing is given by nanofiber networks as they can be prepared by diverse methods, from lithography to electrospinning. Here, we show results of micromagnetic simulations of three coupled semicircle fibers in which domain walls are excited by rotating magnetic fields (inputs), leading to different output signals that can be used for stochastic data processing, mimicking biological synaptic activity and thus being suitable as artificial synapses in artificial neural networks.


Sign in / Sign up

Export Citation Format

Share Document