Efficient Algorithms for Cleaning and Indexing of Graph data

2020 ◽  
Vol 11 (3) ◽  
pp. 1-19
Author(s):  
Santhosh Kumar D. K. ◽  
Demain Antony DMello

Information extraction and analysis from the enormous graph data is expanding rapidly. From the survey, it is observed that 80% of researchers spend more than 40% of their project time in data cleaning. This signifies a huge need for data cleaning. Due to the characteristics of big data, the storage and retrieval is another major concern and is addressed by data indexing. The existing data cleaning techniques try to clean the graph data based on information like structural attributes and event log sequences. The cleaning of graph data on a single piece of information alone will not increase the performance of computation. Along with node, the label can also be inconsistent, so it is highly desirable to clean both to improve the performance. This paper addresses aforesaid issue by proposing graph data cleaning algorithm to detect the unstructured information along with inconsistent labeling and clean the data by applying rules and verify based on data inconsistency. The authors propose an indexing algorithm based on CSS-tree to build an efficient and scalable graph indexing on top of Hadoop.

Sensors ◽  
2021 ◽  
Vol 21 (18) ◽  
pp. 6063
Author(s):  
Francisco Javier Nieto ◽  
Unai Aguilera ◽  
Diego López-de-Ipiña

Data scientists spend much time with data cleaning tasks, and this is especially important when dealing with data gathered from sensors, as finding failures is not unusual (there is an abundance of research on anomaly detection in sensor data). This work analyzes several aspects of the data generated by different sensor types to understand particularities in the data, linking them with existing data mining methodologies. Using data from different sources, this work analyzes how the type of sensor used and its measurement units have an important impact in basic statistics such as variance and mean, because of the statistical distributions of the datasets. The work also analyzes the behavior of outliers, how to detect them, and how they affect the equivalence of sensors, as equivalence is used in many solutions for identifying anomalies. Based on the previous results, the article presents guidance on how to deal with data coming from sensors, in order to understand the characteristics of sensor datasets, and proposes a parallelized implementation. Finally, the article shows that the proposed decision-making processes work well with a new type of sensor and that parallelizing with several cores enables calculations to be executed up to four times faster.


Author(s):  
Arif Hanafi ◽  
Sulaiman Harun ◽  
Sofika Enggari ◽  
Larissa Navia Rani

The way that email has extraordinary significance in present day business communication is certain. Consistently, a bulk of emails is sent from organizations to clients and suppliers, from representatives to their managers and starting with one colleague then onto the next. In this way there is vast of email in data warehouse. Data cleaning is an activity performed on the data sets of data warehouse to upgrade and keep up the quality and consistency of the data. This paper underlines the issues related with dirty data, detection of duplicatein email column. The paper identifies the strategy of data cleaning from adifferent point of view. It provides an algorithm to the discovery of error and duplicates entries in the data sets of existing data warehouse. The paper characterizes the alliance rules based on the concept of mathematical association rules to determine the duplicate entries in email column in data sets.


Neurosurgery ◽  
2003 ◽  
Vol 52 (6) ◽  
pp. 1499-1503
Author(s):  
Cole A. Giller ◽  
Scott J. Clamp

Abstract OBJECTIVE Although radiosurgical practice mandates meticulous radiological follow-up, even the most efficient radiology department can be overwhelmed by the large number of radiosurgical patients who have undergone diagnostic studies for many years at many different institutions to follow many separate lesions. Although the task of assembling these studies is theoretically possible, because they are spread out in time and space, it is often impractical. We therefore sought to construct a computer-based system that could store images from multiple sources and present them instantly for review. METHODS We attached a flatbed film scanner to a standard desktop computer in our clinic and scanned selected sheets of film into an image database at each visit of a radiosurgical patient. “Low-tech” solutions were deliberately chosen—that is, to enhance ease and software compatibility, we used the operating system's directory structure for organization of data instead of proprietary software. Standard commercially available software was used to review studies that had been previously scanned. RESULTS During a 2- to 3-year period, images were scanned from 1129 studies performed on 435 patients. Images could be reviewed instantly and compared with current studies, and scanning a single piece of film required approximately 30 seconds. We estimate that the current capacity of our computer memory will satisfy our needs for approximately 12 years. CONCLUSION Assembly of an efficient and inexpensive system for image storage and retrieval suitable for radiosurgical practice is feasible and straightforward. Although our system is not a substitute for a radiology department, it obviates the constant frustration of “finding the films” and has become an essential part of our radiosurgical practice.


2019 ◽  
Vol 1 (1) ◽  
pp. 17-29
Author(s):  
Putri Anggraini ◽  
Dio Prima Mulya

Existing data processing at the Police Station of the Republic of Indonesia Resort (POLRES) Pasaman is still much done manually whether it is recording, storage and retrieval of administrative data making a driver's license when necessary. So to check or find the necessary data, must first disassemble the archives in the data storage cabinet where this will take a long time. Here the authors analyze and design the information system infrastructure that will and should be built, the navigation structure, the database used, the programming language used and the integration of both. For that in making this driver's license is the author using java programming language. From the results of this study, it can be concluded that the process of making a driver's license is well computerized.


2018 ◽  
Vol 41 ◽  
Author(s):  
Benjamin C. Ruisch ◽  
Rajen A. Anderson ◽  
David A. Pizarro

AbstractWe argue that existing data on folk-economic beliefs (FEBs) present challenges to Boyer & Petersen's model. Specifically, the widespread individual variation in endorsement of FEBs casts doubt on the claim that humans are evolutionarily predisposed towards particular economic beliefs. Additionally, the authors' model cannot account for the systematic covariance between certain FEBs, such as those observed in distinct political ideologies.


1975 ◽  
Vol 26 ◽  
pp. 341-380 ◽  
Author(s):  
R. J. Anderle ◽  
M. C. Tanenbaum

AbstractObservations of artificial earth satellites provide a means of establishing an.origin, orientation, scale and control points for a coordinate system. Neither existing data nor future data are likely to provide significant information on the .001 angle between the axis of angular momentum and axis of rotation. Existing data have provided data to about .01 accuracy on the pole position and to possibly a meter on the origin of the system and for control points. The longitude origin is essentially arbitrary. While these accuracies permit acquisition of useful data on tides and polar motion through dynamio analyses, they are inadequate for determination of crustal motion or significant improvement in polar motion. The limitations arise from gravity, drag and radiation forces on the satellites as well as from instrument errors. Improvements in laser equipment and the launch of the dense LAGEOS satellite in an orbit high enough to suppress significant gravity and drag errors will permit determination of crustal motion and more accurate, higher frequency, polar motion. However, the reference frame for the results is likely to be an average reference frame defined by the observing stations, resulting in significant corrections to be determined for effects of changes in station configuration and data losses.


1988 ◽  
Vol 102 ◽  
pp. 107-110
Author(s):  
A. Burgess ◽  
H.E. Mason ◽  
J.A. Tully

AbstractA new way of critically assessing and compacting data for electron impact excitation of positive ions is proposed. This method allows one (i) to detect possible printing and computational errors in the published tables, (ii) to interpolate and extrapolate the existing data as a function of energy or temperature, and (iii) to simplify considerably the storage and transfer of data without significant loss of information. Theoretical or experimental collision strengths Ω(E) are scaled and then plotted as functions of the colliding electron energy, the entire range of which is conveniently mapped onto the interval (0,1). For a given transition the scaled Ω can be accurately represented - usually to within a fraction of a percent - by a 5 point least squares spline. Further details are given in (2). Similar techniques enable thermally averaged collision strengths upsilon (T) to be obtained at arbitrary temperatures in the interval 0 < T < ∞. Application of the method is possible by means of an interactive program with graphical display (2). To illustrate this practical procedure we use the program to treat Ω for the optically allowed transition 2s → 2p in ArXVI.


Author(s):  
Sterling P. Newberry

At the 1958 meeting of our society, then known as EMSA, the author introduced the concept of microspace and suggested its use to provide adequate information storage space and the use of electron microscope techniques to provide storage and retrieval access. At this current meeting of MSA, he wishes to suggest an additional use of the power of the electron microscope.The author has been contemplating this new use for some time and would have suggested it in the EMSA fiftieth year commemorative volume, but for page limitations. There is compelling reason to put forth this suggestion today because problems have arisen in the “Standard Model” of particle physics and funds are being greatly reduced just as we need higher energy machines to resolve these problems. Therefore, any techniques which complement or augment what we can accomplish during this austerity period with the machines at hand is worth exploring.


Sign in / Sign up

Export Citation Format

Share Document