Data Streams

Author(s):  
João Gama ◽  
Pedro Pereira Rodrigues

Nowadays, data bases are required to store massive amounts of data that are continuously inserted, and queried. Organizations use decision support systems to identify potential useful patterns in data. Data analysis is complex, interactive, and exploratory over very large volumes of historic data, eventually stored in distributed environments. What distinguishes current data sources from earlier ones are the continuous flow of data and the automatic data feeds. We do not just have people who are entering information into a computer. Instead, we have computers entering data into each other (Muthukrishnan, 2005). Some illustrative examples of the magnitude of today data include: 3 billion telephone calls per day, 4 Giga Bytes of data from radio telescopes every night, 30 billion emails per day, 1 billion SMS, 5 Giga Bytes of Satellite Data per day, 70 billion IP Network Traffic per day. In these applications, data is modelled best not as persistent tables but rather as transient data streams. In some applications it is not feasible to load the arriving data into a traditional Data Base Management Systems (DBMS), and traditional DBMS are not designed to directly support the continuous queries required in these applications (Alon et al., 1996; Babcock et al. 2002; Cormode & Muthukrishnan, 2003). These sources of data are called Data Streams. Computers play a much more active role in the current trends in decision support and data analysis. Data mining algorithms search for hypothesis, evaluate and suggest patterns. The pattern discovery process requires online ad-hoc queries, not previously defined, that are successively refined. Due to the exploratory nature of these queries, an exact answer may be not required: a user may prefer a fast but approximate answer to a exact but slow answer. Processing queries in streams require radically different type of algorithms. Range queries and selectivity estimation (the proportion of tuples that satisfy a query) are two illustrative examples where fast but approximate answers are more useful than slow and exact ones. Approximate answers are obtained from small data structures (synopsis) attached to data base that summarize information and can be updated incrementally

2020 ◽  
Vol 1 (2) ◽  
pp. 60-66
Author(s):  
Oleksandr Shelest ◽  
Bella Holub

Today, most organizations use databases and at worst text documents and spreadsheet files as sources for data analysis, which prevent correct and error-free analysis. At best, the data can be constantly adjusted due to ambiguities and inaccuracies. The subject of the study is the intellectual analysis of outdoor advertising data. The methodology of successful data analysis is the correct storage of data, which is the basis for clear data analysis. Modern computer systems and computer networks allow the accumulation of large arrays of data to solve problems of processing and analyzing. Unfortunately, the machine form of data presentation itself contains the information that a person needs in a hidden form, and you need to use special methods of data analysis to obtain it. In order to get what you want, you need to create not just a database, but a data warehouse with a special storage structure. Thus, the data warehouse allows you to collect data from various sources, databases, table files and other things, store them throughout history and, unlike conventional databases, allows you to create systems for fast and accurate data analysis. Data warehouse is the basis for building decision support systems. Operational data is checked, cleared and aggregated before entering the data warehouse. Such integrated data is much easier to analyze. Different sources of operational data may contain data describing the same subject area from different points of view (for example, from the point of view of accounting, inventory control, planning department, etc.). A decision made on the basis of only one point of view can be ineffective or even erroneous. The goal is to use a data warehouse to integrate information that reflects different perspectives on the same subject area. Focus on the object, which will also allow the data warehouse to store only the data you need to analyze it. It will also significantly increase the speed of data access both due to the possible redundancy of the stored information and due to the exclusion of modification operations. Conclusion: the decision support system will ensure reliable storage of large amounts of data. Tasks will also be assigned to prevent unauthorized access, data backup, archiving, etc.


Author(s):  
Ambar Widianingrum ◽  
Joko Sulianto ◽  
Rahmat Rais

The purpose of this study was to describe the feasibility of teaching materials based on an open-ended approach to improve the reasoning abilities of fourth grade students in elementary schools. This type of research is research and development (Research and Development). The subjects of this study were 3 classroom teachers. The data analysis technique used is descriptive qualitative data analysis (data reduction, data presentation and conclusion) and quantitative descriptive data analysis. Based on the results of stage 1 media validation, it was obtained 84.8%, and the results of stage 2 media validation were obtained 94.8%. The result of material validation for stage 1 was obtained 84.6%, and validation for material for stage 2 was obtained 93.3%. The results of initial field trials obtained media 93.7% and material 92.3%. This shows that the teaching material is declared valid and suitable for use. Based on the results of this study, the suggestion that can be conveyed is that teaching materials based on an open-ended approach can be used as a tool for teaching and learning resources for students.


Electronics ◽  
2021 ◽  
Vol 10 (15) ◽  
pp. 1771
Author(s):  
Ferdinando Di Martino ◽  
Irina Perfilieva ◽  
Salvatore Sessa

Fuzzy transform is a technique applied to approximate a function of one or more variables applied by researchers in various image and data analysis. In this work we present a summary of a fuzzy transform method proposed in recent years in different data mining disciplines, such as the detection of relationships between features and the extraction of association rules, time series analysis, data classification. After having given the definition of the concept of Fuzzy Transform in one or more dimensions in which the constraint of sufficient data density with respect to fuzzy partitions is also explored, the data analysis approaches recently proposed in the literature based on the use of the Fuzzy Transform are analyzed. In particular, the strategies adopted in these approaches for managing the constraint of sufficient data density and the performance results obtained, compared with those measured by adopting other methods in the literature, are explored. The last section is dedicated to final considerations and future scenarios for using the Fuzzy Transform for the analysis of massive and high-dimensional data.


2018 ◽  
Vol 46 (1) ◽  
pp. 32-39 ◽  
Author(s):  
Ilana G. Raskind ◽  
Rachel C. Shelton ◽  
Dawn L. Comeau ◽  
Hannah L. F. Cooper ◽  
Derek M. Griffith ◽  
...  

Data analysis is one of the most important, yet least understood, stages of the qualitative research process. Through rigorous analysis, data can illuminate the complexity of human behavior, inform interventions, and give voice to people’s lived experiences. While significant progress has been made in advancing the rigor of qualitative analysis, the process often remains nebulous. To better understand how our field conducts and reports qualitative analysis, we reviewed qualitative articles published in Health Education & Behavior between 2000 and 2015. Two independent reviewers abstracted information in the following categories: data management software, coding approach, analytic approach, indicators of trustworthiness, and reflexivity. Of the 48 ( n = 48) articles identified, the majority ( n = 31) reported using qualitative software to manage data. Double-coding transcripts was the most common coding method ( n = 23); however, nearly one third of articles did not clearly describe the coding approach. Although the terminology used to describe the analytic process varied widely, we identified four overarching trajectories common to most articles ( n = 37). Trajectories differed in their use of inductive and deductive coding approaches, formal coding templates, and rounds or levels of coding. Trajectories culminated in the iterative review of coded data to identify emergent themes. Few articles explicitly discussed trustworthiness or reflexivity. Member checks ( n = 9), triangulation of methods ( n = 8), and peer debriefing ( n = 7) were the most common procedures. Variation in the type and depth of information provided poses challenges to assessing quality and enabling replication. Greater transparency and more intentional application of diverse analytic methods can advance the rigor and impact of qualitative research in our field.


2005 ◽  
Vol 21 (2) ◽  
pp. 284-291 ◽  
Author(s):  
Zhongming Chen ◽  
Lin Liu

The spot images from DNA microarray highly affect the discovery of biological knowledge from gene expression data. However, results from quality analysis, normalization, differential expression, and cluster analysis are rarely validated with spot images in current data analysis methods or software packages. We designed RealSpot, a software package, to validate the results by directly associating spot quality and data with spot images in a spreadsheet table. RealSpot splits hybridization images into individual spots stored in a spreadsheet table. It subsequently associates microarray data with spot images and performs data validation through the standard table operation such as sorting, searching, and editing. RealSpot has several built-in functions to facilitate data validation, including spot quality analysis, data organization, one-way ANOVA, gene ontology association, verification, import, and export. We used RealSpot to evaluate 77 slides (30,000 features each) from real hybridization experiments and to validate results from each step of data analysis. It took ∼10 min to validate results of spot quality after initial evaluation and correct ∼0.3% of falsely assigned qualities of 10,000 spots. We validated 1,641 of 2,110 differentially expressed genes identified by SAM analysis in ∼1/2 h by comparing each gene with its respective spot image. Furthermore, we found that 6 of 48 genes in one cluster from k-mean clustering method showed inconsistent trends of spot images. RealSpot is efficient for validating microarray results and thus helpful for improving the reliability of the whole microarray experiment for experimentalists.


Sign in / Sign up

Export Citation Format

Share Document