Size matters

2014 ◽  
Vol 24 (3) ◽  
pp. 224-237 ◽  
Author(s):  
Valerie Johnson ◽  
Sonia Ranade ◽  
David Thomas

Purpose – This paper aims to focus on a highly significant yet under-recognised concern: the huge growth in the volume of digital archival information and the implications of this shift for information professionals. Design/methodology/approach – Though data loss and format obsolescence are often considered to be the major threats to digital records, the problem of scale remains under-acknowledged. This paper discusses this issue, and the challenges it brings using a case study of a set of Second World War service records. Findings – TNA’s research has shown that it is possible to digitise large volumes of records to replace paper originals using rigorous procedures. Consequent benefits included being able to link across large data sets so that further records could be released. Practical implications – The authors will discuss whether the technical capability, plus space and cost savings will result in increased pressure to retain, and what this means in creating a feedback-loop of volume. Social implications – The work also has implications in terms of new definitions of the “original” archival record. There has been much debate on challenges to the definition of the archival record in the shift from paper to born-digital. The authors will discuss where this leaves the digitised “original” record. Originality/value – Large volumes of digitised and born-digital records are starting to arrive in records and archive stores, and the implications for retention are far wider than simply digital preservation. By sharing novel research into the practical implications of large-scale data retention, this paper showcases potential issues and some approaches to their management.

2021 ◽  
Vol 15 ◽  
Author(s):  
Jianwei Zhang ◽  
Xubin Zhang ◽  
Lei Lv ◽  
Yining Di ◽  
Wei Chen

Background: Learning discriminative representation from large-scale data sets has made a breakthrough in decades. However, it is still a thorny problem to generate representative embedding from limited examples, for example, a class containing only one image. Recently, deep learning-based Few-Shot Learning (FSL) has been proposed. It tackles this problem by leveraging prior knowledge in various ways. Objective: In this work, we review recent advances of FSL from the perspective of high-dimensional representation learning. The results of the analysis can provide insights and directions for future work. Methods: We first present the definition of general FSL. Then we propose a general framework for the FSL problem and give the taxonomy under the framework. We survey two FSL directions: learning policy and meta-learning. Results: We review the advanced applications of FSL, including image classification, object detection, image segmentation and other tasks etc., as well as the corresponding benchmarks to provide an overview of recent progress. Conclusion: FSL needs to be further studied in medical images, language models, and reinforcement learning in future work. In addition, cross-domain FSL, successive FSL, and associated FSL are more challenging and valuable research directions.


2020 ◽  
Vol 20 (6) ◽  
pp. 5-17
Author(s):  
Hrachya Astsatryan ◽  
Aram Kocharyan ◽  
Daniel Hagimont ◽  
Arthur Lalayan

AbstractThe optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.


2020 ◽  
Vol 4 (1) ◽  
pp. 31-44
Author(s):  
Kai Zheng ◽  
Xianjun Yang ◽  
Yilei Wang ◽  
Yingjie Wu ◽  
Xianghan Zheng

Purpose The purpose of this paper is to alleviate the problem of poor robustness and over-fitting caused by large-scale data in collaborative filtering recommendation algorithms. Design/methodology/approach Interpreting user behavior from the probabilistic perspective of hidden variables is helpful to improve robustness and over-fitting problems. Constructing a recommendation network by variational inference can effectively solve the complex distribution calculation in the probabilistic recommendation model. Based on the aforementioned analysis, this paper uses variational auto-encoder to construct a generating network, which can restore user-rating data to solve the problem of poor robustness and over-fitting caused by large-scale data. Meanwhile, for the existing KL-vanishing problem in the variational inference deep learning model, this paper optimizes the model by the KL annealing and Free Bits methods. Findings The effect of the basic model is considerably improved after using the KL annealing or Free Bits method to solve KL vanishing. The proposed models evidently perform worse than competitors on small data sets, such as MovieLens 1 M. By contrast, they have better effects on large data sets such as MovieLens 10 M and MovieLens 20 M. Originality/value This paper presents the usage of the variational inference model for collaborative filtering recommendation and introduces the KL annealing and Free Bits methods to improve the basic model effect. Because the variational inference training denotes the probability distribution of the hidden vector, the problem of poor robustness and overfitting is alleviated. When the amount of data is relatively large in the actual application scenario, the probability distribution of the fitted actual data can better represent the user and the item. Therefore, using variational inference for collaborative filtering recommendation is of practical value.


2018 ◽  
Vol 34 (1) ◽  
pp. 70-76 ◽  
Author(s):  
Jim Hahn ◽  
Courtney McDonald

Purpose This paper aims to introduce a machine learning-based “My Account” recommender for implementation in open discovery environments such as VuFind among others. Design/methodology/approach The approach to implementing machine learning-based personalized recommenders is undertaken as applied research leveraging data streams of transactional checkout data from discovery systems. Findings The authors discuss the need for large data sets from which to build an algorithm and introduce a prototype recommender service, describing the prototype’s data flow pipeline and machine learning processes. Practical implications The browse paradigm of discovery has neglected to leverage discovery system data to inform the development of personalized recommendations; with this paper, the authors show novel approaches to providing enhanced browse functionality by way of a user account. Originality/value In the age of big data and machine learning, advances in deep learning technology and data stream processing make it possible to leverage discovery system data to inform the development of personalized recommendations.


2020 ◽  
Author(s):  
Isha Sood ◽  
Varsha Sharma

Essentially, data mining concerns the computation of data and the identification of patterns and trends in the information so that we might decide or judge. Data mining concepts have been in use for years, but with the emergence of big data, they are even more common. In particular, the scalable mining of such large data sets is a difficult issue that has attached several recent findings. A few of these recent works use the MapReduce methodology to construct data mining models across the data set. In this article, we examine current approaches to large-scale data mining and compare their output to the MapReduce model. Based on our research, a system for data mining that combines MapReduce and sampling is implemented and addressed


Author(s):  
Lior Shamir

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .


GigaScience ◽  
2020 ◽  
Vol 9 (1) ◽  
Author(s):  
T Cameron Waller ◽  
Jordan A Berg ◽  
Alexander Lex ◽  
Brian E Chapman ◽  
Jared Rutter

Abstract Background Metabolic networks represent all chemical reactions that occur between molecular metabolites in an organism’s cells. They offer biological context in which to integrate, analyze, and interpret omic measurements, but their large scale and extensive connectivity present unique challenges. While it is practical to simplify these networks by placing constraints on compartments and hubs, it is unclear how these simplifications alter the structure of metabolic networks and the interpretation of metabolomic experiments. Results We curated and adapted the latest systemic model of human metabolism and developed customizable tools to define metabolic networks with and without compartmentalization in subcellular organelles and with or without inclusion of prolific metabolite hubs. Compartmentalization made networks larger, less dense, and more modular, whereas hubs made networks larger, more dense, and less modular. When present, these hubs also dominated shortest paths in the network, yet their exclusion exposed the subtler prominence of other metabolites that are typically more relevant to metabolomic experiments. We applied the non-compartmental network without metabolite hubs in a retrospective, exploratory analysis of metabolomic measurements from 5 studies on human tissues. Network clusters identified individual reactions that might experience differential regulation between experimental conditions, several of which were not apparent in the original publications. Conclusions Exclusion of specific metabolite hubs exposes modularity in both compartmental and non-compartmental metabolic networks, improving detection of relevant clusters in omic measurements. Better computational detection of metabolic network clusters in large data sets has potential to identify differential regulation of individual genes, transcripts, and proteins.


2016 ◽  
Vol 26 (5) ◽  
pp. 1134-1157 ◽  
Author(s):  
Donghee Shin ◽  
Myunggoon Choi ◽  
Jang Hyun Kim ◽  
Jae-gil Lee

Purpose The purpose of this paper is to examine the effects of interaction techniques (e.g. swiping and tapping) and the range of thumb movement on interactivity, engagement, attitude, and behavioral intention in single-handed interaction with smartphones. Design/methodology/approach A 2×2 between-participant experiment (technological features: swiping and tapping×range of thumb movement: wide and narrow) was conducted to study the effects of interaction techniques and thumb movement ranges. Findings The results showed that the range of thumb movement had significant effects on perceived interactivity, engagement, attitude, and behavioral intention, whereas no effects were observed for interaction techniques. A narrow range of thumb movement had more influence on the interactivity outcomes in comparison to a wide range of thumb movement. Practical implications While the subject of actual and perceived interactivity has been discussed, the issue has not been applied to smartphone. Based on the research results, the mobile industry may come up with a design strategy that balances feature- and perception-based interactivity. Originality/value This study adopted the perspective of the hybrid definition of interactivity, which includes both actual and perceived interactivity. Interactivity effect outcomes mediated by perceived interactivity.


2006 ◽  
Vol 78 (1) ◽  
pp. 32-38 ◽  
Author(s):  
Donald McLean

PurposeTo provide for the use of airlines and other civil aviation organizations a practical definition of operational efficiency and to show how it can be determined.Design/methodology/approachA brief account of air transport economics is used to demonstrate how bom load factors and aircraft utilization need to be considered in assessing operational efficiency. Then other efficiencies are treated briefly before an example is given of how the better of two fictitious aircraft can be chosen for a particular route. A second example involving the calculation of the operational efficiency achieved by an imaginary airline is also given to show that the typical value is lower than might be expected, particularly in view of the relatively high load factors involved.FindingsProvides performance values and economic figures which are typical of current airline operations.Practical implicationsUse of the proposed definition will allow the consistent assessment of the economic performance of airlines.Originality/valueAt present there is no definition of operational efficiency in general use although it is greatly needed by airlines. The definition proposed in this paper is practical and easy to use.


Sign in / Sign up

Export Citation Format

Share Document