massive databases
Recently Published Documents


TOTAL DOCUMENTS

26
(FIVE YEARS 8)

H-INDEX

4
(FIVE YEARS 0)

2021 ◽  
pp. 1-12
Author(s):  
Swati Srivastava

Big technology companies like Facebook, Google, and Amazon amass global power through classification algorithms. These algorithms use unsupervised and semi-supervised machine learning on massive databases to detect objects, such as faces, and to process texts, such as speech, to model predictions for commercial and political purposes. Such governance by algorithms—or “algorithmic governance”—has received critical scrutiny from a vast interdisciplinary scholarship that points to algorithmic harms related to mass surveillance, information pollution, behavioral herding, bias, and discrimination. Big Tech’s algorithmic governance implicates core IR research in two ways: (1) it creates new private authorities as corporations control critical bottlenecks of knowledge, connection, and desire; and (2) it mediates the scope of state–corporate relations as states become dependent on Big Tech, Big Tech circumvents state overreach, and states curtail Big Tech. As such, IR scholars should become more involved in the global research on algorithmic governance.


2021 ◽  
pp. 1-11
Author(s):  
Bin Qin

In reality there are always a large number of complex massive databases. The notion of homomorphism may be a mathematical tool for studying data compression in knowledge bases. This paper investigates a knowledge base in dynamic environments and its data compression with homomorphism, where “dynamic” refers to the fact that the involved information systems need to be updated with time due to the inflow of new information. First, the relationships among knowledge bases, information systems and relation information systems are illustrated. Next, the idea of non-incremental algorithm for data compression with homomorphism and the concept of dynamic knowledge base are introduced. Two incremental algorithms for data compression with homomorphism in dynamic knowledge bases are presented. Finally, an experimental analysis is employed to demonstrate the applications of the non-incremental algorithm and the incremental algorithms for data compression when calculating the knowledge reduction of dynamic knowledge bases.


2021 ◽  
Vol 2 (2) ◽  
Author(s):  
Rafael Viana Ribeiro

Legal reasoning is increasingly quantified. Developers in the market and public institutions in the legal system are making use of massive databases of court opinions and other legal communications to craft algorithms to assess the effectiveness of legal arguments or predict court judgments; tasks that were once seen as the exclusive province of seasoned lawyers’ obscure knowledge. New legal technologies promise to search heaps of documents for useful evidence, and to analyze dozens of factors to quantify a lawsuit’s odds of success. Legal quantification initiatives depend on the availability of reliable data about the past behavior of courts that institutional actors have attempted to control. The development of initiatives in legal quantification is visible as public bodies craft their own tools for internal use and access by the public, and private companies create new ways to valorize the “raw data” provided by courts and lawyers by generating information useful to the strategies of legal professionals, as well as to the investors that re-valorize legal activity by securitizing legal risk through litigation funding.


Stats ◽  
2020 ◽  
Vol 3 (4) ◽  
pp. 444-464
Author(s):  
Hellen Paz ◽  
Mateus Maia ◽  
Fernando Moraes ◽  
Ricardo Lustosa ◽  
Lilia Costa ◽  
...  

The analysis of massive databases is a key issue for most applications today and the use of parallel computing techniques is one of the suitable approaches for that. Apache Spark is a widely employed tool within this context, aiming at processing large amounts of data in a distributed way. For the Statistics community, R is one of the preferred tools. Despite its growth in the last years, it still has limitations for processing large volumes of data in single local machines. In general, the data analysis community has difficulty to handle a massive amount of data on local machines, often requiring high-performance computing servers. One way to perform statistical analyzes over massive databases is combining both tools (Spark and R) via the sparklyr package, which allows for an R application to use Spark. This paper presents an analysis of Brazilian public data from the Bolsa Família Programme (BFP—conditional cash transfer), comprising a large data set with 1.26 billion observations. Our goal was to understand how this social program acts in different cities, as well as to identify potentially important variables reflecting its utilization rate. Statistical modeling was performed using random forest to predict the utilization rated of BFP. Variable selection was performed through a recent method based on the importance and interpretation of variables in the random forest model. Among the 89 variables initially considered, the final model presented a high predictive performance capacity with 17 selected variables, as well as indicated high importance of some variables for the observed utilization rate in income, education, job informality, and inactive youth, namely: family income, education, occupation and density of people in the homes. In this work, using a local machine, we highlighted the potential of aggregating Spark and R for analysis of a large database of 111.6 GB. This can serve as proof of concept or reference for other similar works within the Statistics community, as well as our case study can provide important evidence for further analysis of this important social support programme.


2020 ◽  
Vol 9 (3) ◽  
pp. 100-117
Author(s):  
Sangeetha T. ◽  
Geetha Mary A.

The process of recognizing patterns, collecting knowledge from massive databases is called data mining. An object which does not obey and deviates from other objects by their characteristics or behavior are known as outliers. Research works carried out so far on outlier detection were focused only on numerical data, categorical data, and in single universal sets. The main goal of this article is to detect outliers significant in two universal sets by applying the intuitionistic fuzzy cut relationship based on membership and non-membership values. The proposed method, weighted density outlier detection, is based on rough entropy, and is employed to detect outliers. Since it is unsupervised, without considering class labels of decision attributes, weighted density values for all conditional attributes and objects are calculated to detect outliers. For experimental analysis, the Iris dataset from the UCI repository is taken to detect outliers, and comparisons have been made with existing algorithms to prove its efficiency.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Megan Doerr ◽  
Jennifer K Wagner

Abstract The number and size of existing research studies with massive databases and biosample repositories that could be leveraged for public health response against SARS-CoV-2 (or other infectious disease pathogens) are unparalleled in history. What risks are posed by coopting research infrastructure—not just data and samples but also participant recruitment and contact networks, communications, and coordination functions—for public health activities? The case of the Seattle Flu Study highlights the general challenges associated with utilizing research infrastructure for public health response, including the legal and ethical considerations for research data use, the return of the results of public health activities relying upon research resources to unwitting research participants, and the possible impacts of public health reporting mandates on future research participation. While research, including public health research, is essential during a pandemic, careful consideration should be given to distinguishing and balancing the ethical mandates of public health activities against the existing ethical responsibilities of biomedical researchers.


Author(s):  
Jennifer Stromer-Galley

The quest for data-driven campaigning in 2012—creating massive databases of voter information for more effective micro-targeting—found greater efficacy and new controversy in 2016. The Trump campaign capitalized on the power of digital advertising to reach the public to engage in unprecedented mass-targeted campaigning. His campaign spent substantially more on Facebook and other digital media paid ads than Clinton. Yet, the company that Trump worked with, Cambridge Analytica, closed up shop in 2018 under a cloud of controversy about corrupt officials and voter manipulation in several countries, as well as ill-begotten data of Facebook users that drove their micro-targeting practices. The Clinton campaign modeled itself on data-driven successes of the Obama campaign, yet the algorithms that drove their decision making were flawed, thereby leading her campaign to underperform in essential swing states. Similar to the Romney campaign’s Narwhal challenges on Election Day when the campaign effectively was flying blind on get-out-the-vote numbers, the Clinton plane was flying on bad coordinates, ultimately causing her campaign to crash in critical swing states.


Author(s):  
Qiang Fu ◽  
Xu Han ◽  
Xianglong Liu ◽  
Jingkuan Song ◽  
Cheng Deng

Building multiple hash tables has been proven a successful technique for indexing massive databases, which can guarantee a desired level of overall performance. However, existing hash based multi-indexing methods suffer from the heavy redundancy, without strong table complementarity and effective hash code learning. To address the problems, this paper proposes a complementary binary quantization (CBQ) method to jointly learning multiple hash tables. It exploits the power of incomplete binary coding based on prototypes to align the original space and the Hamming space, and further utilizes the nature of multi-indexing search to jointly reduce the quantization loss based on the prototype based hash function. Our alternating optimization adaptively discovers the complementary prototype sets and the corresponding code sets of a varying size in an efficient way, which together robustly approximate the data relations. Our method can be naturally generalized to the product space for long hash codes. Extensive experiments carried out on two popular large-scale tasks including Euclidean and semantic nearest neighbor search demonstrate that the proposed CBQ method enjoys the strong table complementarity and significantly outperforms the state-of-the-art, with up to 57.76\% performance gains relatively.


2013 ◽  
Vol 748 ◽  
pp. 1028-1032
Author(s):  
Jia Chen ◽  
Hong Wei Chen ◽  
Xin Rong Hu

On-Line Analytical Processing (OLAP) tools are frequently used in business, science and health to extract useful knowledge from massive databases. An important and hard optimization problem in OLAP data warehouses is the view selection problem, consisting of selecting a set of aggregate views of the data for speeding up future query processing. We apply one n Estimation of Distribution Algorithms (EDAs) to view selection under a size constraint. Our emphasis is to determine the suitability of the combination of EDAs with constraint handling to the view selection problem, compared to a widely used genetic algorithm. The EDAs are competitive with the genetic algorithm on a variety of problem instances, often finding approximate optimal solutions in a reasonable amount of time.


Sign in / Sign up

Export Citation Format

Share Document