Acquire and Integrate Data

Author(s):  
Kai R. Larsen ◽  
Daniel S. Becker

Access to additional and relevant data will lead to better predictions from algorithms until we reach the point where more observations (cases) are no longer helpful to detect the signal, the feature(s), or conditions that inform the target. In addition to obtaining more observations, we can also look for additional features of interest that we do not currently have, at which point it will invariably be necessary to integrate data from different sources. This section introduces this process of data integration, starting with an introduction of two methods: “joins” (to access more features) and “unions” (to access more observations) and continues on to cover regular expressions, data summarization, crosstabs, data reduction and splitting, and data wrangling in all its flavors.

Author(s):  
Yan Qi ◽  
Huiping Cao ◽  
K. Selçuk Candan ◽  
Maria Luisa Sapino

In XML Data Integration, data/metadata merging and query processing are indispensable. Specifically, merging integrates multiple disparate (heterogeneous and autonomous) input data sources together for further usage, while query processing is one main reason why the data need to be integrated in the first place. Besides, when supported with appropriate user feedback techniques, queries can also provide contexts in which conflicts among the input sources can be interpreted and resolved. The flexibility of XML structure provides opportunities for alleviating some of the difficulties that other less flexible data types face in the presence of uncertainty; yet, this flexibility also introduces new challenges in merging multiple sources and query processing over integrated data. In this chapter, the authors discuss two alternative ways XML data/schema can be integrated: conflict-eliminating (where the result is cleaned from any conflicts that the different sources might have with each other) and conflict-preserving (where the resulting XML data or XML schema captures the alternative interpretations of the data). They also present techniques for query processing over integrated, possibly imprecise, XML data, and cover strategies that can be used for resolving underlying conflicts.


Author(s):  
O. S. Olokeogun ◽  
O. O. Akintola ◽  
E. K. Abodunrin

This study demonstrates the potentials of Geographic Information System (GIS) as a management tool for avenue trees (Street trees) populations in small communities (using Idi-Ishin community, Ibadan, Nigeria as a case study). GIS is a decision support system which integrate data or set of data from different sources, bringing them under the same referencing system in a computer system. An Ikonos Imagery (1m Spatial Resolution) of the study area was digitized to produce a digital map using ArcGIS 10.1 version. The avenue trees species ≥ 5cm diameter at breast height (DBH) was selected for enumeration. These trees were then measured and tagged. The Height, Girth and Geographic location (X &Y coordinate) of the trees were measured with Haga altimeter, Girthing tape and Hand held Global Positioning System (GPS) respectively. The species and families of the trees enumerated were also identified. Data were analysed for basal area (BA) and volume (V). A total number of 43 avenue trees were assessed in Idi-Ishin Community. <i>Roystonea regia</i> accounted for the majority of the avenue trees (25.58%), followed by <i>Polyanthia longiflora</i> (23.26%), <i>Gliricida seprium</i> (20.93%), <i>Eucalyptus toreliana</i> (13.95%), <i>Delunix regea</i> (6.98%). However <i>Terminalia catapa</i>, <i>Terminalia radii</i>, <i>Azadrachita indica</i> and <i>Newbodia levis</i> had the same abundance of 2.33%. It was also observed that the benefits derived from these avenue trees includes; Carbon sequestration, Beautification, Wind break and shade. A spatial relational database was created for the assessed avenue trees using ArcCatalog of ArcGIS 10.1 version. Based on the findings from the study (which serves as baseline information for the management of the avenue trees in the study area), it was therefore recommended that subsequent assessment should be carried out at 3-5 year interval in other to ensure proper and continuous monitoring and updating of the data.


Author(s):  
Laila Niedrite ◽  
Darja Solodovnikova

The measuring of research results can be used in different ways e.g. for assignment of research grants and afterwards for evaluation of project’s results. It can be used also for recruiting or promoting research institutions’ staff. Because of a wide usage of such measurement, the selection of appropriate measures is important. At the same time there does not exist a common view which metrics should be used in this field, moreover many existing metrics that are widely used are often misleading due to different reasons, e.g. computed from incomplete or faulty data, the metric’s computation formula may be invalid or the computation results can be interpreted wrongly. To produce a good framework for research evaluation, the mentioned problems must be solved in the best possible way by integrating data from different sources to get comprehensive view of academic institutions’ research activities and to solve data quality problems. We will present a data integration system that integrates university information system with library information system and with data that are gathered through API from Scopus and Web of Science databases. Data integration problems and data quality problems that we have faced are described and possible solutions are presented. Metrics that are defined and computed over these integrated data and their analysis possibilities are also discussed.


2021 ◽  
Vol 3 (2) ◽  
pp. 217-224
Author(s):  
Ratu Upisika Maha Misi ◽  
Johny Prihanto ◽  
Florentina Kurniasari ◽  
Noemi da Silva

Robologee is a sub-unit of PT. Bangun Satya Wacana is part of Kompas Gramedia which is focused in Education section for ages 7 to 12 years. Robologee is a diversification of the existing sub-units in PT. Bangun Satya Wacana. Robologee has branches located at Gramedia World so it is expected that it will have an impact on Gramedia traffic. Currently, Robologee is transforming in order to integrate data that will be stored in the cloud by Amazon Web Service.The goal of this project is that data can be accessed by various users and stored in one platform. In the analysis of the digital transformation project, 15 respondents have been determined who are parents as external customers. Based on the indicators used in DMM. It was found that Robologee's current condition is at the Advancing level. Based on the Roadmap this project is implemented for 1 year and consists of four stages. In the Budgeting analysis, Robologee has payback period of 1.7 years with an IRR of 7.512% greater than the expected return of 5% by the company. Then the NVP is in a positive number, so this project is feasible to implement.


2012 ◽  
Vol 3 (1) ◽  
pp. 72-82 ◽  
Author(s):  
Yinle Zhou ◽  
Ali Kooshesh ◽  
John Talburt

Entity-based data integration (EBDI) is a form of data integration in which information related to the same real-world entity is collected and merged from different sources. It often happens that not all of the sources will agree on one value for a common attribute. These cases are typically resolved by invoking a rule that will select one of the non-null values presented by the sources. One of the most commonly used selection rules is called the naïve selection operator that chooses the non-null value provided by the source with the highest overall accuracy for the attribute in question. However, the naïve selection operator will not always produce the most accurate result. This paper describes a method for automatically generating a selection operator using methods from genetic programming. It also presents the results from a series of experiments using synthetic data that indicate that this method will yield a more accurate selection operator than either the naïve or naïve-voting selection operators.


Author(s):  
Américo Sampaio

Web portals present an effective way to integrate applications, people, and business by offering a unique point of access to these resources within an organization and also with external business partners. Moreover, the integration of business processes, automation of daily tasks, and data integration contribute to cut down costs and accelerate business operations. However, Web portal development and maintenance imposes many challenges to developers, such as how to provide personalization features to users (organizations and individuals), how to control access from different users, how to integrate and present data from different sources, and how to maintain the content of the Web portal.


2008 ◽  
Vol 22 (1) ◽  
pp. 38-55 ◽  
Author(s):  
Solfrid Vatne ◽  
May Solveig Fagermoen

The purpose of this article is to present a qualitative mixed-method strategy that uncovers the compound reality in professional practice and the inner aspects of actions, feelings, values, and thoughts embedded therein. The authors developed a systematic strategy for collecting and handling data from different sources. This strategy, called event-oriented data integration represents within-method triangulation as well as triangulation during data analysis. The analysis involves 2 distinctive new steps for structuring of data—the braiding of data threads and the braiding of data ropes. We found that the process of weaving together different data created an inclusive text that allowed the researcher to undertake a holistic, coherent, and consistent analysis and to attain a more complete picture of professional practice.


2019 ◽  
Vol 6 (1) ◽  
pp. 1
Author(s):  
Komang Endy Suastika ◽  
I Ketut Paramarta ◽  
Ida Ayu Putu Purnami

This study aims to describe (1) type of numbers in old Balinese, and (2) type of numbers in new Balinese. The subject of this study is the ancient Balinese inscription while the object is the numbers. The documentation method is used in order to collect the data. Then, the data analysis uses several techniques such as, data identification, data reduction, data classification, and data summarization. The result of this study was found that (1) type of the numbers in the old Balinese such as distributive numbers, collective numbers, indefinite distributive numbers, klitik numbers, measurement numbers, and fractional numbers, and (2) type of numbers in New Balinese which based on the same meanings, such as the equivalent forms, the similar forms, and the different forms.


2020 ◽  
Author(s):  
A Patrícia Bento ◽  
Anne Hersey ◽  
Eloy Felix ◽  
Greg Landrum ◽  
Anna Gaulton ◽  
...  

Abstract BackgroundThe ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised.ResultsA chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. ConclusionAll the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.


2020 ◽  
Author(s):  
A Patrícia Bento ◽  
Anne Hersey ◽  
Eloy Felix ◽  
Greg Landrum ◽  
Anna Gaulton ◽  
...  

Abstract Background The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised. Results A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. Conclusion All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.


Sign in / Sign up

Export Citation Format

Share Document