The Research of XML Keyword Retrieval Algorithms Based on MapReduce

2014 ◽  
Vol 556-562 ◽  
pp. 3347-3349
Author(s):  
Yao Wen Xia ◽  
Ji Li Xie

In this paper, from the perspective of XML data management, first in the HDFS store large amount of data and XML data based on XML data query rewrite the traditional framework of MapReduce process, the design of large amount of data XML data set keywords retrieval algorithm, contain XML data classification and coding, index and search a four parts, solve the large amount of data of the XML document keywords retrieval problem. Then the design and implementation based on MapReduce of large amount of data XML keyword query system.

2018 ◽  
Vol 232 ◽  
pp. 01004
Author(s):  
Wenshuai Ge ◽  
Gang He ◽  
Xinwen Liu

This paper proposes a big data query system for customized queries based on specific business needs. This paper introduces the components and structure of the query system. ANTLR tools are used as language recognizer to design and implement a customized SQL dialect. The system builds a simpler and easier query interface on Spark SQL, which satisfies the query requirements of the Internet user behavior analysis platform.


2014 ◽  
Vol 496-500 ◽  
pp. 1877-1880
Author(s):  
Dong Juan Gu ◽  
Li Yong Wan

In order to resolve the inefficiency for XML data query and support dynamic updates, etc, this paper has proposed an improved method to encode XML document nodes. On the basic of region encoding and the prefix encoding, it introduces a XML document coding schema base on binary (CSBB). The CSBB code use binary encoding strategy and make the bit string inserted in order. The bit string inserted algorithm can generate ordered bit string to reserve space for the inserted new nodes, and not influence on the others. Experiments shows the CSBB code can effectively avoid re-encoding of nodes, and supports the nodes Dynamic Update.


2014 ◽  
Vol 7 (3) ◽  
pp. 3021-3073 ◽  
Author(s):  
M. Grossi ◽  
P. Valks ◽  
D. Loyola ◽  
B. Aberle ◽  
S. Slijkhuis ◽  
...  

Abstract. The knowledge of the total column water vapour (TCWV) global distribution is fundamental for climate analysis and weather monitoring. In this work, we present the retrieval algorithm used to derive the operational TCWV from the GOME-2 sensors and perform an extensive inter-comparison and validation in order to estimate their absolute accuracy and long-term stability. We use the recently reprocessed data sets retrieved by the GOME-2 instruments aboard EUMETSAT's MetOp-A and MetOp-B satellites and generated by DLR in the framework of the O3M-SAF using the GOME Data Processor (GDP) version 4.7. The retrieval algorithm is based on a classical Differential Optical Absorption Spectroscopy (DOAS) method and combines H2O/O2 retrieval for the computation of the trace gas vertical column density. We introduce a further enhancement in the quality of the H2O column by optimizing the cloud screening and developing an empirical correction in order to eliminate the instrument scan angle dependencies. We evaluate the overall consistency between about 8 months measurements from the newer GOME-2 instrument on the MetOp-B platform with the GOME-2/MetOp-A data in the overlap period. Furthermore, we compare GOME-2 results with independent TCWV data from ECMWF and with SSMIS satellite measurements during the full period January 2007–August 2013 and we perform a validation against the combined SSM/I + MERIS satellite data set developed in the framework of the ESA DUE GlobVapour project. We find global mean biases as small as ± 0.03 g cm−2 between GOME-2A and all other data sets. The combined SSM/I-MERIS sample is typically drier than the GOME-2 retrievals (−0.005 g cm−2), while on average GOME-2 data overestimate the SSMIS measurements by only 0.028 g cm−2. However, the size of some of these biases are seasonally dependent. Monthly average differences can be as large as 0.1 g cm−2, based on the analysis against SSMIS measurements, but are not as evident in the validation with the ECMWF and the SSM/I + MERIS data. Studying two exemplary months, we estimate regional differences and identify a very good agreement between GOME-2 total columns and all three independent data sets, especially for land areas, although some discrepancies over ocean and over land areas with high humidity and a relatively large surface albedo are also present.


2015 ◽  
Vol 8 (2) ◽  
pp. 1787-1832 ◽  
Author(s):  
J. Heymann ◽  
M. Reuter ◽  
M. Hilker ◽  
M. Buchwitz ◽  
O. Schneising ◽  
...  

Abstract. Consistent and accurate long-term data sets of global atmospheric concentrations of carbon dioxide (CO2) are required for carbon cycle and climate related research. However, global data sets based on satellite observations may suffer from inconsistencies originating from the use of products derived from different satellites as needed to cover a long enough time period. One reason for inconsistencies can be the use of different retrieval algorithms. We address this potential issue by applying the same algorithm, the Bremen Optimal Estimation DOAS (BESD) algorithm, to different satellite instruments, SCIAMACHY onboard ENVISAT (March 2002–April 2012) and TANSO-FTS onboard GOSAT (launched in January 2009), to retrieve XCO2, the column-averaged dry-air mole fraction of CO2. BESD has been initially developed for SCIAMACHY XCO2 retrievals. Here, we present the first detailed assessment of the new GOSAT BESD XCO2 product. GOSAT BESD XCO2 is a product generated and delivered to the MACC project for assimilation into ECMWF's Integrated Forecasting System (IFS). We describe the modifications of the BESD algorithm needed in order to retrieve XCO2 from GOSAT and present detailed comparisons with ground-based observations of XCO2 from the Total Carbon Column Observing Network (TCCON). We discuss detailed comparison results between all three XCO2 data sets (SCIAMACHY, GOSAT and TCCON). The comparison results demonstrate the good consistency between the SCIAMACHY and the GOSAT XCO2. For example, we found a mean difference for daily averages of −0.60 ± 1.56 ppm (mean difference ± standard deviation) for GOSAT-SCIAMACHY (linear correlation coefficient r = 0.82), −0.34 ± 1.37 ppm (r = 0.86) for GOSAT-TCCON and 0.10 ± 1.79 ppm (r = 0.75) for SCIAMACHY-TCCON. The remaining differences between GOSAT and SCIAMACHY are likely due to non-perfect collocation (±2 h, 10° × 10° around TCCON sites), i.e., the observed air masses are not exactly identical, but likely also due to a still non-perfect BESD retrieval algorithm, which will be continuously improved in the future. Our overarching goal is to generate a satellite-derived XCO2 data set appropriate for climate and carbon cycle research covering the longest possible time period. We therefore also plan to extend the existing SCIAMACHY and GOSAT data set discussed here by using also data from other missions (e.g., OCO-2, GOSAT-2, CarbonSat) in the future.


2015 ◽  
Vol 2015 ◽  
pp. 1-11 ◽  
Author(s):  
Yue Zhao ◽  
Ye Yuan ◽  
Guoren Wang

This paper describes a keyword search measure on probabilistic XML data based on ELM (extreme learning machine). We use this method to carry out keyword search on probabilistic XML data. A probabilistic XML document differs from a traditional XML document to realize keyword search in the consideration of possible world semantics. A probabilistic XML document can be seen as a set of nodes consisting of ordinary nodes and distributional nodes. ELM has good performance in text classification applications. As the typical semistructured data; the label of XML data possesses the function of definition itself. Label and context of the node can be seen as the text data of this node. ELM offers significant advantages such as fast learning speed, ease of implementation, and effective node classification. Set intersection can compute SLCA quickly in the node sets which is classified by using ELM. In this paper, we adopt ELM to classify nodes and compute probability. We propose two algorithms that are based on ELM and probability threshold to improve the overall performance. The experimental results verify the benefits of our methods according to various evaluation metrics.


2015 ◽  
Vol 8 (12) ◽  
pp. 12663-12707 ◽  
Author(s):  
T. E. Taylor ◽  
C. W. O'Dell ◽  
C. Frankenberg ◽  
P. Partain ◽  
H. Q. Cronk ◽  
...  

Abstract. The objective of the National Aeronautics and Space Administration's (NASA) Orbiting Carbon Observatory-2 (OCO-2) mission is to retrieve the column-averaged carbon dioxide (CO2) dry air mole fraction (XCO2) from satellite measurements of reflected sunlight in the near-infrared. These estimates can be biased by clouds and aerosols within the instrument's field of view (FOV). Screening of the most contaminated soundings minimizes unnecessary calls to the computationally expensive Level 2 (L2) XCO2 retrieval algorithm. Hence, robust cloud screening methods have been an important focus of the OCO-2 algorithm development team. Two distinct, computationally inexpensive cloud screening algorithms have been developed for this application. The A-Band Preprocessor (ABP) retrieves the surface pressure using measurements in the 0.76 μm O2 A-band, neglecting scattering by clouds and aerosols, which introduce photon path-length (PPL) differences that can cause large deviations between the expected and retrieved surface pressure. The Iterative Maximum A-Posteriori (IMAP) Differential Optical Absorption Spectroscopy (DOAS) Preprocessor (IDP) retrieves independent estimates of the CO2 and H2O column abundances using observations taken at 1.61 μm (weak CO2 band) and 2.06 μm (strong CO2 band), while neglecting atmospheric scattering. The CO2 and H2O column abundances retrieved in these two spectral regions differ significantly in the presence of cloud and scattering aerosols. The combination of these two algorithms, which key off of different features in the spectra, provides the basis for cloud screening of the OCO-2 data set. To validate the OCO-2 cloud screening approach, collocated measurements from NASA's Moderate Resolution Imaging Spectrometer (MODIS), aboard the Aqua platform, were compared to results from the two OCO-2 cloud screening algorithms. With tuning to allow throughputs of ≃ 30 %, agreement between the OCO-2 and MODIS cloud screening methods is found to be ≃ 85 % over four 16-day orbit repeat cycles in both the winter (December) and spring (April–May) for OCO-2 nadir-land, glint-land and glint-water observations. No major, systematic, spatial or temporal dependencies were found, although slight differences in the seasonal data sets do exist and validation is more problematic with increasing solar zenith angle and when surfaces are covered in snow and ice and have complex topography. To further analyze the performance of the cloud screening algorithms, an initial comparison of OCO-2 observations was made to collocated measurements from the Cloud-Aerosol Lidar with Orthogonal Polarization (CALIOP) aboard the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observations (CALIPSO). These comparisons highlight the strength of the OCO-2 cloud screening algorithms in identifying high, thin clouds but suggest some difficulty in identifying some clouds near the surface, even when the optical thicknesses are greater than 1.


2019 ◽  
Author(s):  
Juan Huo ◽  
Daren Lu ◽  
Shu Duan ◽  
Yongheng Bi ◽  
Bo Liu

Abstract. To better understand the accuracy of cloud top heights (CTHs) derived from passive satellite data, ground-based Ka-band radar measurements from 2016 and 2017 in Beijing were compared with CTH data inferred from the Moderate Resolution Imaging Spectroradiometer (MODIS) and the Advanced Himawari Imager (AHI). Relative to the radar CTHs, the MODIS CTHs were found to be underestimated by −1.10 ± 2.53 km and 49 % of CTH differences were within 1.0 km. Like the MODIS results, the AHI CTHs were underestimated by −1.10 ± 2.27 km and 42 % were within 1.0 km. Both the MODIS and AHI retrieval accuracy depended strongly on the cloud depth (CD). Large differences were mainly occurring for the retrieval of thin clouds of CD  1 km, the CTH difference decreased to −0.48 ± 1.70 km for MODIS and to −0.76 ± 1.63 km for AHI. MODIS CTHs greater than 6 km showed better agreement with the radar data than those less than 4 km. Statistical analysis showed that the average AHI CTHs were lower than the average MODIS CTHs by −0.64 ± 2.36 km. The monthly accuracy of both retrieval algorithms was studied and it was found that the AHI retrieval algorithm had the largest bias in winter while the MODIS retrieval algorithm had the lowest accuracy in spring.


2014 ◽  
Vol 8 (4) ◽  
pp. 1161-1176 ◽  
Author(s):  
B. Hudson ◽  
I. Overeem ◽  
D. McGrath ◽  
J. P. M. Syvitski ◽  
A. Mikkelsen ◽  
...  

Abstract. The freshwater flux from the Greenland Ice Sheet (GrIS) to the North Atlantic Ocean carries extensive but poorly documented volumes of sediment. We develop a suspended sediment concentration (SSC) retrieval algorithm using a large Greenland specific in situ data set. This algorithm is applied to all cloud-free NASA Moderate Resolution Imaging Spectrometer (MODIS) Terra images from 2000 to 2012 to monitor SSC dynamics at six river plumes in three fjords in southwest Greenland. Melt-season mean plume SSC increased at all but one site, although these trends were primarily not statistically significant. Zones of sediment concentration > 50 mg L−1 expanded in three river plumes, with potential consequences for biological productivity. The high SSC cores of sediment plumes ( > 250 mg L−1 expanded in one-third of study locations. At a regional scale, higher volumes of runoff were associated with higher melt-season mean plume SSC values, but this relationship did not hold for individual rivers. High spatial variability between proximal plumes highlights the complex processes operating in Greenland's glacio–fluvial–fjord systems.


Sign in / Sign up

Export Citation Format

Share Document