scholarly journals Designing a Document Retrieval Method for University Digital Libraries Based on Hadoop Technology

2021 ◽  
Vol 5 (12) ◽  
pp. 82-87
Author(s):  
Haixia He

With the development of big data, all walks of life in society have begun to venture into big data to serve their own enterprises and departments. Big data has been embraced by university digital libraries. The most cumbersome work for the management of university libraries is document retrieval. This article uses Hadoop algorithm to extract semantic keywords and then calculates semantic similarity based on the literature retrieval keyword calculation process. The fast-matching method is used to determine the weight of each keyword, so as to ensure an efficient and accurate document retrieval in digital libraries, thus completing the design of the document retrieval method for university digital libraries based on Hadoop technology.

2012 ◽  
Vol 605-607 ◽  
pp. 2561-2568
Author(s):  
Qin Wang ◽  
Shou Ning Qu ◽  
Tao Du ◽  
Ming Jing Zhang

Nowadays, document retrieval was an important way of academic exchange and achieving new knowledge. Choosing corresponding category of database and matching the input key words was the traditional document retrieval method. Using the method, a mass of documents would be got and it was hard for users to find the most relevant document. The paper put forward text quantification method. That was mining the features of each element in some document, which including word concept, weight value for position function, improved weights characteristic value, text distribution function weights value and text element length. Then the word’ contributions to this document would be got from the combination of five elements characteristics. Every document in database was stored digitally by the contribution of elements. And a subject mapping scheme was designed in the paper, which the similarity calculation method based on contribution and association rule was firstly designed, according to the method, the documents in the database would be conducted text clustering, and then feature extraction method was used to find class subject. When searching some document, the description which users input would be quantified and mapped to some class automatically by subject mapping, then the document sequences would be retrieved by computing the similarity between the description and the other documents’ features in the class. Experiment shows that the scheme has many merits such as intelligence, accuracy as well as improving retrieval speed.


2014 ◽  
Vol 556-562 ◽  
pp. 4959-4962
Author(s):  
Sai Qiao

The traditional database information retrieval method is achieved by retrieving simple corresponding association of the attributes, which has the necessary requirement that image only have a single characteristic, with increasing complexity of image, it is difficult to process further feature extraction for the image, resulting in great increase of time consumed by large-scale image database retrieval. A fast retrieval method for large-scale image databases is proposed. Texture features are extracted in the database to support retrieval in database. Constraints matching method is introduced, in large-scale image database, referring to the texture features of image in the database to complete the target retrieval. The experimental results show that the proposed algorithm applied in the large-scale image database retrieval, augments retrieval speed, thereby improves the performance of large-scale image database.


2019 ◽  
Vol 2019 ◽  
pp. 1-20 ◽  
Author(s):  
Ameera M. Almasoud ◽  
Hend S. Al-Khalifa ◽  
Abdulmalik S. Al-Salman

In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.


2005 ◽  
Vol 23 (3) ◽  
pp. 267-298 ◽  
Author(s):  
Laurence A. F. Park ◽  
Kotagiri Ramamohanarao ◽  
Marimuthu Palaniswami

Sign in / Sign up

Export Citation Format

Share Document