Automating Duplicate Detection for Lexical Heterogeneous Web Databases

Author(s):  
Anil Ahlawat ◽  
Kalpna Sagar

Introduction: The need for efficient search engines has been identified with the ever-increasing technological advancement and huge growing demand of data on the web. Method: Automating duplicate detection over query results in identifying the records from multiple web databases that point to the similar real-world entity and returns non-matching records to the end-users. The proposed algorithm in this paper is based upon an unsupervised approach with classifiers over heterogeneous web databases that return more accurate results with high precision, F-measure, and recall. Different assessments are also executed to analyze the efficacy of the proposed algorithm for identification of the duplicates. Result: Results show that the proposed algorithm has greater precision, F-score measure, and the same recall values as compared to standard UDD. Conclusion: This paper concludes that the proposed algorithm outperforms standard UDD. Discussion: This paper aims to introduce an algorithm that automates the process of duplicate detection for lexical heterogeneous web databases.

Author(s):  
Suyash Awasthi

Previously, before the existence of chatbots and websites, for eliciting information like syllabus, fees related details, assessment or exam schedule, or tasks like fees payment, students had to be physically present at the respective place in college. This is a time consuming process. With technological advancement, necessary information is available on the college’s website. However, the process of acquiring information from the website turns out to be tedious when the end user is not able to find the information and therefore leaves the website. To do away with such problems chatbot can be developed to help visitors have a more human-like conversational experience thus easing access for users to required information. The SDA bot can be deployed on the web and integrated with the college’s website. To resolve their queries any end user can use this chatbot to acquire information related to college or student related information. This project is specifically designed to solve end users’ queries and give instant responses to them.


Author(s):  
Fotis Lazarinis

As the Web population continues to grow, more non-English users will be amassed online. The purpose of this chapter is to describe the methods and the criteria used for evaluating search engines and to propose a model for evaluating the searching effectiveness of Web retrieval systems in non-English queries. The qualities and weaknesses related to the handling of Greek and Italian queries are evaluated based on this method. The fundamental purpose of the methodology is to establish quality measurements on search engine utilization from the perspective of end users. Application of the proposed evaluation methodology aids users to select the most effective search engine and developers to identify some of the modules of their software that need improvements.


Author(s):  
Pat Case

The Web changed the paradigm for full-text search. Searching Google for search engines returns 57,300,000 results at this writing, an impressive result set. Web search engines favor simple searches, speed, and relevance ranking. The end user most often finds a wanted result or two within the first page of search results. This new paradigm is less useful in searching collections of homogeneous data and documents than it is for searching the web. When searching collections end users may need to review everything in the collection on a topic, or may want a clean result set of only those 6 high-quality results, or may need to confirm that there are no wanted results because finding no results within a collection sometimes answers a question about a topic or collection. To accomplish these tasks, end users may need more end user functionality to return small, manageable result sets. The W3C XQuery and XPath Full Text Recommendation (XQFT) offers extensive end user functionality, restoring the end user control that librarians and expert searches enjoyed before the Web. XQFT offers more end user functionality and control than any other full-text search standard ever: more match options, more logical operators, more proximity operators, more ways to return a manageable result set. XQFT searches are also completely composable with XQuery string, number, date, and node queries, bringing the power of full-text search and database querying together for the first time. XQFT searches run directly against XML, enabling searches on any elements or attributes. XQFT implementations are standard-driven, based on shared semantics and syntax. A search in any implementation is portable and may be used in other implementations.


2017 ◽  
pp. 030-050
Author(s):  
J.V. Rogushina ◽  

Problems associated with the improve ment of information retrieval for open environment are considered and the need for it’s semantization is grounded. Thecurrent state and prospects of development of semantic search engines that are focused on the Web information resources processing are analysed, the criteria for the classification of such systems are reviewed. In this analysis the significant attention is paid to the semantic search use of ontologies that contain knowledge about the subject area and the search users. The sources of ontological knowledge and methods of their processing for the improvement of the search procedures are considered. Examples of semantic search systems that use structured query languages (eg, SPARQL), lists of keywords and queries in natural language are proposed. Such criteria for the classification of semantic search engines like architecture, coupling, transparency, user context, modification requests, ontology structure, etc. are considered. Different ways of support of semantic and otology based modification of user queries that improve the completeness and accuracy of the search are analyzed. On base of analysis of the properties of existing semantic search engines in terms of these criteria, the areas for further improvement of these systems are selected: the development of metasearch systems, semantic modification of user requests, the determination of an user-acceptable transparency level of the search procedures, flexibility of domain knowledge management tools, increasing productivity and scalability. In addition, the development of means of semantic Web search needs in use of some external knowledge base which contains knowledge about the domain of user information needs, and in providing the users with the ability to independent selection of knowledge that is used in the search process. There is necessary to take into account the history of user interaction with the retrieval system and the search context for personalization of the query results and their ordering in accordance with the user information needs. All these aspects were taken into account in the design and implementation of semantic search engine "MAIPS" that is based on an ontological model of users and resources cooperation into the Web.


2020 ◽  
Vol 79 (Suppl 1) ◽  
pp. 1961.1-1961
Author(s):  
J. Knitza ◽  
J. Mohn ◽  
C. Bergmann ◽  
E. Kampylafka ◽  
M. Hagen ◽  
...  

Background:Symptom checkers (SC) promise to reduce diagnostic delay, misdiagnosis and effectively guide patients through healthcare systems. They are increasingly used, however little evidence exists about their real-life effectiveness.Objectives:The aim of this study was to evaluate the diagnostic accuracy, usage time, usability and perceived usefulness of two promising SC, ADA (www.ada.com) and Rheport (www.rheport.de). Furthermore, symptom duration and previous symptom checking was recorded.Methods:Cross-sectional interim clinical data from the first of three recruiting centers from the prospective, real-world, multicenter bETTeR-study (DKRS DRKS00017642) was used. Patients newly presenting to a secondary rheumatology outpatient clinic between September and December 2019 completed the ADA and Rheport SC. The time and answers were recorded and compared to the patient’s actual diagnosis. ADA provides up to 5 disease suggestions, Rheport calculates a risk score for rheumatic musculoskeletal diseases (RMDs) (≥1=RMD). For both SC the sensitivity, specificity was calculated regarding RMDs. Furthermore, patients completed a survey evaluating the SC usability using the system usability scale (SUS), perceived usefulness, previous symptom checking and symptom duration.Results:Of the 129 consecutive patients approached, 97 agreed to participate. 38% (37/97) of the presenting patients presented with an RMD (Figure 1). Mean symptom duration was 146 weeks and a mean number of 10 physician contacts occurred previously, to evaluate current symptoms. 56% (54/96) had previously checked their symptoms on the internet using search engines, spending a mean of 6 hours. Rheport showed a sensitivity of 49% (18/37) and specificity of 58% (35/60) concerning RMDs. ADA’s top 1 and top 5 disease suggestions concerning RMD showed a sensitivity of 43% (16/37) and 54% (20/37) and a specificity of 58% (35/60) and 52% (31/60), respectively. ADA listed the correct diagnosis of the patients with RMDs first or within the first 5 disease suggestions in 19% (7/37) and 30% (11/37), respectively. The average perceived usefulness for checking symptoms using ADA, internet search engines and Rheport was 3.0, 3.5 and 3.1 on a visual analog scale from 1-5 (5=very useful). 61% (59/96) and 64% (61/96) would recommend using ADA and Rheport, respectively. The mean SUS score of ADA and Rheport was 72/100 and 73/100. The mean usage time for ADA and Rheport was 8 and 9 minutes, respectively.Conclusion:This is the first prospective, real-world, multicenter study evaluating the diagnostic accuracy and other features of two currently used SC in rheumatology. These interim results suggest that diagnostic accuracy is limited, however SC are well accepted among patients and in some cases, correct diagnosis can be provided out of the pocket within few minutes, saving valuable time.Figure:Acknowledgments:This study was supported by an unrestricted research grant from Novartis.Disclosure of Interests:Johannes Knitza Grant/research support from: Research Grant: Novartis, Jacob Mohn: None declared, Christina Bergmann: None declared, Eleni Kampylafka Speakers bureau: Novartis, BMS, Janssen, Melanie Hagen: None declared, Daniela Bohr: None declared, Elizabeth Araujo Speakers bureau: Novartis, Lilly, Abbott, Matthias Englbrecht Grant/research support from: Roche Pharma, Chugai Pharma Europe, Consultant of: AbbVie, Roche Pharma, RheumaDatenRhePort GbR, Speakers bureau: AbbVie, Celgene, Chugai Pharma Europe, Lilly, Mundipharma, Novartis, Pfizer, Roche Pharma, UCB, David Simon Grant/research support from: Else Kröner-Memorial Scholarship, Novartis, Consultant of: Novartis, Lilly, Arnd Kleyer Consultant of: Lilly, Gilead, Novartis,Abbvie, Speakers bureau: Novartis, Lilly, Timo Meinderink: None declared, Wolfgang Vorbrüggen: None declared, Cay-Benedict von der Decken: None declared, Stefan Kleinert Shareholder of: Morphosys, Grant/research support from: Novartis, Consultant of: Novartis, Speakers bureau: Abbvie, Novartis, Celgene, Roche, Chugai, Janssen, Andreas Ramming Grant/research support from: Pfizer, Novartis, Consultant of: Boehringer Ingelheim, Novartis, Gilead, Pfizer, Speakers bureau: Boehringer Ingelheim, Roche, Janssen, Jörg Distler Grant/research support from: Boehringer Ingelheim, Consultant of: Boehringer Ingelheim, Paid instructor for: Boehringer Ingelheim, Speakers bureau: Boehringer Ingelheim, Peter Bartz-Bazzanella: None declared, Georg Schett Speakers bureau: AbbVie, BMS, Celgene, Janssen, Eli Lilly, Novartis, Roche and UCB, Axel Hueber Grant/research support from: Novartis, Lilly, Pfizer, Consultant of: Abbvie, BMS, Celgene, Gilead, GSK, Lilly, Novartis, Speakers bureau: GSK, Lilly, Novartis, Martin Welcker Grant/research support from: Abbvie, Novartis, UCB, Hexal, BMS, Lilly, Roche, Celgene, Sanofi, Consultant of: Abbvie, Actelion, Aescu, Amgen, Celgene, Hexal, Janssen, Medac, Novartis, Pfizer, Sanofi, UCB, Speakers bureau: Abbvie, Aescu, Amgen, Biogen, Berlin Chemie, Celgene, GSK, Hexal, Mylan, Novartis, Pfizer, UCB


2012 ◽  
Vol 204-208 ◽  
pp. 2721-2725
Author(s):  
Hua Ji Zhu ◽  
Hua Rui Wu

Village land continually changes in the real world. In order to keep the data up-to-date, data producers need update the data frequently. When the village land data are updated, the update information must be dispensed to the end-users to keep their client-databases current. In the real world, village land changes in many forms. Identifying the change type of village land (i.e. captures the semantics of change) and representing them in the data world can help end-users understand the change commonly and be convenient for end-users to integrate these change information into their databases. This work focuses on the model of the spatio-temporal change. A three-tuple model CAR for representing the spatio-temporal change is proposed based on the village land feature set before change and the village land feature set after change, change type and rules. In this model, the C denotes the change type. A denotes the attribute set; R denotes the judging rules of change type. The rule is described by the IF-THEN expressions. By the operations between R and A, the C is distinguished. This model overcomes the limitations of current methods. And more, the rules in this model can be easy realized in computer program.


2001 ◽  
Vol 1 (3) ◽  
pp. 28-31 ◽  
Author(s):  
Valerie Stevenson

Looking back to 1999, there were a number of search engines which performed equally well. I recommended defining the search strategy very carefully, using Boolean logic and field search techniques, and always running the search in more than one search engine. Numerous articles and Web columns comparing the performance of different search engines came to different conclusions on the ‘best’ search engines. Over the last year, however, all the speakers at conferences and seminars I have attended have recommended Google as their preferred tool for locating all kinds of information on the Web. I confess that I have now abandoned most of my carefully worked out search strategies and comparison tests, and use Google for most of my own Web searches.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Yiqing Zhao ◽  
Saravut J. Weroha ◽  
Ellen L. Goode ◽  
Hongfang Liu ◽  
Chen Wang

Abstract Background Next-generation sequencing provides comprehensive information about individuals’ genetic makeup and is commonplace in oncology clinical practice. However, the utility of genetic information in the clinical decision-making process has not been examined extensively from a real-world, data-driven perspective. Through mining real-world data (RWD) from clinical notes, we could extract patients’ genetic information and further associate treatment decisions with genetic information. Methods We proposed a real-world evidence (RWE) study framework that incorporates context-based natural language processing (NLP) methods and data quality examination before final association analysis. The framework was demonstrated in a Foundation-tested women cancer cohort (N = 196). Upon retrieval of patients’ genetic information using NLP system, we assessed the completeness of genetic data captured in unstructured clinical notes according to a genetic data-model. We examined the distribution of different topics regarding BRCA1/2 throughout patients’ treatment process, and then analyzed the association between BRCA1/2 mutation status and the discussion/prescription of targeted therapy. Results We identified seven topics in the clinical context of genetic mentions including: Information, Evaluation, Insurance, Order, Negative, Positive, and Variants of unknown significance. Our rule-based system achieved a precision of 0.87, recall of 0.93 and F-measure of 0.91. Our machine learning system achieved a precision of 0.901, recall of 0.899 and F-measure of 0.9 for four-topic classification and a precision of 0.833, recall of 0.823 and F-measure of 0.82 for seven-topic classification. We found in result-containing sentences, the capture of BRCA1/2 mutation information was 75%, but detailed variant information (e.g. variant types) is largely missing. Using cleaned RWD, significant associations were found between BRCA1/2 positive mutation and targeted therapies. Conclusions In conclusion, we demonstrated a framework to generate RWE using RWD from different clinical sources. Rule-based NLP system achieved the best performance for resolving contextual variability when extracting RWD from unstructured clinical notes. Data quality issues such as incompleteness and discrepancies exist thus manual data cleaning is needed before further analysis can be performed. Finally, we were able to use cleaned RWD to evaluate the real-world utility of genetic information to initiate a prescription of targeted therapy.


Author(s):  
Hsun-Ping Hsieh ◽  
JiaWei Jiang ◽  
Tzu-Hsin Yang ◽  
Renfen Hu

The success of mediation is affected by many factors, such as the context of the quarrel, personality of both parties, and the negotiation skill of the mediator, which lead to uncertainty for the predicting work. This paper takes a different approach from previous legal prediction research. It analyzes and predicts whether two parties in a dispute can reach an agreement peacefully through the conciliation of mediation. With the inference result, we can know if the mediation is a more practical and time-saving method to solve the dispute. Existing works about legal case prediction mostly focus on prosecution or criminal cases. In this work, we propose a LSTM-based framework, called LSTMEnsembler, to predict mediation results by assembling multiple classifiers. Among these classifiers, some are powerful for modeling the numerical and categorical features of case information, e.g., XGBoost and LightGBM; and, some are effective for dealing with textual data, e.g., TextCNN and BERT. The proposed LSTMEnsembler aims to not only combine the effectiveness of different classifiers intelligently, but also capture temporal dependencies from previous cases to boost the performance of mediation prediction. Our experimental results show that our proposed LSTMEnsembler can achieve 85.6% for F-measure on real-world mediation data.


2019 ◽  
Vol 1 ◽  
pp. 1-2
Author(s):  
Shinpei Ito ◽  
Akinori Takahashi ◽  
Ruochen Si ◽  
Masatoshi Arikawa

<p><strong>Abstract.</strong> AR (Augmented Reality) could be realized as a basic and high-level function on latest smartphones with a reasonable price. AR enables users to experience consistent three-dimensional (3D) spaces co-existing with 3D real and virtual objects with sensing real 3D environments and reconstructing them in the virtual world through a camera. The accuracy of sensing real 3D environments using an AR function, that is, visual-inertial odometer, of a smartphone is extremely higher than one of a GPS receiver on it, and can be less than one centimeter. However, current common AR applications generally focus on “small” real 3D spaces, not large real 3D spaces. In other words, most of the current AR applications are not designed for uses based on a geographic coordinate system.</p><p>We proposed a global extension of the visual-inertial odometer with an image recognition function of geo-referenced image markers installed in real 3D spaces. Examples of geo-referenced image markers can be generated from analog guide boards existing in the real world. We tested this framework of a global extension of the visual-inertial odometer embedded in a smartphone on the first floor in the central library of Akita University. The geo-referenced image markers such as floor map boards and book categories sign boards were registered in a database of 3D geo-referenced real-world scene images. Our prototype system developed on a smartphone, that is, iPhone XS, Apple Inc., could first recognized a floor map board (Fig. 1), and could determine the 3D precise distance and direction of the smartphone from the central position of the floor map board in a local 3D coordinate space with the origin point as the central positon of the board. Then, the system could convert the relative precise position and the relative direction of the smartphone’s camera in a local coordinate space into a global precise location and orientation of it. A subject was walking the first floor in the building of the library with a world tracking function of the smartphone. The experimental result shows that the error of tracking a real 3D space of a global coordinate system was accumulated, but not bad. The accumulated error was only about 30 centimeters after the subject’s walking about 30 meters (Fig. 2). We are now planning to improve our prototype system in the accuracy of indoor navigation with calibrating the location and orientation of a smartphone based sequential recognitions of multiple referenced scene image markers which have already existed for a general user services of the library before developing this proposed new services. As the conclusion, the experiment’s result of testing our prototype system was impressive, we are now preparing a more practical high-precision LBS which enables a user to be navigated to the exact location of a book of a user’s interest in a bookshelf on a floor with AR and floor map interfaces.</p>


Sign in / Sign up

Export Citation Format

Share Document