Overview of Entity Resolution

Advances in Data Mining and Database Management - Innovative Techniques and Applications of Entity Resolution ◽

10.4018/978-1-4666-5198-2.ch001 ◽

2014 ◽

pp. 1-14

Keyword(s):

Machine Learning ◽

Information Retrieval ◽

Information Integration ◽

High Performance ◽

Web Search ◽

Data Cleaning ◽

Entity Resolution ◽

Data Quality Management ◽

Research Issues ◽

Search Data

Entity resolution is one of many importation operations for data quality management, information retrieval, and data management. It has wide applications in Web search, ecommerce search, data cleaning, and information integration. Due to its importance, entity resolution has been studied by researchers in multiple fields including database, machine learning, information retrieval, as well as high performance computation. This book contains a number of chapters, which are carefully chosen in order to discuss the broad research issues in entity resolution. In addition, a number of important applications of entity resolution are also covered in the book. The purpose of this chapter is to provide an overview of the concepts, applications, and research topics of entity resolution, as well as the coverage of these topics in this book.

Download Full-text

Securing Machine Learning in the Cloud: A Systematic Review of Cloud Machine Learning Security

Frontiers in Big Data ◽

10.3389/fdata.2020.587139 ◽

2020 ◽

Vol 3 ◽

Author(s):

Adnan Qayyum ◽

Aneeqa Ijaz ◽

Muhammad Usama ◽

Waleed Iqbal ◽

Junaid Qadir ◽

...

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Graphics Processing Units ◽

High Performance ◽

Cloud Services ◽

Third Party ◽

Systematic Evaluation ◽

Research Issues ◽

Wide Range ◽

Attack And Defense

With the advances in machine learning (ML) and deep learning (DL) techniques, and the potency of cloud computing in offering services efficiently and cost-effectively, Machine Learning as a Service (MLaaS) cloud platforms have become popular. In addition, there is increasing adoption of third-party cloud services for outsourcing training of DL models, which requires substantial costly computational resources (e.g., high-performance graphics processing units (GPUs)). Such widespread usage of cloud-hosted ML/DL services opens a wide range of attack surfaces for adversaries to exploit the ML/DL system to achieve malicious goals. In this article, we conduct a systematic evaluation of literature of cloud-hosted ML/DL models along both the important dimensions—attacks and defenses—related to their security. Our systematic review identified a total of 31 related articles out of which 19 focused on attack, six focused on defense, and six focused on both attack and defense. Our evaluation reveals that there is an increasing interest from the research community on the perspective of attacking and defending different attacks on Machine Learning as a Service platforms. In addition, we identify the limitations and pitfalls of the analyzed articles and highlight open research issues that require further investigation.

Download Full-text

Intelligent Web Search through Adaptive Learning from Relevance Feedback

Architectural Issues of Web-Enabled Electronic Business ◽

10.4018/978-1-59140-049-3.ch009 ◽

2011 ◽

pp. 140-154

Author(s):

Zhixiang Chen ◽

Binhai Zhu ◽

Xiannong Meng

Keyword(s):

Machine Learning ◽

Real Time ◽

Adaptive Learning ◽

Relevance Feedback ◽

Search Engines ◽

Web Search ◽

Learning Algorithm ◽

Future Research ◽

Learning Approaches ◽

Research Issues

In this chapter, machine-learning approaches to real-time intelligent Web search are discussed. The goal is to build an intelligent Web search system that can find the user’s desired information with as little relevance feedback from the user as possible. The system can achieve a significant search precision increase with a small number of iterations of user relevance feedback. A new machine-learning algorithm is designed as the core of the intelligent search component. This algorithm is applied to three different search engines with different emphases. This chapter presents the algorithm, the architectures, and the performances of these search engines. Future research issues regarding real-time intelligent Web search are also discussed.

Download Full-text

Information Retrieval in Conjunction With Deep Learning

Handbook of Research on Emerging Trends and Applications of Machine Learning - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-9643-1.ch014 ◽

2020 ◽

pp. 300-311 ◽

Cited By ~ 1

Author(s):

Anu Bajaj ◽

Tamanna Sharma ◽

Om Prakash Sangwan

Keyword(s):

Machine Learning ◽

Information Retrieval ◽

Deep Learning ◽

Statistical Models ◽

Web Search ◽

Machine Learning Techniques ◽

Trend Detection ◽

Social Media Analytics ◽

Learning Techniques ◽

Long Time

Information is second level of abstraction after data and before knowledge. Information retrieval helps fill the gap between information and knowledge by storing, organizing, representing, maintaining, and disseminating information. Manual information retrieval leads to underutilization of resources, and it takes a long time to process, while machine learning techniques are implications of statistical models, which are flexible, adaptable, and fast to learn. Deep learning is the extension of machine learning with hierarchical levels of learning that make it suitable for complex tasks. Deep learning can be the best choice for information retrieval as it has numerous resources of information and large datasets for computation. In this chapter, the authors discuss applications of information retrieval with deep learning (e.g., web search by reducing the noise and collecting precise results, trend detection in social media analytics, anomaly detection in music datasets, and image retrieval).

Download Full-text

Diagnosis of Rare Diseases: A scoping review of clinical decision support systems

10.21203/rs.2.16205/v1 ◽

2019 ◽

Author(s):

Jannik Schaaf ◽

Martin Sedlmayr ◽

Johanna Schaefer ◽

Holger Storf

Keyword(s):

Machine Learning ◽

Information Retrieval ◽

Decision Support ◽

Decision Support Systems ◽

Clinical Decision Support ◽

Rare Diseases ◽

Web Search ◽

Clinical Decision Support Systems ◽

Clinical Decision ◽

Genetic Matching

Abstract Background Rare Diseases (RD), which are defined as diseases affecting not more than 5 out of 10,000 people, are often severe, chronic, degenerative and life-threating. A main problem is the delay in diagnosis of RD. Clinical Decision Support Systems (CDSS) for RD are software-systems to support physicians in the diagnosis of patients with RD. It would therefore be useful to get a comprehensive overview of which CDSS are available and can be used under what conditions. In this work we provide a review of current CDSS in RD and which functionality and data are used by the CDSS. Methods We searched Pubmed and Cochrane for CDSS in RD published between December 1, 2008 and December 16, 2018. Only English articles, original peer reviewed journals and conference paper describing a clinical prototype or a routine use of CDSS where included. A total of 2076 articles were found and following a screening step 16 articles (describing 13 different CDSS) were considered as relevant for the final analysis. We then described and compared the CDSS using the defined categories “functionality”, “development status”, “type of clinical data” and “system availability”. Results Three types of CDSS for RD were identified: “Machine Learning and Information retrieval”, “Web Search”, and “Phenotypic and genetic matching”. 8 of the 13 reviewed CDSS are publicly available and for use by physicians. The other remaining CDSS are clinical prototypes which have been applied in clinical studies but are not accessible to others. Only one clinical prototype online. The approaches of the CDSS differ depending on what type of clinical data is used. “Machine Learning and information retrieval” can show recommendations for a diagnosis, while Web “search CDSS” will retrieve articles from literature databases (e.g. case reports), which may provide hints for a possible Diagnosis. CDSS in “Phenotypic and genetic matching” can identify similar patients based on genetic or phenotypic data. Conclusions Different CDSS for different purposes have been established and physicians have to decide which CDSS is more accurate for a particular patient case. It remains to be seen which of the CDSS will be used and maintained in the future.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

Use of Machine Learning to Investigate the Quantitative Checklist for Autism in Toddlers (Q-CHAT) towards Early Autism Screening

Diagnostics ◽

10.3390/diagnostics11030574 ◽

2021 ◽

Vol 11 (3) ◽

pp. 574

Author(s):

Gennaro Tartarisco ◽

Giovanni Cicceri ◽

Davide Di Pietro ◽

Elisa Leonardi ◽

Stefania Aiello ◽

...

Keyword(s):

Machine Learning ◽

High Performance ◽

Behavioral Science ◽

Autistic Traits ◽

Classification Performance ◽

Recursive Feature Elimination ◽

Diagnostic Tools ◽

Support Vector ◽

K Nearest Neighbors ◽

Autism Screening

In the past two decades, several screening instruments were developed to detect toddlers who may be autistic both in clinical and unselected samples. Among others, the Quantitative CHecklist for Autism in Toddlers (Q-CHAT) is a quantitative and normally distributed measure of autistic traits that demonstrates good psychometric properties in different settings and cultures. Recently, machine learning (ML) has been applied to behavioral science to improve the classification performance of autism screening and diagnostic tools, but mainly in children, adolescents, and adults. In this study, we used ML to investigate the accuracy and reliability of the Q-CHAT in discriminating young autistic children from those without. Five different ML algorithms (random forest (RF), naïve Bayes (NB), support vector machine (SVM), logistic regression (LR), and K-nearest neighbors (KNN)) were applied to investigate the complete set of Q-CHAT items. Our results showed that ML achieved an overall accuracy of 90%, and the SVM was the most effective, being able to classify autism with 95% accuracy. Furthermore, using the SVM–recursive feature elimination (RFE) approach, we selected a subset of 14 items ensuring 91% accuracy, while 83% accuracy was obtained from the 3 best discriminating items in common to ours and the previously reported Q-CHAT-10. This evidence confirms the high performance and cross-cultural validity of the Q-CHAT, and supports the application of ML to create shorter and faster versions of the instrument, maintaining high classification accuracy, to be used as a quick, easy, and high-performance tool in primary-care settings.

Download Full-text

Thai Fake News Detection Based on Information Retrieval, Natural Language Processing and Machine Learning

SN Computer Science ◽

10.1007/s42979-021-00775-6 ◽

2021 ◽

Vol 2 (6) ◽

Author(s):

Phayung Meesad

Keyword(s):

Machine Learning ◽

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Fake News

Download Full-text

Neural methods for effective, efficient, and exposure-aware information retrieval

ACM SIGIR Forum ◽

10.1145/3476415.3476434 ◽

2021 ◽

Vol 55 (1) ◽

pp. 1-2

Author(s):

Bhaskar Mitra

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Large Scale ◽

Web Search ◽

Real Life ◽

Inverted Index ◽

Information Need ◽

Product Model ◽

Performance Improvements ◽

Deep Model

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

Download Full-text

Automatic generation of high-performance quantized machine learning kernels

Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization ◽

10.1145/3368826.3377912 ◽

2020 ◽

Cited By ~ 1

Author(s):

Meghan Cowan ◽

Thierry Moreau ◽

Tianqi Chen ◽

James Bornholt ◽

Luis Ceze

Keyword(s):

Machine Learning ◽

High Performance ◽

Automatic Generation

Download Full-text

An IoT-Focused Intrusion Detection System Approach Based on Preprocessing Characterization for Cybersecurity Datasets

Sensors ◽

10.3390/s21020656 ◽

2021 ◽

Vol 21 (2) ◽

pp. 656

Author(s):

Xavier Larriva-Novo ◽

Víctor A. Villagrá ◽

Mario Vega-Barbas ◽

Diego Rivera ◽

Mario Sanz Rodrigo

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

High Performance ◽

Learning Algorithm ◽

Detection System ◽

Machine Learning Algorithms ◽

Statistical Characteristics ◽

Detection Techniques ◽

Traffic Characteristics ◽

Benchmark Datasets

Security in IoT networks is currently mandatory, due to the high amount of data that has to be handled. These systems are vulnerable to several cybersecurity attacks, which are increasing in number and sophistication. Due to this reason, new intrusion detection techniques have to be developed, being as accurate as possible for these scenarios. Intrusion detection systems based on machine learning algorithms have already shown a high performance in terms of accuracy. This research proposes the study and evaluation of several preprocessing techniques based on traffic categorization for a machine learning neural network algorithm. This research uses for its evaluation two benchmark datasets, namely UGR16 and the UNSW-NB15, and one of the most used datasets, KDD99. The preprocessing techniques were evaluated in accordance with scalar and normalization functions. All of these preprocessing models were applied through different sets of characteristics based on a categorization composed by four groups of features: basic connection features, content characteristics, statistical characteristics and finally, a group which is composed by traffic-based features and connection direction-based traffic characteristics. The objective of this research is to evaluate this categorization by using various data preprocessing techniques to obtain the most accurate model. Our proposal shows that, by applying the categorization of network traffic and several preprocessing techniques, the accuracy can be enhanced by up to 45%. The preprocessing of a specific group of characteristics allows for greater accuracy, allowing the machine learning algorithm to correctly classify these parameters related to possible attacks.

Download Full-text