Similarity-Based Clustering for Enhancing Image Classification Architectures

10.36227/techrxiv.13186478 ◽

2020 ◽

Author(s):

Dishant Parikh ◽

Shambhavi Aggarwal

Keyword(s):

Computational Cost ◽

Image Similarity ◽

Learning Models ◽

Additional Information ◽

Convolutional Networks ◽

Model Size ◽

Computer Vision Applications ◽

The Cost ◽

Flow Of Information ◽

Wide Assortment

Convolutional networks are at the center of best in class computer vision applications for a wide assortment of undertakings. Since 2014, profound amount of work began to make better convolutional architectures, yielding generous additions in different benchmarks. Albeit expanded model size and computational cost will, in general, mean prompt quality increases for most undertakings but, the architectures now need to have some additional information to increase the performance. We show empirical evidence that with the amalgamation of content-based image similarity and deep learning models, we can provide the flow of information which can be used in making clustered learning possible. We show how parallel training of sub-dataset clusters not only reduces the cost of computation but also increases the benchmark accuracies by 5-11 percent.

Download Full-text

ALPINE: Active Link Prediction Using Network Embedding

Applied Sciences ◽

10.3390/app11115043 ◽

2021 ◽

Vol 11 (11) ◽

pp. 5043

Author(s):

Xi Chen ◽

Bo Kang ◽

Jefrey Lijffijt ◽

Tijl De Bie

Keyword(s):

Active Learning ◽

Protein Interactions ◽

Link Prediction ◽

Prediction Accuracy ◽

Real Data ◽

Network Embedding ◽

Protein Protein Interactions ◽

Additional Information ◽

The Cost ◽

Active Link

Many real-world problems can be formalized as predicting links in a partially observed network. Examples include Facebook friendship suggestions, the prediction of protein–protein interactions, and the identification of hidden relationships in a crime network. Several link prediction algorithms, notably those recently introduced using network embedding, are capable of doing this by just relying on the observed part of the network. Often, whether two nodes are linked can be queried, albeit at a substantial cost (e.g., by questionnaires, wet lab experiments, or undercover work). Such additional information can improve the link prediction accuracy, but owing to the cost, the queries must be made with due consideration. Thus, we argue that an active learning approach is of great potential interest and developed ALPINE (Active Link Prediction usIng Network Embedding), a framework that identifies the most useful link status by estimating the improvement in link prediction accuracy to be gained by querying it. We proposed several query strategies for use in combination with ALPINE, inspired by the optimal experimental design and active learning literature. Experimental results on real data not only showed that ALPINE was scalable and boosted link prediction accuracy with far fewer queries, but also shed light on the relative merits of the strategies, providing actionable guidance for practitioners.

Download Full-text

Domain Adaptation for Semantic Segmentation of Historical Panchromatic Orthomosaics in Central Africa

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10080523 ◽

2021 ◽

Vol 10 (8) ◽

pp. 523

Author(s):

Nicholus Mboga ◽

Stefano D’Aronco ◽

Tais Grippa ◽

Charlotte Pelletier ◽

Stefanos Georganos ◽

...

Keyword(s):

Land Cover ◽

Domain Adaptation ◽

Central Africa ◽

Semantic Segmentation ◽

Target Domain ◽

Convolutional Networks ◽

Fully Convolutional Networks ◽

The Cost ◽

Performance Gains

Multitemporal environmental and urban studies are essential to guide policy making to ultimately improve human wellbeing in the Global South. Land-cover products derived from historical aerial orthomosaics acquired decades ago can provide important evidence to inform long-term studies. To reduce the manual labelling effort by human experts and to scale to large, meaningful regions, we investigate in this study how domain adaptation techniques and deep learning can help to efficiently map land cover in Central Africa. We propose and evaluate a methodology that is based on unsupervised adaptation to reduce the cost of generating reference data for several cities and across different dates. We present the first application of domain adaptation based on fully convolutional networks for semantic segmentation of a dataset of historical panchromatic orthomosaics for land-cover generation for two focus cities Goma-Gisenyi and Bukavu. Our experimental evaluation shows that the domain adaptation methods can reach an overall accuracy between 60% and 70% for different regions. If we add a small amount of labelled data from the target domain, too, further performance gains can be achieved.

Download Full-text

Plant Leaf Disease Recognition Using Depth-Wise Separable Convolution-Based Models

Symmetry ◽

10.3390/sym13030511 ◽

2021 ◽

Vol 13 (3) ◽

pp. 511

Author(s):

Syed Mohammad Minhaz Hossain ◽

Kaushik Deb ◽

Pranab Kumar Dhar ◽

Takeshi Koshiba

Keyword(s):

State Of The Art ◽

Computational Cost ◽

Region Of Interest ◽

Number Of Clusters ◽

Plant Leaf ◽

Leaf Disease ◽

Automatic Initialization ◽

Adequate Accuracy ◽

Model Size ◽

High Computational Cost

Proper plant leaf disease (PLD) detection is challenging in complex backgrounds and under different capture conditions. For this reason, initially, modified adaptive centroid-based segmentation (ACS) is used to trace the proper region of interest (ROI). Automatic initialization of the number of clusters (K) using modified ACS before recognition increases tracing ROI’s scalability even for symmetrical features in various plants. Besides, convolutional neural network (CNN)-based PLD recognition models achieve adequate accuracy to some extent. However, memory requirements (large-scaled parameters) and the high computational cost of CNN-based PLD models are burning issues for the memory restricted mobile and IoT-based devices. Therefore, after tracing ROIs, three proposed depth-wise separable convolutional PLD (DSCPLD) models, such as segmented modified DSCPLD (S-modified MobileNet), segmented reduced DSCPLD (S-reduced MobileNet), and segmented extended DSCPLD (S-extended MobileNet), are utilized to represent the constructive trade-off among accuracy, model size, and computational latency. Moreover, we have compared our proposed DSCPLD recognition models with state-of-the-art models, such as MobileNet, VGG16, VGG19, and AlexNet. Among segmented-based DSCPLD models, S-modified MobileNet achieves the best accuracy of 99.55% and F1-sore of 97.07%. Besides, we have simulated our DSCPLD models using both full plant leaf images and segmented plant leaf images and conclude that, after using modified ACS, all models increase their accuracy and F1-score. Furthermore, a new plant leaf dataset containing 6580 images of eight plants was used to experiment with several depth-wise separable convolution models.

Download Full-text

Triage and diagnosis of COVID-19 from medical social media (Preprint)

10.2196/preprints.30397 ◽

2021 ◽

Author(s):

Abul Hasan ◽

Mark Levene ◽

David Weston ◽

Renate Fromson ◽

Nicolas Koslover ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Learning Models ◽

Rule Based ◽

Additional Information ◽

Processing Pipeline ◽

Machine Learning Models

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.

Download Full-text

Optimization of machining parameters during Drilling by Taguchi based Design of Experiments and Validation by Neural Network

Brazilian Journal of Operations & Production Management ◽

10.14488/bjopm.2018.v15.n2.a11 ◽

2018 ◽

Vol 15 (2) ◽

pp. 294-301

Author(s):

Reddy Sreenivasulu ◽

Chalamalasetti SrinivasaRao

Keyword(s):

Neural Network ◽

Design Of Experiments ◽

Computational Cost ◽

Machining Parameters ◽

Burr Size ◽

Taguchi Design Of Experiments ◽

The Neural Network ◽

Input Parameters ◽

Hole Making ◽

The Cost

Drilling is a hole making process on machine components at the time of assembly work, which are identify everywhere. In precise applications, quality and accuracy play a wide role. Nowadays’ industries suffer due to the cost incurred during deburring, especially in precise assemblies such as aerospace/aircraft body structures, marine works and automobile industries. Burrs produced during drilling causes dimensional errors, jamming of parts and misalignment. Therefore, deburring operation after drilling is often required. Now, reducing burr size is a serious topic. In this study experiments are conducted by choosing various input parameters selected from previous researchers. The effect of alteration of drill geometry on thrust force and burr size of drilled hole was investigated by the Taguchi design of experiments and found an optimum combination of the most significant input parameters from ANOVA to get optimum reduction in terms of burr size by design expert software. Drill thrust influences more on burr size. The clearance angle of the drill bit causes variation in thrust. The burr height is observed in this study. These output results are compared with the neural network software @easy NN plus. Finally, it is concluded that by increasing the number of nodes the computational cost increases and the error in nueral network decreases. Good agreement was shown between the predictive model results and the experimental responses.

Download Full-text

Efficient Deep Learning Models for DGA Domain Detection

Security and Communication Networks ◽

10.1155/2021/8887881 ◽

2021 ◽

Vol 2021 ◽

pp. 1-15

Author(s):

Juhong Namgung ◽

Siwoon Son ◽

Yang-Sae Moon

Keyword(s):

Deep Learning ◽

Short Term Memory ◽

Ensemble Model ◽

Learning Models ◽

Short Term ◽

Domain Names ◽

Additional Information ◽

Domain Sequence ◽

Long Short Term Memory ◽

And Control

In recent years, cyberattacks using command and control (C&C) servers have significantly increased. To hide their C&C servers, attackers often use a domain generation algorithm (DGA), which automatically generates domain names for the C&C servers. Accordingly, extensive research on DGA domain detection has been conducted. However, existing methods cannot accurately detect continuously generated DGA domains and can easily be evaded by an attacker. Recently, long short-term memory- (LSTM-) based deep learning models have been introduced to detect DGA domains in real time using only domain names without feature extraction or additional information. In this paper, we propose an efficient DGA domain detection method based on bidirectional LSTM (BiLSTM), which learns bidirectional information as opposed to unidirectional information learned by LSTM. We further maximize the detection performance with a convolutional neural network (CNN) + BiLSTM ensemble model using Attention mechanism, which allows the model to learn both local and global information in a domain sequence. Experimental results show that existing CNN and LSTM models achieved F1-scores of 0.9384 and 0.9597, respectively, while the proposed BiLSTM and ensemble models achieved higher F1-scores of 0.9618 and 0.9666, respectively. In addition, the ensemble model achieved the best performance for most DGA domain classes, enabling more accurate DGA domain detection than existing models.

Download Full-text

Employing multiple synchronous outcome samples per subject to improve study efficiency

BMC Medical Research Methodology ◽

10.1186/s12874-021-01414-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Roger P. A’Hern

Keyword(s):

Primary Outcome ◽

Sampling Error ◽

Optimum Number ◽

Intra Class Correlation ◽

Additional Information ◽

True Value ◽

The Difference ◽

Highly Correlated ◽

The Cost ◽

Multiple Samples

Abstract Background Accuracy can be improved by taking multiple synchronous samples from each subject in a study to estimate the endpoint of interest if sample values are not highly correlated. If feasible, it is useful to assess the value of this cluster approach when planning studies. Multiple assessments may be the only method to increase power to an acceptable level if the number of subjects is limited. Methods The main aim is to estimate the difference in outcome between groups of subjects by taking one or more synchronous primary outcome samples or measurements. A summary statistic from multiple samples per subject will typically have a lower sampling error. The number of subjects can be balanced against the number of synchronous samples to minimize the sampling error, subject to design constraints. This approach can include estimating the optimum number of samples given the cost per subject and the cost per sample. Results The accuracy improvement achieved by taking multiple samples depends on the intra-class correlation (ICC). The lower the ICC, the greater the benefit that can accrue. If the ICC is high, then a second sample will provide little additional information about the subject’s true value. If the ICC is very low, adding a sample can be equivalent to adding an extra subject. Benefits of multiple samples include the ability to reduce the number of subjects in a study and increase both the power and the available alpha. If, for example, the ICC is 35%, adding a second measurement can be equivalent to adding 48% more subjects to a single measurement study. Conclusion A study’s design can sometimes be improved by taking multiple synchronous samples. It is useful to evaluate this strategy as an extension of a single sample design. An Excel workbook is provided to allow researchers to explore the most appropriate number of samples to take in a given setting.

Download Full-text

Development of Instant Sinigang Powder from Katmon Fruit (Dillenia philippinensis)

Abstract Proceedings International Scholars Conference ◽

10.35974/isc.v7i1.1034 ◽

2019 ◽

Vol 7 (1) ◽

pp. 367-383

Author(s):

Marchelita Beverly Tappy ◽

Sophiya Celine Dellosa ◽

Teejay Malabonga

Keyword(s):

High School Students ◽

Nutritional Value ◽

Nutrient Content ◽

The Philippines ◽

Iodized Salt ◽

School Students ◽

Additional Information ◽

The Cost ◽

Online Software ◽

Senior High School Students

INTRODUCTION:Katmon Fruit (Dillenia Philippinensis) is a fruit tree commonly use in the rural area in the Philippines. Katmon is eaten as fruit but is not very popular because of the unacceptable taste that resembles a green sour apple. The purpose of this study is to develop an instant sinigang powder as a base ingredients of sinigang and using the natural sour taste for sinigang dish. METHOD:This study use katmon fruit, shiitake mushroom, garlic, iodized salt, and sugar to develop an instant sinigang mix powder. It were dehydrated using the Multi-Commodity Heat Pump Dryer for 13 hours. These were powdered using a grinder mixed with iodized salt and sugar. The nutrient content was computed using iFNRI online software. Thirty participants comprising: 10 faculty, 10 dormitory students,10 senior high school students did the taste test. RESULTS:The results revealed that the product was liked very much in terms of color, texture, taste, aroma, and appearance. The instant sinigang powder is stored in a polyethylene metalized zip lock packaging 8.5 x 14cm. The cost per serving is PhP 37.5 .It is cheaper and has more nutritional value compared to other products. DISCUSSION AND RECOMENDATION: The study recommend for more enhancement in terms of flavour of instant sinigang powder from katmon additional ingredient from natural sources to have more tasty and more nutritional content. This study also can help future researchers to have additional information about the katmon fruit.

Download Full-text

Breaking the problem into pieces: pre-clustering on-the-fly with the NGLMS

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.271 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

James Farrow

Keyword(s):

Cost Reduction ◽

Linked Data ◽

Selection Criteria ◽

Transitive Closure ◽

Computational Cost ◽

Quality Analysis ◽

Clustering Techniques ◽

Computationally Intensive ◽

The Cost ◽

Manual Review

ABSTRACT ObjectivesThe SA.NT DataLink Next Generation Linkage Management System (NGLMS) stores linked data in the form of a graph (in the computer science sense) comprised of nodes (records) and edges (record relationships or similarities). This permits efficient pre-clustering techniques based on transitive closure to form groups of records which relate to the same individual (or other selection criteria). ApproachOnly information known (or at least highly likely) to be relevant is extracted from the graph as superclusters. This operation is computationally inexpensive when the underlying information is stored as a graph and may be able to be done on-the-fly for typical clusters. More computationally intensive analysis and/or further clustering may then be performed on this smaller subgraph. Canopy clustering and using blocking used to reduce pairwise comparisons are expressions of the same type of approach. ResultsSubclusters for manual review based on transitive closure are typically computationally inexpensive enough to extract from the NGLMS that they are extracted on-demand during manual clerical review activities. There is no necessity to pre-calculate these clusters. Once extracted further analysis is undertaken on these smaller data groupings for visualisation and presentation for review and quality analysis. More computationally expensive techniques can be used at this point to prepare data for visualisation or provide hints to manual reviewers.  Extracting high-recall groups of data records for review but providing them to reviews grouped further into high precision groups as the result of a second pass has resulted in a reduction of the time taken for clerical reviewers at SANT DataLink to manual review a group by 30–40%. The reviewers are able to manipulate whole groups of related records at once rather than individual records. ConclusionPre-clustering reduces the computational cost associated with higher order clustering and analysis algorithms. Algorithms which scale by n^2 (or more) are typical in comparison scenarios. By breaking the problem into pieces the computational cost can be reduced. Typically breaking a problem into many pieces reduces the cost in proportion to the number of pieces the problem can be broken into. This cost reduction can make techniques possible which would otherwise be computationally prohibitive.

Download Full-text