single dataset
Recently Published Documents


TOTAL DOCUMENTS

86
(FIVE YEARS 48)

H-INDEX

9
(FIVE YEARS 3)

2021 ◽  
Vol 4 ◽  
pp. 123-128
Author(s):  
Amit Kumar ◽  
Soni Rajput ◽  
Manjunath P. Puranik ◽  
Ankit Mahesh Patel

Proving research efficiency and academic growth by the number of publications flag the researchers to publish more articles from a single dataset. They are crossing into unethical practices such as self-plagiarism, duplicate publication, and other research misconducts, which warrant disciplinary action against them. The thrust of this review is to draw the attention of the authors, reviewers, editors, and readers toward different dimensions of overlapping publications in research. Various guidelines and ethical bodies such as Committee on Publication Ethics and International Committee of Medical Journal Editors were considered for the review. The present review provides an expansive outline of publication overlap available in the literature. The reasons for conducting and problems associated with different types of overlapping publications are identified. Preventive and remedial measures as well as recommendations for authors, editors, and reviewers have been highlighted. Because of the strain to “publish or perish” from the researchers’ end, journals are ending up being flooded with overlapping publications.


2021 ◽  
Vol 12 (1) ◽  
pp. 71
Author(s):  
Peng-Yeng Yin ◽  
Ray-I Chang ◽  
Rong-Fuh Day ◽  
Yen-Cheng Lin ◽  
Ching-Yuan Hu

The rapid development of industrialization and urbanization has had a substantial impact on the increasing air pollution in many populated cities around the globe. Intensive research has shown that ambient aerosols, especially the fine particulate matter PM2.5, are highly correlated with human respiratory diseases. It is critical to analyze, forecast, and mitigate PM2.5 concentrations. One of the typical meteorological phenomena seducing PM2.5 concentrations to accumulate is temperature inversion which forms a warm-air cap to blockade the surface pollutants from dissipating. This paper analyzes the meteorological patterns which coincide with temperature inversion and proposes two machine learning classifiers for temperature inversion classification. A separate multivariate regression model is trained for the class with or without manifesting temperature inversion phenomena, in order to improve PM2.5 forecasting performance. We chose Puli township as the studied site, which is a basin city easily trapping PM2.5 concentrations. The experimental results with the dataset spanning from 1 January 2016 to 31 December 2019 show that the proposed temperature inversion classifiers exhibit satisfactory performance in F1-Score, and the regression models trained from the classified datasets can significantly improve the PM2.5 concentration forecast as compared to the model using a single dataset without considering the temperature inversion factor.


2021 ◽  
Vol 11 (24) ◽  
pp. 12122
Author(s):  
Dilovan Asaad Zebari ◽  
Dheyaa Ahmed Ibrahim ◽  
Diyar Qader Zeebaree ◽  
Mazin Abed Mohammed ◽  
Habibollah Haron ◽  
...  

Breast cancer detection using mammogram images at an early stage is an important step in disease diagnostics. We propose a new method for the classification of benign or malignant breast cancer from mammogram images. Hybrid thresholding and the machine learning method are used to derive the region of interest (ROI). The derived ROI is then separated into five different blocks. The wavelet transform is applied to suppress noise from each produced block based on BayesShrink soft thresholding by capturing high and low frequencies within different sub-bands. An improved fractal dimension (FD) approach, called multi-FD (M-FD), is proposed to extract multiple features from each denoised block. The number of features extracted is then reduced by a genetic algorithm. Five classifiers are trained and used with the artificial neural network (ANN) to classify the extracted features from each block. Lastly, the fusion process is performed on the results of five blocks to obtain the final decision. The proposed approach is tested and evaluated on four benchmark mammogram image datasets (MIAS, DDSM, INbreast, and BCDR). We present the results of single- and double-dataset evaluations. Only one dataset is used for training and testing in the single-dataset evaluation, whereas two datasets (one for training, and one for testing) are used in the double-dataset evaluation. The experiment results show that the proposed method yields better results on the INbreast dataset in the single-dataset evaluation, whilst better results are obtained on the remaining datasets in the double-dataset evaluation. The proposed approach outperforms other state-of-the-art models on the Mini-MIAS dataset.


Author(s):  
Temitayo O. Oyegoke ◽  
Kehinde K. Akomolede ◽  
Adesola G. Aderounmu ◽  
Emmanuel R. Adagunodo

This study was developed an e-mail classification model to preempt fraudulent activities. The e-mail has such a predominant nature that makes it suitable for adoption by cyber-fraudsters. This research used a combination of two databases: CLAIR fraudulent and Spambase datasets for creating the training and testing dataset. The CLAIR dataset consists of raw e-mails from users’ inbox which were pre-processed into structured form using Natural Language Processing (NLP) techniques. This dataset was then consolidated with the Spambase dataset as a single dataset. The study deployed the Multi-Layer Perceptron (MLP) architecture which used a back-propagation algorithm for training the fraud detection model. The model was simulated using 70% and 80% for training while 30% and 20% of datasets were used for testing respectively. The results of the performance of the models were compared using a number of evaluation metrics. The study concluded that using the MLP, an effective model for fraud detection among e-mail dataset was proposed.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Liqian Zhou ◽  
Qi Duan ◽  
Xiongfei Tian ◽  
He Xu ◽  
Jianxin Tang ◽  
...  

Abstract Background Long noncoding RNAs (lncRNAs) have dense linkages with a plethora of important cellular activities. lncRNAs exert functions by linking with corresponding RNA-binding proteins. Since experimental techniques to detect lncRNA-protein interactions (LPIs) are laborious and time-consuming, a few computational methods have been reported for LPI prediction. However, computation-based LPI identification methods have the following limitations: (1) Most methods were evaluated on a single dataset, and researchers may thus fail to measure their generalization ability. (2) The majority of methods were validated under cross validation on lncRNA-protein pairs, did not investigate the performance under other cross validations, especially for cross validation on independent lncRNAs and independent proteins. (3) lncRNAs and proteins have abundant biological information, how to select informative features need to further investigate. Results Under a hybrid framework (LPI-HyADBS) integrating feature selection based on AdaBoost, and classification models including deep neural network (DNN), extreme gradient Boost (XGBoost), and SVM with a penalty Coefficient of misclassification (C-SVM), this work focuses on finding new LPIs. First, five datasets are arranged. Each dataset contains lncRNA sequences, protein sequences, and an LPI network. Second, biological features of lncRNAs and proteins are acquired based on Pyfeat. Third, the obtained features of lncRNAs and proteins are selected based on AdaBoost and concatenated to depict each LPI sample. Fourth, DNN, XGBoost, and C-SVM are used to classify lncRNA-protein pairs based on the concatenated features. Finally, a hybrid framework is developed to integrate the classification results from the above three classifiers. LPI-HyADBS is compared to six classical LPI prediction approaches (LPI-SKF, LPI-NRLMF, Capsule-LPI, LPI-CNNCP, LPLNP, and LPBNI) on five datasets under 5-fold cross validations on lncRNAs, proteins, lncRNA-protein pairs, and independent lncRNAs and independent proteins. The results show LPI-HyADBS has the best LPI prediction performance under four different cross validations. In particular, LPI-HyADBS obtains better classification ability than other six approaches under the constructed independent dataset. Case analyses suggest that there is relevance between ZNF667-AS1 and Q15717. Conclusions Integrating feature selection approach based on AdaBoost, three classification techniques including DNN, XGBoost, and C-SVM, this work develops a hybrid framework to identify new linkages between lncRNAs and proteins.


Mathematics ◽  
2021 ◽  
Vol 9 (20) ◽  
pp. 2599
Author(s):  
Diego Opazo ◽  
Sebastián Moreno ◽  
Eduardo Álvarez-Miranda ◽  
Jordi Pereira

Student dropout, defined as the abandonment of a high education program before obtaining the degree without reincorporation, is a problem that affects every higher education institution in the world. This study uses machine learning models over two Chilean universities to predict first-year engineering student dropout over enrolled students, and to analyze the variables that affect the probability of dropout. The results show that instead of combining the datasets into a single dataset, it is better to apply a model per university. Moreover, among the eight machine learning models tested over the datasets, gradient-boosting decision trees reports the best model. Further analyses of the interpretative models show that a higher score in almost any entrance university test decreases the probability of dropout, the most important variable being the mathematical test. One exception is the language test, where a higher score increases the probability of dropout.


2021 ◽  
pp. 1-7
Author(s):  
Aaron R. Kaufman ◽  
Aja Klevs

Abstract A single dataset is rarely sufficient to address a question of substantive interest. Instead, most applied data analysis combines data from multiple sources. Very rarely do two datasets contain the same identifiers with which to merge datasets; fields like name, address, and phone number may be entered incorrectly, missing, or in dissimilar formats. Combining multiple datasets absent a unique identifier that unambiguously connects entries is called the record linkage problem. While recent work has made great progress in the case where there are many possible fields on which to match, the much more uncertain case of only one identifying field remains unsolved: this fuzzy string matching problem, both its own problem and a component of standard record linkage problems, is our focus. We design and validate an algorithmic solution called Adaptive Fuzzy String Matching rooted in adaptive learning, and show that our tool identifies more matches, with higher precision, than existing solutions. Finally, we illustrate its validity and practical value through applications to matching organizations, places, and individuals.


2021 ◽  
Author(s):  
Toni Viskari ◽  
Janne Pusa ◽  
Istem Fer ◽  
Anna Repo ◽  
Julius Vira ◽  
...  

Abstract. Soil Organic Carbon (SOC) models are important tools in determining global SOC distributions and how carbon stocks are affected by climate change. Their performances are, however, affected by data and methods used to calibrate them. Here we study how the Yasso SOC model performs if calibrated individually or with multiple datasets and how the chosen calibration method affected the parameter estimation. We found that when calibrated with multiple datasets, the model showed a better global performance compared to a single dataset calibration. Furthermore, our results show that more advanced calibration algorithms should be used for SOC models due to the multiple local maximas in the likelihood space.


Information ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 392
Author(s):  
Sinead A. Williamson ◽  
Jette Henderson

Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepancy (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily compared across datasets. In this paper, we introduce dependent MMD coresets, a data summarization method for collections of datasets that facilitates comparison of distributions. We show that dependent MMD coresets are useful for understanding multiple related datasets and understanding model generalization between such datasets.


AI ◽  
2021 ◽  
Vol 2 (3) ◽  
pp. 366-382
Author(s):  
Zhihan Xue ◽  
Tad Gonsalves

Research on autonomous obstacle avoidance of drones has recently received widespread attention from researchers. Among them, an increasing number of researchers are using machine learning to train drones. These studies typically adopt supervised learning or reinforcement learning to train the networks. Supervised learning has a disadvantage in that it takes a significant amount of time to build the datasets, because it is difficult to cover the complex and changeable drone flight environment in a single dataset. Reinforcement learning can overcome this problem by using drones to learn data in the environment. However, the current research results based on reinforcement learning are mainly focused on discrete action spaces. In this way, the movement of drones lacks precision and has somewhat unnatural flying behavior. This study aims to use the soft-actor-critic algorithm to train a drone to perform autonomous obstacle avoidance in continuous action space using only the image data. The algorithm is trained and tested in a simulation environment built by Airsim. The results show that our algorithm enables the UAV to avoid obstacles in the training environment only by inputting the depth map. Moreover, it also has a higher obstacle avoidance rate in the reconfigured environment without retraining.


Sign in / Sign up

Export Citation Format

Share Document