scholarly journals Automated Detection of Records in Biological Sequence Databases that are Inconsistent with the Literature

2017 ◽  
Author(s):  
Mohamed Reda Bouadjenek ◽  
Karin Verspoor ◽  
Justin Zobel

AbstractWe investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as “confident” or “suspicious”.Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.

2017 ◽  
Author(s):  
Mohamed Reda Bouadjenek ◽  
Karin Verspoor ◽  
Justin Zobel

AbstractBioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness, and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records.Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using Principal Component Analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that 1 record out of 4 is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records.


Author(s):  
Petra Nováková

The aim of the work was to elaborate and evaluate the water quality of water reservoir Vranov nad Dyjí. Fresh water was sampled in five different locations of the reservoir (three important tributaries, dam and water captation locality). Ten, the most essential water quality indicators were selected. From the point of view of water quality indicators complexity the most integrated samples were taken in the water captation locality (period 1984 – 2002). At other locations, there were missing dates from the eightieth, but their volume was sufficient for statistical processing.Correlation analyses for the individual locations and dimensions were done as so as determination coefficients for all localities during the time period of 1994 – 2002. The results demonstrate very good allocation of the water captation from the point of view of the water flow.Multiples and factor analysis was done for the period of 1984 – 2002 in the locality Jelení zátoka where the object of water captation is situated. The results of the analysis are nine factors, which influence the water quality of the reservoir. From the point of view of the importance three factors were interpreted.The analyses and results are part of my Ph.D. thesis. The results will be used for other evaluations of the water quality in the reservoir and tributaries, for activities in the catchment’s area and for proposal processing other zones of second level of protected areas.


2021 ◽  
Vol 10 (9) ◽  
pp. 144-147
Author(s):  
Huiling LI ◽  
Xuan SU ◽  
Shuaipeng ZHANG

Massive amounts of business process event logs are collected and stored by modern information systems. Model discovery aims to discover a process model from such event logs, however, most of the existing approaches still suffer from low efficiency when facing large-scale event logs. Event log sampling techniques provide an effective scheme to improve the efficiency of process discovery, but the existing techniques still cannot guarantee the quality of model mining. Therefore, a sampling approach based on set coverage algorithm named set coverage sampling approach is proposed. The proposed sampling approach has been implemented in the open-source process mining toolkit ProM. Furthermore, experiments using a real event log data set from conformance checking and time performance analysis show that the proposed event log sampling approach can greatly improve the efficiency of log sampling on the premise of ensuring the quality of model mining.


Author(s):  
Sajad Badalkhani ◽  
Ramazan Havangi ◽  
Mohsen Farshad

There is an extensive literature regarding multi-robot simultaneous localization and mapping (MRSLAM). In most part of the research, the environment is assumed to be static, while the dynamic parts of the environment degrade the estimation quality of SLAM algorithms and lead to inherently fragile systems. To enhance the performance and robustness of the SLAM in dynamic environments (SLAMIDE), a novel cooperative approach named parallel-map (p-map) SLAM is introduced in this paper. The objective of the proposed method is to deal with the dynamics of the environment, by detecting dynamic parts and preventing the inclusion of them in SLAM estimations. In this approach, each robot builds a limited map in its own vicinity, while the global map is built through a hybrid centralized MRSLAM. The restricted size of the local maps, bounds computational complexity and resources needed to handle a large scale dynamic environment. Using a probabilistic index, the proposed method differentiates between stationary and moving landmarks, based on their relative positions with other parts of the environment. Stationary landmarks are then used to refine a consistent map. The proposed method is evaluated with different levels of dynamism and for each level, the performance is measured in terms of accuracy, robustness, and hardware resources needed to be implemented. The method is also evaluated with a publicly available real-world data-set. Experimental validation along with simulations indicate that the proposed method is able to perform consistent SLAM in a dynamic environment, suggesting its feasibility for MRSLAM applications.


2021 ◽  
Vol 03 (01) ◽  
pp. 40-46
Author(s):  
Hasil Cəmil oğlu Bağırov ◽  
◽  
Vüqar İmanəli oğlu Cəfərov ◽  
Arzu Vidadi qızı Həşimova ◽  
Rəşidə Elşən qızı Şükürova ◽  
...  

Without knowing the main quality indicators of agricultural products, it is impossible to draw conclusions about the effectiveness of this or that agro-technical measure. One of the factors influencing the quality of sugar beet and watermelon is the effective application of fertilizers. Fertilizers increase the quality indicators of the product along with its expansion. From this point of view, the effect of organic and mineral fertilizers on the quality indicators of sugar beet and watermelon product in the meadow-gray soils of Mugan-Salyan region was studied. The combined application of organic and mineral fertilizers had a positive effect on the quality indicators of sugar beet and watermelon. Key words: organic and mineral fertilizers, sugar beet, watermelon, phosphorus, potassium, productivity, soil, quality


2020 ◽  
Vol 17 (35) ◽  
pp. 791-799
Author(s):  
Rafael ISMAGILOV ◽  
Ilgiz ASYLBAEV ◽  
Nuriya URAZBAKHTINA ◽  
Denis ANDRIYANOV ◽  
Firdavis AVSAKHOV

Throughout the world, potatoes, as a food crop, are very important. One of the main reasons for the poor quality of planting material, yield and potatoes themselves are viral infections. The use of virus-free seed material is one of the high-potential ways to increase the yield and efficiency of potato production. Aeroponics is a promising direction in obtaining a virus-protected crop. This study aimed to assess the potential and improve the technology for growing healthy mini-tubers of potatoes using the aeroponic method, which is a safe and economical method. Compared to the usual method of growing crops, aeroponics assumes lower water and energy costs per unit of production, as well as excludes soil diseases of the plant and does not allow damage to the tuber caused by pests. For growing different varieties of crops in different regions, artificial conditions such as additional lighting in greenhouses can be easily provided. In this study, economic calculations have shown that, from a practical point of view, Aeroponics technology may be appropriate for large-scale production of seed potatoes.


Author(s):  
Gediminas Merkys ◽  
Daiva Bubeliene ◽  
Nijole Čiučiulkienė

The research paper presents the results of a large-scale longitudinal study which aims to highlight pre-schooling social problems with the help of social indicators. For over a decade, the authors of the research paper have been developing a survey inventory aiming to determine the population’s satisfaction with the public service index. The tool includes 190 original survey indicators that represent all public services. 20 indicators are devoted to education; two of them represent pre-school education. These are: 1) assessment of the quality of pre-school services; and 2) the availability of a child's place in a kindergarten in a residential area (availability). The existing statistical norming base (not older than 2 years) includes 12 municipalities in Lithuania and 88 subdistricts. The total number of respondents is 16202 (n=16202). It has been cleared out that the residents consider the quality of the service "high", but its "availability" is considered to be poor. The statistical regularity found is common to all surveyed municipalities. There is a significant dispersion of measured indicators in separate municipalities and in the subdistricts. Facing the negative evaluation tendency of the “availability“ service some municipalities are more able to handle the problem. For this reason their experience is worth to analyze and to disseminate in a broader way. It is also worth to mention that the results of this study have much in common with EUROSTAT data. In Lithuania, the inclusion of 2-3 years old children in the education system is extremely poor, whereas the inclusion of preschoolers is largely universal. It is possible to state that poor situation of 2-3 years old children inclusion in the Lithuanian education system is related to the problems of Lithuanian social policy. In Lithuania, mother (or father) receives financial benefits for two years after the birth of a child. It is also possible to save one‘s job without receiving a payment for one year more. From the point of view of women's employment and equal opportunities policies, our discovered regularity testifies social policy dysfunctions at the macro national level which, on their turn, indicate a deep-seated demographic crisis in an EU country.  


2001 ◽  
Vol 2 (4) ◽  
pp. 196-206 ◽  
Author(s):  
Christian Blaschke ◽  
Alfonso Valencia

The Dictionary of Interacting Proteins(DIP) (Xenarioset al., 2000) is a large repository of protein interactions: its March 2000 release included 2379 protein pairs whose interactions have been detected by experimental methods. Even if many of these correspond to poorly characterized proteins, the result of massive yeast two-hybrid screenings, as many as 851 correspond to interactions detected using direct biochemical methods.We used information retrieval technology to search automatically for sentences in Medline abstracts that support these 851 DIP interactions. Surprisingly, we found correspondence between DIP protein pairs and Medline sentences describing their interactions in only 30% of the cases. This low coverage has interesting consequences regarding the quality of annotations (references) introduced in the database and the limitations of the application of information extraction (IE) technology to Molecular Biology. It is clear that the limitation of analyzing abstracts rather than full papers and the lack of standard protein names are difficulties of considerably more importance than the limitations of the IE methodology employed. A positive finding is the capacity of the IE system to identify new relations between proteins, even in a set of proteins previously characterized by human experts. These identifications are made with a considerable degree of precision.This is, to our knowledge, the first large scale assessment of IE capacity to detect previously known interactions: we thus propose the use of the DIP data set as a biological reference to benchmark IE systems.


2017 ◽  
Vol 9 (3) ◽  
pp. 349-360 ◽  
Author(s):  
Leonardo (Don) A.N. Dioko ◽  
Amy S.I. So

Purpose The purpose of this study is to propose a destination-level framework incorporating subjective and overall assessments of residents’ quality of life (QOL) and visitors’ quality of experience (QOE) as a means for managing optimum levels of visitor volume at destinations. Design The proposed framework is empirically tested and applied using a large-scale survey of residents and visitors across a four-year time span in Macao, a Special Administrative Region of China that counts among the smallest and densest city-states in the world and which has borne the full force of extraordinary rapid tourism growth in recent years. Findings The study’s findings suggest that subjective assessments of residents’ QOL and visitors’ QOE interact and must be considered together when assessing sustainable levels of tourism at the level of a destination. Originality The study’s value lies in its use of a large-scale survey across a four-year time span to empirically validate theorized maximal values of QOL assessments from the point of view of residents as well as quality of visiting experience from the point of view of visitors. This finding lays future groundwork for more robust management of tourism growth in destinations.


Sign in / Sign up

Export Citation Format

Share Document