scholarly journals Selecting Clustering Algorithms for IBD Mapping

2021 ◽  
Author(s):  
Ruhollah Shemirani ◽  
Gillian M Belbin ◽  
Keith Burghardt ◽  
Kristina Lerman ◽  
Christy L Avery ◽  
...  

Background: Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks via a process called IBD mapping. Clustering algorithms play an important role in finding these groups. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare clustering algorithms in terms of statistical power. We also investigated the effectiveness of common clustering metrics as replacements for statistical power. Results: We simulated 3.4 million clusters across 850 experiments with varying cluster counts, false-positive, and false-negative rates. Infomap and Markov Clustering (MCL) community detection methods have high statistical power in most of the graphs, compared to greedy methods such as Louvain and Leiden. We demonstrate that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications, though they can help with simulating realistic benchmarks. We extend our findings to real datasets by analyzing 3 populations in the Population Architecture using Genomics and Epidemiology (PAGE) Study with ~51,000 members and 2 million shared segments on Chromosome 1, resulting in the extraction of ~39 million local IBD clusters across three different populations in PAGE. We used cluster properties derived in PAGE to increase the accuracy of our simulations and comparison. Conclusions: Markov Clustering produces a 30% increase in statistical power compared to the current state-of-art approach, while reducing runtime by 3 orders of magnitude; making it computationally tractable in modern large-scale genetic datasets. We provide an efficient implementation to enable clustering at scale for IBD mapping and poplation-based linkage for various populations and scenarios.

2021 ◽  
Author(s):  
Pankaj Kumar ◽  
Abhay Kumar ◽  
Kamal Sarma ◽  
Paresh Sharma ◽  
Rashmi Rekha Kumari ◽  
...  

A novel, rapid and specific multiplex polymerase chain reaction has been developed for the diagnosis of hemo-parasitic infection in bovine blood by three of the most common hemo-parasites. The reported method relied on the detection of the three different bovine hemoparasites isolated from red blood cells (RBCs) of cattle by conventional Giemsa stained blood smear (GSBS) and confirmed by multiplex PCR. The designed multiplex primer sets can amplify 205, 313 and 422 bp fragments of apocytochrome b, sporozoite and macroschizont 2 (spm2) and 16S rRNA gene for Babesia bigemina, Theileria annulata and Anaplasma marginale, respectively. This multiplex PCR was sensitive with the ability to detect the presence of 150 ng of genomic DNA. The primers used in this multiplex PCR also showed highly specific amplification of specific gene fragments of each respective parasite DNA without the presence of non-specific and non-target PCR products. This multiplex PCR system was used to diagnose GSBS confirmed blood samples (N=12) found infected or co-infected with hemoparasites. A comparison of the two detection methods revealed that 58.33% of specimens showed concordant diagnoses with both techniques. The specificity, positive predictive value and kappa coefficient of agreement was highest for diagnosis of B. bigemina and lowest for A. marginale. The overall Kappa coefficient for diagnosis based on GSBS for multiple pathogen compared to multiplex PCR was 0.56 slightly behind the threshold of 0.6 of agreement. Therefore, confirmation should always be made based on PCR to rule out false positive due to differences in subjective observations, stain particles and false negative due to low level of parasitaemia. The simplicity and rapidity of this specific multiplex PCR method make it suitable for large-scale epidemiological studies and for follow-up of drug treatments.


2017 ◽  
Vol 15 (06) ◽  
pp. 1740006 ◽  
Author(s):  
Mohammad Arifur Rahman ◽  
Nathan LaPierre ◽  
Huzefa Rangwala ◽  
Daniel Barbara

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a


Blood ◽  
2004 ◽  
Vol 104 (11) ◽  
pp. 1578-1578
Author(s):  
Luanne L. Peters ◽  
Orah S. Platt ◽  
Karen L. Svenson ◽  
Beverly J. Paigen ◽  
Gary A. Churchill ◽  
...  

Abstract Identifying the genes and gene products relevant to physiological systems and creating opportunities to elucidate their function are essential first steps in understanding the pathophysiology of disease. To dissect the genetic variation underlying hematopoietic, cardiovascular, lung, and sleep dysfunction, we established a Center for Mouse Models of Heart, Lung, Blood and Sleep (HLBS) Disorders at The Jackson Laboratory as part of the NHLBI Program for Genomic Applications (PGA). The major goal of the JAX PGA is to enable researchers to link both single-gene mutations and quantitative trait loci (QTL) to gene function and disease. To achieve this goal, we are generating new mutations in mice by chemical (ENU) mutagenesis, and characterizing the common inbred mouse strains to detect existing genetic variation. Here, we report an extensive body of hematologically relevant strain characterization data and the establishment of new animal models. All strain characterization data is deposited into the Mouse Phenome Database (MPD, http://www.jax.org/phenome), also accessible via the JAX PGA website (http://pga.jax.org). Data for up to 48 inbred strains are currently available and include complete blood counts and coagulation profiles (PT, aPTT, fibrinogen). These data allow investigators to identify the most appropriate strains for (a) physiological testing; (b) drug development; (c) progenitors in QTL crosses; (d) sensitized mutagenesis screens; and (e) direct hypothesis testing. For example, to maximize the potential for successful QTL identification, parental strains that differ substantially in the phenotype of interest, at least 2 standard deviations (SD), should be selected. We used our strain survey data to select parental strains for identification of QTL for baseline WBC count, an important risk factor for sickle cell disease severity. The strains C57BLKS/J and SM/J have WBC counts of 12.6 ± 1.6 and 3.3 ± 0.8 x 103/μL, respectively, a difference much greater the 2 SD, indicating a high statistical power. We identified a highly significant QTL (LOD = 7) on chromosome 1 in an initial genome wide scan of 279 F2 animals. Moreover, the availability of extensive phenotypic data across the inbred strains in conjunction with the availability of saturated sslp and SNP maps has allowed us to identify QTL in silico. As an example of the utility of the MPD in hypothesis testing, a modifier gene associated with decreased VWF levels is present in 5 of the 6 MPD strains showing the highest aPTT levels (see abstract by Johnsen et al). In total, 44 different phenotypic projects, each consisting of large datasets, can be freely accessed through the MPD. The JAX PGA mutagenesis effort in C57BL/6J mice has likewise yielded valuable resources. Nearly 100 new mutant strains are in various stages of development, including strains with phenotypes of interest to the hematology community (e. g., anemia, thrombocytopenia, leukopenia, leukocytosis). These animal models and all other JAX PGA resources (protocols, software, QTL locations) are freely available to the scientific community.


2021 ◽  
Author(s):  
Zhiyu Hao ◽  
Jin Gao ◽  
Yuxin Song ◽  
Runqing Yang ◽  
Di Liu

AbstractAmong linear mixed model-based association methods, GRAMMAR has the lowest computing complexity for association tests, but it produces a high false-negative rate due to the deflation of test statistics for complex population structure. Here, we present an optimized GRAMMAR method by efficient genomic control, Optim-GRAMMAR, that estimates the phenotype residuals by regulating downward genomic heritability in the genomic best linear unbiased prediction. Even though using the fewer sampling markers to evaluate genomic relationship matrices and genomic controls, Optim-GRAMMAR retains a similar statistical power to the exact mixed model association analysis, which infers an extremely efficient approach to handle large-scale data. Moreover, joint association analysis significantly improved statistical power over existing methods.


2020 ◽  
Vol 21 ◽  
Author(s):  
Yin-xue Wang ◽  
Yi-xiang Wang ◽  
Yi-ke Li ◽  
Shi-yan Tu ◽  
Yi-qing Wang

: Ovarian cancer (OC) is one of the deadliest gynecological malignancy. Epithelial ovarian cancer (EOC) is its most common form. OC has both a poor prognosis and a high mortality rate due to the difficulties of early diagnosis, the limitation of current treatment and resistance to chemotherapy. Extracellular vesicles is a heterogeneous group of cellderived submicron vesicles which can be detected in body fluids, and it can be classified into three main types including exosomes, micro-vesicles, and apoptotic bodies. Cancer cells can produce more EVs than healthy cells. Moreover, the contents of these EVs have been found distinct from each other. It has been considered that EVs shedding from tumor cells may be implicated in clinical applications. Such as a tool for tumor diagnosis, prognosis and potential treatment of certain cancers. In this review, we provide a brief description of EVs in diagnosis, prognosis, treatment, drug-resistant of OC. Cancer-related EVs show powerful influences on tumors by various biological mechanisms. However, the contents mentioned above remain in the laboratory stage and there is a lack of large-scale clinical trials, and the maturity of the purification and detection methods is a constraint. In addition, amplification of oncogenes on ecDNA is remarkably prevalent in cancer, it may be possible that ecDNA can be encapsulated in EVs and thus detected by us. In summary, much more research on EVs needs to be perform to reveal breakthroughs in OC and to accelerate the process of its application on clinic.


2021 ◽  
Author(s):  
Marion Germain ◽  
Daniel Kneeshaw ◽  
Louis De Grandpré ◽  
Mélanie Desrochers ◽  
Patrick M. A. James ◽  
...  

Abstract Context Although the spatiotemporal dynamics of spruce budworm outbreaks have been intensively studied, forecasting outbreaks remains challenging. During outbreaks, budworm-linked warblers (Tennessee, Cape May, and bay-breasted warbler) show a strong positive response to increases in spruce budworm, but little is known about the relative timing of these responses. Objectives We hypothesized that these warblers could be used as sentinels of future defoliation of budworm host trees. We examined the timing and magnitude of the relationships between defoliation by spruce budworm and changes in the probability of presence of warblers to determine whether they responded to budworm infestation before local defoliation being observed by standard detection methods. Methods We modelled this relationship using large-scale point count surveys of songbirds and maps of cumulative time-lagged defoliation over multiple spatial scales (2–30 km radius around sampling points) in Quebec, Canada. Results All three warbler species responded positively to defoliation at each spatial scale considered, but the timing of their response differed. Maximum probability of presence of Tennessee and Cape May warbler coincided with observations of local defoliation, or provided a one year warning, making them of little use to guide early interventions. In contrast, the probability of presence of bay-breasted warbler consistently increased 3–4 years before defoliation was detectable. Conclusions Early detection is a critical step in the management of spruce budworm outbreaks and rapid increases in the probability of presence of bay-breasted warbler could be used to identify future epicenters and target ground-based local sampling of spruce budworm.


2021 ◽  
Vol 13 (8) ◽  
pp. 1509
Author(s):  
Xikun Hu ◽  
Yifang Ban ◽  
Andrea Nascetti

Accurate burned area information is needed to assess the impacts of wildfires on people, communities, and natural ecosystems. Various burned area detection methods have been developed using satellite remote sensing measurements with wide coverage and frequent revisits. Our study aims to expound on the capability of deep learning (DL) models for automatically mapping burned areas from uni-temporal multispectral imagery. Specifically, several semantic segmentation network architectures, i.e., U-Net, HRNet, Fast-SCNN, and DeepLabv3+, and machine learning (ML) algorithms were applied to Sentinel-2 imagery and Landsat-8 imagery in three wildfire sites in two different local climate zones. The validation results show that the DL algorithms outperform the ML methods in two of the three cases with the compact burned scars, while ML methods seem to be more suitable for mapping dispersed burn in boreal forests. Using Sentinel-2 images, U-Net and HRNet exhibit comparatively identical performance with higher kappa (around 0.9) in one heterogeneous Mediterranean fire site in Greece; Fast-SCNN performs better than others with kappa over 0.79 in one compact boreal forest fire with various burn severity in Sweden. Furthermore, directly transferring the trained models to corresponding Landsat-8 data, HRNet dominates in the three test sites among DL models and can preserve the high accuracy. The results demonstrated that DL models can make full use of contextual information and capture spatial details in multiple scales from fire-sensitive spectral bands to map burned areas. Using only a post-fire image, the DL methods not only provide automatic, accurate, and bias-free large-scale mapping option with cross-sensor applicability, but also have potential to be used for onboard processing in the next Earth observation satellites.


2021 ◽  
Vol 13 (6) ◽  
pp. 1211
Author(s):  
Pan Fan ◽  
Guodong Lang ◽  
Bin Yan ◽  
Xiaoyan Lei ◽  
Pengju Guo ◽  
...  

In recent years, many agriculture-related problems have been evaluated with the integration of artificial intelligence techniques and remote sensing systems. The rapid and accurate identification of apple targets in an illuminated and unstructured natural orchard is still a key challenge for the picking robot’s vision system. In this paper, by combining local image features and color information, we propose a pixel patch segmentation method based on gray-centered red–green–blue (RGB) color space to address this issue. Different from the existing methods, this method presents a novel color feature selection method that accounts for the influence of illumination and shadow in apple images. By exploring both color features and local variation in apple images, the proposed method could effectively distinguish the apple fruit pixels from other pixels. Compared with the classical segmentation methods and conventional clustering algorithms as well as the popular deep-learning segmentation algorithms, the proposed method can segment apple images more accurately and effectively. The proposed method was tested on 180 apple images. It offered an average accuracy rate of 99.26%, recall rate of 98.69%, false positive rate of 0.06%, and false negative rate of 1.44%. Experimental results demonstrate the outstanding performance of the proposed method.


Genetics ◽  
2003 ◽  
Vol 165 (4) ◽  
pp. 2269-2282
Author(s):  
D Mester ◽  
Y Ronin ◽  
D Minkov ◽  
E Nevo ◽  
A Korol

Abstract This article is devoted to the problem of ordering in linkage groups with many dozens or even hundreds of markers. The ordering problem belongs to the field of discrete optimization on a set of all possible orders, amounting to n!/2 for n loci; hence it is considered an NP-hard problem. Several authors attempted to employ the methods developed in the well-known traveling salesman problem (TSP) for multilocus ordering, using the assumption that for a set of linked loci the true order will be the one that minimizes the total length of the linkage group. A novel, fast, and reliable algorithm developed for the TSP and based on evolution-strategy discrete optimization was applied in this study for multilocus ordering on the basis of pairwise recombination frequencies. The quality of derived maps under various complications (dominant vs. codominant markers, marker misclassification, negative and positive interference, and missing data) was analyzed using simulated data with ∼50-400 markers. High performance of the employed algorithm allows systematic treatment of the problem of verification of the obtained multilocus orders on the basis of computing-intensive bootstrap and/or jackknife approaches for detecting and removing questionable marker scores, thereby stabilizing the resulting maps. Parallel calculation technology can easily be adopted for further acceleration of the proposed algorithm. Real data analysis (on maize chromosome 1 with 230 markers) is provided to illustrate the proposed methodology.


AMB Express ◽  
2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Marcelo dos Santos Barbosa ◽  
Iara Beatriz Andrade de Sousa ◽  
Simone Simionatto ◽  
Sibele Borsuk ◽  
Silvana Beutinger Marchioro

AbstractCurrent prevention methods for the transmission of Mycobacterium leprae, the causative agent of leprosy, are inadequate as suggested by the rate of new leprosy cases reported. Simple large-scale detection methods for M. leprae infection are crucial for early detection of leprosy and disease control. The present study investigates the production and seroreactivity of a recombinant polypeptide composed of various M. leprae protein epitopes. The structural and physicochemical parameters of this construction were assessed using in silico tools. Parameters like subcellular localization, presence of signal peptide, primary, secondary, and tertiary structures, and 3D model were ascertained using several bioinformatics tools. The resultant purified recombinant polypeptide, designated rMLP15, is composed of 15 peptides from six selected M. leprae proteins (ML1358, ML2055, ML0885, ML1811, ML1812, and ML1214) that induce T cell reactivity in leprosy patients from different hyperendemic regions. Using rMLP15 as the antigen, sera from 24 positive patients and 14 healthy controls were evaluated for reactivity via ELISA. ELISA-rMLP15 was able to diagnose 79.17% of leprosy patients with a specificity of 92.86%. rMLP15 was also able to detect the multibacillary and paucibacillary patients in the same proportions, a desirable addition in the leprosy diagnosis. These results summarily indicate the utility of the recombinant protein rMLP15 in the diagnosis of leprosy and the future development of a viable screening test.


Sign in / Sign up

Export Citation Format

Share Document