human reference genome
Recently Published Documents


TOTAL DOCUMENTS

47
(FIVE YEARS 33)

H-INDEX

8
(FIVE YEARS 4)

2021 ◽  
Author(s):  
Valeriy Titarenko ◽  
Sofya Titarenko

Abstract Background: Technical progress in computational hardware allows researchers to use new approaches for sequence alignment problems. A standard procedure is usually based on pre-aligning of short subsequences followed by proper comparison of neighbouring parts. For this purpose index files are created that store all subsequences (or numbers associated with them) and their positions within a reference sequence. Index files designed on subsequences of 32–64 symbols for a human reference genome can now be easily stored without any compression even on a budget computer. The main goal now is to choose a combination of symbols (a spaced seed) that will tolerate various mismatches between reference and given sequences. An ideal spaced seed should allow us to find all such positions (full sensitivity). By increasing the seed’s weight by one we usually reduce the number of candidate positions fourfold. At the same time longer seeds also reduce the number of signatures to be checked. Results: Several algorithms to assist seed generation are presented. The first one allows us to find all permitted spaced seeds iteratively. The results obtained with the algorithm show specific patterns of the seeds of the highest weight. Among the best seeds, there are periodic seeds with a simple relation between the period of a seed, its length and the length of a read. The second algorithm generates blocks for periodic seeds. A list of blocks is found for blocks of up to 50 symbols and up to 9 mismatches. The third algorithm uses those lists to find spaced seeds for reads of an arbitrary length. Conclusions: Lists of long high-weight spaced seeds are found and available in Supplementary Materials. The seeds are best in terms of weights compared to seeds from other papers and can usually be applied to shorter reads. Codes for all algorithms are available at https://github.com/vtman/PerFSeeB.


2021 ◽  
Author(s):  
Lavanya not provided C ◽  
Vidya Niranjan ◽  
Aajnaa not provided Upadhyaya ◽  
Arpita not provided Guha Neogi

The Sars-CoV-2 virus is a previously uncharacterized coronavirus and causative agent of the COVID-19 pandemic. Gene expression analysis followed by pathway analysis helps researchers to find possible key targets present in biological pathways of host cells that are targeted by the SARS-CoV-2 virus. This review considers the peripheral blood mononuclear cell line (PBMC) and the normal human bronchial epithelial (NHBE) cell line, both of which support SARS-CoV-2 viral replication. Pathway analysis between the healthy and patient samples of the respective cell lines shall provide useful insights on the COVID-19 disease. Initially, the datasets from the respective cell lines were collected from the NCBI databank. These datasets underwent further analysis and were mapped and aligned to the human reference genome. This outputs the file in the BAM format. The BAM files along with the human reference genome in the GFF format were uploaded to an open-source software called OmicsBox 2.0 for differential gene expression analysis. This resulted in the generation of a table containing the differentially expressed genes which were upregulated and downregulated. These gene lists were uploaded to various pathway analyzers that map the significant genes to the most significant pathways. In this project, KOBAS 3.0 and Enrichr were used for pathway analysis. The pathways obtained from the above-mentioned pathway analyzers were further narrowed down by manual comparison. It was observed that many pathways were similar between the NHBE and PBMC cell lines. However, they were also different in terms of their overall nature. In this project, many patterns were seen through the pathways obtained, however, further optimization and functionality studies must be performed in order to establish conclusive results on the scope of the COVID-19 disease. Expanding research on the scope of the disease by going back to the basics will generate new and valuable information about the virus. This knowledge will help us combat the disease in a better and appropriate manner.


2021 ◽  
Author(s):  
Meng Yang ◽  
Haiping Huang ◽  
Lichao Huang ◽  
Nan Zhang ◽  
Jihong Wu ◽  
...  

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotate biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpretating non-coding regions. Here we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only 2 self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of unlabeled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based language model for human genome. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.


2021 ◽  
Author(s):  
Meng Yang ◽  
Haiping Huang ◽  
Lichao Huang ◽  
Nan Zhang ◽  
Jihong Wu ◽  
...  

Abstract Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotate biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpretating non-coding regions. Here we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model that applies self-supervision techniques to learn bidirectional representations of unlabeled human reference genome and extend to a series of downstream tasks via fine-tuning. We also explore a novel knowledge embedded version of LOGO to incorporate prior human annotations. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art predictive power on chromatin features with only 3% parameterization against fully supervised convolutional neural network, DeepSEA. Fine-tuned LOGO also shows outstanding performance in prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework with powerful adaptability to various tasks without substantial task-specific architecture modifications.


2021 ◽  
Author(s):  
Savannah J Hoyt ◽  
Jessica M Storer ◽  
Gabrielle A Hartley ◽  
Patrick G.S. Grady ◽  
Ariel Gershman ◽  
...  

Mobile elements and highly repetitive genomic regions are potent sources of lineage-specific genomic innovation and fingerprint individual genomes. Comprehensive analyses of large, composite or arrayed repeat elements and those found in more complex regions of the genome require a complete, linear genome assembly. Here we present the first de novo repeat discovery and annotation of a complete human reference genome, T2T-CHM13v1.0. We identified novel satellite arrays, expanded the catalog of variants and families for known repeats and mobile elements, characterized new classes of complex, composite repeats, and provided comprehensive annotations of retroelement transduction events. Utilizing PRO-seq to detect nascent transcription and nanopore sequencing to delineate CpG methylation profiles, we defined the structure of transcriptionally active retroelements in humans, including for the first time those found in centromeres. Together, these data provide expanded insight into the diversity, distribution and evolution of repetitive regions that have shaped the human genome.


2021 ◽  
Author(s):  
Aleksey V Zimin ◽  
Alaina Shumate ◽  
Ida Shinder ◽  
Jakob Heinz ◽  
Daniela Puiu ◽  
...  

Until 2019, the human genome was available in only one fully-annotated version, which was the result of 18 years of continuous improvement and revision. Despite dramatic improvements in sequencing technology, no other individual human genome was available as an annotated reference until 2019, when the genome of an Ashkenazi individual was released. In this study, we describe the assembly and annotation of a second individual genome, from a Puerto Rican individual whose DNA was collected as part of the Human Pangenome project. The new genome, called PR1, is the first true reference genome created from an individual of African descent. Due to recent improvements in both sequencing and assembly technology, PR1 is more complete and more contiguous than either the human reference genome (GRCh38) or the Ashkenazi genome. Annotation revealed 42,217 genes (of which 20,168 are protein-coding), including 107 additional gene copies that are present in PR1 and missing from GRCh38. 180 genes have fewer copies in PR1 than in GRCh38, 13 map only partially, and 3 genes (1 protein-coding) from GRCh38 are entirely missing from PR1.


2021 ◽  
Author(s):  
Hui-Su Kim ◽  
Asta Blazyte ◽  
Sungwon Jeon ◽  
Changhan Yoon ◽  
Yeonkyung Kim ◽  
...  

We present LT1, the first high-quality human reference genome from the Baltic States. LT1 is a female de novo human reference genome assembly constructed using 57× of ultra-long nanopore reads and 47× of short paired-end reads. We also utilized 72 Gb of Hi-C chromosomal mapping data to maximize the assembly′s contiguity and accuracy. LT1′s contig assembly was 2.73 Gbp in length comprising of 4,490 contigs with an N50 value of 13.4 Mbp. After scaffolding with Hi-C data and extensive manual curation, we produced a chromosome-scale assembly with an N50 value of 138 Mbp and 4,699 scaffolds. Our gene prediction quality assessment using BUSCO identify 89.3% of the single-copy orthologous genes included in the benchmarking set. Detailed characterization of LT1 suggested it has 73,744 predicted transcripts, 4.2 million autosomal SNPs, 974,000 short indels, and 12,330 large structural variants. These data are shared as a public resource without any restrictions and can be used as a benchmark for further in-depth genomic analyses of the Baltic populations.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Thomas Liehr

Abstract Background The Genome Reference Consortium (GRC) has according to its own statement the “mission to improve the human reference genome assembly, correcting errors and adding sequence to ensure it provides the best representation of the human genome to meet basic and clinical research needs”. Data from GRC is included in genome browsers like UCSC (University of California, Santa Cruz), Ensembl or NCBI (National Center for Biotechnology Information) and are thereby bases for scientific and diagnostically working human genetic community. Method Here long standing knowledge deriving from classical molecular genetic, cytogenetic and molecular cytogenetic data, not being considered yet by GRC was revisited. Results There were three major points identified: (1) GRC missed to including three chromosomal subbands, each, for 1q32.1, 2p21, 5q13.2, 6p22.3 and 6q21, which were defined by International System for Human Cytogenetic Nomenclature (ISCN) already back in 1980s; instead GRC included additional 6 subbands not ever recognized by ISCN. (2) GRC defined 34 chromosomal subbands of 0.1 to 0.9 Mb in size, while it is general agreement of cytogeneticists that it unlikely to detect chromosomal aberrations below 1–2 Mb in size by GTG-banding. And (3): still all sequences used in molecular cytogenetic routine diagnostics to detect heterochromatic and/ or pericentromeric satellite DNA sequences within the human genome are not included yet into human reference genome. For those sequences, localization and approximate sizes have been determined in the 1970s to 1990, and if included at least ~ 100 Mb of the human genome sequence could be added to the genome browsers. Conclusion Overall, taking into account the here mentioned points and correcting and including the data will definitely provide to the still not being completely finished mapping of the human genome.


Nature ◽  
2021 ◽  
Vol 590 (7845) ◽  
pp. 217-218
Author(s):  
Karen H. Miga

2021 ◽  
Author(s):  
Hongyu Zheng ◽  
Carl Kingsford ◽  
Guillaume Marçais

AbstractMinimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset.


Sign in / Sign up

Export Citation Format

Share Document