BINF 511 Lecture Notes - Lecture 11: De Bruijn Graph, Sequence Assembly, Contig

70 views2 pages

greywildebeest28

25 Jun 2018

School

Department

Course

Professor

For unlimited access to Class Notes, a Class+ subscription is required.

Lecture 11: Genome assembly; Extra lab

April 4, 2018

Lecture portion: Introduction to genome assembly (lecturer: Maria Kyriakidou)

Genome assembly is a hierarchical data structure that maps the sequence data to a putative

reconstruction of the target

Two types

oDe novo: the process of reconstructing the DNA sequence of an organism from its

sequence reads alone

No reference genome

Necessary for novel genome

Issues: length of the sequenced reads, errors, repeats

oReference based

Difficulties in assembling genomes

oBiological: high ploidy, heterozygosity, repetitiveness

oSequencing: large genomes, no perfect sequence yet

oComputational: large genomes, complex structures

oAccuracy: hard to assess correctness

Hierarchical structure of assembly: reads -> contigs -> scaffolds

oReads: fragments of original DNA with sequenced ends

oContigs: align reads to build contigs -> then align the cotigs to get a consensus contig

Major problem: repeats

oScaffolds: use of additional information to orient and connect contigs (paired end, mate

pair, restriction maps)

Paired end reads 100-500 bp insert

Mate pairs: 2-20 kb insert

Algorithms

oAll graph-based -> simplify assembly

Read layout

Overlap graph (overlap-layout consensus)

All versus all pairwise comparison of reads

Computationally very expensive

Does not scale well

Most fragmented assembly algorithms consist of the following steps

Overlap: finding potentially overlapping reads - alignment

(computationally intensive)

Layout: finding the order of the reads along DNA (graph

simplification)

Consensus: deriving the DNA sequence of reads along DNA

(sequence)

de Bruijn graph

Concept in combinatorial mathematics

Representation of sequence based on short words (k-mers)

Overlaps between words

Procedure

Split the reads into k-mer size chunks

K-mer is a short substring of reads (for this example, k=3)

Dk = (V,E)

find more resources at oneclass.com

Unlock document

This preview shows half of the first page of the document.
Unlock all 2 pages and 3 million more documents.

Already have an account? Log in

Document Summary

Lecture portion: introduction to genome assembly (lecturer: maria kyriakidou) Genome assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target. Two types: de novo: the process of reconstructing the dna sequence of an organism from its sequence reads alone. Issues: length of the sequenced reads, errors, repeats o. Difficulties in assembling genomes o o o. Computational: large genomes, complex structures: accuracy: hard to assess correctness. Hierarchical structure of assembly: reads -> contigs -> scaffolds. Reads: fragments of original dna with sequenced ends. Contigs: align reads to build contigs -> then align the cotigs to get a consensus contig o o. Scaffolds: use of additional information to orient and connect contigs (paired end, mate pair, restriction maps) Algorithms: all graph-based -> simplify assembly. Most fragmented assembly algorithms consist of the following steps. Overlap: finding potentially overlapping reads - alignment (computationally intensive) Layout: finding the order of the reads along dna (graph simplification)

BINF 511 Lecture Notes - Lecture 11: De Bruijn Graph, Sequence Assembly, Contig

Document Summary

Get access

Related Documents

Biology 2581B Lecture Notes - Lecture 5: Desert Woodrat, Genome Size, Contig

MCELLBI 104 Lecture Notes - Lecture 18: Shotgun Sequencing, Plasmid, Genomic Library

MCELLBI C148 Lecture Notes - Lecture 4: High-Throughput Screening, Metagenomics, Dna Extraction