BINF 511 Lecture Notes - Lecture 8: Chromatin, Database Tuning, Cellular Component

61 views9 pages

greywildebeest28

25 Jun 2018

School

Department

Course

Professor

For unlimited access to Class Notes, a Class+ subscription is required.

Lecture 8: Gene and genome annotation; Databases

March 14, 2018

The only items that are worth noting and didn’t come from slides:

For the final exam, don’t need to know the names of the software, just need to know

what they do

Definite exam question: entity relation (ER) data model

Below are notes taken from slides

What happens after obtaining a sequence?

After having a sequence, we need to annotate to understand what it actually is. To do

so, we find:

oORFs

oIntron-exon boundaries

oTSS = transcription start site

oGene function (protein domains)

oGene regulatory sequences

oEtc.

How do we know where in the genome sequence a gene is?

oWe need a high throughput / quick annotation method, then prove it

experimentally

ORF finding algorithms

oThe easiest thing to do is to find the start codon (ATG) for the open reading

frame or the stop codons (TAA, TAG, TGA)

Need to make sure that there is a decent length

oLimitation of the ORF finder

The exon ends where the splice is

The intron is jibberish in terms of information

Problematic if the program does not find the splice site

Have to be careful where the program looks

To get an idea where the splice site is, you can line up the mRNA or cDNA

oSome simple programs

getorf (EMBOSS)

MacVector (Oxford Molecular Group)

Sequencher (Gene Codes)

ORFfinder (NCBI)

NCBI OrfFinder example:

Found ORFs in different reading frames

Can run BLASTP on the different ORFs (check if someone

has seen it before)

If low number of BLAST matches, it's probably not real

find more resources at oneclass.com

Unlock document

This preview shows pages 1-3 of the document.
Unlock all 9 pages and 3 million more documents.

Already have an account? Log in

Results: close but not real exons (Why? Intron exon

boundaries, not stop if not last exon, does not find splice site at the

end of first exon, does not find start site of the second exon!)

To find splice site/ prove intron: align mRNA seq to the

genome seq and look for gaps

oPrediction versus reality, for example:

ORFfinder predicts one ORF at 2775-3032 and one ORF at 3153-3341

In reality, the exons are 2775-2964 and 3070-3341

Introns

oCan predict by putative splice sites

oTo prove an intron, we must compare mRNA (cDNA) sequence with genomic

sequence

The splice site may have nothing to do with the reading frame other than

that the spliced mRNA will have a full frame

Genetic code

oCodon usage: different organisms may prefer certain codons over others

oYou may need to train your gene/orf/pattern finder to suit your organism

oPopular gene finding tools: GRAILEXP, FGENESH, GenScan

BLAST annotation

oAnalyze your sequences against a BLAST target, most often non-redundant

protein database (using BLASTX)

oQuick, simple, works best for coding regions -> what about UTRs?

oAlso error prone:

Top BLAST hit is really top high-scoring pair, not necessarily global hit

Give annotation based on annotation of hit sequence

Chimeras -> which of the part-sequences will give the annotation to the

whole sequence?

No model for exon-intron boundaries

oAn example: identifying soybean virus sequences in EST data

Most plant viruses have RNA genomes

Can be incorporated in cDNA libraries

Common soybean viruses

BPMV: bean pod mottle virus

SMV: soybean mosaic virus

Procedure

NCBI query

Retrieve 300,000 sequences in fasta format

Format BLAST target database

Execute commandline BLAST search using viral genome

sequences

Identifying soybean viral sequences in EST data

Series of BLAST analyses

12 complete viral genomes as reference sequences

Assemble into viral contigs using phrap

find more resources at oneclass.com

Unlock document

This preview shows pages 1-3 of the document.
Unlock all 9 pages and 3 million more documents.

Already have an account? Log in

Results

Great sequence diversity

Sequences from 3 viruses found: bean pod mottle virus, soybean

mosaic virus, cowpea chlorotic mottle virus

12 libraries (out of 80) contained viral sequences

Contig assembly

Theoretically 4 different molecules = 4 contigs

In reality, there were 37 different contigs!

Genome annotation and analysis

Challenges

oData is extremely large

oNeed to see whole picture and still at detail level

oMost of genome sequence is non-coding, i.e. difficult to understand

oAdding various annotation, e.g. expression data, function, orthology, etc.

Ensembl: gene annotation and maintenance software

oFor final exam: make sure you know what we need to do for annotation, but

don't need to memorize any software names

oSteps



oGene prediction procedure

Goal: to have a basic set of predicted gene structures to which metadata

can be added (e.g. gene ontology, gene expression data, etc.)

The "raw compute" (to have a basic set of predicted gene structures to

which metadata can be added): running a number of stand-alone analyses

Such as RepeatMasker, GenScan, tRNAscan, eponine, BLAST

homology searches

RepeatMasker

Scans for interspersed repeats and low complexity

regions

Outputs sequence with masked ('N') repeats (the

next set of software will ignore all the 'N's)

GenScan: identifies gene structures in gDNA sequences

including the exon-intron boundaries

tRNAscan: identifies the tRNA genes in gDNA sequence

find more resources at oneclass.com

Unlock document

This preview shows pages 1-3 of the document.
Unlock all 9 pages and 3 million more documents.

Already have an account? Log in

Document Summary

The only items that are worth noting and didn"t come from slides: For the final exam, don"t need to know the names of the software, just need to know what they do. Definite exam question: entity relation (er) data model. After having a sequence, we need to annotate to understand what it actually is. To do so, we find: orfs o o. Tss = transcription start site: gene function (protein domains, gene regulatory sequences o. How do we know where in the genome sequence a gene is: we need a high throughput / quick annotation method, then prove it experimentally. The easiest thing to do is to find the start codon (atg) for the open reading frame or the stop codons (taa, tag, tga) Need to make sure that there is a decent length o. The intron is jibberish in terms of information. Problematic if the program does not find the splice site.

BINF 511 Lecture Notes - Lecture 8: Chromatin, Database Tuning, Cellular Component

Document Summary

Get access

Related Documents

BPS 3101 Lecture Notes - Lecture 6: Dna Annotation, Blast, Dna Database

Biology 1001A Study Guide - Quiz Guide: Natural Selection, Clustal, Volvox

Biology 1002B Lecture Notes - Lecture 23: Minimal Genome, Synthetic Biology, Tata Box