BINF 511 Lecture Notes - Lecture 8: Chromatin, Database Tuning, Cellular Component

61 views9 pages
Lecture 8: Gene and genome annotation; Databases
March 14, 2018
The only items that are worth noting and didn’t come from slides:
For the final exam, don’t need to know the names of the software, just need to know
what they do
Definite exam question: entity relation (ER) data model
Below are notes taken from slides
What happens after obtaining a sequence?
After having a sequence, we need to annotate to understand what it actually is. To do
so, we find:
oORFs
oIntron-exon boundaries
oTSS = transcription start site
oGene function (protein domains)
oGene regulatory sequences
oEtc.
How do we know where in the genome sequence a gene is?
oWe need a high throughput / quick annotation method, then prove it
experimentally
ORF finding algorithms
oThe easiest thing to do is to find the start codon (ATG) for the open reading
frame or the stop codons (TAA, TAG, TGA)
Need to make sure that there is a decent length
oLimitation of the ORF finder
The exon ends where the splice is
The intron is jibberish in terms of information
Problematic if the program does not find the splice site
Have to be careful where the program looks
To get an idea where the splice site is, you can line up the mRNA or cDNA
oSome simple programs
getorf (EMBOSS)
MacVector (Oxford Molecular Group)
Sequencher (Gene Codes)
ORFfinder (NCBI)
NCBI OrfFinder example:
Found ORFs in different reading frames
Can run BLASTP on the different ORFs (check if someone
has seen it before)
If low number of BLAST matches, it's probably not real
find more resources at oneclass.com
find more resources at oneclass.com
Unlock document

This preview shows pages 1-3 of the document.
Unlock all 9 pages and 3 million more documents.

Already have an account? Log in
Results: close but not real exons (Why? Intron exon
boundaries, not stop if not last exon, does not find splice site at the
end of first exon, does not find start site of the second exon!)
To find splice site/ prove intron: align mRNA seq to the
genome seq and look for gaps
oPrediction versus reality, for example:
ORFfinder predicts one ORF at 2775-3032 and one ORF at 3153-3341
In reality, the exons are 2775-2964 and 3070-3341
Introns
oCan predict by putative splice sites
oTo prove an intron, we must compare mRNA (cDNA) sequence with genomic
sequence
The splice site may have nothing to do with the reading frame other than
that the spliced mRNA will have a full frame
Genetic code
oCodon usage: different organisms may prefer certain codons over others
oYou may need to train your gene/orf/pattern finder to suit your organism
oPopular gene finding tools: GRAILEXP, FGENESH, GenScan
BLAST annotation
oAnalyze your sequences against a BLAST target, most often non-redundant
protein database (using BLASTX)
oQuick, simple, works best for coding regions -> what about UTRs?
oAlso error prone:
Top BLAST hit is really top high-scoring pair, not necessarily global hit
Give annotation based on annotation of hit sequence
Chimeras -> which of the part-sequences will give the annotation to the
whole sequence?
No model for exon-intron boundaries
oAn example: identifying soybean virus sequences in EST data
Most plant viruses have RNA genomes
Can be incorporated in cDNA libraries
Common soybean viruses
BPMV: bean pod mottle virus
SMV: soybean mosaic virus
Procedure
NCBI query
Retrieve 300,000 sequences in fasta format
Format BLAST target database
Execute commandline BLAST search using viral genome
sequences
Identifying soybean viral sequences in EST data
Series of BLAST analyses
12 complete viral genomes as reference sequences
Assemble into viral contigs using phrap
find more resources at oneclass.com
find more resources at oneclass.com
Unlock document

This preview shows pages 1-3 of the document.
Unlock all 9 pages and 3 million more documents.

Already have an account? Log in
Results
Great sequence diversity
Sequences from 3 viruses found: bean pod mottle virus, soybean
mosaic virus, cowpea chlorotic mottle virus
12 libraries (out of 80) contained viral sequences
Contig assembly
Theoretically 4 different molecules = 4 contigs
In reality, there were 37 different contigs!
Genome annotation and analysis
Challenges
oData is extremely large
oNeed to see whole picture and still at detail level
oMost of genome sequence is non-coding, i.e. difficult to understand
oAdding various annotation, e.g. expression data, function, orthology, etc.
Ensembl: gene annotation and maintenance software
oFor final exam: make sure you know what we need to do for annotation, but
don't need to memorize any software names
oSteps
oGene prediction procedure
Goal: to have a basic set of predicted gene structures to which metadata
can be added (e.g. gene ontology, gene expression data, etc.)
The "raw compute" (to have a basic set of predicted gene structures to
which metadata can be added): running a number of stand-alone analyses
Such as RepeatMasker, GenScan, tRNAscan, eponine, BLAST
homology searches
RepeatMasker
Scans for interspersed repeats and low complexity
regions
Outputs sequence with masked ('N') repeats (the
next set of software will ignore all the 'N's)
GenScan: identifies gene structures in gDNA sequences
including the exon-intron boundaries
tRNAscan: identifies the tRNA genes in gDNA sequence
find more resources at oneclass.com
find more resources at oneclass.com
Unlock document

This preview shows pages 1-3 of the document.
Unlock all 9 pages and 3 million more documents.

Already have an account? Log in

Document Summary

The only items that are worth noting and didn"t come from slides: For the final exam, don"t need to know the names of the software, just need to know what they do. Definite exam question: entity relation (er) data model. After having a sequence, we need to annotate to understand what it actually is. To do so, we find: orfs o o. Tss = transcription start site: gene function (protein domains, gene regulatory sequences o. How do we know where in the genome sequence a gene is: we need a high throughput / quick annotation method, then prove it experimentally. The easiest thing to do is to find the start codon (atg) for the open reading frame or the stop codons (taa, tag, tga) Need to make sure that there is a decent length o. The intron is jibberish in terms of information. Problematic if the program does not find the splice site.

Get access

Grade+
$40 USD/m
Billed monthly
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
10 Verified Answers
Class+
$30 USD/m
Billed monthly
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
7 Verified Answers

Related Documents