BMS2062 Lecture Notes - Lecture 5: Transfer Rna, Gene Duplication, Integrase
Week 3. Genome annotation and the Human Genome
Project
GENOME ANNOTATION: IDENTIFICATION OF A NOVEL BACTERIAL VIRULENCE
FACTOR
• Bacterial genes rarely have introns
• Getting information out of sequence: what next?
o Locate protein coding regions on the genome sequence
o Predict the function of proteins
-> genome annotation
• What is genome annotation?
o Overlay of biological information onto the genome sequence
o Predicting and marking the position of genes and other elements on a genome sequence
i.e. protein coding sequences
RNA features – usually predicted directly
Protein coding genes
1. Predict location of genes on genome
2. Translate encoded protein and predict function
o Predicting protein function
Similarity to characterised proteins
hpothetial poteis = ot similar to any characterised proteins
• Annotation: gene finding
o Prokaryotes: simple gene, introns are rare but can be multicistronic (MRNA can have
multiple proteins encoded)
o Features of protein coding genes:
Contained in ORF
Initiation codon, ribosome binding site
Initiation codon
ATG, GTG, TTG (GTG and TTG are usually in bacteria)
Stop codon
TAG, TAA, TGA
find more resources at oneclass.com
find more resources at oneclass.com
• Reading frames:
o Gene finding to look for regions with no stop codons (initial list then refine)
-> gene finders produce a model where the genes are encoding -> prediction of what
proteins are encoded
• Annotation: a oela - Not all ORF’s ae aked as gees
• Tools for annotation:
o Gene finders:
eg. GeneMarks, GLIMMER, profigal
What do they do?
o Search all six RF
o Local all ORF (minimum size = 50 codons)
o Identify a potential start codon (ATG most common)
o No overlap – rare to have protein coding regions overlap in prokaryotes
find more resources at oneclass.com
find more resources at oneclass.com
ORF finder software
o NCBI
o Very crude
o Need to be careful because reader is unbiased, may be selecting regions that aren’t gee
encoding
o Issue: oelappig ORF’s oe tha 2 RF – likely not encoding proteins)
o Not a very good model to find genes
• Prediction of protein function – Databases:
o Key content: protein sequence + function
eg. Gen bank – used as dumping site for every genome produced, may get predictions
about predictions about predictions – may enhance some mistakes in our prediction
o Using sequence similarity to predict function
- proteins with the same sequence are likely to have the same/similar function
Most common tool for similarity searching = BLAST
Query sequence = unknown protein
Subject database = database of proteins with known function
Blastp = protein query vs protein database
Blastx = nucleotide query vs protein database (query sequence is translated into
6 peptides, one for each RF)
Description output: all the hits, can get reports, can go to Gen bank to
investigate (also find information about biological experiments done on
sequence)
BLAST: alignment – 1st line = query sequence, 3rd line = sequence from database,
line in between are aa that are identical or similar with plus sign (related aa)
Expect = likelihood of match happening by chance (towards 0 = good, >0.1 =
bad)
• Predicting protein function:
o < 10% identical = similarity occurs by chance (not related)
o 10-35% identical = might have a related function
o > 35% = probably have a related function
o Groups of proteins that play a particular role are usually located in similar parts of
genome/locus
find more resources at oneclass.com
find more resources at oneclass.com
Document Summary
Genome annotation: identification of a novel bacterial virulence. Factor: bacterial genes rarely have introns, getting information out of sequence: what next, locate protein coding regions on the genome sequence, predict the function of proteins. > genome annotation: what is genome annotation, overlay of biological information onto the genome sequence, predicting and marking the position of genes and other elements on a genome sequence i. e. protein coding sequences. Protein coding genes: predict location of genes on genome, translate encoded protein and predict function, predicting protein function. Similarity to characterised proteins (cid:862)h(cid:455)potheti(cid:272)al p(cid:396)otei(cid:374)s(cid:863) = (cid:374)ot similar to any characterised proteins: annotation: gene finding, prokaryotes: simple gene, introns are rare but can be multicistronic (mrna can have multiple proteins encoded, features of protein coding genes: Atg, gtg, ttg (gtg and ttg are usually in bacteria) Tag, taa, tga: reading frames, gene finding to look for regions with no stop codons (initial list then refine)