BPS 4104 Chapter Notes - Chapter 1: String Searching Algorithm, Binomial Distribution, Poisson Distribution

48 views5 pages

17 May 2018

School

University of Ottawa

Department

Biopharmaceutical sciences

Course

BPS 4104

Professor

Dr. Xia

For unlimited access to Textbook Notes, a Class+ subscription is required.

Chapter 1: BLAST & FASTA

-sequence search and annotation tools

-gene in virus that caused cancer-like transformation of infected cell v-sys was

similar to platelet derived growth factor

-sequence similarity search is most effective method for exploiting sequence data

-FASTA & BLAST may miss homologous sequences but are very fast

-BLAST became more popular than FASTA because it evaluated statistical

significance of resulting sequence matches

Basic Concepts of String Matching

-given PA, PC, PG, PT in a database (target) sequence, the probability of a guery

sequence (Q) having a perfect match of the target sequence (D) is:

-  













-LD is the length of the target sequence and LQ is the length of the query sequence

-the number of possible matching operations number of times one can shift Q

against D in search for a perfect match of LQ letters is:

-    

-probability distribution of the number of matches follows approximately a binomial

distribution:

-



-binomial distribution is troublesome when is large

-when np<1 and n is very large, the binomial distribution can be approximated by

the Poisson distribution with mean and variance equal to np:

-  



-SAGE: serial analysis of gene expression

-is strongly biased against short mRNA

-a query length of 14 nucleotides not sufficient to identify a gene product in

typical eukaryotic genomes

-information on SAGE incorrectly states that 14 nucleotides is enough to

identify gene

-matching substring of Q against substring of D:

find more resources at oneclass.com

Unlock document

This preview shows pages 1-2 of the document.
Unlock all 5 pages and 3 million more documents.

Already have an account? Log in

-assuming nucleotide probabilities are equal (0.25)

-probability of finding an exact match of at least L consecutive letters is (L≤LQ

and L≤LD):

-p=0.25L=2-2L

-match of L consecutive letters between Q and D can happen at m=(LQ-L+1)

positions on Q and at n=(LD-L+1) positions on D

-m and n are termed effective length of query/database

-there are mn possible matching operations each with a probability of 0.25L

of getting a match with L consecutive letters, therefore the expected number

of matches with at least L is:

-      

-where S=2L

-or    where λ=-ln(0.25) and R=L

-when S is computed with a particular scoring scheme, the equation can be

applied to situations with consecutive matching letters AND

mismatches/gaps

-can use different scoring schemes

-when raw score (R) is computed according to scoring scheme and the bit

score S is computed with R & two scaling factors λ and K, the following

equation can be used to obtain an E-value:

-  

 then   

-R will increase with length of query & target sequences

-with large R values, S≈2R

-E-value can be used as λ parameter in the Poisson distribution to get the

probability of having 0,1,x matches that are as good or better than the reported

match

-BLAST scales the E value with K, which makes output of E value too conservative

- 

-termed EVD (extreme value distribution) & gives statistical significance to a match

score between two sequences

find more resources at oneclass.com

Unlock document

This preview shows pages 1-2 of the document.
Unlock all 5 pages and 3 million more documents.

Already have an account? Log in

Document Summary

Gene in virus that caused cancer-like transformation of infected cell v-sys was similar to platelet derived growth factor. Sequence similarity search is most effective method for exploiting sequence data. Fasta & blast may miss homologous sequences but are very fast. Blast became more popular than fasta because it evaluated statistical significance of resulting sequence matches. Given pa, pc, pg, pt in a database (target) sequence, the probability of a guery sequence (q) having a perfect match of the target sequence (d) is: Ld is the length of the target sequence and lq is the length of the query sequence. The number of possible (cid:494)matching operations(cid:495) (cid:523)number of times one can shift q against d in search for a perfect match of lq letters is: Probability distribution of the number of matches follows approximately a binomial distribution: When np<1 and n is very large, the binomial distribution can be approximated by the poisson distribution with mean and variance equal to np:

BPS 4104 Chapter Notes - Chapter 1: String Searching Algorithm, Binomial Distribution, Poisson Distribution

Document Summary

Get access

Related Documents

BPS 3101 Lecture Notes - Lecture 6: Southern Blot, Reverse Transcription Polymerase Chain Reaction, Restriction Site

BPS 3101 Lecture Notes - Lecture 6: Dna Annotation, Blast, Dna Database

BIO130H1 Chapter Notes - Chapter 3: Endonuclease, Blast, Restriction Digest