BPS 4104 Chapter Notes - Chapter 1: String Searching Algorithm, Binomial Distribution, Poisson Distribution

48 views5 pages
Chapter 1: BLAST & FASTA
-sequence search and annotation tools
-gene in virus that caused cancer-like transformation of infected cell v-sys was
similar to platelet derived growth factor
-sequence similarity search is most effective method for exploiting sequence data
-FASTA & BLAST may miss homologous sequences but are very fast
-BLAST became more popular than FASTA because it evaluated statistical
significance of resulting sequence matches
Basic Concepts of String Matching
-given PA, PC, PG, PT in a database (target) sequence, the probability of a guery
sequence (Q) having a perfect match of the target sequence (D) is:
-  
-LD is the length of the target sequence and LQ is the length of the query sequence
-the number of possible matching operations number of times one can shift Q
against D in search for a perfect match of LQ letters is:
-    
-probability distribution of the number of matches follows approximately a binomial
distribution:
-

-binomial distribution is troublesome when is large
-when np<1 and n is very large, the binomial distribution can be approximated by
the Poisson distribution with mean and variance equal to np:
- 

-SAGE: serial analysis of gene expression
-is strongly biased against short mRNA
-a query length of 14 nucleotides not sufficient to identify a gene product in
typical eukaryotic genomes
-information on SAGE incorrectly states that 14 nucleotides is enough to
identify gene
-matching substring of Q against substring of D:
find more resources at oneclass.com
find more resources at oneclass.com
Unlock document

This preview shows pages 1-2 of the document.
Unlock all 5 pages and 3 million more documents.

Already have an account? Log in
-assuming nucleotide probabilities are equal (0.25)
-probability of finding an exact match of at least L consecutive letters is (L≤LQ
and L≤LD):
-p=0.25L=2-2L
-match of L consecutive letters between Q and D can happen at m=(LQ-L+1)
positions on Q and at n=(LD-L+1) positions on D
-m and n are termed effective length of query/database
-there are mn possible matching operations each with a probability of 0.25L
of getting a match with L consecutive letters, therefore the expected number
of matches with at least L is:
-   
-where S=2L
-or  where λ=-ln(0.25) and R=L
-when S is computed with a particular scoring scheme, the equation can be
applied to situations with consecutive matching letters AND
mismatches/gaps
-can use different scoring schemes
-when raw score (R) is computed according to scoring scheme and the bit
score S is computed with R & two scaling factors λ and K, the following
equation can be used to obtain an E-value:
- 
 then 
-R will increase with length of query & target sequences
-with large R values, S≈2R
-E-value can be used as λ parameter in the Poisson distribution to get the
probability of having 0,1,x matches that are as good or better than the reported
match
-BLAST scales the E value with K, which makes output of E value too conservative
- 
-termed EVD (extreme value distribution) & gives statistical significance to a match
score between two sequences
find more resources at oneclass.com
find more resources at oneclass.com
Unlock document

This preview shows pages 1-2 of the document.
Unlock all 5 pages and 3 million more documents.

Already have an account? Log in

Document Summary

Gene in virus that caused cancer-like transformation of infected cell v-sys was similar to platelet derived growth factor. Sequence similarity search is most effective method for exploiting sequence data. Fasta & blast may miss homologous sequences but are very fast. Blast became more popular than fasta because it evaluated statistical significance of resulting sequence matches. Given pa, pc, pg, pt in a database (target) sequence, the probability of a guery sequence (q) having a perfect match of the target sequence (d) is: Ld is the length of the target sequence and lq is the length of the query sequence. The number of possible (cid:494)matching operations(cid:495) (cid:523)number of times one can shift q against d in search for a perfect match of lq letters is: Probability distribution of the number of matches follows approximately a binomial distribution: When np<1 and n is very large, the binomial distribution can be approximated by the poisson distribution with mean and variance equal to np:

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers

Related Documents