BPS 4104 Chapter Notes - Chapter 7: Gibbs Sampling, Position Weight Matrix, Local Optimum

65 views8 pages
Chapter 7: Gibbs Sampler
-most frequently used for the identification of regulatory sequences of genes
-efficiency of transcription & translation depends on associated sequence motifs
-transcription is affected by promoter sequences
-translation affected by translation initiation signals
-example: a biologist has identified a set of co-expressed genes in yeast
-he wants to know if the genes are co-regulated (sharing of promoter
sequences & transcription factors)
-he extracted the upstream region of the translation initiation codon
-main output of Gibbs sampler consists of two parts
-first output: sequences with aligned motifs
-second output: position weight matrix derived from aligned motifs so we
can use it to scan new sequences for the presence & location of such motifs
-input consists of a set of sequences that contain one or more motifs of interest
-two slightly different applications of Gibbs sampler in motif prediction:
-first assumes that each sequence contains exactly one motif & the algorithm
is called the site sampler
-second allows each sequence to have none or multiple motifs & the
algorithm is called the motif sampler
Numerical Illustration of the Computational Details of Gibbs Sampler
-N is the number of input sequences designated as S1, S2, S3…SN
-m is the length of the motif
-Li is the total sequence length of Si
-Ai is the starting position of the motif in Si
-objective of Gibbs sampler is to:
-1. Obtain a set of correct Ai values to align the motifs
-2. Generate a PWM to be used to scan for presence of identified motif
-PWM is of dimension mx4 (nucleotides) or mx20 (amino acids)
-first, all nucleotides must be counted
-for example: FA=325, FC=316, FG=267, FT=301 with a total of 1209
-these numbers will be needed for calculating pseudocounts
find more resources at oneclass.com
find more resources at oneclass.com
Unlock document

This preview shows pages 1-3 of the document.
Unlock all 8 pages and 3 million more documents.

Already have an account? Log in
-main algorithm of Gibbs sampler is of two steps:
-first is random initialization in which a random set of Ai values is chosen
and site specific nucleotide frequencies are calculated
-second step is predictive updating until a local solution of Ai values is
obtained and retained
-this is repeated multiple times and previously stored optimal solutions are
continuously replaced with better ones
-convergence is typically declared when two or more local solutions are
identical
Initialization
-we randomly assign a value to Ai with the constraint that Ai(Li-m+1)
-our first set of N motifs is just a random set of sequences of length m and is not
expected to have any pattern
-C0 vector: lists the distribution of nucleotides outside the 29 random motifs
-C-matrix: lists the site specific nucleotides from 29 random motifs:
Nuc
C0
Site 1
Site 2
Site 3
Site 4
Site 5
Site 6
A
278
8
7
9
6
10
7
C
279
3
8
5
10
6
5
G
230
7
5
6
5
3
11
T
248
11
9
9
8
10
6
Predictive Update
-consists of obtaining N random numbers ranching from 1 to N
-use these umbers as an index to choose the sequences sequentially to update the
site specific distribution of nucleotides (C matrix) & associated frequencies (C0
vector)
-example: if numbers were: 11, 18, 26, 22, 2, 28, 12, 9, 7, 3, 17, 16, 1, 4, 21, 15, 14, 24,
19, 27, 29, 6, 10, 20, 13, 8, 23, 25, and 5, then:
-S11 will be used first and S5 last for the first cycle of predictive update
find more resources at oneclass.com
find more resources at oneclass.com
Unlock document

This preview shows pages 1-3 of the document.
Unlock all 8 pages and 3 million more documents.

Already have an account? Log in
-it is important to use a random series of numbers instead of choosing sequences
according to input order
-choosing according to input increases likelihood of trapping Gibbs sampler
in a local optimum
-back to example: first randomly chosen sequence is S11
-randomly chosen motif starts at site 11 (A11=11)
-the motif is AGTGTG
-initial motif will now be taken out of C and put into C0 vector
-the motif has 1 A, three Gs and two Us
-by adding these values to the C0 vector in the table above, we obtain the following
C0 vector:
-we also must take this motif out of the C matrix by subtracting the first A from the
first value in the first column, second G from second column, etc.
Nuc
C0
Site 1
Site 2
Site 3
Site 4
Site 5
Site 6
A
279
7
7
9
6
10
7
C
279
3
8
5
10
6
5
G
233
7
4
6
4
3
10
T
250
11
9
8
8
9
6
-now the C matrix is made up of 28 randomly chosen motifs, one from each
sequence
-we take motif out of C matrix and add to C0 vector to find a more likely motif in S11
-we can then make a position weight matrix out of C0 vector and C matrix and use
the PWM to scan S11 to get a new motif of the highest PWMS
-with the new C0 vector & C matrix we can now make a Q0 vector and Q matrix
-


-ex: 
  
-NCode is the number of different symbols in the sequences (4 for nucleotide
sequences)
find more resources at oneclass.com
find more resources at oneclass.com
Unlock document

This preview shows pages 1-3 of the document.
Unlock all 8 pages and 3 million more documents.

Already have an account? Log in

Document Summary

Most frequently used for the identification of regulatory sequences of genes. Efficiency of transcription & translation depends on associated sequence motifs. Example: a biologist has identified a set of co-expressed genes in yeast. He wants to know if the genes are co-regulated (sharing of promoter sequences & transcription factors) He extracted the upstream region of the translation initiation codon. Main output of gibbs sampler consists of two parts. Second output: position weight matrix derived from aligned motifs so we can use it to scan new sequences for the presence & location of such motifs. Input consists of a set of sequences that contain one or more motifs of interest. Two slightly different applications of gibbs sampler in motif prediction: First assumes that each sequence contains exactly one motif & the algorithm is called the site sampler. Second allows each sequence to have none or multiple motifs & the algorithm is called the motif sampler.

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers

Related Documents