CSE 150 Lecture Notes - Lecture 10: Posterior Probability, Bayes Estimator, Backtracking

33 views3 pages
Learning from Incomplete Data
Examples t = 1 to T
Visible Nodes V(t)
Hidden Nodes H(t)
How to learn CPTs? Maximize log-likelihood of “partially” observed ata w/ EM alg
EM Algorithm
1) Initialize CPTs w/ some values
2) Iterate until convergence
E-step: Inference
For EACH example t = 1 to T,
For EACH node Xi (i = 1 to n)
For EACH possible value x of Xi
For EACH possible configuration π of pa(Xi)
Compute posterior P(Xi = x, pai = π|V(t))
M-step: re-estimate/update CPTs
For nodes w/ parents:
P(Xi = x|pai = π) t P(Xi = x, pai = π|V(t)) ← computed from current CPTs
New CPTs t P(pai = π|V(t))
Why is it ? It’s just a formula for getting NEW values
For root nodes:
P(Xi = x) ← (1/T) ∑t P(Xi = x|V(t))
Intuition
- Use posterior distribution P(H|V(t)) to “fill in” values of hidden nodes missing!
- Expected counts under P(H|V) are “substituting” for observed counts in COMPLETE
data case
Key Properties of EM Algorithm
1) Monotonic convergence
- Each iteration of EM is always getting better! (i.e. improves log-likelihood)
L = ∑t = 1
T log P(V(t))
- If LK is log-likelihood at kth iteration of EM, then LK >= KK-1 (useful for debugging!)
2) NO tuning parameters, NO learning rates, NO backtracking!
3) Converges “in general” to local but NOT global maximum of L = ∑t log P(V(t))
Final result of EM can depend on initialization
Example: A → B → C
Where A and C are visible, B is hidden
First consider inference:
Posterior P(B|A, C) = P(C|B, A) P(B|A) [Bayes Rule]
P(C|A)
= P(C|B) P(A) [CI]
Unlock document

This preview shows page 1 of the document.
Unlock all 3 pages and 3 million more documents.

Already have an account? Log in

Document Summary

Maximize log-likelihood of partially observed ata w/ em alg. Em algorithm: initialize cpts w/ some values, iterate until convergence. For each example t = 1 to t, For each node x i (i = 1 to n) Compute posterior p(x i = x, pa i = |v (t) ) P(x i = x|pa i = ) t p(x i = x, pa i = |v (t) ) computed from current cpts. It"s just a formula for getting new values. P(x i = x) (1/t) t p(x i = x|v (t) ) Use posterior distribution p(h|v (t) ) to fill in values of hidden nodes missing! Expected counts under p(h|v) are substituting for observed counts in complete data case. Key properties of em algorithm: monotonic convergence. Each iteration of em is always getting better! (i. e. improves log-likelihood) Final result of em can depend on initialization. Where a and c are visible , b is hidden.

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers

Related Documents