CSE 150 Lecture Notes - Lecture 9: Bigram, N-Gram, Posterior Probability

48 views2 pages
Learning Bigram Models
- collect “large” corpus of text ~ 106-8 sentences
- vocabulary size V ~105-8 dictionary entries
- count Cij = # times that word j
follows word i
- count Ci = # times that word i
appears (followed by ‘anything”)
Estimate PML (wl + 1 = j
|wl = i
) = Cij / Ci
Note: NO generalization to “unseen” word combinations!
N-gram model conditions on previous n - 1 words
I.e. P(wl|w1,...,wl - 1) = P(wl|wl-1,...,wl - (n-1))
Where n = 1 unigram
n = 2 bigram
n = 3 trigram
N-gram counts get increasingly sparse for large n
Learning (ML estimation) from incomplete data
Given:
- fixed DAG over discrete nodes
- Data set of T examples, BUT each example is only a PARTIAL instantiation of nodes in
BN
Ex: movie recommender system
Data set t R1R2… R60 Y t = student #
101…??Ri {0,1} recommends i
’th movie
2??…1?
311…??
T0?…0?
R1R2… RnP(Ri = 1|Y = y)
\ | / /
Y P(Y = y)
Y {1, 2, …, k} types of “movie-goer”
Can we “learn” such a model from data?
More Generally
Let {X1,...,Xn} denote ALL nodes in BN
Let H denote “subset” of hidden (unobserved) nodes
Let V denote subset of visible (observed) nodes
{X1,...,Xn} = H V
Goal: estimate CPTs in BN P(Xi = x|pai = π) to maximize probability of partially observed data
What is log-likelihood for partially observed data?
Assume that T examples IID from joint distribution
L = log P(DATA) = log Πt = 1
T P(V = v(t)) ← visible nodes at the t
’th example
Unlock document

This preview shows half of the first page of the document.
Unlock all 2 pages and 3 million more documents.

Already have an account? Log in

Document Summary

Collect large corpus of text ~ 10 6-8 sentences. Count c ij = # times that word j follows word i. Count c i = # times that word i appears (followed by anything ) Estimate p ml (w l + 1 = j |w l = i ) = c ij / c i. N-gram model conditions on previous n - 1 words. Where n = 1 uni gram n = 2 bi gram n = 3 tri gram. N-gram counts get increasingly sparse for large n. Data set of t examples, but each example is only a partial instantiation of nodes in fixed dag over discrete nodes. Y {1, 2, , k} types of movie-goer . Let h denote subset of hidden (unobserved) nodes. Let v denote subset of visible (observed) nodes. Goal: estimate cpts in bn p(x i = x|pa i = ) to maximize probability of partially observed data.

Get access

Grade+
$40 USD/m
Billed monthly
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
10 Verified Answers
Class+
$30 USD/m
Billed monthly
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
7 Verified Answers

Related Documents