CSE 150 Lecture Notes - Lecture 9: Bigram, N-Gram, Posterior Probability

48 views2 pages

turquoiseplatypus250

2 May 2018

School

UCSD

Department

Computer Science & Engineering

Course

CSE 150

Professor

Lawrence Saul

For unlimited access to Class Notes, a Class+ subscription is required.

Learning Bigram Models

- collect “large” corpus of text ~ 106-8 sentences

- vocabulary size V ~105-8 dictionary entries

- count Cij = # times that word j



follows word i

- count Ci = # times that word i



appears (followed by ‘anything”)

Estimate PML (wl + 1 = j



|wl = i



) = Cij / Ci

Note: NO generalization to “unseen” word combinations!

N-gram model conditions on previous n - 1 words

I.e. P(wl|w1,...,wl - 1) = P(wl|wl-1,...,wl - (n-1))

Where n = 1 → unigram

n = 2 → bigram

n = 3 → trigram

N-gram counts get increasingly sparse for large n

Learning (ML estimation) from incomplete data

Given:

- fixed DAG over discrete nodes

- Data set of T examples, BUT each example is only a PARTIAL instantiation of nodes in

Ex: movie recommender system

Data set t R1R2… R60 Y t = student #

101…??Ri ∈ {0,1} recommends i



’th movie

2??…1?

311…??

…

T0?…0?

R1R2… RnP(Ri = 1|Y = y)

\ | / /

Y P(Y = y)

Y ∈ {1, 2, …, k} types of “movie-goer”

Can we “learn” such a model from data?

More Generally

Let {X1,...,Xn} denote ALL nodes in BN

Let H denote “subset” of hidden (unobserved) nodes

Let V denote subset of visible (observed) nodes

{X1,...,Xn} = H ∪ V

Goal: estimate CPTs in BN P(Xi = x|pai = π) to maximize probability of partially observed data

What is log-likelihood for partially observed data?

Assume that T examples IID from joint distribution

L = log P(DATA) = log Πt = 1

T P(V = v(t)) ← visible nodes at the t



’th example

Unlock document

This preview shows half of the first page of the document.
Unlock all 2 pages and 3 million more documents.

Already have an account? Log in

Document Summary

Collect large corpus of text ~ 10 6-8 sentences. Count c ij = # times that word j follows word i. Count c i = # times that word i appears (followed by anything ) Estimate p ml (w l + 1 = j |w l = i ) = c ij / c i. N-gram model conditions on previous n - 1 words. Where n = 1 uni gram n = 2 bi gram n = 3 tri gram. N-gram counts get increasingly sparse for large n. Data set of t examples, but each example is only a partial instantiation of nodes in fixed dag over discrete nodes. Y {1, 2, , k} types of movie-goer . Let h denote subset of hidden (unobserved) nodes. Let v denote subset of visible (observed) nodes. Goal: estimate cpts in bn p(x i = x|pa i = ) to maximize probability of partially observed data.

CSE 150 Lecture Notes - Lecture 9: Bigram, N-Gram, Posterior Probability

Document Summary

Get access

Related Documents

CSE 150 Lecture Notes - Lecture 12: Mixture Model, Bigram, 96.5 Wave Fm