EECS 1520 Lecture Notes - Lecture 11: Huffman Coding, Extended Ascii, Compression Ratio
EECS 1520 verified notes
11/14View all
REPRESENTING TEXT
• Text is a combo of characters
• To represent every character, a list is made, and each is assigned a binary
string
• Character set is a list of characters and the codes used to represent each one
• Experts in the computer industry agree on the specifics of a character set,
thereby creating a standard for sharing data
American Standard Code for Information Interchange (ASCII) character set
• First 32 are control characters (or hidden characters); they control how the text
appears but do not appear as text
• Decimal Arabic numbers (i.e., 0, 1, 2, etc) start at code 48
o Ex. the code for 6 is 48 + 6 = 54
• Uppercase roman letters (i.e., A, B, C etc) start at 65
o Ex. the code for J (10th letter) is 65 + (10-1) =74
• Lowercase roman letters (i.e., a, b, c etc) start at 97
o Ex. the code for j (10th letter) is 97 + (10-1) = 106
• Notice the corresponding upper and lowercase letters are separated by 32 (106-
74 = 32)
Unicode character set
• Even the extended ASCII is not enough for international use
• Unicode uses 16 bits per character and can represent 216 or over 65 thousand
characters
• Unicode is a superset of ASCII
o The first 256 characters in the Unicode character set correspond exactly
to the extended ASCII character set
• For organization, Unicode is divided into books of characters, with each block
having a common theme
Data Compression
• A reduction in the amount of space needed to store a piece of data
• Compression ratio: the size of the compressed data divided by the size of the
original data
• A data compression technique can be
o Lossless-the data can be retrieved without loss of the original info
o Lossy- some info may be lost in the process of compaction
• Ex.
o Keyword encoding