Data Compression
Data Compression Overview
Data Compression shrinks down a file so that it takes up less
space. This is desirable for data storage and data communication.
Storage space on disks is expensive so a file which occupies
less disk space is "cheaper" than an uncompressed
file. Smaller files are also desirable for data communication,
because the smaller a file the faster it can be transferred.
A compressed file appears to increase the speed of data transfer
over an uncompressed file.
Data Compression Defined
All data is encoded. This means that the data is originally
a combination of elements, e, from some alphabet, A. This
combination of elements is a message, M. This message from
the alphabet, A, is encoded into the binary alphabet, B. The
string of bits, binary digits (0's and 1's), is the encoded
data. So essentially encoding is just tranferring a message,
M, from the alphabet A into the alphabet B. Here is an example:
The message is: a b c d
The encoded message is: 00 01 10 11
The above example just translates the elements a,b,c,d from
the english alphabet, A, into the binary alphabet, B. These
elements can also be decoded from the binary alphabet back
into the original message which was in the english alphabet.
Types of Data Compression
There are two main types of data compression: lossy and lossless.
Lossy data compression is named for what
it does. After one applies lossy data compression to a message,
the message can never be recovered exactly as it was before
it was compressed. When the compressed message is decoded
it does not give back the original message. Data has been
lost.
Because lossy compression can not be decoded
to yield the exact original message, it is not a good method
of compression for critical data, such as textual data. It
is most useful for Digitally Sampled Analog Data (DSAD). DSAD
consists mostly of sound, video, graphics, or picture files.
Algorithms for lossy compression of DSAD vary, but many use
a threshold level truncation. This means that a level is chosen
past which all data is truncated. In a sound file, for example,
the very high and low frequencies, which the human ear can
not hear, may be truncated from the file. Some examples of
lossy data compression algorithms are JPEG, MPEG, and Indeo.
Lossless data compression is also named for
what it does. In a lossless data compression file the original
message can be exactly decoded. Lossless data compression
works by finding repeated patterns in a message and encoding
those patterns in an efficient manner. For this reason, lossless
data compression is also referred to as redundancy reduction.
Becuase redundancy reduction is dependent on patterns in the
message, it does not work well on random messages. Lossless
data compression is ideal for text. Most of the algorithms
for lossless compression are based on the LZ compression method
developed by Lempel and Ziv.
One type of text encoding which is very effective
for files with long strings of repeating bits is RLE. RLE
stands for Run Length Encoding. RLE uses a sliding dictionary
method of the LZ algorithm. The sliding dictionary method
utilizes pointers within the compressed file that point to
previously represented strings of bits within the file. Here
is an example of a message which could be effectively encoded
with RLE:
The rain in Spain falls mainly on the plain.
The string "ain" could be represented only once
and could be pointed to by all later calls to that string.
Huffman coding works by analyzing the frequency, F, of elements,
e, in a message, M. The elements with the highest frequency,
F:e, get assigned the shortest encoding (with the fewest bits).
Elements with lower frequencies get assigned longer encodings
(with more bits).
Entropy Discussed
The average information content of a message is called its
enropy. The information content is related to uncertainty.
The less likely a message is to occur the larger its information
content. This makes sense if we think of an example: if a
person knows what message is about to be sent to him, how
much new information has he learned by receiving that message?
None. This is all that the above statement is saying.
Entropy, E, is information content. The entropy
of a source is inversely proportional to its probability of
occurance:
E = -log P
We use the the log function because we are converting all
sources into the binary alphabet, B.
The same rule applies to an element, e, in a message, M. Its
entropy can be defined as:
E:e = -log (P:e)
P:e is the probability of an element in a message. It is equal
to that element's frequency, F:e, divided by the frequency
of the entire message, F:M:
P:e = F:e / F:M
The average information content or average entropy for a message,
E:M, can now be defined. We know the entropy for each element
in the message is E:e. We will index the elements, e, in a
message, M, by assigning them integer positions, i, according
to their place in the message. For example: the first element,
e, in message M will be assigned 1, the second will be assigned
2...the i-th will be assigned i. Now we can define the entropy
for an entire message, E:M, where there are i elements:
E:M = (P:1*E:1)+(P:2*E:2)+(P:3*E:3)+...+(P:i*E:i)
Entropy is an important concept to data compression. The entropy
of an element (E:e) is the minimum number of bits needed to
encode that element. The entropy of an entire message (E:M)
is the minimum number of bits needed to encode the entire
message with a lossless compression. The entropy of a message
can be used to determine if data compression is worth attempting.
It can also be used to evaluate the effectiveness of a compression.
The number of bits in a compressed code can be compared to
the entropy for that message (E:M) revealing how close to
optimal compression one's compressed code is.
|