Data Compression

Data Compression Overview
Data Compression shrinks down a file so that it takes up less space. This is desirable for data storage and data communication. Storage space on disks is expensive so a file which occupies less disk space is "cheaper" than an uncompressed file. Smaller files are also desirable for data communication, because the smaller a file the faster it can be transferred. A compressed file appears to increase the speed of data transfer over an uncompressed file.


Data Compression Defined
All data is encoded. This means that the data is originally a combination of elements, e, from some alphabet, A. This combination of elements is a message, M. This message from the alphabet, A, is encoded into the binary alphabet, B. The string of bits, binary digits (0's and 1's), is the encoded data. So essentially encoding is just tranferring a message, M, from the alphabet A into the alphabet B. Here is an example:


The message is: a b c d
The encoded message is: 00 01 10 11
The above example just translates the elements a,b,c,d from the english alphabet, A, into the binary alphabet, B. These elements can also be decoded from the binary alphabet back into the original message which was in the english alphabet.


Types of Data Compression
There are two main types of data compression: lossy and lossless.

Lossy data compression is named for what it does. After one applies lossy data compression to a message, the message can never be recovered exactly as it was before it was compressed. When the compressed message is decoded it does not give back the original message. Data has been lost.

Because lossy compression can not be decoded to yield the exact original message, it is not a good method of compression for critical data, such as textual data. It is most useful for Digitally Sampled Analog Data (DSAD). DSAD consists mostly of sound, video, graphics, or picture files. Algorithms for lossy compression of DSAD vary, but many use a threshold level truncation. This means that a level is chosen past which all data is truncated. In a sound file, for example, the very high and low frequencies, which the human ear can not hear, may be truncated from the file. Some examples of lossy data compression algorithms are JPEG, MPEG, and Indeo.

Lossless data compression is also named for what it does. In a lossless data compression file the original message can be exactly decoded. Lossless data compression works by finding repeated patterns in a message and encoding those patterns in an efficient manner. For this reason, lossless data compression is also referred to as redundancy reduction. Becuase redundancy reduction is dependent on patterns in the message, it does not work well on random messages. Lossless data compression is ideal for text. Most of the algorithms for lossless compression are based on the LZ compression method developed by Lempel and Ziv.

One type of text encoding which is very effective for files with long strings of repeating bits is RLE. RLE stands for Run Length Encoding. RLE uses a sliding dictionary method of the LZ algorithm. The sliding dictionary method utilizes pointers within the compressed file that point to previously represented strings of bits within the file. Here is an example of a message which could be effectively encoded with RLE:

The rain in Spain falls mainly on the plain.
The string "ain" could be represented only once and could be pointed to by all later calls to that string.
Huffman coding works by analyzing the frequency, F, of elements, e, in a message, M. The elements with the highest frequency, F:e, get assigned the shortest encoding (with the fewest bits). Elements with lower frequencies get assigned longer encodings (with more bits).

Entropy Discussed
The average information content of a message is called its enropy. The information content is related to uncertainty. The less likely a message is to occur the larger its information content. This makes sense if we think of an example: if a person knows what message is about to be sent to him, how much new information has he learned by receiving that message? None. This is all that the above statement is saying.

Entropy, E, is information content. The entropy of a source is inversely proportional to its probability of occurance:

E = -log P
We use the the log function because we are converting all sources into the binary alphabet, B.
The same rule applies to an element, e, in a message, M. Its entropy can be defined as:

E:e = -log (P:e)
P:e is the probability of an element in a message. It is equal to that element's frequency, F:e, divided by the frequency of the entire message, F:M:
P:e = F:e / F:M
The average information content or average entropy for a message, E:M, can now be defined. We know the entropy for each element in the message is E:e. We will index the elements, e, in a message, M, by assigning them integer positions, i, according to their place in the message. For example: the first element, e, in message M will be assigned 1, the second will be assigned 2...the i-th will be assigned i. Now we can define the entropy for an entire message, E:M, where there are i elements:

E:M = (P:1*E:1)+(P:2*E:2)+(P:3*E:3)+...+(P:i*E:i)
Entropy is an important concept to data compression. The entropy of an element (E:e) is the minimum number of bits needed to encode that element. The entropy of an entire message (E:M) is the minimum number of bits needed to encode the entire message with a lossless compression. The entropy of a message can be used to determine if data compression is worth attempting. It can also be used to evaluate the effectiveness of a compression. The number of bits in a compressed code can be compared to the entropy for that message (E:M) revealing how close to optimal compression one's compressed code is.

minnesota life insurance; life insurance; News of Israel israel news websites; Download butterfly screensavers; cialis; Guenstig bei Discontia.de: Sweatshirtjacken