Sei sulla pagina 1di 15

Computer Science Extended Essay

Research Question: To what extent are data compression techniques like Lempel Ziv Welch
Algorithm, DEFLATE Compression and DWT compression more effective in compressing tiff
images regards to data compression ratio, space-time complexity and efficiency?

Krises Maskey
1. Introduction

The following essay will focus on various algorithms needed to compress data and compare them

according to speed, compression ratio, efficiency and practicality in cloud computing. This is

mainly helpful as large data can be compressed into fewer bits which saves both time and money

for us. This essay will specifically look into the Lempel Ziv Welch Compression, DEFLATE

Compression and Lossy Compressions used in cloud computing. As stated before Data

Compression is the base for all the multimedia applications around the world. Without Data

Compression algorithms it wouldn’t be possible for us to upload images, audio, text and video on

websites. Similarly, mobile phones would also not be able to provide telecommunication clearly

without the use of data compression. Therefore, of the many data compression techniques

currently used, Lempel Ziv Welch, DEFLATE and lossy are some of them. These compression

techniques mainly help large data files to be compressed in smaller bits with keeping the space-

time complexity, data compression ratio and practicality constraints in mind.

Hence, the question: To what extent are data compression techniques like Lempel Ziv Welch

Algorithm, DEFLATE Compression and Lossy compression more effective in cloud computing

regards to data compression ratio, space-time complexity and efficiency?


2. Theory

2.1 Lossless compression


Data compression techniques is mainly divided into two main categories, lossy and lossless.

Even though each compression algorithm uses different techniques to compress files, both have

the same function: Both of them search for the duplicity of data in the file and uses complicated

algorithms to store them in compact data representation. Lossless data compression reduces size

by identifying and eliminating statistical redundancy. The major advantage of lossless data

compression is, no data is lost in the process. Whereas, Lossy data compression reduces size by

removing unnecessary and repeated information/data. The major drawback of this is data is lost

in the process.

Lossy data compression methods mainly include DCT (Discreet Cosine Transform), Vector

Quantization and Huffman coding. Similarly, Lossless data compression methods includes RLE

(Run Length Encoding), string-table compression and LZW (Lempel Ziv Welch). There are

many other several compression techniques.

Lossless compression algorithms, measure the compression result by comparing the amount of

reduction from the source file to the size of the compressed version. Similarly, to understand

lossless compression even better the following terms should be known:

i. Compression ratio

Compression ratio is defined as the ratio of the output to the input file size of a compression

technique. For example:  when a file is compressed and compared to its original size, and if

the following result shows the compressed file is three times smaller than the original one it

means, the compression ratio is 1:3.


ii. Saving percentage 

This is the saving when a file is compressed shown as a percentage.

Saving percentage = size before compression – size after Compression × 100

size before Compression

iii. Computational complexity 

Computational complexity is defined as the complexity of an algorithm or the amount of

resources and time required for running it. For example, we can use the O-notation, which

mainly denotes the time efficiency and storage requirement. However, the behavior of

compression algorithms can be very inconsistent. Thus, computational complexity can never

be specific rather it changes with the amount of input in compression algorithms.

iv. Compression time 

Compression time is the time required to compress a file. Even though, times for encoding

and decoding a file is considered separately, some applications decoding time is more

important than encoding time. For other applications, both are equally important.

v. Entropy 

Entropy is a lossless data compression system which is independent of the specific

characteristics of the input. Entropy can also be used as a theoretical bound if the

compression algorithm is based on statistical findings, to help make a useful quantifiable

judgement. This also provides a theoretical idea as of how much compression can be

achieved.
vi. Redundancy 

Redundancy is regarded as the repetition of data in a file in compression. While in other

areas, redundancy is defined as the difference between a normal and uniform probability

distribution. When compressing a data if the redundancy is higher it is believed higher

compression can be achieved, whereas lower redundancy results in lower compression ratio.

vii. Overhead 

The amount of extra data added to the compressed version of the inputted data is known as

Overhead, this is done because it is needed for decompression later onwards. Overhead at

time can be large but it is advised it should be much smaller than the space saved by

compression.

2.1 Lempel Ziv Welch Data Compression

Lempel Ziv Welch (LZW) was created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was

enhanced and published by Welch in 1984 as an improved version of the LZ78 algorithm

published originally by Lempel and Ziv in 1978.  Since then, LZW has become a very common

compression algorithm. This algorithm is mainly used in GIF, PDF and texts. LZW is a type of

lossless compression, meaning data is not lost when compressing. This algorithm is pretty simple

to implement and has one of highest throughput potential in hardware implementations. Lempel
Ziv Welch is the compression technique which is widely used to compress Unix file, texts and

GIF image format.

Lempel Ziv Welch data compression works by searching reoccurring patterns or redundancy of

data to save data space. LZW is seen as the fastest technique for general purpose data

compression because of its simplicity and versatility.

LZW data compression works by searching reoccurring patterns or redundancy of data. It

primarily searches a sequence of symbols, then after grouping the symbols into strings, and

converting the strings into codes. This way the codes take up less space than the strings they

convert.

a. LZW compression mainly uses a code table, which consists of 4096 sections to enter data

in the tables. Codes 0-255 in the code table are always assigned to represent single bytes

from the input file.

b. In the start the code table only contains the first 256 entries, but as encoding starts the

remaining sections of the table are filled. Compression is achievable when codes 256

through 4095 are used to represent sequences of bytes.

c. The encoding continues, as LZW identifies redundancy of data and adds them to the code

table.

d. The file can be decompressed by taking each code from the compressed file and

substituting it through the code table to decipher what character or characters it

represents.

The major idea of this compression technique is: when a user inputs a data, it will be processed,

whereas a dictionary keeps a correspondence between the longest encountered words and a list of
code values. Then these words are replaced by their corresponding codes, this way the input file

is compressed. In conclusion, the efficiency of this algorithm increases as the number of

repetitive words in the input data increases.

Sample CODE:
* PSEUDOCODE
1 Initialize table with single character strings
2 OLD = first input code
3 output translation of OLD
4 WHILE not end of input stream
5 NEW = next input code
6  IF NEW is not in the string table
7 S = translation of OLD
8   S = S + C
9 ELSE
S = translation of NEW
11 output S
12   C = first character of 10  S
13   OLD + C to the string table
14 OLD = NEW
15 END WHILE

Example: Using LZW algorithm to compress the string: BABAABAAA


The steps are shown in the diagram below.
2.2 DEFLATE Data Compression

Deflate is a lossless data compression technique which uses the combination of both

LZ77 algorithm and Huffman coding. The DEFLATE compression algorithm consists of a series

of blocks, equivalent to a following block of input data. The input blocks are compressed using a

combination of the LZ77 compression algorithm and Huffman compression algorithm. When the

data id compressed LZ77 algorithm finds repeated substrings, words, letters etc. Then after, it

replaces them with backward references. The LZ77 algorithm can use a same reference for a

duplicated string occurring in the same or preceding blocks, equal to 32K input bytes.

a. LZ77 compression

LZ77 was developed by Abraham Lempel and Jacob Ziv which uses window divider for

search buffer and look-ahead buffer. The size of the search buffer is mostly 8192 bits and

size of the look-ahead buffer is about 10 to 20 bits. The algorithm can be described as

follows; Foremost the longest prefix of a look-ahead buffer that starts in the search buffer

is found. This particular prefix is then encoded as a triplet (a, b, c) where ‘a’ is the

distance of the beginning of the found preface from the end of the search buffer.
Similarly, ‘b’ is the length of the preface and ‘c’ is the first character after the preface in

look-ahead buffer.

b. Huffman compression

This compression technique is based on the redundancy of data item. The most frequent

data items will be represented and encoded with a lower number of bits. Here, this

algorithm creates a binary tree, known as the Huffman tree, which is based on the

repetition of the data. A Huffman algorithm starts by assembling the elements of the data

by assigning each one a ‘weight’; which is a number that represents its relative frequency

within the dataset to be compressed. These weights can be guessed or measured exactly

from passes through the data, or some combination of the two. The elements are selected

two at a time, element with the lowest weights being chosen. Then the two elements are

made to be leaf nodes of a node with two branches.

When both Huffman compression and LZ77 compression are combined, DEFLATE compression

algorithm is formed. Deflate is considered one of the most efficient and effective compression

techniques.

The data is compressed by the following way in Deflate:

 Foremost the compression is done with LZ77, which is followed with Huffman coding.

The trees which are used to compress in this algorithm are already defined by the Deflate

specification itself, hence no extra space needs to be taken to store those trees. In this
process initially, the data is broken up in ‘blocks’, each of this block uses a single mode

of compression. If the compressor wants to switch from non-compressed storage to

compression with the trees defined by the specification, or to compression with specified

Huffman trees, or to compression with a different pair of Huffman trees, the current block

must be ended and a new one needs to begin.

2.3 DWT Data Compression

Discrete Wavelet Transform (DWT) is used in image compression which separates an image into

a pixel. This technique is used in signal and image processing mainly for lossless image

compression.

Conclusively there are four types of wavelets transformation,

 Haar wavelet transform

 Daubechies wavelet transform

 Symlet wavelet transform

 Biorthogonal wavelet transform

In this paper we will be specifically looking at Haar wavelet transform method. Haar wavelet

transform is one of the simplest transforms for image compression. The process involved with

this also is very simple as it only requires calculating averages and differences of adjacent pixels.

Similarly, the Haar DWT is also faster and computationally efficient than the other sinusoidal

based discrete transforms, meaning. But the major drawback of this DWT tradeoff between

quality of image and decreased energy compaction compared to the DCT. As a general rule of

thumb whenever computational complexity increases, compression ratio also increases, but with
Haar DWT computational complexity is also at a minimum, meaning compression ratio also

decreases.

The Haar DWT transform consists of a matrix, which operates row-wise as each sums and

differences of consecutive elements are found. The sums and differences are stored such that if

the matrix is split in half from top to bottom the sums can be found in one side and the

differences in the other. Similarly, the next operation occurs column-wise, where an image is

split in half from left to right, while storing the sums on one half and the differences in the other.

This process is repeated on the smaller square matrices, power-of-two matrix which results in

sums of sums. The greater number of times this process happens it can be inferred as the depth of

the transform.

Some Properties of Haar DWT is:

a. Haar Transform is real and orthogonal.

Therefore Hr = Hr* (1) Hr = Hr (2) Haar Transform is a very fast transform.

b. The basis vectors for the Haar matrix are sequence ordered.

c. Drawback for Haar DWT is it has poor energy compaction for images.

d. In case of Orthogonality: The original signal is split into a low and a high frequency part and

filters enabling the splitting without duplicating information are said to be orthogonal.

e. Compact support: The magnitude response of the filter should be exactly zero outside the

frequency range covered by the transform. If this property satisfied with algebraic equation, the

transform is energy invariant.


f. Perfect reconstruction: If the input signal is transformed and inversely transformed using a set

of weighted basis functions and the reproduced sample values are identical to those of the input

signal, the transform is said to have the perfect reconstruction property. If, in addition no

information redundancy is present in the sampled signal, the wavelet transform is, as stated

above, ortho normal.

2.4 Image Compression

Image is defined as a two-dimensional function which can be represented by; f (x, y), where the spatial

(plane) coordinates are ‘X’ and ‘Y’. Whereas the amplitude of any pair of coordinates (x, y) is known as

the gray level or intensity of an image at a specific point. An image can be called a ‘digital image’ when x,

y and the amplitude values of ‘f’ is a real finite and discrete quantities. When image is formed it takes up

some chunks of storage which increases as more images are stored. To solve this Image compression is

used which helps to minimize the amount of memory needed to represent an image. A large number of

bits is required to represent a single image, and at times if the image needs to be stored or transferred,

it is unfeasible to do so without reducing the number of bits. This is a dilemma for the world at present

with tons of images stored in our smartphones taking loads of storage. To overcome this issue Image

compression was developed, it can be defined as the process of reducing the amount of data required to

represent an image. To achieve this redundant information is removed from an image. The three types

of redundancies to be removed are:

i. Coding redundancy

ii. Inter pixel redundancy

iii. Psycho visual redundancy


1. Coding redundancy: Image is a combination of pixels where each pixel can be represented in binary

bits. Gray levels used to represent an image, is also based on the number of bits used to represent each

pixel.

a) To compress gray levels in an image we can use variable length code; number of bits used to

represent each pixel. The general concept here is to decrease the usage of bits for more

frequent gray levels and use a greater number of bits for lower frequent gray levels in an image.

This way we represent the entire image using the least possible number of bits. Thus, this can

reduce the coding redundancy.

2. Inter pixel redundancy: In an image each pixel depends on its neighboring pixel. This mainly

duplicates unnecessary data in the representations of the correlated pixels. Correlation of

pixels means two or more pixels are completely dependent on one another. For examples:

When watching a video, if the frame rate is higher then the successive frames contain

almost same information. Similarly, in case of still images, higher the spatial resolutions

more the inter pixel redundancy.

3. Psycho visual redundancy: Psycho visual redundancy is the Information that is ignored by

human eye or is unimportant in an image. Reducing this can compress the image further and

increase the compression ratio in lossless images.

3. Hypothesis and Applied Theory


The theory of all three compression algorithms have been described and explained in a good

amount of detail. Now it is important to consider which of the three algorithms is most efficient.

Compression ratio, time complexity, computational complexity was brought up at the beginning

of this essay but not applied when the three algorithms were explored. To find out an experiment

will be carried out to measure these for each algorithm and be compared.

For this experiment all three algorithms will be compressing a tiff format image as input.

However, this compression as mentioned earlier will only compare lossless compression

techniques. Similarly, LZW algorithm, DEFLATE and Haar DWT will be compared on the basis

of compression ratio, its saving percentage and most to least compressibility. The complication

with comparing these algorithms is they rely on different resources. This is due to their

multipurpose use in the field of data compression, LZW as mentioned earlier is a dictionary-

based compression technique which compresses image by the removal of image's spatial

redundancy. The DEFLATE algorithm, however, compresses an image without reducing its

quality. Whereas, Haar DWT compression compresses an image according to its color

redundancy.

The experiment will measure the compression time, ratio between all three algorithms, and

complexity of each one. By the varying compression ratio with respect to its complexity, a clear

relationship between the image and the algorithm can be determined and how this relationship

differs with size of inputted file can also be concluded form this experiment.
Therefore, I hypothesize that there will be a positive relationship between compression ratio and

file size of image as described above. I also believe that the LZW will be more efficient when

compressing tiff images out of the three algorithms. Since the efficiencies will be the conclusive

factor, we will also be measuring the time it takes for the algorithms to compress a tiff image.

Canterbury Corpus provides a good testbed for testing compression programs.


See http://corpus.canterbury.ac.nz for details.

Potrebbero piacerti anche