Computer Science Extended Essay

Computer Science Extended Essay
Research Question: To what extent are data compression techniques like Lempel Ziv Welch
Algorithm, DEFLATE Compression and DWT compression more effective in compressing tiff
images regards to data compression ratio, space-time complexity and efficiency?
Krises Maskey
1. Introduction
The following essay will focus on various algorithms needed to compress data and compare them
according to speed, compression ratio, efficiency and practicality in cloud computing. This is
mainly helpful as large data can be compressed into fewer bits which saves both time and money
for us. This essay will specifically look into the Lempel Ziv Welch Compression, DEFLATE
Compression and Lossy Compressions used in cloud computing. As stated before Data
Compression is the base for all the multimedia applications around the world. Without Data
Compression algorithms it wouldn’t be possible for us to upload images, audio, text and video on
websites. Similarly, mobile phones would also not be able to provide telecommunication clearly
without the use of data compression. Therefore, of the many data compression techniques
currently used, Lempel Ziv Welch, DEFLATE and lossy are some of them. These compression
techniques mainly help large data files to be compressed in smaller bits with keeping the space-
time complexity, data compression ratio and practicality constraints in mind.
Hence, the question: To what extent are data compression techniques like Lempel Ziv Welch
Algorithm, DEFLATE Compression and Lossy compression more effective in cloud computing
regards to data compression ratio, space-time complexity and efficiency?

2. Theory
2.1 Lossless compression

Data compression techniques is mainly divided into two main categories, lossy and lossless.
Even though each compression algorithm uses different techniques to compress files, both have
the same function: Both of them search for the duplicity of data in the file and uses complicated
algorithms to store them in compact data representation. Lossless data compression reduces size
by identifying and eliminating statistical redundancy. The major advantage of lossless data
compression is, no data is lost in the process. Whereas, Lossy data compression reduces size by
removing unnecessary and repeated information/data. The major drawback of this is data is lost
in the process.
Lossy data compression methods mainly include DCT (Discreet Cosine Transform), Vector
Quantization and Huffman coding. Similarly, Lossless data compression methods includes RLE
(Run Length Encoding), string-table compression and LZW (Lempel Ziv Welch). There are
many other several compression techniques.
Lossless compression algorithms, measure the compression result by comparing the amount of
reduction from the source file to the size of the compressed version. Similarly, to understand
lossless compression even better the following terms should be known:
i. Compression ratio
Compression ratio is defined as the ratio of the output to the input file size of a compression
technique. For example: when a file is compressed and compared to its original size, and if
the following result shows the compressed file is three times smaller than the original one it
means, the compression ratio is 1:3.

ii. Saving percentage
This is the saving when a file is compressed shown as a percentage.
Saving percentage = size before compression – size after Compression × 100
size before Compression
iii. Computational complexity
Computational complexity is defined as the complexity of an algorithm or the amount of
resources and time required for running it. For example, we can use the O-notation, which
mainly denotes the time efficiency and storage requirement. However, the behavior of
compression algorithms can be very inconsistent. Thus, computational complexity can never
be specific rather it changes with the amount of input in compression algorithms.
iv. Compression time
Compression time is the time required to compress a file. Even though, times for encoding
and decoding a file is considered separately, some applications decoding time is more
important than encoding time. For other applications, both are equally important.
v. Entropy
Entropy is a lossless data compression system which is independent of the specific
characteristics of the input. Entropy can also be used as a theoretical bound if the
compression algorithm is based on statistical findings, to help make a useful quantifiable
judgement. This also provides a theoretical idea as of how much compression can be
achieved.
vi. Redundancy
Redundancy is regarded as the repetition of data in a file in compression. While in other
areas, redundancy is defined as the difference between a normal and uniform probability
distribution. When compressing a data if the redundancy is higher it is believed higher
compression can be achieved, whereas lower redundancy results in lower compression ratio.
vii. Overhead
The amount of extra data added to the compressed version of the inputted data is known as
Overhead, this is done because it is needed for decompression later onwards. Overhead at
time can be large but it is advised it should be much smaller than the space saved by
compression.
2.1 Lempel Ziv Welch Data Compression
Lempel Ziv Welch (LZW) was created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was
enhanced and published by Welch in 1984 as an improved version of the LZ78 algorithm
published originally by Lempel and Ziv in 1978. Since then, LZW has become a very common
compression algorithm. This algorithm is mainly used in GIF, PDF and texts. LZW is a type of
lossless compression, meaning data is not lost when compressing. This algorithm is pretty simple
to implement and has one of highest throughput potential in hardware implementations. Lempel
Ziv Welch is the compression technique which is widely used to compress Unix file, texts and
GIF image format.
Lempel Ziv Welch data compression works by searching reoccurring patterns or redundancy of
data to save data space. LZW is seen as the fastest technique for general purpose data
compression because of its simplicity and versatility.
LZW data compression works by searching reoccurring patterns or redundancy of data. It
primarily searches a sequence of symbols, then after grouping the symbols into strings, and
converting the strings into codes. This way the codes take up less space than the strings they
convert.
a. LZW compression mainly uses a code table, which consists of 4096 sections to enter data
in the tables. Codes 0-255 in the code table are always assigned to represent single bytes
from the input file.
b. In the start the code table only contains the first 256 entries, but as encoding starts the
remaining sections of the table are filled. Compression is achievable when codes 256
through 4095 are used to represent sequences of bytes.
c. The encoding continues, as LZW identifies redundancy of data and adds them to the code
table.
d. The file can be decompressed by taking each code from the compressed file and
substituting it through the code table to decipher what character or characters it
represents.
The major idea of this compression technique is: when a user inputs a data, it will be processed,
whereas a dictionary keeps a correspondence between the longest encountered words and a list of
code values. Then these words are replaced by their corresponding codes, this way the input file
is compressed. In conclusion, the efficiency of this algorithm increases as the number of
repetitive words in the input data increases.
Sample CODE:
* PSEUDOCODE
1 Initialize table with single character strings
2 OLD = first input code
3 output translation of OLD
4 WHILE not end of input stream
5 NEW = next input code
6 IF NEW is not in the string table
7 S = translation of OLD
8 S = S + C
9 ELSE
S = translation of NEW
11 output S
12 C = first character of 10 S
13 OLD + C to the string table
14 OLD = NEW
15 END WHILE
Example: Using LZW algorithm to compress the string: BABAABAAA

The steps are shown in the diagram below.
2.2 DEFLATE Data Compression
Deflate is a lossless data compression technique which uses the combination of both
LZ77 algorithm and Huffman coding. The DEFLATE compression algorithm consists of a series
of blocks, equivalent to a following block of input data. The input blocks are compressed using a
combination of the LZ77 compression algorithm and Huffman compression algorithm. When the
data id compressed LZ77 algorithm finds repeated substrings, words, letters etc. Then after, it
replaces them with backward references. The LZ77 algorithm can use a same reference for a
duplicated string occurring in the same or preceding blocks, equal to 32K input bytes.
a. LZ77 compression
LZ77 was developed by Abraham Lempel and Jacob Ziv which uses window divider for
search buffer and look-ahead buffer. The size of the search buffer is mostly 8192 bits and
size of the look-ahead buffer is about 10 to 20 bits. The algorithm can be described as
follows; Foremost the longest prefix of a look-ahead buffer that starts in the search buffer
is found. This particular prefix is then encoded as a triplet (a, b, c) where ‘a’ is the
distance of the beginning of the found preface from the end of the search buffer.
Similarly, ‘b’ is the length of the preface and ‘c’ is the first character after the preface in
look-ahead buffer.
b. Huffman compression
This compression technique is based on the redundancy of data item. The most frequent
data items will be represented and encoded with a lower number of bits. Here, this
algorithm creates a binary tree, known as the Huffman tree, which is based on the
repetition of the data. A Huffman algorithm starts by assembling the elements of the data
by assigning each one a ‘weight’; which is a number that represents its relative frequency
within the dataset to be compressed. These weights can be guessed or measured exactly
from passes through the data, or some combination of the two. The elements are selected
two at a time, element with the lowest weights being chosen. Then the two elements are
made to be leaf nodes of a node with two branches.
When both Huffman compression and LZ77 compression are combined, DEFLATE compression
algorithm is formed. Deflate is considered one of the most efficient and effective compression
techniques.
The data is compressed by the following way in Deflate:
 Foremost the compression is done with LZ77, which is followed with Huffman coding.
The trees which are used to compress in this algorithm are already defined by the Deflate
specification itself, hence no extra space needs to be taken to store those trees. In this
process initially, the data is broken up in ‘blocks’, each of this block uses a single mode
of compression. If the compressor wants to switch from non-compressed storage to
compression with the trees defined by the specification, or to compression with specified
Huffman trees, or to compression with a different pair of Huffman trees, the current block
must be ended and a new one needs to begin.
2.3 DWT Data Compression
Discrete Wavelet Transform (DWT) is used in image compression which separates an image into
a pixel. This technique is used in signal and image processing mainly for lossless image
compression.
Conclusively there are four types of wavelets transformation,
 Haar wavelet transform
 Daubechies wavelet transform
 Symlet wavelet transform
 Biorthogonal wavelet transform
In this paper we will be specifically looking at Haar wavelet transform method. Haar wavelet
transform is one of the simplest transforms for image compression. The process involved with
this also is very simple as it only requires calculating averages and differences of adjacent pixels.
Similarly, the Haar DWT is also faster and computationally efficient than the other sinusoidal
based discrete transforms, meaning. But the major drawback of this DWT tradeoff between
quality of image and decreased energy compaction compared to the DCT. As a general rule of
thumb whenever computational complexity increases, compression ratio also increases, but with
Haar DWT computational complexity is also at a minimum, meaning compression ratio also
decreases.
The Haar DWT transform consists of a matrix, which operates row-wise as each sums and
differences of consecutive elements are found. The sums and differences are stored such that if
the matrix is split in half from top to bottom the sums can be found in one side and the
differences in the other. Similarly, the next operation occurs column-wise, where an image is
split in half from left to right, while storing the sums on one half and the differences in the other.
This process is repeated on the smaller square matrices, power-of-two matrix which results in
sums of sums. The greater number of times this process happens it can be inferred as the depth of
the transform.
Some Properties of Haar DWT is:
a. Haar Transform is real and orthogonal.
Therefore Hr = Hr* (1) Hr = Hr (2) Haar Transform is a very fast transform.
b. The basis vectors for the Haar matrix are sequence ordered.
c. Drawback for Haar DWT is it has poor energy compaction for images.
d. In case of Orthogonality: The original signal is split into a low and a high frequency part and
filters enabling the splitting without duplicating information are said to be orthogonal.
e. Compact support: The magnitude response of the filter should be exactly zero outside the
frequency range covered by the transform. If this property satisfied with algebraic equation, the
transform is energy invariant.

f. Perfect reconstruction: If the input signal is transformed and inversely transformed using a set
of weighted basis functions and the reproduced sample values are identical to those of the input
signal, the transform is said to have the perfect reconstruction property. If, in addition no
information redundancy is present in the sampled signal, the wavelet transform is, as stated
above, ortho normal.
2.4 Image Compression
Image is defined as a two-dimensional function which can be represented by; f (x, y), where the spatial
(plane) coordinates are ‘X’ and ‘Y’. Whereas the amplitude of any pair of coordinates (x, y) is known as
the gray level or intensity of an image at a specific point. An image can be called a ‘digital image’ when x,
y and the amplitude values of ‘f’ is a real finite and discrete quantities. When image is formed it takes up
some chunks of storage which increases as more images are stored. To solve this Image compression is
used which helps to minimize the amount of memory needed to represent an image. A large number of
bits is required to represent a single image, and at times if the image needs to be stored or transferred,
it is unfeasible to do so without reducing the number of bits. This is a dilemma for the world at present
with tons of images stored in our smartphones taking loads of storage. To overcome this issue Image
compression was developed, it can be defined as the process of reducing the amount of data required to
represent an image. To achieve this redundant information is removed from an image. The three types
of redundancies to be removed are:
i. Coding redundancy
ii. Inter pixel redundancy
iii. Psycho visual redundancy

1. Coding redundancy: Image is a combination of pixels where each pixel can be represented in binary
bits. Gray levels used to represent an image, is also based on the number of bits used to represent each
pixel.
a) To compress gray levels in an image we can use variable length code; number of bits used to
represent each pixel. The general concept here is to decrease the usage of bits for more
frequent gray levels and use a greater number of bits for lower frequent gray levels in an image.
This way we represent the entire image using the least possible number of bits. Thus, this can
reduce the coding redundancy.
2. Inter pixel redundancy: In an image each pixel depends on its neighboring pixel. This mainly
duplicates unnecessary data in the representations of the correlated pixels. Correlation of
pixels means two or more pixels are completely dependent on one another. For examples:
When watching a video, if the frame rate is higher then the successive frames contain
almost same information. Similarly, in case of still images, higher the spatial resolutions
more the inter pixel redundancy.
3. Psycho visual redundancy: Psycho visual redundancy is the Information that is ignored by
human eye or is unimportant in an image. Reducing this can compress the image further and
increase the compression ratio in lossless images.
3. Hypothesis and Applied Theory

The theory of all three compression algorithms have been described and explained in a good
amount of detail. Now it is important to consider which of the three algorithms is most efficient.
Compression ratio, time complexity, computational complexity was brought up at the beginning
of this essay but not applied when the three algorithms were explored. To find out an experiment
will be carried out to measure these for each algorithm and be compared.
For this experiment all three algorithms will be compressing a tiff format image as input.
However, this compression as mentioned earlier will only compare lossless compression
techniques. Similarly, LZW algorithm, DEFLATE and Haar DWT will be compared on the basis
of compression ratio, its saving percentage and most to least compressibility. The complication
with comparing these algorithms is they rely on different resources. This is due to their
multipurpose use in the field of data compression, LZW as mentioned earlier is a dictionary-
based compression technique which compresses image by the removal of image's spatial
redundancy. The DEFLATE algorithm, however, compresses an image without reducing its
quality. Whereas, Haar DWT compression compresses an image according to its color
redundancy.
The experiment will measure the compression time, ratio between all three algorithms, and
complexity of each one. By the varying compression ratio with respect to its complexity, a clear
relationship between the image and the algorithm can be determined and how this relationship
differs with size of inputted file can also be concluded form this experiment.
Therefore, I hypothesize that there will be a positive relationship between compression ratio and
file size of image as described above. I also believe that the LZW will be more efficient when
compressing tiff images out of the three algorithms. Since the efficiencies will be the conclusive
factor, we will also be measuring the time it takes for the algorithms to compress a tiff image.
Canterbury Corpus provides a good testbed for testing compression programs.

See http://corpus.canterbury.ac.nz for details.

Computer Science Extended Essay

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Computer Science Extended Essay

Caricato da

Copyright:

Formati disponibili

Computer Science Extended Essay

time complexity, data compression ratio and practicality constraints in mind.

regards to data compression ratio, space-time complexity and efficiency?

2.1 Lossless compression

many other several compression techniques.

Lossless compression algorithms, measure the compression result by comparing the amount of

lossless compression even better the following terms should be known:

means, the compression ratio is 1:3.

This is the saving when a file is compressed shown as a percentage.

Saving percentage = size before compression – size after Compression × 100

size before Compression

iii. Computational complexity

Computational complexity is defined as the complexity of an algorithm or the amount of

be specific rather it changes with the amount of input in compression algorithms.

iv. Compression time

Entropy is a lossless data compression system which is independent of the specific

compression algorithm is based on statistical findings, to help make a useful quantifiable

Redundancy is regarded as the repetition of data in a file in compression. While in other

areas, redundancy is defined as the difference between a normal and uniform probability

distribution. When compressing a data if the redundancy is higher it is believed higher

2.1 Lempel Ziv Welch Data Compression

enhanced and published by Welch in 1984 as an improved version of the LZ78 algorithm

GIF image format.

compression because of its simplicity and versatility.

LZW data compression works by searching reoccurring patterns or redundancy of data. It

from the input file.

through 4095 are used to represent sequences of bytes.

substituting it through the code table to decipher what character or characters it

is compressed. In conclusion, the efficiency of this algorithm increases as the number of

repetitive words in the input data increases.

Example: Using LZW algorithm to compress the string: BABAABAAA

Deflate is a lossless data compression technique which uses the combination of both

LZ77 algorithm and Huffman coding. The DEFLATE compression algorithm consists of a series

made to be leaf nodes of a node with two branches.

The data is compressed by the following way in Deflate:

of compression. If the compressor wants to switch from non-compressed storage to

must be ended and a new one needs to begin.

2.3 DWT Data Compression

Conclusively there are four types of wavelets transformation,

 Haar wavelet transform

 Daubechies wavelet transform

 Symlet wavelet transform

 Biorthogonal wavelet transform

Some Properties of Haar DWT is:

a. Haar Transform is real and orthogonal.

Therefore Hr = Hr* (1) Hr = Hr (2) Haar Transform is a very fast transform.

transform is energy invariant.

above, ortho normal.

2.4 Image Compression

of redundancies to be removed are:

ii. Inter pixel redundancy

iii. Psycho visual redundancy

reduce the coding redundancy.

duplicates unnecessary data in the representations of the correlated pixels. Correlation of

more the inter pixel redundancy.

increase the compression ratio in lossless images.

3. Hypothesis and Applied Theory

Canterbury Corpus provides a good testbed for testing compression programs.

Potrebbero piacerti anche