Sei sulla pagina 1di 2

Disk storage systems can often be the most expensive components of a database solution.

For large warehouses or databases with huge volumes of data, the cost of the storage subsystem can easily exceed the combined cost of the hardware server and the data server software. Therefore, even a small reduction in the storage subsystem can result in substantial cost savings for the entire database solution. Data compression has an undeserved reputation for being difficult to master, hard to implement, and tough to maintain. In fact, the techniques used in the previously mentioned programs are relatively simple, and can be implemented with standard utilities taking only a few lines of code. This article discusses a good all-purpose data compression technique: LempelZiv-Welch, or LZW compression. The routines shown here belong in any programmer's toolbox. For example, a program that has a few dozen help screens could easily chop 50K bytes off by compressing the screens. Or 500K bytes of software could be distributed to end users on a single 360K byte floppy disk. Highly redundant database files can be compressed down to 10% of their original size. Once the tools are available, the applications for compression will show up on a regular basis.

LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters. So, any code in the dictionary must have a string in the compressed buffer for the first time only, that means the dictionary just keeps a reference to uncompressed strings in the compressed buffer, no need to let the dictionary take a copy of these strings. So, it is not necessary to preserve the dictionary to decode the LZW data stream. This can save quite a bit of space when storing the LZWencoded data. The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are, by default, assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit to 31 bit codes. This means, codes 0-255 refer to individual bytes, while codes 256-2^n refer to substrings, where n equal to number of bits per code, as in the following figure.

Potrebbero piacerti anche