Sei sulla pagina 1di 16

Extreme DXT Compression

Peter Uliiansky Cauldron, Ltd.

Overview
Simple highly optimized algorithm Uses SSE2 and SSSE3 for maximum performance Quality comparable to Real-Time DXT Compression algorithm Performance roughly 300%

Whats identical to Real-Time DXT Compression o Only non-transparent compression scheme for DXT1 o Only six intermediate alpha values compression scheme for DXT5 o Uses bounding box method for representative color and alpha values Computes color and alpha indices by division (fixed point multiplication) o Uses lookup tables for color/alpha dividers

ColorIndex = 4

( R R min) + (G G min) + ( B B min) ( R max R min) + (G max G min) + ( B max B min) ( A A min) ( A max A min)

AlphaIndex = 8

Converts natural index ordering to DXT index ordering by lookup tables o Tightly packs natural indices first o Then converts four color indices at once/two alpha indices at once Just two functions (CompressImageDXT1, CompressImageDXT5) o Saves function call overhead No comparisons, jumps, loops (except height/width loops) Processes two 4x4 blocks at once o Better utilization of registers o Hides instruction latency in some places o No need to extract block first Constant/temporary data just 24 * 16 = 384 bytes Lookup tables just 3072 + 1024 + 256 + 1280 = 5632 bytes

Although some parts of DXT1/DXT5 compression algorithms are identical different instruction ordering is crucial for maximum performance Code is optimized for Core 2 Duo so Pentium 4 performance is not optimal (Dont see much point in optimizing for Pentium 4 these days)

Color Compression Comparison

Original image

Extreme DXT Comp.

Real-Time DXT Comp.

Alpha Compression Comparison

Original image

Extreme DXT Comp.

Real-Time DXT Comp.

Performance
256x256 texture graphs show maximum possible performance of the algorithms (all used data can fit and is already prepared in the cache memory) 4096x4096 texture graphs show more real-life performance (source data cannot fit or is not already in the cache memory)

The 256x256 Lena image was used for the 256x256 texture performance tests The same image was 16x16 tiled to create 4096x4096 texture for the 4096x4096 texture performance tests

The blue channel was replicated to the alpha channel for the DXT5 tests The DXT1 compression creates correct results regardless of the alpha information in the source texture and never outputs transparent pixels

The Algorithm
Read 4x4 pixel block (movdqa)
Pixel03 Pixel13 Pixel23 Pixel33 Pixel02 Pixel12 Pixel22 Pixel32 Pixel01 Pixel11 Pixel21 Pixel31 Pixel00 Pixel10 Pixel20 Pixel30

Compute bounding box and store minimum (movdqa, pmaxub, pminub, pshufd)
Max Max Max Max

Min

Min

Min

Min

Compute and store range (movdqa, punpcklbw, psubw, movq)


Range Range Range Range Range Range Range Range

Inset bounding box and interleave max/min values (psrlw, psubw, paddw, punpcklwd)
Min Max Min Max Min Max Min Max

Shift and mask max/min values as needed in the DXT block (pmulw, pand, movdqa)
Min Max Min Max Min Max Min Max

Pack and store max/min values to the DXT block (mov, shr, or)
Min Max Min Max

Load 4x4 pixel block again, subtract minimum, prepare for the division (SSSE3: movdqa, psubb, pmaddubsw, phaddw) (SSE2: movdqa, psubb, pand, pmaddwd, psrlw, psllw, paddw, packssdw)
DXT1
8(R+G+B)13 8(R+G+B)12 8(R+G+B)11 8(R+G+B)10 8(R+G+B)03 8(R+G+B)33 8(R+G+B)32 8(R+G+B)31 8(R+G+B)30 8(R+G+B)23 8(R+G+B)02 8(R+G+B)01 8(R+G+B)22 8(R+G+B)21 8(R+G+B)00 8(R+G+B)20

DXT5
8A03 8A13 8A23 8A33 8(R+G+B)03 8(R+G+B)13 8(R+G+B)23 8(R+G+B)33 8A02 8A12 8A22 8A32 8(R+G+B)02 8(R+G+B)12 8(R+G+B)22 8(R+G+B)32 8A01 8A11 8A21 8A31 8(R+G+B)01 8(R+G+B)11 8(R+G+B)21 8(R+G+B)31 8A00 8A10 8A20 8A30 8(R+G+B)00 8(R+G+B)10 8(R+G+B)20 8(R+G+B)30

Prepare dividers according to the range (mov, add, or, movd, pshufd)
DXT1
ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider ColorDivider

DXT5
AlphaDivider ColorDivider AlphaDivider ColorDivider AlphaDivider ColorDivider AlphaDivider ColorDivider

Perform the division (fixed point multiplication) to get indices (pmulhw)


DXT1
ColorIndex13 ColorIndex12 ColorIndex11 ColorIndex10 ColorIndex03 ColorIndex02 ColorIndex01 ColorIndex00 ColorIndex33 ColorIndex32 ColorIndex31 ColorIndex30 ColorIndex23 ColorIndex22 ColorIndex21 ColorIndex20

DXT5
AlphaIndex03 ColorIndex03 AlphaIndex02 ColorIndex02 AlphaIndex01 ColorIndex01 AlphaIndex00 ColorIndex00 AlphaIndex13 ColorIndex13 AlphaIndex12 ColorIndex12 AlphaIndex11 ColorIndex11 AlphaIndex10 ColorIndex10 AlphaIndex23 ColorIndex23 AlphaIndex22 ColorIndex22 AlphaIndex21 ColorIndex21 AlphaIndex20 ColorIndex20 AlphaIndex33 ColorIndex33 AlphaIndex32 ColorIndex32 AlphaIndex31 ColorIndex31 AlphaIndex30 ColorIndex30

Pack indices together and store them to the temporary buffer (SSSE3: packuswb, pshufb, pmaddubsw, pmaddwd, movdqa) (SSE2: pshuflw, pshufhw, pmaddwd, packssdw, movdqa)
DXT1
ColorIndex3330 ColorIndex2320 ColorIndex1310 ColorIndex0300

DXT5
AlphaIndex1310 AlphaIndex3330 ColorIndex1310 ColorIndex3330 AlphaIndex0300 AlphaIndex2320 ColorIndex0300 ColorIndex2320

Convert packed indices to final DXT indices and store them to the DXT block (mov, or)
Set3 Set2 Set1 Set0 Min Max Set2 Set1 Set0 Min Max

/************************************************************************************************************* Extreme DXT Compression Copyright (C) 2008 Cauldron, Ltd. Written by Peter Uliiansky Microsoft Public License (Ms-PL) This license governs use of the accompanying software. If you use the software, you accept this license. If you do not accept the license, do not use the software. 1. Definitions The terms "reproduce," "reproduction," "derivative works," and "distribution" have the same meaning here as under U.S. copyright law. A "contribution" is the original software, or any additions or changes to the software. A "contributor" is any person that distributes its contribution under this license. "Licensed patents" are a contributor's patent claims that read directly on its contribution. 2. Grant of Rights (A) Copyright Grant- Subject to the terms of this license, including the license conditions and limitations in section 3, each contributor grants you a non-exclusive, worldwide, royalty-free copyright license to reproduce its contribution, prepare derivative works of its contribution, and distribute its contribution or any derivative works that you create. (B) Patent Grant- Subject to the terms of this license, including the license conditions and limitations in section 3, each contributor grants you a non-exclusive, worldwide, royalty-free license under its licensed patents to make, have made, use, sell, offer for sale, import, and/or otherwise dispose of its contribution in the software or derivative works of the contribution in the software. 3. Conditions and Limitations (A) No Trademark License- This license does not grant you rights to use any contributors' name, logo, or trademarks. (B) If you bring a patent claim against any contributor over patents that you claim are infringed by the software, your patent license from such contributor to the software ends automatically. (C) If you distribute any portion of the software, you must retain all copyright, patent, trademark, and attribution notices that are present in the software. (D) If you distribute any portion of the software in source code form, you may do so only under this license by including a complete copy of this license with your distribution. If you distribute any portion of the software in compiled or object code form, you may only do so under a license that complies with this license. (E) The software is licensed "as-is." You bear the risk of using it. The contributors give no express warranties, guarantees, or conditions. You may have additional consumer rights under your local laws which this license cannot change. To the extent permitted under your local laws, the contributors exclude the implied warranties of merchantability, fitness for a particular purpose and non-infringement. *************************************************************************************************************/

DWORD DWORD BYTE WORD

COLOR_DIVIDER_TABLE[768]; ALPHA_DIVIDER_TABLE[256]; COLOR_INDICES_TABLE[256]; ALPHA_INDICES_TABLE[640];

__declspec(align(16)) const BYTE SSE2_BYTE_0 [1 * 16] = {0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00}; __declspec(align(16)) const BYTE SSE2_WORD_1 [1 * 16] = {0x01,0x00,0x01,0x00,0x01,0x00,0x01,0x00,0x01,0x00,0x01,0x00,0x01,0x00,0x01,0x00}; __declspec(align(16)) const BYTE SSE2_WORD_8 [1 * 16] = {0x08,0x00,0x08,0x00,0x08,0x00,0x08,0x00,0x08,0x00,0x08,0x00,0x08,0x00,0x08,0x00}; __declspec(align(16)) const BYTE SSE2_BOUNDS_MASK [1 * 16] = {0x00,0x1F,0x00,0x1F,0xE0,0x07,0xE0,0x07,0x00,0xF8,0x00,0xF8,0x00,0xFF,0xFF,0x00}; __declspec(align(16)) const BYTE SSE2_BOUNDS_SCALE [1 * 16] = {0x20,0x00,0x20,0x00,0x08,0x00,0x08,0x00,0x00,0x01,0x00,0x01,0x00,0x01,0x01,0x00}; __declspec(align(16)) const BYTE SSE2_INDICES_MASK_0 [1 * 16] = {0xFF,0x00,0xFF,0x00,0xFF,0x00,0xFF,0x00,0xFF,0x00,0xFF,0x00,0xFF,0x00,0xFF,0x00}; __declspec(align(16)) const BYTE SSE2_INDICES_MASK_1 [1 * 16] = {0x00,0xFF,0x00,0x00,0x00,0xFF,0x00,0x00,0x00,0xFF,0x00,0x00,0x00,0xFF,0x00,0x00}; __declspec(align(16)) const BYTE SSE2_INDICES_MASK_2 [1 * 16] = {0x08,0x08,0x08,0x00,0x08,0x08,0x08,0x00,0x08,0x08,0x08,0x00,0x08,0x08,0x08,0x00}; __declspec(align(16)) const BYTE SSE2_INDICES_SCALE_0[1 * 16] = {0x01,0x00,0x04,0x00,0x10,0x00,0x40,0x00,0x01,0x00,0x04,0x00,0x10,0x00,0x40,0x00}; __declspec(align(16)) const BYTE SSE2_INDICES_SCALE_1[1 * 16] = {0x01,0x00,0x04,0x00,0x01,0x00,0x08,0x00,0x10,0x00,0x40,0x00,0x00,0x01,0x00,0x08}; __declspec(align(16)) const BYTE SSE2_INDICES_SCALE_2[1 * 16] = {0x01,0x04,0x10,0x40,0x01,0x04,0x10,0x40,0x01,0x04,0x10,0x40,0x01,0x04,0x10,0x40}; __declspec(align(16)) const BYTE SSE2_INDICES_SCALE_3[1 * 16] = {0x01,0x04,0x01,0x04,0x01,0x08,0x01,0x08,0x01,0x04,0x01,0x04,0x01,0x08,0x01,0x08}; __declspec(align(16)) const BYTE SSE2_INDICES_SCALE_4[1 * 16] = {0x01,0x00,0x10,0x00,0x01,0x00,0x00,0x01,0x01,0x00,0x10,0x00,0x01,0x00,0x00,0x01}; __declspec(align(16)) const BYTE SSE2_INDICES_SHUFFLE[1 * 16] = {0x00,0x02,0x04,0x06,0x01,0x03,0x05,0x07,0x08,0x0A,0x0C,0x0E,0x09,0x0B,0x0D,0x0F}; __declspec(align(16)) __declspec(align(16)) __declspec(align(16)) __declspec(align(16)) BYTE BYTE BYTE BYTE sse2_minimum[2 sse2_range [2 sse2_bounds [2 sse2_indices[4 * * * * 16]; 16]; 16]; 16];

void CompressImageDXT1(const BYTE* argb, BYTE* dxt1, int width, int height) { int x_count; int y_count; __asm { mov esi, DWORD PTR argb // src mov edi, DWORD PTR dxt1 // dst mov mov y_loop: mov mov x_loop: mov lea movdqa movdqa movdqa pmaxub pmaxub pmaxub pminub pminub pminub pshufd pshufd pmaxub pminub pshufd pshufd pmaxub pminub movdqa movdqa movdqa pmaxub pmaxub pmaxub pminub pminub pminub pshufd pshufd pmaxub pminub pshufd pshufd pmaxub pminub movdqa movdqa movdqa punpcklbw punpcklbw punpcklbw punpcklbw movdqa movdqa psubw psubw movq movq psrlw psrlw psubw psubw paddw paddw punpcklwd pmullw pand movdqa eax, DWORD PTR height DWORD PTR y_count, eax eax, DWORD PTR width DWORD PTR x_count, eax eax, ebx, DWORD PTR width DWORD PTR [eax + eax*2] // width * 1 // width * 3 // src + width * // src + width * 0 + 4 + 0 0 0 0 0 0

xmm0, XMMWORD PTR [esi + 0] xmm3, XMMWORD PTR [esi + eax*4 + 0] xmm1, xmm0 xmm0, xmm3 xmm0, XMMWORD PTR [esi + eax*8 + 0] xmm0, XMMWORD PTR [esi + ebx*4 + 0] xmm1, xmm3 xmm1, XMMWORD PTR [esi + eax*8 + 0] xmm1, XMMWORD PTR [esi + ebx*4 + 0] xmm2, xmm0, 0x4E xmm3, xmm1, 0x4E xmm0, xmm2 xmm1, xmm3 xmm2, xmm0, 0xB1 xmm3, xmm1, 0xB1 xmm0, xmm2 xmm1, xmm3 xmm4, XMMWORD PTR [esi + 16] xmm7, XMMWORD PTR [esi + eax*4 + 16] xmm5, xmm4 xmm4, xmm7 xmm4, XMMWORD PTR [esi + eax*8 + 16] xmm4, XMMWORD PTR [esi + ebx*4 + 16] xmm5, xmm7 xmm5, XMMWORD PTR [esi + eax*8 + 16] xmm5, XMMWORD PTR [esi + ebx*4 + 16] xmm6, xmm4, 0x4E xmm7, xmm5, 0x4E xmm4, xmm6 xmm5, xmm7 xmm6, xmm4, 0xB1 xmm7, xmm5, 0xB1 xmm4, xmm6 xmm5, xmm7 XMMWORD PTR sse2_minimum[ 0], xmm1 XMMWORD PTR sse2_minimum[16], xmm5 xmm7, XMMWORD PTR SSE2_BYTE_0 xmm0, xmm7 xmm4, xmm7 xmm1, xmm7 xmm5, xmm7 xmm2, xmm0 xmm6, xmm4 xmm2, xmm1 xmm6, xmm5 MMWORD PTR sse2_range[ 0], xmm2 MMWORD PTR sse2_range[16], xmm6 xmm2, 4 xmm6, 4 xmm0, xmm2 xmm4, xmm6 xmm1, xmm2 xmm5, xmm6 xmm0, xmm1 xmm0, XMMWORD PTR SSE2_BOUNDS_SCALE xmm0, XMMWORD PTR SSE2_BOUNDS_MASK XMMWORD PTR sse2_bounds[ 0], xmm0

// src + width * 8 + // src + width * 12 + // src + width * 8 + // src + width * 12 +

// src + width * // src + width *

0 + 16 4 + 16

// src + width * 8 + 16 // src + width * 12 + 16 // src + width * 8 + 16 // src + width * 12 + 16

punpcklwd pmullw pand movdqa movzx movzx mov mov shr shr or or or or mov mov add add add add mov mov #ifdef FIX_DXT1_BUG movzx xor cmovz movzx xor cmovz #endif // FIX_DXT1_BUG mov lea movdqa movdqa movdqa psubb psubb movdqa movdqa psubb psubb #ifdef USE_SSSE3 movd pshufd movdqa pmaddubsw pmaddubsw phaddw pmaddubsw pmaddubsw phaddw pmulhw pmulhw packuswb pmaddubsw pmaddwd movdqa #else // USE_SSSE3 movdqa movdqa movdqa movdqa pand pand pmaddwd pmaddwd pand pand

xmm4, xmm5 xmm4, XMMWORD PTR SSE2_BOUNDS_SCALE xmm4, XMMWORD PTR SSE2_BOUNDS_MASK XMMWORD PTR sse2_bounds[16], xmm4 ecx, edx, eax, ebx, eax, ebx, eax, ebx, eax, ebx, DWORD DWORD cx, dx, cx, dx, ecx, edx, eax, ax, ecx, ebx, bx, edx, eax, ebx, xmm0, xmm1, xmm7, xmm0, xmm1, xmm2, xmm3, xmm2, xmm3, WORD PTR sse2_range [ 0] WORD PTR sse2_range [16] DWORD PTR sse2_bounds[ 0] DWORD PTR sse2_bounds[16] 8 8 DWORD PTR sse2_bounds[ 4] DWORD PTR sse2_bounds[20] DWORD PTR sse2_bounds[ 8] DWORD PTR sse2_bounds[24] PTR [edi + 0], eax PTR [edi + 8], ebx WORD WORD WORD WORD DWORD DWORD WORD WORD eax WORD WORD ebx PTR PTR PTR PTR PTR PTR sse2_range [ 2] sse2_range [18] sse2_range [ 4] sse2_range [20] COLOR_DIVIDER_TABLE[ecx*4] COLOR_DIVIDER_TABLE[edx*4] 0] 2]

PTR [edi + PTR [edi +

PTR [edi + 8] PTR [edi + 10]

DWORD PTR width DWORD PTR [eax + eax*2] XMMWORD XMMWORD XMMWORD xmm7 xmm7 XMMWORD XMMWORD xmm7 xmm7 PTR [esi + 0] PTR [esi + eax*4 + 0] PTR sse2_minimum[ 0] PTR [esi + eax*8 + PTR [esi + ebx*4 + 0] 0]

// width * 1 // width * 3 // src + width * // src + width * 0 + 4 + 0 0

// src + width * 8 + // src + width * 12 +

0 0

xmm7, ecx xmm7, xmm7, 0x00 xmm6, XMMWORD PTR SSE2_INDICES_MASK_2 xmm0, xmm1, xmm0, xmm2, xmm3, xmm2, xmm6 xmm6 xmm1 xmm6 xmm6 xmm3

xmm0, xmm7 xmm2, xmm7 xmm0, xmm2 xmm0, XMMWORD PTR SSE2_INDICES_SCALE_2 xmm0, XMMWORD PTR SSE2_WORD_1 XMMWORD PTR sse2_indices[ 0], xmm0 xmm4, xmm5, xmm6, xmm7, xmm0, xmm1, xmm0, xmm1, xmm4, xmm5, xmm0 xmm1 XMMWORD XMMWORD xmm6 xmm6 XMMWORD XMMWORD xmm7 xmm7

PTR SSE2_INDICES_MASK_0 PTR SSE2_INDICES_MASK_1 PTR SSE2_WORD_8 PTR SSE2_WORD_8

psrlw psrlw paddw paddw movdqa movdqa pand pand pmaddwd pmaddwd pand pand psrlw psrlw paddw paddw movd pshufd packssdw pmulhw pmaddwd packssdw pmulhw pmaddwd packssdw pmaddwd movdqa #endif // USE_SSSE3 movdqa movdqa movdqa psubb psubb movdqa movdqa psubb psubb #ifdef USE_SSSE3 movd pshufd movdqa pmaddubsw pmaddubsw pmaddubsw pmaddubsw phaddw phaddw pmulhw pmulhw packuswb pmaddubsw pmaddwd movdqa #else // USE_SSSE3 movdqa movdqa movdqa movdqa pand pand psrlw psrlw pand pand pmaddwd pmaddwd paddw paddw movdqa movdqa pand pand

xmm4, xmm5, xmm0, xmm1, xmm4, xmm5, xmm2, xmm3, xmm2, xmm3, xmm4, xmm5, xmm4, xmm5, xmm2, xmm3,

5 5 xmm4 xmm5 xmm2 xmm3 xmm6 xmm6 XMMWORD PTR SSE2_WORD_8 XMMWORD PTR SSE2_WORD_8 xmm7 xmm7 5 5 xmm4 xmm5

xmm7, ecx xmm7, xmm7, 0x00 xmm0, xmm1 xmm0, xmm7 xmm0, XMMWORD PTR SSE2_INDICES_SCALE_0 xmm2, xmm3 xmm2, xmm7 xmm2, XMMWORD PTR SSE2_INDICES_SCALE_0 xmm0, xmm2 xmm0, XMMWORD PTR SSE2_WORD_1 XMMWORD PTR sse2_indices[ 0], xmm0 xmm0, xmm1, xmm7, xmm0, xmm1, xmm2, xmm3, xmm2, xmm3, XMMWORD XMMWORD XMMWORD xmm7 xmm7 XMMWORD XMMWORD xmm7 xmm7 PTR [esi + 16] PTR [esi + eax*4 + 16] PTR sse2_minimum[16] PTR [esi + eax*8 + 16] PTR [esi + ebx*4 + 16] // src + width * // src + width * 0 + 16 4 + 16

// src + width * 8 + 16 // src + width * 12 + 16

xmm7, edx xmm7, xmm7, 0x00 xmm6, XMMWORD PTR SSE2_INDICES_MASK_2 xmm0, xmm2, xmm1, xmm3, xmm0, xmm2, xmm6 xmm6 xmm6 xmm6 xmm1 xmm3

xmm0, xmm7 xmm2, xmm7 xmm0, xmm2 xmm0, XMMWORD PTR SSE2_INDICES_SCALE_2 xmm0, XMMWORD PTR SSE2_WORD_1 XMMWORD PTR sse2_indices[32], xmm0 xmm4, xmm5, xmm6, xmm7, xmm4, xmm5, xmm4, xmm5, xmm0, xmm1, xmm0, xmm1, xmm0, xmm1, xmm4, xmm5, xmm4, xmm5, xmm0 xmm1 XMMWORD XMMWORD xmm7 xmm7 5 5 xmm6 xmm6 XMMWORD XMMWORD xmm4 xmm5 xmm2 xmm3 xmm7 xmm7

PTR SSE2_INDICES_MASK_0 PTR SSE2_INDICES_MASK_1

PTR SSE2_WORD_8 PTR SSE2_WORD_8

psrlw psrlw pand pand pmaddwd pmaddwd paddw paddw movd pshufd packssdw pmulhw pmaddwd packssdw pmulhw pmaddwd packssdw pmaddwd movdqa #endif // USE_SSSE3 movzx movzx mov mov mov mov movzx movzx mov mov mov mov movzx movzx mov mov mov mov movzx movzx mov mov mov mov add add sub jnz mov lea lea sub jnz } }

xmm4, xmm5, xmm2, xmm3, xmm2, xmm3, xmm2, xmm3,

5 5 xmm6 xmm6 XMMWORD PTR SSE2_WORD_8 XMMWORD PTR SSE2_WORD_8 xmm4 xmm5

xmm7, edx xmm7, xmm7, 0x00 xmm0, xmm1 xmm0, xmm7 xmm0, XMMWORD PTR SSE2_INDICES_SCALE_0 xmm2, xmm3 xmm2, xmm7 xmm2, XMMWORD PTR SSE2_INDICES_SCALE_0 xmm0, xmm2 xmm0, XMMWORD PTR SSE2_WORD_1 XMMWORD PTR sse2_indices[32], xmm0 eax, ebx, cl, ch, BYTE BYTE eax, ebx, dl, dh, BYTE BYTE eax, ebx, cl, ch, BYTE BYTE eax, ebx, dl, dh, BYTE BYTE esi, edi, BYTE BYTE BYTE BYTE PTR PTR BYTE BYTE BYTE BYTE PTR PTR BYTE BYTE BYTE BYTE PTR PTR BYTE BYTE BYTE BYTE PTR PTR 32 16 PTR sse2_indices[ 0] PTR sse2_indices[ 4] PTR COLOR_INDICES_TABLE[eax*1 PTR COLOR_INDICES_TABLE[ebx*1 [edi + 4], cl [edi + 5], ch PTR sse2_indices[ 8] PTR sse2_indices[12] PTR COLOR_INDICES_TABLE[eax*1 PTR COLOR_INDICES_TABLE[ebx*1 [edi + 6], dl [edi + 7], dh PTR sse2_indices[32] PTR sse2_indices[36] PTR COLOR_INDICES_TABLE[eax*1 PTR COLOR_INDICES_TABLE[ebx*1 [edi + 12], cl [edi + 13], ch PTR sse2_indices[40] PTR sse2_indices[44] PTR COLOR_INDICES_TABLE[eax*1 PTR COLOR_INDICES_TABLE[ebx*1 [edi + 14], dl [edi + 15], dh

+ +

0] 0]

+ +

0] 0]

+ +

0] 0]

+ +

0] 0]

// src += 32 // dst += 16

DWORD PTR x_count, 8 x_loop eax, ebx, esi, DWORD PTR width DWORD PTR [eax + eax*2] DWORD PTR [esi + ebx*4] // width * 1 // width * 3 // src += width * 12

DWORD PTR y_count, 4 y_loop

void CompressImageDXT5(const BYTE* argb, BYTE* dxt5, int width, int height) { int x_count; int y_count; __asm { mov esi, DWORD PTR argb // src mov edi, DWORD PTR dxt5 // dst mov mov y_loop: mov mov eax, DWORD PTR height DWORD PTR y_count, eax eax, DWORD PTR width DWORD PTR x_count, eax

x_loop: mov lea movdqa movdqa movdqa pmaxub pminub pmaxub pminub pmaxub pminub pshufd pmaxub pshufd pminub pshufd pmaxub pshufd pminub movdqa movdqa movdqa pmaxub pminub pmaxub pminub pmaxub pminub pshufd pmaxub pshufd pminub pshufd pmaxub pshufd pminub movdqa movdqa movdqa punpcklbw punpcklbw punpcklbw punpcklbw movdqa movdqa psubw psubw movq movq psrlw psrlw psubw psubw paddw paddw punpcklwd pmullw pand movdqa punpcklwd pmullw pand movdqa mov mov shr shr movzx movzx mov mov

eax, ebx,

DWORD PTR width DWORD PTR [eax + eax*2]

// width * 1 // width * 3 // src + width * // src + width * 0 + 4 + 0 0

xmm0, XMMWORD PTR [esi + 0] xmm3, XMMWORD PTR [esi + eax*4 + 0] xmm1, xmm0 xmm0, xmm3 xmm1, xmm3 xmm0, XMMWORD PTR [esi + eax*8 + 0] xmm1, XMMWORD PTR [esi + eax*8 + 0] xmm0, XMMWORD PTR [esi + ebx*4 + 0] xmm1, XMMWORD PTR [esi + ebx*4 + 0] xmm2, xmm0, 0x4E xmm0, xmm2 xmm3, xmm1, 0x4E xmm1, xmm3 xmm2, xmm0, 0xB1 xmm0, xmm2 xmm3, xmm1, 0xB1 xmm1, xmm3 xmm4, XMMWORD PTR [esi + 16] xmm7, XMMWORD PTR [esi + eax*4 + 16] xmm5, xmm4 xmm4, xmm7 xmm5, xmm7 xmm4, XMMWORD PTR [esi + eax*8 + 16] xmm5, XMMWORD PTR [esi + eax*8 + 16] xmm4, XMMWORD PTR [esi + ebx*4 + 16] xmm5, XMMWORD PTR [esi + ebx*4 + 16] xmm6, xmm4, 0x4E xmm4, xmm6 xmm7, xmm5, 0x4E xmm5, xmm7 xmm6, xmm4, 0xB1 xmm4, xmm6 xmm7, xmm5, 0xB1 xmm5, xmm7 XMMWORD PTR sse2_minimum[ 0], xmm1 XMMWORD PTR sse2_minimum[16], xmm5 xmm7, XMMWORD PTR SSE2_BYTE_0 xmm0, xmm7 xmm4, xmm7 xmm1, xmm7 xmm5, xmm7 xmm2, xmm0 xmm6, xmm4 xmm2, xmm1 xmm6, xmm5 MMWORD PTR sse2_range[ 0], xmm2 MMWORD PTR sse2_range[16], xmm6 xmm2, 4 xmm6, 4 xmm0, xmm2 xmm4, xmm6 xmm1, xmm2 xmm5, xmm6 xmm0, xmm1 xmm0, XMMWORD PTR SSE2_BOUNDS_SCALE xmm0, XMMWORD PTR SSE2_BOUNDS_MASK XMMWORD PTR sse2_bounds[ 0], xmm0 xmm4, xmm5 xmm4, XMMWORD PTR SSE2_BOUNDS_SCALE xmm4, XMMWORD PTR SSE2_BOUNDS_MASK XMMWORD PTR sse2_bounds[16], xmm4 eax, ebx, eax, ebx, ecx, edx, DWORD DWORD DWORD PTR sse2_bounds[ 0] DWORD PTR sse2_bounds[16] 8 8 WORD PTR sse2_bounds[13] WORD PTR sse2_bounds[29] PTR [edi + 0], ecx PTR [edi + 16], edx

// // // //

src src src src

+ + + +

width width width width

* 8 + * 8 + * 12 + * 12 +

0 0 0 0

// src + width * // src + width *

0 + 16 4 + 16

// // // //

src src src src

+ + + +

width width width width

* 8 + * 8 + * 12 + * 12 +

16 16 16 16

or or or or mov mov movzx movzx add add add add movzx movzx #ifdef FIX_DXT5_BUG movzx xor cmovz movzx xor cmovz #endif // FIX_DXT5_BUG movzx movzx mov mov or or mov lea movdqa movdqa movdqa psubb psubb movdqa psubb movdqa psubb movdqa movdqa movdqa movdqa pand pand psrlw psrlw pmaddwd pmaddwd psllw psllw paddw paddw movdqa movdqa pand pand psrlw psrlw pmaddwd pmaddwd psllw psllw paddw paddw #ifdef USE_SSSE3 movd pshufd movdqa movdqa

eax, ebx, eax, ebx, DWORD DWORD ecx, edx, cx, dx, cx, dx, ecx, edx, eax, ax, ecx, ebx, bx, edx, eax, ebx, eax, ebx, ecx, edx, eax, ebx, xmm0, xmm1, xmm7, xmm0, xmm1, xmm2, xmm2, xmm3, xmm3, xmm6, xmm7, xmm4, xmm5, xmm0, xmm1, xmm4, xmm5, xmm0, xmm1, xmm4, xmm5, xmm0, xmm1, xmm4, xmm5, xmm2, xmm3, xmm4, xmm5, xmm2, xmm3, xmm4, xmm5, xmm2, xmm3, xmm7, xmm7, xmm5, xmm6,

DWORD PTR sse2_bounds[ 4] DWORD PTR sse2_bounds[20] DWORD PTR sse2_bounds[ 8] DWORD PTR sse2_bounds[24] PTR [edi + 8], eax PTR [edi + 24], ebx WORD WORD WORD WORD WORD WORD WORD WORD WORD WORD eax WORD WORD ebx WORD WORD DWORD DWORD eax ebx PTR PTR PTR PTR PTR PTR PTR PTR sse2_range [ 0] sse2_range [16] sse2_range [ 2] sse2_range [18] sse2_range [ 4] sse2_range [20] COLOR_DIVIDER_TABLE[ecx*4] COLOR_DIVIDER_TABLE[edx*4]

PTR [edi + 8] PTR [edi + 10] PTR [edi + 24] PTR [edi + 26]

PTR PTR PTR PTR

sse2_range [ 6] sse2_range [22] ALPHA_DIVIDER_TABLE[eax*4] ALPHA_DIVIDER_TABLE[ebx*4]

DWORD PTR width DWORD PTR [eax + eax*2] XMMWORD XMMWORD XMMWORD xmm7 xmm7 XMMWORD xmm7 XMMWORD xmm7 PTR [esi + 0] PTR [esi + eax*4 + 0] PTR sse2_minimum[ 0] PTR [esi + eax*8 + PTR [esi + ebx*4 + 0] 0]

// width * 1 // width * 3 // src + width * // src + width * 0 + 4 + 0 0

// src + width *

8 +

0 0

// src + width * 12 +

XMMWORD PTR SSE2_INDICES_MASK_0 XMMWORD PTR SSE2_WORD_8 xmm0 xmm1 xmm6 xmm6 8 8 xmm7 xmm7 3 3 xmm4 xmm5 xmm2 xmm3 xmm6 xmm6 8 8 xmm7 xmm7 3 3 xmm4 xmm5 ecx xmm7, 0x00 XMMWORD PTR SSE2_INDICES_SCALE_3 XMMWORD PTR SSE2_INDICES_SCALE_4

pmulhw pmulhw pmulhw pmulhw packuswb pshufb pmaddubsw pmaddwd movdqa packuswb pshufb pmaddubsw pmaddwd movdqa #else // USE_SSSE3 movd pshufd pmulhw pmulhw pshuflw pshufhw pshuflw pshufhw movdqa pmaddwd pmaddwd packssdw pshuflw pshufhw pmaddwd movdqa pmulhw pmulhw pshuflw pshufhw pshuflw pshufhw pmaddwd pmaddwd packssdw pshuflw pshufhw pmaddwd movdqa #endif // USE_SSSE3 movdqa movdqa movdqa psubb psubb movdqa psubb movdqa psubb movdqa movdqa movdqa movdqa pand pand pmaddwd pmaddwd psrlw psrlw psllw psllw paddw paddw movdqa movdqa pand pand pmaddwd pmaddwd

xmm0, xmm7 xmm1, xmm7 xmm2, xmm7 xmm3, xmm7 xmm0, xmm1 xmm0, XMMWORD PTR SSE2_INDICES_SHUFFLE xmm0, xmm5 xmm0, xmm6 XMMWORD PTR sse2_indices[ 0], xmm0 xmm2, xmm3 xmm2, XMMWORD PTR SSE2_INDICES_SHUFFLE xmm2, xmm5 xmm2, xmm6 XMMWORD PTR sse2_indices[16], xmm2 xmm7, ecx xmm7, xmm7, 0x00 xmm0, xmm7 xmm1, xmm7 xmm0, xmm0, 0xD8 xmm0, xmm0, 0xD8 xmm1, xmm1, 0xD8 xmm1, xmm1, 0xD8 xmm6, XMMWORD PTR SSE2_INDICES_SCALE_1 xmm0, xmm6 xmm1, xmm6 xmm0, xmm1 xmm0, xmm0, 0xD8 xmm0, xmm0, 0xD8 xmm0, XMMWORD PTR SSE2_WORD_1 XMMWORD PTR sse2_indices[ 0], xmm0 xmm2, xmm7 xmm3, xmm7 xmm2, xmm2, 0xD8 xmm2, xmm2, 0xD8 xmm3, xmm3, 0xD8 xmm3, xmm3, 0xD8 xmm2, xmm6 xmm3, xmm6 xmm2, xmm3 xmm2, xmm2, 0xD8 xmm2, xmm2, 0xD8 xmm2, XMMWORD PTR SSE2_WORD_1 XMMWORD PTR sse2_indices[16], xmm2 xmm0, xmm1, xmm7, xmm0, xmm1, xmm2, xmm2, xmm3, xmm3, xmm6, xmm7, xmm4, xmm5, xmm0, xmm1, xmm0, xmm1, xmm4, xmm5, xmm4, xmm5, xmm0, xmm1, xmm4, xmm5, xmm2, xmm3, xmm2, xmm3, XMMWORD XMMWORD XMMWORD xmm7 xmm7 XMMWORD xmm7 XMMWORD xmm7 PTR [esi + 16] PTR [esi + eax*4 + 16] PTR sse2_minimum[16] PTR [esi + eax*8 + 16] PTR [esi + ebx*4 + 16] // src + width * // src + width * 0 + 16 4 + 16

// src + width *

8 + 16

// src + width * 12 + 16

XMMWORD PTR SSE2_INDICES_MASK_0 XMMWORD PTR SSE2_WORD_8 xmm0 xmm1 xmm6 xmm6 xmm7 xmm7 8 8 3 3 xmm4 xmm5 xmm2 xmm3 xmm6 xmm6 xmm7 xmm7

psrlw psrlw psllw psllw paddw paddw #ifdef USE_SSSE3 movd pshufd movdqa movdqa pmulhw pmulhw pmulhw pmulhw packuswb pshufb pmaddubsw pmaddwd movdqa packuswb pshufb pmaddubsw pmaddwd movdqa #else // USE_SSSE3 movd pshufd pmulhw pmulhw pshuflw pshufhw pshuflw pshufhw movdqa pmaddwd pmaddwd packssdw pshuflw pshufhw pmaddwd movdqa pmulhw pmulhw pshuflw pshufhw pshuflw pshufhw pmaddwd pmaddwd packssdw pshuflw pshufhw pmaddwd movdqa #endif // USE_SSSE3 movzx movzx mov mov mov mov movzx movzx mov mov mov mov movzx movzx mov mov mov mov

xmm4, xmm5, xmm4, xmm5, xmm2, xmm3,

8 8 3 3 xmm4 xmm5

xmm7, edx xmm7, xmm7, 0x00 xmm5, XMMWORD PTR SSE2_INDICES_SCALE_3 xmm6, XMMWORD PTR SSE2_INDICES_SCALE_4 xmm0, xmm7 xmm1, xmm7 xmm2, xmm7 xmm3, xmm7 xmm0, xmm1 xmm0, XMMWORD PTR SSE2_INDICES_SHUFFLE xmm0, xmm5 xmm0, xmm6 XMMWORD PTR sse2_indices[32], xmm0 xmm2, xmm3 xmm2, XMMWORD PTR SSE2_INDICES_SHUFFLE xmm2, xmm5 xmm2, xmm6 XMMWORD PTR sse2_indices[48], xmm2 xmm7, edx xmm7, xmm7, 0x00 xmm0, xmm7 xmm1, xmm7 xmm0, xmm0, 0xD8 xmm0, xmm0, 0xD8 xmm1, xmm1, 0xD8 xmm1, xmm1, 0xD8 xmm6, XMMWORD PTR SSE2_INDICES_SCALE_1 xmm0, xmm6 xmm1, xmm6 xmm0, xmm1 xmm0, xmm0, 0xD8 xmm0, xmm0, 0xD8 xmm0, XMMWORD PTR SSE2_WORD_1 XMMWORD PTR sse2_indices[32], xmm0 xmm2, xmm7 xmm3, xmm7 xmm2, xmm2, 0xD8 xmm2, xmm2, 0xD8 xmm3, xmm3, 0xD8 xmm3, xmm3, 0xD8 xmm2, xmm6 xmm3, xmm6 xmm2, xmm3 xmm2, xmm2, 0xD8 xmm2, xmm2, 0xD8 xmm2, XMMWORD PTR SSE2_WORD_1 XMMWORD PTR sse2_indices[48], xmm2 eax, ebx, cl, ch, BYTE BYTE eax, ebx, dl, dh, BYTE BYTE eax, ebx, cl, ch, BYTE BYTE BYTE BYTE BYTE BYTE PTR PTR BYTE BYTE BYTE BYTE PTR PTR BYTE BYTE BYTE BYTE PTR PTR PTR sse2_indices[ 0] PTR sse2_indices[ 8] PTR COLOR_INDICES_TABLE[eax*1 PTR COLOR_INDICES_TABLE[ebx*1 [edi + 12], cl [edi + 13], ch PTR sse2_indices[16] PTR sse2_indices[24] PTR COLOR_INDICES_TABLE[eax*1 PTR COLOR_INDICES_TABLE[ebx*1 [edi + 14], dl [edi + 15], dh

+ +

0] 0]

+ +

0] 0]

PTR sse2_indices[32] PTR sse2_indices[40] PTR COLOR_INDICES_TABLE[eax*1 + PTR COLOR_INDICES_TABLE[ebx*1 + [edi + 28], cl [edi + 29], ch

0] 0]

movzx movzx mov mov mov mov movzx movzx mov mov movzx movzx or or movzx movzx or or mov mov mov mov movzx movzx or or movzx movzx or or movzx movzx or or mov mov mov mov movzx movzx or or movzx movzx or or mov mov add add sub jnz mov lea lea sub jnz } }

eax, ebx, dl, dh, BYTE BYTE eax, ebx, cx, dx, eax, ebx, cx, dx, eax, ebx, cx, dx, WORD WORD cx, dx, eax, ebx, cx, dx, eax, ebx, cx, dx, eax, ebx, cx, dx, WORD WORD cx, dx, eax, ebx, cx, dx, eax, ebx, cx, dx, WORD WORD esi, edi,

BYTE BYTE BYTE BYTE PTR PTR BYTE BYTE WORD WORD BYTE BYTE WORD WORD BYTE BYTE WORD WORD PTR PTR WORD WORD BYTE BYTE WORD WORD BYTE BYTE WORD WORD BYTE BYTE WORD WORD PTR PTR WORD WORD BYTE BYTE WORD WORD BYTE BYTE WORD WORD PTR PTR 32 32

PTR sse2_indices[48] PTR sse2_indices[56] PTR COLOR_INDICES_TABLE[eax*1 + PTR COLOR_INDICES_TABLE[ebx*1 + [edi + 30], dl [edi + 31], dh PTR sse2_indices[ 4] PTR sse2_indices[36] PTR ALPHA_INDICES_TABLE[eax*2 PTR ALPHA_INDICES_TABLE[ebx*2 PTR sse2_indices[ 5] PTR sse2_indices[37] PTR ALPHA_INDICES_TABLE[eax*2 PTR ALPHA_INDICES_TABLE[ebx*2 PTR sse2_indices[12] PTR sse2_indices[44] PTR ALPHA_INDICES_TABLE[eax*2 PTR ALPHA_INDICES_TABLE[ebx*2 [edi + 2], cx [edi + 18], dx PTR ALPHA_INDICES_TABLE[eax*2 PTR ALPHA_INDICES_TABLE[ebx*2 PTR sse2_indices[13] PTR sse2_indices[45] PTR ALPHA_INDICES_TABLE[eax*2 PTR ALPHA_INDICES_TABLE[ebx*2 PTR sse2_indices[20] PTR sse2_indices[52] PTR ALPHA_INDICES_TABLE[eax*2 PTR ALPHA_INDICES_TABLE[ebx*2 PTR sse2_indices[21] PTR sse2_indices[53] PTR ALPHA_INDICES_TABLE[eax*2 PTR ALPHA_INDICES_TABLE[ebx*2 [edi + 4], cx [edi + 20], dx PTR ALPHA_INDICES_TABLE[eax*2 PTR ALPHA_INDICES_TABLE[ebx*2 PTR sse2_indices[28] PTR sse2_indices[60] PTR ALPHA_INDICES_TABLE[eax*2 PTR ALPHA_INDICES_TABLE[ebx*2 PTR sse2_indices[29] PTR sse2_indices[61] PTR ALPHA_INDICES_TABLE[eax*2 PTR ALPHA_INDICES_TABLE[ebx*2 [edi + 6], cx [edi + 22], dx

0] 0]

+ + + + + +

0] 0] 128] 128] 256] 256]

+ + + + + + + +

384] 384] 512] 512] 640] 640] 768] 768]

+ +

896] 896]

+ 1024] + 1024] + 1152] + 1152]

// src += 32 // dst += 32

DWORD PTR x_count, 8 x_loop eax, ebx, esi, DWORD PTR width DWORD PTR [eax + eax*2] DWORD PTR [esi + ebx*4] // width * 1 // width * 3 // src += width * 12

DWORD PTR y_count, 4 y_loop

void PrepareColorDividerTable() { for (int i = 0; i < 768; i++) { COLOR_DIVIDER_TABLE[i] = (((1 << 15) / (i + 1)) << 16) | ((1 << 15) / (i + 1)); } } void PrepareAlphaDividerTable() { for (int i = 0; i < 256; i++) { ALPHA_DIVIDER_TABLE[i] = (((1 << 16) / (i + 1)) << 16); } } void PrepareColorIndicesTable() { const BYTE COLOR_INDEX[] = {1, 3, 2, 0}; for (int BYTE BYTE BYTE BYTE } } void PrepareAlphaIndicesTable() { const int SHIFT_LEFT [] = {0, 1, 2, 0, 1, 2, 3, 0, 1, 2}; const int SHIFT_RIGHT[] = {0, 0, 0, 2, 2, 2, 2, 1, 1, 1}; const WORD ALPHA_INDEX[] = {1, 7, 6, 5, 4, 3, 2, 0}; for (int j = 0; j < 10; j++) { int sl = SHIFT_LEFT [j] * 6; int sr = SHIFT_RIGHT[j] * 2; for (int i = 0; i < 64; i++) { WORD ai1 = ALPHA_INDEX[(i & 0x38) >> 3] << 3; WORD ai0 = ALPHA_INDEX[(i & 0x07) >> 0] << 0; ALPHA_INDICES_TABLE[(j * 64) + i] = ((ai1 | ai0) << sl) >> sr; } } } i = ci3 ci2 ci1 ci0 0; i < 256; i++) = COLOR_INDEX[(i = COLOR_INDEX[(i = COLOR_INDEX[(i = COLOR_INDEX[(i { & & & & 0xC0) 0x30) 0x0C) 0x03) >> >> >> >> 6] 4] 2] 0] << << << << 6; 4; 2; 0;

COLOR_INDICES_TABLE[i] = ci3 | ci2 | ci1 | ci0;

Potrebbero piacerti anche