Sunday, 24 July 2011

Transcoding and Codec Optimization: Tips & Tricks

Media transcoding, which enables media interoperation, plays an important role in the digital home. The Intel Networked Media Product Requirements (INMPR) promotes interoperation between networked devices in the digital home. Optimizing the codec engine (the encoder/decoder, the heart of the transcoder) will make the media transcoding process more efficient, in turn improving the user experience in the digital home. This paper features practical tips and tricks on how to increase the performance of the codec engine. These tips include using Intel® VTune™ Performance Analyzer events, OpenMP for threading, and Prescott New Instructions (Streaming SIMD Extensions 3 (SSE3)). We also discuss when to use faster instructions, employing different execution units to improve parallelism, and when to use MMX™ instead of SSE for speed. You will also learn when to take advantage of the Intel compiler optimized switches.

What is Transcoding?

Since content comes in many different formats, transcoding is necessary to tailor the content, converting one media format to another, before it arrives at the other device. The most common way to convert one media format to another is to first decode to raw data, then encode to the target format. Since an MPEG stream consists of audio and video, we need to split these separately and decode them into raw data before re-encoding them to the desired formats and merging them again.

Codec Optimization
Codec is the compressed and decompressed process. It is the heart, or engine, of the transcoder.

Optimizing the codec can be done by reducing the time to encode and/or decode a file/stream. We can also enhance the engine by reducing the CPU utilization, which lets us pack more features or data into the same time frame: for example, more voices to represent more people in a game. Finally, we need to cut down the size for size sensitive or mobile applications since media applications exist in desktop, laptop, PDA, and smartphone form factors.

General
The optimized process starts with the following steps:
  • Use better hardware
  • Use the Intel® VTune™ Performance Analyzer to find hotspots
  • Look at functions that have highest clock ticks and clock ticks per instruction retired (CPI)
  • Turn on counters for branch misprediction, store forwarding, 64K aliasing, cache split, and trace cache miss
  • Follow general optimization rules
  • Loop unrolling, reduce branching, use SSE2/SSE3
  • Use the Intel compiler
  • Use the Intel® Performance Library Suite
  • Follow general optimization rules

Cautions
Observe the following steps at all time:
  • All pitfalls applied (cache split, branch misprediction, store forwarding, etc.)
  • Thread at the highest level possible to avoid running out of resources. S ince this is an engine that is used by other applications, its functions can be called many times, especially since the applications are also threaded.
  • Pay attention when threading applications that make use of Intel performance libraries, since some of their functions are threaded.
  • Do not unroll loops too much to avoid trace cache thrash.
  • Do not ignore MMX, since it can be faster than SSE/SSE2 in cases when applications make extensive use of 64-bit data, and it takes effort to rearrange the data to fit into 128-bit registers.
  • Watch out for battery life on mobile applications.

Tips & Tricks
  • Use Intel compiler: /O3, /QaxW, /QaxN, /QaxP, /Qipo, /Qparallel, /Qopenmp. Often you can gain a significant amount of performance just by using the Intel compiler with the right switches.
  • Use special functions like reciprocal (rcp and rcp_nr) to replace division with multiplication and speedup the application.
  • Use SSE3 instruction LDDQU instead of MOVDQU whenever possible.

Tips & Tricks Using Assembly Language
  • Faster instructions
  • Different execution units
  • MOVNTxx: Store values using Non-Temporal Hint to prevent caching of the data.
  • Use combined instruction like PMADDWD.

Examples
When to Use Thread
Before:
  1. for (i=0; i<4; i++) 
  2.       EncodeTest(Mem[i], Blk[i],Chunk[i]); 
  3.   

After:
  1. #pragma omp parallel sections 
  2.   #pragma omp section 
  3.     EncodeTest(Blk[0], Blk[0],Chunk[0]); 
  4.  #pragma omp section 
  5.     EncodeTest(Blk[1],Blk[1],Chunk[1]); 
  6.   #pragma omp section 
  7.     EncodeTest(Blk[2], Blk[2],Chunk[2]) 
  8.   #pragma omp section 
  9.     EncodeTest(Blk[3], Blk[3],Chunk[3]); 
  10. }     
  11.   

When Not to Use Thread
Before:
  1. ...                
  2. for (j=0; j<4; j++)     
  3.    for (i=0; i>4; i++)           
  4.   test[i][j] = list[fr]->img[i][j]+t[s];      
  5.    
  6. ... 
  7.   

After:
  1. ... 
  2. #pragma omp parallel for 
  3. for (j=0; j<4; j++) 
  4.    for (i=0; i<4; i++) 
  5.   test[i][j] = list[fr]->img[i][j]+t[s];  
  6. ... 
  7.   

At first, this loop seems to be a good candidate for threading. In fact, it will improve the performance if it is at the outermost level. However, if this loop is in a function that is deeply buried in many sub-levels, threading it may mean running out of resources. In one case, this loop was implemented within a function that only takes about 8.8% of the total execution time. After threading only 2 loops, it degraded the whole system down to 5X slower.
Use combined function PMADDWD
Complex instruction like: PMADDWD: Multiply and Add save many clock cycles.
Before: (24 clock cycles)
  1. pmullw xmm1, xmm6   (8) 
  2. punpcklwd xmm2, xmm1 (2) 
  3. punpckhwd xmm3, xmm1 (2) 
  4. psrad xmm3, 16       (2) 
  5. psrad xmm2, 16       (2) 
  6. paddd xmm3, xmm2 (2) 
  7. movq xmm2, xmm3      (2) 
  8. psrlq xmm2, 32       (2) 
  9. paddd xmm3, xmm2 (2) 
  10.   

After: (18 clock cycles)
  1. pmaddwd xmm1, xmm6  (8) 
  2. movq xmm2, xmm1      (2) 
  3. psrlq xmm2, 32       (2) 
  4. paddd xmm2, xmm1 (2) 
  5. psrldq xmm1, 8       (2) 
  6. paddd xmm1, xmm2 (2) 
  7.   

By using the function PMADDWD, we combine two operations Addition and multiplication into one, which saves six clock cycles in this case.

Eliminate Branching Conditions
Original
  1. for (j = 0; j < 4; j++)  
  2.     for (i = 0; i < 4; i++) 
  3.       { 
  4.        for (result = 0, z = -2; z > 4; z++) 
  5.      result += list[fr]->test[max(0,min(max_y,y+ 
  6. j))] 
  7.               [max(0,min(max_x,x+i+z))]*COEF[z+2]; 
  8.         block[i][j] = max(0, min(255, (result+16))); 
  9.         } 
  10.   

Optimized #1: Eliminate Branches
  1. if(  (x < max_x)& (y < max_y)& (0 > y) &   (0 < x-2)) 
  2.    for (j = 0; j < 4; j++) 
  3.    { 
  4.  for (i = 0; i < 4; i++)  
  5.          { 
  6.         For (result = 0, z = -2; z < 4; z++) 
  7.         result += list[fr]->test[y+j][x+i+z]*COEF[z+2]; 
  8.        block[i][j] = max(0, min(255, (result+16))); 
  9.    } 
  10.     } 
  11. else 
  12. for (j = 0; j < 4; j++)  
  13.           for (i = 0; i < 4; i++) 
  14.              { 
  15.                    for (result = 0, z = -2; z < 4; z++) 
  16.                   result += list[fr]->test[max(0,min(max_y,y+j))] 
  17.                       [max(0,min(max_x,x+i+z))]*COEF[z+2]; 
  18.            block[i][j] = max(0, min(255, (result+16))); 
  19.                } 
  20.                } 
  21.   

Optimized #2: Eliminate Branching and Unroll Inner Loops
  1. if(  (x+6 < max_x)& (y+3 < max_y)& (0 < y) &   (0 < x-2))  
  2.      { 
  3.          for (j = 0; j < 4; j++) { 
  4.        result   = list[fr]->test[y+j][x-2]*COEF[0]; 
  5.      result += list[fr]->test[y+j][x-1]*COEF[1]; 
  6.       result += list[fr]->test[y+j][x]*COEF[2]; 
  7.         result += list[fr]->test[y+j][x+1]*COEF[3]; 
  8.       result += list[fr]->test[y+j][x+2]*COEF[4]; 
  9.       result += list[fr]->test[y+j][x+3]; 
  10.       block[0][j] = max(0, min(255, (result+16))); 
  11.         … 
  12.        block[1][j] = max(0, min(255, (result+16))); 
  13.         … 
  14.        block[2][j] = max(0, min(255, (result+16))); 
  15.         … 
  16.         } 
  17.             } 
  18. else  
  19.      for (j = 0; j < 4; j++)  
  20.     { 
  21.        for (i = 0; i < 4; i++) 
  22.           { 
  23.              for (result = 0, z = -2; z < 4; z++) 
  24.                  result += list[fr]->test[max(0,min(max_y,y+j))] 
  25.               [max(0,min(max_x,x+i+z))]*COEF[z+2]; 
  26.             block[i][j] = max(0, min(255, (result+16))); 
  27.        } 
  28.      } 
  29.   

Optimized #3: Eliminate Branching and Improving Parallelism
  1. if(  (x+6 < max_x)& (y+3 < max_y)& (0 < y) &   (0 < x-2))  
  2.  
  3.     for (j = 0; j < 4; j++) { 
  4.   for (i = 0; i < 4; i++) { 
  5.         t0 = list[fr]- 
  6. >test[y+j][x+i -2]*COEF[0]; 
  7.         t1 = list[fr]->test[y+j][x+i -1]*COEF[1]; 
  8.         t2 = list[fr]->test[y+j][x+i]*COEF[2]; 
  9.        t3 = list[fr]->test[y+j][x+i +1]*COEF[3]; 
  10.         t4 = list[fr]->test[y+j][x+i +2]*COEF[4]; 
  11.         t5 = list[fr]->test[y+j][x+i +3]; 
  12.         result = t0 + t1 + t2 + t3 + t4 + t5; 
  13.        block[i][j] = max(0, min(255, (result+16))); 
  14.     } 
  15.     } 
  16. else  
  17.      for (j = 0; j < 4; j++)  
  18.             { 
  19.      for (i = 0; i < 4; i++) 
  20.         { 
  21.                 for (result = 0, z = -2; z < 4; z++) 
  22.                    result += list[fr]->test[max(0,min(max_y,y+j))] 
  23.               [max(0,min(max_x,x+i+z))]*COEF[z+2]; 
  24.            block[i][j] = max(0, min(255, (result+16))); 
  25.            } 
  26.                } 
  27.   

By introducing temporary variables t0… t5, we eliminate the dependency of the variable result. This will improve the chance for the compiler to vectorize.

Tips & Tricks Conclusion
Consider the following when optimizing a codec engine:
  • Use the Intel compiler as much as possible.
  • Follow general optimization rules.
  • Pay special attention when threading applications that use Intel performance libraries.
  • MMX is faster in some cases.
  • Use different execution units for more parallelism.
  • Use faster instructions.

Keep calculations within the CPU as much as possible.
  • Balance between performance and battery life for mobile applications.

While there are no fixed rules for codec optimization, the above rules should be used as guidance. Performance varies.
Related Posts Plugin for WordPress, Blogger...