Skip to content

Commit f3a8684

Browse files
committed
SSE2 SIMD implementation of Huffman encoding
Full-color compression speedups relative to libjpeg-turbo 1.4.2: 2.8 GHz Intel Xeon W3530, Linux, 64-bit: 2.2-18% (avg. 9.5%) 2.8 GHz Intel Xeon W3530, Linux, 32-bit: 10-25% (avg. 17%) 2.3 GHz AMD A10-4600M APU, Linux, 64-bit: 4.9-17% (avg. 11%) 2.3 GHz AMD A10-4600M APU, Linux, 32-bit: 8.8-19% (avg. 15%) 3.0 GHz Intel Core i7, OS X, 64-bit: 3.5-16% (avg. 10%) 3.0 GHz Intel Core i7, OS X, 32-bit: 4.8-14% (avg. 11%) 2.6 GHz AMD Athlon 64 X2 5050e: Performance-neutral (give or take a few percent) Full-color compression speedups relative to IPP: 2.8 GHz Intel Xeon W3530, Linux, 64-bit: 4.8-34% (avg. 19%) 2.8 GHz Intel Xeon W3530, Linux, 32-bit: -19%-7.0% (avg. -7.0%) Refer to #42 for discussion. Numerous other approaches were attempted, but this one proved to be the most performant across all platforms. This commit also fixes flutter#3 (works around, really-- the clang-compiled version of jchuff.c still performs 20% worse than its GCC-compiled counterpart, but that code is now bypassed by the new SSE2 Huffman algorithm.) Based on: mayeut/libjpeg-turbo@2cb4d41 mayeut/libjpeg-turbo@36c94e0
1 parent eb59b6e commit f3a8684

18 files changed

+5157
-84
lines changed

BUILDING.md

Lines changed: 39 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -38,19 +38,7 @@ Build Requirements
3838

3939
NOTE: the NASM build will fail if texinfo is not installed.
4040

41-
- GCC v4.1 or later recommended for best performance
42-
* Beginning with Xcode 4, Apple stopped distributing GCC and switched to
43-
the LLVM compiler. Xcode v4.0 through v4.6 provides a GCC front end
44-
called LLVM-GCC. Unfortunately, as of this writing, neither LLVM-GCC nor
45-
the LLVM (clang) compiler produces optimal performance with libjpeg-turbo.
46-
Building libjpeg-turbo with LLVM-GCC v4.2 results in a 10% performance
47-
degradation when compressing using 64-bit code, relative to building
48-
libjpeg-turbo with GCC v4.2. Building libjpeg-turbo with LLVM (clang)
49-
results in a 20% performance degradation when compressing using 64-bit
50-
code, relative to building libjpeg-turbo with GCC v4.2. If you are
51-
running Snow Leopard or earlier, it is suggested that you continue to use
52-
Xcode v3.2.6, which provides GCC v4.2. If you are using Lion or later, it
53-
is suggested that you install Apple GCC v4.2 or GCC v5 through MacPorts.
41+
- GCC v4.1 (or later) or clang recommended for best performance
5442

5543
- If building the TurboJPEG Java wrapper, JDK or OpenJDK 1.5 or later is
5644
required. Some systems, such as Solaris 10 and later and Red Hat Enterprise
@@ -89,38 +77,38 @@ for 64-bit build instructions.)
8977

9078
This will generate the following files under .libs/:
9179

92-
**libjpeg.a**
80+
**libjpeg.a**
9381
Static link library for the libjpeg API
9482

95-
**libjpeg.so.{version}** (Linux, Unix)
96-
**libjpeg.{version}.dylib** (OS X)
97-
**cygjpeg-{version}.dll** (Cygwin)
83+
**libjpeg.so.{version}** (Linux, Unix)
84+
**libjpeg.{version}.dylib** (OS X)
85+
**cygjpeg-{version}.dll** (Cygwin)
9886
Shared library for the libjpeg API
9987

10088
By default, *{version}* is 62.1.0, 7.1.0, or 8.0.2, depending on whether
10189
libjpeg v6b (default), v7, or v8 emulation is enabled. If using Cygwin,
10290
*{version}* is 62, 7, or 8.
10391

104-
**libjpeg.so** (Linux, Unix)
105-
**libjpeg.dylib** (OS X)
92+
**libjpeg.so** (Linux, Unix)
93+
**libjpeg.dylib** (OS X)
10694
Development symlink for the libjpeg API
10795

108-
**libjpeg.dll.a** (Cygwin)
96+
**libjpeg.dll.a** (Cygwin)
10997
Import library for the libjpeg API
11098

111-
**libturbojpeg.a**
99+
**libturbojpeg.a**
112100
Static link library for the TurboJPEG API
113101

114-
**libturbojpeg.so.0.1.0** (Linux, Unix)
115-
**libturbojpeg.0.1.0.dylib** (OS X)
116-
**cygturbojpeg-0.dll** (Cygwin)
102+
**libturbojpeg.so.0.1.0** (Linux, Unix)
103+
**libturbojpeg.0.1.0.dylib** (OS X)
104+
**cygturbojpeg-0.dll** (Cygwin)
117105
Shared library for the TurboJPEG API
118106

119-
**libturbojpeg.so** (Linux, Unix)
120-
**libturbojpeg.dylib** (OS X)
107+
**libturbojpeg.so** (Linux, Unix)
108+
**libturbojpeg.dylib** (OS X)
121109
Development symlink for the TurboJPEG API
122110

123-
**libturbojpeg.dll.a** (Cygwin)
111+
**libturbojpeg.dll.a** (Cygwin)
124112
Import library for the TurboJPEG API
125113

126114

@@ -333,16 +321,16 @@ Set the following shell variables for simplicity:
333321
IOS_SYSROOT=$IOS_PLATFORMDIR/Developer/SDKs/iPhoneOS*.sdk
334322
IOS_GCC=$IOS_PLATFORMDIR/Developer/usr/bin/arm-apple-darwin10-llvm-gcc-4.2
335323

336-
*ARMv6 (code will run on all iOS devices, not SIMD-accelerated)*
324+
*ARMv6 (code will run on all iOS devices, not SIMD-accelerated)*
337325
[NOTE: Requires Xcode 4.4.x or earlier]
338326

339327
IOS_CFLAGS="-march=armv6 -mcpu=arm1176jzf-s -mfpu=vfp"
340328

341-
*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*
329+
*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*
342330

343331
IOS_CFLAGS="-march=armv7 -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon"
344332

345-
*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*
333+
*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*
346334
[NOTE: Requires Xcode 4.5 or later]
347335

348336
IOS_CFLAGS="-march=armv7s -mcpu=swift -mtune=swift -mfpu=neon"
@@ -365,11 +353,11 @@ Set the following shell variables for simplicity:
365353
IOS_SYSROOT=$IOS_PLATFORMDIR/Developer/SDKs/iPhoneOS*.sdk
366354
IOS_GCC=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang
367355

368-
*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*
356+
*ARMv7 (code will run on iPhone 3GS-4S/iPad 1st-3rd Generation and newer)*
369357

370358
IOS_CFLAGS="-arch armv7"
371359

372-
*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*
360+
*ARMv7s (code will run on iPhone 5/iPad 4th Generation and newer)*
373361

374362
IOS_CFLAGS="-arch armv7s"
375363

@@ -527,22 +515,22 @@ on which version of cl.exe is in the `PATH`.
527515

528516
The following files will be generated under *{build_directory}*:
529517

530-
**jpeg-static.lib**
518+
**jpeg-static.lib**
531519
Static link library for the libjpeg API
532520

533-
**sharedlib/jpeg{version}.dll**
521+
**sharedlib/jpeg{version}.dll**
534522
DLL for the libjpeg API
535523

536-
**sharedlib/jpeg.lib**
524+
**sharedlib/jpeg.lib**
537525
Import library for the libjpeg API
538-
539-
**turbojpeg-static.lib**
526+
527+
**turbojpeg-static.lib**
540528
Static link library for the TurboJPEG API
541529

542-
**turbojpeg.dll**
530+
**turbojpeg.dll**
543531
DLL for the TurboJPEG API
544532

545-
**turbojpeg.lib**
533+
**turbojpeg.lib**
546534
Import library for the TurboJPEG API
547535

548536
*{version}* is 62, 7, or 8, depending on whether libjpeg v6b (default), v7, or
@@ -569,22 +557,22 @@ build of libjpeg-turbo.
569557

570558
This will generate the following files under *{build_directory}*:
571559

572-
**{configuration}/jpeg-static.lib**
560+
**{configuration}/jpeg-static.lib**
573561
Static link library for the libjpeg API
574562

575-
**sharedlib/{configuration}/jpeg{version}.dll**
563+
**sharedlib/{configuration}/jpeg{version}.dll**
576564
DLL for the libjpeg API
577565

578-
**sharedlib/{configuration}/jpeg.lib**
566+
**sharedlib/{configuration}/jpeg.lib**
579567
Import library for the libjpeg API
580568

581-
**{configuration}/turbojpeg-static.lib**
569+
**{configuration}/turbojpeg-static.lib**
582570
Static link library for the TurboJPEG API
583571

584-
**{configuration}/turbojpeg.dll**
572+
**{configuration}/turbojpeg.dll**
585573
DLL for the TurboJPEG API
586574

587-
**{configuration}/turbojpeg.lib**
575+
**{configuration}/turbojpeg.lib**
588576
Import library for the TurboJPEG API
589577

590578
*{configuration}* is Debug, Release, RelWithDebInfo, or MinSizeRel, depending
@@ -603,22 +591,22 @@ cross-compiling on a Linux/Unix machine, then see "Build Recipes" below.
603591

604592
This will generate the following files under *{build_directory}*:
605593

606-
**libjpeg.a**
594+
**libjpeg.a**
607595
Static link library for the libjpeg API
608596

609-
**sharedlib/libjpeg-{version}.dll**
597+
**sharedlib/libjpeg-{version}.dll**
610598
DLL for the libjpeg API
611599

612-
**sharedlib/libjpeg.dll.a**
600+
**sharedlib/libjpeg.dll.a**
613601
Import library for the libjpeg API
614602

615-
**libturbojpeg.a**
603+
**libturbojpeg.a**
616604
Static link library for the TurboJPEG API
617605

618-
**libturbojpeg.dll**
606+
**libturbojpeg.dll**
619607
DLL for the TurboJPEG API
620608

621-
**libturbojpeg.dll.a**
609+
**libturbojpeg.dll.a**
622610
Import library for the TurboJPEG API
623611

624612
*{version}* is 62, 7, or 8, depending on whether libjpeg v6b (default), v7, or

ChangeLog.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,16 @@ benchmark from outputting any images. This removes any potential operating
5757
system overhead that might be caused by lazy writes to disk and thus improves
5858
the consistency of the performance measurements.
5959

60+
[12] Added SIMD acceleration for Huffman encoding on SSE2-capable x86 and
61+
x86-64 platforms. This speeds up the compression of full-color JPEGs by about
62+
10-15% on average (relative to libjpeg-turbo 1.4.x) when using modern Intel and
63+
AMD CPUs. Additionally, this works around an issue in the clang optimizer that
64+
prevents it (as of this writing) from achieving the same performance as GCC
65+
when compiling the C version of the Huffman encoder
66+
(https://llvm.org/bugs/show_bug.cgi?id=16035). For the purposes of benchmarking
67+
or regression testing, SIMD-accelerated Huffman encoding can be disabled by
68+
setting the JSIMD_NOHUFFENC environment variable to 1.
69+
6070

6171
1.4.2
6272
=====

jchuff.c

Lines changed: 47 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
* Copyright (C) 1991-1997, Thomas G. Lane.
66
* libjpeg-turbo Modifications:
77
* Copyright (C) 2009-2011, 2014-2016 D. R. Commander.
8+
* Copyright (C) 2015 Matthieu Darbois.
89
* For conditions of distribution and use, see the accompanying README.ijg
910
* file.
1011
*
@@ -20,7 +21,7 @@
2021
#define JPEG_INTERNALS
2122
#include "jinclude.h"
2223
#include "jpeglib.h"
23-
#include "jchuff.h" /* Declarations shared with jcphuff.c */
24+
#include "jsimd.h"
2425
#include "jconfigint.h"
2526
#include <limits.h>
2627

@@ -108,6 +109,8 @@ typedef struct {
108109
long * dc_count_ptrs[NUM_HUFF_TBLS];
109110
long * ac_count_ptrs[NUM_HUFF_TBLS];
110111
#endif
112+
113+
int simd;
111114
} huff_entropy_encoder;
112115

113116
typedef huff_entropy_encoder * huff_entropy_ptr;
@@ -159,6 +162,8 @@ start_pass_huff (j_compress_ptr cinfo, boolean gather_statistics)
159162
entropy->pub.finish_pass = finish_pass_huff;
160163
}
161164

165+
entropy->simd = jsimd_can_huff_encode_one_block();
166+
162167
for (ci = 0; ci < cinfo->comps_in_scan; ci++) {
163168
compptr = cinfo->cur_comp_info[ci];
164169
dctbl = compptr->dc_tbl_no;
@@ -480,6 +485,23 @@ flush_bits (working_state * state)
480485

481486
/* Encode a single block's worth of coefficients */
482487

488+
LOCAL(boolean)
489+
encode_one_block_simd (working_state * state, JCOEFPTR block, int last_dc_val,
490+
c_derived_tbl *dctbl, c_derived_tbl *actbl)
491+
{
492+
JOCTET _buffer[BUFSIZE], *buffer;
493+
size_t bytes, bytestocopy; int localbuf = 0;
494+
495+
LOAD_BUFFER()
496+
497+
buffer = jsimd_huff_encode_one_block(state, buffer, block, last_dc_val,
498+
dctbl, actbl);
499+
500+
STORE_BUFFER()
501+
502+
return TRUE;
503+
}
504+
483505
LOCAL(boolean)
484506
encode_one_block (working_state * state, JCOEFPTR block, int last_dc_val,
485507
c_derived_tbl *dctbl, c_derived_tbl *actbl)
@@ -640,16 +662,30 @@ encode_mcu_huff (j_compress_ptr cinfo, JBLOCKROW *MCU_data)
640662
}
641663

642664
/* Encode the MCU data blocks */
643-
for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
644-
ci = cinfo->MCU_membership[blkn];
645-
compptr = cinfo->cur_comp_info[ci];
646-
if (! encode_one_block(&state,
647-
MCU_data[blkn][0], state.cur.last_dc_val[ci],
648-
entropy->dc_derived_tbls[compptr->dc_tbl_no],
649-
entropy->ac_derived_tbls[compptr->ac_tbl_no]))
650-
return FALSE;
651-
/* Update last_dc_val */
652-
state.cur.last_dc_val[ci] = MCU_data[blkn][0][0];
665+
if (entropy->simd) {
666+
for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
667+
ci = cinfo->MCU_membership[blkn];
668+
compptr = cinfo->cur_comp_info[ci];
669+
if (! encode_one_block_simd(&state,
670+
MCU_data[blkn][0], state.cur.last_dc_val[ci],
671+
entropy->dc_derived_tbls[compptr->dc_tbl_no],
672+
entropy->ac_derived_tbls[compptr->ac_tbl_no]))
673+
return FALSE;
674+
/* Update last_dc_val */
675+
state.cur.last_dc_val[ci] = MCU_data[blkn][0][0];
676+
}
677+
} else {
678+
for (blkn = 0; blkn < cinfo->blocks_in_MCU; blkn++) {
679+
ci = cinfo->MCU_membership[blkn];
680+
compptr = cinfo->cur_comp_info[ci];
681+
if (! encode_one_block(&state,
682+
MCU_data[blkn][0], state.cur.last_dc_val[ci],
683+
entropy->dc_derived_tbls[compptr->dc_tbl_no],
684+
entropy->ac_derived_tbls[compptr->ac_tbl_no]))
685+
return FALSE;
686+
/* Update last_dc_val */
687+
state.cur.last_dc_val[ci] = MCU_data[blkn][0][0];
688+
}
653689
}
654690

655691
/* Completed MCU, so update state */

jsimd.h

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,16 @@
33
*
44
* Copyright 2009 Pierre Ossman <[email protected]> for Cendio AB
55
* Copyright 2011, 2014 D. R. Commander
6+
* Copyright 2015 Matthieu Darbois
67
*
78
* Based on the x86 SIMD extension for IJG JPEG library,
89
* Copyright (C) 1999-2006, MIYASAKA Masaru.
910
* For conditions of distribution and use, see copyright notice in jsimdext.inc
1011
*
1112
*/
1213

14+
#include "jchuff.h" /* Declarations shared with jcphuff.c */
15+
1316
EXTERN(int) jsimd_can_rgb_ycc (void);
1417
EXTERN(int) jsimd_can_rgb_gray (void);
1518
EXTERN(int) jsimd_can_ycc_rgb (void);
@@ -82,3 +85,9 @@ EXTERN(void) jsimd_h2v2_merged_upsample
8285
EXTERN(void) jsimd_h2v1_merged_upsample
8386
(j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
8487
JDIMENSION in_row_group_ctr, JSAMPARRAY output_buf);
88+
89+
EXTERN(int) jsimd_can_huff_encode_one_block (void);
90+
91+
EXTERN(JOCTET*) jsimd_huff_encode_one_block
92+
(void * state, JOCTET *buffer, JCOEFPTR block, int last_dc_val,
93+
c_derived_tbl *dctbl, c_derived_tbl *actbl);

jsimd_none.c

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
*
44
* Copyright 2009 Pierre Ossman <[email protected]> for Cendio AB
55
* Copyright 2009-2011, 2014 D. R. Commander
6+
* Copyright 2015 Matthieu Darbois
67
*
78
* Based on the x86 SIMD extension for IJG JPEG library,
89
* Copyright (C) 1999-2006, MIYASAKA Masaru.
@@ -387,3 +388,16 @@ jsimd_idct_float (j_decompress_ptr cinfo, jpeg_component_info * compptr,
387388
{
388389
}
389390

391+
GLOBAL(int)
392+
jsimd_can_huff_encode_one_block (void)
393+
{
394+
return 0;
395+
}
396+
397+
GLOBAL(JOCTET*)
398+
jsimd_huff_encode_one_block (void * state, JOCTET *buffer, JCOEFPTR block,
399+
int last_dc_val, c_derived_tbl *dctbl,
400+
c_derived_tbl *actbl)
401+
{
402+
return NULL;
403+
}

jversion.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
"Copyright (C) 2009-2016 D. R. Commander\n" \
3333
"Copyright (C) 2009-2011 Nokia Corporation and/or its subsidiary(-ies)\n" \
3434
"Copyright (C) 2013-2014 MIPS Technologies, Inc.\n" \
35-
"Copyright (C) 2013 Linaro Limited"
35+
"Copyright (C) 2013 Linaro Limited\n" \
36+
"Copyright (C) 2015 Matthieu Darbois"
3637

3738
#define JCOPYRIGHT_SHORT "Copyright (C) 1991-2016 The libjpeg-turbo Project and many others"

simd/CMakeLists.txt

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,17 +22,19 @@ endif()
2222

2323
if(SIMD_X86_64)
2424
set(SIMD_BASENAMES jfdctflt-sse-64 jccolor-sse2-64 jcgray-sse2-64
25-
jcsample-sse2-64 jdcolor-sse2-64 jdmerge-sse2-64 jdsample-sse2-64
26-
jfdctfst-sse2-64 jfdctint-sse2-64 jidctflt-sse2-64 jidctfst-sse2-64
27-
jidctint-sse2-64 jidctred-sse2-64 jquantf-sse2-64 jquanti-sse2-64)
25+
jchuff-sse2-64 jcsample-sse2-64 jdcolor-sse2-64 jdmerge-sse2-64
26+
jdsample-sse2-64 jfdctfst-sse2-64 jfdctint-sse2-64 jidctflt-sse2-64
27+
jidctfst-sse2-64 jidctint-sse2-64 jidctred-sse2-64 jquantf-sse2-64
28+
jquanti-sse2-64)
2829
message(STATUS "Building x86_64 SIMD extensions")
2930
else()
3031
set(SIMD_BASENAMES jsimdcpu jfdctflt-3dn jidctflt-3dn jquant-3dn jccolor-mmx
3132
jcgray-mmx jcsample-mmx jdcolor-mmx jdmerge-mmx jdsample-mmx jfdctfst-mmx
3233
jfdctint-mmx jidctfst-mmx jidctint-mmx jidctred-mmx jquant-mmx jfdctflt-sse
33-
jidctflt-sse jquant-sse jccolor-sse2 jcgray-sse2 jcsample-sse2 jdcolor-sse2
34-
jdmerge-sse2 jdsample-sse2 jfdctfst-sse2 jfdctint-sse2 jidctflt-sse2
35-
jidctfst-sse2 jidctint-sse2 jidctred-sse2 jquantf-sse2 jquanti-sse2)
34+
jidctflt-sse jquant-sse jccolor-sse2 jcgray-sse2 jchuff-sse2 jcsample-sse2
35+
jdcolor-sse2 jdmerge-sse2 jdsample-sse2 jfdctfst-sse2 jfdctint-sse2
36+
jidctflt-sse2 jidctfst-sse2 jidctint-sse2 jidctred-sse2 jquantf-sse2
37+
jquanti-sse2)
3638
message(STATUS "Building i386 SIMD extensions")
3739
endif()
3840

0 commit comments

Comments
 (0)