In order to create executable code for a platform other than the one on which the compiler is running we can use cross compilers. In this case we want to create executables for ARM platforms but execute them on Intel platform with Ubuntu. The only package that is necessary (at least in Ubuntu 15.04) is gcc-arm-linux-gnueabihf.
Once this is installed the way to compile an example is:
$ arm-linux-gnueabihf-gcc -static -mfpu=neon -flax-vector-conversions -DSHUFFLE_NEON_ENABLED -O3 simple.c ../blosc/blosc.c ../blosc/blosclz.c ../blosc/shuffle.c ../blosc/shuffle-generic.c ../blosc/bitshuffle-generic.c ../blosc/shuffle-neon.c ../blosc/bitshuffle-neon.c -I../blosc -o simple -lpthread
Another example for runnig a bench with the cross compliler:
$ arm-linux-gnueabihf-gcc -static -mfpu=neon -flax-vector-conversions -DSHUFFLE_NEON_ENABLED -O3 bench.c ../blosc/blosc.c ../blosc/blosclz.c ../blosc/shuffle.c ../blosc/shuffle-generic.c ../blosc/bitshuffle-generic.c ../blosc/shuffle-neon.c ../blosc/bitshuffle-neon.c -I../blosc -o bench -lpthread
In these cases the NEON flags are: -mfpu=neon -flax-vector-conversions
This is explained in detail in: http://linux-sunxi.org/Toolchain
This way you can develop and debug aplications for ARM on intel machines as if you were in ARM platforms.
This is a benchmark to compare the speeds between the NEON and the generic implementation.
CPU: ARMv7 Processor rev 3 (v7l) Compiler: gcc, version gcc-4.8.real Optimizations: -O3 OS: Ubuntu 14.04 trusty
- NEON implementation results:
Blosc version: 2.0.0a1 ($Date:: 2015-07-30 #$) List of supported compressors in this build: blosclz Supported compression libraries: BloscLZ: 1.0.5 LZ4: unknown Snappy: unknown Zlib: unknown Using compressor: blosclz Using shuffle type: shuffle Running suite: suite --> 1, 2097152, 8, 19, blosclz, shuffle ********************** Run info ****************************** Blosc version: 2.0.0a1 ($Date:: 2015-07-30 #$) Using synthetic data with 19 significant bits (out of 32) Dataset size: 2097152 bytes Type size: 8 bytes Working set: 64.0 MB Number of threads: 1 ********************** Running benchmarks ********************* memcpy(write): 2265.1 us, 883.0 MB/s memcpy(read): 1196.9 us, 1671.0 MB/s Compression level: 0 comp(write): 984.8 us, 2030.9 MB/s Final bytes: 2097168 Ratio: 1.00 decomp(read): 1202.8 us, 1662.8 MB/s OK Compression level: 1 comp(write): 5863.7 us, 341.1 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 1014.1 us, 1972.3 MB/s OK Compression level: 2 comp(write): 6229.8 us, 321.0 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 1013.4 us, 1973.5 MB/s OK Compression level: 3 comp(write): 6603.8 us, 302.9 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 1012.9 us, 1974.4 MB/s OK Compression level: 4 comp(write): 6792.6 us, 294.4 MB/s Final bytes: 557840 Ratio: 3.76 decomp(read): 983.5 us, 2033.6 MB/s OK Compression level: 5 comp(write): 8598.3 us, 232.6 MB/s Final bytes: 557840 Ratio: 3.76 decomp(read): 983.4 us, 2033.9 MB/s OK Compression level: 6 comp(write): 9866.2 us, 202.7 MB/s Final bytes: 546320 Ratio: 3.84 decomp(read): 1079.1 us, 1853.3 MB/s OK Compression level: 7 comp(write): 9334.9 us, 214.2 MB/s Final bytes: 216528 Ratio: 9.69 decomp(read): 1959.4 us, 1020.7 MB/s OK Compression level: 8 comp(write): 9221.1 us, 216.9 MB/s Final bytes: 216528 Ratio: 9.69 decomp(read): 1972.1 us, 1014.2 MB/s OK Compression level: 9 comp(write): 8452.0 us, 236.6 MB/s Final bytes: 153160 Ratio: 13.69 decomp(read): 2780.0 us, 719.4 MB/s OK --> 2, 2097152, 8, 19, blosclz, shuffle ********************** Run info ****************************** Blosc version: 2.0.0a1 ($Date:: 2015-07-30 #$) Using synthetic data with 19 significant bits (out of 32) Dataset size: 2097152 bytes Type size: 8 bytes Working set: 64.0 MB Number of threads: 2 ********************** Running benchmarks ********************* memcpy(write): 2258.0 us, 885.8 MB/s memcpy(read): 1194.9 us, 1673.7 MB/s Compression level: 0 comp(write): 831.2 us, 2406.0 MB/s Final bytes: 2097168 Ratio: 1.00 decomp(read): 1162.6 us, 1720.3 MB/s OK Compression level: 1 comp(write): 2975.1 us, 672.2 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 738.4 us, 2708.4 MB/s OK Compression level: 2 comp(write): 3156.8 us, 633.6 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 738.7 us, 2707.5 MB/s OK Compression level: 3 comp(write): 3347.0 us, 597.6 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 749.6 us, 2668.1 MB/s OK Compression level: 4 comp(write): 3486.2 us, 573.7 MB/s Final bytes: 557840 Ratio: 3.76 decomp(read): 745.1 us, 2684.4 MB/s OK Compression level: 5 comp(write): 4488.5 us, 445.6 MB/s Final bytes: 557840 Ratio: 3.76 decomp(read): 725.7 us, 2755.9 MB/s OK Compression level: 6 comp(write): 4998.5 us, 400.1 MB/s Final bytes: 546320 Ratio: 3.84 decomp(read): 796.8 us, 2510.2 MB/s OK Compression level: 7 comp(write): 4780.3 us, 418.4 MB/s Final bytes: 216528 Ratio: 9.69 decomp(read): 1383.9 us, 1445.1 MB/s OK Compression level: 8 comp(write): 4778.6 us, 418.5 MB/s Final bytes: 216528 Ratio: 9.69 decomp(read): 1398.0 us, 1430.6 MB/s OK Compression level: 9 comp(write): 5884.6 us, 339.9 MB/s Final bytes: 153160 Ratio: 13.69 decomp(read): 2647.7 us, 755.4 MB/s OK Round-trip compr/decompr on 3.8 GB Elapsed time: 13.9 s, 609.1 MB/s
- Generic implementation results:
Blosc version: 2.0.0a1 ($Date:: 2015-07-30 #$) List of supported compressors in this build: blosclz Supported compression libraries: BloscLZ: 1.0.5 LZ4: unknown Snappy: unknown Zlib: unknown Using compressor: blosclz Using shuffle type: shuffle Running suite: suite --> 1, 2097152, 8, 19, blosclz, shuffle ********************** Run info ****************************** Blosc version: 2.0.0a1 ($Date:: 2015-07-30 #$) Using synthetic data with 19 significant bits (out of 32) Dataset size: 2097152 bytes Type size: 8 bytes Working set: 64.0 MB Number of threads: 1 ********************** Running benchmarks ********************* memcpy(write): 2194.1 us, 911.5 MB/s memcpy(read): 1170.8 us, 1708.2 MB/s Compression level: 0 comp(write): 896.2 us, 2231.7 MB/s Final bytes: 2097168 Ratio: 1.00 decomp(read): 1179.3 us, 1695.9 MB/s OK Compression level: 1 comp(write): 7534.4 us, 265.4 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 4117.1 us, 485.8 MB/s OK Compression level: 2 comp(write): 7895.6 us, 253.3 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 4106.7 us, 487.0 MB/s OK Compression level: 3 comp(write): 8262.2 us, 242.1 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 4113.9 us, 486.2 MB/s OK Compression level: 4 comp(write): 8495.6 us, 235.4 MB/s Final bytes: 557840 Ratio: 3.76 decomp(read): 4042.6 us, 494.7 MB/s OK Compression level: 5 comp(write): 10321.0 us, 193.8 MB/s Final bytes: 557840 Ratio: 3.76 decomp(read): 4033.9 us, 495.8 MB/s OK Compression level: 6 comp(write): 11675.3 us, 171.3 MB/s Final bytes: 546320 Ratio: 3.84 decomp(read): 4096.4 us, 488.2 MB/s OK Compression level: 7 comp(write): 10193.0 us, 196.2 MB/s Final bytes: 216528 Ratio: 9.69 decomp(read): 7150.9 us, 279.7 MB/s OK Compression level: 8 comp(write): 10192.3 us, 196.2 MB/s Final bytes: 216528 Ratio: 9.69 decomp(read): 7167.7 us, 279.0 MB/s OK Compression level: 9 comp(write): 10418.7 us, 192.0 MB/s Final bytes: 153160 Ratio: 13.69 decomp(read): 7870.8 us, 254.1 MB/s OK --> 2, 2097152, 8, 19, blosclz, shuffle ********************** Run info ****************************** Blosc version: 2.0.0a1 ($Date:: 2015-07-30 #$) Using synthetic data with 19 significant bits (out of 32) Dataset size: 2097152 bytes Type size: 8 bytes Working set: 64.0 MB Number of threads: 2 ********************** Running benchmarks ********************* memcpy(write): 2243.1 us, 891.6 MB/s memcpy(read): 1219.9 us, 1639.4 MB/s Compression level: 0 comp(write): 846.8 us, 2361.9 MB/s Final bytes: 2097168 Ratio: 1.00 decomp(read): 1182.1 us, 1691.9 MB/s OK Compression level: 1 comp(write): 3867.6 us, 517.1 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 2162.8 us, 924.7 MB/s OK Compression level: 2 comp(write): 4054.2 us, 493.3 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 2156.6 us, 927.4 MB/s OK Compression level: 3 comp(write): 4241.2 us, 471.6 MB/s Final bytes: 584976 Ratio: 3.59 decomp(read): 2169.6 us, 921.8 MB/s OK Compression level: 4 comp(write): 4377.1 us, 456.9 MB/s Final bytes: 557840 Ratio: 3.76 decomp(read): 7556.9 us, 264.7 MB/s OK Compression level: 5 comp(write): 5276.3 us, 379.1 MB/s Final bytes: 557840 Ratio: 3.76 decomp(read): 7556.7 us, 264.7 MB/s OK Compression level: 6 comp(write): 6026.8 us, 331.9 MB/s Final bytes: 546320 Ratio: 3.84 decomp(read): 3108.2 us, 643.5 MB/s OK Compression level: 7 comp(write): 5877.0 us, 340.3 MB/s Final bytes: 216528 Ratio: 9.69 decomp(read): 3668.5 us, 545.2 MB/s OK Compression level: 8 comp(write): 5882.0 us, 340.0 MB/s Final bytes: 216528 Ratio: 9.69 decomp(read): 3531.2 us, 566.4 MB/s OK Compression level: 9 comp(write): 7621.3 us, 262.4 MB/s Final bytes: 153160 Ratio: 13.69 decomp(read): 4978.3 us, 401.7 MB/s OK Round-trip compr/decompr on 3.8 GB Elapsed time: 21.9 s, 385.0 MB/s
We have achieved to implement shuffle NEON instructions for ARM that are twice as fast as the generic implementation.
Enjoy developing for ARM!
Lucian Marc