Data compression is essential to reduce high storage and communication costs for a wide range of systems and applications. Canonical Huffman coding plays a pivotal role for several compression standards. This paper presents bit-parallel static and dynamic canonical Huffman decoder implementations using an optimized lookup table approach on a fine-grain many-core processor array and an Intel FPGA. The decoder implementation results are compared with an Intel i7-4850HQ and a massively parallel Nvidia GT 750M GPU executing the corpus benchmarks: Calgary, Canterbury, Artificial, and Large. The many-core implementations achieve a scaled throughput per chip area that is 891× and 7× greater on average than the i7 and GT 750M respectively. Moreover, the many-core implementations result in a scaled energy efficiency (compressed bits decoded per energy) that is 149.5×, 3.9×, and 2.5× greater on average than the i7, GT 750M, and Intel FPGA respectively. In addition, the optimized lookup-table-based static canonical Huffman decoder on the Intel FPGA yields performance and energy efficiency improvements of 2.1× and 3.68× respectively on average compared to a dynamic canonical Huffman decoder at a 17% cost in compression ratio.