SHA2-256 Hashing Algorithm

In the following, a high throughput implementation of the secure hashing algorithm SHA2 implemented in the synchronous language Quartz is presented. The students started with a basic implementation of the SHA2 standard for 256 bits and then applied optimizations systematically. The final implementation can hash a SHA block (512 bits) in 66 clock cycles giving a throughput of 1.17 Gbps on Xilinx Virtex-5 FPGA technology running at 152 MHz. To the best of our knowledge, this throughput roughly matches the performance of the second best SHA-2 commercial implementation available which is the Cast core.

1. SHA-2 standard overview

In this section, we will highlight the basics of the standard; for more details, you may consider to the standard FIPS 180-4. The algorithm consists of the following two main stages:

1.1 Pre-processing

Pre-processing consists of the following steps:

Padding of the message to ensure that the total message size is a multiple of the block size which is 512 bits.
Parse the message into N blocks. Note that each block consists of 16 words Mi each of 32 bits width.
Calculate the schedule values of each block Wi where 16<= i <= 63. Note that the first 16 schedule values Wi are equal to corresponding words Mi.

1.2 Hash Computation

Hash computation is a serial process done block by block. The result of hashing a block is an intermediate hash Hi where 0 <= i <= 7 that is used by the next block. Hashing of the last block in the message results in eight hash words Hi that are concatenated to provide the final hash of 256 bit width.

2. Basic Implementation of the Standard

We have implemented the standard in a Quartz module called UserModule. A wrapper module TestModule has been used to provide the input stimuli and to check the resulting hash. The pseudo code of the algorithm is given in the following listing UserModule:

Inspecting the HashBlock listing below, we can see that (1) Ta calculation in line 17 represents our critical path. In addition, (2) by doing a simple calculation, it can be noted that we need (48 + 64 + 1) = 113 clock cycles to hash a block which is 512 bit length. Our goal in the optimization section below should be working on those issues to reduce both the critical path length and the number of clock cycles needed to hash a block.

3. Optimizations

3.1 Optimized Schedule Calculation

Inspecting module HashBlock, we observe the following:

To generate a schedule value Wi[i], we only need the last 16 schedule values. That is, a data structure of 16 words can be used instead of 64 words. Remember that each word is 32 bit width.
Calculating the schedule is independent from the inner-loop iteration, i.e., we can reschedule schedule calculation to be right before when the schedule Wi is used in Ta and Te calculation.

Thus, we have rescheduled the calculation of Wi to be done in parallel inside the inner-loop iteration which enabled us to save 48 clock cycles. Effectively reducing the clock cycles needed in hash a block to 66 clock cycles. Note that one clock cycle is still needed to initialize Ti values at the beginning.

3.2 Reducing critical path

The calculation of Ta represents the critical path of the inner-loop iteration of module HashBlock. Actually, Ta and Te are the only elements calculated in each iteration, all other temporary elements are directly assigned. Calculation of Ta and Te can be written as depicted in the corresponding figure.

Note that the highlighted operations represents the maximum common term that can be rescheduled. That is, this term, let's call it alpha, can be calculated in the previous clock cycle, which effectively removes two addition operations from the critical path.

4. Results

We have merged module HashBlock in UserModule and applied the aforementioned optimizations. The performance of the Quartz core has been compared to the best commercial cores available, namely, Helion and Cast. The performance comparison can be seen in the following table. It's worth to mention that our Quartz implementation represents the result of synthesizing the wrapper module TestModule together with UserModule which adds area overhead. Also note that Virtex-5 boards can support speed grades -1 (lowest) to -3 (highest). The speed grade metric denotes the time needed for data to propagate through a LUT. Our frequency figure is based on synthesis to speed grade -1. Synthesizing to speed grade -2 gave us a max frequency of up to 177 MHz.

5. Downloads

The students that participated in that project provided the following two solutions:

first solution
second solution

Fachbereich Informatik

Embedded Systems Group (ES)