Parallel Coprocessor
by Sebastian Fett, Frederik Walk, Mario Reder, and Marcel Heer
Goal
The goal of this student project was to design and implement a hardware/software system to accelerate numerical computations using an FPGA. The system consists of a parallel coprocessor (PCP) on the FPGA and host software to access the PCP.
The PCP is given a kernel and then executes the kernel in a given number of threads in parallel. Therefore, the threads are divided in four warps which are executed simultaneously in lockstep. Warps themselves have to be synchronized explicitly. Nevertheless, it is possible for threads to diverge within a warp. The PCP also provides a scratchpad memory for fast inter-thread communication.
Hardware
Control Unit
The control unit is responsible for filling the scheduler initially as well as handling control signals (e.g. stall) from and to the pipeline. In case the program terminates, the control unit writes this information to the AXI-bus.
Scheduler
The Scheduler consists of several FIFO queues, receiving warps that are ready to be scheduled again in different scenarios. Synchronization of warps is handled as well. In case of a synchronization instruction, all warps have to arrive at the same program counter. In this case, warps that have to wait and are therefore inactive have to be put in one FIFO queue.
Pipeline
Since the coprocessor is synthesized on an FPGA, a pipeline is used to exploit concurrency. Since enough space is available on the FPGA, it makes sense to use this logic for a pipeline. This pipeline is quite similar to a standard RISC-pipeline and is composed of five pipeline stages. Instructions are executed in an SIMD fashion. Therefore, the register file contains registers for every thread and the execution stage contains one ALU per thread that can be executed within one warp.
In the MEM stage, two possibilities for memory access exist: A scratchpad memory, which is synthesized on the FPGA and that can be used as a cache and for communication between threads and the main memory, which is connected asynchronously to the AXI bus.
Software
On the FPGA's ARM processor, the lwIP library was used to communicate with the host system over TCP/IP using a simple network protocol. Memory allocation and freeing is also done on the ARM using the standard C library functions. Accessing the PCP from the ARM processor works via memory mapped I/O.
On the host system, a small Python library abstracts network communication to allow easy access to the PCP. This library also provides an assembler to program the PCP. The assembler allows the use of parameters, e.g. to insert memory addresses, and expands 32 bit immediate loads into to multiple 16 bit loads. An overview over the assembly language is given in the table below.