CloudX

System Hyper Pipelining (SHP) is based on C-Slow Retiming (CSR). The knowledge of how CSR works is crucial to the understanding of SHP. An introduction to CSR and timing driven register insertion on RTL can be found here.

I'm working on an open source project called "Arduissimo" which demonstrates the benefits of SHP.

Introduction to System Hyper Pipelining

csr

Fig 1: a) Simplified single clock design. b) Applying CSR technique.

Fig. 1a shows the basic structure of a sequential circuit with its combinatorial logic (CL) and original design registers (DR). Inputs and outputs are not shown for simplification. The sequential circuit handles one thread T(1). Fig. 1b shows the CSR technique. The original logic is sliced into C (here C=3) sections, and each original path has now C-1 additional registers. This results in C functional independent design copies T(1..3) which use the logic in a time sliced fashion.

Each thread has its own thread index. For each design copy it now takes C micro-cycles to achieve the same result
as in one cycle of the original design (macro-cycle). The implemented register sets are called “CSR Registers“, CRs. They are placed at different C-levels (CR0, CR1, ...).

{CR0;CR1;DR} = [{T(1); T(2); T(3)};
{T(3); T(1); T(2)};
{T(2); T(3); T(1)}] (1)

The sequence (1) shows how the complete design states (threads) traverse through the logic each micro-cycle. There is no interaction between threads and each thread uses the complete design in a time sliced fashion.

shp

Fig 2: a) SHP-ed design with thread controller (TC) and memory (Mem) b) Improved SHP-ed design.

The DRs are now replaced by memories (Mem). The incoming threads are stored at the relevant address (write pointer) based on the thread index. D is the number of threads which the memory can hold (memory depth). The outgoing thread can now be freely selected within D available threads (read pointer), except the threads already passing through the design logic.

{CR0;CR1;Mem} = [{T(a); T(b); {T(a), ..., T(p)}},
{T(i); T(a); {T(a), ..., T(p)}},
{T(f); T(i); {T(a), ..., T(p)}}] (2)

Equation (2) shows that an SHP-ed design can run any thread (T <= D) in any possible order. Same threads must not be executed at the same time in this initial version.

A CSR-ed design has usually many shift registers. DR are followed by a series of CSR registers. In the SHP-ed version, many memory data outputs are connected to CRs directly. In this case, the shift register chain at the outputs can be replaced by a shift register chain at the read address inputs of the memories. Fig. 2b shows this improved SHP version. The memory is sliced into individual sections (Mem0, Mem1, Mem2, ...) and each section has a delayed read of the thread. The outputs can now be directly connected to the relevant combinatorial logic and the shift registers can be removed. Additionally, triple read port memories can be used to further reduce the CR count.

The same method can be applied at the inputs of the memories to further reduce register count. CRs, which are directly connected to the data inputs of the Mem can be merged into the Mem. This can be achieved by splitting the individual sections (Mem0, Mem1, Mem2, ...) again into individual subsections (Mem0.0, Mem0.1, ) which are now controlled by an early write address.

Adding CR into each path to use the logic into a time sliced fashion also implies that C registers are added into each feedback loop, which results into a high shift register count of a CSR-ed design. Feedback loops with multiplexers are sometimes replaced by a register write enable signal. This feature cannot be applied on a CSR-ed design. The replacement of DRs with Mems in the SHP-ed version allows the usage of the write enable signal and the feedback loop gets obsolete.

Load balancing

perf

Fig. 3. Histogram of different scenarios running CSR and/or SHP.

Fig. 3 shows the advantages of CSR and SHP compared to the original design. The x-axis of the histogram shows different scenarios/solutions, the y-axis the system performance.

Assuming a thread (T0) on a single CPU runs at e.g. 60MHz on an FPGA (Fig. 3a). It can be seen, how CSR improves the system performance of the original system implementation (Fig. 3b). When using CSR, the system performance is not necessarily limited by the critical path of the original design, but - for instance - by the switching limit of the FPGA (e.g. 250MHz) or the external memory access instead. All threads run at the same relative speed (fixed).

For executing multiple programs on multiple CPUs (symmetrical multi-processing), SHP allows a more efficient usage of the system resources (Fig. 3b to 3e). It adds the possibility to distribute the system performance over a minimum (C, Fig. 3b), and a maximum set of threads (D, Fig. 3c), whereas any solution in between can be realized. Fig. 3d shows a random example. This load balancing is handled by a thread controller (TC) and can be dynamically modified during runtime. Threads can be inserted, stalled and killed on a cycle-by-cycle base, and a flexible priority scheme takes care of individual load balancing (Fig. 3d). Acceleration techniques enable the speed-up of at least one thread beyond the speed of the thread running on the original design (Fig. 3e).

Video introduction to SHP

My presentation on SHP at the 4th RISC-V workshop, MIT, Boston, 2016:

Advantages

There are several advantages when using SHP over standard approaches. There are the performance per area factor increase, the performance increase of a single thread and system level performance improvements, especially in the multi-core domain.

Latest work on system level improvements: virtual peripherals

System level performance improvement is possible by dynamically vary the number of active threads. This enables a much more flexible multithreading approach, which can be used for running multiple virtual peripherals:

T. Strauch, "Connecting Things to the IoT by Using Virtual Peripherals on a Dynamically Multithreaded Cortex M3'', IEEE Trans. on Circuits and Systems I: Regular Papers, vol. 64, issue 9, Sep. 2017, pp. 2462 - 2469. http://ieeexplore.ieee.org/document/7935353/

The paper is recommended by the Associate Editor, A. Sangiovanni Vincentelli.

Initial work: performance per area increase

This paper discusses the increase of the classical performance per area factor when SHP is used:

T. Strauch, "The Effects of System Hyper Pipelining on Three Computational Benchmarks Using FPGAs", 11th International Symposium in Applied Reconfigurable Computing, ARC 2015, 13-17 April 2015, Bochum, Germany, pp. 1-12.

Acceleration techniques

One benefit of SHP is, that performance can be balanced more flexible among individual threads. This paper shows, how individual threads can be even further accelerated (Fig. 3e):

T. Strauch, "Acceleration Techniques for System-Hyper-Pipelined Soft-Processors on FPGAs", IEEE Euromicro DSD 2017, 30th Aug. - 1st Sep., Vienna, Austria, pp. 129-138. http://ieeexplore.ieee.org/document/8049775/

Performance per area and CGRA

SHP can be used on coarse grained reconfigurable arrays (CGRA) as well:

T. Strauch, "Using System Hyper Pipelining (SHP) to Improve the Performance of a Coarse-Grained Reconfigurable Architecture (CGRA) Mapped on an FPGA", 2nd International Workshop on FPGAs for Software Programmers, FSP 2015, 1st September 2015, London, UK, pp. 1-6.

System level improvements: more to come

I'm currently working on more system level improvements that can be reached when using SHP.