C-Slow Retiming (CSR) and System Hyper Pipelining (SHP)

It must be stated, that C-Slow Retiming is already very well known. Nevertheless, we do think, that this technique has much more potential than it is currently used. This CloudX initiative is set up to change this and to place real-world examples on the table of anybody who asks for it.

What is the benefit of using SHP on multicores ?

1) The area reduction is immense when multiplying the functionality of cores with SHP instead of instantiating each individual core.

2) The system architecture simplifies, which has an impact on the "multicore memory wall".

3) The power consumption reduces, which has a great impact on the "multicore power wall".

4) A lot of additional benefits on for instance inter-processor communication, multi-threading, ...

What's the difference to known solutions ?

1) SHP is not only used on the CPU engine alone, but on complete (sub-)systems instead.

2) SHP is executed
on RTL and not on the netlist. This enables better system-integration and verification.

This might make the difference for a wider use of the technology.

Theory of CSR

Figure 1: Solving an equation in one cycle.    Figure 2: Solving the same equation in 2 cycles.

Figure 1 shows a simplified example of a digital logic. The combinatorial logic is solved within one cycle. In Figure 2 registers are inserted in the digital logic. The logic result is the same, but it takes now 2 cycles so solve the same combinatorial logic. The point is, that in the second cycle, you can already start a completely independent calculation of a new result. The simplified example also shows, that in theory you can run the clock at twice the speed, so that the overall time to solve one single equation does not change. In other words, by adding registers, you have the chance to solve the same equation twice as often. 

Now let's use this technique on a complete design.

Figure 3: Simplification of single clock design. Figure 4: Single clock design after register insertion.

Any (single clock) design can be defined as a set of inputs, outputs, a graph of logic elements and registers (Figure 3). CSR now executes this register insertion on a more complex design automatically, as can be seen in Figure 4. Now is takes 2 cycles to achieve the same behavior as the original design, but you have a second, totally independent design which uses the combinatorial logic in a time sliced fashion.

It is totally irrelevant, if the original design is already pipelined (as in a CPU for instance). If you follow the rule to insert the same number of registers in any of the original logic paths, you multiply the functionality of the design/core. If the registers are timing driven placed, the performance of a single core remains almost the same. More register levels can be inserted and the functionality multiplies accordingly. The automatic register insertion on RTL simplifies the SoC-implementation, necessary code optimizations (e.g. memories) by hand and the verification process.

and Area Reduction


Figure 5: A6 processor (source:          Figure 6: A6X processor (source:

The most obvious advantage of using CSR is area reduction. CSR can be applied to combinatorial logic of identical core instantiations. As in the case of the A6 processor (Figure 5), CSR could potentially be executed on the two ARM cores (10% area reduction of the red area) as well as on the three GPU cores (32% area reduction of the yellow area). This would already result in a 6.7% area reduction of the complete die.

These numbers might not be overwhelming, but since more and more multicores are implemented - "the processor is the new transistor" - the area reduction might go up to 10%, 20% or even more. In the FPGA world, the area utilization is even greater, since the registers "already exist".

An example for the increasing importance of using CSR is the A6X processor (Figure 6). It is using a larger GPU and now 4 instead of 3. Since CSR is more efficient of larger designs (GPUs) and the number of identical designs increased, the potential area reduction for the A6X is now 14%.

on RTL

Figure 7: X² distribution of single net delay.                    Figure 8: Gaussian for consecutive nets.

CSR on RTL has a lot of advantages, and it is questionable if it is doable on netlist at all. We use simple empirical observations, such as that the net delay of an FPGA net follows a X² distribution (Figure 7), which ultimately leads with k > 70 to a Gaussian for consecutive delays (Figure 8). This observation allows the insertion of registers on RTL quite efficiently.

The difference between System Hyper Pipelining and C-Slow Retiming will be explained later.

"breaking the walls"

Well, this statement is certainly a little bit "strong". It is not a technique to completely solve today's multicore challenges. In fact, the obvious benefit of saving area (by just inserting registers) is interesting from the commercial point of view, but area is not really one of the big challenges today. We have been working on this technique for quite some time now, and we see a lot more very interesting aspects. These are power reduction aspects, interprocessor communication (e.g. multi-threading) and a lot more. In fact, the complete work leads more and more to the IRAM idea (promoted by Prof. Patterson), a data centric multicore architecture ("cloud") and a more cluster oriented multicore concept.

More details on this research are released soon.

Motivation for using the Arduino environment

I did my diploma-thesis on „Wavepipelining of a FSM“, which is a follow up step of CSR if you will. It was targeted towards smaller DSP functions and on the good old xc4008 (at that time we were talking about delays, yeah!). As an FAE for LSI Logic's R4000 team and working on EDA algorithms during the early physical synthesis times, I realized, that you cannot do this on netlist base. A few years ago I had an RTL flow in place and tried CSR on RTL and it worked great. Unfortunately it is hard to explain this idea (even to CPU experts) sometimes. So I decided to jump on the Arduino wagon and to generate some examples. Let's see how this thing evolves.

Older examples (2010) of using CSR can be found here:

        Hyper Pipelined Open RISC OR1200 Core,,or1200_hp (Verilog)
        Hyper Pipelined AVR Core,,avr_hp (VHDL)


  1. Afram M, Khan A., and Sarfaraz M., 2011, C-slow Technique vs Multiprocessor in designing Low Area Customized Instruction set Processor for Embedded Applications, Intern. Journal of Computer Applications, Vol. 36, No. 7, December 2011

  2. Bufistov D., Cortadella J., Kishinevsky M., and Sapatnekar S. 2007. A general model for performance optimization of sequential systems. Proceedings of the intl. Conference on Computer-aided designs, November 4-8, 2007, San Jose, California, USA. DOI = 10.1109/ICCAD.2007.4397291.

  3. Weaver N. and Wawrzynek J. 2002. The Effects of Datapath Placement and C-slow Retiming on Three Computational Benchmarks, Extended Abstract. Proceedings of the 10th Annual Symposium on Field-Programmable Customer Computer Machines, April 22-24, 2002, Napa, California, USA, pages 303-304. DOI = 10.1109/FPGA.2002.1106694.

  4. Weaver N., Markovskiy Y., Patel Y., and Wawrzynek J. 2003. Post-Placement C-slow Retiming for the Xilinx Virtex FPGA. Proceedings of the 11th intl. Symposium on FPGAs 2003, February 23-25, 2003, Monterey, CA, USA, pages 185-194. DOI = 10.1145/611817.611845.

  5. Baumgartner J., Tripp A., Aziz A., Singhal V., and Anderson F. 2000. An Abstraction Algorithm for the Verification of Generalized C-Slow Designs. Proceedings of 12th intl. Conference on Computer Aided Verification, July 15-19, 2000, Chicago. Il, USA, pages 5-19.

  6. T.Strauch, "Timing Driven C-Slow Retiming on RTL for MultiCores on FPGAs", ParaFPGA2013, 10-13 September 2013, Munich, Germany, pages 1-6.


  1. T. Strauch, 2010, Hyper pipelining of multicores and SoC interconnects, EETimes, 11/2/2010, link, pdf

last modified: 2016/jul/9