C-Slow Retiming (CSR) and System
Hyper Pipelining (SHP)
must be stated, that C-Slow Retiming is
already very well known. Nevertheless, we do think, that this technique
has much more potential than it is currently used. This CloudX
initiative is set up to change this and to place real-world examples on
the table of anybody who asks for it.
What is the benefit of
using SHP on multicores ?
1) The area reduction is immense when multiplying the functionality of
cores with SHP instead of instantiating each individual core.
2) The system architecture simplifies, which has an impact on the
3) The power consumption reduces, which has a great impact on the
4) A lot of additional benefits on for instance inter-processor
communication, multi-threading, ...
What's the difference to
known solutions ?
SHP is not
only used on the CPU engine alone, but on complete
2) SHP is executed on
RTL and not on
the netlist. This enables better system-integration and verification.
make the difference for a wider use of the technology.
1: Solving an equation in one cycle.
||Figure 2: Solving the same equation in 2 cycles.
Figure 1 shows
a simplified example of a digital logic. The combinatorial logic is
solved within one cycle. In
Figure 2 registers are inserted in the digital logic. The logic result
is the same, but it takes now 2 cycles so solve the same combinatorial
logic. The point is, that in
cycle, you can already start a completely independent calculation of a
new result. The
simplified example also shows, that in theory you can run
the clock at twice the speed, so that the overall time to solve one
single equation does not change. In other words, by adding registers,
you have the chance to solve the same equation twice as often.
Now let's use
this technique on a complete design.
|Figure 3: Simplification of single clock design.
4: Single clock design after register insertion.
(single clock) design can be defined as a set of
inputs, outputs, a graph of logic elements and registers (Figure 3).
CSR now executes this register insertion on a more complex design
automatically, as can be seen in Figure 4. Now is takes 2 cycles to
achieve the same behavior as the original design, but you have a
second, totally independent design which uses the combinatorial logic
in a time sliced fashion.
is totally irrelevant, if the original design is already pipelined
(as in a CPU for instance). If you follow the rule to insert the same
number of registers in any of the original logic paths, you multiply
the functionality of the design/core. If the registers are timing
placed, the performance of a single core remains almost the same. More
register levels can be inserted and the functionality multiplies
accordingly. The automatic register
insertion on RTL simplifies the SoC-implementation,
necessary code optimizations (e.g. memories) by hand and the
CSR and Area Reduction
Figure 5: A6
processor (source: techinsights.com).
Figure 6: A6X
processor (source: techinsights.com).
most obvious advantage of using CSR is area reduction. CSR can be
applied to combinatorial logic of identical core instantiations. As in
the case of the A6 processor (Figure 5), CSR could potentially be
executed on the
two ARM cores (10% area reduction of the red area) as well as on the
three GPU cores (32% area reduction of
the yellow area). This would already result in a 6.7% area reduction of
the complete die.
These numbers might not be overwhelming, but since more and more
multicores are implemented - "the processor is the new transistor" -
the area reduction might go up to 10%, 20% or even more. In the FPGA
world, the area utilization is even greater, since the registers
An example for the
increasing importance of using CSR is the A6X processor
(Figure 6). It is using a larger GPU
and now 4 instead of 3. Since CSR is more efficient of larger designs
(GPUs) and the number of
identical designs increased, the potential area reduction for the A6X
is now 14%.
CSR on RTL
Figure 7: X²
distribution of single net delay.
Figure 8: Gaussian for consecutive nets.
CSR on RTL has a lot of advantages, and it is questionable if it is doable
on netlist at all. We use simple empirical observations, such as that
net delay of an FPGA net follows a X² distribution (Figure 7),
ultimately leads with k > 70 to a Gaussian for consecutive
(Figure 8). This observation allows the insertion of registers on RTL
The difference between System Hyper Pipelining and C-Slow Retiming will be explained later.
Well, this statement is
certainly a little bit "strong". It is not a
technique to completely solve today's multicore challenges. In fact,
benefit of saving area (by just inserting registers) is interesting
from the commercial point of view, but area is not really one of the
big challenges today. We have been working on this technique for quite
time now, and we see a lot more very interesting aspects. These are
power reduction aspects, interprocessor communication (e.g.
multi-threading) and a lot more.
In fact, the complete work leads more and more to the IRAM
(promoted by Prof. Patterson), a data centric multicore architecture
("cloud") and a more cluster oriented multicore
More details on this
research are released soon.
for using the
did my diploma-thesis on
„Wavepipelining of a FSM“, which is a follow up
of CSR if you
will. It was targeted towards smaller DSP functions and on the
good old xc4008 (at that time we were talking about delays, yeah!). As
an FAE for
LSI Logic's R4000 team and working on EDA algorithms during the early
physical synthesis times, I realized, that you cannot do this on
netlist base. A few years ago I had an RTL flow in place and tried CSR on RTL and it worked great. Unfortunately it is hard to
this idea (even
So I decided to jump on
the Arduino wagon
and to generate some examples. Let's see how this thing evolves.
Afram M, Khan A., and Sarfaraz M., 2011, C-slow
Technique vs Multiprocessor in designing Low Area Customized
Instruction set Processor for Embedded Applications, Intern. Journal of
Computer Applications, Vol. 36, No. 7, December 2011
Bufistov D., Cortadella J., Kishinevsky M., and
Sapatnekar S. 2007. A general model for performance optimization of
sequential systems. Proceedings of the intl. Conference on
Computer-aided designs, November 4-8, 2007, San Jose, California, USA.
DOI = 10.1109/ICCAD.2007.4397291.
Weaver N. and Wawrzynek J. 2002. The Effects of
Datapath Placement and C-slow Retiming on Three Computational
Benchmarks, Extended Abstract. Proceedings of the 10th Annual Symposium
on Field-Programmable Customer Computer Machines, April 22-24, 2002,
Napa, California, USA, pages 303-304. DOI = 10.1109/FPGA.2002.1106694.
Weaver N., Markovskiy Y., Patel Y., and
Wawrzynek J. 2003. Post-Placement C-slow Retiming for the Xilinx Virtex
FPGA. Proceedings of the 11th intl. Symposium on FPGAs 2003, February
23-25, 2003, Monterey, CA, USA, pages 185-194. DOI =
Baumgartner J., Tripp A., Aziz A., Singhal V.,
and Anderson F. 2000. An Abstraction Algorithm for the Verification of
Generalized C-Slow Designs. Proceedings of 12th intl. Conference on
Computer Aided Verification, July 15-19, 2000, Chicago. Il, USA, pages
"Timing Driven C-Slow Retiming on RTL for MultiCores on FPGAs",
ParaFPGA2013, 10-13 September 2013, Munich, Germany, pages 1-6.
T. Strauch, 2010, Hyper pipelining of
multicores and SoC interconnects, EETimes, 11/2/2010, link, pdf