

CSlow Retiming (CSR) and System
Hyper Pipelining (SHP)
It
must be stated, that CSlow Retiming is
already very well known. Nevertheless, we do think, that this technique
has much more potential than it is currently used. This CloudX
initiative is set up to change this and to place realworld examples on
the table of anybody who asks for it.
What is the benefit of
using SHP on multicores ?
1) The area reduction is immense when multiplying the functionality of
cores with SHP instead of instantiating each individual core.
2) The system architecture simplifies, which has an impact on the
"multicore memory
wall".
3) The power consumption reduces, which has a great impact on the
"multicore power
wall".
4) A lot of additional benefits on for instance interprocessor
communication, multithreading, ...
What's the difference to
known solutions ?
1)
SHP is not
only used on the CPU engine alone, but on complete
(sub)systems instead.
2) SHP is executed on
RTL and not on
the netlist. This enables better systemintegration and verification.
This might
make the difference for a wider use of the technology.



Figure
1: Solving an equation in one cycle.

Figure 2: Solving the same equation in 2 cycles. 
Figure 1 shows
a simplified example of a digital logic. The combinatorial logic is
solved within one cycle. In
Figure 2 registers are inserted in the digital logic. The logic result
is the same, but it takes now 2 cycles so solve the same combinatorial
logic. The point is, that in
the second
cycle, you can already start a completely independent calculation of a
new result. The
simplified example also shows, that in theory you can run
the clock at twice the speed, so that the overall time to solve one
single equation does not change. In other words, by adding registers,
you have the chance to solve the same equation twice as often.
Now let's use
this technique on a complete design.


Figure 3: Simplification of single clock design. 
Figure
4: Single clock design after register insertion.

Any
(single clock) design can be defined as a set of
inputs, outputs, a graph of logic elements and registers (Figure 3).
CSR now executes this register insertion on a more complex design
automatically, as can be seen in Figure 4. Now is takes 2 cycles to
achieve the same behavior as the original design, but you have a
second, totally independent design which uses the combinatorial logic
in a time sliced fashion.
It
is totally irrelevant, if the original design is already pipelined
(as in a CPU for instance). If you follow the rule to insert the same
number of registers in any of the original logic paths, you multiply
the functionality of the design/core. If the registers are timing
driven
placed, the performance of a single core remains almost the same. More
register levels can be inserted and the functionality multiplies
accordingly. The automatic register
insertion on RTL simplifies the SoCimplementation,
necessary code optimizations (e.g. memories) by hand and the
verification process.
CSR and Area Reduction
Figure 5: A6
processor (source: techinsights.com).
Figure 6: A6X
processor (source: techinsights.com).
The
most obvious advantage of using CSR is area reduction. CSR can be
applied to combinatorial logic of identical core instantiations. As in
the case of the A6 processor (Figure 5), CSR could potentially be
executed on the
two ARM cores (10% area reduction of the red area) as well as on the
three GPU cores (32% area reduction of
the yellow area). This would already result in a 6.7% area reduction of
the complete die.
These numbers might not be overwhelming, but since more and more
multicores are implemented  "the processor is the new transistor" 
the area reduction might go up to 10%, 20% or even more. In the FPGA
world, the area utilization is even greater, since the registers
"already exist".
An example for the
increasing importance of using CSR is the A6X processor
(Figure 6). It is using a larger GPU
and now 4 instead of 3. Since CSR is more efficient of larger designs
(GPUs) and the number of
identical designs increased, the potential area reduction for the A6X
is now 14%.
CSR on RTL
Figure 7: X²
distribution of single net delay.
Figure 8: Gaussian for consecutive nets.
CSR on RTL has a lot of advantages, and it is questionable if it is doable
on netlist at all. We use simple empirical observations, such as that
the
net delay of an FPGA net follows a X² distribution (Figure 7),
which
ultimately leads with k > 70 to a Gaussian for consecutive
delays
(Figure 8). This observation allows the insertion of registers on RTL
quite efficiently.
The difference between System Hyper Pipelining and CSlow Retiming will be explained later.
"breaking
the walls"
Well, this statement is
certainly a little bit "strong". It is not a
technique to completely solve today's multicore challenges. In fact,
the
obvious
benefit of saving area (by just inserting registers) is interesting
from the commercial point of view, but area is not really one of the
big challenges today. We have been working on this technique for quite
some
time now, and we see a lot more very interesting aspects. These are
power reduction aspects, interprocessor communication (e.g.
multithreading) and a lot more.
In fact, the complete work leads more and more to the IRAM
idea
(promoted by Prof. Patterson), a data centric multicore architecture
("cloud") and a more cluster oriented multicore
concept.
More details on this
research are released soon.
Motivation
for using the
Arduino environment
I
did my diplomathesis on
„Wavepipelining of a FSM“, which is a follow up
step
of CSR if you
will. It was targeted towards smaller DSP functions and on the
good old xc4008 (at that time we were talking about delays, yeah!). As
an FAE for
LSI Logic's R4000 team and working on EDA algorithms during the early
physical synthesis times, I realized, that you cannot do this on
netlist base. A few years ago I had an RTL flow in place and tried CSR on RTL and it worked great. Unfortunately it is hard to
explain
this idea (even
to CPU
experts) sometimes.
So I decided to jump on
the Arduino wagon
and to generate some examples. Let's see how this thing evolves.
Examples
References

Afram M, Khan A., and Sarfaraz M., 2011, Cslow
Technique vs Multiprocessor in designing Low Area Customized
Instruction set Processor for Embedded Applications, Intern. Journal of
Computer Applications, Vol. 36, No. 7, December 2011

Bufistov D., Cortadella J., Kishinevsky M., and
Sapatnekar S. 2007. A general model for performance optimization of
sequential systems. Proceedings of the intl. Conference on
Computeraided designs, November 48, 2007, San Jose, California, USA.
DOI = 10.1109/ICCAD.2007.4397291.

Weaver N. and Wawrzynek J. 2002. The Effects of
Datapath Placement and Cslow Retiming on Three Computational
Benchmarks, Extended Abstract. Proceedings of the 10th Annual Symposium
on FieldProgrammable Customer Computer Machines, April 2224, 2002,
Napa, California, USA, pages 303304. DOI = 10.1109/FPGA.2002.1106694.

Weaver N., Markovskiy Y., Patel Y., and
Wawrzynek J. 2003. PostPlacement Cslow Retiming for the Xilinx Virtex
FPGA. Proceedings of the 11th intl. Symposium on FPGAs 2003, February
2325, 2003, Monterey, CA, USA, pages 185194. DOI =
10.1145/611817.611845.

Baumgartner J., Tripp A., Aziz A., Singhal V.,
and Anderson F. 2000. An Abstraction Algorithm for the Verification of
Generalized CSlow Designs. Proceedings of 12th intl. Conference on
Computer Aided Verification, July 1519, 2000, Chicago. Il, USA, pages
519.

T.Strauch,
"Timing Driven CSlow Retiming on RTL for MultiCores on FPGAs",
ParaFPGA2013, 1013 September 2013, Munich, Germany, pages 16.
Articles

T. Strauch, 2010, Hyper pipelining of
multicores and SoC interconnects, EETimes, 11/2/2010, link, pdf
last
modified: 2016/jul/9 





