Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Time 1
I.
INTRODUCTION
In the majority of the existing commercial and
academic architectural synthesis tools primitive resources
like ALUs or multipliers (called conventional resources
hereafter) implement the data-path. The intermediate
results are usually stored in a centralized register bank [5].
Methodologies that allow for more complex resources
have also been proposed [1]-[4]. These complex units,
called templates or clusters, consist of primitive resources
in sequence without intermediate registers. This sequence
of operations called chaining, is exploited during synthesis
to reduce the number of latency cycles and improve the
systems performance [5]. The templates can be contained
in an existing library [1, 3] or they can be extracted from
the Control Data Flow Graph (CDFG) of the application
by a process called template generation [2, 4].
Corazao et. al. [1] have shown that templates of depth
two and at most two primitive resources per level can be
used as the computational elements on the data-path to
significantly improve performance. However, their
approach requires a large number of different template
instances. They propose to generate templates for a large
subset of these instances, and then a scheduling algorithm
is presented to cover the Data Flow Graph (DFG) with
template instances to maximize the throughput
( 1 / latency template period ). To achieve this goal, a
large number of templates (some occurring more than
once in the data-path) may be required. This prevents the
design of an efficient inter-cluster network to support the
system's performance. Thus, the templates are forced to
communicate through the register bank and chaining may
Time 1
c
(a)
template 1
template 2
(b)
Time 2
II.
A. Intra-component architecture
The structure of the proposed coarse-grain component
is illustrated in Fig. 2. The component consists of 4 nodes,
4 inputs (in1, in2, in3 in4) connected to the centralized
register bank, 4 additional inputs (A, B, C, D) connected
either to the register bank or to another component, two
outputs (out1, out2) connected to the register bank and/or
to another component and two outputs (out3, out4) whose
values are stored in the register bank. Since each internal
node performs two-operand computation, multiplexers are
used to select the inputs of the nodes of the second level.
In 1
To register bank
or to another component
In 2
In 3
Node
1
A B
Node
2
A,B,C,D come
from register bank
or from other component
Node
3
To register bank
or to another component
In 4
C D
Node
4
Out 3
Out 4
Sel 1
In B
Buffer
Buffer
ALU
Sel 2
Out
M outputs
III.
Scheduled and
Allocated DFG
DFG
nodes
edges
dct
ellip
fir7
fir11
iir
lattice
volterra
wavelet
wdf7
44
39
21
33
18
24
34
69
52
84
76
47
69
31
45
65
146
106
Primitive
resource
latency
11
12
7
11
7
9
12
17
13
Two 2x2
CGCs
latency comp. usage
6
11/12
6
11/12
4
6/8
6
9/12
4
5/8
5
8/10
6
9/12
9
18/18
7
13/14
2nd experiment
average
max
6/4
3
5/6
2
0/4
0
1/6
1
3/4
2
2/5
1
1/6
1
2/7
2
5/7
2
[2]
[3]
[4]
[5]
[6]
[7]
[8]