Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
ThispagesdescribesthestructureoftheCGRA.
Thecomputemodule
ThemostfinegrainedmodulesusedintheCGRAarethefunctionalunits(FUs),thefollowingFUsareavailable:
LoadStoreUnit(LSU)
RegisterFileUnit(RF)
ArithmeticLogicUnit(ALU)
ImmediateUnit(IU)
AccumulateandBranchUnit(ABU)
MultiplierUnit(MUL)
BesidestheseunitsthereareInstructionDecoders(IDs),InstructionFetchunits(IFs)andswitchboxes(swb).
Theseunitsareinstantiatedbasedontheatchitecturedescription,aXMLfiledescribingtheFUsavailable,their
propertiesandinterconnect.Theinterconnectcanbeeitherfixed(specifiedatdesigntimeintheXMLfile)which
wecallstaticorreconfigurable,whichwecalldynamic.DynamicCGRAsuseswitchboxestomakeconnections
betweenfunctionalunitsandbetweenIDsandFUs.Bydoingsoprocessorscanbeconstructed,eitheratdesign
timewithastaticCGRAorruntimewithadynamicCGRA.
TheFUs,IDs,IFsandswitchboxesarecontainedinthecomputemodule.Anexampleofsuchacompute
moduleisshowninthefigurebelow,pleasenotethatthisisnotmeantthosshowaparticularinstantiationand
merelyaimstoshowthestructureoftheCGRA.Theswitchboxesarenotshownforclarity.
Picturesource
Thearrowsinthefigureaboveindicateinterfacestoahigherhierarchicallevel,fortheLSUstheseareinterfaces
tolocalorglobalmemory.FortheIFunitsthearrowsindicateainterfacetotheinstructionmemory.
ThecomputemodulealsocontainsscanchainsforloadingtheCGRAconfigurationandforreadingorwritingthe
processorstate.Thecontrolwiresareavailabletothehigherhierarchicallevel,thecomputewrappermodule.
Thecomputewrappermodule
ThecomputewrappermodulecontainsthecomputemoduleoftheCGRAandaddssomeadministrative
functionality.TheindependentglobalmemoryconnectionsforeachoftheLSUsareconnectedtoanarbiter
whichmanagesaccesstotheglobalmemorybus.Thelocalmemoryconnectionspassthroughthismodulesince
theydonotneedtobearbited,asdotheinstructionmemoryconnectionstotheIFs.Configurationloadingis
managedbytheconfigurationloader,thismodulealsocontainssomestatusandcontrolregisterswhichare
describedinCGRAexternalinterfaces.Theconfigurationloaderalsotakescareoffillingtheinstruction
memorieswiththeoperationsdefinedinthebinary.Another,optional,moduleisthestatecontroller.Thismodule
allowsreadingandwritingoftheentirestateofthecomputemodule,thiscanbeusefulformultithreadingor
debugging.
Picturesource
Theconnectionsforthedatamemory,loaderandstatecontrollerareallDTL(DeviceTransactionLayer)busses,
essentiallyasimplifiedAXIbus.ThesebussesarecompatiblewiththeCompSoCplatformbutarealsovery
convenienttoconnecttootherdevicesviaDTLtoAXItranslatorsforexample.TheDTLportforthestate
controllerisanoptionalportandwillonlybeinstantiatedwhentheglobaldefineINCLUDE_STATE_CONTROL
isdefined.
TheconnectionsbetweenthelocalmemoriesandtheLSUs,aswellastheinstructionmemoriesandtheIFsare
simpledirectmemoryinterfaces.Thesememoriesarenotcontainedwithinthecomputewrappermodulesince
theASICtoolflowtypicallyhasnomemorygenerator.Thereforethecomputewrappermoduleisthetoplevel
moduleforASICsynthesis.
Thecoremodule
Thecoremodulecombinesthecomputewrappermodulewiththelocaldatamemoriesandinstructionmemories.
Thesememoriesarecontainedinaseparatemodule,veryimaginativelycalled,thememorymodule.Besides
addingthememoriestothecomputewrappermodule,thecoremoduledoesnotintroduceanynewfunctionality.
TheDTLbussessimplypassthroughthismodule.
Picturesource
ThecoremoduleisareadytouseCGRAblockthatcanbeincludedintoFPGAdesigns(ortheCompSoC
platform),theDTLportshavetobeconnectedtothehostprocessorandtherequiredexternalmemories,more
aboutthisinUsingtheCGRA.
Thetopmodule
Thetopmoduleisa(almost)standaloneversionoftheCGRAandismostlyusedforsimulation,all
requiredmemoriessuchastheglobalandstatestoragememoryandperipheralsarecontainedinthismodule.
Theolymemorynotcontainedinthismodule(becauseitisassumedtobesuppliedbyaexternalsystem)isthe
memorywheretheapplicationbinaryresides.Thetopmoduleassumesasimplehardwareinitiatorthatsendsa
pointertotheaddressinthebinarymemorywheretheapplicationbinaryresides.Peripheralsspecifiedinthe
architectureXMLfilewillbeinstantiatedinthetopmodule,anyrequiredinputsandoutputswillbeaddedbased
onthearchtecturedescription.
Theglobalmemorybus,andthereforealsotheperipherals,arearbitedbetweentheexternalworld(ahost
processorcanthereforealsocontrolltheperipherals)andtheCGRA.Thearbiterisisroundrobinbutdoesnot
grantslotstoportswithoutrequests,iftherearenorequestsfromtheexternalworldthereisnopenaltyinCGRA
performance(incycles)byhavingthearbiterpresent.
Picturesource
Testbench
Thetestbenchcontainsthesimulationlogicandapplicationbinary.ItassertstheproperDTLcontrolsignalsto
configureandstarttheCGRA.ItcanalsobeusedtopollthestateoftheCGRAorretrievedatafromtheglobal
memory.Controlsignalsforperipheralsareavailablewithinthetestbench,itisuptotheuserhowevertouse
thesesignals(notincludedinthedefaultcontrollogic).
Picturesource
Verilogcodestructure
ThissectiondescribesthestructureofthetoolflowgeneratedbytheCGRAtoolflow.Weusethefollowing
conventions:
[design]:isthenameofthedesign,e.g.thetestbenchname'BinarizationStatic'
[unit]:isthenameofafunctionalunit,e.g.lsu
Thecodeisstructuredasfollows:
TB_TOP:topleveltestbenchloadingthebinaryandinitiatingconfigurationoftheDUT.
dut:TheDeviceUnderTest,inourcasetheCGRAinstanceused.
GM_inst:Globalmemory.
DTL_[peripheralname]_inst:Peripheralconnectedtotheglobalmemorybus.
[design]_Core_inst:Wrapperaroundthememoryandcomputemodules
[design]_Memory_inst:ModulecontainingallthememoriesfortheCGRA
(instruction,localdataandglobaldatamemories)
LM_[unit]:Memoryforfunctionalunits,usuallyLSUs,thatareconnected
toalocalscratchpadmemory.
IM_[unit]:InstructionMemoryforunitssuchastheIDsandtheimmediate
units.
[design]_Compute_Wrapper_inst:Modulecontainingaloaderandthecompute
module.
Loader_inst:ModulethatmanagesconfigurationofthenetworkandFUs.
Italsomanagesloadingtheinstructionmemories.
[design]_Compute_inst:ModulecontainingallFUs,switchboxes(if
present)andallwiringinbetween.
IF_[unit]_inst:instructionfetcherforaID
SWB_DATA_[unit]_inst:Switchboxmoduleforthedatanetwork
SWB_CONTROL_[unit]_inst:Switchboxmoduleforthecontrol
(decodedinstructions)network
[unit]_inst:instanceofafunctionalunit,canbeaID,IU,ALU,ABU,
LSU,RForMUL.
Instruction Set
The CGRA consists of multiple functional units (FUs) which each have their own
instruction set. These FUs are the Load Store Unit (LSU), the Register File (RF)
and the Arithmetic Logic Unit (ALU), although other units might be added in the
future. For all FUs that are implemented and functionally tested the instruction
set is listed below.
These instructions have the following format (bit widths for destination and
inputs might change in the future as we decide to add or remove in- or outputs
from the FUs):
Opcode [11:7], DataType [6:5], Destination [4:4], InputB [3:2], InputA [1:0]
The 5 bit opcode describes the operation that is to be performed, The DataType
field is used for loading bytes, halfwords and words. The remaining 6 bits are
divided into 2 sets of 3 bit, the first set specifying load operations and the
second set specifying store operations. These 3-bit sets are formatted as
follows:
00: BYTE
01: HWORD
10: WORD
11: DWORD
The input describes which of the inputs of the functional unit will be selected
for a specific operand. For all operations in this instruction type: input B is the
address input and input A is the data input.
Note: when a simultaneous load and store are performed on the same read-
and write address the NEW value will be returned by the load.
Operation Description Opcode and parameters
NOP No OPeration 0000000_?_??_??
PASS outD, inA PASS A to output 0010011_D_??_AA
SLA TYPE, inB, inA Store Local Address 00001_TT_?_BB_AA
SLI TYPE, inA Store Local Implicit 00010_TT_?_??_AA
SGA TYPE, inB, inA Store Global Address 00011_TT_?_BB_AA
SGI TYPE, inA Store Global Implicit 10110_TT_?_??_AA
LLA TYPE, outD, inB Load Local Address 00101_TT_D_BB_??
LLI TYPE, outD Load Local Implicit 00110_TT_D_??_??
LGA TYPE, outD, inB Load Global Address 00111_TT_D_BB_??
LGI TYPE, outD Load Global Implicit 01000_TT_D_??_??
Load Local Implicit | Store
LLI_SLA TYPE, outD, inB, inA 01001_TT_D_BB_AA
Local Address
Load Global Implicit | Store
LGI_SLA TYPE, outD, inB, inA 01010_TT_D_BB_AA
Local Address
Load Local Address | Store
LLA_SLI TYPE, outD, inB, inA 01011_TT_D_BB_AA
Local Implicit
Load Local Implicit | Store
LLI_SLI TYPE, outD, inA 01100_TT_D_??_AA
Local Implicit
Load Global Address | Store
LGA_SLI TYPE, outD, inB, inA 01101_TT_D_BB_AA
Local Implicit
Load Global Implicit | Store
LGI_SLI TYPE, outD, inA 01110_TT_D_??_AA
Local Implicit
Load Local Implicit | Store
LLI_SGA TYPE, outD, inB, inA 01111_TT_D_BB_AA
Global Address
Load Global Implicit | Store
LGI_SGA TYPE, outD, inB, inA 10001_TT_D_BB_AA
Global Address
Load Local Address | Store
LLA_SGI TYPE, outD, inB, inA 10010_TT_D_BB_AA
Global Implicit
Load Local Implicit | Store
LLI_SGI TYPE, outD, inA 10011_TT_D_??_AA
Global Implicit
Load Global Address | Store
LGA_SGI TYPE, outD, inB, inA 10100_TT_D_BB_AA
Global Implicit
Load Global Implicit | Store
LGI_SGI TYPE, outD, inA 10101_TT_D_??_AA
Global Implicit
Type 2 instructions:
This type of instruction takes a 4-bit register address and input number as
parameters. The opcode is 6 bit in size, hence the instruction format is as
follows:
The SRM operation is used to write the LSUs configuration registers, which can
be found in the LSU description.
The LRM operation can be used to read configuration registers. To ensure
compatibility with the LRM operation used for the RF, this operation always
writes to the highest output port number.
Register File
Type 1 instructions:
These instructions have the following format (bit widths for destination and
inputs might change in the future as we decide to add or remove in- or outputs
from the FUs):
The 7 bit opcode describes the operation that is to be performed. For the RF,
the only instruction of this type is NOP.
Type 3 instructions:
This type of instructions has an opcode of only 2-bit in size. The parameters
are: register X, register Y, and input A. The format of this instruction is:
Opcode [11:10], Register X [9:6], Register Y [5:2], InputA [1:0]
The RF only has one instruction of this type and it performs a parallel register
read and write on two different addresses. The input (A) specifies the input for
the data that is to be used in the register write part of the operation.
Type 2 instructions:
This type of instruction takes a 4-bit register address and input number as
parameters. The opcode is 6 bit in size, hence the instruction format is as
follows:
Input A is the data input and is only used in register write operations.
? = dont care
Type 4 instructions:
This type of instruction takes a input specified address and data input number
as parameters. The opcode is 8 bit in size, hence the instruction format is as
follows:
Input A is the data input and is only used in register write operations. Input B is
the input on which the register address is present.
? = dont care
These instructions have the following format (bit widths for destination and
inputs might change in the future as we decide to add or remove in- or outputs
from the FUs):
110: BYTE
010: HWORD
101: WORD
The input describes which of the inputs of the functional unit will be selected
for a specific operand.
Immediate Unit:
Type 5 instructions:
The single-bit opcode specifies if the immediate value has to be written to the
output of the immediate unit.
Since the Immediate Unit (IU) is a special version of a instruction decoder (ID)
the width of these instructions can be different than the instruction width for
the other IDs.
The size of the Immediate (M) field scales with the instruction size (e.g a 9-bit
instruction will have a 8-bit data field).
Operatio
Description Opcode and parameters
n
NOPI No OPeration 0_{N-1{?}}
IMM write IMMediate to data
1_{N-1{M}}
value network
These instructions have the following format (bit widths for destination and
inputs might change in the future as we decide to add or remove in- or outputs
from the FUs):
The 7 bit opcode describes the operation that is to be performed. For the RF,
the only instruction of this type is NOP.
Type 2 instructions:
This type of instruction takes a 4-bit register address and input number as
parameters. The opcode is 6 bit in size, hence the instruction format is as
follows:
Input A is the data input and is only used in register write operations.
? = dont care
Type 6 instructions:
This type of instruction takes a 6-bit immediate value and input number as
parameters. The opcode is 4 bit in size, hence the instruction format is as
follows:
Input A is the input specifying the branch condition and is only used in
conditional branch operations.
? = dont care
Multiplier Unit
Type 1 instructions:
These instructions have the following format (bit widths for destination and
inputs might change in the future as we decide to add or remove in- or outputs
from the FUs):
The input describes which of the inputs of the functional unit will be selected
for a specific operand.
Opcode and
Operation Description
parameters
NOP No OPeration 0000000_?_??_??
Unsigned multiplication of A and B,
MULLU outD, inB, inA 100_1000_D_BB_AA
output is the lower part
MULLU_SH8 outD, inB, inA Unsigned multiplication of A and B,
output is the lower part, result shifted 100_1001_D_BB_AA
right by 8 bit.
MULLU_SH16 outD, inB, inA Unsigned multiplication of A and B,
output is the lower part, result shifted 100_1010_D_BB_AA
right by 16 bit.
MULLU_SH24 outD, inB, inA Unsigned multiplication of A and B,
output is the lower part, result shifted 100_1011_D_BB_AA
right by 24 bit.
Signed multiplication of A and B,
MULLS outD, inB, inA 101_1000_D_BB_AA
output is the lower part
MULLS_SH8 outD, inB, inA Signed multiplication of A and B,
output is the lower part, result shifted 101_1001_D_BB_AA
right by 8 bit.
MULLS_SH16 outD, inB, inA Signed multiplication of A and B, 101_1010_D_BB_AA
output is the lower part, result shifted
right by 16 bit.
Signed multiplication of A and B,
MULLS_SH24 outD, inB, inA output is the lower part, result shifted 101_1011_D_BB_AA
right by 24 bit.
Unsigned multiplication of A and B,
MULU outD, inB, inA complete multiplier output on port D 110_1000_D_BB_AA
and D+1
Unsigned multiplication of A and B,
MULU_SH8 outD, inB, inA complete multiplier output on port D 110_1001_D_BB_AA
and D+1, result shifted right by 8 bit.
Unsigned multiplication of A and B,
MULU_SH16 outD, inB, inA complete multiplier output on port D 110_1010_D_BB_AA
and D+1, result shifted right by 16 bit.
Unsigned multiplication of A and B,
MULU_SH24 outD, inB, inA complete multiplier output on port D 110_1011_D_BB_AA
and D+1, result shifted right by 24 bit.
Signed multiplication of A and B,
MULS outD, inB, inA complete multiplier output on port D 111_1000_D_BB_AA
and D+1
Signed multiplication of A and B,
MULS_SH8 outD, inB, inA complete multiplier output on port D 111_1001_D_BB_AA
and D+1, result shifted right by 8 bit.
Signed multiplication of A and B,
MULS_SH16 outD, inB, inA complete multiplier output on port D 111_1010_D_BB_AA
and D+1, result shifted right by 16 bit.
Signed multiplication of A and B,
MULS_SH24 outD, inB, inA complete multiplier output on port D 111_1011_D_BB_AA
and D+1, result shifted right by 24 bit.
LH outD Load contents of the high part of the 010_0000_D_??_??
multiplier output regsiter to output D
(does not do a multiplication)
ArithmeticLogicUnit
TheArithmeticLogicUnit(ALU)oftheCGRAcantakeanumberofinputs(bydefaultconfiguredto4inputs)and
performarithmeticandlogicoperationsontwooftheseinputs.Theinputsarespecifiedbytheinstructionand
selectedbythetwo4inputmultiplexersinthetopofthefigurebelow.
TheoperationstheALUcanperformaredividedoverthreefunctionalgroups:
Shiftoperations
LogicOperations
Arithmeticoperations
Additionally,theoutputofthearithmeticoperationscanbeusedforcomparingtwooperands.Theresultofthe
comparisonisstoredintheflagregisterandcanbeusedforCMOVoperations.Thecomparisonoutputcanalso
beroutedtothedatapath,thisallowstransmittingtheflagtootherALUsinthepipelineorusingitasaaresult
(e.g.binarization).Whenroutedtothedatapaththeflagisextendedfrom1bittothewidthofthedatapath.This
meansthatflag=0willresultinavalueof0onthedatapathwhileflag=1willresultin2^D_WIDTH1.Theflag
canalsobeinvertedsuchthatnotonlyLT(LessThan)andEQ(Equals)areavailablebutNEQ(NotEqual)and
GE(GreaterthanorEqual)aswell.
TheoutputofthelogicmodulecanbeinvertedwhichresultsintheoperationsNAND,NOR,XNORandInvert.
Thedestinationoperandspecifiestowhichregistertheoutputhastobewritten.Output0isanbufferedor
unbufferedoutput(dependingonconfigurationBit[0],0=unbufferedand1=buffered),whereastheotheroutputs
arebuffered.Theunbufferedoutputcouldbeusedtoconstructsinglecyclecomplexoperations.
MultiplierUnit
TheMultiplierUnitoftheCGRAcantakeanumberofinputs(bydefaultconfiguredto4inputs)andperform
signedandunsignedmultiplicationsontwooftheseinputs.Theinputsarespecifiedbytheinstructionand
selectedbythetwo4inputmultiplexersinthetopofthefigurebelow.
Sincetheoutputofthemultiplieris2timesthewidthoftheinputdatatheresultdoesnotfitononeoutputport.
Themultipliersupportstwomethodsforreadingthefulloutput:
Anormalmultiplicationoutputsthelowerhalfoftheresultdirectlyonthechosenoutputport.Asecond
instructionthenreadsthehigherpartoftheresulttoaspecifiedoutputport.
Thecompleteresultiswrittentotwoportsatonce,theoutputswillbedest(lowerpart)anddest+1(higher
part).
Thedestinationoperandspecifiestowhichregistertheoutputhastobewritten.
RegisterFileUnit
TheRegisterFileunit(RF)has4inputsofwhichonecanbeselectedforaddressingandoneasdatainput.In
contrasttootherfunctionalunitstheregisterfiledoesnothavemultipleaddressableoutputs.Instead,thehighest
outputnumber(bydefaultoutput1)canbeaddressedintwoways:
Throughtheinstruction,asanimmediate.
ThroughoperandB(fromthedatanetwork).
Theotheroutputs(0N2)aredirectlyconnectedtotheregisterscorrespondingtotheiroutputnumber(e.gr0>
out0).Thisallowsforreadingmultipleregistersfromtheregisterfileatoncewithoutthecostofrequiringextra
readports.
Inordertostoreavalueintoaregister,thedatahastobeavailableonoperandA.Theregisterwherethedata
hastobewrittenisspecifiedbyeitheranimmediateintheinstructionorviathedatanetwork.
Itispossibletoperformareadandwriteondifferentregistersatonce(usingimmediatesforbothoperations).
NotethattheoutputsoftheRFarenotregistered,thereforethereresultofaloadisavailabletootherunitswithin
thesamecycle.
LoadStoreUnit
TheLoadStoreUnit(LSU)has(bydefault)4inputsofwhichonecanbeselectedforaddressingandoneas
datainput.Alloutputsarebufferedandoneofthe(bydefault2)outputscanbeselectedasthetargetregister.
TheLSUsupportsoperationsfrombothlocal(privatetoeachLSU)andglobal(sharedbetweenLSUs).Currently
allmemoriesareconsideredtobesinglecyclebutinthefutureanarbiterwillbeinsertedbetweentheglobal
memoryandtheLSUsconnectedtoit.Thiswillalsomeanthatwewillhavetoimplementsomekindofstall
signal.
Onboththeglobalandlocalmemorythefollowingoperationscanbeperformed:
Load
Store
Loadimplicit
Storeimplicit
DataTypes
TheLSUsupportsloadingandstoringmultipledatatypes:
DWORD(64bit)
WORD(32bit)
HWORD(16bit)
BYTE(8bit)
However,themaximumsupportedwidthisequaltothedatapathwidth(e.g.a32bitCGRAsupportsBYTE,
HWORDandWORD).
Loadandstore
Theseoperationstakealltheirrequiredinformation(addressanddata)fromtheinputs.OperandAisusedfor
dataandOperandBisusedforaddressing.Currentlyweconsiderthemaximumaddressspaceofthelocal
memorytobe16bitandtheaddressspaceoftheglobalmemorytobe32bit.Thismeansthate.g.aCGRAwith
a8bitdatapathcannotdirectlyaddressallmemory(bothlocalandglobal)withaddressessuppliedfromthe
datanetwork.Toovercomethis,theLSUcontainsseveralconfigurationregisterswhichalsocontainregisters
thatholdthehigherbytes/wordsoftheaddresses.Thecontentsoftheseregisters,togetherwiththeaddress
suppliedontheinputwillformthefinalmemoryaddress.Witha16bitCGRAalllocalmemorycanbedirectly
addressedandwitha32bitCGRAallglobalmemorycanbedirectlyaddressedaswell.
Implicitloadandstore
Theseoperationsusetheconfigurationregisterstoimplementsomeautomaticaddressgeneration.The
configurationregistersallowtospecifythestartaddressandthestride.Eachtimeanimplicitoperationis
performedtheaddressisincrementedwiththestride.Thisisdoneseperatelyforloadsandstoresandglobaland
localaccesses.
Fortheglobalmemorythestartaddressandthebytesformemoryaddressextention(for8bitand16bit
CGRAs)areshared,meaningthatthisaddresswillincrementwitheachimplicitoperation.Forlocalmemorythe
startaddressandmemoryaddressextentionareseparated.
Dualissue
Someoperationsthatdonotconflictwithrespecttoinputselectioncanbeexecutedinparallel(e.g.astoreon
thelocalmemoryandaimplicitloadfromtheglobalmemory).Thisallowsforahighermemorybandwidthandfor
veryefficientmemorycopyorshuffling.Theoperationsthatcanbeperformedinparallelhaveaspecial
instructionfacilitatingthis(e.g.LGI_SLA),wheretheunderscoredenotesthesimultaneousexecution.
Picturesource
LSUConfigurationRegisters
Inordertoovercomelimitationsrelatedtodatapathwidthandthewidthrequiredtoaddressthelocalandglobal
memoriestheLoadStoreUnit(LSU)hassomeconfigurtionregisters.Additionallytheseregisterscanbeusedfor
implicit(automaticallyincrementingmemoryaddresses)whichallowparallelloadandstoreoperations.
Themappingoftheseconfigurationregistersisasfollows:
Register Description
0 Startaddressforimplicitlocalloadoperations(Highbytein8bit
datapath,notusedforwidthlargerthan8)
1 Startaddressforimplicitlocalloadoperations(Lowbytein8bit
datapath)
2 Strideforimplicit(localandglobal)loadoperations
3 Startaddressforimplicitlocalstoreoperations(Highbytein8bit
datapath,notusedforwidthlargerthan8)
4 Startaddressforimplicitlocalstoreoperations(Lowbytein8bit
datapath)
5 Strideforimplicit(localandglobal)storeoperations
6 Globalloadaddress(alsoimplicitcounter)(byte3forW=8,
unusedforW=16,unusedforW=32andhigher)
7 Globalloadaddress(byte2forW=8,unusedforW=16,unused
forW=32andhigher)
8 Globalloadaddress(byte1forW=8,word1forW=16,unusedfor
W=32andhigher)
9 Globalloadaddress(byte0forW=8,word0forW=16,fullfor
W=32andhigher)
10 Globalstoreaddress(alsoimplicitcounter)(byte3forW=8,
unusedforW=16,unusedforW=32andhigher)
11 Globalstoreaddress(byte2forW=8,unusedforW=16,unused
forW=32andhigher)
12 Globalstoreaddress(byte1forW=8,word1forW=16,unused
forW=32andhigher)
13 Globalstoreaddress(byte0forW=8,word0forW=16,fullfor
W=32andhigher)
14 Localloadaddresshighbyte(onlyforW=8)
15 Localstoreaddresshighbyte(onlyforW=8)
Accumulate&BranchUnit
TheAccumulateandBranchUnit(ABU)canbeconfiguredtoperformtwotasks,asthenameimplies.
Itcanbeusedasamultiregisteraccumulator.
Itcanbeusedtocalculateprogramcountersandhencefunctionasabranchunit.
Selectionbetweenthesetwofunctionalitiesismadeatconfigurationtimebysetting(branchfunctionality)or
clearing(accumulatefunctionality)theconfigurationbit.
Accumulatemode:
Thewidthoftheaccumulationregistersis16bitin8and16bitmodesand32bitin32bitmode.Theaccumulate
outputoftheselectedregisterisavailableatthehighestportnumber(s).
In8bitmodetwooutputsareusedtooutputoneofthe16bitaccumulationregisters,16accumulate
registersareavailable.
In16bitmodeoneoutputisusedtooutputtheselectedaccumulateregister,16accumulateregistersare
available.Anyadditionaloutputsareconnecteddirectlytotheregisternumbercorrespondingtotheport
number.
In32bitmodeoneoutputisusedtooutputtheselectedaccumulateregister,apair16bitregistersare
concatenatedinthismodetherefore832bitaccumulateregistersareavailable.Anyadditionaloutputsare
connecteddirectlytotheregisternumbercorrespondingtotheportnumber.
Picturesource
Branchmode:
Theprogramcounterisavailableatthehighestportnumber(s)andusesa16bitcounter.In16bitand32bit
modetheportN1canbeusedtoloadvaluesfromanyofthe16internal16btregisters.In8bitmodethisoption
isnotavailabletoportnumberlimitations.
Thebranchunitsupportsabsolute/relativeconditional/unconditionaljumps.Whenthebranchonlyhastojumpa
limitednumberofinstructionsitispossibletouseaintermediatebranchinstruction,otherwisethebranchtarget
hastobepresentononeoftheinputs.Conditionsalwayshavetobepresentontheinputs.
In8bitmodetwooutputsareusedtooutputthe16bitprogramcounter.Noregisterreadingissupported.
Anyadditionaloutputsareconnecteddirectlytotheregisternumbercorrespondingtotheportnumber.
In16bitand32bitmodethehighestoutputportproducestheprogramcounter,portnumberN1reads
theselectedregister.Anyadditionaloutputsareconnecteddirectlytotheregisternumbercorrespondingto
theportnumber.
Notyetimplemented:Theadditionalregisterswill,inthefuture,beusedforconfigurationofhardwareloop
support.
ImmediateUnit
TheimmediateUnit(IU)isaInstructionDecoderthatdoesnotperformanyotheractionthantoputavalue,
encodedintheinstruction,onthedatanetwork.
Ifthewidthoftheimmediatepartoftheinstructionisequalorlargerthanthatofthedatanetwork,theimmediate
willbedirectlyavailableonthenetwork.Ifthewidthoftheimmediatepartoftheinstructionissmallerthanthe
datanetworkwidth,theimmediateisbuiltfromseveralimmediateinstructions.Eachtimeanimmediate
instructionisexecutedthedataisshiftedbythenumberofbitsavailableineachimmediateinstruction.
Forexample:
Ourinstructionwidthis9bit,thereforethesizeoftheimmediateis8bit(9minusonewriteenablebit).
Weassumethedatanetworktobe32bit,meaningweneed4loadstofilltheoutputoftheimmediateunit
withthe32bitvalue.
Ifwewouldwanttoloadtheimmediatevalue0xAABBCCDD,wewouldexecutethefollowinginstructions:
imm0xAA(IUoutputvalue:0x??????AA)
imm0xBB(IUoutputvalue:0x????AABB)
imm0xCC(IUoutputvalue:0x??AABBCC)
imm0xDD(IUoutputvalue:0xAABBCCDD)
TheIUcanbeconfigured(withaparameterinVerilog)toinsertabubbleornot.Thisallowsthetotalnumberof
pipelinestagesfromtheIFtothedataarrivingonthedatanetworktobeequalforboththenotmalIDandFUs
andtheIU.Thenumberofpipelinestageswithoutbubbleinsertionis2andwithbubbleinsertion3.