Final Solutions

I give my permission to post my exam score and my final grade on the CSC-506 web site.
Signed_____________________________
Name____________________________
CSC-506: Architecture of Parallel Computers 1st Summer, 1999 Final Exam Question 1: (27 points 3 per step) A symmetrical multiprocessor computer system is implemented using the MESI cache coherency algorithm. Each processor has 64 kilobytes of addressable memory and a 2-way set-associative write-back cache with 256 sets and 16 bytes per line. The LRU replacement policy is used. Assume that each cache is empty when we start executing programs on the two processors, and they make the following sequence of references. Use the tables provided to give the tag, set, line number, MESI state, and data for non-empty cache lines, and the contents of memory after each step of the sequence. Be sure to fill in the complete contents of occupied cache lines and memory at each step. Use hexadecimal notation for all numbers. Initial Memory Memory Data Loc 0 4 8 C 0B5x 1B5x 2B5x 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
1. Processor 0 reads from location 0B5C Processor 0 cache Tag RAM Tag Set Line MESI 0 B5 0 E
0 0000
Data RAM 4 8 0000 0000
C 0000 Memory Loc 0 0000 0000 0000 Memory Data 4 8 0000 0000 0000 0000 0000 0000 C 0000 0000 0000
Processor 1 cache Tag RAM Tag Set Line MESI
Data RAM 4 8
0B5x 1B5x 2B5x
CSC506/NCSU/GQK/1999
Page 1
2. Processor 1 writes 1111 to location 0B54 Processor 0 cache Tag RAM Tag Set Line MESI 0 B5 0 I
0 ----
Data RAM 4 8 -------
C ---Memory Loc 0B5x 1B5x 2B5x C 0000 0 0000 0000 0000 Memory Data 4 8 0000 0000 0000 0000 0000 0000 C 0000 0000 0000
Processor 1 cache Tag RAM Tag Set Line MESI 0 B5 0 M
0 0000
Data RAM 4 8 1111 0000
3. Processor 0 reads from location 0B58 Processor 0 cache Tag RAM Tag Set Line MESI 0 B5 0 S
0 0000
Data RAM 4 8 1111 0000
C 0000 Memory Loc 0B5x 1B5x 2B5x C 0000 0 0000 0000 0000 Memory Data 4 8 1111 0000 0000 0000 0000 0000 C 0000 0000 0000
Processor 1 cache Tag RAM Tag Set Line MESI 0 B5 0 S
0 0000
Data RAM 4 8 1111 0000
Page 2
4. Processor 1 writes 2222 to location 1B54 Processor 0 cache Tag RAM Tag Set Line MESI 0 B5 0 S
0 0000
Data RAM 4 8 1111 0000
C 0000 Memory Loc 0B5x 1B5x 2B5x C 0000 0000 0 0000 0000 0000 Memory Data 4 8 1111 0000 0000 0000 0000 0000 C 0000 0000 0000
Processor 1 cache Tag RAM Tag Set Line MESI 0 1 B5 B5 0 1 S M
0 0000 0000
Data RAM 4 8 1111 2222 0000 0000
5. Processor 0 writes 3333 to location 1B5C Processor 0 cache Tag RAM Tag Set Line MESI 0 1 B5 B5 0 1 S M
0 0000 0000
Data RAM 4 8 1111 2222 0000 0000
C 0000 3333 Memory Loc 0B5x 1B5x 2B5x C 0000 ---0 0000 0000 0000 Memory Data 4 8 1111 2222 0000 0000 0000 0000 C 0000 0000 0000
Processor 1 cache Tag RAM Tag Set Line MESI 0 1 B5 B5 0 1 S I
0 0000 ----
Data RAM 4 8 1111 ---0000 ----
Page 3
6. Processor 0 reads from location 0B5C Processor 0 cache Tag RAM Tag Set Line MESI 0 1 B5 B5 0 1 S M
0 0000 0000
Data RAM 4 8 1111 2222 0000 0000
0 0000 ----
Data RAM 4 8 1111 ---0000 ----
7. Processor 1 writes 4444 to location 2B50 Processor 0 cache Tag RAM Tag Set Line MESI 0 1 B5 B5 0 1 S M
0 0000 0000
Data RAM 4 8 1111 2222 0000 0000
C 0000 3333 Memory Loc 0B5x 1B5x 2B5x C 0000 0000 0 0000 0000 0000 Memory Data 4 8 1111 2222 0000 0000 0000 0000 C 0000 0000 0000
Processor 1 cache Tag RAM Tag Set Line MESI 0 2 B5 B5 0 1 S M
0 0000 4444
Data RAM 4 8 1111 0000 0000 0000
Note: Some students erroneously victimized the line 0 with tag 0 here because of strict LRU when line 1 was available because it was marked invalid.
Page 4
8. Processor 0 reads from location 2B5C Processor 0 cache Tag RAM Tag Set Line MESI 0 2 B5 B5 0 1 S S
0 0000 4444
Data RAM 4 8 1111 0000 0000 0000
C 0000 0000 Memory Loc 0B5x 1B5x 2B5x C 0000 0000 0 0000 0000 4444 Memory Data 4 8 1111 2222 0000 0000 0000 0000 C 0000 3333 0000
Processor 1 cache Tag RAM Tag Set Line MESI 0 2 B5 B5 0 1 S S
0 0000 4444
Data RAM 4 8 1111 0000 0000 0000
9. Processor 0 writes 5555 to location 2B5C Processor 0 cache Tag RAM Tag Set Line MESI 0 2 B5 B5 0 1 S M
0 0000 4444
Data RAM 4 8 1111 0000 0000 0000
0 0000 ----
Data RAM 4 8 1111 ---0000 ----
Page 5
Question 2. (15 points) The following 8 x 8 Omega (shuffle-exchange) network shows P3 accessing M1. In the box provided, identify ALL other connections that are blocked by the switch while P3 is accessing M1. DO NOT identify any connections that involve either P3 or M1, because they are blocked at the processor or at the memory rather than in the switch.
P0 P1 P2 P3 P4 P5 P6 P7
M0 M1 M2 M3 M4 M5 M6 M7
Processor
to
Memory M0 M0 M0 M2 M3 M M M M M
1. P1 2. P5 3. P7 4. P7 5. P7 6. P 7. P 8. P 9. P 10. P
Page 6
Question 3. A two-processor SMP system uses a shared bus to access main memory. The data bus is 32 bits wide and operates at a clock rate of 100 MHz. Each processor has a private write-back cache. The internal clock rate of each processor is 500 MHz and average cache access rate is 0.3 instruction fetches, 0.5 data fetches per instruction. This unified cache uses 8 words per line and has an access time of 1 clock (2 ns). The main memory uses 100 MHz SDRAM with a RAS access time of 5 clocks. Assume that the processing rate is limited by the data access rate (that is, 100% utilization of the processor-to-cache bus). Also assume that we access a requested (missed) word first from the SDRAM on a cache miss, and that the processor can proceed (as soon as the requested word reaches the cache) while the remaining words of the cache line are filled. a) (5 points) What is the effective MIPS rate of each processor running as a uniprocessor with a cache hit rate of 100%? 0.3 + 0.5 = 0.8 accesses per instruction. 2 ns/access x 0.8 accesses per instruction = 1.6 ns per instruction 625 MIPS.
b) (5 points) What is the effective MIPS rate of each processor running as a uniprocessor with a cache hit rate of 99%? teff = 2 + (1 - 0.99)(10)(5) = 2.5 ns. 2.5 ns per access x 0.8 accesses per inst. = 2 ns per instruction 500 MIPS.
Page 7
c) (5 points) Without considering 2nd order effects (bus contention slowing down each of the processors), what is the memory bus utilization with the two processors running in the multiprocessor configuration with a hit rate of 99%? Each processor uses the bus for (10)(5) = 50 ns access plus 7 x 10 ns = 120 ns cache fill time out of every 100 accesses (0.01 miss rate). Out of the 100 accesses, 99 are at the cache hit time of 2 ns and one is at the miss time of 2 + 50 ns main memory access time. Therefore, each processor is using the bus (1)(50 + 70)/((100)(2) + (1)(50)) = 48%. With two processors, the bus utilization is (2)(48%) = 96%.
d) (5 points) What would the bus utilization be if we changed to a transaction bus that runs at 500 MHz? Assume that we divide the memory into a sufficient number of banks so that we do not have contention for the actual memory. Also assume that we use one bus cycle to make a memory request and that we complete the memory access of 8 words before we reconnect to the bus to transfer the data words to the processor and cache. Each processor would use the bus for 1 cycle to make the request and 8 cycles to transfer the eight data words for a total of 9 cycles once out of every 100 accesses (.01 miss rate). Out of the 100 accesses, 99 are at the cache hit time of 2 ns and one is at the miss time of 2 (cache time) + 2 (memory request over the bus) + 50 (access first word) + 70 (read remainder of the cache line) + 2 (transfer the first word over the bus). Therefore, each processor is using the bus (2 + 16)/(200 + 2 + 50 + 70 + 2) = 5.6%. With two processors, the bus utilization is (2)(5.6%) = 11.1%.
e) (5 points) What is the MIPS rate for each of the processors using the transaction bus as described in part d)? Our effective memory access time is longer because we have to wait the additional time to access main memory on a cache miss. teff = 2 + (1 - 0.99)(2 + 50 + 70 + 2) = 3.24 ns. 3.24 ns per access x 0.8 accesses per inst. = 2.592 ns per instruction 386 MIPS.
Page 8
Question 4. Our processor runs with a 100 MHz clock and uses an instruction pipeline with the following stages: instruction fetch = 2 clocks, instruction decode = 1 clock, fetch operands = 2 clocks, execute = 1 clock, and store result = 2 clocks. Assume a sequence of instructions where one of the inputs to every instruction depends on the results of the previous instruction. a) (5 points) What is the MIPS rate of the processor where we do not implement internal forwarding? b) (5 points) What is the MIPS rate of the processor where we do implement internal forwarding? 1 Clock Operation Inst Fetch Inst Decode Op Fetch Execute Op Store 2 3 4 5 6 7 8 9 10 11 12 13 14
2 1
3 2
4 3
4W 4W
5 4
5W 5W
3W 3W 3W 2 2
4W 4W 4W 3
2W 2W 2W 1 1 1
3W 3W 3W 2 2 2
a) We lose 3 cycles in the operand fetch unit waiting for the results of the previous instruction to be stored. So we complete one instruction every 5 clocks. With a clock rate of 100 MHz, we get 100/5 = 20 MIPS. 1 Clock Operation Inst Fetch Inst Decode Op Fetch Execute Op Store 2 3 4 5 6 7 8 9 10 11 12 13 14
2 1
3 2
4 3
5 4
6 5
7 6
2 1
3 2
4 3
5 4
6 5
b) The execute stage can forward its output directly to one of the inputs of the next instruction as well as to the operand store unit. We lose no cycles waiting for previous results and so our processing rate is limited only by the longest stage in the pipeline, which is two cycles. With a clock rate of 100 MHz, we get 100/2 = 50 MIPS.
Page 9
Question 5. Use the following collision vector: 0 1 0 1 1 0 1
a) (5 points) Draw the reduced state-diagram. b) (5 points) Show the maximum-rate cycle and pipeline efficiency.
0101101
3,6 8
1111111
1101101
3,6
The maximum rate cycle is shown in bold. It follows the sequence 3, 3, 3 . . . cycles for a total of 1 initiation every 3 cycles, giving 1/3 = 33% efficiency for the pipeline. It is not the greedy cycle because we must wait and initiate at the third clock to get into the cycle, bypassing the first available initiation from the initial state.
Page 10
Question 6. A company builds a vector computer with four parallel vector processors that each run at a clock rate of 500 MHz. The floating point add pipeline is 6 stages and the floating point multiply unit is 11 stages. Assume that memory bandwidth is sufficient to feed the pipelines. a) (5 points) What is the effective speedup of vector additions for input vectors of length 4? b) (5 points) What is the effective speedup of vector multiplications for input vectors of length 1000?
Speedup
Best Serial Time Parallel Execution Time
Considering the number of stages in a pipeline (k) and the length of the input vectors (n), this is:
S (k ) =
nk k + ( n 1)
The total length of the add pipeline is 6 stages, and we have 4 pipelines operating in parallel (the four parallel processors). The best serial time is nk, the number of vector elements times the 6 stages. The parallel time is the time it takes to get all of the vector elements through the four parallel pipelines. So our actual speedup is then: a) For the additions on vectors of length 4, the vectors are split into 4 subvectors of length 1, so each processor gets a single element:
S= 4( 6) =4 6 + (1 1)
b) For the multiplication on vectors of length 1000, each processor gets subvectors of length 1000/4 = 250. The speedup for the set of four 11-stage pipelines is then: S= 1000(11) = 42.3 11 + ( 250 1)
Question 7. (3 points) Why shouldnt we use a normal STORE instruction to clear a lock (set it to zero) in processor systems that do not implement strong consistency? We need to use explicit synchronizing instructions in order to provide weak or release consistency. If the store instruction were declared a synchronizing instruction, then the processor would need to ensure that all instructions completed processing in a pipeline before starting execution of any store instruction. This would severely impact performance.
Page 11

Final Solutions

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Final Solutions

Caricato da

Copyright:

Formati disponibili

I give my permission to post my exam score and my final grade on the CSC-506 web site.

Data RAM 4 8 0000 0000

Processor 1 cache Tag RAM Tag Set Line MESI

0B5x 1B5x 2B5x

Data RAM 4 8 -------

Processor 1 cache Tag RAM Tag Set Line MESI 0 B5 0 M

Data RAM 4 8 1111 0000

Data RAM 4 8 1111 0000

Processor 1 cache Tag RAM Tag Set Line MESI 0 B5 0 S

Data RAM 4 8 1111 0000

Data RAM 4 8 1111 0000

Processor 1 cache Tag RAM Tag Set Line MESI 0 1 B5 B5 0 1 S M

Data RAM 4 8 1111 2222 0000 0000

Data RAM 4 8 1111 2222 0000 0000

Processor 1 cache Tag RAM Tag Set Line MESI 0 1 B5 B5 0 1 S I

Data RAM 4 8 1111 ---0000 ----

Data RAM 4 8 1111 2222 0000 0000

Processor 1 cache Tag RAM Tag Set Line MESI 0 1 B5 B5 0 1 S I

Data RAM 4 8 1111 ---0000 ----

Data RAM 4 8 1111 2222 0000 0000

Processor 1 cache Tag RAM Tag Set Line MESI 0 2 B5 B5 0 1 S M

Data RAM 4 8 1111 0000 0000 0000

Data RAM 4 8 1111 0000 0000 0000

Processor 1 cache Tag RAM Tag Set Line MESI 0 2 B5 B5 0 1 S S

Data RAM 4 8 1111 0000 0000 0000

Data RAM 4 8 1111 0000 0000 0000

Processor 1 cache Tag RAM Tag Set Line MESI 0 2 B5 B5 0 1 S I

Data RAM 4 8 1111 ---0000 ----

Question 5. Use the following collision vector: 0 1 0 1 1 0 1

Best Serial Time Parallel Execution Time

Potrebbero piacerti anche