Rohit Rao Naineni

EECE7352
Computer Architecture
Assignment-I
FALL 2014
ROHIT RAO NAINENI

NU ID- 001737178
PROBLEM 1:
For the first part of this problem, Dhrystone benchmark was made to compile and
run on three different micro-architectural platforms. The two different Instruction
Set Architectures used were Sun/Solaris/SPARC (machines pi and seconds) and
Intel/Linux/X86 (the systems ergs, grams, etc.).
Two micro-architectural versions of X86 were used i.e. (X86_32 bit and X86_64).
SPARC was used as the third platform. (SunOS 5.10, sun4v sparc SUNW).
Dhrystone benchmark:
Dhrystone benchmark is provided for this problem. Dhrystone is a synthetic
computing benchmark program developed in 1984 by Reinhold P. Weicker which
measures processor and compiler efficiency by executing a typical set of integer
calculations. These calculations include integer arithmetic, string/array
manipulation, and pointers. The Dhrystone benchmark contains no floating point
operations. The results provide a general measure of user level integer
performance.
1. On Linux/X86 (64 bit) (i7 Haswell processor)

a. Using gcc to compile
Firstly, run dhrystone.c using:
-bash-3.00$ gcc dhrystone.c o dhrystone
Initially the number of loops are small and hence the Dhrystone time shown is 0.
Adjust loop number:
Using vi editor the number of loops were changed to 220,000,000. Then the
Dhrystone benchmark was made to run for 20 times and the results were tabulated.
The results have been tabulated 20 times because a computer-based information
system sometimes have the cold-start effects, especially when theres a lot of data
to model, that the user might have to dedicate an amount of effort using the system
in its 'dumb' state contributing to the construction of their user profile before the
system can start providing any intelligent recommendations. Also the results
tabulated for various optimizations are shown below:
TABLE 1
S.No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
gcc without
optimization
gcc with -O1
gcc with -O2
gcc with -O3
gcc with -Os
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
13
13
13
13
13
13
13
13
13
13
13
14
13
14
14
13
13
13
13
13
16923076
16923076
16923076
16923076
16923076
16923076
16923076
16923076
16923076
16923076
16923076
15714285
16923076
15714285
15714285
16923076
16923076
16923076
16923076
16923076
6
6
6
6
6
5
6
6
6
6
6
5
5
6
6
6
6
6
6
6
36666666
36666666
36666666
36666666
36666666
44000000
36666666
36666666
36666666
36666666
36666666
44000000
44000000
36666666
36666666
36666666
36666666
36666666
36666666
36666666
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
4
4
3
3
3
4
3
3
3
3
4
3
4
3
4
3
3
3
3
3
55000000
55000000
73333333
73333333
73333333
55000000
73333333
73333333
73333333
73333333
55000000
73333333
55000000
73333333
55000000
73333333
73333333
73333333
73333333
73333333
6
5
5
5
6
5
5
6
6
5
5
6
5
5
5
6
5
5
5
5
36666666
55000000
55000000
55000000
36666666
55000000
55000000
36666666
36666666
55000000
55000000
36666666
55000000
55000000
55000000
36666666
55000000
55000000
55000000
55000000
Optimizations:
-O0: No optimization (the default); generates un-optimized code but has the fastest
compilation time.
Optimization -O1: This is the most basic optimization level. The compiler tried to
produce faster, smaller code without performing any optimizations that take a great
deal of compilation time.
Optimization -O2: A step up from -O1. As compared to -O, this option increases
both compilation time and the performance of the generated code.
Optimization -O3: This is the highest level of optimization possible. Full
optimization is done as in -O2 and also the finline-functions flag for inline
functions. It enabled optimizations that are expensive in terms of compile time and
memory usage.
Optimization -Os: This option optimizes your code for size. It activates all -O2
options that don't increase the size of the generated code. It can be useful for
machines that have extremely limited disk storage space and/or have CPUs with
small cache sizes.
From the Table 1, we can see that the default option O0 took about 13 passes
with an error of 1 pass, yielding 16923076 dhrystones per second. -O1, -Os and O2 levels took less number of passes, but -O3 took 3 passes with an error of 1 pass.
Comparing -O3 and -Os optimizations, we can see that the system performances
better in -O3. This is because -O3 covers more shot-downs than -Os omits more
unnecessary switches and accelerates the speed of forming codes.
gprof is used to profile the execution of the program, gprof to analyze the gcc
without optimization and with O1,O2,O3 and Os optimizations
.gcc without optimization:
-bash-3.00$ gcc pg dhrystone.c o linprof
-bash-3.00$ gprof linprof
1. What is the most frequently executed function?

Table for most frequently executed function (without optimization)
% Time
9.05
7.80
Cumulative
seconds
9.16
11.44
Selfseconds
1.23
1.23
calls
Self s/call
Total s/call
Name
660000000
660000000
0.00
0.00
0.00
0.00
Func1
Proc7
After running gprof we get a table giving the running time, calls and the name
of the functions. Part of the information is tabulated above. The functions
Func1 and Poc7 are the most frequently executed functions. The number of
calls for the functions are 660 million. Also, there is only 1 call to the function
Proc0 and it makes up more than 1/4th of the program execution time.
2. What percentage of the entire execution time does it consume?
The Funct1 and Proc7 takes 9.05 % and 7.8% respectively of the entire
execution time, which can be seen from the table.
3. How does optimization change this percentage and why?

Gprof is used to profile the execution of the program with optimization.
Optimization -O1:
-bash-3.00$ gcc pg O1 dhrystone.c o lingprof1
Similarly, gprof is execute like this for other optimization levels.

Lets take a look at the assembly listing of Func1
a. Without Optimization
b. With optimization -O1,-O2,-O3 and -Os
Code Without Optimization
Optimization -O1
Optimization -O3
Optimization -O2
Optimization -Os
-O1 Optimization :
% Time
8.61
9.83
Cumulative
seconds
3.93
3.43
Selfseconds
1.23
1.23
calls
Self s/call
Total s/call
Name
660000000
660000000
0.00
0.00
0.00
0.00
Func1
Proc7
After looking at the assembly code, we can say that in -O1, compiler produces
smaller code as compared with no optimization.. The execution time is also
decreased without performing any optimizations that take a great deal of
compilation time. We can see that the repeated registry storage instructions have
been optimized in O1 level.
Gprof Optimization -O2:

% Time
0.62
Cumulative
seconds
3.24
Selfseconds
0.02
calls
Self s/call
Total s/call
Name
0.00
0.00
Proc7
We can see that in -O2 the assembly code is almost similar as in -O1. Like O1, this
option turns on optimizations that will not compromise speed and size, but will
increase the compilation time. The time percentage taken by proc7 is significantly
reduced to 0.62 %. The %time taken and cumulative time both have been reduced
for Proc7, both Func1 and Proc7 have zero calls as compared to O1 and O0 and
thus reducing execution time. Also, as compared to O0, this option increases both
compilation time and the performance of the generated code.
Optimization O3:
% Time
100
Cumulative
seconds
1.2
Selfseconds
1.2
calls
1
Self s/call
Total s/call
Name
Proc0
Comparing O3 optimization to O2 we can see that the PASSES results are almost 1
less. Highest Dhrystones/sec are achieved in O3 level. Also the calls of both the
functions are zero. This is the highest level of optimization possible. We can see
that in the O3 gprof that the procx and Funcx spend much less time.
Finally, for the smallest code size, Optimization -Os is recommended. The
complier switch Os develops from O2 It optimize for code size, it enables all O2
optimizations that do not increase code size; it puts the emphasis on size over
speed.
Using cc to compile (Linux/x86_64)
TABLE 2
S.No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
cc without
optimization
cc with -O1
cc with -O2
cc with -O3
cc with -Os
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
13
14
14
13
13
13
13
13
14
13
13
14
13
14
14
13
13
13
13
13
16923076
15714285
15714285
16923076
16923076
16923076
16923076
16923076
15714285
16923076
16923076
15714285
16923076
15714285
15714285
16923076
16923076
16923076
16923076
16923076
6
6
5
6
6
6
6
6
6
6
5
6
6
6
6
6
5
6
6
6
36666666
36666666
44000000
36666666
36666666
36666666
36666666
36666666
36666666
36666666
44000000
36666666
36666666
36666666
36666666
36666666
44000000
36666666
36666666
36666666
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
3
3
4
3
4
4
4
3
3
3
4
3
4
3
4
3
3
3
3
3
73333333
73333333
55000000
73333333
55000000
55000000
55000000
73333333
73333333
73333333
55000000
73333333
55000000
73333333
55000000
73333333
73333333
73333333
73333333
73333333
5
5
5
5
6
5
5
6
6
5
6
5
5
5
5
6
5
5
5
5
55000000
55000000
55000000
55000000
36666666
55000000
55000000
36666666
36666666
55000000
36666666
55000000
55000000
55000000
55000000
36666666
55000000
55000000
55000000
55000000
By comparing results obtained by compilers cc and gcc on Linux/X86_64 microarchitecture the results obtained are very similar.
After using gprof, similar to gcc the func1 and proc7 were the most frequently
executed functions. Their execution times were also similar.
2. On Linux/X86_i686 (32 bit X86) (i5 Westmere micro architecture

processor)
a. Using gcc to compile
: -bash-3.00$ ssh p 27 alpha
Here same numer of loops were used (about 220 million) . Similarly, Dhrystone is
made to run 20 times each time and the results are tabulated below.(table 3)
TABLE 3
S.No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
gcc without
optimization
gcc with -O1
gcc with -O2
gcc with -O3
gcc with -Os
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
29
27
29
29
27
29
29
29
29
29
29
27
29
29
28
29
28
29
29
29
7586206
8148148
7586206
7586206
8148148
7586206
7586206
7586206
7586206
7586206
7586206
8148148
7586206
7586206
7857142
7586206
7857142
7586206
7586206
7586206
15
16
15
15
15
15
15
16
15
15
15
16
15
15
16
15
15
16
15
15
14666666
13750000
14666666
14666666
14666666
14666666
14666666
13750000
14666666
14666666
14666666
13750000
14666666
14666666
13750000
14666666
14666666
13750000
14666666
14666666
8
9
9
8
8
9
9
9
9
9
9
9
9
9
9
9
8
9
9
9
27500000
24444444
24444444
27500000
27500000
24444444
24444444
24444444
24444444
24444444
24444444
24444444
24444444
24444444
24444444
24444444
27500000
24444444
24444444
24444444
4
5
4
3
4
4
4
4
4
4
4
4
4
4
4
5
4
4
4
4
55000000
44000000
55000000
73333333
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
44000000
55000000
55000000
55000000
55000000
15
14
15
15
15
15
15
14
15
15
15
15
14
15
15
15
15
15
14
15
14666666
15714285
14666666
14666666
14666666
14666666
14666666
15714285
14666666
14666666
14666666
14666666
15714285
14666666
14666666
14666666
14666666
14666666
15714285
14666666
From the above table we can see again that in O3 optimization the number of
passes are the least and then in -O2. - Comparing O2 and Os optimizations, we
can see that the system performances better in O2. We can also deduce that the
64bit version of X86 architecture achieves better Dhrystone performance.
gprof is used to analyze the gcc without optimization and with O1,O2,O3 and
-Os optimizations.
1. What is the most frequently executed function?
Table for most frequently executed function
% Time
12.17
7.23
Cumulative
seconds
24.5
27.01
Selfseconds
4.21
2.5
calls
Self s/call
Total s/call
Name
660000000
660000000
0.00
0.00
0.00
0.00
Func1
Proc7
Here similarly Func1 and proc7 are the most frequently executed functions. They
also take more cumulative seconds than the 64 bit version of this architecture (The
execution times are lot higher in this version of micro architecture).
2. What percentage of the entire execution time does it consume?
The Funct1 and Proc7 takes 12.17 % and 7.23% respectively of the entire
execution time, which can be seen from the table.
3. How does optimization change this percentage and why?
Similarly, Gprof is used to profile the execution of the program with
optimization.
O1 Optimization :
% Time
8.64
10.87
Cumulative
seconds
14.68
12.89
Selfseconds
1.78
2.25
calls
Self s/call
Total s/call
Name
660000000
660000000
0.00
0.00
0.00
0.00
Func1
Proc7
As we discussed earlier, the cumulative seconds have been reduced in O1

optimization level. Thus, the execution time is also decreased without performing
any optimizations that take a great deal of compilation time. After looking at the
assembly code, we can say that in -O1, compiler produces smaller code as
compared with no optimization.
Optimization -O2:
% Time
-
Cumulative
seconds
-
Selfseconds
-
calls
Self s/call
Total s/call
Name
Proc7&Func1
Similarly , we can see that in -O2 the assembly code is almost similar as in 64 bit
version. Both Func1 and Proc7 have zero calls as compared to O1 and O0 and thus
reducing execution time . Also, As compared to O0, this option increases both
compilation time and the performance of the generated code. Like O1, this option
turns on optimizations that will not compromise speed and size, but will increase
the compilation time.
Optimization O3:
% Time
100
Cumulative
seconds
3.4
Selfseconds
3.4
calls
Self s/call
Total s/call
Name
3.4
0.00
Proc0
Comparing O3 optimization to O2 we can see that the PASSES results are almost
less by 4. Highest Dhrystones/sec are achieved in O3 level. Similarly, the calls of
both the functions are zero. This is the highest level of optimization possible.
We can also see that in the O3 gprof that the cumulative time of proc0 in this
version of architecture iss higher than the 64 bit version.
Finally, for the smallest code size, Optimization -Os is recommended.
Assembly code for func1 without optimization and for O3

Code Without Optimization
Optimization -O3
From the assembly code listed for func1 which was the most called program during
without optimization, we can say that after O3 optimization the code size is
significantly reduced. Since the repeated register storage instructions (movl,etc)
have been optimized.
b. Using cc to compile (On Linux/X86_i686)
By comparing results obtained by compilers cc and gcc on Linux/X86_i686 microarchitecture the results obtaines are very similar.
After using gprof, similar to gcc the func1 and proc7 were the most frequently
excuted functions. Their exection times were also similar.
TABLE 4
S.No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
cc without
optimization
cc with -O1
cc with -O2
cc with -O3
cc with -Os
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
29
29
29
29
30
29
30
29
30
29
30
29
30
29
28
29
29
29
30
29
7586206
7586206
7586206
7586206
8148148
7586206
8148148
7586206
8148148
7586206
8148148
7586206
8148148
7586206
7857142
7586206
7586206
7586206
8148148
7586206
16
16
16
15
15
15
16
16
15
15
15
16
15
15
16
15
15
15
15
15
13750000
13750000
13750000
14666666
14666666
14666666
13750000
13750000
14666666
14666666
14666666
13750000
14666666
14666666
13750000
14666666
14666666
14666666
14666666
14666666
9
8
8
8
9
9
9
8
9
9
9
9
9
9
9
9
8
9
9
9
24444444
27500000
27500000
27500000
24444444
24444444
24444444
27500000
24444444
24444444
24444444
24444444
24444444
24444444
24444444
24444444
27500000
24444444
24444444
24444444
4
5
4
3
4
4
4
4
4
4
4
4
4
4
4
5
4
4
4
4
55000000
44000000
55000000
73333333
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
55000000
44000000
55000000
55000000
55000000
55000000
15
14
15
15
15
15
15
14
15
15
15
15
14
15
15
15
15
15
14
15
14666666
15714285
14666666
14666666
14666666
14666666
14666666
15714285
14666666
14666666
14666666
14666666
15714285
14666666
14666666
14666666
14666666
14666666
15714285
14666666
3. On Solaris/SPARC architecture:
a. Using gcc to compile:
Here same numer of loops were used (about 220 million) . Similarly, Dhrystone is
made to run 20 times each time and the results are tabulated below.(table 5)
TABLE 4
S.No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
gcc without
optimization
gcc with -O1
gcc with -O2
gcc with O3
gcc with -Os
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
PASSES
Dhrystones/s
534
536
535
536
535
535
535
535
536
535
535
536
535
536
535
535
534
536
535
536
411985
410447
411214
410447
411214
411214
411214
411214
410447
411214
411214
410447
411214
410447
411214
411214
411214
411214
410447
411214
210
210
210
210
210
209
210
209
210
210
210
210
210
209
210
209
210
210
210
210
1047619
1047619
1047619
1047619
1047619
1052632
1047619
1052632
1047619
1047619
1047619
1047619
1047619
1052632
1047619
1052632
1047619
1047619
1047619
1047619
204
204
204
205
204
204
205
205
204
204
204
204
204
205
204
204
204
204
204
205
1078431
1078431
1078431
1073170
1078431
1078431
1073170
1073170
1078431
1078431
1078431
1078431
1078431
1073170
1078431
1078431
1078431
1078431
1078431
1073170
179
179
179
179
177
178
179
179
179
179
178
179
179
179
177
179
179
179
179
179
1229050
1229050
1229050
1229050
1242937
1235955
1229050
1229050
1229050
1229050
1235955
1229050
1229050
1229050
1242937
1229050
1229050
1229050
1229050
1229050
225
225
226
224
224
225
225
225
225
224
224
225
226
224
224
225
225
225
225
224
977778
977778
973451
982143
982143
977778
977778
977778
977778
982143
982143
977778
973451
982143
982143
977778
977778
977778
977778
982143
From the Table 5, we can see that the default option O0 took about 535 passes
with an error of 1 pass, yielding 411214 dhrystones per second. -O1, -Os and -O2
levels took less number of passes, but -O3 took 179 passes with an error of 1 pass.
Comparing -O3 and -Os optimizations, we can see that the system performances
better in -O3
Here gprof is used to analyze the gcc without optimization and with -O1,-O2,
-O3 and -Os optimizations.
% Time
4.33
2.82
Cumulative
seconds
1033.5
1146.84
Selfseconds
53.25
34.73
calls
Self s/call
Total s/call
Name
660000000
660000000
0.00
0.00
0.00
0.00
Func1
Proc7
of the functions. Part of the information is tabulated above.
Similar to the above results, the functions Func1 and Poc7 are the most
frequently executed functions. The number of calls for the functions are 660
million.
What percentage of the entire execution time does it consume?
The Funct1 and Proc7 takes 4.33 % and 2.82% respectively of the entire execution
time, which can be seen from the table.
How does optimization change this percentage and why?
O1 Optimization:
% Time
1.59
1.61
Cumulative
seconds
849.19
834.9
Selfseconds
1.78
2.25
calls
Self s/call
Total s/call
Name
660000000
660000000
0.00
0.00
0.00
0.00
Func1
Proc7
As we discussed earlier, the cumulative seconds have been reduced in O1

optimization level for the functions. It reduced from more than 1000 to 849 and
834 for func1 and proc7 respectively. Thus, the execution time is also decreased
without performing any optimizations that take a great deal of compilation time.
After looking at the assembly code, we can say that in -O1, compiler produces
smaller code as compared with no optimization.
Optimization -O2:
% Time
1.19
1.17
Cumulative
seconds
955.36
943.60
Selfseconds
11.89
11.76
calls
Self s/call
Total s/call
Name
660000000
660000000
0
0
0
0
Proc7
Func1
Like O1, this option turns on optimizations that will not compromise speed and
size, but will increase the compilation time. The cumulative time for O2 came to be
more than O1.
Optimization O3:
% Time
100
Cumulative
seconds
88.86
Selfseconds
88.86
calls
Self s/call
Total s/call
Name
88.86
88.86
Proc0
The cumulative seconds are significantly less in this optimization , thus reducing
codesize and execution time, i.e providing highest optimization.
Finally, for the smallest code size, Optimization -Os is recommended.
Func1 code without optimization
Code with optimization O3
From the above listing of assembly codes we can say that the O3 optimization
significantly reduced code size. The repeated register storage (64 bit to 8 bit)
instructions have been optimized in O3 level and hence the code being short.
Using cc to compile:
Here same number of loops were used (about 220 million) . Similarly,
Dhrystone is made to run 20 times each time and the results are tabulated
below.(table 6)
Use gprof to analyze the CC without optimization.
TABLE 6
S.No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
gcc without optimization
cc without optimization
PASSES
Dhrystones/s
PASSES
Dhrystones/s
534
536
535
536
535
535
535
535
536
535
535
536
535
536
535
535
534
536
535
536
411985
410447
411214
410447
411214
411214
411214
411214
410447
411214
411214
410447
411214
410447
411214
411214
411214
411214
410447
411214
314
315
315
315
315
315
315
314
314
315
315
314
315
315
316
315
315
315
315
315
700637
698413
698413
698413
698413
698413
698413
700637
700637
698413
698413
700637
698413
698413
696203
698413
698413
698413
698413
698413
In Sparc, dhrystone benchmarks a better performance in cc compiling background.

This may be because CC is a kind of symbol-link of GCC. Its less powerful thus
more easy and time-saving to form codes. As it can be seen from table 6 cc
benchmarks around 315 passes having execution time less than gcc.
PROBLEM 2:
Linpack benchmark:
The Linpack benchmark is a method of measuring the floating point rate of
execution of a computer by running a program that solves a system of linear
equations. They measure how fast a computer solves a dense n by n system of
linear equations Ax = b. The aim is to approximate how fast a computer will
perform when solving real problems. The results of the Linpack benchmark can be
used to estimate the speed with which a computer is likely to run an individual
program or multiple programs.
This benchmark is made to run on linux/X86 architecture.
The performance measured by the LINPACK benchmark consists of the number of
64-bit floating-point operations, generally additions and multiplications, a
computer can perform per second, also known as FLOPS.
For this benchmark the matrix order by default is taken to be 1000.
TABLE 7
gcc without optimization
gcc with -O1
gcc with -O2
gcc with -O3
gcc with -Os
MFLOPS
MFLOPS
MFLOPS
MFLOPS
MFLOPS
1114.444444
3343.333333
3933.333333
3519.298246
3714.814815
1114.444444
3343.333333
3714.814815
3519.298246
3933.333333
1096.174863
3343.333333
3519.298246
4179.166667
3933.333333
1133.333333
3039.393939
3933.333333
3933.333333
3519.298246
1114.444444
3039.393939
3519.298246
3519.298246
3714.814815
1096.174863
3343.333333
3519.298246
3519.298246
3519.298246
1096.174863
2907.246377
3519.298246
3519.298246
3933.333333
1114.444444
3039.393939
3933.333333
3933.333333
3714.814815
1114.444444
3184.126984
3933.333333
3714.814815
3933.333333
1114.444444
2907.246377
3933.333333
4179.166667
3714.814815
1114.444444
2907.246377
3519.298246
4179.166667
3933.333333
1133.333333
3343.333333
3933.333333
3519.298246
3714.814815
1114.444444
3343.333333
3519.298246
3519.298246
3714.814815
1114.444444
3343.333333
3519.298246
3519.298246
3519.298246
1114.444444
3343.333333
3714.814815
3519.298246
3933.333333
1133.333333
2907.246377
3519.298246
3933.333333
3714.814815
In this benchmark we set up a random dense matrix A of size 1000 as specified

above, and a right hand side vector B ,which is the product of A and a vector X of
all 1s. The first task is to complete an LU factorization of A. Second, is to use the
LU factorization to solve A*X=B.
From table 7 we can see MFLOPS, it is the rate of execution, millions of floating
point operations per second. It refers to 64 bit floating point operations and the
operations may be addition/multiplication. We can see that MFLOPS improve with
optimizations O1, O2 and O3.

% Time
10.01
Cumulative
seconds
0.64
Selfseconds
0.07
calls
Self s/call
Total s/call
Name
2000000
0.00
0.00
r8_random
of the functions. Part of the information is tabulated above. The functions
r8_random is the most frequently executed function. The number of calls for
the functions are 2000000.R8_random takes 10.07% of the entire execution
time, which can be seen from the table.

Optimization -O1:
% Time
8.35
Cumulative
seconds
0.21
Selfseconds
0.02
calls
Self s/call
Total s/call
Name
2000000
0.00
0.00
r8_random
In this optimization the cumulative seconds is reduced to 0.21. Thus this

optimization reduces execution time and code size. Also from table 7 we can see
the performance enhancement through the increase in the number of MFLOPS. It
achieved a peak MFLOP of 3343.333333 (from table 7)
Optimization -O2:
% Time
4.55
Cumulative
seconds
0.2
Selfseconds
0.01
calls
Self s/call
Total s/call
Name
2000000
0.00
0.00
r8_random
This optimization improved performance of the generated code even more. The
%time taken and the cumulative seconds are lesser than -O1. The peak MFLOP
achieved here is 3933.333333
Optimization O3:
% Time
4.77
Cumulative
seconds
0.2
Selfseconds
0.01
calls
Self s/call
Total s/call
Name
2000000
5.00
5.00
r8_random
In this optimization the peak MFLOP was 4179.166667. Optimization was even
more than O2, few functions like daxpy have not been called in O3.
The profiling of Os is similar to O2, except it differs in code size. The peak
MFLOP achieved here is 3933.333333.
PROBLEM 3:
Whetstone Benchmark: (double precision)
The Whetstone benchmark primarily measures the floating-point arithmetic
performance. A similar benchmark for integer and string operations is the
Dhrystone.
This benchmark is a synthetic mix of integer and floating point calculations,
transcendental functions, conditional jumps, function calls and array indexing.
Whetstone on Linux/X_86 and Solaris/Sparc:
Number of loops are set at 500000 and number of iterations are 1.

TABLE 8
S.No Linux/X86
Gcc without
optimization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Linux/X86
Gcc with
optimization O3
Solaris/Sparc
Gcc without
optimization
Solaris/Sparc
Gcc with optimization O3
Duration
MIPS
Duration
MIPS
Duration
MIPS
Duration
MIPS
13
12
13
12
12
12
13
12
12
12
12
12
12
12
12
3846.2
4166.7
3846.2
4166.7
4166.7
4166.7
3846.2
4166.7
4166.7
4166.7
4166.7
4166.7
4166.7
4166.7
4166.7
6
7
6
6
6
6
7
6
6
6
6
6
6
6
6
8333.3
7142.9
8333.3
8333.3
8333.3
8333.3
7142.9
8333.3
8333.3
8333.3
8333.3
8333.3
8333.3
8333.3
8333.3
1812
1812
1813
1815
1812
1814
1812
1812
1812
1812
27.6
27.6
27.6
27.5
27.6
27.5
27.6
27.6
27.6
27.6
1423
1423
1424
1423
1424
1423
1423
1425
1423
1423
35.1
35.1
35.1
35.1
35.1
35.1
35.1
35.0
35.1
35.1
As we can see from table 8, whetstone benchmark is made to run on x86 and
Sparc. We can see that the performance in X86_64 bit is superior to the Sparc
based system. We can also see that the optimization O3 significantly improved the
duration of execution. Duration reduced from around 13 to 6 in Linux system and
from 1812 to 1423 in Solaris system.
The performance is measured in MIPS, the Linux system got peak MIPS of around
8333, which is significantly better than Solaris system with O3 opt which is around
1423 MIPS.
To find the hot portions of the code gprof is used,
Gprof without optimization:
% Time
29.55
Cumulative
seconds
7.29
Selfseconds
3.16
calls
Self s/call
Total s/call
Name
449500000
7.02
7.02
P3
The most frequently executed function is tabulated above. P3 is the function taking
about 30% of entire execution time.
After performing O3 optimization the number of calls for function P3 is zero and
doesnt affect the execution time. Both the execution time and code size is reduced
by O3.
PROBLEM 4:
Basic Block:
A basic block is a portion of the code within a program with only one entry point
and only one exit point.
a sequence of instructions forms a basic block if:
1. The instruction in each position dominates, or always executes before, all

those in later positions, and
2. No other instruction executes between two instructions in the sequence.
Let us take a basic block proc4:

Proc4()
{
REG boolean
BoolLoc;
BoolLoc = Char1Glob == 'A';

BoolLoc |= BoolGlob;
Char2Glob = 'B';
}
The assembly code generated in SPARC ISA and X86 ISA is shown below:
SPARC ISA
X86 ISA
Firstly, SPARC is a RISC instruction set architecture while the X86 ISA follows
CISC architecture. Thus we can see that SPARC produces more lines of assembly
code for the same function than X86. In SPARC sethi is used to move the high
portion of a large constant into a register and is the only way to load 32- and 64-bit
values from memory. Also in SPARC operations often take three operands. Also
many registers are available in SPARC, much more than X86 architecture.
PROBLEM 5:
Benchmarks suites provide a method of comparing the performance of various
subsystems across different chip/system architectures.
1. PARSEC benchmark suite:

The Princeton Application Repository for Shared-Memory Computers
(PARSEC) is a benchmark suite composed of multithreaded programs. The
suite focuses on emerging workloads and was designed to be representative
of next-generation shared-memory programs for chip-multiprocessors.
Key features:
PARSEC differs from other benchmark suites in the following ways:
Multithreaded: While serial programs are abundant, they are of
limited use for evaluation of multiprocessor machines. PARSEC is
one of few benchmark suites that are parallel.
Emerging Workloads: The suite includes emerging workloads which
are likely to become important applications in the near future but
which are currently not commonly used. Our goal was to provide a
collection of applications as might be typical in a few years.
Diverse: PARSEC does not try to explore a single application domain
in detail, as was done by several previous benchmark suites. The
selection of included programs is wide and tries to be as
representative as possible.
Research papers:
a. Fidelity and Scaling of the PARSEC Benchmark Inputs
Christian Bienia and Kai Li. In Proceedings of the IEEE International Symposium
on Workload Characterization, December 2010.
Abstract:
A good benchmark suite should provide users with inputs that have multiple levels of fidelity for
different use cases such as running on real machines, register level simulations, or gate-level
simulations. Although input reduction has been explored in the past, there is a lack of
understanding how to systematically scale input sets for a benchmark suite. This paper presents a
framework that takes the novel view that benchmark inputs should be considered approximations
of their original, full-sized inputs. It formulates the input selection problem for a benchmark as
an optimization problem that maximizes the accuracy of the benchmark subject to a time
constraint. The paper demonstrates how to use the proposed methodology to create several
simulation input sets for the PARSEC benchmarks and how to quantify and measure their
approximation error. The paper also shows which parts of the inputs are more likely to distort
their original characteristics. Finally, the paper provides guidelines for users to create their own
customized input sets.
b. PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors

Christian Bienia and Kai Li. In Proceedings of the 5th Annual Workshop on
Modeling, Benchmarking and Simulation, June 2009.
Abstract:
The second version of the Princeton Application Repository for Shared-Memory
Computers (PARSEC) has been released. PAR- SEC is a benchmark suite for ChipMultiprocessors (CMPs) that focuses on emerging applications. It includes a diverse set
of workloads from different domains such as interactive animation or systems
applications that mimic large-scale commercial workloads. The next version of PARSEC
features several improved and one new workload. It also supports an additional
parallelization model. Many patches and changes were included which simplify the use
of PARSEC in practice. The benchmarks of the new suite have higher scalability and
cover a larger number of emerging applications. In this paper we discuss the major
changes in detail and provide the information necessary to interpret results obtained with
PARSEC 2.0 correctly.
2. Isolation benchmark suite:

Isolation Benchmark Suite is designed to quantify the degree to which a
virtualization system limits the impact of a misbehaving virtual machine on
other well-behaving virtual machines running on the same physical machine
This benchmark suite includes six different stress tests
a CPU intensive test

a memory intensive test
a fork bomb.
a disk intensive test
two network intensive tests (send and receive)
The results, from the Isolation Benchmark Suite, can highlight the difference
between different classes of virtualization systems as well as the importance of
considering multiple categories of resource consumption when evaluating the
performance isolation properties of a virtualization system.
Research papers:
a. R. Creasy IBM Journal of Research and Development. Vol. 25, Number 5.
Page 483. Published 1981. The Origin of the VM/370 Time-Sharing System.
Abstract:
VM/370 is an operating system which provides its multiple users with seemingly separate
and independent IBM System/ 370 computing systems. These virtual machines are
simulated using IBM System/370 hardware and have its same architecture. In addition,
VM/370 provides a single-user interactive system for personal computing and a computer
network system for information interchange among interconnected machines. VM/370
evolved from an experimental operating system designed and built over fifteen years ago.
This paper reviews the historical environment, design influences, and goals which shaped
the original system.
b. B. Clark, T. Deshane, E. Dow, S. Evanchik, M. Finlayson, J. Herne and J.

Matthews. Xen and the Art of Repeated Research. Proceedings of the
USENIX 2004 Annual Technical Conference, FREENIX Track, pp. 135144, June 2004.
Abstract:
Xen is an x86 virtual machine monitor produced by the University of Cambridge
Computer Laboratory and released under the GNU General Public License. Performance
results comparing XenoLinux (Linux running in a Xen virtual machine) to native Linux
as well as to other virtualization tools such as User Mode Linux (UML) were recently
published in the paper Xen and the Art of Virtualization at the Symposium on
Operating Systems Principles (October 2003). In this study, we repeat this performance
analysis of Xen. We also extend the analysis in several ways, including comparing
XenoLinux on x86 to an IBM zServer. We use this study as an example of repeated
research. We argue that this model of research, which is enabled by open source
software, is an important step in transferring the results of computer science research into
production environments.
3. Problem Based Benchmark Suite:

The problem based benchmark suite (PBBS) is designed to be an open
source repository to compare different parallel programming methodologies
in terms of performance and code quality. The benchmarks define problems
in terms of the function they implement and not the particular algorithm or
code they use. We encourage people to implement the benchmarks using any
algorithm, with any programming language, with any form of parallelism (or
sequentially), and for any machine. The problems are selected so they:
Are representative of a reasonably wide variety of real-world tasks
The problem can be defined concisely
Are simple enough that reasonably efficient solutions can be
implemented in 500 lines of code, but not trivial micro-benchmarks
Have outputs that can be easily tested for correctness and possibly
quality
Research papers:
a. Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo
Kyrola, Harsha Vardhan Simhadri and Kanat Tangwongsan.
The Problem Based Benchmark Suite
Abstract:
This announcement describes the problem based benchmark suite (PBBS). PBBS is a set
of benchmarks designed for comparing parallel algorithmic approaches, parallel
programming language styles, and machine architectures across a broad set of problems.
Each benchmark is dened concretely in terms of a problem specication and a set of
input distributions. No requirements are made in terms of algorithmic approach,
programming language, or machine architecture. The goal of the benchmarks is not only
to compare runtimes, but also to be able to compare code and other aspects of an
implementation (e.g., portability, robustness, determinism, and generality). As such the
code for an implementation of a benchmark is as important as its runtime, and the public
PBBS repository will include both code and performance results.
b. Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun.

Internally Deterministic Parallel Algorithms Can Be Fast.
Abstract:
The virtues of deterministic parallelism have been argued for decades and many forms of
deterministic parallelism have been described and analyzed. Here we are concerned with
one of the strongest forms, requiring that for any input there is a unique dependence
graph representing a trace of the computation annotated with every operation and value.
This has been referred to as internal determinism, and implies a sequential semantics i.e.,
considering any sequential traversal of the dependence graph is sufcient for analyzing
the correctness of the code. In addition to returning deterministic results, internal
determinism has many advantages including ease of reasoning about the code, ease of
verifying correctness, ease of debugging, ease of dening invariants, ease of dening
good coverage for testing, and ease of formally, informally and experimentally reasoning
about performance. On the other hand one needs to consider the possible downsides of
determinism, which might include making algorithms (i) more complicated, unnatural or
special purpose and/or (ii) slower or less scalable.
References:
1. http://en.wikipedia.org/wiki/Benchmark_(computing)
2. http://www.cs.cmu.edu/~pbbs/
3. http://parsec.cs.princeton.edu/
4. http://web2.clarkson.edu/class/cs644/isolation/index.html
5. http://www.netlib.org/benchmark/whetstone.c

Rohit Rao Naineni - Assignment1

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Rohit Rao Naineni - Assignment1

Caricato da

Copyright:

Formati disponibili

EECE7352

1. On Linux/X86 (64 bit) (i7 Haswell processor)

gcc with -O1

gcc with -O2

gcc with -O3

gcc with -Os

1. What is the most frequently executed function?

3. How does optimization change this percentage and why?

Similarly, gprof is execute like this for other optimization levels.

Code Without Optimization

Gprof Optimization -O2:

2. On Linux/X86_i686 (32 bit X86) (i5 Westmere micro architecture

gcc with -O1

gcc with -O2

gcc with -O3

gcc with -Os

As we discussed earlier, the cumulative seconds have been reduced in O1

Assembly code for func1 without optimization and for O3

b. Using cc to compile (On Linux/X86_i686)

gcc with -O1

gcc with -O2

gcc with -Os

As we discussed earlier, the cumulative seconds have been reduced in O1

Func1 code without optimization

Code with optimization O3

gcc without optimization

In Sparc, dhrystone benchmarks a better performance in cc compiling background.

gcc with -O1

gcc with -O2

gcc with -O3

gcc with -Os

In this benchmark we set up a random dense matrix A of size 1000 as specified

Table for most frequently executed function (without optimization)

Gprof is used to profile the execution of the program with optimization.

In this optimization the cumulative seconds is reduced to 0.21. Thus this

Number of loops are set at 500000 and number of iterations are 1.

1. The instruction in each position dominates, or always executes before, all

Let us take a basic block proc4:

BoolLoc = Char1Glob == 'A';

1. PARSEC benchmark suite:

b. PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors

2. Isolation benchmark suite:

a CPU intensive test

b. B. Clark, T. Deshane, E. Dow, S. Evanchik, M. Finlayson, J. Herne and J.

3. Problem Based Benchmark Suite:

b. Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun.

Potrebbero piacerti anche