RomilShah HW 1

COMPUTER ARCHITECTURE
Homework: #1
Name: Romil Shah

NUID: 001673595
Email: shah.romil@husky.neu.edu
Q1 Dhrystone benchmark
(a) Did Dhrystone run slower the first time you ran it? Why? Try to explain this phenomenon.
Yes. Dhrystone did run slow for the first time. This is mainly because initially, as the program was
not compiled, it was not stored in the cache. After running it the first time, it was stored in the cache
from where it can be accessed easily and faster when we try to run the next times. Even companies
like ARM suggests to disregard the first run as it is usually slower for the above mentioned reason.
In this the Temporal locality is to be considered as after running the program several times, the
program is stored in the cache and the execution becomes faster.
(b) Dhrystone compilation with and without optimization on x86 Linux, Solaris and Alpha machines.
Dhrystone had some errors and warnings which was reduced to only 2 warnings after the addition
of specified libraries and increasing the LOOPS count. I obtained the following results for running
Dhrystone with different number of LOOPS and different kind of optimization on all 3 machines. On
all 3 machines, I have run the trial 5 times and obtained the observations accordingly. The table
shows the number of passes for each different trials given for each type of loop number.
(i) X86-64 bit
For Loops = 1,000,000 we get a code dump error as there is an exception that occurs in the
code.
Without Optimization
With O1
With O2
With O3
With Ofast
With Os
Loops
10,000,000
3,2,3,3,2
1,1,1,1,1
1,1,0,1,1
0,0,0,0,0
1,0,0,1,0
1,1,1,1,1
100,000,000
26,26,26,25,27
9,9,7,9,9
6,7,7,6,6
3,3,2,3,3
3,3,3,3,3
10,10,10,9,10
250,000,000
55,65,66,66,65
25,24,21,20,21
16,15,16,15,15
7,7,7,7,7
7,7,7,7,7
25,25,27,25,26
In this case, the -O3 switch gives the best performance in terms of number of passes. It gives the
highest Dhrystone benchmark number for the different number of LOOPS. In this, even the -Ofast
switch gives almost similar performance to that of -O3.
We see the following results from the optimizations:
Without optimization
With O3 optimization
Here the compiler tries to optimize the functions and procedures as much as possible. It breaks down
the function into smaller parts than non optimized assembly code. This helps in making the program
faster and more efficient.
(ii) Sun/Solaris
For Loops = 100,000 we get a code dump error as there is an exception that occurs in the code.
USING gcc:
Number of passes for
5 trials
With O1
With O2
With O3
With Os
USING cc:
Loops
1,000,000
6,6,7,6,6
5,4,4,5,5
5,5,5,5,5
1,1,0,0,1
2,3,2,3,2
2,000,000
12,12,11,12,12
9,9,10,9,9
10,10,10,10,9
2,1,2,2,1
5,5,5,5,5
5,000,000
31,31,30,31,31
23,23,23,22,23
24,23,25,25,25
4,4,5,4,4
12,12,13,12,12

5 trials
With O1
With O2
With O3
With Os
Loops
1,000,000
3,3,4,4,4
4,4,4,3,3
4,3,3,4,4
4,3,3,4,4
2,2,2,2,2
2,000,000
7,7,6,7,7
5,4,5,4,5
5,4,5,4,5
4,5,4,4,4
4,5,4,4,4
5,000,000
18,18,16,16,18
18,18,18,18,18
18,18,18,18,16
17,18,18,18,18
11,10,10,10,11
In this case, we observe that cc compiler is much faster than gcc compiler. -Os optimizes the
benchmark much better than others for cc compiler whereas in the case of gcc compiler, the -O3
optimizes better. The same difference can be found in the assembly listing where the compiler breaks
the program into smaller segments to be executed in order to increase the performance of each.
(iii) X86-32 bit
For Loops = 1,000,000 we get a code dump as there is an exception that occurs in the code.
5 trials
With O1
With O2
With O3
With Os
Loops
5,000,000
2,1,1,1,1
2,0,1,1,1
1,0,1,0,0
1,0,0,0,0
1,0,1,0,1
10,000,000
3,2,2,3,2
3,2,1,3,2
0,1,1,0,1
0,0,1,0,0
1,1,1,1,1
100,000,000
23,24,23,25,23
20,20,21,21,20
7,7,7,7,7
1,2,2,2,2
9,9,9,8,9
In this case, we observe that it is much faster than 64 bit and also the -O3 optimization gives better
performance compared to other performance switches.
In the above 3 machines, -O3 tries to optimize the code very heavily for the performance and includes
the performance of -O2 and others as well. -Os on the other hand, instructs gcc to optimize for size
rather than performance. It toggles flags to reduce executable size. This is why it gives better
performance for cc than gcc for Solaris machine. And thus -O3 performance is better in Linux 64 and
32 bit machines.
(c) Some answers:

Using gprof, here are some screenshots:
For X86-64 bit with 100,000,000 LOOPS without optimization and with O3 optimization
For Solaris with 10,000,000 LOOPS without optimization and with O3 optimization
For X86-32 bit with 100,000,000 LOOPS without optimization and with -O3 optimization
i.
Func1, Proc7 and Proc1 are most frequently executed functions or procedures in dhrystone.
They are called the most number of times.
ii. In X86-64 bit, Proc7 takes 4.96%, Func1 takes 10.8% of the time and Proc1 takes 14.2% of
the time. This is basically 7.31 seconds, 9.98 seconds and 9.76 seconds respectively.
iii. Optimization changes the percentage by increasing that for both Proc0 and Proc1 whereas
decreases for the other functions and procedures. Thus, the self-seconds of the program
decreases drastically which causes the overall time to decrease. Looking at the gprof of
optimized and non-optimized code sections, we can see that there are some functions and
procedures that are missing and those are the ones which are optimized to reduce time.
iv. Whilst observing dhrystone on Solaris with cc and gcc, we observe that the results by cc
compiler are much better than gcc. Also the optimization by cc compiler is better by -Os
and in gcc compiler it is better by O3. Thus it seems that the Solaris would have been
tuned to perform better for cc than for gcc.
Q2 Linpack benchmark
I have tried running Linpack on both the X86-64 bit machine and obtained the optimized and nonoptimized results for them.
X86-64 bit
MFLOPS obtained: 566.666667
With Ofast optimization
We can see the difference in the optimization from following:

With O3
With Ofast
It can be clearly observed that Ofast gives a better performance than -O3. This is mainly because
Linpack is a floating point involving benchmark and thus -Ofast is a math optimizing switch. This is
mainly why it shows a faster and better performance. Moreover from the assembly it can be seen that
both the switches are trying to optimize the r8_mat generation which takes about 0.34 seconds in -O3
and about 0.33 seconds in -Ofast switch.
Hence the optimization switches in the decreasing performance order: Ofast > O3 > O2 > O1 > None.
The optimized code makes use of various functions that use less memories like XOR for the calculations
and thus take less memory compared to the register operations which include heavy memory
allocations. Also the redundant programs, loops or procedures are also cleared in order to optimize the
complete program. This is done by splitting the function or loop into smaller parts to carry out the
operations efficiently.
Q3 Benchmark of choice: Whetstone

For testing out another benchmark, I am planning to choose whetstone benchmark. It was first written
in Algol 60 and is used to evaluate the performance of computers. This basically is used to measure the
floating point arithmetic operations and the performance based on that.
It basically functions similar to Dhrystone but for floating point operations.
I am trying to find the performance of Whetstone on X86-64 bit and Solaris machines.
(i) X86-64 bit (LOOPS = 100000)

(ii) Solaris (LOOPS = 1000)

Without optimization -
From the above observations, it is concluded that Whetstone works better for X86 systems than
compared to Solaris/Sun systems. This is probably because X86 systems have better performance in
terms of math or arithmetic operations compared to Solaris/Sun systems.
Whetstone performance based on Windows systems:
Whetstone performance for RaspberryPi systems:
Other performance:
Thus the performance of Whetstone is better on Core i7 processor. This is because of the presence of
parallel multicore processors that give better benchmark performance compared to single core
machines. Hence it has better performance in X86 systems of higher number of cores and performs
weak for Sun/Solaris type of system.
Q4 Difference between ISA for SPARC and x86

The comparison is done based on Dhrystone benchmark.
Both the processes were run for certain number of LOOPS and the assembly was generated for both of
them. A basic block of Func1 is used to compare the ISA for X86 and SPARC. On X86, we are getting 10
seconds for 10,000,000 LOOPS and on SPARC we are getting 23 seconds for 10,000,000 LOOPS.
Basic assembly for Func1 in X86:
Basic assembly for Func1 in SPARC:
Looking at the ISA for X86 and ISA for SPARC, we can see a lot many differences.
For example:
1. For moving from one register to another:
a. X86 movq
b. SPARC mov
2. Going to another loop is equal condition is found:
a. X86 je (jump if equal to loop L)
b. SPARC be (break if equal to loop L)
3. Simple jump to another loop:
a. X86 jmp
b. SPARC b
4. Register storage bariables:
a. X86 - %al (for lower byte of register a)
b. SPARC - %i0 (for first register)
There are more such differences that are related to either comparing, jumping, moving, copying, storing
etc. The ISA for X86 and SPARC differs.
A basic difference behind X86 and SPARC ISA is that former has less registers compared to the latter.
Thus, X86 saves the registers by switch threading. This helps it to handle interrupts faster compared to
SPARC.
SPARC uses XOR, -OR kind of logic functions in order to reduce the memory utilization compared to
X86.
X86 is memory to memory storage ISA whereas SPARC is register type of ISA.
Q5 Select 2 benchmarks: POV-Ray for Open Source and 3DMark for

Windows
i. Application domain and system architecture:
a. POV-Ray: It is a ray tracing benchmark application that is used to generate images and text.
Methods to generate 3D models are also used in here. It is a cross-platform application thus
runs for x86 Microsoft, x86 Linux and x86 Macintosh platforms.
b. 3DMark: This computer benchmarking was specifically created for Windows (latest version
includes for iOS and Android as well) and is used to determine the 3D graphic performace of a
computer. Each 3DMark version shows results with its related DirectX version else it cannot be
evaluated for comparison.
ii. Performance measures and metrics:

a. POV-Ray: In this benchmarking program, the performance is evaluated on the basis of ray
tracing on the images and texture. The output is based on how long it takes to render the
image on a particular platform using the specified options. Thus based on ray tracing and
rendering, the POV-Ray measures the CPU time taken by each platform for the overall
rendering.
b. 3DMark: It evaluates performance based on the gaming performance, overclocking and
stability and OpenGL performances. It provides with a 3DMark score to evaluate the
performance of the system. Higher the score, better is the performance of the platform. Thus
the evaluation is done on basis of graphic rendering and CPU workload processing and given
an output in terms of a score.
iii.
2 papers for each of the benchmark:

a. POV-Ray:
[1] MPIPOV: A Parallel Implementation of POV-Ray based on MPI (Fava and Fava and Bertozzi,
2002, Recent advances in Parallel virtual machine)
[2] Molray a web based interface between O and the POV-Ray ray tracer (Harris and Jones,
2001, Acta Cryst.)
b. 3DMark:
[1] Dynamic adjustment of CPU clock speed to prevent notebook overheating and shutdown
by AC adapter (Bai and Cheng, 2012, GCCE)
[2] 3D Graphics Performance Scaling and Workload Decomposition and Analysis (Sibai, 2007,
ICIS)
iv.
My thoughts on improvement:
a. POV-Ray: This benchmarking focusses on rendering of images. It could be used to render high
quality images that involve a variety of color schemes and also multidimensional. There are 4D
x-ray images where POV-Ray rendering could help in understanding the performance of a
system.
b. 3DMark: This benchmark mainly focusses on the graphic end and thus mostly used to
understand games performances. Hence one improvement which could be brought into this
is a VR support. As these days the VR games increase, a benchmark like 3DMark could help in
gauging the performance of the computer.

RomilShah HW 1

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

RomilShah HW 1

Caricato da

Copyright:

Formati disponibili

COMPUTER ARCHITECTURE

Name: Romil Shah

Number of passes for

(iii) X86-32 bit

(c) Some answers:

MFLOPS obtained: 566.666667

MFLOPS obtained: 1807.207207

MFLOPS obtained: 2089.583333

With Ofast optimization

MFLOPS obtained: 2156.989247

We can see the difference in the optimization from following:

Q3 Benchmark of choice: Whetstone

(i) X86-64 bit (LOOPS = 100000)

(ii) Solaris (LOOPS = 1000)

Whetstone performance for RaspberryPi systems:

Q4 Difference between ISA for SPARC and x86

Basic assembly for Func1 in X86:

Basic assembly for Func1 in SPARC:

Q5 Select 2 benchmarks: POV-Ray for Open Source and 3DMark for

ii. Performance measures and metrics:

2 papers for each of the benchmark:

Potrebbero piacerti anche