Sei sulla pagina 1di 20

S OFTWARE M APPING T ECHNIQUES FOR A PPROXIMATE

C OMPUTING

ITI/RA S EMINAR ON D ESIGN , T EST AND A PPLICATION OF E MERGING


C OMPUTER A RCHITECTURES
S UMMER T ERM 2015

KASHIF WAJID QURESHI


INFOTECH

S UPERVISED BY D IPL .-I NF A LEXANDER S CH OLL


E XAMINER
P ROF. D R . RER . NAT. HABIL . H ANS -J OACHIM W UNDERLICH
S EMINAR TALK GIVEN ON J ULY 2, 2015

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

Contents
1

Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Approximate Computing at The Application Level . . . . . . . . . . . . . . . . . .

3
3
3

Loop Perforation
2.1 Loop Transformation . . . . . . . . . . . . . . .
2.2 Accuracy Metric . . . . . . . . . . . . . . . . .
2.3 Perforation Exploration . . . . . . . . . . . . . .
2.3.1 Criticality Testing . . . . . . . . . . . . .
2.3.2 Perforation Space Exploration Algorithms

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

4
4
5
5
5
6

Best Effort Computing Model (BE)


3.1 Idea . . . . . . . . . . . . . . . . . . . . .
3.2 Model . . . . . . . . . . . . . . . . . . . .
3.3 Best-effort Iterative Convergence Template
3.4 Best-effort Strategies . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

7
7
7
8
9

.
.
.
.
.
.
.
.
.

10
10
10
10
12
13
13
15
15
15

.
.
.
.

.
.
.
.

.
.
.
.

Case Studies
4.1 K-means Clustering based on Best-effort Computation Model . . . . . .
4.1.1 K-means Algorithm . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Potential in K-means for Computation Reduction . . . . . . . .
4.1.3 K-means Using the Best-effort Iterative Convergence Template .
4.1.4 Best-effort Strategies for K-means . . . . . . . . . . . . . . . .
4.1.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 H.264 Video Encoding Using Loop Perforation . . . . . . . . . . . . .
4.2.1 H.264 Implementation . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion

Kashif Wajid Qureshi

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

18

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

A BSTRACT
Approximate computing is an emerging design paradigm that enables highly efficient hardware and
software implementations by exploiting the inherent resilience of applications to in-exactness in their
results. Software applications such as multimedia applications as well as recognition and data mining
applications pose a forgiving nature. The result of these applications is not required to be precise. Due
to the tremendous growth in computational data in recent years, new methods must be developed that
exploit this forgiving nature to sustain feasible computation times. This report discusses two such
methods which exploit this forgiving nature of applications: Loop Perforation and the Best-effort
computation model. Both of these methods provide an automatic way to improve the performance
of applications with the help of certain strategies that can be adjusted to fit a specific application.
The increase in performance is induced by a meaningful reduction of accuracy to reduce computation
efforts. Loop perforation makes the loops inside an application to run lesser number of iterations per
loop. The Best-effort computing model distinguishes between optional and guaranteed computations
inside an application. For instance, when Loop Perforation is applied to the H.264 encoding standard,
a speedup of 3.25x is seen with an accuracy loss of 10%. Likewise, when Best-effort computing is
applied to the K-means clustering algorithm to segment images, a speedup of 3.5x can be achieved
with an accuracy loss of 1%.

Kashif Wajid Qureshi

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

1
1.1

Introduction
Motivation

The increase in clock frequency and reduction in voltage with each technology generation has
greatly slowed down with then end of Dennard Scaling [11]. The end of this era gave birth to increase
performance with parallelism. However, this new scaling paradigm came with its own problems such
as programming bottlenecks due to serial computations, synchronization and global communication.
With the above developments, the nature of computing workloads has also changed fundamentally
across the computing spectrum. In data centers, the demand for computing is driven by the need to
organize, search through, analyze, and draw inferences from, largely increasing amounts of data. In
mobile devices and embedded systems, the usage of media and the need to interact more intelligently
with both users and the environment drive much of the computing demand. These applications are
not required to calculate a precise result. Instead, acceptable results are defined as being good enough
or of sufficient quality.
In accordance with the challenges listed above, alternate avenues to improve the performance of
computing platforms must be invented. One such improvement is Approximate Computing. The basic
idea of Approximate Computing is to trade-off accuracy for an increase in performance and energy
efficiency.

1.2

Approximate Computing at The Application Level

Different computing workloads exhibit application resilience, or the ability to produce acceptable
outputs despite some of their computations being performed in an approximate manner. These applications termed as having a so-called forgiving nature hold the following properties:
Noisy Input Data: Real word data is mostly noisy. Since the algorithms are designed to deal
with noise data, they often tolerate erroneous computations.
Redundant Data Sets: They process large input data sets that have significant redundancy.
This enables them to be resilient in errors in computations without effecting the results.
Limited perceptual ability of users: Many of these applications produce output for human
consumption. The limited ability of humans to perceive allows these applications to generate
results which are just acceptable.
Statistical or probabilistic computations: Applications that employ statistical computations
are often tolerant towards imprecision in errors in some computations.
Result Refinement: They often employ iterative computations where a result is refined until
certain a criterion is satisfied. Therefore, errors in previous computations may be corrected in
the future iterations.
Recent works describe the high degree of intrinsic resilience in many applications quantitatively.
For example, on the analysis of a benchmark suite of 12 recognition, mining and search applications
shows that on average, 83% of the runtime is spent in computations that can tolerate at least some
degree of approximation [4]. Therefore, there is a significant potential to exploit intrinsic resilience
in a broad context.

Kashif Wajid Qureshi

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing
To exploit this forgiving nature of such applications, approximate computing is used to reduce the
run-time of programs at the software level. The resulting speedup caused is at a cost of accuracy in
the result. The key challenge is to identify computations that can be approximated to decrease the
execution time of the application. This report discusses two recent methods:
1. Loop Perforation [3]: This method transforms loops in an application to run a subset of its
iterations. Critical and tunable loops are identified in the application and then a loop perforation
is performed over the tunable loops.
2. Best Effort Computing Model (BE) [1]: This methods exploits the applications cognizant to
best-effort model by differentiating between guaranteed and optional instructions.
Two applications are presented in this report which show a decent applicability with respect to the
methods described above:
H.264 video encoding algorithm: Video encoders take a stream of input frames and compress
them for efficient storage or transmission. In this report the x264 implementation of H.264 is
used.
K-means clustering algorithm [6]: K-means is a clustering algorithm based on unsupervised
learning. It clusters a given set of points that are represented as vectors in multi-dimensional
space. The algorithm begins picking a random set of centroids for clusters and assigns the input
vectors to the respective cluster in an iterative manner until the clusters do not change.

Loop Perforation

The goal of loop perforation is to reduce the amount of computational work and therefore the
amount of time and other resources such as power required to produce the result. This is achieved by
the meaningful reduction in the number of iterations per loop. By doing such perforations, the final
result of the application may be altered. As long as it is in an acceptable range this result is valid. It
can be seen as a viable option for applications which contain many loops such as image and video
processing applications.
This technique is divided into two parts to find effective perforations:
1. Criticality Testing: Each loop is perforated and executed on representative inputs to filter out
critical loops. Perforation of such loops causes unacceptable results to be produced, crash of
the application, cause memory errors in the applications or increase the running time of the
application. The loops left over are called as tunable loops which are used by the so called
space exploration algorithm.
2. Perforation Space Exploration: It explores the space of variants generated by perforating
combinations of loops together. The result of this is a set of Pareto-optimal variants, each of
which maximizes performance for a specific accuracy when run on representative inputs. A perforation is Pareto-optimal if there are no other perforations that provide both better performance
and accuracy.

2.1

Loop Transformation

Given a loop to perforate, loop transformation takes as input a percentage of iterations to skip during
the execution of the loop, and a perforation strategy. A transformation pass alters the calculation of
Kashif Wajid Qureshi

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing
the loop iteration variable to manipulate the number of iterations that a loop will execute. Thus it
transforms a loop
for(int i = 0; i < b; i++) {...}
to
for(int i = 0; i < b; i += n) {...}.
The percentage of non-executed iterations is called the perforation rate (r). Depending on the selected
perforation rate a different performance/distortion trade-off can be made. For example, for a perforation rate r = 0.5, half of the iterations are skipped, for r = 0.25, one quarter of the iterations are
skipped, while for r = 0.75, three quarters of the iterations are skipped, i.e. only one quarter of the
initial work is carried out.

2.2

Accuracy Metric

This metric measures the difference between an output from the original program and the corresponding result from the perforated program. The metric is decomposed into two parts: an output
abstraction, which maps from a programs specific output to a measurable numerical value or values,
and an accuracy calculation, which measures the difference between the output abstractions from
perforated and original executions. The accuracy calculations computes the metric acc as a weighted
mean scaled differences between the output abstraction components o1 ,... , om from the original
program and the output abstraction components o 1 ... , o m from the perforated program [5].

Here each weight wi captures the relative importance of the ith component for the output abstraction. The closer the accuracy metric acc is to zero, the more accurate the perforated program.

2.3

Perforation Exploration

The loop perforation space algorithm takes as input an application, an accuracy metric for that
application, a set of training inputs, an accuracy bound b (a maximum acceptable accuracy metric for
the applications) and a set perforation rate r. The algorithm produces a set S of loops to perforate at
specified perforation rates.
2.3.1

Criticality Testing

The criticality testing algorithm is shown in Figure 1. It starts with a set of candidate loops L
and perforation rate R. The following algorithm tries to find and remove critical loops from a set of
candidate loops. The algorithm perforates each loop in turn at each of the specified perforation rates,
then runs this program on the training inputs. It filters out a loop if its perforation:
1. Failed to improve performance (speed of the program, sp).
2. Causes the application to exceed the accuracy bound b.
3. Introduces memory errors (such as out of bounds access, memory leaks).
The result of the criticality testing is the set if tunable loops P = {(l0 , r0 ),...,(lm , rm )}, where (li , ri )
specifies the perforation of loop li at rate ri .
Kashif Wajid Qureshi

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

Figure 1: Criticality Testing: Find the set of tunable loops P in A given training inputs T and accuracy
bound b [3].
2.3.2

Perforation Space Exploration Algorithms

Two algorithms are presented below which explore the performance vs accuracy trade-off space of
perforated programs:
Exhaustive Exploration Algorithm: This algorithm starts with a set of tunable loops and then
exhaustively explores all combinations of tunable loops l at their specific perforation rates r.
The algorithm executes all combinations on all training inputs and records the resulting speed
up and accuracy. It also detects memory errors for each combination and if found discards such
combinations. The results are used to compute the set of Pareto-optimal perforations in the
induced performance vs. accuracy trade-off space.
Greedy Exploration Algorithm: This algorithm is shown in Figure 2. It uses a greedy strategy
to search the loop perforation space to produce, for a given accuracy bound b, a set of loops and
corresponding perforation rates S = {(l0 , r0 ), ...,(lm , rm )} that maximize performance subject to
b. The algorithm uses a heuristic scoring metric to prioritize loop/perforation rate pairs. The
scoring metric for a pair of loop and its perforation rate is based on the harmonic mean of terms
that estimate performance increase and accuracy loss of perforated program:

Where sp(l,r) and acc(l,r) are the mean speedup and accuracy metric respectively for the (l,r)
perforation over all training inputs and b is the accuracy bound. The algorithm first computes
a set of pairs P. For each loop l, P contains the pair (l,r), where r maximizes score(l, r). It
then sorts the pairs in P by score(l, r). The algorithm maintains a set S of (l,r) pairs that can
Kashif Wajid Qureshi

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing
be perforated together without violating the accuracy bound. Upon each iteration the set S is
extended with a new pair (l,r) if it keeps the overall accuracy within bound b.

Figure 2: Greedy Explorations: Find a set S of loops to perforate in A given training inputs T and
accuracy bound b [3].

Best Effort Computing Model (BE)

To bridge the gap between applications demand for performance and the capabilities of the future
computing platforms, a programming model is presented which is called Best-effort (BE) model.

3.1

Idea

The basic idea of this model is that the computing platform provides computation on best-effort
service basis. Therefore this model is able to [2]:
Drop some of the computations requested by the application.
Some computations may be executed on unreliable hardware that introduces errors in the results
occasionally.
For instance, sacrificing guarantees and consistency in networking and storage systems respectively
it is possible to make simpler faster networks and fast storage systems. Similarly by sacrificing on
optional computations a performance increase in algorithms can be seen as well.

3.2

Model

Figure 3 depicts a conceptual model of the overall system architecture for a best-effort computing
system. The Best-effort programming model distinguishes the computations into two categories, to
achieve higher performance:
Kashif Wajid Qureshi

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

Figure 3: Best-effort Computing Model Overview [2].


1. Optional computations: These may be dropped by the computing platform or executed incorrectly.
2. Mandatory or guaranteed computations: These must be executed correctly to maintain the
integrity of the application.
The best-effort programming model can leverage the optional computations in the following ways:
Drop computations to reduce overall workload and improve performance.
Execute the computations on an efficient but unreliable computation layer.
There are different strategies which are presented later that choose the optional computations in an
application at the best-effort computation layer. This layer also implements a mechanism that makes
sure that guaranteed computations are executed reliably and can be re-executed to confirm their result
if necessary.

3.3

Best-effort Iterative Convergence Template

Iterative convergence algorithms perform computations in an iterative manner until convergence or


termination condition is satisfied. The pseudo code of the template is given in Figure 4.
A computation is repeated until a specified convergence criterion is met. Computations within each
iteration and convergence criteria have to be specified. Two best-effort operators are introduced:

Kashif Wajid Qureshi

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

Figure 4: Pseudo code of the best-effort iterative-convergence template [1].


1. filter: This operator can reduce computations and generate bit-masks to implement different
best-effort strategies. Every computation task is assigned a bit mask and is only executed if
the bit is set. This way one can control the ratio of optional and guaranteed instructions per
iteration.
2. batch: This operator relaxes dependencies across iterations, thus enabling the parallel executions of different iterations.

3.4

Best-effort Strategies
Convergent-based pruning: Converging data structures are used to speculatively identify computations that have minimal impact on the results and eliminate them.
Staged Computation: Gradually increase the amount of data in subsequent stages. This refines
the estimates done in the previous iterations. The computations strategy attempts to accelerate
the computation by considering a subset of data in early stages. However, if the representative
data in the early stages are not chosen correctly the convergence rate may not slow down.
Early termination: Statistics are used to estimate accuracy and terminate before actual convergence. Termination criterion is encoded in the converged operator in the above mentioned
programming template. Therefore if all the data points have arrived under a threshold, computations can be stopped, even though the points have not converged yet.
Sampling: Select random subset of input data and compute the results. This strategy works well
when the input is abundantly redundant, if not then using sampling might have adverse effects,
as much important information will be lost.

The aforementioned best-effort strategies can be also used in combination. For instance Sampling
can select a subset of input data before any other strategies are applied. Likewise Early Termination
can be mixed with Staged Computations as a relaxed criterion that determines when to advance to the
next stage.

Kashif Wajid Qureshi

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

Case Studies

4.1
4.1.1

K-means Clustering based on Best-effort Computation Model


K-means Algorithm

The K-means algorithm clusters a given set of points in a multi-dimensional space. It begins by
randomly picking several input vectors as cluster centroids. These cluster centroids are then refined
in subsequent iterations, until no further iteration changes them. In each iteration the following steps
are performed [6]:
1. Compute the distance between every point and every cluster centroid. This distance metric is
usually the euclidean distance.
2. Assign each point to the cluster centroid that the point is closest to. Points belonging to the
same centroid belong to one cluster.
3. Re-compute the new centroid for each cluster as the mean of all the points belonging to this
cluster.
4.1.2

Potential in K-means for Computation Reduction

A common use of the K-means clustering algorithm is to cluster images into regions with similar
properties such as color and texture. Image segmentation is one of the preprocessing steps for image
content analysis and image compression. A pixel in RGB space of the image corresponds to an input
points for the algorithm. This case study demonstrates the application of image segmentation done
via the K-means algorithm.
There are several characteristics that enable the use of the best-effort programming model. Figure
5 shows the number of points that change their memberships i.e change a cluster in each iteration.
As seen from the Figure 5, less than 1% of the points change their memberships after around 20% of
iterations. Consider a point p that does not change its membership after iteration i. Thus, the future
iterations will not have any impact due to point p as it does not change its cluster anymore. This
indicates that the future iterations (i + 1) can skip the computation of this point in calculations as
this point has already been stabilized. Even though, in practice, its hard to identify points which have
been stabilized due to gradual change of clusters as new inputs are added into the system.

Figure 5: Percentage of points changing their memberships in the time line of iterations. Data is
measured for K-means when it groups 2895872 (1792 x 1616 image) into 2, 4, 8, 16, 32 clusters [1].

Kashif Wajid Qureshi

10

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing
Figure 6 shows how the cluster centroids change during subsequent iterations. As the iterations
increase the offset distance of the centroids reduces drastically, the offset distance then stays in the
very low range of 10-4 . This implies that initial iterations do not require very high accuracy. Therefore
it is possible that not all the points are considered in early iterations.

Figure 6: The distance that each centroid migrates in the time line of iterations. Data is measured for
K-means when it groups 2895872 (1792 x 1616 image) points in to 8 clusters [1].

To further evaluate the forgiving nature of the K-means algorithm, a software implementation of Kmeans was executed across several image datasets, while errors were injected in the cluster centroid
that was computed upon each iteration. A certain percentage of centroid values are changed per
iteration. The errors were injected to model the impact of executing this algorithm on a best-effort
computing platform. However, it is important to note that not all computations in the algorithm can
be subjected to errors - for example, the operations that determine whether to continue iterating must
be executed without any errors. This corresponds to the view that most applications will consist of
computations that may be executed on a best-effort basis, and others that may not. Figure 7 shows
how the quality of clustering computed by K-means varies with different rates of error injection. As
seen injecting errors up to 1% (yellow curve) virtually does not have an impact on cluster quality.

Figure 7: Illustration of the forgiving nature of K-means clustering [2].

Kashif Wajid Qureshi

11

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing
4.1.3

K-means Using the Best-effort Iterative Convergence Template

Figure 8 presents the pseudo-code of how the K-means clustering algorithm can be written using
the best effort iterative convergence template. Arrays are used to store the points, cluster centroids,
distances between each point and cluster centroids, and cluster memberships. Since this algorithm
segments n points in k clusters, initially k random points are chosen as centroids using the function
random select(). Then, using the specified best-effort strategy, the filter() function filters the points
that will be used in calculations. Here mask[i] = 1 means that the ith point will be considered in the
current iteration. Thus computations involving this point are guaranteed. For the points whose entry
in the mask[] is 0 are regarded as optional computations. The parallel for only processes points whose
mask is set to 1. The function compute distances() computes the distance of the ith point from all the k
centroids. From these the function argmin() computes the index of the centroid closes to the ith point.
Thus the ith point it assigned to the cluster with which its distance is least. Then the compute means()
function computes the new centroid based on all the points in the cluster. At the end, depending on
the best-effort convergence criteria specified, the method converged() decides when to terminate the
algorithm.

Figure 8: Pseudo code of K-means in the best-effort iterative-convergence template[1].

Kashif Wajid Qureshi

12

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing
4.1.4

Best-effort Strategies for K-means

Several best-effort strategies have been deployed for the K-means algorithm. They include four
different best-effort filtering criteria and one best-effort convergence criteria. The strategies described
below are parameterized. They are [1]:
terminate: The convergence criterion returns True when less than T% of the points have changed
their memberships since the last iteration.
sample: P% of the n points are randomly sampled. For all these points the mask bit in the mask[]
array is set to 1. These points will then participate in the computations during this iteration.
stage: The filtering starts with subset of all the input points. It then gradually adds points in
stages to the computations. A total of S stages are deployed. The next stage computations
begin when previous stages computation converges and returns True. The number of points
grow geometrically in each stage.
conv.point: The filtering criterion tries to filter out the points whose membership has not
changed in the previous N iterations. for these points the mask bit is set to 0 i.e. they are
considered to be converged.
conv.center: The filtering criterion identifies the points whose centroids have changed by distance greater than D since the previous iteration. For these points the bits are set in masks[]
array while for the rest the bits are unset. Normalization of input data via a z-score transformation [8] has been done for the experiments performed. All the distance calculation is done in
this normalized space.
4.1.5

Results

To evaluate the best-effort K-means implementation, two images were chosen as the input data set.
Figure 9(a) shows image of histological micro graphs of tissue samples. Figure 9(b) shows a scenic
image having small depth of fields and large blur area.

Figure 9: Images used for evaluating best-effort K-means [1].

Kashif Wajid Qureshi

13

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing
The effectiveness of each best-effort strategy is evaluated with respect to performance and error
introduced in the result due to the use of best-effort computing. A performance baseline is established
by executing the K-means algorithm without any best-effort strategy i.e. no filtering operation is
performed. This is executed until no points change their memberships. This baseline result is used
to compare it with the results from the different best-effort strategies. The percentage of points that
end with different cluster memberships is the error rate. A fix clustering size of 8 (i.e. K = 8) is used.
There are two ways in which best-effort strategies are applied [1]:
Individual best-effort strategies: Figure 10 shows the error vs. performance curves resulting
from each of the best-effort strategies described in the previous section when applied to the
cancer image dataset(Figure 9(a)). The baseline is shown in the terminate strategy with T% set
to 0%. All best-effort strategies reduce the execution time in expense to a small loss of accuracy
in the result. With an error rate of less than 1%, execution time can be reduced from 4.5 to 1.3
seconds, resulting in a speedup 3.5x. The sample strategy reduces the execution time to 0.1
seconds at an error rate of 3% while other strategies reduce it to 0.6 seconds with an error rate
of 4%. The stage strategy produces the best result with 9 stages. It results in execution time of
1.75 seconds while error rate is maintained at 0.01%.

Figure 10: Error vs. performance for individual best-effort strategies [1].

Combined best-effort strategy: Figure 11 shows the results of performance vs error rate of
combined strategies. Combining the best-effort strategies yields, in most cases, higher performance gain at a lower error rate as compared to individual best-effort strategies. Figure
11 includes a curve for the individual terminate strategy for comparison. The baseline is also
shown in this curve with T% set to 0%. In most cases the sample strategy is very effective in
trading-off accuracy for performance. However it can be further improved by combining it with
stage and conv.center policies. Although the sample strategy is very effective in some cases,
the optimal sampling rate depends on the redundancy of the input data. A sampling rate of 1%
yields an error rate of 3% for the histological image, but increases to more than 30% for the
scenic image. However, the stage strategy demonstrates more robust results when combined
with the terminate strategy with a termination parameter (T%) of 0.1% to 1%.

Kashif Wajid Qureshi

14

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

Figure 11: Error vs. Execution time for combined best-effort strategies when applied to K-means
based image segmentation for (a) Cancer and (b) Goose [1].
In summary, the experiments demonstrate that significant performance improvements can be obtained due to reductions in computational workload by introducing some error in the result. Experiments on the image dataset show that with an error rate of 1% a speedup of 3.5x can be achieved.

4.2

H.264 Video Encoding Using Loop Perforation

Video encoders take a stream of input frames and compress them for efficient storage. The quality
of the video is measured in peak signal-to-noise ratio (PSNR) and bit rate. The bit rate reflects the
compression achieved by the encoder.
4.2.1

H.264 Implementation

The x264 implementation of H.264 is taken into consideration for this case study. Good video
compression is done by finding and exploiting similarities between contiguous frames of a video and
is known as motion estimation. During this process a frame is broken into 16x16 pixel block called
a macro block. Sometimes these macro blocks are also split into smaller blocks called sub-blocks.
x264 tries to find similar macro-block or a sub-block like this one from previously encoded frames by
computing Hadamard Transformed Differences (SATD) between current macro-block or sub-block
and previous frame. Figure 12 shows a C-function for computing the SATD between two regions. It
is one of the methods that takes part in successfully encoding the raw input video. Other methods
include x264 mb analyse inter p16x16, x264 pixel sad 16x16, x264 me search ref. To find the best
match for a macro-block the encoder will have to search the entire reference frame which could be
expensive in practice. Thus motion estimations algorithms use heuristics to move from one location
in a frame to another without having to examine the entire reference frame. This trade-off between
quality and performance makes this motion estimation algorithm a good candidate for loop perforation.

4.2.2

Results

The Loop perforation technique is applied on the x264 implementation of H.264 video encoding
standard. The video is taken from the PARSEC benchmark suite [7].

Kashif Wajid Qureshi

15

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

Figure 12: Code to compute sum of Hadamard transformed differences. This function is important in
video encoding and a good candidate for code perforation [9].
Criticality Testing: The Criticality testing algorithm described in Figure 1 was run on the
x264 implementation of H.264 encoding standard. The results are shown in Figure 13. For
an accuracy bound of 10%, each column shows results for a given perforation rate, r = (0.25,
0.50, 0.75, 1.00). The first row (candidate) presents the starting number of candidate loops. The
second row (crash) shows the number of loops filtered out as perforating them terminates the
application with an error. The third row (accuracy) presents the number of loops filtered which
caused the application to violate the specified accuracy bound. The fourth row (Speed) shows
the loops filter out because perforating these loops does not increase the overall performance.
The last row (Valgrind) filters out loops which cause a memory error while perforating the
loops.
Performance Space Exploration: Figure 14 shows the results of the exhaustive loop perforation space exploration algorithm. The graph plots a single point for each explored perforation.
On the x-axis the percentage accuracy loss of each corresponding perforation is plotted. On
y-axis the mean speedup of the perforation is plotted. Accuracy bound of 10% is chosen for
this experiment. Green points show the perforation results below the specified accuracy bound,
while the red points show the perforation results which are outside the accuracy bound. Blue
Kashif Wajid Qureshi

16

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

Figure 13: Criticality Testing for Individual Loops Results [3].


line connects the points to form the Pareto-optimal perforations. Pink triangle shows the result
that the greedy algorithm produces. The graph shows that loop perforation is able to increase
performance on the training inputs by factor of more than 3, while reducing the accuracy by
less than 10%. The greedy algorithm result is less than the optimal result found. This is due to
the fact that greedy algorithm only explores the highest priority rate loop only, as shown by the
score equation in section before.

Figure 14: Exhaustive Loop Perforation Space Algorithm Results [3].

The following table shows the accuracy and speedup results in the Pareto-optimal perforations in
loop perforation space. Each group of column presents the results for Pareto-optimal perforation
for four accuracy bounds, b (2.5%, 5%, 7.5%, 10%). Each entry in the form of X(Y%) shows the
respective mean speedup and mean accuracy for the combination of bound and input set. As seen
with increasing the accuracy bound b the overall mean speedup increases as well.
In summary, applying loop perforation on the x264 implementation of the H.264 encoding standard
will consider fewer blocks to compare per frame and terminate the search to find a similar encoded
block early. All of these perforations may cause the x264 to choose a less desirable previously encoded block as a starting point for encoding the current block. Since there is significant redundancy in
Kashif Wajid Qureshi

17

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

Figure 15: Training and Production results for Pareto-optimal Perforations for Varying Accuracy
Bounds [3].
the set of previously encoded blocks. The changes to image quality are imperceivable by humans as
the resulting PSNR from the perforated x264 is within 0.3 dB of the unperforated version of the x264.
This trade-off of output quality increases the performance of the overall algorithm. On the other hand
the average file size of the output is increased by 18% compared to that of the unperforated output
[9]. Applications like HD video viewing on mobile platforms and live streaming where speed is of
essence can benefit from this perforated version of the x264.

Conclusion

This report presented two software methods for Approximate Computing that can be applied at
the application level to increase performance of the applications, namely Loop Perforation and the
Best-effort computing model. This performance improvement is achieved at the cost of trading off
accuracy. The loop perforation technique reduces the number of iterations a loop has to run in an
application, while the Best-effort computed model distinguishes between optional and guaranteed
computations inside an application. Two applications where chosen which have the inherent forgiving
nature i.e. they can tolerate inaccuracy in their result: H.264 video encoding and K-means clustering.
By perforating the loops inside the x264 implementation of the H.264 video encoding standard, a
speedup of 3.25x was observed with an error rate below 10%. By applying the best-effort computation
strategies on K-means clustering algorithm a speedup of 3.5x was observed with an error rate of
1%. Thus, it is shown that by applying such software techniques for approximate computing, a
performance increase can be achieved at an acceptable loss of accuracy.

Kashif Wajid Qureshi

18

ITI/RA Seminar on Design, Test and Application of Emerging Computer Architectures


Topic: Software Mapping Techniques for Approximate Computing

References
[1] M. Jiayuan, S. Chakradhar, A. Raghunathan. Best-effort Parallel Execution Framework for
Recognition and Mining Applications. IEEE International Symposium on Parallel & Distributed
Processing, 2009, pp. 1-12.
[2] S.T. Chakradhar, A. Raghunathan. Best-effort computing: Re-thinking Parallel Software and
Hardware. 47th ACM/IEEE Design Automation Conference (DAC), 2010, pp. 865 - 870.
[3] S. Sidiroglou, S. Misailovic, H. Hoffmann, M. Rinard. Managing Performance vs. Accuracy
Trade-offs With Loop Perforation. 19th ACM SIGSOFT Symposium and the 13th European
Conference on Foundations of Software Engineering, 2011, pp. 124-134.
[4] V. K. Chippa et. al. Analysis and Characterization of Inherent Application Resilience for Approximate Computing. 50th ACM/IEEE Design Automation Conference (DAC), 2013, pp. 1-9.
[5] M. Rinard. Probabilistic Accuracy Bounds for Fault-Tolerant Computations that Discard Tasks.
20th Annual International Conference on Supercomputing, 2006, pp. 324-334.
[6] J. MACQUEEN. Some Methods For Classification and Analysis of Multivariate Observations.
5th Berkley Symposium on Mathematical Statistics and Probability, 1967, pp. 281-297.
[7] C. Bienia, S. Kumar, S. Misailovic, J. P. Singh, K. Li. The PARSEC benchmark suite: Characterization and Architectural Implications. 17th International Conference on Parallel Architectures
and Compilation Techniques, 2008, pp. 72-81.
[8] A. M. Mood, F. A. Graybill, D. C. Boes. Introduction to Theory of Statistics. McGraw Hill,
1974.
[9] H. Hoffman, S. Sidiroglou, S. Misailovic, A. Agarwal. Using Code Perforation to Improve
Performance, Reduce Energy Consumption, and Respond to Failures. Technical Report MITSCAIL-TR-2009-042, MIT, 2009.
[10] J. A. Hartigan, M. A. Wong. A K-Means Clustering Algorithm. Journal of the Royal Statistical
Society. Series C (Applied Statistics), Vol. 28, 1979, pp. 100-108.
[11] M. Bohr. A 30 Year Retrospective on Dennards MOSFET Scaling Paper. IEEE Solid-State Circuits Society Newsletter, 2007, Volume:12, Issue: 1, pp. 1113.

Kashif Wajid Qureshi

19

Potrebbero piacerti anche