Sei sulla pagina 1di 3

CISC 663 Summary: Improving the Reliability of

Commodity Operating Systems Ram


Student: Jose M. Monsalve

1.

SUMMARY

The main purpose of the paper is to describe the implementation details and justify the necessity of Nooks in the
Operating System. Nooks is an operating system subsystem that, through the use of lightweight kernel protection
domains, provides some isolation to the extension of the operating systems, restricting the access to the kernel space,
and reducing the possibility of corrupting it. Nooks deals
not only with the prevention of the kernel corruption, but
also with the recovery of the kernel when an error occurs
in an extension. The author wants to highlight the importance of adding reliability on commodity operating systems.
There are three main reasons that motivates Nook. First,
it is crucial to deal with reliability, and still, it is something
that has not been solved yet. Second, while the kernel is
coded by a reduced number of experts, that understand the
structure of the operating system in general, there are kernel extensions, that are optional components in the kernel
space for specific functions, that are programmed by the
vendors or other people that might not have that deep of a
knowledge of the kernel organization. Third, thanks to the
previous factor, and given the complexity of the operating
system, most of the errors that corrupt the kernel are caused
by extensions, which is at the same time, hard to test completely as a whole unit (in comparison to the core kernel
of the OS). Additionally, the author claims that there are
two necessary aspects to keep in mind when dealing with
this problem: backward compatibility, that provides reliability improvements to future and old operating systems, and
efficiency to avoid the classic tread off between robustness
and performance (taken from the original article).
The key questions the authors are addressing are: (1) is it
possible to reduce the number of failures that arise from
the kernel extensions. (2) If so, how the operating system
should be modified maintaining compatibility with most of
the existing extensions, (3) How such implementation should
remain efficient and provide backward compatibility to the
kernel extensions. (4) given the implementation, what is

its performance in terms of avoiding and recovering from


extensions errors and in terms of the overhead introduced
by the new layer.
The method(s) used to answer these key questions is to implement Nooks in Linux Operating System; a reliability layer
that exist between the kernel core and the extensions, that
intends to reduce the kernel corruption, which produces kernel panics, that are related to the extensions. Nooks does
not provide complete fault tolerance, it just adds fault resistance capabilities that helps the solution of the vast majority
of errors but not all the possible errors, falling in the middle
between an unprotected operating system and a 100% safe
operating system. There are three major goals in the architecture behind Nooks: (1) isolation of the errors coming
from the extensions. (2) When an error happens, recovery
must be automatic. (3) Nooks has to be backward compatible supporting both newer and older existing extensions,
without requiring large modifications. In order to satisfy
these goals, Nooks creator proposed a reliability layer which
exist between the OS core kernel and the kernel extensions,
while remaining transparent to the extensions. This layer,
called Nooks Isolation Manager (NIM), is made out of four
functional components. first isolation, which avoid errors
happening in the extensions to affect the behavior of the
kernel, by giving each extension its own lightweight kernel protection domain through the use of isolated memory
space. Second, interposition which allows the integration
of existing extensions into the Nook environment, by wrapping the extensions and handling the kernel-to-extensions
and extensions-to-kernel control flow, and by providing all
the data transfer from the kernel to the extension (and the
other way around) to the object tracking. Third, Object
tracking is in charge of preserving the data structures that
are used by the extensions and controlling their modification. It also provides necessary information when an extension fails, crucial to the cleanup process. Fourth, the
recovery mechanism is in charge of detecting the faulty behavior of the extensions and recover when a detection take
place. The detection happens when an extension invokes a
kernel service improperly, or when there is starvation of resources. Also, when a hardware error occurs, it is handle
by the recovery mechanism, communicating the kernel and
doing the necessary exception handling and cleanup.
The testing approach used to prove the validity/effectiveness
of the method(s) is divided in two parts. First, testing the reliability capabilities of the modified Linux operating system,

in order to validate Nooks functionality. Second, create a set


of performance metrics that shows the overhead introduced
by Nooks. For the first part, six device drivers extensions
(sound cards and network devices), the VFAT kernel subsystem and the kHTTPd application specific kernel extension
were tested. A fault injection mechanism was adapted to
emulate common programming errors like uninitialized variables, bad parameters and inverted test conditions. For each
extension, a normal workload was assigned (e.g. playing an
MP3 file for the sound driver, and sending TCP streaming through the network). A virtual machine in VMware
ran the modified operating system allowing the recovery to
a controlled clean state after each experiment. Two cases
were tested, a native one, which did not have Nooks modifications, and a Nooks one. For each case, 400 trials were
compared, each with 5 random errors injected (same errors
were injected in both the native and the nooks). It is important to highlight that, since the injection of errors was
random, they might or might not cause a faulty behavior.
The results were really impressive, nooks eliminated 99%
of the crashes that occurred in the native version. At the
same time, non-fatal errors (which did not crash the native system, but altered the behavior of the extension, were
mitigated through the use of nook with a 60% of efficiency.
For the second set of experiments, 6 benchmarks were used
to measurement the CPU utilization, the kernel mode time
and the total expended time. First, the total user time was
not changed, but the kernel mode time suffered from serious
alterations. The paper does not emphasize in this results.
On the other hand, CPU utilization was compared for each
benchmark. For drivers, the performance loses where not
bigger than 9%, however, for kernel services and applications, the drops were as big as 60% for the worse case. There
is a correlation between the performance loses and the XPC
(Extension procedure call) operations Rate, and so it affects
the throughput of extensions operations.
The main conclusions in these papers are three: First, Nook
is an excellent mechanism for incrementing the operating
system reliability. With recovery performance of 60 % and
system crashes reductions of 99% nook is promising and its
results should be kept in mind for future developments of
operating system. However, despite the fact that the author
claimed the overhead introduced by nook is lower (compared
to the benefits that it provides), I believe that there is still
a lot of work left and that the user is not willing to reduce
its performance by 50 %, when crashes are not as critical
for commodity operating systems, and even more, when for
the final user, they are not as common. Second, it is clear
that the author did a good job in isolating the extensions,
and its errors, from the core Kernel of the operating system,
and that such work can lead to improvements of operating
systems. Experiments showed that, with few or none modifications to the extensions, the reliability of the operating
system increased considerably. I do not expect this work to
be use as it is, but I think it is a worthy idea that should be
kept in mind for operating systems designers.
If we take this line of reasoning seriously, the implications
are that it is possible to achieve higher reliability for operating systems, without big modifications to its extensions.
However, this will come with a price. Following the authors ideas, when reliability is important, performance can

be forfeited. It is possible to apply this ideas, but I think


that the common user of commodity operating system is
not interested in reliability as much as the author expects.
The problem is that for specific purpose operating systems,
where reliability matters, this work does not provide a 100%
fault tolerance and safeness. In such case we can take advantage of the benefits of Nook, but further effort should be
done when Nook is not enough.

2.

EVALUATION

I believe that the paper properly addresses the elements of


reasoning, it is a really interesting idea and it proves to work
really well. This work is another possibility for increasing
reliability in systems, where all the above layers running over
the operating system will get some benefits by the increasing
operating system reliability. However, performance is still a
problem. Efficiency is not as great as the author claims, and
so the tread off between robustness and performance is still
there.

1. Purpose: The purpose of the reasoner is to provide


a new reliability layer between the core kernel of the
operating system and its extensions. This new layer
should provide error isolation, avoiding an extension to
corrupt the whole operating system. This reduces the
occurrence of kernel panics, and other side effects of
extensions errors. This purpose is clearly stated and is
kept among the paper. It is supported by the presence
of statistics and by the use of some few references.
2. Question: Although the questions are not explicitly
mentioned in the paper, the thesis of the author is
clear and it remains clear during the development of
the paper. Its position and complexity is justifiable,
not only from the statistics presented at the beginning
of the paper or the related work, but also because the
results demonstrate the skills that nooks has, as well as
its high performance for dealing with extensions faults.
3. Information: Not only the author cites evidence, but
also provide a whole background and related work,
proof of a big community that is interested in the
matter. The only information that I will say that is
exaggerated is with respect of the efficiency of nook
in therms of overhead. I think that the performance
looses for CPU are considerably big for the common
use of a commodity operating system. However, citations are provided and comparison of this work with
others is possible.
4. Concepts: As a person that has few knowledge on this
area, I felt that the paper was easy to read and follow,
and that the concepts were clearly defined and used
consistently. Even for someone outside the resilience
community, it is possible to understand the risk of an
error in the extension, and how their isolation can provide the benefits that the author claims.
5. Assumptions: The only assumption that I might disagree is the fact that performance loses for a general
user of a commodity operating system can be acceptable. For the case of VFAT, 60% of extra CPU means

a considerable slowdown of the whole system, and reduction of the performance of all the running application, every time a disk access is performed (for such
file systems). Outside this assumption I think the authors work is wroth, and still it provides an interesting perspective and solution to the problem that can
be exploited in different ways.
6. Inferences: The line of reasoning is clear, and can be
followed throughout the paper. The reader will reach
most of the same conclusions, however, (as mentioned
before) the fact that the author is aiming the approach
to general purposes operating systems, does not make
complete sense to me.
7. Point of View: The author claims that this is only a
possible solution, and it is not closed to possible implementation. It mentioned some related work, their
advantages and disadvantages, and it tries to cover the
disadvantages. One that is highlighted is the recovery
process, that it seems to be novel. The point of view
is clear, but is not the only one acknowledged by the
author, and so the discussion remains open.
8. Implications: I believe the author is aware of the implications of his work, even if he underestimates the performance loses. The paper is clear in what the scope
of the approach is and what are the limitations. It
also shows sensitivity to those limitations and clearly
instantiates that the position is not unique and it can
be either expanded or modified.

3.

CONCLUSIONS

I think this is a good paper that provide an impressive work


in operating systems reliability. It is well written and interesting. Nook is a work that can be reused in further operating system, and some of the concepts can be borrowed for
improvements in the operating systems reliability and reduce the number of crashes. I do not expect Nook to be use
as it is in any operating system, but I think it really opens
a discussion of the necessity of this mechanisms and if this
solution is adequate. This paper provides evidence of the
improvements on reliability with a 99 % on crash reductions
and 60 % on errors propagation in extensions.

Potrebbero piacerti anche