Sei sulla pagina 1di 42

Chapter 17.

Locking and Synchronization


In this chapter, we continue our discussion of core kernel facilities, with an examination of the synchronization o !ects implemented in the Solaris kernel.
17.1. Synchronization

Solaris runs on a "ariety of different hardware platforms, including multiprocessor systems ased on oth the S#$%C and Intel processors. Se"eral multiprocessor architectures in existence today offer "arious trade& offs in performance and engineering complexity in oth hardware and software. 'he current multiprocessor architecture that Solaris supports is the symmetric multiprocessor (S)#* and shared memory architecture, which implements a single kernel shared y all processors and a single memory address space. 'o support such an architecture, the kernel must synchronize access to critical data to maintain data integrity, coherency, and state. 'he kernel synchronizes access y defining a lock for a particular kernel data structure or "aria le and re+uiring that code reading or writing the data must first ac+uire the appropriate lock. 'he holder of the lock is re+uired to release the lock once the data operation has een completed. 'he synchronization primiti"es and associated interfaces are used y "irtually all kernel su systems, de"ice dri"ers, the dispatcher, process and thread support code, file systems, etc. Insight into what the synchronization o !ects are and how they are implemented is key to understanding one of the core strengths of the Solaris kernel scala le performance on multiprocessor systems. $n e+ually important component to the scala ility e+uation is the a"oidance of locks altogether whene"er possi le. 'he use of synchronization locks in the kernel is constantly eing scrutinized as part of the de"elopment process in kernel engineering, with an eye to minimizing the num er of locks re+uired without compromising data integrity. Se"eral alternati"e methods of uilding parallel multiprocessor systems ha"e emerged in the industry o"er the years. So, in the interest of con"eying the issues surrounding the implementation, we need to put things in context. -irst, we take a rief look at the different parallel systems architectures that are commercially a"aila le today, and then we turn to the specifics of support for multiprocessor architectures y the Solaris kernel.

17.2. Parallel Systems Architectures

)ultiprocessor ()#* systems from Sun (S#$%C&processor& ased*, as well as se"eral x./0x/1& ased )# platforms, are implemented as symmetric multiprocessor (S)#* systems. Symmetric multiprocessor descri es a system in which a peer&to&peer relationship exists among all the processors (C#2s* on the system. $ master processor, defined as the only C#2 on the system that can execute operating system code and field interrupts, does not exist. $ll processors are e+ual. 'he S)# acronym can also e extended to mean Shared )emory )ultiprocessor, which defines an architecture in which all the processors in the system share a uniform "iew of the system3s physical address space and the operating system3s "irtual address space. 'hat is, all processors share a single image of the operating system kernel. Sun3s multiprocessor systems meet the criteria for oth definitions. $lternati"e )# architectures alter the kernel3s "iew of addressa le memory in different ways. )assi"ely parallel processor ()##* systems are uilt on nodes that contain a relati"ely small num er of processors, some local memory, and I04. 5ach node contains its own copy of the operating system6 thus, each node addresses its own physical and "irtual address space. 'he address space of one node is not "isi le to the other nodes on the system. 'he nodes are connected y a high&speed, low&latency interconnect, and node&to&node communication is done through an optimized message passing interface. )## architectures re+uire a new programming model to achie"e parallelism across nodes. 'he shared memory model does not work since the system3s total address space is not "isi le across nodes, so memory pages cannot e shared y threads running on different nodes. 'hus, an $#I that pro"ides an interface into the message passing path in the kernel must e used y code that needs to scale across the "arious nodes in the system. 4ther issues arise from the nonuniform nature of the architecture with respect to I04 processing since the I04 controllers on each node are not easily made "isi le to all the nodes on the system. Some )## platforms attempt to pro"ide the illusion of a uniform I04 space across all the nodes y using kernel software, ut the nonuniformity of the access times to nonlocal I04 de"ices still exists. 72)$ and cc72)$ (nonuniform memory access and cache coherent 72)$* architectures attempt to address the programming model issue inherent in )##

systems. -rom a hardware architecture point of "iew, 72)$ systems resem le )##s small nodes with few processors, a node&to&node interconnect, local memory, and I04 on each node. 7ote, It is not re+uired that 72)$0cc72)$ or )## systems implement small nodes (nodes with four or fewer processors*. )any implementations are uilt that way, ut there is no architectural restriction on the node size. 4n 72)$0cc72)$ systems, the operating system software pro"ides a single system image, where each node has a "iew of the entire system3s memory address space. In this way, the shared memory model is preser"ed. 8owe"er, the nonuniform nature of speed of memory access (latency* is a factor in the performance and potential scala ility of the platform. 9hen a thread executing on a processor node on a 72)$ or cc72)$ system incurs a page fault (references an unmapped memory address*, the latency in"ol"ed in resol"ing the page fault "aries according to whether the physical memory page is on the same node of the executing thread or on a node somewhere across the interconnect. 'he latency "ariance can e su stantial. $s the le"el of memory page sharing increases across threads executing on different nodes, a potentially higher "olume of page faults needs to e resol"ed from a non local memory segment. 'his pro lem ad"ersely affects performance and scala ility. 'he three different parallel architectures can

e summarized as follows,

S)#. Symmetric multiprocessor with a shared memory model6 single kernel image )##. )essage& ased model6 multiple kernel images 72)$0cc72)$. Shared memory model6 single kernel image

-igure 17.1 illustrates the different architectures.

Figure 17.1. Parallel Systems Architectures

'he challenge in uilding an operating system that pro"ides scala le performance when multiple processors are sharing a single image of the kernel and when e"ery processor can run kernel code, handle interrupts, etc., is to synchronize access to critical data and state information. Scala le performance, or scala ility, generally refers to accomplishment of an increasing amount of work as more hardware resources are added to the system. If more processors are added to a multiprocessor system, an incremental increase in work is expected, assuming sufficient resources in other areas of the system (memory, I04, network*. 'o achie"e scala le performance, the system must e a le to concurrently support multiple processors executing operating system code. 9hether that execution is in de"ice dri"ers, interrupt handlers, the threads dispatcher, file

system code, "irtual memory code, etc., is, to a degree, load dependent. Concurrency is key to scala ility. 'he preceding discussion on parallel architectures only scratched the surface of a "ery complex topic. 5ntire texts discuss parallel architectures exclusi"ely6 you should refer to them for additional information. See, for example, :1;<, :=><, and :=7<. 'he difficulty is maintaining data integrity of data structures, kernel "aria les, data links (pointers*, and state information in the kernel. 9e cannot, for example, allow threads running on multiple processors to manipulate pointers to the same data structure on the same linked list all at the same time. 9e should pre"ent one processor from reading a it of critical state information (for example, is a processor online?* while a thread executing on another processor is changing the same state data (for example, in the process of ringing online a processor that is still in a state transition*. 'o sol"e the pro lem of data integrity on such systems, the kernel implements locking mechanisms. It re+uires that all operating system code e aware of the num er and type of locks that exist in the kernel and comply with the locking hierarchy and rules for ac+uiring locks efore writing or reading kernel data. It is worth noting that the architectural issues of uilding a scala le kernel are not "ery different from those of de"eloping a multithreaded application to run on a shared memory system. )ultithreaded applications must also synchronize access to shared data, using the same asic locking primiti"es and techni+ues that are used in the kernel. 4ther synchronization pro lems, such as dealing with interrupts and trap e"ents, exist in kernel code and make the pro lem significantly more complex for operating systems de"elopment, ut the fundamental pro lems are the same.
17.3. Hardware Considerations for oc!s and Synchronization

8ardware&specific considerations must enter into the implementation of lock primiti"es on a system. 'he first consideration has to do with the processor3s instruction set and the a"aila ility of machine instructions suita le for locking code. 'he second deals with the "isi ility of a lock3s state when it is examined y executing kernel threads. 'o understand how these considerations apply to lock primiti"es, keep in mind that a lock is a piece of data at a specific location in the system3s memory. In its simplest form, a lock is a single yte location in %$). $ lock that is set, or held (has een ac+uired*, is represented y all the its in the lock yte

eing 13s (lock "alue @x--*. $ lock that is a"aila le (not eing held* is the same yte with all @3s (lock "alue @x@@*. 'his explanation may seem +uite rudimentary, ut is crucial to understanding the text that follows. )ost modern processors shipping today pro"ide some form of yte&le"el test&and&set instruction that is guaranteed to e atomic in nature. 'he instruction se+uence is often descri ed as read&modify&write6 that is, the referenced memory location (the memory address of the lock* is read, modified, and written ack in one atomic operation. In %ISC processors (such as the 2ltraS#$%C '1 processor*, reads are load operations and writes are store operations. $n atomic operation is re+uired for consistency. $n instruction that has atomic properties means that no other store operation is allowed etween the load and store of the executing instruction. )utex and %9 lock operations must e atomic, such that when the instruction execution to get the lock is complete, we either ha"e the lock or ha"e the information we need to determine that the lock is already eing held. Consider what could happen without an instruction that has atomic properties. $ thread executing on one processor could issue a load (read* of the lock and while it is doing a test operation to determine if the lock is held or not, another thread executing on another processor issues a lock call to get the same lock at the same time. If the lock is not held, oth threads would assume the lock is a"aila le and would issue a store to hold the lock. 4 "iously, more than one thread cannot own the same lock at the same time, ut that would e the result of such a se+uence of e"ents. $tomic instructions pre"ent such things from happening. S#$%C processors implement memory access instructions that pro"ide atomic test&and&set semantics for mutual exclusion primiti"es, as well as instructions that can force a particular ordering of memory operations (more on the latter feature in a moment*. 2ltraS#$%C processors (the S#$%C AB instruction set* pro"ide three memory access instructions that guarantee atomic eha"ior, ldstu (load and store unsigned yte*, cas (compare and swap*, and swap (swap yte locations*. 'hese instructions differ slightly in their eha"ior and the size of the datum they operate on. -igure 17.= illustrates the ldstu and cas instructions. 'he swap instruction (not shown* simply swaps a ;=& it "alue etween a hardware register and a memory location, similar to what cas does if the compare phase of the instruction se+uence is e+ual.

Figure 17.2. Atomic "nstructions for oc!s on SPA#C Systems

'he implementation of locking code with the assem ly language test&and&set style of instructions re+uires a su se+uent test instruction on the lock "alue, which is retrie"ed with either a cas or ldstu instruction. -or example, the ldstu instruction retrie"es the yte "alue (the lock* from memory and stores it in the specified hardware register. Locking code must test the "alue of the register to determine if the lock was held or a"aila le when the ldstu executed. If the register "alue is all 13s, the lock was held, so the code must ranch off and deal with that condition. If the register "alue is all @3s, the lock was not held and the code can progress as eing the current lock holder. 7ote that in oth cases, the lock "alue in memory is set to all 13s, y "irtue of the eha"ior of the ldstu instruction (store @x-- at designated address*. If the lock was already held, the "alue simply didn3t change. If the lock was @ (a"aila le*, it will now reflect that the lock is held (all 13s*. 'he code that releases a lock sets the lock "alue to all @3s, indicating the lock is no longer eing held. 'he Solaris lock code uses assem ly language instructions when the lock code is entered. 'he asic design is such that the entry point to ac+uire a lock enters an assem ly language routine, which uses either ldstu or cas to gra

the lock. 'he assem ly code is designed to deal with the simple case, meaning that the desired lock is a"aila le. If the lock is eing held, a C language code path is entered to deal with this situation. 9e descri e what happens in detail in the next few sections that discuss specific lock types. 'he second hardware consideration referred to earlier has to do with the "isi ility of the lock state to the running processors when the lock "alue is changed. It is critically important on multiprocessor systems that all processors ha"e a consistent "iew of data in memory, especially in the implementation of synchronization primiti"es mutex locks and reader0writer (%9* locks. In other words, if a thread ac+uires a lock, any processor that executes a load instruction (read* of that memory location must retrie"e the data following the last store (write* that was issued. 'he most recent state of the lock must e glo ally "isi le to all processors on the system. )odern processors implement hardware uffering to pro"ide optimal performance. In addition to the hardware caches, processors also use load and store uffers to hold data eing read from (load* or written to (store* memory in order to keep the instruction pipeline running and not ha"e the processor stall waiting for data or a data write&to&memory cycle. 'he data hierarchy is illustrated in -igure 17.;.
Figure 17.3. Hardware $ata Hierarchy

'he illustration in -igure 17.; does not depict a specific processor6 it is a generic representation of the "arious le"els of data flow in a typical modern high&end microprocessor. It shows the flow of data to and from physical memory from a processor3s main execution units (integer units, floating point units, etc.*.

'he sizes of the load0store uffers "ary across processor implementations, ut they are typically se"eral words in size. 'he load and store uffers on each processor are "isi le only to the processor they reside on, so a load issued y a processor that issued the store fetches the data from the store uffer if it is still there. 8owe"er, it is theoretically possi le for other processors that issue a load for that data to read their hardware cache or main memory efore the store uffer in the store&issuing processor was flushed. 7ote that the store uffer we are referring to here is not the same thing as a le"el 1 or le"el = hardware instruction and data cache. Caches are eyond the store uffer6 the store uffer is closer to the execution units of the processor. #hysical memory and hardware caches are kept consistent on S)# platforms y a hardware us protocol. $lso, many caches are implemented as write&through caches (as is the case with the le"el 1 cache in Sun 2ltraS#$%C*, so data written to cache causes memory to e updated. 'he implementation of a store uffer is part of the memory model implemented y the hardware. 'he memory model defines the constraints that can e imposed on the order of memory operations (loads and stores* y the system. )any processors implement a se+uential consistency model, where loads and stores to memory are executed in the same order in which they were issued y the processor. 'his model has ad"antages in terms of memory consistency, ut there are performance trade&offs with such a model ecause the hardware cannot optimize cache and memory operations for speed. 'he S#$%C architecture specification :17< pro"ides for uilding S#$%C& ased processors that support multiple memory models, the choice eing left up to the implementors as to which memory models they wish to support. $ll current S#$%C processors implement a 'otal Store 4rdering ('S4* model, which re+uires compliance with the following rules for loads and stores,

Loads (reads from memory* are locking and are ordered with respect to other loads. Stores (writes to memory* are ordered with respect to other stores. Stores cannot ypass earlier loads. $tomic load&stores (ldstu respect to loads. and cas instructions* are ordered with

'he 'S4 model is not +uite as strict as the se+uential consistency model ut not as relaxed as two additional memory models defined y the S#$%C architecture. S#$%C& ased processors also support %elaxed )emory 4rder

(%)4* and #artial Store 4rder (#S4*, ut these are not currently supported y the kernel and not implemented y any Sun systems shipping today. $ final consideration in data "isi ility applies also to the memory model and concerns instruction ordering. 'he execution unit in modern processors can reorder the incoming instruction stream for processing through the execution units. 'he goals again are performance and creation of a se+uence of instructions that will keep the processor pipeline full. 'he hardware considerations descri ed in this section are summarized in 'a le 17.1, along with the solution or implementation detail that applies to the particular issue.
%a&le 17.1. Hardware Considerations and Solutions for oc!s

Consideration 7eed for an atomic test&and&set instruction for locking primiti"es.

Solution 2se of nati"e machine instructions. ldstu and cas on S#$%C, cmpxchgl (compare0exchange long* on x./. 2se of memory arrier instructions.

Cata glo al "isi ility issue ecause of the use of hardware load and store uffers and instruction reordering, as defined y the memory model.

'he issues of consistent memory "iews in the face of a processor3s load and store uffers, relaxed memory models, and atomic test&and&set capa ility for locks are addressed at the processor instruction&set le"el. 'he mutex lock and %9 lock primiti"es implemented in the Solaris kernel use the ldstu and cas instructions for lock testing and ac+uisition on 2ltraS#$%C& ased systems and use the cmpxchgl (compare0exchange long* instruction on x./. 'he lock primiti"e routines are part of the architecture&dependent segment of the kernel code. S#$%C processors pro"ide "arious forms of memory arrier (mem ar* instructions, which, depending on options that are set in the instruction, impose specific constraints on the ordering of memory access operations (loads and stores* relati"e to the se+uence with which they were issued. 'o ensure a consistent memory "iew when a mutex or %9 lock operation has een issued,

the Solaris kernel issues the appropriate mem ar instruction after the lock ha"e changed.

its

$s we mo"e from the strongest consistency model (se+uential consistency* to the weakest model (%)4*, we can uild a system with potentially etter performance. 9e can optimize memory operations y playing with the ordering of memory access instructions that ena le designers to minimize access latency and to maximize interconnect andwidth. 'he trade&off is consistency, since the more relaxed models pro"ide fewer and fewer constraints on the system to issue memory access operations in the same order in which the instruction stream issued them. So, processor architectures pro"ide memory arrier controls that kernel de"elopers can use to address the consistency issues as necessary, with some le"el of control on which consistency le"el is re+uired to meet the system re+uirements. 'he types of mem ar instructions a"aila le, the options they support, and how they fit into the different memory models descri ed would make for a highly technical and lengthy chapter on its own. %eaders interested in this topic should read :1< and :=7<.
17.'. "ntroduction to Synchronization (&)ects

'he Solaris kernel implements se"eral types of synchronization o !ects. Locks pro"ide mutual exclusion semantics for synchronized access to shared data. Locks come in se"eral forms and are the primary focus of this chapter. 'he most commonly used lock in the Solaris kernel is the mutual exclusion, or mutex lock, which pro"ides exclusi"e read and write access to data. $lso implemented are reader0 writer (%9* locks, for situations in which multiple readers are allowa le ut only one writer is allowed at a time. Dernel semaphores are also employed in some areas of the kernel, where access to a finite num er of resources must e managed. $ special type of mutex lock, called a dispatcher lock, is used y the kernel dispatcher when synchronization re+uires access protection through a locking mechanism, as well as protection from interrupts. Condition "aria les, which are not a type of lock, are used for thread synchronization and are an integral part of the kernel sleep0wakeup facility. Condition "aria les are introduced here and co"ered in detail in Chapter ;. 'he actual num er of locks that exist in a running system at any time is dynamic and scales with the size of the system. Se"eral hundred locks are defined in the kernel source code, ut a lock count ased on static source code is not accurate ecause locks are created dynamically during normal system acti"ity when kernel threads and processes are created, file systems

are mounted, files are created and opened, network connections are made, etc. )any of the locks are em edded in the kernel data structures that pro"ide the a stractions (processes, files* pro"ided y the kernel, and thus the num er of kernel locks will scale up linearly as resources are created dynamically. 'his design speaks to one of the core strengths of the Solaris kernel, scala ility and scaling synchronization primiti"es dynamically with the size of the kernel. Cynamic lock creation has se"eral ad"antages o"er static allocations. -irst, the kernel is not wasting time and space managing a large pool of unused locks when running on a smaller system, such as a desktop or workgroup ser"er. 4n a large system, a sufficient num er of locks is a"aila le to sustain concurrency for scala le performance. It is possi le to ha"e literally thousands of locks in existence on a large, usy system.
17.4.1. Synchronization Process

9hen an executing kernel thread attempts to ac+uire a lock, it will encounter one of two possi le lock states, free (a"aila le* or not free (owned, held*. $ re+uesting thread gets ownership of an a"aila le lock when the lock&specific get lock function is in"oked. If the lock is not a"aila le, the thread most likely needs to lock and wait for it to come a"aila le, although, as we will see shortly, the code does not always lock (sleep*, waiting for a lock. -or those situations in which a thread will sleep while waiting for a lock, the kernel implements a sleep +ueue facility, known as turnstiles, for managing threads locking on locks. 9hen a kernel thread has completed the operation on the shared data protected y the lock, it must release the lock. 9hen a thread releases a lock, the code must deal with one of two possi le conditions, threads are waiting for the lock (such threads are termed waiters*, or there are no waiters. 9ith no waiters, the lock can simply e released. 9ith waiters, the code has se"eral options. It can release the lock and wake up the locking threads. In that case, the first thread to execute ac+uires the lock. $lternati"ely, the code could select a thread from the turnstile (sleep +ueue*, ased on priority or sleep time, and wake up only that thread. -inally, the code could select which thread should get the lock next, and the lock owner could hand the lock off to the selected thread. $s we will see in the following sections, no one solution is suita le for all situations, and the Solaris kernel uses all three methods, depending on the lock type. -igure 17.1 pro"ides the ig picture.

Figure 17.'. Solaris oc!s %he *ig Picture

-igure 17.1 pro"ides a generic representation of the execution flow. Later we will see the results of a considera le amount of engineering effort that has gone into the lock code, impro"ed efficiency and speed with short code paths, optimizations for the hot path (fre+uently hit code path* with well&tuned assem ly code, and the est algorithms for lock release as determined y extensi"e analysis.
17.4.2. Synchronization Object Operations Vector

5ach of the synchronization o !ects discussed in this section mutex locks, reader0writer locks, and semaphores defines an operations "ector that is linked to kernel threads that are locking on the o !ect. Specifically, the o !ect3s operations "ector is a data structure that exports a su set of o !ect functions re+uired for kthreads sleeping on the lock. 'he generic structure is defined as follows, 0E E 'he following data structure is used to map E synchronization o !ect type num ers to the E synchronization o !ect3s sleep +ueue num er E or the synch. o !ect3s owner function. E0 typedef struct Fso !Fops G int so !Ftype6 kthreadFt E(Eso !Fowner*(*6 "oid (Eso !Funsleep*(kthreadFt E*6 "oid (Eso !FchangeFpri*(kthreadFt E, priFt, priFt E*6

H so !FopsFt6 See sys0so !ect.h

'he structure shown a o"e pro"ides for the o !ect type declaration. -or each synchronization o !ect type, a type&specific structure is defined, mutexFso !Fops for mutex locks, rwFso !Fops for reader0writer locks, and semaFso !Fops for semaphores. 'he structure also pro"ides three functions that may kthread sleeping on a synchronization o !ect,

e called on

ehalf of a

$n owner function, which returns the IC of the kernel thread that owns the o !ect. $n unsleep function, which transitions a kernel thread from a sleep state. $ changeFpri function, which changes the priority of a kernel thread, used for priority inheritance. (See Section 17.7.*

9e will see how references to the lock3s operations structure are implemented as we mo"e through specifics on lock implementations in the following sections. It is useful to note at this point that our examination of Solaris kernel locks offers a good example of some of the design trade&offs in"ol"ed in kernel software engineering. Iuilding the "arious software components that make up the Solaris kernel is a series of design decisions, when performance needs are measured against complexity. In areas of the kernel where optimal performance is a top priority, simplicity might e sacrificed in fa"or of performance. 'he locking facilities in the Solaris kernel are an area where such trade&offs are made much of the lock code is written in assem ly language, for speed, rather than in the C language6 the latter is easier to code with and maintain ut is potentially slower. In some cases, when the code path is not performance critical, a simpler design will e fa"ored o"er cryptic assem ly code or complexity in the algorithms. 'he eha"ior of a particular design is examined through exhausti"e testing, to ensure that the est possi le design decisions were made.

17.+. ,ute- oc!s

)utual exclusion, or mutex locks, are the most common type of synchronization primiti"e used in the kernel. )utex locks serialize access to critical data, when a kernel thread must ac+uire the mutex specific to the data region eing protected efore it can read or write the data. 'he thread is the lock owner while it is holding the lock, and the thread must release the lock when it has finished working in the protected region so other threads can ac+uire the lock for access to the protected data.
17.5.1. Overview

If a thread attempts to ac+uire a mutex lock that is eing held, it can asically do one of two things, it can spin or it can lock. Spinning means the thread enters a tight loop, attempting to ac+uire the lock in each pass through the loop. 'he term spin lock is often used to descri e this type of mutex. Ilocking means the thread is placed on a sleep +ueue while the lock is eing held and the kernel sends a wakeup to the thread when the lock is released. 'here are pros and cons to oth approaches. 'he spin approach has the enefit of not incurring the o"erhead of context switching, re+uired when a thread is put to sleep and also has the ad"antage of a relati"ely fast ac+uisition when the lock is released, since there is no context&switch operation. It has the downside of consuming C#2 cycles while the thread is in the spin loop the C#2 is executing a kernel thread (the thread in the spin loop* ut not really doing any useful work. 'he locking approach has the ad"antage of other threads while the lock is eing held6 it context switching to get the waiting thread runna le thread onto the processor. 'here3s latency, since a wakeup and context switch thread can ecome the owner of the lock it freeing the processor to execute has the disad"antage of re+uiring off the processor and a new also a little more lock ac+uisition are re+uired efore the locking was waiting for.

In addition to the issue of what to do if a re+uested lock is eing held, the +uestion of lock granularity needs to e resol"ed. Let3s take a simple example. 'he kernel maintains a process ta le, which is a linked list of process structures, one for each of the processes running on the system. $ simple ta le&le"el mutex could e implemented, such that if a thread needs to manipulate a process structure, it must first ac+uire the process ta le mutex. 'his le"el of locking is "ery coarse. It has the ad"antages of simplicity and minimal lock o"erhead. It has the o "ious disad"antage of potentially poor

scala ility, since only one thread at a time can manipulate o !ects on the process ta le. Such a lock is likely to ha"e a great deal of contention ( ecome a hot lock*. 'he alternati"e is to implement a finer le"el of granularity, a lock&per& process ta le entry "ersus one ta le&le"el lock. 9ith a lock on each process ta le entry, multiple threads can e manipulating different process structures at the same time, pro"iding concurrency. 'he disad"antages are that such an implementation is more complex, increases the chances of deadlock situations, and necessitates more o"erhead ecause there are more locks to manage. In general, the Solaris kernel implements relati"ely fine&grained locking whene"er possi le, largely due to the dynamic nature of scaling locks with kernel structures as needed. 'he kernel implements two types of mutex locks, spin locks and adapti"e locks. Spin locks, as we discussed, spin in a tight loop if a desired lock is eing held when a thread attempts to ac+uire the lock. $dapti"e locks are the most common type of lock used and are designed to dynamically either spin or lock when a lock is eing held, depending on the state of the holder. 9e already discussed the trade&offs of spinning "ersus locking. Implementing a locking scheme that only does one or the other can se"erely impact scala ility and performance. It is much etter to use an adapti"e locking scheme, which is precisely what we do. 'he mechanics of adapti"e locks are straightforward. 9hen a thread attempts to ac+uire a lock and the lock is eing held, the kernel examines the state of the thread that is holding the lock. If the lock holder (owner* is running on a processor, the thread attempting to get the lock will spin. If the thread holding the lock is not running, the thread attempting to get the lock will lock. 'his policy works +uite well ecause the code is such that mutex hold times are "ery short ( y design, the goal is to minimize the amount of code to e executed while a lock is held*. So, if a thread is holding a lock and running, the lock will likely e released "ery soon, pro a ly in less time than it takes to context&switch off and on again, so it3s worth spinning. 4n the other hand, if a lock holder is not running, then we know that minimally one context switch is in"ol"ed efore the holder will release the lock (getting the holder ack on a processor to run*, and it makes sense to simply lock and free up the processor to do something else. 'he kernel will place the locking thread on a turnstile (sleep +ueue* designed specifically for synchronization

primiti"es and will wake the thread when the lock is released (See Section 17.7.*

y the holder.

'he other distinction etween adapti"e locks and spin locks has to do with interrupts, the dispatcher, and context switching. 'he kernel dispatcher is the code that selects threads for scheduling and does context switches. It runs at an ele"ated #riority Interrupt Le"el (#IL* to lock interrupts (the dispatcher runs at priority le"el 11 on S#$%C systems*. 8igh&le"el interrupts (interrupt le"els 111> on S#$%C systems* can interrupt the dispatcher. 8igh&le"el interrupt handlers are not allowed to do anything that could re+uire a context switch or to enter the dispatcher (we discuss this further in Section ;.1*. $dapti"e locks can lock, and locking means context switching, so only spin locks can e used in high&le"el interrupt handlers. $lso, spin locks can raise the interrupt le"el of the processor when the lock is ac+uired. struct kernelFdata G kmutexFt klock6 char EforwFptr6 char E ackFptr6 uint/1Ft data16 uint/1Ft data=6 H kdata6 "oid function(* . mutexFinit(Jkdata.klock*6 . mutexFenter(Jkdata.klock*6 klock.data1 K 16 mutexFexit(Jkdata.klock*6

'he preceding lock of pseudo&code illustrates the general mechanics of mutex locks. $ lock is declared in the code6 in this case, it is em edded in the data structure that it is designed to protect. 4nce declared, the lock is initialized with the kernel mutexFinit(* function. $ny su se+uent reference to the kdata structure re+uires that the klock mutex e ac+uired with mutexFenter(*. 4nce the work is done, the lock is released with mutexFexit(*. 'he lock type, spin or adapti"e, is determined in the mutexFinit(* code y the kernel. $ssuming an adapti"e mutex in this example, any kernel threads that make a mutexFenter(* call on klock will either lock or spin, depending on the state of the kernel thread that owns klock when the mutexFenter(* is called.

17.5.2. Solaris Mute

!oc" #$ple$entation

'he kernel defines different data structures for the two types of mutex locks, adapti"e and spin, as shown elow. 0E E #u lic interface to mutual exclusion locks. See mutex(B-* for details. E E 'he asic mutex type is )2'5LF$C$#'IA5, which is expected to e used E in almost all of the kernel. )2'5LFS#I7 pro"ides interrupt locking E and must e used in interrupt handlers a o"e L4CDFL5A5L. 'he i lock E cookie argument to mutexFinit(* encodes the interrupt le"el to lock. E 'he i lock cookie must e 72LL for adapti"e locks. E E )2'5LFC5-$2L' is the type usually specified (except in dri"ers* to E mutexFinit(*. It is identical to )2'5LF$C$#'IA5. E E )2'5LFC%IA5% is always used y dri"ers. mutexFinit(* con"erts this to E either )2'5LF$C$#'IA5 or )2'5LFS#I7 depending on the i lock cookie. E E )utex statistics can e gathered on the fly, without re ooting or E recompiling the kernel, "ia the lockstat dri"er (lockstat(7C**. E0 typedef enum G )2'5LF$C$#'IA5 K @, 0E spin if owner is running, otherwise lock E0 )2'5LFS#I7 K 1, 0E lock interrupts and spin E0 )2'5LFC%IA5% K 1, 0E dri"er (CCI* mutex E0 )2'5LFC5-$2L' K / 0E kernel default mutex E0 H kmutexFtypeFt6 typedef struct mutex G Mifdef FL#/1 "oid EFopa+ue:1<6 Melse "oid EFopa+ue:=<6 Mendif H kmutexFt6 See sys0mutex.h

'he 1=.& it mutex o !ect is used for each type of lock, as shown in -igure 17.>.
Figure 17.+. Solaris 1. Ada/ti0e and S/in ,ute-

In -igure 17.>, the mFowner field in the adapti"e lock, which holds the address of the kernel thread that owns the lock (the kthread pointer*, plays a dou le role, in that it also ser"es as the actual lock6 successful lock ac+uisition for a thread means it has its kthread pointer set in the mFowner field of the target lock. If threads attempt to get the lock while it is held (waiters*, the low& order it ( it @* of mFowner is set to reflect that case. Iecause kthread pointers "alues are always word aligned, they do not re+uire it @, allowing this work. 0E E mutexFenter(* assumes that the mutex is adapti"e and tries to gra the E lock y doing a atomic compare and exchange on the first word of the mutex. E If the compare and exchange fails, it means that either (1* the lock is a E spin lock, or (=* the lock is adapti"e ut already held. E mutexF"ectorFenter(* distinguishes these cases y looking at the mutex E type, which is encoded in the low&order its of the owner field. E0 typedef union mutexFimpl G 0E E $dapti"e mutex. E0 struct adapti"eFmutex G uintptrFt FmFowner6 0E @&;0@&7 owner and waiters it E0 Mifndef FL#/1 uintptrFt FmFfiller6 0E 1&7 unused E0 Mendif H mFadapti"e6 0E

E Spin )utex. E0 struct spinFmutex G lockFt mFdummylock6 0E @ dummy lock (always set* E0 lockFt mFspinlock6 0E 1 real lock E0 ushortFt mFfiller6 0E =&; unused E0 ushortFt mFoldspl6 0E 1&> old pil "alue E0 ushortFt mFminspl6 0E /&7 min pil "al if lock held E0 H mFspin6 H mutexFimplFt6 See sys0mutexFimpl.h

'he spin mutex, as we pointed out earlier, is used at high interrupt le"els, where context switching is not allowed. Spin locks lock interrupts while in the spin loop, so the kernel needs to maintain the priority le"el the processor was running at efore entering the spin loop, which raises the processor3s priority le"el. (5le"ating the priority le"el is how interrupts are locked.* 'he mFminspl field stores the priority le"el of the interrupt handler when the lock is initialized, and mFoldspl is set to the priority le"el the processor was running at when the lock code is called. 'he mFspinlock fields are the actual mutex lock its. 5ach kernel module and su system implementing one or more mutex locks calls into a common set of mutex functions. $ll locks must first e initialized y the mutexFinit(* function, where y the lock type is determined on the asis of an argument passed in the mutexFinit(* call. 'he most common type passed into mutexFinit(* is )2'5LFC5-$2L', which results in the init code determining what type of lock, adapti"e or spin, should e used. It is possi le for a caller of mutexFinit(* to e specific a out a lock type (for example, )2'5LFS#I7*. If the init code is called from a de"ice dri"er or any kernel module that registers and generates interrupts, then an interrupt lock cookie is added to the argument list. $n interrupt lock cookie is an a straction used y de"ice dri"ers when they set their interrupt "ector and parameters. 'he mutexFinit(* code checks the argument list for an interrupt lock cookie. If mutexFinit(* is eing called from a de"ice dri"er to initialize a mutex to e used in a high& le"el interrupt handler, the lock type is set to spin. 4therwise, an adapti"e lock is initialized. 'he test is the interrupt le"el in the passed interrupt lock6

le"els a o"e L4CDFL5A5L (1@ on S#$%C systems* are considered high&le"el interrupts and thus re+uire spin locks. 'he init code clears most of the fields in the mutex lock structure as appropriate for the lock type. 'he mFdummylock field in spin locks is set to all 13s (@x--*. 9e3ll see why in a minute. 'he primary mutex functions called, aside from mutexFinit(* (which is only called once for each lock at initialization time*, are mutexFenter(* to get a lock and mutexFexit(* to release it. mutexFenter(* assumes an a"aila le, adapti"e lock. If the lock is held or is a spin lock, mutexF"ectorFenter(* is entered to reconcile what should happen. 'his is a performance optimization. mutexFenter(* is implemented in assem ly code, and ecause the entry point is designed for the simple case (adapti"e lock, not held*, the amount of code that gets executed to ac+uire a lock when those conditions are true is minimal. $lso, there are significantly more adapti"e mutex locks than spin locks in the kernel, making the +uick test case effecti"e most of the time. 'he test for a lock held or spin lock is "ery fast. 8ere is where the mFdummylock field comes into play, mutexFenter(* executes a compare&and&swap instruction on the first yte of the mutex, testing for a zero "alue. 4n a spin lock, the mFdummylock field is tested ecause of its positioning in the data structure and the endianness of S#$%C processors. Since mFdummylock is always set (it is set to all 13s in mutexFinit(**, the test will fail for spin locks. 'he test will also fail for a held adapti"e lock since such a lock will ha"e a nonzero "alue in the yte field eing tested. 'hat is, the mFowner field will ha"e a kthread pointer "alue for a held, adapti"e lock. If the lock is an adapti"e mutex and is not eing held, the caller of mutexFenter(* gets ownership of the lock. If the two conditions are not true, that is, either the lock is held or the lock is a spin lock, the code enters the mutexF"ectorFenter(* function to sort things out. 'he mutexF"ectorFenter(* code first tests the lock type. -or spin locks, the mFoldspl field is set, ased on the current #riority Interrupt Le"el (#IL* of the processor, and the lock is tested. If it3s not eing held, the lock is set (mFspinlock* and the code returns to the caller. $ held lock forces the caller into a spin loop, where a loop counter is incremented (for statistical purposes6 the lockstat(1)* data*, and the code checks whether the lock is still held in each pass through the loop. 4nce the lock is released, the code reaks out of the loop, gra s the lock, and returns to the caller. $dapti"e locks re+uire a little more work. 9hen the code enters the adapti"e code path (in mutexF"ectorFenter(**, it increments the

cpuFsysinfo.mutexFadenters (adapti"e lock enters* field, as is reflected in the smtx column in mpstat(1)*. mutexF"ectorFenter(* then tests again to determine if the lock is owned (held*, since the lock may ha"e een released in the time inter"al etween the call to mutexFenter(* and the current point in the mutexF"ectorFenter(* code. If the adapti"e lock is not eing held, mutexF"ectorFenter(* attempts to ac+uire the lock. If successful, the code returns. If the lock is held, mutexF"ectorFenter(* determines whether or not the lock owner is running y looping through the C#2 structures and testing the lock mFowner against the cpuFthread field of the C#2 structure. (cpuFthread contains the kernel thread address of the thread currently executing on the C#2.* $ match indicates the holder is running, which means the adapti"e lock will spin. 7o match means the owner is not running, in which case the caller must lock. In the locking case, the kernel turnstile code is entered to locate or ac+uire a turnstile, in preparation for placement of the kernel thread on a sleep +ueue associated with the turnstile. 'he turnstile placement happens in two phases. $fter mutexF"ectorFenter(* determines that the lock holder is not running, it makes a turnstile call to look up the turnstile, sets the waiters it in the lock, and retests to see if the owner is running. If yes, the code releases the turnstile and enters the adapti"e lock spin loop, which attempts to ac+uire the lock. 4therwise, the code places the kernel thread on a turnstile (sleep +ueue* and changes the thread3s state to sleep. 'hat effecti"ely concludes the se+uence of e"ents in mutexF"ectorFenter(*. Cropping out of mutexF"ectorFenter(*, either the caller ended up with the lock it was attempting to ac+uire or the calling thread is on a turnstile sleep +ueue associated with the lock. In either case, the lockstat(1)* data is updated, reflecting the lock type, spin time, or sleep time as the last it of work done in mutexF"ectorFenter(*. lockstat(1)* is a kernel lock statistics command that was introduced in Solaris =./. It pro"ides detailed information on kernel mutex and reader0writer locks. 'he algorithm descri ed in the pre"ious paragraphs is summarized in pseudocode elow. mutexF"ectorFenter(* if (lock is a spin lock* lockFsetFspl(* 0E enter spin&lock specific code path E0

increment cpuFsysinfo.ademters. spinFloop, if (lock is not owned* mutexFtrylock(* 0E try to ac+uire the lock E0 if (lock ac+uired* goto ottom else continue 0E lock eing held E0 if (lock owner is running on a processor* goto spinFloop else lookup turnstile for the lock set waiters it if (lock owner is running on a processor* drop turnstile goto spinFloop else lock 0E the sleep +ueue associated with the turnstile E0 ottom, update lockstat statistics

9hen a thread has finished working in a lock&protected data area, it calls the mutexFexit(* code to release the lock. 'he entry point is implemented in assem ly language and handles the simple case of freeing an adapti"e lock with no waiters. 9ith no threads waiting for the lock, it3s a simple matter of clearing the lock fields (mFowner* and returning. 'he C language function mutexF"ectorFexit(* is entered from mutexFexit(* for anything ut the simple case. In the case of a spin lock, the lock field is cleared and the processor is returned to the #IL le"el it was running at efore entering the lock code. -or adapti"e locks, a waiter must e selected from the turnstile (if there is more than one waiter*, ha"e its state changed from sleeping to runna le, and e placed on a dispatch +ueue so it can execute and get the lock. If the thread releasing the lock was the eneficiary of priority inheritance, meaning that it had its priority impro"ed when a calling thread with a etter priority was not a le to get the lock, then the thread releasing the lock will ha"e its priority

reset to what it was in Section 17.7.

efore the inheritance. #riority inheritance is discussed

9hen an adapti"e lock is released, the code clears the waiters it in mFowner and calls the turnstile function to wake up all the waiters. %eaders familiar with sleep0wakeup mechanisms of operating systems ha"e likely heard of a particular eha"ior known as the Nthundering herd pro lem,N a situation in which many threads that ha"e een locking for the same resource are all woken up at the same time and make a mad dash for the resource (a mutex in this case*like a herd of large, four&legged easts running toward the same o !ect. System eha"ior tends to go from a relati"ely small run +ueue to a large run +ueue (all the threads ha"e een woken up and made runna le* and high C#2 utilization until a thread gets the resource, at which point a unch of threads are sleeping again, the run +ueue normalizes, and C#2 utilization flattens out. 'his is a generic eha"ior that can occur on any operating system. 'he wakeup mechanism used when mutexF"ectorFexit(* is called may seem like an open in"itation to thundering herds, ut in practice it turns out not to e a pro lem. 'he main reason is that the locking case for threads waiting for a mutex is rare6 most of the time the threads will spin. If a locking situation does arise, it typically does not reach a point where "ery many threads are locked on the mutexone of the characteristics of the thundering herd pro lem is resource contention resulting in a lot of sleeping threads. 'he kernel code segments that implement mutex locks are, y design, short and fast, so locks are not held for long. Code that re+uires longer lock&hold times uses a reader0writer write lock, which pro"ides mutual exclusion semantics with a selecti"e wakeup algorithm. 'here are, of course, other reasons for choosing reader0writer locks o"er mutex locks, the most o "ious eing to allow multiple readers to see the protected data.
17.1. #eader23riter oc!s

%eader0writer (%9* locks pro"ide mutual exclusion semantics on write locks. 4nly one thread at a time is allowed to own the write lock, ut there is concurrent access for readers. 'hese locks are designed for scenarios in which it is accepta le to ha"e multiple threads reading the data at the same time, ut only one writer. 9hile a writer is holding the lock, no readers are allowed. $lso, ecause of the wakeup mechanism, a writer lock is a etter solution for kernel code segments that re+uire relati"ely long hold times, as we will see shortly.

'he asic mechanics of %9 locks are similar to mutexes, in that %9 locks ha"e an initialization function (rwFinit(**, an entry function to ac+uire the lock (rwFenter(**, and an exit function to release the lock (rwFexit(**. 'he entry and exit points are optimized in assem ly code to deal with the simple cases, and they call into C language functions if anything eyond the simplest case must e dealt with. $s with mutex locks, the simple case is that the re+uested lock is a"aila le on an entry (ac+uire* call and no threads are waiting for the lock on the exit (release* call.
17.%.1. Solaris &ea'er()riter !oc"s

%eader0writer locks are implemented as a single&word data structure in the kernel, either ;= its or /1 its wide, depending on the data model of the running kernel, as depicted in -igure 17./.
Figure 17.1. #eader23riter oc!

typedef struct rwlockFimpl G uintptrFt rwFwwwh6 E0 H rwlockFimplFt6 Mendif 0E F$S) E0 Mdefine %9F8$SF9$I'5%S Mdefine %9F9%I'5F9$7'5C Mdefine %9F9%I'5FL4CD5C Mdefine %9F%5$CFL4CD Mdefine %9F9%I'5FL4CD(thread* %9F9%I'5FL4CD5C* Mdefine %9F84LCFC427' Mdefine %9F84LCFC427'FS8I-' log=(%9F%5$CFL4CD* E0 Mdefine %9F%5$CFC427' Mdefine %9F4975% Mdefine %9FL4CD5C

0E waiters, write wanted, hold count

1 = 1 . ((uintptrFt*(thread* O (&%9F%5$CFL4CD* ; 0E %9F84LCFC427' %9F84LCFC427' %9F84LCFC427'

Mdefine %9F9%I'5FCL$I)5C %9F9%I'5F9$7'5C* Mdefine %9FC42IL5FL4CD %9F%5$CFL4CD* sys0rwlock.h

(%9F9%I'5FL4CD5C O (%9F9%I'5FL4CD(@* O See

'here are two states for the reader writer lock, depending on whether the lock is held y a writer, as indicated y it =, wrlock. Iit =, wrlock, is the actual write lock, and it determines the meaning of the high&order its. If the write lock is held ( it = set*, then the upper its contain a pointer to the kernel thread holding the write lock. If it = is clear, then the upper its contain a count of the num er of threads holding the lock as a read lock. 'he Solaris 1@ %9 lock defines it @, the wait it, set to signify that threads are waiting for the lock. 'he wrwant it (write wanted, it 1* indicates that at least one thread is waiting for a write lock. 'he simple cases for lock ac+uisition through rwFenter(* are the circumstances listed elow,

'he write lock is wanted and is a"aila le. 'he read lock is wanted, the write lock is not held, and no threads are waiting for the write lock (wrwant is clear*.

'he ac+uisition of the write lock results in it = getting set and the kernel thread pointer getting loaded in the upper its. -or a reader, the hold count (upper its* is incremented. Conditions where the write lock is eing held, causing a lock re+uest to fail, or where a thread is waiting for a write lock, causing a read lock re+uest to fail, result in a call to the rwFenterFsleep(* function. Important to note is that the rwFenter(* code sets a flag in the kernel thread used y the dispatcher code when esta lishing a kernel thread3s priority efore preemption or changing state to sleep. 9e co"er this in more detail in the paragraph eginning NIt is in the dispatcher +ueue insertion codeN on #age =/=. Iriefly, the kernel thread structure contains a tFkpriFre+ (kernel priority re+uest* field that is checked in the dispatcher code when a thread is a out to e preempted (forced off the processor on which it is executing ecause a higher&priority thread ecomes runna le* or when the thread is a out to ha"e its state changed to sleep. If the tFkpriFre+ flag is set, the dispatcher assigns a kernel priority to the thread, such that when the thread resumes

execution, it will run efore threads in scheduling classes of lower priority (timeshare and interacti"e class threads*. )ore succinctly, the priority of a thread holding a write lock is set to a etter priority to minimize the hold time of the lock. Petting ack to the rwFenter(* flow, If the code falls through the simple case, we need to set up the kernel thread re+uesting the %9 lock to lock. 1. rwFenterFsleep(* esta lishes whether the calling thread is re+uesting a read or write lock and does another test to see if the lock is a"aila le. If it is, the caller gets the lock, the lockstat(1)* statistics are updated, and the code returns. If the lock is not a"aila le, then the turnstile code is called to look up a turnstile in preparation for putting the calling thread to sleep. =. 9ith a turnstile now a"aila le, another test is made on the lock a"aila ility. (4n today3s fast processors, and especially multiprocessor systems, it3s +uite possi le that the thread holding the lock finished what it was doing and the lock ecame a"aila le.* $ssuming the lock is still held, the thread is set to a sleep state and placed on a turnstile. *. 'he %9 lock structure will ha"e the wait it set for a reader waiting (forced to lock ecause a writer has the lock* or the wrwant it set to signify that a thread wanting the write lock is locking. 4. 'he cpuFsysinfo structure for the processor maintains two counters for failures to get a read lock or write lock on the first pass, rwFrdfails and rwFwrfails. 'he appropriate counter is incremented !ust prior to the turn&stile call6 this action places the thread on a turnstile sleep +ueue. 'he mpstat(1)* command sums the counters and displays the fails&per& second in the srw column of its output. 'he ac+uisition of a %9 lock and su se+uent eha"ior if the lock is held are straightforward and similar in many ways to what happens in the mutex case. 'hings get interesting when a thread calls rwFexit(* to release a lock it is hold&ingthere are se"eral potential solutions to the pro lem of determining which thread gets the lock next. $ wakeup is issued on all threads that are sleeping, waiting for the mutex, and we know from empirical data that this solution works well for reasons pre"iously discussed. 9ith %9 locks, we3re dealing with potentially longer hold times, which could result in more sleepers, a desire to gi"e writers priority o"er readers (it3s typically est to not ha"e a reader read data that3s a out to e changed y a pending writer*, and the potential for the priority in"ersion pro lem descri ed in Section 17.7.

-or rwFexit(*, which is called y the lock holder when it is ready to release the lock, the simple case is that there are no waiters. In this case, the wrlock it is cleared if the holder was a writer, or the hold count field is decremented to reflect one less reader. 'he more complex case of the system ha"ing waiters when the lock is released is dealt with in the following manner, 1. 'he kernel does a direct transfer of ownership of the lock to one or more of the threads waiting for the lock when the lock is released, either to the next writer or to a group of readers if more than one reader is locking and no writers are locking. 'his situation is "ery different from the case of the mutex implementation, for which the wakeup is issued and a thread must o tain lock ownership in the usual fashion. 8ere, a thread or threads wake up owning the lock they were locking on. 'he algorithm used to figure out who gets the lock next addresses se"eral re+uirements that pro"ide for generally alanced system performance. 'he kernel needs to minimize the possi ility of star"ation (a thread ne"er getting the resource it needs to continue executing* while allowing writers to take precedence whene"er possi le. 2. rwFexitFwakeup(* retests for the simple case and drops the lock if there are no waiters (clear wrlock or decrement the hold count*. ;. 9hen waiters are present, the code gra s the turnstile (sleep +ueue* associated with the lock and sa"es the pointer to the kernel thread of the next write waiter that was on the turnstile3s sleep +ueue (if one exists*. 'he turnstile sleep +ueues are organized as a -I-4 (first in, first out* +ueue, so the +ueue management (turnstile code* makes sure that the thread that was waiting the longest (the first in* is the thread that is selected as the next writer (first out*. 'hus, part of the fairness policy we want to enforce is co"ered. 'he remaining its of the algorithm go as follows,

1. If a writer is releasing the write lock and there are waiting readers and writers, readers of the same or higher priority than the highest&priority locked writer are granted the read lock. 5. 'he readers are handed ownership, and then woken up y the turnstileFwakeup(* kernel function,

'hese readers also inherit the priority of the writer that released the lock if the reader thread is of a lower priority (inheritance is done on a per&reader thread asis when more than one thread is eing woken up*. Lock ownership handoff is a relati"ely simple operation. -or read locks, there is no notion of a lock owner, so it3s a matter of setting the hold count in the lock to reflect the num er of readers coming off the turnstile, then issuing the wakeup of each reader. /. $n exiting reader always grants the lock to a waiting writer, e"en if there are higher&priority readers locked. 7. It is possi le for a reader freeing the lock to ha"e waiting readers, although it may not e intuiti"e, gi"en the multiple reader design of the lock. If a reader is holding the lock and a writer comes along, the wrwant it is set to signify that a writer is waiting for the lock. 9ith wrwant set, su se+uent readers cannot get the lockwe want the holding readers to finish so the writer can get the lock. 'herefore, it is possi le for a reader to execute rwFexitFwakeup(* with waiting writers and readers. 'he Nlet3s fa"or writers ut e fair to readersN policy descri ed a o"e was first implemented in Solaris =./.
17.7. %urnstiles and Priority "nheritance

$ turnstile is a data a straction that encapsulates sleep +ueues and priority inheritance information associated with mutex locks and reader0writer locks. 'he mutex and %9 lock code use a turnstile when a kernel thread needs to lock on a re+uested lock. 'he sleep +ueues implemented for other resource waits do not pro"ide an elegant method of dealing with the priority in"ersion pro lem through priority inheritance. 'urnstiles were created to address that pro lem. #riority in"ersion descri es a scenario in which a higher&priority thread is una le to run ecause a lower&priority thread is holding a resource it needs, such as a lock. 'he Solaris kernel addresses the priority in"ersion pro lem in its turn&stile implementation, pro"iding a priority inheritance mechanism, where the higher&priority thread can will its priority to the lower&priority thread holding the resource it re+uires. 'he eneficiary of the inheritance, the thread holding the resource, will now ha"e a higher scheduling priority and thus get scheduled to run sooner so it can finish its work and release the resource, at which point the original priority is returned to the thread.

In this section, we assume you ha"e some le"el of knowledge of kernel thread priorities, which are co"ered in Section ;.7. Iecause turnstiles and priority inheritance are an integral part of the implementation of mutex and %9 locks, we thought it est to discuss them here rather than later. -or this discussion, it is important to e aware of these points,

'he Solaris kernel assigns a glo al priority to kernel threads, ased on the scheduling class they elong to. Dernel threads in the timeshare and interacti"e scheduling classes will ha"e their priorities ad!usted o"er time, ased on three things, the amount of time the threads spend running on a processor, sleep time ( locking*, and the case when they are preempted. 'hreads in the real& time class are fixed priority6 the priorities are ne"er changed regardless of runtime or sleep time unless explicitly changed through programming interfaces or commands.

'he Solaris kernel implements sleep +ueues for the placement of kernel threads locking on (waiting for* a resource or e"ent. -or most resource waits, such as those for a disk or network I04, sleep +ueues, in con!unction with condition "aria les, manage the systemwide +ueue of sleeping threads. 'hese sleep +ueues are co"ered in Section ;.1@. 'his set of sleep +ueues is separate and distinct from turn&stile sleep +ueues.
17.7.1. +urnstiles #$ple$entation

-igure 17.7 illustrates the Solaris 1@ turnstiles. 'urnstiles are maintained in a systemwide hash ta le, turnstileFta le:<, which is an array of turnstileFchain structures6 each entry in the array (each turnstileFchain structure* is the eginning of a linked list of turnstiles. 'he array is indexed "ia a hash function on the address of the synchronization o !ect (the mutex or reader0writer lock*, so locks that hash to the same array location will ha"e a turnstile on the same linked list. 'he turnstileFta le:< array is statically initialized at oot time.

Figure 17.7. %urnstiles

typedef struct turnstileFchain G turnstileFt EtcFfirst6 dispFlockFt tcFlock6 H turnstileFchainFt6 turnstileFchainFt common0os0turnstile.c

0E first turnstile on hash chain E0 0E lock for this hash chain E0

turnstileFta le:= E '2%7S'IL5F8$S8FSIQ5<6 See

5ach entry in the chain has its own lock, tcFlock, so chains can e tra"ersed concurrently. 'he turnstile itself has a different lock6 each chain has an acti"e list (tsFnext* and a free list (tsFfree*. 'here are also a count of threads waiting on the sync o !ect (waiters*, a pointer to the synchronization o !ect (tsFso !*, a thread pointer linking to a kernel thread that had a priority oost through priority inheritance, and the sleep +ueues. 5ach turnstile has two sleep +ueues, one for readers and one for writers (threads locking on a read0write lock are maintained on separate sleep +ueues*. 'he priority inheritance data is integrated into the turnstile. Mdefine 'SF9%I'5%FR E0 Mdefine 'SF%5$C5%FR E0 @ 1 0E writer sleep+ (exclusi"e access to so !* 0E reader sleep+ (shared access to so !*

Mdefine 'SF72)FR

0E num er of sleep +ueues per turnstile E0

typedef struct turnstile turnstileFt6 struct Fso !Fops6 struct turnstile G turnstileFt EtsFnext6 0E next on hash chain E0 turnstileFt EtsFfree6 0E next on freelist E0 "oid EtsFso !6 0E s&o !ect threads are locking on E0 int tsFwaiters6 0E num er of locked threads E0 priFt tsFepri6 0E max priority of locked threads E0 struct Fkthread EtsFinheritor6 0E thread inheriting priority E0 turnstileFt EtsFprioin"6 0E next in inheritor3s tFprioin" list E0 sleep+Ft tsFsleep+:'SF72)FR<6 0E read0write sleep +ueues E0 H6 See sys0turnstile.h

5"ery kernel thread is orn with an attached turnstile. 'hat is, when a kernel thread is created ( y the kernel threadFcreate(* routine*, a turnstile is allocated for the kthread and linked to kthread3s tFts pointer. $ kthread can lock on only one lock at a time, so one turnstile is sufficient. 9e know from the pre"ious sections on mutex and %9 locks that a turnstile is re+uired if a thread needs to lock on a synchronization o !ect. It calls turnstileFlookup(* to look up the turnstile for the synchronization o !ect in the turnstileFta le:<. Since we index the array y hashing on the address of the lock, if a turnstile already exists (there are already waiters*, then we get the correct turnstile. If no kthreads are currently waiting for the lock, turnstileFlookup(* simply returns a null "alue. If the locking code must e called (recall from the pre"ious sections that su se+uent tests are made on lock a"aila ility efore it is determined that the kthread must lock*, then turnstileF lock(* is entered to place the kernel thread on a sleep +ueue associated with the turnstile for the lock. Dernel threads lend their attached turnstile to the lock when a kthread ecomes the first to lock (the lock ac+uisition attempt fails, and there are no waiters*. 'he thread3s turnstile is added to the appropriate turnstile chain, ased on the result of a hashing function on the address of the lock. 'he lock

now has a turnstile, so su se+uent threads that lock on the same lock will donate their turnstiles to the free list on the chain (the tsFfree link off the acti"e turnstile*. In turnstileF lock(*, the pointers are set up as determined y the return from turnstileFlookup(*. If the turnstile pointer is null, we link up to the turnstile pointed to y the kernel thread3s tFts pointer. If the pointer returned from the lookup is not null, there3s already at least one kthread waiting on the lock, so the code sets up the pointer links appropriately and places the kthread3s turnstile on the free list. 'he thread is then put into a sleep state through the scheduling&class& specific sleep routine (for example, tsFsleep(**. 'he tsFwaiters field in the turnstile is incremented, the threads tFwchan is set to the address of the lock, and tFso !Fops in the thread is set to the address of the lock3s operations "ectors, the owner, unsleep, and changeFpriority functions. 'he kernel sleep+Finsert(* function actually places the thread on the sleep +ueue associated with the turnstile. 'he code does the priority in"ersion check (now called out of the turnstileF lock(* code*, uilds the priority in"ersion links and applies the necessary priority changes. 'he priority inheritance rules apply6 that is, if the priority of the lock holder is less (worse* than the priority of the re+uesting thread, the re+uesting thread3s priority is NwilledN to the holder. 'he holder3s tFepri field is set to the new priority, and the inheritor pointer in the turnstile is linked to the kernel thread. $ll the threads on the locking chain are potential inheritors, ased on their priority relati"e to the calling thread. $t this point, the dispatcher is entered through a call to swtch(*, and another kernel thread is remo"ed from a dispatch +ueue and context&switched onto a processor. 'he wakeup mechanics are initiated as pre"iously descri ed, where a call to the lock exit routine results in a turnstileFwakeup(* call if threads are locking on the lock. turnstileFwakeup(* does essentially the re"erse of turnstileF lock(*6 threads that inherited a etter priority ha"e that priority wai"ed, and the thread is remo"ed from the sleep +ueue and gi"en a turnstile from the chain3s free list. %ecall that a thread donated its turnstile to the free list if it was not the first thread placed on the locking chain for the lock6 coming off the turnstile, threads get a turnstile ack. 4nce the thread is

unlinked from the sleep +ueue, the scheduling class wakeup code is entered, and the thread is put ack on a processor3s dispatch +ueue.
17.4. 5ernel Sema/hores

Semaphores pro"ide a method of synchronizing access to a shara le resource y multiple processes or threads. $ semaphore can e used as a inary lock for exclusi"e access or as a counter, allowing for concurrent access y multiple threads to a finite num er of shared resources. In the counter implementation, the semaphore "alue is initialized to the num er of shared resources (these semaphores are sometimes referred to as counting semaphores*. 5ach time a process needs a resource, the semaphore "alue is decremented to indicate there is one less of the resource. 9hen the process is finished with the resource, the semaphore "alue is incremented. $ @ semaphore "alue tells the calling process that no resources are currently a"aila le, and the calling process locks until another process finishes using the resource and frees it. 'hese functions are historically referred to as semaphore # and A operationsthe # operation attempts to ac+uire the semaphore, and the A operation releases it. 'he Solaris kernel uses semaphores where appropriate, when the constraints for atomicity on lock ac+uisition are not as stringent as they are in the areas where mutex and %9 locks are used. $lso, the counting functionality that semaphores pro"ide makes them a good fit for things like the allocation and deallocation of a fixed amount of a resource. 'he kernel semaphore structure maintains a sleep +ueue for the semaphore and a count field that reflects the "alue of the semaphore, shown in -igure 17... 'he figure illustrates the look of a kernel semaphore for all Solaris releases co"ered in this ook.
Figure 17.4. 5ernel Sema/hore

Dernel functions for semaphores include an initialization routine (semaFinit(**, a destroy function (semaFdestroy(**, the traditional # and A operations (semaFp(* and semaF"(**, and a test function (test for semaphore held,

semaFheld(**. 'here are a few other support functions, as well as some "ariations on the semaFp(* function, which we discuss later. 'he init function simply sets the count "alue in the semaphore, ased on the "alue passed as an argument to the semaFinit(* routine. 'he sFslp+ pointer is set to 72LL, and the semaphore is initialized. 'he semaFdestroy(* function is used when the semaphore is an integral part of a resource that is dynamically created and destroyed as the resource gets used and su se+uently released. -or example, the io ( lock I04* su system in the kernel, which manages uf structures for page I04 support through the file system, uses semaphores on a per& uf structure asis. 5ach uffer has two semaphores, which are initialized when a uffer is allocated y semaFinit(*. 4nce the I04 is completed and the uffer is released, semaFdestroy(* is called as part of the uffer release code. (semaFdestroy(* !ust nulls the sFslp+ pointer.* Dernel threads that must access a resource controlled y a semaphore call the semaFp(* function, which re+uires that the semaphore count "alue e greater than @ in order to return success. If the count is @, then the semaphore is not a"aila le and the calling thread must lock. If the count is greater than @, then the count is decremented in the semaphore and the code returns to the caller. 4therwise, a sleep +ueue is located from the systemwide array of sleep +ueues, the thread state is changed to sleep, and the thread is placed on the sleep +ueue. 7ote that turnstiles are not used for semaphoresturnstiles are an implementation of sleep +ueues specifically for mutex and %9 locks. Dernel threads locked on anything other than mutexes and %9 locks are placed on sleep +ueues. Sleep +ueues are discussed in more detail in Section ;.1@. Iriefly though, sleep +ueues are organized as a linked list of kernel threads, and each linked list is rooted in an array referenced through a sleep+Fhead kernel pointer. -igure 17.B illustrates how sleep +ueues are organized.
Figure 17.6. Slee/ 7ueues

$ hashing function indexes the sleep+Fhead array, hashing on the address of the o !ect. $ singly linked list that esta lishes the eginning of the dou ly linked su lists of kthreads at the same priority is in ascending order ased on priority. 'he su list is implemented with a tFpriforw (forward pointer* and tFpri ack (pre"ious pointer* in the kernel thread. $lso, a tFsleep+ pointer points ack to the array entry in sleep+Fhead, identifying which sleep +ueue the thread is on and pro"iding a +uick method to determine if a thread is on a sleep +ueue at all6 if the thread3s tFsleep+ pointer is 72LL, then the thread is not on a sleep +ueue. Inside the semaFp(* function, if we ha"e a semaphore count "alue of @, the semaphore is not a"aila le and the calling kernel thread needs to e placed on a sleep +ueue. $ sleep +ueue is located through a hash function into the sleepFhead array, which hashes on the address of the o !ect the thread is locking, in this case, the address of the semaphore. 'he code also gra s the sleep +ueue lock, s+Flock (see -igure 17.B*, to lock any further inserts or remo"als from the sleep +ueue until the insertion of the current kernel thread has een completed (that3s what locks are forS*. 'he scheduling&class&specific sleep function is called to set the thread wakeup priority and to change the thread state from 47#%4C (running on a processor* to SL55#. 'he kernel thread3s tFwchan (wait channel* pointer is set to the address of the semaphore it3s locking on, and the thread3s tFso !Fops pointer is set to reference the semaFso !Fops structure. 'he thread is now in a sleep state on a sleep +ueue. $ semaphore is released y the semaF"(* function, which has the exact opposite effect of semaFp(* and eha"es "ery much like the lock release functions we3"e examined up to this point. 'he semaphore "alue is incremented, and if any threads are sleeping on the semaphore, the one that has een sitting on the sleep +ueue longest will e woken up. Semaphore wakeups always in"ol"e waking one waiter at a time. Semaphores are used in relati"ely few areas of the operating system, the uffer I04 ( io* module, the dynamically loada le kernel module code, and a couple of de"ice dri"ers.
17.6. $%race oc!stat Pro0ider

'he lockstat pro"ider makes a"aila le pro es that can e used to discern lock contention statistics or to understand "irtually any aspect of locking eha"ior.

'he lockstat(1)* command is actually a C'race consumer that uses the lockstat pro"ider to gather its raw data.
17.,.1. Overview

'he lockstat pro"ider makes a"aila le two kinds of pro es, content&e"ent pro es and hold&e"ent pro es. Contention&e"ent pro es correspond to contention on a synchronization primiti"e6 they fire when a thread is forced to wait for a resource to ecome a"aila le. Solaris is generally optimized for the noncontention case, so prolonged contention is not expected. 'hese pro es should e used to understand those cases where contention does arise. Iecause contention is relati"ely rare, ena ling contention&e"ent pro es generally doesn3t su stantially affect performance. 8old&e"ent pro es correspond to ac+uiring, releasing, or otherwise manipulating a synchronization primiti"e. 'hese pro es can e used to answer ar itrary +uestions a out the way synchronization primiti"es are manipulated. Iecause Solaris ac+uires and releases synchronization primiti"es "ery often (on the order of millions of times per second per C#2 on a usy system*, ena ling hold&e"ent pro es has a much higher pro e effect than does ena ling contention&e"ent pro es. 9hile the pro e effect induced y ena ling them can e su stantial, it is not pathological6 they may still e ena led with confidence on production systems. 'he lockstat pro"ider makes a"aila le pro es that correspond to the different synchronization primiti"es in Solaris6 these primiti"es and the pro es that correspond to them are discussed in the remainder of this chapter.
17.,.2. -'aptive !oc" Probes

$dapti"e locks enforce mutual exclusion to a critical section and can e ac+uired in most contexts in the kernel. Iecause adapti"e locks ha"e few context restrictions, they comprise the "ast ma!ority of synchronization primiti"es in the Solaris kernel. 'hese locks are adapti"e in their eha"ior with respect to contention. 9hen a thread attempts to ac+uire a held adapti"e lock, it will determine if the owning thread is currently running on a C#2. If the owner is running on another C#2, the ac+uiring thread will spin. If the owner is not running, the ac+uiring thread will lock.

'he four lockstat pro es pertaining to adapti"e locks are in 'a le 17.=. -or each pro e, arg@ contains a pointer to the kmutexFt structure that represents the adapti"e lock.
%a&le 17.2. Ada/ti0e oc! Pro&es

#ro e 7ame

Cescription

adapti"e& 8old&e"ent pro e that fires immediately after an adapti"e lock is ac+uire ac+uired. adapti"e& Contention&e"ent pro e that fires after a thread that has locked lock on a held adapti"e mutex has reawakened and has ac+uired the mutex. If oth pro es are ena led, adapti"e& lock fires efore adapti"e&ac+uire. $t most one of adapti"e& lock and adapti"e& spin fire for a single lock ac+uisition. arg1 for adapti"e& lock contains the sleep time in nanoseconds. adapti"e& Contention&e"ent pro e that fires after a thread that has spun on spin a held adapti"e mutex has successfully ac+uired the mutex. If oth are ena led, adapti"e&spin fires efore adapti"e&ac+uire. $t most one of adapti"e&spin and adapti"e& lock fire for a single lock ac+uisition. arg1 for adapti"e&spin contains the spin count, the num er of iterations that were taken through the spin loop efore the lock was ac+uired. 'he spin count has little meaning on its own ut can e used to compare spin times. adapti"e& 8old&e"ent pro e that fires immediately after an adapti"e lock is release released.

17.,.*. Spin !oc" Probes

'hreads cannot lock in some contexts in the kernel, such as high&le"el interrupt context and any context manipulating dispatcher state. In these contexts, this restriction pre"ents the use of adapti"e locks. Spin locks are instead used to effect mutual exclusion to critical sections in these contexts. $s the name implies, the eha"ior of these locks in the presence of contention

is to spin until the lock is released y the owning thread. 'he three pro es pertaining to spin locks are in 'a le 17.;.
%a&le 17.3. S/in oc! Pro&es

#ro e 7ame spin& ac+uire spin& spin

Cescription

8old&e"ent pro e that fires immediately after a spin lock is ac+uired.

Contention&e"ent pro e that fires after a thread that has spun on a held spin lock has successfully ac+uired the spin lock. If oth are ena led, spin&spin fires efore spin&ac+uire. arg1 for spin&spin contains the spin count, the num er of iterations that were taken through the spin loop efore the lock was ac+uired. 'he spin count has little meaning on its own ut can e used to compare spin times. 8old&e"ent pro e that fires immediately after a spin lock is released.

spin& release

$dapti"e locks are much more common than spin locks. 'he following script displays totals for oth lock types to pro"ide data to support this o ser"ation. lockstat,,,adapti"e&ac+uire 0execname KK NdateN0 G Tlocks:Nadapti"eN< K count(*6 H lockstat,,,spin&ac+uire 0execname KK NdateN0 G Tlocks:NspinN< K count(*6 H

%un this script in one window, and a date(1* command in another. 9hen you terminate the C'race script, you will see output similar to the following example. lockstat,,,adapti"e&ac+uire M dtrace &s .0whatlock.d dtrace, script 3.0whatlock.d3 matched > pro es UC spin adapti"e

=/ =B.1

$s this output indicates, o"er BB percent of the locks ac+uired in running the date command are adapti"e locks. It may e surprising that so many locks are ac+uired in doing something as simple as a date. 'he large num er of locks is a natural artifact of the fine&grained locking re+uired of an extremely scala le system like the Solaris kernel.
17.,.4. +hrea' !oc"s

$ thread lock is a special kind of spin lock that locks a thread for purposes of changing thread state. 'hread lock hold e"ents are a"aila le as spin lock hold&e"ent pro es (that is, spin&ac+uire and spin&release*, ut contention e"ents ha"e their own pro e specific to thread locks. 'he thread lock hold& e"ent pro e is descri ed in 'a le 17.1.
%a&le 17.'. %hread oc! Pro&es

#ro e 7ame

Cescription

t8%ead Contention&e"ent pro e that fires after a thread has spun on a &spin thread lock. Like other contention&e"ent pro es, if oth the contention&e"ent pro e and the hold&e"ent pro e are ena led, thread&spin fires efore spin&ac+uire. 2nlike other contention&e"ent pro es, howe"er, t8%ead&spin fires efore the lock is actually ac+uired. $s a result, multiple t8%ead&spin pro e firings may correspond to a single spin&ac+uire pro e firing.

17.,.5. &ea'ers()riter !oc" Probes

%eaders0writer locks enforce a policy of allowing multiple readers or a single writer ut not othto e in a critical section. 'hese locks are typically used for structures that are searched more fre+uently than they are modified and for which there is su stantial time in the critical section. If critical section times are short, readers0writer locks will implicitly serialize o"er the shared memory used to implement the lock, gi"ing them no ad"antage o"er adapti"e locks. See rwlock(B-* for more details on readers0writer locks. 'he pro es pertaining to readers0writer locks are in 'a le 17.>. -or each pro e, arg@ contains a pointer to the krwlockFt structure that represents the adapti"e lock.
%a&le 17.+. #eaders23riter oc! Pro&es

#ro e 7ame rw& ac+uire

Cescription

8old&e"ent pro e that fires immediately after a readers0writer lock is ac+uired. arg1 contains the constant %9F%5$C5% if the lock was ac+uired as a reader, and %9F9%I'5% if the lock was ac+uired as a writer.

rw& lock Contention&e"ent pro e that fires after a thread that has locked on a held readers0writer lock has reawakened and has ac+uired the lock. arg1 contains the length of time (in nanoseconds* that the current thread had to sleep to ac+uire the lock. arg= contains the constant %9F%5$C5% if the lock was ac+uired as a reader, and %9F9%I'5% if the lock was ac+uired as a writer. arg; and arg1 contain more information on the reason for locking. arg; is nonzero if and only if the lock was held as a writer when the current thread locked. arg1 contains the readers count when the current thread locked. If oth the rw& lock and rw&ac+uire pro es are ena led, rw& lock fires efore rw&ac+uire. rw& upgrade 8old&e"ent pro e that fires after a thread has successfully upgraded a readers0writer lock from a reader to a writer. 2pgrades do not ha"e an associated contention e"ent ecause they are only possi le through a non locking interface,

%a&le 17.+. #eaders23riter oc! Pro&es

#ro e 7ame

Cescription

rwFtryupgrade('%V2#P%$C5.B-*. rw& 8old&e"ent pro e that fires after a thread had downgraded its downgrad ownership of a readers0writer lock from writer to reader. e Cowngrades do not ha"e an associated contention e"ent ecause they always succeed without contention. rw& release 8old&e"ent pro e that fires immediately after a readers0writer lock is released. arg1 contains the constant %9F%5$C5% if the released lock was held as a reader, and %9F9%I'5% if the released lock was held as a writer. Cue to upgrades and downgrades, the lock may not ha"e een released as it was ac+uired.

Potrebbero piacerti anche