Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1. Intel Atom Architecture a. Dual Die Processors i. Hyperthreading architecture 1. Shared instruction cache 2. Shared data cache 3. Parallel decode and issue of instructions 4. Parallel register file 5. Parallel integer and floating point execution units b. Pipeline i. 16 stages c. Two cache levels i. L1 compact cache 1. 32KB I-Cache 2. 24KB D-Cache ii. 512KB L2 cache per core 1. Dual core architecture provides 1M total 2. Each L2 cache shared with both threads iii. Include prefetch units that detect stride lengths and optimize prefetch operations
2. Synchronization challenges a. Many examples in kernel and applications for independent processes and threads
b. However, many of the most important applications introduce constraints of data dependence c. Most important applications (for example database systems) exhibit severe reduction in throughput as thread count rises above several threads per processor.
3. Synchronization requirements a. Lock time resolution i. Lock acquisition and release may require high time resolution ii. Lock testing may require high time resolution 1. Encouraging a polling method iii. Reduced time resolution requirements must be exploited 1. Sleep timing upon lock detection greater than one clock tick permits sleeping 2. Sleep timing less than one clock period requires busy wait b. Lock footprint i. Lock fetch, decode cost ii. Lock cache footprint cost c. Lock optimization methods i. Locking for readers (consumers) and writers (suppliers) 1. Read lock 2. Write lock 3. Read and Write lock
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 4. Architecture challenges a. Managing thread creation and removal
d. Lock integration with computational load systems i. Management of load bursts due to lock release e. Energy efficient lock processor resources
struct kthread_create_info { int (*threadfn)(void *data); void *data; struct completion started; struct task_struct *result; struct completion done; }; o Stop information data structure Done state written by keventd
struct kthread_stop_info { struct task_struct *k; int err; struct completion done; };
struct task_struct *kthread_create(int (*threadfn)(void *data), void *data, const char namefmt[], ) /* Note: indicates a list of variable arguments referenced * by the name format (name_fmt) field. These may include, for * example, the current pid value. These arguments will be displayed * in process status information */ { struct kthread_create_info create; DECLARE_WORK(work, keventd_create_kthread, &create); create.threadfn = threadfn; /* function argument */ create.data = data; /* function data */ init_completion(&create.started); init_completion(&create.done); /* * Start the workqueue system below */ if (!helper_wq) work.func(work.data); else { queue_work(helper_wq, &work); wait_for_completion(&create.done); } if (!IS_ERR(create.result)) { /* following code prepares process table string) */ va_list args; va_start(args, namefmt); vsnprintf(create.result->comm, sizeof(create.result->comm), namefmt, args); va_end(args); } return create.result; } EXPORT_SYMBOL(kthread_create);
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Kernel thread binding to CPU o Called after creation and before wakeup
void kthread_bind(struct task_struct *k, unsigned int cpu) { wait_task_inactive(k); /* wait for task to be unscheduled */ set_task_cpu(k, cpu); k->cpus_allowed = cpumask_of_cpu(cpu); } EXPORT_SYMBOL(kthread_bind);
static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) { task_thread_info(p)->cpu = cpu; }
int fastcall wake_up_process(task_t *p) { /* places stopped or sleeping task on run queue */ return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0); } EXPORT_SYMBOL(wake_up_process);
Kernel thread checks for stop status that may be applied by another thread o Kernel thread calls kthread_should_stop()
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 A kernel thread may apply stop state for another thread
The implementation of kthread_stop_sem is important o o The mutex_lock ensures that only one CPU may apply the stop condition Also, the thread must receive a signal (as required) in order to initiate its completion
int kthread_stop_sem(struct task_struct *k, struct semaphore *s) { int ret; mutex_lock(&kthread_stop_lock); get_task_struct(k); init_completion(&kthread_stop_info.done); smp_wmb(); kthread_stop_info.k = k; /* sets kthread pointer indicating stop */ if (s) up(s); /* release the semaphore */ else wake_up_process(k); /* start thread to enable completion */ put_task_struct(k); /* atomic decrement of task usage */ wait_for_completion(&kthread_stop_info.done); kthread_stop_info.k = NULL; ret = kthread_stop_info.err; mutex_unlock(&kthread_stop_lock); return ret; } EXPORT_SYMBOL(kthread_stop_sem);
4.
kthread_mod_uncoord.c Demonstration of multiple kernel thread creation and binding on multicore system
/* array of pointers to thread task structures */ #define #define #define #define #define MAX_CPU 16 LOOP_MAX 10 BASE_PERIOD 200 INCREMENTAL_PERIOD 330 WAKE_UP_DELAY 0
static struct task_struct *kthread_cycle_[MAX_CPU]; static int kthread_cycle_state = 0; static int num_threads; static int cycle_count = 0; static int cycle(void *thread_data) { int delay, residual_delay; int this_cpu; int loops; delay = BASE_PERIOD; for (loops = 0; loops < LOOP_MAX; loops++) { this_cpu = get_cpu(); delay = delay + this_cpu*INCREMENTAL_PERIOD; printk ("kthread_mod: no lock pid %i cpu %i delay %i count %i \n", current->pid, this_cpu, delay, cycle_count); cycle_count++; set_current_state(TASK_UNINTERRUPTIBLE);
10
while (!kthread_should_stop()) { delay = 1 * HZ; set_current_state(TASK_UNINTERRUPTIBLE); residual_delay = schedule_timeout(delay); /* prepare to yield */ printk ("kthread_mod: wait for stop pid %i cpu %i \n", current->pid, this_cpu); } printk ("kthread_mod: cycle function: stop state detected for cpu %i\n", this_cpu); return 0; } int init_module(void) { int cpu = 0; int count; int this_cpu; int num_cpu; int delay_val; int *kthread_arg = 0; int residual_delay; const char thread_name[] = "cycle_th"; const char name_format[] = "%s/%d"; /* format name and cpu id */ num_threads = 0; num_cpu = num_online_cpus(); printk("kthread_mod: number of operating processors: %i\n", num_cpu); this_cpu = get_cpu(); printk ("thread_mod: kthread_mod init: current task is %i on cpu %i \n", current->pid, this_cpu); for (count = 0; count < num_cpu; count++) { cpu = count; num_threads++;
11
12
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Start up
3 pid
60937.707450] kthread_mod: number of operating processors: 4 [60937.707486] thread_mod: kthread_mod init: current task is 16243 on cpu [60937.709822] kthread_mod: execution after wake_up_process, current task 16243 on cpu 3 [60937.709841] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 0 [60937.709919] kthread_mod: current task is 16243 on cpu 3 creating and waking next thread after delay of 1s [60937.713666] kthread_mod: execution after wake_up_process, current task 16243 on cpu 3 [60937.713678] kthread_mod: no lock pid 16245 cpu 1 delay 530 count 1 [60937.713738] kthread_mod: current task is 16243 on cpu 3 creating and waking next thread after delay of 1s [60937.717799] kthread_mod: execution after wake_up_process, current task 16243 on cpu 3 [60937.717815] kthread_mod: no lock pid 16246 cpu 2 delay 860 count 2 [60937.717893] kthread_mod: current task is 16243 on cpu 3 creating and waking next thread after delay of 1s [60937.721661] kthread_mod: execution after wake_up_process, current task 16243 on cpu 3 [60937.721712] kthread_mod: current task is 16243 on cpu 3 creating and waking next thread after delay of 1s [60937.721950] kthread_mod: no lock pid 16247 cpu 3 delay 1190 count 3 [60938.508041] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 3 [60938.508084] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 3 [60939.308039] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 3 [60939.832050] kthread_mod: no lock pid 16245 cpu 1 delay 530 count 2 [60939.832086] kthread_mod: no lock pid 16245 cpu 1 delay 860 count 2 [60940.308037] kthread_mod: wait for stop pid 16244 cpu 0 [60941.156046] kthread_mod: no lock pid 16246 cpu 2 delay 860 count2 [60941.156082] kthread_mod: no lock pid 16246 cpu 2 delay 1520 count 2 [60941.308064] kthread_mod: wait for stop pid 16244 cpu 0 [60942.308038] kthread_mod: wait for stop pid 16244 cpu 0 [60942.480041] kthread_mod: no lock pid 16247 cpu 3 delay 1190 count 2 [60942.480074] kthread_mod: no lock pid 16247 cpu 3 delay 2180 count 2 [60943.272042] kthread_mod: no lock pid 16245 cpu 1 delay 860 count 2
pid
pid
pid
[61085.496049] [61086.236050] [61086.276046] [61086.308040] [61086.496039] [61087.236041] [61087.276048] [61087.308040] [61150.308036] [61150.720049] [61151.134948]
13
Cpu0 Cpu1 Cpu2 Cpu3
Computing load top d 0.1 Upon start of top execution, enter 1 for CPU display
: : : : 0.0%us, 0.0%us, 0.0%us, 0.0%us, 9.1%sy, 0.0%sy, 0.0%sy, 0.0%sy, 0.0%ni, 90.9%id, 0.0%ni,100.0%id, 0.0%ni,100.0%id, 0.0%ni,100.0%id, 0.0%wa, 0.0%wa, 0.0%wa, 0.0%wa, 0.0%hi, 0.0%hi, 0.0%hi, 0.0%hi, 0.0%si, 0.0%si, 0.0%si, 0.0%si, 0.0%st 0.0%st 0.0%st 0.0%st
Note that one task presents a user space task load while all others are sleeping (note the presence of 100% in idle state.
14
Synchronization between tasks is critical to multiprocess and multiprocessor systems o Requirement exists for ensuring that operations occur only according to design constraints and are not subject to race conditions.
Synchronized access to a resource may be managed by controlling access to the code segment using a variable that may implement a lock o Enables a form of communication between processes
Hierarchy o Spinlocks o Fast acquisition and release Resource intensive for extended lock times
RW Spinlocks High efficiency lock favoring readers Often read by rarely written Multiple readers One writer
Kernel Semaphore Complex implementation Increased latency Efficient for long delays
15
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Seqlock o High efficiency lock that favors writers
Implementation o o o Relies on processor hardware Important optimizations for performance Recent advances in optimization for energy
16
Processor architectures may enable the implementation of atomic operations o o o o Atomic operations complete without interruption of the sequence of control under all circumstances Examples may include the increment of a memory register. For this to be atomic, the fetch, decode, fetch of operand from memory, increment and write back must occur contiguously The arrival of an interrupt must not induce an interruption of this sequence of control.
A class of Intel IA-32 instructions are always atomic. These include: o o o o Byte length read or write from memory 32 bit aligned read or write from memory of 32 or 64 bit word 64 bit aligned read or write from memory of 128 bit word Reading or writing to cache The cache is accessible as cache lines of 32 bytes An unaligned read or write falling within this limit will be atomic
Interrupts o The data bus is locked after an interrupt, only allowing a selected APIC to write.
Other operations are not atomic o o Thus other methods must apply The IA-32 system allows for software-selectable bus locking
17
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Here, one processor may acquire and lock the address/data bus preventing any other processor (or device) from accessing memory
Assert LOCK prefix o Instructions are listed with the identifier lock prepending the instruction. Assembler will ensure that object code includes a bus lock operation during execution Will add an opcode modifier 0xF0 to instruction Only one processor may access memory during the lock
#ifdef CONFIG_SMP #define LOCK "lock ; " #else #define LOCK "" #endif
Defined as static inline o Standalone object code for functions may be created by compile, if required
static __inline__ void atomic_inc(atomic_t *v) { __asm__ __volatile__( LOCK "incl %0" :"=m" (v->counter) :"m" (v->counter)); }
18
Atomic add i to atomic type v ir indicates that a register is to be assigned by the compiler to an integer
static __inline__ void atomic_add(int i, atomic_t *v) { __asm__ __volatile__( LOCK "addl %1,%0" :"=m" (v->counter) :"ir" (i), "m" (v->counter)); }
static __inline__ void atomic_sub(int i, atomic_t *v) { __asm__ __volatile__( LOCK "subl %1,%0" :"=m" (v->counter) :"ir" (i), "m" (v->counter)); // i indicates 32b for reg }
Since the variable in question is actually a data structure member, this must be accessed via a read operation
((v)->counter)
#define atomic_read(v)
o o
atomic_set must be used to write This sets the value of v to that of integer, i
(((v)->counter) = (i))
#define atomic_set(v,i)
19
ARM atomic instructions o o read operations are inherently atomic Write operations must be protected
Here is an example of atomic_set o o o This sets v equal to i Its functionality as a kernel library function is the same as its i386 counterpart However, due to processor differences between IA-32 and ARM, its underlying implementation is quite different
The instruction Load Exclusive LDREX R1, [R2] is implemented on ARM o o This loads R1 with the contents of memory register addressed by the contents of R2 Then, this initializes a monitor The monitor observes any write action on the address-data bus that may occur on the 32b memory block pointed to by the contents of R2 This write action may occur due to the operation of another CPU that shares the memory space and address-data bus The occurrence of a write action can then be detected subsequently by STREX
The instruction Store Exclusive STREX R1, R2, [R3] o o o Stores R2 into memory register addressed R3 If the write is successful by the definition that the previously initialized monitor installed above shows no prior writes, then the data pointed to be [R3] has been written atomically A successful write is returned as a zero value in R1
If a failure is detected, this function continues to attempt to initialize a monitor, store, and verify. An example is atomic_set() for ARM
20
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o First read value of v->counter (sets monitor) o o o This is a guard instruction The value of counter not needed
Then start load operation with STREX Check monitor Loop until success detected
Note this sequence will be inserted in-line in code not called as function o By declaring static, this inline code can be included in any kernel function.
static inline void atomic_set(atomic_t *v, int i) { unsigned long tmp; __asm__ __volatile__("@ atomic_set\n" ldrex %0, [%1]\n" ; tmp register receives the contents at ; memory address containing counter ; This initializes the monitor. strex %0, %2, [%1]\n" ; i (third on arg list) is stored into ; mem register at address of v->counter teq %0, #0\n" ; test if reg 0 is cleared bne 1b" ; if not success, repeat branch if ; not equal to label 1 back : "=&r" (tmp) ; & requires separation of output and ; input register choices by compiler : "r" (&v->counter), "r" (i) : "cc"); ; sequence may have modified cpsr ; compiler must ensure cpsr protected
"1:
21
Suppresses code optimization that may appear when these functions are inlined Example, Consider sequence
volatile unsigned long *output_port = memory_mapped_interface_address *output_port = CONTROL_WORD_1; /* set high */ *output_port = CONTROL_WORD_2; /* set low */
static inline void clear_bit(int nr, volatile unsigned long * addr) { __asm__ __volatile__( LOCK_PREFIX "btrl %1,%0" // Bit Test and Reset Long // 1 is the bit offset, nr. // 0 is operated on its bit // indicated by nr is cleared :"=m" (ADDR) // output operand :"Ir" (nr)); // I identifies a constant in // range of 0 31 }
22
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Change a bit in memory
static inline void change_bit(int nr, volatile unsigned long * addr) { __asm__ __volatile__( LOCK_PREFIX "btcl %1,%0" :"=m" (ADDR) :"Ir" (nr)); }
23
Example bit = 4 bit = 0000 0000 0000 0100 bit & 31 = 0000 0000 0000 0100 & 0000 0000 0001 1111 = 0000 0000 0000 0100 mask = 1UL << (4 & 31) = = 1UL << (4) = 0000 0000 0001 0000 ~mask = 1111 1111 1110 1111 bit >> 5 = 0 p=p+0 *p = *p & 1111 1111 1110 1111 Clears 5th bit
static inline void ____atomic_clear_bit(unsigned int bit, volatile unsigned long *p) { unsigned long flags; unsigned long mask = 1UL << (bit & 31); p += bit >> 5; local_irq_save(flags); *p &= ~mask; local_irq_restore(flags); }
24
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 bit = 34 bit = 0000 0000 0010 0010 mask = 1UL << (34 & 31) (34 & 31) = 0000 0000 0010 0010 & 0000 0000 0001 1111 (34 & 31) = 0000 0000 0000 0010 mask = 1UL << (2) = 0000 0000 0000 0100 bit >> 5 = 1 p = p + 1 this advances pointer to next word *p = *p & 1111 1111 1111 1011 Clear second bit in second word (bit location 34)
static inline void ____atomic_set_bit(unsigned int bit, volatile unsigned long *p) { unsigned long flags; unsigned long mask = 1UL << (bit & 31); p += bit >> 5; local_irq_save(flags); *p |= mask; local_irq_restore(flags); }
25
In the hierarchy of control sequence locking, the spinlock is the most efficient in its initialization and use, but, also represents the most significant impact on the kernel. o Design requirements Intended for application to locking where resource hold time is short Fast initialization Fast access and return Small cache footprint Design will tolerate processing overhead
There are two primary alternatives for code sequence locking o A process operating on one CPU (in an SMP system) may seek to acquire a resource or enter a sequence of instructions. If the lock is not accessible, the process (a kernel thread) may be designed to be dequeued until the lock is available. If the lock is anticipated to not be available for an extended period, then this is acceptable. The process of dequeue and then enqueue incurs latency o A context switch exiting and entering
o o
An alternative for design requirements where it is known in advance that the lock acquisition delay will be short is the spin lock The kernel thread process is not dequeued during the lock delay Rather, it continues to test the lock
26
Operations o A process that may wish to protect a code sequence sets a spinlock o Only one spinlock is available per thread
A second process requesting the lock makes repeated attempts to gain the lock It remains in a busy loop, testing the lock during each period that it is scheduled. If the previous task releases the lock it will be discovered to be available when the new task seeking the lock is scheduled
Characteristics o o The spinlock is central in kernel code The acquisition of a spinlock does not disable interrupt operations Will result from our having inserted an interruptable NOP loop An example of a potential deadlock failure results from a process having acquired a spinlock and then being interrupted and replaced by an ISR that seeks the same spinlock.
Usage rules o o Spinlocks are appropriate for fast execution (less than the time required for two context switches Sleep operations should not be started in a sequence of execution after a lock is acquired
27
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Probability that an interrupt service routine may require the same lock resource is high
First, setting the lock: (in kernel/spinlock.c) __lockfunc defines fastcall and the directive that first three function arguments are to be placed in registers as opposed to the stack (as would be the compiler operation default) o FASTCALL macro settings: #define fastcall __attribute__((regparm(3)))
Actual spinlock slock = 1 if lock available slock = 0 after decrement on lock request if request successful
static inline void _raw_spin_lock(spinlock_t *lock) { __asm__ __volatile__( spin_lock_string :"=m" (lock->slock) : : "memory"); }
28
static inline void _raw_spin_lock(spinlock_t *lock) { __asm__ __volatile__( "1:\t" \ "lock ; decb %0\n\t" \ "jns 3f\n" \ "2:\t" \ "rep;nop\n\t" \ "cmpb $0,%0\n\t" \ "jle 2b\n\t" \ "jmp 1b\n" \ "3:\n\t" : "=m" (lock->slock) : : "memory"); }
o decb decrements spinlock value note argument %0 points to lock->slock o Tests if decrement of lock is zero (lock state was therefore one at time of access)
Note, not adequate to merely set slock Consider multiple CPUs in race to set bit Decrement removes race condition o o Each CPU can decrement the lock No CPU can exit the spinlock until the lock becomes set to one and may be decremented
29
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 if sign flag not set, then spinlock was 1 and is now zero, thus jump to 3 and continue The thread executing this sequence now owns the lock If the spinlock value was zero, a decrement will yield a negative value o o o o The lock was therefore taken previously
Then, the system compares the memory register with zero If less than or equal, remain in loop since other CPU processes may be decrementing slock If greater than zero, lock must be set to 1 and is free However, this system does not exit immediately A race condition with multiple CPUs is in progress o o Another CPU has set the lock to 1 But, yet another CPU may now have decremented the lock
Test is: can this thread successfully decrement the lock to zero from one o o If so, this is the only thread that owns the lock If not, this CPU has lost the race
Hence, this system performs one more test to ensure o o The current thread acquires the lock No other thread has or can acquire the lock
30
static inline void rep_nop(void) { __asm__ __volatile__("rep;nop": : :"memory"); } #define cpu_relax() rep_nop()
Detail point on optimization o Analysis has shown that some systems spend a significant fraction of time in spinlock state. o o Delay may be unavoidable However, introduces undesired power dissipation
Now, the rep;nop sequence introduces a method for signaling the CPU that the current thread is executing and waiting for a spinlock The rep preprocessor causes a number of NOPs to be introduced equal to the contents of cx register Assembler introduces the rep opcode modifier byte 0xF2 prepending the instruction causes the instruction to be called repeatedly Applies only to string of instructions defaults to a single NOP otherwise
o o
However, processor observes the presence of rep nop Processor then may adjust clock frequency, core voltage, and reduce energy usage
The cpu_relax() macro has appeared in recent kernel versions that includes the rep nop sequence
31
Setting a lock while disabling interrupts o o _spin_lock_irq(spinlock_t *lock) Interrupts are enabled unconditionally at the time the lock is released This may create an error condition if interrupts were previously disabled
Setting a lock while disabling interrupts and storing interrupt state. o o o _spin_lock_irqsave(spinlock_t *lock , unsigned long flags) This enables the state of interrupts to be stored at the time the lock is set Interrupts are enabled at the time the lock is released only if they were initially enabled
#define local_irq_save(x)__asm__ __volatile__( "pushfl ; popl %0 ; cli" :"=g" (x): /* disable interrupts */ :"memory")
/asm-i386/system.h o This is called with flags argument Flags are first saved on the stack
32
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Flag value is then popped into a general purpose register- stack pointer is returned to initial value o Thus, the flags are now saved as a temporary variable (in the next stack entry)
Memory keyword informs compiler that memory has been changed, resulting in blocking of compiler actions to reorder sequence of control
Setting a lock while disabling interrupt bottom halves o o Critical for many network drivers This permits hardware interrupts to proceed o o o o However, the computational demand of bottom halves is not introduced, delaying interrupt service routines
For example, timer interrupts and other critical events _spin_lock_bh(spinlock_t *lock) This enables the state of interrupts to be stored at the time the lock is set Interrupts are enabled at the time the lock is released only if they were initially enabled
This takes us to interrupts.c Now, softirqs (for networking, for example) will only be allowed if the preemption counter is less than SOFTIRQ_OFFSET o If many processes have incremented the preemption counter, policy is to not add yet another task, rather allow these to complete
33
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Now, with the increase in preempt count, BH are disabled since the number of allowed softirqs will be incremented above the limit by simply adding the max value to the current value
o o
creates a convenient method for returning and restoring preemption count while gating BH operations Note the while (0) construct and barrier
Design goals o o o Release lock resource Enable preemption Evaluate if rescheduling should occur
Release spin_lock
In preempt.h o Enable
34
Call reschedule if current is flagged this return from spinlock represents important opportunity to exploit resched option
Unlock (lock arg applies only if debug is applied) this is removed below
35
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Sets spin lock bit note memory barrier
#define local_irq_restore(x) do { if ((x & 0x000000f0) != 0x000000f0) local_irq_enable(); } while (0) #define local_irq_enable() __asm__ __volatile__( "sti" : : :"memory")
#define spin_unlock_bh(lock)
In softirq.c, local_bh_enable is found o o o Note the recovery using the (SOFTIRQ_OFFSET 1 ) subtraction This removes the SOFTIRQ_OFFSET and enables SOFTIRQ threads to be executed by softirqd However, preemption remains disabled (due to the -1 above) if it was previous to this action
36
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Note that check is made on o o Not in interrupt and that there is a pending softirq Then actually perform the softirq immediately before any other process that the scheduler may have selected
Note that preemption counter is decremented preemption will be enabled when this reaches zero. Note that resched is called
void local_bh_enable(void) { sub_preempt_count(SOFTIRQ_OFFSET - 1); if (unlikely(!in_interrupt() && local_softirq_pending())) do_softirq(); dec_preempt_count(); preempt_check_resched(); }
Lock state testing o Lock state can be tested without locking to enable flow control For example, spin_trylock(), spin_trylock_bh() Implemented with atomic xchgl instruction in x86
static inline int __raw_spin_trylock(raw_spinlock_t *lock) { int oldval; __asm__ __volatile__( "xchgl %0,%1" :"=q" (oldval), "=m" (lock->slock) :"" (0) : "memory"); return oldval > 0; }
Implemented in ARM architecture: Loads lock value Stores exclusive if equal (if lock value is 0) Otherwise, exits with lock value in tmp
37
static inline int __raw_spin_trylock(raw_spinlock_t *lock) { unsigned long tmp; __asm__ __volatile__( ldrex %0, [%1]\n" teq %0, #0\n" strexeq %0, %2, [%1]" : "=&r" (tmp) : "r" (&lock->lock), "r" (1) : "cc"); if (tmp == 0) { smp_mb(); return 1; } else { return 0; } }
38
kthread_mod_coord.c Demonstration of multiple kernel thread creation and binding on multicore system This system includes spinlock synchronization
/* array of pointers to thread task structures */ #define #define #define #define #define MAX_CPU 16 LOOP_MAX 10 BASE_PERIOD 200 INCREMENTAL_PERIOD 30 WAKE_UP_DELAY 0
static struct task_struct *kthread_cycle_[MAX_CPU]; static int kthread_cycle_state = 0; static int num_threads; static int cycle_count = 0; static spinlock_t kt_lock = SPIN_LOCK_UNLOCKED; static int cycle(void *thread_data) { int delay, residual_delay; int this_cpu; int ret; int loops; delay = BASE_PERIOD; for (loops = 0; loops < LOOP_MAX; loops++) { this_cpu = get_cpu(); delay = delay + this_cpu*INCREMENTAL_PERIOD; ret = spin_is_locked(&kt_lock); if (ret != 0) { printk("kthread_mod: cpu %i start spin cycle\n", this_cpu);
39
exit loop
while (!kthread_should_stop()) { delay = 1 * HZ; set_current_state(TASK_UNINTERRUPTIBLE); residual_delay = schedule_timeout(delay); printk ("kthread_mod: wait for stop pid %i cpu %i \n", current->pid, this_cpu); } printk ("kthread_mod: cycle function: stop state detected for cpu %i\n", this_cpu); return 0; } int init_module(void) { int cpu = 0; int count; int this_cpu; int num_cpu; int delay_val; int *kthread_arg = 0; int residual_delay; const char thread_name[] = "cycle_th"; const char name_format[] = "%s/%d"; /* format name and cpu id */ num_threads = 0; num_cpu = num_online_cpus(); this_cpu = get_cpu(); printk
40
41
Start up o Note coordination o Note locking occurs However, locking only occurs when relationship between delays leads to resource contention
kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: init task 17348 cpu 2 of total CPU 4 current task 17348 cpu 2 create/wake next lock pid 17349 cpu 0 delay 200 count 0 current task 17348 cpu 2 create/wake next cpu 1 start spin cycle current task 17348 cpu 2 create/wake next cpu 2 start spin cycle current task 17348 cpu 0 create/wake next cpu 3 start spin cycle unlock pid 17349 cpu 0 delay 200 count 0 cpu 0 start spin cycle lock pid 17350 cpu 1 delay 230 count 0 unlock pid 17350 cpu 1 delay 230 count 0 cpu 1 start spin cycle lock pid 17351 cpu 2 delay 260 count 0 unlock pid 17351 cpu 2 delay 260 count 0 cpu 2 start spin cycle lock pid 17352 cpu 3 delay 290 count 0 unlock pid 17352 cpu 3 delay 290 count 0 cpu 3 start spin cycle lock pid 17349 cpu 0 delay 200 count 0 unlock pid 17349 cpu 0 delay 200 count 0 cpu 0 start spin cycle lock pid 17350 cpu 1 delay 260 count 0 unlock pid 17350 cpu 1 delay 260 count 0 cpu 1 start spin cycle lock pid 17351 cpu 2 delay 320 count 0 unlock pid 17351 cpu 2 delay 320 count 0 cpu 2 start spin cycle lock pid 17352 cpu 3 delay 380 count 0 unlock pid 17352 cpu 3 delay 380 count 0 cpu 3 start spin cycle lock pid 17349 cpu 0 delay 200 count 0 unlock pid 17349 cpu 0 delay 200 count 0 cpu 0 start spin cycle lock pid 17350 cpu 1 delay 290 count 0 unlock pid 17350 cpu 1 delay 290 count 0 cpu 1 start spin cycle lock pid 17351 cpu 2 delay 380 count 0 thread thread thread thread
[61888.295386] [61888.297709] [61888.297805] [61888.301142] [61888.301158] [61888.309106] [61888.309146] [61889.004148] [61889.004161] [61889.100033] [61889.100073] [61889.100080] [61890.020530] [61890.020581] [61890.020588] [61891.061032] [61891.061074] [61891.061080] [61892.217531] [61892.217572] [61892.217582] [61893.312070] [61893.312131] [61893.312138] [61895.332564] [61895.332609] [61895.332617] [61897.921049] [61897.921094] [61897.921101] [61899.889564] [61899.889609] [61899.889619] [61902.088074] [61902.088118] [61902.088124] [61903.248533] [61903.248577] [61903.248586]
42
: : : :
The top utility, operating in batch mode, may record per cpu load o o This is a direct result of the resource cost associated with the spinlock Note the processor usage at 100 % system and 0% user o Note: Three processors (threads) are operating at full load One processor, CPU3, is executing (and is in a sleep state). Note that this processor is spending the balance of its time in the idle thread.
Entries indicate percentage of time processor was executing a task other than the idle task during the time since the last screen update
Note the behavior above: o o o CPU 0 wins the race to acquire the spinlock and remains in a sleep state, requiring now CPU load CPU 1, 2, and 3 operate at 100 percent load, waiting for the spinlock resource to become available Again at t = 1 second, a race condition occurs and CPU 1 wins
43
Note the behavior above over an extended period o o CPU 0, 1, and 3 acquire the spinlock CPU 2 does not acquire the lock
44
Conventional multiprocessor systems suffer from energy inefficiency due to processors waiting for and expending energy in polling of spinlocks o o A significant fraction of processor time may be lost in synchronization Problems include: Priority Inversion Deadlocks convoy behavior o Groups of processors executing control sequences in parallel and stalling in synchrony, waiting for the same lock
Now energy saving is ensured by placing processor in temporary stall state with ability to wake processor within one cycle upon lock being freed o Notification via SCU (Snoop and Control Unit) in multiprocessor core
45
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o o o Ensures that all caches are coherent
Signal propagates to all CPUs (see unlock) wfene instruction - receive sev instruction - notify
static inline void __raw_spin_lock(raw_spinlock_t *lock) { unsigned long tmp; __asm__ __volatile__( ldrex %0, [%1]\n" teq %0, #0\n" wfene\n" strexeq %0, %2, [%1]\n" teqeq %0, #0\n" bne 1b" : "=&r" (tmp) : "r" (&lock->lock), "r" (1) : "cc"); smp_mb(); }
; ; ; ; ; ;
load lock member of &lock into r test lock value wait for notification attempt to store 1 in r test loop if unsuccessful
wmb() and rmb() both are defined as mb() for ARM o o define mb() __asm__ __volatile__ ("" : : : "memory") This ensures that any writes or reads of variables that are being protected by the spinlock are scheduled prior to releasing the lock
static inline void __raw_spin_unlock(raw_spinlock_t *lock) { smp_mb(); __asm__ __volatile__( " str %1, [%0]\n" ; release, store 0 in lock member " mcr p15, 0, %1, c7, c10, 4\n" ; drain storage buffer " sev" ; send signal to waiting CPUs : : "r" (&lock->lock), "r" (0) : "cc"); ; CPSR updated }
Drain Storage Buffer operation o Forces synchronization of this stored data into D-cache of each processor
46
14. RW SPINLOCKS
Spinlock allows only one sequence of control to enter a sequence of instructions An alternative exists for the spinlock o The reader/writer lock admits many readers o o The lock prevents access by readers if a writer has taken the lock
Permits only one writer Writer may not access lock while any reader or other writer holds lock
47
RW spinlocks are based on a rwlock_t structure o This contains a counter variable that is equal to the sum of readers that have acquired the rwlock
#define RW_LOCK_UNLOCKED
Implementation and usage For a sequence of control that the designer intends to use to read a shared memory resource, then read_lock(rw_lock *lock) is used
48
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o All of the other variants above for spin_lock are included Read_lock Read_lock_irq Read_lock_irqsave Read_lock_bh Read_unlock Read_unlock_irq Read_unlock_irqsave Read_unlock_bh Read_trylock (added in 2.6 kernel) write_lock write_lock_irq write_lock_irqsave write_lock_bh write_unlock write_unlock_irq write_unlock_irqsave write_unlock_bh write_trylock
Consider write lock acquisition (called by writers) Recall that strex r1, r2, [r3] stores contents of r2 into memory at address contained in r3 and places zero result in r1 if no other writes have occurred to [r3] since previous ldrex Note this can also execute conditionally
static inline void _raw_write_lock(rwlock_t *rw) { unsigned long tmp; __asm__ __volatile__( ldrex %0, [%1]\n" ; load exclusive and monitor lock teq %0, #0\n" ; test if lock zero strexeq %0, %2, [%1]\n" ; attempt to write LOCK_BIAS if zero teq %0, #0\n" ; - note above is conditional execution bne 1b" ; spin until lock acquired : "=&r" (tmp) : "r" (&rw->lock), "r" (0x80000000) : "cc", "memory");
Note that this sets a value of 232 interpreted as negative value Also, write unlock this merely involves clearing the lock (called by writers)
49
static inline void _raw_write_unlock(rwlock_t *rw) { __asm__ __volatile__( "str %1, [%0]" ; store zero at address: &rw->lock : : "r" (&rw->lock), "r" (0) : "cc", "memory"); }
50
This must admit many readers This must track the number of readers and prevent any writer from entering a code section until all readers have exited o o Each reader increments lock on entry and decrements on release Writers only are permitted to enter if lock is set to zero
Here is the operation for read_lock this is called by a reader attempting to enter a critical section o Note: if a writer has taken the lock its value will be -232 o Then the increment result below will remain negative for 231 readers
If no reader or writer is present, the initial value of the lock variable is zero o The lock is incremented by one for each reader upon acquiring the lock
This implementation tests for presence of writer and spins in this event Otherwise, readers are admitted atomically increments the lock value by setting loading, setting monitor, and incrementing a value in a register, initially equal to lock value
o o o o
The lock value is stored only if result is zero or positive It is then decremented if the lock is negative (returning value to prior state) Then, if negative, remain in busy wait loop until writer exits and releases the lock Otherwise store will have occurred and exit
Note strexpl is Store Exclusive executing on positive or zero comparison o o Result of adding 1 to register and placing in register will be positive only if lock was initially zero. Otherwise, LS modifier executes on Lower or Same o Note, rsbpls %0, %1, #0\n" returns negative result if result of strexpl is non-zero
51
" "
load exclusive (&rw->lock) into reg increment lock value (blindly) store reg exclusive setting reg (tmp2) if result is positive indicating no writers present ; But, a value of 1 will appear in ; (tmp2) if lock has been modified ; Thus, must now decrement to return lock ; to its initial value in next ; instruction note %1 contains value 1 ; as a result of this event so, ; not required to load 1 immediate rsbpls %0, %1, #0\n" ; decrement lock if lower or same bmi 1b" ; branch if negative since lock value is : "=&r" (tmp), "=&r" (tmp2) ; negative and writers are present : "r" (&rw->lock) : "cc", "memory");
; ; ; ; ;
Now, unlocking proceeds with o o o Readers decrement the lock value on exiting This is performed exclusively (atomically) Lock value is positive if readers are present and decrements to 0 when all readers exit.
Here is the operation for read_unlock called by a reader exiting a critical section
static inline void _raw_read_unlock(rwlock_t *rw) { __asm__ __volatile__( "1: ldrex %0, [%2]\n" ; load exclusive lock into reg " sub %0, %0, #1\n" ; decrement lock value " strex %1, %0, [%2]\n" ; store lock value " teq %1, #0\n" ; test if successful exclusive operation " bne 1b" ; branch if not exclusive : "=&r" (tmp), "=&r" (tmp2) : "r" (&rw->lock) : "cc", "memory");
Here is the operation for read_unlock called by a reader exiting a critical section
52
Writer may attempt to set lock to LOCK_BIAS or exit if another thread holds write lock
static inline int _raw_write_trylock(rwlock_t *rw) { unsigned long tmp; __asm__ __volatile__( "1: ldrex %0, [%1]\n" " teq %0, #0\n" " strexeq %0, %2, [%1]" ; store exclusive if equal (successful) : "=&r" (tmp) : "r" (&rw->lock), "r" (0x80000000) : "cc", "memory"); return tmp == 0 ; otherwise exit }
53
The semaphore is a unique variable with the following characteristics: o o o The semaphore value can be used to determine whether a process will execute or wait. The semaphore may be operated on by wait or post. wait The wait function causes the semaphore value to be decremented by 1 if the semaphore is non-zero o o The process calling wait on the semaphore is allowed to continue This operation is atomic in that it completes without interruption by other processes. Thus, two processes that attempt to decrement the semaphore only decrement it by one unit. If the semaphore is non-zero, only one process will be allowed to continue, one will block.
If the semaphore is zero o o The process calling wait on the semaphore is blocked The process remains blocked until the action of decrementing the semaphore may return zero (as opposed to a negative value).
post The post function increments the semaphore. This is again atomic in that if two processes both attempt to increment a semaphore of value 0, it is incremented by two. For example, without the semaphore operation, both processes might conclude that the proper value for the semaphore is 1.
A process may use the semaphore to protect a critical section of code such that its access to shared resources is protected (as if it were the only process operating) during a code sequence. This holds true even if the process is interrupted or taken from running to ready by the operating system.
IMPLEMENTATION
The next step in the locking hierarchy is the semaphore This prevents a process from passing a point in the sequence of control defined by the semaphore
54
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o o o o However, unlike spinlocks, semaphores cause the process that reaches the semaphore to sleep Formally, this means that the process (kernel thread) is dequeued and a user space process or new kernel thread operates as a result of context switch This is clearly efficient for designs where the sleep time is long However, scheduler latency must be accounted for
The semaphore design is considerably more complex o o Task wait queue management Management of many waiting tasks that may be admitted when the semaphore is available
Again, a data structure is the design foundation o srtuct semaphore with data members count: an atomic variable with these states o o o wait: Positive: semaphore is free
0: if semaphore is acquired and one thread is executing and no other threads are sleeping while waiting for the semaphore. Negative: A number of threads equal to the absolute value of count are waiting for the semaphore pointer to a linked list of waiting tasks
sleepers: A flag indicating the presence of queued processes this is zero if no sleeping processed, 1 otherwise
Functions o An atomic down operation reduces the count variable If semaphore is taken or busy, results in task being placed on wait queue until semaphore state changes
55
Initialization (see /include/asm-i386/semaphore.h) o void sema_init (struct semaphore *sem, int val)
struct semaphore { atomic_t count; int sleepers; wait_queue_head_t wait; }; static inline void sema_init (struct semaphore *sem, int val) { atomic_set(&sem->count, val); sem->sleepers = 0; init_waitqueue_head(&sem->wait); }
Mutex o o Initializing a semaphore to 1 produces a mutex variable This implies that only one lock holder is enabled one thread that can occupy a code sequence
Requesting a semaphore : down down(struct semaphore * sem) o This will place a task that fails to receive the semaphore in the wait queue in a TASK UNINTERRUPTABLE state.
56
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Implementation of down() o First, note code structure. o Begins with atomic decrement. If decrement does not yield zero, then lock has been taken. Jump to LOCK_SECTION_START Otherwise, exit the down() function
Optimization: Note access to the semaphore lock is likely and a fail is unlikely This code is inlined in compilation with other code o o Included in volatile node without reordering Note, LOCK_SECTION_START is defined to create a subsection for this code separate from this section
Thus, as this code sequence is included in the inline function, only the decl and js instructions appear in the inline sequence o This prevents code in the lock section from being imported into the instruction cache, evicting other instructions more likely to be used
static inline void down(struct semaphore * sem) { __asm__ __volatile__( LOCK "decl %0\n\t" ; decrement sem->count "js 2f\n" ; jump on sign ; otherwise exit this ; function since next ; instructions not included "1:\n" LOCK_SECTION_START("") "2:\t lea %0,%%eax\n\t" ; load addr of sem in eax "call __down_failed\n\t" ; "jmp 1b\n" ; loop back to ; LOCK_SECTION_START label LOCK_SECTION_END :"=m" (sem->count) : :"memory","ax"); }
57
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o o o __down_failed prepares call to __down Places current task on a waitqueue Note that this will be included in the text section (code section) that is occupied by sched.c
; include in sched.c text section
asm( ".section .sched.text\n" ".align 4\n" ".globl __down_failed\n" "__down_failed:\n\t" "pushl %edx\n\t" "pushl %ecx\n\t" "call __down\n\t" "popl %ecx\n\t" "popl %edx\n\t" "ret" );
Examine __down() First, obtain a pointer to a task struct with address equal to that of the current task task_struct Create a waitqueue for the current task Now, the current task was TASK_UNINTERRUPTIBLE TASK_RUNNING, now set state member to be
Acquire a spinlock with interrupts disabled and with ability to restore interrupts Add the current task to the wait waitqueue associated with this semaphore o o Mark this as WQ_EXCLUSIVE will control waking process Adds tasks to tail of waitqueue
Increment the sleepers member Now, enter a loop o o o First, get the number of sleepers Subtract the number (sleepers -1) from the semaphore counter If the result is negative, set sleepers to zero and break
58
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Upon entry to _down, semaphore is down (lock is held) count = 0 and no other sleeping function is waiting on the queue (the only other task of interest is the task that currently holds the semaphore o o First, semaphore count will be decremented to -1
Then, sleeper is incremented by 1 Then, count value is incremented by adding (sleepers 1) (original sleep count) This will yield (sleepers) + -1 = -1 (in our case with no sleepers) o Note the definition of atomic_add_negative result is true if result of addition is negative, otherwise false.
This negative result will cause the conditional branch not to be taken
o o
Then, sleeper is set to 1, this indicates the presence of the task requesting the semaphore Call schedule() Schedule will observe the TASK_INTERRUPTIBLE status and dequeue this task. This places task in the waitqueue, waiting for an event
After return from schedule Locks are taken Task is marked UNINTERRUPTABLE and another check is performed on the semaphore status with sleepers = 1 If the semaphore is not available, then the control remains in the loop Sleepers are set to 1 and schedule is again called
If the semaphore count has been incremented to 1 (released) by other action Then the branch is taken, each of the sleepers are removed.
As the semaphore becomes available A call is made to release the spinlock and restore interrupts Any processes sleeping on the waitqueue will be activated o With a set of rules to be seen below
The task is set to TASK_RUNNING o The next time that the scheduler function runs, this task is eligible for selection
59
/* loop will not exit until */ /* all sleepers exit */ int sleepers = sem->sleepers; if (!atomic_add_negative(sleepers - 1, &sem->count)) { sem->sleepers = 0; break; } sem->sleepers = 1; /* this task - see -1 above */ spin_unlock_irqrestore(&sem->wait.lock, flags); schedule(); /* will lead to sleep */ spin_lock_irqsave(&sem->wait.lock, flags); tsk->state = TASK_UNINTERRUPTIBLE;
60
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 It is important to consider how a list of tasks sleeping on the waitqueue may be activated (set to TASK_RUNNING) o So, consider the modified example where N tasks occupy the waitqueue Now, these all entered the waitqueue through this function ! So, upon being woken, they will execute this loop, entering the control immediately after schedule() So, each task will execute the remove_waitqueue and then wake_up_locked. The wake up locked function will activate each task in turn.
void fastcall add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait) { unsigned long flags; wait->flags |= WQ_FLAG_EXCLUSIVE; spin_lock_irqsave(&q->lock, flags); __add_wait_queue_tail(q, wait); spin_unlock_irqrestore(&q->lock, flags); }
Consider wake_up_locked this will call __wake_up_common( ) Will wakeup one exclusive task the process that initially called __down Will place task on the runqueue marked as TASK_RUNNING Semaphore functions
static inline int down_interruptible(struct semaphore * sem) static inline int down_trylock(struct semaphore * sem) static inline void up(struct semaphore * sem)
61
#include #include #include #include #include #include #define #define #define #define #define
<linux/module.h> <linux/kthread.h> <linux/sched.h> <linux/timer.h> <linux/cpumask.h> <asm/semaphore.h> MAX_CPU 16 LOOP_MAX 20 BASE_PERIOD 200 INCREMENTAL_PERIOD 30 WAKE_UP_DELAY 0
/* array of pointers to thread task structures */ static struct task_struct *kthread_cycle_[MAX_CPU]; static int kthread_cycle_state = 0; static int num_threads; static int cycle_count = 0; static struct semaphore kthread_mod_sem; static int { int int int int cycle(void *thread_data) delay, residual_delay; this_cpu; ret_sem; loops;
delay = BASE_PERIOD; for (loops = 0; loops < LOOP_MAX; loops++) { this_cpu = get_cpu(); delay = delay + this_cpu*INCREMENTAL_PERIOD; printk("kthread_mod: cpu %i executing down on kthread_mod_semaphore \n", this_cpu); down(&kthread_mod_sem);
62
exit loop
while (!kthread_should_stop()) { delay = 1 * HZ; set_current_state(TASK_UNINTERRUPTIBLE); residual_delay = schedule_timeout(delay); printk ("kthread_mod: wait for stop pid %i cpu %i \n", current->pid, this_cpu); } printk ("kthread_mod: cycle function: stop state detected for cpu %i\n", this_cpu); return 0; } int init_module(void) { int cpu = 0; int count; int this_cpu; int num_cpu; int delay_val; int *kthread_arg = 0; int residual_delay; const char thread_name[] = "cycle_th"; const char name_format[] = "%s/%d";
num_threads = 0; num_cpu = num_online_cpus(); this_cpu = get_cpu(); printk ("kthread_mod: init task %i cpu %i of total CPU %i \n", current->pid, this_cpu, num_cpu);
63
64
Note the behavior where the same thread on cpu0 reacquires the semaphore o o As thread executed up() it returns and executes down() Unlike the spinlock example, other competing threads do not observe the availability of the semaphore since their test of the semaphore can only occur at the rate of clock ticks
As it since each other thread must wait for the next timer tick (to be removed from the waitqueue and test the semaphore)
[5.356000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [5.356000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [5.360000] kthread_mod: current task 9631 cpu 3 create/wake next thread [5.360000] kthread_mod: cpu 1 executing down on kthread_mod_semaphore [5.364000] kthread_mod: current task 9631 cpu 2 create/wake next thread [5.364000] kthread_mod: cpu 2 executing down on kthread_mod_semaphore [5.368000] kthread_mod: current task 9631 cpu 3 create/wake next thread [5.368000] kthread_mod: cpu 3 executing down on kthread_mod_semaphore [6.156000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0 [6.156000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [6.156000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [6.956000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0 [6.956000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [6.956000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [7.756000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0 [7.756000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [7.756000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [8.556000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0 [8.556000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [8.556000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [9.356000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0 [9.356000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [9.356000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [0.156000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0
Then, after further delay, cpu 0 releases and completes its task Then, cpu 1 wins the race to acquire the semaphore
65
While responsiveness has degraded, computational load is vastly reduced. Examine the cpu loads below o Tasks waiting for the semaphore are in sleep state and CPUs are executing idle task
: : : : : : : :
0.0%ni,100.0%id, 0.0%ni, 96.2%id, 0.0%ni,100.0%id, 0.0%ni,100.0%id, 0.0%ni, 98.0%id, 0.0%ni,100.0%id, 0.0%ni,100.0%id, 0.0%ni,100.0%id,
66
17. RW SEMAPHORES
The next layer of synchronization functions merges semaphores and the read write design Produce a mechanism that provides semaphore resources that may be held by any number of readers, but, only one writer Again a rw_semaphore struct is defined in include/asm-i386/rwsem.h o counter: Now, the counter variable is divided into a most significant and least significant field o o
struct rw_semaphore { signed long spinlock_t struct list_head }; count; wait_lock; wait_list;
The number of readers is stored in lower 16 bits and read with a mask value The upper 16 bits counts the number of writers (either 0 or 1) A wait list of waiting processes A spinlock used for protecting the wait list
wait_list: wait_lock:
#define RWSEM_UNLOCKED_VALUE #define RWSEM_ACTIVE_BIAS #define RWSEM_ACTIVE_MASK #define RWSEM_WAITING_BIAS #define RWSEM_ACTIVE_READ_BIAS #define RWSEM_ACTIVE_WRITE_BIAS RWSEM_ACTIVE_BIAS)
static inline void init_rwsem(struct rw_semaphore *sem) { sem->count = RWSEM_UNLOCKED_VALUE; spin_lock_init(&sem->wait_lock); INIT_LIST_HEAD(&sem->wait_list); }
67
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Optimization here o o o LOCK_SECTION_START creates separate assembler subsection This code sequence is not loaded into instruction cache, in general This avoids contamination of cache with code unlikely to be executed
static inline void __down_read(struct rw_semaphore *sem) { __asm__ __volatile__( LOCK_PREFIX " incl (%%eax)\n\t" ; increment semaphore " js 2f\n\t" ; will be set if initially 0 ; will exit here "1:\n\t" LOCK_SECTION_START("") ; creates a subsection "2:\n\t" " pushl %%ecx\n\t" ; save reg state " pushl %%edx\n\t" " call rwsem_down_read_failed\n\t" " popl %%edx\n\t" " popl %%ecx\n\t" " jmp 1b\n" LOCK_SECTION_END "# ending down_read\n\t" : "=m"(sem->count) : "a"(sem), "m"(sem->count) : "memory", "cc"); }
68
void __down_read(struct rw_semaphore *sem) int __down_read_trylock(struct rw_semaphore *sem) void __down_write(struct rw_semaphore *sem) int __down_write_trylock(struct rw_semaphore *sem) void __up_read(struct rw_semaphore *sem) void __up_write(struct rw_semaphore *sem)
69
The completion structure includes a waitqueue of all tasks that are blocked by the completion gate. In /include/linux/completion.h
A completion variable is initialized and the done flag is set to zero - (analogous to the down action on a semaphore) then o Now, a task that encounters a call to wait_for_complete in its sequence of control will block. This task will be added to the waitqueue
In sched.c we find wait_for_completion o o o Note, might_sleep compiles to a NOP if debugging not set removed here Checks done flag if done is 0, then declare waitqueue and add the current task to the waitqueue tail Declares a waitqueue and then makes this flag exclusive The waitqueue structure flag is set to 1
70
This enables only flagged tasks waiting on the waitqueue to be selected during wakeup (it is inefficient to wake all tasks since only one may be scheduled for execution) The task state is set to uninterruptible no need to receive a signal here schedule() Control remains in this do while loop until done is 0, then Remove the task from the waitqueue and closed the completion variable on exit
void fastcall __sched wait_for_completion(struct completion *x) { spin_lock_irq(&x->wait.lock); if (!x->done) { DECLARE_WAITQUEUE(wait, current); wait.flags |= WQ_FLAG_EXCLUSIVE; __add_wait_queue_tail(&x->wait, &wait); do { __set_current_state(TASK_UNINTERRUPTIBLE); spin_unlock_irq(&x->wait.lock); schedule(); spin_lock_irq(&x->wait.lock); } while (!x->done); __remove_wait_queue(&x->wait, &wait); } x->done--; spin_unlock_irq(&x->wait.lock); } EXPORT_SYMBOL(wait_for_completion);
To open the completion gate a call to complete() sets the done flag to 1 and then calls the wake_up function to wake tasks from the waitqueue
71
Now, the wake_up_common function wakes the tasks on the completion waitqueue - note this is declared as x and the wait member of task is an argument for wake_up_common. o This wakes up both (or) TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, the number of exclusive tasks (one in this case), the sync setting and a NULL key) The sync setting enables priority checking: if sync is set to 1 and priority of task on the queue is greater than that of other tasks, then schedule() is called. If sync is set to 0, no priority checking occurs. Wait bit key provides another filter level a wait bit key may be set to further filter processes at wake_up
72
Instruction issue may be out of order to optimize throughput This may result in out of order execution Often, race conditions may not lead to errors since interdependency may not apply.
The processor is dependent on its instruction decoder to determine if a dependency exists that may not permit out of order execution. However, many examples appear in kernel code where dependencies are not apparent. A synchronization method referred to as a barrier is applied (we have seen this appear in our scheduling kernel code earlier). This is a barrier that terminates all read and write instructions such that no reordering can occur
wmb()
rmb()
Write barrier that terminates all write instructions such that no reordering can occur Read barrier that terminates all read instructions such that no reordering can occur
This differs from barrier() that we observed before. The presence of barrier() prevents reordering at compile time. The above prevent reordering at runtime Here is an example from the 2.6.11 kernel
The mfence indicates that memory has been updated by this function forcing the compiler to not assume reordering is possible The lock forces an atomic operation. This merely adds zero to address pointed to by stack pointer. Thus the processor treats this as a memory barrier all read (load) operations are required to have been completed before this instruction is reached since this will force an atomic operation even locking the bus.
73