EE202C Networked Embedded Systems Design Lecture 12 Multiprocessor Synchronization

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12
EE 202C LECTURE 12 EMBEDDED PLATFORMS AND OPERATING SYSTEMS MULTIPROCESSOR SYNCHRONIZATION

TABLE OF CONTENTS
1. MULTIPROCESSOR / MULTICORE SYSTEMS: EMBEDDED AND MOBILE PLATFORM PROCESSORS ..............................................................................................................................................................2 2. SYMMETRIC MULTIPROCESSING (SMP) KERNEL THREADING .......................................................4 3. SMP KERNEL THREAD CREATION AND MANAGEMENT .....................................................................6 4. UNCOORDINATED KERNEL THREAD EXAMPLE .................................................................................10 5. LOCK CLASSES, APPLICATIONS, AND IMPLEMENTATION ..............................................................15 6. PROCESSOR SUPPORT FOR SYNCHRONIZATION ................................................................................17 7. ATOMIC PRIMITIVES FOR MULTIWORD OPERATIONS ....................................................................18 8. ARM ATOMIC OPERATIONS ........................................................................................................................20 9. ATOMIC PRIMITIVES FOR BIT OPERATIONS X86 .............................................................................22 10. ATOMIC PRIMITIVES FOR BIT OPERATIONS ARM ........................................................................24 11. SPIN LOCK X86 ............................................................................................................................................26 Implementation of spinlocks: Setting Locks .........................................................................................................28 Spinlock Energy Optimization .............................................................................................................................31 Controlling synchronization and interrupts .........................................................................................................32 Lock operations and load management ...............................................................................................................33 Implementation of spinlocks: Releasing Locks ....................................................................................................34 12. SPINLOCK SYNCHRONIZED KERNEL THREAD EXAMPLE .............................................................39 13. SPIN LOCK ARM .........................................................................................................................................45 14. RW SPINLOCKS ..............................................................................................................................................47 Writers (In ARM architecture) .............................................................................................................................49 readers (in ARM Architecture) ............................................................................................................................51 Writers and Trylock .............................................................................................................................................53 15. KERNEL SEMAPORES ..................................................................................................................................54 Background ..........................................................................................................................................................54 Implementation .....................................................................................................................................................54 Operations ............................................................................................................................................................56 16. SEMAPHORE SYNCHRONIZED KERNEL THREAD EXAMPLE ........................................................62 17. RW SEMAPHORES .........................................................................................................................................67 18. COMPLETION VARIABLES.........................................................................................................................70 19. SEQ LOCK ........................................................................................................................................................72 20. SYNCHRONIZATION AGAINST OUT OF ORDER EXECUTION ........................................................73
1. MULTIPROCESSOR / MULTICORE SYSTEMS: EMBEDDED AND MOBILE PLATFORM PROCESSORS
1. Intel Atom Architecture a. Dual Die Processors i. Hyperthreading architecture 1. Shared instruction cache 2. Shared data cache 3. Parallel decode and issue of instructions 4. Parallel register file 5. Parallel integer and floating point execution units b. Pipeline i. 16 stages c. Two cache levels i. L1 compact cache 1. 32KB I-Cache 2. 24KB D-Cache ii. 512KB L2 cache per core 1. Dual core architecture provides 1M total 2. Each L2 cache shared with both threads iii. Include prefetch units that detect stride lengths and optimize prefetch operations
2. SYMMETRIC MULTIPROCESSING (SMP) KERNEL THREADING

1. SMP architecture benefits a. Parallel processing energy and performance benefits
2. Synchronization challenges a. Many examples in kernel and applications for independent processes and threads
b. However, many of the most important applications introduce constraints of data dependence c. Most important applications (for example database systems) exhibit severe reduction in throughput as thread count rises above several threads per processor.
3. Synchronization requirements a. Lock time resolution i. Lock acquisition and release may require high time resolution ii. Lock testing may require high time resolution 1. Encouraging a polling method iii. Reduced time resolution requirements must be exploited 1. Sleep timing upon lock detection greater than one clock tick permits sleeping 2. Sleep timing less than one clock period requires busy wait b. Lock footprint i. Lock fetch, decode cost ii. Lock cache footprint cost c. Lock optimization methods i. Locking for readers (consumers) and writers (suppliers) 1. Read lock 2. Write lock 3. Read and Write lock
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 4. Architecture challenges a. Managing thread creation and removal
b. Lock integration with interrupt manangement c. Lock integration with preemption
d. Lock integration with computational load systems i. Management of load bursts due to lock release e. Energy efficient lock processor resources
3. SMP KERNEL THREAD CREATION AND MANAGEMENT

Kernel thread descriptors o Create information data structure This is managed for thread by keventd daemon Fields of started and result will be populated by the keventd daemon at runtime Result task structure will include name and arguments associated with thread function
struct kthread_create_info { int (*threadfn)(void *data); void *data; struct completion started; struct task_struct *result; struct completion done; }; o Stop information data structure Done state written by keventd
struct kthread_stop_info { struct task_struct *k; int err; struct completion done; };
Kernel thread creation
struct task_struct *kthread_create(int (*threadfn)(void *data), void *data, const char namefmt[], ) /* Note: indicates a list of variable arguments referenced * by the name format (name_fmt) field. These may include, for * example, the current pid value. These arguments will be displayed * in process status information */ { struct kthread_create_info create; DECLARE_WORK(work, keventd_create_kthread, &create); create.threadfn = threadfn; /* function argument */ create.data = data; /* function data */ init_completion(&create.started); init_completion(&create.done); /* * Start the workqueue system below */ if (!helper_wq) work.func(work.data); else { queue_work(helper_wq, &work); wait_for_completion(&create.done); } if (!IS_ERR(create.result)) { /* following code prepares process table string) */ va_list args; va_start(args, namefmt); vsnprintf(create.result->comm, sizeof(create.result->comm), namefmt, args); va_end(args); } return create.result; } EXPORT_SYMBOL(kthread_create);
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Kernel thread binding to CPU o Called after creation and before wakeup
void kthread_bind(struct task_struct *k, unsigned int cpu) { wait_task_inactive(k); /* wait for task to be unscheduled */ set_task_cpu(k, cpu); k->cpus_allowed = cpumask_of_cpu(cpu); } EXPORT_SYMBOL(kthread_bind);
static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) { task_thread_info(p)->cpu = cpu; }
Kernel thread wakeup (enqueue task)
int fastcall wake_up_process(task_t *p) { /* places stopped or sleeping task on run queue */ return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0); } EXPORT_SYMBOL(wake_up_process);
Kernel thread checks for stop status that may be applied by another thread o Kernel thread calls kthread_should_stop()
int kthread_should_stop(void) { return (kthread_stop_info.k == current); } EXPORT_SYMBOL(kthread_should_stop);
A kernel thread may apply stop state for itself
int kthread_stop(void) { return (kthread_stop_info.k == current); } EXPORT_SYMBOL(kthread_should_stop);
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 A kernel thread may apply stop state for another thread
int kthread_stop(struct task_struct *k) { return kthread_stop_sem(k, NULL); } EXPORT_SYMBOL(kthread_stop);
The implementation of kthread_stop_sem is important o o The mutex_lock ensures that only one CPU may apply the stop condition Also, the thread must receive a signal (as required) in order to initiate its completion
int kthread_stop_sem(struct task_struct *k, struct semaphore *s) { int ret; mutex_lock(&kthread_stop_lock); get_task_struct(k); init_completion(&kthread_stop_info.done); smp_wmb(); kthread_stop_info.k = k; /* sets kthread pointer indicating stop */ if (s) up(s); /* release the semaphore */ else wake_up_process(k); /* start thread to enable completion */ put_task_struct(k); /* atomic decrement of task usage */ wait_for_completion(&kthread_stop_info.done); kthread_stop_info.k = NULL; ret = kthread_stop_info.err; mutex_unlock(&kthread_stop_lock); return ret; } EXPORT_SYMBOL(kthread_stop_sem);
4.
UNCOORDINATED KERNEL THREAD EXAMPLE
/* * * * * * */ #include #include #include #include #include #include
kthread_mod_uncoord.c Demonstration of multiple kernel thread creation and binding on multicore system
<linux/module.h> <linux/kthread.h> <linux/sched.h> <linux/timer.h> <linux/cpumask.h> <linux/spinlock.h>
/* array of pointers to thread task structures */ #define #define #define #define #define MAX_CPU 16 LOOP_MAX 10 BASE_PERIOD 200 INCREMENTAL_PERIOD 330 WAKE_UP_DELAY 0
static struct task_struct *kthread_cycle_[MAX_CPU]; static int kthread_cycle_state = 0; static int num_threads; static int cycle_count = 0; static int cycle(void *thread_data) { int delay, residual_delay; int this_cpu; int loops; delay = BASE_PERIOD; for (loops = 0; loops < LOOP_MAX; loops++) { this_cpu = get_cpu(); delay = delay + this_cpu*INCREMENTAL_PERIOD; printk ("kthread_mod: no lock pid %i cpu %i delay %i count %i \n", current->pid, this_cpu, delay, cycle_count); cycle_count++; set_current_state(TASK_UNINTERRUPTIBLE);
10

residual_delay = schedule_timeout(delay); cycle_count--; printk ("kthread_mod: no lock pid %i cpu %i delay %i count%i\n", current->pid, this_cpu, delay, cycle_count); } kthread_cycle_state--; /* * */
exit loop poll stop state with sleep cycle
while (!kthread_should_stop()) { delay = 1 * HZ; set_current_state(TASK_UNINTERRUPTIBLE); residual_delay = schedule_timeout(delay); /* prepare to yield */ printk ("kthread_mod: wait for stop pid %i cpu %i \n", current->pid, this_cpu); } printk ("kthread_mod: cycle function: stop state detected for cpu %i\n", this_cpu); return 0; } int init_module(void) { int cpu = 0; int count; int this_cpu; int num_cpu; int delay_val; int *kthread_arg = 0; int residual_delay; const char thread_name[] = "cycle_th"; const char name_format[] = "%s/%d"; /* format name and cpu id */ num_threads = 0; num_cpu = num_online_cpus(); printk("kthread_mod: number of operating processors: %i\n", num_cpu); this_cpu = get_cpu(); printk ("thread_mod: kthread_mod init: current task is %i on cpu %i \n", current->pid, this_cpu); for (count = 0; count < num_cpu; count++) { cpu = count; num_threads++;
11

kthread_cycle_state++; delay_val = WAKE_UP_DELAY; set_current_state(TASK_UNINTERRUPTIBLE); /* prepare to yield */ residual_delay = schedule_timeout(delay_val); kthread_cycle_[count]=kthread_create(cycle, (void *) kthread_arg, thread_name, name_format, cpu); if (kthread_cycle_[count] == NULL) { printk("kthread_mod: thread creation error\n"); } kthread_bind(kthread_cycle_[count], cpu); /* sets cpu in task */ /* struct */ wake_up_process(kthread_cycle_[count]); this_cpu = get_cpu(); printk ("kthread_mod: execution after wake_up_process, current task pid %i on cpu %i\n", current->pid, this_cpu); printk ("kthread_mod: current task is %i on cpu %i creating and waking next thread after delay of 1s \n", current->pid, this_cpu); } return 0; } void cleanup_module(void) { int ret; int count; int this_cpu; /* * determine if module removal terminated thread creation cycle early * * also must determine if cpu is suspended */ printk("kthread_mod: number of threads to stop %i and active %i\n", num_threads, kthread_cycle_state); this_cpu = get_cpu(); printk ("kthread_mod: kthread_stop requests being applied by task %i on cpu %i \n", current->pid, this_cpu); for (count = 0; count < num_threads; count++) { ret = kthread_stop(kthread_cycle_[count]); /* set done in */ /* completion field */ printk("kthread_mod: kthread_stop request for cpu count returned with value %i \n", ret); } } MODULE_LICENSE("GPL");
12
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Start up
3 pid
60937.707450] kthread_mod: number of operating processors: 4 [60937.707486] thread_mod: kthread_mod init: current task is 16243 on cpu [60937.709822] kthread_mod: execution after wake_up_process, current task 16243 on cpu 3 [60937.709841] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 0 [60937.709919] kthread_mod: current task is 16243 on cpu 3 creating and waking next thread after delay of 1s [60937.713666] kthread_mod: execution after wake_up_process, current task 16243 on cpu 3 [60937.713678] kthread_mod: no lock pid 16245 cpu 1 delay 530 count 1 [60937.713738] kthread_mod: current task is 16243 on cpu 3 creating and waking next thread after delay of 1s [60937.717799] kthread_mod: execution after wake_up_process, current task 16243 on cpu 3 [60937.717815] kthread_mod: no lock pid 16246 cpu 2 delay 860 count 2 [60937.717893] kthread_mod: current task is 16243 on cpu 3 creating and waking next thread after delay of 1s [60937.721661] kthread_mod: execution after wake_up_process, current task 16243 on cpu 3 [60937.721712] kthread_mod: current task is 16243 on cpu 3 creating and waking next thread after delay of 1s [60937.721950] kthread_mod: no lock pid 16247 cpu 3 delay 1190 count 3 [60938.508041] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 3 [60938.508084] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 3 [60939.308039] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 3 [60939.832050] kthread_mod: no lock pid 16245 cpu 1 delay 530 count 2 [60939.832086] kthread_mod: no lock pid 16245 cpu 1 delay 860 count 2 [60940.308037] kthread_mod: wait for stop pid 16244 cpu 0 [60941.156046] kthread_mod: no lock pid 16246 cpu 2 delay 860 count2 [60941.156082] kthread_mod: no lock pid 16246 cpu 2 delay 1520 count 2 [60941.308064] kthread_mod: wait for stop pid 16244 cpu 0 [60942.308038] kthread_mod: wait for stop pid 16244 cpu 0 [60942.480041] kthread_mod: no lock pid 16247 cpu 3 delay 1190 count 2 [60942.480074] kthread_mod: no lock pid 16247 cpu 3 delay 2180 count 2 [60943.272042] kthread_mod: no lock pid 16245 cpu 1 delay 860 count 2
pid
pid
pid
Note lack of coordination above Complete and removal phase

kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: wait for stop pid wait for stop pid wait for stop pid wait for stop pid wait for stop pid wait for stop pid wait for stop pid wait for stop pid wait for stop pid wait for stop pid number of threads 16247 cpu 16246 cpu 16245 cpu 16244 cpu 16247 cpu 16246 cpu 16245 cpu 16244 cpu 16244 cpu 16247 cpu to stop 4 3 2 1 0 3 2 1 0 0 3 and active 0
[61085.496049] [61086.236050] [61086.276046] [61086.308040] [61086.496039] [61087.236041] [61087.276048] [61087.308040] [61150.308036] [61150.720049] [61151.134948]
13

[61151.134984] on cpu 0 [61151.135024] [61151.135049] [61151.135118] value 0 [61151.135171] [61151.135220] [61151.135267] value 0 [61151.135331] [61151.135357] [61151.135398] value 0 [61151.135456] [61151.135499] [61151.135541] value 0 kthread_mod: kthread_stop requests being applied by task 16332 kthread_mod: wait for stop pid 16244 cpu 0 kthread_mod: cycle function: stop state detected for cpu 0 kthread_mod: kthread_stop request for cpu count returned with kthread_mod: wait for stop pid 16245 cpu 1 kthread_mod: cycle function: stop state detected for cpu 1 kthread_mod: kthread_stop request for cpu count returned with kthread_mod: wait for stop pid 16246 cpu 2 kthread_mod: cycle function: stop state detected for cpu 2 kthread_mod: kthread_stop request for cpu count returned with kthread_mod: wait for stop pid 16247 cpu 3 kthread_mod: cycle function: stop state detected for cpu 3 kthread_mod: kthread_stop request for cpu count returned with

Cpu0 Cpu1 Cpu2 Cpu3
Computing load top d 0.1 Upon start of top execution, enter 1 for CPU display
: : : : 0.0%us, 0.0%us, 0.0%us, 0.0%us, 9.1%sy, 0.0%sy, 0.0%sy, 0.0%sy, 0.0%ni, 90.9%id, 0.0%ni,100.0%id, 0.0%ni,100.0%id, 0.0%ni,100.0%id, 0.0%wa, 0.0%wa, 0.0%wa, 0.0%wa, 0.0%hi, 0.0%hi, 0.0%hi, 0.0%hi, 0.0%si, 0.0%si, 0.0%si, 0.0%si, 0.0%st 0.0%st 0.0%st 0.0%st
Note that one task presents a user space task load while all others are sleeping (note the presence of 100% in idle state.
14
5. LOCK CLASSES, APPLICATIONS, AND IMPLEMENTATION
Synchronization between tasks is critical to multiprocess and multiprocessor systems o Requirement exists for ensuring that operations occur only according to design constraints and are not subject to race conditions.
Synchronized access to a resource may be managed by controlling access to the code segment using a variable that may implement a lock o Enables a form of communication between processes
Hierarchy o Spinlocks o Fast acquisition and release Resource intensive for extended lock times
RW Spinlocks High efficiency lock favoring readers Often read by rarely written Multiple readers One writer
Kernel Semaphore Complex implementation Increased latency Efficient for long delays
RW Semaphore Semaphore attributes with reader/writer resolution
15
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Seqlock o High efficiency lock that favors writers
Completion Variables Synchronization against out of order execution
Applications o List manipulation o o o o Memory, Tasks,
Timer interrupts Interrupt service System call Scheduler Operations
Implementation o o o Relies on processor hardware Important optimizations for performance Recent advances in optimization for energy
16
6. PROCESSOR SUPPORT FOR SYNCHRONIZATION

Implementation of reliable synchronization methods requires support of processor hardware. o The presence of unscheduled interrupts implies that any sequence of control is uncertain
Processor architectures may enable the implementation of atomic operations o o o o Atomic operations complete without interruption of the sequence of control under all circumstances Examples may include the increment of a memory register. For this to be atomic, the fetch, decode, fetch of operand from memory, increment and write back must occur contiguously The arrival of an interrupt must not induce an interruption of this sequence of control.
A class of Intel IA-32 instructions are always atomic. These include: o o o o Byte length read or write from memory 32 bit aligned read or write from memory of 32 or 64 bit word 64 bit aligned read or write from memory of 128 bit word Reading or writing to cache The cache is accessible as cache lines of 32 bytes An unaligned read or write falling within this limit will be atomic
Memory management o o Updating segment registers Updating page tables
Interrupts o The data bus is locked after an interrupt, only allowing a selected APIC to write.
Other operations are not atomic o o Thus other methods must apply The IA-32 system allows for software-selectable bus locking
17
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Here, one processor may acquire and lock the address/data bus preventing any other processor (or device) from accessing memory
Assert LOCK prefix o Instructions are listed with the identifier lock prepending the instruction. Assembler will ensure that object code includes a bus lock operation during execution Will add an opcode modifier 0xF0 to instruction Only one processor may access memory during the lock
7. ATOMIC PRIMITIVES FOR MULTIWORD OPERATIONS

For IA-32, Linux atomic operations are defined in o /include/asm-i386/atomic.h
#ifdef CONFIG_SMP #define LOCK "lock ; " #else #define LOCK "" #endif
First, there is an atomic data type
typedef struct { volatile int counter; } atomic_t;
One data member - counter
Defined as static inline o Standalone object code for functions may be created by compile, if required
Examples of atomic increment and decrement
static __inline__ void atomic_inc(atomic_t *v) { __asm__ __volatile__( LOCK "incl %0" :"=m" (v->counter) :"m" (v->counter)); }
18

static __inline__ void atomic_dec(atomic_t *v) { __asm__ __volatile__( LOCK "decl %0" :"=m" (v->counter) :"m" (v->counter));
Atomic add i to atomic type v ir indicates that a register is to be assigned by the compiler to an integer
static __inline__ void atomic_add(int i, atomic_t *v) { __asm__ __volatile__( LOCK "addl %1,%0" :"=m" (v->counter) :"ir" (i), "m" (v->counter)); }
Atomic sub i from atomic type v
static __inline__ void atomic_sub(int i, atomic_t *v) { __asm__ __volatile__( LOCK "subl %1,%0" :"=m" (v->counter) :"ir" (i), "m" (v->counter)); // i indicates 32b for reg }
Since the variable in question is actually a data structure member, this must be accessed via a read operation
((v)->counter)
#define atomic_read(v)
o o
atomic_set must be used to write This sets the value of v to that of integer, i
(((v)->counter) = (i))
#define atomic_set(v,i)
19
8. ARM ATOMIC OPERATIONS

The ARMV6 architecture implements atomic operations using a unique approach. o Lock method differs in that memory bus locking is not applied
ARM atomic instructions o o read operations are inherently atomic Write operations must be protected
Here is an example of atomic_set o o o This sets v equal to i Its functionality as a kernel library function is the same as its i386 counterpart However, due to processor differences between IA-32 and ARM, its underlying implementation is quite different
The instruction Load Exclusive LDREX R1, [R2] is implemented on ARM o o This loads R1 with the contents of memory register addressed by the contents of R2 Then, this initializes a monitor The monitor observes any write action on the address-data bus that may occur on the 32b memory block pointed to by the contents of R2 This write action may occur due to the operation of another CPU that shares the memory space and address-data bus The occurrence of a write action can then be detected subsequently by STREX
The instruction Store Exclusive STREX R1, R2, [R3] o o o Stores R2 into memory register addressed R3 If the write is successful by the definition that the previously initialized monitor installed above shows no prior writes, then the data pointed to be [R3] has been written atomically A successful write is returned as a zero value in R1
If a failure is detected, this function continues to attempt to initialize a monitor, store, and verify. An example is atomic_set() for ARM
20
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o First read value of v->counter (sets monitor) o o o This is a guard instruction The value of counter not needed
Then start load operation with STREX Check monitor Loop until success detected
Note this sequence will be inserted in-line in code not called as function o By declaring static, this inline code can be included in any kernel function.
static inline void atomic_set(atomic_t *v, int i) { unsigned long tmp; __asm__ __volatile__("@ atomic_set\n" ldrex %0, [%1]\n" ; tmp register receives the contents at ; memory address containing counter ; This initializes the monitor. strex %0, %2, [%1]\n" ; i (third on arg list) is stored into ; mem register at address of v->counter teq %0, #0\n" ; test if reg 0 is cleared bne 1b" ; if not success, repeat branch if ; not equal to label 1 back : "=&r" (tmp) ; & requires separation of output and ; input register choices by compiler : "r" (&v->counter), "r" (i) : "cc"); ; sequence may have modified cpsr ; compiler must ensure cpsr protected
"1:
" " "
21
9. ATOMIC PRIMITIVES FOR BIT OPERATIONS X86

Atomic bit manipulations are defined in include/asm-i386/bitops.h Aside: o o o Note use of volatile type qualifier Consider example where it is desired to send two words in succession to a memory mapped I/O port Informs compiler that memory is volatile and must be read from main memory at each referencing instruction o o Avoids reference to cache that would create an error in this case since memory corresponds to I/O port Prevents any compiler optimization that may eliminate a read
Suppresses code optimization that may appear when these functions are inlined Example, Consider sequence
volatile unsigned long *output_port = memory_mapped_interface_address *output_port = CONTROL_WORD_1; /* set high */ *output_port = CONTROL_WORD_2; /* set low */
Note that the first instruction would be otherwise eliminated
Clear a bit in memory
static inline void clear_bit(int nr, volatile unsigned long * addr) { __asm__ __volatile__( LOCK_PREFIX "btrl %1,%0" // Bit Test and Reset Long // 1 is the bit offset, nr. // 0 is operated on its bit // indicated by nr is cleared :"=m" (ADDR) // output operand :"Ir" (nr)); // I identifies a constant in // range of 0 31 }
22
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Change a bit in memory
static inline void change_bit(int nr, volatile unsigned long * addr) { __asm__ __volatile__( LOCK_PREFIX "btcl %1,%0" :"=m" (ADDR) :"Ir" (nr)); }
Test and Change bit in memory
static inline int test_and_change_bit(int nr, volatile unsigned long* addr)
Test and Clear bit in memory
static inline int test_and_clear_bit(int nr, volatile unsigned long * addr)
23
10. ATOMIC PRIMITIVES FOR BIT OPERATIONS ARM

Atomic bit manipulations are defined in include/asm-arm/bitops.h o Clear a bit in memory o Operates on word referenced by pointer, p If bit > 31, then pointer is advanced to a following 32b word The bits 0 through 5 determine the bit test location to the following word Five lines of code without a conditional or loop
Example bit = 4 bit = 0000 0000 0000 0100 bit & 31 = 0000 0000 0000 0100 & 0000 0000 0001 1111 = 0000 0000 0000 0100 mask = 1UL << (4 & 31) = = 1UL << (4) = 0000 0000 0001 0000 ~mask = 1111 1111 1110 1111 bit >> 5 = 0 p=p+0 *p = *p & 1111 1111 1110 1111 Clears 5th bit
static inline void ____atomic_clear_bit(unsigned int bit, volatile unsigned long *p) { unsigned long flags; unsigned long mask = 1UL << (bit & 31); p += bit >> 5; local_irq_save(flags); *p &= ~mask; local_irq_restore(flags); }
24
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 bit = 34 bit = 0000 0000 0010 0010 mask = 1UL << (34 & 31) (34 & 31) = 0000 0000 0010 0010 & 0000 0000 0001 1111 (34 & 31) = 0000 0000 0000 0010 mask = 1UL << (2) = 0000 0000 0000 0100 bit >> 5 = 1 p = p + 1 this advances pointer to next word *p = *p & 1111 1111 1111 1011 Clear second bit in second word (bit location 34)
Set a bit in memory Replace AND operation setting bit
static inline void ____atomic_set_bit(unsigned int bit, volatile unsigned long *p) { unsigned long flags; unsigned long mask = 1UL << (bit & 31); p += bit >> 5; local_irq_save(flags); *p |= mask; local_irq_restore(flags); }
25
11. SPIN LOCK X86

Atomic operations are adequate for the example of controlling a word or bit In general, methods are needed to lock an a sequence of control for atomic operation o Examples include examination of a list, for example of tasks or timers
In the hierarchy of control sequence locking, the spinlock is the most efficient in its initialization and use, but, also represents the most significant impact on the kernel. o Design requirements Intended for application to locking where resource hold time is short Fast initialization Fast access and return Small cache footprint Design will tolerate processing overhead
There are two primary alternatives for code sequence locking o A process operating on one CPU (in an SMP system) may seek to acquire a resource or enter a sequence of instructions. If the lock is not accessible, the process (a kernel thread) may be designed to be dequeued until the lock is available. If the lock is anticipated to not be available for an extended period, then this is acceptable. The process of dequeue and then enqueue incurs latency o A context switch exiting and entering
o o
An alternative for design requirements where it is known in advance that the lock acquisition delay will be short is the spin lock The kernel thread process is not dequeued during the lock delay Rather, it continues to test the lock
26
Operations o A process that may wish to protect a code sequence sets a spinlock o Only one spinlock is available per thread
A second process requesting the lock makes repeated attempts to gain the lock It remains in a busy loop, testing the lock during each period that it is scheduled. If the previous task releases the lock it will be discovered to be available when the new task seeking the lock is scheduled
Characteristics o o The spinlock is central in kernel code The acquisition of a spinlock does not disable interrupt operations Will result from our having inserted an interruptable NOP loop An example of a potential deadlock failure results from a process having acquired a spinlock and then being interrupted and replaced by an ISR that seeks the same spinlock.
Thus, we very often observe the use of the spin_lock_irq_save() function
Usage rules o o Spinlocks are appropriate for fast execution (less than the time required for two context switches Sleep operations should not be started in a sequence of execution after a lock is acquired
27
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Probability that an interrupt service routine may require the same lock resource is high
IMPLEMENTATION OF SPINLOCKS: SETTING LOCKS
First, setting the lock: (in kernel/spinlock.c) __lockfunc defines fastcall and the directive that first three function arguments are to be placed in registers as opposed to the stack (as would be the compiler operation default) o FASTCALL macro settings: #define fastcall __attribute__((regparm(3)))
Pass up to three parameters via registers, the remainder on the stack
#define spin_lock(lock) _spin_lock(lock)
void __lockfunc _spin_lock(spinlock_t *lock) { preempt_disable(); _raw_spin_lock(lock); }
Preemption disabled (from /linux/preempt.h)
#define preempt_disable() \ do { \ inc_preempt_count(); \ barrier(); \ } while (0)
Actual spinlock slock = 1 if lock available slock = 0 after decrement on lock request if request successful
static inline void _raw_spin_lock(spinlock_t *lock) { __asm__ __volatile__( spin_lock_string :"=m" (lock->slock) : : "memory"); }
28

#define spin_lock_string \ "\n1:\t" \ "lock ; decb %0\n\t" \ "jns 3f\n" \ "2:\t" \ "rep;nop\n\t" \ "cmpb $0,%0\n\t" \ "jle 2b\n\t" \ "jmp 1b\n" \ "3:\n\t"
gcc preprocessor will produce
static inline void _raw_spin_lock(spinlock_t *lock) { __asm__ __volatile__( "1:\t" \ "lock ; decb %0\n\t" \ "jns 3f\n" \ "2:\t" \ "rep;nop\n\t" \ "cmpb $0,%0\n\t" \ "jle 2b\n\t" \ "jmp 1b\n" \ "3:\n\t" : "=m" (lock->slock) : : "memory"); }
o decb decrements spinlock value note argument %0 points to lock->slock o Tests if decrement of lock is zero (lock state was therefore one at time of access)
Note, not adequate to merely set slock Consider multiple CPUs in race to set bit Decrement removes race condition o o Each CPU can decrement the lock No CPU can exit the spinlock until the lock becomes set to one and may be decremented
Checks sign flag
29
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 if sign flag not set, then spinlock was 1 and is now zero, thus jump to 3 and continue The thread executing this sequence now owns the lock If the spinlock value was zero, a decrement will yield a negative value o o o o The lock was therefore taken previously
Then, the system compares the memory register with zero If less than or equal, remain in loop since other CPU processes may be decrementing slock If greater than zero, lock must be set to 1 and is free However, this system does not exit immediately A race condition with multiple CPUs is in progress o o Another CPU has set the lock to 1 But, yet another CPU may now have decremented the lock
Test is: can this thread successfully decrement the lock to zero from one o o If so, this is the only thread that owns the lock If not, this CPU has lost the race
Hence, this system performs one more test to ensure o o The current thread acquires the lock No other thread has or can acquire the lock
30

SPINLOCK ENERGY OPTIMIZATION
Note the rep;nop sequence
static inline void rep_nop(void) { __asm__ __volatile__("rep;nop": : :"memory"); } #define cpu_relax() rep_nop()
Detail point on optimization o Analysis has shown that some systems spend a significant fraction of time in spinlock state. o o Delay may be unavoidable However, introduces undesired power dissipation
Now, the rep;nop sequence introduces a method for signaling the CPU that the current thread is executing and waiting for a spinlock The rep preprocessor causes a number of NOPs to be introduced equal to the contents of cx register Assembler introduces the rep opcode modifier byte 0xF2 prepending the instruction causes the instruction to be called repeatedly Applies only to string of instructions defaults to a single NOP otherwise
o o
However, processor observes the presence of rep nop Processor then may adjust clock frequency, core voltage, and reduce energy usage
The cpu_relax() macro has appeared in recent kernel versions that includes the rep nop sequence
31

CONTROLLING SYNCHRONIZATION AND INTERRUPTS
Setting a lock while disabling interrupts o o _spin_lock_irq(spinlock_t *lock) Interrupts are enabled unconditionally at the time the lock is released This may create an error condition if interrupts were previously disabled
#define _spin_lock_irq(lock) \ do { \ local_irq_disable(); \ preempt_disable(); \ _raw_spin_lock(lock); \ } while (0)
Setting a lock while disabling interrupts and storing interrupt state. o o o _spin_lock_irqsave(spinlock_t *lock , unsigned long flags) This enables the state of interrupts to be stored at the time the lock is set Interrupts are enabled at the time the lock is released only if they were initially enabled
#define _spin_lock_irqsave(lock, flags) do { local_irq_save(flags); preempt_disable(); _raw_spin_lock(lock); } while (0)
#define local_irq_save(x)__asm__ __volatile__( "pushfl ; popl %0 ; cli" :"=g" (x): /* disable interrupts */ :"memory")
/asm-i386/system.h o This is called with flags argument Flags are first saved on the stack
32
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Flag value is then popped into a general purpose register- stack pointer is returned to initial value o Thus, the flags are now saved as a temporary variable (in the next stack entry)
Memory keyword informs compiler that memory has been changed, resulting in blocking of compiler actions to reorder sequence of control
LOCK OPERATIONS AND LOAD MANAGEMENT
Setting a lock while disabling interrupt bottom halves o o Critical for many network drivers This permits hardware interrupts to proceed o o o o However, the computational demand of bottom halves is not introduced, delaying interrupt service routines
For example, timer interrupts and other critical events _spin_lock_bh(spinlock_t *lock) This enables the state of interrupts to be stored at the time the lock is set Interrupts are enabled at the time the lock is released only if they were initially enabled
#define _spin_lock_bh(lock) \ do { \ local_bh_disable(); \ preempt_disable(); \ _raw_spin_lock(lock); \ } while (0)
This takes us to interrupts.c Now, softirqs (for networking, for example) will only be allowed if the preemption counter is less than SOFTIRQ_OFFSET o If many processes have incremented the preemption counter, policy is to not add yet another task, rather allow these to complete
Convenient approach o Just add SOFTIRQ_OFFSET
33
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Now, with the increase in preempt count, BH are disabled since the number of allowed softirqs will be incremented above the limit by simply adding the max value to the current value
o o
creates a convenient method for returning and restoring preemption count while gating BH operations Note the while (0) construct and barrier
#define local_bh_disable() \ do { add_preempt_count(SOFTIRQ_OFFSET); barrier(); } while (0)
And in sched.c (removing debug options)
void fastcall add_preempt_count(int val) { preempt_count() += val; }
IMPLEMENTATION OF SPINLOCKS: RELEASING LOCKS
Design goals o o o Release lock resource Enable preemption Evaluate if rescheduling should occur
Release spin_lock
#define _spin_unlock(lock) do { _raw_spin_unlock(lock); preempt_enable(); __release(lock); } while (0)
In preempt.h o Enable
#define preempt_enable() do { preempt_enable_no_resched();
34

preempt_check_resched(); } while (0)
Decrement preempt count
#define preempt_enable_no_resched() do { barrier(); dec_preempt_count(); } while (0)
Call reschedule if current is flagged this return from spinlock represents important opportunity to exploit resched option
#define preempt_check_resched() do { if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) preempt_schedule(); } while (0)
Release spin_lock_irq o Here is the unconditional restore
#define _spin_unlock_irq(lock) \ do { \ _raw_spin_unlock(lock); \ local_irq_enable(); \ preempt_enable(); \ } while (0)
Unlock (lock arg applies only if debug is applied) this is removed below
static inline void _raw_spin_unlock(spinlock_t *lock) { __asm__ __volatile__( spin_unlock_string ); }
35
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Sets spin lock bit note memory barrier
#define spin_unlock_string "movb $1,%0" \ :"=m" (lock->slock) : : "memory"
Release spin_lock_irq_restore o Here is the conditional restore
#define spin_unlock_irqrestore(lock, flags) do { _raw_spin_unlock(lock); local_irq_restore(flags); preempt_enable(); } while (0)
#define local_irq_restore(x) do { if ((x & 0x000000f0) != 0x000000f0) local_irq_enable(); } while (0) #define local_irq_enable() __asm__ __volatile__( "sti" : : :"memory")
Finally, releasing a spin_unlock_bh

_spin_unlock_bh(lock)
#define spin_unlock_bh(lock)
Here is the conditional restore
void __lockfunc _spin_unlock_bh(spinlock_t *lock) { _raw_spin_unlock(lock); preempt_enable(); local_bh_enable();
In softirq.c, local_bh_enable is found o o o Note the recovery using the (SOFTIRQ_OFFSET 1 ) subtraction This removes the SOFTIRQ_OFFSET and enables SOFTIRQ threads to be executed by softirqd However, preemption remains disabled (due to the -1 above) if it was previous to this action
36
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Note that check is made on o o Not in interrupt and that there is a pending softirq Then actually perform the softirq immediately before any other process that the scheduler may have selected
Note that preemption counter is decremented preemption will be enabled when this reaches zero. Note that resched is called
void local_bh_enable(void) { sub_preempt_count(SOFTIRQ_OFFSET - 1); if (unlikely(!in_interrupt() && local_softirq_pending())) do_softirq(); dec_preempt_count(); preempt_check_resched(); }
Lock state testing o Lock state can be tested without locking to enable flow control For example, spin_trylock(), spin_trylock_bh() Implemented with atomic xchgl instruction in x86
static inline int __raw_spin_trylock(raw_spinlock_t *lock) { int oldval; __asm__ __volatile__( "xchgl %0,%1" :"=q" (oldval), "=m" (lock->slock) :"" (0) : "memory"); return oldval > 0; }
Implemented in ARM architecture: Loads lock value Stores exclusive if equal (if lock value is 0) Otherwise, exits with lock value in tmp
37
static inline int __raw_spin_trylock(raw_spinlock_t *lock) { unsigned long tmp; __asm__ __volatile__( ldrex %0, [%1]\n" teq %0, #0\n" strexeq %0, %2, [%1]" : "=&r" (tmp) : "r" (&lock->lock), "r" (1) : "cc"); if (tmp == 0) { smp_mb(); return 1; } else { return 0; } }
" " "
38
12. SPINLOCK SYNCHRONIZED KERNEL THREAD EXAMPLE

/* * * * * * * * */
kthread_mod_coord.c Demonstration of multiple kernel thread creation and binding on multicore system This system includes spinlock synchronization
#include #include #include #include #include #include
<linux/module.h> <linux/kthread.h> <linux/sched.h> <linux/timer.h> <linux/cpumask.h> <linux/spinlock.h>
/* array of pointers to thread task structures */ #define #define #define #define #define MAX_CPU 16 LOOP_MAX 10 BASE_PERIOD 200 INCREMENTAL_PERIOD 30 WAKE_UP_DELAY 0
static struct task_struct *kthread_cycle_[MAX_CPU]; static int kthread_cycle_state = 0; static int num_threads; static int cycle_count = 0; static spinlock_t kt_lock = SPIN_LOCK_UNLOCKED; static int cycle(void *thread_data) { int delay, residual_delay; int this_cpu; int ret; int loops; delay = BASE_PERIOD; for (loops = 0; loops < LOOP_MAX; loops++) { this_cpu = get_cpu(); delay = delay + this_cpu*INCREMENTAL_PERIOD; ret = spin_is_locked(&kt_lock); if (ret != 0) { printk("kthread_mod: cpu %i start spin cycle\n", this_cpu);
39

} spin_lock(&kt_lock); printk ("kthread_mod: lock pid %i cpu %i delay %i count %i \n", current->pid, this_cpu, delay, cycle_count); cycle_count++; set_current_state(TASK_UNINTERRUPTIBLE); residual_delay = schedule_timeout(delay); cycle_count--; printk ("kthread_mod: unlock pid %i cpu %i delay %i count %i\n", current->pid, this_cpu, delay, cycle_count); spin_unlock(&kt_lock); } kthread_cycle_state--; /* * */
exit loop
while (!kthread_should_stop()) { delay = 1 * HZ; set_current_state(TASK_UNINTERRUPTIBLE); residual_delay = schedule_timeout(delay); printk ("kthread_mod: wait for stop pid %i cpu %i \n", current->pid, this_cpu); } printk ("kthread_mod: cycle function: stop state detected for cpu %i\n", this_cpu); return 0; } int init_module(void) { int cpu = 0; int count; int this_cpu; int num_cpu; int delay_val; int *kthread_arg = 0; int residual_delay; const char thread_name[] = "cycle_th"; const char name_format[] = "%s/%d"; /* format name and cpu id */ num_threads = 0; num_cpu = num_online_cpus(); this_cpu = get_cpu(); printk
40

("kthread_mod: init task %i cpu %i of total CPU %i \n", current->pid, this_cpu, num_cpu); for (count = 0; count < num_cpu; count++) { cpu = count; num_threads++; kthread_cycle_state++; delay_val = WAKE_UP_DELAY; set_current_state(TASK_UNINTERRUPTIBLE); residual_delay = schedule_timeout(delay_val); kthread_cycle_[count] = kthread_create(cycle, (void *) kthread_arg, thread_name, name_format, cpu); if (kthread_cycle_[count] == NULL) { printk("kthread_mod: thread creation error\n"); } kthread_bind(kthread_cycle_[count], cpu); wake_up_process(kthread_cycle_[count]); this_cpu = get_cpu(); printk ("kthread_mod: current task %i cpu %i create/wake next thread\n", current->pid, this_cpu); } return 0; } void cleanup_module(void) { int ret; int count; int this_cpu; /* * determine if module removal terminated thread creation cycle early * * also must determine if cpu is suspended */ printk("kthread_mod: number of threads to stop %i and active %i\n", num_threads, kthread_cycle_state); this_cpu = get_cpu(); printk ("kthread_mod: kthread_stop requests being applied by task %i on cpu %i \n", current->pid, this_cpu); for (count = 0; count < num_threads; count++) { ret = kthread_stop(kthread_cycle_[count]) /* sets done state*/ printk ("kthread_mod: kthread_stop request for cpu count returned
41

with value %i \n", ret); } } MODULE_LICENSE("GPL");
Start up o Note coordination o Note locking occurs However, locking only occurs when relationship between delays leads to resource contention
kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: kthread_mod: init task 17348 cpu 2 of total CPU 4 current task 17348 cpu 2 create/wake next lock pid 17349 cpu 0 delay 200 count 0 current task 17348 cpu 2 create/wake next cpu 1 start spin cycle current task 17348 cpu 2 create/wake next cpu 2 start spin cycle current task 17348 cpu 0 create/wake next cpu 3 start spin cycle unlock pid 17349 cpu 0 delay 200 count 0 cpu 0 start spin cycle lock pid 17350 cpu 1 delay 230 count 0 unlock pid 17350 cpu 1 delay 230 count 0 cpu 1 start spin cycle lock pid 17351 cpu 2 delay 260 count 0 unlock pid 17351 cpu 2 delay 260 count 0 cpu 2 start spin cycle lock pid 17352 cpu 3 delay 290 count 0 unlock pid 17352 cpu 3 delay 290 count 0 cpu 3 start spin cycle lock pid 17349 cpu 0 delay 200 count 0 unlock pid 17349 cpu 0 delay 200 count 0 cpu 0 start spin cycle lock pid 17350 cpu 1 delay 260 count 0 unlock pid 17350 cpu 1 delay 260 count 0 cpu 1 start spin cycle lock pid 17351 cpu 2 delay 320 count 0 unlock pid 17351 cpu 2 delay 320 count 0 cpu 2 start spin cycle lock pid 17352 cpu 3 delay 380 count 0 unlock pid 17352 cpu 3 delay 380 count 0 cpu 3 start spin cycle lock pid 17349 cpu 0 delay 200 count 0 unlock pid 17349 cpu 0 delay 200 count 0 cpu 0 start spin cycle lock pid 17350 cpu 1 delay 290 count 0 unlock pid 17350 cpu 1 delay 290 count 0 cpu 1 start spin cycle lock pid 17351 cpu 2 delay 380 count 0 thread thread thread thread
[61888.295386] [61888.297709] [61888.297805] [61888.301142] [61888.301158] [61888.309106] [61888.309146] [61889.004148] [61889.004161] [61889.100033] [61889.100073] [61889.100080] [61890.020530] [61890.020581] [61890.020588] [61891.061032] [61891.061074] [61891.061080] [61892.217531] [61892.217572] [61892.217582] [61893.312070] [61893.312131] [61893.312138] [61895.332564] [61895.332609] [61895.332617] [61897.921049] [61897.921094] [61897.921101] [61899.889564] [61899.889609] [61899.889619] [61902.088074] [61902.088118] [61902.088124] [61903.248533] [61903.248577] [61903.248586]
42
Cpu0 Cpu1 Cpu2 Cpu3
: : : :
0.0%us,100.0%sy, 0.0%us,100.0%sy, 0.0%us,100.0%sy, 3.7%us, 5.6%sy,
0.0%ni, 0.0%id, 0.0%ni, 0.0%id, 0.0%ni, 0.0%id, 0.0%ni, 90.7%id,
0.0%wa, 0.0%wa, 0.0%wa, 0.0%wa,
0.0%hi, 0.0%hi, 0.0%hi, 0.0%hi,
0.0%si, 0.0%si, 0.0%si, 0.0%si,
0.0%st 0.0%st 0.0%st 0.0%st
The top utility, operating in batch mode, may record per cpu load o o This is a direct result of the resource cost associated with the spinlock Note the processor usage at 100 % system and 0% user o Note: Three processors (threads) are operating at full load One processor, CPU3, is executing (and is in a sleep state). Note that this processor is spending the balance of its time in the idle thread.
Entries indicate percentage of time processor was executing a task other than the idle task during the time since the last screen update
Note the behavior above: o o o CPU 0 wins the race to acquire the spinlock and remains in a sleep state, requiring now CPU load CPU 1, 2, and 3 operate at 100 percent load, waiting for the spinlock resource to become available Again at t = 1 second, a race condition occurs and CPU 1 wins
43
Note the behavior above over an extended period o o CPU 0, 1, and 3 acquire the spinlock CPU 2 does not acquire the lock
44
13. SPIN LOCK ARM

Energy aware lock operation new in the ARM Linux 2.6.15 Applied directly to new multiprocessor embedded cores o ARM11 MPcore example Embedded control Networking Graphics
Conventional multiprocessor systems suffer from energy inefficiency due to processors waiting for and expending energy in polling of spinlocks o o A significant fraction of processor time may be lost in synchronization Problems include: Priority Inversion Deadlocks convoy behavior o Groups of processors executing control sequences in parallel and stalling in synchrony, waiting for the same lock
Now energy saving is ensured by placing processor in temporary stall state with ability to wake processor within one cycle upon lock being freed o Notification via SCU (Snoop and Control Unit) in multiprocessor core
45
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o o o Ensures that all caches are coherent
Signal propagates to all CPUs (see unlock) wfene instruction - receive sev instruction - notify
static inline void __raw_spin_lock(raw_spinlock_t *lock) { unsigned long tmp; __asm__ __volatile__( ldrex %0, [%1]\n" teq %0, #0\n" wfene\n" strexeq %0, %2, [%1]\n" teqeq %0, #0\n" bne 1b" : "=&r" (tmp) : "r" (&lock->lock), "r" (1) : "cc"); smp_mb(); }
"1: " " " " "
; ; ; ; ; ;
load lock member of &lock into r test lock value wait for notification attempt to store 1 in r test loop if unsuccessful
wmb() and rmb() both are defined as mb() for ARM o o define mb() __asm__ __volatile__ ("" : : : "memory") This ensures that any writes or reads of variables that are being protected by the spinlock are scheduled prior to releasing the lock
static inline void __raw_spin_unlock(raw_spinlock_t *lock) { smp_mb(); __asm__ __volatile__( " str %1, [%0]\n" ; release, store 0 in lock member " mcr p15, 0, %1, c7, c10, 4\n" ; drain storage buffer " sev" ; send signal to waiting CPUs : : "r" (&lock->lock), "r" (0) : "cc"); ; CPSR updated }
Drain Storage Buffer operation o Forces synchronization of this stored data into D-cache of each processor
46
14. RW SPINLOCKS
Spinlock allows only one sequence of control to enter a sequence of instructions An alternative exists for the spinlock o The reader/writer lock admits many readers o o The lock prevents access by readers if a writer has taken the lock
Permits only one writer Writer may not access lock while any reader or other writer holds lock
47

R R W W
RW spinlocks are based on a rwlock_t structure o This contains a counter variable that is equal to the sum of readers that have acquired the rwlock
Without debugging options, this appears
typedef struct { volatile unsigned int lock; } rwlock_t;
To initialize (x is the lock of type rwlock_t

(rwlock_t) { 0 }
#define RW_LOCK_UNLOCKED
Implementation and usage For a sequence of control that the designer intends to use to read a shared memory resource, then read_lock(rw_lock *lock) is used
48
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o All of the other variants above for spin_lock are included Read_lock Read_lock_irq Read_lock_irqsave Read_lock_bh Read_unlock Read_unlock_irq Read_unlock_irqsave Read_unlock_bh Read_trylock (added in 2.6 kernel) write_lock write_lock_irq write_lock_irqsave write_lock_bh write_unlock write_unlock_irq write_unlock_irqsave write_unlock_bh write_trylock
WRITERS (IN ARM ARCHITECTURE)
Consider write lock acquisition (called by writers) Recall that strex r1, r2, [r3] stores contents of r2 into memory at address contained in r3 and places zero result in r1 if no other writes have occurred to [r3] since previous ldrex Note this can also execute conditionally
static inline void _raw_write_lock(rwlock_t *rw) { unsigned long tmp; __asm__ __volatile__( ldrex %0, [%1]\n" ; load exclusive and monitor lock teq %0, #0\n" ; test if lock zero strexeq %0, %2, [%1]\n" ; attempt to write LOCK_BIAS if zero teq %0, #0\n" ; - note above is conditional execution bne 1b" ; spin until lock acquired : "=&r" (tmp) : "r" (&rw->lock), "r" (0x80000000) : "cc", "memory");
"1: " " " "
Note that this sets a value of 232 interpreted as negative value Also, write unlock this merely involves clearing the lock (called by writers)
49
static inline void _raw_write_unlock(rwlock_t *rw) { __asm__ __volatile__( "str %1, [%0]" ; store zero at address: &rw->lock : : "r" (&rw->lock), "r" (0) : "cc", "memory"); }
50

READERS (IN ARM ARCHITECTURE)
This must admit many readers This must track the number of readers and prevent any writer from entering a code section until all readers have exited o o Each reader increments lock on entry and decrements on release Writers only are permitted to enter if lock is set to zero
Here is the operation for read_lock this is called by a reader attempting to enter a critical section o Note: if a writer has taken the lock its value will be -232 o Then the increment result below will remain negative for 231 readers
If no reader or writer is present, the initial value of the lock variable is zero o The lock is incremented by one for each reader upon acquiring the lock
This implementation tests for presence of writer and spins in this event Otherwise, readers are admitted atomically increments the lock value by setting loading, setting monitor, and incrementing a value in a register, initially equal to lock value
o o o o
The lock value is stored only if result is zero or positive It is then decremented if the lock is negative (returning value to prior state) Then, if negative, remain in busy wait loop until writer exits and releases the lock Otherwise store will have occurred and exit
Note strexpl is Store Exclusive executing on positive or zero comparison o o Result of adding 1 to register and placing in register will be positive only if lock was initially zero. Otherwise, LS modifier executes on Lower or Same o Note, rsbpls %0, %1, #0\n" returns negative result if result of strexpl is non-zero
Then remain in loop until another process sets lock to zero
51

static inline void _raw_read_lock(rwlock_t *rw) { unsigned long tmp, tmp2; /* will be stored in registers */ __asm__ __volatile__( ldrex %0, [%2]\n" adds %0, %0, #1\n" strexpl %1, %0, [%2]\n"
"1: " "
" "
load exclusive (&rw->lock) into reg increment lock value (blindly) store reg exclusive setting reg (tmp2) if result is positive indicating no writers present ; But, a value of 1 will appear in ; (tmp2) if lock has been modified ; Thus, must now decrement to return lock ; to its initial value in next ; instruction note %1 contains value 1 ; as a result of this event so, ; not required to load 1 immediate rsbpls %0, %1, #0\n" ; decrement lock if lower or same bmi 1b" ; branch if negative since lock value is : "=&r" (tmp), "=&r" (tmp2) ; negative and writers are present : "r" (&rw->lock) : "cc", "memory");
; ; ; ; ;
Now, unlocking proceeds with o o o Readers decrement the lock value on exiting This is performed exclusively (atomically) Lock value is positive if readers are present and decrements to 0 when all readers exit.
Here is the operation for read_unlock called by a reader exiting a critical section
static inline void _raw_read_unlock(rwlock_t *rw) { __asm__ __volatile__( "1: ldrex %0, [%2]\n" ; load exclusive lock into reg " sub %0, %0, #1\n" ; decrement lock value " strex %1, %0, [%2]\n" ; store lock value " teq %1, #0\n" ; test if successful exclusive operation " bne 1b" ; branch if not exclusive : "=&r" (tmp), "=&r" (tmp2) : "r" (&rw->lock) : "cc", "memory");
Here is the operation for read_unlock called by a reader exiting a critical section
52
WRITERS AND TRYLOCK
Writer may attempt to set lock to LOCK_BIAS or exit if another thread holds write lock
static inline int _raw_write_trylock(rwlock_t *rw) { unsigned long tmp; __asm__ __volatile__( "1: ldrex %0, [%1]\n" " teq %0, #0\n" " strexeq %0, %2, [%1]" ; store exclusive if equal (successful) : "=&r" (tmp) : "r" (&rw->lock), "r" (0x80000000) : "cc", "memory"); return tmp == 0 ; otherwise exit }
53
15. KERNEL SEMAPORES

BACKGROUND
The semaphore is a unique variable with the following characteristics: o o o The semaphore value can be used to determine whether a process will execute or wait. The semaphore may be operated on by wait or post. wait The wait function causes the semaphore value to be decremented by 1 if the semaphore is non-zero o o The process calling wait on the semaphore is allowed to continue This operation is atomic in that it completes without interruption by other processes. Thus, two processes that attempt to decrement the semaphore only decrement it by one unit. If the semaphore is non-zero, only one process will be allowed to continue, one will block.
If the semaphore is zero o o The process calling wait on the semaphore is blocked The process remains blocked until the action of decrementing the semaphore may return zero (as opposed to a negative value).
post The post function increments the semaphore. This is again atomic in that if two processes both attempt to increment a semaphore of value 0, it is incremented by two. For example, without the semaphore operation, both processes might conclude that the proper value for the semaphore is 1.
A process may use the semaphore to protect a critical section of code such that its access to shared resources is protected (as if it were the only process operating) during a code sequence. This holds true even if the process is interrupted or taken from running to ready by the operating system.
IMPLEMENTATION
The next step in the locking hierarchy is the semaphore This prevents a process from passing a point in the sequence of control defined by the semaphore
54
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o o o o However, unlike spinlocks, semaphores cause the process that reaches the semaphore to sleep Formally, this means that the process (kernel thread) is dequeued and a user space process or new kernel thread operates as a result of context switch This is clearly efficient for designs where the sleep time is long However, scheduler latency must be accounted for
The semaphore design is considerably more complex o o Task wait queue management Management of many waiting tasks that may be admitted when the semaphore is available
Again, a data structure is the design foundation o srtuct semaphore with data members count: an atomic variable with these states o o o wait: Positive: semaphore is free
0: if semaphore is acquired and one thread is executing and no other threads are sleeping while waiting for the semaphore. Negative: A number of threads equal to the absolute value of count are waiting for the semaphore pointer to a linked list of waiting tasks
sleepers: A flag indicating the presence of queued processes this is zero if no sleeping processed, 1 otherwise
Functions o An atomic down operation reduces the count variable If semaphore is taken or busy, results in task being placed on wait queue until semaphore state changes
55

OPERATIONS
Initialization (see /include/asm-i386/semaphore.h) o void sema_init (struct semaphore *sem, int val)
struct semaphore { atomic_t count; int sleepers; wait_queue_head_t wait; }; static inline void sema_init (struct semaphore *sem, int val) { atomic_set(&sem->count, val); sem->sleepers = 0; init_waitqueue_head(&sem->wait); }
Mutex o o Initializing a semaphore to 1 produces a mutex variable This implies that only one lock holder is enabled one thread that can occupy a code sequence
static inline void init_MUTEX (struct semaphore *sem) { sema_init(sem, 1); }
Requesting a semaphore : down down(struct semaphore * sem) o This will place a task that fails to receive the semaphore in the wait queue in a TASK UNINTERRUPTABLE state.
56
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Implementation of down() o First, note code structure. o Begins with atomic decrement. If decrement does not yield zero, then lock has been taken. Jump to LOCK_SECTION_START Otherwise, exit the down() function
Optimization: Note access to the semaphore lock is likely and a fail is unlikely This code is inlined in compilation with other code o o Included in volatile node without reordering Note, LOCK_SECTION_START is defined to create a subsection for this code separate from this section
Thus, as this code sequence is included in the inline function, only the decl and js instructions appear in the inline sequence o This prevents code in the lock section from being imported into the instruction cache, evicting other instructions more likely to be used
static inline void down(struct semaphore * sem) { __asm__ __volatile__( LOCK "decl %0\n\t" ; decrement sem->count "js 2f\n" ; jump on sign ; otherwise exit this ; function since next ; instructions not included "1:\n" LOCK_SECTION_START("") "2:\t lea %0,%%eax\n\t" ; load addr of sem in eax "call __down_failed\n\t" ; "jmp 1b\n" ; loop back to ; LOCK_SECTION_START label LOCK_SECTION_END :"=m" (sem->count) : :"memory","ax"); }
57
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o o o __down_failed prepares call to __down Places current task on a waitqueue Note that this will be included in the text section (code section) that is occupied by sched.c
; include in sched.c text section
asm( ".section .sched.text\n" ".align 4\n" ".globl __down_failed\n" "__down_failed:\n\t" "pushl %edx\n\t" "pushl %ecx\n\t" "call __down\n\t" "popl %ecx\n\t" "popl %edx\n\t" "ret" );
; save state ; __down places task on waitqueue ; restore
Examine __down() First, obtain a pointer to a task struct with address equal to that of the current task task_struct Create a waitqueue for the current task Now, the current task was TASK_UNINTERRUPTIBLE TASK_RUNNING, now set state member to be
Acquire a spinlock with interrupts disabled and with ability to restore interrupts Add the current task to the wait waitqueue associated with this semaphore o o Mark this as WQ_EXCLUSIVE will control waking process Adds tasks to tail of waitqueue
Increment the sleepers member Now, enter a loop o o o First, get the number of sleepers Subtract the number (sleepers -1) from the semaphore counter If the result is negative, set sleepers to zero and break
Consider an example semaphore is acquired and no task waiting on the semaphore
58
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 o Upon entry to _down, semaphore is down (lock is held) count = 0 and no other sleeping function is waiting on the queue (the only other task of interest is the task that currently holds the semaphore o o First, semaphore count will be decremented to -1
Then, sleeper is incremented by 1 Then, count value is incremented by adding (sleepers 1) (original sleep count) This will yield (sleepers) + -1 = -1 (in our case with no sleepers) o Note the definition of atomic_add_negative result is true if result of addition is negative, otherwise false.
This negative result will cause the conditional branch not to be taken
o o
Then, sleeper is set to 1, this indicates the presence of the task requesting the semaphore Call schedule() Schedule will observe the TASK_INTERRUPTIBLE status and dequeue this task. This places task in the waitqueue, waiting for an event
After return from schedule Locks are taken Task is marked UNINTERRUPTABLE and another check is performed on the semaphore status with sleepers = 1 If the semaphore is not available, then the control remains in the loop Sleepers are set to 1 and schedule is again called
If the semaphore count has been incremented to 1 (released) by other action Then the branch is taken, each of the sleepers are removed.
As the semaphore becomes available A call is made to release the spinlock and restore interrupts Any processes sleeping on the waitqueue will be activated o With a set of rules to be seen below
The task is set to TASK_RUNNING o The next time that the scheduler function runs, this task is eligible for selection
59

fastcall void __sched __down(struct semaphore * sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); unsigned long flags; tsk->state = TASK_UNINTERRUPTIBLE; spin_lock_irqsave(&sem->wait.lock, flags); add_wait_queue_exclusive_locked(&sem->wait, &wait); sem->sleepers++; for (;;) {
/* loop will not exit until */ /* all sleepers exit */ int sleepers = sem->sleepers; if (!atomic_add_negative(sleepers - 1, &sem->count)) { sem->sleepers = 0; break; } sem->sleepers = 1; /* this task - see -1 above */ spin_unlock_irqrestore(&sem->wait.lock, flags); schedule(); /* will lead to sleep */ spin_lock_irqsave(&sem->wait.lock, flags); tsk->state = TASK_UNINTERRUPTIBLE;
} remove_wait_queue_locked(&sem->wait, &wait); wake_up_locked(&sem->wait); spin_unlock_irqrestore(&sem->wait.lock, flags); tsk->state = TASK_RUNNING; }
60
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 It is important to consider how a list of tasks sleeping on the waitqueue may be activated (set to TASK_RUNNING) o So, consider the modified example where N tasks occupy the waitqueue Now, these all entered the waitqueue through this function ! So, upon being woken, they will execute this loop, entering the control immediately after schedule() So, each task will execute the remove_waitqueue and then wake_up_locked. The wake up locked function will activate each task in turn.
void fastcall add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait) { unsigned long flags; wait->flags |= WQ_FLAG_EXCLUSIVE; spin_lock_irqsave(&q->lock, flags); __add_wait_queue_tail(q, wait); spin_unlock_irqrestore(&q->lock, flags); }
Consider wake_up_locked this will call __wake_up_common( ) Will wakeup one exclusive task the process that initially called __down Will place task on the runqueue marked as TASK_RUNNING Semaphore functions
static inline int down_interruptible(struct semaphore * sem) static inline int down_trylock(struct semaphore * sem) static inline void up(struct semaphore * sem)
An atomic up operation increments the count variable
61
16. SEMAPHORE SYNCHRONIZED KERNEL THREAD EXAMPLE

/* * * * * * */ kthread_mod_coord_semaphore.c Demonstration of multiple kernel thread creation and binding on multicore system with semaphore synchronization
#include #include #include #include #include #include #define #define #define #define #define
<linux/module.h> <linux/kthread.h> <linux/sched.h> <linux/timer.h> <linux/cpumask.h> <asm/semaphore.h> MAX_CPU 16 LOOP_MAX 20 BASE_PERIOD 200 INCREMENTAL_PERIOD 30 WAKE_UP_DELAY 0
/* array of pointers to thread task structures */ static struct task_struct *kthread_cycle_[MAX_CPU]; static int kthread_cycle_state = 0; static int num_threads; static int cycle_count = 0; static struct semaphore kthread_mod_sem; static int { int int int int cycle(void *thread_data) delay, residual_delay; this_cpu; ret_sem; loops;
delay = BASE_PERIOD; for (loops = 0; loops < LOOP_MAX; loops++) { this_cpu = get_cpu(); delay = delay + this_cpu*INCREMENTAL_PERIOD; printk("kthread_mod: cpu %i executing down on kthread_mod_semaphore \n", this_cpu); down(&kthread_mod_sem);
62

printk ("kthread_mod: Thread pid %i acquired semaphore executing on cpu %i delay %i count %i\n", current->pid, this_cpu, delay, cycle_count); cycle_count++; set_current_state(TASK_UNINTERRUPTIBLE); residual_delay = schedule_timeout(delay); cycle_count--; printk ("kthread_mod: Thread pid %i releasing semaphore executing on cpu %i delay %i count %i\n", current->pid, this_cpu, delay, cycle_count); up(&kthread_mod_sem); } kthread_cycle_state--; /* * */
exit loop
while (!kthread_should_stop()) { delay = 1 * HZ; set_current_state(TASK_UNINTERRUPTIBLE); residual_delay = schedule_timeout(delay); printk ("kthread_mod: wait for stop pid %i cpu %i \n", current->pid, this_cpu); } printk ("kthread_mod: cycle function: stop state detected for cpu %i\n", this_cpu); return 0; } int init_module(void) { int cpu = 0; int count; int this_cpu; int num_cpu; int delay_val; int *kthread_arg = 0; int residual_delay; const char thread_name[] = "cycle_th"; const char name_format[] = "%s/%d";
/* format for name and cpu id */
num_threads = 0; num_cpu = num_online_cpus(); this_cpu = get_cpu(); printk ("kthread_mod: init task %i cpu %i of total CPU %i \n", current->pid, this_cpu, num_cpu);
63

init_MUTEX(&kthread_mod_sem); for (count = 0; count < num_cpu; count++) { cpu = count; num_threads++; kthread_cycle_state++; delay_val = WAKE_UP_DELAY; set_current_state(TASK_UNINTERRUPTIBLE); residual_delay = schedule_timeout(delay_val); kthread_cycle_[count] = kthread_create(cycle, (void *) kthread_arg, thread_name, name_format, cpu); if (kthread_cycle_[count] == NULL) { printk("kthread_mod: thread creation error\n"); } kthread_bind(kthread_cycle_[count], cpu); /* sets cpu in task struct */ wake_up_process(kthread_cycle_[count]); this_cpu = get_cpu(); printk ("kthread_mod: current task %i cpu %i create/wake next thread \n", current->pid, this_cpu); } return 0; } void cleanup_module(void) { int ret; int count; int this_cpu; /* * determine if early module removal terminated thread creation cycle early * * also must determine if cpu is suspended */ printk("kthread_mod: number of threads to stop %i and active %i\n", num_threads, kthread_cycle_state); this_cpu = get_cpu(); printk ("kthread_mod: kthread_stop requests being applied by task %i on cpu %i\n", current->pid, this_cpu); for (count = 0; count < num_threads; count++) { ret = kthread_stop(kthread_cycle_[count]); /* sets done state */ printk ("kthread_mod: kthread_stop request for cpu count returned with value %i \n", ret);
64

} } MODULE_LICENSE("GPL");
Note the behavior where the same thread on cpu0 reacquires the semaphore o o As thread executed up() it returns and executes down() Unlike the spinlock example, other competing threads do not observe the availability of the semaphore since their test of the semaphore can only occur at the rate of clock ticks
As it since each other thread must wait for the next timer tick (to be removed from the waitqueue and test the semaphore)
[5.356000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [5.356000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [5.360000] kthread_mod: current task 9631 cpu 3 create/wake next thread [5.360000] kthread_mod: cpu 1 executing down on kthread_mod_semaphore [5.364000] kthread_mod: current task 9631 cpu 2 create/wake next thread [5.364000] kthread_mod: cpu 2 executing down on kthread_mod_semaphore [5.368000] kthread_mod: current task 9631 cpu 3 create/wake next thread [5.368000] kthread_mod: cpu 3 executing down on kthread_mod_semaphore [6.156000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0 [6.156000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [6.156000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [6.956000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0 [6.956000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [6.956000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [7.756000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0 [7.756000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [7.756000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [8.556000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0 [8.556000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [8.556000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [9.356000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0 [9.356000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore [9.356000] kthread_mod: Thread pid 9637 acquired semaphore executing on cpu 0 delay 200 count 0 [0.156000] kthread_mod: Thread pid 9637 releasing semaphore executing on cpu 0 delay 200 count 0
Then, after further delay, cpu 0 releases and completes its task Then, cpu 1 wins the race to acquire the semaphore
[8.940000] kthread_mod: cpu 0 executing down on kthread_mod_semaphore
65

[8.940000] kthread_mod: Thread pid 9685 acquired semaphore executing on cpu 0 delay 200 count 0 [9.740000] kthread_mod: Thread pid 9685 releasing semaphore executing on cpu 0 delay 200 count 0 [9.740000] kthread_mod: Thread pid 9686 acquired semaphore executing on cpu 1 delay 230 count 0 [0.660000] kthread_mod: Thread pid 9686 releasing semaphore executing on cpu 1 delay 230 count 0 [1.700000] kthread_mod: cpu 1 executing down on kthread_mod_semaphore [1.700000] kthread_mod: Thread pid 9686 acquired semaphore executing on cpu 1 delay 290 count 0 [1.740000] kthread_mod: wait for stop pid 9685 cpu 0 (Note cpu 0 has completed its task) [2.740000] kthread_mod: wait for stop pid 9685 cpu 0 [2.860000] kthread_mod: Thread pid 9686 releasing semaphore executing on cpu 1 delay 290 count 0 [2.860000] kthread_mod: cpu 1 executing down on kthread_mod_semaphore [2.860000] kthread_mod: Thread pid 9686 acquired semaphore executing on cpu 1 delay 320 count 0 [3.740000] kthread_mod: wait for stop pid 9685 cpu 0 [4.140000] kthread_mod: Thread pid 9686 releasing semaphore executing on cpu 1 delay 320 count 0 [4.140000] kthread_mod: cpu 1 executing down on kthread_mod_semaphore [4.140000] kthread_mod: Thread pid 9686 acquired semaphore executing on cpu 1 delay 350 count 0 [4.740000] kthread_mod: wait for stop pid 9685 cpu 0
While responsiveness has degraded, computational load is vastly reduced. Examine the cpu loads below o Tasks waiting for the semaphore are in sleep state and CPUs are executing idle task
Cpu0 Cpu1 Cpu2 Cpu3 Cpu0 Cpu1 Cpu2 Cpu3
: : : : : : : :
0.0%us, 1.9%us, 0.0%us, 0.0%us, 0.0%us, 0.0%us, 0.0%us, 0.0%us,
0.0%sy, 1.9%sy, 0.0%sy, 0.0%sy, 2.0%sy, 0.0%sy, 0.0%sy, 0.0%sy,
0.0%ni,100.0%id, 0.0%ni, 96.2%id, 0.0%ni,100.0%id, 0.0%ni,100.0%id, 0.0%ni, 98.0%id, 0.0%ni,100.0%id, 0.0%ni,100.0%id, 0.0%ni,100.0%id,
0.0%wa, 0.0%wa, 0.0%wa, 0.0%wa, 0.0%wa, 0.0%wa, 0.0%wa, 0.0%wa,
0.0%hi, 0.0%hi, 0.0%hi, 0.0%hi, 0.0%hi, 0.0%hi, 0.0%hi, 0.0%hi,
0.0%si, 0.0%si, 0.0%si, 0.0%si, 0.0%si, 0.0%si, 0.0%si, 0.0%si,
0.0%st 0.0%st 0.0%st 0.0%st 0.0%st 0.0%st 0.0%st 0.0%st
66
17. RW SEMAPHORES
The next layer of synchronization functions merges semaphores and the read write design Produce a mechanism that provides semaphore resources that may be held by any number of readers, but, only one writer Again a rw_semaphore struct is defined in include/asm-i386/rwsem.h o counter: Now, the counter variable is divided into a most significant and least significant field o o
struct rw_semaphore { signed long spinlock_t struct list_head }; count; wait_lock; wait_list;
The number of readers is stored in lower 16 bits and read with a mask value The upper 16 bits counts the number of writers (either 0 or 1) A wait list of waiting processes A spinlock used for protecting the wait list
wait_list: wait_lock:
Now, count may have these values based on these initializations

0x00000000 0x00000001 0x0000ffff (-0x00010000) RWSEM_ACTIVE_BIAS (RWSEM_WAITING_BIAS +
#define RWSEM_UNLOCKED_VALUE #define RWSEM_ACTIVE_BIAS #define RWSEM_ACTIVE_MASK #define RWSEM_WAITING_BIAS #define RWSEM_ACTIVE_READ_BIAS #define RWSEM_ACTIVE_WRITE_BIAS RWSEM_ACTIVE_BIAS)
static inline void init_rwsem(struct rw_semaphore *sem) { sem->count = RWSEM_UNLOCKED_VALUE; spin_lock_init(&sem->wait_lock); INIT_LIST_HEAD(&sem->wait_list); }
Here is an example implementation of down_read()
67
Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12 Optimization here o o o LOCK_SECTION_START creates separate assembler subsection This code sequence is not loaded into instruction cache, in general This avoids contamination of cache with code unlikely to be executed
static inline void __down_read(struct rw_semaphore *sem) { __asm__ __volatile__( LOCK_PREFIX " incl (%%eax)\n\t" ; increment semaphore " js 2f\n\t" ; will be set if initially 0 ; will exit here "1:\n\t" LOCK_SECTION_START("") ; creates a subsection "2:\n\t" " pushl %%ecx\n\t" ; save reg state " pushl %%edx\n\t" " call rwsem_down_read_failed\n\t" " popl %%edx\n\t" " popl %%ecx\n\t" " jmp 1b\n" LOCK_SECTION_END "# ending down_read\n\t" : "=m"(sem->count) : "a"(sem), "m"(sem->count) : "memory", "cc"); }
68

static inline void __down_write(struct rw_semaphore *sem) { int tmp; tmp = RWSEM_ACTIVE_WRITE_BIAS; ; will appear in edx reg __asm__ __volatile__( "# beginning down_write\n\t" LOCK_PREFIX " xadd %%edx,(%%eax)\n\t" ; sub 0x00010000 ; note operates on upper ; 16b semaphore field ; performs atomic swap and ; add fetches semaphore ; adds value to bias stores ; in edx " testl %%edx,%%edx\n\t" ; test if positive " jnz 2f\n\t" ; jmp if not granted lock ; exit "1:\n\t" LOCK_SECTION_START("") "2:\n\t" " pushl %%ecx\n\t" " call rwsem_down_write_failed\n\t" " popl %%ecx\n\t" " jmp 1b\n" LOCK_SECTION_END "# ending down_write" : "=m"(sem->count), "=d"(tmp) : "a"(sem), "1"(tmp), "m"(sem->count) : "memory", "cc"); }
static static static static static static
inline inline inline inline inline inline
void __down_read(struct rw_semaphore *sem) int __down_read_trylock(struct rw_semaphore *sem) void __down_write(struct rw_semaphore *sem) int __down_write_trylock(struct rw_semaphore *sem) void __up_read(struct rw_semaphore *sem) void __up_write(struct rw_semaphore *sem)
69
18. COMPLETION VARIABLES

Completion variables are designed to eliminate race conditions o o Example of operation at boot time where init() must waits for kernel threads to complete operations Examples where two CPUs may access a mutex at the same instant may lead to a race
The completion structure includes a waitqueue of all tasks that are blocked by the completion gate. In /include/linux/completion.h
struct completion { unsigned int done; wait_queue_head_t wait; };
static inline void init_completion(struct completion *x) { x->done = 0; init_waitqueue_head(&x->wait); }
A completion variable is initialized and the done flag is set to zero - (analogous to the down action on a semaphore) then o Now, a task that encounters a call to wait_for_complete in its sequence of control will block. This task will be added to the waitqueue
In sched.c we find wait_for_completion o o o Note, might_sleep compiles to a NOP if debugging not set removed here Checks done flag if done is 0, then declare waitqueue and add the current task to the waitqueue tail Declares a waitqueue and then makes this flag exclusive The waitqueue structure flag is set to 1
70

struct __wait_queue { unsigned int flags; #define WQ_FLAG_EXCLUSIVE 0x01 struct task_struct * task; wait_queue_func_t func; struct list_head task_list; };
#define WQ_FLAG_EXCLUSIVE 0x01
This enables only flagged tasks waiting on the waitqueue to be selected during wakeup (it is inefficient to wake all tasks since only one may be scheduled for execution) The task state is set to uninterruptible no need to receive a signal here schedule() Control remains in this do while loop until done is 0, then Remove the task from the waitqueue and closed the completion variable on exit
void fastcall __sched wait_for_completion(struct completion *x) { spin_lock_irq(&x->wait.lock); if (!x->done) { DECLARE_WAITQUEUE(wait, current); wait.flags |= WQ_FLAG_EXCLUSIVE; __add_wait_queue_tail(&x->wait, &wait); do { __set_current_state(TASK_UNINTERRUPTIBLE); spin_unlock_irq(&x->wait.lock); schedule(); spin_lock_irq(&x->wait.lock); } while (!x->done); __remove_wait_queue(&x->wait, &wait); } x->done--; spin_unlock_irq(&x->wait.lock); } EXPORT_SYMBOL(wait_for_completion);
To open the completion gate a call to complete() sets the done flag to 1 and then calls the wake_up function to wake tasks from the waitqueue
71

void fastcall complete(struct completion *x) { unsigned long flags; spin_lock_irqsave(&x->wait.lock, flags); x->done++; __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE|TASK_INTERRUPTIBLE, 1, 0, NULL); spin_unlock_irqrestore(&x->wait.lock, flags); } EXPORT_SYMBOL(complete);
Now, the wake_up_common function wakes the tasks on the completion waitqueue - note this is declared as x and the wait member of task is an argument for wake_up_common. o This wakes up both (or) TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, the number of exclusive tasks (one in this case), the sync setting and a NULL key) The sync setting enables priority checking: if sync is set to 1 and priority of task on the queue is greater than that of other tasks, then schedule() is called. If sync is set to 0, no priority checking occurs. Wait bit key provides another filter level a wait bit key may be set to further filter processes at wake_up
19. SEQ LOCK

The seq lock was discussed in Lecture 7 This is yet another lock hierarchy layer The seqlock ensures that only one writer may access a resource, but, any number of readers may attempt to access
72
20. SYNCHRONIZATION AGAINST OUT OF ORDER EXECUTION

Processors may include parallel execution pipelines exist. o o o
mb()
Instruction issue may be out of order to optimize throughput This may result in out of order execution Often, race conditions may not lead to errors since interdependency may not apply.
The processor is dependent on its instruction decoder to determine if a dependency exists that may not permit out of order execution. However, many examples appear in kernel code where dependencies are not apparent. A synchronization method referred to as a barrier is applied (we have seen this appear in our scheduling kernel code earlier). This is a barrier that terminates all read and write instructions such that no reordering can occur
wmb()
rmb()
Write barrier that terminates all write instructions such that no reordering can occur Read barrier that terminates all read instructions such that no reordering can occur
This differs from barrier() that we observed before. The presence of barrier() prevents reordering at compile time. The above prevent reordering at runtime Here is an example from the 2.6.11 kernel
#define mb() asm volatile ("lock; addl $0,0(%%esp)", "mfence")
The mfence indicates that memory has been updated by this function forcing the compiler to not assume reordering is possible The lock forces an atomic operation. This merely adds zero to address pointed to by stack pointer. Thus the processor treats this as a memory barrier all read (load) operations are required to have been completed before this instruction is reached since this will force an atomic operation even locking the bus.
73

EE202C Networked Embedded Systems Design Lecture 12 Multiprocessor Synchronization

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

EE202C Networked Embedded Systems Design Lecture 12 Multiprocessor Synchronization

Caricato da

Copyright:

Formati disponibili

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

EE 202C LECTURE 12 EMBEDDED PLATFORMS AND OPERATING SYSTEMS MULTIPROCESSOR SYNCHRONIZATION

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

1. MULTIPROCESSOR / MULTICORE SYSTEMS: EMBEDDED AND MOBILE PLATFORM PROCESSORS

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

2. SYMMETRIC MULTIPROCESSING (SMP) KERNEL THREADING

b. Lock integration with interrupt manangement c. Lock integration with preemption

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

3. SMP KERNEL THREAD CREATION AND MANAGEMENT

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

Kernel thread creation

Kernel thread wakeup (enqueue task)

int kthread_should_stop(void) { return (kthread_stop_info.k == current); } EXPORT_SYMBOL(kthread_should_stop);

A kernel thread may apply stop state for itself

int kthread_stop(void) { return (kthread_stop_info.k == current); } EXPORT_SYMBOL(kthread_should_stop);

int kthread_stop(struct task_struct *k) { return kthread_stop_sem(k, NULL); } EXPORT_SYMBOL(kthread_stop);

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

UNCOORDINATED KERNEL THREAD EXAMPLE

/* * * * * * */ #include #include #include #include #include #include

<linux/module.h> <linux/kthread.h> <linux/sched.h> <linux/timer.h> <linux/cpumask.h> <linux/spinlock.h>

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

exit loop poll stop state with sleep cycle

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

Note lack of coordination above Complete and removal phase

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

5. LOCK CLASSES, APPLICATIONS, AND IMPLEMENTATION

RW Semaphore Semaphore attributes with reader/writer resolution

Completion Variables Synchronization against out of order execution

Applications o List manipulation o o o o Memory, Tasks,

Timer interrupts Interrupt service System call Scheduler Operations

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

6. PROCESSOR SUPPORT FOR SYNCHRONIZATION

Memory management o o Updating segment registers Updating page tables

7. ATOMIC PRIMITIVES FOR MULTIWORD OPERATIONS

First, there is an atomic data type

typedef struct { volatile int counter; } atomic_t;

One data member - counter

Examples of atomic increment and decrement

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

Atomic sub i from atomic type v

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

8. ARM ATOMIC OPERATIONS

" " "

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

9. ATOMIC PRIMITIVES FOR BIT OPERATIONS X86

Note that the first instruction would be otherwise eliminated

Clear a bit in memory

Test and Change bit in memory

static inline int test_and_change_bit(int nr, volatile unsigned long* addr)

Test and Clear bit in memory

static inline int test_and_clear_bit(int nr, volatile unsigned long * addr)

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

10. ATOMIC PRIMITIVES FOR BIT OPERATIONS ARM

Set a bit in memory Replace AND operation setting bit

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

11. SPIN LOCK X86

Embedded Platform and Embedded Operating System Introduction EE202C Lecture 12

Thus, we very often observe the use of the spin_lock_irq_save() function

IMPLEMENTATION OF SPINLOCKS: SETTING LOCKS

Pass up to three parameters via registers, the remainder on the stack

#define spin_lock(lock) _spin_lock(lock)

void __lockfunc _spin_lock(spinlock_t *lock) { preempt_disable(); _raw_spin_lock(lock); }

static inline void _raw_spin_unlock(spinlock_t *lock) { asm volatile( spin_unlock_string ); }