Sei sulla pagina 1di 105

LINUX Kernel

Chapter 3 Introduction to the Kernel

Processes and Tasks


Processes

seen from outside: individual processes exist independently


Tasks

seen from inside: only one operating system is running


Process 1 Process 2 Process 3

Task 1

Task 2

Task 3

System Kernel with co-routines


/

Process States
User mode Running Interrupt Return from system call System mode Scheduler Ready Waiting Interrupt routine System call

Process States
Running

Task is active and running in the non-privileged user mode. If an interrupt or system call occurs, it is switched to the privileged system mode.
Interrupt

routine

hardware signals an exception condition clock generates signal every 10 ms


System

call

software interrupt
/

Process States
Waiting

wait for an external event (e.g., I/O complete)


Return

from system call

when system call or interrupt is complete scheduler switches the process to ready state
Ready

competing for the processor

Important Data Structures


Task

structure

task_struct in include/linux/sched.h Also accessed by assembly code, cannot alter the sequence or add declarations in the front states

TASK_RUNNING (0): ready or running TASK_INTERRUPTIBLE(1), TASK_UNINTERRUPTIBLE(2): waiting for certain events. TASK_UNINTERRUPTIBLE means a task cannot accept any other signals. TASK_ZOMBIE(3): process terminated but still has its task structure TASK_STOPPED(4): process has been halted TASK_SWAPPING(5): not used.
/

Task Structure
struct task_struct { /* these are hardcoded - don't touch */ volatile long state; volatile indicates that this value can be altered by interrupt routines long counter; long priority; counter variable holds the time in ticks for the process can still run before a mandatory scheduling action is carried out. Counter is used as dynamic priority for scheduler priority holds the static priority of a process
/

Task Structure
unsigned long signal; unsigned long blocked; signal contains a bit mask for signals received for the process. It is evaluated in the routing ret_from_sys_call() which is called after every system call and after slow interrupts. blocked contains a bit mask for signals to be blocked unsigned long flags; flags contains the combination of the system status flags

Task Structure
Process

flags:

#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */ /* Not implemented yet, only for 486*/ #define PF_PTRACED 0x00000010 /* set if ptrace (0) has been called. */ #define PF_TRACESYS 0x00000020 /* tracing system calls */ #define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */ #define PF_SUPERPRIV 0x00000100 /* used super-user privileges */ #define PF_DUMPCORE 0x00000200 /* dumped core */ #define PF_SIGNALED 0x00000400 /* killed by a signal */ #define PF_STARTING 0x00000002 /* being created */ #define PF_EXITING 0x00000004 /* getting shut down */ #define PF_USEDFPU 0x00100000 /* Process used the FPU this quantum (SMP only) */ #define PF_DTRACE 0x00200000 /* delayed trace (used on m68k) */
/

Task Structure
int errno; int debugreg[8]; errno holds the error code for the last faulty system call. debugreg contains the 80x86s debugging registers. struct exec_domain *exec_domain; which UNIX is emulated for each process struct task_struct *next_task, *prev_task; all processes are linked through these two pointers init_task points to the start and end of this list struct task_struct *next_run, *prev_run; list of processes that apply for the processor
/

Task Structure
struct task_struct *p_opptr, *p_pptr, *p_cptr, *p_ysptr, *p_osptr; pointers to (original) parent process, youngest child, younger sibling, older sibling, respectively parent p_cptr p_pptr p_pptr p_pptr p_osptr child p_ysptr
/

youngest child

p_osptr

p_ysptr

oldest child

Task Structure
struct mm_struct *mm; memory management information
struct mm_struct { int count; pgd_t * pgd; unsigned long context; unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack, start_mmap; unsigned long arg_start, arg_end, env_start, env_end; unsigned long rss, total_vm, locked_vm; unsigned long def_flags; struct vm_area_struct * mmap; struct vm_area_struct * mmap_avl; struct semaphore mmap_sem; };
/

Virtual Memory

Task Structure
unsigned long kernel_stack_page; stack when a process is running in system mode unsigned long saved_kernel_stack; save the old stack pointer when running MS-DOS emulator (vm86) int pid, pgrp, session, leader; process id, group id, session belongs to, and session leader unsigned short uid,euid,suid,fsuid; unsigned short gid,egid,sgid,fsgid; user id, effective user id, file system user id group id, effective group id, file system group id
/

Task Structure
uid,

euid, suid, gid, egid, sgid Each process has a real user ID and group ID and an effective user ID and group ID.
The real ID identifies the person using the system The effective ID determines their access privileges. execve() changes the effective user or group ID to the owner or group of the executed file if the file has the set-user-ID (suid) or set-group-ID (sgid) modes. The real UID and GID are not affected. The effective user ID and effective group ID of the new process image are saved as the saved set-user-ID and saved set-group-ID respectively, for use by setuid(3V).

Turn on suid: chmod a+s filename


/

Task Structure
gid are inherited from parent euid, egid, fsuid, fsgid can be set at run time (owner of the executable file) int groups[NGROUPS]; A process may be assigned to many groups struct fs_struct *fs; file system information struct fs_struct { int count; /* for future expansions */ unsigned short umask; /* access mode */ struct inode * root, * pwd; /* root dir and current dir */ };
/

Uid,

Task Structure
struct files_struct *files; open file information (file descriptors) struct files_struct { /* open file table structure */ int count; fd_set close_on_exec; /* files to be closed when exec is issued */ fd_set open_fds; /* open files (bitmask) */ struct file * fd[NR_OPEN]; };

Task Structure
long utime, stime, cutime, cstime, start_time; time spend in user mode, system mode, total time of children process spend in user mode, system mode, and the time when the process generated, respectively. unsigned long it_real_value, it_prof_value, it_virt_value; unsigned long it_real_incr, it_prof_incr, it_virt_incr; struct timer_list real_timer; timer for alarm system call (SIGALRM) time in ticks until the timer will be trigger, for reinitialization, real-time interval timer, respectively.

Task Structure
struct sem_undo *semundo; semaphores need to be released when a process terminated struct sem_queue *semsleeping; semaphore waiting queue struct wait_queue *wait_chldexit; When a process calls wait4(), it will halt until a child process terminates at this queue. struct rlimit rlim[RLIM_NLIMITS]; limits of the use of resources (setrlimit(), getrlimit())

Task Structure
struct signal_struct *sig; struct signal_struct { int count; struct sigaction action[32]; }; Signal handlers int exit_code, exit_signal; return code and the signal that causes the program aborted char comm[16]; name of the program that executed by the process
/

Task Structure
unsigned long personality; description of the characteristics of this version of UNIX (see also exec_domain) int dumpable:1; whether a memory dump is to be executed int did_exec:1; is the process still running the old program (no execve, ) struct desc_struct *ldt; used by WINE, windows emulator

Task Structure
struct linux_binfmt *binfmt; functions responsible for loading the program struct thread_struct tss; holds all the data on the current processor status at the time of the last transition from user mode to system mode, all registers are saved here. struct thread_struct can be found in asmi386/processor.h which, among other definitions, include 8086 related information: struct vm86_struct * vm86_info; unsigned long screen_bitmap; unsigned long v86flags, v86mask, v86mode;
/

Task Structure
unsigned long policy, rt_priority; Scheduling policies: classic (SCHED_OTHER), real-time (SCHED_RR, SCHED_FIFO) rt_priority :real-time priority #ifdef __SMP__ int processor; int last_processor; int lock_depth; #endif When running on a multi-processor machine, need to know on which processor the task is running, .., etc.
/

Process Table
struct task_struct init_task; points to the start of the doubly linked task list struct task_struct *task[NR_TASKS]; task table #define current (0+current_set[smp_processor_id()]) struct task_struct *current_set[NR_CPUS]; current process (for multi-processor architecture) #define for_each_task(p) \ for (p = &init_task ; (p = p->next_task) != &init_task ; ) macro for find all processes the first task is skipped (init_task)
/

Files and inodes


Two

important structures:file, inode (linux/fs.h) The file structure (processs view)


struct file { mode_t f_mode; acess mode when opened(RO, RW, WO) loff_t f_pos; position of the read/write pointer (64-bit) unsigned short f_flags; additional flag for controlling access rights (fcntl)
/

Files and inodes

Files and inodes


unsigned short f_count; reference count (dup, dup2, fork) struct file *f_next, *f_prev; doubly linked list global variable: struct file *first_file; struct inode * f_inode; actual description of the file struct file_operations * f_op; refers to a structure of function pointers of file operations, i.e., functions are not directly called. Since LINUX supports many file system, Virtual File System (VFS) is implemented.
/

Files and inodes


struct inode { kdev_t unsigned long umode_t nlink_t uid_t gid_t off_t time_t time_t time_t i_dev; /* which device the file is on */ i_ino; /* position on the device */ i_mode; i_nlink; i_uid; /* owner user id */ i_gid; /* owner group id */ i_size; /* size in bytes */ i_atime; /* time of last access */ i_mtime; /* time of last modification */ i_ctime; /* time of last modification to inode*/
/

Memory Management
Macros

#define __get_free_page(priority) __get_free_pages((priority),0,0) #define __get_dma_pages(priority, order) __get_free_pages((priority),(order),1) extern unsigned long __get_free_pages(int priority, unsigned long gfporder, int dma); defined in linux/mm.h, page size is 4KB priority: GFP_BUFFER, GFP_ATOMIC, GFP_KERNEL, GFP_NOBUFFER, GFP_NFS (what to do if not enough pages are free) order:number of pages to be reserved (in power of 2) dma: address can be addressed by DMA component
/

Memory Management
Functions

extern inline unsigned long get_free_page(int priority) { unsigned long page; page = __get_free_page(priority); if (page) memset((void *) page, 0, PAGE_SIZE); return page; } Will clear the page

Memory Management
Functions

void *kmalloc(size_t size, int priority) void kfree(void *__ptr) malloc() and free() in the kernel

Waiting Queues
Structures

for waiting queues

struct wait_queue { struct task_struct * task; struct wait_queue * next; }; include/linux/wait.h wait until condition met
Functions

(sched.h)

extern inline void add_wait_queue(struct wait_queue ** p, struct wait_queue * wait) extern inline void remove_wait_queue(struct wait_queue ** p, struct wait_queue * wait)
/

Waiting Queues
Functions

void sleep_on(struct wait_queue ** p); void interruptible_sleep_on(struct wait_queue ** p); void wake_up(struct wait_queue ** p); void wake_up_interruptible(struct wait_queue ** p); kernel/sched.c sleep_on sets process state to TASK_UNINTERRUPTIBLE or TASK_INTERRUPTIBLE wait_up sets process state to TASK_RUNNING

Semaphores
Structure

for semaphores

struct semaphore { int count; int waiting; struct wait_queue * wait; }; asm-i386/semaphore.h
Functions

extern inline void down(struct semaphore * sem) extern inline void up(struct semaphore * sem)

System Time and Timers


In

unit of ticks (10 ms) Global variable, jiffies, denotes the time in ticks since the system booted Structure for timer (old)
struct timer_struct { unsigned long expires; void (*fn)(void); }; extern struct timer_struct timer_table[32]; extern unsigned long timer_active; /* which entry is valid? */
/

System Time and Timers


Structure

for timer (new)

struct timer_list { struct timer_list *next; struct timer_list *prev; unsigned long expires; unsigned long data; /* arguments */ void (*function)(unsigned long); }; extern void add_timer(struct timer_list * timer); extern int del_timer(struct timer_list * timer);

Process Management
Signal

Interrupt
Booting

Timer
Scheduler

Signal
Signals
SIGHUP SIGINT SIGQUIT SIGILL SIGTRAP SIGABRT SIGIOT SIGBUS SIGFPE SIGKILL SIGUSR1

()
1 2 3 4 5 6 6 7 8 9 10 hangup interrupt quit illegal instruction trace trap abort (generated by abort(3) routine) Input/Output Trap (obsolete) bus error arithmetic exception kill (cannot be caught, blocked, or ignored) user-defined signal 1
/

Signal
SIGSEGV 11 segmentation violation SIGUSR2 12 user-defined signal 2 SIGPIPE 13 write on a pipe or other socket with no one to read it SIGALRM 14 alarm clock SIGTERM 15 software termination signal SIGTKFLT 16 SIGCHLD 17 child status has changed SIGCONT 18 continue after stop SIGSTOP 19 stop (cannot be caught, blocked, or ignored) SIGTSTP 20 stop signal generated from keyboard SIGTTIN 21 background read attempted from control terminal
/

Signal
SIGTTOU 22 background write attempted to control terminal SIGURG 23 urgent condition present on socket SIGXCPU 24 cpu time limit exceeded (see getrlimit(2)) SIGXFSZ 25 file size limit exceeded (see getrlimit(2)) SIGVTALRM 26 virtual time alarm (see getitimer(2)) SIGPROF 27 profiling timer alarm (see getitimer(2)) SIGWINCH 28 window changed (see termio(4) and win(4S)) SIGIO 29 I/O is possible on a descriptor (see fcntl(2V)) SIGPOLL 29 SIGIO SIGPWR 30 Power Failure (for UPS) SIGUNUSED 31
/

Signal System Calls


Important system

calls

kill(int pid, int sig)


sends

the signal sig to a process or a group of processes If pid is greater than zero, the signal is sent to the process with the PID pid. If pid is zero, the signal is sent to the process group of the current process. If pid is -1, the signal is sent to all processes, except the system processes and current process If pid is less than -1, the signal is sent to all process of the process group -pid
/

Signal System Calls


Important system

calls

kill(int pid, int sig)


The

real or effective user ID of the sending processing must match the real or saved set-user ID of the receiving process, unless the effective user ID of the sending process is super-user. A single exception is the signal SIGCONT, which requires the sending and receiving processes belong to the same session. Errors: EINVAL: invalid sig ESRCH: process or process group does not exist EPERM: no privileges
/

Signal System Calls


Important system

calls

kill(int pid, int sig)


Implementation

linux/kernel/exit.c sys_kill() -> send_sig(), kill_pg(), kill_proc() -> generate() see also force_sig(), kill_sl() also called from ret_from_sys_call() -> do_signal()->send_sig() ->handle_signal() (signal.c, 223) ->setup_frame() (160) ->regs->eip = sa->sa_handler (213)
/

sys_kill
Linux/kernel/exit.c,

line 318-339

322-323: If pid is zero, the signal is sent to the process group of the current process. 324-334: If pid is -1, the signal is sent to all processes, except the system processes (PID=0 or 1) and current process. for_each_task macro is defined in include/linux/sched.h, line 491. If count is zero, return error code ESRCH. 335-336:If pid is less than -1, the signal is sent to all process of the process group -pid. 338: If pid is greater than zero, the signal is sent to the process with the PID pid.
/

kill_pg
Linux/kernel/exit.c,

line 258-275.

264-265: sig must be in [1..32], pgrp (process group id) must be greater than zero 266-273: for each process, if its process group id is pgrp, then sends signal sig to it (send_sig). If success, send_sig will return zero. 274: if found=0, then no process has been found, return error ESRCH, else return zero.

kill_proc
Linux/kernel/exit.c,

line 301-312

305-306: sig must be in [1..32]. 307-310: if a process with pid is found, sends signal sig to it (send_sig) 311: if no process has been found, return error ESRCH

send_sig
Linux/kernel/exit.c,

line 73-101

75-76: p cannot be null and sig must less than or equal to 32 77: priv is privilege (0 for normal process, 1 for super user), SIGCONT can only send to process belongs to the same sessin 78-79: The real or effective user ID of the sending processing must match the real or saved set-user ID of the receiving process, unless the effective user ID of the sending process is super-user. 80: super user? 81: If none of above conditions is true, return error
/

send_sig
82-83: if sig=0, do nothing 84-88: if sig in the task struct is null (in zombie state), do nothing 89-95: if sig is SIGKILL or SIGCONT, and the process is in state TASK_STOPPED, wake up the process and reset SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU signals. 96-97: if sig is SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU, reset SIGCONT. 99: actually generate the signal

generate
Linux/kernel/exit.c,

line 29-51

31: set up signal mask 32: action of the signal, sa=p->sig->action[sig-1] 39: if the signal is not blocked and the process is not traced 41: and if the handler of the signal is SIG_IGN (to be ignored) and the signal is not from state change of child process 42: then return immediately. 44-46: if the handler if SIG_DFL (default action) and the signal is SIGCONT, SIGCHLD, SIGWINCH, SIGURG, then return immediately. (wake up has been done for SIGCONT)
/

generate
48: finally, set the signal 49-50: if the signal receiving process is interruptable and the signal is not to be blocked, then wake up the process.

force_sig
Linux/kernel/exit.c,

line 57-70

force to send a signal to a process (cannot be ignored) 60: if the process is not in zombie state 61-62: set the signal and get the signal action struct 63: really set the signal 64: the signal cannot be blocked, so clear the bit in p->blocked 65-66: if the handler is SIG_IGN, reset it to SIG_DFL 67-68: wake up the process if it is interruptible

kill_sl
Linux/kernel/exit.c,

line 282-299

sends a signal to the session leader 288-289: sig must be in [1..32]. Session must be greater than zero 290-297: for each process, checks to see if session id is equal to sess and the process is the session leader, then sends signal to the session leader (send_sig) 298: return error if no process is found

Signal System Calls


Important system
examine

calls

sigaction(int sig, struct sigaction *act, *oact)


and change signal action (signal()) act: new action, oact: old action (return)
struct sigaction { __sighandler_t sa_handler; /* SIG_DFL, SIG_IGN, or */ sigset_t sa_mask; /* signals to be blocked during execution of handler*/ unsigned long sa_flags; /* SA_ONSTACK: on sig stack SA_INTERRUPT: do not restart system on signal return SA_RESETHAND: reset handler to SIG_DFL when signal taken SA_NOCLDSTOP: dont send SIGCHLD on child stop */ void (*sa_restorer)(void); }
/

Signal System Calls


Important system
examine

calls

sigprocmask(int how, sigset_t *set, *oset)


and change the calling processs blocked signals how SIG_BLOCK: add blocked signals to oset SIG_UNBLOCK: unblock blocked signals from oset SIG_SETMASK: reset blocked signals with set SIGKILL, SIGSTOP cannot be blocked undefined for SIGFPE, SIGILL, SIGSEGV if they are blocked when they are generated
/

Signal System Calls


Important system
stores

calls

sigpending(sigset_t *set)
the set of signals that are blocked from delivery and pending for the calling process in set.

ssetmask(int mask), sgetmask() set/get blocked singals of current process, obsolete by sigprocmask(). Sigsuspend(int restart, unsigned long oldmask, unsigned long newmask)
replaces

the processs signal mask with newmask and then suspends the process until delivery of a signal.
/

sys_sigaction
Linux/kernel/signal.c,

line 150-182

155-156: check signal number [1..32] 157: get the old sigaction (p) 158-170: if action (new setting) is not null, check if it can be read. If yes, copy the content of action to new_sa 171-176: if oldaction is not null, stores the old sigaction (p) to oldaction 177-180: replace sigaction with new_sa

sys_sigprocmask
Linux/kernel/signal.c,

line 29-60

34-52: if set (new mask) is not null, process set depends on how; SIG_BLOCK, add blocked signals to oset, SIG_UNBLOCK: unblock blocked signals from oset, SIG_SETMASK: reset blocked signals with set. 53-58: if oset is no null, copy old_set (current->blocked) to seet.

Sys_sigpending
Linux/kernel/signal.c,

line 80-88

stores signals pending but blocked into set. 84: check if set can be write 85-86: if yes, copy current blocked signals to set.

Interrupt
To

allow the hardware to communicate with the operating system Source files
arch/i386/kernel/irq.c include/asm-i386/irq.h
Interrupt

handlers

slow, fast, bad (irq.c, lines 142-172) build the interrupt handler first
line

114-136, irq.c line 200-243, irq.h (macros)


/

Interrupt
Interrupt

number

First set : 0-7 Second set: 8-15 0 for timer On SMP board (486 and above)
irq13

for Interprocessor interrupts irq16 for SMP reschedule

On a 386
irq13

for SIGFPE (unreliable) no irq16


/

Interrupt

Slow interrupts (include/asm/irq.h, line 205-222)


206: build symbol table IRQ#_interrupt (see irq.c, 142) 208: SAVE_ALL, save all registers 209: ENTER_KERNEL, synchronization processors access to the kernel on a SMP board 210: ACK_FIRST (or SECOND), ack to the interrupt controller 211: increase intr_count (number of nested interrupts) 213-217: call do_IRQ(int irq, struct pt_regs *regs) (see arch/i386/kernel/irq.c, line 343-364) 219: UNBLK_FIRST (or SECOND), inform interrupt controller that interrupts of this type can again be accepted
/

Do_IRQ()
struct irqaction * action = *(irq + irq_action); while (action) { do_random |= action->flags; action->handler(irq, action->dev_id, regs); action = action->next; }

Data Structure
struct irqaction { void (*handler)(int, void *, struct pt_regs *); unsigned long flags; unsigned long mask; const char *name; void *dev_id;struct irqaction *next; };

Interrupt

Slow interrupts (irq.h, line 205-222)


220: decrease intr_count 221: increase syscall_count 222: jump to routine ret_from_sys_call (never returned)

Fast Interrupts (irq.h, line 224-236)


use SAVE_MOST and RESTORE_MOST instead of SAVE_ALL and do not call ret_from_sys_call 229-230: call do_fast_IRQ(int irq) (see irq.c, line 371393)

Bad interrupts (irq.h, line 237-243)


Simply acks the interrupt (not installed)
/

Interrupt
IDT[] -> interrupt[] (or fast_interrupt[], bad_interrupt[]) IRQi_interrupt (or fast or bad) -> do_IRQ()->irqaction[] irqaction[i]->handler -> jump to ret_from_sys_call jump to handle_bottom_half (if bh_mask & bh_active) do_bottom_half -> bh_base[] -> bh_base[i]
/

Interrupt
request_irq()->setup_x86_irq() (init fn) setup_x86_irq:

IDT[]entryinterrupt[] actionirqaction[]irq shared irqaction[]list


interrupt[],

fast_interrupt[], bad_interrupt[] BUILDIRQ macro assembly codeassembly code interruptfast_interruptcall do_IRQ bad_interrupt interruptcall do_IRQjumpret_from_sys_call / (fast_interrupt)

Interrupt
do_IRQ

irqaction[]actionhandler jumpret_from_sys_call jumphandle_bottom_half (bh_mask & bh_active) handle_bottom_half assembly code call do_bottom_halfdo_bottom_half bh_base[]function

Interrupt
bh_base[]

init_bh() irqrequest_irq()

bottom halfdata structure


1bottom half routine bh_active: 1interrupt bottom half bh_mask_count: bottom half disable0nested disable (enable) bh_base: bottom half routine
bh_mask:
/


interrupt:

start_kernel() -> time_init() -> setup_x86_irq(0, &irq0) -> set_intr_gate()


irq0->action=timer_interrupt

IDT[0] -> interrupt[0] -> do_IRQ -> timer_interrupt()->do_timer()


bottom

half:

start_kernel() -> sched_init() -> init_bh(TIMER_BH, timer_bh) calldo_timerjump ret_from_sys_call -> handle_bottom_half -> do_bottom_half->bh_base[0]->timer_bh

Device Driver Example


3Com 3C509

Sourcedrivers/net/3c509.c open (el3_open(), line 347)


request_irq(dev->irq,

&el3_interrupt, ) 356

interrupt
el3_interrupt()

515 mark_bh(NET_BH) 548

NET_BHinit?
net_dev_init()

(net/core/dev.c) init_bh(NET_BH, net_bh); 1471


/

init_IRQ()
arch/i386/kernel/irq.c 536void init_IRQ(void)functionIRQ 545~547outb_poutboutputbyteport 548~549for loopset_intr_gate bad_interrupt arrayset_intr_gatesystem.h235247bad_interrupt[] interrupt handlerrequest_irq()flag interrupt[]fast_ interrupt[] 555~556request_region()apricot.cfunction resource.cmacro 557~558setup_x86_irq()Interrupt Descriptor Table(IDT)
/

setup_x86_irq( )
395setup_x86_irq() 401p = irq_action + irq; irq_action219 16NULLstructirq 0~15irq 402~417IRQsharefast bad interruptshareslow interrupt interrupt share 426~432IRQsharefast interrupt fast_interrupt[]interrupt[] int request_irq() 437~467deviceIRQ request_irq()functiondeviceIRQ IRQhandlerdevice
/

Request and Free IRQ


int request_irq() 437~467device IRQrequest_irq()function deviceIRQIRQ handlerdevice void free_irq() 469~495free_irq()request_irq() device free_irq()IRQ
/

Boot
Boot

process

BIOS
reads

the first sector of the boot disk (floppy, hard disk, , according to the BIOS parameter setting) Load the boot sector (512 bytes), which will contain program code for loading the operating system kernel (e.g., Linux Loader, LILO), to 0x7C00 (arch/i386/boot/bootsect.s, 35) in real mode boot sector ends with 0xAA55

Boot disk
Floppy:

the first sector Hard disk: the first sector is the master boot record (MBR)
/

Boot Sector and MBR


0x000 JMP 0x03E Disk parameters 0x003 Program code loading 0x03E the OS kernel 0xAA55 0x1FE 0x000 0x1BE 0x1BE 0x010 0x1CE 0x010 0x1DE 0x010 0x1EE 0x010 0x1FE 0x002 Boot Sector (Floppy)

Code for loading the boot sector of the active partition Partition 1 MBR and extended Partition 2 partition table Partition 3 Partition 4 0xAA55
/

MBR
MBR

Four primary partitions


only

4 partition entries Each entry is 16 bytes

Extended partition
If

more than 4 partitions are needed The first sector of extended partition is same as MBR The first partition entry is for the first logical drive The second partition entry points to the next logical drive (MBR)

The first sector of each primary or extended partition contains a boot sector
/

Extended Partition MBR

MBR for extended partition


Code for loading the boot sector of the active partition Logic Partition Next Ext Partition Not Used

Not Used 0xAA55


/

Structure of a Partition Entry


1 Boot

1 2 1 1 2 4
4

HD SEC CYL Begin: sector and cylinder number of boot sector SYS System code: 0x83 Linux, 0x82: swap, 0x05: extend End: head number HD End: sector and cylinder number of boot sector SEC CYL low byte high byte Relative sector number of start sector low byte high byte
Number of sectors in the partition

Boot flag: 0=not active, 0x80 active Begin: head number

Active Partition
Booting is

carried out from the active partition which is determined by the boot flag Operations of MBR
determine active partition load the boot sector of the active partition jump into the boot sector at offset 0

Boot Process
Compressed

Kernel size

Include/linux/config.h, DEF_SYSSIZE = 0x7F00 clicks = 508 KB. (1 click=16 bytes) zImage is less than this size zImages source is arch/i386/boot/bootsect.s, it is loaded to 0x7C00 first, it is then moved to 0x90000 and jump to there to start execution. Setup.s is then loaded to 0x90200 and kernel image is loaded to 0x10000 (64KB) Setup.s moves the kernel from 0x10000 to 0x1000(4KB) to save memory and then enters the protected mode, jumps to 0x1000 (line 520-536)
/

Bootsect.c
Line

59-69

Moves code from 0x7C00 (BOOTSEG) to 0x90000(INITSEG) 64-65: set si, di to zero rep: repeat 68 68: move word by word until cx=0 (initialize to 256) 66: cld clears DF flag in EFLAG to 0 which makes the move statement goes up (increases the address for data movement)
/

Boot Process
Uncompress

Kernel

The start point is at arch/i386/kernel/head.s It initializes the system and then calls start_kernel So the system then runs from start_kernel()

Booting the System


LILO

loads the Linux kernel into memory

starts from start: in arch/i386/boot/setup.s setup.s is responsible for initializing the hardware, asking the bios for memory/disk/other parameters, and putting them in memory 0x90000-0x901FF 520-521: switch to protected mode 534-536: jmp 0x1000, KERNEL_CS
jmpi

0x100000, KERNEL_CS for big kernels

Continues from startup_32 in arch/i386/kernel/head.s


/

Booting the System


More sections of the hardware are initialized (paging table, co-processor, interrupt descriptor table (idt), stack, environment, ) 219: calls the start_kernel() in init/main.c start_kernel(): all areas of the kernel are initialized and process 1 is created
794-852:

more initializations 858: creates process 1 (kernel_thread(init, NULL,0)) process 0 is an idle process, do nothing and runs when no other process needs CPU process 1 calls the init() and starts some daemons 868: process 0 enters an infinite idle loop
/

Booting the System


Init() in init/main.c, lines 919-1020
927:

bdflush is responsible for synchronization of the buffer cache contents with the file system 929: kswapd is the background pageout daemon (swaping) 937: setup initializes the file systems and mounts the root file system 986-991: connects to the console and open file descriptors 0, 1, 2 (console) 993-997: tries to execute one of the programs /etc/init, /bin/init, /sbin/init. 999-1003: if none of the three programs exists, executes /etc/rc
/

Booting the System


Init() in init/main.c, lines 919-1020
1005-1018:

enters an infinite loop in which a shell is started for users to login on the console.

Setitimer System Call

int setitimer(int which, struct itimerval *value, *ovalue)


which: ITIMER_REAL: decrements n real time. A SIGALRM signal is delivered when this timer expires. ITIMER_VIRTUAL: Decrements in process virtual time. It runs only when the process is executing (not including system time). A SIGVTALRM is delivered when this timer expires. ITIMER_PROF: Decrements both in process virtual time and when the system is running on behalf of the process. A SIGPROF signal is delivered when this timer expires. It is designed for profiling the execution of interpreted programs. The itimerval struct has two fields: it_interval and it_value. If it_value is non-zero, it indicates the time to the next timer expiration. If it_interval is non-zero, it specifies a value to be used in reloading it_value when timer expires. Setting it_value to zero disables a timer. Setting it_interval to zero causes a timer to be disabled after its next expiration.
/

Related Codes
ITIMER_REAL

Data structure: timer_head run_timer_list() it_real_fn() (itimer.c, 98, sched.h, 297)


ITIMER_VIRTUAL

do_it_virt() (sched.c, 943)


ITIMER_PROF

do_it_prof() (sched.c, 956)


Sys_setitimer

-> _setitimer()-> add_timer()


/

Itimer.c/115, sched.c/606

Timer Interrupt
Important

global variables

jiffies
kernel/sched.c

(96): unsigned long volatile jiffies=0; ticks (10ms) since the system was started up

xtime
kernel/sched.c
actual

(47): volatile struct timeval xtime;

time

Timer

interrupt

updates jiffies and make the bottom half active the bottom half is called later, after handling other interrupts
/

Timer Interrupt

Timer Interrupt
do_timer

(kernel/sched.c, 1077-1095)

1079: increase jiffies 1080: increase lost_ticks (ticks since last called of the bottom half routine) 1081: mark the bottom half active (include/linux/ interrupt.h) 1082-1083: increase lost_ticks_system if in kernel mode (ticks spent in kernel mode since last called of the bottom half routing) 1084-1092: profile 1093-1094: mark timer queue handler active
/

Timer Interrupt
Bottom half
1072:

routines of the timer interrupt

timer_bh (kernel/sched.c, lines 1070-1075)


updating the times, kernel/sched.c, lines 10541068 1058: xchg gets the value of lost_ticks and reset it to zero in an atomic way. 1063: get lost_ticks_system and reset 1064: calculate system load (lines725-738) 1065: update the real time xtime (740-922, hw) 1066: update times of current process (977-1049) 1073, 1074: updating system wide timers (649-683)
/

Timer Interrupt
update_process_times

(977-1049)

981: user time = ticks - system time 983: decrease the time quota used by current process 984-987: if the time quota is used up, need to reschedule 988-992: kernel statistics 994: update current processs times (924-975) 929-930: update processs user and system times 932-940: check if the process has used up its CPU limitation (setrlimit for setting limit of resource usage). If exceeds soft limit, sends SIGXCPU. If exceeds hard quota, sends SIGKILL to kill the process.
/

Timer Interrupt
update_process_times

(977-1049)

994: update current processs times (924-975) 947-953: update interval timers. When timers have expired, sends SIGVTALRM. 960-966: update profile
run_timer_list

(649-665)

654: check timer list to see which timer has expired 655-662: prepare to call timer handler
run_old_timers

(667-683)

check timer table (obsolete)

Scheduler
Classes

Real-time (soft)
Preemptive:

rt_priority SCHED_FIFO a process runs until it relinquishes control or a process with higher rt_priority wishes to run SCHED_RR can be interrupted if its time slice has expired and there are other processes with the same priority wishes to run (round robin with the same class)

Classic
SCHED_OTHER

Scheduler
Schedule()
system

(kernel/sched.c, lines 283-407)

Called when
call (indirectly, sleep_on -> schedule) after slow_interrupt, ret_from_sys_call is called to check the need_resched flag timer interrupt will also set the need_resched flag

Major tasks
routines

need to be called regularly determine the process with highest priority make the process to be the current process
/

Scheduler
Schedule()

(kernel/sched.c, lines 283-407)

303-304: cannot be called within a nested interrupt 306-310: the bottom halves of the interrupt routines (timeuncritical). E.g., the timer interrupt. 312: routines registered to be run in scheduler (chap. 7) 318-321: if current process belongs to the SCHED_RR class and its time slice has expired, move it to the end of run queue. 323-325: if current process is in TASK_INTERRUPTIBLE state and the signal it is waiting has arrived, make it runnable again 326-333: if current process is waiting for timeout and the timeout has expired, make it runnable again
/

Scheduler
Schedule()

(kernel/sched.c, lines 283-407)

334-335: the current process must wait for an event, remove it from the run queue 357-364: looks for the process with highest priority..
goodness(lines

235-281) return values -1000: dont select this task 0: out of time (no results) +ve: the larger, the better 1000: real-time process 255-256: real-time process 265: simply use p->counter as its weight 277-278: a slight favor to the current process
/

Scheduler
Schedule()

(kernel/sched.c, lines 283-407)

367-370: all processs counter is 0, re-calculate 386-401: have a new process become the current process, do the context switch (switch_to()) switch_to() in include/asm-i386/system.h, lines 53122 104-105: if next is the current task, do nothing 106-109: clears the TS-flag if the task we switched to has used the math co-processor latest 111-112: switch to the next task 114-120: reloads the debug regs if necessary.
/

System Call
IDT

table

kernel_start()calltrap_init() (arch/i386/kernel/traps.c, 322) trap_init()trap call set_system_gate(0x80, &system_call) IDT[0x80]system_call trap 0x80call system_call system callint 0x80
/

system call
fork()include/asm-i386/ unistd.h 272 static inline _syscall0(int,fork) _syscall0174extend

int fork(void) { long __res; __asm__ volatile ("int $0x80" : "=a" (__res) : "0" (__NR_fork)); /* 2 */ if (__res >= 0) return (type) __res; errno = -__res; return -1;

Fork() System Call


int

$0x80trap input __NR_fork output__restrap system_call

system_call
arch/i386/kernel/entry.s281

290(system call

number)sys_call_table[] function (sys_fork)null trace flag 304call function(sys_fork) system call322 ret_from_sys_callslow interrupt
/

Potrebbero piacerti anche