Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Student Guide
Version 20001015
Tradmarks
IBM® is a registered trademark of International Business Machines Corporation.
UNIX is a registered trademark in the United States, other countries, or both and is licensed
exclusively through X/Open Compnay Limited.
<<< list any other Trademarks used int he course materials >>>
The information contained in this document has not been submitted to any formal IBM test and is distributed on
an “as is” basis without any warranty either express or implied. The use of this information or the
implementation of any of these techniques is a customer responsibility and depends on the customer’s ability
to evaluate and integrate them into the customer’s operational environment. While each item may have been
reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or simular results will
result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their
own risk.
© Copyright International Business Machines Corporation 2000. All rights reserved. This document may not be
reproduced in whole or in part without the prior written permission from IBM. Information in this course is
subject to change without notice.
Contents
Kernel Overview
Kernel Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
Kernel states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8
Kernel exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10
Kernel Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12
Kernel Limits Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16
64-bit Kernel base enablement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-17
64-bit Kernel stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24
CPU big- and little-endian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-26
Multi Processor dependent designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-28
Command and Utility compatibility for 32-bit and 64-bit kernels . . . . . . . . . . . . . . . . 1-29
Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-30
Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-33
Interrupt handling in AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-35
Handling CPU state information at interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-36
Handling CPU state information at interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-37
IA-64 Hardware Overview
IA-64 Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
IA-64 formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
IA-64 memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
IA-64 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
IA-64 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
IA-64 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
Power Hardware Overview
Power Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Power CPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
64 bit CPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
SMP Hardware Overview
SMP Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Configuring System Dumps on AIX 5L
About This Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
System Dump Facility in AIX5L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Configuring for System Dumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Obtaining a Crash Dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Dump Status and completion codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17
dumpcheck utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
Verify the dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21
Packaging the dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-26
Introduction to Dump Analysis Tools
About This Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
System Dump Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
dump components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Dump creation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
Component dump routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
bosdebug command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-127
Process Management
Process Management Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
Process operations fork() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
Process operations exec() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
Process operations exec() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
Process operations exit system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Process operations, wait() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
Kernel Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16
Thread Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
AIX Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19
Thread Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
Threads Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
Thread states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25
Thread Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27
Process swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28
Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29
The Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-33
AIX run queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-36
Process and Threads data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-39
Process and Threads data structures addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-43
What is new in AIX 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-48
Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-50
Signal handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-51
Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-53
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-57
Memory Management
Overview of Virtual Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
Memory Management Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
Demand Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
Memory Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
Memory Object types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
Page Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
Page Not In Hardware Frame Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
Page on Paging Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Loading Pages From The Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Filesystem I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Free Memory and Page Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19
vmtune . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21
Fatal Memory Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
Memory Objects (Segments) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23
Shared Memory segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25
shmat Memory Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26
Memory Mapped Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28
IA-64 Virtual Memory Manager
IA-64 Addressing Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
Region Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4
Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
© Copyright IBM Corp. 2000 Version 20001015 Contents v
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Guide Draft Version for review, Sunday, 15. October 2000, intTOC.fm
LVM
Logical Volume Manager overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
Data Integrity and LVM Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
LVM Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15
LVM Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
Physical disk layout Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-21
VGSA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-30
Physical disk layout IA-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-31
LVM Passive Mirror Write Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-36
AIX 5 LVM Hot Spare Disk in a Volume group. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-40
LVM Hot spot management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-42
LVM split mirror AIX 4.3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-45
LVM Variable logical track group (LTG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-46
LVM command overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-47
LVM Problem Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-48
Trace LVM commands with the trace command . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-51
LVM Library calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-56
logical volume device driver LVMDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57
Disk Device Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-58
Disk low level Device Calls such as SCSI calls . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-60
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-61
Enhanced Journaled File System
J2 - Enhanced Journaled File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
Allocation Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7
Filesets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8
Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
Binary Trees of Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15
File Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19
fsdb Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-23
Exercise 1 - fsdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-24
Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-27
Directory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-31
Exercise 2 - Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-35
Logical and Virtual File Systems
General File System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
Logical File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4
User File Descriptor Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
System File Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
Virtual File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-9
Vnode/vfs interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10
vi AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, intTOC.fm Student Guide
Vnodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11
vfs and vmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12
File and Filesystem Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14
gfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15
vnodeops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16
vfsops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-17
The Gnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-18
Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-20
Lab Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-21
Lab Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-26
AIX 5L boot
What is boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2
Various Types of boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-3
Systems types and Kernel images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-5
RAMFS and prototype files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
Boot Image Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-8
AIX 5L Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-12
The Power Boot Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13
Power boot disk layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14
AIX 5L Power boot record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-16
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-20
Power boot images structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-21
RSPC boot image hints header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-22
CHRP Boot image ELF structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-24
CHRP boot image ELF structure - Continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-25
CHRP boot image ELF structure - Continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-26
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-27
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-28
Power ROS and Softros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-30
IPLCB on Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-31
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-33
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-34
The IA-64 Boot Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-35
IA-64 boot disk layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-37
EFI boot manager and boot maintenance manager overview . . . . . . . . . . . . . . . . 14-39
EFI Shell Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-40
IA-64 Boot Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-43
IA-64 Initial Program Load Control Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-44
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-45
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-46
Hard Disk Boot process (rc.boot Phase I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-47
Hard Disk Boot process (rc.boot Phase II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-48
Hard Disk Boot process (rc.boot Phase III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-49
CDROM Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . 14-50
Tape Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-51
Network Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . 14-52
Common Boot process (rc.boot Phase III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-53
© Copyright IBM Corp. 2000 Version 20001015 Contents vii
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Guide Draft Version for review, Sunday, 15. October 2000, intTOC.fm
Kernel Overview
Introduction Up until AIX 5L, the kernel was a 32-bit kernel for Power architecture only.
AIX version 4.3 introduced the 64-bit application enabling on Power, which
meant there was still a 32-bit kernel, but an 64-bit environment was
available through a kernel extension which did the appropriate Now AIX 5L
features both a 32-bit and a 64-bit kernel on Power systems, and a 64-bit
kernel on the IA-64 architecture.
This overview describes the concepts used in the kernel in general and in
the 64-bit kernel specifically.
Functions of The kernel provides the system with the following functions:
the kernel
• Create, manage and delete processes.
• Schedule and balance resources.
• Provide access to devices.
• Handle asynchronous events.
The kernel manages resources so they can be shared simultaneously
among many processes and users. Resources can be physical like the
CPU, the memory or an adapter, or it can be virtual, like a lock or a slot in
the process table.
Uniprocessor The 64-bit kernel is aimed at the high-end server environment and
support multiprocessor hardware. As a result, it is optimized strictly for the
multiprocessor environment and no separate uniprocessor version is
provided.
64-bit vs. 32- The primary purpose of the 64-bit AIX kernel is to address the fundamental
bit kernel need for workload scalability. This is achieved through a kernel address
space which is large enough to support increases in software resources.
32-bit kernel Customers have made and will continue to make significant investment in
life time 32-bit RS/6000 hardware systems and need system software that protects
this investment. Thus, AIX also offers a 32-bit kernel.The RS/6000
software plan is to eventually drop support for the 32-bit kernel. However,
support will not be withdrawn before 2002 and after the initial 64-bit kernel
release. This process is driven by end-of-life plans for 32-bit hardware
systems, as well as the fact that customers require a bridge period under
which both the 32-bit and 64-bit kernels are available for 64-bit hardware
systems and offer the same basic functionality. This period is needed to
ease migration to the 64-bit kernel.
Compatibility Customers need system software that protects their investment in existing
applications and provides binary and source compatibility. AIX 5L will
therefore maintain support for existing 32-bit applications.
Kernels The table below shows which kernels are supported on different systems.
supported by In general, a 64-bit kernel and application can only run on 64-bit hardware,
hardware
platform but 64-bit hardware can execute 32- and 64-bit kernels and applications.
Currently, there are three different CPUs types in the RS/6000 systems
(only the PowerPC 604e CPU is 32-bit).
CPU Type
PowerPC 604e 32-bit
Power3-II 64-bit
RS64 II 64-bit
RS64 III 64-bit
Binary The 64-bit kernel offers binary compatibility to existing applications for both
compatibility 32-bit and 64-bit applications. However, it does not extend to the minority
and limitations
of applications that are built non-shared or have intimate knowledge of
internal details, such as programs accessing /dev/kmem or /dev/mem.
This is consistent with the general AIX policy for these two classes of
applications.
Source Source code compatibility is preserved for applications and 32-bit kernel
compatibility extensions. Consistent with general AIX policy, this extends to makefiles
(build mechanisms), but not to the small set of applications that rely upon
shipped header file contents that are provided only for use by the kernel.
Programs accessing /dev/mem or /dev/kmem serve as an example of such
applications.
32-bit vs. 64- The 64-bit kernel is intended to increase scalability of the RS/6000 product
bit kernel family and is optimized for running 64-bit applications on the upcoming
Performance
on Power Gigaprocessor systems (Power4, which will be announced in 2001). The
performance of 64-bit applications running on the 64-bit kernel on
Gigaprocessor-based systems is better than if the same application was
running on the same hardware with the 32-bit kernel. This is because the
64-bit kernel allows 64-bit applications to be supported without requiring
system call parameters to be remapped or reshaped. The 64-bit kernel
may also be compiler-optimized specifically for the Gigaprocessor system,
whereas the 32-bit kernel may be optimized to a more general platform.
32-bit The 64-bit kernel will also be optimized for 32-bit applications (to the extent
application possible). This is because 32-bit applications now dominate the application
Performance
on 32-bit and space and will continue to do so for some time. In fact, performance trade-
64-bit kernels offs involving 32-bit versus 64-bit applications should be made in favor of
32-bit applications. However, 32-bit applications on the 64-bit kernel will
typically have less performance than on the 32-bit kernel, because call
parameter reshaping is required for 32-bit applications on the 64-bit kernel.
64-bit The performance of 64-bit applications under the 64-bit kernel on non-
application Gigaprocessor systems may be less than that of the same applications on
and 64-bit
kernel the same hardware under the 32-bit kernel. This is due to the fact that the
performance at non-Gigaprocessor systems are intended as a bridge to Gigaprocessor
non systems and lack some of support that is needed for optimal 64-bit kernel
Gigaprocessor performance. In addition, efforts should be made to optimize 64-bit kernel
systems
performance for non-Gigaprocessor system, but performance trade-offs
are made in the favor of the Gigaprocessor.
32-bit and 64- The performance of 64-bit kernel extensions on Gigaprocessor systems
bit kernel should be the same or better than their 32-bit counterparts on the same
extension
performance at hardware. However, the performance of the 64-bit kernel extension on
Gigaprocessor non-Gigaprocessor machines may be less than 32-bit kernel extensions
systems on the same hardware. This flows from the fact that 64-bit kernels are
optimized for Gigaprocessor systems.
Kernel Since the kernel is a program itself, it behaves almost like any other
characteristics program. It’s features are:
• Preemptable
• Pageable
• Segmented
• 64-bit
• Dynamically loadable
Preemptable means that the kernel can be in the middle of a system call
and be interrupted by a more important task. The preemption causes a
context switch to another thread inside the kernel.
Some parts of the kernel are pageable, which means they are not needed
in memory all the time, and can be paged to paging space.
Both the 32-bit kernel and the 64-bit kernel implement virtual address
translation by using segments. In previous versions of AIX, segment
registers were used to map segments to thread contexts. Now segment
tables are being used.
Kernel states
kernel system
diagram
trap (Power) user programs
libraries
User level
Kernel level
system call interface
buffer cache
control scheduler
character block
subsystem memory
device driver management
hardware control
Kernel level
Hardware level
hardware
This diagram shows how the kernel is the interface between the user level
and the hardware. Applications live at the user level, and they can only
access hardware, like a disk or printer, through the kernel.
Process Processes can run in two different execution modes: kernel mode and user
execution mode.These modes are also referred to as Supervisor State and Problem
modes
State.
User mode A process running in user mode can only affect its own execution
protection environment and runs in the processor’s unprivileged state. In user mode,
domain
a process has read/write access to the user data process private segment
and the shared library data segment. It also has access to the shared
memory segments using the shared memory functions. The process in
user mode has read access to the user text and shared library text
segment.
User mode processes can still use kernel functions by means of a system
call. Access to functions that directly or indirectly invoke system calls are
typically provided by programming libraries which gives access to
operating system functions.
Kernel mode Code running in this mode has read/write access to global kernel space
protection and access to kernel data in the process private segment when running
domain
within the process context. Code in interrupt handlers, the base kernel and
kernel extensions run in kernel mode. If a program running in kernel mode
needs to access user data, a kernel service is used to do so. Programs
running in kernel mode can use kernel services, can access global system
data, are exempt from all security restraints, and run in the processor
privileged state
In short:
The kernel state is part of the thread state, so this information typically is
kept in the threads Machine State area (MST).
Kernel exercise
Exercise: Look at the value of the Machine State Register (MSR) for thread of
figuring out interest:
thread state on
Power
# echo “mst <thread slot>”| kdb | grep msr
iar : 0000000000009444 msr : A0000000000010B2 cr : 31384935
From /usr/include/sys/machine.h :
This means that if bit 15 from the MSR is set, the thread is running in user
mode, that is, when the fourth nibble from the right is either 4,5,6,7 or
C,D,E,F.
Exercise: Look at the value of the Interrupt Processor State Register (IPSR) for
figuring out thread of interest.
thread state on
IA-64 On an interrupt, and if PSR.ic (Interrupt Collection) is 1, the IPSR receives
the value of the PSR. The IPSR, IIP and IFS are used to restore the
processor state on a Return From Interrupt (rfi). The IPSR has the same
format as PSR. IPSR.ri is set to 0, after any interruption from the IA-32
instruction set.
# iadb
(0)> ut -t <thread-ID>
*ut_save: 0x0003ff002ff3b400 *ut_rsesave: 0x0003ff002ff3bf50
System call state: ut_psr: 0x00001053080ee030
From /usr/include/sys/machine.h :
#define PSR_PK 15
00001010080AE030 (HEX) =
100000001000000001000000010101110000000110000 (Binary)
Bit 15 is set, which means that the thread has the Protection Key set, and
hence is in a problem state.
Kernel Limits
Kernel Limits Most of the settings in the kernel are dynamic and don’t need to be tuned.
Their maximum values are considered to be chosen in such a way that
they will never be reached during normal system usage. Some limits
chosen as a maximum can technically be even higher.
The following table lists kernel system limits as of AIX 5L Version 5.0
Kernel Limits
Kernel Limits
Checking The purpose for this exercise is to find actual limit or settings in a running
kernel values kernel. From the file /usr/include/sys/msginfo, we obtain the structure
msginfo that holds four integers. To list the content in the running kernel,
we use kdb fat Power and iadb at IA-64 platform. From both systems, we
display 16 bytes equal to four integers.
/*
* Message information structure.
*/
struct msginfo {
int msgmax, /* max message size */
msgmnb, /* max # bytes on queue */
msgmni, /* # of message queue identifiers */
msgmnm; /* max # messages per queue identifier */
};
Power # kdb
(0)> d msginfo
IA-64 # iadb
> d msginfo 4 4
e00000000415cfb0: 00400000 00400000 00020000 00080000
msgmax msgmnb msgmni msgmnm
64-bit Kernel Several components of base enablement support are provided to make it
base possible for kernel subsystems and kernel extensions to run in 64-bit mode
enablement and use a large address space.
State Support is provided for saving and restoring 64-bit kernel context,
management including full 64-bit GPR contents. This support also extends to the area of
support kernel exception handling where setjmpx() and longjmpx() must deal with
64-bit kernel context. In addition, state management is extended to include
the 64-bit kernel address space as part of the kernel context.
Temporary The 64-bit kernel provides kernel subsystems and kernel extensions with
attachment the capability to change the contents of the kernel address space. This
includes the capability to change segments within the address space
temporarily for a specific thread of execution and is consistent with the
segmented virtual memory architecture of the hardware and the legacy 32-
bit kernel programming model.
A total of four concurrent temporary attachments will be supported under a
single thread of execution. This limitation is consistent with the limitation
imposed by the 32-bit kernel and is made to restrict the amount of kernel
state that must be saved and restored at context switch.
Global While the temporary attachment model is maintained, the 64-bit kernel
attachment also provides a model under which subsystem data is placed within the
global kernel address space and made visible to all kernel code for the
entire life of its usefulness, rather than temporarily attaching segments as
needed and in the context of a single thread.
This global attachment model does more than allow the 64-bit kernel to
provide sufficient space for subsystems to place their data in the global
kernel heap. Rather, it includes the capability to place subsystem
segments within the global address space. This capability is needed for
two reasons:
• Different memory characteristics
• Data organized around segment
Global Some subsystems require virtual memory characteristics that are different
attachment from those of the kernel heap. For the most part, these characteristics are
defined at the segment level and typically must be reflected by segment
types that are different from those used for the kernel heap. Also, some
subsystems organize their data around segments and require sizes and
alignments that are inappropriate for the kernel heap.
The global attachment model is also of value in cases where only a small
number of subsystem segments are involved. Segments are attached to
the global kernel addresses space only once, typically at subsystem
initialization, and are accessible from then on without requiring individual
subsystem operations to incur the path length cost of segment attachment.
This is not to say that the global attachment model is without its own path
length costs; specifically, use of this model may result in more segment
lookside buffer (SLB) reloads. This is because it provides no opportunity to
prime the SLB table with virtual segment IDs (VSIDs) for soon-to-be-
accessed segments. Rather, it relies upon the caching nature of the SLB
table and updates SLBs with new VSIDs only when satisfying reload faults.
This differs from the temporary attachment model where VSIDs are placed
in the SLB as part of segment attachment.
Global Finally, this model simplifies the general kernel programming model.
attachment Subsystems are not required to deal with the complexity of segments,
segment offsets or segment attachments in accessing their data. Rather,
data accesses are made simply and naturally using addresses within the
flat kernel address space.
The specific subsystem segments that will be placed in the kernel address
space under the global attachment model include:
• Kernel Heap
Although traditionally part of the global address space, the
kernel heap segments will be placed in this space through
global attachment.
• mbuf Segments
The mbuf pool has long been a part of global space and this will
continue under the 64-bit kernel.
• VMM Segments
These segments are privately attached in the 32-bit kernel
legacy and hold the software page frame table, segment control
blocks, paging device table, file system lockwords, external
page tables, and address space map entries.
All segments added to the global kernel address space through global
attachment will be strictly read/write for the kernel and no-access for users.
In addition, unaligned accesses to these segments will not be supported
and will result in a protection exception.
Data isolation While placing subsystem data in the global kernel address space provides
significant benefits, it eliminates the data isolation that is provided by the
temporary attachment model. Under this model, data is typically made
accessible only while running subsystem code and is not generally
exposed to other subsystems. Unrelated interrupt handlers may gain
accessibility to data by interrupting subsystem code. However, this
exposure is more limited than that which occurs by placing data in global
space where all kernel code has accessibility.
Isolation is critical for some classes of subsystem data. As a result, not all
subsystem data should be placed in the global kernel address space. In
particular, file systems should continue to use temporary attachments to
provided isolation for user data.
Kernel address The kernel address space layout preserves the existing 32-bit and 64-bit
space layout user address layouts that is now found under the 32-bit kernel legacy. In
addition, a common global kernel and per-process user address space is
provided. This is required for a number of performance reasons:
Kernel address Temporary attachments are not included as part of the common address
space layout space. This is for a number of reasons. First, data isolation would be
impacted for temporary attachments if they were placed in the common
address space. This is because the attached data would be accessible in
the kernel by all threads of a process rather than only by the thread that
performed the temporary attachment. Second, it would be inefficient for the
common address space to include temporary attachments. This is due to
the fact that changes to the common address space would have to be
serialized among all threads of a process.
I/O space The 64-bit kernel supports I/O space at locations below and above 4 GB
mapping within the hardware system memory map. Under the 64-bit kernel, I/O
space is virtually mapped through the page translation hardware and made
accessible through segments on all supported hardware system
implementations. In the 32-bit kernel legacy on current hardware systems,
I/O space virtual access is achieved through block address translation
(BAT) registers, but this capability is not provided by the Gigaprocessor
hardware.
Performance The capability to place portions of I/O space within the global kernel
when address must be provided to allow temporary attachment overhead to be
accessing I/O
addresses avoided. This capability is built upon the global attachment model. Along
with services to support this, others services are provided that allow
portions of I/O space to be temporarily attached. However, these services
will form an I/O space temporary attachment model that is slightly different
from the one now found under the 32-bit kernel. Specifically, I/O space
mappings must be created prior to any temporary attachments and
destroyed once all temporary attachments are complete. These mapping
operations are performed by individual device drivers through new
services and typically occur at the time of device configuration and de-
configuration. Compare to the existing model under the 32-bit kernel,
where no separate mapping operations are present.
I/O mapping in The mapping operations are provided under the 64-bit kernel model for a
64-bit kernel number of reasons. The first is performance. While the 32-bit kernel model
mode
does not require I/O space to be mapped before it is attached, it does
require each temporary attachment to perform some level of mapping.
Under the 64-bit kernel model, each device driver maps its portion of I/O
once at initialization time and incurs no additional mapping overhead in
performing temporary attachments. Next, the presence of the mapping
operations provide efficient use of system resources. I/O space is mapped
in virtual memory through the page table and segments under the 64-bit
kernel and these system resources are only consumed for portions of I/O
space that are actually in use. In the absence of mapping operations, the
64-bit kernel itself would have to map all of I/O space into virtual memory
and possibly waste resources for unused portions. In addition to potentially
wasting resources, arming the kernel with the responsibility of mapping I/O
space would lead to arbitrary layouts of I/O space in virtual memory and
would not support data isolation. Finally, the interfaces for performing
temporary attachments are simplified, as no I/O mapping information must
be specified. This implies new interfaces for attaching and detaching from
I/O space.
The new I/O space temporary attachment model and supporting services
is not only provided under the 64-bit kernel but under the 32-bit kernel as
well. This is required to ease the migration of 32-bit device drivers to the
64-bit kernel environment and to make it simpler to maintain 32-bit and 64-
bit versions of a single device driver.
Rather than placing their respective portions of I/O space in the global
kernel address space, most device drivers should continue to access I/O
space through temporary attachments. This is because a large proportion
of these accesses occur under interrupts and would more than likely miss
the SLB table if the accesses were performed using the global attachment
model. While the temporary attachment model adds overhead to I/O space
accesses, it typically avoids the SLB miss performance penalty by priming
the SLB table.
LP64 C The 64-bit kernel uses the LP64 (Long Pointer 64-bit) C language data
language data model. This data model was chosen for a number of reasons. First, the
model
LP64 data is also used by 64-bit AIX applications, and this allows the 64-
bit kernel to support these applications in a straightforward manner. Of the
prevailing 64-bit data models, including ILP64 and LLP64, the LP64 data
model is most consistent with the ILP32 data model used by 32-bit
applications. This consistency simplifies 32-bit application support under
the 64-bit kernel and allows 32-bit and 64-bit applications to be supported
in fairly common ways. Next, LP64 has been chosen as the data model for
the 64-bit kernel implementations provided by key UNIX vendors, including
SGI, SUN, and H-P. Use of a common data model simplifies matters for
ISVs, and enables AIX to use industry wide solutions to some problems.
Finally, the 64-bit kernel requires no new compiler functionality and can
use the existing 64-bit mode compiler.
Register The register conventions used in the 64-bit kernel environment are the
conventions same as those used in the 64-bit application environment. This means that
general purpose register 13 will be reserved for operating system use.
Kernel stack 64-bit code has greater stack requirements than 32-bit code. This is for two
reasons. First, the amount of stack space required to hold subroutine
linkage information increases for 64-bit code, since this information is
made up of register and pointer values and these values are larger 64-bit
quantities. Second, long and pointer values are 64-bit quantities for 64-bit
code and consume more space when maintained as stack variables.
The larger stack requirements of 64-bit code also means that stack-related
sizes under the 64-bit kernel are increased over those of the 32-bit kernel.
In fact, most existing stack sizes will double.
Minimum stack Under the 64-bit kernel, the components of the common subroutine
size linkage, such as the link register and TOC pointer, are 64-bit quantities. As
a result, the minimum stack frame size is 112 bytes.
Process Consistent with the 32-bit kernel, the kernel stacks for use in process
context stack context are 96 KB in size. This size should prove to be sufficient for the 64-
size
bit kernel, since it has been found to be twice that of what is actually
needed for the 32-bit kernel.
Interrupt stack The interrupt stack will be 8 KB in size under the 64-bit kernel. This size is
size clearly warranted, since some interrupt handlers find the 4 KB interrupt
stack size of the 32-bit kernel to be insufficient.
Dynamic To allow scalability, resource pools are allocated dynamically from the
resource pools kernel heap and through separately created segments intended for this
purpose. This means that some existing resource pools, like the shared
memory, message queue, and semaphore ID pools, are relocated from the
kernel BSS.
Kernel heap The kernel heap is the home of most kernel data structures, and is
sufficiently large to allow subsystems to scale fixed resource pools, while
at the same time, providing adequate space for dynamically allocated
resources. To provide this, the kernel heap is expanded to encompass a
larger number of segments and placed above 4 GB within the global kernel
address space to accommodate its larger size.
While the kernel heap is extended and moved above 4 GB, the interfaces
provided for the allocation and freeing from this heap are the same as
those provided under the 32-bit kernel. The use of these interfaces is
pervasive, so common interfaces eases the 64-bit kernel porting effort for
kernel subsystems and kernel extensions and makes it simpler to support
both kernels.
The kernel heap is now expanded to 16 segments, for a total of about 4GB
of allocatable space. This is more than eight times larger than the space
available under the 32-bit kernel.
Allocation requests are only limited in size by the amount of available heap
space, rather than by some arbitrary limit. This means that the segments
that make up the kernel heap are laid out contiguously within the address
space, and requests for more than a segment size worth of data is granted
if sufficient free space is available. It also means that a request can be
satisfied with space that crosses segment boundaries.
A separate global heap reserved for the loader is provided in segment zero
(that is, the kernel segment). This heap is used to hold the system call
table and svc_instructions code for 32-bit applications and must be placed
in segment zero, because it is the only global segment that is mapped into
the 32-bit user address space. This heap is also used to hold the system
call table for 64-bit applications and loader sections for kernel extensions.
This data is located in the loader heap because it must be readable in user
mode. This type of access is not supported for the kernel heap.
Memory view Although both Power and IA-64 architectures support big-endian and
for big and
little endian little-endian implementations, the endian of AIX 5L running on IA-64 and
systems AIX 5L on PowerPC are different. AIX 5L for IA-64 is little-endian, and AIX
5L for PowerPC is big-endian.
Now, when you look at the system memory, we can look at it in two ways.
The example shows a 100 byte memory seen the two ways. Try to write
the number 1234567890 at address 0-9 in both figures. What is the digit in
the byte at address two?
address
address address address
99 90 00 09
89 80 10 19
79 70 20 29
69 60 30 39
59 50 40 49
49 40 50 59
39 30 60 69
29 20 70 79
19 10 80 89
09 0 90 99
Register and Computers address memory in bytes while manipulating data in words (of
memory byte multiple bytes). When a word is placed in memory, starting from the lowest
order
address, there are only two options: Either place the least significant byte
first (known as little-endian) or place the most significant byte first (known
as big-endian).
register
bit 63 0
a b c d e f g h
big-endian memory
a b c d e f g h
address 0 1 2 3 4 5 6 7
little-endian memory
h g f e d c b a
address 0 1 2 3 4 5 6 7
In the register layout shown in the figure above, “a” is the most significant
byte, and “h” is the least significant byte. The figure also shows the byte
order in memory. On big-endian systems, the most significant byte will be
placed at the lowest memory address. On little-endian systems, the least
significant byte will be placed at the lowest memory address.
Kernel lock The kernel lock is not supported under the 64-bit kernel. This lock was
originally provided to allow subsystems to deal with the pre-emptive nature
of AIX kernel on uniprocessor hardware, while later being used as a mean
for ensuring correctness for non-MP-safe subsystems on MP hardware. At
a minimum, all 64-bit kernel subsystems and kernel extension must be
MP-safe, with most required to be MP-efficient to meet performance
requirements. As a result, the kernel lock is no longer required.
Device Under the 64-bit kernel, no support will be provided for device funneling.
funneling This means that all device drivers must be MP-safe and identify
themselves as such when registering devices and interrupt handlers.
Device funneling was originally provided under the 32-bit kernel so that
non-MP-safe device drivers could run correctly on multi-processor
hardware with no change. However, all device drivers must change to
some extent under the 64-bit kernel and this provides the opportunity to
simplify the 64-bit kernel by not providing device funneling support and
requiring additional changes for the set of device drivers that are not MP-
safe.
Of the existing IBM Austin-owned device drivers, only the X.25 and
graphics device drivers are not MP-safe. However, this is of no concern,
since X.25 will not be provided under the 64-bit kernel and the (new)
graphics drivers that will be provided in the time frame of the 64-bit kernel
will be MP-safe.
Commands A number of AIX-supplied commands and utilities deal directly with kernel
and utilities details and require different implementation under the different kernels.
Commands based upon /dev/kmem or /dev/mem serve as an example.
Exceptions
Exceptions The distinction between the terms "exception" and "interrupt" is often
and interrupts blurred. The bulk of AIX documentation refers to both classes generically
distinction
as "interrupts," while the hardware documentation (like the PowerPC 60x
User’s Manuals) makes the distinction. We will try to keep the terms
separate.
Definition of Exceptions are synchronous events that are normally caused by the
exceptions process doing something illegal.
An exception is a condition caused by a process attempting to perform an
action that is not allowed, such as writing to a memory location not owned
by the process, or trying to execute illegal operations. For illegal
operations, the kernel traps the offending action and delivers a signal to
the process causing the exceptions, (or crashes, if the process was in
kernel mode). Exceptions can also be caused by a page fault. A page fault
is a reference to a virtual memory location for which the associated real
data is not in physical memory.
Determine the The result of an exception is either to send a signal to the process or to
action taken crash the machine. The decision is based upon what kind of exception
on an
exception occurred and whether the process was executing in user mode or kernel
mode:
• Exceptions are caused within the context of a process.
• A process may NOT decide how to react to the exception.
• Exception handlers are kernel code and run without regard to the process, except
to cleanly handle the exception generated by the process.
• Some exceptions result in the death of the process.
• Some exception types can be found in V\VPBH[FHSWK!
A process can decide how to respond to the signal generated by the
exception in certain cases. For example, a process can decide to catch the
signal for SIGILL, which is delivered when a process in user mode
executes an illegal instruction.
An exception is also a mechanism to change to supervisor state as a result
of:
• Program errors
• Unusual conditions
• Program requests
Exceptions -- continued
Branching to After an exception, the system switches to supervisor state and branches
exception to an exception handler routine. The branch address is found from the
handlers
content of a specific memory location called "vector."
System reset The system reset exception is used when a system reset is initiated by the
exception system administrator. This generally causes a "soft" reboot of the system.
Machine check The machine check exception is generated when a hardware machine
exception check occurs. This generally indicates either a hardware bus error or bad
real address access. If a machine check occurs with the ME bit off, then a
machine checkstop occurs. Generally, a machine check exception causes
a kernel crash dump to be generated. A machine checkstop causes no
kernel crash dump to be generated, though a checkstop record is
generated.
Data storage Data storage interrupt (DSI) and instruction storage interrupt (ISI)
exception exceptions are caused by hardware not being able to find a translation for
a instruction fetch or load/store operation. These generally result in a page
fault.
Exceptions -- continued
Floating point The floating point unavailable exception is caused when a thread executes
unavailable a floating point instruction when floating point operations are not allowed.
exception
This generally indicates that a thread has not executed any floating point
instructions yet or that another thread’s floating point data is currently in
the processor’s floating point registers. AIX does not save a thread’s
floating point register values until it first uses the floating point registers.
On UP systems, AIX does not save off floating point registers for the
currently running thread when another thread is dispatched. Often, no
other thread will use the floating point registers before the thread is again
dispatched. This saves AIX having to save and restore the floating point
registers on every thread dispatch.
Decrementer The decrementer exception is caused when the decrementer register has
exception reached the value zero. This indicates that a timer operation has
completed.
System call The system call exception occurs whenever a thread executes a system
exception call.
Interrupts
Description of Interrupts are asynchronous events that may be generated by the system
interrupts or a device, and "interrupts" the execution of the current process.
Interrupts usually occur when a process is running and some
asynchronous event occurs such as disk I/O completion or a clock tick.
The event usually has nothing to do with the current running process. The
kernel immediately preempts the current running process to handle the
interrupt. The state of the machine is saved on the stack and the interrupt
is handled. The user process has no knowledge that the interrupt
occurred.
Interrupts are one of the major reasons that AIX cannot be a hard real-time
system. No guarantee can be made as to how long it may take for some
action to occur as it may get interrupted any number of times during the
action.
Interrupts are caused outside the context of a process. In general, a
process may NOT *decide how to react to the interrupt. Interrupt handlers
are kernel code and run without regard to the process unless the nature of
the interrupt is to update some process related structure, *statistics, and so
on.
Interrupt levels Each interrupts has a level and an associated priority; the level is a value
that is used to differentiate between interrupts. The priority ranks the
importance of each one.
Devices, such as adapter cards, with interrupt facilities have a interrupt
level associated. When the system receive an interrupt with that level, AIX
then knows that it was caused by the device at that level.
In AIX, devices may share interrupt levels such that more than one adapter
may share the same level.
Controlling A kernel process can disable some or all types of interrupts for short
Interrupts periods. The interrupted process will safely return to continue execution.
Some interrupt types can be found in <sys/m_intr.h>
Most interrupts are not concerned with which process is getting
interrupted. The major counter example is the clock interrupt. This is used
to update the run-time statistics for the currently running process.
Interrupts -- continued
Critical A critical section is a code section that must be executed without any
sections break. For example: if data is examined and changed based on the value.
A process would disable interrupts across a critical section to ensure that
the section is executed without breaks.
Out of order On modern processors, such as Power and IA-64, many instructions are
instruction being executed at one time. When a hardware interrupt occurs,
sets and
Interrupts instructions are executed to completion and any following instructions are
terminated with no effect on the processor registers or memory; results
from out of order instructions are discarded. This is what is meant by
"interrupts are guaranteed to occur between the execution of instructions."
The processor makes sure that the effect of its operations are equivalent to
an interrupt occurring between the execution of instructions.
Interrupt When an interrupt is received, AIX performs several steps to handle the
handling interrupt properly:
Saving and AIX maintains a set of machine state save (mstsave) areas. Each
restoring processor has a pointer to the mstsave area it should use when the next
machine state
interrupt occurs. This pointer is called the current save area, or csa pointer.
When state needs to be saved, AIX:
When an interrupt handler returns, AIX must restore the machine state that
was in effect when the interrupt occurred. AIX does this by:
mstsave area Because the mstsave (machine state) areas are linked together, the
description mstsave areas provide an interrupt history stack.
csa
prev prev prev
Size limitation The stack used by an interrupt handler is kept in the same page as the
on mstsave mstsave area. This limits the stack to 4K on the 32-bit kernel and 8k on 64-
area and
interrupt stack bit kernel minus the size of the mstsave area. Using this area for the stack
ensures that the stack is pinned, which is required for interrupt handlers.
Saving base
level machine
state
The thread’s base level state save area initial thread’s uthread block
is in the initial thread’s uthread block.
In the 32-bit kernel, there is also the user64 (32-bit kernel only)
user64 area, which is used to save the
64-bit user registers for 64-bit process ublock
processes.
The user64 area is only used when the process is a 64-bit process in a 32-
bit kernel. If the user64 area is being used it is initialized and pinned. The
area is created when a process calls exec() for a 64-bit executable. It is
destroyed when a 64-bit process exits or calls exec() for a 32-bit
executable.
The portion of the base level state save area that contains the 32-bit
registers is unused for 64-bit processes.
At a 32-bit kernel, only the base level state save (MST) area needs to have
a 64-bit register state save area (user64) associated with it. Since all
interrupt handlers run in 32-bit kernel mode, all state save areas other than
the base level state save area only needs to save 32-bit states (even on
64-bit hardware). At a 64-bit kernel all MST areas are 64-bit.
IA-64 formats
63 31 15 7 0
79 63 31 0
The basic IA-64 data type is 8 bytes. Apart from a few exceptions, all
integer operations are on 64-bit data, and registers are always written as
64 bits. Therefore, 1, 2 and 4 byte operands loaded from memory are zero-
extended to 64 bits.
Instruction A typical IA-64 instruction is a three operand instruction, with the following
format syntax:
Simple Instruction
add r1 = r2, r3
Predicated instruction
(p4)add r1 = r2, r3
Instruction with immediate
add r1 = r2, r3, 1
Instruction with completer
cmp.eq p3 = r2, r4
IA-64 memory
Memory IA-64 defines a single, uniform, linear address space of 2^64 bytes which
organization is divided into 8 regions of size 2^61. A single space means that both data
and instructions share the same memory range. Uniform means that there
are no address regions with predefined functionality. Linear means that the
address space contains no segments; all 2^64 bytes are consecutive.
All code is stored in little-endian byte order in memory. Data is typically
stored in little-endian byte order. IA-64 also provides support for big-endian
code and operating systems.
Moving data between registers to and from memory is performed strictly
through the load (ld) and store (st) instructions. IA-64 supports loads and
stores of all data types. Because registers are written as 64-bit, loads are
zero-extended. Stores always write the exact number of bytes for the
required format.
The size of memory location is specified in the opcode as a number
• st1/ld1 = byte (8bits)
• st2/ld2 = halfword (16 bits)
• st4/ld4 = word ( 32 bits)
• st8/ld8 = doubleword ( 64 bits)
Examples :
// Loads 32 bits from address 4 + r30 into r31 High 32-bits cleared on 64-bit
processor
add r31 = 4, r30
ld4 r31 = [r30]
Region Usage On IA64, the 64-bit linear address space consists of 8 regions of size 2^61
with the upper 3-bits of the address selecting a virtual region, a physical
region register, and an associated region identifier. The region
identifier (RID), much like the POWER segment identifier (SID),
participates in the hardware address translation such that in order to share
the same address translation, the same RID must be used. The sharing
semantic (private, globally shared, shared-by-some) is determined by
whether or not multiple processes utilize the same RID.
For example, a process’s private storage resides within a region whose
RID is mapped only by that process. Therefore, address space usage is in
a large part determined by assigning the desired sharing semantics to
each of the 8 virtual regions and mapping the appropriate objects into
those regions that require those semantics.
There are two imporant properties associated with this region usage. First,
the mapping of objects to regions is many-to-one. That is, multiple objects
map into a single region. Second, mapping the same object to different
regions results in aliases. This is a distinct difference from the POWER
architecture where an object (a.k.a. SID) is addressed the same
regardless of the virtual address used. Aliases simply additional address
translations on IA64 and thus a likelyhood for decreased performance and
so their use should be minimized.
Another significant departure from AIX is that the majority of the 64-bit
address space is managed using Single Address Space (SAS) semantics.
This is necessary to achieve the desired degree of sharing of address
translations for shared objects: to achieve a single translation for an object
all accesses must be made through a common global address. Such a
semantic is possible by virtue of the IA64 protection keys which provide
additional access control beyond address translations. So, a process that
maps a region only has accessibility to those objects within that region for
which it has the appropriate protection key. Note that AIX manages some
parts of the process address space as SAS -- for example, the shared
library text segment contains mappings whose addresses are common
across all processes. The AIX use of the SAS style of management is
minimal because the POWER architecture provides for sharing on a
segment basis regardless of the virtual address used to map the segment.
To achieve the same degree of sharing on IA64 a shared object must be
mapped at a global address.
region usage In addition to the sharing semantics there are additional properties that
continued influence the location of objects within regions. First, to preserve the flat-
address space with a logical boundary between user and kernel space it is
useful to place user and kernel objects at opposite ends of the address
space whenever feasible. Next, the IA64 architecture provides for
multiple page sizes and a preferred page size per region so objects with
similar page size requirements are most naturally colocated within the
same region. Finally, certain object types such as executable text have
properties and uses which mandate that they be isolated to a separate
region.
Given these general guidelines, the following table shows the selected
region usage and subsequent sections describe each region use in greater
detail. These selections provide for 4 regions dedicated to user space and
3 for kernel for the initial release.
IA-64 Instructions
IA-64
processor
Instruction groups reduces the need to optimize the code for each new
micro architecture. Processors with additional resources will take
advantage of the existing ILP in the instruction group.
The template field maps each instruction to an execution unit. This allows
the processor to dispatch all three instructions in parallel.
template
instruction slot 2 instruction slot 1 instruction slot 0
127 86 45 4 0
Template
The set of templates define the combinations of MII MIIs
functional units that can be invoked by a executing a MIsI MIsIs
single bundle. This in turn lets the compiler schedule the MLX* MLXs*
functional units in an order that avoids contention. The
template can also indicate a stop. The 24 available MMI MMIs
templates are listed opposite. MsMI MsMIs
MFI MFIs
M - is a memory function MMF MMFs
I - is an integer function MIB MIBs
F - is a floating point function MBB MBBs
B - is a branch function BBB BBBs
L - is a function involving a long immediate MMB MMBs
"s" indicates a stop. MFB MFBs
Where :
Branch All instructions beginning with “br.” are branches. The IA-64 architecture
instructions provides three branch types:
IA-64 Registers
Registers IA-64 provides several register files that are visible to the programmer:
Branch registers
General Registers
63 0
63 0
br7
gr127
0
gr0 br0
Floating-point registers
81 0 Predicate registers
fr127 pr63
0.0 1
fr0 p0
Application registers
63 0 Instruction pointer
63 0
ar127
ar0
General
registers
computation. gr0 0 0
gr1
Floating-point
registers
fr1 0.1
• 32 static floating-point registers
fr2
• 96 rotating floating-point registers, for
software pipelining
fr31
The first two registers (fr0 and fr1) are read-
fr32
only:
• validating/invalidating instructions 0
• eliminating branches in if/then/else logic blocks pr0
pr1
pr2
There are:
The IA-64 executes all branches in parallel, where the predication register
is used to stop that branch of execution. This way the processor can
process ‘out-of-order execution’ by just executing all branches without
performance penalty.
Branch
registers
63 0
br0
Eight 64-bit branch registers are used to specify
the branch target addresses for indirect br1
branches.
br2
Application
registers
63 0
ar0 KR0
ar7 KR7
ar36 UNAT
ar40 FPSR
ar44 ITC
ar64 PFS
ar65 LC
ar66 EC
ar127
Instruction The 64-bit instruction pointer holds the address of the bundle of the
pointer (IP) currently executing instruction. The IP cannot be directly read or written, it
increments as instructions are executed. Branch instructions set the IP to a
new value. The IP is always 16-byte aligned.
Register
validity
63 0 nat
If data needs to get from the memory to the processor, there’s always a
delay because it’ll take a while to get there. This is called ‘memory latency’.
In an attempt to eliminate this time, the processor tries to read the memory
beforehand.
If data has been read in in advance and then other data has been written
back to that exact location, the already read in data becomes invalid.
IA-64 Operations
Rotating registers are registers which are rotated by one register position
on each loop execution. The logical names of the registers are rotated in a
wrap-around fashion, so that logical register X is logical register X+1 after
one rotation. The predicate, floating-point and general registers can be
rotated.
IA-64 provides support for special branch instructions. One example is the
br.cloop instruction, used for simple counted loops.
The cloop branch instruction uses the LC application register, and not a
qualifying predicate to determine the branch condition.
IA-64 allows you to eliminate many memory accesses through the use of
large register files to manage work in progress, and by allowing better
control of the memory hierarchy.
early load
dependency dependency
ld check validity
Memory access is supported through the load (ld) and store (st)
instructions. All other integer, floating-point and branch instructions use the
registers as operands.
IA-64 enables you to hide the memory latencies of the remaining load
instructions, by placing speculative loads, prior to coding barriers. Thus the
stall caused by memory latency is minimized. This also enables more
opportunities for parallelism. When you use speculative loads, error/
exception detection is deferred until final result is actually required:
• If no error/exception is detected the latency is hidden.
• If an error/exception is detected then memory accesses and
dependent instructions must be redone by an exception handler.
A-64 provides an advanced load instruction (ld.a), that allows you to move
potentially data dependent loads earlier in the code.
To verify the data speculation, a check load instruction (ld.c) must be
placed at the location of the original load instruction.
If the contents of the memory address have not changed since the
advanced load, the speculation succeeded, and the memory latency is
hidden. If the contents of the memory address have changed by a store
instruction, the ld.c instruction repeats the load.
Data speculation does not defer exceptions. For example page faults are
taken immediately.
Also, IA-64 provides a control-speculative load instruction (ld.s), which
executes the load while speculating the results of the governing branch.
Control-speculative loads are also referred to as speculative loads.
To verify the load, a check instruction (chk.s) is placed at the location of the
original load. IA-64 uses a NaT bit/NaTVal, to track the success of the
load. If the NaT bit/NaTVal indicates a deferred exception, the chk.s
instruction jumps to correction code that repeats all dependent
instructions. The correction code is generated by a compiler or assembly
writer.
If the load is successful, the speculation succeeded, and the memory
latency is hidden.
Procedure The traditional use of a procedure stack in memory for procedure call
calls management demands a large overhead. IA-64 uses the general register
stack for procedure call management, thus eliminating the frequent
memory accesses. The general register stack consists of 96 general
registers, starting at r32, used to pass parameters to the called procedure
and store local variables for the currently executing procedure. The new
structure of a register stack allows:
• the caller procedure to pass parameters through registers to the
called procedure
• dynamic allocation of local registers for the currently executing
procedure
• allocating a maximum of 96 logical registers for each function
IA-32 IA-64
Procedure A Procedure A
call B ... call B
Procedure B Procedure B
save current register state alloc no save!
... ...
restore previous register state no restore!
return return
The called procedure can resize the frame to include its own input, local
and output area, using the alloc instruction. For each subsequent call, this
sequence is repeated, and a new procedure frame is created.
When the procedure returns, the processor unwinds the register stack, the
current frame is released, and the previous procedure’s frame is restored.
Register stack
engine
RSE
Stack Engine (RSE), which Frame
operates transparently in the
background, to ensure that an
memory
overflow does not occur, and that
the contents of the registers are
always available. The RSE is not
visible to the software. Stacked
Registers
When the stack fills up, the RSE
saves logical registers to memory,
thus freeing them. The stored
registers are restored in the same
way when necessary.
gr127
Floating point IA-64 provides high floating-point performance with full IEEE floating-point
and support for single, double, and double-extended formats.
multimedia
b3 b2 b1 b0
• Addition and subtraction (including 3
forms of saturating arithmetic)
• Multiplication
• Left shift, signed and unsigned right
shift
a3+b3 a2+b2 a1+b1 a0+b0
• Pack and unpack to convert between
different element sizes.
81 80 63 0
Exponent Significand
Sign
81 80 63 31 0
IA-64 provides four separate status fields (sf0-sf3) enabling four different
computational environments. Each status field contains dynamic control
and status information for floating-point operations.
The FPSR contains the four status fields and a traps field that enable
masking the IEEE exception events and denormal operand exceptions.
This register also includes 6 reserved bits which must be 0.
63 0
Processor (Hardware)
Platform (Hardware)
Interrupts Interrupts are events that occur during IA-32 or IA-64 instruction
processing, causing the flow control to be passed to an interrupt handling
routine. In the process, certain processor state is saved automatically by
the processor. Upon completion of interrupt processing, a return from
interrupt (rfi) is executed which restores the saved processor state.
Execution then proceeds with the interrupted IA-32 or IA-64 instruction.
Interrupt Depending on how an interrupt is serviced, interrupts are divided into: IVA-
definitions based interrupts and PAL-based interrupts.
interrupts are divided into four types: Aborts, Interrupts, Faults, and Traps.
Aborts
Processor A processor has been powered-on or a reset request has been sent to it.
Reset (RESET) The PALE_RESET entry point is entered to perform processor and system
self-test and initialization.
Faults The current IA-64 or IA-32 instruction which requests an action which
cannot or should not be carried out, or system intervention is required
before the instruction is executed. Faults are synchronous with respect to
the instruction stream. The processor completes state changes that have
occurred in instructions prior to the faulting instruction. The faulting and
subsequent instructions have no effect on machine state. Faults are IVA-
based interrupts.
Traps The IA-32 or IA-64 instruction just executed requires system intervention.
Traps are synchronous with respect to the instruction stream. The trapping
instruction and all previous instructions are completed. Subsequent
instructions have no effect on machine state. Traps are IVA-based
interrupts.
INIT
RESET PMI
MCA INT
(NMI,ExtINT,...)
PAL-based interrupts
IVA-based interrupts
Interrupt When an interrupt event occurs, hardware saves the minimum processor
programming state required to enable software to resolve the event and continue. The
model
state saved by hardware is held in a set of interrupt resources, and
together with the interrupt vector gives software enough information to
either resolve the cause of the interrupt, or surface the event to a higher
level of the operating system. Software has complete control over the
structure of the information communicated, and the conventions between
the low-level handlers and the high-level code. Such a scheme allows
software rather than hardware to dictate how to best optimize performance
for each of the interrupts in its environment. The same basic mechanisms
are used in all interrupts to support efficient IA-64 low-level fault handlers
for events such as a TLB fault, speculation fault, or a key miss fault.
PSR.ic The PSR.ic (interrupt state collection) bit supports an efficient nested
Interrupt state interrupt model. Under normal circumstances the PSR.ic bit is enabled.
collection bit
When an interrupt event occurs, the various interrupt resources are
overwritten with information pertaining to the current event. Prior to saving
the current set of interrupt resources, it is often advantageous in a miss
handler to perform a virtual reference to an area which may not have a
translation. To prevent the current set of resources from being overwritten
on a nested fault, the PSR.ic bit is cleared on any interrupt. This will
suppress the writing of critical interrupt resources if another interrupt
occurs while the PSR.ic bit is cleared. If a data TLB miss occurs while the
PSR.ic bit is zero, then hardware will vector to the Data Nested TLB fault
handler.
e-server This section introduces RS/6000, giving a brief history of the products,
p-series or an overview of the RS/6000 design, and a description of key RS/6000
RS/6000
introduction technologies.
The RS/6000 family combines the benefits of UNIX computing with IBMs
leading-edge RISC technology in a broad product line - from powerful
desktop workstations ideal for mechanical design, to workgroup servers
for departments and small businesses, to enterprise servers for medium
to large companies for ERP and server consolidation applications, up to
massively parallel RS/6000 SP systems that can handle demanding
scientific and technical computing, business intelligence, and Web
serving tasks. Along with AIX, IBMs award winning UNIX operating
system, and HACMP, the leading high availability clustering solution, the
RS/6000 platform provides the power to create change and has the
flexibility to manage it with a wide variety of applications that provide real
value.
RS/6000 The first RS/6000 was announced February 1990 and shipped June
History 1990. Since then, over 1,100,000 systems have shipped to over 132,000
customers.
The next figure summarizes the history of the RS/6000 product line,
classified by machine type. For each machine type, the I/O bus
architecture and range of processor clock speeds are indicated. The
figure shows the following:
• In the past, RS/6000 I/O buses were based on the Micro Channel
Architecture (MCA). Today, RS/6000 I/O buses are based on the
industry-standard Peripheral Component Interface (PCI) Architecture.
• Processor speed, one key element of RS/6000 system performance,
has increased dramatically over time.
• There have been many machine types over the entire RS/6000
history. In recent years, there has been considerable effort to reduce
the complexity of the model offerings without creating gaps in the
market coverage.
RS/6000
history
7011 (33 to 80 MHz) 7248 (100 to 133 MHz)
Micro Channel Workstations PCI Workstations
7006 (80 to 120 MHz)
Micro Channel Entry Desktops
7009 (80 to 120 MHz)
Micro Channel Compact Servers
7013 (20 to 200 MHz)
Micro Channel Deskside Systems
7012 (20 to 200 MHz)
Micro Channel Desktop Systems
7015 (25 to 200 MHz)
Micro Channel Rack Systems
7024 (100 to 233 MHz)
PCI Deskside Systems
7025 (166 to 500 MHz)
PCI Workgroup Servers Deskside Systems
7043 (166 to 375 MHz)
PCI Workstations & Workgroup Servers
7044 (333 to 400 MHz)
PCI Workstations &
Workgroup Servers
7046 (375 MHz)
PCI Workgroup Servers - Rack Systems
7026 (166 to 500 MHz)
PCI Workgroup Servers - Rack Systems
7017 (125 to 450 MHz)
PCI Enterprise Servers
SP1, SP2, SP
All Node Types
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Today
PowerPC and PowerPc CPUs started as a joint effort between Motorola Apple and IBM
Power2 Cpu the family consist of PowerPc, PPc601, PPc604 and PPc604e. These
family
CPUs are very close to those prodused by Motorola and used in Apple
systems, currently the PPc604e CPU is used in model f50, b50, and 43p
F lo a t in g F lo a tin g F ix e d F ix e d F ix e d L D /S T L D /S T
C P U r e g is te r s :
P o in t P o in t P o in t P o in t P o in t U n it U n it
U n it U n it U n it U n it 3 2 x 6 4 -b it in t e g e r
U n it
( F ix e d P o in t )
F P U 1 F P U 2 F X U 1 F X U 2 F X U 3 L S 1 L S 2 3 2 x 6 4 -b it F P
( F lo a tin g P o in t )
R e g is te r b u f fe r s f o r
B r a n c h h is t o r y t a b le 2 0 4 8 e n t r ie s r e g is te r r e n a m in g :
B r a n c h / D is p a t c h B r a n c h t a r g e t c a c h e 2 5 6 e n t r ie s 2 4 F P
1 6 In te g e r
3 2 K B , 1 2 8 -w a y 6 4 K B , 1 2 8 -w a y
M e m o r y M g m t U n it M e m o r y M g m t U n it
In s tr u c t io n C a c h e D a ta C a c h e
IU D U
3 2 3 2
B y te s B y te s
B IU B u s In t e r fa c e U n it L 2 C o n t r o l, C lo c k
3 2 B y te s 1 6 B y te s
@ 2 0 0 M H z = 6 .4 G B /s @ 1 0 0 M H z = 1 .6 G B /s
D ir e c t L 2 C a c h e
6 X X B u s
M a p p e d 1 -1 6 M B
RS64 and The RS64 microprocessor, based on the PowerPC Architecture, was
RS64 II CPUs designed for leading-edge performance in OLTP, e-business, BI, server
consolidation, SAP, Notesbench, and Web serving for the commercial
and server markets. It is the basis for at least four generations of
RS/6000 and AS/400 enterprise server offerings.
The RS64 processor focuses on commercial performance with emphasis
on conditional branches with zero or one cycle incorrect branch predict
penalty, contains 64 KB L1 instruction and data caches, has a one cycle
load support, four superscalar fixed point pipelines, and one floating
point pipeline. There is an on-board bus interface (BIU) that controls both
the 32 MB L2 bus interface and the memory bus interface.
RS64 and RS64 II are defined by the following specifications:
• 125 MHz RS64/262 MHz RS64 II on the RS/6000 Model S70
• 262 MHz RS64 II on the RS/6000 Model S70 Advanced
• 340 MHz RS64 II on the RS/6000 Model H70
• 64 KB on-chip, L1 instruction cache
• 64 KB on-chip four-way set associative data cache
• 32 MB L2 cache
• Superscalar design with integrated integer, floating-point, and branch
units
• Support for up to 64-way SMP configurations (currently 12-way)
• 128-bit data bus
• 64-bit real memory addressing
• Real memory support for up to one terabyte (240)
• CMOS 6S2 using a 162 mm2 die, 12.5 million transistors
S im p le S im p le F lo a tin g Load/
F ix e d C o m p le x P o in t S to re
P o in t F ix e d U n it U n it
U n it P o in t U n it
B r a n c h /D is p a tc h
M e m o r y M g m t U n it M e m o r y M g m t U n it
In s tr u c tio n C a c h e D a ta C a c h e
IU DU
3 2 B y te s 3 2 B y te s
B IU B u s In te r fa c e U n it L 2 C o n tr o l, C lo c k
3 2 B y te s 1 6 B y te s
L2 Cache
6XX B us
1 -3 2 M B
RS64 III The RS64 III processor is designed to perform applications that place
heavy demands on system memory. The RS64 III architecture addresses
both the need for very large working sets and low latency. Latency is
measured by the number of CPU cycles that elapse before requested
data or instructions can be utilized by the processor.
The RS64 III processors combine IBM advanced copper chip technology
with a redesign of critical timing paths on the chip to achieve greater
throughput. The L1 instruction and data caches have been doubled to
128 KB each. New circuit design techniques were used to maintain the
one cycle load-to-use latency for the L1 data cache.
L2 cache performance on the RS64 III processor has been significantly
improved. Each processor has an on-chip L2 cache controller and an
on-chip directory of L2 cache contents. The cache is four-way set
associative. This means that directory information for all four sets is
accessed in parallel. Greater associativity results in more cache hits and
lower latency, which improves commercial performance.
Using a technique called Double Data Rate (DDR), the new 8 MB Static
SRAM used for L2 is capable of transferring data twice during each clock
cycle. The L2 interface is 32 bytes wide and runs at 225 MHz (half
processor speed), but, because of the use of DDR, it provides 14.4 GBps
of throughput.
System Bus All current systems in the RS/6000 family are equiped with PCI buses,
information the PCI architecture provides an industry standard specification and
protocol that allows multiple adapters access to system resources
through a set of adapter slots.
Each PCI bus has a limit on the number of slots (adapters) it can
support. Typically, this can range from two to six. To overcome this limit,
the system design can implement multiple PCI buses. Two different
methods can be used to add PCI buses in a system. These two methods
are:
• Secondary PCI Bus, The simplest method to add PCI slots when
designing a system is to add a secondary PCI bus. This bus is bridged
onto a primary bus using a PCI-to-PCI bridge chip.
• Another method of providing more PCI slots is to design the system
with two or more primary PCI buses. This design requires a more
sophisticated I/O interface with the system memory.
32-bit 32-bit Power and PowerPC processors all have the following features in
hardware common:
characteristics
User registers
• 32 general-purpose integer registers, each 32 bits wide (GPRs)
• 32 floating-point registers, each 64 bits wide (FPRs)
• A 32-bit Condition Register (CR)
• A 32-bit Link Register (LR)
• A 32-bit Count Register (CTR)
System Registers
• 16 Segment Registers (SRs)
• A Machine State Register (MSR)
• A Data Address Register (DAR)
• Two Save and Restore Registers (SRRs)
• 4 special purpose (SPRG) registers (PowerPC only)
All instructions are 32 bits long. The Data Address Register contains the
memory address that caused the last memory-related exception.
SRRs are used to save information when an interrupt occurs
• SRR0 points to the instruction that was running when the interrupt
occurred
• SRR1 contains the contents of the MSR when the interrupt
occurred
General General Purpose Registers (GPRs) (often just called Rs) used for loads,
purpose stores, and integer calculations
registers
No memory-to-memory operations are provided.This always needs to go
through registers
Condition The condition register (CR) contains bits set by the results of compare
register instructions. It’s treated as 8 4-bit registers.
The bits are used to test for less-than, greater-than, equal, and overflow
conditions.
Link register The link register (LR) is set by some branch instructions.
Its content points to the instruction which has to be executed immediately
after the branch. It typically is used in subroutine calls to find out where
to return to.
Machine state The MSR controls many of the current operating characteristics of the
register processor. Among others are :
• Privilege Level (Supervisor vs. Problem or Kernel vs. User)
• Addressing modes (virtual vs. real)
• Interrupt enabling
• Little-endian vs. Big-endian mode
Instruction set
A single instruction generally modifies only one register or one memory
location. Exceptions to this are “multiple” and “update” operations
Register to These types of operations will always have at least 2 registers listed,
register where the first is the target for the result of the instruction, and the others
operations
provide the input to the operation.
Examples :
• RUUUU# Logical ORs r4 and r5, result into r3
• DGGLU[U # Adds 0x48 to r1, result into r1
Register to Register-Memory Operations always have one register and one memory
memory location. The register is always listed first.
operations
The size of the memory location is specified in the opcode :
• b = byte (8 bits)
• h = halfword (16 bits)
• w = word (32 bits)
• d = doubleword (64 bits)
All opcodes beginning with “l” are loads and all opcodes beginning with
“st” are stores.
Register to Examples:
memory • OZ]UU# Loads 32 bits from address 4+r30 into
operation
examples r31.High 32-bits cleared on 64-bit processor
• VWGUU# Stores 64 bits from r3 to address r29 - 8.
Invalid operation on 32-bit processor
• OE]U[U # Loads 8 bits from address 27+r1 into r0.
Top 24/56 bits are cleared
• VWKU[U# Stores low 16 bits from r3 to address
0x56+r1
Notice that the load instructions also have a “z” in their mnemonics. The
“z” stands for “zero,” and is intended to make clear that these instructions
clear any bits in the target register that were not actually copied from
memory.
In case you were wondering, there are load instructions without the “z”.
lwa and lha are “algebraic” loads. This means that the value being
loaded is sign-extended to fill out the rest of the register. This is used
when loading a signed value - if a halfword had a negative value, lhz
would make it a positive, but lha would preserve the value’s
“negativeness.”
Compare There are four variations of compare instructions , all beginning with
instructions “cmp”. They compare two values :
The result of the comparison iis placed in the Condition Register (CR)
where the various bits that can be set are :
• LT = less than
• GT = greater than
• EQ = equal
• OV = overflow (a.k.a. carry bit)
Branch All instructions beginning with a “b” are branches. They change the
instructions address for the next instruction to be run.
Branches can be conditional. That depends upon whether the option bit
matches the specified bit in the CR. A branch instruction can specify
which CR to use, where CR0 is assumed unless otherwise specified.
Extended mnemonics are defined by the assembler to cover most
combinations
The conditional branch instruction is central to any computer
architecture. However, most architectures (including POWER and
PowerPC) avoid putting comparisons directly into their branch
instructions (to keep things simple). They provide compare instructions
that set “condition bits.” These bits are what are used on branch
instructions to make the actual decision.
The assembler (and crash’s disassembler) provides extended
mnemonics that combine a type of branch and the condition register bit
that determines whether the branch is taken. Another bit in the branch
opcode determines whether the CR bit must be on or off for the branch to
take place. This bit is also incorporated into the extended mnemonics
(the “not” versions of the branches). For maximum flexibility, the
assembler usually also allows you to specify the “not” cases as the
logically-opposite case. For example, bnl (branch not less than) can also
be written as bge (branch greater than or equal to). Either case is still
saying, “branch if the LT bit is turned off.”
Examples
• EOW[F # Branches to address 38c00 if LT bit is on in CR0
• EJHFU[ # Branches if LT bit is off in CR3
• EQHOUFU # Branches to address in LR if EQ bit is off in CR7
• EOHDFU[ # Branches to absolute address 0x3600 if
GT bit is off in CR2
Trap Most mnemonics beginning with a “t” are traps, and generate a program
instructions exception if the specified condition is met. There are two variations of the
trap instruction :
“w” mnemonics are the PowerPC indication that these trap instructions
are working on 32-bit values. As with branches, there are extended
mnemonics defined to provide various traps. In this context ‘lt’, ‘gt’, ‘eq’,
etc. have same meaning as on branch mnemonics
Examples
• WZHTUU# Traps if r3 equals r4
• WZQHLU # Traps if r31 is not equal to 0
Trap instructions are the only instructions in this architecture that perform
a comparison and take some action, all in one instruction. They do not
set or use condition register bits.
Special The Special Purpose Registers (SPRs) can only be copied to or from
register GPRs.
operations
• PIVSUU # Copies SPR 8 into r3
• PWVSUU # Copies r3 into SPR 9
Interrupt Interrupt vectors are addresses of short sections of code which saves
vectors the state of the processor and then branches to a handler routine.
Some examples are :
64-bit With full hardware 32-bit binary compatibility as the baseline, the
hardware features that characterize a PowerPC processor as 64-bit include:
characteristics
• 64-bit general registers
• 64-bit instructions for loading and storing 64-bit data operands, and
for performing 64-bit arithmetic and logical operations.
• two execution modes: 32-bit and 64-bit. Whereas 32-bit processors
have implicitly only one mode of operation, 32-bit execution mode
on a 64-bit processor causes instructions and addressing to
behave the same as on a 32-bit processor. As a separate mode,
64-bit execution mode creates a true 64-bit environment, with 64-bit
addressing and instruction behavior.
• 64-bit physical memory addressing facilities
• additional supervisor instructions, as needed to set up and control
the execution mode. A key feature the PowerPC 64-bit architecture
provides is execution mode on a per-process level, helping AIX to
create, at the system level, a mixed environment of concurrent
32-bit and 64-bitprocesses.
Segment table The 64-bit virtual address space is represented with a segment table,
which acts as an in-memory set-associative cache of the most recently
used 256 segment number to segment ID mappings. The current
segment table is pointed to with the 64 bit Address Space Register
(ASR) register. The ASR has a valid bit to indicate that no segment table
is valid. This is used in 32-bit mode on 64-bit processors to indicate that
the segment table is not being used.
IBM "bridge extensions" to PowerPC 64-bit architecture allow segment
register operations to work for 32-bit mode. It allows the kernel to
continue to manipulate segment registers. The "bridge extensions" are
used to load and store "segment registers" instead.
Symmetric On uniprocessor systems, bottlenecks exist in the form of the address and
multi- data bus restricting transfers to one at a time and the other program
processing
counter forcing instructions to be executed in strict sequence. Some
performance improvement was achieved by constantly improving the
speeds of these uniprocessor machines.
Types of Multiprocessors:
• Loosely-coupled MP
• Tightly-coupled MP
• Symmetric MP
Loosely Has different systems on a communication link with the systems fuctioning
coupled MP independently and communicating when necessary.
The separate systems can access each other’s files and maybe even
download tasks to the lightly loaded CPU to achieve some load balancing.
Tightly Uses a single storage shared by the various processors and a single
coupled MP operating system that controls all the processors and system hardware.
Symmetric MP All of the processors are functionally equivalent and can perform I/O and
computation.
Multi- In order to have all CPU’s work together, there must be some sort of
processor organization. There are three ways to do that :
organization
• Master/slave multiprocessing organization.
• Separate executives organization.
• Symmetric multi-processing organization.
Master slave One processor is designated as the master and the others are the slaves.
organization The master is a general purpose processor and performs input/output as
well as computation. The slave processors perform only computation.
The processors are considered asymmetric (not equivalent) since only the
master can do I/O as well as computation. Utilization of a slave may be
poor if the master does not service slave requests efficiently enough.
Another disadvantage may be I/O-bound jobs, because they may not run
efficiently since only the master does I/O.
Separate With this organization each processor has its own operating system and
executives responds to interrupts from users running on that processor. A process is
organization
assigned to run on a particular processor and runs to completion.
It is possible for some of the processors to remain idle while other
processors execute lengthy processes. Some tables are global to the
entire system and access to these tables must be carefully controlled.
Each processor controls its own dedicated resources, such as files and I/O
devices.
Symmetric All of the processors are functionally equivalent and can perform I/O and
multi- computation. The operating system manages a pool of identical
processing
organization processors, any one of which may be used to control any I/O devices or
reference any storage unit. Conflicts between processors attempting to
access the same storage at the same time are ordinarily resolved by
hardware. Multiple tables in the kernel can be accessed by different
processes simultaneously. Conflicts in access to systemwide tables are
ordinarily resolved by software. A process may be run at different times by
any of the processors and at any given time, several processors may
execute operating system function in kernel-mode.
Multi- There are two ways of identifying separate processors. You can identfiy
processor them by :
definitions
The lowest number will start from ‘0’ on Power systems, but will start
from ‘1’ on IA-64.
One processor will be known as the default, or Master processor and this
concept is used for funneling. It is not a master processor in the sense of
master/slave processing - the term is used only to designate which
processor will be the default processor. It’s defined by the value of
MP_MASTER in the <sys/processor.h> file
MP safe MP safe code will run on any processor. It’s modified to prevent resource
clashes by adding locking code in order to serialize its execution.
MP efficient MP efficient code is MP safe code, but has also some data locking
mechanisms to serialize data access. This way it will be easier to spread
whatever the code does across the availables CPUs.
MP efficient device drivers are intended for high-throughput device drivers.
Purpose This lesson describes how to configure and take system dumps on a node
running the AIX5L operating system.
Accountability You will be able to measure your progress with the following:
• Exercises using your lab system.
• Check-point activity
• Lesson review
Reference
Redbooks
Organization This lesson consists of information followed by exercises that allow you to
of this lesson practice what you’ve just learned. Sometimes, as the information is being
presented, you are required to do something - pull down a menu, enter a
response, etc. This symbol, in the left hand side-head, is an indication that
an action is required.
Introduction An AIX5L system can generate a system dump (or crash dump) due to
encountering a severe system error, such as an exception in kernel mode
that was unexpected or that the kernel cannot handle. It can also be
initiated by the system administrator when the system has hung.
Analysis of the dump can be done on another machine away from the
production machine at a convenient time, or location by a skilled kernel
person.
Process The process of taking a system dump is illustrated in the following chart.
The process involves a two stages, in stage one the contents of memory is
copied into a temporary disk location. In stage two, AIX5L is booted and
the memory image is moved to a permanent location in the /var/adm/ras
directory.
Process
continued
AIX5L in production
Introduction When the operating system is installed, parameters regarding the dump
device are configured with default settings. To ensure that a system dump
is taken successfully, the system dump parameters need to be configured
properly.
The system dump parameters are stored in system configuration objects
within the SWservAt ODM object class. Objects within the SWservAt
object class define where and how a system dump should be handled.
SWservAt The SWservAt ODM object class is stored in the /etc/objrepos directory.
object class Objects included within the object class are:
Dump Device When selecting the primary or secondary dump device the following rules
selection rules must be observed:
• A mirrored paging space may be used as a dump device.
• Do not use a diskette drive as your dump device.
• If you use a paging device, only use hd6, the primary paging device.
Preparing for a To ensure that a system dump will be successfully captured, complete the
system dump following steps:
Step Action
1. Estimate the size of the dump. This can be done through smit
by following the fast path:
# smit dump_estimate
Or, using the sysdumpdev command:
# sysdumpdev -e
(With Compression turned on)
0453-041 Estimated dump size in bytes:11744051
(With Compression turned off)
0453-041 Estimated dump size in bytes:58720256
Using the above example, the dump will require 12MB (with
compression on), or 59MB (with compression turned off) of
device storage. This value can change based on the activity of
the system. It is best to run this command when the machine is
under its heaviest workload. Size the dump device four times
the value reported by the sysdumpdev command in order to
handle a system dump during peak system activity.
IA-64 Systems - Compression must be turned off to gather
a valid system dump. (Eratta)
Preparing for a
system dump
continued
Step Action
2 Create a primary dump device named dumplv. Calculate the
required number of PPs for the dump device. Get the PP size
of the volume group by using the lsvg command:
# lsvg rootvg
VOLUME GROUP: rootvg VG IDENTIFIER: db1010a
VG STATE: active PP SIZE: 16 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 1626
(26016 megabytes)
MAX LVs: 256 FREE PPs: 1464 (23424
megabytes)
LVs: 11 USED PPs: 162 (2592
megabytes)
OPEN LVs: 8 QUORUM: 2
TOTAL PVs: 3 VG DESCRIPTORS: 3
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 3 AUTO ON: yes
MAX PPs per PV: 1016 MAX PVs: 32
LTG size: 128 kilobyte(s) AUTO SYNC: no
HOT SPARE: no
Preparing for a
system dump
continued
Step Action
3. Verify the size of the device /dev/dumplv.
Enter the following command:
# lslv dumplv
LOGICAL VOLUME: dumplv
VOLUME GROUP: rootvg
LV IDENTIFIER: e59bd8 PERMISSION: read/write
VG STATE: active/complete
LV STATE:opened/syncd
TYPE: dump WRITE VERIFY: off
MAX LPs:512 PP SIZE: 16 megabyte(s)
COPIES: 1 SCHED POLICY: parallel
LPs: 15 PPs: 15
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: minimum RELOCATABLE: no
INTRA-POLICY: middle UPPER BOUND: 32
MOUNT POINT: N/A LABEL: None
MIRROR WRITE CONSISTENCY: off
EACH LP COPY ON A SEPARATE PV?: yes
#sysdumpdev -s /dev/dumplv -P
primary /dev/dumplv
secondary /dev/sysdumpnull
copy directory /var/adm/ras
forced copy flag FALSE
always allow dump FALSE
dump compression OFF
Preparing for a
system dump
continued
Step Action
5. Create a secondary dump device. The secondary dump device
is used to back up the primary dump device. If an error occurs
during a system to dump to the primary dump device, the
system attempts to dump to the secondary device (if it is
defined).
#sysdumpdev -s /dev/hd7 -P
primary /dev/dumplv
secondary /dev/hd7
copy directory /var/adm/ras
forced copy flag FALSE
always allow dump FALSE
dump compression OFF
Preparing for a
system dump
continued
Step Action
7. Verify the size of the filesystem containing the copy directory is
large enough to handle a crash dump. Check the size of the
copy directory filesystem with the following command:
#df -k /var
Filesystem 1024-blocks Free%Used
Iused %Iused Mounted on
/dev/hd9var 32768 31268 5%
143 64% /var
In this example the /var filesystem is 32MB. To increase the
size of the /var filesystem to 240MB, use the following
command:
# chfs -asize=+240000 /var
Portion of /sbin/rc.boot:
Preparing for a
system dump
continued
Step Action
8. Configure the force copy flag. If paging space is being used
as a dump device, the force copy flag must be set to TRUE.
This will force the system boot sequence into menus that
allow copy of the dump to external media if the copy to the
copy directory fails. This utility will give you the opprotunity
to save the crash to removable media if the default copy
directory is full or un-available. To set the flag to TRUE, use
the following command:
# sysdumpdev -KP
10. Configure the compression flag. To enable compression of
the system dump prior to being written to the dump device,
the compression flag must be set to ON. To set the flag to
ON, use the following command:
# sysdumpdev -CP
Preparing for a
system dump
continued
Step Action
11. Configure the system for autorestart. A useful system attribute
is autorestart. If autorestart is TRUE, the system will
automatically reboot after a crash. This is useful if the machine
is physically distant or often unattended. To list the system
attributes, use the following command:
# smit chgsys
Introduction AIX5L has been designed to automatically collect a system crash dump
following a system panic. This section discusses the operator controls and
procedure that is used to obtain a system dump.
User initiated Under unattended hang conditions or for other debugging purposes
dumps system administrator may use different techniques to force a dump:
• Using sysdumpstart -p command (primary dump device) or
sysdump -s command (secondary dump device).
• Start a system dump with the Reset button by doing the following (this
procedure works for all system configurations and will work in
circumstances where other methods for starting a dump will not):
Step Action
1. Turn the machine’s mode switch to the Service position, or
set Always Allow System Dump to TRUE.
2. Press the Reset button. The system writes the dump
information to the primary dump device.
Progression A system crash will cause a number of status codes to be displayed. When
status codes a system has crashed, the LEDs will display a flashing 888. The system
may display the code 0c9 for a short period of time, indicating a system
dump is in progress. When the dump is complete, the dump status code
will change to 0c0 if the system was able to dump successfully.
If the Low-Level Debugger (LLDB) is enabled, a c20 will appear in the
LEDs, and an ASCII terminal connected to the s1 or s2 serial port will
show an LLDB screen. Typing quit dump will initiate a dump.
During the dump process, the following progression status codes may be
seen on the LED or LCD displays:
Error log If the dump was lost or did not save during system boot, the error log can
help determine the nature of the problem that caused the dump. To check
the error log, use the errpt command.
Step Action
1. # sysdumpstart -p
IA-64 Systems - For a dump that is approximately 120MB
in size wait for approximately 15 minutes before shutting
down the machine.
2. Reboot the system.
dumpcheck utility
Description The /usr/lib/ras/dumpcheck utility is used to check the disk resources used
by the system dump facility. The command logs an error if either the
largest dump device is too small to receive the dump or there is insufficient
space in the copy directory when the dump device is a paging space.
Error log entry The following is an example of an errorlog entry created by the dumpcheck
sample utility because of lack of space in the primary and secondary dump
devices:
----------------------------------------------------
LABEL: DMPCHK_TOOSMALL
IDENTIFIER: E87EF1BE
Description
The largest dump device is too small.
Probable Causes
Neither dump device is large enough to accommodate a
system dump at this time.
Recommended Actions
Increase the size of one or both dump devices.
Detail Data
Largest dump device
testdump
Description Before submitting a dump to IBM for analysis, it is important to verify that
the dump is valid and readable.
Dump analysis To verify the dump is valid, the dump must be examined by a kernel
tools debugger. The kernel debugger used to validate the dump depends on the
system architecture. If the system is running on Power PC, the debugger is
kdb. The kernel debugger for IA-64 platforms is iadb.
Verifying the The following procedure should be used to verify the dump
dump
Step Action
1. Locate the crash dump:
# sysdumpdev -L
0453-039
Device name: /dev/dumplv
Major device number: 10
Minor device number: 2
Size: 8837632 bytes
Uncompressed Size: 32900935 bytes
Date/Time: Fri Sep 22 13:01:41 PDT 2000
Dump status: 0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.0.Z
2. Change directory to the dump location. In the above
example:
# cd /var/adm/ras
3. Decompress the vmcore file if necessary:
# uncompress vmcore.0.Z
Verifying the
dump
continued
Step Action
4. Start the kernel debugger;
Power PC:
# kdb /var/adm/ras/vmcore.0
The specified kernel file is a UP kernel
vmcore.1 mapped from @ 70000000 to @ 71fdba81
Preserving 880793 bytes of symbol table
First symbol __mulh
KERNEXT FUNCTION NAME CACHE (90112 bytes) allocated
KERNEXT COMMANDS SPACE (4096 bytes) allocated
Component Names:
1) dmp_minimal [5 entries]
....
Dump analysis on CHRP_UP_PCI POWER_PC POWER_604
machine with 1 cpu(s) (32-bit r
egisters)
Processing symbol table...
.......................done
(0)>
IA-64:
# iadb /var/adm/ras/vmcore.0
symbol capture using file: /unix
iadb: Probing a live system, with memfd as :4
Current Context:
cpu:0x1, thread slot: 77, process Slot: 51,
ad space: 0x8e44
thrd ptr: 0xe00000972a13b000, proc ptr:
e00000972a12e000
mst at:3ff002ff3b400
(1)>
Verifying the
dump
continued
Step Action
5. Issue the stat subcommand to verify the details of the dump.
Ensure the values are consistent with the dump that was
taken.
Power PC:
(0)> stat
SYSTEM_CONFIGURATION:
CHRP_UP_PCI POWER_PC POWER_604 machine with 1
cpu(s) (32-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. kca41
release... 0
version... 5
machine... 000930134C00
nid....... 0930134C
time of crash: Thu Oct 5 10:37:57 2000
age of system: 3 min., 11 sec.
xmalloc debug: disabled
IA-64:
(1)>stat
SYSTEM_CONFIGURATION:
IA64 machine with 2 cpu(s)(64-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. kca40
hostname.. kca40.hil.sequent.com
release... 0
version... 5
machine... 000000004C00
nid....... 0000004c
current time: Fri Oct 6 12:20:56 2000
age of system: 1 day, 1 hr., 1 min., 43 sec.
xmalloc debug: disabled
Verifying the
dump
continued
Step Action
6. Exit the kernel debugger:
Power PC:
(0) > q
IA-64:
(1) > q
Overview Once a valid dump has been identified, the next step is to package the
dump to be send in for analysis.
Packaging the The following procedure will automatically collect the required files
dump pertaining to the system dump
Step Action
1. Compress the vmcore file:
# compress /var/adm/ras/vmcore.0
2. Gather all of the files and information regarding the dump
using the following command:
# snap -Dkg
Checking space requirement for general
information............................................
........... done.
Checking space requirement for kernel
information.......... done.
Checking space requirement for dump information.....
done.
Checking for enough free space in filesystem... done.
********Checking and initializing directory structure
Creating /tmp/ibmsupt directory tree... done.
Creating /tmp/ibmsupt/dump directory tree... done.
Creating /tmp/ibmsupt/kernel directory tree... done.
Creating /tmp/ibmsupt/general directory tree... done.
Creating /tmp/ibmsupt/general/diagnostics directory
tree... done.
Creating /tmp/ibmsupt/testcase directory tree... done.
Creating /tmp/ibmsupt/other directory tree... done.
********Finished setting up directory /tmp/ibmsupt
Gathering general system
information........................done.
Gathering kernel system information........... done.
Gathering dump system information...... done.
Packaging the
dump
continued
Step Action
3. Copy the dump to external media. To copy the gathered
files to the /dev/rmt0 tape device, issue the following
command:
# snap -o /dev/rmt0
Packaging a A dump saved to external media needs to be gathered with other files to
dump stored provide a dump which is readable. To gather and pack the files follow the
on external
media following steps:
Step Action
1. Create a skeleton directory to contain the dump information.
# snap -D
# cd /tmp/ibmsupt/dump
# tar -xvf /dev/rmt
# mv dump_file dump
3. Copy the dump to external media. To copy the gathered files
to the /dev/rmt0 tape device, issue the following command:
# snap -o /dev/rmt0
Purpose This lesson describes the different tools that are available to debug a
system dump taken from an AIX5L system.
Table of
contents
continued
Topic See Page
KDB watch break point sub commands 44
KDB machine status sub commands 46
KDB kernel extension loader sub commands 48
KDB address translation sub commands 50
KDB process/thread sub commands 51
KDB Kernel stack sub commands 59
KDB LVM sub commands 61
KDB SCSI sub commands 63
KDB memory allocator sub commands 66
KDB file system sub commands 70
KDB system table sub commands 73
KDB network sub commands 78
KDB VMM sub commands 81
KDB SMP sub commands 87
KDB data and instruction block address translation sub 88
commands
KDB bat/brat sub commands 90
IADB kernel debugger 91
iadb command 93
Table of
contents
continued
Topic See Page
IADB break point and step sub commands 94
IADB dump/display/decode sub commands 97
IADB modify memory sub commands 101
IADB name list/symbol sub commands 106
IADB watch break point sub commands 107
IADB machine status sub commands 109
IADB kernel extension loader sub commands 111
IADB address translation sub commands 112
IADB process/thread sub commands 113
IADB LVM sub commands 115
IADB SCSI sub commands 116
IADB memory allocator sub commands 117
IADB file system sub commands 118
IADB system table sub commands 119
IADB network sub commands 120
IADB VMM sub commands 121
IADB SMP sub commands 123
IADB block address translation sub commands 124
IADB bat/brat sub commands 125
IADB miscellaneous sub commands 126
Exercise 128
Accountability You will be able to measure your progress with the following:
• Exercises using your lab system.
• Check-point activity
• Lesson review
Reference
• AIX5L docs
Organization This lesson consists of information followed by exercises that allow you to
of this lesson practice what you’ve just learned. Sometimes, as the information is being
presented, you are required to do something - pull down a menu, enter a
response, etc. This symbol, in the left hand side-head, is an indication that
an action is required.
Introduction AIX5L introduces new debugging tools, the main change from the previous
releases of AIX is that the crash command has been replaced by:
• IADB and KDB kernel debuggers for live systems debugging
• iadb and kdb commands for system image analysis
In addition the following tools/commands are available, to assist you with
debug:
• bosdebug
• Memory Overlay Detection System (MODS)
• System Hang Detection
• truss
Typographic In the following sections we will use uppercase IADB and KDB when
conventions speaking about the live kernel debuggers, and lowercase iadb and kdb
when speaking about the commands.
dump components
Introduction In AIX5L, a dump image is not actually a full image of the system memory but a
set of memory areas dumped by the dump process
The Master A master dump table entry is a pointer to a function provided by the kernel
dump Table extension that will be called by the kernel dump routine when a system dump
occurs. These functions must return a pointer to a component dump table
structure. These functions and the component dump table entry both must reside
in pinned global memory. They must be registered to the kernel by the dmp_add
and unregistered using dmp_del kernel services. Kernel specific areas are pre-
loaded by kernel initialization.
Component Dump component tables are structures of type struct_cdt. Component dump tables
dump tables are returned by the dmp registered functions when the dump process start. Each
one is a structure made of:
• a CDT Header
• an array of CDT entries
CDT entries CDT entries in the component dump tables will be one of cdt_entry64,
cdt_entry_vr or cdt_entry32 according to the DMP_MAGIC number has defined
in /usr/include/sys/dump.h
Process The following steps will be used to write a dump to the dump device:
overview
Step Action
1 Interrupts are disabled
2 0c9 or 0c2 are written to the LED display, if present
3 Header information about the dump is written to the dump device
4 The kernel steps through each entry in the master dump table,
calling each component dump routine twice
• Once to indicate that the kernel is starting to dump this
component 1 is passed as a parameter
• Again to say that the dump process is complete 2 is passed
After the first call to a component dump routine, the kernel
processes the CDT that was returned
For each CDT entry, the kernel :
• Checks every page in the identified data area to see if it is in
memory or paged out
• Builds a bitmap indicating each page's status
• Writes a header, the bitmap, and those pages which are in
memory to the dump device
5 Once all dump routines have been called, the kernel enters an
infinite loop, displaying 0c0 or flashing 888
Note A component dump routine may or may not do a lot of work when called with a 1.
Many simply return the address of some previously-initialized CDT, but some (for
example, the thread table and process table dump routines) actually build the CDT
from scratch.
The original rationale for the second call to each dump routine was to provide
notification that the dump process had finished with that component's dump data.
In practice, however, no one really cares. The routines that just return an address
don't even bother to look at the parameter they were passed. The routines that
build the data on the fly look for a 2 and return immediately. The most that any
routine today does with this second call is to issue some debug printf call. This is
generally used to debug the component dump routine itself, by verifying that the
system dump facility was able to successfully process its CDT.
bosdebug command
Introduction The bosdebug command can be used to enable or disable the MODS feature as
well as other kernel debugging parameters.
Any changes made with the bosdebug command will not take effect until the
system is rebooted.
-10 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The Memory Overlay Detection System (MODS) helps detect memory overlay
problems in the kernel, kernel extensions, and device drivers. The MODS can be
enabled using the bosdebug command.
Problems Some of the most difficult types of problems to debug are what are generally
detected called "memory overlays." Memory overlays include the following:
Note: This feature does not detect problems in application code; it only
watches kernel and kernel extension code.
How MODS The primary goal of the MODS feature is to produce a dump file that
works accurately identifies the problem.
MODS works by turning on additional checking to help detect the
conditions listed above. When any of these conditions is detected, your
system crashes immediately and produces a dump file that points directly
at the offending code. (Previously, a system dump might point to unrelated
code that happened to be running later when the invalid situation was
finally detected.)
If your system crashes while the MODS is turned on, then MODS has most
likely done its job.
To make it easier to detect that this situation has occurred, the IADB/
iadb and KDB/kdb commands have been extensively modified. The
stat subcommand now displays both:
• Whether the MODS (also called "xmalloc debug") has been turned on
• Whether this crash was the result of the MODS detecting an incorrect
situation.
The xmalloc subcommand provides details on exactly what memory
address (if any) was involved in the situation, and displays mini-tracebacks
for the allocation and/or free of this memory.
Similarly, the netm command displays allocation and free records for
memory allocated using the net_malloc kernel service (for example,
mbufs, mclusters, etc.).
You can use these commands, as well as standard crash techniques, to
determine exactly what went wrong.
MODS There are limitations to the Memory Overlay Detection System. Although it
limitations significantly improves your chances, MODS cannot detect all memory
overlays. Also, turning MODS on has a small negative impact on overall
system performance and causes somewhat more memory to be used in
the kernel and the network memory heaps. If your system is running at full
CPU utilization, or if you are already near the maximums for kernel
memory usage, turning on the MODS may cause performance
degradation and/or system hangs.
Our practical experience with the MODS, however, is that the great
majority of customers will be able to use it with minimal impact to their
systems.
-12 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
MODS and kdb If a system crash occurs due to an MODS problem, the kdb xm sub command will
be able to display status and traces on memory overlay problems
Introduction System hang management allows users to run mission critical applications
continually while improving application availability. System hang detection
alerts the system administrator of possible problems and then allows the
administrator to log in as root or to reboot the system to resolve the
problem.
System Hang All processes (also know as threads) run at a priority. This priority is
Detection numerically inverted in the range 40-126. Forty is highest priority and 126
is the lowest priority. The default priority for all threads is 60. The priority of
a process can be lowered by any user with the nice command. Anyone
with root authority can also raise a process’s priority.
The kernel scheduler always picks the highest priority runnable thread to
put on a CPU. It is therefore possible for a sufficient number of high priority
threads to completely tie up the machine such that low priority threads can
never run. If the running threads are at a priority higher than the default of
60, this can lock out all normal shells and logins to the point where the
system appears hung.
The System Hang Detection (SHD) feature provides a mechanism to
detect this situation and allow the system administrator a means to
recover. This feature is implemented as a daemon (shdaemon) that runs at
the highest process priority. This daemon queries the kernel for the lowest
priority thread run over a specified interval. If the priority is above a
configured threshold, the daemon can take one of several actions. Each of
these actions can be independently enabled, and each can be configured
to trigger at any priority and over any time interval. The actions and their
defaults are:
-14 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
System Hang
Detection
continued
shconf Script The shconf command is invoked when System Hang Detection is
enabled. shconf configures which events are surveyed and what actions
are to be taken if such events occur.
The user can specify the five actions described below and can specify the
priority level to check, the time out while no process or thread executes at
a lower or equal priority, the terminal device for the warning action and the
getty action:
• Log an error in the error log file
• Display a warning message on the system console (alphanumeric
console) or on a specified TTY
• Reboot the system
• Give a special getty to allow the user to log in as root and launch
commands
• Launch a command
For the Launch a command and Give a special getty options,
SHD will launch the special getty or the specified command at the highest
priority. The special getty will print a warning message specifying that it is a
recovering getty running at priority 0. The following table lists the default
values when the SHD is enabled. Only one action is enabled per type of
detection.
Note: When Launch a recovering getty on a console is
enabled, the shconf script adds the -u flag to the getty line in the inittab
that is associated with the console login.
-16 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
The shdaemon will be set (off/respawn) in the inittab each time the
shconf command will (disable/enable) the sh_pp option.
SMIT Interface You can manage the SHD configuration from the SMIT System
Environments menu. From the System Environments menu, select
Manage System Hang Detection. The options in this menu allow
system administrators to enable or disable the detection mechanism.
Configuration of The shconf command can be used to configure the System Hang Detection.
the SHD The following parameters maybe used with shconf:
• -d: Display the System Hang Detection status.
• -R -l prio: will reset effective values to default.
• -D[O] -l prio: Display the default values (Optional O will output values
separated by colons
• -E[O] -l prio: Display the effective values (Optional O will output values
separated by colons
• -l prio [-a Attribute=Value]: will change the Attribute to the nue Value
Options The following options can be used to customize the System Hang Detection
:
name default description
sh_pp enable Enable Process Priority Problem
pp_errlog disable Log Error in the Error Logging
pp_eto 2 Detection Time-out
pp_eprio 60 Process Priority
pp_warning disable Display a warning message on a console
pp_wto 2 Detection Time-out
pp_wprio 60 Process Priority
pp_wterm /dev/console Terminal Device
pp_login enable Launch a recovering login on a console
pp_lto 2 Detection Time-out
pp_lprio 56 Process Priority
pp_lterm /dev/tty0 Terminal Device
pp_cmd disable Launch a command
pp_cto 2 Detection Time-out
pp_cprio 60 Process Priority
pp_cpath / Script
pp_reboot disable Automatically REBOOT system
pp_rto 5 Detection Time-out
pp_rprio 39 Process Priority
-18 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
example The following output represent various use of the chconf command:
# shconf -R -l prio <== restore default values
shconf: Default Problem Conf is restored.
shconf: Priority Problem Conf has changed.
# shconf -D -l prio <== display default values
sh_pp disable Enable Process Priority Problem
pp_errlog disable Log Error in the Error Logging
pp_eto 2 Detection Time-out
pp_eprio 60 Process Priority
pp_warning disable Display a warning message on a console
pp_wto 2 Detection Time-out
pp_wprio 60 Process Priority
pp_wterm /dev/console Terminal Device
pp_login enable Launch a recovering login on a console
pp_lto 2 Detection Time-out
pp_lprio 56 Process Priority
pp_lterm /dev/tty0 Terminal Device
pp_cmd disable Launch a command
pp_cto 2 Detection Time-out
pp_cprio 60 Process Priority
pp_cpath / Script
pp_reboot disable Automatically REBOOT system
pp_rto 5 Detection Time-out
pp_rprio 39 Process Priority
# shconf -l prio -a pp_lterm=/dev/console <== change terminal device to /dev/console
shconf: Priority Problem Conf has changed.
# shconf -l prio -a sh_pp=enable <== enable priority problem detection
shconf: Priority Problem Conf has changed.
# ps -ef|grep shd <== verify the shdaemon has been started
root 4982 1 0 17:08:17 - 0:00 /usr/sbin/shdaemon
root 9558 9812 1 17:08:22 0 0:00 grep shd
truss command
Description The truss command executes a specified command, or attaches to listed process
IDs, and produces a trace of the system calls, received signals, and machine faults
a process incurs. Each line of the trace output reports either the Fault or Signal
name, or the Syscall name with parameters and return values. The subroutines
defined in system libraries are not necessarily the exact system calls made to the
kernel. The truss command does not report these subroutines, but rather, the
underlying system calls they make. When possible, system call parameters are
displayed symbolically using definitions from relevant system header files. For
path name pointer parameters, truss displays the string being pointed to. By
default, undefined system calls are displayed with their name, all eight possible
arguments and the return value in hexadecimal format.
-20 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Options The following options can be used with the truss command line
Option Description
-a Displays the parameter strings passed in each system call.
-c Counts traced system calls, faults, and signals rather than displaying
trace results line by line. A summary report is produced.
-e Displays the environment strings which are passed in each executed
system call.
-f Follows all children created by the fork system call.
-i Keeps interruptible sleeping system calls from being displayed. Causes
system calls to be reported only once, upon completion.
-m [!] Machine faults to trace/exclude. Faults may be specified by name or
Fault number (see the sys/fault.h header file). The default is -mall.
-o Outfile Designates the file to be used for the trace output.
-p Interprets the parameters to truss as a list of process ids for an existing
process rather than as a command to be executed. truss takes control of
each process and begins tracing it.
-r [!] Displays the full contents of the I/O buffer for each read on any of the
FileDescri specified file descriptors. The output is formatted 32 bytes per line and
ptor shows each byte either as an ASCII character (preceded by one blank) or
as a two-character C language escape sequence for control characters. If
ASCII interpretation is not possible, the byte is shown in two-character
hexadecimal. The default is -r!all.
-s [!] Permits listing Signals to trace/exclude. The trace output reports the
Signal receipt of each specified signal even if the signal is being ignored, but not
blocked, by the process. Blocked signals are not received until the process
releases them. Signals may be specified by name or number (see sys/
signal.h). The default is -s all.
-t [!] Includes/excludes system calls from the trace. The default is -tall.
Syscall
-w [!] Displays the contents of the I/O buffer for each write on any of the listed
FileDescri file descriptors (see -r). The default is -w!all.
ptor
-x [!] Displays data from the specified parameters of traced system calls in raw
Syscall format, usually hexadecimal, rather than symbolically. The default is -
x!all.
Each option requiring a list must contain a list separated by commas. You can use
“all”/”!all” to include/exclude all possible values of the list.
truss output The following output represent an example of the use of a truss command:
example
WUXVVDHLPVUDOOZDOOROVRXWOV
OVRXW
PRUHOVRXW
H[HFYHXVUELQOV[))[))DUJF
DUJYOV
HQYSB XVUELQWUXVV/$1* &/2*,1 URRW
1/63$7+ XVUOLEQOVPVJ/1XVUOLEQOVPVJ/1FDW
3$7+ XVUELQHWFXVUVELQXVUXFEXVUELQ;VELQ
/&BB)$6706* WUXH/2*1$0( URRW0$,/ XVUVSRROPDLOURRW
/2&3$7+ XVUOLEQOVORF86(5 URRW$87+67$7( FRPSDW
6+(// XVUELQNVK2'0',5 HWFREMUHSRV+20( 7(50 DL[WHUP
0$,/06* ><28+$9(1(:0$,/@3:' KRPHDOH[7= 3673'7$BB] /2*1$0(
BBJHWBNHUQHOBWRGBSWU[[%[['&$
[&
[[$$[( [))))
JHWXLG[ [
NLRFWO[[
NLRFWO[)(&[
VEUN[ [&
EUN['
VEUN[ ['
EUN[
VWDW[[)(&
VWDW[[)(&$
RSHQ2B5'21/<
JHWGLUHQW[
OVHHN
NIFQWO)B*(7)'['
NIFQWO)B6(7)'[
JHWGLUHQW[
JHWGLUHQW[
FORVH
NLRFWO[[
NZULWH[$)
OVRXW?Q
NIFQWO)B*(7)/[
FORVH
NIFQWO)B*(7)/[
BH[LW
-22 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The KDB is the kernel debugger used on AIX5L running on Power systems.
#kdb
(0)> dw kdb_avail
kdb_avail+000000: 00000001 00000000 00000000 00000000
Loading KDB In AIX 5L, the KDB is included in all unix kernels found in /usr/lib/boot. In order
to use it, the KDB must be loaded at boot time. To allow KDB to load use the
following command:
• bosboot -a -D -d /dev/ipldevice, or bosdebug -D: will
load KDB at boot time.
• bosboot -a -I -d /dev/ipldevice, or bosdebug -I: will load
and invoke the KDB at boot time.
• bosboot -ad /dev/ipldevice, or bosdebug -o: will not load or
invoke the KDB at boot time.
You must reboot the system in order to take these changes in account.
Starting KDB The KDB interface maybe be started, if loaded, under the following
circumstances:
• If the bosboot or bosdebug was run with -I, this mean that the tty
attached to a native serial port will show up the KDB just after the kernel is
loaded.
• You may invoke manually the KDB from a tty attached to a native serial port
using: Ctrl-4 or Ctrl-\, or from a native keyboard using Ctrl-alt-
Numpad4.
• An application make a call to the breakpoint() kernel services or to the
breakpoint system call.
• A breakpoint previously set using the KDB has been reached
• A fatal system error occurs. A dump might be generated on exit from the KDB.
KDB Concept When the KDB Kernel Debugger is invoked, it is the only running program until
you exit the KDB or you use the start sub command to start another cpu. All
processes are stopped and interrupts are disabled. The KDB Kernel Debugger runs
with its own Machine State Save Area (mst) and a special stack. In addition, the
KDB Kernel Debugger does not run operating system routines. Though this
requires that kernel code be duplicated within KDB, it is possible to break
anywhere within the kernel code. When exiting the KDB Kernel Debugger, all
processes continue to run unless the debugger was entered via a system halt.
-24 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
kdb command
Introduction The kdb command, unlike the KDB kernel debugger, allows examination of an
operating system image issued on Power systems.
The kdb command maybe used on a running system but will not provide all
functions available with the KDB kernel debugger.
Parameters The kdb command maybe used with the following parameters:
• no parameter: the kdb will use /dev/mem as the system image file and /usr/lib/
boot/unix as the kernel file. In this case root permissions are required.
• -m system_image_file: the kdb will use the image file provided.
• -u kernel_file: the kdb will use the kernel file. This is required to analyze a
system dump on a system with different level of unix.
• -k kernel_modules: a comma separated list of kernext symbols to add.
• -w: to view XCOFF object
• -v: to print CDT entries
• -h: to print help
• -l: to disable inline more, useful to run non interactive session.
Loading errors If the system image file provided doesn’t contain a valid dump or the kernel file
doesn’t match the system image file, the following message may be issued by the
kdb command:
Introduction The following table represents the miscellaneous sub commands and their
matching crash/lldb sub commands when available
reboot sub The reboot subcommand can be used to reboot the machine. This subcommand
command issues a prompt for confirmation that a reboot is desired before executing the
reboot. If the reboot request is confirmed, the soft reboot interface is called
(sr_slih(1)).
! sub command The ! sub command allow the user to run an aix command without leaving the
kdb or KDB kernel debugger.
? sub command Help or ? sub command can be used to display a long sub command listing or to
display help by subjects.
A particular help a a command can be display using the sub command followed by
?
-26 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
q sub command For the KDB Kernel Debugger, this subcommand exits the debugger with all
breakpoints installed in memory. To exit the KDB Kernel Debugger without
breakpoints, the ca subcommand should be invoked to clear all breakpoints prior
to leaving the debugger.
The optional dump argument can be specified to force an operating system dump.
The method used to force a dump depends on how the debugger was invoked.
set sub The set sub command can be used to toggle the kdb parameters. Set accept the
command following parameters:
time sub The time command can be used to determine the elapsed time from the last time
command the KDB Kernel Debugger was left to the time it was entered.
debug sub The debug subcommand may be used to print additional information during
command KDB execution, the primary use of this subcommand is to aid in ensuring that the
debugger is functioning properly. The debug sub command can be used with the
following arguments:
hcal/dcal sub The hcal subcommand evaluates hexadecimal expressions and displays the
commands result in both hex and decimal.
The dcal subcommand evaluates decimal expressions and displays the result in
both hex and decimal.
-28 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the dump/display/decode sub commands and their
matching crash/lldb sub commands when available
d/dw/dd/dp/ d/dw/dd/dp/dpw/dpd sub commands are use to display memory with the following
dpw/dpd sub sizes:
commands
• d,dp display bytes
• dw,dpw: display words
• dd,dpd (display double words)
Addresses are specified by:
• virtual addresses for d,dw and dd
• physical for dp,dpw and dpd
These sub commands accept the following arguments:
• Address - starting address of the area to be dumped. hexadecimal values, or
hexadecimal expressions can be used in specification of the address.
• count - number of bytes (d, dp), words (dw, dpw), or double words (dd, dpd)
to be displayed. The count argument is a hexadecimal value.
dc/dpc/dis sub The display code subcommands, dc,dis and dpc may be used to decode
commands instructions. The address argument for the dc subcommand is an effective
address. The address argument for the dpc subcommand is a physical
address. They accept the following arguments:
ddvb/ddvh/ IO space memory (Direct Store Segment (T=1)) can not be accessed when
ddvd/ddpv/ translation is disabled. bat mapped areas must also be accessed with translation
ddph/ddpd sub
commands enabled, else cache controls are ignored.
The subcommands ddvb, ddvh, ddvw and ddvd can be used to access these areas
in translated mode, using an effective address already mapped.
The subcommands ddpb, ddph, ddpw and ddpd can be used to access these areas
in translated mode, using a physical address that will be mapped.
On 64-bit machine, double words correctly aligned are accessed (ddpd and ddvd)
in a single load (ld) instruction.
• Address - address of the starting memory area to display. This can either be a
effective or real address, dependent on the subcommand used. Symbols,
hexadecimal values, or hexadecimal expressions can be used in specification of
the address.
• count - number of bytes (ddvb, ddpb), half words (ddvh, ddph), words (ddvw,
ddpw), or double words (ddvd, ddpd) to display. The count argument is a
hexadecimal value.
-30 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
find findp sub The find and findp subcommands can be used to search for a specific pattern in
commands memory. The find subcommand requires an effective address for the address
argument, whereas the findp subcommand requires a real address. find and findp
accept the following parameters:
ext/extp sub The ext and extp subcommands can be used to display a specific area from a
commands structure. If an array exists, it can be traversed displaying the specified area for
each entry of the array. These subcommands can also be used to traverse a linked
list displaying the specified area for each entry.
For the ext subcommand the address argument specifies an effective address. For
the extp subcommand the address argument specifies a physical address.
• -p: flag to indicate that the delta argument is the offset to a pointer to the next
area.
• Address: address at which to begin display of values. This can either be a
virtual (effective) or physical address depending on the subcommand used.
Symbols, hexadecimal values, or hexadecimal expressions can be used in
specification of the address.
• delta: offset to the next area to be displayed or offset from the beginning of the
current area to a pointer to the next area. This argument is a hexadecimal value.
• size: hexadecimal value specifying the number of words to display.
• count: hexadecimal value specifying the number of entries to traverse
dr sub command The display registers sub command can be used to display:
examples The following show examples of the use of display sub commands:
-32 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the modify memory sub commands and their
matching crash/lldb sub commands when available
m/mp/mw/mpw/ m/mp/mw/mpw/md/mpd sub commands are use to modify memory with the
md/mpd sub following sizes:
commands
• m.mp display bytes
• mw.mpw: display words
• md,mpd (display double words)
Addresses are specified by :
• virtual addresses for m,mw and md
• physical for mp,mpw and mpd
These sub commands accept the following arguments:
• Address - starting address of the area to be dumped. hexadecimal values, or
hexadecimal expressions can be used in specification of the address.
The sub commands will prompt for new values until a “.” value is entered.
mr sub The mr sub command can be used to modify general purpose, segment, special, or
commands floating point registers. Individual registers can also be selected for modification
by register name. The current thread context is used to locate the register values to
be modified. The switch sub command can be used to change context to other
threads. When the register being modified is in the mst context, KDB alters the
mst. When the register being modified is a special one, the register is altered
immediately. Symbolic expressions are allowed as input.
The following arguments can be used:
• gp - modify general purpose registers.
• sr - modify segment registers.
• sp - modify special purpose registers.
• fp - modify floating point registers.
• reg_name - modify a specific register, by name.
mr will prompt for input if a register name was specified, or will prompt for input
until a “.” is entered.
mdvb/mdpb/ These subcommands are available to write in IO space memory. To avoid bad
mdvh/mdph/ effects, memory is not read before, only the specified write is performed with
mdvd/mdpd sub
commands translation enabled.
Access can be in bytes, half words, words or double words.
Address can be an effective address or a real address.
The subcommands mdvb, mdvh, mdvw and mdvd can be used to access these
areas in translated mode, using an effective address already mapped. The
subcommands mdpb, mdph, mdpw and mdpd can be used to access these areas in
translated mode, using a physical address that will be mapped. On 64-bit machine,
double
words correctly aligned are accessed (mdpd and mdvd) in a single store
instruction. DBAT interface is used to translate this address in cache inhibited
mode (PowerPC only).
These subcommands accept the following parameters:
• Address - address of the memory to modify. This can either be a virtual
(effective) or physical address, dependent on the subcommand used. Symbols,
hexadecimal values, or hexadecimal expressions can be used in specification of
the address.
These sub commands will prompt for input until a “.” is entered.
-34 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
introduction The following table represents the trace sub commands and their matching crash/
lldb sub commands when available
bt sub command The trace point subcommand bt can be used to trace each execution of a specified
address. Each time a trace point is encountered during execution, a message is
displayed indicating that the trace point has been encountered. The displayed
message indicates the first entry from the stack.
The bt sub command can also use a test parameter to break at the specified address
only if the test condition is true
The conditional test requires two operands and a single operator. Values that can
be used as operands in a test subcommand include symbols, hexadecimal values,
and hexadecimal expressions. Comparison operators that are supported include:
==, !=, >=, <=, >, and <.
Additionally, the bitwise operators ^ (exclusive OR), & (AND), and | (OR) are
supported.
When bitwise operators are used, any non-zero result is considered to be true.
-36 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
ct/cat sub The cat and ct sub commands erase all and individual trace points, respectively.
command The trace point cleared by the ct subcommand may be specified either by a slot
number or an address. These sub commands accept the following arguments:
examples The following example show the use of the trace sub commands:
Introduction The following table represents the breakpoint and step sub commands and their
matching crash/lldb sub commands when available
b/lb sub The b subcommand sets a permanent global breakpoint in the code. KDB checks
command that a valid instruction will be trapped. If an invalid instruction is detected a
warning message is displayed. If the warning message is displayed the breakpoint
should be removed; otherwise, memory can be corrupted (the breakpoint has been
installed).
The lb sub command will act the same way as the b sub command except the
break point will be local to the thread or cpu depending on the set option 14.
The following arguments may be used with the b/lb sub commands :
-38 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
c/lc/ca sub c/lc and ca can be used to clear break points. The differences are:
commands
• c will clear general break points
• lc will clear local break points
• ca will clear all break points
•
The b and lc sub commands will use the following parameters:
• ctx - context to be cleared for a local break. The context may either be a CPU or
thread specification.
r/gt sub A non-permanent breakpoint can be set using the subcommands r and gt. These
command subcommands set local breakpoints which are cleared after they have been hit.
The r subcommand sets a breakpoint on the address found in the lr register. In
SMP environment, it is possible to hit this breakpoint on another processor, so it is
important to have thread/process local break point.
The gt subcommand performs the same as the r subcommand except that the
breakpoint address must be specified.
n/s sub The two subcommands n and s provide step functions. The s subcommand allows
command the processor to single step to the next instruction. The n subcommand also single
steps, but it steps over subroutine calls as though they were a single instruction.
• count: specify how many steps are executed before returning to the KDB
prompt.
S/B sub The S subcommand single steps but stops only on bl and br instructions. With that,
commands you can see every call and return of routines. A count can also be used to specify
how many times KDB continues before stopping.
• count: specify how many steps are executed before returning to the KDB
prompt.
-40 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Example The following example will show the use of break points:
# Debugger entered via keyboard.
.waitproc_find_run_queue+00006C srwi r29,r31,3 <00000000> r29=0,r31=0
KDB(0)> br open <== we set a break point on open.
.open+000000 (sid:00000000) permanent & global
KDB(0)> q <== we exit the kdb
# ls <== do some command that will certainly call open
Breakpoint <== open was called so we enter the KDB
.open+000000 li r6,0 <0000000000000000> r6=0
KDB(0)> s <== do one step
.open+000004 stdu stkp,FFFFFF80(stkp)
stkp=F00000002FF3B390,FFFFFF80(stkp)=F00000002FF3B310
KDB(0)> n <== an other one
.open+000008 mflr r0 <.sys_call_ret+000000>
KDB(0)> dis.open+000008 32 <== not let’s find a the following branch
.open+000008 mflr r0
.open+00000C extsw r4,r4
.open+000010 addi r7,stkp,70
.open+000014 std r0,90(stkp)
.open+000018 clrlwi r5,r5,0
.open+00001C bl <.copen> <== here it is
.open+000020 ori r0,r3,0
.open+000024 clrlwi r4,r3,0
KDB(0)> B <== this will break at the next branch taht should be open+1c
.open+00001C bl <.copen> r3=0000000020008B88
KDB(0)> s <== we step that branch
.copen+000000 std r31,FFFFFFF8(stkp) r31=0,FFFFFFF8(stkp)=F00000002FF3B
308
KDB(0)> dr lr <== let see what is in the link register
lr : 0000000000387D24
.open+000020 ori r0,r3,0 <0000000020008B88> r0=0000000000003
77C,r3=0000000020008B88
KDB(0)> r <== break on the lr (we will return to the calling function)
.open+000020 ori r0,r3,0 <0000000000000000> r0=0000000000000030,r3=0
KDB(0)> ca <== clear all break point before leaving
KDB(0)> q <== exit the KDB
Introduction The following table represents the name list/symbol sub commands and their
matching crash/lldb sub commands when available
ns sub command The ns subcommand toggles symbolic name translation on and off.
ts sub command The ts subcommand translates addresses to symbolic representations. ts use the
following argument:
examples (0)> nm kdb_avail <== display addresses for the kdb_avail symbol
Symbol Address: 0046AE70
TOC Address: 0046AC80
(0)> set 1 <== turn address translation off
Symbolic name translation off
(0)> ts 046AE70 <== get symbol for 046AE70
0046AE70 <== didn’t get it because address translation is turned off
(0)> ns <== turn address translation back on
Symbolic name translation on
(0)> ts 046AE70 <== no we should get the symbol
kdb_avail+000000
-42 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the watch break point sub commands and their
matching crash/lldb sub commands when available
wr, ww, wrw, On PowerPC architecture, a watch register (the DABR Data Address Breakpoint
lwr, lww, lwrw, Register or HID5 on Power 601) can be used to enter KDB when a specified
cw and lcw sub
commands effective address is accessed. The register holds a double-word effective address
and bits to specify load and/or store operation.
So the watch break points can be used with the following rules
examples KDB(0)> wr utsname 3 <== set a break on read of utsname for 3 bytes
CPU 0: utsname+000000 eaddr=001CB9C8 size=3 hit=0 mode=R Xlate ON
CPU 1: utsname+000000 eaddr=001CB9C8 size=3 hit=0 mode=R Xlate ON
KDB(0)> q <== exit the debugger
# uname -a <== run some command that will read the utsname
Watch trap: 001CB9C8 <utsname+000000>
.umem_move+000030 lbzx r7,r6,r3 r7=000000000000B6B4, r6=0, r3=00000000001CB9C8
KDB(0)> wr <== verify the number of hits -------v
CPU 0: utsname+000000 eaddr=001CB9C8 size=3 hit=1 mode=R Xlate ON
CPU 1: utsname+000000 eaddr=001CB9C8 size=3 hit=1 mode=R Xlate ON
KDB(0)> cw <== clear watch break points
KDB(0)> lwr utsname <== now set a local watch break point (only cpu 0)
CPU 0: utsname+000000 eaddr=001CB9C8 size=8 hit=0 mode=R Xlate ON
KDB(0)> lcw <== clear local watch break points
KDB(0)> q <== exit kdb, will resume the current thread
AIX oc3b42 0 5 000714834C00
-44 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the status sub commands and their
matching crash/lldb sub commands when available
stat sub The stat subcommand displays system statistics, including the last kernel printf()
command messages, still in memory. The following information is displayed for a processor
that has crashed:
-46 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the kernel extension loader sub commands and
their matching crash/lldb sub commands when available
lke and stbl sub The subcommands lke and stbl can be used to display current state of loaded
commands kernel extensions using the following parameters:
rmst sub A symbol table can be removed from KDB using the rmst subcommand. This
command subcommand requires that either a slot number or the effective address for the
loader entry of the symbol table be specified.
exp sub The exp subcommand can be used to look for an exported symbol or to display the
command entire export list. If no argument is specified the entire export list is printed. If a
symbol name is specified as an argument, then all symbols which begin with the
input string are displayed.
-48 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the address translation sub commands and their
matching crash/lldb sub commands when available
tr and tv sub The tr and tv sub commands can be used to display address translation
commands information. The tr sub command provides a short format; the tv subcommand a
detailed format.
For the tv subcommand, all double hashed entries are dumped, when the entry
matches the specified effective address, corresponding physical address and
protections are displayed. Page protection (K and PP bits) is displayed according
to the current segment register and machine state register values.
examples (0)> tr @iar <== display the physical address of the current instruction
Physical Address = 000000000002CB58
(0)> tv @iar <== display the physical mapping of the current instruction
eaddr 000000000002CB58 sid 0000000000000000 vpage 000000000000002C hash1 0000002
C
p64pte_cur_addr 0000000001001600 sid 0000000000000000 avpi 00 hsel 0 valid 1
rpn 000000000000002C refbit 1 modbit 0 wimg 2 key 3
____ 000000000002CB58 ____ K = 0 PP = 11 ==> read only
Introduction The following table represents the process/thread sub commands and their
matching crash/lldb sub commands when available
ppda sub The ppda sub command displays per processor data areas with the following
command conditions :
intr sub The intr sub command prints entries in the interrupt handler table with the
command following conditions :
-50 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
mst sub The mst sub command prints Machine State Save Area for :
command
• the current context : if no argument is provided
• slot : thread slot number. This value must be a decimal value.
• Address : effective address of an mst to display. Symbols, hexadecimal values,
or hexadecimal expressions can be used in specification of the address.
proc sub The proc subcommand displays process table entries using :
command
• * : display a summary for all processes.
• -s flag : display only processes with a process state matching that specified by
flag. The allowable values for flag are: SNONE, SIDLE, SZOMB, SSTOP,
SACTIVE, and SSWAP.
• slot : process slot number. This value must be a decimal value.
• Address : effective address of a process table entry. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
th sub command The thread subcommand displays thread table entries using :
sw sub By default, KDB shows the virtual space for the current thread. The sw
command subcommand allows selection of the thread to be considered the current thread.
Threads can be specified by slot number or address. The current thread can be
reset to its initial context by entering the sw subcommand with no arguments.
For the KDB Kernel Debugger, the initial context is also restored whenever
exiting the debugger.
-52 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
-54 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
-56 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
example (0) ttid 70e <== now display threads for gil(70e) that should have 5 threads
continued SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+000380 7 gil SLEEP 00070F 025 1 65
pvthread+000580 11 gil SLEEP 000B17 025 1 65 netisr_servers
pvthread+000500 10 gil SLEEP 000A15 025 1 65 netisr_servers
pvthread+000480 9 gil SLEEP 000913 025 1 65 netisr_servers
pvthread+000400 8 gil SLEEP 000811 025 1 65 netisr_servers
(0)> user -ad 5 <== display address space for thread 5
User-mode address space mapping:
segs32_raddr.0000000000000000
uadspace node allocation......(U_unode) @ F00000002FF3E028
usr adspace 32bit process.(U_adspace32) @ F00000002FF3E048
segment node allocation.......(U_snode) @ F00000002FF3E008
segnode for 32bit process...(U_segnode) @ F00000002FF3E2A8
U_adspace_lock @ F00000002FF3E4E8
lock_word.....0000000000000000 vmm_lock_wait.0000000000000000
V_USERACC strtaddr:0x0000000000000000 Size:0x0000000000000000
vmmflags......00000000
(0)> sw 5 <== switch to the thread 5
Switch to thread: <pvthread+000280>
(0)> tpid <== display the current tpid that should be slot 5
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+000280 5*xmgc SLEEP 00050B 03C 1 65 KERN_heap+ECD5730
(0)> sw <== switch back to initial thread
Switch to initial thread: <pvthread+001200>
(0)> tpid <== display the current tpid that should be initial pvthread+001200
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+001200 36*kdb_64 RUN 002467 03C 0 0
Introduction The following table represents the Kernel stack sub commands and their matching
crash/lldb sub commands when available
f sub command The stack sub command displays all the stack frames from the current instruction
as deep as possible. Interrupts and system calls are crossed and the user stack is
also displayed. In the user space, trace back allows display of symbolic names.
The amount of data displayed may be controlled through the mst_wanted and
display_stacked_frames options of the set sub command. You can also request to
see the stacked registers using the display_stacked_regs set option.
-58 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
examples (0)> f +x <== display the stack frame for the current thread
pvthread+000380 STACK:
[0002CB58]et_wait+00036C (0000000000212A0C, A0000000000010B2,
0000000000122A0C [??])
[000EF170]netthread_start+0000B8 ()
[00060F6C]procentry+000010 (??, ??, ??, ??)
(0)> f -x <==display the stack frame without addresses
pvthread+000380 STACK:
et_wait+00036C (.backt+000000, A0000000000010B2,
.v_prepin+000000 [??])
netthread_start+0000B8 ()
procentry+000010 (??, ??, ??, ??)
(0) set 10 <== want to see the stacked registers
display_stacked_regs is true
(0)> f <== show the stack frame with stacked registers
pvthread+000380 STACK:
et_wait+00036C (.backt+000000, A0000000000010B2,
.v_prepin+000000 [??])
r31 : 0000000000000000 r30 : 0FFFFFFFF0100000 r29 : 0000000000205E38
r28 : 00000000DEADBEEF r27 : 00000000DEADBEEF r26 : 00000000DEADBEEF
r25 : 00000000DEADBEEF r24 : 00000000DEADBEEF r23 : 00000000DEADBEEF
r22 : 00000000DEADBEEF r21 : 00000000DEADBEEF r20 : 00000000DEADBEEF
r19 : 00000000DEADBEEF r18 : 00000000DEADBEEF r17 : 00000000DEADBEEF
r16 : 00000000DEADBEEF r15 : 00000000DEADBEEF r14 : 00000000DEADBEEF
netthread_start+0000B8 ()
r31 : 00000000DEADBEEF r30 : 00000000DEADBEEF r29 : 00000000DEADBEEF
procentry+000010 (??, ??, ??, ??)
Introduction The following table represents the LVM sub commands and their matching crash/
lldb sub commands when available
volgrp,pvol, lvol volgrp, pvol, lvol and pbuf will respectively display :
and pbuf sub
command
• volume group information (including lvol structures). volgrp addresses are
registered in the devsw table, in the DSDPTR field.
• physical volume information. pvol addresses are registered within the vlogrp
structure.
• logical volume information. lvol addresses are registered within the volgrp and
lvol structures.
• physical buffer information. pbuf addresses are registered withing volgrp and
pvol structures.
-60 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
examples (0)> dev 0xa <== get the device switch table entry for a volume group
Slot address F1000097140C3500
MAJOR: 00A
.
.
dump: 010E3D00
mpx: .nodev (0009E378)
revoke: .nodev (0009E378)
dsdptr: F10000971660D000 <== the pointer to the volgrp structure
selptr: 00000000
opts: 0000002A DEV_DEFINED DEV_MPSAFE
(0)> volgrp F10000971660D000
VOLGRP............. F10000971660D000
vg_lock............... FFFFFFFFFFFFFFFF partshift............. 0000000E
open_count............ 0000000A flags................. 00000000
lvols............... @ F10000971660D010 <== pointer to the lvol struct
pvols............... @ F10000971660E010 <== pointer to the pvol struct
major_num............. 0000000A
vg_id................. 0007148300004C00000000E12335DF7D
nextvg................ 00000000 opn_pin............. @ F10000971660E428
von_pid............... 00000A32 nxtactvg.............. 00000000
ca_freepvw............ 00000000 ca_pvwmem............. 00000000
ca_hld.............. @ F10000971660E488 ca_pv_wrt........... @ F10000971660E4A0
.
.
(0)> lvol F10000971E624E00 <== display on of the lvol structure
LVOL............ F10000971E624E00
work_Q.......... 00000000 lv_status....... 00000000
lv_options...... 00001000 nparts.......... 00000001
i_sched......... 00000000 nblocks......... 00034000
parts[0]........ F10000971E621A00
pvol@ F1000097163DF200 <== pointer to pvol structure
.............dev 8000000E00000001 start 002C9100
parts[1]........ 00000000
parts[2]........ 00000000
maxsize......... 00000000 tot_rds......... 00000000
complcnt........ 00000000 waitlist........ FFFFFFFF
stripe_exp...... 00000000 striping_width.. 00000000
lvol_intlock. @ F10000971E624E60 lvol_intlock.... 00000000
(0)> pvol F1000097163DF200 <== now display the pvol
PVOL............... F1000097163DF200
dev................ 8000000E00000001 xfcnt.............. 00000000
armpos............. 00000000 pvstate............ 00000000
pvnum.............. 00000000 vg_num............. 0000000A
fp................. F1000096000022F0 flags.............. 00000000
num_bbdir_ent...... 00000000 fst_usr_blk........ 00001100
beg_relblk......... 00867C2D next_relblk........ 00867C2Dl
max_relblk......... 00867D2C defect_tbl......... F1000097165F4C00
ca_pv............ @ F1000097163DF250 sa_area[0]....... @ F1000097163DF260
sa_area[1]....... @ F1000097163DF270
pv_pbuf.......... @ F1000097163DF280 <== pointer to pbuf
oclvm............ @ F1000097163DF3C8
Introduction The following table represents the scsi sub commands and their matching crash/
lldb sub commands when available
asc,vsc and csd The asc,vsc and csd sub commands respectively prints:
sub commands
• ascsi adaptesr informations : the ascsiddpin kernext is used to locate the
adp_ctrl structure
• vscsi adapters informations : the vscsiddpin kernext is used to locate the
vscsi_ptrs structure
• scdisk disk informations L the scdiskpin kernext is used to locate the
scdisk_list structure
•
If no argument is specified the asc subcommand loads the slot numbers with
addresses from the adp_ctrl structure. The asc,vsc sub commands can use the
following arguments:
-62 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Example mtu...................00141000
continued num_tcw_words.........00000011shift.................00000000 tcw_word..............00000000
resvd1................00000000 cfg_close.............00000000
vpd_close.............00000000 locate_state..........00000004
locate_event..........FFFFFFFF rir_event.............FFFFFFFF
vpd_event.............FFFFFFFF eid_event.............FFFFFFFF
ebp_event.............FFFFFFFF eid_lock..............FFFFFFFF
recv_fn...............0124024C tm_recv_fn............00000000
tm_buf_info...........00000000 tm_head...............00000000
tm_tail...............00000000 tm_recv_buf...........00000000
tm_bufs_tot...........00000000 tm_bufs_at_adp........00000000
tm_bufs_to_enable.....00000000 tm_buf................00000000
tm_raddr..............00000000 proto_tag_e...........00000000
proto_tag_i...........00000000 adapter_check.........00000000
eid................ @ 5002E42C limbo_start_time......00000000
dev_eid............ @ 5002E4B0 tm_dev_eid......... @ 5002E8B0
pipe_full_cnt.........00000000 dump_state............00000000
pad...................00000000 adp_cmd_pending.......00000000
reset_pending.........00000000 epow_state............00000000
mm_reset_in_prog......00000000 sleep_pending.........00000000
bus_reset_in_prog.....00000000 first_try.............00000001
devs_in_use_I.........00000000 devs_in_use_E.........00000000
num_buf_cmds..........00000000 next_id...............00000045next_id_tm............00000000
resvd4................00000000
ebp_flag..............00000000 tm_bufs_blocked.......00000000
tm_enable_threshold...00000000 limbo.................00000000
critical_path.........00000000 epow_reset_needed.....00000000
-64 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the memory allocator sub commands and their
matching crash/lldb sub commands when available
kmstats sub The kmstats sub command prints kernel allocator memory statistics. If no address
command is specified, all kernel allocator memory statistics are displayed. If an address is
entered, only the specified statistics entry is displayed.
kmbuckets sub The kmbucket sub command prints kernel memory allocator buckets. If no
command arguments are specified information is displayed for all allocator buckets for all
CPUs. kmbucket accept the following parameters :
xm sub The xmalloc subcommand may be used to display memory allocation information.
command Other than the -u option, these subcommands require that the Memory Overlay
Detection System (MODS) is active. The MODS can be activated using the
bosdebug command.
heap sub The heap subcommand displays information about heaps. If no argument is
command specified information is displayed for the kernel heap. Information can be
displayed for other heaps by specifying an address of a heap_t structure.
-66 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
.
.
.
Examples (0)> kmbucket <== display all kernel memory allocator buckets
continued displaying kmembucket for cpu 0 offset 5 size 0x00000020
address...............F100009715FA4C48 b_next..(x)...........F1000082C007BB80
b_calls..(x)..........0000000000000026 b_total..(x)..........0000000000000080
b_totalfree..(x)......000000000000005D b_elmpercl..(x).......0000000000000080
b_highwat..(x)........00000000000003F5 b_couldfree (sic).(x).0000000000000000
b_failed..(x).........0000000000000000 lock............... @ F100009715FA4C90
lock..(x).............0000000000000000
-68 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the file system sub commands and their matching
crash/lldb sub commands when available
gnode, vnode, gnode, vnode, specnode, devnode, fifonode, rnode and hnode sub commands
specnode, respectively displays :
devnode,
fifonode, rnode
and hnode sub • generic node structure at the specified address.
commands • virtual node (vnode) table entries.
• special device node structure at the specified address.
• device node (devnode) table entries.
• fifo node table entries.
• remote node structure at the specified address.
• hash node table entries.
• slot : slot number of a f table entry. This argument must be a decimal value.
• Address : effective address of a table entry. Symbols, hexadecimal values, or
hexadecimal expressions can be used in specification of the address.
vfs sub The vfs subcommand displays entries of the virtual file system table. If no
command argument is entered a summary is displayed with one line for each entry. Detailed
information can be obtained for an entry by identifying the entry of interest.
Individual entries can be displayed using :
• slot : slot number of a virtual file system table entry. This argument must be a
decimal value.
• Address : address of a virtual file system table entry. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
-70 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
gfs sub The gfs subcommand displays the generic file system structure at the specified
command address.
file sub The file subcommand displays file table entries. If no argument is entered all file
command table entries are displayed in a summary. Used files are displayed first (count > 0),
then others. Detailed information can be displayed using :
• slot : slot number of a file table entry. This argument must be a decimal value.
• Address : effective address of a file table entry. Symbols, hexadecimal values,
or hexadecimal expressions can be used in specification of the address.
Introduction The following table represents the system table sub commands and their matching
crash/lldb sub commands when available
var sub The var subcommand prints the var structure and the system configuration of the
command machine including :
-72 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
devsw sub The dev subcommand display device switch table entries. If no argument is
command specified, all entries are displayed. To display a specific entry use :
trb sub
command The trb subcommand displays Timer Request Block (TRB) information. If this
subcommand is entered without arguments a menu is displayed allowing selection
of the data to be displayed. Otherwise, you can use the following arguments :
• * : selects display of Timer Request Block (TRB) information for TRBs on all
CPUs. The information displayed will be summary information for some
options.
• cpu x : selects display of TRB information for the specified CPU. Note, the
characters "cpu" must be included in the input. The value x is a hexadecimal
number.
• option - the option number indicating the data to be displayed. The available
option numbers are :
• 1. TRB Maintenance Structure - Routine Addresses
• 2. System TRB
• 3. Thread Specified TRB
• 4. Current Thread TRB's
• 5. Address Specified TRB
• 6. Active TRB Chain
• 7. Free TRB Chain
• 8. Clock Interrupt Handler Information
• 9. Current System Time - System Timer Constants
slk,clk and dla slk and clk display respectively simple and complex lock. If no argument is
sub commands specifyed, a list a major locks will be displayed. Then, you can use the address of
the lock to display the lock structure.
iplcb sub The iplcb sub command will display the IPL Control Block structure using the
command following parameters :
• [cpu] to print IPL CB (will display all informations including cpu information
for [cpu].
• * : print summary of all processors
• -dir : print directory information
• -proc [cpu] : print processor information
• -mem : print memory region information
• -sys : print system information
• -user : print user information
• -numa : print NUMA information
trace sub The trace sub command displays data in the kernel trace buffers. Data is entered
command into these buffers via the shell subcommand trace. The trace sub command accept
the following parameters :
Examples (0)> !ls -al /dev/cd0 <== find the cd0 major number
br--r--r-- 1 root system 14, 0 Sep 08 11:18 /dev/cd0
(0)> lke 57 <== load the kernext for scsidd
ADDRESS FILE FILESIZE FLAGS MODULE NAME
57 049D6B00 00DB9740 000070D8 00080262 s_scsidd64/usr/lib/drivers/pci/s_scsidd
le_flags....... TEXT DATAINTEXT DATA DATAEXISTS 64
le_next........ 049D6900 le_svc_sequence 00000000.
.
.
.
.
-74 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Example (0)> trace <== show the trace buffers trace was started for proc events.
continued Trace channel[0 - 7]: 0
Trace Channel 0 (7 entries)
Current queue starts at F1000097231F2000 and ends at F100009723232000
Current entry is #7 of 7 at F1000097231F2130
Hook ID: SYSC_EXECVE (00000134) Hook Type: Timestamped|Generic C000
ThreadIdent: 00003F0B
Timestamp: 26E264B2F6
Subhook ID/HookData: 0000
Data Length: 0007 bytes
D0: 00000001
*Variable Length Buffer: F1000097231F2140
Current queue starts at F1000097231F2000 and ends at F100009723232000
Current entry is #6 of 7 at F1000097231F2108
.
.
-76 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the network sub commands and their matching
crash/lldb sub commands when available
ifnet sub The ifnet sub command prints interface information. If no argument is specified,
command information is displayed for each entry in the ifnet table. Data for individual
entries can be displayed by specifying :
• slot : specifies the slot number within the ifnet table for which data is to be
displayed. This value must be a decimal number.
• Address : effective address of an ifnet entry to display. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
tcpcb and sock The tcpcb and socket sub commands respectively prints:
sub command
• tcpcb information for TCP/UDP blocks.
• socket information for TCP/UDP blocks.
•
If no argument is specified tcpcb information is displayed for all TCP and UDP
blocks. tcpcb and sock accept the following command :
tcb and udb sub tcb and udb sub commands can be used respectively to display :
commands
• tcb block information + socket information
• udb block information + socket information
tcb and udb sub commands accept the following parameters :
• slot : specifies the slot number within the b table for which data is to be
displayed. This value must be a decimal number.
• Address : effective address of a udb entry to display. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
-78 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the VMM sub commands and their matching crash/
lldb sub commands when available
-80 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
vmker, pfhdata, These sub commands will display VMM information about :
vmstat, vmaddr,
vmwait, zproc,
vmlog, vrld and • vmker : virtual memory kernel data.
vmlocks sub • pfhdata : virtual memory control variables.
commands • vmstat : virtual memory statistics
• vmaddr : addresses of VMM structures.
• vmwait : displays VMM wait status using the address of a wait chanel.
• zproc : displays information about the VMM zeroing kproc.
• vmlog : displays the current VMM error log entry.
• vrld : displays the VMM reload xlate table. This information is only used on
SMP PowerPC machine, to prevent VMM reload dead-lock.
• vmlocks : displays VMM spin lock data.
scb sub The sub sub command provides options for display of information about VMM
command segment control blocks. The scb sub command will prompt a menu to display scb
using the following options :
• 1 : index
• 2 : sid
• 3 : srval
• 4 : search on sibits
• 5 : search on npsblks
• 6 : search on nvpages
• 7 : search on npages
• 8 : search on npseablks
• 9 : search on lock
• a : search on segment type
• b : add total scb_vpages
• c : search on segment class
• d : search on segment pvproc
ames sub The ames subcommand provides options for display of the process address map
command for either the current or a specified processes. The ames sub command will prompt
a menu to display address map using the following options :
• 1 : current process
• 2 : specified process
• 3 : specified address map
pft sub The pft sub command provides options for display of information about VMM
command page frame table. The pft sub command will prompt a menu to display page frame
information using the following options :
pte sub The pte sub command provides options for display of information about VMM
command page table entries . The pte sub command will prompt a menu to display scb using
the following options :
• 1 : index
• 2 : sid,pno
• 3 : page frame
• 4 : PTE group
pta sub The pta subcommand displays data from the VMM PTA segment. The following
command optional arguments maybe used to determine the data to be displayed :
pdt sub The pdt subcommand displays entries of the paging device table. An argument of
command * results in all entries being displayed in a summary. Details for a specific entry
can be displayed using a slot number.
-82 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
rmap sub The rmap subcommand displays the real address range mapping table. If an
command argument of * is specified, a summary of all entries is displayed. If a slot number
is specified, only that entry is displayed. If no argument is specified, the user is
prompted for a slot number, and data for that and all higher slots is displayed, as
well as the page intervals utilized by VMM.
ste sub The ste subcommand provides options for display of information about segment
command table entries for 64-bit processes. The ste sub command will prompt a menu to
display segments using the following options :
• 1 :esid
• 2 : sid
• 3 : dump hash class (input=esid)
• 4 : dump entire stab
sr64 sub The sr64 sub command displays segment registers for a 64-bit process. Using the
command following parameters :
• none : the segment registers will be displayed for the current process.
• -p pid : process ID of a 64-bit process. This must be a decimal or hexadecimal
value depending on the setting of the hexadecimal_wanted switch.
• esid : first segment register to display (lower register numbers are ignored).
This argument must be a hexadecimal value.
• size : value to be added to esid to determine the last segment register to display.
This argument must be a hexadecimal value.
apt sub The apt subcommand provides options for display of information from the alias
command page table.The apt sub command will prompt a menu to display aliases using the
following options :
• 1 : index
• 2 : sid,pno
• 3 : page frame
segst64 sub The segst64 subcommand displays segment state information for a 64-bit process.
command The information display can be filtered using :
ipc sub The ipc subcommand reports interprocess communication facility information.
command The ipc sub command will prompt a menu to display ipc using the following
options :
• ***TBD***
lockanch, These sub commands will display VMM lock information for :
lockhash and
lockword sub
commands • lockanch : anchor data and data for the transaction blocks in the transaction
block table.
• lockhash : lock hash list.
• lockword : lock words.
lockanch, lockhash and lockword accept the following parameters :
• slot : slot number of an entry in the VMM lock table. This argument must be a
decimal value.
• Address : effective address of an entry in the VMM lock table. Symbols,
hexadecimal values, or hexadecimal expressions may be used in specification
of the address.
-84 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
vmdmap sub The vmdmap subcommand displays VMM disk maps. To look at other disk maps
command it is necessary to initialize segment register 13 with the corresponding srval.
vmdmap accept the following arguments :
• no arguments : all paging and file system disk maps are displayed.
• slot : Page Device Table (pdt) slot number. This argument must be a decimal
value.
examples ***TBD***
Introduction The following table represents the SMP sub commands and their matching crash/
lldb sub commands when available
start, stop and start, stop and cpu commands will allow you to :
cpu sub
commands
• start a cpu
• stop a cpu
• display status or switch to another cpu
These sub commands accept a cpu number as parameter.
Examples ***TBD***
-86 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the block address translation sub commands and
their matching crash/lldb sub commands when available
dbat and ibat On PowerPC machine, the dbat and ibat sub commands may be used to display
sub commands dbat and ibat registers. dbat and idat accept the following arguments :
mdbat and On PowerPC machine, the mdbat and mibat sub commands may be used to
mibat sub modify dbat and ibat registers. The processor data bat register is altered
commands
immediately. KDB takes care of the valid bit, the word containing the valid bit is
set last. mdbat and mibat accept the following arguments :
KDBdataandinstructionblockaddresstranslationsubcommands -- continued
-88 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the bat/brat sub commands and their matching
crash/lldb sub commands when available
btac,lbtac, cbtac The btac and lbtac sub commands can be used to stop when Branch Target
and lcbtac sub Address Compare is true using hardware registers HID1 and HID2 on PowerPC
commands
systems in the following condictions :
Introduction The IADB is kernel debugger used on AIX5L running on IA-64 platform.
# iadb
(0)> d dbg_avail
E000000004755BD8: 0000000000000001
loading IADB In AIX5L, the IADB is included in the unix_ia64 kernel located in /usr/lib/boot.
In order to use it, the IADB must be loaded at boot time. To allow IADB to load
use the following command :
You must reboot the system in order to take these changes in account.
-90 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
starting IADB The KDB maybe be started, if loaded, under the following circumstances :
• If the bosboot or bosdebug was run with -I, this mean that the tty
attached to a native serial port will show up the IADB just after the kernel is
loaded.
• You may invoke manually the IADB from a tty attached to a native serial port
using a native keyboard using Ctrl-alt-Numpad4. For example:
Debugger entered by hitting cntrl-atl-numpad4
AIX/IA64 KERNEL DEBUGGER ENTERED Due to...
Debugger entered via keyboard with key in SERVICE position using
numpad 4
IP->E00000000008C910 waitproc_find_run_queue()+210: { .mib
==>0: adds sp = 0x40, sp
1: mov.i ar.lc = r33
2: br.ret.sptk.few rp
;; }
>CPU0>
IADB concept When the IADB Kernel Debugger is invoked, it is the only running program until
you exit IADB or you use the start sub command to start another cpu. All
processes are stopped and interrupts are disabled. The IADB Kernel Debugger
runs with its own Machine State Save Area (mst) and a special stack. In addition,
the IADB Kernel Debugger does not run operating system routines. Though this
requires the kernel code be duplicated within IADB, it is possible to break
anywhere within the kernel code. When exiting the IADB Kernel Debugger, all
processes continue to run unless the debugger was entered via a system halt.
iadb command
Introduction The iadb command, unlike the IADB kernel debugger, allows examination of an
operating system image issued on IA-64 systems.
The iadb command may be used on a running system but will not provide all
functions available with the IADB kernel debugger.
Parameters The iadb command maybe used with the following parameters :
• no parameter : the iadb will use /dev/mem as the system image file and /usr/lib/
boot/unix as the kernel file. In this case root permissions are required.
• -d system_image_file : the iadb will use the image file provided.
• -u kernel_file : the iadb will use the kernel file. This is required to analyze a
system dump on a system that has a different unix level.
• -i include file list(may be comma separated)
• -u user modules list for any symbol retrieval(comma separated list)
Loading errors If the system image file provided doesn’t contain a valid dump or the kernel file
doesn’t match the system image file, the following message may be issued by the
iadb command:
-92 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the breakpoint and step sub commands and their
matching crash/lldb sub commands when available
br sub command The br subcommand can be used to set and display software break points. The br
subcommand accept the following options :
c sub command The c sub command can be use to clear some or all break points. The c sub
command accept the following parameters :
Examples The following example will show the use of br,c and s sub commands :
-94 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
No Active Breakpoints
See Ya!
Introduction The following table represents the dump/display/decode sub commands and their
matching crash/lldb sub commands when available
d sub command The d sub command can be use to display virtual memory using the following
parameters :
dp sub The dp sub command can be used to display physical memory using :
command
• address : physical address to dump
• ordinal : number of byte access (1,2,4,or 8)
• count : number of elements to dump (of size 'ordinal')
-96 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
dio sub The dio sub command can be used to display the I/O space using the following
command parameters :
dis subcommand The dis sub command can be used to list instructions at a defined address using :
registers sub The following sub commands can be used to display registers informations :
commands
• b :Display Branch Register(s)
• cfm : Display Current Stacked Register
• fpr : Display FPR(s) (f0 - f127)
• iip : Display or Modify Instruction Pointer
• iipa : Display Instruction Previous Address
• ifa : Display Fault Address
• intr : Display Interrupt Registers
• ipsr : Display/Decode IPSR
• isr : Display/Decode ISR
• itc : Display Time Registers ITC ITM & ITV
• kr : Display Kernel Register(s)
• p : Display Predicate Register(s)
• perfr : Display Performance Register(s)
• r : Display General Register(s)
• rr : Display Region Register(s)
• rse : Display Register Stack Registers
dpci sub The dpci sub command can be used to display pci devices configuration space
command using the following parameters :
00000FFFFC0FDBF6: 50FF000000006F60
P.....o‘
E0000000040CF150: 0000000000000000
E0000000040CF150: 00000043
-98 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
E0000000040CF150: 0000000000000043
0000000000005000: FFFFFFFFFFFFFFFF
0000000000005000: 1122334455667788
0000000000005000: 1122334455667788
Introduction The following table represents the modify memory sub commands and their
matching crash/lldb sub commands when available
m sub command The m sub command can be used to modify the virtual memory contents using :
mp sub The mp sub command can be used to modify the physical memory contents with
command the following parameters :
-100 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
registers sub The following sub commands can be used to modify registers informations :
commands
• b :Set Branch Register(s)
• iip : Modify Instruction Pointer
• kr : Set Kernel Register(s)
• p : Set Predicate Register(s)
• r : Set General Register(s)
• rr : Set Region Register(s)
mio sub The mio sub command can be use to modify I/O space using :
command
• addr : I/O port address to modify
• ordinal : size of each data element (1,2,4,8)
• data1 : first data element to be stored with access of size 'ordinal'
• data2.. : subsequent data elements to be stored
b00:E00000000008E050 waitproc()+1B0
b01:BADC0FFEE0DDF00D
b02:BADC0FFEE0DDF00D
b03:BADC0FFEE0DDF00D
b04:BADC0FFEE0DDF00D
b05:BADC0FFEE0DDF00D
b06:E00000000008DEA0 waitproc()+0
b07:BADC0FFEE0DDF00D
IIP : E00000000008E000:waitproc()+160
-102 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
r32:C000006013200000 [0]
r33:C000006013200290 [0]
r34:E00000971404B11C [0]
r35:E00000971404B120 [0]
r36:E0000000040C6060 [0]
r37:E0000000040C6068 [0]
r38:0000000000000186 [0]
r39:0000000000000009 [0]
r40:0000000000000001 [0]
r41:0000000000000001 [0]
rr0:0000000000480931
rr1:0000000000200431
rr2:0000000000280531
rr3:0000000000000030
rr4:0000000000000030
rr5:0000000000180331
rr6:0000000000100269
rr7:0000000000080131
-104 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
The following table represents the name list/symbol sub commands and their
matching crash/lldb sub commands when available
map sub The map sub command can be used to translate a symbol into an address and
command revers and so accept the following as parameter :
Examples >CPU0> map (r34) <== Lookup symbol for address in r34
>CPU0> map 0xe000000000000000 <== Lookup symbol for
address 0xe000000000000000
>CPU0> map foo+0x100 <== Lookup symbol for symbol
‘foo’+0x100
Introduction The following table represents the watch break point sub commands and their
matching crash/lldb sub commands when available
dbr sub The dbr command can be used to set break point on data access using :
command
• action : the action to watch for :
• r = = Break on Read
• w = = Break on Write
• rw = = Break on Read or Write
• mask : bit mask of which address bits to match
• plvl_mask : bit mask of which privilege levels to match
• 0x1 = = CPL 0 (Kernel)
• 0x2 = = CPL 1 (unused)
• 0x2 = = CPL 1 (unused)
• 0x4 = = CPL 2 (unused)
• 0x8 = = CPL 3 (User)
• addr : the address to trigger on
cdbr sub The cdbr sub command can be used to clear previously set data break points using
command :
• index : index of DBR breakpoint (from dbr cmd)
• all : clear all DBRs
-106 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the trace sub commands and their matching crash/
lldb sub commands when available
sys sub The sys sub command will display the following information :
command
• Build level and build date
• Number and type of processors
• Memory size
• Processor Speed
• Bus Speed
reason sub The reason sub command will display the reason why debugger was entered along
command with IP and assembly code of the bundle at that IP
-108 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the kernel extension loader sub commands and
their matching crash/lldb sub commands when available
kext The kext sub command will display all loaded kernel extensions and their text and
data load addresses
ldsyms and The ldsyms and unldsyms will load or unload a kernel extension symbols using :
unldsyms sub
commands
• -p [path] : where path is the absolute file path of the kernel extension
• module : the module name
-110 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the address translation sub commands and their
matching crash/lldb sub commands when available
parameters x addr
where;
Introduction The following table represents the process/thread sub commands and their
matching crash/lldb sub commands when available
ppda The ppda sub command will display Per Processor Descriptor Area and accept the
following parameters :
mst The mst sub command will display the Machine State Stack using :
-112 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
• -p : process id (PID)
• -t : Thread id (TID)
• * : All processes
• -b {bucket} : detailed info for threads in bucket of all run queue slots
• -g : global info for run queues
• -q [ number ] : detailed info for all queues
• -v {address} : detailed info for threads at run queue address
Examples ***TBD
Introduction The following table represents the LVM sub commands and their matching crash/
lldb sub commands when available
parameters
Examples
-114 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the scsi sub commands and their matching crash/
lldb sub commands when available
parameters
Examples
Introduction The following table represents the memory allocator sub commands and their
matching crash/lldb sub commands when available
parameters
Examples
-116 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the file system sub commands and their matching
crash/lldb sub commands when available
parameters
Examples
Introduction The following table represents the system table sub commands and their matching
crash/lldb sub commands when available
dev The dev sub command will display the device switch table using :
iplcb The iplcb sub command will display the IPL control block
Examples
-118 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the network sub commands and their matching
crash/lldb sub commands when available
parameters
Examples
Introduction The following table represents the VMM sub commands and their matching crash/
lldb sub commands when available
-120 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
parameters
Examples
Introduction The following table represents the SMP sub commands and their matching crash/
lldb sub commands when available
cpu The cpu command can be used to display or change the current cpu you are
working on using :
• num : logical CPU number to switch to
-122 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the block address translation sub commands and
their matching crash/lldb sub commands when available
parameters
Examples
Introduction The following table represents the bat/brat sub commands and their matching
crash/lldb sub commands when available
parameters
Examples
-124 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Introduction The following table represents the miscellaneous sub commands and their
matching crash/lldb sub commands when available
help sub ‘The help sub command can be used with out parameter to display the command
command listing or with a command as parameter to display an help related to that
command.
kdbx sub The kdbx sub command can be used to set the symbol needed to use kdb with the
command kdbx interface.
The following variables are set by kdbx and will modify output of certain sub
commands :
• kdbx_addrd : Display breakpoint address instead of symbol name
• kdbx_bindisp : Display output in binary format instead of ASCII format
go sub command The go sub command is used to leave the KDB, this will start the dump process if
the KDB was entered while the system was crashing.
set sub The set sub command can be used to set or display the following kdb parameters :
command • rows=number : set number of rows on current display
• mltrace={on|off} : mltrace on/off; only on DEBUG kernel
• sctrace={on|off} : verbose syscall prints on/off; only on DEBUG kernel
• itrace={on|off} : enable/disable tracing on/off; only on DEBUG kernel
• umon={on|off} : enable/disable umon performance tool
• exectrace={on|off} : verbose exec prints on/off; only on DEBUG kernel
• excpenter={on|off} : debugger entry on exception on/off
• ldrprint={on|off} : verbose loader prints on/off; only on DEBUG kernel
• kprintvga={on|off} : kernel prints to VGA on/off
• dbgtty={on|off} : use debugger TTY as console on/off
• dbgmsg={on|off} : Tee Console and LED output to TTY
• hotkey={on|off} : enter debugger on key press on/off; only on DEBUG kernel
Examples
-126 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Exercise
Introduction In this exercise you will configure the system to enable the live debugger
and invoke both the live and image debugger for your system.
> stat
Power PC:kdb
> dw kdb_avail
>q
IA-64: iadb
> d dbg_avail
> go
6. Execute the following truss command:
# exit
Exercise -- continued
Ctrl-Alt-NUMAPAD4
12. Enter the cpu command. What is the status
of CPU0?
________________________________
13. Exit the live debugger.
-128 of 128 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide
Platform
This lesson is independent of platform.
Lesson Objectives
At the end of the lesson you will be able to:
• List and describe the states of a process.
• List the steps taken by the kernel to create a new process as the
result of a fork() system call, and the steps taken to create a new
thread of execution.
• Describe what happens when a process terminates.
• List the three thread models available in AIX 5.
• Identify the relationship between the internal structures proc,
thread, user and u_thread.
• Use the kernel debugging tool to locate and examine processes,
proc, thread, user and u_thread data structures.
• Manage process scheduling using available commands, manage
processes and threads on a SMP system (to best employ cache
affinity scheduling), and manage processes on a ccNUMA system
(to best employ quad affinity scheduling).
• List the factors determining what action the threads of a process
will take when a signal is received.
• Write a simple C program that use the fork() system call to spawn
new processes, that uses the wait() system call to retrieve the exit
status of a child process, that creates a simple multi-threaded
program by using the pthread_create() system call, and that uses
exec() system call to load a new program into memory.
Process A process can be defined by the list of items which builds it. A process
definition consists of:
• A process table entry
• A process ID (PID)
• Virtual address space
- User-area (U-area)
- Program “text”
- Data
- User and kernel stacks
• Statistical information
Definition of Process management consists of the tools and ability to have many
process processes and threads existing simultaneously in a system, and to share
management
usage of the CPU or, in a SMP system, CPUs. Process management also
includes the ability to start, stop, and force a stop of a process.
The tools and • A process is a self-contained entity that consists of the information
information required to run a single program, such as a user application.
used to
manage the • The kernel contains a table entry for each process called the proc entry.
processes
• The proc entry contains information necessary to keep track of the
current state and location of page tables for the process.
• The proc entry resides in a slot in an array of proc entries.
• The kernel is configured with a fixed number of slots.
• All processes have a process ID or PID.
• The PID is assigned when the process is created and provides a
convenient way for users to refer to the other processes.
• The process contains a list of virtual memory addresses that the
process is allowed to access.
• The user-area (u_area) of a process contains additional information
about the process when it is running.
• The kernel tracks statistical information for the process, such as the
amount of time the process uses the CPU, the amount of memory the
process is using, etc. The statistical information is used by the kernel for
managing its resources and for accounting purposes.
Process Four basic operations define the lifetime of a process in the system:
operations
• fork - Process creation
• exec - Loading of programs in process
• exit - Death of process
• wait - The parent process notification of the death of the child process.
Fork new The fork system call is the way to create a new process
processes
• All processes in the system (except the boot process) are created
from other processes through the fork mechanism.
• All processes are descendants of the init process (process 1).
• A process that forks creates a child process that is nearly a duplicate
of the original parent process.
• The child has a new proc entry (slot), PID, and registers.
• Statistical information is reset, and the child initially shares most of
the virtual memory space with the parent process.
• The child process initially runs the same program as the parent
process. The child may use the exec() call to run another program.
The fork() The parent process has an entry in the process and thread table before the
system call fork() system call; after the fork() system call, another independent
process is created with entries in the Process and Thread tables.
$,;.HUQHO
6\VWHPFDOO
Parent Process
......
......
Thread Table ......
fork()
......
3DUHQWHQWU\
&KLOGHQWU\
Child Process
Process Table
3DUHQWHQWU\
&KLOGHQWU\
Inherited The illustration shows what happens when the fork() system call is issued.
attributes after The caller creates a child process that is almost an exact copy of the
a fork() system
call process itself. The child process inherits many attributes of the parent, but
receives a new user block and dataregion.
The child process inherits the following attributes from the parent process:
• Environment
• Close-on-exec flags and signal handling settings
• Set user ID mode bit and Set group ID mode bit
• Profiling on and off status
• Nice value
• All attached shared libraries
• Process group ID and tty group ID
• Current directory and Root directory
• File-mode creation mask and File size limit
• Attached shared memory segments and Attached mapped file
segments
• Debugger process ID and multiprocess flag, if the parent process has
multiprocess debugging enabled (described in the ptrace subroutine).
Attributes not Not all attributes are inherited from the parent. The child process differs
inherited from from the parent process in the following ways:
the parent
process • The child process has only one user thread; it is the one that called the
fork subroutine, no matter how many threads the parent process had.
• The child process has a unique process ID.
• The child process ID does not match any active process group ID.
• The child process has a different parent process ID.
• The child process has its own copy of the file descriptors for the parent
process. However, each file descriptor of the child process shares a
common file pointer with the corresponding file descriptor of the parent
process.
• All semadj values are cleared.
• Process locks, text locks, and data locks are not inherited by the child
process.
• If multiprocess debugging is turned on, the trace flags are inherited from
the parent; otherwise, the trace flags are reset.
• The child process utime, stime, cutime, and cstime are set to 0.
• Any pending alarms are cleared in the child process.
• The set of signals pending for the child process is initialized to the
empty set.
• The child process can have its own copy of the message catalogue for
the parent process.
The fork() The following code illutrates the usage of the fork() system call. After the
system call call there will be two processes executing two different copies of the same
code example
code. A process can determine if it is the parent or the child from the return
code.
int statuslocation;
pid_t proc_id;
tproc_id=fork();
if ( proc_id < 0 ) {
printf ("fork error \n");
exit (-1);
}
if ( proc_id > 0 ) {
/*Parent process waiting for child to terminate */
proc_id2 = wait(&statuslocation);
}
if ( proc_id == 0 ) {
/* I’m the child proces */
{.............}
Listing Executing the test program creates two processes, which can be listed
processes with with the ps command. The program name in the example is fork and that
the ps
command after name is listed as the command for both the parent and the child. Note that
fork() the child’s PPID is equal to the PID of the parent.
F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD
Processes In the previous example, it was shown how the PID of the calling process
without the becomes the PPID of the child process. This example shows what
parent process
happens if the parent process terminates before the child process
terminates. If we rewrite the program so that the parent process terminates
after fork() without waiting for the child, the system will replace the PPID
with 1, which is the init process. The init process will then pickup the
SIGCHLD signal so that the system can free the process table, even
though the parent process does not exist. This situation is shown below:
F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD
240001 A 0 10346 10236 0 60 20 5b8b 496 pts/1 0:00 ksh
40001 A 0 10996 1 0 68 24 8330 44 pts/1 0:00 fork
200001 A 0 11216 10346 3 61 20 dbbb 244 0:00 ps
Zombie If, for some reason, no processes receive the SIGCHLD signal from the
processes child, the empty slot will remain in the process table, even though other
resources are released. Such a process is called a zombie, and is listed in
ps as <defunct>. The example below shows some of these zombie
processes.
.....F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD
200003 A 0 1 0 0 60 20 500a 704 - 0:03 init
240401 A 0 2502 1 0 60 20 d2da 40 - 0:00 uprintfd
240001 A 0 2622 2874 0 60 20 2965 5208 - 0:46 X
40001 A 0 2874 1 0 60 20 c959 384 - 0:00 dtlogin
50005 Z 0 3776 1 1 68 24 0:00 <defunct>
40401 A 0 3890 1 0 60 20 91d2 480 - 0:00 errdemon
240001 A 0 4152 1 0 60 20 39c7 88 - 0:21 syncd
240001 A 0 4420 4648 0 60 20 4b29 220 - 0:00 writesrv
240001 A 0 4648 1 0 60 20 b1d6 308 - 0:00 srcmstr
50005 Z 0 10072 1 0 68 24 0:00 <defunct>
50005 Z 0 10454 1 0 68 24 0:00 <defunct>
Exec system The exec subroutine does not create a new process; it loads a new
call to load a program into the process.
new program
Valid program
files for the
exec() system The fork() system call creates a new process with a copy of the
call environment, and the exec() system call loads a new program into the
current process, and overlays the current program with a new one (which
is called the new-process image). The new-process image file can be one
of three file types:
Inherited
attributes after
the exec() The new-process image inherits the following attributes from the calling
system call process image: session membership, PID, PPID, supplementary group
IDs, process signal mask, and pending signals.
The exec() The illustration show how the process and thread table remain unchanged
system call after the exec() system call.
6\VWHPFDOOV
Parent Process
......
......
Thread Table ......
exec()
......
Process Table
The exec() The following code illustrates the usage of the execv() system call. After
system call the call, the current process will be overlaid with the new program. To
code example
illustrate the function, the output from the program is listed after the
program.
The program first defines two variables. The first is a pointer to the
program name to be executed, and the second is a pointer to the
arguments (by convention the first argument parsed is the program name
itself). The program source for sleeping.c is not supplied, as any program
can be used for this example.
#include <unistd.h>
int returncode;
char *argumentp[3],arg1[50],arg2[50],arg3[50];
const char *Path="/home/olc/prog/thread/sleeping";
main(argc,argv)
int argc;
char **argv;
{
strcpy (arg1,"/home/olc/prog/thread/sleeping");
strcpy (arg2,"test param 1");
strcpy (arg3,"test param 2");
argumentp[0]=arg1;
argumentp[1]=arg2;
argumentp[2]=arg3;
/* ArgumentV=*arguments; */
printf ("before execv \n");
returncode = execv(Path,argumentp);
printf ("after execv \n");
exit (0);
}
and the program output:
before execv
I’m the sleeping process
The exec() While the program in the example is being executed, we can examine the
system call process status with the ps command. Notice that the program name for the
example is “exec,” and the program name for the called program is
“sleeping.” As we see in the listing from the ps command, the current
program is replaced with the new one, and we never reach the print
statement "after execv\n." The program prints “I’m the sleeping process,”
because the main program has been replaced with the program in the path
variable. If we look closer at the output from the ps -l command before
and after the system call, we can tell that the program name has been
replaced, but the process ID and PPID remains the same.
#> ps -l
. F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD
240001 A 0 10346 10236 0 60 20 5b8b 492 pts/1 0:00 ksh
200001 A 0 10696 10346 2 61 20 6bad 240 pts/1 0:00 ps
200001 A 0 10964 10346 0 68 24 4388 40 pts/1 0:00 exec
And after the exec() system call, the exec program is replaced with
sleeping:
#> ps -l
. F S UID PID PPID C PRI NI ADDR TTY TIME CMD
240001 A 0 10346 10236 0 60 20 5b8b pts/1 0:00 ksh
200001 A 0 10698 10346 2 61 20 a354 pts/1 0:00 ps
200001 A 0 10964 10346 0 68 24 4388 pts/1 0:00 sleeping
Exit: what The exit system call is executed at the end of every process, the system
happens when call cleans up, releases memory, text and data, but leaves an entry in the
a process
terminates process table so that a return value and other status information can be
passed to the parent process if needed.
• exit - termination of a process
• When a program no longer needs to run or execute other programs,
it can exit.
• A program that exits causes the process to enter the zombie state.
Exiting from a There are basically three ways that a process can terminate: the program
program can have reached the end of the program flow and meet an explicit
exit(exit_value) statement, the program flow can end without an exit()
statement (in which case the linker automatically inserts a call to the exit
system call), or the running program receives a signal from an external
source such as keyboard interrupt (<Ctrl-c>) from the user. If the program
receives an interrupt, the program path will switch to the interrupt handling
routine, either in the program, or the system default routine, which will
terminate the program with an exit.
When executing the exit() system call, all memory and other resources are
freed, and the parameter supplied to exit(0 is placed in the process table
as the exit value for the process. After the completion of the exit() system
call, a signal SIGCHLD is issued to the parent process (the process at this
stage is nothing but the process table entry). This state is called the
zombie state, when the parent process reacts to the SIGCHLD signal and
reads the return code from the process table, the system can remove the
process table entry, clean up, and free the process table entry.
In rare occasions the parent process can not respond to the signal
immediately, we can see the zombie in the process table with the ps
command. A zombie will be listed as <defunct>.
Waiting for the The wait system call is placed at the end of a program; normally it is placed
death of a there by the programmer as the system call wait(), but if not, the system
child process
will automatically add a wait one. The wait call is used to notify the parent
process of the death of the child process and for releasing the child’s
process slot.
• The parent process can be notified of the death of the child by waiting
with a system call or catching the proper signal.
• Once the parent process acknowledges the death of a child process,
the child process' slot is freed.
Idle state When processes are being created, they are first in the idle state. This
state is temporary until the fork mechanism is able to allocate all of the
necessary resources for the creation and fill in the process table for a new
process.
Active state Once the new child process creation is completed, it is placed in the active
state. The active state is the normal process state, and threads in the
process can be running or be ready-to-run.
Swapped If a process is swapped, it means that another process is running, and the
processes process, or any threads, cannot run until scheduler makes it active again.
Zombie When a process terminates with an exit system call, they first goes into the
process zombie state, such processes have most of their resources freed.
However, a small part of the process remains, such as the exit value that
the parent process uses to determine why the child process died. If the
parent process issues a wait system call, the exit status is returned to the
parent, and the remaining resources of the child process are freed, and the
process ceases to exist. The slot can then be used by another newly
created process.
If the parent process no longer exists when a child process exits, the init
process frees the remaining resources held by the child. Sometimes we
can see a Zombie staying in the process list for a longer time; one example
of this situation could be that a process exited, but the parent process is
busy or waiting in the kernel and unable to read the return code.
State The illustration show how a process is being started with a fork() system
transitions for call, turns into an active process, and how active process can change
AIX processes
between swapped, active and stopped state. A terminating process
becomes a zombie until the entire process is removed.
fork() Idle
Zombie
Non existing
Kernel Processes
Kernel processes:
Kernel • Are created by the kernel.
processes -
Kproc • Have a private u-area/kernel stack.
• Share "text" and data with the rest of the kernel.
• Are not affected by signals.
• Cannot use shared library object code or other user protection domain
code.
• Run in the Kernel Protection Domain.
Thread Fundamentals
Threads • Threads allow multiple execution units to share the same address
space.
• The thread is the fundamental unit of execution.
• Thread has IDs (TIDs) like a process has IDs (PIDs).
• An independent flow of control within a process.
• In a multi threaded process, each thread can execute on a different
code concurrently.
• Managing threads needs fewer resources than managing processes.
• Inter-thread communication is more efficient than inter-process
communication, especially because variables can be shared.
Threads share Threads reduce the need for IPC operation, because they allow multiple
data and execution units to share the same address space, and thereby easily
address space
share data. On the other hand, it adds complexity and risk to the
programming. For example: synchronization and locking has to be
controlled by the threads.
Threads are The thread is the fundamental unit of execution and the scheduler and
the unit of dispatcher only work with threads. Therefore, every process has at least
execution
one thread.
Thread IDs TIDs are listed for all threads in the threads table; TIDs are always odd.
(TID) and PIDs are listed for all processes in the process table; PIDs are always
Process IDs
(PID) even, except for the init process, where PID = 1. Threads represent
independent flows within a process; the system does not provide
synchronization, and the control must be in the thread itself.
One of the main reasons for using threads is that managing threads
requires fewer resources than managing processes. Inter-thread
communication is more efficient than inter-process communication.
AIX Thread
AIX Threads • A thread is an independent flow of control that operates within the same
address space as other independent flows of controls within a process.
In other operating systems, threads are sometimes called "lightweight
processes," or the meaning of the word "thread" is sometimes slightly
different.
• Multiple threads of control allow an application to overlap operations
such as reading from a terminal or writing to a disk file. This also allows
an application to service requests from multiple users at the same time.
• Multiple threads of control within a single process are required by
application developers to be able to provide these capabilities without
the overhead of multiple processes.
• Multiple threads of control within a single process allow application
developers to exploit the throughput of multiprocessor (MP) hardware.
TID format Threads IDs have the following format for 32-bit kernels:
31 24 8 7 1 0
0 0 0 0 0 0 INDEX COUNT 1
63 56 8 7 1 0
0 0 0 0 0 0 INDEX COUNT 1
TID format The following is a 64-bit slot in the thread table listed with kdb; the TID is
listed with kdb 002143 HEX =>, the index = 21, and the COUNT= 43, 21 hex = 33
decimal. According to the figure, this is the slot number in the thread table;
the value is listed in the next line of the output from kdb.
(0)> thread 33
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+001080 33 sendmail SLEEP 002143 03C 0 0
(0)> d pvthread+001080
pvthread+001080: 0000 0000 0000 2143 0000 0000 0000 0000
(0)>
Thread Concepts
Thread • User threads are mapped to kernel threads by the threads library. The
mapping way this mapping is done is called the thread model. There are three
models
possible thread models, corresponding to three different ways, to map
user threads to kernel threads:
• M:1 model
• 1:1 model
• M:N model.
• The AIX Version 4.1 and later threads support is based on the OSF/1
libpthreads implementation. It supports what is referred to as the 1:1
model. This means that for every thread visible in an application, there
is a corresponding kernel thread. Architecturally, it is possible to have a
M:N libpthreads model, where "M" user threads are multiplexed on "N"
kernel threads. This is supported in AIX 4.3.1 and AIX 5L.
• The mapping of user threads to kernel threads is done using virtual
processors. A virtual processor (VP) is a library entity that is usually
implicit. For a user thread, the virtual processor behaves as a CPU for a
kernel thread. In the library, the virtual processor is a kernel thread or a
structure bound to a kernel thread.
• The libpthreads implementation is provided for application developers to
develop portable multi-threaded applications The libpthreads.a library
has been written as per the POSIX 1003.4a Draft 10 specification in AIX
4.3. Previous versions of AIX support the POSIX 1003.4a Draft 7
specification. The libpthreads is a linkable user library that provides user
space threads services to an application. The libpthreads_compat.a
provides the POSIX 1003.4a Draft 7 specification pthreads model on
AIX 4.3.
Threads Models
M:1 threads In the M:1 model, all user threads are mapped to one kernel thread and all
model user threads run on one VP. The mapping is handled by a library
scheduler. All user threads programming facilities are completely handled
by the library. This model can be used on any systems, especially on
traditional single-threaded systems.
Library Scheduler
VP Threads Library
Kernel Thread
1:1 threads In the 1:1 model, each user thread is mapped to one kernel thread and
model each user thread runs on one VP. Most of the user threads programming
facilities are directly handled by the kernel threads.
VP VP VP
Threads Library
Kernel Threads
M:N threads In the M:N model, all user threads are mapped to a pool of kernel threads
model and all userthreads run on a pool of virtual processors. A user thread may
be bound to a specific VP, as in the 1:1 model. All unbound user threads
share the remaining VPs. This is the most efficient and most complex
thread model; the user threads programming facilities are shared between
the threads library and the kernel threads.
User Threads
Library Scheduler
VP VP VP
Threads Library
Kernel Threads
Thread states
Thread states In AIX, the kernel allows many threads to run at the same time, but there
can only be one thread executing on each CPU at a time. The thread state
is kept in t_state in the thread table (for detailed information look in the /
usr/include/sys/thread.h file).
Idle state When processes and threads are being created, they are first in the idle
state. This state is temporary until the fork mechanism is able to allocate
all of the necessary resources for the creation and fill in the thread table for
a new thread.
Ready to run Once the new thread creation is completed, it is placed in the ready to run
state. The thread waits in this state until the thread is ran. When the thread
is running, it continues to run until it has used a time slice, gives up the
CPU or is preempted by a higher priority thread.
Running A thread in the running state is the thread executing at the CPU. The
thread thread state will change between running and ready to run until the thread
finishes execution; the thread then goes to the Zombie state
Sleeping Whenever the thread is waiting for an event, the thread is said to be
sleeping.
Swapped Though swapping takes place at the process level and all threads of a
process are swapped at the same time, the thread table is updated
whenever the thread is swapped.
Zombie The zombie state is a intermediate state for the thread lasting only until the
all resources owned by the thread are given up.
State The illustration show the states for AIX threads. Threads are typically
transitions for changing between running, ready to run, sleeping and stopped during the
AIX threads
life time of the thread.
Ready to run
Swapped Zombie
Non existing
Thread Management
Thread / • The diagram below shows how the process shares most of the data
process among the threads; although each thread has its own copy of the
relationship
registers, some kernel thread have specific data, and therefore have a
private stack. Thus, data can be passed between threads via global
variables.
• A conventional unithreaded UNIX process can only harm itself (if
incorrectly coded).
• All threads in a process share the same address space, so in an
incorrectly coded program, one thread can damage the stack and data
areas associated with other threads in that process.
• Except for such areas as explicitly shared memory segments, a process
cannot directly affect other processes.
• There is some kernel data that is shared between the threads, but the
kernel also maintains thread specific data.
• Per-process data is needed even when the process is swapped out is in
the pvproc structure. The pvproc structure is pinned.
• Per-process data is needed only when the process is swapped in is in
the user structure.
• Per-thread data is needed even when the process is swapped out is in
the pvthread structure. The pvthread thread structure is pinned.
• Per-thread data is needed only when the process is swapped in is in the
uthread structure.
Data placement overview
BSS
Program
Data Stack Stack Stack
Code
Kernel Kernel Kernel
Thread Thread Thread
Data Data Data
Process swapping
Process
swapping
Thread Scheduling
Thread Scheduling and dispatching is the ability to assign CPU time to threads in
scheduling the system in a efficient and fair way. The problem is to design the system
to handle many simultaneous threads and at the same time still be
responsive to events.
Clock tics and The division of time among the threads on the AIX system relies on clock
time slices tics. Every 1/100 of a second, or 100 times a second, the dispatcher is
called and does the following:
• Increases the running tic counter for the running process.
• Scans run queues for the thread with the highest priority.
• Dispatchs the most favored thread.
Every real second the scheduler is awake, it recalculates the priority for all
threads.
Thread priority • AIX priority has 128 (0-127) levels that are called run queue levels.
• The higher the run queue level, the lower priority.
• Priority 127 can only be used by the wait process.
• User processes can get priority changed from -20 to + 20 levels
(renice).
• User processes are in the range 40 - 80.
• A clock tick interrupt decreases thread priority.
• The scheduler (swapper) increases thread priority.
The priority is based on the basic priority level, the initial nice value, the
renice value and a penalty.
Penalty based on
runtime
Renice value
-20 - +20
Higher value = lower priority
Nice value
default = 20
Base Priority
default value = 40
SCHED_RR • SCHED_RR
threads
scheduling - This is a Round Robin scheduling mechanism in which the thread is
algorithms time-sliced at fixed priority.
- This scheme is similar to creating a fixed priority, real time process.
- The thread must have root authority to be able to use this
scheduling mechanism.
SCHED_FIFO • SCHED_FIFO
threads
scheduling - A non-preemptive scheduling scheme.
algorithms
- The thread runs at fixed priority and is not time-sliced.
- It will be allowed to run on a processor until it voluntarily
relinquishes by blocking or yielding.
- A thread using SCHED_FIFO must also have root authority to use
it.
- It is possible to create a thread with SCHED_FIFO that has a high
enough priority that it could monopolize the processor.
SCHED_ • SCHED_OTHER
OTHER
threads - The default AIX scheduling.
scheduling
algorithms - Priority degrades with CPU usage.
Thread Like most UNIX systems, AIX uses a multilevel round-robin model for
scheduling process and thread scheduling. Processes and threads at the same
priority level are linked together and placed on a run queue. AIX has 128
run queues, 0-127, each representing one of the 128 possible priorities.
When a process starts running is determined by a given priority based on
the nice value, and the process is linked with other processes at the same
level. As the process is running and consumes CPU resources, the priority
decreases until it it finishes, or until the priority is so low that other
processes get CPU time. If a process does not run, the priority increases
until it can get CPU time again. The drawing below illustrates the 128 run
queue levels and six processes: three at priority 60 and three at 70.
20
40
60
80
100
120
127 Idle process
Thread The scheduler is using the following algorithm to calculate priorities for the
scheduling running processes:
algorithm
Invariants:
-20 <= nice <= 20
0 <= r <= 32
0 <= d <= 32
0<= ticks <= 120
0 <= p <= 126
The r and d controls how a process is impacted by the run time; r impacts
how severely a process is penalized by used CPU time, while d controls
how fast the system “forgives” previous CPU consumption.
The r and d can be set by the schedtune [-r <r_val] [-d d_val] command.
The Dispatcher
Context switch The procedures switches context to make a different thread execute:
• As a thread executes in the CPU, its priority becomes less favored.
• The scheduler re-calculates the priority of the executing thread and
measures the new priority against the priorities of the threads that
are runnable.
• In AIX, the run queues are divided into 128 separate priority queues
with priority 0 being the most-favored priority and priority 127 the
least-favored.
• Threads at the same priority level are on the same run queue for
quick determination of the next runnable process.
• All of the threads on a more-favored priority queue run before
threads on a less-favored priority queue.
• Queue 127 contains the wait threads. There is one wait thread per
CPU, and these run only when there are no other runnable threads.
Thread In AIX, the kernel allows preemption of both user and kernel threads.
preemption
• Preemption allows the kernel to respond to real-time processes
much faster.
• On most UNIX systems, when a thread is in kernel mode, no other
thread can execute until the thread in kernel mode returns to user
mode or voluntarily gives up the CPU.
• In AIX, other higher priority threads may preempt threads running in
kernel mode.
• This feature supports real-time processing where a real-time process
must respond to an action immediately.
• Some sections of code have been determined to be critical sections
where preemption is not possible because preemption may cause
inconsistent kernel data structures. These sections are protected
either by preventing preemption (by disabling interrupts) or by
holding a lock.
• The kernel can use locks to serialize access to global kernel data that
could be corrupted by preemption.
• The thread holding the lock for a piece of data is guaranteed to run at
a higher priority than the set of threads waiting for the lock. This is
called priority promotion.
• However, other threads running at higher priority and not asking for
the lock on the same piece of data can preempt the locking thread.
There is one global priority-based, multi-level run queue (runq). All threads
that are runable are linked into one of these runq entries. There are
currently 128 priorities (0-127). The scheduler periodically scans the list of
all active threads and recalculates thread priorities based on the amount of
processor time used.
Multiple run • AIX 4.3.3. uses multiple specialized run queues instead of just one
queues (MRQ) global queue.
• Each processor has its own local run queue, and each node has a
global run queue.
• Processors dispatch threads from the local and the global run queue.
RQ RQ RQ RQ RQ RQ RQ RQ RQ RQ RQ RQ
0 1 2 3 4 5 6 7 8 9
10 11
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU
0 1 2 3 4 5 6 7 8 9 10 11
If processes has the variable RT_GRQ=ON set, they will sacrifice cache
optimization for best possible real-time behavior. That is, the process will
be on the Global Run Queue and run on the first available CPU. Threads
can be bound to one CPU and will then never be on the global run queue.
Each local run queue has its own lock. This reduces the lock contention,
and make the lock handling faster. The local queue makes the scan faster
because there are no special handling of bound threads, and simple
handling of soft affinity with one CPU per run queue. The kernel cache
contention is reduced because each CPU updates the dispatcher state,
and the structures for threads in the local run queue are more likely to be in
the local cache.
Initial load When new unbound threads are created, they should initially be placed so
balancing that the system load remains balanced. This has to be handled differently
for new processes and for additional threads in an existing process.
Idle load Idle load balancing occurs when a CPU goes idle, and starts looking for
balancing work in other run queues. The criteria for permitting a thread steal are:
• Foreign run queue threads are greater than 1/4 load factor of the
node.
• There is at least one stealable (unbound) thread available.
• There is at least one unstealable (bound) thread available.
• The number of threads stolen from this run queue during the current
clock interval is less than 20.
• Should multiple run queues meet these criteria, the one with the most
threads will be used.
• If this run queue’s lock is available, its best priority unbound thread
will be stolen, assuming its p_lock_d is available.
• Note that failure to lock the run queue or the thread will cause the
dispatcher to loop through waitproc, thereby opening up a periodic
enablement window.
Process and Four main data structures is used for process management:
thread
management • proc
data structure
overview • thread
• user
• uthread
The figure below show how the tables are linked together.
Thread The diagram above shows that the thread structures contain
management pointers to all the other structures required to run that particular
data structure
overview thread. This is a reflection of the fact that the thread is the
schedulable entity, and the system must be able to access all
structures from the pointers in the thread table. The thread table are
doubly and circularly linked to all other threads for a particular
process. Note that the ublock structure contains the user structure
plus uthread structures for the initial thread. The uthread structures
for all other threads are in the uthread (and kernel thread stacks)
segment. The first uthread structure is kept separate within in the
ublock so that the fields it contains can be addressed directly and
such that fork and exec can operate with only the process private
segment to deal with.
The proc and thread structures are maintained in the kernel extension
segment as a part of the process and thread tables of the kernel. Every in-
use entry in these tables is pinned, such that the information there is
always available to the kernel. The user and uthread structures are
maintained in the process private segment of the corresponding process.
These structures are only pinned when the process is not swapped out.
When the process is swapped out, they are unpinned.
Process and The previous diagram shows how the tables are linked together. Each
thread links process in the system has an entry in the process table. Each process
entry has a pointer to the list of threads for the process, and the thread list
has a pointer back to the process table. The thread list is a double circular
linked list of all the threads owned by the process, and the pvthreads
entries point to the user area and uthread field in the process data area.
Proc structure The following is an extract of the fields in the proc table to show the
fields and pointers. Note that each entry in the proc table starts with a pointer to a
pointers C
structures pvproc structure (we will later discuss the pvproc structure). The proc table
holds the number of threads, and the pvproc table has a pointer
pv_threadlist that points to the first thread for the process in the thread
table. A complete listing of the structures can be found in the file /usr/
include/sys/proc.h.
struct proc {
/* thread fields */
ushort p_threadcount; /* number of threads */
ushort p_active; /* number of active threads */
.......
};
struct pvproc {
/* identifier fields */
pid_t pv_pid; /* unique process identifier */
pid_t pv_ppid; /* parent process identifier */
/* thread fields */
struct pvthread *pv_threadlist; /* head of list of threads */
.......};
Thread table Like the process table, the thread table is divided into two tables a
fields and links pvthread table and a thread table. The complete structures can be found in
to the process
table and the the file /usr/include/sys/thread.h. The structures listed contains only
ublock selected variables.
struct thread {
struct t_uaddress {
struct uthread *uthreadp; /* local data */
struct user *userp; /* owner process’ ublock (const)*/
} t_uaddress;
......
struct pvthread {
/* identifier fields */
tid_t tv_tid; /* unique thread identifier */
Process and AIX 5L has 64-bit kernel and the addresses are 64-bit long. Both process
thread tables’ and thread tables are kept in the kernel extension segment at fixed
addresses in
the kernel addresses.
• The proc table starts at 0xF100008080000000.
• The thread table starts at 0xF100008090000000.
AIX 4.3.3 is a 32-bit kernel and the addresses are only 32-bit long the
values for an AIX 4.3.3 32-bit kernel are:
• The proc table starts at 0xe2000000.
• The thread table starts at 0xe6000000.
Slot number can be derived from PID or TID bit 8 - 23. See the example
and list from the process table on an AIX 5L power system
• Generation count for each slot is incremented every time a PID or
TID is created in that slot.
Looking at AIX Looking at the process table with kdb, we can tell that there is a difference
4 process between AIX 4 and AIX 5. List the process table with the p subcommand in
structures with
kdb kdb. The process table starts at address proc and the process slot used by
kdb is 7936, which is offset by 326000 (hex) from the start of the process
table. The size of proc is 326000 (hex) / 7936 (dec) = 416 (dec) = 1A0
(hex).
The size of each process slot can be verified with the p * subcommand. In
the following list, each slot is offset by 1A0 bytes
(0)> nm proc
Symbol Address : E2000000
TOC Address : 001F9EF8
Looking at AIX The same lists will look different on an AIX 5 system. First, on a list of the
5 process proc table, we can tell that the structure used is no longer proc but pvproc,
structures with
kdb and each pvproc slot is 6680 (hex) / 41 (dec) = 280 (hex) long.
(0)> p
SLOT NAME STATE PID PPID ADSPACE CL #THS
pvproc+006680 41*kdb_64 ACTIVE 0002996 00037D8 00000000200040AA 0 0001
Listing the first three slots shows that the offset is 280(hex) between the
slots.
(0)> p *
SLOT NAME STATE PID PPID ADSPACE CL
pvproc+000000 0 swapper ACTIVE 0000000 0000000 0000000000000B00 0
pvproc+000280 1 init ACTIVE 0000001 0000000 000000000000E2FD 00
pvproc+000500 2 wait ACTIVE 0000204 0000000 0000000000001B02 0
(0)> nm pvproc
Symbol Address : F100008080000000
TOC Address : 0046AC80
(0)>
Process data The changes in the process table are made to support the NUMA (Non-
structure Uniform memory) structure in AIX 5L.
changes in AIX
5 A NUMA system consist of one or more separate nodes connected by a
very fast connection. The nodes operates as one computer, running one
copy of AIX. The name NUMA refers to the fact that the memory access
time is not constant. A CPU accessing memory on its own node will get the
memory fast (accessed via the local bus). A CPU accessing remote
memory will have to get the data from a remote node, and the access will
be slower.
In order to make the system efficient, we want to keep all parts of a
process close together so that memory access is fast; therefore, the proc
structure has been rearranged and divided into two parts. Struct pvproc,
that holds global process data and the rest, is still in struct proc. This
change allows the NUMA system to move processes around between
CPU’s or “QUADS,” and still have most of the process table local to the
process. However, some of the process table must be kept at the main
node in a NUMA system.
Because of things like shared memory, processes can form migration
groups. These are groups of processes, shared memory, files, and so on.
that are logically attached to each other. The most common form of logical
attachment involves one being intrinsically tied in with another process.
For example, a process that creates a shared memory segment is logically
attached to it. If another process uses the shared memory segment, it is
logically attached to it, and as a result is in a migration group with the first
process. Additionally, the user is allowed to create logical attachments
between items through the NUMA APIs
The proc structure in an AIX 5 system starts with a pvproc structure and
continue with process flags. The start of the structure is listed here; for a
full listing, see the file /user/include/sys/proc.h.
struct proc {
struct pvproc *p_pvprocp; /* my global process data*/
pid_t p_pid; /* unique process identifier*/
uint p_flag; /* process flags */
Process ID PID The process ID or thread ID is composed of process slot number and a
and process generation count, bit 0 tells us if it is a PID or a TID (all PID’s are even).
table slot
number The next 7 bits are the generation count; the generation count prevents the
rapid reuse of process IDs. Bits 8 to 23 is the slot number in the process
table. The information can be verified from the pvproc list, where bits 8-23
in the PID field match the process slot number in the pvproc table.
63 24 23 8 7 1 0
0000000 Slot Number Generation Count 0 if PID
1 if TID
Priority boost Priority boost is a facility added that ensures that higher priority processes
get CPU time, and that the time such processes have to wait for lower
prioritized processes is minimized. Priority boost was implemented in AIX
4.3 and is further enhanced in AIX 5.
Lock A Process 1
Low priority
Process 2
Lock B High priority
Process 3
Lock C
Medium priority
User area in The user structure is much larger in the 64-bit kernel than in the 32-bit
User64 kernel. To improve efficiency and performance in the 32-bit kernel, two
structures are maintained: a 32-bit and a 64-bit. This ensures that the
kernel does not copy data areas which are not used.
What is Runaway processes and hanging system are hard to detect from locked
system hang systems, and methods to detect the runaway process are needed.
detection and
why do we • Misbehaving high priority applications are a recurring problem.
need it?
• When one or more processes or threads are stuck in the running
state, they can prevent any other lower priority threads from running.
• If the priority is above the default user priority, the machine can
appear to be hung.
• The hung situation is very difficult to debug since the administrator
can not tell what is happening on the system.
The solution to the hang problem is the system hang detection. It is
implemented by the shdaemon, which runs at the highest user priority.
Shdaemon monitors the lowest priority process that run on the system in a
given period of time, and if the system fails to run process below a given
threshold, an action is taken. The system hang detection can be set by the
shconf command, but the easiest way is to use the smit panel. There are
five distinct actions that can be taken, and for each of them a timeout value
and a threshold priority value can be set.
Signals
What are • Signals are a way of notifying a process or thread of a system event.
signals?
• Signals also provide a means of interprocess communication.
• A signal is represented as a single bit within a bit field in the kernel.
• The bit field used for signals is 64-bit wide, but only about 40 signals are
defined.
• AIX 4.3.3 defines only 37 signals for the user.
• AIX 5 has 44 defined signals, but three of them are not used.
Types of • There are two types of signals in AIX: synchronous and asynchronous.
signals
• Synchronous signals are only delivered to a thread, usually as a result
of an error condition or exception caused by the thread, that is, SIGILL
is delivered to a thread that tries to execute an illegal instruction.
• Asynchronous signals are generated externally to the current thread or
process.
• Asynchronous signals may be delivered to a process (that is, kill() or to
another thread within the same process (that is, thread_kill() or tidsig() ).
Signal When an event triggers a signal, the kernel sets the corresponding bit in
mechanism the pending signal bit field for the process (p_sig) or thread (t_sig).
• All signals are enabled by default, and when returning from the kernel,
threads are looking for signals.
• If the signal is being ignored (masked), nothing happens.
Signal handling
Signal When a signal has been generated but not yet handled, it is said to be
delivering pending.
• Pending signals are detected when returning from a system call.
• Pending signals are detected when resuming in user mode.
• Pending signals are detected entering or during an interruptible
sleep.
• Signals may be caught, blocked or ignored by a process.
Signal Signal handling is done at the process level and signal masking is done at
handling the thread level. That is, each thread in a process must use the signal
handler set up by the process, but each has its own signal mask.
• If a pending signal is not specifically handled by the process, it is
delivered to all threads in the process.
• If the signal is handled by the process, the signal is delivered to the
thread that is not blocking the signal.
• If all threads are blocking a signal, it is left pending for the process
until one thread unmasks the signal or the signal is removed from the
pending list.
• If more than one signal is pending, only one is chosen for delivery at
a time.
• When a signal is being handled, it is moved to the p_cursig or
t_cursig field in the pvproc or pvthread structure.
Signal handler There is a default system handler for all signals, but most signals have a
routines local system handler routine, or the signal is ignored or blocked.
• SIGKILL and SIGSTOP can not be handled by a local routine, these
signals will always be handled by the system default routine.
• SIGKILL and SIGSTOP can not be blocked the process will always
handle the signal.
Signal actions The default action for a signal depends on the signal, but may be one of
the following:
• Abort: This will generate a core dump and terminate the process.
• Exit: This will terminate the process without generating a core dump.
• Ignore: The signal is ignored.
• Stop: This action will suspend the process or thread.
• Continue: This action will resume a suspended process or thread.
Signals
Signals -- continued
Signal data The file /usr/include/sys/proc.h defines the proc structure and the following
structures information about signals is kept in the proc structure.
/* Signal information */
sigset_t p_sig; /* pending signals */
sigset_t p_sigignore; /* signals being ignored */
sigset_t p_sigcatch; /* signals being caught */
sigset_t p_siginfo; /* keep siginfo_t for these */
Signals • A signal is a bit set in an array with enough bits set aside for each signal
number.
• The bits are turned on by kernel code as the process is executing in
kernel mode or by the processing of interrupts that are determined to be
assigned to the process.
• Signals can also be sent from one process to another process through
the use of system calls.
• Signals are delivered to the process when:
- The process returns to the User Protection Domain.
- There is a transition from ready-to-run state to running state.
• To deliver a signal, the kernel checks whether the process is receiving
the signal.
• If the signal is being received, the kernel sets the receiving process to
perform the appropriate action.
• The appropriate action may be to invoke the signal handler for that
particular signal, kill the process, or ignore the signal.
• If the signal is blocked by the process, it is left pending until the process
is no longer blocking the signal.
• Signals can be delivered to a group of processes.
• Signals can be sent to process or thread.
• Thread receives signal if:
• A signal is synchronous and attributable to particular thread. For
example: SIGSEGV.
• A signal is sent by thread in the same process via thread_kill system
call.
• Otherwise, the signal goes to process.
Signals -- continued
Signals to a • If a signal is not being caught, a signal action applies to entire process.
process
- Every thread is terminated, stopped, or continued, depending on
action.
• If a signal is being caught:
- Pick one thread that is not blocking signal to receive it.
- If all threads are blocking, a signal pending on process is sent.
Exercises
Exercises after In this exercise, the student will be supplied with programs that will create
this module process and threads using the available thread models. The programs
should be very simple source and will be supplied to the student. Kernel
debugging tools (running on a live kernel) are then used to interrogate the
kernel structures associated with the process and threads of the program.
The first code example explores the fork() system call and how variables
are private to each process. The second example show how threads are
created and how global variables are shared because all threads share
user space, but local variables in functions are not shared because those
data are kept on the stack to make the procedure reentrant. The third
example is a signal handler example.
Exercises -- continued
C code Use C code to create siblings with the fork() system call notice that the
example to variable is private to each process.
explore fork()
and wait()
system calls
#include <unistd.h>
int i;
int *statuslocation;
pid_t proc_id;
pid_t proc_id2;
main(argc,argv)
int argc;
char **argv;
{
int this=7;
proc_id=fork();
/* error routine */
if ( proc_id < 0 ) {
printf ("fork error \n");
exit (-1);
}
if ( proc_id > 0 ) {
this= this+4;
printf("waiting for child \n");
proc_id2 = wait(statuslocation);
printf("I’m Parent variable= %d \n",this);
exit (0);
}
if ( proc_id == 0 ) {
printf (" I’m the child proces \n");
sleep(1);
printf ("I’m the child the variable is %d\n",this);
printf ("I’m the child terminating\n");
exit (0);
}
}
Exercises -- continued
Exercises -- continued
C sample code The program explores proces priority, the program run long time such that
to explore ther are time to look at the process table, and the nice value with the ps
process
renice, the command.
proces priority,
and the int i,ii;
program run long ll;
long time,
there is time to long ll1();
look at the
process table main(argc,argv)
char *argv[];
int argc;
{
i=atoi(argv[1]);
ii = nice(i);
ll=1;
for (i = 1;i < 5000; i++) {
ll = ll1(ll);
ll++;
}
}
Exercises -- continued
C code The Signal code sample catch the signals and print a message whenever
example to a signal is being caught. What happens if the same signal is being send
explore signal
handling twice? And how can this behaviour be changed.
#include <stdio.h>
#include <fcntl.h>
#include <termio.h>
#include <signal.h>
int i;
void sig1(), sig2(), sig3();
main()
{
signal( SIGHUP,sig1);
signal( SIGINT,sig2);
signal( SIGQUIT,sig3);
void sig1()
{
printf("interrupt 1 modtaget \n");
}
void sig2() {
printf("interrupt 2 modtaget \n");
}
void sig3() {
printf("interrupt 3 modtaget \n");
}
Objectives
After completing this unit, you should be able to describe the
common features of VMM on POWER and IA64:
• virtual memory
• page mapping
• memory objects
• VMM tuning parameters
• object types
• shared memory objects
References
Memory The virtual memory system divides real memory into fixed- length pages
management and allocates pages to program as it requires them. Such a system
allows multiple programs to reside in memory and execute simultaneously.
The virtual memory system is responsible for keeping track of which pages
of a program are resident in memory and which are on secondary storage
(disk).
It handles interrupts from the address translation hardware in the system
to determine when pages must be retrieved from secondary storage and
placed in real memory.
When all of real memory is in use, it decides which program’s pages are to
be replaced and paged out to secondary storage.
Each time a process access a virtual address, the virtual address is
mapped (if not already mapped) by the VMM to a physical address where
the data is located.
Access The VMM also provides for access protection to prevent illegal access to
Protection data. This protects programs from incorrectly accessing kernel memory or
memory belonging to their programs. Access protection also allows
programs to setup memory that may be shared between process.
VMM on In this lesson the common feature of VMM on POWER and IA64 are
POWER described. For the most part, the IA64 VMM design inherits design on the
opposed to IA-
64 VMM Power architecture. The majority of data structures, the serialization
model, and the majority of code are common between the two. Separate
lessons will describe POWER and IA64 VMM context.
Introduction The following terms relating to virtual memory concepts will be defined in
this section:
• Page
• Frame
• Address space
• Effective address
• Virtual memory
• Physical address
• Paging Space
Illustration Follow this diagram as you read about the virtual memory concepts.
Physical Virtual address
Memory space Process 1
Effective
address
Process 2
Paging
space
Page Page is a fixed size chunk of contiguous storage that is treated as the
basic entity transferred between memory and disk. Pages stay separately
from each other, they do not overlap in virtual address space. AIX 5L uses
a fixed page size of 4096 bytes for both Power and IA64. The smallest unit
of memory managed by hardware and software is one page
Frame The place in real memory used to hold the page is called frame. You can
think that the page is the collection of information and the frame is the
place in memory to hold that information.
Address Space Address space is the set of addresses available to a program that it can be
use to access memory. This lesson describes three types of address
space:
• Effective address space.
• Virtual address space.
• Physical address space.
Effective Effective address are the addresses reference by the machine instructions
Address of a program or kernel. The effective address space is the range of
addresses defined by the instruction set, 64-bits on AIX 5L. The effective
address space is mapped to different physical address space or disk files
for each process. Programs/process see one contiguous address space.
Virtual The virtual address space is the set of all memory objects that could be
Address made addressable by the hardware. The virtual address is a bigger (has
more address bits) than the effective address. Processes have access to a
limited range of virtual addresses given to them by the kernel.
Physical The physical address space is dependent on how much memory (memory
Address chips) are on the machine. Physical address space maps one- to- one with
the machine’s hardware memory.
Paging space Paging space is disk area used by the memory manager to hold inactive
memory pages with no other home. In AIX the paging space is mainly used
to hold the pages from working storage (process data pages). If a memory
page is not in physical memory it may be loaded from disk, this is called a
page-in. Writing a modified page to disk is called a page-out.
Demand Paging
Introduction AIX is a demand paging system. Physical pages (frames) are not allocated
for virtual pages until they are needed (referenced).
• Data is copied to a physical page only when referenced.
• Paging is done on the fly and is invisible to the user.
• Data comes from:
• A page from the page space.
• A page from a file on disk.
When a virtual address is referenced on a page that has no mapping to a
frame, the mapping is done on the fly and the page frame is loaded from
where it is mapped. The loading is invisible to the user process. Demand
paging saves much of the overhead of creating new processes because
the pages for execution do not have to be loaded unless they are needed.
If a process never uses parts of its virtual space, valuable physical memory
will never be used.
Page Faults A page fault occurs when a program tries to access a page that is not
currently in real memory. Memory that has been recently used is kept in
real memory, while memory that has not been recently used is kept aside
in paging space.
For speed, most systems have the mapping of virtual addresses to real
addresses done in the hardware. This mapping is done on a page- by-
page basis. When the hardware finds that there is no mapping to real
memory, it raises a page fault condition. The operating system software
must handle these faults in such a way that the page fault is transparent to
the user program.
Virtual Memory The job of a virtual memory management system is to handle page faults
manager so that they are transparent to the thread using virtual memory addresses.
Pool of A pager daemon attempts to keep a pool of physical pages free. If the
Physical Free number of pages available goes below a high- water mark threshold, the
Pages
pager frees the oldest (referenced further back in time) pages until a low-
water mark threshold is reached.
Pageable AIX’s kernel is pageable. Only some of the kernel in physical memory at
Kernel one time. Kernel pages that are not currently being used can be unused
can be paged out.
Pinned Pages Some parts of the kernel are required to stay in memory because it is not
possible to perform a page-in when those pieces of code execute. These
pages are said to be pinned. The bottom halves of devices drivers
(interrupt processing) are pinned. Only a small part of the kernel is
required to be pined.
Memory Objects
Introduction A fundamental feature of AIX 5L’s Virtual Memory Manager is the use of
addressable memory objects.
POWER VMM The POWER architecture provides for efficient access to 256MB objects
Design against (segments in POWER terminology) in the global virtual address space.
IA-64 Design
The 256 MB objects are also used on IA-64 VMM implementation;
however, segments are implemented in software instead of hardware.
Term “segment” and “object” have the same meaning but keep in mind that
term “segment” in IA64 should be considered in software context.
Working Working objects (also called working storage and working segments) are
Objects temporary segments used during the execution of a program for its stack
and data areas. Process data are created by the loader at run time and
are page in and page out of paging space. Working storage segment,
holds the amount of paging space allocated to pages in the segment,
associated with it. The part of AIX kernel is also pageable and are the part
of working storage.
Persistent The VMM is used for performing I/O operations for file systems. Persistent
Objects objects are used to hold file data for the local file systems. When the
process opens the file, the data pages are page-in. When contents of file
changes the page is marked as modified and eventually page out directly
to original disk location. File system reads and writes occur by attaching
the appropriate file system object and performing loads/stores between the
mapped object and the user buffer. File data pages and also program text
are both part of persistent storage; however, the program text pages are
read only pages and are page-in but never page-out to disk. Persistent
pages are not using paging space.
Client Objects Client objects are used for pages of client file systems (all file systems
types other than JFS). When remote pages are modified they are marked
and eventually page-out to original disk location across the network.
Remote program text pages (read-only pages) are page out to paging
space from where they can be page in later if needed.
Log Objects Log objects are used for writing or reading JFS file systems logs during
journalling operations.
Mapping Mapping objects are used to support the mmap() interfaces which allows an
Objects application to map multiple objects to the same memory segment.
Page Mapping
Introduction This section describes the page mapping functions in the VMM.
VMM Function The main function of virtual memory manager is to make translations from
effective addresses to real addresses.
Hardware The exact procedure used by the VMM depends heavily on hardware
differences processor used by the system. As AIX 5L runs of both Power and IA-64
processors this lesson will describe the process in general terms. More
exact descriptions of address translation can be found in the hardware
specific lessons.
Diagram This diagram shows the overall relationship among the major AIX data
structures involved in mapping a virtual page to a real page or to paging
space.
software
page frame table external page
tables (XPT) paging space
SID table
filesystem
file inode
Software Page Software Page Frame Tables (SWPFT) are extensions of the hardware
Frame Table frame table and are used and managed by the VMM software. SWPFT
contains informations connected with a page as well as page in, page out
flags, free list flag, block number. It contains also the device information
(PDT) used to obtain the proper page from disk.
Page Faults Page faults occur when the hardware has looked through its page frame
tables but cannot find a real page mapping for a virtual page.
A page fault causes AIX Virtual Memory Manager (VMM) to do the bulk of
its work. It handles the fault by first verifying that the requested page is
valid. If the page is valid the VMM determines the location of the page,
recovers the page if necessary and updates the hardware’s frame page
table with the location of the page. A faulted page will be recovered from
one of the following locations:
• In physical memory (but not in the hardware PFT).
• On a paging disk (working object)
• On a filesystem object (persistent object)
Protection Protection fault occurs when page is in memory but process has no rights
Fault to access it.
.
introduction The size of the hardware page tables is limited; therefor, the hardware
can’t satisfy all address translation requests. The VMM software must
supplement the hardware tables with software managed page tables.
Procedure The procedure used for page fault handling when the page is not found in
hardware specific tables; however is in physical memory consists of
several steps detailed in this illustration and the following table.
software
external page paging space
page frame table
tables (XPT)
virtual
page
number
SID table filesystem
file inode
Procedure (continued)
Note: these steps assume the memory page is in memory just not in the
hardware page tables..
Step Action
1 A page fault is generated by the address translation
hardware. The page might be in real memory, just not in
hardware specific table due to its size limits.
2 The AIX Virtual Memory Manager first verifies that the
requested page is valid. If the page is not valid a kernel
exception is generated.
3 If the page is valid, the VMM starts looking through the
software PFT for the page. This processing almost
duplicates the hardware processing, but uses software page
tables. The software PFTs are pinned.
4 If the page is found:
• Hardware specific table is updated with real page number
for this page and process resumes execution.
• No page-in of the page occurs.
Important is to remember that the dispatcher is not run . The faulted
thread just continues the execution at the instruction that caused the fault.
PTEGs PowerPC processors hash the PFT into Page Table. Equivalence Groups
(PTEGs), and these groups may only be able to hold 16 page entries each.
Since there may be more than 16 pages that hash into one PTEG, the
VMM has to decide which ones are not in the PTEG. Then, when a page
fault occurs for one of these pages, VMM only has to reload the PTEG with
the page in question replacing some other page.
Introduction If the page was not found in real memory, VMM determines whether it is
on paging space or else where on disk. If the page is in paging space
the disk block containing the page is located and the page loaded into a
free memory page.
Waiting for I/O Copying a page from paging space to an available frame is not a
synchronous process. Any process or thread waiting for a page fault to
be handled is put to sleep until the page is available.
Procedure The procedure for loading a page from paging space is show in this
illustration and in the table that follows.
software
page frame table external page
tables (XPT)
paging space
virtual
page
number
segment ID table
disk
block
XPT number
address
and page
number filesystem
file inode
Procedure (continued)
Step Action
1 The VMM looks up the object ID for this address in the
Segment ID table and gets the External Page Table (XPT)
root pointer.
2 The VMM finds the correct XPT direct block from XPT root.
3 The VMM gets paging space disk block number from XPT
direct block.
4 VMM takes the first available frame from the free frame list.
(the free list contains one entry for each free frame of real
memory).
5 If the free frame list is empty, the VMM uses an algorithm to
select several active pages to steal.
• If the page to be stolen is modified , an I/O request is
issued to write the contents of the selected page to disk.
• Once written, the frames containing the stolen pages are
added to the free list, and one is selected to hold the
page from paging space.
6 VMM indicates device and logical block for the page. An I/O
request loads the frame with the data for the faulting page.
7 When the I/O completes VMM is notified and the thread
waiting on the frame is awakened.
8 The disk block is loaded from paging space or the file
system.
9 The hardware PFT is updated, and the process/thread
resumes at the faulting instruction
The net effect is that the process or thread has no knowledge that a page
fault occurred except for a delay in it’s processing.
External Page The XPT maps a page within a working storage segments to a disk block
Table (XPT) on external storage. The XPT is two level tree structure.
The first level of tree is XPT root block. The second level consists of 256
direct blocks. Each word in the root block is a pointer to one of the direct
block. Each word of the direct block contains the page state and disk block
information for the single page in the segment.
Each XPT direct block covers the 1MB of the 256MB segment.
.
Disk blocks in paging space
0 page 0
Paging Space AIX offers two policies for allocating paging space. If the environment
Allocation variable PSALLOC=early, then the early allocation policy is used which
Policy
will cause a disk block to be allocated whenever a memory request is
made. This guarantees that the paging space will be available if it is
needed.
If the environment variable is not set, then the default late allocation
policy is used and a disk block is not allocated until it becomes necessary
to page out the page. This policy decreased paging space requirements on
large-memory systems which do little paging.
Free memory The VMM maintains a linked list containing all the currently free real
list memory pages in the system. When a page fault occurs, VMM just takes
the first page from this list to assign to the faulting page. When the free
frame list is empty and a page fault occurs, VMM selects several active
pages to be stolen (usually around 20 or so), and all these pages are then
added to the free list This reduces the amount of time spent starting and
running the steal routines.
Paging Device The Paging Device Table (PDT) contains an entry for every device
Table (PDT) referenced by the VMM.
It is used for filesystem, paging, log and remote pages.
There is a pending I/O list associated with PDT.
The pending I/O list contains all page frames awaiting I/O for the device.
Page frames are removed from the list as soon as the I/O has been
dispatched to the device.
Introduction Persistent pages do not use XPT (eXternal Page Table). VMM uses the
information contained in file’s inode structure to locate the pages for the
file.
Procedure Persistent pages are paged from local files located on a filesystems. Local
files will have a segment allocated and will have an entry (SID) in the
segment information Table. The inode is pointed to by the SID entry
allowing VMM to find and page in the faulting block.
software
page frame table external page
tables (XPT)
paging space
virtual
page
number
segment ID table
disk
block
number
inode
address
filesystem
file inode
Filesystem I/O
Introduction The paging functions of the VMM is also used to preform reads and writes
to files by processes.
File system File system reads and writes occur by attaching the appropriate file system
objects object and performing loads/stores between the mapped object and the
user buffer. It means that file objects are not directly addressable in the
current address space but instead are temporarily attached.
A local file has a segment allocated and has an entry (SID) in the segment
information Table. File gnode contains the information which segment
belongs to the particular file.
Persistent AIX is using a large portion of memory as the filesystem buffer cache. The
pages pages for files compete for the storage the same way as other pages. The
VMM schedules the modified persistent pages to be written to their original
location on disk when:
• VMM needs the frame for another page
• file is closed
• sync operation is performed
The sync operation can be performed by syncd daemon running on the
system (by default the syncd daemon is run every 60 seconds) and by
calling sync() function or running sync command. Scheduling does not
mean that the data are written to disk at once.
Introduction To maintain system performance the VMM always wants some physical
memory to be available for page-ins . This section describes the free
memory list and the algorithms used to keep pages on the list.
Free memory The VMM maintains a linked list containing all the currently free real
list memory pages in the system. When a page fault occurs, VMM just takes
the first page from this list to assign to the faulting page. When the free
frame list is empty and a page fault occurs, VMM selects several active
pages to be stolen (usually around 20 or so), and all these pages are then
added to the free list. This reduces the amount of time spent starting and
running the steal routines.
Page The method used to select a page which should be replaced is called Page
Replacement Replacement Algorithm. The mechanism used to determine which pages
Algorithm
to steal is a pseudo-LRU (Least Recently Used) algorithm called the
clock-hand algorithm. This algorithm is commonly used in operating
systems when the hardware provides only a reference bit for each page in
physical memory. The hardware automatically sets the reference bit for a
page translation whenever a store occurs to the page. The clock hand
algorithm checks frames by frame number looking for pages that have not
been referenced since the last time the algorithm looked at the page. If a
page has been referenced since the last time the algorithm looked at the
frame, the algorithm clears the reference bit and goes to look at the next
frame. If the page has not been referenced since the last time the
algorithm looked at the frame, the page is stolen
Clock Hand The algorithm is called the clock-hand algorithm because the algorithm
acts like a clock hand that is constantly pointing at frames in order. The
clock-hand advances whenever the algorithm advances to the next frame.
If a modified page is stolen, the clock-hand algorithm writes the page to
disk (to paging space or a file system) before stealing the page.
rotation
Reference = 0 Reference = 0
This page is
eligible to be
stolen
Reference = 1
vmtune
Introduction Some number of pages of different type must retain in memory to maintain
system performance. The VMM keeps the statistics for each page types
by enforcing thresholds in page replacement algorithm. When a number of
pages approaches threshold , the page replacement algorithm selects
proper pages for replacement and favors other pages. VMM takes
appropriate action to bring the state of memory back within bounds.
VMM Tunable The vmtune command changes operational parameters of the Virtual
Parameters Memory Manager controlling the thresholds.
Parameter Description
minfree Page replacement is invoked whenever the number of free page
frames falls below this threshold.
maxfree The page replacement algorithm replaces enough pages so that
this number of frames are free when it completes.
LruBucket Specifies the size (in 4K pages) of the least recently used (lru)
page-replacement bucket size. This is the number of page frames
which will be examined at one time for possible page-outs when a
free frame is needed. A lower number will result in lower latency
when looking for a free frame, but will also result in behavior that is
not as much like a true lru algorithm.
MaxPin Specifies the maximum percentage of real memory that can be
pinned. The default value is 80. If this value is changed, the new
value should ensure that at least 4MB of real memory will be left
unpinned for use by the kernel.
minperm Specifies the point below which file pages are protected from the
re-page algorithm. This value is a percentage of the total real-
memory page frames in the system. The specified value must be
greater than or equal to 5.
MaxPerm Specifies the point above which the page stealing algorithm steals
only file pages. This value is expressed as a percentage of the total
real-memory page frames in the system. The specified value must
be greater than or equal to 5.
MinPgAhead Specifies the number of pages with which sequential read-ahead
starts. This value can range from 0 through 4096. It should be a
power of 2.
MaxPgAhead Specifies the maximum number of pages to be read ahead. This
value can range from 0 through 4096. It should be a power of 2 and
should be greater than or equal to MinPgAhead.
NpsWarn Specifies the number of free paging-space pages at which the
operating system begins sending the SIGDANGER signal to
processes. The default value is 512.
Introduction Not all page and protection faults can be handled by the O/S. When an
fault occurs that can not be handled by the O/S the system will panic and
immediately halt.
Fatal memory In all of the following cases, the VMM bypasses all kernel exception
exceptions handlers and immediately halts the system:
• A page fault occurs in the interrupt environment.
• A page fault occurs with interrupts partially disabled.
• A protection fault occurs while in kernel mode on kernel data.
• The system is out of paging space, or an I/O error occurs on kernel
data.
• An instruction storage exception occurs while in kernel mode.
• A memory exception occurs while in kernel mode without an exception
handler set up.
Introduction Each segment has unique segment ID in segment table. There are a
number of important segment types in AIX :
• kernel
• user text
• shared library text
• shared data
• process private
• shared library data
Kernel This segment is described separately for Power and IA-64 in their lessons.
segment
User text The user text segment contains the code of the program. Threads in user
mode have read-only access to text segment to prevent the modification
during running of the program. This protection allows a single copy of a
text segment to be shared by all processes associated with the same
program. For example, If the two threads in the system are running the
ls command then the instructions of ls are shared between them.
Shared Library The shared library text segment contains mappings whose addresses are
Text common across all processes. A shared library segment:
• Contains a copy of the program text (instructions) for the shared
libraries currently in use in the system.
• These segments are added to the user address space by the loader
when the first shared library is loaded.
• Each process using text from this segment has a copy of the
corresponding data in the per- process shared library data segment.
Executable modules list the shared libraries they need at exec() time.
The shared library text is loaded into this segment when an module is
loaded via the exec() system call. Or a program may issue load() calls
to get additional shared modules.
Per-Process The functions in the shared library that have data that can not be shared
Shared Library between processes and are loaded as process private data.
Data Segment
• This segment holds items required by modules in the shared text
segment(s).
• There is one of these segments for each process
• Addresses of data items are generally the same across processes
• Data itself is not shared
The shared library data segments acts like extension of the process
private segment.
Shared data Mapped memory regions, also called shared memory areas, can serve as
a large pool for exchanging data among processes.
Process Process Private Segment is not shared between other processes. The
private process private segment contains:
• user data (for 32-bit programs that aren’t maxdata programs)
• the user stack (for 32-bit programs)
• text and data from explicitly loaded modules (for 32-bit programs)
• kernel per-process data (accessible only in kernel mode)
• primary kernel thread stack (accessible only in kernel mode)
• per-process loader data (accessible only in kernel mode)
Introduction Mapped memory regions, also called shared memory areas, can serve as
a large pool for exchanging data among processes.
• A process can create and/or attach a shared data segment that is
accessible by other processes.
• A shared data segment can represent a single memory object or a
collection of memory objects.
• Shared memory can be attached read-only or read-write.
Benefit Shared memory areas can be most beneficial when the amount of data to
be exchanged between processes is too large to transfer with messages,
or when many processes maintain a common large database.
Shared The shared memory is process based and can be attached at different
memory effective addresses in different processes
address
process A real memory
effective
address
space
process B
effective
address
space VMM
Introduction shmat services, are typically used to create and use shared memory
objects from a program.
shmat Your program can use the following functions to create and manage
functions shared memory segments.
• shmctl() - Controls shared memory operations
• shmget() - Gets or creates a shared memory segment
• shmat()- Attaches a shared memory segment from a process
• shmdt()- Detaches a shared memory segment from a process
• disclaim() - Removes a mapping from a specified address range
within a shared memory segment
Using shmat shmget() system call is used to create a shared memory region and
when supporting larger objects than 256MB shared memory regions,
creates multiple segments.
shmat() system call is used to gain address ability to a shared memory
region.
Limitations Right now shmget() on the 64-bit kernel is limited to 8 segments even for
64-bit applications. Thus, the largest shared memory region that one can
create is 2Gb. This limitation will be removed if it is a 64-bit application
that performs the shmget(). There will be no explicit limitation, other than
what system resources will bear. 32-bit applications will still retain the 2Gb
limitation.
When to use Use the shmat() services under the following circumstances:
Introduction Shared segments can be used to map any ordinary file directly into
memory.
• Instead of reading and writing to the file, the program would just load or
store in the segment
• This avoids buffering of the I/O data in the kernel.
• This provides easy random access, as the file data is always available.
• This avoids the system call overhead of read() and write().
• Either shmat() or mmap() system calls can be used
File mapping The system allows file mapping at the user level. This allows a program to
access file data through loads and stores to its virtual address space. This
single level store approach can also greatly improve performance by
creating a form of Direct Memory Access (DMA) file access. Instead of
buffering the data in the kernel and copying the data from kernel to user,
the file data is mapped directly into the user’s address space.
Shared files The file can even be shared between multiple processes even if some are
using mapping and others are using the read/ write system call interface.
Of course, this may require some sort of synchronization scheme between
the processes.
shmat to map When using shmat to map memory file an open file descriptor is used in
files place of shared memory ID. Once the file segment is mapped , it is treated
like any other shared segment and can be shared with other processes.
mmap services mmap services, is typically used for mapping files, although it may be used
for creating shared memory segments as well.
• madvise() - Advises the system of a process' expected paging
behavior
• mincore() - Determines residency of memory pages
• mmap() - Maps an object file into virtual memory
• mprotect() - Modifies the access protections of memory mapping
• msync() - Synchronizes a mapped file with its underlying storage
device
• munmap() - Un-maps a mapped memory region
Both the mmap and shmat services provide the capability for multiple
processes to map the same region of an object such that they share
address ability to that object. However, the mmap subroutine extends this
capability beyond that provided by the shmat subroutine by allowing a
relatively unlimited number of such mappings to be established.
Read-Write Read -write mapping allows loads and stores in the segment to behave
Mapping like reads and writes to the corresponding file. If a thread loads beyond
the end of the file, the load will load zero values.
Read-only Read only mapping allows only loads from the segment. The operating
Mapping system generates a SIGSEGV signal if a program attempts an access that
exceeds the access permission given to a memory region. Just as with
read-write access, a thread that loads beyond the end of the file loads zero
values.
Deferred Deferred update mapping also allows loads and stores to the segment to
Update behave like reads and writes to the corresponding file. The difference
Mapping
between this mapping and read-write mapping is that the modifications are
delayed. Any storing into the segment modifies the segment but does not
modify the corresponding file.
With deferred update, the application can begin modifying the file data (by
memory mapped loads and stores) and then either commit the
modifications to the file system (via fsync()) or discard the modifications
completely. This can greatly simplify error recover and allows the
application to avoid a costly temporary file that may otherwise be required.
Data written to a file that a process has opened for deferred update (with
the O_DEFER flag) is not written to permanent storage until another pro-
cess issues an fsync() subroutine against this file or runs a synchro-
nous write subroutine (with the O_SYNC flag) on this file.
Objectives
After completing this unit, you should be able to
• List the size of the effective and virtual address space on the IA64
platform.
• .Show how regions, region register, and region ID are used in AIX
5L.
• Name the region register that is used to identify a processes
private region.
• Given an address identify the region it belongs.
References
• Intel IA-64 Architecture , Software Developer’s Manual
Introduction AIX-5L on the IA-64 platform is designed as a 64-bit kernel. Unlike the
Power version of AIX 5L no 32-bit kernel is available. This lesson
describes the address translation mechanism used by AIX 5L on the IA64
platform.
Overview The IA-64 platform provides an effective address space that is 64-bits
wide.
• The effective address space is divided into eight regions.
• Each region has a region register associated with it (rr0 - rr7).
• The region registers under control of the OS supplies an additional 24
bits of addressing creating a 85-bit virtual address space.
Regions
Introduction The 64 bit effective address is broken into 8 regions. This section
describes how the regions are addressed.
Region The 64-bit effective address space consists of 8 regions each region
selector addressed by 61 bits. A region is selected by the upper 3 bits of the
effective address. Each region has a region register associated with it (
rr0 - rr7)that contains a 24-bit Region IDentifier for the region. When
translating effective addresses to virtual addresses the 24 bit region
identifier is combined with the lower 61 bits of the virtual address to form a
85 bit virtual address..
63 60 0
3 bits 61 bits
region ID
24 bits
61 24 85
2 *2 =2
Managing The AIX 5L operating system manages the contents of region registers.
region An address space is made accessible to a processes by loading the
registers
proper RID to one of the eight region registers.
Region Registers
Introduction Each region register contains a Region IDentifier (RID) and region
attributes.
Region Register
field description
rv reserved
ve VHPT Walker Enable
1-VHPT walker is enabled for the region
0-VHTP walker is disabled for the region
ps Preferred page size. Selects the virtual address bits for hash
function for TLB or VHPT
rid 24-bit region identifier
Address Translation
Introduction The VMM software in AIX 5L works closely with the hardware to translate
effective address to an address in physical memory.
VMM hardware This diagram and the table on the next page describe the hardware
compoints and the process used to preform address translations..
Region registers
effective address
rr0 63 60 0
region id
virtual page number offset
rr7
24
hash
search search virtual address
62 0
physical page number offset
physical address
Step Action
1 Effective address contains three parts:
• Virtual Region Number (VRN),
• Virtual Page Number (VPN)
• Page Offset
2 The 3 VRN bits are used to select region register.
3 The region register provides a 24 bit region ID.
4 The region ID and the virtual page number are used to
search for an address translation found in the TLB or the
hardware maintained page tables.
5 If no match is found a page fault is generated transferring
control to the OS. The OS must resolve the fault by making
a page available and updating the translation tables.
6 A successful translation produces a physical page number.
This page number is combined with the page offset to
produce a physical address.
32 bit Address 32 bit address translation is done the same way as 64 bit translation.
Translation There is no bit in processor hardware telling that hardware is working in 32
or 64 bit mode as it is on POWER.
Introduction The IA-64 model provides the ability for ether a Single or Multiple address
space model. These models and are described in this section.
Single In a single address model all process on the system share a single
Address Space address space. Such a model is possible due to the enormous size of a
(SAS)
64-bit address space as opposed to a 32-bit one. The term single address
space refers to the use of shared regions containing objects mapped at a
unique global address. For such mapping a common region ID and page
number is provided.
Multiple In this model each process has a private address space. Not all of the 8
Address Space regions can be used by a process because the operating system must be
(MAS)
mapped on top of one or more of the regions. For each process private
region(s) there is unique RID associated with it.
Address Space The address space model used by AIX on IA-64 combines attributes of
on IA-64 both MAS (multiple address space) and SAS (single address space).
Region 0 is defined by the operating system to be a process private region.
Each process is assigned unique RID for that region which is loaded into
region register each time the process is dispatched. Therefore region 0
provides what is effectively a MAS model.
All other regions are treats as shared address space (SAS), as such the
region ID’s for those regions are constant and don’t need to be changed at
context switch. SAS usage is necessary to achieve the desired degree of
sharing of address translations for shared objects: to achieve a single
translation for an object all accesses must be made through a common
global address.
Introduction The region identifier (RID), much like the POWER segment identifier (SID),
participates in the hardware address translation such that in order to share
the same address translation, the same RID must be used. For a process
to share a memory region with another process (or the kernel) the same
RID must be loaded in the region register in both process’s context.
Region Usage The following table shows the kernel usage model for the 8 virtual regions
Table
VRN Style Name Usage
0 MAS Private process data, stack , heap , mmap ,
ILP32 shared library text,ILP32 main
text,u-block,kernel thread stacks/
msts
1 SAS/MAS Text LP64 shared library text,LP64 main
text
2 SAS LP64shmat
3 SAS LP64 shmat
4 n/a reserved
5 SAS Temp kernel temporary attach , global
buffer pool
6 SAS Kernel2 kernel global w/large page size
7 SAS Kernel kernel global
ILP32 The address space of a 32-bit programs (using the ILP32 instruction set)
is from 0 to 4GB and is solely contained in region 0.
Private Providing process data, heap, and stack as well as per-process kernel
segment information such as the u-block in a single private segment means that just
that segment needs to be copied across fork (e.g. copy-on-write
semantics).
Memory Protection
Introduction The IA-64 architecture provides two methods for applying protection to a
page:
• Access rights for each translation.
• Protection keys
Protection Protection keys are used to control which processes have access to
Keys individual objects in the single address space to achieve a “shared-by-
some” semantic, such as exists for shmat objects.
There is a special bit in hardware and when this bit is turned on(1) then
memory references go through protection key access checks during
address translations.
There are also protection key registers (at least 16) and VMM manages
and keeps track of the particular entry.
field usage
v valid bit.When 1 it means that register contains valid key
wd write disable.When 1 ,write permissions denied
rd read disable.When 1, read permission is denied.
xd execute disable.When 1 ,execute permission is denied.
key protection key(18-24 bits)
Process The process of memory access using protection keys is described in this
table.
Step Action
1 During an address translation by the hardware a protection
key is identified for the page being translated.
2 The protection key of the translation is checked against
protection keys found in protection key registers (stored by
the OS).
3 If the match succeeds then protection rights are applied to
the translation. The access can be allowed or not allowed
based on the protection key value.
4 If the access is not allowed, then the protection key
permission fault is raised and control goes to VMM.
5 In the case when match is not found ( from step 2) the
protection key mss fault is raised and VMM inserts the
correct protection key into protection key registers.
Protection Key An example of protection key usage is described in this illustration and
Example table.
process A address space
virtual address space
shared object
Step Action
1 A shared object is assigned the protection key 0x1.
2 Processes A and B share the object with the following
permissions:
• Process A has read/write access to the object.
• Process B has read-only access to the object.
3 When A is running VMM inserts the protection key register
with 0x1 and the ‘wd’ and ‘rd’ bits cleared. The process can
read and write all pages in the object.
4 When B is running VMM inserts the protection key register
with 0x1 and the ‘rd’ bit cleared.The process can only read
pages in the object.
Access Rights In addition to the protection key mechanism the IA-64 architecture
provides page protection by associating access and privilege level
information witch each translation. However, the majority of page access
rights support in AIX 5L is in the common code base shared with POWER.
Therefore the software mechanism for dealing with page protection were
all left as is so at the upper layers conform to the POWER access rights
mechanisms. These consist of:
• per segment K bits
• POWER style per-page protection bits.
At the low platform dependent layer , these POWER style protections are
translated to the IA-64 hardware informations.
Introduction Segments and segment services are used for management of objects both
on POWER and IA64.
Segments on The segment model was original developed with the Power hardware
IA64 architecture in mind. A segment can be thought of as a hardware object
on Power. Selection of the segment is made directly by the hardware’s
translation of a virtual address. As we have seen the IA64 hardware
address memory by regions. A regions is a much larger areas of the
virtual address space that a segment. On IA64 the software manage
segments on top of the region model; therefor, on IA64 a segment is a
software object not a hardware one.
Introduction The layout of the 4GB ILP32 address space is principally the same as that
for POWER 32-bit applications. The motivations for preserving this layout
for IA64 are compatibility and performance.
This table details the segment usage for the ILP32 model.:
Big Data Model A big data model is supported for 32-bit applications on POWER. This
allows an application to specify maximum requirements for heap, data, and
stack.Such a model is required for programs which exceed the limits
imposed by the normal 32-bit address space (i.e. a shared 256MB
segment for heap, data, and stack).This model will be also supported on
IA64 for 32 bit applications in future releases
Exercise
Introduction Complete the following written exercise and the lab exercise on the
following page.
A. 32 bits
B. 64 bits
C. 84 bits
A. 32 bits
B. 64 bits
C. 84 bits
3. One of eight region registers is used for each address translation. How
is the region register selected?
Exercise -- continued
Step Action
1 Logon to you IA64 lab system.
2 su to root and start the iadb utility.
$ su
# iadb
3 Display the thread structure for the current context using the
command:
0> th
5 Look for the field labeled userp this will contain a pointer to
the threads user area. Examine this address. What region
is this address in?
Lesson Objectives
At the end of the module the student should have gained knowledge
about:
Have an overview of the LVM, and Identify the LVM components such as
• Logical volume
• Physical volume
• Mirroring, and parameters for mirroring
• Striping and parameters for striping
Physical disk layout Power
Physical disk layout IA-64
LVM Physical layout including VGDA and VGSA
Know the function of LVM Passive Mirror Write Consistency
Know the function of LVM Hot spare disk
Know the function of LVM Hot spot management
Know the function of LVM Online backup (4.3.3.)
Know the function of LVM Variable logical track group (LTG)
Know the function of each of the High-Level LVM commands
Trace LVM commands with the trace command
Know the function of LVM Library calls
Know briefly about Disk Device Calls
Know briefly about Disk low level Device Calls such as SCSI calls and
SSA
Furthermore it is an objective that the student get experience from
exercises with the content of this section. The exercises will
• Examine the physical disk layout of a logical volume and a physical
volume.
• Examinine the impact of LVM Passive Mirror Write Consistency
• Examinine the function of LVM LTG
• Trace some LVM system activity.
Platform
This lesson is independent of platform.
References
http://w3.austin.ibm.com/:/projects/tteduc/ Technology Transfer Home Page
Introduction The Logical Volume Manager (LVM) is the layer between the operating
system (AIX) and the physical hard drives, the LVM provides reliable data
storage (Logical volumes) to the OS. The LVM make use of the underlying
physical storage, but hides the actual physical drives and drive layout. This
section will explain how its done, how the data can be traced, and which
parameters impacts the performance in different scenarios.
Physical disks A disk must be designated as a physical volume and be put into an
available state before AIX can assign it to a volume group. A physical
volume has certain configuration and identification information written on it.
This information includes a physical volume identifier and for IA-64
partition information for the disk. When a disk becomes a physical volume,
it is divided into 512-byte physical blocks.
The first time you start up the system after connecting a new disk, AIX
detects the disk and examines it to see if it already has a unique physical
volume identifier in its boot record. If it does, the disk is designated as a
physical volume and a physical volume name (typically, hdiskx where x is a
unique number on the system) is permanently associated with that disk
until you undefine it.
Volume groups The physical volume must become part of a volume group before it can be
utilized by LVM. A volume group is a collection of 1 to 32 physical volumes
of varying sizes and types. A physical volume may belong to only one
volume group. The system will as default allow you to define up to 256
logical volumes per volume group, but the actual number you can define
depends on the total amount of physical storage defined for that volume
group and the size of the logical volumes you define.
There can be up to 255 volume groups per system.
A VG that is created with standard physical and logical volume limits can
be converted to big format which can hold up to 128 PVs and up to 512
more LVs. This operation requires that there be enough free partitions on
every PV in the VG for the Volume group descriptor area (VGDA)
expansion.
MAXPVS: 32 (128 big VG) MAXLVS: 255 (512 big VG)
Physical In the design of LVM, each logical partition maps to one physical partition.
partitions PP And, each physical partition maps to a number of disk sectors. The design
of LVM limits the number of Physical Partitions that LVM can track per disk
to 1016. In most cases, not all the possible 1016 tracking partitions are
used by a disk. The default size of each physical partition during a "mkvg"
command is 4 MB, which implies that individual disks up to 4 GB can be
included in a volume group.
If a disk larger than 4 Gb is added to a volume group (based on usage of
the 4 MB size for Physical Partition) the disk addition will fail with a warning
message that the physical partition size needs to be increased. There are
two instances where this limitation will be enforced. The first case is when
the user tries to use "mkvg" to create a volume group where the number of
physical partitions on one of the disks in the volume group would exceed
1016. In this case, the user must pick from the
available physical partition size ranges of 1, 2, (4), 8, 16, 32, 64, 128, and
256 megabytes and use the "-s" option to "mkvg". The second case is
where the disk which violates the 1016 limitation is attempting to join a pre-
existing volume group with the "extendvg" command. The user can either
recreate the volume group with a larger physical partition size (which will
allow the new disk to work with the 1016 limitation) or the user can create a
stand-alone volume group (consisting of a larger physical partition size) for
the new disks.
Device drivers, The figure shows the interfaces to the LVM at different layers, starting top
hierachy and down, the file system JFS or J2, use the LVMDD API interface to access
interface to
LVM devices LV’s, the LVMDD use the disk DD to access the physical disk which is
handles by the SCSI DD or the SSA DD depending on the type of disk. we
do also have interface and commands to manipulate the LVM system, the
high level commands are complex commands written as shell scripts as
the mklv command. These scripts use basic LVM commands, such as
lcreatelv, which are AIX binaries to perform the operations. The basic
commands are written in C and use the LVM API liblvm.a access the LVM.
High level
JFS commands
LVM DD commands
Disk DD liblvm.a
SCSI SSA
DD DD
VGDA The VGDA is an area at the front of each disk which contains information
description about the volume group, the logical volumes that reside on the volume
group and disks that make up the volume group. For each disk in a volume
group, there exists a VGDA concerning that volume group. This VGDA
area is also used in quorum voting.
The VGDA contains information about what other disks make up the
volume group. This information is what allows the user to just specify one
of the disks in the volume group when they are using the "importvg"
command to import a volume group into an AIX system. The importvg will
go to that disk, read the VGDA and find out what other disks (by PVID)
make up the volume group and automatically import those disks into the
system. The information about neighboring disks can sometimes be useful
in data recovery. For the logical volumes that exist on that disk, the VGDA
gives information about that logical volume so anytime some change is
done to the status of the logical volume (creation, extension, or deletion),
then the VGDA on that disk and the others in the volume group
must be updated.
The VGDA space, that allows for 32 disks, is a fixed size which is part of
the LVM design. Large disks require more management mapping space in
the VGDA, which causes the number and size of available disks to be
added to the existing volume group to shrink. When a disk is added to a
volume group, not only does the new disk get a copy of the updated
VGDA, but as mentioned before, all existing drives in the volume group
must be able to accept the new, updated
VGDA.
VGSA The Volume Group Status Area (VGSA) records information on stale
description partitions for mirroring.
The VGSA is comprised of 127 bytes, where each bit in the bytes
represents up to 1016 physical partitions that reside on each disk. The bits
of the VGSA are used as a quick bit-mask to determine which physical
partitions, if any, have become stale. This is only important in the case of
mirroring where there exists more than one copy of the physical partition.
Stale partitions are flagged by the VGSA. Unlike the VGDA, the VGSA’s
are specific only to the drives which they exist. They do not contain
information about the status of partitions on other drives in the same
volume group. The VGSA is also used to determine which physical
partitions must undergo data resyncing when mirror copy resolution is
performed.
BIG VGDA The original design of the VGDA and VGSA limit the number of disks that
Volume Group can be added to a volume group to 32, and the total number of logical
Design
(BigVG) volumes to 256 (including one reserved for LVM internal use). With the
implemented proliferation of disk arrays, the need for increased capacity in a single
in AIX 4.3.2 volume group is growing.
This section describes the requirements for a new big Volume Group
Descriptor Area and Volume Group Status Areas, here after referred as
VGDA and VGSA.
Objectives
• Increase maximum number of disk per VG from 32 to 128
• Increase maximum number of logical volumes per VG to 512
• Provide migration path for small VG to big VG
Changes in commands:
• mkvg
• -B option is added to create big VGs.
• -t If the t flag (factor value) is not used, the default total of
1016physical partitions per physical volume limit will be set. Using
the factor value will change the physical partitions per disk to 1016*
factor and the total number of disks per VG to 64/factor. BigVG can
not be imported/activate into systems with pre AIX 4.3.2 versions.
• chvg
• -B option added to convert the small VG to bigVG. -B flag can be
used to convert the small VG to the bigVG format. This operation will
expand the VGDA/VGSA to change the total number of disks that
can be added to the volume group from 1-32 to 64. Once converted,
these volume groups cannot be imported/activated into systems
running pre AIX 4.3.2 versions. If both t and B flags are specified,
factor will be update first and then VG is converted to bigVG format
(sequential operation).
LVM Flexibility LVM offer great flexibility for the system administrator and users such as
• Real-time Volume Group and Logical Volume expansion/deletion
• Ability to customize data integrity check
• Use of Logical Volume under file system
• Use of Logical Volume as raw data storage
• User customized logical volumes
Real-time Typical UNIX operating systems have static file systems that require the
Volume Group archiving, deletion, and recreation of larger file systems in order for an
and Logical
Volume existing file system to expand. LVM allows the user to add disks to the
expansion / system without bringing the system down and allows the real-time
deletion expansion of the file system through the use of the logical volume. All file
systems exist on top of logical volumes. However, logical volumes can
exist without the presence of a file system. When a file system is created,
the system first creates a logical volume, then places the journaled file
system (jfs) "layer" on top of that logical volume. When a file system is
expanded, the logical volume associated with that file system is first
"grown", then the jfs is "stretched" to match the grown logical volume.
Ability to The user has the ability to control which levels of data integrity checks are
customize data placed in the LVM code in order to tune the system performance. The user
integrity
checks can change the mirror write consistency check, create mirroring, and
change the requirement for quorum in a volume group.
Use of Logical The logical volume is a logical to physical entity which allows the mapping
Volume under of data. The jfs maps files defined in its file system in its own logical way
a file system
and then translates file actions to a logical request. This logical request is
sent to the LVM device driver which converts this logical request into a
physical request. When the LVM device driver sends this physical request
to the disk device driver, it is further translated into another physical
mapping. At this level, LVM does not care about where the data is truly
located on the disk platter. But with this logical to physical abstraction, LVM
provides for the easy expansion of a file system, ease in mirroring data for
a file system, and the performance improvement of file access in certain
LVM configurations.
Use of Logical As stated before, the logical volume can run without the existence of the jfs
Volumes as file system to hold data. Typically, database programs use the "raw" logical
raw data
storage volume as a data "device" or "disk". They use the LVM logical volumes
(rather than the raw disk itself) because LVM allows them to control which
disks the data resides, allows the flexibility to add disks and "grow" the
logical volume, and gives data integrity with the mirroring of the data via
the logical volume mirroring capability.
User The user can create logical volumes, using a map file, that will allow them
customized to specify the exact disk(s) the logical volume will inhabit and the exact
logical
volumes order on the disk(s) that the logical volume will be created in. This ability
allows the user to tune the creation of their logical volumes for
performance cases.
Write Verify There is a capability in LVM to specify that you wish an extra level of data
LVM setting integrity is assured every time you write data to the disk. This is the ability
known as write verify. This capability is given to each logical volume in a
volume group. When you have write verify enabled, every write to a
physical portion of a disk that’s part of a logical volume causes the disk
device driver to issue the Write and Verify scsi command to the disk. This
means that after each write, the disk will reread the data and do an IOCC
parity check on the data to see if what the platter wrote exactly matched
what the write request buffer contained. This type of extra check
understandably adds more time to the completion length of a write request,
but it adds to the integrity of the system.
Quorum Quorum checking is the voting that goes on between disks in a volume
checking for group to see if a majority of disks exist to form a quorum that will allow the
LVM volume
groups disks in a volume group to become and stay activated. LVM runs many of
its commands and strategies based on having the most current copy of
some data. Thus, it needs a method to compare data on two or more disks
and figure out which one contains the most current information. This need
gives rise to the need of a quorum. If not enough quorums can be found
during a varyonvg command, the volume group will not varyon.
Additionally, if a disk dies during normal operation and the loss of the disk
causes volume group quorum to be lost, then the volume group will notify
the user that it is ceasing to allow any more disk i/o to the remaining disks
and enforces this by performing a self varyoffvg. However, the user can
turn off this quorum check and its actions by telling LVM that it always
wants to varyon or stay up regardless of the dependability of the system.
Or, the user can force the varyon of a volume group that doesn’t have
quorum. At this point, the user is responsible for any strange behavior from
that volume group.
Mirroring, and When discussing mirrors in LVM, it is easier to refer to each copy,
parameters for regardless of when it was created, as a copy. the exception to this is when
mirroring
one discusses Sequential mirroring. In Sequential mirroring, there is a
distinct PRIMARY copy and SECONDARY copies. However, the majority
of mirrors created on AIX systems are of the Parallel type. In Parallel
mode, there is no PRIMARY or SECONDARY mirror. All copies in a
mirrored set are just referred to as copy, regardless of which one was
created first. Since the user can remove any copy from any disk, at any
time, there can be no ordering of copies.
AIX allows up to three copies of a logical volume and the copies may be in
sequential or parallel arrangements. Mirrors improve the data integrity of a
system by providing more than one source of identical data. With multiple
copies of a logical volume, if one copy cannot provide the data, one or two
secondary copies may be accessed to provided the desired data.
Sequential Sequential vs. Parallel mirror, and What good is Sequential Mirroring?
Mirroring
Parallel In Parallel mirroring, all copies are of equal ordering. Thus, when a read
Mirroring request arrives to the LVM, there is no first or favorite copy that is
accessed for the read. A search is done on the request queues for the
drives which contain the mirror physical partition that is required. The drive
that has the fewest requests is picked as the disk drive which will service
the read request. On write requests, the LVM driver will broadcast to all
drives which have a copy of the physical partition that needs updating.
Only when all write requests return will the write be considered complete
and the write-complete message will be returned to the calling program.
Write Write
ack Write Write ack
Write req ack
req
Write
req
Mirror Write Mirror Write Consistency Check (MWCC) is a method of tracking the last
Consistency 62 writes to a mirrored logical volume. If the AIX system crashes, upon
Check
reboot the last 62 writes to mirrors are examined and one of the mirrors is
used as a "source" to synchronize the mirrors (based on the last 62 disk
locations that were written). This "source" is of importance to parallel
mirrored systems. In sequentially mirrored systems, the "source" is always
picked to be the Primary disk. If that disk fails to respond, the next disk in
the sequential ordering will be picked as the "source" copy. There is a
chance that the mirror picked as "source" to correct the other mirrors was
not the one that received the latest write before the system crashed. Thus,
the write that may have completed on one copy and incomplete on another
mirror would be lost.
AIX does not guarantee that the absolute, latest write request completed
before a crash will be there after the system reboots. But, AIX will
guarantee that the parallel mirrors will be consistent with each other. If the
mirrors are consistent with each other, then the user will be able to realize
which writes were considered successful before the system crashed and
which writes will be retried. The point here is not data accuracy, but data
consistency. The use of the Primary mirror copy
as the source disk is the basic reason that sequential mirroring is offered.
Not only is data consistency guaranteed with MWCC, but the use of the
Primary mirror as the source disk increases the chance that all the copies
have the latest write that occurred before the mirrored system crashed.
Ability to The Volume Group Status Area (VGSA) tracks the status of 1016 physical
detect stale partitions per disk per volume group. During a read or write, if the LVM
mirror copies
and correct device driver detects that there was a failure in fulfilling a request, the
VGSA will note the physical partition(s) that failed and mark that
partition(s) "stale". When a partition is marked stale, this is logged by AIX
error logging and the LVM device driver will know not to send further
partition data requests to that stale partition. This saves wasted time in
sending i/o requests to a partition that most likely will not respond. And
when this physical problem is corrected, the VGSA will tell the mirror
synchronization code which partitions need to be updated to have the
mirrors contain the same data.
LVM Striping
Striping and Disk striping is the concept of spreading sequential data across more than
parameters for one disk to improve disk i/o. The theory is that if you have data that is close
striping
to each other, and if you can divide the request into more than one disk i/o,
you will reduce the time it takes to get the entire piece of data. This request
must be done so it is transparent to the user. The user doesn’t know which
pieces of the data reside on which disk and does not see the data until all
the disk i/o has completed (in the case of a read) and the data has been
reassembled for the user. Since LVM has the concept of a logical to
physical mapping already built into its design, the concept of disk striping
is an easy evolution. Striping is broken down into the "width" of a stripe and
the "stripe length". The width is how many disks the sequential data should
lay across. The stripe length is how many sequential bytes reside on one
disk before the data jumps to another disk to continue the sequential
information path.
Striping We present an example to show the benefit of striping: A piece of data that
Example is stored of the disk is 100 bytes. The physical cache of the system is only
25 bytes. Thus, it takes 4 read requests to the same disk to complete the
reading of 100 bytes: As you can see, since the data is on the same disk,
four sequential reads must be required.
If this logical volume were created with a stripe width of 4 (how many
disks) and a stripe size of 25 (how many consecutive bytes before going to
the next disk), then you would see:
As you can see, each disk only requires one read request and the time to
gather all 100 bytes has been reduced 4-fold. However, there is still a
bottleneck of having the four independent data disks channel through one
adapter card. But, this can be remedied with the expensive option of
having each disk on an independent adapter card. Note the effect of using
striping: the user has now lost the usage of 3 disks that could have been
used for other volume groups.
LVM Performance
Performance Disk mirroring can improve the read performance of a system, but at a cost
with disk to the write performance. Of the two mirroring strategies, parallel and
mirroring
sequential, parallel is the better of the two in terms of disk i/o. In parallel
mirroring, when a read request is received, the lvm device driver looks at
the queued requests (read and write) and finds the disk with the least
number of requests waiting to execute. This is a change from AIX 3.2,
where a complex algorithm tried to approximate the disk that would be
"closest" to the required data (regardless of how many jobs it had queued
up). In AIX 4.1, it was decided that this complex algorithm did not
significantly improve the i/o behavior of mirroring and so the complex logic
was scrapped. The user can see how this new strategy of finding the
shortest wait line would improve the read time. And with mirroring, two
independent requests to two different locations can be issued at the same
time without causing disk contention, because the requests will be issued
to two independent disks. However, with the improvement to the read
request as a result of disk mirroring and the multiple identical sources of
reads, the LVM disk driver must now perform more writes in order to
complete the write request. With mirroring, all disks that make up a mirror
are issued write commands which each disk must complete before the
LVM device driver considers a write request as complete
Changeable There are a few parameters that the user can change per logical volume
parameters which will affect the performance of the logical volume in terms of data
that affect LVM
performance access efficiency.
From experience however, many people have different views of how to
achieve that efficiently, so there can’t be a specific "right" recommendation
given in these notes.
Inter-policy - This comes in two variations, min and max. The two choices
tells LVM how the user wishes the logical volume to be spread over the
disks in the volume group. With min, this tells LVM that the logical volume
should be spread over as few disks as possible. The max policy directs
LVM to spread the logical volume over as many disks that are defined by
the volume group and limited by the "Upper Bound" variable. Some users
try to use this variation to form a cheap version of disk striping on systems
below AIX 4.1. However, it must be stated that the Inter-policy is a
"recommendation" to the allocp binary (Partition allocation routine), not a
strict requirement. In certain cases, depending on what is free on a disk,
these allocation policies may not be achievable.
Intra-policy - There are five regions on a disk platter defined by the intra-
policy: edge, inner-edge, middle, inner-middle, and center. This policy will
tell the LVM what the preferred location of the logical volume on the disk
platter. Depending on the value also provided for inter-policy, this
preference may or may not be satisfied by LVM. Many users have different
ideas as to which portion of the disk is considered the "best", so no
recommendation is given in these notes.
Mirror write consistency check - As mentioned before, the mirror write
consistency check tracks the last 62 distinct writes to physical partitions. If
the user turns this off, they will shorten (although slightly), the path length
involved in a disk write. However, the trade-off may be inconsistent mirrors
if the system crashes during a write call.
Write verify - This by default is turned off by LVM when a logical volume is
created. If this value is turned on for a logical volume, additional time
during writes will be accumulated as the IOCC check is performed for each
write to the disk platter.
Physical Mirroring on different disks - The default of disk mirroring is that the copies
Connections should exist on different disks. This is for performance as well as data
integrity. With copies residing on different disks, if one disk is extremely
busy, then a read request can be completed the other copy residing on a
less busy disk. Although it might seem the cost would be the same for
writes, the section "Command tag queuing" should show that writing to two
copies on the same disk is worse than writing to two copies on separate
disks.
Mirroring across different adapters - Another method to improve disk
throughput is to mirror the copies across adapters. This will give you a
better chance of not only finding a copy on a disk that is least busy, but it
will also improve your chances of finding an adapter that is not as busy.
LVM does not realize, nor care, that the two disks do not reside on the
same adapter. If the copies were on the same adapter, the bottleneck
there is still the bottleneck of getting your data through the flow of other
data coming from other devices sharing the same adapter card. With multi-
adapters, the throughput through the adapter channel should improve.
Command tag queuing - This is a feature only found on scsi-2 devices. In
scsi-1, an adapter may get many requests, but will only send out one
command at a time. Thus, if the scsi device driver received three requests
for i/o, it will buffer the last two requests until the first one sent is received.
It then will pick the next one in line and issue that command. Thus, the
target device will only receive one command at a time. With command-tag
queuing on scsi-2 devices, multiple commands may be sent out to the
same device at once. The two device drivers (disk and scsi adapter) will be
capable of determining which command returned and what to do with that
command. Thus, disk i/o throughput can be improved.
Physical The one important ability of LVM is the ability to let the user dictate how on
Placement of the disk platter the logical volume should be assigned. This is done with
Logical
Partitions the map file that can be used in the "mklv" and "mklvcopy" commands.
This map file will allow the user to assign a distinct physical partition
number to a distinct logical partition number. Thus, people with different
theories on the optimal layout for data partitions can customize their
systems according to their personal preferences.
Performance Disk striping is introduced in AIX 4.1. This is another word to describe the
consideration RAID 0 implementation in software. This functionality is based on the
with Disk
Striping assumption that large amounts of data can be more efficiently retrieved if
the request were broken up into smaller requests given to multiple disks.
And if the multiple disks are on multiple adapters, then the theory works
even better, as mentioned in the previous sections of mirroring across
different disks and adapters. In the previous sections, we describe the
efficiency gained for mirrors. In this case, the same efficiency is gained
with data across disks and adapters, but without mirroring. Thus there is a
savings on the write case, as compared to mirrors. But, there is a slight
loss in the read case, as compared to mirrors, because now there isn’t
more than one copy to read from if one disk is busier than the other.
AIX 4.3.3 and This section will explore the physical disk layout on Power platform.
AIX 5 IDs
There are three identifiers commonly used within LVM: Physical Volume
Identifier (PVID), Volume Group Identifier (VGID), and Logical Volume
Indentifier (LVID). The last two, VGID and LVID, are closely tied. The LVID
is simply a dot "." and a minor number appended to the end of the VGID.
The VGID is a combination of the machines unique processor serial
number (uname -a) and the date that the volume group was created.
The implementation of LVM, has always been to assume that the VGID of
a system was made up of 2 32 bit words. Throughout the code however,
the VGID/LVID is represented with the system data type struct unique_id
which is made up of 4 32 bit words. However the LVM library, driver, and
commands have always assumed or enforced the notion that the last 2
words, word3 and word 4 of this structure are zeroes.
AIX 5 is now changed such that all 4 32 bit words are used for a total of
128 bit or 32 HEX digits. The MSb 32 bits are copied from the processor ID
and the remaining 96 bits are the milisecond time stamp at creation time.
AIX 4.3.3
PVID
Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1
VGID
Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1
0 0 0 9 0 2 7 7
LVID
Byte9 Byte8 Byte 7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1
0 0 0 9 0 2 7 7 . X
AIX 5
PVID
Byte16 Byte15 Byte 14 Byte 13 Byte 12 Byte11 Byte 10 Byte 9 Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1
VGID
Byte16 Byte15 Byte 14 Byte 13 Byte 12 Byte11 Byte 10 Byte 9 Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1
LVID
Byte17 Byte16 Byte15 Byte 14 Byte 13 Byte 12 Byte11 Byte 10 Byte 9 Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1
Example IDs The processor ID is 64 bit in AIX 5 the uname function cut out bit 33 to 47
from AIX 4 and such that the result is the first word and the last 16 bit of the last word.
AIX 5L
systems that LVID and VGID combine 64 bit processor ID and 64 bit time stamp to form
shows how IDs an ID. PVIDs are made of 32 bit processor ID and bits from the timestamp.
are
constructed
from
processor ID Example from AIX 5 Power system
PVID hdisk0: 00071483229d06620000000000000000
PVID hdisk1: 00071483b50bbaee0000000000000000
LVID hd1: 0007148300004c00000000e19f7c5aa3.8
LVID hd2: 0007148300004c00000000e19f7c5aa3.5
LVID hd3: 0007148300004c00000000e19f7c5aa3.7
LVID hd4: 0007148300004c00000000e19f7c5aa3.4
VGID rootvg: 0007148300004c00000000e19f7c5aa3
VGID testvg: 0007148300004c00000000e1b50bc8ec
uname -a: 000714834C00
In a AIX 4 system all the IDs are made of the MSB 32 bit of the processor
ID and 32 bit time stamp to form an ID.
Physical The following example show a disk dump from sector 0 at a power system
volume, with a uninitialized is data not written by the LVM, sections holding 00’s or
logical volume
testlv defined initialized are cut out for clarity. The ID’s are those listed in the previous
section.
000000 ¦ C9 C2 D4 C1 00 00 00 00 00 00 00 00 00 00 00 00
000010 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000070 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000090 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0001F0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000200 ¦ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
00400 ¦ 39 C7 F2 9F 14 87 93 46 00 00 00 00 00 00 00 00
000410 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0005E0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
)è&))
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
')è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
(è)&'&(VWUXFWOYPBUHFGHILQHGLQOYPUHFK
(è%%&(&9*,'WHVWYJ
(è&&'
(è%$
(è-- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
è'()(&7
è
)è
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
%)è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
&è)&'&(èB/90VWUXFWOYPBUHF
&è%%&(&è9*,'WHVWYJ
&è&&'è_
&è%$è$
&è-- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
')è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
(èè'()(&7
(èè
))è
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
)))è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
è&)%$ 7KH9*6$
è
)(è
))è&)%$
è&)'&'& 7KH9*'$
è(%%&(&
è
è
è$
è
è
Disk data
continued )è
è%%%$((
è&
è
è
è
è
è
è
è
è
$è
%è
&è
'è
(è
)è
è
è
è
è
è$
è
)è
è&
è
)è
è&)'&'
è
)è
è&)%$
è
(è
)è&)%$
è&)'&'&
è(%%&(&
è
)è
è$
è
è
21A5F0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ¦......
21A600 ¦ 74 65 73 74 6C 76 00 00 00 00 00 00 00 00 00 00 ¦testlv
21A610 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ¦......
()è
(è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
Disk data
continued )))è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
è&$è$,;/9&%MIV
èè
èè
èèF
è&èHWHVWOY
èè
èè
èè
èè7XH6HS
è$$$è
$è$è7XH6HS
%è$$è
&è'è&\PH\
'è$()(è1RQH
(èè
)èè
èè
èè
èè
èè
èè
èè
èè
èè
èè
èè
$èè
%èè
&èè
'èèEEF
(è('($'%(()èHF
)è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
lvm_rec The structure lvm_rec is used by the lvm routines to define the disk layout
structure from
struct lvm_rec
file /usr/
include/ /* structure which describes the physical volume LVM record */
lvmrec.h {
__long32_t lvm_id;
__long32_t lvmarea_len;
daddr32_t reloc_psn;
blocks (located at the end of the PV) which are reserved for
__long32_t reloc_len;
/* the length in number of sectors of the pool of bad block relocation blocks */
/* the physical volume number within the volume group of this physical volume */
__long32_t vgsa_len;
/* the version number of this volume group descriptor and status area */
short int vg_type;
int ltg_shift;
};
If we use the string “_LVM” we can locate the above structure in the
previous disk dump an assign values to the variables
struct lvm_rec
Variable VALUE
#define LVM_LVMID 0x5F4C564D
struct unique_id vg_id; 0007148300004C00000000E1B50BC8EC
__long32_t lvmarea_len; 00001074
__long32_t vgda_len; 00000832
daddr32_t vgda_psn [2]; 00000088 000008C2
daddr32_t reloc_psn; 00867C2D
__long32_t reloc_len; 00000100
short int pv_num; 0001
short int pp_size; 0018
__long32_t vgsa_len; 00000008
daddr32_t vgsa_psn [2]; 00000080 000008BA
int ltg_shift; 0001
char res1 [444]; Uninitialized
VGSA structure
struct vgsa_area {
#ifdef _KERNEL
#endif
/* Bit per PV */
/* Stale PP bits */
uchar stalepp[MAXPVS][VGSA_BT_PV];
#ifdef _KERNEL
#else
#endif
};
struct big_vgsa_area {
#ifdef _KERNEL
struct timestruc32_t b_tmstamp; /* Beginning time stamp */
#else
char e_tmbuf64bit[24];
#ifdef _KERNEL
#else
struct timestruc_t e_tmstamp;
#endif
};
Introduction to IA64 systems has a different design than Power system, some, if not all,
AIX 5L on IA- IA-64 systems will use The Extensible Firmware Interface (EFI). EFI has
64 and EFI
partitioned defined a new disk partitioning scheme to replace the legacy DOS
disks partitioning support.
When booting from a disk device, the EFI firmware utilizes one or more
system partitions containing an EFI file system (FAT32) to locate EFI
applications and drivers, including the OS boot loader. These applications
and drivers provide ways to extend firmware or provide the operating
system with assistance during boot time or runtime. In addition, it is
expected that operating systems will define partitions unique to the
operating system. EFI applications, will also have the capability to display
and potentially create additional partitions before the OS is booted.
AIX traditionally has not supported partitioned disks because AIX was the
only OS running on the RS/6000 systems. Therefore the entire disk is
defined by an hdisk ODM object and /dev/hdiskn special file with a single
major and minor number assigned to the physical disk. In AIX 4.3.3 when a
disk becomes a physical volume (having a PVID) an old style MBR
(master boot record) renamed the IPL control block which contains the
PVID is written into the first sector at the disk.
The overall design for disk partitioning on AIX 5L on IA-64 is to introduce
disk partitioning at the disk driver level. An hdisk ODM object will still refer
to the physical disk, however multiple special files will be created and
associated with the partitions on the disk. Besides the EFI system
partitions, AIX 5L on IA-64 disk configure method will recognize IA-64
physical volume partitions.
AIX 5L on IA-64 supports a maximum of 4 partitions, of these one partition
can be a physical volume partition, other partitions are EFI system
partitions. Therefore only one AIX PV, and one volume group can be
defined per physical disk.
A new command, efdisk, act as a partition manager
Special files will be created for the following partition types:
• Entire physical disk n Access (used by efdisk) /dev/hdiskn_all
• System Partition index y on physical disk n /dev/hdiskn_sy
• Physical volume Partition on physical disk n /dev/hdiskn
• Unknown partition index x on physical disk n /dev/hdiskn_px
Creating new
partitions at a
IA-64 system AIX 5L on IA-64 will partition disks under the following circumstances:
• Under the direction of the user/administrator via the efdisk command.
• During bos install after the designation of a "boot" disk (install targets)
• When adding a disk that is not yet a physical volume to a VG
• Under the direction of the "chdev -l hdiskx -a pv=yes: command
The disk After installing AIX 5L on a system with one disk, the physical drive and the
system after a /dev special files can be listed.
default
installation
lsdev -Cc disk
hdisk0 Available 00-19-10 Other IDE Disk Drive
The EFI system partition holds HW information and EFI firmware data the
disk is DOS formatted and can be accessed through dos utilities as in the
example.
Creating After creating four partitions we can list the start block number and length
partitions with with the efdisk command.
efdisk
------------------------------------------------------
Partition Index: 0
Partition Type: Physical Volume
StartingLBA: 1 (0x1)
Number of blocks: 819200 blocks (0xc8000)
Partition Index: 1
Partition Type: System Partition
StartingLBA: 819201 (0xc8001)
Number of blocks: 409600 blocks (0x64000)
Partition Index: 2
Partition Type: System Partition
StartingLBA: 1228801 (0x12c001)
Number of blocks: 614400 blocks (0x96000)
Partition Index: 3
Partition Type: System Partition
StartingLBA: 1843201 (0x1c2001)
Number of blocks: 614400 blocks (0x96000)
Disk layout at The following disk dump lists the data in hex format, the six leftmost digits
IA-64 systems is the byte offset from physical start of disk, each line list 16 bytes. The
data is read at a IBM Power system with the same utility as previous
examples, when byte swapping is mentioned it is relative to what it would
have been at a disk connected to a AIX Power system.
000010 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
length = 0xc8000
0001D0 ¦ FF FF EF FF FF FF 01 80 0C 00 00 40 06 00 00 FF -start LBA = 0x0c8001
length = 0x064000
length = 0x096000
000200 ¦ C1 D4 C2 C9 00 00 00 00 00 00 00 00 00 00 00 00 - AMBI in ebcdic = IBMA byte swapped
è'&)&(è09/B B/90E\WHVZDSSHG
è&(%)èOYPBUHFVWUXFWRIIVHWE\[
è&))(&èIURPWKHOYPVWUXFWDWSRZHUGDWD
è%$èLQSDUWLWLRQLVSODFHGDVDW39V
èè'()(&7GHIHFWOLVW
èè
è$&$)9*6$7LPHVWDPS
è
)è$&$)HQG9*6$WLPHVWDPS
è$&$%(&9*'$VWDUWWLPHVWDPS
è(&(%)9*,'IRULDYJ
è
For reference information the PVID, LVID and VGID are listed below.
$,;LDLD
39,'KGLVNFFDHHEHDG
/9,'OYFHEIHF
/9,'OYFHEIHF
9*,'LDYJFHEIHF
AIX 5L Passive The previous Mirror Write Consistency Check (MWCC) algorithm has been
Mirror Write in place since AIX 3.1. This original design has served the Logical Volume
Consistency
Manager (LVM) well, but has always slowed the performance of mirrored
logical volumes that performed massive and varied writes. A new design is
implemented in AIX 5 to supplement the original MWCC design.
AIX 4 MWCC The AIX 4 MWCC method uses a table called the mwc table. This table is
algorithm kept in memory as well on the disk platter. The table has 62 entries and
each entry tracks the last 62 distinct Logical Track Group (LTG) writes. An
LTG is 128 Kilobytes. The mwc table is only concerned with writes, not
reads. The algorithm can be expressed in pseudo-code:
if (action is a write)
{
if (LTG to be written is already in the mwc table array in memory)
{
proceed and issue the write to the mirrors
wait until all mirrored writes complete
return to calling process
}
else
{
update the mwc table with this latest LTG number overwriting the
oldest LTG entry in the mwc table (in memory), write the memory
mwc table to the edge of the platter of all disks in the volume group
wait for the mwc table writes to complete - when the mwc table write of
the disk that holds the LTG in question returns, this is considered write
complete of the mwc table. issue the parallel mirror writes to all the
mirrors. wait until all mirrored writes complete and return to calling
process
}
}
else
process the read
MWCC usage The reason for having mwcc is: Recovery from a crash while i/o is
for recovery proceeding on a mirrored logical volume. By implication, this means that
mwcc is ignored for non-mirrored logical volumes. A key phrase is data "in
flight", which implies that a write has been issued to a disk and the write
order has not come back from the disk with a confirmation that the action is
complete. Thus, there is no certainty that the data did in fact get written to
the disk. mwcc tracks the last 62 write orders so that upon reboot, this
table is used to rewrite the last 62 mirror writes. It is more than likely that
all the writes finished before the system crash, however LVM goes ahead
and goes to each of the 62 distinct LTGs, reads one copy of the mirror and
writes it to the other mirror(s) that exist. Note that mwcc does not
guarantee that the absolute latest write is made available to the user.
Mwcc just guarantees that the images on the mirrors are consistent
(identical).
AIX 4 MWCC The current mwcc algorithm has a penalty for heavily random writes. There
performance is a performance sag associated with doing an extra write for each write
implications
you perform. A good example, taken from a customer, is a mail server that
had mirrored accounts. Thousands of users were constantly writing or
deleting files from their mail accounts. Thus, the LTG counter was
constantly being changed and written to disk. In addition to that overhead,
if the mwcc table has been dispatched to be written, new requests that
come into the LVM work queue are held until the mwcc table write returns
so that it can be updated and once more sent down to the disk platters to
be updated.
Current AIX 4 Currently, the only way customers can work around the performance
MWCC penalty associated with mwcc is to turn the functionality off. But in order to
workaround.
insure data consistency, they must do a syncvg -f <vgname> immediately
after a system crash and reboot to synchronize data.
Since there is no mwcc table on the platter, there is no way to determine
which LTGs need resyncing, thus a forced resync of ALL partitions is
required. Omitting this synchronization may cause inconsistent data.
AIX 5 LVM The MWCC implementation in AIX 5 provides a new passive algorithm, but
Passive Mirror only for big VGs, The reason for this is that we need space for a dirty flag
Write
Consistency for each logical volume, and only the VGSA for big VGs provides this
Check space.
AIX 5 Passive The new MWCC algorithm set a flag when the mirrored LV is open in RW
MWCC mode, and the flag is no cleared until the last close on the device. The flag
algorithm
is then examined during subsequent boots, the algorithm implemented is:
1. The user opens a mirrored logical volume.
2. The lvm driver marks a bit in the VGDA which states that for purposes
of passive mwcc, the lv is "dirty"
3. Reads and writes occur to the mirrored lv with no (traditional) mwcc
table writes
4. The machine crashes
5. Upon reboot, the volume group automatically varies on. As part of this
varyonvg, checks are made to see if dirty bits exists for each lv
6. For each logical volume that is dirty, a "syncvg -f -l <lvname>" is
performed, regardless of whether or not the user wants to do this.
Advantage:
The behavior of a mirrored write will be the same as those of a mirrored
logical volume with now mwcc. Since crashes are very rare, the need for
mwcc resync is negligible. Thus, a mostly unnecessary write (mwc table
update) will be avoided.
Disadvantage:
After a crash, the entire logical volume is considered dirty, although only a
few blocks could have changed. Until all the partitions have been
resync’ed, then the logical volume will always be considered dirty while the
logical volume is open. Additionally, reads will be a bit slower as a read-
then-sync operation must be performed.
Commands Varyonvg command will inform the user that a background forced sync
affected by the may be occurring with the passive MWCC recovery.
Passive MWCC
algorithm Syncvg command will inform user that a non-forced sync on a logical
volume with a passive MWCC will result in a forced background sync.
Lslv command has been altered such that the output shows if Passive
MWCC is set and active.
To set passive sync
• mklv -w p = Use Passive MWCC algorithm
• chlv -w p = Use Passive MWCC algorithm
hd_ioctl: this will return additional status and tell the user if the logical
volume is current marked as needing to undergo or is actually undergoing
passive mwcc
recovery (all reads result in a resync of the mirrors).
Changes in hdpin.exp
Export the call hd_sa_update so that hd_top can update the VGSA well
with the modified lv_dirty_bit as a result of hd_open or hd_close.
AIX 5 Hot Chpv [-h Hotspare] ... existing flags ... PhysicalVolume
Spare disk
chpv
command
-h hotspare
• Sets the sparring characteristics of the physical volume such that the
physical volume can be used as a hot spare and the allocation
permission for physical partitions on the physical volume specified by
the PhysicalVolume parameter. This flag has no meaning for non
mirrored logical volumes. The Spare variable can be either:
• y
• Marks the disk as a hot spare disk within the VG it belongs to and
prohibits the allocation of physical partitions on the physical volume.
The disk must not have any partitions allocated to logical volumes to
be successfully marked as a hot spare disk.
• n
• Removes the disk from the hot spare pool for the volume group in
which it resides and allows allocation of physical partitions on the
physical volume.
AIX 5 Hot Chvg [-s Sync] [-h Hotspare] ... existing flags .... VolumeGroup
Spare disk
chvg -h hotspare
command
• Sets the sparing characteristics for the volume group specified by the
VolumeGroup parameter. Either allows (yes) the automatic migration of
failed disks, or prohibits (no) the automatic migration of failed disks.
This flag has no meaning for non mirrored logical volumes
• y
• Allows the automatic migration of failed disks. Use one for one
migration of partitions from one failed disk to one spare disk. The
smallest disk in the volume group spare pool that is big enough for
one to one migration will be used.
• Y
• Allows the automatic migration of failed disks. Potentially use the
entire pool of spare disks to migrate to as apposed to a one for one
migration of partitions to a spare.
• n
• Prohibits the automatic migration of failed disks. This is the default
value for a volume group.
• r
• Removes all disks from the hotspare pool for the volume group.
-s sync
Sets the synchronization characteristics for the volume group specified by
the VolumeGroup parameter. Either allows (yes) the automatic
synchronization of stale partitions or prohibits (no) the automatic
synchronization of stale partitions. This flag has no meaning for non
mirrored logical volumes.
• y
• Attempt to automatically synchronize stale partitions.
• n
• Prohibits automatic synchronization of stale partitions. This is the
default for a volume group.
• Lsvg -p will show the status of all physical volumes in the VG.
• Lsvg will show status of current state of sparing and synchronization.
• Lspv will show if a disk is a spare.
AIX 5 LVM Hot Provides tools to determine which logical partitions have high I/O traffic
Spot and allow the migration of those logical partitions to other disks. The
Management
benefit from this system is to:
• Improve performance by eliminating hot spots.
• The system can also be used to migrate certain logical partitions for
maintenance.
The lvmstat command generates two types of reports, per Logical partition
statistics in a logical volume and per logical volume statistics in a volume
group. The reports has the following format:
# lvmstat -l hd3
Log_part mirror# iocnt Kb_read Kb_wrtn Kbps
1 1 0 0 0 0.00
2 1 0 0 0 0.00
3 1 0 0 0 0.00
# lvmstat -v rootvg
Logical Volume iocnt Kb_read Kb_wrtn Kbps
hd2 1592 5620 880 0.05
hd9var 71 32 28 0.00
hd8 71 0 284 0.00
hd4 13 8 60 0.00
hd1 11 1 21 0.00
Examples
move the first logical partitions of logical volume lv00 to hdisk1, type:
migratelp lv00/1 hdisk1
move second mirror copy of the third logical partitions of logical volume
hd2 to hdisk5, type:
migratelp hd2/3/2 hdisk5
Splitting and For a long time it has been a desire to be able to make online backups,
reintegrating a especially in installations with mirrored volumes it’s been a requested
mirror
feature to be able to split the mirror and use one side of the mirror for
online backups. It has been possible to do a manual split and later
reintegration, but it has been rather complicated and therefore unsafe. In
AIX 4.3.3. this feature has been made available with an easy command
interface.
A mirrored LV can be divided with the chfs command, the example will split
the LV mounted on /testfs, copy number 3 will be mounted ad /backup.
AIX 5 Today the Logical Volume Manager (LVM) shipped with all versions of AIX
introduce has a constant max transfer size of 128K also know within LVM as the
Variable LTG
size to Logical Track Group (LTG). All IO within LVM must be on a Logical Track
improve disk Group boundary. When AIX was first released all disks supported 128K.
performance Today many disks are going beyond 128K and the efficiency of many disks
such as RAID Arrays are impacted if the IO is not a multiple of the stripe
size and the stripe size is normally larger than 128K.
Find out What The first question to be asked is if this problem is really in the LVM layer.
is the root The sections that detail how an I/O request is handed down from layer to
cause of the
error? layer might help clarify all the sections that must be considered. The most
important initial determination is whether the problem is in above the LVM
layer, in the LVM layer, or below the LVM layer. For instance, an application
program such as Oracle or HACMP/6000 that accesses the LVM directly
might have a problem. If you can determine what actions these failing
programs are attempting to the LVM, then try to recreate this action by
hand using a method that is not based on those application programs. If
your attempt by hand works, then the focus of the problem shifts "up" to
the application program. Obviously if it fails, then you isolated the problem
at the LVM layer (or below). Or, the problem could simply be corruption to
the data needed by LVM; the programs are behaving correctly, but data
needed by LVM is corrupted which is causing LVM to behave strangely. An
additional bonus to the field investigator is the fact that most high-level
commands are shell scripts. Thus, if they are familiar with shell
programming, they may turn on the shell output and what the execution of
the shell commands to observe the failure point. This information might
produce additional helpful information to the problem record. Finally, if
there is corruption or loss of data required by LVM (such as a disk
accidently erased from a volume group), it helps to find the exact steps
performed (or even not performed) by the user so that the investigator can
deduce the state of the system and what useful LVM information is left
behind.
Can this Many times problem reports from the field to Level 3 concerning LVM are
problem be difficult to investigate because clarification is required (to determine the
distilled to the
simplest case? root cause of the problem). Or, the problem is described with the complex
user configurations. If it is possible, the most basic action of the LVM is the
one that should be investigated. This is not always possible as some
problem may only be exposed when running in a complex environment.
However, whenever possible one should try to distill the case into how the
action to a logical volume is causing misbehavior by the system. And in
that clarification, a non-LVM root cause may be discovered instead.
What has been This type of question is typically asked of the system when some sort of
lost and what accident has resulted in data corruption or loss of LVM required
has been left
behind? information. Given the state of the system before the corruption, the steps
that most likely caused the corruption, and the current state of the
machine, one can deduce what is left to work with. Sometimes one will
receive conflicting information. This is because part of the ODM disagrees
with part of the VGDA. The ODM is the one that is easily alterable
(compared to the VGDA).
Is this Sometimes you have enough information to know what is missing and
situation what should be done to repair the system. However, the design of ODM,
repairable?
the system configurator, and LVM prevents the repair. By fixing one
problem, another is spawned. And, one is caught in a deadlock situation
that cannot be fixed unless one wrote very specific kernel code to repair
the internal aspects of the LVM (most likely the VGDA). This is not a trivial
solution, but it is possible. It is only through experience that a judgement
can be made if recovery can be attempted.
Although this might seem a trivial step, when you attempt problem
recovery, most of the time you must alter or destroy an important internal
structure within the LVM (such as the VGDA). Once this is done, if the
recovery attempt didn't work, the user's system is usually in worse shape
than before the recovery attempt. Many users will decline the recovery
attempt once this warning is given. However, it is better to warn them
ahead of time!
Gather all While the volume group is still partially accessible, gather all possible data
possible data about the current volume group. The VGDA will provide information about
missing logical volumes, which will be important. Once the recovery
procedure starts, important reference information such as that gathered
from the VGDA will be lost for good. And if your information is incomplete,
then you may be stuck with no where to go.
Save off what Before starting the recovery, make a copy of files that can be restored in
can be saved case something goes wrong. A good example would be something like the
ODM database files that reside in /etc/objrepos. Sometimes the recovery
steps involves deleting information from those databases. And once
deleted, if one is unsure of their form, one can't try to recreate some of the
structures or values.
Each case is Since each LVM problem is most likely going to be unique for that system,
different, so these notes cannot provide a list of steps one would take in a repair. Once
must each
solution be again, the recovery steps must be based on individual experiences with
LVM. The LVM lab exercise on recovery provides a glimpse of the
complexity and information required to repair a system. However, this lab
is just an example, not a template of how all fixes should be attempted.
List of Logical The library of LVM subroutines is a main component of the Logical Volume
Volume Manager.
Subroutines
LVM subroutines define and maintain the logical and physical volumes of a
volume group. They are used by the system management commands to
perform system management for the logical and physical volumes of a
system. The programming interface for the library of LVM subroutines is
available to anyone who wishes to provide alternatives to or expand the
function of the system management commands for logical volumes.
Note: The LVM subroutines use the sysconfig system call, which requires
root user authority, to query and update kernel data structures describing a
volume group. You must have root user authority to use the services of the
LVM subroutine library.
LVM logical The Logical Volume Device Driver (LVDD) is a pseudo-device driver that
volume device operates on logical volumes through the /dev/lvn special file. Like the
driver
physical disk device driver, this pseudo-device driver provides character
and block entry points with compatible arguments. Each volume group has
an entry in the kernel device switch table. Each entry contains entry points
for the device driver and a pointer to the volume group data structure. The
logical volumes of a volume group are distinguished by their minor device
numbers.
• Attention: Each logical volume has a control block located in the first
512 bytes. Data begins in the second 512-byte block. Care must be
taken when reading and writing directly to the logical volume, because
the control block is not protected from writes. If the control block is
overwritten, commands that use it can no longer be used.
Character I/O requests are performed by issuing a read or write request on
a /dev/rlvn character special file for a logical volume. The read or write is
processed by the file system SVC handler, which calls the LVDD ddread or
ddwrite entry point. The ddread or ddwrite entry point transforms the
character request into a block request. This is done by building a buffer for
the request and calling the LVDD ddstrategy entry point.
Block I/O requests are performed by issuing a read or write on a block
special file /dev/lvn for a logical volume. These requests go through the
SVC handler to the bread or bwrite block I/O kernel services. These
services build buffers for the request and call the LVDD ddstrategy entry
point. The LVDD ddstrategy entry point then translates the logical address
to a physical address (handling bad block relocation and mirroring) and
calls the appropriate physical disk device driver.
On completion of the I/O, the physical disk device driver calls the iodone
kernel service on the device interrupt level. This service then calls the
LVDD I/O completion-handling routine. Once this is completed, the LVDD
calls the iodone service again to notify the requester that the I/O is
completed.
The LVDD is logically split into top and bottom halves. The top half
contains the ddopen, ddclose, ddread, ddwrite, ddioctl, and ddconfig entry
points. The bottom half contains the ddstrategy entry point, which contains
block read and write code. This is done to isolate the code that must run
fully pinned and has no access to user process context. The bottom half of
the device driver runs on interrupt levels and is not permitted to page fault.
The top half runs in the context
of a process address space and can page fault.
scsidisk, SCSI This driver supports the small computer system interface (SCSI) and the
Disk Device Fibre Channel Protocol for SCSI (FCP) fixed disk, CD-ROM (compact disk
Driver
read only memory), and read/write optical (optical memory) devices.
Syntax
#include <sys/devinfo.h>
#include <sys/scsi.h>
#include <sys/scdisk.h>
Device-Dependent Subroutines
Typical fixed disk, CD-ROM, and read/write optical drive operations are
implemented using the open, close, read, write, and ioctl subroutines.
rhdisk Special File Provides raw I/O access to the physical volumes (fixed-
disk) device driver.
The rhdisk special file provides raw I/O access and control functions to
physical-disk device drivers for physical disks. Raw I/O access is provided
through the /dev/rhdisk0, /dev/rhdisk1, ..., character special files.
SCSI Adapter The SCSI device driver has access to the physical disk (if SCSI disk). The
Device Driver driver support data transfers via read and write and control commands via
ioctl calls. The diskDD use the Adapter device driver to access and control
the physical storage device.
Syntax
<#include /usr/include/sys/scsi.h>
<#include /usr/include/sys/devinfo.h>
Description
The /dev/scsin and /dev/vscsin special files provide interfaces for access
for both initiator and target mode device instances. The host adapter is an
initiator for access to devices such as disks, tapes, and CD-ROMs. The
adapter is a target when accessed from devices such as computer
systems, or other devices that can act as SCSI initiators.
Exercises
Examine the Use a tool such as edhx, hexit, dd or other to Look at a physical volume,
physical disk
layout of a Idintify the PVID, the VGID, and the LVM structure.
logical volume
and a physical
volume. Hint: which device should you use to access these data. It may be esier to
copy data from the drive to a file with the dd command.
dd if=/dev/xxx of=/tmp/Myfile bs=1020k count=<number of MB>
Use another device to look at the logical volume, and does the data match
those from the physical device.
Examinine the This exercise will look at the perfromace impact enabling and disabling
impact of LVM MWC, to do do this we need a reproduceable write load. one way to get
Passive Mirror
Write this is to write a C program to create the load remember the file has to be
Consistency realy big to exceed the cache size or, force a sync to occur before
terminating.
void writetstfile()
{
char buffer[512];
char *filename = "/test/a_large_file";
register int i;
int fildes;
/* for (i=0;i<38;i++) buffer[i] = buf[i]; */
if ((fildes = creat(filename,0640)) < 0) {
printf("cannot create file \n");
exit(1);
}
else {
close(fildes);
if ((fildes = open(filename,1)) < 0) {
printf("cannot open file for write \n");
exit(1);
}
Exercises -- continued
Examinine the The LTG is the LVM Logical Track Group, the amount of data read or
function of written to the disk in each operation. Try to monitor the the data and the
LVM LTG
number of disk transactions per. second during IO. The IO and the disk
transactions per second can be monitored with the iostat command.
Test the split Test the “Splitting and reintegrating” facility of a mirror. First create a
mirror facility mirrored LV, and write data to it. Then split the mirror and access data from
both sides. Change data at the “primary side”, and then reintrgrate the
mirror, what happens?
How fast are the mirrors reintegrated?
are they realy synchronized?
Exercise Trace In this exercise we will use the trace command to monitor LVM activity
LVM system
activity. start, stop, and list the results from a LVM trace with the commands
Try to Unmount a filesystem, mount the filesystem again, create a file, and
write data into the file to create some activity in the LVM trace file.
Objectives
After completing this unit, you should be able to
• List the difference between the terms aggregate and fileset.
• Identify the various data structures that make up the JFS-2
filesystem.
• Use the fsdb command to trace the various data structures that
make up the logical and virtual file system.
References
SCnn-nnnn Title of Reference
http://www.yoururl.com
WEB Page Name
Numbers The following table list some general information about JFS2
Function Value
Block Size 512 - 4096 Configurable block size
Architectural max. files size 4 Petabytes
Max. file size tested 1 Tetabytes
Max. file system size 1 Tetabytes
Number of Inodes Dynamic, limited by disk space.
Directory Organization B-tree
Aggregate
Introduction The term aggregate is defined in this section. The layout of a JFS2
aggregate is described.
Definitions JFS2 separates the notion of a disk space allocation pool, called an
aggregate, from the notion of a mountable file system sub-tree, called a
fileset. The rules that define aggregates and filesets in JFS2 are:
• There is exactly one aggregate per logical volume.
• There may be multiple filesets per aggregate.
• In The first release of AIX 5L, only one fileset per aggregate is
supported,.
• The meta-data has been designed to support multiple filesets, and this
feature may be introduced in a future release of AIX 5.
The terms aggregate and fileset in this document correspond to their DCE/
DFS (Distributed Computing Environment Distributed File System) usage.
Aggregate An aggregate has a fixed block size (number of bytes per block) that is
block size defined at configuration time. The aggregate block size defines the
smallest unit of space allocation supported on the aggregate. The block
size cannot be altered, and must be no smaller than the physical block size
(currently 512 bytes). Legal aggregate block sizes are:
• 512 bytes
• 1024 bytes
• 2048 bytes
• 4096 bytes.
Do not confused aggregate block size with the logical volume block size,
which defines the smallest unit of I/O.
Aggregate -- continued
Aggregate The following diagram and table details the layout of the aggregate.
layout Note: Aggregate Block Size is 1K in this example. 1KB
(One Aggregate Block)
Aggregate
RESERVED
Block # 0 31
32 Inodes (16KB)
Aggregate Inode Table; inode numbers shown
Primary 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Secondary
Aggregate Control Page IAG Aggregate
Superblock 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Superblock
32 36 40 44 60
aggr inode #1: “self” aggr inode #2: block map aggr inode #16: fileset 0 aggr inode #16: fileset 1
owner: root owner: root owner: root owner: root
1st extent of Aggregate Inode Allocation Map perm: -rwx------ perm: -rwx------ perm: -rwx------ perm: -rwx------
etc: blah blah etc: blah blah etc: blah blah etc: blah blah
size: 8192 size: 16384 size: 12288 size: 8192
Control Section
Persistent Map
length[0]: 16
ixd Section
addr[0]: 44
0xf8008000
0x00000000
0xf8008000
0x00000000
addr[1]: 0
xad entries
offset: 8192
(8 total)
addr: 4
length: 10284
...
...
...
Part Function
Reserved area A 32K area at the front not used by JFS2. The first
block is used by the LVM.
Primary The primary aggregate superblock (defined as a
aggregate struct superblock) contains aggregate-wide
superblock information such as the:
• size of the aggregate
• size of allocation groups
• aggregate block size
The superblock is at fixed locations, which allows
us to always be able to find these without
depending on any other information.
Secondary The secondary aggregate superblock is a direct
aggregate copy of the primary aggregate superblock. The
superblock secondary aggregate superblock is used if the
primary aggregate superblock is corrupted.
Aggregate -- continued
Part Function
Aggregate inode Contains inodes that describe the aggregate-wide
table control structures these inodes are described
below.
Secondary Contains replicated inodes from the Aggregate
aggregate inode Inode Table. Since the inodes in the Aggregate
table Inode Table are critical for finding any file system
information they will each be replicated in the
Secondary Aggregate Inode Table. The actual
data for the inodes will not be repeated, just the
addressing structures used to find the data and
the inode itself.
Aggregate inode Describes the Aggregate Inode Table. It contains
allocation map allocation state information on the aggregate
inodes as well as their on-disk location.
Secondary Describes the Secondary Aggregate Inode Table.
aggregate inode
allocation map
Block allocation Describes the control structures for allocating and
map freeing aggregate disk blocks within the
aggregate. The Block Allocation Map maps one-
to-one with the aggregate disk blocks.
fsck working Provides space for fsck to be able to track the
space aggregate block allocations. This space is
necessary - for a very large aggregate there might
not be enough memory to track this information in
memory when fsck is run. The space is described
by the superblock. One bit is needed for every
aggregate block. The fsck working space always
exists at the end of the aggregate.
In-line Log, Provides space for logging of the meta-data
changes of the aggregate. The space is described
by the superblock. The in-line log always exist
following the fsck working space.
Aggregate -- continued
Aggregate When the aggregate is initially created, the first inode extent is allocated,
Inodes additional inode extents are allocated and de-allocated dynamically as
needed. These aggregate Inodes each describe certain aspects of the
aggregate itself, as follows:
Inode # Description
0 Reserved
1 Called the “self” inode, this inode describes the aggregate
disk blocks comprising the aggregate inode map. This is a
circular representation, in that aggregate inode one is itself
in the file that it describes. The obvious circular
representation problem is handled by forcing at least the
first aggregate inode extent to appear at a well-known
location, namely, 4K after the Primary Aggregate
Superblock. Therefore, JFS2 can easily find Aggregate
Inode one, and from there it can find the rest of the
Aggregate Inode table by following the B+–tree in inode one
2 Describes the Block Allocation Map.
3 Describes the In-line Log when mounted. This inode is
allocated but no data is saved to disk.
4 - 15 Reserved for future extensions.
16 - Starting at aggregate inode 16 there is one inode per fileset,
the Fileset Allocation Map Inode. These inodes describe the
control structures that represent each fileset. As additional
filesets are added to the aggregate, the aggregate inode
table itself may have to grow to accommodate additional
fileset inodes
Allocation Groups
Introduction Allocation Groups (AG) divide the space on an aggregate into chunks, and
allow JFS2 resource allocation policies to use well known methods for
achieving good JFS2 I/O performance.
Allocations When locating data on the disk JFS2 will attempt to:
policies
• Group disk blocks for related data and inodes close together.
• Distribute unrelated data throughout the aggregate.
Allocation Allocation group sizes must be selected which yield Allocation Groups that
Group Sizes are sufficiently large to provide for contiguous resource allocation over
time. The allocation group size is stored in the aggregate superblock. The
rules for setting the allocation group size is:
• maximum number of allocation groups per aggregate is 128
• minimum size of an allocation group is 8192 aggregate blocks
• The allocation group size must always be a power of 2 multiple of the
number of blocks described by one dmap page. (i.e. 1, 2, 4, 8,... dmap
pages)
Partial An aggregate whose size is not a multiple of the allocation group size
Allocation contains a partial allocation group - it is not fully covered by disk blocks.
Group
This partial allocation group will be treated as a complete allocation group,
except we mark the non-existent disk blocks allocated in the Block
Allocation Map.
Filesets
Layout The following illustration and table details the layout of a fileset.
Filese Inode Table
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Control Page IAG IAG
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
240 244 248 264 10284
Part Function
Fileset Inode Contains inodes describing the fileset-wide control
table structures. The Fileset Inode Table logically
contains an array of inodes.
Fileset Inode A Fileset Inode Allocation Map which describes
allocation map the Fileset Inode Table. The Fileset Inode
Allocation Map contains allocation state
information on the fileset inodes as well as their
on-disk location.
Inodes Objects. Every JFS2 object is represented by an
inode, which contains the expected object-specific
information such as time stamps, file type (regular
vs. directory, etc.). They also “contain” a B+–tree
to record the allocation of extents. Note
specifically that all JFS2 meta data structures
(except for the superblock) are represented as
“files.” By reusing the inode structure for this data,
the data format (on-disk layout) becomes
inherently extensible.
Filesets -- continued
Super Inode Super Inodes found in the aggregate inode table (#16 and greater)
describe the Fileset Inode Allocation Map and other fileset information
resides in the Aggregate Inode Table. Since the Aggregate Inode Table is
replicated there is also a secondary version of this inode which points to
the same data.
Inodes When the fileset is initially created, the first inode extent is allocated,
additional inode extents are allocated and de-allocated dynamically as
needed. The inodes in a fileset are allocated as follows:
Fileset Description
Inode #
0 reserved
1 additional fileset information that would not fit in the Fileset
Allocation Map Inode in the Aggregate Inode Table.
2 The root directory inode for the fileset.
3 The ACL file for the fileset.
4- Fileset inodes from four onwards are used by ordinary fileset
objects, user files, directories, and symbolic links.
Extents
Extent Extents are described in an xad structure. The two main values
Allocation describing an extent, its length, and its address. In an xad both the length
Descriptor
and address are expressed in units of the aggregate block size. Details of
the xad data structure are shown below.
struct xad {
uint8 xad_flag;
uint16 xad_reserved;
uint40 xad_offset;
uint24 xad_length;
uint40 xad_address;
};
Member Description
xad_flag Flags set on this extent. See /usr/include/j2/j2_xtree.h for
a list of flags.
xad_reserved Reserved for future use.
xad_offset Extents are generally grouped together to from a larger
group of disk blocks. The xad_offset, describes the
logical byte address this extent represents in the larger
group.
xad_length A 24-bit field, containing the length of the extent in
aggregate blocks. An extent can range in size from 1 to
224-1 aggregate blocks.
xad_address A 40-bit field containing the address of the first block of
the extent. The address is in units of aggregate blocks
and is the block offset from the beginning of the
aggregate.
Extents -- continued
Allocation In general, the allocation policy for JFS2 tries to maximize contiguous
Policy allocation by allocating a minimum number of extents, with each extent as
large and contiguous as possible. This allows for larger I/O transfer
resulting in improved performance. However in special cases this is not
always possible. For example copy-on-write clone of a segment will cause
a contiguous extent to be partitioned into a sequence of smaller
contiguous extents. Another case is restriction of the extent size. For
example the extent size is restricted for compressed files since we must
read the entire extent into memory and decompress it. We have a limited
amount of memory available so we must ensure we will have enough room
for the decompressed extent.
Introduction Objects in JFS2 are stored in groups of extents arranged in binary trees.
The concepts on binary trees are introduced in this section.
Trees Binary trees consists of nodes arranged in a tree structure. Each node
contains an header describing the node. A flag in the node header
identifies the role of the node in the tree.
Root node
Header
flags=BT_ROOT
Internal
Leaf node
node
Header
Header flags=
flags= BT_LEAF
BT_INTERNAL
Array of extent
descriptors
Leaf node Leaf node
Header Header xad
flags= flags=
BT_LEAF BT_LEAF
xad
xad
xad xad
xad xad
Header flags This table describe the binary tree header flags.
Flag Description
BT_ROOT The root or top of the tree.
BT_LEAF The bottom of a branch of a tree. Leaf nodes point to
the extents containing the objects data.
BT_INTERNAL An internal node points to two or more leaf nodes or
other internal nodes.
B+-tree index There is one generic B+–tree index structure for all index objects in JFS2
except for directories. The data being indexed depends upon the object.
The B+–tree is keyed by offset of the xad structure of the data being
described by the tree. The entries are sorted by the offsets of the xad
structures, each of which is an entry in a node of a B+–tree.
Root node The file j2_xtree.h describes the header for the root of the B+–tree in struct
header xtpage_t.
Leaf node The file j2_btree.h describes the header for an internal node or a leaf node
header in struct btpage_t.
typedef struct {
int64 next; /* 8: right sibling bn */
int64 prev; /* 8: left sibling bn */
uint8 flag; /* 1: */
uint8 rsrvd[7]; /* 7: type specific */
int64 self; /* 8: self address */
uint8 entry[4064]; /* 4064: */
} btpage_t;
inodes
Overview Every file on a JFS2 filesystem is describe by an on-disk inode. The inode
holds the root header for the extent binary tree. File attribute data and
block allocation maps are also kept in the inode.
Inode Layout The inode is a 512 byte structure, split into four 128 byte sections
described here.
Inode Layout
é extended attributes
Section 2 é block allocation maps
é Inode allocation maps
é headers describing the inode data
In-line data
Section 3 or
xad’s
extended attributes
or
Section 4 more in-line data
or
additional xad’s
Section Description
1 This section describes the POSIX attributes of the JFS2 object
including the inode and fileset number, object type, object size,
user id, group id, created, access time, modified time, created
time and more.
2 This section contains several parts:
• descriptors for extended attributes
• block allocation maps
• inode allocation maps
• Header pointing to the data (b+-tree root, directory, in-line
data)
3 This section can contain one of the following:
• In-line File data - for very small files (up to 128 bytes)
• The first 8 xad structures describing the extents for this file.
4 This section extends section 3 by providing additional storage for
more attributes, xad structures or in-line data.
Inodes -- continued
union {
uint8 _data[80];
/* symbolic link.
* link is stored in inode if its length is less than
* IDATASIZE. Otherwise stored like a regular file. */
struct {
uint8 _fastsymlink[128];
} _symlink;
} _data3;
/* IV. type-dependent extension area (128 bytes)
* user-defined attribute, or
* inline data continuation, or
* B+-tree root node continuation */
union {
uint8 _data[128];
} _data4;
}
Inodes -- continued
Inode extents Inodes are allocated dynamically by allocating inode extents that are
simply a contiguous chunk of inodes on the disk. By definition, a JFS2
inode extent contains 32 inodes. With a 512 byte inode size, an inode
extent is therefore occupies 16KB on the disk.
Inodes -- continued
Inode When a new inode extent is allocated the extent is not initialized, but in
initialization order for fsck to be able to check if an inode is in-use, JFS2 will need some
information in it. Once an inode in an extent is marked in-use its fileset
number, inode number, inode stamp, and the inode allocation group block
address are initialized. Thereafter, the link field will be sufficient to
determine if the inode is currently in-use.
Inode Inode generation numbers are simply counters that will increment each
Generation time an inode is reused. Network file system protocols such as NFS
Numbers
(implicitly) require them; they form part of the file identifier manipulated by
VNOP_FID() and VFS_VGET().
The static-inode-allocation practice of storing a per-inode generation
counter doesn’t work with dynamic inode allocation, because when an
inode becomes free its disk space may literally be reused for something
other than an inode (e.g., the space may be reclaimed for ordinary file data
storage). Therefore, in JFS2 there is simply one inode generation counter
that is incremented on every inode allocation (rather than one counter per
inode that would be incremented when that inode is reused).
Although a fileset-wide generation counter will recycle faster than a per-
inode generation counter, a simple calculation shows that the 32-bit value
is still sufficient to meet NFS or DFS requirements.
Overview This section introduces the data structures used to describe where a file’s
data is stored.
In-line data If a file contains small amounts of data the data may be stored in the inode
its self. This is called in-line storage. The header found in the second
section of the inode points to the data that is stored in the third and fourth
section of the inode.
inode
Inode Info
In-line data
Binary trees When more storage is needed than can be provided in-line the data must
be placed in extents. The header in the inode now becomes the binary
tree root header. If there are 8 or fewer extents for the file, then the xad
structures describing the extents are contained in the inode. An inode
containing less than 8 xad structures would look like:
inode
68
Inode Info
16KB
B+-tree header Data
offset: 0
addr: 68
length: 4
offset: 4096
xad entries
4096
(8 total)
addr: 84
length: 12
48KB
Data
offset: 26624
addr: 256
length: 2
26624
8KB Data
INLINEEA bit Once the 8 xad structures in the inode are filled, an attempt is made to use
the last quadrant of the inode for more xad structures. If the INLINEEA bit
is set in the di_mode field of the inode, then the last quadrant of the inode
is available for 8 more xad structures.
More extents Once all of the available xad structures in the inode are used, the B+–tree
must be split. 4K of disk space is allocated for a leaf node of the B+–tree,
which is logically an array of xad entries with a header. The 8 xad entries
are moved from the inode to the leaf node, and the header is initialized to
point to the 9th entry as the first free entry. The first xad structure in the
inode is updated to point to the newly allocated leaf node, and the inode
header is updated to indicate that only one xad structure is now being
used, and that it contains the pure root of a B+-tree. The offset for this new
xad structure contains the offset of the first entry in the leaf node.
The organization of the inode now look like:
inode 412
header 68
Inode Info
16KB
B+-tree header Data
254 xad leaf node entries
offset: 0
addr: 412
length: 4
offset: 0
xad entries
4096
(8 total)
addr: 0
length: 0
48KB
Data
offset: 0
addr: 0
length: 0
26624
8KB Data
Continuing to As new extents are added to the file, they continue to be added to the leaf
add extents node in the necessary order, until the node fills. Once the node fills a new
4K of disk space is allocated for another leaf node of the B+–tree, and the
second xad structure from the inode is set to point to this newly allocated
node. The node now looks like:
inode 412
header 68
Inode Info
16KB
B+-tree header Data
4096
(8 total)
addr: 560
length: 4
48KB
Data
offset: 0
addr: 0
length: 0
560
header
254 xad leaf node entries
26624
8KB Data
Another split As extents are added to the inode, this behavior continues until all 8 xad
structures in the inode contain leaf node xad structures, at which time
another split of the B+–tree will occur. This split creates an internal node of
the B+–tree which is used purely to route the searches of the tree. An
internal node looks exactly like a leaf node. 4K of disk space is allocated
for the internal node of the B+–tree., the 8 xad entries of the leaf nodes are
moved from the inode to the newly created internal node, and the internal
node header is initialized to point to the 9th entry as the first free entry. The
root of the B+–tree is then updated by making the inode’s first xad
structure point to the newly allocated internal node, and the header in the
inode is updated to indicate that now only 1 xad structure is being used for
the B+–tree.
As extents continue to be added, additional leaf nodes are created to
contain the xad structures for the extents, and these leaf nodes are added
to the internal node.
Once the first internal node is filled, a second internal node is allocated,
the inode’s second xad structure is updated to point to the new internal
node.
This behavior continues until all 8 of the inode’s xad structures contain
internal nodes.
offset: 0
addr: 380
length: 4
offset: 8340
xad entries
4096
(8 total)
addr: 212
length: 4
48KB
Data
offset: 0
addr: 0
length: 0
212 560
header header
254 xad internal node entries
26624
8KB Data
fsdb Utility
Introduction The fsdb command enables you to examine, alter, and debug a file
system.
Starting fsdb It is best to run fsdb against an unmounted filesystem. Use the following
syntax to start fsdb:
fddb <path to logical volume>
For example:
# fsdb /dev/lv00
Aggregate Block Size: 512
>
Supported fsdb supports both the JFS and JFS2 file systems. The commands
filesystems available in fsdb are different depending on what filesystem type it is
running against. The following explains how to use fsdb with a JFS2 file
system.
Commands The commands available in fsdb can be viewed with the help command
as shown here.
> help
Xpeek Commands
Exercise 1 - fsdb
Introduction In this lab you will run the fsdb utility against a JFS2 filesystem that was
created for you. The filesystem should not be mounted when running
fsdb. The filesystem may be mounted to examine the files, just be sure to
un-mount it before running fsdb.
# fsdb /dev/lv00
Display the root inode for the file set. What command did
you use?
Using fsdb - In the next few steps you will locate and display the fileA’s data.
continued
Step Action
4 Display the inode of fileA, what command did you use?
FileB and fileC Use the commands and techniques you learned in the last section to
examine fileB, fileC and fileD. Answer the following questions about these
files:
1. What number inodes are used for fileB, fileC and fileD?
2. How many xad structures are used to describe fileB’s data blocks?
3. How many xad structures are used to describe fileC’s data blocks?
4. Examine the inode for fileD. How big is this file (as shown in di_size)?
Are enough aggregate blocks allocated to store the entire file? Explain
your answer.
Directory
Directory entry Stored in an array the directory entries links the names of the objects in the
directory to an inode number. The directory entry has the following
members.
Member Description
inumber Inode number
namelen Length of the name.
name[30] File name, up to 30 characters.
next If more that 30 characters are
needed additional entries are link
using the next pointer
Directory -- continued
Member Description
idotdot Inode number of parent directory.
flag indicating if the node is an internal or leaf node, and
whether it is the root of the binary tree.
nextindex last used slot in the directory entry slot array.
freecnt number of free slots in the directory entry array.
freelist slot number of the head of the free list
stbl[8] indices to the directory entry slots that are currently in
use. The entries are sorted alphabetically by name.
slot[9] Array of directory entries. 8 entries, The header is
stored in the first slot.
Leaf and When more than 8 directory entries are needed a leaf or internal node is
internal node added. The directory internal and leaf node headers are similar to root
header
node header except that up to 128 directory entries. The page header is
defined by a dpage_t structure contained in /usr/include/j2/j2_dtree.h.
Directory -- continued
Directory slot The Directory Slot Array (stbl[]) is a sorted array, of indices to the directory
array slots that are currently in use. The entries are sorted alphabetically by
name. This limits the amount of shifting necessary when directory entries
are added or deleted, since the array is much smaller than the entries
themselves. A binary search can be used on this array to search for
particular directory entries.
In this example the directory entry table contains four files. The stbl table
contains the slot numbers of the entries ordering the entries alphabetically.
Directory Entry
table
1 def
STBL[8]
2 abc
3 xyz 2 1 4 3 0 0 0 0
4 hij
5
6
7
8
. and .. A directory does not contain specific entries for self (“.”) and parent (“..”).
Instead these will be represented in the inode itself. Self is the directory’s
own inode number, and the parent inode number is held in the “idotdot”
field in the header.
Directory -- continued
Growing As the number of files in the directory grow the directory tables must be
directory size increase in size. This table describes the steps used.
Step Action
1 Initial directory entries are stored in directory inode in-line
data area.
2 When the in-line data area of the directory inode becomes
full JFS2 allocates a leaf node the same size as the
aggregate block size.
3 When that initial leaf node becomes full and the leaf node is
not yet 4K, double the current size. First attempt to double
the extent in place, if there is not room to do this a new
extent must be allocated and the data from the old extent
must be copied to the new extent. The directory slot array
will only have been big enough to reference enough slots for
the smaller page so a new slot array will have to be created.
Use the slots from the beginning of the newly allocated
space for the larger array and copy the old array data to the
new location. Update the header to point to this array and
add the slots for the old array to the free list.
4 If the leaf node again becomes full and is still not 4K repeat
step 3. Once the leaf node reaches 4K allocate a new leaf
node. Every leaf node after the initial one will be allocated
as 4K to start.
5 When all entries are free in a leaf page, the page will be
removed from the B+–tree. When all the entries in the last
leaf page are deleted, the directory will shrink back into the
directory inode in-line data area.
Directory Examples
Introduction This sections demonstrates how the directory structures change over time.
Small Initial directory entries are stored in directory inode in-line data area.
Directories Examine this example of a small directory. In this example all the inode
information fits into the in-line data area:
# ls -ai
69651 .
2 ..
69652 foobar1
69653 foobar12
69654 foobar3
69655 longnamedfilewithover22charsinitsname
1 inumber: 69652
next: -1
namelen: 7
name: foobar1
2 inumber: 69653
next: -1
namelen: 8
name: foobar12
3 inumber: 69654
next: -1
namelen: 7
name: foobar2
4 inumber: 69655
next: 5
namelen: 37
name:longnamedfilewithover2
5
next: -1
cnt: 0
name: 2charsinitsname
Note: the file with a long name has its name split across two slots.
Adding a file An additional file called “afile” is created. The details for this file are added
at the next free slot (slot 6). As this is now, alphabetically, the first file in
the directory, the search table array (stbl[]) is re-organized, such that the
entry in slot 6 is now in the first entry.
# ls -ai
69651 .
2 ..
69656 afile
69652 foobar1
69653 foobar2
69654 foobar3
69655 longnamedfilewithover22charsinitsname
flag: BT_ROOT BT_LEAF
nextindex: 5
freecnt: 2
freelist: 7
idotdot: 2
stbl: {6,1,2,3,4,0,0,0}
1 inumber: 69652
next: -1
namelen: 7
name: foobar1
2 inumber: 69653
next: -1
namelen: 8
name: foobar12
3 inumber: 69654
next: -1
namelen: 7
name: foobar2
4 inumber: 69655
next: 5
namelen: 37
name:longnamedfilewithover2
5
next: -1
cnt: 0
name: 2charsinitsname
6 inumber: 69656
next: -1
namelen: 5
name: afile
Adding a leaf When the directory grows to where there are more entries than can be
node stored in the in-line data area of the inode then JFS2 allocates a leaf node
the same size as the aggregate block size. The in-line entries are moved
to a leaf node as illustrated.
Block 52
flag: BT_ROOT BT_INTERNAL flag: BT_LEAF
nextindex: 1 nextindex: 20
freecnt: 7 freecnt: 103
freelist: 2 freelist: 25
idotdot: 2 maxslot: 128
stbl: {1,2,3,4,5,6,7,8} stbl: {1,2,15, ... 8,13,14}
1 xd.len: 1 1 inumber: 5
xd.addr1: 0 next: -1
xd.addr2: 52 namelen: 5
next: -1 name: file0
namelen: 0 2 inumber: 6
name: file0 next: -1
namelen: 5
name: file1
3 inumber: 15
next: -1
namelen: 6
name: file10
19 inumber: 23
next: -1
namelen: 6
name: file18
20 inumber: 24
next: -1
namelen: 6
name: file19
Once the leaf is full, an internal node is added at the next free in-line data
slot in the inode, which will contain the address of the next leaf node.
Note: the internal node entry, contains the name of the first file (in
alphabetical order) for that leaf node.
Adding a Once all the in-line slots have been filled by internal nodes, a separate
internal node node block is allocated, the entries from the in-line data slots are moved to
this new node, and the first in-line data slot updated with the address of the
new internal node.
After many extra files have been added to the directory, two layers of
internal nodes are required to reference all the files.
Note: now, that the internal node entries in the inode contain the name of
the alphabetically first entry referenced by each of the second level internal
nodes, and each entry in these references the name of the alphabetically
first entry in each leaf node.
Exercise 2 - Directories
Introduction In this exercise you will use the fsdb utility to examine directory inodes in
a jfs2 filesystem.
Small Run fsdb on the sample filesystem. Use the following steps to examine
directories the directory node for /mnt/small.
Step Action
1 Find the inode for directory small:
> dir 2
2 Display the inode found in the last step.
> i <inode number>
3 Using the t sub-command display the directory node root
header.
#touch /mnt/small/a
Predict what the stbl[] table for directory small will look like
now?
Larger In this section you will examine the directory node structures for some
directories larger directories.
Step Action
1 What is the inode for the directory called medium?
2 Display the inode and look at the root tree header. The
flags should indicate that this is an internal header. One
entry should be found for each leaf node. Display the
entries with the <enter> key. How many leaf nodes are
their?.
3 Use the down sub command to display the first leaf node
header. How many entries is this header currently
describing?
Objectives
After completing this unit, you should be able to
• Identify the various compoints that make up the logical and virtual
• To use the debugger (kdb/iadb) to display these components.
References
Introduction This lesson covers the interface and services that AIX 5L provides to
physical filesystem. The Logical File System (LFS), Virtual File System
(VFS) and the interface between these compoints and physical file
systems are discussed in this lesson.
Supported file Using the structure of the logical file system and the virtual filesystem AIX
systems 5L can support a number of different file system types transparently to
application programs. These file systems reside below the LFS/VFS and
operate relatively independently of each other. Currently AIX 5L supports
the following physical filesystem implementations:
• Enhanced Journaled Filesystem (JFS2)
• Journaled filesystem (JFS)
• Network File System (NFS)
• A CD-ROM File system which supports ISO-9660, High Sierra and
Rock Ridge formats.
Extensible The LFS/VFS interface also provides a relatively easy means by which
third party filesystem types can be added without any changes to the LFS.
System Call
é System calls
é Logical File System (LFS)
é Virtual File System (VFS) Logical File System
é File System Implementation -
Support of individual file system
Virtural File System
layout.
é Fault Handler - Device page fault
handler support in the VMM. File System
Implementation
é Device Driver - Actual device
driver code to interface with the
device. It is invoked by the page
fault handler when the file system Fault Handler
Internal data This illustration shows the major data structures that will be discussed in
structures this lesson. This illustration is repeated throughout the lesson
highlighting the areas being discussed.
u-block inode
vnode gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
gfs
vmount
vfsops
LFS Data The data structures discussed in this section are the System Open File
Structures Table and the User File Descriptor Table. The system open file table has
one entry for each open file on the system. The user file descriptor table
(one per process) contains entries for each of the process open file...
u-block inode
vnode gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
gfs
vmount
vfsops
Operations The LFS provides a standard set of operations to support the system call
interface, its routines manage the open file table entries and the per-
process file descriptors. It provides:
• the User File Descriptor Table.
• the System File table. An open file table entry records the authorization
of a process’s access to a file system object.
The LFS abstraction specifies the set of file system operations that an
implementation must include in order to carry out logical file system
requests. Physical file systems can differ in how they implement these
predefined operations, but they must present a uniform interface to the
LFS. It supports UNIX-like file system access semantics, but other non-
UNIX file systems can also be supported.
User interface A user can refer to an open file table entry through a file descriptor held in
the thread’s ublock, or by accessing the virtual memory to which the file
was mapped. The file descriptor table entry is created when the file is
initially opened, via the open() system call and will remain until either the
user closes the file via the close() system call, or the process terminates.
The LFS is the level of the file system at which users can request file
operations by using system calls, such as open(), close(), read(), write()
etc. For all these calls (except open()), the file descriptor number is passed
as an argument to the call. The system calls implement services that are
exported to users, and provide a consistent user mode programming
interface to the LFS that is independent of the underlying file system type.
System calls that carry out file system requests:
• Map the user’s parameters to a file system object. This requires that the
system call component use the vnode (virtual node) component to
follow the object’s path name. In addition, the system call must resolve
a file descriptor or establish implicit (mapped) references using the open
file component.
• Verify that a requested operation is applicable to the type of the
specified object.
• Dispatch a request to the file system implementation to perform
operations.
Description The user file descriptor table, is contained in the user area, and is a per
process resource. Each entry references an open file, device, or socket
from the process’ perspective. The index into the table for a specific file, is
the value returned by the open() system call when the file is opened - the
file descriptor.
Table One or more slots of the file descriptor tables are used for each open file.
Management The file descriptor table can extend beyond first page of the ublock, and is
page-able. There is a fixed upper limit of 32768 open file descriptors per
process (defined as OPEN_MAX in /usr/include/sys/limits.h). This value is
fixed, and may not changed.
User File The user file descriptor table consists of an array of user file descriptor
Descriptor table structures defined in /usr/include/sys/user.h in the structure ufd:
Table structure
struct ufd {
struct file * fp;
unsigned short flags;
unsigned short count;
#ifdef __64BIT_KERNEL
unsigned int reserved;
#endif /* __64BIT_KERNEL */
};
Description The system file table is a global resource, and is shared by all processes
on the system. One unique entry is allocated for each unique open of a file,
device, or socket in the system.
Table The table is a large array, and is partly initialized. It grows on demand, and
Management is never shrunk. Once entries are freed, they are added back onto the free
list (ffreelist). The table can contain a maximum of 1,000,000 entries, and
is not configurable.
Table entries The file table array consists of struct file data elements. Several of the
key members of this data structure are described in this table.
Member Description
f_count A reference count field detailing the current
number of opens on the file. This value is
increased each time the file is opened, and
decremented on each close(). Once the
reference count is zero, the slot is considered
free, and may be re-used.
f_flag various flags described in fcntl.h
f_type a type field describing the type of file:
/* f_type values */
#define DTYPE_VNODE 1 /* file */
#define DTYPE_SOCKET 2 /* communications endpoint */
#define DTYPE_GNODE 3 /* device */
#define DTYPE_OTHER -1 /* unknown */
f_offset a read/write pointer.
f_data Defined as f_up.f_uvnode it is a pointer to
another data structure representing the object
typically the vnode structure.
f_ops a structure containing pointers to functions for
the following file operations: rw (read/write),
ioctl, select, close, fstat.
Overview The Virtual FIle System (VFS) defines a standard set of operations on an
entire file system. Operations preformed by a process on a file or file
system are mapped through the VFS to the file system below. In this way,
the process need not know the specifics of different file systems (such as
JFS, J2, NFS or CDROM).
u-block inode
vnode gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
gfs
vmount
vfsops
Functional For the purpose of this lesson the VFS will be broken into three sections
sections and described separately. These sections are:
• Vnode-VFS interface
• File and File System Operations
• The gnode
Vnode/vfs interface
Overview The interface between the logical file system and the underlying file
system implementations is referred to as the vnode/vfs interface. This
interface provides a logical boundary between generic objects understood
at the LFS layer and the file system specific objects that the underlying file
system implementation must manage such as inodes and super blocks.
The LFS is relatively unaware of the underlying file system data structures
since they can be radically different for the various file system types.
Data Vnodes and vfs structures are the primary data structures used to
Structures communicate through the interface (with help from the vmount).
• vnodes - represents a files
• vfs - represents a mounted file system
• vmount - contains specifics of the mount request.
u-block inode
vnode gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
gfs
vmount
vfsops
History The vnode and vfs structures of the LFS was created by Sun Micro
Systems and has evolved into a de-facto industry standard, thanks in part
to NFS.
Vnodes
Overview The vnode provides a standard set of operations within the file system,
and provides system calls with a mechanism for local name resolution.
This allows the logical file system to access multiple file system
implementations through a uniform name space.
Detail Vnodes are the primary handles by which the operating system references
files, and represent access to an object within a virtual file system. Each
time an object (file) within a file system is located (even if it is not opened),
a vnode for that object is located (if already in existence), or created, as
are the vnodes for any directory that has to be searched to resolve the
path to the object.
As a file is created, a vnode is also created, and will be re-used for every
subsequent reference made to the file by a path name. Every path name
known to the logical file system can be associated with, at most, one file
system object, and each file system object can have several names
because it can be mounted in different locations. Symbolic links and hard
links to an object always get the same vnode if accessed through the same
mount point.
vnode Vnodes are created by the vfs-specific code when needed, using the
Management vn_get kernel service. Vnodes are deleted with the vn_free kernel service.
Vnodes are created as the result of a path resolution.
Description When new file systems are mounted, a vfs and vmount structures are
created. The vmount structure contains specifics of the mount request,
such as the object being mounted, and the stub over which it is being
mounted. The vfs structure is the connecting structure which links the
vnodes (representing files) with the vmount information, and the gfs
structure that help define the operations that can be performed on the
filesystem and its files.
vfs The vfs structure is the connecting structure which links the vnodes
(representing files) with the vmount information, and the gfs structure witch
provides a path to the operations that can be performed on the filesystem
and its files.
Element Description
*vfs_next vfs’s are a linked list with the first vfs entry
addressed by the rootvfs variable which is
private to the kernel.
*vfs_gfs path back to the gfs structure and its file
system specific subroutines through the
vfs_gfs pointer.
vfs_mntd The vfs_mntd pointer points to the vnode
within the file system which generally
represents the root directory of the file
system.
vfs_mntdover The vfs_mntdover pointer points to a vnode
within another file system, also usually
representing a directory, which indicates
where the file system is mounted. In this
sense, the vfs_mntd pointer corresponds to
the object within the vmount structure
referenced by the vfs_mdata pointer, and
the vfs_mntdover pointer corresponds to the
stub within the vmount structure referenced
by the vfs_mdata pointer.
vfs_nodes Pointer to all vnodes for this file system.
vfs_mdata Pointer to the vmount providing mount
information for this filesystem
vfs The mount helper creates the vmount structure, and calls the vmount
Management subroutine. The vmount subroutine then creates the vfs structure, partially
populates it, and invokes the file system dependent vfs_mount subroutine
which completes the vfs structure, and performs any operations required
internally by the particular file system implementation.
There is one vfs structure for each file system currently mounted. New vfs
structures are created with the vmount subroutine. This subroutine calls
the vfs_mount subroutine found within the vfsops structure for the
particular virtual file system type. The vfs entries are removed with the
uvmount subroutine. This subroutine calls the vfs_umount subroutine from
the vfsops structure for the virtual file system type.
vmount The vmount structure contains specifics of the mount request. The vmount
structure is defined in /usr/include/sys/vmount.h
struct vmount {
uint vmt_revision; /* I revision level, currently 1 */
uint vmt_length; /* I total length of structure & data */
fsid_t vmt_fsid; /* O id of file system */
int vmt_vfsnumber; /* O unique mount id of file system */
uint vmt_time; /* O time of mount */
uint vmt_timepad; /* O (in future, time is 2 longs) */
int vmt_flags; /* I general mount flags */
/* O MNT_REMOTE is output only */
int vmt_gfstype; /* I type of gfs, see MNT_XXX above */
struct vmt_data {
short vmt_off; /* I offset of data, word aligned */
short vmt_size; /* I actual size of data in bytes */
} vmt_data[VMT_LASTINDEX + 1];
};
Overview Each file system type extension provides functions to perform operations
on the filesystem and its files. Pointers to these functions are stored in the
vfsops (filesystem operations) and vnodeops (file operations) structures.
u-block inode
vnode gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
gfs
vmount
vfsops
gfs
Description There is one gfs structure for each type of virtual file system currently
installed on the machine. For each gfs entry, there may be any number of
vfs entries.
Purpose The operating system uses the gfs entries as an access point to the virtual
file system functions on a type-by-type basis. There is no direct link from a
gfs entry to all of the vfs entries of a particular gfs type. The file system
code generally uses the gfs structure as a pointer to the vnodeops
structure and the vfsops structure for a particular type of file system.
gfs The gfs structures are stored within a global array accessible only by the
management kernel. The gfs entries are inserted with the gfsadd() kernel service, and
only one gfs entry of a given gfs_type can be inserted into the array.
Generally, gfs entries are added by the CFG_INIT section of the
configuration code of the file system kernel extension. The gfs entries are
removed with the gfsdel()kernel service. This is usually done within the
CFG_TERM section of the configuration code of the file system kernel
extension.
vnodeops
vnodeops There is one vnodeops structure per filesystem kernel extension loaded
management (i.e. one per unique filesystem type), and is initialized when the extension
is loaded.
struct vnodeops {
/* creation/naming/deletion */
int (*vn_link)(struct vnode *, struct vnode *, char *,
struct ucred *);
int (*vn_mkdir)(struct vnode *, char *, int32long64_t,
struct ucred *);
int (*vn_mknod)(struct vnode *, caddr_t, int32long64_t,
dev_t, struct ucred *);
int (*vn_remove)(struct vnode *, struct vnode *, char *,
struct ucred *);
int (*vn_rename)(struct vnode *, struct vnode *, caddr_t,
struct vnode *,struct vnode *,caddr_t,struct ucred *);
int (*vn_rmdir)(struct vnode *, struct vnode *, char *,
struct ucred *);
vfsops
vfsops There is one vfsops structure per filesystem kernel extension loaded (i.e.
management one per unique filesystem type), and is initialized when the extension is
loaded.
The Gnode
Creation A gnode refers directly to a file (regular, directory, special, and so on), and
is usually embedded within a file system implementation specific structure
(such as an inode). Gnodes are created as needed by file system specific
code at the same time as creating implementation specific structures. This
is normally immediately followed by a call to the vn_get kernel service to
create a matching vnode. The gnode structure is usually deleted either
when the file it refers to is deleted, or when the implementation specific
structure is being reused for another file.
gnode and The gnode is typical embedded in an in-core inode. The member
inode gnode->gn_data points to the start of the inode.
Incore inode
gnode
gnode->gn_data
Exercise 1
Overview This exercise will test you knowledge of the data structures of the LFS and
VFS and the relationships between them.
lab Use the following list of terms to best complete the statements below.
File vfs
File system vnodeops
System File Table vmount
u-block inode
gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
vfsops
6. Label the blocks representing the vnode, vmount and gfs structures
7. Draw a line representing the file pointer in the ufd to an entry in the
system file table.
Lab Exercise 1
Overview In the following exercise you will run a small C program that opens a file,
initializes it by writing a few bytes to it, then pauses. The pause allows us
to investigate the various LFS structures that are created by opening the
file, using the appropriate system debugger.
The close() then open() is required, to ensure that the write is committed to
disk & hence that the inode is updated.
save this code to a file called t.c, and compile it using “make t”.
Stage Description
1 Enter the C program from above, save it to a file called t.c
and compile with the command:
$ make t
2 Execute the program created in the last step. It will print the
file descriptor number of the file it creates, then pauses.
$ ./t
fd = 3
3 From another shell on the same system, enter the system
debugger (kdb or iadb).
Lab
Stage Description
4 Initially, we need to find the address of the file structure for
the open file. We know that the file descriptor for our
program is number 3, so we have to find the mapping
between the file descriptor number and the file structure.
This mapping is done from the file descriptor table in the
uarea structure for the process. To find the uarea, find the
slot number in the thread table that our “t” process occupies,
the uarea slot number will be the same.
For kdb use the “th *” command to display all the threads.
Page down through the entries until you find the correct
entry:
(0)> th *
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
lab
Stage Description
6 The file structure for file descriptor 3 is at address
F100009600007700. Use the “file” command along with this
address to display the contents of the structure:
(0)!ILOH)
$''5&28172))6(7'$7$7<3()/$*6
))$912'(5($'
QRGHVORW
IBIODJIBFRXQW
IBRSWLRQVIBW\SH
IBGDWD)$IBRIIVHW
IBGLUBRIIIBFUHG)&&
IBORFN#)IBORFN
IBRIIVHWBORFN#)IBRIIVHWBORFN
IBYLQIRIBRSV$&&(YQRGHIRSV
912'()$
YBIODJYBFRXQW
YBYIVJHQYBYIVS))'
YBORFN#)$YBORFN
YBPYIVSYBJQRGH)$)
YBQH[WYBYIVQH[W)%)
YBYIVSUHYYBSIVYQRGH
YBDXGLW
Note that half way down the output, the address of the
vnode structure that corresponds to this file is printed,
followed by the contents of this vnode structure.
(We could also display the vnode structure separately by
running the kdb command “vnode” with the address
F10000971528A380.)
8 There are two items that we are interested in from the vnode
structure displayed in the last step, the v_vfsp address,
which points to the filesystem that contains the vnode, and
the v_gnode address, which points to the gnode structure
for the file. From the gnode we can display the inode
structure for the file.
Lab
Step Action
9 The inode address is contained in the gn_data field, in this
case F10000971528A3D8. Use the kdb command “inode”
to display this structure:
(0)!LQRGH)$'
'(9180%(5&177<3()/$*6
.(51BKHDS$'$5(*
IRUZ))(EDFN))(
QH[W)$'SUHY)$'
JQRGH#)$)QXPEHU
GHY$LSPQW)()(
IODJORFNVELJH[SFRPSUHVV
FIODJFRXQWV\QFVQ'$LG&
PRYHGIUDJRSHQHYHQW))))))))))))))))
KLS))(QRGHORFN
QRGHORFN#)$$GTXRW>865@
GTXRW>*53@GLQRGH#)$&
FOXVWHUUFOXVWHUGLRFQWQRQGLR
VL]HJHWV
*12'()$)
JQBW\SHJQBIODJV
JQBVHJ$'
JQBPZUFQWJQBPUGFQWJQBUGFQW
JQBZUFQWJQBH[FQWJQBUVKFQW
JQBRSV''&MIVBYRSV
JQBYQRGH)$JQBUGHY$
JQBFKDQJQBUHFONBHYHQW))))))))
JQBUHFONBORFN#)$JQBUHFONBORFN
JQBILORFNVJQBGDWD)$'
JQBW\SH5(*
GLBJHQ))&GLBPRGH&GLBQOLQN
GLBDFFWGLBXLGGLBJLG
GLBQEORFNVGLBDFO
GLBPWLPH&)'GLBDWLPH&)'GLBFWLPH&)'
GLBVL]HBKLGLBVL]HBORGLBVHF
GLBUGDGGU
GLBYLQGLUHFWGLBULQGLUHFW
GLBSULYRIIVHWGLBSULYIODJVGLBSULY
912'()$
YBIODJYBFRXQW
YBYIVJHQYBYIVS))'
YBORFN#)$YBORFN
YBPYIVSYBJQRGH)$)
YBQH[WYBYIVQH[W)%)
YBYIVSUHYYBSIVYQRGH
YBDXGLW
lab
Step Action
10 The inode command displays the inode, gnode and vnode
structures.
$ ls -lia foo
Lab Exercise 2
Overview The instructor will create a simple shell script that simply prints its process
id, then pauses.
Both the “ps” command, and the process and thread tables entries for this
script will simply list the name of the program as the name of the shell that
it is being executed by (E.g. “ksh”).
Objective To determine the name of the script that the instructor is running.
Tips • Remember that the shell will have to open() the script prior to executing
it.
• The command find . -inum xxx can be used to find the name of a
file given the filesystem name and an inode number.
Objectives
After completing this unit, you should be able to :
• List and locate boot components and their usage
• Understand the 3 Phases of rc.boot
• Understand the contents and usage of a RAMFS
• Understand the ODM structure and the usage of ODM classes
• Create new boot images
• Debug boot problems
What is boot
Definition It is the process that begins when the computer is powered up and
continues until the entries in the init table have been processed.
ROS process System ROS (Read Only Storage), contains firmware that is independent
of the operating system which initializes the hardware and loads AIX.
All platforms except RS6K will use an intermediate boot process called :
• Softros : (/usr/lib/boot/aixmon_chrp) for CHRP systems
• Softros : (/usr/lib/boot/aixmon_rspc) for RSPC systems
• Boot loader : (/usr/lib/boot/boot_elf) for IA-64 systems
AIX process AIX begins execution after system ROS firmware or the intermediate boot
process finishes its execution :
• sets up firmware information
• kernel initialization
• RAM filesystem based configuration
• control is passed to files based in the permanent filesystem (this may be
a disk or network filesystem)
• /etc/inittab entries are processed. This usually includes enabling the
user login process.
Configuration The boot process can use one of the following boot configurations :
• standalone
• diskless/dataless (Not supported on IA64 platform)
• operating system installation/software maintenance
• diagnostics
Hard disk boot The hard disk boot has the following characteristics :
• the boot image resides on the hard disk
• the RAM filesystem contains the files necessary for configuring the hard
disk(s), and then accessing the filesystems that reside in the root
volume group (rootvg)
• this is the most common system configuration
• these types of systems are also known as “standalone” systems
• these types of systems may also be booted into the diagnostics
functions
CDROM boot The CDROM boot maybe used in the following situations :
• operating system installation
• diagnostics
• hard disk boot failure recovery/maintenance
Network boot The network boot can be used for the following purposes :
• boot and install the operating system
• the operating system is installed on a hard disk with NIM
• subsequent boots are from the hard disk
• supported diskless/dataless configurations
• diagnostics
• hard disk boot failure recovery/maintenance
The centralized boot/filesystem servers offer convenient administration
Introduction In order to successfully boot a system, the AIX kernel will need basic
commands, configuration files, kernel extensions and device drivers to be
able to configure a minimum environment.
All the files needed are included in the RAMFS using the following
command
mkfs -V jfs -p <proto> <temp_filesystem_file>
prototypes A prototype file is a list of file and file descriptions that are needed to create
files a RAMFS.
description
A prototype file entry format is as follow :
<dest_file_name> <type> <mode> 0 0 <full_path_name>
Where :
• <dest_file_name> : is the name of the file, directory, link or device as it
will be written to the RAMFS
• <type> : defines the type of the entry and can be :
• d--- : a directory entry (this will change the relative path of the
following entries).
• l--- : a link (the target will be listed in the <full_path_name>
parameter)
• b--- : a block device (the <full_path_name> parameter will represent
the major and minor numbers)
• c--- : a character device (the <full_path_name> parameter will
represent the major and minor numbers)
• ---- : a file
• <mode> : represent the file permissions in octal format
• <full_path_name> : value will depend on the <type> as described
before.
prototypes Prototype files are divided in several parts according to their specific use :
files types
• Prototypes files located in /usr/lib/boot are the base prototypes used for
a platform according to the boot device type and comes with the
platform base system device fileset
• Prototypes files located in /usr/lib/boot/network are specific to any
general kind of network boot device and comes with the platform base
system device fileset
• Prototypes files located in /usr/lib/boot/protoext are used for any specific
type of boot device and comes with the device specific fileset
Introduction In order to successfully boot from a device, the administrator will need to
run commands that will create the boot structure.
bosboot The bosboot command is the most commonly used on AIX because it will
command manage all verification tasks and environment setup for the administrator.
The administrator can also use the mkboot command but he then should
take care himself of all these preliminary checks.
The bosboot command will also be used by over commands like mksysb
or installp post installation process when installing packages that needs to
build a new boot image.
argument description
-a Create complete boot image and device.
-w file Copy given boot image file to device.
-r file Create ROS Emulation boot image.
-d device Device for which to create the boot image.
-U Create uncompressed boot image.
-p proto Use given proto file for RAM disk file system.
-k kernel Use given kernel file for boot image.
-l lvdev Target boot logical volume for boot image.
-b file Use given file name for boot image name.
-D Load Low Level Debugger.
-I Load and Invoke Low Level Debugger.
-L Enable MP locks instrumentation (MP kernels)
-M norm|serv|both Boot mode - normal or service
-O offset boot image offset for CDROM file system.
-q Query size required for boot image.
AIX 5L Distributions
Power CDROM The distribution CDROM that IBM provides to our customers has three
Distributions boot images. There is a boot image for the RS6K computers, a second for
the RSPC computers, and a third for CHRP (/ppc/chrp/bootfile.exe). The
RS6K, RSPC, and CHRP UP computers can use the MP Kernel, which is
the method implemented for distribution media that goes to our customers.
In other words, when a customer receives boot/install media from IBM,
there is no need to determine whether the system is UP or MP. This boot
image is created using the MP kernel. The UP kernel is more efficient for
uniprocessor systems, but the strategy of a single boot image for both
hardware platform types lowers distribution cost, and is more convenient
for our customers.
IA-64 CDROM
Distributions
Checkpoint
Introduction Take a few minutes to answer the following questions. We will review the
questions as a group when everyone has finished.
Quiz 1. What is the name of the file used as a SOFTROS on CHRP systems
3. What are the common functions of the ROS, SOFTROS and EFI boot
loader.
Instructor Notes
Introduction The section will explain the boot mechanism used by Power family
systems.
Boot overview When the system is powered on, the ROS or the firmware will look for the
bootrecord on the device pointed by the bootlist to find the boot entry point.
The Softros on RSPC and CHRP will execute and uncompress the boot
image if needed using the bootexpand process.
Then it will load the kernel that will initialize.
The kernel will then call init (In fact /usr/lib/boot/ssh at this stage)
The ssh will then call rc.boot for PHASE I and PHASE II specific to each
boot device types.
Then init will execute rc.boot phase 3 and the remaining common code in
rc.boot for disk and network boot devices
Boot diagram The following diagram represent the high level boot process overview.
execution of the system ROS
or firmware.
boot record read from boot device
no
com yes
execution
pressed
boot bootexpand
img
no
Kernel initialization
Kernel call init (/usr/lib/boot/ssh)
init ssh call rc.boot PHASE I&II
init exit to newroot
init calls rc.boot PHASE III from
inittab and process the rest of
inittab entries.
compressed compressed
kernel RAM Filesystem
bootexpand
VGDA softros (chrp and rspc) rest of
base the boot
customized disk
data
bootrecord
bootrecord 512 byte block containing size and location of the boot image. The boot
record is the first block on a disk or cdrom and is therefore separated from
the boot image. The boot image on a disk is placed in the boot logical
volume which is a reserved contiguous area.
kernel AIX 32 bits UP, 32 bits MP or 64 bits MP kernels that which control
passes to after expansion by bootexpand. The kernel initializes itself and
then passes control to the simple shell init (ssh) in the RAM filesystem.
RAM Filesystem used during boot process, that contains programs and data
filesystem for initializing devices and subsystems in order to install AIX, execute
diagnostics, or to access and bring up the rest of AIX.
base Area of the hard disk boot logical volume containing user configured
customized ODM device configuration information that is used by the system
data configuration process.
Introduction On Power systems, the boot record is located at the beginning of the boot
device and contains the following informations :
• The IPL record
• The boot partition table used by chrp and rspc systems.
IPL record The following table describe the content of the boot record.
description
IPL record
description
continued
boot partition The boot record contains 4 partition tables entries starting at offset 0x1be.
table Each entry contains the following information :
boot partition RS6K platform doesn’t use a boot partition table. The four boot partition
tables entries table entries are used for :
• CHRP boot images
• CHRP and First RSPC boot image
• CHRP and Second RSPC boot image
• CHRP Third RSPC boot image
Example The following chart represent an AIX 5L boot record from a chrp system. It
was obtained using :
od -Ax -x /dev/hdisk0|pg
base_cs boot_lv
_start _start
IBMA 0000000 c9c2 d4c1 0000 0000 0000 0000 0000 0000
0000010 0000 0000 0000 0000 0000 0000 0000 0000
boot_code
_len 0000020 0000 0000 0000 2cc1 0000 0000 0000 1100
0000030 0000 0000 0000 0000 0000 0000 0000 0000
base_cn 0000040 0100 0100 0000 3cdc 0000 3cdc 0000 0000
_length 0000050 0000 0000 0000 0000 0000 0000 0000 0000
base_cs 0000060 0000 0000 0000 2cc1 0000 0000 0000 1100
_length 0000070 0000 0000 0000 0000 0000 0000 0000 0000
0000080 0007 1483 229d 0662 0000 0000 0000 0000
0000090 0000 0000 0000 0000 0000 0000 0000 0000
PVID serv_code base_cn ser_lv
_length _start _start
BOOT_SIGNATURE boot_partition
_table
Instructor Notes
Little endian The RBA and sectors informations from the boot partition table are little
format endian format.
So to obtain the actual address, you will need to swap the 2 bytes as they
are display using the od command
Introduction Depending on the architecture, the boot image will not always contains the
same elements due to the needs of ROS and Firmware specifications.
RS6K boot The rs6k platform doesn’t need a an softros emulation, so the boot image
image start with the bootexpand program. The bootexpand will be loaded first to
uncompress the kernel and the RAMFS.
RSPC boot On rspc, the aixmon_rspc softros is located at the begening of the boot
image image, but the xcoff format is replaced by an hints structure has defined in
/usr/include/sys/boot.h. So an RSPC boot image will contain the following
sections :
• The hints structure
• The aixmon_rspc file reduced by it’s xcoff header and in fact starting at
its entry point
• The bootexpand program
• The compressed kernel
• The compressed RAMFS
• The saved base customization.
CHRP boot On chrp, the aixmon_chrp softros is located at the begening of the boot
image image, but the xcoff format is replaced by an ELF format. So a CHRP boot
image will contain :
• The ELF structure
• The aixmon_chrp file reduced by it’s xcoff header and in fact starting at
its entry point.
• The bootexpand program
• The compressed kernel
• The compressed RAMFS
• The saved base customization.
RSPC boot The following output represents the hints header output from the following
image example command :
# dd if=<boot_disk> bs=512 skip=<RBA> count=1 |od -Xa -
x
introduction On chrp systems, the aixmon xcoff header is replaced by an ELF header.
The aixmon_chrp file is copied to the boot image after the ELF header
starting at it’s entry point.
ELF header The Following table describes the ELF header structure :
structure
description
size name description
16 e_ident ELF identification
2 e_type object file type
2 e_machine architecture
4 e_version object file version
4 e_entry entry point
4 e_phoff prog hdr byte offset
4 e_shoff section hdr byte offset
4 e_flags processor specific flags
2 e_ehsize ELF header size
2 e_phentsize prog hdr table entry size
2 e_phnum prog hdr table entry count
2 e_shentsize section header size
2 e_shnum section header entry count
2 e_shstrndx sect name string tbl idx
Note, load 1 The following table describes the structure used to format note, loader 1
and load2 and loader 2 segments :
segments
descriptions
Note data The following table represent the note data description structure :
description
Boot loader The following table describes the boot loader structure :
parameters
description
size name description
4 timestamp date when the boot image was created
4 bootimage_size equivalent to the number of sectors for the
blv found in the bootrecord
4 boot_loader_size size of the aixmon in bytes
4 inst_offset jump offset in boot image
4 rmalloc_size Percent of memory for kernel heap
4 reserved1
4 reserved2
4 reserved3
Exercise
Introduction This exercise will show you the way to locate the different parts of the boot
image using the boot record
Procedure Follow the following procedure to locate main parts of the boot image.
Step Action
1 Locate the boot disk using :
# bootinfo -b
2 Determine the architecture of your system using :
# bootinfo -p
3 Find the boot record located at the beginning of the disk
found in step 1 using :
# dd if=<boot_disk> bs=512 count=1 |od -Ax -x
4 • On RSPC or CHRP, locate in the boot partition table the
RBA and sectors from output of step 3.
• On RS6K, locate in the record, the boot_prg_start and
boot_code_length
5 Create a file using the offset and sectors length found in
step 5 using :
# dd if=<boot_disk> bs=512 skip=<offset>
count=<sectors> of=/tmp/myfile
6 Using the what command try to find what is included in this
file
What is missing from the what output ?
Why ?
7 Create a file using the offset and sectors length found in
step 5 plus the size of the boot_loader
# dd if=<boot_disk> bs=512
skip=<(offset*512)+boot_loader_size)>
count=512 of=/tmp/myfile2
8 What is myfile2
9 Using the results from step 3, locate the base customization
sector start and length : use these values to create a new
file
# dd if=<boot_disk> bs=512 skip=<base_cn_start>
count=<base_cn_length> of=/tmp/myfile3
10 Create a directory <dir1> and copy /etc/objrepos/* to dir1
# /usr/lib/boot/restbase -o myfile3 -d dir1 -v
Instructor Notes
ROS On RS6K platforms, the Hardware ROS performs some basic hardware
configuration and tests, and create the IPL Control Block before
transferring control to kernel’s entry point.
Softros The RSPC and CHRP family of computers requires a boot image with
special software known as SOFTROS, which is used to provide function
that AIX requires, and is not provided by the hardware firmware. The
SOFTROS performs some basic hardware configuration and tests, and
also sets up some data structures to provide an environment for AIX that
more closely resembles the environment provided by RS6K system ROS.
On CHRP systems the firmware device tree is also appended to the IPL
Control Block. The the Softros transfer control to kernel’s entry point.
IPLCB on Power
Definition The IPLCB (Initial Program Load Control Block) defines the RAM resident
interface between the IPL Boot Process and the Operating System
The ROS or Softros will initialize the IPLCB structure using interfaces to
the firmware or ROS (on RS6K platform).
The kernel when loaded will use the IPLCB structure to initialize it’s
runtime structures.
IPLCB The following screen output shows the IPLCB on a CHRP system captured
directory using the kdb iplcb -dir sub command :
example on a
CHRP system IPL directory [10000080]
ipl_control_block_id.........ROSIPL
ipl_cb_and_bit_map_offset...00000000 ipl_cb_and_bit_map_size....00008898
bit_map_offset..............000087A8 bit_map_size...............00000007
ipl_info_offset.............000002E8 ipl_info_size..............00000598
iocc_post_results_offset....00000000 iocc_post_results_size.....00000000
nio_dskt_post_results_offset00000000 nio_dskt_post_results_size.00000000
sjl_disk_post_results_offset00000000 sjl_disk_post_results_size.00000000
scsi_post_results_offset....00000000 scsi_post_results_size.....00000000
eth_post_results_offset.....00000000 eth_post_results_size......00000000
tok_post_results_offset.....00000000 tok_post_results_size......00000000
ser_post_results_offset.....00000000 ser_post_results_size......00000000
par_post_results_offset.....00000000 par_post_results_size......00000000
rsc_post_results_offset.....00000000 rsc_post_results_size......00000000
lega_post_results_offset....00000000 lega_post_results_size.....00000000
keybd_post_results_offset...00000000 keybd_post_results_size....00000000
ram_post_results_offset.....00000000 ram_post_results_size......00000000
sga_post_results_offset.....00000000 sga_post_results_size......00000000
fm2_post_results_offset.....00000000 fm2_post_results_size......00000000
net_boot_results_offset.....00000000 net_boot_results_size......00000000
csc_results_offset..........00000000 csc_results_size...........00000000
menu_results_offset.........00000000 menu_results_size..........00000000
console_results_offset......00000000 console_results_size.......00000000
diag_results_offset.........00000000 diag_results_size..........00000000
rom_scan_offset.............00000000 rom_scan_size..............00000000
sky_post_results_offset.....00000000 sky_post_results_size......00000000
global_offset...............00000000 global_size................00000000
mouse_offset................00000000 mouse_size.................00000000
vrs_offset..................00000000 vrs_size...................00000000
taur_post_results_offset....00000000 taur_post_results_size.....00000000
ent_post_results_offset.....00000000 ent_post_results_size......00000000
vrs40_offset................00000000 vrs40_size.................00000000
gpr_save_area1............@ 10000178
system_info_offset..........00000880 system_info_size...........0000009C
buc_info_offset.............0000091C buc_info_size..............00000150
processor_info_offset.......00000A6C processor_info_size........00000310
fm2_io_info_offset..........00000000 fm2_io_info_size...........00000000
processor_post_results_off..00000000 processor_post_results_size00000000
system_vpd_offset...........00000000 system_vpd_size............00000000
mem_data_offset.............00000000 mem_data_size..............00000000
l2_data_offset..............00000D7C l2_data_size...............000000C0
fddi_post_results_offset....00000000 fddi_post_results_size.....00000000
golden_vpd_offset...........00000000 golden_vpd_size............00000000
nvram_cache_offset..........00000000 nvram_cache_size...........00000000
user_struct_offset..........00000000 user_struct_size...........00000000
residual_offset.............00000E3C residual_size..............0000776C
numatopo_offset.............00000E3C numatopo_size..............00000000
Checkpoint
Introduction Take a few minutes to answer the following questions. We will review the
questions as a group when everyone has finished.
Instructor Notes
Introduction The section will explain the boot mechanism used by the IA-64 platform.
Definitions EFI stands for Extensible Firmware Interface. EFI provide a standard
interface between the Hardware and the operating system on IA-64
platforms.
Boot overview When the system is powered on, the EFI will load first.
EFI will load BIOS for devices that needs.
EFI will then prompt to enter the setup for a timeout period.
EFI will then prompt the EFI boot menu for another timeout period after
witch he will scan the bootlist in order to find a boot device.
The EFI boot loader will prompt for the boot loader menu and after the
timeout or exit from the menu initialize the IPL Control Block.
Then it will locate and load the kernel that will initialize.
The kernel will then call init (In fact /usr/lib/boot/ssh at this stage)
The ssh will then call rc.boot for Phase I and Phase II specific to each boot
device types.
Then init will execute rc.boot Phase III and the remaining common code in
rc.boot for disk and network boot devices
If no boot device is found EFI will start the EFI Shell on IA-64 platforms that
supports EFI shell.
Boot diagram The following diagram represent the high level boot process overview.
execution of EFI firmware
Load needed BIOS.
Prompt for Setup
prompt no setup
timeout or
os boot menu
request
yes
EFI boot
manager menu
valid no
boot EFI
device Shell
found
yes
AIX boot loader
key yes
entered AIX boot
during loader menu
timeout
no
Kernel initialization
Kernel call init (/usr/lib/boot/ssh)
init ssh call rc.boot PHASE I&II
init exit to newroot
init calls rc.boot PHASE III from
inittab and the rest of inittab entries
Boot image The following represent the overview of an AIX 5L on IA-64 boot image :
overview
hdisk0_all
hdisk0
hd5 hdisk0_s0
VGDA base
customized
data EFI boot
rest loader
of the
PMBR,EFI Partition hdisk0
Header and entries
PMBR, EFI On IA-64 platform, AIX 5L must be aware of EFI disk partitioning.
Partition
Header and During installation, two partitions will be created on the target disk
entries (hdisk0_all) :
• A Physical Volume partition (hdisk0 in the AIX environment) known as a
block device in the EFI environment (blkXX).
• An IA-64 System partition (hdisk0_s0 in the AIX environment) known as
an IA-64 System partition in the EFI environment (fsXX)
kernel On IA-64 platform the 64 bit kernel (unix_ia64) can be used as the kernel
for either UP or MP systems. The kernel initializes itself and then passes
control to the simple shell init (ssh) in the RAM filesystem.
RAM Filesystem used during boot process, that contains programs and data for
filesystem initializing devices and subsystems in order to install AIX, execute
diagnostics, or to access and bring up the rest of AIX.
base Area of the hard disk boot logical volume containing user configured ODM
customized device configuration information that is used by the system configuration
data process.
EFI boot The EFI boot loader will reside in am IA-64 System Partition physically
loader located after the Physical Volume Partition by the installation process.
Introduction At boot time, EFI will prompt for the EFI boot manager menu to be entered
for a timeout period.
The timeout period is customizable via the boot maintenance menu.
boot manager At boot time, the boot manager will display the bootlist and prompt for a
time out period.
If the timeout is reached, the boot manager will scan the bootlist in the boot
order to find a valid boot device.
If a key is entered before the timeout period, the user will be able to :
• select a boot device from the list to boot for this session
• start EFI Shell on platform that support EFI Shell
• enter the boot maintenance manager
boot The boot maintenance manager menu will allow the administrator to :
maintenance
manager menu • boot from a file
• add/delete boot options
• change boot order
• manage boot next setting
• set autoboot timeout
• select active console output devices (output,input and error)
• do a cold reset.
Introduction The EFI Shell allow you to configure the boot process used by the IA-64
platform. The main functions are to :
• Locate and identify different boot devices
• Set environment variable
• Use debugging sub commands
• boot from the selected boot device
EFI Shell The EFI shell startup will display informations about the current EFI level
startup and device mapping as follow :
example
EFI version x.xx [xx.xx] Build flags : EIF64 Running on Merced EFI_DEBUG
EFI IA-64 SDV/FDK (BIOS CallBacks) [Fri Mar 31 13:21:32 2000] - INTEL
Cache Enabled. This image Main entry is at address 000000003F2BA000
Stack = 000000003F2B6FF0 BSP = 000000003F293000
INT Stack = 000000003F292FF0 INT BSP = 000000003F26F000
EFI Shell version x.xx [xx.xx]
Device mapping table
fs0 : VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)
blk0 : VenHw(Unknown Device:01)/HD
blk1 : VenHw(Unknown Device:80)/HD
blk2 : VenHw(Unknown Device:81)/HD
blk3 : VenHw(Unknown Device:ff)/HD
blk4 : VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)
blk5 : VenHw(Unknown Device:80)/HD(Part2,Sig0CBCBA54)
EFI Shell sub In the EFI Shell you will be able to use the following sub commands :
commands
Introduction The AIX 5L EFI boot loader makes the interface between EFI and the
kernel.
On disk drives, the AIX boot loader is located in the system partition.
Before loading the kernel, the boot loader will prompt the user to enter the
boot loader menu.
Then the boot loader will make use of EFI interface to initialize the IPL
Control Block.
The boot loader will then locate the kernel that reside in the hd5 that
actually is contained in the AIX PV partition.
Finally the boot loader will pass control to the kernel entry point.
boot loader The boot loader will make use of all the EFI boot services to load file
and EFI images such as kernel, RAM filesystem file and base customized data and
interactions
to locate various system tables such as System Abstraction Layer (SAL)
System Table (SST) and Advanced Configuration and Power Interface
(ACPI) Specification Tables. The boot loader will then create Initial
Program Load Control Block (IPLCB) and setup Translation Registers (TR)
before transferring control to kernel’s entry point.
EFI boot The boot loader menu can be used to set parameters that may affect the
loader menu kernel loading and operating environment like :
• enable the kernel debugger
• invoke the kernel debugger
• override RMALLOC memory reservation
• set boot loader debug flag
• set service/diagnostics flag
• select the amount of memory to enable
• Set the number of cpu to use
• select the number of CPU to use
• Toggle Single/Multi dispersal mode
Introduction The IPLCB (Initial Program Load Control Block) defines the RAM resident
interface between the IPL Boot Process and the Operating System. The
boot loader will initialize the IPLCB structure using interfaces to EFI.
The kernel when loaded will use the IPLCB structure to initialize it’s
runtime structures.
IPLCB The following screen shows the IPLCB Directory on a IA-64 system
directory captured using the IADB iplcb -dir sub command :
example on a
IA64 system > iplcb -dir
Directory Information
ipl_control_block_id......................= IA64_IPL
ipl_cb_and_bit_map_offset.................= 0x0
ipl_cb_and_bit_map_size...................= 0x7F0
bit_map_offset............................= 0x448
bit_map_size..............................= 0x27
ipl_info_offset...........................= 0xD8
ipl_info_size.............................= 0x7C
system_info_offset........................= 0x3D8
system_info_size..........................= 0x50
processor_info_offset.....................= 0x250
processor_info_size.......................= 0x188
io_xapic_info_offset......................= 0x428
io_xapic_info_size........................= 0x18
handoff_info_offset.......................= 0x158
handoff_info_size.........................= 0xF0
platform_int_info_offset..................= 0x440
platform_int_size.........................= 0x8
residual_offset...........................= 0x0
residual_size.............................= 0x0
Checkpoint
Introduction Take a few minutes to answer the following questions. We will review the
questions as a group when everyone has finished.
Instructor Notes
Introduction The main goal here is to get the devices configured and odm initialized
Hard disk The following chart represent the hard disk boot phase I process
Phase I
diagram
restore base configuration
from boot disk
<>0
restbase
return led 548
code
0
led 510
led 511
exit 0
Introduction The main objective in hard disk boot phase II is to varyon rootvg and
mount standard filesystems.
Hard disk The following chart represent the hard disk boot phase II process
Phase II
diagram
led 511
ipl_varyon -v
ipl
varyon <>0
return led 552,554 or 556
code
0
led 517
key
yes execute the
service
or dump service
in hd6 procedure
no
copy /etc/vg and
objrepos to disk
merge devices
unmount filesystems
remount filesystems
led 553
exit 0
Introduction The main objective in hard disk boot phase III is to mount runtime /tmp,
sync rootvg and then fall down the phase III common process.
Hard disk The following chart represent the hard disk boot phase III process
Phase III
diagram
Introduction The main objective of the CDROM boot process is to configure devices
needed for installation and maintenance procedures and start the bi_main
process.
CDROM boot The following chart shows the CDROM boot phases I,II and III
phases I,II and
III diagram
configuration manager 1 3
Phase exit 0
Phase I number
led 517 2
exec bi_main
Mount the cdrom spot
exit 0
led 512
led 510
configure remaining
devices needed for install
led 511
exit 0
Introduction The main objective of the Tape boot process is to configure devices
needed for installation and maintenance procedures and start the
bi_main process.
Tape boot The following chart shows the Tape boot phases I,II and III
phases I,II and
III diagram
1 3
Phase exit 0
led 510
number
configuration manager
Phase I 2
exec bi_main
configuration manager
Phase II exit 0
led 512
exit 0
Introduction The main objective of the Network boot process is to configure devices,
configure additional network options (network address, mask and default
route) and run the $RC_CONFIG script.
Network boot The following chart shows the Network boot phases I,II and III
phases I,II and
III diagram
boot no
from
atm0
yes
configure ATM configure the
pvc, svc and native network
muxatmd bootdevice (ifconfig)
rc no rc no
=0 led 607 =0 led 607
yes yes
tftp miniroot
set nim environment
create /etc/hosts and routes exit 0
nfs mount the SPOT
run $RC_CONFIG from SPOT
Introduction The common Phase III boot code is run for disk and network boot only.
Common boot The following chart shows the common boot phases III process
Phase III
diagram
ensure 1024K free space in /tmp
load streams modules
fix the secondary dump device
swapon hd6 if no dump present
run savebase recovery procedure
key yes
is in config manager phase III
disable controlling tty
service
no
clean odm for alt disk install
config manager phase II
setup System Hang Detection
run graphical boot if needed
run savebase
clean unavailable tty from inittab
sync the files to hard disk
run /etc/rc.B1 if exists
start the syncd daemon
start the errdaemon daemon
clean /etc/locks and /etc/nologin
start mirrord daemon
start cfgchk daemon
run diagsrv if supported by platform
System initialization completed
exit 0
Introduction As seen in the Network Boot Process (Phases I, II and III) these scripts are
ran by rc.boot when booting from a network device in phases I and II.
These script are located in the /usr/lib/boot/network directory.
They are loaded from the SPOT on the NIM server during the network boot
process.
Introduction The Object Data Manager is widely used in AIX to store and retrieve
various system informations.
For this purpose, AIX defines number of standard ODM classes.
Any application can create an use it’s own ODM classes to manage it’s
own informations.
Devices ODM The Devices classes are used by the configuration manager, device
Classes drivers and AIX device related commands (lsdev, lsattr ,lspv ,lsvg ...).
The following table list the Devices ODM classes and their definitions :
Class Definition
PdDv Predefined Devices
PdCn Predefined Connection
PdAt Predefined Attribute
PdAtXtd Extended Predefined Attribute
Config_Rules Configuration Rules
CuDv Customized Devices
CuDep Customized Dependency
CuAt Customized Attribute
CuDvDr Customized Device Driver
CuVPD Customized Vital Product Data
CuPart EFI partitions
CuPath
CuPathAt
SWVPD ODM The SWVPD classes are used by fileset related commands like installp,
Classes instfix, lslpp, oslevel.
SWVPD is divided in 3 parts :
• root : classes are in /etc/objrepos
• usr : classes are in /usr/lib/objrepos
• share : classes are located in /usr/share/lib/objrepos
The following table list the Software Vital Product Data ODM classes and
their definitions :
Class Definition
lpp The lpp object class contains information about the installed
software products, including the current software product
state.
inventory The inventory object class contains information about the files
associated with a software product.
history The history object class contains historical information about
the installation and updates of software products.
product The product object class contains product information about
the installation and updates of software products and their
prerequisites.
SRC ODM SRC Classes are used by the srcmstr and related commands : lssrc,
Classes startsrc, stopsrc and chssys.
The following table list the System Resource Controller ODM classes and
their definitions
Class Definition
SRCsubsys The subsystem object class contains the descriptors for all
SRC subsystems. A subsystem must be configured in this
class before it can be recognized by the SRC.
SRCsubsvr An object must be configured in this class if a subsystem
has subservers and the subsystem expects to receive
subserver-related commands from the srcmstr daemon.
SRCnotify This class provides a mechanism for the srcmstr daemon
to invoke subsystem-provided routines when the failure of a
subsystem is detected.
SRCextmeth
SMIT ODM The SMIT odm classes are used by smit and smitty commands.
Classes
The following table list the SMIT ODM classes and their definitions
RAS ODM The RAS classes are used by the errdaemon, shdaemon, shconf and alog
Classes commands.
The following table list the RAS ODM classes and their definitions
Class Definition
errnotify Used by errlog notification process
SWservAt Used by errorlog, system dumps, System
Hang Detection and alog
Class Definition
PDiagRes Predefined Diagnostic Resource Object Class
PDiagAtt Predefined Diagnostic Attribute Device Object Class
PDiagTask Predefined Diagnostic Task Object Class
CDiagAtt Customized Diagnostic Attribute Object Class
TMInput Test Mode Input Object Class
MenuGoal Menu Goal Object Class
FRUB Fru Bucket Object Class
FRUs Fru Reporting Object Class
DAVars Diagnostic Application Variables Object Class
PDiagDev Predefined Diagnostic Devices Object Class
DSMOptions Diagnostic Supervisor Menu Options Object Class
ODM The following table list the ODM commands and their usage:
commands
Command Definition
odmadd Adds objects to an object class. The odmadd command
takes an ASCII stanza file as input and populates object
classes with objects found in the stanza file.
odmchange Changes specific objects in a specified object class.
odmcreate Creates empty object classes. The odmcreate command
takes an ASCII file describing object classes as input and
produces C language .h and .c files to be used by the
application accessing objects in those object classes.
odmdelete Removes objects from an object class.
odmdrop Removes an entire object class.
odmget Retrieves objects from object classes and puts the object
information into odmadd command format.
odmshow Displays the description of an object class. The
odmshow command takes an object class name as input
and puts the object class information into odmcreate
command format.
ODM The following table list the odm subroutines and their use :
subroutines
subroutine definition
odm_add_obj Adds a new object to the object class.
odm_change_obj Changes the contents of an object.
odm_close_class Closes an object class.
odm_create_class Creates an empty object class.
odm_err_msg Retrieves a message string.
odm_free_list Frees memory allocated for the odm_get_list
subroutine.
odm_get_by_id Retrieves an object by specifying its ID.
odm_get_first Retrieves the first object that matches the specified
criteria in an object class.
odm_get_list Retrieves a list of objects that match the specified
criteria in an object class.
odm_get_next Retrieves the next object that matches the specified
criteria in an object class.
odm_get_obj Retrieves an object that matches the specified criteria
from an object class.
odm_initialize Initializes an ODM session.
odm_lock Locks an object class or group of classes.
odm_mount_class Retrieves the class symbol structure for the specified
object class.
odm_open_class Opens an object class.
odm_rm_by_id Removes an object by specifying its ID.
odm_rm_obj Removes all objects that match the specified criteria
from the object class.
odm_run_method Invokes a method for the specified object.
odm_rm_class Removes an object class.
odm_set_path Sets the default path for locating object classes.
odm_unlock Unlocks an object class or group of classes.
odm_terminate Ends an ODM session.
ODM paths As the ODM classes can be found in 3 paths (root, usr and share), the user
must decide which path he will use before running ODM commands or
ODM subroutines.
For ODM commands, the user can set the path using :
# export ODMDIR=/usr/share/lib/objrepos
In a C program, the user should use :
odm_set_path("/usr/lib/objrepos");
Introduction It can be useful to retrieve rapidly the logging files used for boot or
installation to help solve problems.
The alog command can be used to recover system logs
log types The alog command is used by installation and boot processes to log
informations or errors for the following topics :
• boot : log for the boot process
• bosinst : log used for the AIX installation process
• console : log used to store console messages
• nim : log used to store NIM messages
• dumpsymp : used to store dump symptom messages
example The following example will output the 15 last lines of the boot log :
# alog -t boot -o|tail -15
Saving Base Customize Data to boot disk
Starting the sync daemon
Starting the error daemon
A device that was previously detected could not be found.
Run "diag -a" to update the system configuration.
System initialization completed.
Starting Multi-user Initialization
introduction For boot problems debugging purposes, it can be useful to get a detailed
output of the boot process, including rc.boot outputs.
entering boot To enter the boot debugging, the administrator should first make sure the
debug KDB kernel debugger will be loaded invoked at boot time using :
# bosboot -I -ad/dev/ipldevice
The next reboot will launch the KDB on the native serial connection.
At the KDB prompt you will need to toggle the rc.boot debug flag and
optionally the exec debug flag in order to have rc.boot outputs at the native
serial connection.
Note that the exec tracing will continue after the end of the rc.boot.
Introduction For boot problems debugging purposes, it can be useful to get a detailed
output of the boot process, including rc.boot outputs.
Prerequisites In order to get boot debug output you will need to have a device (TTY,
Thinkpad or an other system serial port) connected to the native serial port
and configured at 115200-8-N-1
Step Action
1 If you want the IADB to be invoked at boot time, use :
# bosboot -I -ad /dev/ipldevice
You can also chose not to do this and set manually the
debugger flags on the boot loader menu
2 Boot or reboot the system
3 If you are using another system as the TTY, you may want
to set some tracing/capture options to capture the
debugging output.
4 If the autoboot flag is not set in EFI set the file system and
boot using :
Shell> fs0:
fs0> boot
5 The boot loader menu should come up with the debugger
flags set “ON” if you ran bosboot in step 1.
Otherwise, hit some key to enter the boot loader menu and
set the debugger flags. Then exit the boot loader menu
6 The boot loader will load the IADB that will prompt on the
native serial port.
At the IADB prompt type :
CPU0> set dbgmsg=on
CPU0> set exectrace=on
CPU0> go
boot The following example show the beginning of what you can see on the
debugging native serial port when debugging the boot process :
output
example MEDIEVAL DEBUGGER ENTERE interrupt.
IP->E00000000001D2F2 brkpoint()+2: { .mfi
0: nop.m 0x100001
;; }
>CPU0> set dbgmsg=on
>CPU0> set exectrace=on <== here we ask for debugging
>CPU0> go <== here we go
See Ya!
Performing Hostile Takeover of the System Console...
AIX Version 5.0
Starting CPU#001... done.
+ ODMSTRNG=attribute=keylock and value=service
+ HOME=/
+ LIBPATH=/usr/lib:/lib:/usr/sbin:/etc:/usr/bin
+ SHOWLED=showled
+ SYSCFG_PHASE=BOOT
+ export HOME LIBPATH ODMDIR PATH SHOWLED SYSCFG_PHASE
+ umask 077
+ set -x
+ [ 1 -ne 1 ]
+ PHASE=1
+ + bootinfo -p
PLATFORM=ia64
+ [ ! -x /usr/lib/boot/bin/bootinfo_ia64 ]
+ [ 1 -eq 1 ]
+ 1> /)
+ + bootinfo -t
BOOTYPE=3
+ [ 0 -ne 0 ]
+ [ -z 3 ]
+ unset pdev_to_ldev undolt
Packaging Changes
Introduction The lpp packaging has been reviewed to reflect the need for platform
dependant packages.
Packaging installp, bffcreate, inutoc and instfix commands are updated to reflect
commands these changes.
By default packaging commands will process only packages related to the
platform where the command is ran.
A “-M” flag has been added to these command that accept the following
sub options :
• I : To process Intel related packages
• R : To process Power related packages
• N : To process Neutral related packages
• A : To process all kind of packages
installp The installp command will only accept the -M flag with -l or -L options.
options
installp option -L output will include platform informations
bffcreate The bffcreate command will accept all -M sub options to allow transit of
options packages regardless of the current platform. This is needed for nim
operations
instfix options The instfix command like the installp command will only accept the -M flag
when used in conjunction with the -T (list flag).
Checkpoint
Introduction Take a few minutes to answer the following questions. We will review the
questions as a group when everyone has finished.
Instructor Notes
Introduction /proc is a file system that provides access to the state of each active
process and Light Weight Process (LWP) in the system.
/proc The contents of the /proc filesystem have the same appearance as any
filesystem other file and directory in a Unix filesystem. The name of each top-level
entry in the /proc directory is a sub-directory, named by the decimal
number corresponding to the process ID, and the owner of each is
determined by the user-ID of the process.
Access to process state is provided by additional files contained within
each sub-directory; this hierarchy is described more completely below.
Except where otherwise specified, ‘‘/proc file’’ is meant to refer to a non-
directory file within the hierarchy rooted at /proc.
Filesystem The directory structure for the proc directory is described below. The pid
heirarchy represents the process ID number and the lwp# represents the light-weight
process number.
Accessing / Standard system call interfaces are used to access /proc files: open(2),
proc files close(2), read(2), and write(2). Most files describe process state and can
only be open for reading. An open for writing allows process control; a
read-only open allows inspection, but not control.
Types of Files
Introduction Listed below are descriptions of the files that are contained in the /proc
filesystem heirarchy. These files are described in more detail on the
following pages.
The as File
Introduction The as file contains the address-space image of the process and can be
opened for both reading and writing.
Accessing the lseek is used to position the file at the virtual address of interest and then
file the address space can be examined or changed through a read or write.
Introduction The ctl file is a write-only file to which structured messages are written
directing the system to change some aspect of the process’s state or
control its behavior in some way. The seek offset is not relevant when
writing to this file.
Control Individual LWPs also have associated lwpctl files. Process state changes
messages are effected through control messages written to either to the ctl file of the
process or to a specific lwpctl file. All control messages consist of an int
naming the specific operation followed by additional data containing
operands (if any). The effect of a control message is immediately reflected
in the state of the process visible through appropriate status and
information files.
Multiple control messages can be combined in a single write(2) to a control
file, but no partial writes are permitted; that is, each control message
(operation code plus operands) must be presented in its entirety to the
write and not in pieces over several system calls.
Introduction The status file contains state information about the process and one of its
LWPs (chosen according to the rules described below).
File format The file is formatted as a struct pstatus containing the following members:
long pr_flags; /* Flags */
ushort_t pr_nlwp; /* Total number of lwps in the process */
sigset_t pr_sigpend; /* Set of process pending signals */
vaddr_t pr_brkbase; /* Address of the process heap */
ulong_t pr_brksize; /* Size of the process heap, in bytes */
vaddr_t pr_stkbase; /* Address of the process stack */
ulong_t pr_stksize; /* Size of the process stack, in bytes */
pid_t pr_pid; /* Process id */
pid_t pr_ppid; /* Parent process id */
pid_t pr_pgid; /* Process group id */
pid_t pr_sid; /* Session id */
timestruc_t pr_utime; /* Process user cpu time */
timestruc_t pr_stime; /* Process system cpu time */
timestruc_t pr_cutime; /* Sum of children’s user times */
timestruc_t pr_cstime; /* Sum of children’s system times */
sigset_t pr_sigtrace; /* Mask of traced signals */
fltset_t pr_flttrace; /* Mask of traced faults */
sysset_t pr_sysentry; /* Mask of system calls traced on entry */
sysset_t pr_sysexit; /* Mask of system calls traced on exit */
lwpstatus_t pr_lwp; /* "representative" LWP */
Member Description
pr_flags A bit mask holding flags (flags are described below)
pr_nwlp Total number of LWPs in the process
pr_brkbase Virtual address of the process heap
pr_brksize Size of process heap in bytes. The address formed by
the sum of these values is the process break (see
brk(2)).
pr_stkbase Virtual address of the process stack
pr_stksize Size of the process stack in bytes. Each LWP runs on a
separate stack; the process stack is distinguished in that
the operating system will grow as necessary.
pr_pid Process ID
pr_ppid Parent process ID
pr_pgid Process group ID
pr_sid Session ID of the process
pr_utime User CPU time consumed by the process in seconds
and nanoseconds
pr_stime System CPU time consumed by the process in seconds
and nanoseconds
pr_cutime Cumulative user CPU time consumed by the process in
seconds and nanoseconds
pr_cstime Cumulative system CPU time consumed by the process
in seconds and nanoseconds
pr_sigtrace Set of signals that are being traced (see PCSTRACE)
pr_flttrace Set of hardware faults that are being traced (see
PCSFAULT)
pr_sysentry Set of system calls being traced on entry (see
PCSENTRY)
pr_sysexit Set of system calls being traced on exit (see PCSEXIT)
pr_lwp If the process is not a zombie, pr_lwp contains an
lwpstatus_t structure describing a representative LWP.
The contents of this structure ave the same meanin as if
it were read from an lwpstatus file.
Flag Description
PR_ISSYS System process (see PCSTOP)
PR_FORK Has its inherit-on-fork flag set (see PCSET)
PR_RLC Has its run-on-last-close flag set (see PCSET)
PR_KLC Has its kill-on-last-close flag set (see PCSET)
PR-ASYNC Has its asynchronous-stop flag set (see PCSET)
Multi-threaded When the process has more than one LWP, its representative LWP is
applications chosen by the /proc implementation. The chosen LWP is a stopped LWP
only if all the process’s LWPs are stopped, is stopped on an event of
interest only if all the LWPs are so stopped, or is in a PR_REQUESTED
stop only if there are no other events of interest to be found. The chosen
LWP remains fixed as long as all the LWPs are stopped on events of
interest and PCRUN is not applied to any of them.
When applied to the process control file, every /proc control operation that
must act on an LWP uses the same algorithm to choose which LWP to act
on. Together with synchronous stopping (see PCSET), this enables an
application to control a multiple-LWP process using only the process-level
status and control files if it so chooses. More fine-grained control can be
achieved using the LWP-specific files.
Introduction The psinfo file contains information about the process needed by the ps(1)
command. If the process contains more than one LWP, a representative
LWP (chosen according to the rules described for the status file) is used to
derive the status information.
File format The file is formatted as a struct psinfo containing the following members:
ulong_t pr_flag; /* process flags */
ulong_t pr_nlwp; /* number of LWPs in process */
uid_t pr_uid; /* real user id */
gid_t pr_gid; /* real group id */
pid_t pr_pid; /* unique process id */
pid_t pr_ppid; /* process id of parent */
pid_t pr_pgid; /* pid of process group leader */
pid_t pr_sid; /* session id */
caddr_t pr_addr; /* internal address of process */
long pr_size; /* size of process image in pages */
long pr_rssize; /* resident set size in pages */
timestruc_t pr_start; /* process start time, time since epoch */
timestruc_t pr_time; /* usr+sys cpu time for this process */
dev_t pr_ttydev; /* controlling tty device (or PRNODEV)*/
char pr_fname[PRFNSZ]; /* last component of exec()ed pathname*/
char pr_psargs[PRARGSZ]; /* initial characters of arg list */
struct lwpsinfo pr_lwp; /* "representative" LWP */
Platform Some of the entries in psinfo, such as pr_flag and pr_addr, refer to internal
specific data kernel data structures and should not be expected to retain their meanings
across different versions of the operating system. They have no meaning
to a program and are only useful for manual interpretation by a user aware
of the implementation details.
Representative pr_lwp describes the representative LWP chosen as described under the
LWP pstatus file above. If the process is a zombie, pr_nlwp and pr_lwp.pr_lwpid
are zero and the other fields of pr_lwp are undefined.
Introduction The map file contains information about the virtual address map of the
process. The file contains an array of prmap structures, each of which
describes a contiguous virtual address region in the address space of the
traced process.
Member Description
pr_vaddr Virtual address of the mapping within the traced process
pr_size Size of mapping in bytes
pr_mapname If not empty string, contains name of a file in the object directory
that can be opened for reading to yield a file descriptor for the
object to which vitrual address is mapped.
pr_off Offset within the mapped object (if any) to which the virtual
address is mapped
pr_mflags Protection and attribute flags (see below)
pr_filler For future use
Flag Description
MA_READ Mapping is readable by the traced process
MA_WRITE Mapping is writable by the traced process
MA_EXEC Mapping is executable by the traced process
MA_SHARED Mapping changes are shared by mapped object
Contiguous A contiguous area of the address space having the same underlying
address space mapped object may appear as multiple mappings because of varying read,
write, execute, and shared attributes. The underlying mapped object does
not change over the range of a single mapping. An I/O operation to a
mapping marked MA_SHARED fails if applied at a virtual address not
corresponding to a valid page in the underlying mapped object. Reads and
writes to private mappings always succeed. Reads and writes to
unmapped addresses always fail.
Introduction The cred file contains a description of the credentials associated with the
process.
File format The file is formatted as a struct prcred containing the following members:
uid_t pr_euid; /* Effective user id */
uid_t pr_ruid; /* Real user id */
uid_t pr_suid; /* Saved user id (from exec) */
gid_t pr_egid; /* Effective group id */
gid_t pr_rgid; /* Real group id */
gid_t pr_sgid; /* Saved group id (from exec) */
uint_t pr_ngroups; /* Number of supplementary groups */
gid_t pr_groups[1]; /* Array of supplementary groups */
Introduction The sigact file contains an array of sigaction structures describing the
current dispositions of all signals associated with the traced process.
Signal numbers are displaced by 1 from array indexes, so that the action
for signal number n appears in position n-1 of the array.
lwp/lwpctl file
Introduction The lwpctl file is a write-only control file. The messages written to this file
affect only the associated LWP rather than the process as a whole (where
appropriate).
File format The file is formatted as a struct lwpstatus containing the following member
long pr_flags; /* Flags */
short pr_why; /* Reason for stop (if stopped) */
short pr_what; /* More detailed reason */
lwpid_t pr_lwpid; /* Specific LWP identifier */
short pr_cursig; /* Current signal */
siginfo_t pr_info; /* Info associated with signal or fault */
struct sigaction pr_action; /* Signal action for current signal */
sigset_t pr_lwppend; /* Set of LWP pending signals */
stack_t pr_altstack; /* Alternate signal stack info */
short pr_syscall; /* System call number (if in syscall) */
short pr_nsysarg; /* Number of arguments to this syscall */
long pr_sysarg[PRSYSARGS];/* Arguments to this syscall */
char pr_clname[PRCLSZ]; /* Scheduling class name */
ucontext_t pr_context; /* LWP context */
pfamily_t pr_family; /* Processor family-specific information */
Member Description
pr_flags A bit mask holding flags (described below)
pr_why Reason for LWP stop (if stopped). Possible values listed
below.r
pr_what More detailed reason for LWP stop. pr_why and pr_what
together, describe the reason for a stopped LWP.
pr_lwpid Specific LWP identifier.
pr_cursig Names the current signal; that is, the next signal to be
delivered to the LWP.
pr_info When the LWP is in a PR_SIGNALLED or PR_FAULTED
stop, pr_info contains additional information pertinent to the
particular signal or fault. (See sys/siginfo.h)
pr_action Contains signal action information about the current signal
(see sigaction(2)). It is undefined if pr_cursig is zero.
pr_lwppend Identifies any synchronously-generated or LWP-directed
signals pending for the LWP. Does not include signals
pending at the process leel.
pr_altstack Contains the alternate signal stack information for the LWP.
(see sigaltstack(2)).
pr_syscall Number of the system call, if any, being executed by the
LWP. It is nonzero if and only if the LWP is stopped on
PS_SYSENTRY or PR_SYSEXIT or is asleep with a
system call (PR_ASLEEP is set)
pr_nsysarg If pr_syscall is non-zero, pr_nsysarg is the number of
arguments to the system call
pr_sysarg Array of arguments to the system call.
pr_clname Contains the name of the scheduling class of the LWP.
pr_context Contains the user context of the LWP, as if it had called
getcontext(2). If the LWP is not stopped, all context values
are undefined.
pr_family Contains the CPU-family specific information about the
LWP. Use of this field is not portable across different
architectures.
Flag Description
PR_STOPPED LWP is stopped
PR_ISTOP LWP is stopped on an event of interest (see PCSTOP)
PR_DSTOP LWP has a stop directive in effect (see PCSTOP)
PR_STEP LWP has a single-step directive in effect
PR_ASLEEP LWP is in an interruptible sleep within a system call
PR_PCINVAL LWP program counter register does not point to a valid
address
Value Description
PR_REQUESTED Shows that the stop occurred in response to a
stop directive, normally because PCSTOP was
applied or because another LWP stopped on an
event of interest and the asynchronous-stop flag
(see PCSET) was not set for the process.
pr_what is unused in this case.
Introduction The lwp/lwpsinfo file contains information about the LWP needed by ps(1).
This information also is present in the psinfo file of the process for its
representative LWP if it has one.
File format The file is formatted as a struct psinfo containing the following members:
Control Messages
Introduction Process state changes are affected through messages written to the ctl file
of the process or to the lwpctl file of an individual LWP.
Sending All control messages consist of an int naming the specific operation
control followed by additional data containing operands (if any). Multiple control
messages
messages can be combined in a single write(2) to a control file, but no
partial writes are permitted; that is, each control message (operation code
plus operands) must be presented in its entirety to the write and not in
pieces over several system calls.
ENOENT Note that writing a message to a control file for a process or LWP that has
exited elicits the error ENOENT.
Introduction There are three control messages that stop LWPs. They perform in
different ways. They are:
• PCSTOP
• PCDSTOP
• PCWSTOP
PCSTOP When applied to the process control file, directs all LWPs to stop and waits
for them to stop. Completes when every LWP has stopped on an event of
interest.
When applied to an LWP control file, directs the specific LWP to stop and
waits until it has stopped. Completes when the LWP stops on an event of
interest, immediately if already so stopped.
PCDSTOP When applied to the process control file, directs all LWPs to stop without
waiting for them to stop.
When applied to an LWP control file, directs the specific LWP to stop
without waiting for it to stop
PCWSTOP When applied to the process control file, simply waits for all LWPs to stop.
Completes when every LWP has stopped on an event of interest.
When applied to an LWP control file, simply waits for the LWP to stop.
Completes when the LWP stops on an event of interest, immediately if
already so stopped
PCRUN
Introduction The control message PCRUN makes an LWP runnable again after a stop.
The operand is a set of flags, contained in a ulong_t, describing optional
additional actions.
Flag Description
PRCSIG Clears the current signal, if any (see PCSSIG)
PRCFAULT Clears the current fault, if any (see PCCFAULT)
PRSTEP Directs the LWP to execute a single machine
instruction. On completion of the instruction, a trace
trap occurs. If FLTTRACE is being traced, the LWP
stops, otherwise it is sent SIGTRAP; if SIGTRAP is
being traced and not held, the LWP stops. When the
LWP stops on an event of interest the single-step
directive is cancelled, even if the stop occurs before
the instruction is executed. This operation requires
hardware and operating system support and may not
be implemented on all processors
PRSABORT Is significant only if the LWP is in a PR_SYSENTRY
stop or is marked PR_ASLEEP; it instructs the LWP
to abort execution of the system call (see
PCSENTRY, PCSEXIT).
Using PCRUN When applied to an LWP control file PCRUN makes the specific LWP
on an LWP runnable. The operation fails (EBUSY) if the specific LWP is not stopped
on an event of interest.
PCRUN -- continued
Using PCRUN When applied to the process control file an LWP is chosen for the
on a process operation as described for /proc/pid/status. The operation fails (EBUSY) if
the chosen LWP is not stopped on an event of interest. If PRSTEP or
PRSTOP were requested, the chosen LWP is made runnable; otherwise,
the chosen LWP is marked PR_REQUESTED. If as a result all LWPs are
in the PR_REQUESTED stop state, they are all made runnable.
Once an LWP has been made runnable by PCRUN, it is no longer stopped
on an event of interest even if, because of a competing mechanism, it
remains stopped.
PCSTRACE
Introduction PCSTRACE Define a set of signals to be traced in the process: the receipt
of one of these signals by an LWP causes the LWP to stop. The set of
signals is defined using an operand sigset_t contained in the control
message.
Held signals If a signal that is included in a held signal set of an LWP is sent to the LWP,
the signal is not received and does not cause a stop until it is removed
from the held signal set, either by the LWP itself or by setting the held
signal set with PCSHOLD or the PRSHOLD option of PCRUN.
PCCSIG
Introduction
PCCSIG The current signal and its associated signal information for the specific or
chosen LWP are set according to the contents of the operand siginfo
structure (see ). If the specified signal number is zero, the current signal is
cleared. An error (EBUSY) is returned if the LWP is not stopped on an
event of interest. The semantics of this operation are different from those
of kill(2), _lwp_kill(2), or PCKILL in that the signal is delivered to the LWP
immediately after execution is resumed (even if the signal is being held)
and an additional PR_SIGNALLED stop does not intervene even if the
signal is being traced. Setting the current signal to SIGKILL ends the
process immediately.
PCKILL, PCUNKILL
Introduction
PCKILL If applied to the process control file, a signal is sent to the process with
semantics identical to those of kill(2). If applied to an LWP control file, a
signal is sent to the LWP with semantics identical to those of _lwp_kill(2).
The signal is named in an operand int contained in the message. Sending
SIGKILL ends the process or LWP immediately.
PCUNKILL A signal is deleted, that is, it is removed from the set of pending signals. If
applied to the process control file, the signal is deleted from the process’s
pending signals. If applied to an LWP control file, the signal is deleted from
the LWP’s pending signals. The current signal (if any) is unaffected. The
signal is named in an operand int in the control message. It is an error
(EINVAL) to attempt to delete SIGKILL.
PCSHOLD
Introduction Set the set of held signals for the specific or chosen LWP (signals whose
delivery will be delayed if sent to the LWP) according to the operand
sigset_t structure. SIGKILL or SIGSTOP cannot be held; if specified, they
are silently ignored.
PCSFAULT
Fault names Some fault names may not occur on all processors; there may be
processor-specific faults in addition to these. Fault names include the
following:
When not traced, a fault normally results in the posting of a signal to the
LWP that incurred the fault. If an LWP stops on a fault, the signal is posted
to the LWP when execution is resumed unless the fault is cleared by
PCCFAULT or by the PRCFAULT option of PCRUN. FLTPAGE is an
exception; no signal is posted. There may be additional processor-specific
faults like this.
PCSFAULT -- continued
PCCFAULT The current fault (if any) is cleared; the associated signal is not sent to the
specific or chosen LWP.
PCSFAULT -- continued
PCSENTRY, These control operations instruct the process’s LWPs to stop on entry to or
PCSEXIT exit from specified system calls. The set of system calls to be traced is
defined via an operand sysset_t structure.
When entry to a system call is being traced, an LWP stops after having
begun the call to the system but before the system call arguments have
been fetched from the LWP. When exit from a system call is being traced,
an LWP stops on completion of the system call just before checking for
signals and returning to user level. At this point all return values have been
stored into the LWP’s registers.
If an LWP is stopped on entry to a system call (PR_SYSENTRY) or when
sleeping in an interruptible system call (PR_ASLEEP is set), it may be
instructed to go directly to system call exit by specifying the PRSABORT
flag in a PCRUN control message. Unless exit from the system call is
being traced the LWP returns to user level showing error EINTR.
PCSET PCSET sets one or more modes of operation for the traced process.
PCSFAULT -- continued
PCRESET PCRESET resets these modes. The modes to be set or reset are
specified by flags in an operand long in the control message. The flags
are described below:
Flag Description
PR_FORK (inherit-on-fork) When set, the tracing flags of the process are
inherited by the child of a fork(2) or vfork(2).
When reset, child processes start with all tracing
flags cleared.
PR_RLC (run-on-last-close) When set and the last writable /proc file
descriptor referring to the traced process or any
of its LWPs is closed, all the tracing flags of the
process are cleared, any outstanding stop
directives are canceled, and if any LWPs are
stopped on events of interest, they are set
running as though PCRUN had been applied to
them. When reset, the process’s tracing flags are
retained and LWPs are not set running on last
close.
PR_KLC (kill-on-last-close) When set and the last writable /proc file
descriptor referring to the traced process or any
of its LWPs is closed, the process is exited with
SIGKILL.
PCSFAULT -- continued
EINVAL It is an error (EINVAL) to specify flags other than those described above or
to apply these operations to a system process. The current modes are
reported in the pr_flags field of /proc/pid/status.
PCSREG PCSREG sets the general registers for the specific or chosen LWP
according to the operand gregset_t structure. There may be machine-
specific restrictions on the allowable set of changes. PCSREG fails
(EBUSY) if the LWP is not stopped on an event of interest.
PCSFPREG PCSFPREG sets the floating-point registers for the specific or chosen
LWP according to the operand fpregset_t structure. An error (EINVAL) is
returned if the system does not support floating-point operations (no
floating-point hardware and the system does not emulate floating-point
machine instructions). PCSFPREG fails (EBUSY) if the LWP is not
stopped on an event of interest.
PCNICE The traced (or chosen) LWP’s nice(2) priority is incremented by the
amount contained in the operand int. Only the super-user may better an
LWP’s priority in this way, but any user may make the priority worse. This
operation is significant only when applied to an LWP in the time-sharing
scheduling class.
Directories
Introduction
Object The object directory contains read-only files with names as they appear in
directory the entries of the map file, corresponding to objects mapped into the
address space of the target process. Opening such a file yields a
descriptor for the mapped file associated with a particular address-space
region. The name a.out also appears in the directory as a synonym for the
executable file associated with the ‘‘text’’ of the running process.
lwp directory The lwp directory contains entries each of which names an LWP within the
containing process. These entries are directories containing additional files
and are described beginning on page 15.
Code Example
Introduction The following code is an simple example of how one process can use the
/proc filesystem to access the address space of another. Provided with a
single argument (the id of a currently running process), it prints the name
of the process from the psinfo structure.
#include <stdio.h>
#include <fcntl.h>
#include <sys/procfs.h>
fd = open(fname, O_RDONLY);
read(fd, &p, sizeof(struct psinfo));
printf("process pid %s: exec path/args: %s %s\n",
argv[1], p.pr_fname, p.pr_psargs);
}