AIX5L StudentGuide PDF

AIX 5L Internals
Student Guide
Version 20001015
IBM Web Server

Knowledge Channel
Student Guide Draft Version for review, Sunday, 15. October 2000, title.fm
Tradmarks
IBM® is a registered trademark of International Business Machines Corporation.
UNIX is a registered trademark in the United States, other countries, or both and is licensed
exclusively through X/Open Compnay Limited.
<<< list any other Trademarks used int he course materials >>>
July 2000 Edition
The information contained in this document has not been submitted to any formal IBM test and is distributed on
an “as is” basis without any warranty either express or implied. The use of this information or the
implementation of any of these techniques is a customer responsibility and depends on the customer’s ability
to evaluate and integrate them into the customer’s operational environment. While each item may have been
reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or simular results will
result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their
own risk.
© Copyright International Business Machines Corporation 2000. All rights reserved. This document may not be
reproduced in whole or in part without the prior written permission from IBM. Information in this course is
subject to change without notice.
Web Server Knowledge Channel

Technical Education
Draft Version for review, Sunday, 15. October 2000, intTOC.fm Student Guide
Contents
Kernel Overview
Kernel Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
Kernel states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8
Kernel exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10
Kernel Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12
Kernel Limits Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16
64-bit Kernel base enablement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-17
64-bit Kernel stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24
CPU big- and little-endian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-26
Multi Processor dependent designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-28
Command and Utility compatibility for 32-bit and 64-bit kernels . . . . . . . . . . . . . . . . 1-29
Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-30
Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-33
Interrupt handling in AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-35
Handling CPU state information at interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-36
Handling CPU state information at interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-37
IA-64 Hardware Overview
IA-64 Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
IA-64 formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
IA-64 memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
IA-64 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
IA-64 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
IA-64 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
Power Hardware Overview
Power Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Power CPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
64 bit CPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
SMP Hardware Overview
SMP Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Configuring System Dumps on AIX 5L
About This Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
System Dump Facility in AIX5L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Configuring for System Dumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Obtaining a Crash Dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Dump Status and completion codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17
dumpcheck utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
Verify the dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21
Packaging the dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-26
Introduction to Dump Analysis Tools
About This Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
System Dump Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
dump components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Dump creation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
Component dump routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
bosdebug command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
© Copyright IBM Corp. 2000 Version 20001015 Contents iii

Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Guide Draft Version for review, Sunday, 15. October 2000, intTOC.fm
Memory Overlay Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11

System Hang Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
truss command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
KDB kernel debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-23
kdb command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25
KDB miscellaneous sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-26
KDB dump/display/decode sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-29
KDB modify memory sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-33
KDB trace sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-36
KDB break point and step sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-38
KDB name list/symbol sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-42
KDB watch break point sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-43
KDB machine status sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-45
KDB kernel extension loader sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-47
KDB address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-49
KDB process/thread sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-50
KDB Kernel stack sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-58
KDB LVM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-60
KDB SCSI sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-62
KDB memory allocator sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-65
KDB file system sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-69
KDB system table sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-72
KDB network sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-77
KDB VMM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-80
KDB SMP sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-86
KDB data and instruction block address translation sub commands . . . . . . . . . . . . 6-87
KDB bat/brat sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-89
IADB kernel debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-90
iadb command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-92
IADB break point and step sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-93
IADB dump/display/decode sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-96
IADB modify memory sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-100
IADB name list/symbol sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-105
IADB watch break point sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-106
IADB machine status sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-108
IADB kernel extension loader sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-110
IADB address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-111
IADB process/thread sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-112
IADB LVM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-114
IADB SCSI sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-115
IADB memory allocator sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-116
IADB file system sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-117
IADB system table sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-118
IADB network sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-119
IADB VMM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-120
IADB SMP sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-122
IADB block address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-123
IADB bat/brat sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-124
IADB miscellaneous sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-125
iv AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-127
Process Management
Process Management Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
Process operations fork() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
Process operations exec() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
Process operations exec() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
Process operations exit system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Process operations, wait() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
Kernel Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16
Thread Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
AIX Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19
Thread Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
Threads Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
Thread states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25
Thread Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27
Process swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28
Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29
The Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-33
AIX run queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-36
Process and Threads data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-39
Process and Threads data structures addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-43
What is new in AIX 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-48
Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-50
Signal handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-51
Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-53
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-57
Memory Management
Overview of Virtual Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
Memory Management Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
Demand Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
Memory Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
Memory Object types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
Page Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
Page Not In Hardware Frame Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
Page on Paging Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Loading Pages From The Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Filesystem I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Free Memory and Page Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19
vmtune . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21
Fatal Memory Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
Memory Objects (Segments) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23
Shared Memory segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25
shmat Memory Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26
Memory Mapped Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28
IA-64 Virtual Memory Manager
IA-64 Addressing Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
Region Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4
Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
© Copyright IBM Corp. 2000 Version 20001015 Contents v
Single vs. Multiple Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7

AIX 5L Region Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
Memory Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10
LP64 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14
ILP32 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16
LVM
Logical Volume Manager overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
Data Integrity and LVM Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
LVM Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15
LVM Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
Physical disk layout Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-21
VGSA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-30
Physical disk layout IA-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-31
LVM Passive Mirror Write Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-36
AIX 5 LVM Hot Spare Disk in a Volume group. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-40
LVM Hot spot management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-42
LVM split mirror AIX 4.3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-45
LVM Variable logical track group (LTG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-46
LVM command overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-47
LVM Problem Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-48
Trace LVM commands with the trace command . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-51
LVM Library calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-56
logical volume device driver LVMDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57
Disk Device Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-58
Disk low level Device Calls such as SCSI calls . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-60
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-61
Enhanced Journaled File System
J2 - Enhanced Journaled File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
Allocation Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7
Filesets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8
Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
Binary Trees of Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15
File Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19
fsdb Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-23
Exercise 1 - fsdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-24
Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-27
Directory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-31
Exercise 2 - Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-35
Logical and Virtual File Systems
General File System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
Logical File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4
User File Descriptor Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
System File Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
Virtual File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-9
Vnode/vfs interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10
vi AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000
Vnodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11
vfs and vmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12
File and Filesystem Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14
gfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15
vnodeops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16
vfsops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-17
The Gnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-18
Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-20
Lab Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-21
Lab Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-26
AIX 5L boot
What is boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2
Various Types of boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-3
Systems types and Kernel images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-5
RAMFS and prototype files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
Boot Image Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-8
AIX 5L Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-12
The Power Boot Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13
Power boot disk layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14
AIX 5L Power boot record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-16
Power boot images structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-21
RSPC boot image hints header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-22
CHRP Boot image ELF structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-24
CHRP boot image ELF structure - Continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-25
CHRP boot image ELF structure - Continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-26
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-27
Power ROS and Softros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-30
IPLCB on Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-31
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-33
The IA-64 Boot Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-35
IA-64 boot disk layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-37
EFI boot manager and boot maintenance manager overview . . . . . . . . . . . . . . . . 14-39
EFI Shell Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-40
IA-64 Boot Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-43
IA-64 Initial Program Load Control Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-44
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-45
Hard Disk Boot process (rc.boot Phase I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-47
Hard Disk Boot process (rc.boot Phase II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-48
Hard Disk Boot process (rc.boot Phase III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-49
CDROM Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . 14-50
Tape Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-51
Network Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . 14-52
Common Boot process (rc.boot Phase III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-53
© Copyright IBM Corp. 2000 Version 20001015 Contents vii
Network boot $RC_CONFIG files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-54

The init process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-56
ODM Structure and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-57
boot and installation logging facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-63
Debugging boot problems using KDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-65
Debugging boot problems using IADB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-67
Packaging Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-69
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-71
/proc Filesystem Support
/proc Filesystem Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2
Types of Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-4
The as File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-5
The ctl File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-6
The status File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-7
The psinfo file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-10
The map File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11
The cred File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-13
The sigact File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-14
lwp/lwpctl file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-15
The lwp/lwpstatus File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-16
The lwp/lwpsinfo File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-19
Control Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-20
PCSTOP, PCDSTOP, and PCWSTOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-21
PCRUN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-23
PCSTRACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-25
PCCSIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-26
PCKILL, PCUNKILL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-27
PCSHOLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-28
PCSFAULT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-29
Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-34
Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-35
viii AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000

Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm Guide
Unit 1. Kernel Overview

This overview describes the concepts used in the AIX 5L kernel.
What You Should Be Able to Do

After completing this unit, you should be able to
• Identify major components of the kernel.
• Identify the major differences between AIX 5L and previous versions
of AIX.
• Determine what kernel to use.
• Determine what the kernel limits are.
• Find out if a thread is in user or kernel model.
• Define the kernel address layout.
• Describe the steps the kernel takes in handling an interrupt or
exception.
© Copyright IBM Corp. 2000 Version 20001015 -1 of 38

Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
Guide Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel Overview
Introduction Up until AIX 5L, the kernel was a 32-bit kernel for Power architecture only.
AIX version 4.3 introduced the 64-bit application enabling on Power, which
meant there was still a 32-bit kernel, but an 64-bit environment was
available through a kernel extension which did the appropriate Now AIX 5L
features both a 32-bit and a 64-bit kernel on Power systems, and a 64-bit
kernel on the IA-64 architecture.
This overview describes the concepts used in the kernel in general and in
the 64-bit kernel specifically.
Kernel The kernel is the base program of the computer. It is an intermediary

description between the applications and the computer hardware. There is no need for
applications to have specific knowledge of any kind of hardware.
Processes, that is, programs in execution or running programs, just ask for
a generic task to complete (like ‘give me this file’) and the kernel will go out
and get it.
The kernel is the first and most important program on the computer. It can
access things other programs can not. It can create and destroy processes
and it controls the way programs run. Resource usage is balanced by the
kernel in order to keep everybody happy.
Functions of The kernel provides the system with the following functions:
the kernel
• Create, manage and delete processes.
• Schedule and balance resources.
• Provide access to devices.
• Handle asynchronous events.
The kernel manages resources so they can be shared simultaneously
among many processes and users. Resources can be physical like the
CPU, the memory or an adapter, or it can be virtual, like a lock or a slot in
the process table.
Uniprocessor The 64-bit kernel is aimed at the high-end server environment and
support multiprocessor hardware. As a result, it is optimized strictly for the
multiprocessor environment and no separate uniprocessor version is
provided.
Continued on next page
-2 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000

Kernel Overview -- continued
64-bit vs. 32- The primary purpose of the 64-bit AIX kernel is to address the fundamental
bit kernel need for workload scalability. This is achieved through a kernel address
space which is large enough to support increases in software resources.
The demands placed on the system software by customer applications will

soon outstrip the existing AIX 32-bit kernel because of the 32-bit kernel’s
limited address space. At 4GB, this address space is simply too small to
efficiently and/or effectively handle the amount of software resources
needed to support the projected 2001 workloads and hardware. In fact, a
number of software resources pools within the 32-bit kernel are now under
pressure from today’s application workloads.
32-bit kernel Customers have made and will continue to make significant investment in
life time 32-bit RS/6000 hardware systems and need system software that protects
this investment. Thus, AIX also offers a 32-bit kernel.The RS/6000
software plan is to eventually drop support for the 32-bit kernel. However,
support will not be withdrawn before 2002 and after the initial 64-bit kernel
release. This process is driven by end-of-life plans for 32-bit hardware
systems, as well as the fact that customers require a bridge period under
which both the 32-bit and 64-bit kernels are available for 64-bit hardware
systems and offer the same basic functionality. This period is needed to
ease migration to the 64-bit kernel.
Compatibility Customers need system software that protects their investment in existing
applications and provides binary and source compatibility. AIX 5L will
therefore maintain support for existing 32-bit applications.

Kernels The table below shows which kernels are supported on different systems.
supported by In general, a 64-bit kernel and application can only run on 64-bit hardware,
hardware
platform but 64-bit hardware can execute 32- and 64-bit kernels and applications.
32-bit Power 64-bit Power Intel IA64

32-bit Kernel 32-bit applications 32-bit applications
64-bit Kernel Not supported; 32-bit applications 32-bit applications
64-bit kernel is not 64-bit applications 32-bit applications
supported at 32-bit
CPUs
Currently, there are three different CPUs types in the RS/6000 systems
(only the PowerPC 604e CPU is 32-bit).
CPU Type
PowerPC 604e 32-bit
Power3-II 64-bit
RS64 II 64-bit
RS64 III 64-bit
Binary The 64-bit kernel offers binary compatibility to existing applications for both
compatibility 32-bit and 64-bit applications. However, it does not extend to the minority
and limitations
of applications that are built non-shared or have intimate knowledge of
internal details, such as programs accessing /dev/kmem or /dev/mem.
This is consistent with the general AIX policy for these two classes of
applications.
In addition, binary compatibility will not be provided to applications that are

dependent on existing kernel extensions that are not ported to the 64-bit
kernel environment. Only 64-bit kernel extensions will be supported. This
direction is taken to avoid the significant cost of providing 32-bit kernel
extension support under the 64-bit kernel, and is consistent with the
directions taken by other UNIX vendors such as SUN, HP, DEC and SGI.
On the plus side, this direction also forces kernel extensions to migrate to
the more scalable and strategic 64-bit environment (to better face the next
century).

Compatibility There is no change to the compatibility provided for 32-bit kernel

for kernel extensions under the 32-bit kernel. 64-bit kernel extensions will not be
extensions
supported under the 32-bit kernel.
Compatibility One important aspect of binary compatibility involves the required

for system functional behavior of system call APIs when supplied invalid user
calls
addresses. Under today’s 32-bit kernel, this behavior differs in many ways
for 32-bit and 64-bit applications. For 32-bit applications, APIs return errors
(that is, EFAULT errno) to the application if presented with an invalid
address. This behavior is due to the fact that all user space accesses that
are made under an API inside the kernel, and under the protection of
kernel exception handling. For 64-bit applications, an invalid user address
will cause a signal (SIGSEGV) to be sent to the application. This occurs
because structure reshaping is done in supporting API libraries and it is
the user mode library routine that accesses the invalid user (structure)
address.
Today’s kernel behaviors is preserved by the 64-bit kernel for 32-bit
applications but not for 64-bit applications. This is because the behavior for
64-bit applications under the 32-bit kernel will be changed and made
consistent with that now provided for 32-bit applications. This is done for a
number of reasons.
First, it is difficult to fully preserve the present behavior for 64-bit
applications. Reshaping is not required for these applications under the
64-bit kernel, so there will be no library accesses. Signals could be sent as
part of kernel exception handling, but it would be hard to produce the same
signal context as is seen under the 32-bit kernel.
Next, the functional behaviors of 32-bit and 64-bit applications should only
differ in places where there are fundamental application differences, like
address space layout. Introducing different behaviors in other places only
complicates matters for application writers.
Finally, both the errno and signal behaviors are allowable under the
standards, but the errno behavior offers a more friendly application
programming model.
In order to provide a consistent behavior across kernels and applications,
all structure reshaping is performed inside both kernels for both application
types.

Source Source code compatibility is preserved for applications and 32-bit kernel
compatibility extensions. Consistent with general AIX policy, this extends to makefiles
(build mechanisms), but not to the small set of applications that rely upon
shipped header file contents that are provided only for use by the kernel.
Programs accessing /dev/mem or /dev/kmem serve as an example of such
applications.
32-bit vs. 64- The 64-bit kernel is intended to increase scalability of the RS/6000 product
bit kernel family and is optimized for running 64-bit applications on the upcoming
Performance
on Power Gigaprocessor systems (Power4, which will be announced in 2001). The
performance of 64-bit applications running on the 64-bit kernel on
Gigaprocessor-based systems is better than if the same application was
running on the same hardware with the 32-bit kernel. This is because the
64-bit kernel allows 64-bit applications to be supported without requiring
system call parameters to be remapped or reshaped. The 64-bit kernel
may also be compiler-optimized specifically for the Gigaprocessor system,
whereas the 32-bit kernel may be optimized to a more general platform.
32-bit The 64-bit kernel will also be optimized for 32-bit applications (to the extent
application possible). This is because 32-bit applications now dominate the application
Performance
on 32-bit and space and will continue to do so for some time. In fact, performance trade-
64-bit kernels offs involving 32-bit versus 64-bit applications should be made in favor of
32-bit applications. However, 32-bit applications on the 64-bit kernel will
typically have less performance than on the 32-bit kernel, because call
parameter reshaping is required for 32-bit applications on the 64-bit kernel.
64-bit The performance of 64-bit applications under the 64-bit kernel on non-
application Gigaprocessor systems may be less than that of the same applications on
and 64-bit
kernel the same hardware under the 32-bit kernel. This is due to the fact that the
performance at non-Gigaprocessor systems are intended as a bridge to Gigaprocessor
non systems and lack some of support that is needed for optimal 64-bit kernel
Gigaprocessor performance. In addition, efforts should be made to optimize 64-bit kernel
systems
performance for non-Gigaprocessor system, but performance trade-offs
are made in the favor of the Gigaprocessor.

Kernel overview -- continued
32-bit and 64- The performance of 64-bit kernel extensions on Gigaprocessor systems
bit kernel should be the same or better than their 32-bit counterparts on the same
extension
performance at hardware. However, the performance of the 64-bit kernel extension on
Gigaprocessor non-Gigaprocessor machines may be less than 32-bit kernel extensions
systems on the same hardware. This flows from the fact that 64-bit kernels are
optimized for Gigaprocessor systems.
Kernel Since the kernel is a program itself, it behaves almost like any other
characteristics program. It’s features are:
• Preemptable
• Pageable
• Segmented
• 64-bit
• Dynamically loadable
Preemptable means that the kernel can be in the middle of a system call
and be interrupted by a more important task. The preemption causes a
context switch to another thread inside the kernel.
Some parts of the kernel are pageable, which means they are not needed
in memory all the time, and can be paged to paging space.
Both the 32-bit kernel and the 64-bit kernel implement virtual address
translation by using segments. In previous versions of AIX, segment
registers were used to map segments to thread contexts. Now segment
tables are being used.
The kernel can be dynamically extended with extra functionality.

Kernel states
kernel system
diagram
trap (Power) user programs
libraries
User level
Kernel level
system call interface
file subsystem Inter-process

process
Communication
buffer cache
control scheduler
character block
subsystem memory
device driver management
hardware control
Kernel level
Hardware level
hardware
Roughly there are three distinct layers:
• The user level

• The kernel level
• The hardware level
This diagram shows how the kernel is the interface between the user level
and the hardware. Applications live at the user level, and they can only
access hardware, like a disk or printer, through the kernel.
Process Processes can run in two different execution modes: kernel mode and user
execution mode.These modes are also referred to as Supervisor State and Problem
modes
State.

Kernel states -- continued
User mode A process running in user mode can only affect its own execution
protection environment and runs in the processor’s unprivileged state. In user mode,
domain
a process has read/write access to the user data process private segment
and the shared library data segment. It also has access to the shared
memory segments using the shared memory functions. The process in
user mode has read access to the user text and shared library text
segment.
User mode processes can still use kernel functions by means of a system
call. Access to functions that directly or indirectly invoke system calls are
typically provided by programming libraries which gives access to
operating system functions.
Kernel mode Code running in this mode has read/write access to global kernel space
protection and access to kernel data in the process private segment when running
domain
within the process context. Code in interrupt handlers, the base kernel and
kernel extensions run in kernel mode. If a program running in kernel mode
needs to access user data, a kernel service is used to do so. Programs
running in kernel mode can use kernel services, can access global system
data, are exempt from all security restraints, and run in the processor
privileged state
In short:
User mode or problem state:

• User programs and applications run in this mode.
• Kernel data and global structures are protected from access/
modification.
Kernel mode or supervisor state:

• Kernel and kernel extensions run in this mode.
• Can access or modify anything.
• Certain instructions limited to supervisor state only.
The kernel state is part of the thread state, so this information typically is
kept in the threads Machine State area (MST).

Kernel exercise
Exercise: Look at the value of the Machine State Register (MSR) for thread of
figuring out interest:
thread state on
Power
# echo “mst <thread slot>”| kdb | grep msr
iar : 0000000000009444 msr : A0000000000010B2 cr : 31384935
From /usr/include/sys/machine.h :
#define MSR_PR 0x4000 /* Problem state */
This means that if bit 15 from the MSR is set, the thread is running in user
mode, that is, when the fourth nibble from the right is either 4,5,6,7 or
C,D,E,F.

Kernel exercise -- continued
Exercise: Look at the value of the Interrupt Processor State Register (IPSR) for
figuring out thread of interest.
thread state on
IA-64 On an interrupt, and if PSR.ic (Interrupt Collection) is 1, the IPSR receives
the value of the PSR. The IPSR, IIP and IFS are used to restore the
processor state on a Return From Interrupt (rfi). The IPSR has the same
format as PSR. IPSR.ri is set to 0, after any interruption from the IA-32
instruction set.
# iadb
(0)> ut -t <thread-ID>
*ut_save: 0x0003ff002ff3b400 *ut_rsesave: 0x0003ff002ff3bf50
System call state: ut_psr: 0x00001053080ee030
... more stuff...

(0)>mst 0x0003ff002ff3b400
mst at address 0003FF002FF3B400
prev : 0000000000000000 intpri : INTBASE
stackfix : 0000000000000000 backt :
kjmpbuf : 0000000000000000 emulator : NO
excbranch : E000000000020A80 excp_type : EXTINT(10)
ipsr : 00001010080AE030 isr : 0000000000000000
iip : E00000000000B970 ifa : E000009729F4F22A
iipa : E00000000000B960 ifs : 8000000000000716
iim : 00000000000000F4 fpowner : LOW/HIGH
fpsr : 0009804C0270033F fpeu : YES
... tons of more stuff ...

(0)> q
From /usr/include/sys/machine.h :
#define PSR_PK 15
00001010080AE030 (HEX) =
100000001000000001000000010101110000000110000 (Binary)
Bit 15 is set, which means that the thread has the Protection Key set, and
hence is in a problem state.

Kernel Limits
Kernel Limits Most of the settings in the kernel are dynamic and don’t need to be tuned.
Their maximum values are considered to be chosen in such a way that
they will never be reached during normal system usage. Some limits
chosen as a maximum can technically be even higher.
The following table lists kernel system limits as of AIX 5L Version 5.0
Semaphores 32-bit kernel 64-bit-kernel

Maximum number of 131072 131072
semaphore IDs
Maximum 65535 65535
semaphores per
semapore IDs
Maximum operations 1024 1024
per semop call
Maximum undo 1024 1024
entries per process
Size in bytes of undo 8208 8216
structure
Semaphore maximum 32767 32767
value
Adjust on exit 16384 16384
maximum value
Message Queues 32-bit kernel 64-bit kernel

Maximum message 4 MB 4 MB
size
Maximum bytes on 4 MB 4 MB
queue
message queue IDs
Maximum messages 524288 524288
per queue ID

Kernel Limits -- continued
Kernel Limits
Shared Memory 32-bit kernel 64-bit kernel

Maximum region size 2 GB 2 GB
Minimum segment 1 1
size
shared memory IDs
segments per process
There are a couple of kernel parameters which affect the availability of

semaphores (semaem, semmap, semmni, semmns, semmnu, semume).
Please check them by referencing the working system. Please keep in
mind that other applications could also affect the availability of
semaphores.
LVM 32-bit kernel 64-bit kernel

VGs
PP’s per hdisk
LVs
major numbers (see
note 1)
VMM-mapped devices
(see note 2)
disks per VG

Kernel Limits
Filesystems JFS JFS2

Maximum file system 1 TB 32 PB
size (see note 3)
Maximum file size 64 GB 32 PB
(see note 4)
Maximum size of log 256 MB 32 PB
device
Maximum number of 2^24 Unlimited
file system inodes
Maximum number of 2^28 N/A
file system fragments
hard links
Miscellaneous 32-bit kernel 64-bit kernel

processes per system
Maximum numbers of 262143 262143
threads per system
Maximum number of 1000000 Unlimited (resource
open files per system bound)
open files per process
threads per process
processes per user
Maximum physical 4 GB 1 TB
memory size
Minimum physical 32 256 MB
memory size
Maximum value for 1 GB 4 GB
the wall

Kernel Limits Notes:

1. Each volume group takes one major number; some are reserved for
the OS and for other device drivers. Run "lvlstmajor" to see the range
of free major numbers; rootvg always uses 10.
2. VMM-mapped devices are mounted JFS/CDRFS file systems, open
JFS log devices, and paging spaces. Of 512, 16 are pre-reserved for
paging spaces. These devices have are indexed through the kernels
Page Device Table (PDT), which is a fixed size array.
3. To achieve 1TB, the file system must be created with npbi=65536 or
higher and frag=4096.
4. To achieve around 64 GB files, the file system must be created with the
-a bf=true flag AND the application must support files greater than 2
GB.

Kernel Limits Exercises
Checking The purpose for this exercise is to find actual limit or settings in a running
kernel values kernel. From the file /usr/include/sys/msginfo, we obtain the structure
msginfo that holds four integers. To list the content in the running kernel,
we use kdb fat Power and iadb at IA-64 platform. From both systems, we
display 16 bytes equal to four integers.
/*
* Message information structure.
*/
struct msginfo {
int msgmax, /* max message size */
msgmnb, /* max # bytes on queue */
msgmni, /* # of message queue identifiers */
msgmnm; /* max # messages per queue identifier */
};
Power # kdb
(0)> d msginfo
msginfo+000000: 0040 0000 0040 0000 0002 0000 0008 0000

msgmax msgmnb msgmni msgmnm
IA-64 # iadb
> d msginfo 4 4
e00000000415cfb0: 00400000 00400000 00020000 00080000
msgmax msgmnb msgmni msgmnm

64-bit Kernel base enablement
64-bit Kernel Several components of base enablement support are provided to make it
base possible for kernel subsystems and kernel extensions to run in 64-bit mode
enablement and use a large address space.
State Support is provided for saving and restoring 64-bit kernel context,
management including full 64-bit GPR contents. This support also extends to the area of
support kernel exception handling where setjmpx() and longjmpx() must deal with
64-bit kernel context. In addition, state management is extended to include
the 64-bit kernel address space as part of the kernel context.
Temporary The 64-bit kernel provides kernel subsystems and kernel extensions with
attachment the capability to change the contents of the kernel address space. This
includes the capability to change segments within the address space
temporarily for a specific thread of execution and is consistent with the
segmented virtual memory architecture of the hardware and the legacy 32-
bit kernel programming model.
A total of four concurrent temporary attachments will be supported under a
single thread of execution. This limitation is consistent with the limitation
imposed by the 32-bit kernel and is made to restrict the amount of kernel
state that must be saved and restored at context switch.
Global While the temporary attachment model is maintained, the 64-bit kernel
attachment also provides a model under which subsystem data is placed within the
global kernel address space and made visible to all kernel code for the
entire life of its usefulness, rather than temporarily attaching segments as
needed and in the context of a single thread.
This global attachment model does more than allow the 64-bit kernel to
provide sufficient space for subsystems to place their data in the global
kernel heap. Rather, it includes the capability to place subsystem
segments within the global address space. This capability is needed for
two reasons:
• Different memory characteristics
• Data organized around segment

64-bit Kernel base enablement -- continued
Global Some subsystems require virtual memory characteristics that are different
attachment from those of the kernel heap. For the most part, these characteristics are
defined at the segment level and typically must be reflected by segment
types that are different from those used for the kernel heap. Also, some
subsystems organize their data around segments and require sizes and
alignments that are inappropriate for the kernel heap.
The global attachment model is of importance for a number of reasons.

First, it is more scalable than the temporary attachment model. This is
particularly true for subsystems that require large portions of their data to
be accessible at the same time for a single operation. As the volume of this
data increases to meet workload or hardware requirements, the temporary
attachment model proves impractical for these subsystems, as increasing
numbers of segments must be attached and detached. An example of
such a subsystem is the VMM, where page fault resolution and virtual
memory kernel services require access to all page frames and segment
descriptors.
The global attachment model is also of value in cases where only a small
number of subsystem segments are involved. Segments are attached to
the global kernel addresses space only once, typically at subsystem
initialization, and are accessible from then on without requiring individual
subsystem operations to incur the path length cost of segment attachment.
This is not to say that the global attachment model is without its own path
length costs; specifically, use of this model may result in more segment
lookside buffer (SLB) reloads. This is because it provides no opportunity to
prime the SLB table with virtual segment IDs (VSIDs) for soon-to-be-
accessed segments. Rather, it relies upon the caching nature of the SLB
table and updates SLBs with new VSIDs only when satisfying reload faults.
This differs from the temporary attachment model where VSIDs are placed
in the SLB as part of segment attachment.

Global Finally, this model simplifies the general kernel programming model.
attachment Subsystems are not required to deal with the complexity of segments,
segment offsets or segment attachments in accessing their data. Rather,
data accesses are made simply and naturally using addresses within the
flat kernel address space.
The specific subsystem segments that will be placed in the kernel address
space under the global attachment model include:
• Kernel Heap
Although traditionally part of the global address space, the
kernel heap segments will be placed in this space through
global attachment.
• File System Segments

The global segments used to hold the file and inode tables will
be provided through global attachment.
• mbuf Segments
The mbuf pool has long been a part of global space and this will
continue under the 64-bit kernel.
• VMM Segments
These segments are privately attached in the 32-bit kernel
legacy and hold the software page frame table, segment control
blocks, paging device table, file system lockwords, external
page tables, and address space map entries.
• Process and Thread Tables

Global attachment is used for the segments required for the
globally addressable process and thread tables.
All segments added to the global kernel address space through global
attachment will be strictly read/write for the kernel and no-access for users.
In addition, unaligned accesses to these segments will not be supported
and will result in a protection exception.

Data isolation While placing subsystem data in the global kernel address space provides
significant benefits, it eliminates the data isolation that is provided by the
temporary attachment model. Under this model, data is typically made
accessible only while running subsystem code and is not generally
exposed to other subsystems. Unrelated interrupt handlers may gain
accessibility to data by interrupting subsystem code. However, this
exposure is more limited than that which occurs by placing data in global
space where all kernel code has accessibility.
Isolation is critical for some classes of subsystem data. As a result, not all
subsystem data should be placed in the global kernel address space. In
particular, file systems should continue to use temporary attachments to
provided isolation for user data.
Kernel address The kernel address space layout preserves the existing 32-bit and 64-bit
space layout user address layouts that is now found under the 32-bit kernel legacy. In
addition, a common global kernel and per-process user address space is
provided. This is required for a number of performance reasons:
• Efficient transition between kernel and user mode

• Preservation of SLBs
• Reduces complexity
• Single per-process segment table
To begin, a common address space improves the efficiency of transition

between kernel and user mode since there is no need to switch address
spaces. Next, it preserves SLBs. This is because the segments within the
user and kernel address space are common, so there is no need to use
separate SLBs or perform SLB invalidation at user/kernel transitions. Also,
a common address space reduces the complexity and path length that is
associated with kernel access to user space. There is no longer a need for
the kernel to gain address ability to segments from a separate user
address space in performing accesses or to serialize accesses against
changes in the user address space. Rather, user segments are already in
place and properly serialized in the common address space. Finally, the
common address space supports the efficiency of a single per-process
segment table.

Kernel address Temporary attachments are not included as part of the common address
space layout space. This is for a number of reasons. First, data isolation would be
impacted for temporary attachments if they were placed in the common
address space. This is because the attached data would be accessible in
the kernel by all threads of a process rather than only by the thread that
performed the temporary attachment. Second, it would be inefficient for the
common address space to include temporary attachments. This is due to
the fact that changes to the common address space would have to be
serialized among all threads of a process.
I/O space The 64-bit kernel supports I/O space at locations below and above 4 GB
mapping within the hardware system memory map. Under the 64-bit kernel, I/O
space is virtually mapped through the page translation hardware and made
accessible through segments on all supported hardware system
implementations. In the 32-bit kernel legacy on current hardware systems,
I/O space virtual access is achieved through block address translation
(BAT) registers, but this capability is not provided by the Gigaprocessor
hardware.
Performance The capability to place portions of I/O space within the global kernel
when address must be provided to allow temporary attachment overhead to be
accessing I/O
addresses avoided. This capability is built upon the global attachment model. Along
with services to support this, others services are provided that allow
portions of I/O space to be temporarily attached. However, these services
will form an I/O space temporary attachment model that is slightly different
from the one now found under the 32-bit kernel. Specifically, I/O space
mappings must be created prior to any temporary attachments and
destroyed once all temporary attachments are complete. These mapping
operations are performed by individual device drivers through new
services and typically occur at the time of device configuration and de-
configuration. Compare to the existing model under the 32-bit kernel,
where no separate mapping operations are present.

I/O mapping in The mapping operations are provided under the 64-bit kernel model for a
64-bit kernel number of reasons. The first is performance. While the 32-bit kernel model
mode
does not require I/O space to be mapped before it is attached, it does
require each temporary attachment to perform some level of mapping.
Under the 64-bit kernel model, each device driver maps its portion of I/O
once at initialization time and incurs no additional mapping overhead in
performing temporary attachments. Next, the presence of the mapping
operations provide efficient use of system resources. I/O space is mapped
in virtual memory through the page table and segments under the 64-bit
kernel and these system resources are only consumed for portions of I/O
space that are actually in use. In the absence of mapping operations, the
64-bit kernel itself would have to map all of I/O space into virtual memory
and possibly waste resources for unused portions. In addition to potentially
wasting resources, arming the kernel with the responsibility of mapping I/O
space would lead to arbitrary layouts of I/O space in virtual memory and
would not support data isolation. Finally, the interfaces for performing
temporary attachments are simplified, as no I/O mapping information must
be specified. This implies new interfaces for attaching and detaching from
I/O space.
The new I/O space temporary attachment model and supporting services
is not only provided under the 64-bit kernel but under the 32-bit kernel as
well. This is required to ease the migration of 32-bit device drivers to the
64-bit kernel environment and to make it simpler to maintain 32-bit and 64-
bit versions of a single device driver.
Rather than placing their respective portions of I/O space in the global
kernel address space, most device drivers should continue to access I/O
space through temporary attachments. This is because a large proportion
of these accesses occur under interrupts and would more than likely miss
the SLB table if the accesses were performed using the global attachment
model. While the temporary attachment model adds overhead to I/O space
accesses, it typically avoids the SLB miss performance penalty by priming
the SLB table.

LP64 C The 64-bit kernel uses the LP64 (Long Pointer 64-bit) C language data
language data model. This data model was chosen for a number of reasons. First, the
model
LP64 data is also used by 64-bit AIX applications, and this allows the 64-
bit kernel to support these applications in a straightforward manner. Of the
prevailing 64-bit data models, including ILP64 and LLP64, the LP64 data
model is most consistent with the ILP32 data model used by 32-bit
applications. This consistency simplifies 32-bit application support under
the 64-bit kernel and allows 32-bit and 64-bit applications to be supported
in fairly common ways. Next, LP64 has been chosen as the data model for
the 64-bit kernel implementations provided by key UNIX vendors, including
SGI, SUN, and H-P. Use of a common data model simplifies matters for
ISVs, and enables AIX to use industry wide solutions to some problems.
Finally, the 64-bit kernel requires no new compiler functionality and can
use the existing 64-bit mode compiler.
Register The register conventions used in the 64-bit kernel environment are the
conventions same as those used in the 64-bit application environment. This means that
general purpose register 13 will be reserved for operating system use.

64-bit Kernel stack
Kernel stack 64-bit code has greater stack requirements than 32-bit code. This is for two
reasons. First, the amount of stack space required to hold subroutine
linkage information increases for 64-bit code, since this information is
made up of register and pointer values and these values are larger 64-bit
quantities. Second, long and pointer values are 64-bit quantities for 64-bit
code and consume more space when maintained as stack variables.
The larger stack requirements of 64-bit code also means that stack-related
sizes under the 64-bit kernel are increased over those of the 32-bit kernel.
In fact, most existing stack sizes will double.
Minimum stack Under the 64-bit kernel, the components of the common subroutine
size linkage, such as the link register and TOC pointer, are 64-bit quantities. As
a result, the minimum stack frame size is 112 bytes.
Process Consistent with the 32-bit kernel, the kernel stacks for use in process
context stack context are 96 KB in size. This size should prove to be sufficient for the 64-
size
bit kernel, since it has been found to be twice that of what is actually
needed for the 32-bit kernel.
Interrupt stack The interrupt stack will be 8 KB in size under the 64-bit kernel. This size is
size clearly warranted, since some interrupt handlers find the 4 KB interrupt
stack size of the 32-bit kernel to be insufficient.
Dynamic To allow scalability, resource pools are allocated dynamically from the
resource pools kernel heap and through separately created segments intended for this
purpose. This means that some existing resource pools, like the shared
memory, message queue, and semaphore ID pools, are relocated from the
kernel BSS.

64-bit Kernel stack -- continued
Kernel heap The kernel heap is the home of most kernel data structures, and is
sufficiently large to allow subsystems to scale fixed resource pools, while
at the same time, providing adequate space for dynamically allocated
resources. To provide this, the kernel heap is expanded to encompass a
larger number of segments and placed above 4 GB within the global kernel
address space to accommodate its larger size.
While the kernel heap is extended and moved above 4 GB, the interfaces
provided for the allocation and freeing from this heap are the same as
those provided under the 32-bit kernel. The use of these interfaces is
pervasive, so common interfaces eases the 64-bit kernel porting effort for
kernel subsystems and kernel extensions and makes it simpler to support
both kernels.
The kernel heap is now expanded to 16 segments, for a total of about 4GB
of allocatable space. This is more than eight times larger than the space
available under the 32-bit kernel.
Allocation requests are only limited in size by the amount of available heap
space, rather than by some arbitrary limit. This means that the segments
that make up the kernel heap are laid out contiguously within the address
space, and requests for more than a segment size worth of data is granted
if sufficient free space is available. It also means that a request can be
satisfied with space that crosses segment boundaries.
A separate global heap reserved for the loader is provided in segment zero
(that is, the kernel segment). This heap is used to hold the system call
table and svc_instructions code for 32-bit applications and must be placed
in segment zero, because it is the only global segment that is mapped into
the 32-bit user address space. This heap is also used to hold the system
call table for 64-bit applications and loader sections for kernel extensions.
This data is located in the loader heap because it must be readable in user
mode. This type of access is not supported for the kernel heap.

CPU big- and little-endian
Memory view Although both Power and IA-64 architectures support big-endian and
for big and
little endian little-endian implementations, the endian of AIX 5L running on IA-64 and
systems AIX 5L on PowerPC are different. AIX 5L for IA-64 is little-endian, and AIX
5L for PowerPC is big-endian.
Logically, in multi-digit numbers, leftmost digits are more significant, and

rightmost least. For example, in the four-digit number 8472, the 4 is more
significant than the 7.
Now, when you look at the system memory, we can look at it in two ways.
The example shows a 100 byte memory seen the two ways. Try to write
the number 1234567890 at address 0-9 in both figures. What is the digit in
the byte at address two?
address
address address address
99 90 00 09
89 80 10 19
79 70 20 29
69 60 30 39
59 50 40 49
49 40 50 59
39 30 60 69
29 20 70 79
19 10 80 89
09 0 90 99

CPU Big and little Endian -- continued
Register and Computers address memory in bytes while manipulating data in words (of
memory byte multiple bytes). When a word is placed in memory, starting from the lowest
order
address, there are only two options: Either place the least significant byte
first (known as little-endian) or place the most significant byte first (known
as big-endian).
register
bit 63 0
a b c d e f g h
big-endian memory
a b c d e f g h
address 0 1 2 3 4 5 6 7
little-endian memory
h g f e d c b a
address 0 1 2 3 4 5 6 7
In the register layout shown in the figure above, “a” is the most significant
byte, and “h” is the least significant byte. The figure also shows the byte
order in memory. On big-endian systems, the most significant byte will be
placed at the lowest memory address. On little-endian systems, the least
significant byte will be placed at the lowest memory address.
Power, PowerPC, most RISC-based computers, IBM 370 computers, and

Internet protocol (IP) are some examples of things that use the big-endian
data layout. Intel processors, Compaq Alpha processors, and some
networking hardware are examples of things that use the little-endian data
layout.

Multi Processor dependent designs
Kernel lock The kernel lock is not supported under the 64-bit kernel. This lock was
originally provided to allow subsystems to deal with the pre-emptive nature
of AIX kernel on uniprocessor hardware, while later being used as a mean
for ensuring correctness for non-MP-safe subsystems on MP hardware. At
a minimum, all 64-bit kernel subsystems and kernel extension must be
MP-safe, with most required to be MP-efficient to meet performance
requirements. As a result, the kernel lock is no longer required.
Device Under the 64-bit kernel, no support will be provided for device funneling.
funneling This means that all device drivers must be MP-safe and identify
themselves as such when registering devices and interrupt handlers.
Device funneling was originally provided under the 32-bit kernel so that
non-MP-safe device drivers could run correctly on multi-processor
hardware with no change. However, all device drivers must change to
some extent under the 64-bit kernel and this provides the opportunity to
simplify the 64-bit kernel by not providing device funneling support and
requiring additional changes for the set of device drivers that are not MP-
safe.
Of the existing IBM Austin-owned device drivers, only the X.25 and
graphics device drivers are not MP-safe. However, this is of no concern,
since X.25 will not be provided under the 64-bit kernel and the (new)
graphics drivers that will be provided in the time frame of the 64-bit kernel
will be MP-safe.

Command and Utility compatibility for 32-bit and 64-bit kernels
Commands A number of AIX-supplied commands and utilities deal directly with kernel
and utilities details and require different implementation under the different kernels.
Commands based upon /dev/kmem or /dev/mem serve as an example.
While two different implementations may be required, the AIX-supplied

commands and utilities must use a common binary. This is required to
support a common system base and means that a single binary front-end
must be used, but does not dictate that only a single binary be used. In
fact, two binaries make sense in cases where kernel data structures are
used (like vmstat) and these data structures have different sizes or formats
under 32-bit and 64-bit compilations. Rather than duplicating data
structures for a single binary, both a 32-bit and a 64-bit binary version are
provided; one of these serves as a front-end and executes the other when
the bit-ness of the kernel does not match its own. This implementation
ensures that there is one common command interface for both 32-bit and
64-bit kernels utilities.

Exceptions
Exceptions The distinction between the terms "exception" and "interrupt" is often
and interrupts blurred. The bulk of AIX documentation refers to both classes generically
distinction
as "interrupts," while the hardware documentation (like the PowerPC 60x
User’s Manuals) makes the distinction. We will try to keep the terms
separate.
Definition of Exceptions are synchronous events that are normally caused by the
exceptions process doing something illegal.
An exception is a condition caused by a process attempting to perform an
action that is not allowed, such as writing to a memory location not owned
by the process, or trying to execute illegal operations. For illegal
operations, the kernel traps the offending action and delivers a signal to
the process causing the exceptions, (or crashes, if the process was in
kernel mode). Exceptions can also be caused by a page fault. A page fault
is a reference to a virtual memory location for which the associated real
data is not in physical memory.
Determine the The result of an exception is either to send a signal to the process or to
action taken crash the machine. The decision is based upon what kind of exception
on an
exception occurred and whether the process was executing in user mode or kernel
mode:
• Exceptions are caused within the context of a process.
• A process may NOT decide how to react to the exception.
• Exception handlers are kernel code and run without regard to the process, except
to cleanly handle the exception generated by the process.
• Some exceptions result in the death of the process.
• Some exception types can be found in V\VPBH[FHSWK!
A process can decide how to respond to the signal generated by the
exception in certain cases. For example, a process can decide to catch the
signal for SIGILL, which is delivered when a process in user mode
executes an illegal instruction.
An exception is also a mechanism to change to supervisor state as a result
of:
• Program errors
• Unusual conditions
• Program requests

Exceptions -- continued
Branching to After an exception, the system switches to supervisor state and branches
exception to an exception handler routine. The branch address is found from the
handlers
content of a specific memory location called "vector."
Examples of exceptions vectors:

• System reset
• Machine check
• Data storage interrupt (DSI)
• Instruction storage interrupt (ISI)
• Alignment
• Program (invalid instruction or trap instruction)
• Floating-point unavailable
• Decrementer
• System call
System reset The system reset exception is used when a system reset is initiated by the
exception system administrator. This generally causes a "soft" reboot of the system.
Machine check The machine check exception is generated when a hardware machine
exception check occurs. This generally indicates either a hardware bus error or bad
real address access. If a machine check occurs with the ME bit off, then a
machine checkstop occurs. Generally, a machine check exception causes
a kernel crash dump to be generated. A machine checkstop causes no
kernel crash dump to be generated, though a checkstop record is
generated.
Data storage Data storage interrupt (DSI) and instruction storage interrupt (ISI)
exception exceptions are caused by hardware not being able to find a translation for
a instruction fetch or load/store operation. These generally result in a page
fault.

Exceptions -- continued
Alignment Alignment exceptions are generated when an instruction generates an

exception unaligned memory operation that can not be completed by the hardware.
Which unaligned operations can not be handled by the hardware are
processor dependent. This exception generally results in AIX performing
the unaligned operation with special purpose code.
Invalid The program instruction is generated when an illegal instruction or trap

instruction instruction is generated. This is generally caused by debugger breakpoints
exception
in a process being hit. This exception generally results in a call to an
application or kernel debugger.
Floating point The floating point unavailable exception is caused when a thread executes
unavailable a floating point instruction when floating point operations are not allowed.
exception
This generally indicates that a thread has not executed any floating point
instructions yet or that another thread’s floating point data is currently in
the processor’s floating point registers. AIX does not save a thread’s
floating point register values until it first uses the floating point registers.
On UP systems, AIX does not save off floating point registers for the
currently running thread when another thread is dispatched. Often, no
other thread will use the floating point registers before the thread is again
dispatched. This saves AIX having to save and restore the floating point
registers on every thread dispatch.
Decrementer The decrementer exception is caused when the decrementer register has
exception reached the value zero. This indicates that a timer operation has
completed.
System call The system call exception occurs whenever a thread executes a system
exception call.

Interrupts
Description of Interrupts are asynchronous events that may be generated by the system
interrupts or a device, and "interrupts" the execution of the current process.
Interrupts usually occur when a process is running and some
asynchronous event occurs such as disk I/O completion or a clock tick.
The event usually has nothing to do with the current running process. The
kernel immediately preempts the current running process to handle the
interrupt. The state of the machine is saved on the stack and the interrupt
is handled. The user process has no knowledge that the interrupt
occurred.
Interrupts are one of the major reasons that AIX cannot be a hard real-time
system. No guarantee can be made as to how long it may take for some
action to occur as it may get interrupted any number of times during the
action.
Interrupts are caused outside the context of a process. In general, a
process may NOT *decide how to react to the interrupt. Interrupt handlers
are kernel code and run without regard to the process unless the nature of
the interrupt is to update some process related structure, *statistics, and so
on.
Interrupt levels Each interrupts has a level and an associated priority; the level is a value
that is used to differentiate between interrupts. The priority ranks the
importance of each one.
Devices, such as adapter cards, with interrupt facilities have a interrupt
level associated. When the system receive an interrupt with that level, AIX
then knows that it was caused by the device at that level.
In AIX, devices may share interrupt levels such that more than one adapter
may share the same level.
Controlling A kernel process can disable some or all types of interrupts for short
Interrupts periods. The interrupted process will safely return to continue execution.
Some interrupt types can be found in <sys/m_intr.h>
Most interrupts are not concerned with which process is getting
interrupted. The major counter example is the clock interrupt. This is used
to update the run-time statistics for the currently running process.

Interrupts -- continued
Critical A critical section is a code section that must be executed without any
sections break. For example: if data is examined and changed based on the value.
A process would disable interrupts across a critical section to ensure that
the section is executed without breaks.
Out of order On modern processors, such as Power and IA-64, many instructions are
instruction being executed at one time. When a hardware interrupt occurs,
sets and
Interrupts instructions are executed to completion and any following instructions are
terminated with no effect on the processor registers or memory; results
from out of order instructions are discarded. This is what is meant by
"interrupts are guaranteed to occur between the execution of instructions."
The processor makes sure that the effect of its operations are equivalent to
an interrupt occurring between the execution of instructions.

Interrupt handling in AIX
Interrupt When an interrupt is received, AIX performs several steps to handle the
handling interrupt properly:
• Saves the current state of the machine.

• Determines the real handler for the interrupt.
• Calls that handler to "service" the interrupt.
• Restores the machine state if and when the handler completes.
Interrupt Interrupt priorities have no relationship to process and thread scheduling

priorities priorities.
AIX associates priorities with each type of interrupt. A lower priority

number means a more favored interrupt. Interrupt processing can itself be
interrupted, but only by a more favored (lower priority number) interrupt.
Interrupt routines usually allow themselves to be interrupted by higher

prioritized interrupts, but refuse to take less favored interrupt; however,
interrupt routines and other programs running in kernel mode can
manually raise or lower their interrupt priority. This is called "disabling or
enabling interrupts." The reason for this is that a high prioritized disk
handler must complete in time before new data arrives, and it does not
want to be interrupted by less prioritized interrupts.

Handling CPU state information at interrupt
Saving and AIX maintains a set of machine state save (mstsave) areas. Each
restoring processor has a pointer to the mstsave area it should use when the next
machine state
interrupt occurs. This pointer is called the current save area, or csa pointer.
When state needs to be saved, AIX:
• Saves almost all registers into the mstsave pointed to by this

processor’s csa.
• Gets the next available mstsave area from this processor’s pool.
• Links just-saved mstsave to new mstsave.
• Updates this processor’s csa to point to a new area.
When an interrupt handler returns, AIX must restore the machine state that
was in effect when the interrupt occurred. AIX does this by:
• Reloading registers from the processor’s previous mstsave area.

• AIX then sets the processor’s csa pointer to the (now unused)
previous mstsave area.
• If returning to base interrupt level, AIX generally reruns the
dispatcher to determine which thread to resume.
• The interrupt might have made another thread runnable.

Handling CPU state information at interrupt
mstsave area Because the mstsave (machine state) areas are linked together, the
description mstsave areas provide an interrupt history stack.
csa
prev prev prev
mstsave mstsave mstsave mstsave
unused high low base

(next interrupt priority priority interrupt
goes here) interrupt interrupt level
Whenever AIX receives an interrupt that is of higher priority than what it is

currently doing, it must save the state of the machine into an mstsave
area. The csa (Current Save Area) pointer points to an unused mstsave
area that AIX can use if another, higher-priority interrupt comes in. This
area may contain stale data from being used for a previously-handled
interrupt, but its prev pointer always points to the previous mstsave area
(or is null if there aren’t any more in use at that time).
These areas are linked together from most-recently to least-recently used,

so this means that they go from higher to lower interrupt priority. At the end
of the mstsave chain is the mstsave area for the base interrupt level. This
mstsave area contains the state of the machine when it was last doing
something other than interrupt processing (that is, the machine state when
the oldest interrupt that we are currently processing came in).
Size limitation The stack used by an interrupt handler is kept in the same page as the
on mstsave mstsave area. This limits the stack to 4K on the 32-bit kernel and 8k on 64-
area and
interrupt stack bit kernel minus the size of the mstsave area. Using this area for the stack
ensures that the stack is pinned, which is required for interrupt handlers.

Handling CPU state information at interrupt -- continued
Saving base
level machine
state
base level mst save area
The thread’s base level state save area initial thread’s uthread block
is in the initial thread’s uthread block.
The initial thread’s ublock is in the

user area
process’ ublock
In the 32-bit kernel, there is also the user64 (32-bit kernel only)
user64 area, which is used to save the
64-bit user registers for 64-bit process ublock
processes.
The user64 area is only used when the process is a 64-bit process in a 32-
bit kernel. If the user64 area is being used it is initialized and pinned. The
area is created when a process calls exec() for a 64-bit executable. It is
destroyed when a 64-bit process exits or calls exec() for a 32-bit
executable.
The portion of the base level state save area that contains the 32-bit
registers is unused for 64-bit processes.
At a 32-bit kernel, only the base level state save (MST) area needs to have
a 64-bit register state save area (user64) associated with it. Since all
interrupt handlers run in 32-bit kernel mode, all state save areas other than
the base level state save area only needs to save 32-bit states (even on
64-bit hardware). At a 64-bit kernel all MST areas are 64-bit.

Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm Guide
Unit 2. IA-64 Hardware Overview

This unit describes: The /proc filesystem in the AIX 5L kernel.

• list the registers available to programs
• describe how EPIC improves performance

Guide Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Hardware Overview
Introduction to IA-64 is Intel’s 64-bit architecture, based on the Explicitly Parallel

IA-64 Instruction Computing (EPIC) design philosophy. These are the IA-64
goals:
• Overcome the limitations of today’s architectures.

• Provide world class floating point performance.
• Support large memory needs with 64-bit addressability.
• Protect existing investments with IA-32 compatibility.
• Support growing high-end application workloads for e-business,
enterprise, and technicalcomputing.
Performance IA-64 increases performance by using available compile-time information

to reduce current performance limiters, thereby moving some of the
performance burden from the microarchitecture to the compiler. This
enables designing simpler processors, which are more likely to achieve
higher frequencies.
To achieve improved performance, IA-64 code:
• Increases instruction level parallelism (ILP)
• Improves branch handling
• Hides memory latencies
• Supports modular code
IA-64 increases ILP by providing more architectural resources: large
register files, and a 3-instruction wide word.
The architecture also enables the compiler/assembly writer to explicitly
indicate parallelism.
Branch handling is improved by providing the means to minimize branches
in the code, increase branch prediction rate for the remaining branches
and providing specific support for typical branches.
Memory latency is reduced by allowing the compiler to schedule loads
earlier in the code and enabling memory hierarchy cache management.
IA-64 supports the current compiler trend to produce modular code by
providing specific hardware support for function calls and returns.

IA-64 formats
Data types The following data types are supported :
• Integer: 1, 2, 4 and 8 byte(s)

• Floating-point single, double and double-extended formats
• Pointers: 8 bytes
Integer data types
63 31 15 7 0
Floating-point Data Types
79 63 31 0
The basic IA-64 data type is 8 bytes. Apart from a few exceptions, all
integer operations are on 64-bit data, and registers are always written as
64 bits. Therefore, 1, 2 and 4 byte operands loaded from memory are zero-
extended to 64 bits.

IA-64 formats -- continued
Instruction A typical IA-64 instruction is a three operand instruction, with the following
format syntax:
[(qp)] mnemonic[.comp1][.comp2] dests = srcs
(qp) A qualifying predicate is a predicate register

indicating whether or not the instruction is
executed. When the value of the register is true
(1), the instruction is executed. When the value
of the register is false (0), the instruction is
executed as a NOP. Instructions that are not
explicitly preceded by a predicate, assume the
first predicate register, p0, which is always true.
Some instructions cannot be predicated.
mnemonic A unique name identifying the instruction.
[comp1][comp2] Some instructions may include one or more
completers. Completers indicate optional
variations on the basic mnemonic.
dests, srcs Most IA-64 instructions have at least two source
operands and a destination operand. Source
operands are used as input. Typically, the
source operands are registers, or immediates.
The destination operand(s) is typically a register
to which the result is written.
Some examples of different IA-64 instructions:
Simple Instruction
add r1 = r2, r3
Predicated instruction
(p4)add r1 = r2, r3
Instruction with immediate
add r1 = r2, r3, 1
Instruction with completer
cmp.eq p3 = r2, r4

IA-64 memory
Memory IA-64 defines a single, uniform, linear address space of 2^64 bytes which
organization is divided into 8 regions of size 2^61. A single space means that both data
and instructions share the same memory range. Uniform means that there
are no address regions with predefined functionality. Linear means that the
address space contains no segments; all 2^64 bytes are consecutive.
All code is stored in little-endian byte order in memory. Data is typically
stored in little-endian byte order. IA-64 also provides support for big-endian
code and operating systems.
Moving data between registers to and from memory is performed strictly
through the load (ld) and store (st) instructions. IA-64 supports loads and
stores of all data types. Because registers are written as 64-bit, loads are
zero-extended. Stores always write the exact number of bytes for the
required format.
The size of memory location is specified in the opcode as a number
• st1/ld1 = byte (8bits)
• st2/ld2 = halfword (16 bits)
• st4/ld4 = word ( 32 bits)
• st8/ld8 = doubleword ( 64 bits)
Examples :
// Loads 32 bits from address 4 + r30 into r31 High 32-bits cleared on 64-bit
processor
add r31 = 4, r30
ld4 r31 = [r30]
//Stores 64 bits from r3 to address r29 - 8

add r24 = -8, r29
st8 [r24] = r3
//Loads 8 bits from address 27+r1 into r3

add r2 = 0x27, r1
ld1 r3 = [ r2 ]

IA-64 memory -- continued
Region Usage On IA64, the 64-bit linear address space consists of 8 regions of size 2^61
with the upper 3-bits of the address selecting a virtual region, a physical
region register, and an associated region identifier. The region
identifier (RID), much like the POWER segment identifier (SID),
participates in the hardware address translation such that in order to share
the same address translation, the same RID must be used. The sharing
semantic (private, globally shared, shared-by-some) is determined by
whether or not multiple processes utilize the same RID.
For example, a process’s private storage resides within a region whose
RID is mapped only by that process. Therefore, address space usage is in
a large part determined by assigning the desired sharing semantics to
each of the 8 virtual regions and mapping the appropriate objects into
those regions that require those semantics.
There are two imporant properties associated with this region usage. First,
the mapping of objects to regions is many-to-one. That is, multiple objects
map into a single region. Second, mapping the same object to different
regions results in aliases. This is a distinct difference from the POWER
architecture where an object (a.k.a. SID) is addressed the same
regardless of the virtual address used. Aliases simply additional address
translations on IA64 and thus a likelyhood for decreased performance and
so their use should be minimized.
Another significant departure from AIX is that the majority of the 64-bit
address space is managed using Single Address Space (SAS) semantics.
This is necessary to achieve the desired degree of sharing of address
translations for shared objects: to achieve a single translation for an object
all accesses must be made through a common global address. Such a
semantic is possible by virtue of the IA64 protection keys which provide
additional access control beyond address translations. So, a process that
maps a region only has accessibility to those objects within that region for
which it has the appropriate protection key. Note that AIX manages some
parts of the process address space as SAS -- for example, the shared
library text segment contains mappings whose addresses are common
across all processes. The AIX use of the SAS style of management is
minimal because the POWER architecture provides for sharing on a
segment basis regardless of the virtual address used to map the segment.
To achieve the same degree of sharing on IA64 a shared object must be
mapped at a global address.

IA-64 memory -- continued
region usage In addition to the sharing semantics there are additional properties that
continued influence the location of objects within regions. First, to preserve the flat-
address space with a logical boundary between user and kernel space it is
useful to place user and kernel objects at opposite ends of the address
space whenever feasible. Next, the IA64 architecture provides for
multiple page sizes and a preferred page size per region so objects with
similar page size requirements are most naturally colocated within the
same region. Finally, certain object types such as executable text have
properties and uses which mandate that they be isolated to a separate
region.
Given these general guidelines, the following table shows the selected
region usage and subsequent sections describe each region use in greater
detail. These selections provide for 4 regions dedicated to user space and
3 for kernel for the initial release.
VRNStyle Name Example Uses process data, stack, heap,

mmap, ILP32 shared library
0 MAS Private Private text, ILP32 main text, u-block, kernel
thread stacks/msts
1 SAS/MAS Text LP64 shared library text, LP64 main text
2 SAS LP64 shmat
3 SAS LP64 shmat w/ large superpage
4 n/a reserved
5 SAS Temp kernel temporary attach, global buffer pool
6 SAS Kernel2 kernel global w/ large page size
7 SAS Kernel kernel global Virtual Region Usage

IA-64 Instructions
Instruction IA-64 enables improving instruction level parallelism (ILP) by:

level
parallelism
(ILP)
• Enabling the compiler/assembly writer to explicitly indicate
parallelism.
• Providing a three-instruction-wide word, called a bundle, that
facilitates parallel processing of instructions.
•Providing a large number of registers, enabling using different registers
for different variables and avoiding register contention.
Parallel Instruction Processing
IA-64
processor
A-64 instructions are bound in instruction groups. An instruction group is a

set of instructions which do not have read-after-write (RAW) or write-after-
write (WAW) dependencies between them and may execute in parallel. In
any given clock cycle, the processor executes as many instructions from
one instruction group as it can, according to its resources.
An instruction group must contain at least one instruction; the number of

instructions in an instruction group is not limited. Instruction groups are
indicated in the code by cycle breaks (;;) placed in the code by the
assembly writer or compiler. An instruction group may also end
dynamically during run-time by a taken branch.
Instruction groups reduces the need to optimize the code for each new
micro architecture. Processors with additional resources will take
advantage of the existing ILP in the instruction group.

IA-64 Instructions -- continued
Instruction Instruction groups are composed of 41-bit instructions contained in

groups and bundles. Each bundle contains three instructions, and a template field,
bundles
which are set during code generation, by a compiler, or the assembler. The
code generation process ensures instruction group assignment without
RAW or WAW dependency violations within the instruction group.
The template field maps each instruction to an execution unit. This allows
the processor to dispatch all three instructions in parallel.
Bundles are aligned at 16-byte boundaries.

Bundle structure
template
instruction slot 2 instruction slot 1 instruction slot 0
127 86 45 4 0
Template
The set of templates define the combinations of MII MIIs
functional units that can be invoked by a executing a MIsI MIsIs
single bundle. This in turn lets the compiler schedule the MLX* MLXs*
functional units in an order that avoids contention. The
template can also indicate a stop. The 24 available MMI MMIs
templates are listed opposite. MsMI MsMIs
MFI MFIs
M - is a memory function MMF MMFs
I - is an integer function MIB MIBs
F - is a floating point function MBB MBBs
B - is a branch function BBB BBBs
L - is a function involving a long immediate MMB MMBs
"s" indicates a stop. MFB MFBs
* L+X is an extended type that is dispatched to the I-unit.

The template field can end the instruction group either at the end of the
bundle, or in the middle of the instruction group.

Instruction set A basic IA-64 instruction has the following syntax:
[qp] mnemonic[.comp] dest=srcs
Where :
qp Specifies a qualifying predicate register. The value of the

qualifying predicate determines whether the results of the
instruction are committed in hardware or discarded. When
the value of the predicate register is true (1), the instruction
executes, its results are committed, and any exceptions
that occur are handled as usual. When the value is false
(0), the results are not committed and no exceptions are
raised. Most IA-64 instructions can be accompanied by a
qualifying predicate.
mnemonic Specifies a name that uniquely identifies an IA-64

instruction.
comp Specifies one or more instruction completers. Completers
indicate optional variations on a base instruction
mnemonic. Completers follow the mnemonic and are
separatedby periods.
dest Represents the destination operand(s), which is typically
the result value(s) produced by an instruction.
srcs Represents the source operands. Most IA-64 instructions
have at least two input source operands.

Branch All instructions beginning with “br.” are branches. The IA-64 architecture
instructions provides three branch types:
• Relative direct branches, using 21-bit displacement that is appended

to the instruction pointer of the bundle containing the branch.
• Long branches goes to an explicit address by using an 60 bit
displacement from the current instruction pointer.
• Indirect branches, using 64-bit addresses in the branch registers
IA-64 allows multiple branches to be evaluated in parallel. The first taken

branch which is predicated true is taken.
Extended mnemonics are defined by assembler to cover most

combinations : br.cond, br.call, br.ia, br.ret, br.cloop, br.ctop, br.cexit
Branch prediction hints can be provided with branch hints as part of a

branch instruction, or with separate Branch Predict instructions (brp)

IA-64 Registers
Registers IA-64 provides several register files that are visible to the programmer:
• 128 General registers

• 128 Floating-point registers
• 64 Predicate registers
• 8 Branch registers
• 128 Application registers
• Instruction Pointer (IP) register
Registers are referred to by a mnemonic denoting the register type and a

number. For example, general register 32 is named r32.
Branch registers
General Registers
63 0
63 0
br7
gr127
0
gr0 br0
Floating-point registers
81 0 Predicate registers
fr127 pr63
0.0 1
fr0 p0
Application registers
63 0 Instruction pointer
63 0
ar127
ar0

IA-64 Registers -- continued
General
registers
IA-64 provides 128 64-bit general purpose

registers for all integer and multimedia 63 0 nat
computation. gr0 0 0
gr1
• Register gr0 is a read-only register and gr2
is always zero (0).

• 32 registers are static and global to the
process. gr31
• 96 registers are stacked. These gr32
registers are for argument passing and

local register stack frame. A portion of
these registers can also be used for
software pipelining.
gr127
Each register has an associated NaT bit,
indicating whether the value stored in the
register is valid.

Floating-point
registers
IA-64 provides 128 82-bit floating-point

registers, for floating-point computations. All
floating-point registers are globally
accessible within the process. There are: 81 0
fr0 0.0
fr1 0.1
• 32 static floating-point registers
fr2
• 96 rotating floating-point registers, for
software pipelining
fr31
The first two registers (fr0 and fr1) are read-
fr32
only:
• fr0 is read as +0.0

• fr1 is read as +1.0.
fr127
Each register contains three fields:
• 64-bit significand field

• 17-bit exponent field
• 1-bit sign field.

Predicate 64 one-bit predicate registers enable controlling the execution of

registers instructions. When the value of a predicate register is true (1), the
instruction is executed. The predicate registers enable:
• validating/invalidating instructions 0
• eliminating branches in if/then/else logic blocks pr0
pr1
pr2
There are:
• 16 static predicate registers

pr15
• 48 rotating predicate registers for controlling pr16
software pipelining
pr63
Instructions that are not explicitly preceded by a
predicate, defaults to the first predicate register, pr0,
which is read-only, and is always true (1).
Whenever in a program encounters a branch condition, like the ‘if-then-

else’ condition, it depends on the outcome of the condition which branch
gets executed. Branch prediction used to be an often used solution, where
the processor tried to predict which branch would be taken and then
execute that branch in advance. Ofcourse, if the outcome was wrong, then
a performance penalty was met because the branch taken was discarded
and the other branch had to be executed...
The IA-64 executes all branches in parallel, where the predication register
is used to stop that branch of execution. This way the processor can
process ‘out-of-order execution’ by just executing all branches without
performance penalty.

Branch
registers
63 0
br0
Eight 64-bit branch registers are used to specify
the branch target addresses for indirect br1
branches.
br2
The branch registers streamline call/return

branching
br7
IA-64 improves branch handling by:
• providing the means to minimize branches in the code through the

use of qualifying predicates
• providing support for special branch
A qualifying predicate is a predicate register indicating whether or not the

instruction is executed. When the value of the register is true (1), the
instruction is executed. When the value of the register is false (0), the
instruction is executed as a NOP. Instructions that are not preceded by a
predicate explicitly, assume the first predicate register, p0, which is
always true.
Predication enables you to convert a control dependency to a data

dependency, thus eliminating branches in the code. An instruction is
control dependent if it depends on a branch instruction to execute.
Instructions are considered to be data dependent if the first produces a
result that is used by the second, or if the second instruction is data
dependent on the first through a third instruction. Dependent instructions
cannot be executed in parallel. You cannot change the execution
sequence of dependent instructions.
Continued on next pag

Application
registers
63 0
ar0 KR0
ar7 KR7
128 special purpose registers are used for

ar16 RSC
various functions. Some of the more
ar17 BSP
commonly used application registers have
assembler aliases. ar18 BSPSTORE
For example, ar66 is used as the Epilogue ar19 RNAT
Counter (EC) and is called ar.ec.

ar32 CCV
ar36 UNAT
ar40 FPSR
ar44 ITC
ar64 PFS
ar65 LC
ar66 EC
ar127
Instruction The 64-bit instruction pointer holds the address of the bundle of the
pointer (IP) currently executing instruction. The IP cannot be directly read or written, it
increments as instructions are executed. Branch instructions set the IP to a
new value. The IP is always 16-byte aligned.

Register
validity
63 0 nat
Speculative memory access creates a need to gr0 0 0
delay exception handling. This is enabled by gr1

propagating exception conditions.
gr2
Each general register has an a corresponding

NaT (Not a Thing) Bit. The NaT bits enable
gr31
propagating validity/invalidity of a speculative
load result. gr32
Floating-point registers use a special instance

of pseudo-zero, called NaTVal. NaTVal is a
floating-point register value used to propagate
valid/invalid results of speculative loads of gr127
floating-point data.
If data needs to get from the memory to the processor, there’s always a
delay because it’ll take a while to get there. This is called ‘memory latency’.
In an attempt to eliminate this time, the processor tries to read the memory
beforehand.
If data has been read in in advance and then other data has been written
back to that exact location, the already read in data becomes invalid.

IA-64 Operations
Software Loop performance is traditionally improved through software techniques.

pipelining However, these techniques entail significant additional code:
loops
• Loop unrolling requires multiple copies of the original loop in the

unrolled loop. The loop instructions are replicated and the end code
adjusted to eliminate the branch.
• Software pipelining requires adding prolog code to fill the execution
pipe and epilog code that drains it. Software pipelining is a method
that enables the processor to execute, in any given time, several
instructions in various stages of the loop.
IA-64 provides hardware support for software pipelining loops, eliminating

the need for additional prolog and epilog code through the use of:
• special branch instructions

• Loop count (LC) and epilogue count (EC) application registers
• rotating registers
Rotating registers are registers which are rotated by one register position
on each loop execution. The logical names of the registers are rotated in a
wrap-around fashion, so that logical register X is logical register X+1 after
one rotation. The predicate, floating-point and general registers can be
rotated.
IA-64 provides support for special branch instructions. One example is the
br.cloop instruction, used for simple counted loops.
The cloop branch instruction uses the LC application register, and not a
qualifying predicate to determine the branch condition.
The cloop branch checks whether the LC register is zero. If it is not, it

decrements LC and the branch is taken. After the last iteration LC is zero
and the branch is not taken, avoiding a branch misprediction.

IA-64 Operations -- continued
Reduced As current processors increase in speed and parallelism, more scheduling

memory opportunities are lost while memory is accessed.
access costs
IA-64 allows you to eliminate many memory accesses through the use of
large register files to manage work in progress, and by allowing better
control of the memory hierarchy.
Furthermore, the cost of the remaining memory accesses is dramatically

reduced by moving load instructions earlier in the code. Thus hiding
memory latency, which is the time required by the processor, between an
issuance of a load instruction and the moment when the result of this
instruction can be used. This enables the processor to bring the data in
time, and avoid stalling the processor. Memory latency is hidden through
the use of:
• Data speculation - the execution of an operation before its data
dependency is resolved.
• Control speculation - the execution of an instruction before its control
dependency is resolved.
Hiding memory latency
early load
dependency dependency
ld check validity
The large number of registers in IA-64 enable multiple computations to be

performed without having to store temporary data in memory. This reduces
the number of memory accesses.

Memory access is supported through the load (ld) and store (st)
instructions. All other integer, floating-point and branch instructions use the
registers as operands.
IA-64 enables you to hide the memory latencies of the remaining load
instructions, by placing speculative loads, prior to coding barriers. Thus the
stall caused by memory latency is minimized. This also enables more
opportunities for parallelism. When you use speculative loads, error/
exception detection is deferred until final result is actually required:
• If no error/exception is detected the latency is hidden.
• If an error/exception is detected then memory accesses and
dependent instructions must be redone by an exception handler.
A-64 provides an advanced load instruction (ld.a), that allows you to move
potentially data dependent loads earlier in the code.
To verify the data speculation, a check load instruction (ld.c) must be
placed at the location of the original load instruction.
If the contents of the memory address have not changed since the
advanced load, the speculation succeeded, and the memory latency is
hidden. If the contents of the memory address have changed by a store
instruction, the ld.c instruction repeats the load.
Data speculation does not defer exceptions. For example page faults are
taken immediately.
Also, IA-64 provides a control-speculative load instruction (ld.s), which
executes the load while speculating the results of the governing branch.
Control-speculative loads are also referred to as speculative loads.
To verify the load, a check instruction (chk.s) is placed at the location of the
original load. IA-64 uses a NaT bit/NaTVal, to track the success of the
load. If the NaT bit/NaTVal indicates a deferred exception, the chk.s
instruction jumps to correction code that repeats all dependent
instructions. The correction code is generated by a compiler or assembly
writer.
If the load is successful, the speculation succeeded, and the memory
latency is hidden.

Then there’s also a combined speculation load (ld.sa) which enables

placing a load before a control and a data barrier. Use this type of
speculative load to advance a load around a procedure call.
To verify the speculation, a special check instruction (chk.a) is placed at

the location of the original load instruction. If the load is successful, the
speculation succeeded, and the memory latency is hidden.
If an exception was generated, or the data was invalidated, the chk.a

instruction jumps to correction code that repeats all dependent
instructions. The correction code is generated by a compiler or assembly
writer.
Procedure The traditional use of a procedure stack in memory for procedure call
calls management demands a large overhead. IA-64 uses the general register
stack for procedure call management, thus eliminating the frequent
memory accesses. The general register stack consists of 96 general
registers, starting at r32, used to pass parameters to the called procedure
and store local variables for the currently executing procedure. The new
structure of a register stack allows:
• the caller procedure to pass parameters through registers to the
called procedure
• dynamic allocation of local registers for the currently executing
procedure
• allocating a maximum of 96 logical registers for each function
IA-32 IA-64
Procedure A Procedure A
call B ... call B
Procedure B Procedure B
save current register state alloc no save!
... ...
restore previous register state no restore!
return return

The general register stack is divided into two

subsets:
gr0
• Static: The first 32 physical registers Global

(r0-r31) are permanent registers, visible Registers
to all procedures, in which global
gr31
variables are placed. gr32
• Stacked: The other 96 physical Procedure

registers behave like a stack. The Frame
procedure code allocates up to 96 input
and output registers for a procedure
frame. An integral mechanism ensures
that a stack overflow or underflow
never occurs.
Stacked
Registers
As each procedure frame is allocated, the
previous frame is hidden, and the first
register in the frame is renamed as logical
register r32.
Using small register frames eliminates or

gr127
reduces the need for saving and restoring
registers to and from memory, when
allocating a new register stack frame.
When a procedure call is executed, the called procedure receives a

procedure frame which contains the output registers of the caller as input.
The called procedure can resize the frame to include its own input, local
and output area, using the alloc instruction. For each subsequent call, this
sequence is repeated, and a new procedure frame is created.
When the procedure returns, the processor unwinds the register stack, the
current frame is released, and the previous procedure’s frame is restored.

Register stack
engine
Using a register stack reduces the

need to perform memory saves.
However, when a procedure tries
to use more physical registers gr0
than remain on the stack, a
register stack overflow could Global
occur. Registers
gr31
IA-64 uses a hardware gr32
mechanism called a Register Procedure
RSE
Stack Engine (RSE), which Frame
operates transparently in the
background, to ensure that an
memory
overflow does not occur, and that
the contents of the registers are
always available. The RSE is not
visible to the software. Stacked
Registers
When the stack fills up, the RSE
saves logical registers to memory,
thus freeing them. The stored
registers are restored in the same
way when necessary.
gr127
Through this mechanism, the

RSE offers an unlimited number
of physical registers for allocation.

Floating point IA-64 provides high floating-point performance with full IEEE floating-point
and support for single, double, and double-extended formats.
multimedia
Also special support is provided for multimedia, or data-parallel

applications:
• integer data and SIMD computations, similar to the MMX[tm]

technology.
• floating-point data and SIMD-FP computations, similar to IA-32
Streaming SIMD Extensions .
These floating-point features help improve IA-64 floating-point

performance:
• 128 floating-point registers.

• A multiply and accumulate instruction (fma), with four different
floating-point registers for operands (f=a * b + c). This instruction
enables performing a multiply and add in the same number of cycles
as one add or multiply instruction.
• Load and store to and from memory. You can also load from memory
into two floating-point registers.
• Data transfer between floating-point and general registers.
• Multiple status fields register, enables speculation on floating-point
operations.
• Quick conversion from integer to floating-point and vice-versa.
• Rotating floating-point registers.

Integer multimedia is provided by defining a

set of instructions which treat the general
registers as 8x8, 4x16, or 2x32 bit
elements, and by providing specific
instructions for operating on these data 63 0
elements. IA-64 multimedia support is
semantically compatible with the MMX[tm] a3 a2 a1 a0
Technology. Three major types of

instructions are provided:
b3 b2 b1 b0
• Addition and subtraction (including 3
forms of saturating arithmetic)
• Multiplication
• Left shift, signed and unsigned right
shift
a3+b3 a2+b2 a1+b1 a0+b0
• Pack and unpack to convert between
different element sizes.
Floating-point multimedia is provided

through a set of instructions which treat the
floating-point registers as 2x32 bit
elements.
IA-64 provides 128 82-bit floating-point registers. However the floating-
point data type is 80 bits.
Intermediate computation values can contain 82 bits. This enables
software divide and square root computation, comparable to hardware
functions, while taking advantage of wide machines. These fast software
divides and square roots result in valid 80-bit IEEE values.
Floating-point Register
81 80 63 0
Exponent Significand
Sign

For floating-point multimedia operations the floating-point register is

divided as shown in the graphic below
81 80 63 31 0
Exponent Single-precision FP Single-precision FP
IA-64 provides four separate status fields (sf0-sf3) enabling four different
computational environments. Each status field contains dynamic control
and status information for floating-point operations.
The FPSR contains the four status fields and a traps field that enable
masking the IEEE exception events and denormal operand exceptions.
This register also includes 6 reserved bits which must be 0.
Floating-point status register
63 0
sf3 sf2 sf1 sf0 traps

6 13 13 13 13 6

Multimedia Multimedia instructions treat the general registers as concatenations of

instructions eight 8-bit, four 16-bit, or two 32-bit elements. They operate on each
element independently and in parallel. The elements are always aligned on
their natural boundaries within a general register. Most multimedia
instructions are defined to operate on multiple element sizes. Three
classes of multimedia instructions are defined: arithmetic, shift and data
arrangement.
Processor IA-64 firmware consists of three major components

Abstraction
Layer (PAL) • Processor Abstraction Layer (PAL)
• System Abstraction Layer (SAL)
• Extensible Firmware Interface (EFI) layer
PAL provides a consistent firmware interface to abstract processor
implementation-specific features.
The System Abstraction Layer (SAL) is a firmware layer which isolates
operating system and other higher level software fromimplementation
differences in the platform, while PAL is the firmware layer that abstracts
the processor implementation.
Operating System Software
Transfers to OS OS Boot EFI Procedure

Entrypoints Handoff Calls
for Hardware
Events Extensible Firmware
Interface (EFI)
OS Boot
Selection SAL Procedure
Calls Interrupts,
Instruction Traps and
Execution Faults
Platform/System Abstraction Layer (SAL)
Access to PAL Procedure

Platform Calls
Resources
Transfers to SAL
Entrypoints
Processor Abstraction Layer

(PAL)
Processor (Hardware)
Non-performance Critical Performance Critical

Hardware Events, e.g Hardware Events
Reset, Machine Checks e.g. Interrupts
Platform (Hardware)

Interrupts Interrupts are events that occur during IA-32 or IA-64 instruction
processing, causing the flow control to be passed to an interrupt handling
routine. In the process, certain processor state is saved automatically by
the processor. Upon completion of interrupt processing, a return from
interrupt (rfi) is executed which restores the saved processor state.
Execution then proceeds with the interrupted IA-32 or IA-64 instruction.
From the viewpoint of response to interrupts, the processor behaves as if it

were not pipelined. That is, it behaves as if a single IA-64 instruction (along
with its template) is fetched and then executed; or as if a single IA-32
instruction is fetched and then executed. Any interrupt conditions raised by
the execution of an instruction are handled at execution time, in sequential
instruction order. If there are no interrupts, the next IA-64 instruction and its
template, or the next IA-32 instruction, are fetched.
Interrupt Depending on how an interrupt is serviced, interrupts are divided into: IVA-
definitions based interrupts and PAL-based interrupts.
• IVA-based interrupts are serviced by the operating system. IVA-

based interrupts are vectored to the interrupt Vector Table (IVT)
pointed to by CR2, the IVA control register
• PAL-based interrupts are serviced by PAL firmware, system
firmware, and possibly the operating system. PAL-based interrupts
are vectored through a set of hardware entry points directly into PAL
firmware.
interrupts are divided into four types: Aborts, Interrupts, Faults, and Traps.
Aborts
A processor has detected a Machine Check (internal malfunction), or a

processor reset. Aborts can be either synchronous or asynchronous with
respect to the instruction stream. The abort may cause the processor to
suspend the instruction stream at an unpredictable location with partially
updated register or memory state. Aborts are PAL-based interrupts.

Machine A processor has detected a hardware error which requires immediate

Checks (MCA) action. Based on the type and severity of the error the processor may be
able to recover from the error and continue execution. The PALE_CHECK
entry point is entered to attempt to correct the error.
Processor A processor has been powered-on or a reset request has been sent to it.
Reset (RESET) The PALE_RESET entry point is entered to perform processor and system
self-test and initialization.
External An external or independent entity (e.g. an I/O device, a timer event, or

device another processor) requires attention. Interrupts are asynchronous with
Interrupts
respect to the instruction stream. All previous IA-32 and IA-64 instructions
appear to have completed. The current and subsequent instructions have
no effect on machine state. Interrupts are divided into Initialization
interrupts, Platform Management interrupts, and External interrupts.
Initialization and Platform Management interrupts are PAL-based
interrupts; external interrupts are IVA-based interrupts.
Initialization A processor has received an initialization request. The PALE_INIT entry

Interrupts point is entered and the processor is placed in a known state.
(INIT)
Platform A platform management request to perform functions such as platform

Management error handling, memory scrubbing, or power management has been
Interrupts
(PMI) received by a processor. The PALE_PMI entry point is entered to service
the request. Program execution may be resumed at the point of interrupt.
PMIs are distinguished by unique vector numbers. Vectors 0 through 3 are
available for platform firmware use and are present on every processor
model. Vectors 4 and above are reserved for processor firmware use. The
size of the vector space is model specific.

External A processor has received a request to perform a service on behalf of the

Interrupts (INT) operating system. Typically these requests come from I/O devices,
although the requests could come from any processor in the system
including itself. The External Interrupt vector is entered to handle the
request. External Interrupts are distinguished by unique vector numbers in
the range 0, 2, and 16 through 255. These vector numbers are used to
prioritize external interrupts. Two special cases of External Interrupts are
Non-Maskable Interrupts and External Controller Interrupts.
Non-Maskable Non-Maskable Interrupts are used to request critical operating system

Interrupts services. NMIs are assigned external interrupt vector number 2.
(NMI)
External External Controller Interrupts are used to service Intel 8259A-compatible

Controller external interrupt controllers. ExtINTs are assigned locally within the
Interrupts
(ExtINT) processor to external interrupt vector number 0.
Faults The current IA-64 or IA-32 instruction which requests an action which
cannot or should not be carried out, or system intervention is required
before the instruction is executed. Faults are synchronous with respect to
the instruction stream. The processor completes state changes that have
occurred in instructions prior to the faulting instruction. The faulting and
subsequent instructions have no effect on machine state. Faults are IVA-
based interrupts.

Traps The IA-32 or IA-64 instruction just executed requires system intervention.
Traps are synchronous with respect to the instruction stream. The trapping
instruction and all previous instructions are completed. Subsequent
instructions have no effect on machine state. Traps are IVA-based
interrupts.
Aborts Interrupts Faults Traps
INIT
RESET PMI
MCA INT
(NMI,ExtINT,...)
PAL-based interrupts
IVA-based interrupts

Interrupt When an interrupt event occurs, hardware saves the minimum processor
programming state required to enable software to resolve the event and continue. The
model
state saved by hardware is held in a set of interrupt resources, and
together with the interrupt vector gives software enough information to
either resolve the cause of the interrupt, or surface the event to a higher
level of the operating system. Software has complete control over the
structure of the information communicated, and the conventions between
the low-level handlers and the high-level code. Such a scheme allows
software rather than hardware to dictate how to best optimize performance
for each of the interrupts in its environment. The same basic mechanisms
are used in all interrupts to support efficient IA-64 low-level fault handlers
for events such as a TLB fault, speculation fault, or a key miss fault.
On an interrupt, the state of the processor is saved to allow an IA-64

software handler to resolve the interrupt with minimal bookkeeping or
overhead. The banked general registers provide an immediate set of
scratch registers to begin work. For low-level handlers (e.g. TLB miss)
software need not open up register space by spilling registers to either
memory or control registers.
Upon an interrupt, asynchronous events such as external interrupt delivery

is disabled automatically by hardware to allow IA-64 software to either
handle the interrupt immediately or to safely unload the interrupt resources
and save them to memory. Software will either deal with the cause of the
interrupt and rfi back to the point of the interrupt, or it will establish a new
environment and spill processor state to memory to prepare for a call to
higher-level code. Once enough state has been saved (such as the IIP,
IPSR, and the interrupt resources needed to resolve the fault) the low-level
code can re-enable interrupts by restoring the PSR.ic bit and then the
PSR.i bit. Since there is only one set of interrupt resources, software must
save any interrupt resource state the operating system may require prior to
unmasking interrupts or performing an operation that may raise a
synchronous interrupt (such as a memory reference that may cause a TLB
miss).

PSR.ic The PSR.ic (interrupt state collection) bit supports an efficient nested
Interrupt state interrupt model. Under normal circumstances the PSR.ic bit is enabled.
collection bit
When an interrupt event occurs, the various interrupt resources are
overwritten with information pertaining to the current event. Prior to saving
the current set of interrupt resources, it is often advantageous in a miss
handler to perform a virtual reference to an area which may not have a
translation. To prevent the current set of resources from being overwritten
on a nested fault, the PSR.ic bit is cleared on any interrupt. This will
suppress the writing of critical interrupt resources if another interrupt
occurs while the PSR.ic bit is cleared. If a data TLB miss occurs while the
PSR.ic bit is zero, then hardware will vector to the Data Nested TLB fault
handler.

Draft Version for Review October 15, 2000 12:13 pm Instructor Guidepower_hardware_overview.fm
Unit 3. Power Hardware Overview

Objectives The Objectives for this lesson are:
• Provide an overview of the e-server p series systems and their

processors.
• List the registers available to the program and describe the internal
use.
© Copyright IBM Corp. 2000 Unit . -1

power_hardware_overview.fmInstructor Guide Draft Version for Review October 15, 2000 12:13 pm
Power Hardware Overview
e-server This section introduces RS/6000, giving a brief history of the products,
p-series or an overview of the RS/6000 design, and a description of key RS/6000
RS/6000
introduction technologies.
The RS/6000 family combines the benefits of UNIX computing with IBMs
leading-edge RISC technology in a broad product line - from powerful
desktop workstations ideal for mechanical design, to workgroup servers
for departments and small businesses, to enterprise servers for medium
to large companies for ERP and server consolidation applications, up to
massively parallel RS/6000 SP systems that can handle demanding
scientific and technical computing, business intelligence, and Web
serving tasks. Along with AIX, IBMs award winning UNIX operating
system, and HACMP, the leading high availability clustering solution, the
RS/6000 platform provides the power to create change and has the
flexibility to manage it with a wide variety of applications that provide real
value.
RS/6000 The first RS/6000 was announced February 1990 and shipped June
History 1990. Since then, over 1,100,000 systems have shipped to over 132,000
customers.
The next figure summarizes the history of the RS/6000 product line,
classified by machine type. For each machine type, the I/O bus
architecture and range of processor clock speeds are indicated. The
figure shows the following:
• In the past, RS/6000 I/O buses were based on the Micro Channel
Architecture (MCA). Today, RS/6000 I/O buses are based on the
industry-standard Peripheral Component Interface (PCI) Architecture.
• Processor speed, one key element of RS/6000 system performance,
has increased dramatically over time.
• There have been many machine types over the entire RS/6000
history. In recent years, there has been considerable effort to reduce
the complexity of the model offerings without creating gaps in the
market coverage.
-2 Course short title © Copyright IBM Corp. 2000

Power Hardware Overview -- continued
RS/6000
history
7011 (33 to 80 MHz) 7248 (100 to 133 MHz)
Micro Channel Workstations PCI Workstations
7006 (80 to 120 MHz)
Micro Channel Entry Desktops
7009 (80 to 120 MHz)
Micro Channel Compact Servers
7013 (20 to 200 MHz)
Micro Channel Deskside Systems
7012 (20 to 200 MHz)
Micro Channel Desktop Systems
7015 (25 to 200 MHz)
Micro Channel Rack Systems
7024 (100 to 233 MHz)
PCI Deskside Systems
7025 (166 to 500 MHz)
PCI Workgroup Servers Deskside Systems
7043 (166 to 375 MHz)
PCI Workstations & Workgroup Servers
7044 (333 to 400 MHz)
PCI Workstations &
Workgroup Servers
7046 (375 MHz)
PCI Workgroup Servers - Rack Systems
7026 (166 to 500 MHz)
PCI Workgroup Servers - Rack Systems
7017 (125 to 450 MHz)
PCI Enterprise Servers
SP1, SP2, SP
All Node Types
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Today
RISC CPU 320 520

The RISC CPU was the first CPU for the RS/6000 series of systems the
CPU consist of four chips and runs at a speed of 33 Mhz. The CPU had
a outstanding floating point performance at the time. The CPU was used
in the 7012 and 7013 system model 320 - 380 and 520 - 580.
RISC II CPU 390 590

The RISC II has enhanched features over the first RISC design and runs
up to 200 Mhz. The CPU was used in the 7012 and 7013 system model
390 and 590.

PowerPC and PowerPc CPUs started as a joint effort between Motorola Apple and IBM
Power2 Cpu the family consist of PowerPc, PPc601, PPc604 and PPc604e. These
family
CPUs are very close to those prodused by Motorola and used in Apple
systems, currently the PPc604e CPU is used in model f50, b50, and 43p
Power3 and The POWER3 microprocessor introduces a new generation of 64-bit

Power3-II processors especially designed for high performance and visual
CPUs
computing applications. POWER3 processors replace the POWER2 and
the POWER2 Super Chips (P2SC) in high-end RS/6000 workstations
and SP nodes. The RS/6000 44P 7044 Model 270 workstation features
the POWER3-II microprocessor as well as the POWER3-II based SP
nodes.
The POWER3 implementation of the PowerPC architecture provides
significant enhancements compared to the POWER2 architecture. The
SMP- capable POWER3 design allows for concurrent operation of
fixed-point instructions, load/store instructions, branch instructions, and
floating-point instructions. Compared to the P2SC, which reaches its
design limits at a clock frequency of 160 MHz, POWER3 is targeting up
to 600 MHz by exploiting more advanced chip manufacturing processes,
such as copper technology. The first POWER3-based system, RS/6000
43P 7043 Model 260, runs at 200 MHz as well as the POWER3 wide and
thin nodes for the SP.
Features of the POWER3, exceeding its predecessor (P2SC), include:
• A second load-store unit
• Improved memory access speed
• Speculative execution
F lo a t in g F lo a tin g F ix e d F ix e d F ix e d L D /S T L D /S T
C P U r e g is te r s :
P o in t P o in t P o in t P o in t P o in t U n it U n it
U n it U n it U n it U n it 3 2 x 6 4 -b it in t e g e r
U n it
( F ix e d P o in t )
F P U 1 F P U 2 F X U 1 F X U 2 F X U 3 L S 1 L S 2 3 2 x 6 4 -b it F P
( F lo a tin g P o in t )
R e g is te r b u f fe r s f o r
B r a n c h h is t o r y t a b le 2 0 4 8 e n t r ie s r e g is te r r e n a m in g :
B r a n c h / D is p a t c h B r a n c h t a r g e t c a c h e 2 5 6 e n t r ie s 2 4 F P
1 6 In te g e r
3 2 K B , 1 2 8 -w a y 6 4 K B , 1 2 8 -w a y
M e m o r y M g m t U n it M e m o r y M g m t U n it
In s tr u c t io n C a c h e D a ta C a c h e
IU D U
3 2 3 2
B y te s B y te s
B IU B u s In t e r fa c e U n it L 2 C o n t r o l, C lo c k
3 2 B y te s 1 6 B y te s
@ 2 0 0 M H z = 6 .4 G B /s @ 1 0 0 M H z = 1 .6 G B /s
D ir e c t L 2 C a c h e
6 X X B u s
M a p p e d 1 -1 6 M B

RS64 and The RS64 microprocessor, based on the PowerPC Architecture, was
RS64 II CPUs designed for leading-edge performance in OLTP, e-business, BI, server
consolidation, SAP, Notesbench, and Web serving for the commercial
and server markets. It is the basis for at least four generations of
RS/6000 and AS/400 enterprise server offerings.
The RS64 processor focuses on commercial performance with emphasis
on conditional branches with zero or one cycle incorrect branch predict
penalty, contains 64 KB L1 instruction and data caches, has a one cycle
load support, four superscalar fixed point pipelines, and one floating
point pipeline. There is an on-board bus interface (BIU) that controls both
the 32 MB L2 bus interface and the memory bus interface.
RS64 and RS64 II are defined by the following specifications:
• 125 MHz RS64/262 MHz RS64 II on the RS/6000 Model S70
• 262 MHz RS64 II on the RS/6000 Model S70 Advanced
• 340 MHz RS64 II on the RS/6000 Model H70
• 64 KB on-chip, L1 instruction cache
• 64 KB on-chip four-way set associative data cache
• 32 MB L2 cache
• Superscalar design with integrated integer, floating-point, and branch
units
• Support for up to 64-way SMP configurations (currently 12-way)
• 128-bit data bus
• 64-bit real memory addressing
• Real memory support for up to one terabyte (240)
• CMOS 6S2 using a 162 mm2 die, 12.5 million transistors
S im p le S im p le F lo a tin g Load/
F ix e d C o m p le x P o in t S to re
P o in t F ix e d U n it U n it
U n it P o in t U n it
B r a n c h /D is p a tc h
M e m o r y M g m t U n it M e m o r y M g m t U n it
In s tr u c tio n C a c h e D a ta C a c h e
IU DU
B IU B u s In te r fa c e U n it L 2 C o n tr o l, C lo c k
L2 Cache
6XX B us
1 -3 2 M B

RS64 III The RS64 III processor is designed to perform applications that place
heavy demands on system memory. The RS64 III architecture addresses
both the need for very large working sets and low latency. Latency is
measured by the number of CPU cycles that elapse before requested
data or instructions can be utilized by the processor.
The RS64 III processors combine IBM advanced copper chip technology
with a redesign of critical timing paths on the chip to achieve greater
throughput. The L1 instruction and data caches have been doubled to
128 KB each. New circuit design techniques were used to maintain the
one cycle load-to-use latency for the L1 data cache.
L2 cache performance on the RS64 III processor has been significantly
improved. Each processor has an on-chip L2 cache controller and an
on-chip directory of L2 cache contents. The cache is four-way set
associative. This means that directory information for all four sets is
accessed in parallel. Greater associativity results in more cache hits and
lower latency, which improves commercial performance.
Using a technique called Double Data Rate (DDR), the new 8 MB Static
SRAM used for L2 is capable of transferring data twice during each clock
cycle. The L2 interface is 32 bytes wide and runs at 225 MHz (half
processor speed), but, because of the use of DDR, it provides 14.4 GBps
of throughput.
In summary, the RS64 III features include:

• 128 KB on-chip L1 instruction cache
• 128 KB on-chip L1 data cache with one cycle load-to-use latency
• On-chip L2 cache directory that supports up to 8 MB of off-chip L2
SRAM memory
• 14.4 GBps L2 cache bandwidth
• 32 byte on-chip data buses
• 4-way superscalar design
• Five stage deep pipeline
• The Model S80 uses the 450 MHz RS64 III 64-bit copper-chip
technology
• The Model M80 uses the 500 MHz RS64 III 64-bit copper-chip
technology
• The Model F80 and the H80 use 450 or 500MHz RS64 III 64-bit
copper-chip technology

Power4 or POWER4 is a new processor initiative from IBM. It is comprised of two

Gigaprocessor 64-bit 1 GHz five issue superscalar cores that have a triple level cache
Copper SOI
CPU hierarchy. It has a 10 GBps main memory interface with a 45 GBps
multiprocessor interface. IBM is utilizing the 0.18 micron copper
silicon-on-insulator technology in its manufacture. The targeted market is
the Enterprise Server or servers in e-business. It is currently in the
design stage.
System Bus All current systems in the RS/6000 family are equiped with PCI buses,
information the PCI architecture provides an industry standard specification and
protocol that allows multiple adapters access to system resources
through a set of adapter slots.
Each PCI bus has a limit on the number of slots (adapters) it can
support. Typically, this can range from two to six. To overcome this limit,
the system design can implement multiple PCI buses. Two different
methods can be used to add PCI buses in a system. These two methods
are:
• Secondary PCI Bus, The simplest method to add PCI slots when
designing a system is to add a secondary PCI bus. This bus is bridged
onto a primary bus using a PCI-to-PCI bridge chip.
• Another method of providing more PCI slots is to design the system
with two or more primary PCI buses. This design requires a more
sophisticated I/O interface with the system memory.

Power CPU Overview
32-bit 32-bit Power and PowerPC processors all have the following features in
hardware common:
characteristics
User registers
• 32 general-purpose integer registers, each 32 bits wide (GPRs)
• 32 floating-point registers, each 64 bits wide (FPRs)
• A 32-bit Condition Register (CR)
• A 32-bit Link Register (LR)
• A 32-bit Count Register (CTR)
System Registers
• 16 Segment Registers (SRs)
• A Machine State Register (MSR)
• A Data Address Register (DAR)
• Two Save and Restore Registers (SRRs)
• 4 special purpose (SPRG) registers (PowerPC only)
All instructions are 32 bits long. The Data Address Register contains the
memory address that caused the last memory-related exception.
SRRs are used to save information when an interrupt occurs
• SRR0 points to the instruction that was running when the interrupt
occurred
• SRR1 contains the contents of the MSR when the interrupt
occurred
SPRGs are used for general operating system purposes, requiring

per-processor temporary storage. It provides fast state saves and
support for multi-processing environments
General General Purpose Registers (GPRs) (often just called Rs) used for loads,
purpose stores, and integer calculations
registers
No memory-to-memory operations are provided.This always needs to go
through registers
Condition The condition register (CR) contains bits set by the results of compare
register instructions. It’s treated as 8 4-bit registers.
The bits are used to test for less-than, greater-than, equal, and overflow
conditions.

Power CPU Overview -- continued
Link register The link register (LR) is set by some branch instructions.
Its content points to the instruction which has to be executed immediately
after the branch. It typically is used in subroutine calls to find out where
to return to.
Count register The Count Register (CTR) has two uses :

• It can be decremented, tested, and used to decide whether to take
a branch, all from one branch instruction
• It can contain the target address for a branch instruction
Machine state The MSR controls many of the current operating characteristics of the
register processor. Among others are :
• Privilege Level (Supervisor vs. Problem or Kernel vs. User)
• Addressing modes (virtual vs. real)
• Interrupt enabling
• Little-endian vs. Big-endian mode
Instruction set
A single instruction generally modifies only one register or one memory
location. Exceptions to this are “multiple” and “update” operations
The format of an instruction is:

• An opcode mnemonic
• An optional set of option bits
• 0, 1, 2, or 3 registers
• 0 or 1 memory locations, expressed as an offset added
to/subtracted from a register
The first two may be combined into an “extended mnemonic”

For example of the format Umeans the address in r3 + 24
General Purpose Registers are named “r0” - “r31”
Although most instructions are the same, the mnemonics for POWER
and PowerPC are often different. POWER mnemonics are generally
simpler and shorter, while PowerPC mnemonics are longer, but more
explicit.These differences are because PowerPC was developed with
64-bit in mind.
Note: the actual opcodes generated by the assembler for these
instructions are identical

Register to These types of operations will always have at least 2 registers listed,
register where the first is the target for the result of the instruction, and the others
operations
provide the input to the operation.
Immediate operations are shown as a register with an offset.

“Immediate” means that a constant value is involved.The value is built
right into the instruction.
Examples :
• RUUUU# Logical ORs r4 and r5, result into r3
• DGGLU[U # Adds 0x48 to r1, result into r1
Register to Register-Memory Operations always have one register and one memory
memory location. The register is always listed first.
operations
The size of the memory location is specified in the opcode :
• b = byte (8 bits)
• h = halfword (16 bits)
• w = word (32 bits)
• d = doubleword (64 bits)
All opcodes beginning with “l” are loads and all opcodes beginning with
“st” are stores.

Register to Examples:
memory • OZ]UU# Loads 32 bits from address 4+r30 into
operation
examples r31.High 32-bits cleared on 64-bit processor
• VWGUU# Stores 64 bits from r3 to address r29 - 8.
Invalid operation on 32-bit processor
• OE]U[U # Loads 8 bits from address 27+r1 into r0.
Top 24/56 bits are cleared
• VWKU[U# Stores low 16 bits from r3 to address
0x56+r1
Notice that the load instructions also have a “z” in their mnemonics. The
“z” stands for “zero,” and is intended to make clear that these instructions
clear any bits in the target register that were not actually copied from
memory.
In case you were wondering, there are load instructions without the “z”.
lwa and lha are “algebraic” loads. This means that the value being
loaded is sign-extended to fill out the rest of the register. This is used
when loading a signed value - if a halfword had a negative value, lhz
would make it a positive, but lha would preserve the value’s
“negativeness.”
Compare There are four variations of compare instructions , all beginning with
instructions “cmp”. They compare two values :
• Register and register, or

• Register and immediate value (i.e. constant value)
The result of the comparison iis placed in the Condition Register (CR)
where the various bits that can be set are :
• LT = less than
• GT = greater than
• EQ = equal
• OV = overflow (a.k.a. carry bit)

Branch All instructions beginning with a “b” are branches. They change the
instructions address for the next instruction to be run.
They have three addressing modes :
• Absolute - goes to an explicit address

• Relative - target address is an offset from current instruction
address
• Register - Only two registers can contain a branch target : Count
(CTR) and Link (LR)
Branches can be conditional. That depends upon whether the option bit
matches the specified bit in the CR. A branch instruction can specify
which CR to use, where CR0 is assumed unless otherwise specified.
Extended mnemonics are defined by the assembler to cover most
combinations
The conditional branch instruction is central to any computer
architecture. However, most architectures (including POWER and
PowerPC) avoid putting comparisons directly into their branch
instructions (to keep things simple). They provide compare instructions
that set “condition bits.” These bits are what are used on branch
instructions to make the actual decision.
The assembler (and crash’s disassembler) provides extended
mnemonics that combine a type of branch and the condition register bit
that determines whether the branch is taken. Another bit in the branch
opcode determines whether the CR bit must be on or off for the branch to
take place. This bit is also incorporated into the extended mnemonics
(the “not” versions of the branches). For maximum flexibility, the
assembler usually also allows you to specify the “not” cases as the
logically-opposite case. For example, bnl (branch not less than) can also
be written as bge (branch greater than or equal to). Either case is still
saying, “branch if the LT bit is turned off.”
Examples
• EOW[F # Branches to address 38c00 if LT bit is on in CR0
• EJHFU[ # Branches if LT bit is off in CR3
• EQHOUFU # Branches to address in LR if EQ bit is off in CR7
• EOHDFU[ # Branches to absolute address 0x3600 if
GT bit is off in CR2

Trap Most mnemonics beginning with a “t” are traps, and generate a program
instructions exception if the specified condition is met. There are two variations of the
trap instruction :
• t or tw - compares two registers, traps if specified comparison is

true
• ti or twi - compares register to immediate value instead
“w” mnemonics are the PowerPC indication that these trap instructions
are working on 32-bit values. As with branches, there are extended
mnemonics defined to provide various traps. In this context ‘lt’, ‘gt’, ‘eq’,
etc. have same meaning as on branch mnemonics
Examples
• WZHTUU# Traps if r3 equals r4
• WZQHLU # Traps if r31 is not equal to 0
Trap instructions are the only instructions in this architecture that perform
a comparison and take some action, all in one instruction. They do not
set or use condition register bits.
Special The Special Purpose Registers (SPRs) can only be copied to or from
register GPRs.
operations
• PIVSUU # Copies SPR 8 into r3
• PWVSUU # Copies r3 into SPR 9
Extended mnemonics are defined to cover common SPRs :

• PIOUU# Copies the LR (SPR 8) into r3
• PWFWUU # Copies r3 into the CTR (SPR 9)

Interrupt Interrupt vectors are addresses of short sections of code which saves
vectors the state of the processor and then branches to a handler routine.
Some examples are :
• system reset - vector 0x100

• machine Check - vector 0x200
• data storage interrupt (DSI) - vector 0x300
• instruction storage interrupt (ISI) - vector 0x400
• external interrupt - 0x500
• alignment - vector 0x600
• program (invalid instruction or trap instruction) - vector 0x700
• floating-point unavailable - vector 0x800
• decrementer - vector 0x900
• system call - vector 0xc00
• There are some exceptions unique to each type of processor.

64 bit CPU Overview
64-bit With full hardware 32-bit binary compatibility as the baseline, the
hardware features that characterize a PowerPC processor as 64-bit include:
characteristics
• 64-bit general registers
• 64-bit instructions for loading and storing 64-bit data operands, and
for performing 64-bit arithmetic and logical operations.
• two execution modes: 32-bit and 64-bit. Whereas 32-bit processors
have implicitly only one mode of operation, 32-bit execution mode
on a 64-bit processor causes instructions and addressing to
behave the same as on a 32-bit processor. As a separate mode,
64-bit execution mode creates a true 64-bit environment, with 64-bit
addressing and instruction behavior.
• 64-bit physical memory addressing facilities
• additional supervisor instructions, as needed to set up and control
the execution mode. A key feature the PowerPC 64-bit architecture
provides is execution mode on a per-process level, helping AIX to
create, at the system level, a mixed environment of concurrent
32-bit and 64-bitprocesses.
The Machine Status Register (MSR) bit controls 32-bit or 64-bit

execution mode :
• Allows support for 32-bit processes on 64-bit hardware

• Used by the kernel to run in 32-bit mode in kernel
• portions of the VMM run in 64-bit mode on 64-bit hardware (to
address large tables to represent large virtual memory)
• 32-bit mode on 64-bit hardware looks exactly like 32-bit hardware
(ensures binary compatability for 32-bit applications)
• 32-bit instructions use only bottom 32-bits of registers for data or
addresses

64 bit CPU Overview -- continued
Segment table The 64-bit virtual address space is represented with a segment table,
which acts as an in-memory set-associative cache of the most recently
used 256 segment number to segment ID mappings. The current
segment table is pointed to with the 64 bit Address Space Register
(ASR) register. The ASR has a valid bit to indicate that no segment table
is valid. This is used in 32-bit mode on 64-bit processors to indicate that
the segment table is not being used.
IBM "bridge extensions" to PowerPC 64-bit architecture allow segment
register operations to work for 32-bit mode. It allows the kernel to
continue to manipulate segment registers. The "bridge extensions" are
used to load and store "segment registers" instead.
A Segment Lookaside Buffer (SLB) is used to cache recently used

segment number to segment ID mappings. This is similar to Translation
Lookaside Buffer (TLB) for page to frame translations
The SLB is similar to segment table but smaller and faster (on chip, not in
memory)

Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm Guide
Unit 4. SMP Hardware Overview
Objectives The Objectives for this lesson are:
• list the three types op multiprocessor design

• describe what is meant MP safe

Guide Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm
SMP Hardware Overview
Symmetric On uniprocessor systems, bottlenecks exist in the form of the address and
multi- data bus restricting transfers to one at a time and the other program
processing
counter forcing instructions to be executed in strict sequence. Some
performance improvement was achieved by constantly improving the
speeds of these uniprocessor machines.
With symmetric multiprocessing, more than one CPU work together.

There are several categories of MP systems depending on whether the
CPU share resources, have their own resources (like memory, operating
system, I/O channels, control units, files and devices), how they are
connected (whether in a single machine sharing a single bus or in different
machines), whether all processors are functionally equal or some are
specialized.
Types of Multiprocessors:
• Loosely-coupled MP
• Tightly-coupled MP
• Symmetric MP
Loosely Has different systems on a communication link with the systems fuctioning
coupled MP independently and communicating when necessary.
The separate systems can access each other’s files and maybe even
download tasks to the lightly loaded CPU to achieve some load balancing.
Tightly Uses a single storage shared by the various processors and a single
coupled MP operating system that controls all the processors and system hardware.
Symmetric MP All of the processors are functionally equivalent and can perform I/O and
computation.

SMP Hardware Overview -- continued
Multi- In order to have all CPU’s work together, there must be some sort of
processor organization. There are three ways to do that :
organization
• Master/slave multiprocessing organization.
• Separate executives organization.
• Symmetric multi-processing organization.
Master slave One processor is designated as the master and the others are the slaves.
organization The master is a general purpose processor and performs input/output as
well as computation. The slave processors perform only computation.
The processors are considered asymmetric (not equivalent) since only the
master can do I/O as well as computation. Utilization of a slave may be
poor if the master does not service slave requests efficiently enough.
Another disadvantage may be I/O-bound jobs, because they may not run
efficiently since only the master does I/O.
Separate With this organization each processor has its own operating system and
executives responds to interrupts from users running on that processor. A process is
organization
assigned to run on a particular processor and runs to completion.
It is possible for some of the processors to remain idle while other
processors execute lengthy processes. Some tables are global to the
entire system and access to these tables must be carefully controlled.
Each processor controls its own dedicated resources, such as files and I/O
devices.
Symmetric All of the processors are functionally equivalent and can perform I/O and
multi- computation. The operating system manages a pool of identical
processing
organization processors, any one of which may be used to control any I/O devices or
reference any storage unit. Conflicts between processors attempting to
access the same storage at the same time are ordinarily resolved by
hardware. Multiple tables in the kernel can be accessed by different
processes simultaneously. Conflicts in access to systemwide tables are
ordinarily resolved by software. A process may be run at different times by
any of the processors and at any given time, several processors may
execute operating system function in kernel-mode.

Multi- There are two ways of identifying separate processors. You can identfiy
processor them by :
definitions
• the physical CPU number

• the logical CPU number
The lowest number will start from ‘0’ on Power systems, but will start
from ‘1’ on IA-64.
Where the physical numbers identify all processors on the system,

regardless of their state, and the logical numbers identify enabled
processors only. The Object Data Manager (ODM) names for processors
are based on physical numbers with the prefix /proc. The table below
illustrates these naming scheme for a three-processor Power system.
ODM name Physical Logical Processor

number number state
/proc0 0 0 Enabled
/proc1 1 Disabled
/proc2 2 1 Enabled

Funneling In order to run some Uni-Processor device drivers unchanged because

they are not ‘thread-safe’ or ‘MP safe’, their execution had to be “funneled”
through one specific processor, which is called the MP master. Funneled
code runs only on the master processor; therefore, the current
uniprocessor serialization is sufficient.
One processor will be known as the default, or Master processor and this
concept is used for funneling. It is not a master processor in the sense of
master/slave processing - the term is used only to designate which
processor will be the default processor. It’s defined by the value of
MP_MASTER in the <sys/processor.h> file
Note : funneling is NOT supported by the 64-bit kernel !!!
Funneling has the following characteristics :
• Interrupts for a funneled device driver will be routed to the MP master

CPU.
• Funneling is intended to support third party device driver and low-
throughput device drivers.
• The base kernel will provide binary compatibility for these device
drivers.
• Funneling only works if all references to the device driver are through
the device switch table.
MP safe MP safe code will run on any processor. It’s modified to prevent resource
clashes by adding locking code in order to serialize its execution.
MP efficient MP efficient code is MP safe code, but has also some data locking
mechanisms to serialize data access. This way it will be easier to spread
whatever the code does across the availables CPUs.
MP efficient device drivers are intended for high-throughput device drivers.


Draft Version for review, Sunday, 15. October 2000, crashdump.fm Guide
Unit 5. Configuring System Dumps on AIX 5L

This lesson describes how to configure and take system dumps on a
node running the AIX5L operating system.

• Configure an AIX5L system to take a system dump
• Test the system dump configuration of an AIX5L system
• Verify the validity of a dump file

Guide Draft Version for review, Sunday, 15. October 2000, crashdump.fm

About This Lesson
Purpose This lesson describes how to configure and take system dumps on a node
running the AIX5L operating system.
Objectives At the completion of this lesson, you will be able to:

• Configure an AIX5L system to take a system dump
• Test the system dump configuration of an AIX5L system
• Verify the validity of a dump file
Table of This lesson covers the following topics:

contents
Topic See Page

About This Lesson 3
System Dump Facility in AIX5L 5
Configuring for System Dumps 7
Obtaining a Crash Dump 16
Dump Status and completion codes 17
dumpcheck utility 19
Verify the dump 21
Packaging the dump 26

About This Lesson -- continued
Estimated This lesson takes approximately 1hour to complete.

length
Accountability You will be able to measure your progress with the following:
• Exercises using your lab system.
• Check-point activity
• Lesson review
Reference
Redbooks
Organization This lesson consists of information followed by exercises that allow you to
of this lesson practice what you’ve just learned. Sometimes, as the information is being
presented, you are required to do something - pull down a menu, enter a
response, etc. This symbol, in the left hand side-head, is an indication that
an action is required.

System Dump Facility in AIX5L
Introduction An AIX5L system can generate a system dump (or crash dump) due to
encountering a severe system error, such as an exception in kernel mode
that was unexpected or that the kernel cannot handle. It can also be
initiated by the system administrator when the system has hung.
When an unexpected system halt occurs, the system dump facility

automatically copies selected areas of kernel data to the primary dump
device. These areas include kernel segment 0 as well as other ares
registered in the Master Dump Table by kernel modules or kernel
extensions. The system dump is a snapshot of the operating system state
at the time of the crash or manually initiated dump.
The system dump facility provides a mechanism to capture sufficient

information about the AIX5L kernel for later analysis. Once the preserved
image is written to disk, the system will be booted and returned to
production.
Analysis of the dump can be done on another machine away from the
production machine at a convenient time, or location by a skilled kernel
person.
Process The process of taking a system dump is illustrated in the following chart.
The process involves a two stages, in stage one the contents of memory is
copied into a temporary disk location. In stage two, AIX5L is booted and
the memory image is moved to a permanent location in the /var/adm/ras
directory.

System Dump Facility in AIX5L -- continued
Process
continued
AIX5L in production
Copycore copies dump Stage 1

into /var/adm/ras
- copycore started in
rc.boot
System panics
System is booted
Memory Dumper Run
Stage 2 - memory is copied to disk location
specified in SWservAt ODM object
class

Configuring for System Dumps
Introduction When the operating system is installed, parameters regarding the dump
device are configured with default settings. To ensure that a system dump
is taken successfully, the system dump parameters need to be configured
properly.
The system dump parameters are stored in system configuration objects
within the SWservAt ODM object class. Objects within the SWservAt
object class define where and how a system dump should be handled.
SWservAt The SWservAt ODM object class is stored in the /etc/objrepos directory.
object class Objects included within the object class are:
name default description

tprimary /dev/hd6 Defines the permanent primary dump
device. By default this is the primary
paging space logical volume, hd6.
primary /dev/hd6 Defines the temporary primary dump
device. By default this is the primary
paging space logical volume, hd6.
tsecondary /dev/sysdumpnull Defines the permanent secondary
dump device. By default this is the
device sysdumpnull.
secondary /dev/sysdumpnull Defines the temporary secondary
dump device. By default this is the
device sysdumpnull.
autocopydump /var/adm/ras Defines the directory the dump is
copied to at system boot.
forcecopydump TRUE TRUE - If a the copy fails to the copy
directory, the system boot process
will bring up a utility to copy the dump
to removable media.
enable_dump FALSE FALSE - Disables the ability to force
a sysdump using the dump key
sequence or the reset button on
systems without a key mode switch.
dump_compress OFF OFF - specifies that dumps will not be
compressed.
Each object can be changed with the use of the sysdumpdev command.

Configuring for System Dumps -- continued
sysdumpdev The sysdumpdev command changes the settings of SWservAt objects.

The command provides you with the ability to:
• Estimate the size of the system dump
• Selecting the primary and secondary dump devices
• Selecting the directory the dump will be copied to at boot
• Displaying information from the previous dump invocation
• Determine if a new system dump exists
• Display current dump settings
Dump Device When selecting the primary or secondary dump device the following rules
selection rules must be observed:
• A mirrored paging space may be used as a dump device.
• Do not use a diskette drive as your dump device.
• If you use a paging device, only use hd6, the primary paging device.

Preparing for a To ensure that a system dump will be successfully captured, complete the
system dump following steps:
Step Action
1. Estimate the size of the dump. This can be done through smit
by following the fast path:
# smit dump_estimate
Or, using the sysdumpdev command:
# sysdumpdev -e
(With Compression turned on)
0453-041 Estimated dump size in bytes:11744051
(With Compression turned off)
0453-041 Estimated dump size in bytes:58720256
Using the above example, the dump will require 12MB (with
compression on), or 59MB (with compression turned off) of
device storage. This value can change based on the activity of
the system. It is best to run this command when the machine is
under its heaviest workload. Size the dump device four times
the value reported by the sysdumpdev command in order to
handle a system dump during peak system activity.
IA-64 Systems - Compression must be turned off to gather
a valid system dump. (Eratta)
DUMPSPACE requirement for this system:

______MB * 4 = ______MB
Note: On AIX5L a new utility called dumpcheck has been

created to monitor the system and verify that if a system dump
occurred that the resources are properly configured to the
system dump. The utility is run as a cron job, and is located in
the /usr/lib/ras directory. The time when the command is
scheduled to run should be adjusted to when the peak system
load is expected. Any warnings will be logged in the errorlog.

Preparing for a
system dump
continued
Step Action
2 Create a primary dump device named dumplv. Calculate the
required number of PPs for the dump device. Get the PP size
of the volume group by using the lsvg command:
# lsvg rootvg
VOLUME GROUP: rootvg VG IDENTIFIER: db1010a
VG STATE: active PP SIZE: 16 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 1626
(26016 megabytes)
MAX LVs: 256 FREE PPs: 1464 (23424
megabytes)
LVs: 11 USED PPs: 162 (2592
megabytes)
OPEN LVs: 8 QUORUM: 2
TOTAL PVs: 3 VG DESCRIPTORS: 3
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 3 AUTO ON: yes
MAX PPs per PV: 1016 MAX PVs: 32
LTG size: 128 kilobyte(s) AUTO SYNC: no
HOT SPARE: no
Determine the necessary number of PPs by dividing the

estimated size of the dump by the PP size. For example:
236MB (59*4) / 16MB = 14.75 (required number

is 15)
Create a logical volume of the required size, for example:
#mklv -y dumplv -t sysdump rootvg 15

Preparing for a
system dump
continued
Step Action
3. Verify the size of the device /dev/dumplv.
Enter the following command:
# lslv dumplv
LOGICAL VOLUME: dumplv
VOLUME GROUP: rootvg
LV IDENTIFIER: e59bd8 PERMISSION: read/write
VG STATE: active/complete
LV STATE:opened/syncd
TYPE: dump WRITE VERIFY: off
MAX LPs:512 PP SIZE: 16 megabyte(s)
COPIES: 1 SCHED POLICY: parallel
LPs: 15 PPs: 15
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: minimum RELOCATABLE: no
INTRA-POLICY: middle UPPER BOUND: 32
MOUNT POINT: N/A LABEL: None
MIRROR WRITE CONSISTENCY: off
EACH LP COPY ON A SEPARATE PV?: yes
In this example, the dumplv logical volume contains 15 16MB

partitions giving a total size of 240MB.
4. Assign the primary dump device by using the sysdumpdev
command:
#sysdumpdev -s /dev/dumplv -P
primary /dev/dumplv
secondary /dev/sysdumpnull
copy directory /var/adm/ras
forced copy flag FALSE
always allow dump FALSE
dump compression OFF

Preparing for a
system dump
continued
Step Action
5. Create a secondary dump device. The secondary dump device
is used to back up the primary dump device. If an error occurs
during a system to dump to the primary dump device, the
system attempts to dump to the secondary device (if it is
defined).
Create a logical volume of the required size, for example:
#mklv -y hd7 -t sysdump rootvg 15

6. Assign the secondary dump device by using the sysdumpdev
command:
#sysdumpdev -s /dev/hd7 -P
primary /dev/dumplv
secondary /dev/hd7
copy directory /var/adm/ras
forced copy flag FALSE
always allow dump FALSE
dump compression OFF

Preparing for a
system dump
continued
Step Action
7. Verify the size of the filesystem containing the copy directory is
large enough to handle a crash dump. Check the size of the
copy directory filesystem with the following command:
#df -k /var
Filesystem 1024-blocks Free%Used
Iused %Iused Mounted on
/dev/hd9var 32768 31268 5%
143 64% /var
In this example the /var filesystem is 32MB. To increase the
size of the /var filesystem to 240MB, use the following
command:
# chfs -asize=+240000 /var
Note: The default copy directory is /var/adm/ras. The rc.boot

script is coded to check and mount the /var filesystem to
support the copy of the system dump out of the dump device. If
an alternate location is selected modification of /sbin/rc.boot
maybe necessary. Also you will be required to update the ram
filesystem with the bosboot command.
Portion of /sbin/rc.boot:
# Mount /var for copycore

echo "rc.boot: executing \"fsck -fp var\"" \
>>/../tmp/boot_log
fsck -fp /var
echo "rc.boot: executing \"mount /var\"" \
>>/../tmp/boot_log
mount /var
[ $? -ne 0 ] && loopled 0x518
# retrieve dump
echo "rc.boot: executing \"copycore\"" \
>>/../tmp/boot_log
copycore
umount /var

Preparing for a
system dump
continued
Step Action
8. Configure the force copy flag. If paging space is being used
as a dump device, the force copy flag must be set to TRUE.
This will force the system boot sequence into menus that
allow copy of the dump to external media if the copy to the
copy directory fails. This utility will give you the opprotunity
to save the crash to removable media if the default copy
directory is full or un-available. To set the flag to TRUE, use
the following command:
# sysdumpdev -PD /var/adm/ras

9. Configure the allow system dump flag. To enable the reset
button or dump key sequence to force a dump sequence
with the key in the normal position, or on a machine without
a key mode switch, the allow system dump flag must be set
to TRUE. To set the flag TRUE, use the following
command:
# sysdumpdev -KP
10. Configure the compression flag. To enable compression of
the system dump prior to being written to the dump device,
the compression flag must be set to ON. To set the flag to
ON, use the following command:
# sysdumpdev -CP
IA-64 Systems - Compression must be turned off to

gather a valid system dump. (Eratta):
# sysdumpdev -cP
Note: Turning the compression flag on will
cause the dump to be saved in a compressed
form on the primary dump device. Also, the
copycore utility will generate a compressed
vmcore file, vmcore.x.Z.

Preparing for a
system dump
continued
Step Action
11. Configure the system for autorestart. A useful system attribute
is autorestart. If autorestart is TRUE, the system will
automatically reboot after a crash. This is useful if the machine
is physically distant or often unattended. To list the system
attributes, use the following command:
# lsattr -El sys0
To set autorestart to TRUE, use SMIT by following the fast

path:
# smit chgsys
Or use the command:
# chdev -l sys0 -a autorestart=’true’

Obtaining a Crash Dump
Introduction AIX5L has been designed to automatically collect a system crash dump
following a system panic. This section discusses the operator controls and
procedure that is used to obtain a system dump.
User initiated Under unattended hang conditions or for other debugging purposes
dumps system administrator may use different techniques to force a dump:
• Using sysdumpstart -p command (primary dump device) or
sysdump -s command (secondary dump device).
• Start a system dump with the Reset button by doing the following (this
procedure works for all system configurations and will work in
circumstances where other methods for starting a dump will not):
Step Action
1. Turn the machine’s mode switch to the Service position, or
set Always Allow System Dump to TRUE.
2. Press the Reset button. The system writes the dump
information to the primary dump device.
Power PC - Pressing the Ctlr-Alt 1 key sequence to write the dump

information to the primary dump device, or press the Ctlr-Alt 2 key
sequence to write the dump information to the secondary dump device.
IA-64 - Pressing the Ctlr-Alt-NUMPAD1 key sequence to write the dump
information to the primary device, or Ctlr-Alt-NUMPAD2 key sequence to
write the dump information to the secondaray dump device.

Dump Status and completion codes
Progression A system crash will cause a number of status codes to be displayed. When
status codes a system has crashed, the LEDs will display a flashing 888. The system
may display the code 0c9 for a short period of time, indicating a system
dump is in progress. When the dump is complete, the dump status code
will change to 0c0 if the system was able to dump successfully.
If the Low-Level Debugger (LLDB) is enabled, a c20 will appear in the
LEDs, and an ASCII terminal connected to the s1 or s2 serial port will
show an LLDB screen. Typing quit dump will initiate a dump.
During the dump process, the following progression status codes may be
seen on the LED or LCD displays:
LED code sysdumpdev status Description

0c0 0 Dump successful
0c1 -4 I/O error during dump.
0c4 -2 Dump device is too
small. Partial dump
taken.
0c5 -3 Internal dump error. It
shows only when the
dump facility itself fails.
This does not include the
failure of dump
component routines.
0c8 -1 No dump device defined.
0c2 N/A User-initiated dump in
progress.
0c6 N/A User-initiated dump in
progress to secondary
dump device.
0c9 N/A System initiated dump in
progress.
0cc N/A Dump process switched
to secondary dump
device.
Flashing 888 N/A System has crashed
102 N/A This value indicates an
unexpected system halt.

Dump Status and completion codes -- continued
LED code sysdumpdev status Description

nnn N/A This value is the
cause of the system
halt (reason code)
000 N/A Unexpected system
interrupt (hardware
related)
2xx N/A Machine check
Error log If the dump was lost or did not save during system boot, the error log can
help determine the nature of the problem that caused the dump. To check
the error log, use the errpt command.
Create a user Create a test dump by entering the following command:

initiated dump
Step Action
1. # sysdumpstart -p
IA-64 Systems - For a dump that is approximately 120MB
in size wait for approximately 15 minutes before shutting
down the machine.
2. Reboot the system.

dumpcheck utility
Description The /usr/lib/ras/dumpcheck utility is used to check the disk resources used
by the system dump facility. The command logs an error if either the
largest dump device is too small to receive the dump or there is insufficient
space in the copy directory when the dump device is a paging space.
Requirements In order to be effective, the dumpcheck utility must be:

• Enabled:
• To verify if dumpcheck has been enabled by using the following
command:
# crontab -l | grep dumpcheck
0 15 * * * /usr/lib/ras/dumpcheck >/dev/null 2>&1
• Enable the dumpcheck utility by using the -t flag. This will create an
entry in the root crontab if none exists. Example, to set the dumpcheck
utility to run at 2PM:
# /usr/lib/ras/dumpcheck -t “0 14 * * *”
• Dumpcheck should be run at the time the system is heavily loaded in
order to find the maximum size the dump will take. The default time is
set for 3PM.
dumpcheck Dumpcheck utility will do the following when enabled :

overview
• Estimate the dump or compressed dump size using sysdympdev -e

• Find the dump logical volumes and copy directory using sysdumpdev -l
• Estimate the primary and secondary dump device sizes
• Estimate the copy directory free space
• If the dump device is a paging space, dumpcheck will verify if the free
space in the copy directory is large enough to copy the dump
• If the dump device is a logical volume, dumpcheck will verify it is large
enough to contain a dump
• If the dump device is a tape, dumpcheck will exit without message.
Any time a problem is found, dumpcheck will log a entry in the error log
and, if the -p flag is present, will display a message to stdout for crontab,
that mean it will mail the stdout to the root user.

dumpcheck utility -- continued
Error log entry The following is an example of an errorlog entry created by the dumpcheck
sample utility because of lack of space in the primary and secondary dump
devices:
----------------------------------------------------
LABEL: DMPCHK_TOOSMALL
IDENTIFIER: E87EF1BE
Date/Time: Tue Aug 15 09:49:41 CDT

Sequence Number: 45
Machine Id: 000714834C00
Node Id: wcs2
Class: O
Type: PEND
Resource Name: dumpcheck
Description
The largest dump device is too small.
Probable Causes
Neither dump device is large enough to accommodate a
system dump at this time.
Recommended Actions
Increase the size of one or both dump devices.
Detail Data
Largest dump device
testdump
Largest dump device size in kb

8192
Current estimated dump size in kb
65536
----------------------------------------------------

Verify the dump
Description Before submitting a dump to IBM for analysis, it is important to verify that
the dump is valid and readable.
Locating the To locate the dump issue the following command:

dump
# sysdumpdev -L
The following output shows a good dump:
0453-039
Device name: /dev/dumplv
Major device number: 10
Minor device number: 2
Size: 8837632 bytes
Uncompressed Size: 32900935 bytes
Date/Time: Fri Sep 22 13:01:41 PDT 2000
Dump status: 0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.0.Z
In this case a valid dump was safely save by the system in the /var/adm/
ras directory.
The following case shows the command output when the copy failed.
Presumably the dump is available on the external media device, for
example, tape.
0453-039
Size: 8837632 bytes
Dump status: 0
0481-195 Failed to copy the dump from /dev/dumplv to /var/adm/ras.
0481-198 Allowed the customer to copy the dump to external media.
Note: A dump saved on Initial Program Load (IPL) to external media is not
sufficient for analysis. Additional files are required.

Verify the dump -- continued
Dump analysis To verify the dump is valid, the dump must be examined by a kernel
tools debugger. The kernel debugger used to validate the dump depends on the
system architecture. If the system is running on Power PC, the debugger is
kdb. The kernel debugger for IA-64 platforms is iadb.
Verifying the The following procedure should be used to verify the dump
dump
Step Action
1. Locate the crash dump:
# sysdumpdev -L
0453-039
Size: 8837632 bytes
Dump status: 0
Dump copy filename: /var/adm/ras/vmcore.0.Z
2. Change directory to the dump location. In the above
example:
# cd /var/adm/ras
3. Decompress the vmcore file if necessary:
# uncompress vmcore.0.Z

Verifying the
dump
continued
Step Action
4. Start the kernel debugger;
Power PC:
# kdb /var/adm/ras/vmcore.0
The specified kernel file is a UP kernel
vmcore.1 mapped from @ 70000000 to @ 71fdba81
Preserving 880793 bytes of symbol table
First symbol __mulh
KERNEXT FUNCTION NAME CACHE (90112 bytes) allocated
KERNEXT COMMANDS SPACE (4096 bytes) allocated
Component Names:
1) dmp_minimal [5 entries]
....
Dump analysis on CHRP_UP_PCI POWER_PC POWER_604
machine with 1 cpu(s) (32-bit r
egisters)
Processing symbol table...
.......................done
(0)>
IA-64:
# iadb /var/adm/ras/vmcore.0
symbol capture using file: /unix
iadb: Probing a live system, with memfd as :4
Current Context:
cpu:0x1, thread slot: 77, process Slot: 51,
ad space: 0x8e44
thrd ptr: 0xe00000972a13b000, proc ptr:
e00000972a12e000
mst at:3ff002ff3b400
(1)>

Verifying the
dump
continued
Step Action
5. Issue the stat subcommand to verify the details of the dump.
Ensure the values are consistent with the dump that was
taken.
Power PC:
(0)> stat
SYSTEM_CONFIGURATION:
CHRP_UP_PCI POWER_PC POWER_604 machine with 1
cpu(s) (32-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. kca41
release... 0
version... 5
machine... 000930134C00
nid....... 0930134C
time of crash: Thu Oct 5 10:37:57 2000
age of system: 3 min., 11 sec.
xmalloc debug: disabled
IA-64:
(1)>stat
IA64 machine with 2 cpu(s)(64-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. kca40
hostname.. kca40.hil.sequent.com
release... 0
version... 5
machine... 000000004C00
nid....... 0000004c
current time: Fri Oct 6 12:20:56 2000
age of system: 1 day, 1 hr., 1 min., 43 sec.
xmalloc debug: disabled

Verifying the
dump
continued
Step Action
6. Exit the kernel debugger:
Power PC:
(0) > q
IA-64:
(1) > q

Packaging the dump
Overview Once a valid dump has been identified, the next step is to package the
dump to be send in for analysis.
Packaging the The following procedure will automatically collect the required files
dump pertaining to the system dump
Step Action
1. Compress the vmcore file:
# compress /var/adm/ras/vmcore.0
2. Gather all of the files and information regarding the dump
using the following command:
# snap -Dkg
Checking space requirement for general
information............................................
........... done.
Checking space requirement for kernel
information.......... done.
Checking space requirement for dump information.....
done.
Checking for enough free space in filesystem... done.
********Checking and initializing directory structure
Creating /tmp/ibmsupt directory tree... done.
Creating /tmp/ibmsupt/dump directory tree... done.
Creating /tmp/ibmsupt/kernel directory tree... done.
Creating /tmp/ibmsupt/general directory tree... done.
Creating /tmp/ibmsupt/general/diagnostics directory
tree... done.
Creating /tmp/ibmsupt/testcase directory tree... done.
Creating /tmp/ibmsupt/other directory tree... done.
********Finished setting up directory /tmp/ibmsupt
Gathering general system
information........................done.
Gathering kernel system information........... done.
Gathering dump system information...... done.

Packaging the dump -- continued
Packaging the
dump
continued
Step Action
3. Copy the dump to external media. To copy the gathered
files to the /dev/rmt0 tape device, issue the following
command:
# snap -o /dev/rmt0
Once this command completes, the tape can be removed

and sent in for analysis. Write protect the tape and label
appropriately
Packaging a A dump saved to external media needs to be gathered with other files to
dump stored provide a dump which is readable. To gather and pack the files follow the
on external
media following steps:
Step Action
1. Create a skeleton directory to contain the dump information.
# snap -D
This will fail stating the dump device is no longer valid.

Overcome this by restoring the dump from the media used on
IPL to save the dump.
2. Restore the dump from external media. For example, a dump
saved to the /dev/rmt0 device is restored by commands:
# cd /tmp/ibmsupt/dump
# tar -xvf /dev/rmt
# mv dump_file dump
3. Copy the dump to external media. To copy the gathered files
to the /dev/rmt0 tape device, issue the following command:
# snap -o /dev/rmt0
Once this command completes, the tape can be removed and

sent in for analysis. Write protect the tape and label
appropriately


Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm Guide
Unit 6. Introduction to Dump Analysis Tools

This lesson describes the different tools that are available to debug a
system dump taken from an AIX5L system.

After completing this unit, you should be able to:
At the completion of this lesson, you will be able to:

• Describe available tools for system dump analysis
• Invoke the IADB/iadb and KDB/kdb kernel debuggers

Guide Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
About This Lesson
Purpose This lesson describes the different tools that are available to debug a
system dump taken from an AIX5L system.
Prerequisites You should have completed the following lesson:

• Configuring System Dumps on AIX5L
Objectives At the completion of this lesson, you will be able to:

• Describe available tools for system dump analysis
• Invoke the IADB/iadb and KDB/kdb kernel debuggers
Table of This lesson covers the following topics:

contents
Topic See Page

About This Lesson 3
System Dump Analysis Tools 7
dump components 8
Dump creation process 9
Component dump routines 10
bosdebug command 11
Memory Overlay Detection System 12
System Hang Detection 15
truss command 21
KDB kernel debugger 24
kdb command 26
KDB miscellaneous sub commands 27
KDB dump/display/decode sub commands 30
KDB modify memory sub commands 34
KDB trace sub commands 37
KDB break point and step sub commands 39
KDB name list/symbol sub commands 43

Table of
contents
continued
Topic See Page
KDB watch break point sub commands 44
KDB machine status sub commands 46
KDB kernel extension loader sub commands 48
KDB address translation sub commands 50
KDB process/thread sub commands 51
KDB Kernel stack sub commands 59
KDB LVM sub commands 61
KDB SCSI sub commands 63
KDB memory allocator sub commands 66
KDB file system sub commands 70
KDB system table sub commands 73
KDB network sub commands 78
KDB VMM sub commands 81
KDB SMP sub commands 87
KDB data and instruction block address translation sub 88
commands
KDB bat/brat sub commands 90
IADB kernel debugger 91
iadb command 93

Table of
contents
continued
Topic See Page
IADB break point and step sub commands 94
IADB dump/display/decode sub commands 97
IADB modify memory sub commands 101
IADB name list/symbol sub commands 106
IADB watch break point sub commands 107
IADB machine status sub commands 109
IADB kernel extension loader sub commands 111
IADB address translation sub commands 112
IADB process/thread sub commands 113
IADB LVM sub commands 115
IADB SCSI sub commands 116
IADB memory allocator sub commands 117
IADB file system sub commands 118
IADB system table sub commands 119
IADB network sub commands 120
IADB VMM sub commands 121
IADB SMP sub commands 123
IADB block address translation sub commands 124
IADB bat/brat sub commands 125
IADB miscellaneous sub commands 126
Exercise 128

Estimated This lesson takes approximately 1.5 hours to complete.

length
Accountability You will be able to measure your progress with the following:
• Exercises using your lab system.
• Check-point activity
• Lesson review
Reference
• AIX5L docs
Organization This lesson consists of information followed by exercises that allow you to
of this lesson practice what you’ve just learned. Sometimes, as the information is being
presented, you are required to do something - pull down a menu, enter a
response, etc. This symbol, in the left hand side-head, is an indication that
an action is required.

System Dump Analysis Tools
Introduction AIX5L introduces new debugging tools, the main change from the previous
releases of AIX is that the crash command has been replaced by:
• IADB and KDB kernel debuggers for live systems debugging
• iadb and kdb commands for system image analysis
In addition the following tools/commands are available, to assist you with
debug:
• bosdebug
• Memory Overlay Detection System (MODS)
• System Hang Detection
• truss
Typographic In the following sections we will use uppercase IADB and KDB when
conventions speaking about the live kernel debuggers, and lowercase iadb and kdb
when speaking about the commands.

dump components
Introduction In AIX5L, a dump image is not actually a full image of the system memory but a
set of memory areas dumped by the dump process
The Master A master dump table entry is a pointer to a function provided by the kernel
dump Table extension that will be called by the kernel dump routine when a system dump
occurs. These functions must return a pointer to a component dump table
structure. These functions and the component dump table entry both must reside
in pinned global memory. They must be registered to the kernel by the dmp_add
and unregistered using dmp_del kernel services. Kernel specific areas are pre-
loaded by kernel initialization.
Component Dump component tables are structures of type struct_cdt. Component dump tables
dump tables are returned by the dmp registered functions when the dump process start. Each
one is a structure made of:
• a CDT Header
• an array of CDT entries
CDT Header The CDT Header contains:

• A magic number that can be one of:
• DMP_MAGIC_32 for 32 -bit CDT
• DMP_MAGIC_VR for 32-bit CDT that may contain virtual or real
addresses
• DMP_MAGIC_64 for 64-bit CDT
• the component dump name
• the length of component dump table
CDT entries CDT entries in the component dump tables will be one of cdt_entry64,
cdt_entry_vr or cdt_entry32 according to the DMP_MAGIC number has defined
in /usr/include/sys/dump.h

Dump creation process
Introduction This section will describe the dump process.
Process The following steps will be used to write a dump to the dump device:
overview
Step Action
1 Interrupts are disabled
2 0c9 or 0c2 are written to the LED display, if present
3 Header information about the dump is written to the dump device
4 The kernel steps through each entry in the master dump table,
calling each component dump routine twice
• Once to indicate that the kernel is starting to dump this
component 1 is passed as a parameter
• Again to say that the dump process is complete 2 is passed
After the first call to a component dump routine, the kernel
processes the CDT that was returned
For each CDT entry, the kernel :
• Checks every page in the identified data area to see if it is in
memory or paged out
• Builds a bitmap indicating each page's status
• Writes a header, the bitmap, and those pages which are in
memory to the dump device
5 Once all dump routines have been called, the kernel enters an
infinite loop, displaying 0c0 or flashing 888

Component dump routines
Description Component Dump Routines

• When called with a 1:
• Make any necessary preparations for dumping
• For example, they may read device-specific information from an adapter.
The FDDI device driver does this
• Fill in the component dump table
• Most device drivers do this during their initialization
• Return the address of the component dump table
• When called with a 2:
• Clean up after themselves
• In reality, most routines either return immediately, do some debug printfs
and then return, or else they ignore the parameter entirely and return the
same thing every time
Note A component dump routine may or may not do a lot of work when called with a 1.
Many simply return the address of some previously-initialized CDT, but some (for
example, the thread table and process table dump routines) actually build the CDT
from scratch.
The original rationale for the second call to each dump routine was to provide
notification that the dump process had finished with that component's dump data.
In practice, however, no one really cares. The routines that just return an address
don't even bother to look at the parameter they were passed. The routines that
build the data on the fly look for a 2 and return immediately. The most that any
routine today does with this second call is to issue some debug printf call. This is
generally used to debug the component dump routine itself, by verifying that the
system dump facility was able to successfully process its CDT.

bosdebug command
Introduction The bosdebug command can be used to enable or disable the MODS feature as
well as other kernel debugging parameters.
Any changes made with the bosdebug command will not take effect until the
system is rebooted.
bosdebug The bosdebug command accept the following parameters :

parameters • -I: Causes the kernel debug program to be loaded and invoked on each
subsequent reboot.
• -D: Causes the kernel debug program to be loaded on each subsequent reboot.
• -M: Causes the memory overlay detection system to be enabled. Memory
overlays in kernel extensions and device drivers will cause a system crash.
• -s sizelist Causes the memory overlay detection system to promote each of the
specified allocation sizes to a full page, and allocate and hide the next
subsequent page after each allocation. This causes references beyond the end of
the allocated memory to cause a system crash. sizelist is a list of memory sizes
separated by commas. Each size must be in the range from 16 to 2048, and
must be a power of 2.
• -S: Causes the memory overlay detection system to promote all allocation sizes
to the next higher multiple of page size (4096), but does not hide subsequent
pages. This improves the chances that references to freed memory will result in
a crash, but it does not detect reads or writes beyond the end of allocated
memory until that memory is freed.
• -n sizelist: Has the same effect as the -s option, but works instead for network
memory. Each size must be in the range from 32 to 2048, and must be a power
of 2. This causes the net_malloc_frag_mask variable of the 'no' command to be
turned on during boot.
• -o: Turns off all debugging features of the system.
• -L: Displays the current settings for the kernel debug program and the memory
overlay detection system.
• -R on | off: Sets the real-time extensions for multiprocessor systems only.
Memory Overlay Detection System
Introduction The Memory Overlay Detection System (MODS) helps detect memory overlay
problems in the kernel, kernel extensions, and device drivers. The MODS can be
enabled using the bosdebug command.
Problems Some of the most difficult types of problems to debug are what are generally
detected called "memory overlays." Memory overlays include the following:
• Writing to memory that is owned by another program or routine

• Writing past the end (or before the beginning) of declared variables or arrays
• Writing past the end (or before the beginning) of dynamically-allocated
memory
• Writing to or reading from freed memory
• Freeing memory twice
• Calling memory allocation routines with incorrect parameters or under
incorrect conditions.
In the kernel environment (including the kernel, kernel extensions, and
device drivers), memory overlay problems have been especially difficult to
debug because tools for finding them have not been available. Starting
with Version 4.2.1, however, the Memory Overlay Detection System
(MODS) helps detect memory overlay problems in the kernel, kernel
extensions, and device drivers.
Note: This feature does not detect problems in application code; it only
watches kernel and kernel extension code.
When to use This feature is useful in the following circumstances:

MODS
• When developing your own kernel extensions or device drivers and
want to test them thoroughly.
• When asked to turn this feature on by IBM technical support service to
help in further diagnosing a problem that you are experiencing.

Memory Overlay Detection System -- continued
How MODS The primary goal of the MODS feature is to produce a dump file that
works accurately identifies the problem.
MODS works by turning on additional checking to help detect the
conditions listed above. When any of these conditions is detected, your
system crashes immediately and produces a dump file that points directly
at the offending code. (Previously, a system dump might point to unrelated
code that happened to be running later when the invalid situation was
finally detected.)
If your system crashes while the MODS is turned on, then MODS has most
likely done its job.
To make it easier to detect that this situation has occurred, the IADB/
iadb and KDB/kdb commands have been extensively modified. The
stat subcommand now displays both:
• Whether the MODS (also called "xmalloc debug") has been turned on
• Whether this crash was the result of the MODS detecting an incorrect
situation.
The xmalloc subcommand provides details on exactly what memory
address (if any) was involved in the situation, and displays mini-tracebacks
for the allocation and/or free of this memory.
Similarly, the netm command displays allocation and free records for
memory allocated using the net_malloc kernel service (for example,
mbufs, mclusters, etc.).
You can use these commands, as well as standard crash techniques, to
determine exactly what went wrong.
MODS There are limitations to the Memory Overlay Detection System. Although it
limitations significantly improves your chances, MODS cannot detect all memory
overlays. Also, turning MODS on has a small negative impact on overall
system performance and causes somewhat more memory to be used in
the kernel and the network memory heaps. If your system is running at full
CPU utilization, or if you are already near the maximums for kernel
memory usage, turning on the MODS may cause performance
degradation and/or system hangs.
Our practical experience with the MODS, however, is that the great
majority of customers will be able to use it with minimal impact to their
systems.
Memory Overlay Detection System -- continued
MODS and kdb If a system crash occurs due to an MODS problem, the kdb xm sub command will
be able to display status and traces on memory overlay problems

System Hang Detection
Introduction System hang management allows users to run mission critical applications
continually while improving application availability. System hang detection
alerts the system administrator of possible problems and then allows the
administrator to log in as root or to reboot the system to resolve the
problem.
System Hang All processes (also know as threads) run at a priority. This priority is
Detection numerically inverted in the range 40-126. Forty is highest priority and 126
is the lowest priority. The default priority for all threads is 60. The priority of
a process can be lowered by any user with the nice command. Anyone
with root authority can also raise a process’s priority.
The kernel scheduler always picks the highest priority runnable thread to
put on a CPU. It is therefore possible for a sufficient number of high priority
threads to completely tie up the machine such that low priority threads can
never run. If the running threads are at a priority higher than the default of
60, this can lock out all normal shells and logins to the point where the
system appears hung.
The System Hang Detection (SHD) feature provides a mechanism to
detect this situation and allow the system administrator a means to
recover. This feature is implemented as a daemon (shdaemon) that runs at
the highest process priority. This daemon queries the kernel for the lowest
priority thread run over a specified interval. If the priority is above a
configured threshold, the daemon can take one of several actions. Each of
these actions can be independently enabled, and each can be configured
to trigger at any priority and over any time interval. The actions and their
defaults are:
System Hang Detection -- continued
System Hang
Detection
continued
Action Default Default Default Default

Enabled Priority Timeout Device
(Seconds)
Log an error in disabled 60 120
errlog
Display a disabled 60 120 /dev/console
warning
message
Give a recovery enabled 60 120 /dev/tty0
getty
Launch a disabled 60 120
command
Reboot the disabled 39 300
system

shconf Script The shconf command is invoked when System Hang Detection is
enabled. shconf configures which events are surveyed and what actions
are to be taken if such events occur.
The user can specify the five actions described below and can specify the
priority level to check, the time out while no process or thread executes at
a lower or equal priority, the terminal device for the warning action and the
getty action:
• Log an error in the error log file
• Display a warning message on the system console (alphanumeric
console) or on a specified TTY
• Reboot the system
• Give a special getty to allow the user to log in as root and launch
commands
• Launch a command
For the Launch a command and Give a special getty options,
SHD will launch the special getty or the specified command at the highest
priority. The special getty will print a warning message specifying that it is a
recovering getty running at priority 0. The following table lists the default
values when the SHD is enabled. Only one action is enabled per type of
detection.
Note: When Launch a recovering getty on a console is
enabled, the shconf script adds the -u flag to the getty line in the inittab
that is associated with the console login.
Process The shdaemon is in charge of handling the detection of system hang. It

retrieves configuration information, initiates working structures, and starts
detection times set in by the user.
The shdaemon is started by init with a priority zero.
The shdaemon will be set (off/respawn) in the inittab each time the
shconf command will (disable/enable) the sh_pp option.
SMIT Interface You can manage the SHD configuration from the SMIT System
Environments menu. From the System Environments menu, select
Manage System Hang Detection. The options in this menu allow
system administrators to enable or disable the detection mechanism.

Configuration of The shconf command can be used to configure the System Hang Detection.
the SHD The following parameters maybe used with shconf:
• -d: Display the System Hang Detection status.
• -R -l prio: will reset effective values to default.
• -D[O] -l prio: Display the default values (Optional O will output values
separated by colons
• -E[O] -l prio: Display the effective values (Optional O will output values
separated by colons
• -l prio [-a Attribute=Value]: will change the Attribute to the nue Value
Options The following options can be used to customize the System Hang Detection
:
name default description
sh_pp enable Enable Process Priority Problem
pp_errlog disable Log Error in the Error Logging
pp_eto 2 Detection Time-out
pp_eprio 60 Process Priority
pp_warning disable Display a warning message on a console
pp_wto 2 Detection Time-out
pp_wprio 60 Process Priority
pp_wterm /dev/console Terminal Device
pp_login enable Launch a recovering login on a console
pp_lto 2 Detection Time-out
pp_lprio 56 Process Priority
pp_lterm /dev/tty0 Terminal Device
pp_cmd disable Launch a command
pp_cto 2 Detection Time-out
pp_cprio 60 Process Priority
pp_cpath / Script
pp_reboot disable Automatically REBOOT system
pp_rto 5 Detection Time-out
pp_rprio 39 Process Priority
example The following output represent various use of the chconf command:
# shconf -R -l prio <== restore default values
shconf: Default Problem Conf is restored.
shconf: Priority Problem Conf has changed.
# shconf -D -l prio <== display default values
sh_pp disable Enable Process Priority Problem
pp_errlog disable Log Error in the Error Logging
pp_eto 2 Detection Time-out
pp_eprio 60 Process Priority
pp_warning disable Display a warning message on a console
pp_wto 2 Detection Time-out
pp_wprio 60 Process Priority
pp_wterm /dev/console Terminal Device
pp_login enable Launch a recovering login on a console
pp_lto 2 Detection Time-out
pp_lprio 56 Process Priority
pp_lterm /dev/tty0 Terminal Device
pp_cmd disable Launch a command
pp_cto 2 Detection Time-out
pp_cprio 60 Process Priority
pp_cpath / Script
pp_reboot disable Automatically REBOOT system
pp_rto 5 Detection Time-out
pp_rprio 39 Process Priority
# shconf -l prio -a pp_lterm=/dev/console <== change terminal device to /dev/console
# shconf -l prio -a sh_pp=enable <== enable priority problem detection
# ps -ef|grep shd <== verify the shdaemon has been started
root 4982 1 0 17:08:17 - 0:00 /usr/sbin/shdaemon
root 9558 9812 1 17:08:22 0 0:00 grep shd

truss command
Description The truss command executes a specified command, or attaches to listed process
IDs, and produces a trace of the system calls, received signals, and machine faults
a process incurs. Each line of the trace output reports either the Fault or Signal
name, or the Syscall name with parameters and return values. The subroutines
defined in system libraries are not necessarily the exact system calls made to the
kernel. The truss command does not report these subroutines, but rather, the
underlying system calls they make. When possible, system call parameters are
displayed symbolically using definitions from relevant system header files. For
path name pointer parameters, truss displays the string being pointed to. By
default, undefined system calls are displayed with their name, all eight possible
arguments and the return value in hexadecimal format.
truss command -- continued
Options The following options can be used with the truss command line
Option Description
-a Displays the parameter strings passed in each system call.
-c Counts traced system calls, faults, and signals rather than displaying
trace results line by line. A summary report is produced.
-e Displays the environment strings which are passed in each executed
system call.
-f Follows all children created by the fork system call.
-i Keeps interruptible sleeping system calls from being displayed. Causes
system calls to be reported only once, upon completion.
-m [!] Machine faults to trace/exclude. Faults may be specified by name or
Fault number (see the sys/fault.h header file). The default is -mall.
-o Outfile Designates the file to be used for the trace output.
-p Interprets the parameters to truss as a list of process ids for an existing
process rather than as a command to be executed. truss takes control of
each process and begins tracing it.
-r [!] Displays the full contents of the I/O buffer for each read on any of the
FileDescri specified file descriptors. The output is formatted 32 bytes per line and
ptor shows each byte either as an ASCII character (preceded by one blank) or
as a two-character C language escape sequence for control characters. If
ASCII interpretation is not possible, the byte is shown in two-character
hexadecimal. The default is -r!all.
-s [!] Permits listing Signals to trace/exclude. The trace output reports the
Signal receipt of each specified signal even if the signal is being ignored, but not
blocked, by the process. Blocked signals are not received until the process
releases them. Signals may be specified by name or number (see sys/
signal.h). The default is -s all.
-t [!] Includes/excludes system calls from the trace. The default is -tall.
Syscall
-w [!] Displays the contents of the I/O buffer for each write on any of the listed
FileDescri file descriptors (see -r). The default is -w!all.
ptor
-x [!] Displays data from the specified parameters of traced system calls in raw
Syscall format, usually hexadecimal, rather than symbolically. The default is -
x!all.
Each option requiring a list must contain a list separated by commas. You can use
“all”/”!all” to include/exclude all possible values of the list.

truss command -- continued
truss output The following output represent an example of the use of a truss command:
example
WUXVVDHLPVUDOOZDOOROVRXWOV
OVRXW
PRUHOVRXW
H[HFYHXVUELQOV[))[))DUJF
DUJYOV
HQYSB XVUELQWUXVV/$1* &/2*,1 URRW
1/63$7+ XVUOLEQOVPVJ/1XVUOLEQOVPVJ/1FDW
3$7+ XVUELQHWFXVUVELQXVUXFEXVUELQ;VELQ
/&BB)$6706* WUXH/2*1$0( URRW0$,/ XVUVSRROPDLOURRW
/2&3$7+ XVUOLEQOVORF86(5 URRW$87+67$7( FRPSDW
6+(// XVUELQNVK2'0',5 HWFREMUHSRV+20( 7(50 DL[WHUP
0$,/06* ><28+$9(1(:0$,/@3:' KRPHDOH[7= 3673'7$BB] /2*1$0(
BBJHWBNHUQHOBWRGBSWU[[%[['&$
[&
[[$$[( [))))
JHWXLG[ [
NLRFWO[[
NLRFWO[)(&[
VEUN[ [&
EUN['
VEUN[ ['
EUN[
VWDW[[)(&
VWDW[[)(&$
RSHQ2B5'21/<
JHWGLUHQW[
OVHHN
NIFQWO)B*(7)'['
NIFQWO)B6(7)'[
JHWGLUHQW[
JHWGLUHQW[
FORVH
NLRFWO[[
NZULWH[$)
OVRXW?Q
NIFQWO)B*(7)/[
FORVH
NIFQWO)B*(7)/[
BH[LW
KDB kernel debugger
Introduction The KDB is the kernel debugger used on AIX5L running on Power systems.
Availability The kernel debugger must be enabled in order to be used on AIX5L.

The following command should return 00000001 if the kernel debugger was
enabled:
#kdb
(0)> dw kdb_avail
kdb_avail+000000: 00000001 00000000 00000000 00000000
Overview The major functions of the KDB are:
• Setting breakpoints within the kernel or kernel extensions

• Execution control through various forms of step commands
• Formatted display of selected kernel data structures
• Display and modification of kernel data
• Display and modification of kernel instructions
• Modification of the state of the machine through alteration of system registers
Loading KDB In AIX 5L, the KDB is included in all unix kernels found in /usr/lib/boot. In order
to use it, the KDB must be loaded at boot time. To allow KDB to load use the
following command:
• bosboot -a -D -d /dev/ipldevice, or bosdebug -D: will
load KDB at boot time.
• bosboot -a -I -d /dev/ipldevice, or bosdebug -I: will load
and invoke the KDB at boot time.
• bosboot -ad /dev/ipldevice, or bosdebug -o: will not load or
invoke the KDB at boot time.
You must reboot the system in order to take these changes in account.

KDB kernel debugger -- continued
Starting KDB The KDB interface maybe be started, if loaded, under the following
circumstances:
• If the bosboot or bosdebug was run with -I, this mean that the tty
attached to a native serial port will show up the KDB just after the kernel is
loaded.
• You may invoke manually the KDB from a tty attached to a native serial port
using: Ctrl-4 or Ctrl-\, or from a native keyboard using Ctrl-alt-
Numpad4.
• An application make a call to the breakpoint() kernel services or to the
breakpoint system call.
• A breakpoint previously set using the KDB has been reached
• A fatal system error occurs. A dump might be generated on exit from the KDB.
KDB Concept When the KDB Kernel Debugger is invoked, it is the only running program until
you exit the KDB or you use the start sub command to start another cpu. All
processes are stopped and interrupts are disabled. The KDB Kernel Debugger runs
with its own Machine State Save Area (mst) and a special stack. In addition, the
KDB Kernel Debugger does not run operating system routines. Though this
requires that kernel code be duplicated within KDB, it is possible to break
anywhere within the kernel code. When exiting the KDB Kernel Debugger, all
processes continue to run unless the debugger was entered via a system halt.
kdb command
Introduction The kdb command, unlike the KDB kernel debugger, allows examination of an
operating system image issued on Power systems.
The kdb command maybe used on a running system but will not provide all
functions available with the KDB kernel debugger.
Parameters The kdb command maybe used with the following parameters:
• no parameter: the kdb will use /dev/mem as the system image file and /usr/lib/
boot/unix as the kernel file. In this case root permissions are required.
• -m system_image_file: the kdb will use the image file provided.
• -u kernel_file: the kdb will use the kernel file. This is required to analyze a
system dump on a system with different level of unix.
• -k kernel_modules: a comma separated list of kernext symbols to add.
• -w: to view XCOFF object
• -v: to print CDT entries
• -h: to print help
• -l: to disable inline more, useful to run non interactive session.
Loading errors If the system image file provided doesn’t contain a valid dump or the kernel file
doesn’t match the system image file, the following message may be issued by the
kdb command:
# kdb -m dump_file -u /usr/lib/boot/unix

The specified kernel file is a 64-bit kernel
core mapped from @ 700000000000000 to @ 7000000000120a7
First symbol __mulh
KERNEXT FUNCTION NAME CACHE (90112 bytes) allocated
KERNEXT COMMANDS SPACE (8192 bytes) allocated
Component Dump Table not found.
Kernel not included in this dump.
dump core corrupted
make sure /usr/lib/boot/unix refers to the running
kernel

KDB miscellaneous sub commands
Introduction The following table represents the miscellaneous sub commands and their
matching crash/lldb sub commands when available
machdep crash/lldb sub KDB sub kdb sub

function commands commands commands
reboot the machine reboot reboot N/A
display help ?/help ? ?
run an aix command ! ! !
exit q/go q go
set debugger parameters set
display elapsed time time N/A
enable/disable debug debug
calculate/convert an calc/conv hcal/cal hcal/cal
hexadecimal expression
calculate/convert a decimal calc/conv dcal dcal
expression
reboot sub The reboot subcommand can be used to reboot the machine. This subcommand
command issues a prompt for confirmation that a reboot is desired before executing the
reboot. If the reboot request is confirmed, the soft reboot interface is called
(sr_slih(1)).
! sub command The ! sub command allow the user to run an aix command without leaving the
kdb or KDB kernel debugger.
? sub command Help or ? sub command can be used to display a long sub command listing or to
display help by subjects.
A particular help a a command can be display using the sub command followed by
?
KDB miscellaneous sub commands -- continued
q sub command For the KDB Kernel Debugger, this subcommand exits the debugger with all
breakpoints installed in memory. To exit the KDB Kernel Debugger without
breakpoints, the ca subcommand should be invoked to clear all breakpoints prior
to leaving the debugger.
The optional dump argument can be specified to force an operating system dump.
The method used to force a dump depends on how the debugger was invoked.
set sub The set sub command can be used to toggle the kdb parameters. Set accept the
command following parameters:
• none: will display the actual parameters

• 1: no_symbol
• 2: mst_wanted
• 3: screen_size
• 4: power_pc_syntax
• 5: origin
• 6: Unix symbols start from
• 7: hexadecimal_wanted
• 8: screen_previous
• 9: display_stack_frames
• 10: display_stacked_regs
• 11: 64_bit
• 12: ldr_segs_wanted
• 13: emacs_window
• 14: Thread attached local breakpoint
• 15: KDB stops all processors
• 17: kext_IF_active
• 18: trace_back_lookup
• 19: IPI_enable
time sub The time command can be used to determine the elapsed time from the last time
command the KDB Kernel Debugger was left to the time it was entered.

KDB miscellaneous sub commands -- continued
debug sub The debug subcommand may be used to print additional information during
command KDB execution, the primary use of this subcommand is to aid in ensuring that the
debugger is functioning properly. The debug sub command can be used with the
following arguments:
• no argument: will display the current debug flags

• dbg1++/dbg1--: set/unset FM HW lookup debug.
• dbg2++/dbg2--: set/unset vmm tr/tv cmd debug
• dbg3++/dbg3--: set/unset vmm SW lookup debug
• dbg4++/dbg4--: set/unset symbol lookup debug
• dbg5++/dbg5--: set/unset stack trace debug
• dbg61++/dbg61--: set/unset BRKPT debug (list)
• dbg62++/dbg62--: set/unset BRKPT debug (instr)
• dbg63++/dbg63--: set/unset BRKPT debug (suspend)
• dbg64++/dbg64--: set/unset BRKPT debug (phantom)
• dbg65++/dbg65--: set/unset BRKPT debug (context)
• dbg71++/dbg71--: set/unset DABR debug (address) '
• dbg72++/dbg72--: set/unset DABR debug (register) '
• dbg73++/dbg73--: set/unset DABR debug (status) '
• dbg81++/dbg81--: set/unset BRAT debug (address)
• dbg82++/dbg82--: set/unset BRAT debug (register) '
• dbg83++/dbg83--: set/unset BRAT debug (status)
hcal/dcal sub The hcal subcommand evaluates hexadecimal expressions and displays the
commands result in both hex and decimal.
The dcal subcommand evaluates decimal expressions and displays the result in
both hex and decimal.
KDB dump/display/decode sub commands
Introduction The following table represents the dump/display/decode sub commands and their
dump/display/decode crash/lldb sub KDB sub kdb sub

display byte data display d d
display word data od (2 units)/display dw dw
display double word data od (4 untis)/display dd dd
display code id/decode/od (format dc/dpc/dis dc/dpc/dis
I)/un
display registers float/sregs dr dr
display device byte ddvb/ddpb N/A
display device half word ddvh/ddph N/A
display device word ddvw/ddpw N/A
display device double word ddvd/ddpd N/A
display physical memory display dp/dpw/dpd dp/dpw/dpd
find pattern find find/findp find/findp
extract pattern link ext/extp ext/extp
d/dw/dd/dp/ d/dw/dd/dp/dpw/dpd sub commands are use to display memory with the following
dpw/dpd sub sizes:
commands
• d,dp display bytes
• dw,dpw: display words
• dd,dpd (display double words)
Addresses are specified by:
• virtual addresses for d,dw and dd
• physical for dp,dpw and dpd
These sub commands accept the following arguments:
• Address - starting address of the area to be dumped. hexadecimal values, or
hexadecimal expressions can be used in specification of the address.
• count - number of bytes (d, dp), words (dw, dpw), or double words (dd, dpd)
to be displayed. The count argument is a hexadecimal value.

KDB dump/display/decode sub commands -- continued
dc/dpc/dis sub The display code subcommands, dc,dis and dpc may be used to decode
commands instructions. The address argument for the dc subcommand is an effective
address. The address argument for the dpc subcommand is a physical
address. They accept the following arguments:
• Address - address of the code to disassemble. This can either be a

virtual (effective) or physical address, depending on the subcommand
used. Symbols, hexadecimal values, or hexadecimal expressions can be
used in specification of the address.
• count - indicates the number of instructions to be disassembled. The
value specified must be a decimal value or decimal expression.
ddvb/ddvh/ IO space memory (Direct Store Segment (T=1)) can not be accessed when
ddvd/ddpv/ translation is disabled. bat mapped areas must also be accessed with translation
ddph/ddpd sub
commands enabled, else cache controls are ignored.
The subcommands ddvb, ddvh, ddvw and ddvd can be used to access these areas
in translated mode, using an effective address already mapped.
The subcommands ddpb, ddph, ddpw and ddpd can be used to access these areas
in translated mode, using a physical address that will be mapped.
On 64-bit machine, double words correctly aligned are accessed (ddpd and ddvd)
in a single load (ld) instruction.
DBAT interface is used to translate this address in cache inhibited mode

(PowerPC only).
ddvb/ddvh/ddvd/ddpv/ddph/ddpd sub commands use the following parameters:
• Address - address of the starting memory area to display. This can either be a
effective or real address, dependent on the subcommand used. Symbols,
hexadecimal values, or hexadecimal expressions can be used in specification of
the address.
• count - number of bytes (ddvb, ddpb), half words (ddvh, ddph), words (ddvw,
ddpw), or double words (ddvd, ddpd) to display. The count argument is a
hexadecimal value.
find findp sub The find and findp subcommands can be used to search for a specific pattern in
commands memory. The find subcommand requires an effective address for the address
argument, whereas the findp subcommand requires a real address. find and findp
accept the following parameters:
• -s - flag indicating that the pattern to be searched for is an ASCII string

• Address: address where the search is to begin. This can either be a virtual
(effective) or physical address, depending on the subcommand used. Symbols,
the address.
• string: ASCII string to search for if the -s option is specified.
• pattern: hexadecimal value specifying the pattern to search for. The pattern is
limited to one word in length.
• mask: if a pattern is specified, a mask can be specified to eliminate bits from
consideration for matching purposes. This argument is a one word hexadecimal
value.
• delta: increment to move forward after an unsuccessful match. This argument
is a one word hexadecimal value.
ext/extp sub The ext and extp subcommands can be used to display a specific area from a
commands structure. If an array exists, it can be traversed displaying the specified area for
each entry of the array. These subcommands can also be used to traverse a linked
list displaying the specified area for each entry.
For the ext subcommand the address argument specifies an effective address. For
the extp subcommand the address argument specifies a physical address.
ext and extp accept the following arguments:
• -p: flag to indicate that the delta argument is the offset to a pointer to the next
area.
• Address: address at which to begin display of values. This can either be a
virtual (effective) or physical address depending on the subcommand used.
Symbols, hexadecimal values, or hexadecimal expressions can be used in
specification of the address.
• delta: offset to the next area to be displayed or offset from the beginning of the
current area to a pointer to the next area. This argument is a hexadecimal value.
• size: hexadecimal value specifying the number of words to display.
• count: hexadecimal value specifying the number of entries to traverse

dr sub command The display registers sub command can be used to display:
• gp: general purpose

• sr: segment,
• sp: special, or
• fp: floating point registers.
• register name: Individual registers.
•
The current context is used to locate the values to display. The switch sub
command can be used to change context to other threads.
examples The following show examples of the use of display sub commands:
# hostname <== get the hostname

oc3b42
# ctrl-\ <== enter the kdb
KDB(0)> find -s 0 oc3b42
utsname+000020: 6F63 3362 3432 0033 3443 3030 0000 0000 oc3b42.34C00....
KDB(0)> dd utsname 3 <== display 3 double word of utsname
utsname+000000: 4149580000000000 0000000000000000 AIX.............
utsname+000010: 0000000000000000 0000000000000000................
utsname+000020: 6F63336234320033 3443303000000000 oc3b42.34C00....
(0)> dr sp <== display the special purposes registers in current context
iar : 000000000000B65C msr : A0000000000090B2 cr : 44284448
lr : 000000000001C950 ctr : 0000000000000020 xer : 0EB8C400
mq : DEADBEEF asr : 000000000EB8E001
dsisr: 00000000 dar : 0000000000000000 dec : 00000000
sdr1: 0000000000000000 srr0: 0000000000000000 srr1: 0000000000000000
dabr: 0000000000000000 tbu : 00000000 tbl : 00000000
sprg0: 0000000000000000 sprg1: 0000000000000000
sprg2: 0000000000000000 sprg3: 0000000000000000
pir : 00000000 pvr : 00000000 ear : 00000000
hid0: 00000000 iabr: 0000000000000000
buscsr: 0000000000000000 l2cr: 0000000000000000 l2sr: 0000000000000000
via : 0000000000000000 sda : 0000000000000000
mmcr0: 00000000 mmcr1: 00000000
pmc1: 00000000 pmc2: 00000000 pmc3: 00000000 pmc4: 00000000
pmc5: 00000000 pmc6: 00000000 pmc7: 00000000 pmc8: 00000000
KDB modify memory sub commands
Introduction The following table represents the modify memory sub commands and their
modify memory crash/lldb sub KDB sub kdb sub

modify sequential bytes alter -c/stc m N/A
modify sequential word alter -w/st mw N/A
modify sequential double word alter -l md N/A
modify sequential half word sth sth N/A
modify registers set mr N/A
modify device byte mdvb/mdpb N/A
modify device half word mdvh/mdph N/A
modify device double word mdvd/mdpd N/A
modify physical memory mp/mpw/ N/A
mpd
m/mp/mw/mpw/ m/mp/mw/mpw/md/mpd sub commands are use to modify memory with the
md/mpd sub following sizes:
commands
• m.mp display bytes
• mw.mpw: display words
• md,mpd (display double words)
Addresses are specified by :
• virtual addresses for m,mw and md
• physical for mp,mpw and mpd
These sub commands accept the following arguments:
• Address - starting address of the area to be dumped. hexadecimal values, or
The sub commands will prompt for new values until a “.” value is entered.

KDB modify memory sub commands -- continued
mr sub The mr sub command can be used to modify general purpose, segment, special, or
commands floating point registers. Individual registers can also be selected for modification
by register name. The current thread context is used to locate the register values to
be modified. The switch sub command can be used to change context to other
threads. When the register being modified is in the mst context, KDB alters the
mst. When the register being modified is a special one, the register is altered
immediately. Symbolic expressions are allowed as input.
The following arguments can be used:
• gp - modify general purpose registers.
• sr - modify segment registers.
• sp - modify special purpose registers.
• fp - modify floating point registers.
• reg_name - modify a specific register, by name.
mr will prompt for input if a register name was specified, or will prompt for input
until a “.” is entered.
mdvb/mdpb/ These subcommands are available to write in IO space memory. To avoid bad
mdvh/mdph/ effects, memory is not read before, only the specified write is performed with
mdvd/mdpd sub
commands translation enabled.
Access can be in bytes, half words, words or double words.
Address can be an effective address or a real address.
The subcommands mdvb, mdvh, mdvw and mdvd can be used to access these
areas in translated mode, using an effective address already mapped. The
subcommands mdpb, mdph, mdpw and mdpd can be used to access these areas in
translated mode, using a physical address that will be mapped. On 64-bit machine,
double
words correctly aligned are accessed (mdpd and mdvd) in a single store
instruction. DBAT interface is used to translate this address in cache inhibited
mode (PowerPC only).
These subcommands accept the following parameters:
• Address - address of the memory to modify. This can either be a virtual
(effective) or physical address, dependent on the subcommand used. Symbols,
the address.
These sub commands will prompt for input until a “.” is entered.
KDB modify memory sub commands -- continued
examples # uname -a<== get utsname structure

oc3b42
# ctrl-\ <== enter the kdb
KDB(0)> dd utsname 6 <== display 6 double word of utsname
utsname+000000: 4149580000000000 0000000000000000 AIX.............
utsname+000010: 0000000000000000 0000000000000000................
utsname+000020: 6F63336234320033 3443303000000000 oc3b42.34C00....
KDB(0)> mw utsname+000020
utsname+000020: 6F633362 = 616c6578
utsname+000024: 34320033 =.
KDB(0)> dw utsname 12 <== display 12 words of utsname
utsname+000000: 41495800 00000000 00000000 00000000 AIX.............
utsname+000010: 00000000 00000000 00000000 00000000................
utsname+000020: 616C6578 34320033 34433030 00000000 alex42.34C00....
utsname+000030: 00000000 00000000 00000000 00000000................
utsname+000040: 30000000 00000000 0.......
KDB(0)>q
# uname -a <== now let see what we did
AIX alex42 0 5 000714834C00

KDB trace sub commands
introduction The following table represents the trace sub commands and their matching crash/
lldb sub commands when available
trace function crash/lldb sub KDB sub kdb sub

commands commands commands
set/list trace point loop bt N/A
clear trace point ct N/A
clear all trace points cat N/A
bt sub command The trace point subcommand bt can be used to trace each execution of a specified
address. Each time a trace point is encountered during execution, a message is
displayed indicating that the trace point has been encountered. The displayed
message indicates the first entry from the stack.
The bt sub command accept the following parameters:
• -p - flag to indicate that the trace address is a real address.

• -v - flag to indicate that the trace address is an virtual address.
• Address - address of the trace point. This may either be a virtual (effective) or
physical address. Symbols, hexadecimal values, or hexadecimal expressions
may be used in specifying an address.
• script - a list of subcommands to be executed each time the indicated trace
point is executed. The script is delimited by quote (") characters and commands
within the script are delimited by semicolons (;).
The bt sub command can also use a test parameter to break at the specified address
only if the test condition is true
The conditional test requires two operands and a single operator. Values that can
be used as operands in a test subcommand include symbols, hexadecimal values,
and hexadecimal expressions. Comparison operators that are supported include:
==, !=, >=, <=, >, and <.
Additionally, the bitwise operators ^ (exclusive OR), & (AND), and | (OR) are
supported.
When bitwise operators are used, any non-zero result is considered to be true.
KDB trace sub commands -- continued
ct/cat sub The cat and ct sub commands erase all and individual trace points, respectively.
command The trace point cleared by the ct subcommand may be specified either by a slot
number or an address. These sub commands accept the following arguments:
• -p: flag to indicate that the trace address is a real address.

• -v: flag to indicate that the trace address is an virtual address.
• slot: slot number for a trace point. This argument must be a decimal value.
• Address:address of the trace point. This may either be a virtual (effective) or
may be used in specifying an address.
examples The following example show the use of the trace sub commands:
# <== ctrl-\ to enter the KDB from a native serial port

Debugger entered via keyboard.
.waitproc_find_run_queue+00006C srwi r29,r31,3 <00000000> r29=F
1000097140A011C,r31=0
KDB(0)> bt open <== add a trace point at open() address
.open+000000 (sid:00000000) trace {hit: 0}
KDB(0)> q <== exit the debugger
# ls <== run some command to call open
[0][00387D04]open+000000 (0000000020008B88, 0000000000000000,
00000000000001B6 [??])
[0][00387D04]open+000000 (0000000020000CA4, 0000000000000000,
00000000F00A0810 [??])
.bash_history dev lpp sbin u
.bashrc etc lpp_name scratch unix
.sh_history home mnt smit.log usr
.xerrors j2 opt smit.script var
audit lib proc tftpboot
bin lost+found qd0 tmp
# <== ctrl-\ to enter the KDB from a native serial port
KDB(0)> bt open "dr" <== will run dr when open is entered
.open+000000 (sid:00000000) trace {hit: 0}
# ls <== run some command to call open
r0 : 00000000000090B2 r1 : F00000002FF3B390 r2 : 000000000046AC80
r3 : 0000000020008B88 r4 : 0000000000000000 r5 : 00000000000001B6
r6 : 0000000000000000 r7 : 0000000000000000 r8 : 000000001E821C00
r9 : 0000000000000000 r10 : 0000000011D3E8F0 r11 : F00000002FF3B400
r12 : F10000971E821C00 r13 : F10000971F1FF200 r14 : 0000000000000001
r15 : 000000002000D2A8 r16 : 000000002FF22D6C r17 : 00000000FFFFFFCB
r18 : 0000000000000001 r19 : 0000000000000000 r20 : 0000000020007680
r21 : 0000000000000000 r22 : 0000000000002CB6 r23 : 0000000000000000
r24 : 000000002FF229F0 r25 : 0000000000000014 r26 : 000000002000D2DC
r27 : 0000000000000000 r28 : 00000000F0061768 r29 : 00000000FFFFFFFF
r30 : 00000000D0054FAC r31 : 0000000000000000

KDB break point and step sub commands
Introduction The following table represents the breakpoint and step sub commands and their
breakpoint and step crash/lldb KDB sub kdb sub

function sub commands commands commands
set/list break point break/breaks b N/A
set/list local break point break/breaks lb N/A
clear local break point clear lc N/A
clear break points clear c N/A
clear all breakpoint clear ca N/A
go to end of function r N/A
go until address gt N/A
next instruction step n/nextis/stepi N/A
step on bl/blr S N/A
step on branch B N/A
b/lb sub The b subcommand sets a permanent global breakpoint in the code. KDB checks
command that a valid instruction will be trapped. If an invalid instruction is detected a
warning message is displayed. If the warning message is displayed the breakpoint
should be removed; otherwise, memory can be corrupted (the breakpoint has been
installed).
The lb sub command will act the same way as the b sub command except the
break point will be local to the thread or cpu depending on the set option 14.
The following arguments may be used with the b/lb sub commands :
• -p - flag to indicate that the breakpoint address is a real address.

• -v - flag to indicate that the breakpoint address is an virtual address.
• Address - address of the breakpoint. This may either be a virtual (effective) or
may be used in specification of the address.
KDB break point and step sub commands -- continued
c/lc/ca sub c/lc and ca can be used to clear break points. The differences are:
commands
• c will clear general break points
• lc will clear local break points
• ca will clear all break points
•
The b and lc sub commands will use the following parameters:

• slot - slot number of the breakpoint. This argument must be a decimal value.
The lc may use this additional parameter:
• ctx - context to be cleared for a local break. The context may either be a CPU or
thread specification.
r/gt sub A non-permanent breakpoint can be set using the subcommands r and gt. These
command subcommands set local breakpoints which are cleared after they have been hit.
The r subcommand sets a breakpoint on the address found in the lr register. In
SMP environment, it is possible to hit this breakpoint on another processor, so it is
important to have thread/process local break point.
The gt subcommand performs the same as the r subcommand except that the
breakpoint address must be specified.
r and gt sub commands accept the following parameters:


KDB break point and set sub commands -- continued
n/s sub The two subcommands n and s provide step functions. The s subcommand allows
command the processor to single step to the next instruction. The n subcommand also single
steps, but it steps over subroutine calls as though they were a single instruction.
The n/s sub commands accept the following parameter:
• count: specify how many steps are executed before returning to the KDB
prompt.
S/B sub The S subcommand single steps but stops only on bl and br instructions. With that,
commands you can see every call and return of routines. A count can also be used to specify
how many times KDB continues before stopping.
The B subcommand steps stopping at each branch instruction.
The S/B sub commands accept the following parameter:
• count: specify how many steps are executed before returning to the KDB
prompt.
KDB break point and step sub commands -- continued
Example The following example will show the use of break points:
# Debugger entered via keyboard.
.waitproc_find_run_queue+00006C srwi r29,r31,3 <00000000> r29=0,r31=0
KDB(0)> br open <== we set a break point on open.
.open+000000 (sid:00000000) permanent & global
KDB(0)> q <== we exit the kdb
# ls <== do some command that will certainly call open
Breakpoint <== open was called so we enter the KDB
.open+000000 li r6,0 <0000000000000000> r6=0
KDB(0)> s <== do one step
.open+000004 stdu stkp,FFFFFF80(stkp)
stkp=F00000002FF3B390,FFFFFF80(stkp)=F00000002FF3B310
KDB(0)> n <== an other one
.open+000008 mflr r0 <.sys_call_ret+000000>
KDB(0)> dis.open+000008 32 <== not let’s find a the following branch
.open+000008 mflr r0
.open+00000C extsw r4,r4
.open+000010 addi r7,stkp,70
.open+000014 std r0,90(stkp)
.open+000018 clrlwi r5,r5,0
.open+00001C bl <.copen> <== here it is
.open+000020 ori r0,r3,0
.open+000024 clrlwi r4,r3,0
KDB(0)> B <== this will break at the next branch taht should be open+1c
.open+00001C bl <.copen> r3=0000000020008B88
KDB(0)> s <== we step that branch
.copen+000000 std r31,FFFFFFF8(stkp) r31=0,FFFFFFF8(stkp)=F00000002FF3B
308
KDB(0)> dr lr <== let see what is in the link register
lr : 0000000000387D24
.open+000020 ori r0,r3,0 <0000000020008B88> r0=0000000000003
77C,r3=0000000020008B88
KDB(0)> r <== break on the lr (we will return to the calling function)
.open+000020 ori r0,r3,0 <0000000000000000> r0=0000000000000030,r3=0
KDB(0)> ca <== clear all break point before leaving
KDB(0)> q <== exit the KDB

KDB name list/symbol sub commands
Introduction The following table represents the name list/symbol sub commands and their
name list symbol crash/lldb sub KDB sub kdb sub

translate symbol to eaddr nm nm nm
no symbol mode (toggle) hide ns ns
translate eaddr to symbol ts/ds ts ts
nm sub The nm subcommand translates symbols to addresses.nm use the following

command argument:
• symbol - symbol name.
ns sub command The ns subcommand toggles symbolic name translation on and off.
This is equivalent to the set sub command option 1.
ts sub command The ts subcommand translates addresses to symbolic representations. ts use the
following argument:
• Address - effective address to be translated. This argument may be a

hexadecimal value or expression.
examples (0)> nm kdb_avail <== display addresses for the kdb_avail symbol
Symbol Address: 0046AE70
TOC Address: 0046AC80
(0)> set 1 <== turn address translation off
Symbolic name translation off
(0)> ts 046AE70 <== get symbol for 046AE70
0046AE70 <== didn’t get it because address translation is turned off
(0)> ns <== turn address translation back on
Symbolic name translation on
(0)> ts 046AE70 <== no we should get the symbol
kdb_avail+000000
KDB watch break point sub commands
Introduction The following table represents the watch break point sub commands and their
watch break point crash/lldb sub KDB sub kdb sub

stop on read data watch wr N/A
stop on write data watch ww N/A
stop on r/w data watch wrw N/A
local stop on read data watch lwr N/A
local stop on write data watch lww N/A
local stop on r/w data watch lwrw N/A
clear watch cw cw N/A
local clear watch lcw lcw N/A
wr, ww, wrw, On PowerPC architecture, a watch register (the DABR Data Address Breakpoint
lwr, lww, lwrw, Register or HID5 on Power 601) can be used to enter KDB when a specified
cw and lcw sub
commands effective address is accessed. The register holds a double-word effective address
and bits to specify load and/or store operation.
So the watch break points can be used with the following rules
• wr and lwr will break on read

• ww and lww will break on write
• wrw and lwrw will break on read or write
• wr,ww and wrw will break in any context
• lwr,lww and lwrw will break in a specific cpu.
• cw and lcw will clear general or local watch break points.
wr, ww, wrw, lwr, lww, lwrw,cw and lcw will accept the following arguments:
• -p: flag indicating that the address argument is a physical address.
• -v: flag indicating that the address argument is a virtual address.
• -e: flag indicating that the address argument is an effective address.
• Address: address to be watched. Symbols, hexadecimal values, or hexadecimal
expressions can be used in specification of the address.
• size: indicates the number of bytes that are to be watched. This argument is a
decimal value.

KDB watch break point sub commands -- continued
examples KDB(0)> wr utsname 3 <== set a break on read of utsname for 3 bytes
CPU 0: utsname+000000 eaddr=001CB9C8 size=3 hit=0 mode=R Xlate ON
# uname -a <== run some command that will read the utsname
Watch trap: 001CB9C8 <utsname+000000>
.umem_move+000030 lbzx r7,r6,r3 r7=000000000000B6B4, r6=0, r3=00000000001CB9C8
KDB(0)> wr <== verify the number of hits -------v
KDB(0)> cw <== clear watch break points
KDB(0)> lwr utsname <== now set a local watch break point (only cpu 0)
KDB(0)> lcw <== clear local watch break points
KDB(0)> q <== exit kdb, will resume the current thread
AIX oc3b42 0 5 000714834C00
KDB machine status sub commands
Introduction The following table represents the status sub commands and their
machine status crash/lldb KDB sub kdb sub

system status message stat/reason/sysinfo stat stat
stat sub The stat subcommand displays system statistics, including the last kernel printf()
command messages, still in memory. The following information is displayed for a processor
that has crashed:
• Processor logical number

• Current Save Area (CSA) address
• LED value
For the KDB Kernel Debugger this subcommand also displays the reason why the
debugger was entered. There is one reason per processor.

KDB machine status sub commands -- continued
example KDB(0)> stat

POWER_PC POWER_630 machine with 2 cpu(s) (64-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. oc3b42
release... 0
version... 5
machine... 000714834C00
nid....... 0714834C
age of system: 18 hr., 8 min., 13 sec.
xmalloc debug: enabled
Debug kernel error message: No debug cause was specified.
SYSTEM MESSAGES:
AIX Version 5.0
Starting NODE#000 physical CPU#002 as logical CPU#001... done.
kmod_load failed for psekdb
All Rights Reserved (C) Copyright Platypus Technology International Holdings Limited
qik_alert: Unit is not ready!
init 0
.?.?.?.?.?.?.?.?.?.?
ERROR LOG: for mtn_get_adpt_info, location= 1 0, 1105A, DEAFDEAF
.?.?.?.?.?.?.?.?.?.?.qik_alert: Unit is not ready!
init 2
.?!.?.?.?.?.?.?.?.?.?.! J2 Bring Up: gfs:0x00000001
Number of CPUs: 2
L1 Data Cache Line Size: 128
System Memory Size: 512 MByte
VMM minPageReadAhead:2 maxPageReadAhead:8
nCacheClass:5
iCache: inodeSize:888(vode:88,inode:800(gnode:104,dinode:512))
iCache: nInode:52225 nCacheClass:5 nHashClass:8192
nCache: nName:65536 nHashClass:8192
jCache: nBuffer:5120 bufferHeaderSize(176:208)
jCache: nCacheClass:5 nBufferPerCacheClass:1024
vmPager: nBufferPerPagerDevice:512
txCache: nTxBlock:1024 txBlockSize:88
txCache: nTxLock:57400 txLockSize:72 lockShortage:53813
j2_debug: Error Log Table j2Error:0x003F5580
j2_debug: Event Trace Table j2Trace:0x003F9588
J2 Bring UP Complete.
j2_mount: Mount Failure: File System Dirty.
lockd: cannot contact statd(), continuing<- end_of_buffer
KDB(0)> q
KDB kernel extension loader sub commands
Introduction The following table represents the kernel extension loader sub commands and
their matching crash/lldb sub commands when available
kernel extension loader crash/lldb KDB sub kdb sub

list loaded extension le lke lke
list loaded symbol tables stbl stbl
remove symbol table rmst rmst
list export tables map exp exp
lke and stbl sub The subcommands lke and stbl can be used to display current state of loaded
commands kernel extensions using the following parameters:
• -l: list the current entries in the name list cache.

• Address: effective address for the text or data area for a loader entry. The
specified entry is displayed and the name list cache is loaded with data for that
entry. The Address can be specified as a hexadecimal value, a symbol, or a
hexadecimal expression.
• -a addr: display and load the name list cache with the loader entry at the
specified address. The Address can be a hexadecimal value, a symbol, or a
hexadecimal expression.
• -p pslot: display the shared library loader entries for the process slot indicated.
The value for pslot must be a decimal process slot number.
• -l32: display loader entries for 32-bit shared libraries.
• -l64: display loader entries for 64-bit shared libraries.
• slot: slot number. The specified value must be a decimal number.
rmst sub A symbol table can be removed from KDB using the rmst subcommand. This
command subcommand requires that either a slot number or the effective address for the
loader entry of the symbol table be specified.

KDB kernel extension loader sub commands -- continued
exp sub The exp subcommand can be used to look for an exported symbol or to display the
command entire export list. If no argument is specified the entire export list is printed. If a
symbol name is specified as an argument, then all symbols which begin with the
input string are displayed.
examples (0)> nm kbdconfig <== get address for symbol kbdconfig

Symbol Not Found<== not found because it is in a kernext not in cache
(0)> lke -l <== the cache is empty
KERNEXT FUNCTION NAME CACHE empty
(0)> lke <== list kernel extensions
.
.
21 01978B00 01AE9000 000063D0 00080262 msedd_chrp64/usr/lib/drivers/isa/msedd_chrp
22 01978900 01ACA000 00008F68 00080262 kbddd_chrp64/usr/lib/drivers/isa/kbddd_chrp
(0)> lke 22 <== load kernext into the cache
ADDRESS FILE FILESIZE FLAGS MODULE NAME
22 01978900 01ACA000 00008F68 00080262 kbddd_chrp64/usr/lib/drivers/isa/kbddd_c
hrp
le_flags....... TEXT DATAINTEXT DATA DATAEXISTS 64
le_next........ 01978A00 le_svc_sequence 66666666
le_fp.......... 00000000
le_filename.... 01978988 le_file........ 01ACA000
le_filesize.... 00008F68 le_data........ 01AD2100
le_tid......... 01AD2100 le_datasize.... 00000E68
le_usecount.... 00000002 le_loadcount... 00000002
le_ndepend..... 00000001 le_maxdepend... 00000001
le_ule......... 00000000 le_deferred.... 00000000
le_exports..... 00000000 le_de.......... 6666666666666666
le_searchlist.. 00000000 le_dlusecount.. 00000000
le_dlindex..... FFFFFFFF le_lex......... 00000000
le_fh.......... 00000000 le_depend.... @ 01978980
TOC@........... 01AD2C50
<PROCESS TRACE BACKS>
.ureg_pm 01ACA1C0 .reg_pm 01ACA25C
.qvpd 01ACA3B4 .initadpt 01ACA520
.cleanup 01ACA754 .kbdconfig 01ACA924
.
.
(0)> lke -l <== see if it was loaded correctly
KERNEXT FUNCTION NAME CACHE
.ureg_pm 01ACA1C0 .reg_pm 01ACA25C
.qvpd 01ACA3B4 .initadpt 01ACA520
.cleanup 01ACA754 .kbdconfig 01ACA924
.
.
(0)> nm kbdconfig <== no see if we find the address for the symbol
Symbol Address : 01ACA924
TOC Address : 01AD2C50
KDB address translation sub commands
Introduction The following table represents the address translation sub commands and their
address translation crash/lldb sub KDB sub kdb sub

translate to real address xlate tr tr
display MMU translation xlate tv tv
tr and tv sub The tr and tv sub commands can be used to display address translation
commands information. The tr sub command provides a short format; the tv subcommand a
detailed format.
For the tv subcommand, all double hashed entries are dumped, when the entry
matches the specified effective address, corresponding physical address and
protections are displayed. Page protection (K and PP bits) is displayed according
to the current segment register and machine state register values.
tr and tv sub commands takes the following arguments :
• Address - effective address for which translation details are to be displayed.

examples (0)> tr @iar <== display the physical address of the current instruction
Physical Address = 000000000002CB58
(0)> tv @iar <== display the physical mapping of the current instruction
eaddr 000000000002CB58 sid 0000000000000000 vpage 000000000000002C hash1 0000002
C
p64pte_cur_addr 0000000001001600 sid 0000000000000000 avpi 00 hsel 0 valid 1
rpn 000000000000002C refbit 1 modbit 0 wimg 2 key 3
____ 000000000002CB58 ____ K = 0 PP = 11 ==> read only
eaddr 000000000002CB58 sid 0000000000000000 vpage 000000000000002C hash2 0000FFD

3
Physical Address = 000000000002CB58
(0)>

KDB process/thread sub commands
Introduction The following table represents the process/thread sub commands and their
process crash/lldb KDB sub kdb sub

display per processor data ppd ppda ppda
display interrupt handler intr intr
display mst area mst/tcb mst mst
display process table proc proc proc
display thread table th th th
display thread tid th ttid ttid
display thread pid tpid tpid
display user area user user user
switch thread cm sw/switch sw/switch
ppda sub The ppda sub command displays per processor data areas with the following
command conditions :
• no arguments : displays the current process data area

• * : display a summary for all CPUs.
• cpu : display the ppda data for the specified CPU. This argument must be a
decimal value.
• Address : effective address of a ppda structure to display. Symbols,
the address.
intr sub The intr sub command prints entries in the interrupt handler table with the
command following conditions :
• no arguments : display a summary of all entries in the interrupt handler table.

• slot : slot number in the interrupt handler table. This value must be a decimal
value.
• Address : effective address of an interrupt handler. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
KDB process/thread sub commands -- continued
mst sub The mst sub command prints Machine State Save Area for :
command
• the current context : if no argument is provided
• slot : thread slot number. This value must be a decimal value.
• Address : effective address of an mst to display. Symbols, hexadecimal values,
or hexadecimal expressions can be used in specification of the address.
proc sub The proc subcommand displays process table entries using :
command
• * : display a summary for all processes.
• -s flag : display only processes with a process state matching that specified by
flag. The allowable values for flag are: SNONE, SIDLE, SZOMB, SSTOP,
SACTIVE, and SSWAP.
• slot : process slot number. This value must be a decimal value.
• Address : effective address of a process table entry. Symbols, hexadecimal
th sub command The thread subcommand displays thread table entries using :
• no argument : the current thread is displayed.

• * :display a summary for all thread table entries.
• -w flag : display a summary of all thread table entries with a wtype matching
the one specified by the flag argument. Valid values for the flag argument
include: NOWAIT, WEVENT, WLOCK, WTIMER, WCPU, WPGIN,
WPGOUT, WPLOCK, WFREEF, WMEM, WLOCKREAD, WUEXCEPT, and
WZOMB.
• slot :thread slot number. This must be a decimal value.
• Address :effective address of a thread table entry. Symbols, hexadecimal
ttid and tpid sub ttid and tpid respectively display :

commands
• the thread table entry by thread id
• the threads table entry by process id

user sub The user subcommand displays u-block information for :

command
• no argument : the current process
• slot : slot number of a thread table entry. This argument must be a decimal
value.
• Address : effective address of a thread table entry. Symbols, hexadecimal
The following parameters can be used to reduce the output from user :
• -ad : display adspace information only.
• -cr : display credential information only.
• -f : display file information only.
• -s : display signal information only.
• -ru : display profiling/resource/limit information only.
• -t : display timer information only.
• -ut : display thread information only.
• -64 : display 64-bit user information only.
• -mc : display miscellaneous user information only.
sw sub By default, KDB shows the virtual space for the current thread. The sw
command subcommand allows selection of the thread to be considered the current thread.
Threads can be specified by slot number or address. The current thread can be
reset to its initial context by entering the sw subcommand with no arguments.
For the KDB Kernel Debugger, the initial context is also restored whenever
exiting the debugger.
sw will use the following arguments :
• u : flag to switch to user address space for the current thread.

• k : flag to switch to kernel address space for the current thread.
• th_slot : specifies a thread slot number. This argument must be a decimal value.
• th_Address : address of a thread slot. Symbols, hexadecimal values,
orhexadecimal expressions can be used in specification of the address.
examples (0)> ppda * <== display all ppda summary

SLT CSA CURTHREAD SRR0
ppda+000000 0 F00000002FF3B400 KERN_heap+40ABC00 D0059E18
ppda+001000 1 F00000002FF3B400 KERN_heap+E8FBC00 000000010000120C
(0)> ppda <== display ppda for current cpu (0)
Per Processor Data Area [0014ED80]
csa..............F00000002FF3B400 mstack...........0000000000838DF8
fpowner..........0000000000000000 curthread........F1000097140ABC00
syscall..........000000000008202E intr.............0000000000000000
i_softis.....................0000 i_softpri....................0000
prilvl...........F1000097140C1600 worst_run_pri................00FFrun_pri........................FF
v_pnda...........00000000001FC570
cpunidx......................0000
ppda_pal[0]..............00000000 ppda_pal[1]..............00000000
ppda_pal[2]..............00000000 ppda_pal[3]..............00000000
phy_cpuid....................0000 sradid.......................0000
slb_reload_index.............0000 ppda_fp_cr...............00000000
flih save[0].....0000000020000000 flih save[1].....000000000001E10C
flih save[2].....A000000000009032 flih save[3].....0000000000000000
flih save[4].....0FFFFFFFF3FFFE80 flih save[5].....000000000046AC80
flih save[6].....0000000000000000 flih save[7].....0000000000000000
flih save[8].....0000000000000000 flih save[9].....0000000000000000
flih save[10].....0000000000000000
usegp............0000000000000000 srflag...........7000000000000000
srsave[0]........000000000000736F srsave[1]........000000000000736F
srsave[2]........0000000000000000 srsave[3]........0000000000000000
srsave[4]........0000000000000000
gsegs[0].eaddr...0000000000000000 gsegs[0].vsid....0000000000000000
Useracc addr.........0000000000000000
Useracc size.........0000000000000000
dsisr....................42000000 dsi_flag.................00000003
dar..............0000000020010920dssave[0]........0000000000000020 dssave[1]........000000002FF226F0
dssave[2]........00000000F009E9BC dssave[3]........000000002000F8E0
dssave[4]........00000000F0046E28 dssave[5]........0000000000000000
dssave[6]........0000000000000000 dssave[7]........00000000200454E0
dssrr0...........00000000D0052904 dssrr1...........200000000000D0B2
dssprg1..........000000002FF22D54 dsctr............0000000002155980
dslr.............000000000038F248 dsxer....................20000008
dsmq.....................00000000 pmapstk..........00000000001CF8D0
pmapsave64.......0000000000000000 pmapcsa..........0000000000000000
schedtail[0].....0000000000000000 schedtail[1].....0000000000000000
schedtail[2].....0000000000000000 schedtail[3].....0000000000000000
cpuid........................0000 stackfix.......................00
lru............................00 vmflags..................00000000
sio............................00 reservation....................00
hint...........................00 no_vwait.......................00
lock.....................00000000
scoreboard[0]....0000000000000000 scoreboard[1]....0000000000000000

example intr_res1................00000000 intr_res2................00000000

continued mpc_pend.................00000000 iodonelist.......0000000000000000
run_queue........F1000097140A1000 global_run_queue.F1000097140A0118
ppda_timer.... @ 000000000014F0B0 decompress.......0000000000000000
TB_ref_u.................01580CBC TB_ref_l.................40000000
sec_ref..................39B7F005 nsec_ref.................0C3B4A07
_ficd....................00000000 icndx........................07F7
ppda_qio.................00000000 cs_sync..................00000000
perfmon_sv[0]....0000000000000000 perfmon_sv[1]....0000000000000000
thread_private...........00000000 cpu_priv_seg.....0000000000000000
ri_flih_paddr....0000000000F28F00 ri_save6.........0000000000000000
util_start_time..........00000000 util_accumulator.........00000000
ppda_ha_event....0000000000000000 ppda_ha_fun......0000000000000000
ppda_ha_arg......0000000000000000
wp_available.............00000001
frs_id.......................0000 memp_id........................00
newprivseg...............00000000
trace vectors. @ 000000000014F1F0 ppda_trcbufp0....0000000000000000
wlm_cpulocal_dataF100009716320000
WLM (Only non-null slots are shown)........
Slot time npages
ppda_dseg_count..0000000000000000 ppda_iseg_count..0000000000000000
ppda_emul_tptr...0000000000000000 ppda_align_iar...000000000000B658
ppda_align_tptr..F1000097165A2A00 ppda_align_ea....F1000082C01BC926
ppda_emul_iar....0000000000000000 ppda_emul_count..........00000000
ppda_align_count.........00451303 radindex...... @ 000000000014EE84
TIMER....................
t_free...........F10000971E87D200 t_active.........F100009713FF3100
t_freecnt................00000001 trb_called.......0000000000000000
trb_lock...... @ 000000000014F0D0 trb_lock.........0000000000000000
systimer.........F100009713FF3100 ticks_its................00000042
ref_time.tv_sec..0000000039B7F006 ref_time.tv_nsec.........0EA6319F
time_delta.......0000000000000000 time_adjusted....F100009713FF3100
wtimer.next......F100009716458180 wtimer.prev......F10000971ECD42D0
wtimer.func......0000000000203F80 wtimer.count.....0000000000000000
wtimer.restart...0000000000000000 w_called.........0000000000000000
watchdog_lock. @ 000000000014F138 watchdog_lock....0000000000000000
KDB......................
kdb_ppda_r0......0000000000000001 kdb_ppda_r1......000000002FF228B0
kdb_ppda_r2......00000000F01951F4 kdb_ppda_r15.....000000002FF22D54
kdb_ppda_srr0....00000000D043AB18 kdb_ppda_srr1....200000000004D0B2
flih_save................22282229 proc_state...............0000000B
csa..............0000000000CD8A88
ri_flih_paddr....0000000000F28F00 ri_r6............0000000000000000
(0)> intr <== display the interrupt handler table
SLT INTRADDR HANDLER TYPE LEVEL PRIO BID FLAGS
i_data+0000E8 5 F1000097140B0FC0 00000000 0004 00000004 0003 900000C0 0050

i_data+0000E8 5 F10000971ECD4000 019EA5C0 0004 0000000D 0003 900000C0 0050
.
.
example (0)> mst <== display the current mst

continued Machine State Save Area
iar : 000000000002CB58 msr : A0000000000010B2 cr : 28442224
lr : 0000000000000000 ctr : 00000000003E2150 xer : 20000000
mq : FFFFFFFF asr : 0000000005622001
r0 : 0000000044484244 r1 : F00000002FF3B200 r2 : 000000000046AC80
r3 : 00000000003356E4 r4 : A0000000000090B2 r5 : F1000097163BF301
r6 : F00000002FF3AF40 r7 : 0000000000000105 r8 : 000000000014FD80
r9 : 0000000000000001 r10 : 00000000000021B6 r11 : 0000000000000105
r12 : 000000000020CDD0 r13 : F1000097140AB600 r14 : 0000000000000004
r15 : 0000000011000081 r16 : 0000000070000080 r17 : 0000000000000001
r18 : 0000000000000003 r19 : 0000000000000000 r20 : 00000000FFFEFBFF
r21 : F1000097140AB778 r22 : 0000000048242224 r23 : 0000000000000000
r24 : 0000000000000000 r25 : 000000000000000B r26 : 0000000000000000
r27 : F100008080000280 r28 : F100008090000080 r29 : F1000097140C1A00
r30 : F1000097140AB600 r31 : 0000000000000004
s0 : 0000000000000000 s1 : 000000000FFFFFFF s2 : 000000000FFFFFFF
s3 : 000000000FFFFFFF s4 : 000000000FFFFFFF s5 : 000000000FFFFFFF
s15 : 000000000FFFFFFFprev 0000000000000000 stackfix F00000002FF3B200
kjmpbuf 0000000000000000 excbranch 0000000000000000
intpri 00 backt 00 flags 00
fpscr 0000000000000000 fpscrx 00000000 fpowner 00
fpeu 00 fpinfo 00 alloc F000 ptaseg F100000050000000
o_iar 0000000000000000 o_toc 0000000000000000
o_arg1 0000000000000000 o_vaddr 0000000000000000
Except :
csr 0000000000000000 dsisr 0000000042000000 bit set: DSISR_PFT DSISR_ST
esid 000000002000796E dar F10000971F15700C dsirr 0000000000000106
(0)> p * -s SACTIVE <== display all active process
SLOT NAME STATE PID PPID ADSPACE CL #THS
pvproc+000000 0 swapper ACTIVE 0000000 0000000 0000000000000B00 0 0001
pvproc+000280 1 init ACTIVE 0000001 0000000 000000000000E2FD 0 0001
pvproc+000500 2 wait ACTIVE 0000204 0000000 0000000000001B02 0 0001
pvproc+000780 3 wait ACTIVE 0000306 0000000 0000000000002B04 0 0001
pvproc+000A00 4 lrud ACTIVE 0000408 0000000 0000000000003B06 65 0001
pvproc+000C80 5 xmgc ACTIVE 000050A 0000000 000000000000BB16 65 0001
pvproc+000F00 6 netm ACTIVE 000060C 0000000 000000000000CB18 65 0001
pvproc+001180 7 gil ACTIVE 000070E 0000000 000000000000DB1A 65 0005
.
.
(0)> th -w NOWAIT <== display all thread that wait for nothing
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+000180 3!wait RUN 000307 0FF 1 00001 0

example (0)> th 3 <== now display details on thread 3

continued SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+000180 3>wait RUN 000307 0FF 1 00001 0
NAME................ wait
FLAGS............... KTHREAD
WTYPE............... WCPU
.................tid :0000000000000307 ......tsleep :FFFFFFFFFFFFFFFF
...............flags :00001000 ..............flags2 :00000000
DATA.........pvprocp :F100008080000780 <pvproc+000780>
LINKS.....prevthread :F100008090000180 <pvthread+000180>
..........nextthread :F100008090000180 <pvthread+000180>
DISPATCH.......synch :FFFFFFFFFFFFFFFF
SCHEDULER...affinity :00000001 .................pri :000000FF
.............boosted :00000000 ...............wchan :0000000000000000
...............state :00000002 ...............wtype :00000004
CHECKPOINT......vtid :00000000
LOCK........ lock_d @ F100008090000190 0000000000000000
PROCFS......procfsvn :0000000000000000
THREAD.......threadp :F1000097140AB000 ........size :00000080
FLAGS............... SIGAVAIL KTHREAD FUNNELLED SIGSLIH SIGINTR.................tid :0000000000000307 ......stackp
:0000000000000000
.................scp :0000000000000000 .......ulock :0000000000000000
...............uchan :0000000000000000 ....userdata :0000000000000000
..................cv :0000000000000000 .......flags :0000000000003004
..............atomic :0000000000000000 ......flags2 :0000000000000000
DATA...........procp :F1000097140ABE00 <KERN_heap+40ABE00>
...........pvthreadp :F100008090000180 <pvthread+000180>
...............userp :F00000002FF3B898 <__ublock+000498>
............uthreadp :F00000002FF3B400 <__ublock+000000>
SLEEP/LOCK......usid :0000000000000000 ......wchan1 :0000000000000000
..............wchan2 :0000000000000000 ......swchan :0000000000000000
...........eventlist :0000000000000000 ......result :00000000
.............polevel :00000000 ..............pevent :0000000000000000
..............wevent :0000000000000000 .......slist :0000000000000000
...........wchan1sid :0000000000000000 wchan1offset :00000000
...........lockcount :00000000 ..........adsp_flags :0000
DISPATCH.......ticks :0000BC2C ...............prior :F1000097140AB000
................next :F1000097140AB000 ......dispct :00000000008B4EF3
...............fpuct :0000000000000000
MISC........graphics :0000000000000000 ...pmcontext :0000000000000000
...........lockowner :0000000000000000 ..kthreadseg :0000000107FFFFFF
..........time_start :0000000000000000 ..........wlm_charge :0SIGNAL........sigproc:00000000 ..............cursig :00000000
......(pending) sig :[3] 0000000000000000 .................[2] 0000000000000000
......................[1] 0000000000000000 .................[0] 0000000000000000
............sigmask :[3] 0000000000000000 .................[2] 0000000000000000
......................[1] 0000000000000000 .................[0] 0000000000000000
SCHEDULER......cpuid :00000001 ..............scpuid :00000001
.........affinity_ts :0006A57F ..............policy :00000001
.................cpu :00000078 .............lockpri :00000000
.............wakepri :000000FF ................time :00000000
.............sav_pri :000000FF ...........run_queue :F1000097140A2000
................cpu2 :00000078
.............suspend :00000001 .............fsflags :00000000
..........norun_secs :00000000
CHECKPOINT..chkerror :0000 ............chkblock :00000000
PROCFS.......whystop :00000000 ............whatstop :00000000
..............weight :00000008 ........allowed_cpus :C0000000
.......prefunnel_cpu :00000000
......threadcontrolp :0000000000000000...........controlvm :0000000000000000
PVTHREAD...pvthreadp :F100008090000180 ........size :00000080
example (0) ttid 70e <== now display threads for gil(70e) that should have 5 threads
continued SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+000380 7 gil SLEEP 00070F 025 1 65
pvthread+000580 11 gil SLEEP 000B17 025 1 65 netisr_servers
pvthread+000500 10 gil SLEEP 000A15 025 1 65 netisr_servers
pvthread+000480 9 gil SLEEP 000913 025 1 65 netisr_servers
pvthread+000400 8 gil SLEEP 000811 025 1 65 netisr_servers
(0)> user -ad 5 <== display address space for thread 5
User-mode address space mapping:
segs32_raddr.0000000000000000
uadspace node allocation......(U_unode) @ F00000002FF3E028
usr adspace 32bit process.(U_adspace32) @ F00000002FF3E048
segment node allocation.......(U_snode) @ F00000002FF3E008
segnode for 32bit process...(U_segnode) @ F00000002FF3E2A8
U_adspace_lock @ F00000002FF3E4E8
lock_word.....0000000000000000 vmm_lock_wait.0000000000000000
V_USERACC strtaddr:0x0000000000000000 Size:0x0000000000000000
vmmflags......00000000
(0)> sw 5 <== switch to the thread 5
Switch to thread: <pvthread+000280>
(0)> tpid <== display the current tpid that should be slot 5
pvthread+000280 5*xmgc SLEEP 00050B 03C 1 65 KERN_heap+ECD5730
(0)> sw <== switch back to initial thread
Switch to initial thread: <pvthread+001200>
(0)> tpid <== display the current tpid that should be initial pvthread+001200
pvthread+001200 36*kdb_64 RUN 002467 03C 0 0

KDB Kernel stack sub commands
Introduction The following table represents the Kernel stack sub commands and their matching
crash/lldb sub commands when available
kernel stack crash/lldb KDB sub kdb sub

trace a kernel stack fs f f
f sub command The stack sub command displays all the stack frames from the current instruction
as deep as possible. Interrupts and system calls are crossed and the user stack is
also displayed. In the user space, trace back allows display of symbolic names.
The amount of data displayed may be controlled through the mst_wanted and
display_stacked_frames options of the set sub command. You can also request to
see the stacked registers using the display_stacked_regs set option.
The f sub command can be invoked using the following :
• no argument : the stack for the current thread is displayed.

• +x : flag to include hex addresses as well as symbolic names for calls on the
stack. This option remains set for future invocations of the stack subcommand,
until changed via the -x flag.
• -x : flag to suppress display of hex addresses for functions on the stack. This
option remains in effect for future invocations of the stack subcommand, until
changed via the +x flag.
• tslot : decimal value indicating the thread slot number
• Address : hex address, hex expression, or symbol indicating the effective
address for a thread slot
KDB Kernel stack sub commands -- continued
examples (0)> f +x <== display the stack frame for the current thread
pvthread+000380 STACK:
[0002CB58]et_wait+00036C (0000000000212A0C, A0000000000010B2,
0000000000122A0C [??])
[000EF170]netthread_start+0000B8 ()
[00060F6C]procentry+000010 (??, ??, ??, ??)
(0)> f -x <==display the stack frame without addresses
et_wait+00036C (.backt+000000, A0000000000010B2,
.v_prepin+000000 [??])
netthread_start+0000B8 ()
procentry+000010 (??, ??, ??, ??)
(0) set 10 <== want to see the stacked registers
display_stacked_regs is true
(0)> f <== show the stack frame with stacked registers
et_wait+00036C (.backt+000000, A0000000000010B2,
.v_prepin+000000 [??])
r31 : 0000000000000000 r30 : 0FFFFFFFF0100000 r29 : 0000000000205E38
r28 : 00000000DEADBEEF r27 : 00000000DEADBEEF r26 : 00000000DEADBEEF
netthread_start+0000B8 ()
procentry+000010 (??, ??, ??, ??)

KDB LVM sub commands
Introduction The following table represents the LVM sub commands and their matching crash/
LVM crash/lldb KDB sub kdb sub

function sub commands commands
commands
display physical buffer pbuf pbuf
display volume group volgrp volgrp
display physical volume pvol pvol
display logical volume lvol lvol
volgrp,pvol, lvol volgrp, pvol, lvol and pbuf will respectively display :
and pbuf sub
command
• volume group information (including lvol structures). volgrp addresses are
registered in the devsw table, in the DSDPTR field.
• physical volume information. pvol addresses are registered within the vlogrp
structure.
• logical volume information. lvol addresses are registered within the volgrp and
lvol structures.
• physical buffer information. pbuf addresses are registered withing volgrp and
pvol structures.
All lvm sub commands takes addresses as parameters.
KDB LVM sub commands -- continued
examples (0)> dev 0xa <== get the device switch table entry for a volume group
Slot address F1000097140C3500
MAJOR: 00A
.
.
dump: 010E3D00
mpx: .nodev (0009E378)
revoke: .nodev (0009E378)
dsdptr: F10000971660D000 <== the pointer to the volgrp structure
selptr: 00000000
opts: 0000002A DEV_DEFINED DEV_MPSAFE
(0)> volgrp F10000971660D000
VOLGRP............. F10000971660D000
vg_lock............... FFFFFFFFFFFFFFFF partshift............. 0000000E
open_count............ 0000000A flags................. 00000000
lvols............... @ F10000971660D010 <== pointer to the lvol struct
pvols............... @ F10000971660E010 <== pointer to the pvol struct
major_num............. 0000000A
vg_id................. 0007148300004C00000000E12335DF7D
nextvg................ 00000000 opn_pin............. @ F10000971660E428
von_pid............... 00000A32 nxtactvg.............. 00000000
ca_freepvw............ 00000000 ca_pvwmem............. 00000000
ca_hld.............. @ F10000971660E488 ca_pv_wrt........... @ F10000971660E4A0
.
.
(0)> lvol F10000971E624E00 <== display on of the lvol structure
LVOL............ F10000971E624E00
work_Q.......... 00000000 lv_status....... 00000000
lv_options...... 00001000 nparts.......... 00000001
i_sched......... 00000000 nblocks......... 00034000
parts[0]........ F10000971E621A00
pvol@ F1000097163DF200 <== pointer to pvol structure
.............dev 8000000E00000001 start 002C9100
parts[1]........ 00000000
parts[2]........ 00000000
maxsize......... 00000000 tot_rds......... 00000000
complcnt........ 00000000 waitlist........ FFFFFFFF
stripe_exp...... 00000000 striping_width.. 00000000
lvol_intlock. @ F10000971E624E60 lvol_intlock.... 00000000
(0)> pvol F1000097163DF200 <== now display the pvol
PVOL............... F1000097163DF200
dev................ 8000000E00000001 xfcnt.............. 00000000
armpos............. 00000000 pvstate............ 00000000
pvnum.............. 00000000 vg_num............. 0000000A
fp................. F1000096000022F0 flags.............. 00000000
num_bbdir_ent...... 00000000 fst_usr_blk........ 00001100
beg_relblk......... 00867C2D next_relblk........ 00867C2Dl
max_relblk......... 00867D2C defect_tbl......... F1000097165F4C00
ca_pv............ @ F1000097163DF250 sa_area[0]....... @ F1000097163DF260
sa_area[1]....... @ F1000097163DF270
pv_pbuf.......... @ F1000097163DF280 <== pointer to pbuf
oclvm............ @ F1000097163DF3C8

KDB SCSI sub commands
Introduction The following table represents the scsi sub commands and their matching crash/
SCSI crash/ KDB kdb

function lldb sub sub
sub comma comma
comma nds nds
nds
display ascsi N/A asc asc
display vscsi N/A vsc vsc
display scdisk N/A scd scd
asc,vsc and csd The asc,vsc and csd sub commands respectively prints:
sub commands
• ascsi adaptesr informations : the ascsiddpin kernext is used to locate the
adp_ctrl structure
• vscsi adapters informations : the vscsiddpin kernext is used to locate the
vscsi_ptrs structure
• scdisk disk informations L the scdiskpin kernext is used to locate the
scdisk_list structure
•
If no argument is specified the asc subcommand loads the slot numbers with
addresses from the adp_ctrl structure. The asc,vsc sub commands can use the
following arguments:
• no argument : prompt for the structure address.

• slot : slot number of the adp_ctrl,vscsi_ptrs or scdisk_list entry to be displayed.
This value must be a decimal number.
• Address : effective address of the structure to display. Symbols, hexadecimal
KDB SCSI sub commands -- continued
Examples (0)> lke 57

57 04E39480 01237AC0 00008958 00000262 /etc/drivers/ascsiddpin

le_flags....... TEXT DATAINTEXT DATA DATAEXISTS
le_next........ 04E39400 le_fp.......... 00000000
le_filename.... 04E394D8 le_file........ 01237AC0
le_filesize.... 00008958 le_data........ 0123FE60
(0)> d 0123FE60 80
0123FE60: 0123 EE3C 0123 EE38 0123 EE34 0123 EE30 .#.<.#.8.#.4.#.0
0123FE70: 0123 EE2C 0123 EE28 0123 EE24 0123 EE20 .#.,.#.(.#.$.#.
0123FE80: 0123 EE80 0123 EEC0 0123 EF00 0123 EF40 .#...#...#...#.@
0123FE90: 0123 EF80 0123 EFC0 0123 F000 0123 F040 .#...#...#...#.@
0123FEA0: 0123 F080 0123 F0C0 0123 F100 0123 F140 .#...#...#...#.@
0123FEB0: 0123 F180 0123 F1C0 0123 F200 0123 F240 .#...#...#...#.@
0123FEC0: 0000 0000 0000 0002 0000 0002 5002 D000 ............P...
0123FED0: 5002 E000 0000 0000 0000 0000 0000 0000 P...............
(0)> asc <== run asc and enter the address we found previously
Unable to find <adp_ctrl>
Enter the adp_ctrl address (in hex): 0123FEC0
Adapter control [0123FEC0]
semaphore............00000000num_of_opens.........00000002
num_of_cfgs..........00000002
ap_ptr[ 0]...........5002D000
ap_ptr[ 1]...........5002E000
.
.
(0)> asc 1 <== now that asc was ran once, we can use slot numbers
Adapter info [5002E000]
ddi.resource_name..... ascsi1
intr............... @ 5002E000 ndd...................506FC020
seq_number............00000001 next..................00000000
local.............. @ 5002E1A4 ddi................ @ 5002E1D0
active_head...........00000000 active_tail...........00000000
wait_head.............00000000 wait_tail.............00000000
num_cmds_queued.......00000000 num_cmds_active.......00000000
adp_pool..............506C3128 surr_ctl........... @ 5002E22C
sta................ @ 5002E27C time_s.tv_sec.........00000000
time_s.tv_nsec........00000000 tcw_table.............506C3F9C
opened................00000001 adapter_mode..........00000001
adp_uid...............00000002 peer_uid..............00000000
sysmem................506C0000 sysmem_end............506C3FAD
busmem................00654000 busmem_end............00658000
tm_tcw_table..........00000000 eq_raddr..............00654000
dq_raddr..............00655000 eq_vaddr..............506C0000
dq_vaddr..............506C1000 sta_raddr.............00656000
sta_vaddr.............506C2000 bufs..................00658000
tm_sysmem.............00000000 wdog............... @ 5002E344
tm................. @ 5002E360 delay_trb.......... @ 5002E37C
xmem............... @ 5002E3B8 dma_channel...........04001000

KDB SCSI sub commands -- continued
Example mtu...................00141000
continued num_tcw_words.........00000011shift.................00000000 tcw_word..............00000000
resvd1................00000000 cfg_close.............00000000
vpd_close.............00000000 locate_state..........00000004
locate_event..........FFFFFFFF rir_event.............FFFFFFFF
vpd_event.............FFFFFFFF eid_event.............FFFFFFFF
ebp_event.............FFFFFFFF eid_lock..............FFFFFFFF
recv_fn...............0124024C tm_recv_fn............00000000
tm_buf_info...........00000000 tm_head...............00000000
tm_tail...............00000000 tm_recv_buf...........00000000
tm_bufs_tot...........00000000 tm_bufs_at_adp........00000000
tm_bufs_to_enable.....00000000 tm_buf................00000000
tm_raddr..............00000000 proto_tag_e...........00000000
proto_tag_i...........00000000 adapter_check.........00000000
eid................ @ 5002E42C limbo_start_time......00000000
dev_eid............ @ 5002E4B0 tm_dev_eid......... @ 5002E8B0
pipe_full_cnt.........00000000 dump_state............00000000
pad...................00000000 adp_cmd_pending.......00000000
reset_pending.........00000000 epow_state............00000000
mm_reset_in_prog......00000000 sleep_pending.........00000000
bus_reset_in_prog.....00000000 first_try.............00000001
devs_in_use_I.........00000000 devs_in_use_E.........00000000
num_buf_cmds..........00000000 next_id...............00000045next_id_tm............00000000
resvd4................00000000
ebp_flag..............00000000 tm_bufs_blocked.......00000000
tm_enable_threshold...00000000 limbo.................00000000
critical_path.........00000000 epow_reset_needed.....00000000
KDB memory allocator sub commands
Introduction The following table represents the memory allocator sub commands and their
memory allocator crash/lldb KDB sub kdb sub

display kernel heap heap heap
display heap debug xmalloc xm xm
display kmem buckets kmbucket kmbucket
display kmem statistics mblk kmstats kmstats
kmstats sub The kmstats sub command prints kernel allocator memory statistics. If no address
command is specified, all kernel allocator memory statistics are displayed. If an address is
entered, only the specified statistics entry is displayed.
kmbuckets sub The kmbucket sub command prints kernel memory allocator buckets. If no
command arguments are specified information is displayed for all allocator buckets for all
CPUs. kmbucket accept the following parameters :
• -l - display the bucket free list.

• -c cpu - display only buckets for the specified CPU. The cpu is specified as a
decimal value.
• -i index - display only the bucket for the specified index. The index is specified
as a decimal value.
• Address - display the allocator bucket at the specified effective address.
Symbols, hexadecimal values, or hexadecimal expressions may be used in

KDB memory allocator sub commands -- continued
xm sub The xmalloc subcommand may be used to display memory allocation information.
command Other than the -u option, these subcommands require that the Memory Overlay
Detection System (MODS) is active. The MODS can be activated using the
bosdebug command.
• -s : display allocation records matching addr. If Address is not specified, the

value of the symbol Debug_addr is used.
• -h : display free list records matching addr. If Address is not specified, the
value of the symbol Debug_addr is used.
• -l : enable verbose output. Applicable only with flags -f, -a, and -p.
• -f : display records on the free list, from the first freed to the last freed.
• -a : display allocation records.
• -p page : display page information for the specified page. The page number is
specified as a hexadecimal value.
• -d : display the allocation record hash chain associated with the record hash
value for Address. If Address is not specified, the value of the symbol
Debug_addr is used.
• -v : verify allocation trailers for allocated records and free fill patterns for free
records.
• -u : display heap statistics.
• -S : display heap locks and per-cpu lists. Note, the per-cpu lists are only used
for the kernel heaps.
• Address : effective address for which information is to be displayed. Symbols,
the address.
• heap_addr : effective address of the heap for which information is displayed. If
heap_addr is not specified, information is displayed for the kernel heap.
heap sub The heap subcommand displays information about heaps. If no argument is
command specified information is displayed for the kernel heap. Information can be
displayed for other heaps by specifying an address of a heap_t structure.
Examples (0)> heap <== display kernel heaps

Pinned heap 00730290
sanity......... 4E554D41 alt............ 00000001
heapaddr[00]... F100009710000000 [01].. 0
heapaddr[02]... 0 [03].. 0
baseaddr[00]... F100009713FF3000 [01].. 0
baseaddr[02]... 0 [03].. 0
numpages[00]... 1C00D [01].. 0
numpages[02]... 0 [03].. 0
Kernel heap 007302F8
sanity......... 4E554D41 alt............ 00000000
heapaddr[00]... F1000097100000D8 [01].. 0
heapaddr[02]... 0 [03].. 0
baseaddr[00]... F100009713FF3000 [01].. 0
baseaddr[02]... 0 [03].. 0
numpages[00]... 1C00D [01].. 0
numpages[02]... 0 [03].. 0
(0)> xm -S F1000097100000D8 <== display heap lock/cpu for kernel heap 007302F8
Locks:
Lock for allocation size 16: F100009710000248 Available
Lock for allocation size 32: F1000097100002C8 Available
Heap lists:
CPU List # Unpinned Pinned
0 0 0 0
0 1 0 0
.
.
0 9 0 0
0 10 0 0
0 11 2322A000 0
1 0 0 0
.
.
(0)> kmstats <== display all the kernel allocator memory stats
mh_freelater ............0000000000E3E830
displaying kmemstats for offset 0 free
address...............F100009715FB46E0 inuse..(x)............0000000000000000
calls..(x)............0000000000000000 memuse..(x)...........0000000000000000
limit blocks..(x).....0000000000000000 map blocks..(x).......0000000000000000
maxused..(x)..........0000000000000000 limit..(x)............0000000000000000
failed..(x)...........0000000000000000 lock............... @ F100009715FB4728
lock..(x).............0000000000000000
.
.
.

Examples (0)> kmbucket <== display all kernel memory allocator buckets
continued displaying kmembucket for cpu 0 offset 5 size 0x00000020
address...............F100009715FA4C48 b_next..(x)...........F1000082C007BB80
b_calls..(x)..........0000000000000026 b_total..(x)..........0000000000000080
b_totalfree..(x)......000000000000005D b_elmpercl..(x).......0000000000000080
b_highwat..(x)........00000000000003F5 b_couldfree (sic).(x).0000000000000000
b_failed..(x).........0000000000000000 lock............... @ F100009715FA4C90
lock..(x).............0000000000000000
displaying kmembucket for cpu 0 offset 6 size 0x00000040

.
KDB file system sub commands
Introduction The following table represents the file system sub commands and their matching
file system crash/lldb KDB sub kdb sub

display buffer buffer buffer buffer
display buffer hash table hbuffer hbuffer
display freelist fbuffer fbuffer
display gnode gnode gnode
display gfs gfs gfs
display file file file file
display inode inode inode inode
display inode hash table hinode hinode
display inode cache list icache icache
display rnode rnode N/A
display vnode vnode vnode vnode
display vfs vfs vfs vfs
display specnode specnode specnode
display devnode devnode devnode
display fifo node fifonode fifonode
display hnode hash table hnode hnode
buffer,hbuffer The buffer,hbuffer and fbuffer sub command respectivelly prints :

and fbuffer sub
command
• buffer cache headers.
• buffer cache hash list headers.
• buffer cache freelist headers.
If no argument is specified a summary is printed. Details can be displayed by
selecting a slot number or an address using :
• slot : a buffer pool slot number. This argument must be a decimal value.
• Address : effective address of a buffer pool entry. Symbols, hexadecimal

KDB file system sub commands -- continued
inode, hinode The inode, Hinode and Icache respectively displays :

and icache sub
commands
• inode table entries. If no argument is entered a summary for used (hashed)
inode table entries is displayed.
• inode hash list entries.
• inode cache list entries.
These sub commands use the following arguments :
• slot : slot number of an entry. This argument must be a decimal value.

• Address : effective address of an entry. Symbols, hexadecimal values, or
gnode, vnode, gnode, vnode, specnode, devnode, fifonode, rnode and hnode sub commands
specnode, respectively displays :
devnode,
fifonode, rnode
and hnode sub • generic node structure at the specified address.
commands • virtual node (vnode) table entries.
• special device node structure at the specified address.
• device node (devnode) table entries.
• fifo node table entries.
• remote node structure at the specified address.
• hash node table entries.
These sub commands accept the following arguments :
• slot : slot number of a f table entry. This argument must be a decimal value.
• Address : effective address of a table entry. Symbols, hexadecimal values, or
vfs sub The vfs subcommand displays entries of the virtual file system table. If no
command argument is entered a summary is displayed with one line for each entry. Detailed
information can be obtained for an entry by identifying the entry of interest.
Individual entries can be displayed using :
• slot : slot number of a virtual file system table entry. This argument must be a
decimal value.
• Address : address of a virtual file system table entry. Symbols, hexadecimal
KDB file system sub commands -- continued
gfs sub The gfs subcommand displays the generic file system structure at the specified
command address.
file sub The file subcommand displays file table entries. If no argument is entered all file
command table entries are displayed in a summary. Used files are displayed first (count > 0),
then others. Detailed information can be displayed using :
• slot : slot number of a file table entry. This argument must be a decimal value.
• Address : effective address of a file table entry. Symbols, hexadecimal values,
or hexadecimal expressions can be used in specification of the address.
Examples (0)> vfs <== display mounted vfs

GFS DATA TYPE FLAGS
1 KERN_heap+5F7C470 00394EC8 F100009715F7D990 JFS DEVMOUNT

... /dev/hd4 mounted over /
2 KERN_heap+5F7C4D0 00394EC8 F100009715F7DE60 JFS DEVMOUNT
... /dev/hd2 mounted over /usr
3 KERN_heap+5F7C530 00394EC8 F100009715F7DD00 JFS DEVMOUNT
... /dev/hd9var mounted over /var
4 KERN_heap+5F7C410 00394EC8 F100009715F7D8E0 JFS DEVMOUNT
... /dev/hd3 mounted over /tmp
5 KERN_heap+5F7C590 00394EC8 F100009715F7DAF0 JFS DEVMOUNT
... /dev/hd1 mounted over /home
6 KERN_heap+5F7C5F0 00395008 0000000000000000 PROCFS
... /proc mounted over /proc
7 KERN_heap+5F7C650 00394F68 F10000971EB5A3D0 AIX DEVMOUNT
... /dev/lv01 mounted over /j2
(0)> gfs 0039500 <== display gfs for jfs entry
gfs_data. 706F7374FBE1FFF8 gfs_flag. SYS5DIR FUMNT VERSION42 NOUMASK
gfs_ops.. E981008038210070gn_ops... 7D8803A64E800020gfs_name. N
gfs_init. 00000054000E776Cgfs_rinit 607F00007C0802A6gfs_type.
gfs_hold. E8625080
(0)> file <== display the file table
ADDR COUNT OFFSET DATA TYPE FLAGS
F100009600001080 1 0000000000000000 F1000097160CC2B0 VNODE WRITE NOCTTY
F1000096000010D0 1 0000000000000000 F1000082C0078800 SOCKET READ WRITE
F100009600001120 29 0000000000000000 F1000097159BB290 VNODE READ RSHARE
F100009600001170 2 0000000000000000 F100009714C89830 VNODE READ RSHARE
F1000096000011C0 34 0000000000026282 F100009714A01C60 VNODE READ RSHARE
F100009600001210 1 0000000000000100 F100009715696290 VNODE EXEC
F100009600001260 3 00000000000230E2 F100009714AA6620 VNODE READ RSHARE

KDB system table sub commands
Introduction The following table represents the system table sub commands and their matching
system table crash/lldb sub KDB sub kdb sub

display var var var var
display devsw table devsw devsw devsw
display system timer request callout trb trb
blocks
display simple lock lock -s slk slk
display complex lock lock -c clk clk
search for deadlock dlock N/A dla
display ipl proc information iplcb iplcb
display trace buffer trace trace
display the stream queue queue streams streams
var sub The var subcommand prints the var structure and the system configuration of the
command machine including :
• Base kernel parameters

• Calculated High-Water marks
• VMM tunable variables
• System configuration informations
KDB system table sub commands -- continued
devsw sub The dev subcommand display device switch table entries. If no argument is
command specified, all entries are displayed. To display a specific entry use :
• major - indicates the specific device switch table entry to be displayed by

the
• major number : This is the hexadecimal value of the device.
• Address : effective address of a driver. The device switch table entry with
the driver closest to the indicated address is displayed; and the specific
driver is indicated. Symbols, hexadecimal values, or hexadecimal
expressions can be used in specification of the address.
trb sub
command The trb subcommand displays Timer Request Block (TRB) information. If this
subcommand is entered without arguments a menu is displayed allowing selection
of the data to be displayed. Otherwise, you can use the following arguments :
• * : selects display of Timer Request Block (TRB) information for TRBs on all
CPUs. The information displayed will be summary information for some
options.
• cpu x : selects display of TRB information for the specified CPU. Note, the
characters "cpu" must be included in the input. The value x is a hexadecimal
number.
• option - the option number indicating the data to be displayed. The available
option numbers are :
• 1. TRB Maintenance Structure - Routine Addresses
• 2. System TRB
• 3. Thread Specified TRB
• 4. Current Thread TRB's
• 5. Address Specified TRB
• 6. Active TRB Chain
• 7. Free TRB Chain
• 8. Clock Interrupt Handler Information
• 9. Current System Time - System Timer Constants
slk,clk and dla slk and clk display respectively simple and complex lock. If no argument is
sub commands specifyed, a list a major locks will be displayed. Then, you can use the address of
the lock to display the lock structure.
dla will search for deadlock.

KDB system table sub command -- continued
iplcb sub The iplcb sub command will display the IPL Control Block structure using the
command following parameters :
• [cpu] to print IPL CB (will display all informations including cpu information
for [cpu].
• * : print summary of all processors
• -dir : print directory information
• -proc [cpu] : print processor information
• -mem : print memory region information
• -sys : print system information
• -user : print user information
• -numa : print NUMA information
trace sub The trace sub command displays data in the kernel trace buffers. Data is entered
command into these buffers via the shell subcommand trace. The trace sub command accept
the following parameters :
• -h : display trace headers.

• -c chan : select the trace channel for which the contents are to be monitored.
The value for chan must be a decimal constant in the range 0 to 7.
• hook : a hexadecimal value specifying the hook IDs to report on.
• :subhook : allows specification of subhooks, if needed. The subhooks are
specified as hexadecimal values.
Examples (0)> !ls -al /dev/cd0 <== find the cd0 major number
br--r--r-- 1 root system 14, 0 Sep 08 11:18 /dev/cd0
(0)> lke 57 <== load the kernext for scsidd
57 049D6B00 00DB9740 000070D8 00080262 s_scsidd64/usr/lib/drivers/pci/s_scsidd
le_flags....... TEXT DATAINTEXT DATA DATAEXISTS 64
le_next........ 049D6900 le_svc_sequence 00000000.
.
.
.
.
Example (0)> dev 0xd <== display the cd0 device

continued Slot address F10000971406F680
MAJOR: 00D
open: .ssc_open (00DBC0B0)
close: .ssc_close (00DBEAD8)
read: .nodev (00059694)
write: .nodev (00059694)
ioctl: .ssc_ioctl (00DBD1DC)
strategy: .ssc_strategy (00DC3C2C)
ttys: 00000000
select: .nodev (00059694)
config: .ssc_config (00DBE180)
print: .nodev (00059694)
dump: .ssc_dump (00DCDEF4)
mpx: .nodev (00059694)
revoke: .nodev (00059694)
dsdptr: 00000000
selptr: 00000000
opts: 0000002A DEV_DEFINED DEV_MPSAFE
(0)> trb cpu 1 7 <== display the trb free list for cpu 1
CPU #1 TRB #1 of 13 on Free List
Timer address..............F100009715F8B780
trb->to_next...............0000000000000000
trb->knext.................F10000971E27AD00
trb->kprev.................0000000000000000
Owner id (-1 for dev drv)..00000000000042A1
Owning processor...................00000001
Timer flags........................00000010 INCINTERVAL
trb->timerid...............0000000000000000
trb->eventlist.............FFFFFFFFFFFFFFFF
trb->timeout.it_interval...0000000000000000 sec. 00000000 nsec.
Next scheduled timeout ....0000000039BE55A6 sec. 19B39935 nsec.
Completion handler.........00000000001DA910 .rtsleep_end+000000
Completion handler data....F100009715F8B7B0
Int. priority .....................FFFFFFFF
Timeout function...........0000000000000000
CPU #1 TRB #2 of 13 on Free List
.
(0)> iplcb -mem <== display the iplcb memory region information
Memory information [10008AAC]
SLOT ADDR SIZE NODE ATTR LABEL
0 0000000000000000 0000000000FF1000 0 VirtAddr FreeMem
1 0000000000FF1000 000000000000F000 0 VirtAddr RMALLOC
2 0000000001000000 0000000006FCC000 0 VirtAddr FreeMem
3 0000000007FCC000 0000000000029000 0 None RTAS_HEAP
4 0000000007FF5000 000000000000B000 0 VirtAddr IPLCB
5 0000000008000000 0000000018000000 0 VirtAddr FreeMem
6 0000000020000000 FFFFFFFFE0000000 0 None IO_SPACE

Example (0)> trace <== show the trace buffers trace was started for proc events.
continued Trace channel[0 - 7]: 0
Trace Channel 0 (7 entries)
Current queue starts at F1000097231F2000 and ends at F100009723232000
Current entry is #7 of 7 at F1000097231F2130
Hook ID: SYSC_EXECVE (00000134) Hook Type: Timestamped|Generic C000
ThreadIdent: 00003F0B
Timestamp: 26E264B2F6
Subhook ID/HookData: 0000
Data Length: 0007 bytes
D0: 00000001
*Variable Length Buffer: F1000097231F2140
Current queue starts at F1000097231F2000 and ends at F100009723232000
Current entry is #6 of 7 at F1000097231F2108
.
.
KDB network sub commands
Introduction The following table represents the network sub commands and their matching
network crash/lldb sub KDB sub kdb sub

display interface netstat ifnet ifnet
display TCBs ndb tcb tcb
display UDBs ndb udb udb
display sockets sock sock sock
display TCP CB ndb tcpcb tcpcb
display mbuf mbuf mbuf mbuf
ifnet sub The ifnet sub command prints interface information. If no argument is specified,
command information is displayed for each entry in the ifnet table. Data for individual
entries can be displayed by specifying :
• slot : specifies the slot number within the ifnet table for which data is to be
displayed. This value must be a decimal number.
• Address : effective address of an ifnet entry to display. Symbols, hexadecimal
tcpcb and sock The tcpcb and socket sub commands respectively prints:
sub command
• tcpcb information for TCP/UDP blocks.
• socket information for TCP/UDP blocks.
•
If no argument is specified tcpcb information is displayed for all TCP and UDP
blocks. tcpcb and sock accept the following command :
• tcp : display tcpcb information for TCP blocks only.

• udp : display tcpcb information for UDP blocks only.
• Address - effective address of a tcpcb structure to be displayed. Symbols,
the address

KDB network sub commands -- continued
tcb and udb sub tcb and udb sub commands can be used respectively to display :
commands
• tcb block information + socket information
• udb block information + socket information
tcb and udb sub commands accept the following parameters :
• slot : specifies the slot number within the b table for which data is to be
displayed. This value must be a decimal number.
• Address : effective address of a udb entry to display. Symbols, hexadecimal
Examples (0)> ifnet

SLOT 1 ---- IFNET INFO ----(@ 007545E0)----
name........ lo0 unit........ 00000000 mtu......... 00004200
flags....... 0E08084B
(UP|BROADCAST|LOOPBACK|RUNNING|SIMPLEX|NOECHO|BPF|GROUP_ROUTING...
...|64BIT|CANTCHANGE|MULTICAST)
timer....... 00000000 metric...... 00000000
address: 127.0.0.1
init()...... 00000000 output().... 001DBF38 start()..... 00000000
done()...... 00000000 ioctl()..... 001DBF20 reset()..... 00000000
watchdog().. 00000000 ipackets.... 000000B5 ierrors..... 00000000
opackets.... 000000B5 oerrors..... 00000000 collisions.. 00000000
next........ F10000971614F000 type........ 00000018 addrlen..... 00000000
hdrlen...... 00000000 index....... 00000001
ibytes...... 00003448 obytes...... 00003448 imcasts..... 00000000
omcasts..... 00000000 iqdrops..... 00000000 noproto..... 00000000
baudrate.... 00000000 arpdrops.... 00000000 ifbufminsize 00000000
devno....... 00000000 chan........ 00000000 multiaddrs.. F1000082C0157468
tap()....... 00000000 tapctl...... 00000000 arpres().... 00000000
arprev().... 00000000 arpinput().. 00000000 ifq_head.... 00000000
ifq_tail.... 00000000 ifq_len..... 00000000 ifq_maxlen.. 00000032
ifq_drops... 00000000 ifq_slock... 00000000 slock....... 00000000
multi_lock.. 00000000 6_multi_lock 00000000 addrlist_lck 00000000 gidlist..... 00000000 ip6tomcast() 00000000
ndp_bcopy(). 00000000
ndp_bcmp().. 00000000 ndtype...... 01000000 multiaddrs6. F1000082C0158F00
SLOT 2 ---- IFNET INFO ----(@ F10000971614F000)----
name........ tr0 unit........ 00000000 mtu......... 000005D4
.
.
(0)> tcpcb @ F1000082C0031C34 <== display the first tcpcb
---- TCPCB ---(@ F1000082C0031C34)----
seg_next... F1000082C0031C34 seg_prev...... F1000082C0031C34
t_softerror 00000000 t_state....... 00000004 (ESTABLISHED)
t_timer.... 00000000 (TCPT_REXMT)
t_timer.... 00000000 (TCPT_PERSIST)
t_timer.... 00000CFB (TCPT_KEEP)
t_timer.... 00000000 (TCPT_2MSL)
t_rxtshift. 00000000 t_rxtcur...... 00000004 t_dupacks..... 00000000
t_maxseg... 000005AC t_force....... 00000000
KDB network sub commands -- continued
Example t_flags.... 00000000 ()

continued t_oobflags. 00000000 ()
t_iobc..... 00000000 t_template.. F1000082C0031C64
t_inpcb..F1000082C0031B54 <== pointer to tcb or udb structure
t_timestamp... 2DF79401 snd_una....... 8D452AB5 snd_nxt....... 8D452AB5
snd_up........ 8D452920 snd_wl1....... 42612E19 snd_wl2....... 8D452AB5
iss........... 8D4514FA snd_wnd....... 00003E64 rcv_wnd....... 00004410
rcv_nxt....... 42612E1B rcv_up........ 42612E18 irs........... 42612D92
snd_wnd_scale. 00000000 rcv_wnd_scale. 00000000 req_scale_sent 00000000
req_scale_rcvd 00000000 last_ack_sent. 42612E1B timestamp_rec. 00000000
timestamp_age. 00002BE3 rcv_adv....... 4261722B snd_max....... 8D452AB5
snd_cwnd...... 0000DD34 snd_ssthresh.. 3FFFC000 t_idle........ 00002B45
t_rtt......... 00000000 t_rtseq....... 8D452920 t_srtt........ 00000007
t_rttvar...... 00000004 t_rttmin...... 00000002 max_rcvd...... 00000000 max_sndwnd.... 00003E64 t_peermaxseg..
000005AC
(0)> tcb f1000082C0031B54 <== display the tcb for the pointer found before
-------- TCB --------- INPCB INFO ----(@ F1000082C0031B54)----
next........ F1000082C0032354 prev........ 04BB8F80 head........ 04BB8F80
iflowinfo... 00000000 faddr_6... @ F1000082C0031B74 fport....... 00008036
fatype...... 00000001 oflowinfo... 00000000 laddr_6... @ F1000082C0031B8C
lport....... 00000017 latype...... 00000001 socket...... F1000082C0031800
ppcb........ F1000082C0031C34 route_6... @ F1000082C0031BAC ifa.....00000000
flags....... 00000400 proto....... 00000000 tos......... 00000000
ttl......... 0000003C rcvttl...... 00000000 rcvif....... F10000971614F000
options..... 00000000 refcnt...... 00000002
lock........ 00000000 rc_lock..... 00000000 moptions.... 00000000
hash.next... 04BEB040 hash.prev... 04BEB040
timewait.nxt 00000000 timewait.prv 00000000
---- SOCKET INFO ----(@ F1000082C0031800)---- <== we also get socket information
type........ 0001 (STREAM)
opts........ 010C (REUSEADDR|KEEPALIVE|OOBINLINE)
linger...... 0000 state....... 0102 (ISCONNECTED|NBIO)
pcb.. F1000082C0031B54 proto.. 04BAC870 lock.. F1000082C007B740 head.00000000
q0...... 00000000 q....... 00000000 dq...... 00000000 q0len....... 0000
qlen........ 0000 qlimit...... 0000 dqlen....... 0000 timeo....... 0000
error....... 0000 special..... 0A8C pgid.... 00000000 oobmark. 00000000
snd:cc...... 00000000 hiwat... 00002000 mbcnt... 00000000 mbmax... 00008000 lowat... 00001000 mb...... 00000000
sel..... 00000000 events...... 0000
iodone. 00000000 ioargs. 00000000 lastpkt. F1000082C01BE800 wakeone. FFFFFFFF
timer... 00000000 timeo... 00000000 flags....... 0048 (SEL|NOINTR)
wakeup.. 00F66E78 wakearg. C01FF918 lock.... FFFFFFFFF1000082
rcv:cc...... 00000000 hiwat... 00004410 mbcnt... 00000000 mbmax... 00011040
lowat... 00000001 mb...... 00000000 sel..... 00000000 events...... 0004
iodone.. 00000000 ioargs.. 00000000 lastpkt. F1000082C01A9800 wakeone. FFFFF
FFF
timer... 00000000 timeo... 00000000 flags....... 0048 (SEL|NOINTR)
wakeup.. 00F66E78 wakearg. C01FF800 lock.... FFFFFFFFF1000082
tpcb.... 00000000 fdev_ch. F10000971E186DC0 sec_info 00000000 qos..... 00000
000
gidlist. 00000000 private. 00000000 uid..... 00000000 bufsize. 00000000
threadcnt00000000 nextfree 00000000 siguid.. 00000000 sigeuid. 00000000
sigpriv. 00000000
sndtime. 00000000 sec 00000000 usec rcvtime. 00000000 sec 00000000 usec
proc/fd: 44/0 44/1 44/2

KDB VMM sub commands
Introduction The following table represents the VMM sub commands and their matching crash/
VMM crash/lldb KDB sub kdb sub

VMM kernel segment data /vmm-1 vmker vmker
VMM RMAP vmm-rmap rmap rmap
VMM control variables /vmm-2 pfhdata pfhdata
VMM statistics /vmm-3 vmstat vmstat
VMM Addresses /vmm-a vmaddr vmaddr
VMM paging device table vmm-pdt pdt pdt
VMM segment control blocks vmm-scb scb scb
VMM PFT entries vmm-pft pft pft
VMM PTE entries vmm-pte pte pte
VMM PTA segment vmm-pta pta pta
VMM STAB ste ste
VMM segment register sr64 sr64 sr64
VMM segment status segst64 segst64 segst64
VMM APT entries vmm-apt apt apt
VMM wait status /vmm-9 vmwait vmwait
VMM address map entries vmm-ame ames ames
VMM zeroing kproc /vmm-f zproc zproc
VMM error log /vmm-e vmlog vmlog
VMM reload xlate table vrld vrld
IPC information vmm-sem/shm ipc ipc
VMM lock anchor/tblock lockanch lockanch
VMM lock hash table lockhash lockhash
VMM lock word lockword lockword
VMM disk map vmdmap vmdmap
VMM spin locks vmlocks vmlocks
KDB VMM sub commands -- continued
vmker, pfhdata, These sub commands will display VMM information about :
vmstat, vmaddr,
vmwait, zproc,
vmlog, vrld and • vmker : virtual memory kernel data.
vmlocks sub • pfhdata : virtual memory control variables.
commands • vmstat : virtual memory statistics
• vmaddr : addresses of VMM structures.
• vmwait : displays VMM wait status using the address of a wait chanel.
• zproc : displays information about the VMM zeroing kproc.
• vmlog : displays the current VMM error log entry.
• vrld : displays the VMM reload xlate table. This information is only used on
SMP PowerPC machine, to prevent VMM reload dead-lock.
• vmlocks : displays VMM spin lock data.
scb sub The sub sub command provides options for display of information about VMM
command segment control blocks. The scb sub command will prompt a menu to display scb
using the following options :
• 1 : index
• 2 : sid
• 3 : srval
• 4 : search on sibits
• 5 : search on npsblks
• 6 : search on nvpages
• 7 : search on npages
• 8 : search on npseablks
• 9 : search on lock
• a : search on segment type
• b : add total scb_vpages
• c : search on segment class
• d : search on segment pvproc
ames sub The ames subcommand provides options for display of the process address map
command for either the current or a specified processes. The ames sub command will prompt
a menu to display address map using the following options :
• 1 : current process
• 2 : specified process
• 3 : specified address map

pft sub The pft sub command provides options for display of information about VMM
command page frame table. The pft sub command will prompt a menu to display page frame
information using the following options :
• 2 : h/w hash (sid,pno)

• 3 : s/w hash (sid,pno)
• 4 : search on swbits
• 5 : search on pincount
• 6 : search on xmemcnt
• 7 : scb list
• 8 : io list
• 9 : deferred pgsp service frames
pte sub The pte sub command provides options for display of information about VMM
command page table entries . The pte sub command will prompt a menu to display scb using
the following options :
• 1 : index
• 2 : sid,pno
• 3 : page frame
• 4 : PTE group
pta sub The pta subcommand displays data from the VMM PTA segment. The following
command optional arguments maybe used to determine the data to be displayed :
• -r - to display XPT root data.

• -d - to display XPT direct block data.
• -a - to display the Area Page Map.
• -v - to display map blocks.
• -x - to display XPT fields.
• -f - prompt for the sid/pno for which the XPT fields are to be displayed.
• sid - segment ID. Symbols, hexadecimal values, or hexadecimal expressions
may be used for this argument.
• idx - index for the specified area. Symbols, hexadecimal values, or
hexadecimal expressions may be used for this argument.
pdt sub The pdt subcommand displays entries of the paging device table. An argument of
command * results in all entries being displayed in a summary. Details for a specific entry
can be displayed using a slot number.
rmap sub The rmap subcommand displays the real address range mapping table. If an
command argument of * is specified, a summary of all entries is displayed. If a slot number
is specified, only that entry is displayed. If no argument is specified, the user is
prompted for a slot number, and data for that and all higher slots is displayed, as
well as the page intervals utilized by VMM.
ste sub The ste subcommand provides options for display of information about segment
command table entries for 64-bit processes. The ste sub command will prompt a menu to
display segments using the following options :
• 1 :esid
• 2 : sid
• 3 : dump hash class (input=esid)
• 4 : dump entire stab
sr64 sub The sr64 sub command displays segment registers for a 64-bit process. Using the
command following parameters :
• none : the segment registers will be displayed for the current process.
• -p pid : process ID of a 64-bit process. This must be a decimal or hexadecimal
value depending on the setting of the hexadecimal_wanted switch.
• esid : first segment register to display (lower register numbers are ignored).
This argument must be a hexadecimal value.
• size : value to be added to esid to determine the last segment register to display.
This argument must be a hexadecimal value.
apt sub The apt subcommand provides options for display of information from the alias
command page table.The apt sub command will prompt a menu to display aliases using the
following options :
• 1 : index
• 2 : sid,pno
• 3 : page frame

segst64 sub The segst64 subcommand displays segment state information for a 64-bit process.
command The information display can be filtered using :
• no argument : the information for the current process is displayed.

• -p pid - process ID of a 64-bit process. This must be a decimal or hexadecimal
value depending on the setting of the hexadecimal_wanted switch.
• -e esid - first segment register to display (lower register numbers are ignored).
• -s seg - limit display to only segment register with a segment state that matches
seg. Possible values for seg are: SEG_AVAIL, SEG_SHARED,
• SEG_MAPPED, SEG_MRDWR, SEG_DEFER, SEG_MMAP,
SEG_WORKING, SEG_RMMAP, SEG_OTHER, SEG_EXTSHM, and
SEG_TEXT.
• value - limit display to only segments with the specified value for the segfileno
field.
ipc sub The ipc subcommand reports interprocess communication facility information.
command The ipc sub command will prompt a menu to display ipc using the following
options :
• ***TBD***
lockanch, These sub commands will display VMM lock information for :
lockhash and
lockword sub
commands • lockanch : anchor data and data for the transaction blocks in the transaction
block table.
• lockhash : lock hash list.
• lockword : lock words.
lockanch, lockhash and lockword accept the following parameters :
• slot : slot number of an entry in the VMM lock table. This argument must be a
decimal value.
• Address : effective address of an entry in the VMM lock table. Symbols,
hexadecimal values, or hexadecimal expressions may be used in specification
of the address.
vmdmap sub The vmdmap subcommand displays VMM disk maps. To look at other disk maps
command it is necessary to initialize segment register 13 with the corresponding srval.
vmdmap accept the following arguments :
• no arguments : all paging and file system disk maps are displayed.
• slot : Page Device Table (pdt) slot number. This argument must be a decimal
value.
examples ***TBD***

KDB SMP sub commands
Introduction The following table represents the SMP sub commands and their matching crash/
SMP crash/lldb sub KDB sub kdb sub

Start cpu start N/A
Stop cpu stop N/A
Switch to cpu cpu cpu cpu
start, stop and start, stop and cpu commands will allow you to :
cpu sub
commands
• start a cpu
• stop a cpu
• display status or switch to another cpu
These sub commands accept a cpu number as parameter.
Examples ***TBD***
KDB data and instruction block address translation sub commands
Introduction The following table represents the block address translation sub commands and
block address crash/lldb sub KDB sub kdb sub

translation commands commands commands
function
display dbats dbat dbat
display ibats ibat ibat
modify dbats mdbat mdbat
modify ibtas mibat mibat
dbat and ibat On PowerPC machine, the dbat and ibat sub commands may be used to display
sub commands dbat and ibat registers. dbat and idat accept the following arguments :
• no argument : all dbat registers are displayed.

• index : just the specified dbat register is displayed.
mdbat and On PowerPC machine, the mdbat and mibat sub commands may be used to
mibat sub modify dbat and ibat registers. The processor data bat register is altered
commands
immediately. KDB takes care of the valid bit, the word containing the valid bit is
set last. mdbat and mibat accept the following arguments :
• no argument : all dbat or ibat registers are prompted for modification.

• index : just the specified dbat or ibat register is prompted for modification.

KDBdataandinstructionblockaddresstranslationsubcommands -- continued
Examples KDB(0)> dbat 2 <== display bat register 2

BAT2: 00000000 00000000
bepi 0000 brpn 0000 bl 0000 v 0 wimg 0 ks 0 kp 0 pp 0
KDB(0)> mdbat 2 alter bat register 2
BAT register, enter <RC> twice to select BAT field, enter <.> to quit
BAT2 upper 00000000 = <CR/LF>
BAT2 lower 00000000 = <CR/LF>
BAT field, enter <RC> to select field, enter <.> to quit
BAT2.bepi: 00000000 = 00007FE0
BAT2.brpn: 00000000 = 00007FE0
BAT2.bl : 00000000 = 0000001F
BAT2.v : 00000000 = 00000001
BAT2.ks : 00000000 = 00000001
BAT2.kp : 00000000 = <CR/LF>
BAT2.wimg: 00000000 = 00000003
BAT2.pp : 00000000 = 00000002
BAT2: FFC0003A FFC0005F
bepi 7FE0 brpn 7FE0 bl 001F v 1 wimg 3 ks 1 kp 0 pp 2
eaddr = FFC00000, paddr = FFC00000 size = 4096 KBytes
KDB bat/brat sub commands
Introduction The following table represents the bat/brat sub commands and their matching
bat/brat crash/ KDB kdb

sub comma comma
comma nds nds
nds
branch target btac N/A
clear branch target cbtac N/A
local branch target lbtac N/A
clear local branch target lcbtac N/A
btac,lbtac, cbtac The btac and lbtac sub commands can be used to stop when Branch Target
and lcbtac sub Address Compare is true using hardware registers HID1 and HID2 on PowerPC
commands
systems in the following condictions :
• btac : set a general branch target

• lbtac : set a local branch target on a cpu basis.
lbtac and lcbtac respectively clear general and local branch targets.
Examples KDB(0)> btac open <== set BRAT on open function

KDB(7)> btac <== display current BRAT status
CPU 0: .open+000000 eaddr=001B5354 vsid=00000000 hit=0
...
Branch trap: 001B5354 <.open+000000>
.sys_call+000000 bcctrl <.open>
KDB(0)> btac <== display current BRAT status (we have one hit)

IADB kernel debugger
Introduction The IADB is kernel debugger used on AIX5L running on IA-64 platform.
Availability The kernel debugger must be enabled in order to be used on AIX5L.

The following command should return 0000000000000001 if the kernel debugger
was enabled :
# iadb
(0)> d dbg_avail
E000000004755BD8: 0000000000000001
Overview The major functions of the IADB are :
• Setting breakpoints within the kernel or kernel extensions

• Execution control through various forms of step commands
• Formatted display of selected kernel data structures
• Display and modification of kernel data
• Display and modification of kernel instructions
• Modification of the state of the machine through alteration of system registers
loading IADB In AIX5L, the IADB is included in the unix_ia64 kernel located in /usr/lib/boot.
In order to use it, the IADB must be loaded at boot time. To allow IADB to load
use the following command :
bosboot -a -D -d /dev/ipldevice, or bosdebug -D : will load

IADB at boot time.
• bosboot -a -I -d /dev/ipldevice, or bosdebug -I : will
load and invoke the IADB at boot time.
• bosboot -ad /dev/ipldevice, or bosdebug -o : will not load
and invoke the IADB at boot time.
You must reboot the system in order to take these changes in account.
IADB kernel debugger -- continued
starting IADB The KDB maybe be started, if loaded, under the following circumstances :
• If the bosboot or bosdebug was run with -I, this mean that the tty
attached to a native serial port will show up the IADB just after the kernel is
loaded.
• You may invoke manually the IADB from a tty attached to a native serial port
using a native keyboard using Ctrl-alt-Numpad4. For example:
Debugger entered by hitting cntrl-atl-numpad4
AIX/IA64 KERNEL DEBUGGER ENTERED Due to...
Debugger entered via keyboard with key in SERVICE position using
numpad 4
IP->E00000000008C910 waitproc_find_run_queue()+210: { .mib
==>0: adds sp = 0x40, sp
1: mov.i ar.lc = r33
2: br.ret.sptk.few rp
;; }
>CPU0>
• An application make a call to the breakpoint() kernel services or to the

breakpoint system call.
• A breakpoint previously set using the IADB has been reached
• A fatal system error occurs. A dump might be generated on exit from the
IADB.
IADB concept When the IADB Kernel Debugger is invoked, it is the only running program until
you exit IADB or you use the start sub command to start another cpu. All
processes are stopped and interrupts are disabled. The IADB Kernel Debugger
runs with its own Machine State Save Area (mst) and a special stack. In addition,
the IADB Kernel Debugger does not run operating system routines. Though this
requires the kernel code be duplicated within IADB, it is possible to break
anywhere within the kernel code. When exiting the IADB Kernel Debugger, all
processes continue to run unless the debugger was entered via a system halt.

iadb command
Introduction The iadb command, unlike the IADB kernel debugger, allows examination of an
operating system image issued on IA-64 systems.
The iadb command may be used on a running system but will not provide all
functions available with the IADB kernel debugger.
Parameters The iadb command maybe used with the following parameters :
• no parameter : the iadb will use /dev/mem as the system image file and /usr/lib/
boot/unix as the kernel file. In this case root permissions are required.
• -d system_image_file : the iadb will use the image file provided.
• -u kernel_file : the iadb will use the kernel file. This is required to analyze a
system dump on a system that has a different unix level.
• -i include file list(may be comma separated)
• -u user modules list for any symbol retrieval(comma separated list)
Loading errors If the system image file provided doesn’t contain a valid dump or the kernel file
doesn’t match the system image file, the following message may be issued by the
iadb command:
# iadb -u /usr/lib/boot/unix -d dump_file

**TBD**
IADB break point and step sub commands
Introduction The following table represents the breakpoint and step sub commands and their
breakpoint and step crash/lldb sub IADB sub iadb sub

set/list break point br N/A
set/list local break point N/A
clear local break point N/A
clear break points c N/A
clear all breakpoint N/A
go to end of function sr N/A
go until address N/A
single step s/so N/A
step a bundle sb N/A
step to next branch stb
step on bl/blr N/A
step on branch N/A
br sub command The br subcommand can be used to set and display software break points. The br
subcommand accept the following options :
• None : will display the currently set break points.

• -a ‘N’ : will break after ‘N’ occurrences
• -c {expr} : will break if the condition {expr} is true
• -d : deferred, will set the break point when the module will be loaded
• -e ‘N’ : will break every ‘N’ occurrences
• -t ‘tid’ : will break only if current thread id is ‘tid’
• -u ‘N’ : break up to ‘N’ occurrences
• address : the break point address
c sub command The c sub command can be use to clear some or all break points. The c sub
command accept the following parameters :
• index : index of the break point as listed in the br output

• address : address of the break point
• all : clear all break points.

IADB break point and step sub commands -- continued
Examples The following example will show the use of br,c and s sub commands :
# ps -mF "THREAD" <== search for our thread id

USER PID PPID TID S CP PRI SC WCHAN F TT BND COMMAND
root 8008 1 - A 0 60 1 - 240001 0 - -ksh
- - - 10865 S 0 60 1 - 400 - - -
#<== hit ctrl-alt-numpad4 to enter the IADB
AIX/IA64 KERNEL DEBUGGER ENTERED Due to....
IP->E0000000000884B1 waitproc()+131: { .mii
0: ld4.acq r40 = [r36]
==>1: adds r8 = 0x1, r41
;;
2: cmp.eq p6, p0 = 0, r40 }
> dis kread+90 <== in bundle 90 of kread we have a branch to rdwr()
E000000000333B90 kread()+90: { .mib
0: st8 [r11] = r9
1: nop.i 0
2: br.call.sptk.few rp = <rdwr()+0>
;; }
> br -a 5 -t 2A71 kread <== set a break point after 5 kread for our tid
> br <== list break points
brk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0
> go <== exit IADB
See Ya!
# <== hit enter, this will call 3 kread
# <== hit enter, this will call 3 kread
AIX/IA64 KERNEL DEBUGGER ENTERED Due to...<== after 5 kread we enter IADB
Break instruction interrupt.
IP->E000000000333B00 kread()+0: { .mii
==>0: alloc r35 = ar.pfs, 5, 0, 5, 0
1: adds sp = -0xA0, sp
2: mov r36 = rp
;; }
> s <== we step one instruction at a time in bundle 1
IP->E0000000002E220 kread()+1: { .mii
==>0: alloc r35 = ar.pfs, 5, 0, 5, 0
2: mov r36 = rp
;; }
IADB break point and step sub commands -- continued
Examples > sb <== we step to next bundle (bundle 10)

continued AIX/IA64 KERNEL DEBUGGER ENTERED Due to...
IP->E0000000002E2230 kread()+10: { .mii
==>0: adds r8 = 0x18, sp
1: adds r40 = 0x20, sp
2: adds r9 = 0x28, sp }
> stb <== we step to next branch that points to rdwr()

Another thread is currently stepping. To avoid
confusion, only one thread can be actively
stepped.
Would you rather step this thread? (y/n) y
IP->E0000000002E2620 rdwr()+0: { .mii

==>0: alloc r41 = ar.pfs, 11, 0, 6, 0
1: adds sp = -0x50, sp
2: mov r42 = rp
;; }
> sr <== we return from rdwr() so we come back in kread in bundle A0
IP->E0000000002E22A0 kread()+80: { .mii
==>0: adds r9 = 0, r8
1: nop.i 0
;;
2: cmp4.eq p6, p7 = 0, r9
> c all <== we clear all break point when the job is done
> br <== list break points
No Active Breakpoints
> go <== exit IADB
See Ya!

IADB dump/display/decode sub commands
Introduction The following table represents the dump/display/decode sub commands and their
dump/display/decode crash/lldb sub IADB sub iadb sub

display byte data N/A d (ordinal 1) d (ordinal 1)
display word data od (2 units) d (ordinal 4) d (ordinal 4)
display double word data od (4 untis) d (ordinal 8) d (ordinal 8)
display code decode/od dis dis
(format I)
display registers b/cfm/fpr/iip/ b/cfm/fpr/iip/
iipa/ifa/intr/ iipa/ifa/intr/
ipsr/isr/itc/kr/ ipsr/isr/itc/kr/
p/perfr/r/rr/rse p/perfr/r/rr/rse
display device byte dio (ordinal 1) dio (ordinal 1)
display device half word dio (ordinal 2) dio (ordinal 2)
display device word dio (ordinal 4) dio (ordinal 4)
display device double word dio (ordinal 8) dio (ordinal 8)
display physical memory dp ***TBD
display pci config space dpci ***TBD
find pattern find
extract pattern
d sub command The d sub command can be use to display virtual memory using the following
parameters :
• address : address or symbol to dump

• ordinal : number of byte access (1,2,4,or 8)
• number : number of elements to dump (of size 'ordinal')
• none : continue dumping from previous d sub command
dp sub The dp sub command can be used to display physical memory using :
command
• address : physical address to dump
• count : number of elements to dump (of size 'ordinal')
IADB dump/display/decode sub commands -- continued
dio sub The dio sub command can be used to display the I/O space using the following
command parameters :
• port : I/O port address to dump

• count : number of elements to dump (of size 'ordinal')
dis subcommand The dis sub command can be used to list instructions at a defined address using :
• address : address or symbol to disassemble

• count : number of bundles to disassemble
registers sub The following sub commands can be used to display registers informations :
commands
• b :Display Branch Register(s)
• cfm : Display Current Stacked Register
• fpr : Display FPR(s) (f0 - f127)
• iip : Display or Modify Instruction Pointer
• iipa : Display Instruction Previous Address
• ifa : Display Fault Address
• intr : Display Interrupt Registers
• ipsr : Display/Decode IPSR
• isr : Display/Decode ISR
• itc : Display Time Registers ITC ITM & ITV
• kr : Display Kernel Register(s)
• p : Display Predicate Register(s)
• perfr : Display Performance Register(s)
• r : Display General Register(s)
• rr : Display Region Register(s)
• rse : Display Register Stack Registers
dpci sub The dpci sub command can be used to display pci devices configuration space
command using the following parameters :
• bus : Hardware bus number of target PCI bus

• dev : PCI Device Number of target PCI device
• function : PCI Function Number of target PCI device
• register : Configuration register offset to read
• ordinal : Size of access to make (1,2,4,8)

Examples >CPU0> d dbg_avail <== Display Virtual memory address

at dbg_avail
E00000000407D6E0: 0000000000000001
>CPU0> dp 0x1000 2 5 <== Display 5 half-words from

physical address 0x1000
0000000000001000: 0000 0000 0000 0000 0000
>CPU0> dio 0x3f6 1 8 <== Display 8 bytes from port

0x3F6
00000FFFFC0FDBF6: 50FF000000006F60
P.....o‘
>CPU0> dis kread <== Disassemble from kread

E0000000002E2220 kread()+0: { .mii
0: alloc r35 = ar.pfs, 5,
0, 5, 0
2: mov r36 = rp
;; }
>CPU0> dpci 0 0x58 0 0x20 4 <== Display 4-byte word

from PCI config register 0x20 for device dpci 0 0x58 0
0x20 4 0x58, function 0, on bus0
PCI Config Space Bus 0, Dev 0x58, Fnc 0:

reg 20: FFFFFFFF
>CPU0> d enter_dbg <== Display Virtual memory address

at enter_dbg
E0000000040CF150: 0000000000000000
>CPU0> m enter_dbg 4 0x43 <== Modify enter_dbg with a

4-byte store of data 0x43
E0000000040CF150: 00000043
Examples >CPU0> d enter_dbg <== Display Virtual memory address

continued at enter_dbg
E0000000040CF150: 0000000000000043
>CPU0> dp 0x5000 <== Display physical memory at

location 0x5000
0000000000005000: FFFFFFFFFFFFFFFF
>CPU0> mp 0x5000 8 0x1122334455667788 <== Modify

Physical memory at location 0x5000 with 8-byte store of
data 0x1122334455667788
0000000000005000: 1122334455667788
>CPU0> dp 0x5000 <== Display physical memory at

location 0x5000
0000000000005000: 1122334455667788

IADB modify memory sub commands
Introduction The following table represents the modify memory sub commands and their
modify memory crash/lldb sub IADB sub iadb sub

modify sequential bytes alter -c m N/A
modify sequential word alter -w N/A
modify sequential double word alter -l N/A
modify registers b/iip/kr/p/r/rr N/A
modify device byte mio N/A
modify device half word N/A
modify device double word N/A
modify physical memory mp N/A
m sub command The m sub command can be used to modify the virtual memory contents using :
• addr : symbol or virtual address to modify

• ordinal : size of each data element (1,2,4,8)
• data1 : first data element to be stored with access of size 'ordinal'
• data2.. : subsequent data elements to be stored
mp sub The mp sub command can be used to modify the physical memory contents with
command the following parameters :
• addr : physical address to modify

IADB modify memory sub commands -- continued
registers sub The following sub commands can be used to modify registers informations :
commands
• b :Set Branch Register(s)
• iip : Modify Instruction Pointer
• kr : Set Kernel Register(s)
• p : Set Predicate Register(s)
• r : Set General Register(s)
• rr : Set Region Register(s)
mio sub The mio sub command can be use to modify I/O space using :
command
• addr : I/O port address to modify
Examples >CPU0> b <== Display branch registers
b00:E00000000008E050 waitproc()+1B0
b01:BADC0FFEE0DDF00D
b06:E00000000008DEA0 waitproc()+0
>CPU0> iip <== Display instruction pointer
IIP : E00000000008E000:waitproc()+160

Examples >CPU0> kr <== Display all kernel registers

continued
kr0:00000FFFFC000000
kr1:0000000000000000
kr2:0000000000000000
kr3:0000000000000000
kr4:C000006013220000
kr5:0200000000000000
kr6:C00000601324CCC0
kr7:C000006013200000
>CPU0> p <== Display all predicate registers
p00:1 p16:0 p32:0 p48:0

p01:0 p17:0 p33:0 p49:0
p02:0 p18:0 p34:0 p50:0
p03:0 p19:0 p35:0 p51:0
p04:0 p20:0 p36:0 p52:0
p05:0 p21:0 p37:0 p53:0
p06:0 p22:0 p38:0 p54:0
p07:1 p23:0 p39:0 p55:0
p08:0 p24:0 p40:0 p56:0
p09:0 p25:0 p41:0 p57:0
p10:0 p26:0 p42:0 p58:0
p11:0 p27:0 p43:0 p59:0
p12:0 p28:0 p44:0 p60:0
p13:0 p29:0 p45:0 p61:0
p14:0 p30:0 p46:0 p62:0
p15:0 p31:0 p47:0 p63:0
IADB modify sub commands -- continued
Examples >CPU0> r <== Display all general registers

continued
r00:BADC0FFEE0DDF00D [0] r16:E00000971404B008 [0]
r01:E000000004002818 [0] r17:00000000C0000000 [0]
r02:BADC0FFEE0DDF00D [0] r18:0000000000000014 [0]
r03:BADC0FFEE0DDF00D [0] r19:0000000040000000 [0]
r04:BADC0FFEE0DDF00D [0] r20:E00000971404D000 [0]
r05:BADC0FFEE0DDF00D [0] r21:0000000000000000 [0]
r06:BADC0FFEE0DDF00D [0] r22:0000000000000000 [0]
r07:BADC0FFEE0DDF00D [0] r23:E00000971404D008 [0]
r08:0000000000000000 [0] r24:0000000000000000 [0]
r09:0000000000000000 [0] r25:BADC0FFEE0DDF00D [0]
r10:0000000000000002 [0] r26:BADC0FFEE0DDF00D [0]
r11:0000000080000000 [0] r27:BADC0FFEE0DDF00D [0]
r12:0003FEFFF3FFF7C0 [0] r28:BADC0FFEE0DDF00D [0]
r13:E00000971405C600 [0] r29:BADC0FFEE0DDF00D [0]
r14:E00000971404C02C [0] r30:BADC0FFEE0DDF00D [0]
r15:E00000971404B028 [0] r31:BADC0FFEE0DDF00D [0]
r32:C000006013200000 [0]
r33:C000006013200290 [0]
r34:E00000971404B11C [0]
r35:E00000971404B120 [0]
r36:E0000000040C6060 [0]
r37:E0000000040C6068 [0]
r38:0000000000000186 [0]
r39:0000000000000009 [0]
r40:0000000000000001 [0]
r41:0000000000000001 [0]

Examples >CPU0> rr <== Display all region registers

continued
rr0:0000000000480931
rr1:0000000000200431
rr2:0000000000280531
rr3:0000000000000030
rr4:0000000000000030
rr5:0000000000180331
rr6:0000000000100269
rr7:0000000000080131
>CPU0> mio 0x408 8 0 <== Modify I/O port 0x408 with 8-

byte store of data 0
IADB name list/symbol sub commands
The following table represents the name list/symbol sub commands and their
name list symbol crash/lldb sub IADB sub iadb sub

translate symbol to eaddr nm map map
no symbol mode (toggle)
translate eaddr to symbol ts/ds map map
map sub The map sub command can be used to translate a symbol into an address and
command revers and so accept the following as parameter :
• symbol : symbol to show address for

• address : address to show symbol for
Examples >CPU0> map (r34) <== Lookup symbol for address in r34
>CPU0> map 0xe000000000000000 <== Lookup symbol for
address 0xe000000000000000
>CPU0> map foo+0x100 <== Lookup symbol for symbol
‘foo’+0x100

IADB watch break point sub commands
Introduction The following table represents the watch break point sub commands and their
watch break point crash/lldb sub IADB sub iadb sub

stop on read data dbr r N/A
stop on write data dbr w N/A
stop on r/w data dbr rw N/A
local stop on read data N/A
local stop on write data N/A
local stop on r/w data N/A
clear watch cdbr N/A
local clear watch N/A
dbr sub The dbr command can be used to set break point on data access using :
command
• action : the action to watch for :
• r = = Break on Read
• w = = Break on Write
• rw = = Break on Read or Write
• mask : bit mask of which address bits to match
• plvl_mask : bit mask of which privilege levels to match
• 0x1 = = CPL 0 (Kernel)
• 0x2 = = CPL 1 (unused)
• 0x8 = = CPL 3 (User)
• addr : the address to trigger on
cdbr sub The cdbr sub command can be used to clear previously set data break points using
command :
• index : index of DBR breakpoint (from dbr cmd)
• all : clear all DBRs
IADB watch break point sub commands -- continued
Examples >CPU0> dbr <== Display all current breakpoints
>CPU0> dbr foo <== Break on access to ‘foo’
>CPU0> dbr -t foo <== Break on any access to ‘foo’ for

current thread
>CPU0> cdbr 3 <== Clear DBR in slot 3
>CPU0> cdbr 0xe000000000011cc0 <== Clear DBR at address

0xe000000000011cc0

IADB machine status sub commands
Introduction The following table represents the trace sub commands and their matching crash/
machine status crash/ IADB iadb

sub comma comma
comma nds nds
nds
system status message stat sys+rea
son
switch thread
sys sub The sys sub command will display the following information :
command
• Build level and build date
• Number and type of processors
• Memory size
• Processor Speed
• Bus Speed
reason sub The reason sub command will display the reason why debugger was entered along
command with IP and assembly code of the bundle at that IP
IADB machine status sub commands -- continued
Examples >CPU1> sys <== Display system information
Kernel : AIX 0036E_500IA, Built on Sep 27 2000

at 14:52:02
Memory : 1023MB
Processors : 2 Itanium, Stepping 0
Proc Speed : 665374960 HZ
Bus Speed : 133074992 HZ
>CPU1> reason <== Display reason debugger was entered

Debugger entered via keyboard with key in SERVICE
position using numpad 4
IP->E00000000008E000 waitproc()+160: { .mii
==>0: alloc r35 = ar.pfs, 5,
0, 5, 0
2: mov r36 = rp
;; }

IADB kernel extension loader sub commands
Introduction The following table represents the kernel extension loader sub commands and
kernel extension loader crash/ IADB iadb

sub comma comma
comma nds nds
nds
list loaded extension le kext/
ldsyms/
unldsy
ms
list loaded symbol tables
remove symbol table
list export tables
kext The kext sub command will display all loaded kernel extensions and their text and
data load addresses
ldsyms and The ldsyms and unldsyms will load or unload a kernel extension symbols using :
unldsyms sub
commands
• -p [path] : where path is the absolute file path of the kernel extension
• module : the module name
examples (0) kext <== list loaded kernel extensions

.
.
Name : /usr/lib/drivers/isa/kbddd
TextMapped: 0xE000009729630000 to 0xE000009729645FFF, Size: 0x00016000
DataMapped: 0xE000009729660000 to 0xE000009729665FFF, Size: 0x00006000
UnwindTBL: 0xE000009729644BA8 to 0xE0000097296453E7, Size: 0x00000840
TextStart: 0xE000009729630120 Load count: 2 Use count: 0
.
.
(0) nm kbdconfig <== try to get address for kbdconfig symbol
Symbol not found
(0)>ldsyms kbddd <== load kbddd symbols
(0)>nm kbdconfig <== now nm should work
kbdconfig : e000009729639560
IADB address translation sub commands
Introduction The following table represents the address translation sub commands and their
address translation crash/lldb sub IADB sub iadb sub

translate to real address x
display MMU translation
parameters x addr
where;
addr = symbol or virtual address to translate
Examples >CPU0> x foo+0x4000 <== Display the physical

translation for foo+0x400
>CPU0> x 0x20000000 <== Display the physical

translation for virtual address 0x20000000
>CPU0> x (r1) <== Display the physical address in r1

IADB process/thread sub commands
Introduction The following table represents the process/thread sub commands and their
process crash/lldb sub IADB sub iadb sub

display per processor data ppd ppda ppda
area
display interrupt handler
display mst area mst mst mst
display process table proc pr pr
display thread table th th th
display thread tid th th th
display thread pid
display user area user/du us us
display run queue rq
display sleep queue sq
ppda The ppda sub command will display Per Processor Descriptor Area and accept the
following parameters :
• cpu : which CPU's ppda to display (logical numbering)
mst The mst sub command will display the Machine State Stack using :
• addr : address of an MST to display
pr The pr sub command will display process informations using :
• -p {value} :for process where PID = = {value}

• -s {value} : for process in slot {value}
• -v {value} : for proc struct pointer = = {value}
• -a : detailed display for all processes
• * : process table display
IADB process/thread sub commands -- continued
th The th sub command will display thread information related to :
• -s {slot} : detailed thread info for thread in 'slot'

• -t {tid} : detailed thread info for thread 'tid'
• -v {thrdptr} : detailed thread info for thread pointer ''thrdptr'
• -a : detailed thread info for all threads
• * : display thread table
us The us sub command will display user structure information for:
• -p : process id (PID)
• -t : Thread id (TID)
• * : All processes
rq The rq will return the run queue information related to :
• -b {bucket} : detailed info for threads in bucket of all run queue slots
• -g : global info for run queues
• -q [ number ] : detailed info for all queues
• -v {address} : detailed info for threads at run queue address
sq The sq sub command will display the sleep queue related to :
• -b {bucket} : detailed info for threads in 'bucket'

• -v {address} : detailed info for threads at sleep queue 'address'
Examples ***TBD

IADB LVM sub commands
Introduction The following table represents the LVM sub commands and their matching crash/
LVM crash/lldb sub IADB sub iadb sub

display physical buffer
display volume group
display physical volume
display logical volume
parameters
Examples
IADB SCSI sub commands
Introduction The following table represents the scsi sub commands and their matching crash/
SCSI crash/lldb IADB sub iadb sub

function sub commands commands
commands
display ascsi
display vscsi
display scdisk
parameters
Examples

IADB memory allocator sub commands
Introduction The following table represents the memory allocator sub commands and their
memory allocator crash/lldb IADB sub iadb sub

display kernel heap
display kernel xmalloc xmalloc xmalloc xmalloc
display heap debug
display kmem buckets
display kmem statistics
parameters
Examples
IADB file system sub commands
Introduction The following table represents the file system sub commands and their matching
file system crash/lldb sub IADB sub iadb sub

display buffer
display buffer hash table
display freelist
display gnode
display gfs
display file file
display inode inode
display inode hash table
display inode cache list
display rnode
display vnode vnode vnode vnode
display vfs vfs vfs vfs
display specnode
display devnode
display fifo node
display hnode hash table
parameters
Examples

IADB system table sub commands
Introduction The following table represents the system table sub commands and their matching
system table crash/lldb sub IADB sub iadb sub

display var var
display devsw table devsw dev dev
display system timer request
blocks
display simple lock lock -s
display complex lock lock -c
display ipl proc information iplcb iplcb
display trace buffer
dev The dev sub command will display the device switch table using :
• major : major number slot to display
iplcb The iplcb sub command will display the IPL control block
Examples
IADB network sub commands
Introduction The following table represents the network sub commands and their matching
network crash/lldb sub IADB sub iadb sub

display interface netstat
display TCBs
display UDBs
display sockets sock
display TCP CB
display mbuf mbuf
parameters
Examples

IADB VMM sub commands
Introduction The following table represents the VMM sub commands and their matching crash/
VMM crash/lldb sub IADB sub iadb sub

VMM kernel segment
data
VMM RMAP vmm-rmap
VMM control variables
VMM statistics
VMM Addresses
VMM paging device vmm-pdt
table
VMM segment control vmm-scb
blocks
VMM PFT entries vmm-pft
VMM PTE entries vmm-pte
VMM PTA segment vmm-pta
VMM STAB
VMM segment register sr64
VMM segment status segst64 u -64 u -64
VMM APT entries vmm-apt
VMM wait status
VMM address map vmm-ame
entries
VMM zeroing kproc
VMM error log
VMM reload xlate
table
IPC information vmm-sem/shm
VMM lock anchor/
tblock
VMM lock hash table
VMM lock word
VMM disk map
VMM spin locks
IADB VMM sub commands -- continued
parameters
Examples

IADB SMP sub commands
Introduction The following table represents the SMP sub commands and their matching crash/
SMP crash/lldb sub IADB sub iadb sub

Start cpu
Stop cpu
Switch to cpu cpu cpu cpu
cpu The cpu command can be used to display or change the current cpu you are
working on using :
• num : logical CPU number to switch to
Examples >CPU0> cpu 1 <== Switch the debug process to processor

1
Debugger entered via MPC stop
IP->E00000000008C7F2 waitproc_find_run_queue()+F2: {
.mii
0: adds r20 = 0x1, r10
1: shr.u r19 = r11, r10
;;
==>2: and r21 = r17, r19 }
>CPU1>
IADB block address translation sub commands
Introduction The following table represents the block address translation sub commands and
block address translation crash/lldb sub IADB sub iadb sub

display dbats
display ibats
modify dbats
modify ibtas
parameters
Examples

IADB bat/brat sub commands
Introduction The following table represents the bat/brat sub commands and their matching
bat/brat crash/lldb sub IADB sub iadb sub

branch target
clear branch target
local branch target
clear local branch target
parameters
Examples
IADB miscellaneous sub commands
Introduction The following table represents the miscellaneous sub commands and their
miscellaneous crash/lldb sub IADB sub iadb sub

reboot the machine
display help help/? help
run an aix command !
set kdbx compatibility kdbx
exit go
set debugger parameters set set set
display elapsed time
enable/disable debug
calculate/convert an calc
hexadecimal expression
calculate/convert a decimal
expression
help sub ‘The help sub command can be used with out parameter to display the command
command listing or with a command as parameter to display an help related to that
command.
kdbx sub The kdbx sub command can be used to set the symbol needed to use kdb with the
command kdbx interface.
The following variables are set by kdbx and will modify output of certain sub
commands :
• kdbx_addrd : Display breakpoint address instead of symbol name
• kdbx_bindisp : Display output in binary format instead of ASCII format
go sub command The go sub command is used to leave the KDB, this will start the dump process if
the KDB was entered while the system was crashing.

IADB miscellaneous sub commands -- continued
set sub The set sub command can be used to set or display the following kdb parameters :
command • rows=number : set number of rows on current display
• mltrace={on|off} : mltrace on/off; only on DEBUG kernel
• sctrace={on|off} : verbose syscall prints on/off; only on DEBUG kernel
• itrace={on|off} : enable/disable tracing on/off; only on DEBUG kernel
• umon={on|off} : enable/disable umon performance tool
• exectrace={on|off} : verbose exec prints on/off; only on DEBUG kernel
• excpenter={on|off} : debugger entry on exception on/off
• ldrprint={on|off} : verbose loader prints on/off; only on DEBUG kernel
• kprintvga={on|off} : kernel prints to VGA on/off
• dbgtty={on|off} : use debugger TTY as console on/off
• dbgmsg={on|off} : Tee Console and LED output to TTY
• hotkey={on|off} : enter debugger on key press on/off; only on DEBUG kernel
Examples
Exercise
Introduction In this exercise you will configure the system to enable the live debugger
and invoke both the live and image debugger for your system.
Complete the following steps:
Step Action Reference

1. Enable the Memory Overlay Detection
System (MODS) using the bosdebug
command.
2. Enable the live debugger with the bosboot
command.
3. Reboot the system, and login as root.
4. Verify MODS is enabled with the debugger.
> stat
xmalloc debug: ________________

5. Verify the debugger is available:
Power PC:kdb
> dw kdb_avail
>q
IA-64: iadb
> d dbg_avail
> go
6. Execute the following truss command:
# truss -t kread -i ksh
Hit the enter key. How many kread

functions were executed? __________
Enter the exit command to exit truss:
# exit

Exercise -- continued
Step Action Reference

7. Change directory to /var/adm/ras.
8. Start the image debugger against the crash
dump captured in the previous lesson.
9. Execute the following commands:
• iadb: reason
Why was the debugger entered?
___________________________
• kdb: p * or iadb: pr *
What is the process id for the errdemon?
____________________________
• Execute the ls command:
kdb: !ls or iadb: ! ls
• iadb: sys
What build of AIX5L was the crash dump
taken on?
__________________________
10. Exit the debugger: q
11. Enter the live debugger:
Ctrl-Alt-NUMAPAD4
12. Enter the cpu command. What is the status
of CPU0?
________________________________
13. Exit the live debugger.
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm Guide
Unit 7. Process Management
Platform
This lesson is independent of platform.
Lesson Objectives
At the end of the lesson you will be able to:
• List and describe the states of a process.
• List the steps taken by the kernel to create a new process as the
result of a fork() system call, and the steps taken to create a new
thread of execution.
• Describe what happens when a process terminates.
• List the three thread models available in AIX 5.
• Identify the relationship between the internal structures proc,
thread, user and u_thread.
• Use the kernel debugging tool to locate and examine processes,
proc, thread, user and u_thread data structures.
• Manage process scheduling using available commands, manage
processes and threads on a SMP system (to best employ cache
affinity scheduling), and manage processes on a ccNUMA system
(to best employ quad affinity scheduling).
• List the factors determining what action the threads of a process
will take when a signal is received.
• Write a simple C program that use the fork() system call to spawn
new processes, that uses the wait() system call to retrieve the exit
status of a child process, that creates a simple multi-threaded
program by using the pthread_create() system call, and that uses
exec() system call to load a new program into memory.

Guide Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process Management Fundamentals
Process A process can be defined by the list of items which builds it. A process
definition consists of:
• A process table entry
• A process ID (PID)
• Virtual address space
- User-area (U-area)
- Program “text”
- Data
- User and kernel stacks
• Statistical information
Definition of Process management consists of the tools and ability to have many
process processes and threads existing simultaneously in a system, and to share
management
usage of the CPU or, in a SMP system, CPUs. Process management also
includes the ability to start, stop, and force a stop of a process.
The tools and • A process is a self-contained entity that consists of the information
information required to run a single program, such as a user application.
used to
manage the • The kernel contains a table entry for each process called the proc entry.
processes
• The proc entry contains information necessary to keep track of the
current state and location of page tables for the process.
• The proc entry resides in a slot in an array of proc entries.
• The kernel is configured with a fixed number of slots.
• All processes have a process ID or PID.
• The PID is assigned when the process is created and provides a
convenient way for users to refer to the other processes.
• The process contains a list of virtual memory addresses that the
process is allowed to access.
• The user-area (u_area) of a process contains additional information
about the process when it is running.
• The kernel tracks statistical information for the process, such as the
amount of time the process uses the CPU, the amount of memory the
process is using, etc. The statistical information is used by the kernel for
managing its resources and for accounting purposes.

Process operations fork() system call
Process Four basic operations define the lifetime of a process in the system:
operations
• fork - Process creation
• exec - Loading of programs in process
• exit - Death of process
• wait - The parent process notification of the death of the child process.
Fork new The fork system call is the way to create a new process
processes
• All processes in the system (except the boot process) are created
from other processes through the fork mechanism.
• All processes are descendants of the init process (process 1).
• A process that forks creates a child process that is nearly a duplicate
of the original parent process.
• The child has a new proc entry (slot), PID, and registers.
• Statistical information is reset, and the child initially shares most of
the virtual memory space with the parent process.
• The child process initially runs the same program as the parent
process. The child may use the exec() call to run another program.
The fork() The parent process has an entry in the process and thread table before the
system call fork() system call; after the fork() system call, another independent
process is created with entries in the Process and Thread tables.
$,;.HUQHO
6\VWHPFDOO
Parent Process
......
......
Thread Table ......
fork()
......
3DUHQWHQWU\
&KLOGHQWU\
Child Process
Process Table
3DUHQWHQWU\
&KLOGHQWU\

Process operations fork() system call -- continued
Inherited The illustration shows what happens when the fork() system call is issued.
attributes after The caller creates a child process that is almost an exact copy of the
a fork() system
call process itself. The child process inherits many attributes of the parent, but
receives a new user block and dataregion.
The child process inherits the following attributes from the parent process:
• Environment
• Close-on-exec flags and signal handling settings
• Set user ID mode bit and Set group ID mode bit
• Profiling on and off status
• Nice value
• All attached shared libraries
• Process group ID and tty group ID
• Current directory and Root directory
• File-mode creation mask and File size limit
• Attached shared memory segments and Attached mapped file
segments
• Debugger process ID and multiprocess flag, if the parent process has
multiprocess debugging enabled (described in the ptrace subroutine).

Attributes not Not all attributes are inherited from the parent. The child process differs
inherited from from the parent process in the following ways:
the parent
process • The child process has only one user thread; it is the one that called the
fork subroutine, no matter how many threads the parent process had.
• The child process has a unique process ID.
• The child process ID does not match any active process group ID.
• The child process has a different parent process ID.
• The child process has its own copy of the file descriptors for the parent
process. However, each file descriptor of the child process shares a
common file pointer with the corresponding file descriptor of the parent
process.
• All semadj values are cleared.
• Process locks, text locks, and data locks are not inherited by the child
process.
• If multiprocess debugging is turned on, the trace flags are inherited from
the parent; otherwise, the trace flags are reset.
• The child process utime, stime, cutime, and cstime are set to 0.
• Any pending alarms are cleared in the child process.
• The set of signals pending for the child process is initialized to the
empty set.
• The child process can have its own copy of the message catalogue for
the parent process.

The fork() The following code illutrates the usage of the fork() system call. After the
system call call there will be two processes executing two different copies of the same
code example
code. A process can determine if it is the parent or the child from the return
code.
int statuslocation;
pid_t proc_id;
tproc_id=fork();
if ( proc_id < 0 ) {
printf ("fork error \n");
exit (-1);
}
if ( proc_id > 0 ) {
/*Parent process waiting for child to terminate */
proc_id2 = wait(&statuslocation);
}
if ( proc_id == 0 ) {
/* I’m the child proces */
{.............}

Listing Executing the test program creates two processes, which can be listed
processes with with the ps command. The program name in the example is fork and that
the ps
command after name is listed as the command for both the parent and the child. Note that
fork() the child’s PPID is equal to the PID of the parent.
F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD
240001 A 0 10346 10236 0 60 20 5b8b 496 pts/1 0:00 ksh

200001 A 0 10742 10346 0 68 24 9bb3 44 pts/1 0:00 fork
1 A 0 10990 10742 0 68 24 dbbb 44 pts/1 0:00 fork
Processes In the previous example, it was shown how the PID of the calling process
without the becomes the PPID of the child process. This example shows what
parent process
happens if the parent process terminates before the child process
terminates. If we rewrite the program so that the parent process terminates
after fork() without waiting for the child, the system will replace the PPID
with 1, which is the init process. The init process will then pickup the
SIGCHLD signal so that the system can free the process table, even
though the parent process does not exist. This situation is shown below:
F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD
240001 A 0 10346 10236 0 60 20 5b8b 496 pts/1 0:00 ksh
40001 A 0 10996 1 0 68 24 8330 44 pts/1 0:00 fork
200001 A 0 11216 10346 3 61 20 dbbb 244 0:00 ps
Zombie If, for some reason, no processes receive the SIGCHLD signal from the
processes child, the empty slot will remain in the process table, even though other
resources are released. Such a process is called a zombie, and is listed in
ps as <defunct>. The example below shows some of these zombie
processes.
.....F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD
200003 A 0 1 0 0 60 20 500a 704 - 0:03 init
240401 A 0 2502 1 0 60 20 d2da 40 - 0:00 uprintfd
240001 A 0 2622 2874 0 60 20 2965 5208 - 0:46 X
40001 A 0 2874 1 0 60 20 c959 384 - 0:00 dtlogin
50005 Z 0 3776 1 1 68 24 0:00 <defunct>
40401 A 0 3890 1 0 60 20 91d2 480 - 0:00 errdemon
240001 A 0 4152 1 0 60 20 39c7 88 - 0:21 syncd
240001 A 0 4420 4648 0 60 20 4b29 220 - 0:00 writesrv
240001 A 0 4648 1 0 60 20 b1d6 308 - 0:00 srcmstr
50005 Z 0 10072 1 0 68 24 0:00 <defunct>
50005 Z 0 10454 1 0 68 24 0:00 <defunct>

Process operations exec() system call
Exec system The exec subroutine does not create a new process; it loads a new
call to load a program into the process.
new program
• To execute a new program, a process uses the exec set of system

calls to load the new program into memory and execute the program.
• Each program can successively exec other programs to load and
execute in the process.
Valid program
files for the
exec() system The fork() system call creates a new process with a copy of the
call environment, and the exec() system call loads a new program into the
current process, and overlays the current program with a new one (which
is called the new-process image). The new-process image file can be one
of three file types:
• An executable binary file in XCOFF file format.

• An executable text file that contains a shell procedure.
• A file that names an executable binary file or shell procedure to be run.
Inherited
attributes after
the exec() The new-process image inherits the following attributes from the calling
system call process image: session membership, PID, PPID, supplementary group
IDs, process signal mask, and pending signals.

Process operations exec() system call -- continued
The exec() The illustration show how the process and thread table remain unchanged
system call after the exec() system call.
6\VWHPFDOOV
Parent Process
......
......
Thread Table ......
exec()
......
Process Table

Process operations exec() system call
The exec() The following code illustrates the usage of the execv() system call. After
system call the call, the current process will be overlaid with the new program. To
code example
illustrate the function, the output from the program is listed after the
program.
The program first defines two variables. The first is a pointer to the
program name to be executed, and the second is a pointer to the
arguments (by convention the first argument parsed is the program name
itself). The program source for sleeping.c is not supplied, as any program
can be used for this example.
#include <unistd.h>
int returncode;
char *argumentp[3],arg1[50],arg2[50],arg3[50];
const char *Path="/home/olc/prog/thread/sleeping";
main(argc,argv)
int argc;
char **argv;
{
strcpy (arg1,"/home/olc/prog/thread/sleeping");
strcpy (arg2,"test param 1");
strcpy (arg3,"test param 2");
argumentp[0]=arg1;
argumentp[1]=arg2;
argumentp[2]=arg3;
/* ArgumentV=*arguments; */
printf ("before execv \n");
returncode = execv(Path,argumentp);
printf ("after execv \n");
exit (0);
}
and the program output:
before execv
I’m the sleeping process

Process operations exec() system call -- continued
The exec() While the program in the example is being executed, we can examine the
system call process status with the ps command. Notice that the program name for the
example is “exec,” and the program name for the called program is
“sleeping.” As we see in the listing from the ps command, the current
program is replaced with the new one, and we never reach the print
statement "after execv\n." The program prints “I’m the sleeping process,”
because the main program has been replaced with the program in the path
variable. If we look closer at the output from the ps -l command before
and after the system call, we can tell that the program name has been
replaced, but the process ID and PPID remains the same.
Before the exec system call take place:
#> ps -l
. F S UID PID PPID C PRI NI ADDR SZ TTY TIME CMD
240001 A 0 10346 10236 0 60 20 5b8b 492 pts/1 0:00 ksh
200001 A 0 10696 10346 2 61 20 6bad 240 pts/1 0:00 ps
200001 A 0 10964 10346 0 68 24 4388 40 pts/1 0:00 exec
And after the exec() system call, the exec program is replaced with
sleeping:
#> ps -l
. F S UID PID PPID C PRI NI ADDR TTY TIME CMD
240001 A 0 10346 10236 0 60 20 5b8b pts/1 0:00 ksh
200001 A 0 10698 10346 2 61 20 a354 pts/1 0:00 ps
200001 A 0 10964 10346 0 68 24 4388 pts/1 0:00 sleeping

Process operations exit system call
Exit: what The exit system call is executed at the end of every process, the system
happens when call cleans up, releases memory, text and data, but leaves an entry in the
a process
terminates process table so that a return value and other status information can be
passed to the parent process if needed.
• exit - termination of a process
• When a program no longer needs to run or execute other programs,
it can exit.
• A program that exits causes the process to enter the zombie state.
Exiting from a There are basically three ways that a process can terminate: the program
program can have reached the end of the program flow and meet an explicit
exit(exit_value) statement, the program flow can end without an exit()
statement (in which case the linker automatically inserts a call to the exit
system call), or the running program receives a signal from an external
source such as keyboard interrupt (<Ctrl-c>) from the user. If the program
receives an interrupt, the program path will switch to the interrupt handling
routine, either in the program, or the system default routine, which will
terminate the program with an exit.
When executing the exit() system call, all memory and other resources are
freed, and the parameter supplied to exit(0 is placed in the process table
as the exit value for the process. After the completion of the exit() system
call, a signal SIGCHLD is issued to the parent process (the process at this
stage is nothing but the process table entry). This state is called the
zombie state, when the parent process reacts to the SIGCHLD signal and
reads the return code from the process table, the system can remove the
process table entry, clean up, and free the process table entry.
In rare occasions the parent process can not respond to the signal
immediately, we can see the zombie in the process table with the ps
command. A zombie will be listed as <defunct>.

Process operations, wait() system call
Waiting for the The wait system call is placed at the end of a program; normally it is placed
death of a there by the programmer as the system call wait(), but if not, the system
child process
will automatically add a wait one. The wait call is used to notify the parent
process of the death of the child process and for releasing the child’s
process slot.
• The parent process can be notified of the death of the child by waiting
with a system call or catching the proper signal.
• Once the parent process acknowledges the death of a child process,
the child process' slot is freed.

Process states -- continued
Process states In AIX, processes can be in one of five states:

• Idle
• Active
• Stopped
• Swapped
• Zombie
Idle state When processes are being created, they are first in the idle state. This
state is temporary until the fork mechanism is able to allocate all of the
necessary resources for the creation and fill in the process table for a new
process.
Active state Once the new child process creation is completed, it is placed in the active
state. The active state is the normal process state, and threads in the
process can be running or be ready-to-run.
Stopped Processes can also be stopped or in a stopped state. Process can be

processes stopped by the SIGSTOP signal. Stopped processes can be restarted by
the SIGCONT signal. If a process is stopped, all threads are in the stopped
state.
Swapped If a process is swapped, it means that another process is running, and the
processes process, or any threads, cannot run until scheduler makes it active again.

Process states -- continued
Zombie When a process terminates with an exit system call, they first goes into the
process zombie state, such processes have most of their resources freed.
However, a small part of the process remains, such as the exit value that
the parent process uses to determine why the child process died. If the
parent process issues a wait system call, the exit status is returned to the
parent, and the remaining resources of the child process are freed, and the
process ceases to exist. The slot can then be used by another newly
created process.
If the parent process no longer exists when a child process exits, the init
process frees the remaining resources held by the child. Sometimes we
can see a Zombie staying in the process list for a longer time; one example
of this situation could be that a process exited, but the parent process is
busy or waiting in the kernel and unable to read the return code.
State The illustration show how a process is being started with a fork() system
transitions for call, turns into an active process, and how active process can change
AIX processes
between swapped, active and stopped state. A terminating process
becomes a zombie until the entire process is removed.
fork() Idle
Swapped Active Stopped
Zombie
Non existing

Kernel Processes
Kernel processes:
Kernel • Are created by the kernel.
processes -
Kproc • Have a private u-area/kernel stack.
• Share "text" and data with the rest of the kernel.
• Are not affected by signals.
• Cannot use shared library object code or other user protection domain
code.
• Run in the Kernel Protection Domain.
Some processes in the system are kernel processes. Kernel processes

are created by the kernel itself to execute independent of threads action.
Even though a kernel process shows up in the process table, through
"Berkeley" ps, it is part of the kernel. The scheduler is one example of a
kernel process. Kernel processes are scheduled like user processes, but
tend to have higher priorities.
Kernel processes can have multiple threads, as can user processes.

Thread Fundamentals
Thread Like a process, a thread can be defined by separate components. A thread

definition consists of:
• A thread table entry
• A thread ID (TID)
Processes and • Process holds address space

threads
• Thread holds execution context
• Multiple threads can run within one process
- One CPU can run one thread at a time, on SMP systems, threads
can actually run truly concurrent
Threads • Threads allow multiple execution units to share the same address
space.
• The thread is the fundamental unit of execution.
• Thread has IDs (TIDs) like a process has IDs (PIDs).
• An independent flow of control within a process.
• In a multi threaded process, each thread can execute on a different
code concurrently.
• Managing threads needs fewer resources than managing processes.
• Inter-thread communication is more efficient than inter-process
communication, especially because variables can be shared.
Threads share Threads reduce the need for IPC operation, because they allow multiple
data and execution units to share the same address space, and thereby easily
address space
share data. On the other hand, it adds complexity and risk to the
programming. For example: synchronization and locking has to be
controlled by the threads.
Threads are The thread is the fundamental unit of execution and the scheduler and
the unit of dispatcher only work with threads. Therefore, every process has at least
execution
one thread.

Thread Fundamentals -- continued
Thread IDs TIDs are listed for all threads in the threads table; TIDs are always odd.
(TID) and PIDs are listed for all processes in the process table; PIDs are always
Process IDs
(PID) even, except for the init process, where PID = 1. Threads represent
independent flows within a process; the system does not provide
synchronization, and the control must be in the thread itself.
In a multi-threaded process, each thread can execute on a different code

concurrently controlled by the program paths.
One of the main reasons for using threads is that managing threads
requires fewer resources than managing processes. Inter-thread
communication is more efficient than inter-process communication.

AIX Thread
AIX Threads • A thread is an independent flow of control that operates within the same
address space as other independent flows of controls within a process.
In other operating systems, threads are sometimes called "lightweight
processes," or the meaning of the word "thread" is sometimes slightly
different.
• Multiple threads of control allow an application to overlap operations
such as reading from a terminal or writing to a disk file. This also allows
an application to service requests from multiple users at the same time.
• Multiple threads of control within a single process are required by
application developers to be able to provide these capabilities without
the overhead of multiple processes.
• Multiple threads of control within a single process allow application
developers to exploit the throughput of multiprocessor (MP) hardware.
TID format Threads IDs have the following format for 32-bit kernels:
31 24 8 7 1 0
0 0 0 0 0 0 INDEX COUNT 1
And for 64-bit kernels the TID is 64-bit
63 56 8 7 1 0
0 0 0 0 0 0 INDEX COUNT 1
• INDEX identifies the entry in the thread table corresponding to the

designated TID (thread[INDEX]).
• COUNT is a generation count that is intended to avoid the rapid
reallocation of TIDs. When a new TID is to be allocated, its value is
calculated on the first available thread table entry. Slots are recycled.

AIX threads -- continued
TID format The following is a 64-bit slot in the thread table listed with kdb; the TID is
listed with kdb 002143 HEX =>, the index = 21, and the COUNT= 43, 21 hex = 33
decimal. According to the figure, this is the slot number in the thread table;
the value is listed in the next line of the output from kdb.
(0)> thread 33
pvthread+001080 33 sendmail SLEEP 002143 03C 0 0
If we look in the memory at address pvthread+0001080 we can se the 64-bit

TID structure.
(0)> d pvthread+001080
pvthread+001080: 0000 0000 0000 2143 0000 0000 0000 0000
(0)>

Thread Concepts
Threads • An application is said to be thread safe when multiple threads in a

concepts process can run the application successfully without data corruption.
• A library is thread safe when multiple threads can be running a routine
in that library without data corruption (another word for this is reentrant).
• A kernel thread is a thread of control managed by the kernel.
• A user thread is a thread of control managed by the application.
• User threads are attached to kernel threads to gain access to system
services.
• In a multi-threaded system such as AIX:
- The process is the swappable entity.
- The thread is the schedulable entity.
Thread • User threads are mapped to kernel threads by the threads library. The
mapping way this mapping is done is called the thread model. There are three
models
possible thread models, corresponding to three different ways, to map
user threads to kernel threads:
• M:1 model
• 1:1 model
• M:N model.
• The AIX Version 4.1 and later threads support is based on the OSF/1
libpthreads implementation. It supports what is referred to as the 1:1
model. This means that for every thread visible in an application, there
is a corresponding kernel thread. Architecturally, it is possible to have a
M:N libpthreads model, where "M" user threads are multiplexed on "N"
kernel threads. This is supported in AIX 4.3.1 and AIX 5L.
• The mapping of user threads to kernel threads is done using virtual
processors. A virtual processor (VP) is a library entity that is usually
implicit. For a user thread, the virtual processor behaves as a CPU for a
kernel thread. In the library, the virtual processor is a kernel thread or a
structure bound to a kernel thread.
• The libpthreads implementation is provided for application developers to
develop portable multi-threaded applications The libpthreads.a library
has been written as per the POSIX 1003.4a Draft 10 specification in AIX
4.3. Previous versions of AIX support the POSIX 1003.4a Draft 7
specification. The libpthreads is a linkable user library that provides user
space threads services to an application. The libpthreads_compat.a
provides the POSIX 1003.4a Draft 7 specification pthreads model on
AIX 4.3.

Threads Models
M:1 threads In the M:1 model, all user threads are mapped to one kernel thread and all
model user threads run on one VP. The mapping is handled by a library
scheduler. All user threads programming facilities are completely handled
by the library. This model can be used on any systems, especially on
traditional single-threaded systems.
User Threads User Threads
Library Scheduler
VP Threads Library
Kernel Thread
M:1 Threads Model

Threads Models -- continued
1:1 threads In the 1:1 model, each user thread is mapped to one kernel thread and
model each user thread runs on one VP. Most of the user threads programming
facilities are directly handled by the kernel threads.
User Threads User Threads
VP VP VP
Threads Library
Kernel Threads
1:1 Threads Model

Threads Models -- continued
M:N threads In the M:N model, all user threads are mapped to a pool of kernel threads
model and all userthreads run on a pool of virtual processors. A user thread may
be bound to a specific VP, as in the 1:1 model. All unbound user threads
share the remaining VPs. This is the most efficient and most complex
thread model; the user threads programming facilities are shared between
the threads library and the kernel threads.
User Threads
Library Scheduler
VP VP VP
Threads Library
Kernel Threads
M:N Threads Model

Thread states
Thread states In AIX, the kernel allows many threads to run at the same time, but there
can only be one thread executing on each CPU at a time. The thread state
is kept in t_state in the thread table (for detailed information look in the /
usr/include/sys/thread.h file).
Each thread can be in one of the following five states:

• Idle
• Ready to run
• Running
• Sleeping
• Stopped
• Swapped
• Zombie
Idle state When processes and threads are being created, they are first in the idle
state. This state is temporary until the fork mechanism is able to allocate
all of the necessary resources for the creation and fill in the thread table for
a new thread.
Ready to run Once the new thread creation is completed, it is placed in the ready to run
state. The thread waits in this state until the thread is ran. When the thread
is running, it continues to run until it has used a time slice, gives up the
CPU or is preempted by a higher priority thread.
Running A thread in the running state is the thread executing at the CPU. The
thread thread state will change between running and ready to run until the thread
finishes execution; the thread then goes to the Zombie state
Sleeping Whenever the thread is waiting for an event, the thread is said to be
sleeping.

Thread states -- continued
Stopped A stopped thread is a thread stopped by the SIGSTOP signal. Stopped

threads can be restarted by the SIGCONT signal.
Swapped Though swapping takes place at the process level and all threads of a
process are swapped at the same time, the thread table is updated
whenever the thread is swapped.
Zombie The zombie state is a intermediate state for the thread lasting only until the
all resources owned by the thread are given up.
State The illustration show the states for AIX threads. Threads are typically
transitions for changing between running, ready to run, sleeping and stopped during the
AIX threads
life time of the thread.
fork() Being Created
Ready to run
Sleeping Running Stopped
Swapped Zombie
Non existing

Thread Management
Thread / • The diagram below shows how the process shares most of the data
process among the threads; although each thread has its own copy of the
relationship
registers, some kernel thread have specific data, and therefore have a
private stack. Thus, data can be passed between threads via global
variables.
• A conventional unithreaded UNIX process can only harm itself (if
incorrectly coded).
• All threads in a process share the same address space, so in an
incorrectly coded program, one thread can damage the stack and data
areas associated with other threads in that process.
• Except for such areas as explicitly shared memory segments, a process
cannot directly affect other processes.
• There is some kernel data that is shared between the threads, but the
kernel also maintains thread specific data.
• Per-process data is needed even when the process is swapped out is in
the pvproc structure. The pvproc structure is pinned.
• Per-process data is needed only when the process is swapped in is in
the user structure.
• Per-thread data is needed even when the process is swapped out is in
the pvthread structure. The pvthread thread structure is pinned.
• Per-thread data is needed only when the process is swapped in is in the
uthread structure.
Data placement overview
Thread Thread Thread

Kernel
Process
Data Registers Registers Registers
BSS
Program
Data Stack Stack Stack
Code
Kernel Kernel Kernel
Thread Thread Thread
Data Data Data

Process swapping
Process
swapping

Thread Scheduling
Thread Scheduling and dispatching is the ability to assign CPU time to threads in
scheduling the system in a efficient and fair way. The problem is to design the system
to handle many simultaneous threads and at the same time still be
responsive to events.
Clock tics and The division of time among the threads on the AIX system relies on clock
time slices tics. Every 1/100 of a second, or 100 times a second, the dispatcher is
called and does the following:
• Increases the running tic counter for the running process.
• Scans run queues for the thread with the highest priority.
• Dispatchs the most favored thread.
Every real second the scheduler is awake, it recalculates the priority for all
threads.
Thread priority • AIX priority has 128 (0-127) levels that are called run queue levels.
• The higher the run queue level, the lower priority.
• Priority 127 can only be used by the wait process.
• User processes can get priority changed from -20 to + 20 levels
(renice).
• User processes are in the range 40 - 80.
• A clock tick interrupt decreases thread priority.
• The scheduler (swapper) increases thread priority.
The priority is based on the basic priority level, the initial nice value, the
renice value and a penalty.
Penalty based on
runtime
Renice value
-20 - +20
Higher value = lower priority
Nice value
default = 20
Base Priority
default value = 40

Thread Scheduling -- continued
Thread • Dispatcher chooses the highest priority thread to execute.

dispatching
• Threads are the dispatchable unit for the AIX scheduler.
• Each thread has its own priority (0-127) and scheduling algorithm.
• There are three Scheduling algorithms:
- SCHED_RR Round Robin
- SCHED_FIFO
- SCHED_OTHER
SCHED_RR • SCHED_RR
threads
scheduling - This is a Round Robin scheduling mechanism in which the thread is
algorithms time-sliced at fixed priority.
- This scheme is similar to creating a fixed priority, real time process.
- The thread must have root authority to be able to use this
scheduling mechanism.
SCHED_FIFO • SCHED_FIFO
threads
scheduling - A non-preemptive scheduling scheme.
algorithms
- The thread runs at fixed priority and is not time-sliced.
- It will be allowed to run on a processor until it voluntarily
relinquishes by blocking or yielding.
- A thread using SCHED_FIFO must also have root authority to use
it.
- It is possible to create a thread with SCHED_FIFO that has a high
enough priority that it could monopolize the processor.
SCHED_ • SCHED_OTHER
OTHER
threads - The default AIX scheduling.
scheduling
algorithms - Priority degrades with CPU usage.

Process and Thread Scheduling -- continued
Thread Like most UNIX systems, AIX uses a multilevel round-robin model for
scheduling process and thread scheduling. Processes and threads at the same
priority level are linked together and placed on a run queue. AIX has 128
run queues, 0-127, each representing one of the 128 possible priorities.
When a process starts running is determined by a given priority based on
the nice value, and the process is linked with other processes at the same
level. As the process is running and consumes CPU resources, the priority
decreases until it it finishes, or until the priority is so low that other
processes get CPU time. If a process does not run, the priority increases
until it can get CPU time again. The drawing below illustrates the 128 run
queue levels and six processes: three at priority 60 and three at 70.
20
40
60
80
100
120
127 Idle process

Process and Thread Scheduling -- continued
Thread The scheduler is using the following algorithm to calculate priorities for the
scheduling running processes:
algorithm
For every clock tick (1/100 sec.):

• The running thread is charged for one tick.
• The dispatcher is called, scans the run queues, and dispatches the one
with the highest priority.
The scheduler runs every second:

• It calculates new priority for all threads.
• For each thread set, the number of used ticks is equal to (used ticks)* d/
32 where 0 <= d <= 32.
The algorithm for calculating the priority is:

• new_nice = 60 + 2* nice if nice > 0
• new_nice = 60 + nice if nice < 0
• Priority = used ticks * (new_nice + 4) / 64 * r/32 + new_nice, where
0<=r<=32
Invariants:
-20 <= nice <= 20
0 <= r <= 32
0 <= d <= 32
0<= ticks <= 120
0 <= p <= 126
The r and d controls how a process is impacted by the run time; r impacts
how severely a process is penalized by used CPU time, while d controls
how fast the system “forgives” previous CPU consumption.
The r and d can be set by the schedtune [-r <r_val] [-d d_val] command.

The Dispatcher
The dispatcher The dispatcher runs under the following circumstances:

• A time interval has passed. (1/100 sec.)
• A thread voluntarily gives up the CPU.
• A thread is returning to user mode from kernel mode.
• Another thread has been made runnable (awake).
Context switch The context switch procedure consists of:

procedure
• Saving the machine state of the departing thread.
• Recalling the machine state of the selected thread.
• Mapping the process private data and other virtual space of the
selected thread.
• Switching the CPU to execute with the selected thread's registers.
Context switch The procedures switches context to make a different thread execute:
• As a thread executes in the CPU, its priority becomes less favored.
• The scheduler re-calculates the priority of the executing thread and
measures the new priority against the priorities of the threads that
are runnable.
• In AIX, the run queues are divided into 128 separate priority queues
with priority 0 being the most-favored priority and priority 127 the
least-favored.
• Threads at the same priority level are on the same run queue for
quick determination of the next runnable process.
• All of the threads on a more-favored priority queue run before
threads on a less-favored priority queue.
• Queue 127 contains the wait threads. There is one wait thread per
CPU, and these run only when there are no other runnable threads.

The Dispatcher -- continued
Thread In AIX, the kernel allows preemption of both user and kernel threads.
preemption
• Preemption allows the kernel to respond to real-time processes
much faster.
• On most UNIX systems, when a thread is in kernel mode, no other
thread can execute until the thread in kernel mode returns to user
mode or voluntarily gives up the CPU.
• In AIX, other higher priority threads may preempt threads running in
kernel mode.
• This feature supports real-time processing where a real-time process
must respond to an action immediately.
• Some sections of code have been determined to be critical sections
where preemption is not possible because preemption may cause
inconsistent kernel data structures. These sections are protected
either by preventing preemption (by disabling interrupts) or by
holding a lock.
• The kernel can use locks to serialize access to global kernel data that
could be corrupted by preemption.
• The thread holding the lock for a piece of data is guaranteed to run at
a higher priority than the set of threads waiting for the lock. This is
called priority promotion.
• However, other threads running at higher priority and not asking for
the lock on the same piece of data can preempt the locking thread.
The MP • Hard Cache Affinity - The ability to bind a thread or process to a

scheduler/ processor.
dispatcher
• Soft Cache Affinity - An attempt to run a thread or process on the same
processor.
• Support funneling threads - Funneling threads is a method to run non-
MP-safe threads on MP hardware.

The Dispatcher -- continued
The MP AIX Scheduling uses a time-sharing, priority based scheduling algorithm.

dispatcher/ This algorithm favors threads that do not consume large amounts of the
scheduler
processor, as the amount of processor time used by the thread is included
in the priority recalculation equation. Fixed priority threads are allowed.
The priority of these threads do not change regardless of the amount of
processor time used.
There is one global priority-based, multi-level run queue (runq). All threads
that are runable are linked into one of these runq entries. There are
currently 128 priorities (0-127). The scheduler periodically scans the list of
all active threads and recalculates thread priorities based on the amount of
processor time used.

AIX run queues
Multiple run • AIX 4.3.3. uses multiple specialized run queues instead of just one
queues (MRQ) global queue.
• Each processor has its own local run queue, and each node has a
global run queue.
• Processors dispatch threads from the local and the global run queue.
Node 0 Node 1 Node 2

CPU 0 - 3 CPU 4 - 7 CPU 8 - 11
RQ RQ RQ RQ RQ RQ RQ RQ RQ RQ RQ RQ
0 1 2 3 4 5 6 7 8 9
10 11
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU
0 1 2 3 4 5 6 7 8 9 10 11
Global run • Fixed priority POSIX-compliant threads

queue
• Load balancing of newly created threads and very low priority threads
• Process executed with RT_GRQ=ON exported
• Threads that are bound to a processor are never placed in the global
run queue
The fixed priority threads guarantees strict priority order execution. Load
balancing is achieved with new and low priority threads. New threads can
be picked up by any CPU because they have not run yet, and the cache
penalty is therefore small. Also, low priority threads can easily be moved
as they do not have data in cache.
If processes has the variable RT_GRQ=ON set, they will sacrifice cache
optimization for best possible real-time behavior. That is, the process will
be on the Global Run Queue and run on the first available CPU. Threads
can be bound to one CPU and will then never be on the global run queue.

AIX run queues -- continued
Local run • Local run queues reduce lock contention

queues
• Shorter and simpler run queues scans
• Stronger implied affinity
• Reduced cache contention in the kernel
Each local run queue has its own lock. This reduces the lock contention,
and make the lock handling faster. The local queue makes the scan faster
because there are no special handling of bound threads, and simple
handling of soft affinity with one CPU per run queue. The kernel cache
contention is reduced because each CPU updates the dispatcher state,
and the structures for threads in the local run queue are more likely to be in
the local cache.
Initial load When new unbound threads are created, they should initially be placed so
balancing that the system load remains balanced. This has to be handled differently
for new processes and for additional threads in an existing process.
If the thread is the initial thread in a new process:

• Choose a run queue; round-robin first among all nodes, and secondly
within the run queues of the chosen node.
• Look for an idle CPU in the chosen run queue.
• Look for an idle CPU in the chosen node.
• Look for an idle CPU anywhere on the system.
• Otherwise, add to round_robin global run queue.
If the new thread is an additional thread for an already existing process

• Choose a run queue; round-robin among the run queues in the
process’ node.
• Look for an idle CPU in the chosen run queue.
• Look for an idle CPU in the chosen node.
• Otherwise, add to global run queue for this node.

AIX run queues -- continued
Idle load Idle load balancing occurs when a CPU goes idle, and starts looking for
balancing work in other run queues. The criteria for permitting a thread steal are:
• Foreign run queue threads are greater than 1/4 load factor of the
node.
• There is at least one stealable (unbound) thread available.
• There is at least one unstealable (bound) thread available.
• The number of threads stolen from this run queue during the current
clock interval is less than 20.
• Should multiple run queues meet these criteria, the one with the most
threads will be used.
• If this run queue’s lock is available, its best priority unbound thread
will be stolen, assuming its p_lock_d is available.
• Note that failure to lock the run queue or the thread will cause the
dispatcher to loop through waitproc, thereby opening up a periodic
enablement window.

Process and Threads data structures
Process and Four main data structures is used for process management:
thread
management • proc
data structure
overview • thread
• user
• uthread
The figure below show how the tables are linked together.
/dev/kmem user memory

Process Table Thread Table Process Data
pvproc pvthread
Ublock
User Area
pvproc pvthread
user
Pvthread Uthread Uthread
uthread
pvproc pvthread
Kernel Stack Kernel Stack
pvproc pvthread Thread Stack Thread Stack
pvproc pvthread Gobal Data
Process Text Segment

Process and Threads data structures -- continued
Thread The diagram above shows that the thread structures contain
management pointers to all the other structures required to run that particular
data structure
overview thread. This is a reflection of the fact that the thread is the
schedulable entity, and the system must be able to access all
structures from the pointers in the thread table. The thread table are
doubly and circularly linked to all other threads for a particular
process. Note that the ublock structure contains the user structure
plus uthread structures for the initial thread. The uthread structures
for all other threads are in the uthread (and kernel thread stacks)
segment. The first uthread structure is kept separate within in the
ublock so that the fields it contains can be addressed directly and
such that fork and exec can operate with only the process private
segment to deal with.
The proc and thread structures are maintained in the kernel extension
segment as a part of the process and thread tables of the kernel. Every in-
use entry in these tables is pinned, such that the information there is
always available to the kernel. The user and uthread structures are
maintained in the process private segment of the corresponding process.
These structures are only pinned when the process is not swapped out.
When the process is swapped out, they are unpinned.
Process and The previous diagram shows how the tables are linked together. Each
thread links process in the system has an entry in the process table. Each process
entry has a pointer to the list of threads for the process, and the thread list
has a pointer back to the process table. The thread list is a double circular
linked list of all the threads owned by the process, and the pvthreads
entries point to the user area and uthread field in the process data area.

Proc structure The following is an extract of the fields in the proc table to show the
fields and pointers. Note that each entry in the proc table starts with a pointer to a
pointers C
structures pvproc structure (we will later discuss the pvproc structure). The proc table
holds the number of threads, and the pvproc table has a pointer
pv_threadlist that points to the first thread for the process in the thread
table. A complete listing of the structures can be found in the file /usr/
include/sys/proc.h.
struct proc {
struct pvproc *p_pvprocp; /* my global process data*/

pid_t p_pid; /* unique process identifier */
uint p_flag; /* process flags */
/* thread fields */
ushort p_threadcount; /* number of threads */
ushort p_active; /* number of active threads */
.......
};
struct pvproc {
/* identifier fields */
pid_t pv_pid; /* unique process identifier */
pid_t pv_ppid; /* parent process identifier */
/* thread fields */
struct pvthread *pv_threadlist; /* head of list of threads */
.......};

Thread table Like the process table, the thread table is divided into two tables a
fields and links pvthread table and a thread table. The complete structures can be found in
to the process
table and the the file /usr/include/sys/thread.h. The structures listed contains only
ublock selected variables.
The thread and pvthread structures have a pointer, *tv_pvprocp back to

the owner process, pointers *uthreadp and *userp to the user thread and
user area, and the thread list linked in a circular doubly linked list (with the
*prevthread and *nextthread fields).
struct thread {
struct pvthread *t_pvthreadp; /* my pvthread struct */
struct t_uaddress {
struct uthread *uthreadp; /* local data */
struct user *userp; /* owner process’ ublock (const)*/
} t_uaddress;
......
struct pvthread {
/* identifier fields */
tid_t tv_tid; /* unique thread identifier */
/* related data structures */

struct thread *tv_threadp; /* my pvthread struct */
struct pvproc *tv_pvprocp; /* owner process (global data) */
struct {
struct pvthread *prevthread;/* previous thread */
struct pvthread *nextthread;/* next thread */
} tv_threadlist; /* circular doubly linked list */
...

Process and Threads data structures addresses
Process and AIX 5L has 64-bit kernel and the addresses are 64-bit long. Both process
thread tables’ and thread tables are kept in the kernel extension segment at fixed
addresses in
the kernel addresses.
• The proc table starts at 0xF100008080000000.
• The thread table starts at 0xF100008090000000.
Both tables are maintained as arrays.

• Entries are called “slots.”
• Slot number can be derived from PID or TID (see the example).
AIX 4.3.3 is a 32-bit kernel and the addresses are only 32-bit long the
values for an AIX 4.3.3 32-bit kernel are:
• The proc table starts at 0xe2000000.
• The thread table starts at 0xe6000000.
Both tables are maintained as arrays.

• Entries are called “slots.”
Slot number can be derived from PID or TID bit 8 - 23. See the example
and list from the process table on an AIX 5L power system
• Generation count for each slot is incremented every time a PID or
TID is created in that slot.

Process and Threads data structures addresses -- continued
Looking at AIX Looking at the process table with kdb, we can tell that there is a difference
4 process between AIX 4 and AIX 5. List the process table with the p subcommand in
structures with
kdb kdb. The process table starts at address proc and the process slot used by
kdb is 7936, which is offset by 326000 (hex) from the start of the process
table. The size of proc is 326000 (hex) / 7936 (dec) = 416 (dec) = 1A0
(hex).
SLOT NAME STATE PID PPID PGRP UID EUID ADSPACE CL

proc+326000 7936*kdb_up ACTIVE 1F001A 0123A 1F001A 00000 00000 00001302
00
The size of each process slot can be verified with the p * subcommand. In
the following list, each slot is offset by 1A0 bytes
SLOT NAME STATE PID PPID PGRP UID EUID ADSPACE CL

proc+000000 0 swapper ACTIVE 00000 00000 00000 00000 00000 0000780F 00
proc+0001A0 1 init ACTIVE 00001 00000 00000 00000 00000 0000500A 00
proc+000340 2 wait ACTIVE 00204 00000 00000 00000 00000 00008010 00
proc+0004E0 3 netm ACTIVE 00306 00000 00000 00000 00000 0000B817 00
And the location of proc in memory can be retrieved with the nm

subcommand.
(0)> nm proc
Symbol Address : E2000000
TOC Address : 001F9EF8

Process and Threads data structures addresses -- continued
Looking at AIX The same lists will look different on an AIX 5 system. First, on a list of the
5 process proc table, we can tell that the structure used is no longer proc but pvproc,
structures with
kdb and each pvproc slot is 6680 (hex) / 41 (dec) = 280 (hex) long.
(0)> p
SLOT NAME STATE PID PPID ADSPACE CL #THS
pvproc+006680 41*kdb_64 ACTIVE 0002996 00037D8 00000000200040AA 0 0001
Listing the first three slots shows that the offset is 280(hex) between the
slots.
(0)> p *
SLOT NAME STATE PID PPID ADSPACE CL
pvproc+000000 0 swapper ACTIVE 0000000 0000000 0000000000000B00 0
pvproc+000280 1 init ACTIVE 0000001 0000000 000000000000E2FD 00
pvproc+000500 2 wait ACTIVE 0000204 0000000 0000000000001B02 0
The pvproc address in memory is found using the nm command.
(0)> nm pvproc
Symbol Address : F100008080000000
TOC Address : 0046AC80
(0)>

AIX 5 process and Thread data structures -- continued
Process data The changes in the process table are made to support the NUMA (Non-
structure Uniform memory) structure in AIX 5L.
changes in AIX
5 A NUMA system consist of one or more separate nodes connected by a
very fast connection. The nodes operates as one computer, running one
copy of AIX. The name NUMA refers to the fact that the memory access
time is not constant. A CPU accessing memory on its own node will get the
memory fast (accessed via the local bus). A CPU accessing remote
memory will have to get the data from a remote node, and the access will
be slower.
In order to make the system efficient, we want to keep all parts of a
process close together so that memory access is fast; therefore, the proc
structure has been rearranged and divided into two parts. Struct pvproc,
that holds global process data and the rest, is still in struct proc. This
change allows the NUMA system to move processes around between
CPU’s or “QUADS,” and still have most of the process table local to the
process. However, some of the process table must be kept at the main
node in a NUMA system.
Because of things like shared memory, processes can form migration
groups. These are groups of processes, shared memory, files, and so on.
that are logically attached to each other. The most common form of logical
attachment involves one being intrinsically tied in with another process.
For example, a process that creates a shared memory segment is logically
attached to it. If another process uses the shared memory segment, it is
logically attached to it, and as a result is in a migration group with the first
process. Additionally, the user is allowed to create logical attachments
between items through the NUMA APIs
The proc structure in an AIX 5 system starts with a pvproc structure and
continue with process flags. The start of the structure is listed here; for a
full listing, see the file /user/include/sys/proc.h.
struct proc {
struct pvproc *p_pvprocp; /* my global process data*/
pid_t p_pid; /* unique process identifier*/
uint p_flag; /* process flags */

AIX 5 process and Thread data structures -- continued
Process ID PID The process ID or thread ID is composed of process slot number and a
and process generation count, bit 0 tells us if it is a PID or a TID (all PID’s are even).
table slot
number The next 7 bits are the generation count; the generation count prevents the
rapid reuse of process IDs. Bits 8 to 23 is the slot number in the process
table. The information can be verified from the pvproc list, where bits 8-23
in the PID field match the process slot number in the pvproc table.
63 24 23 8 7 1 0
0000000 Slot Number Generation Count 0 if PID
1 if TID
Process table example from an AIX 5L system.
SLOT NAME STATE PID

pvproc+001180 7 gil ACTIVE 000070E
pvproc+001400 8 wlmsched ACTIVE 0000810
pvproc+001900 10 shlap64 ACTIVE 0000AD2
pvproc+001B80 11 syncd ACTIVE 0000B4E
pvproc+002080 13 lvmbb ACTIVE 0000D22
pvproc+002580 15 errdemon ACTIVE 0000F50

What is new in AIX 5
Priority boost Priority boost is a facility added that ensures that higher priority processes
get CPU time, and that the time such processes have to wait for lower
prioritized processes is minimized. Priority boost was implemented in AIX
4.3 and is further enhanced in AIX 5.
The background for priority boost is demonstrated in the following

scenario. Assume that we have three resources, Locks A,B, and C. Two
processes, process 1 and 2, both want to get access to resource B, but
process 1, a low priority process, has the lock, and process 2 has to wait.
However, another process (process 3) has higher priority than 1and gets
most of the CPU time. In this scenario, the high priority process 2 is waiting
on the lower priority process 1 because it holds a lock. To resolve this
situation, priority boost was added to AIX 4.3.
Lock A Process 1
Low priority
Process 2
Lock B High priority
Process 3
Lock C
Medium priority
Priority boost increases the process priority of process holding locks:

• When a process has to wait for a lock, it increases the priority of the
process that has the lock to its own priority.
• Other processes waiting for the same lock also get increased priority.
• Only the kernel thread is increased; as soon as the altered process
leaves the kernel, the priority is set back to the original value

What is new in AIX 5 -- continued
User area in The user structure is much larger in the 64-bit kernel than in the 32-bit
User64 kernel. To improve efficiency and performance in the 32-bit kernel, two
structures are maintained: a 32-bit and a 64-bit. This ensures that the
kernel does not copy data areas which are not used.
What is Runaway processes and hanging system are hard to detect from locked
system hang systems, and methods to detect the runaway process are needed.
detection and
why do we • Misbehaving high priority applications are a recurring problem.
need it?
• When one or more processes or threads are stuck in the running
state, they can prevent any other lower priority threads from running.
• If the priority is above the default user priority, the machine can
appear to be hung.
• The hung situation is very difficult to debug since the administrator
can not tell what is happening on the system.
The solution to the hang problem is the system hang detection. It is
implemented by the shdaemon, which runs at the highest user priority.
Shdaemon monitors the lowest priority process that run on the system in a
given period of time, and if the system fails to run process below a given
threshold, an action is taken. The system hang detection can be set by the
shconf command, but the easiest way is to use the smit panel. There are
five distinct actions that can be taken, and for each of them a timeout value
and a threshold priority value can be set.
Log an Error in the Error Log [disable]

Detection Time-out [120]
Process Priority [60]
Display a warning message on a console [disable]

Terminal Device [console]
Launch a recovering getty on a console [enable]

Terminal Device [console]
Launch a command [disable]

Script [ ]
Automatically REBOOT system after Detection [disable]


Signals
What are • Signals are a way of notifying a process or thread of a system event.
signals?
• Signals also provide a means of interprocess communication.
• A signal is represented as a single bit within a bit field in the kernel.
• The bit field used for signals is 64-bit wide, but only about 40 signals are
defined.
• AIX 4.3.3 defines only 37 signals for the user.
• AIX 5 has 44 defined signals, but three of them are not used.
Types of • There are two types of signals in AIX: synchronous and asynchronous.
signals
• Synchronous signals are only delivered to a thread, usually as a result
of an error condition or exception caused by the thread, that is, SIGILL
is delivered to a thread that tries to execute an illegal instruction.
• Asynchronous signals are generated externally to the current thread or
process.
• Asynchronous signals may be delivered to a process (that is, kill() or to
another thread within the same process (that is, thread_kill() or tidsig() ).
Signal types Signals may be generated for a number of reasons:

• An exception, as segment violation
• An Interrupt, as a clock tick
• An Alarm, as when the timer expires
• Process management, as when a child process dies
• Device I/O, as data ready
• Signals from another process
Signal When an event triggers a signal, the kernel sets the corresponding bit in
mechanism the pending signal bit field for the process (p_sig) or thread (t_sig).
• All signals are enabled by default, and when returning from the kernel,
threads are looking for signals.
• If the signal is being ignored (masked), nothing happens.

Signal handling
Signal When a signal has been generated but not yet handled, it is said to be
delivering pending.
• Pending signals are detected when returning from a system call.
• Pending signals are detected when resuming in user mode.
• Pending signals are detected entering or during an interruptible
sleep.
• Signals may be caught, blocked or ignored by a process.
Signal Signal handling is done at the process level and signal masking is done at
handling the thread level. That is, each thread in a process must use the signal
handler set up by the process, but each has its own signal mask.
• If a pending signal is not specifically handled by the process, it is
delivered to all threads in the process.
• If the signal is handled by the process, the signal is delivered to the
thread that is not blocking the signal.
• If all threads are blocking a signal, it is left pending for the process
until one thread unmasks the signal or the signal is removed from the
pending list.
• If more than one signal is pending, only one is chosen for delivery at
a time.
• When a signal is being handled, it is moved to the p_cursig or
t_cursig field in the pvproc or pvthread structure.
Signal handler There is a default system handler for all signals, but most signals have a
routines local system handler routine, or the signal is ignored or blocked.
• SIGKILL and SIGSTOP can not be handled by a local routine, these
signals will always be handled by the system default routine.
• SIGKILL and SIGSTOP can not be blocked the process will always
handle the signal.

Signal handling -- continued
Signal actions The default action for a signal depends on the signal, but may be one of
the following:
• Abort: This will generate a core dump and terminate the process.
• Exit: This will terminate the process without generating a core dump.
• Ignore: The signal is ignored.
• Stop: This action will suspend the process or thread.
• Continue: This action will resume a suspended process or thread.

Signals
Signals SIGHUP 1 /* hangup, generated when terminal disconnects */

SIGINT 2 /* interrupt, generated from terminal special char */
SIGQUIT 3 /* (*) quit, generated from terminal special char */
SIGILL 4 /* (*) illegal instruction (not reset when caught)*/
SIGTRAP 5 /* (*) trace trap (not reset when caught) */
SIGABRT 6 /* (*) abort process */
SIGEMT 7 /* EMT intruction */
SIGFPE 8 /* (*) floating point exception */
SIGKILL 9 /* kill (cannot be caught or ignored) */
SIGBUS 10 /* (*) bus error (specification exception) */
SIGSEGV 11 /* (*) segmentation violation */
SIGSYS 12 /* (*) bad argument to system call */
SIGPIPE 13 /* write on a pipe with no one to read it */
SIGALRM 14 /* alarm clock timeout */
SIGTERM 15 /* software termination signal */
SIGURG 16 /* (+) urgent contition on I/O channel */
SIGSTOP 17 /* (@) stop (cannot be caught or ignored) */
SIGTSTP 18 /* (@) interactive stop */
SIGCONT 19 /* (!) continue (cannot be caught or ignored) */
SIGCHLD 20 /* (+) sent to parent on child stop or exit */
SIGTTIN 21 /* (@) background read attempted from ctl terminal*/
SIGTTOU 22 /* (@) background write attempted to control terminal */
SIGIO 23 /* (+) I/O possible, or completed */
SIGXCPU 24 /* cpu time limit exceeded (see setrlimit()) */
SIGXFSZ 25 /* file size limit exceeded (see setrlimit()) */
SIGMSG 27 /* input data is in the ring buffer */
SIGWINCH 28 /* (+) window size changed */
SIGPWR 29 /* (+) power-fail restart */
SIGUSR1 30 /* user defined signal 1 */
SIGUSR2 31 /* user defined signal 2 */
SIGPROF 32 /* profiling time alarm (see setitimer) */
SIGDANGER 33 /* system crash imminent; free up some pg space */
SIGVTALRM 34 /* virtual time alarm (see setitimer) */

SIGMIGRATE 35 /* migrate process */
SIGPRE 36 /* programming exception */
SIGVIRT 37 /* AIX virtual time alarm */
SIGKAP 60 /* keep alive poll from native keyboard */
SIGGRANT SIGKAP /* monitor mode granted */
SIGRETRACT 61 /* monitor mode should be relinguished */
SIGSOUND 62 /* sound control has completed */
SIGSAK 63 /* secure attention key */

Signals -- continued
Signal data The file /usr/include/sys/proc.h defines the proc structure and the following
structures information about signals is kept in the proc structure.
/* Signal information */
sigset_t p_sig; /* pending signals */
sigset_t p_sigignore; /* signals being ignored */
sigset_t p_sigcatch; /* signals being caught */
sigset_t p_siginfo; /* keep siginfo_t for these */
Signals • A signal is a bit set in an array with enough bits set aside for each signal
number.
• The bits are turned on by kernel code as the process is executing in
kernel mode or by the processing of interrupts that are determined to be
assigned to the process.
• Signals can also be sent from one process to another process through
the use of system calls.
• Signals are delivered to the process when:
- The process returns to the User Protection Domain.
- There is a transition from ready-to-run state to running state.
• To deliver a signal, the kernel checks whether the process is receiving
the signal.
• If the signal is being received, the kernel sets the receiving process to
perform the appropriate action.
• The appropriate action may be to invoke the signal handler for that
particular signal, kill the process, or ignore the signal.
• If the signal is blocked by the process, it is left pending until the process
is no longer blocking the signal.
• Signals can be delivered to a group of processes.
• Signals can be sent to process or thread.
• Thread receives signal if:
• A signal is synchronous and attributable to particular thread. For
example: SIGSEGV.
• A signal is sent by thread in the same process via thread_kill system
call.
• Otherwise, the signal goes to process.

Signals -- continued
Signals to a • If a signal is not being caught, a signal action applies to entire process.
process
- Every thread is terminated, stopped, or continued, depending on
action.
• If a signal is being caught:
- Pick one thread that is not blocking signal to receive it.
- If all threads are blocking, a signal pending on process is sent.

Exercises
Exercises after In this exercise, the student will be supplied with programs that will create
this module process and threads using the available thread models. The programs
should be very simple source and will be supplied to the student. Kernel
debugging tools (running on a live kernel) are then used to interrogate the
kernel structures associated with the process and threads of the program.
The first code example explores the fork() system call and how variables
are private to each process. The second example show how threads are
created and how global variables are shared because all threads share
user space, but local variables in functions are not shared because those
data are kept on the stack to make the procedure reentrant. The third
example is a signal handler example.

Exercises -- continued
C code Use C code to create siblings with the fork() system call notice that the
example to variable is private to each process.
explore fork()
and wait()
system calls
#include <unistd.h>
int i;
int *statuslocation;
pid_t proc_id;
pid_t proc_id2;
main(argc,argv)
int argc;
char **argv;
{
int this=7;
proc_id=fork();
/* error routine */
if ( proc_id < 0 ) {
printf ("fork error \n");
exit (-1);
}
if ( proc_id > 0 ) {
this= this+4;
printf("waiting for child \n");
proc_id2 = wait(statuslocation);
printf("I’m Parent variable= %d \n",this);
exit (0);
}
if ( proc_id == 0 ) {
printf (" I’m the child proces \n");
sleep(1);
printf ("I’m the child the variable is %d\n",this);
printf ("I’m the child terminating\n");
exit (0);
}
}

C code to #include <pthread.h>

explore the
#include <stdio.h>
thread system
call #include <stdlib.h>
void *mythread(void *data);

int x = 0;
int main(void)
{
/* This will be an array holding the threads ids for each thread */
pthread_t tids[11];
int i;
/* We will now create the 5 threads. */
for(i=0;i<5;i++) {
pthread_create(&tids[i], NULL, mythread, NULL);
}
/* We will now wait for each thread to terminate */
for(i=0;i<5;i++)
{
/* this will block until the specified thread finishes execution.
* second argument to pthread_join can be a pointer that will have
* the return value of thread stored in it */
pthread_join(tids[i], NULL);
}
return(0);
}
/* This is our actual thread function */
void *mythread(void *data)
{
int v;
printf (" x was %d v was %d , now change it ",x,v);
if (x < 20) x= 444;
if (v < 20) v= 444;
printf (" x is %d v is %d \n",x,v);
pthread_exit(NULL);
}

C sample code The program explores proces priority, the program run long time such that
to explore ther are time to look at the process table, and the nice value with the ps
process
renice, the command.
proces priority,
and the int i,ii;
program run long ll;
long time,
there is time to long ll1();
look at the
process table main(argc,argv)
char *argv[];
int argc;
{
i=atoi(argv[1]);
ii = nice(i);
ll=1;
for (i = 1;i < 5000; i++) {
ll = ll1(ll);
ll++;
}
}
long ll1(long l1)

{
int e;
long l2,l3;
bb=l1;
for (e = 1;e < 50000; e++) {
l2 = sin(l3);
l3 = l2+l3;
}
return(l3);
}

C code The Signal code sample catch the signals and print a message whenever
example to a signal is being caught. What happens if the same signal is being send
explore signal
handling twice? And how can this behaviour be changed.
#include <stdio.h>
#include <fcntl.h>
#include <termio.h>
#include <signal.h>
int i;
void sig1(), sig2(), sig3();
main()
{
signal( SIGHUP,sig1);
signal( SIGINT,sig2);
signal( SIGQUIT,sig3);
for (i = 1;i < 100; i++) {

sleep(15);
printf ("been sleeping for 15 sec. \n");
}
}
void sig1()
{
printf("interrupt 1 modtaget \n");
}
void sig2() {
}
void sig3() {
}


Draft Version for review, Sunday, 15. October 2000, common_vmm.fm Guide
Unit 8. Memory Management
Objectives
After completing this unit, you should be able to describe the
common features of VMM on POWER and IA64:
• virtual memory
• page mapping
• memory objects
• VMM tuning parameters
• object types
• shared memory objects
References

Guide Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Overview of Virtual Memory Management
Introduction Traditionally, the memory management component of the operating

system (VMM) is responsible for managing the system’s real memory
resources. Virtual memory systems provide the capability to run programs
whose memory requirements exceed the system’s real memory resources
by allowing programs to execute when they are only partially resident in
memory and by utilizing disk to extend memory.
Memory The virtual memory system divides real memory into fixed- length pages
management and allocates pages to program as it requires them. Such a system
allows multiple programs to reside in memory and execute simultaneously.
The virtual memory system is responsible for keeping track of which pages
of a program are resident in memory and which are on secondary storage
(disk).
It handles interrupts from the address translation hardware in the system
to determine when pages must be retrieved from secondary storage and
placed in real memory.
When all of real memory is in use, it decides which program’s pages are to
be replaced and paged out to secondary storage.
Each time a process access a virtual address, the virtual address is
mapped (if not already mapped) by the VMM to a physical address where
the data is located.
Access The VMM also provides for access protection to prevent illegal access to
Protection data. This protects programs from incorrectly accessing kernel memory or
memory belonging to their programs. Access protection also allows
programs to setup memory that may be shared between process.
VMM on In this lesson the common feature of VMM on POWER and IA64 are
POWER described. For the most part, the IA64 VMM design inherits design on the
opposed to IA-
64 VMM Power architecture. The majority of data structures, the serialization
model, and the majority of code are common between the two. Separate
lessons will describe POWER and IA64 VMM context.

Memory Management Definitions
Introduction The following terms relating to virtual memory concepts will be defined in
this section:
• Page
• Frame
• Address space
• Effective address
• Virtual memory
• Physical address
• Paging Space
Illustration Follow this diagram as you read about the virtual memory concepts.
Physical Virtual address
Memory space Process 1
Effective
address
Process 2
Paging
space
Page Page is a fixed size chunk of contiguous storage that is treated as the
basic entity transferred between memory and disk. Pages stay separately
from each other, they do not overlap in virtual address space. AIX 5L uses
a fixed page size of 4096 bytes for both Power and IA64. The smallest unit
of memory managed by hardware and software is one page
Frame The place in real memory used to hold the page is called frame. You can
think that the page is the collection of information and the frame is the
place in memory to hold that information.

Memory Management Definitions -- continued
Address Space Address space is the set of addresses available to a program that it can be
use to access memory. This lesson describes three types of address
space:
• Effective address space.
• Virtual address space.
• Physical address space.
Effective Effective address are the addresses reference by the machine instructions
Address of a program or kernel. The effective address space is the range of
addresses defined by the instruction set, 64-bits on AIX 5L. The effective
address space is mapped to different physical address space or disk files
for each process. Programs/process see one contiguous address space.
Virtual The virtual address space is the set of all memory objects that could be
Address made addressable by the hardware. The virtual address is a bigger (has
more address bits) than the effective address. Processes have access to a
limited range of virtual addresses given to them by the kernel.
Physical The physical address space is dependent on how much memory (memory
Address chips) are on the machine. Physical address space maps one- to- one with
the machine’s hardware memory.
Paging space Paging space is disk area used by the memory manager to hold inactive
memory pages with no other home. In AIX the paging space is mainly used
to hold the pages from working storage (process data pages). If a memory
page is not in physical memory it may be loaded from disk, this is called a
page-in. Writing a modified page to disk is called a page-out.

Demand Paging
Introduction AIX is a demand paging system. Physical pages (frames) are not allocated
for virtual pages until they are needed (referenced).
• Data is copied to a physical page only when referenced.
• Paging is done on the fly and is invisible to the user.
• Data comes from:
• A page from the page space.
• A page from a file on disk.
When a virtual address is referenced on a page that has no mapping to a
frame, the mapping is done on the fly and the page frame is loaded from
where it is mapped. The loading is invisible to the user process. Demand
paging saves much of the overhead of creating new processes because
the pages for execution do not have to be loaded unless they are needed.
If a process never uses parts of its virtual space, valuable physical memory
will never be used.
Page Faults A page fault occurs when a program tries to access a page that is not
currently in real memory. Memory that has been recently used is kept in
real memory, while memory that has not been recently used is kept aside
in paging space.
For speed, most systems have the mapping of virtual addresses to real
addresses done in the hardware. This mapping is done on a page- by-
page basis. When the hardware finds that there is no mapping to real
memory, it raises a page fault condition. The operating system software
must handle these faults in such a way that the page fault is transparent to
the user program.
Virtual Memory The job of a virtual memory management system is to handle page faults
manager so that they are transparent to the thread using virtual memory addresses.
Pool of A pager daemon attempts to keep a pool of physical pages free. If the
Physical Free number of pages available goes below a high- water mark threshold, the
Pages
pager frees the oldest (referenced further back in time) pages until a low-
water mark threshold is reached.

Demand Paging -- continued
Pageable AIX’s kernel is pageable. Only some of the kernel in physical memory at
Kernel one time. Kernel pages that are not currently being used can be unused
can be paged out.
Pinned Pages Some parts of the kernel are required to stay in memory because it is not
possible to perform a page-in when those pieces of code execute. These
pages are said to be pinned. The bottom halves of devices drivers
(interrupt processing) are pinned. Only a small part of the kernel is
required to be pined.

Memory Objects
Introduction A fundamental feature of AIX 5L’s Virtual Memory Manager is the use of
addressable memory objects.
Objects In AIX 5L provides access to a 256 MB objects called segments. The

predominant features these objects are:
• All objects are broken into pages.
• Objects can be shared among processes.
• Objects can grow by adding additional pages.
• Objects can be attached or detached from processes.
• New objects can be created or destroyed by threads in a process.
The benefit of this object-level addressing is high degree of sharing that

can be accomplished.
Object VMM code and interfaces operate on object specified as:

specifier
<object ID,object Offset>
POWER VMM The POWER architecture provides for efficient access to 256MB objects
Design against (segments in POWER terminology) in the global virtual address space.
IA-64 Design
The 256 MB objects are also used on IA-64 VMM implementation;
however, segments are implemented in software instead of hardware.
Term “segment” and “object” have the same meaning but keep in mind that
term “segment” in IA64 should be considered in software context.

Memory Object types
Introduction There are five types of objects defined by the VMM:

• working
• persistent
• client
• log
• mapping
Working Working objects (also called working storage and working segments) are
Objects temporary segments used during the execution of a program for its stack
and data areas. Process data are created by the loader at run time and
are page in and page out of paging space. Working storage segment,
holds the amount of paging space allocated to pages in the segment,
associated with it. The part of AIX kernel is also pageable and are the part
of working storage.
Persistent The VMM is used for performing I/O operations for file systems. Persistent
Objects objects are used to hold file data for the local file systems. When the
process opens the file, the data pages are page-in. When contents of file
changes the page is marked as modified and eventually page out directly
to original disk location. File system reads and writes occur by attaching
the appropriate file system object and performing loads/stores between the
mapped object and the user buffer. File data pages and also program text
are both part of persistent storage; however, the program text pages are
read only pages and are page-in but never page-out to disk. Persistent
pages are not using paging space.
Client Objects Client objects are used for pages of client file systems (all file systems
types other than JFS). When remote pages are modified they are marked
and eventually page-out to original disk location across the network.
Remote program text pages (read-only pages) are page out to paging
space from where they can be page in later if needed.
Log Objects Log objects are used for writing or reading JFS file systems logs during
journalling operations.
Mapping Mapping objects are used to support the mmap() interfaces which allows an
Objects application to map multiple objects to the same memory segment.

Page Mapping
Introduction This section describes the page mapping functions in the VMM.
VMM Function The main function of virtual memory manager is to make translations from
effective addresses to real addresses.
Hardware The exact procedure used by the VMM depends heavily on hardware
differences processor used by the system. As AIX 5L runs of both Power and IA-64
processors this lesson will describe the process in general terms. More
exact descriptions of address translation can be found in the hardware
specific lessons.
Diagram This diagram shows the overall relationship among the major AIX data
structures involved in mapping a virtual page to a real page or to paging
space.
effective hardware specific real memory

address space table
software
page frame table external page
tables (XPT) paging space
SID table
filesystem
file inode

Page Mapping -- continued
Hardware Page Hardware page mapping is determined by processor architecture. The

Mapping processor generates the hash function which is used to look up the
appropriate hardware tables for the proper translation. The hardware
specific table(s) used on Power is a hardware Frame Page Table (PFT), on
IA-64 a Virtual Hash Page Table (VHPT) is used.
Software Page Software Page Frame Tables (SWPFT) are extensions of the hardware
Frame Table frame table and are used and managed by the VMM software. SWPFT
contains informations connected with a page as well as page in, page out
flags, free list flag, block number. It contains also the device information
(PDT) used to obtain the proper page from disk.
Page Faults Page faults occur when the hardware has looked through its page frame
tables but cannot find a real page mapping for a virtual page.
A page fault causes AIX Virtual Memory Manager (VMM) to do the bulk of
its work. It handles the fault by first verifying that the requested page is
valid. If the page is valid the VMM determines the location of the page,
recovers the page if necessary and updates the hardware’s frame page
table with the location of the page. A faulted page will be recovered from
one of the following locations:
• In physical memory (but not in the hardware PFT).
• On a paging disk (working object)
• On a filesystem object (persistent object)
Protection Protection fault occurs when page is in memory but process has no rights
Fault to access it.
.

Page Not In Hardware Frame Table
introduction The size of the hardware page tables is limited; therefor, the hardware
can’t satisfy all address translation requests. The VMM software must
supplement the hardware tables with software managed page tables.
Procedure The procedure used for page fault handling when the page is not found in
hardware specific tables; however is in physical memory consists of
several steps detailed in this illustration and the following table.

address space table
real page
number
software
external page paging space
page frame table
tables (XPT)
virtual
page
number
SID table filesystem
file inode

Page Not In Hardware Frame Table -- continued
Procedure (continued)
Note: these steps assume the memory page is in memory just not in the
hardware page tables..
Step Action
1 A page fault is generated by the address translation
hardware. The page might be in real memory, just not in
hardware specific table due to its size limits.
2 The AIX Virtual Memory Manager first verifies that the
requested page is valid. If the page is not valid a kernel
exception is generated.
3 If the page is valid, the VMM starts looking through the
software PFT for the page. This processing almost
duplicates the hardware processing, but uses software page
tables. The software PFTs are pinned.
4 If the page is found:
• Hardware specific table is updated with real page number
for this page and process resumes execution.
• No page-in of the page occurs.
Important is to remember that the dispatcher is not run . The faulted
thread just continues the execution at the instruction that caused the fault.
PTEGs PowerPC processors hash the PFT into Page Table. Equivalence Groups
(PTEGs), and these groups may only be able to hold 16 page entries each.
Since there may be more than 16 pages that hash into one PTEG, the
VMM has to decide which ones are not in the PTEG. Then, when a page
fault occurs for one of these pages, VMM only has to reload the PTEG with
the page in question replacing some other page.

Page on Paging Space
Introduction If the page was not found in real memory, VMM determines whether it is
on paging space or else where on disk. If the page is in paging space
the disk block containing the page is located and the page loaded into a
free memory page.
Waiting for I/O Copying a page from paging space to an available frame is not a
synchronous process. Any process or thread waiting for a page fault to
be handled is put to sleep until the page is available.
Procedure The procedure for loading a page from paging space is show in this
illustration and in the table that follows.

address space table
real page
number
software
tables (XPT)
paging space
virtual
page
number
segment ID table
disk
block
XPT number
address
and page
number filesystem
file inode
Continued on next pag

Page on Paging Space -- continued
Procedure (continued)
Step Action
1 The VMM looks up the object ID for this address in the
Segment ID table and gets the External Page Table (XPT)
root pointer.
2 The VMM finds the correct XPT direct block from XPT root.
3 The VMM gets paging space disk block number from XPT
direct block.
4 VMM takes the first available frame from the free frame list.
(the free list contains one entry for each free frame of real
memory).
5 If the free frame list is empty, the VMM uses an algorithm to
select several active pages to steal.
• If the page to be stolen is modified , an I/O request is
issued to write the contents of the selected page to disk.
• Once written, the frames containing the stolen pages are
added to the free list, and one is selected to hold the
page from paging space.
6 VMM indicates device and logical block for the page. An I/O
request loads the frame with the data for the faulting page.
7 When the I/O completes VMM is notified and the thread
waiting on the frame is awakened.
8 The disk block is loaded from paging space or the file
system.
9 The hardware PFT is updated, and the process/thread
resumes at the faulting instruction
The net effect is that the process or thread has no knowledge that a page
fault occurred except for a delay in it’s processing.

External Page The XPT maps a page within a working storage segments to a disk block
Table (XPT) on external storage. The XPT is two level tree structure.
The first level of tree is XPT root block. The second level consists of 256
direct blocks. Each word in the root block is a pointer to one of the direct
block. Each word of the direct block contains the page state and disk block
information for the single page in the segment.
Each XPT direct block covers the 1MB of the 256MB segment.
.
Disk blocks in paging space
0 page 0
XPT Direct Block 0

XPT entry 0
1MB page 255

XPT Root
0
XPT entry 255
.
. .
. .
.
XPT entry 0
255
255MB page 65280
XPT entry 255
XPT Direct block 255

page 65535
256MB

Paging Space AIX offers two policies for allocating paging space. If the environment
Allocation variable PSALLOC=early, then the early allocation policy is used which
Policy
will cause a disk block to be allocated whenever a memory request is
made. This guarantees that the paging space will be available if it is
needed.
If the environment variable is not set, then the default late allocation
policy is used and a disk block is not allocated until it becomes necessary
to page out the page. This policy decreased paging space requirements on
large-memory systems which do little paging.
Free memory The VMM maintains a linked list containing all the currently free real
list memory pages in the system. When a page fault occurs, VMM just takes
the first page from this list to assign to the faulting page. When the free
frame list is empty and a page fault occurs, VMM selects several active
pages to be stolen (usually around 20 or so), and all these pages are then
added to the free list This reduces the amount of time spent starting and
running the steal routines.
Paging Device The Paging Device Table (PDT) contains an entry for every device
Table (PDT) referenced by the VMM.
It is used for filesystem, paging, log and remote pages.
There is a pending I/O list associated with PDT.
The pending I/O list contains all page frames awaiting I/O for the device.
Page frames are removed from the list as soon as the I/O has been
dispatched to the device.

Loading Pages From The Filesystem
Introduction Persistent pages do not use XPT (eXternal Page Table). VMM uses the
information contained in file’s inode structure to locate the pages for the
file.
Procedure Persistent pages are paged from local files located on a filesystems. Local
files will have a segment allocated and will have an entry (SID) in the
segment information Table. The inode is pointed to by the SID entry
allowing VMM to find and page in the faulting block.

address space table
real page
number
software
tables (XPT)
paging space
virtual
page
number
segment ID table
disk
block
number
inode
address
filesystem
file inode

Filesystem I/O
Introduction The paging functions of the VMM is also used to preform reads and writes
to files by processes.
File system File system reads and writes occur by attaching the appropriate file system
objects object and performing loads/stores between the mapped object and the
user buffer. It means that file objects are not directly addressable in the
current address space but instead are temporarily attached.
A local file has a segment allocated and has an entry (SID) in the segment
information Table. File gnode contains the information which segment
belongs to the particular file.
Persistent AIX is using a large portion of memory as the filesystem buffer cache. The
pages pages for files compete for the storage the same way as other pages. The
VMM schedules the modified persistent pages to be written to their original
location on disk when:
• VMM needs the frame for another page
• file is closed
• sync operation is performed
The sync operation can be performed by syncd daemon running on the
system (by default the syncd daemon is run every 60 seconds) and by
calling sync() function or running sync command. Scheduling does not
mean that the data are written to disk at once.

Free Memory and Page Replacement
Introduction To maintain system performance the VMM always wants some physical
memory to be available for page-ins . This section describes the free
memory list and the algorithms used to keep pages on the list.
Free memory The VMM maintains a linked list containing all the currently free real
list memory pages in the system. When a page fault occurs, VMM just takes
the first page from this list to assign to the faulting page. When the free
frame list is empty and a page fault occurs, VMM selects several active
pages to be stolen (usually around 20 or so), and all these pages are then
added to the free list. This reduces the amount of time spent starting and
running the steal routines.
Page The method used to select a page which should be replaced is called Page
Replacement Replacement Algorithm. The mechanism used to determine which pages
Algorithm
to steal is a pseudo-LRU (Least Recently Used) algorithm called the
clock-hand algorithm. This algorithm is commonly used in operating
systems when the hardware provides only a reference bit for each page in
physical memory. The hardware automatically sets the reference bit for a
page translation whenever a store occurs to the page. The clock hand
algorithm checks frames by frame number looking for pages that have not
been referenced since the last time the algorithm looked at the page. If a
page has been referenced since the last time the algorithm looked at the
frame, the algorithm clears the reference bit and goes to look at the next
frame. If the page has not been referenced since the last time the
algorithm looked at the frame, the page is stolen

.

Free Memory and Page Replacement -- continued
Clock Hand The algorithm is called the clock-hand algorithm because the algorithm
acts like a clock hand that is constantly pointing at frames in order. The
clock-hand advances whenever the algorithm advances to the next frame.
If a modified page is stolen, the clock-hand algorithm writes the page to
disk (to paging space or a file system) before stealing the page.
Physica The reference bit

l page is changed to
Reference = 1 zero when the
clock hand
passes
rotation
Reference = 0 Reference = 0
This page is
eligible to be
stolen
Reference = 1

vmtune
Introduction Some number of pages of different type must retain in memory to maintain
system performance. The VMM keeps the statistics for each page types
by enforcing thresholds in page replacement algorithm. When a number of
pages approaches threshold , the page replacement algorithm selects
proper pages for replacement and favors other pages. VMM takes
appropriate action to bring the state of memory back within bounds.
VMM Tunable The vmtune command changes operational parameters of the Virtual
Parameters Memory Manager controlling the thresholds.
Parameter Description
minfree Page replacement is invoked whenever the number of free page
frames falls below this threshold.
maxfree The page replacement algorithm replaces enough pages so that
this number of frames are free when it completes.
LruBucket Specifies the size (in 4K pages) of the least recently used (lru)
page-replacement bucket size. This is the number of page frames
which will be examined at one time for possible page-outs when a
free frame is needed. A lower number will result in lower latency
when looking for a free frame, but will also result in behavior that is
not as much like a true lru algorithm.
MaxPin Specifies the maximum percentage of real memory that can be
pinned. The default value is 80. If this value is changed, the new
value should ensure that at least 4MB of real memory will be left
unpinned for use by the kernel.
minperm Specifies the point below which file pages are protected from the
re-page algorithm. This value is a percentage of the total real-
memory page frames in the system. The specified value must be
greater than or equal to 5.
MaxPerm Specifies the point above which the page stealing algorithm steals
only file pages. This value is expressed as a percentage of the total
real-memory page frames in the system. The specified value must
be greater than or equal to 5.
MinPgAhead Specifies the number of pages with which sequential read-ahead
starts. This value can range from 0 through 4096. It should be a
power of 2.
MaxPgAhead Specifies the maximum number of pages to be read ahead. This
value can range from 0 through 4096. It should be a power of 2 and
should be greater than or equal to MinPgAhead.
NpsWarn Specifies the number of free paging-space pages at which the
operating system begins sending the SIGDANGER signal to
processes. The default value is 512.

Fatal Memory Exceptions
Introduction Not all page and protection faults can be handled by the O/S. When an
fault occurs that can not be handled by the O/S the system will panic and
immediately halt.
Fatal memory In all of the following cases, the VMM bypasses all kernel exception
exceptions handlers and immediately halts the system:
• A page fault occurs in the interrupt environment.
• A page fault occurs with interrupts partially disabled.
• A protection fault occurs while in kernel mode on kernel data.
• The system is out of paging space, or an I/O error occurs on kernel
data.
• An instruction storage exception occurs while in kernel mode.
• A memory exception occurs while in kernel mode without an exception
handler set up.

Memory Objects (Segments)
Introduction Each segment has unique segment ID in segment table. There are a
number of important segment types in AIX :
• kernel
• user text
• shared library text
• shared data
• process private
• shared library data
Kernel This segment is described separately for Power and IA-64 in their lessons.
segment
User text The user text segment contains the code of the program. Threads in user
mode have read-only access to text segment to prevent the modification
during running of the program. This protection allows a single copy of a
text segment to be shared by all processes associated with the same
program. For example, If the two threads in the system are running the
ls command then the instructions of ls are shared between them.
Running a When a debugger is running on a program a private read/write copy of text

debugger segment is used. This allows debaters to set breakpoints directly in code.
In that case the status of text segment is changed from shared to private.

Memory Objects (Segments) -- continued
Shared Library The shared library text segment contains mappings whose addresses are
Text common across all processes. A shared library segment:
• Contains a copy of the program text (instructions) for the shared
libraries currently in use in the system.
• These segments are added to the user address space by the loader
when the first shared library is loaded.
• Each process using text from this segment has a copy of the
corresponding data in the per- process shared library data segment.
Executable modules list the shared libraries they need at exec() time.
The shared library text is loaded into this segment when an module is
loaded via the exec() system call. Or a program may issue load() calls
to get additional shared modules.
Per-Process The functions in the shared library that have data that can not be shared
Shared Library between processes and are loaded as process private data.
Data Segment
• This segment holds items required by modules in the shared text
segment(s).
• There is one of these segments for each process
• Addresses of data items are generally the same across processes
• Data itself is not shared
The shared library data segments acts like extension of the process
private segment.
Shared data Mapped memory regions, also called shared memory areas, can serve as
a large pool for exchanging data among processes.
Process Process Private Segment is not shared between other processes. The
private process private segment contains:
• user data (for 32-bit programs that aren’t maxdata programs)
• the user stack (for 32-bit programs)
• text and data from explicitly loaded modules (for 32-bit programs)
• kernel per-process data (accessible only in kernel mode)
• primary kernel thread stack (accessible only in kernel mode)
• per-process loader data (accessible only in kernel mode)

Shared Memory segments
Introduction Mapped memory regions, also called shared memory areas, can serve as
a large pool for exchanging data among processes.
• A process can create and/or attach a shared data segment that is
accessible by other processes.
• A shared data segment can represent a single memory object or a
collection of memory objects.
• Shared memory can be attached read-only or read-write.
Benefit Shared memory areas can be most beneficial when the amount of data to
be exchanged between processes is too large to transfer with messages,
or when many processes maintain a common large database.
Methods of The system provides two methods of sharing memory:

Sharing
• Mapping file data into the process address space (mmap() services).
• Mapping processes to anonymous memory regions that may be shared
(shmat services).
Shared The shared memory is process based and can be attached at different
memory effective addresses in different processes
address
process A real memory
effective
address
space
process B
effective
address
space VMM
Serialization There is no implicit serialization support when two or more processes

access the same shared data segment. The available subroutines do not
provide locks or access control among the processes. Therefore,
processes using shared memory areas must set up a signal or semaphore
control method to prevent access conflicts and to keep one process from
changing data that another is using.

shmat Memory Services
Introduction shmat services, are typically used to create and use shared memory
objects from a program.
shmat Your program can use the following functions to create and manage
functions shared memory segments.
• shmctl() - Controls shared memory operations
• shmget() - Gets or creates a shared memory segment
• shmat()- Attaches a shared memory segment from a process
• shmdt()- Detaches a shared memory segment from a process
• disclaim() - Removes a mapping from a specified address range
within a shared memory segment
Using shmat shmget() system call is used to create a shared memory region and
when supporting larger objects than 256MB shared memory regions,
creates multiple segments.
shmat() system call is used to gain address ability to a shared memory
region.
Limitations Right now shmget() on the 64-bit kernel is limited to 8 segments even for
64-bit applications. Thus, the largest shared memory region that one can
create is 2Gb. This limitation will be removed if it is a 64-bit application
that performs the shmget(). There will be no explicit limitation, other than
what system resources will bear. 32-bit applications will still retain the 2Gb
limitation.
EXTSHM Environment variable EXTSHM=ON allows shared memory regions to be

created with page granularity instead of the default segment granularity
thus allowing more shared memory regions within the same sized address
space but no increase in the total amount of share memory region space.

shmat Memory Services -- continued
When to use Use the shmat() services under the following circumstances:
When mapping files larger than 256MB.

For 32-bit application, eleven or fewer files are mapped simultaneously ,
and each is smaller than 256MB
When mapping shared memory regions which need to be shared among
unrelated processes (no parent-child relationship).
When mapping entire files.

Memory Mapped Files
Introduction Shared segments can be used to map any ordinary file directly into
memory.
• Instead of reading and writing to the file, the program would just load or
store in the segment
• This avoids buffering of the I/O data in the kernel.
• This provides easy random access, as the file data is always available.
• This avoids the system call overhead of read() and write().
• Either shmat() or mmap() system calls can be used
File mapping The system allows file mapping at the user level. This allows a program to
access file data through loads and stores to its virtual address space. This
single level store approach can also greatly improve performance by
creating a form of Direct Memory Access (DMA) file access. Instead of
buffering the data in the kernel and copying the data from kernel to user,
the file data is mapped directly into the user’s address space.
Shared files The file can even be shared between multiple processes even if some are
using mapping and others are using the read/ write system call interface.
Of course, this may require some sort of synchronization scheme between
the processes.
shmat to map When using shmat to map memory file an open file descriptor is used in
files place of shared memory ID. Once the file segment is mapped , it is treated
like any other shared segment and can be shared with other processes.

Memory Mapped Files -- continued
mmap services mmap services, is typically used for mapping files, although it may be used
for creating shared memory segments as well.
• madvise() - Advises the system of a process' expected paging
behavior
• mincore() - Determines residency of memory pages
• mmap() - Maps an object file into virtual memory
• mprotect() - Modifies the access protections of memory mapping
• msync() - Synchronizes a mapped file with its underlying storage
device
• munmap() - Un-maps a mapped memory region
Both the mmap and shmat services provide the capability for multiple
processes to map the same region of an object such that they share
address ability to that object. However, the mmap subroutine extends this
capability beyond that provided by the shmat subroutine by allowing a
relatively unlimited number of such mappings to be established.
When to use Use mmap under the following circumstances:

mmap
Portability of the application is a concern.

Many files are mapped simultaneously.
Only a portion of a file needs to be mapped.
Page-level protection needs to be set on the mapping.
Private mapping is required.

Memory Mapped Files -- continued
Mapping Types There are a 3 mapping types :

• read-write mapping
• read-only mapping
• deferred-update mapping
Read-Write Read -write mapping allows loads and stores in the segment to behave
Mapping like reads and writes to the corresponding file. If a thread loads beyond
the end of the file, the load will load zero values.
Read-only Read only mapping allows only loads from the segment. The operating
Mapping system generates a SIGSEGV signal if a program attempts an access that
exceeds the access permission given to a memory region. Just as with
read-write access, a thread that loads beyond the end of the file loads zero
values.
Deferred Deferred update mapping also allows loads and stores to the segment to
Update behave like reads and writes to the corresponding file. The difference
Mapping
between this mapping and read-write mapping is that the modifications are
delayed. Any storing into the segment modifies the segment but does not
modify the corresponding file.
With deferred update, the application can begin modifying the file data (by
memory mapped loads and stores) and then either commit the
modifications to the file system (via fsync()) or discard the modifications
completely. This can greatly simplify error recover and allows the
application to avoid a costly temporary file that may otherwise be required.
Data written to a file that a process has opened for deferred update (with
the O_DEFER flag) is not written to permanent storage until another pro-
cess issues an fsync() subroutine against this file or runs a synchro-
nous write subroutine (with the O_SYNC flag) on this file.

Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm Guide
Unit 9. IA-64 Virtual Memory Manager
Objectives
• List the size of the effective and virtual address space on the IA64
platform.
• .Show how regions, region register, and region ID are used in AIX
5L.
• Name the region register that is used to identify a processes
private region.
• Given an address identify the region it belongs.
References
• Intel IA-64 Architecture , Software Developer’s Manual

Guide Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
IA-64 Addressing Introduction
Introduction AIX-5L on the IA-64 platform is designed as a 64-bit kernel. Unlike the
Power version of AIX 5L no 32-bit kernel is available. This lesson
describes the address translation mechanism used by AIX 5L on the IA64
platform.
Overview The IA-64 platform provides an effective address space that is 64-bits
wide.
• The effective address space is divided into eight regions.
• Each region has a region register associated with it (rr0 - rr7).
• The region registers under control of the OS supplies an additional 24
bits of addressing creating a 85-bit virtual address space.
ILP32 In addition to a 64-bit programming model AIX 5L provides a 32-bit

address environment (ILP32). The IPL32 address space is 4 GB. A zero
extension model is used to convert 32-bit address to 64-bits for address
translation. The ILP32 effective address space is completely contained in
the first 4 GB of the 64-bit model.

Regions
Introduction The 64 bit effective address is broken into 8 regions. This section
describes how the regions are addressed.
Region The 64-bit effective address space consists of 8 regions each region
selector addressed by 61 bits. A region is selected by the upper 3 bits of the
effective address. Each region has a region register associated with it (
rr0 - rr7)that contains a 24-bit Region IDentifier for the region. When
translating effective addresses to virtual addresses the 24 bit region
identifier is combined with the lower 61 bits of the virtual address to form a
85 bit virtual address..
63 60 0
3 bits 61 bits
region ID
24 bits
61 24 85
2 *2 =2
Managing The AIX 5L operating system manages the contents of region registers.
region An address space is made accessible to a processes by loading the
registers
proper RID to one of the eight region registers.

Region Registers
Introduction Each region register contains a Region IDentifier (RID) and region
attributes.
Region The fields making up the region registers are detailed:

Registers
63 32 8 2 1 0
rv rid ps rv ve
Region Register
field description
rv reserved
ve VHPT Walker Enable
1-VHPT walker is enabled for the region
0-VHTP walker is disabled for the region
ps Preferred page size. Selects the virtual address bits for hash
function for TLB or VHPT
rid 24-bit region identifier

Address Translation
Introduction The VMM software in AIX 5L works closely with the hardware to translate
effective address to an address in physical memory.
VMM hardware This diagram and the table on the next page describe the hardware
compoints and the process used to preform address translations..
Region registers
effective address
rr0 63 60 0
region id
virtual page number offset
rr7
24
hash
search search virtual address
region id key VPN rights physical pno.
Translation Lookaside Buffer
protection key registers

key rights
62 0
physical page number offset
physical address

Address Translation -- continued
Address This table describes the process of address translation.

Translation
Details
Step Action
1 Effective address contains three parts:
• Virtual Region Number (VRN),
• Virtual Page Number (VPN)
• Page Offset
2 The 3 VRN bits are used to select region register.
3 The region register provides a 24 bit region ID.
4 The region ID and the virtual page number are used to
search for an address translation found in the TLB or the
hardware maintained page tables.
5 If no match is found a page fault is generated transferring
control to the OS. The OS must resolve the fault by making
a page available and updating the translation tables.
6 A successful translation produces a physical page number.
This page number is combined with the page offset to
produce a physical address.
32 bit Address 32 bit address translation is done the same way as 64 bit translation.
Translation There is no bit in processor hardware telling that hardware is working in 32
or 64 bit mode as it is on POWER.
Translation The cache of active virtual memory addresses is called Translation

Lookaside Lookaside Buffer (TLB). TLB contains Page Table Entries (PTE) that were
Buffer
recently used. The TLB stores recently used virtual addresses and
corresponding physical addresses.

Single vs. Multiple Address Space
Introduction The IA-64 model provides the ability for ether a Single or Multiple address
space model. These models and are described in this section.
Single In a single address model all process on the system share a single
Address Space address space. Such a model is possible due to the enormous size of a
(SAS)
64-bit address space as opposed to a 32-bit one. The term single address
space refers to the use of shared regions containing objects mapped at a
unique global address. For such mapping a common region ID and page
number is provided.
Multiple In this model each process has a private address space. Not all of the 8
Address Space regions can be used by a process because the operating system must be
(MAS)
mapped on top of one or more of the regions. For each process private
region(s) there is unique RID associated with it.
Address Space The address space model used by AIX on IA-64 combines attributes of
on IA-64 both MAS (multiple address space) and SAS (single address space).
Region 0 is defined by the operating system to be a process private region.
Each process is assigned unique RID for that region which is loaded into
region register each time the process is dispatched. Therefore region 0
provides what is effectively a MAS model.
All other regions are treats as shared address space (SAS), as such the
region ID’s for those regions are constant and don’t need to be changed at
context switch. SAS usage is necessary to achieve the desired degree of
sharing of address translations for shared objects: to achieve a single
translation for an object all accesses must be made through a common
global address.
The sharing semantic (private, globally shared, shared-by-some) is

determined by whether or not multiple processes utilize the same RID and
also in the case of “shared-by-some” ,whether they have access to specific
protection keys.

AIX 5L Region Usage
Introduction The region identifier (RID), much like the POWER segment identifier (SID),
participates in the hardware address translation such that in order to share
the same address translation, the same RID must be used. For a process
to share a memory region with another process (or the kernel) the same
RID must be loaded in the region register in both process’s context.
Region Usage The following table shows the kernel usage model for the 8 virtual regions
Table
VRN Style Name Usage
0 MAS Private process data, stack , heap , mmap ,
ILP32 shared library text,ILP32 main
text,u-block,kernel thread stacks/
msts
1 SAS/MAS Text LP64 shared library text,LP64 main
text
2 SAS LP64shmat
3 SAS LP64 shmat
4 n/a reserved
5 SAS Temp kernel temporary attach , global
buffer pool
6 SAS Kernel2 kernel global w/large page size
7 SAS Kernel kernel global

AIX 5L Region Usage -- continued
Region Usage Region usage is detailed here:

Details
• Region 0 is the process private region. Only the running process will
have access to its own private region.
• Region 1 is dedicated for mappings of LP64 executable text. This
includes globally shared text such as shared libraries and share-by-
some text such as the main text of a program. This region is SAS under
normal circumstances and is MAS when the process is being
debugged.
• Regions 2-3 are the primary residence of shared non-text mappings
which include user mappings via shmat.Region 4 is reserved for future
use.
• Region 5 is dedicated to support of kernel temporary attach. In AIX 5L
the temporary attach mechanism has been adapted to promote the SAS
model.
• Regions 6-7 contain kernel global mappings.
ILP32 The address space of a 32-bit programs (using the ILP32 instruction set)
is from 0 to 4GB and is solely contained in region 0.
Private Providing process data, heap, and stack as well as per-process kernel
segment information such as the u-block in a single private segment means that just
that segment needs to be copied across fork (e.g. copy-on-write
semantics).

Memory Protection
Introduction The IA-64 architecture provides two methods for applying protection to a
page:
• Access rights for each translation.
• Protection keys
Protection Protection keys are used to control which processes have access to
Keys individual objects in the single address space to achieve a “shared-by-
some” semantic, such as exists for shmat objects.
There is a special bit in hardware and when this bit is turned on(1) then
memory references go through protection key access checks during
address translations.
There are also protection key registers (at least 16) and VMM manages
and keeps track of the particular entry.
Protection key Protection key register fields:

register fields
field usage
v valid bit.When 1 it means that register contains valid key
wd write disable.When 1 ,write permissions denied
rd read disable.When 1, read permission is denied.
xd execute disable.When 1 ,execute permission is denied.
key protection key(18-24 bits)

Memory Protection -- continued
Process The process of memory access using protection keys is described in this
table.
Step Action
1 During an address translation by the hardware a protection
key is identified for the page being translated.
2 The protection key of the translation is checked against
protection keys found in protection key registers (stored by
the OS).
3 If the match succeeds then protection rights are applied to
the translation. The access can be allowed or not allowed
based on the protection key value.
4 If the access is not allowed, then the protection key
permission fault is raised and control goes to VMM.
5 In the case when match is not found ( from step 2) the
protection key mss fault is raised and VMM inserts the
correct protection key into protection key registers.

Protection Key An example of protection key usage is described in this illustration and
Example table.
process A address space
virtual address space
shared object
process B address space
Step Action
1 A shared object is assigned the protection key 0x1.
2 Processes A and B share the object with the following
permissions:
• Process A has read/write access to the object.
• Process B has read-only access to the object.
3 When A is running VMM inserts the protection key register
with 0x1 and the ‘wd’ and ‘rd’ bits cleared. The process can
read and write all pages in the object.
4 When B is running VMM inserts the protection key register
with 0x1 and the ‘rd’ bit cleared.The process can only read
pages in the object.

Access Rights In addition to the protection key mechanism the IA-64 architecture
provides page protection by associating access and privilege level
information witch each translation. However, the majority of page access
rights support in AIX 5L is in the common code base shared with POWER.
Therefore the software mechanism for dealing with page protection were
all left as is so at the upper layers conform to the POWER access rights
mechanisms. These consist of:
• per segment K bits
• POWER style per-page protection bits.
At the low platform dependent layer , these POWER style protections are
translated to the IA-64 hardware informations.

LP64 Address Space
Introduction Segments and segment services are used for management of objects both
on POWER and IA64.
Segments on The segment model was original developed with the Power hardware
IA64 architecture in mind. A segment can be thought of as a hardware object
on Power. Selection of the segment is made directly by the hardware’s
translation of a virtual address. As we have seen the IA64 hardware
address memory by regions. A regions is a much larger areas of the
virtual address space that a segment. On IA64 the software manage
segments on top of the region model; therefor, on IA64 a segment is a
software object not a hardware one.
The user space segment model on IA64 is shown in this table:
ESID (hex) Name

0000_0000_0-0000_0000_F Low 4GB Reserved
0000_0001_0-0000_0001_F Aliased Main Text
0000_0002_0-0000_0002_F Private Dynamically Loaded
Text
0000_0003_0-0000_0003_F Private Data, BSS
0000_0100_0-0001_FFFF_F Private Heap
0002_0000_0-0002_FFFF_F Default Mmap, Aliased Shmat
0003_0000_0-0003_FEFF_F User Stack
0003_FF00_0-0003_FF00_2 Kernel reserved
0003_FF00_2-0003_FF00_3 Process Private Segment
0003_FF00_3-0003_FF0F_F Kernel Thread Segments
0003_FF10_0-0003_FFFF_F Kernel reserved
2000_0001_0-2000_0001_F LP64 Shared Library Text
2000_0100_0-2003_FFFF_F Main Text
4000_0001_0-4003_FFFF_F Global Shmat (normal page
size)
6000_0001_0-6003_FFFF_F Global Shmat (superpage)

ILP32 Address Space
Introduction The layout of the 4GB ILP32 address space is principally the same as that
for POWER 32-bit applications. The motivations for preserving this layout
for IA64 are compatibility and performance.
This table details the segment usage for the ILP32 model.:
ESID Name Example Uses

0 n/a Not used
1 Text Main text
2 Private main+libc data, stack,
heap, u-block, kernel
stack
3-12 n/a shmat, mmap
13 Shlib Text Shared library text
14 n/a shmat, mmap
15 Shlib Data Post-exec data,
private text
Big Data Model A big data model is supported for 32-bit applications on POWER. This
allows an application to specify maximum requirements for heap, data, and
stack.Such a model is required for programs which exceed the limits
imposed by the normal 32-bit address space (i.e. a shared 256MB
segment for heap, data, and stack).This model will be also supported on
IA64 for 32 bit applications in future releases

Exercise
Introduction Complete the following written exercise and the lab exercise on the
following page.
Test yourself Complete the following questions.
1. The effective address size for a 64 bit process is?
A. 32 bits
B. 64 bits
C. 84 bits
2. The virtual address size on the IA-64 platform is?
A. 32 bits
B. 64 bits
C. 84 bits
3. One of eight region registers is used for each address translation. How
is the region register selected?
4. A 64-bit process running on AIX 5L on IA-64 hardware has a private

region of memory that is located in what region?

Exercise -- continued
Lab Follow the instruction in this table to complete this lab.
Step Action
1 Logon to you IA64 lab system.
2 su to root and start the iadb utility.
$ su
# iadb
3 Display the thread structure for the current context using the
command:
0> th
The thread structure displayed will be the thread for the

running iadb process.
4 Look for the field labeled t_procp this will contain a
pointer to the proc structure. Examine this address. What
region is this address in?
5 Look for the field labeled userp this will contain a pointer to
the threads user area. Examine this address. What region
is this address in?
6 Of the two address you examined witch one is in the

process’s privet region?


Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm Guide
Unit 10. IA-64 Linkage Convention

Guide Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm





Draft Version for review, Sunday, 15. October 2000, LVM.fm Guide
Unit 11. LVM
Lesson Objectives
At the end of the module the student should have gained knowledge
about:
Have an overview of the LVM, and Identify the LVM components such as
• Logical volume
• Physical volume
• Mirroring, and parameters for mirroring
• Striping and parameters for striping
Physical disk layout Power
Physical disk layout IA-64
LVM Physical layout including VGDA and VGSA
Know the function of LVM Passive Mirror Write Consistency
Know the function of LVM Hot spare disk
Know the function of LVM Hot spot management
Know the function of LVM Online backup (4.3.3.)
Know the function of LVM Variable logical track group (LTG)
Know the function of each of the High-Level LVM commands
Trace LVM commands with the trace command
Know the function of LVM Library calls
Know briefly about Disk Device Calls
Know briefly about Disk low level Device Calls such as SCSI calls and
SSA
Furthermore it is an objective that the student get experience from
exercises with the content of this section. The exercises will
• Examine the physical disk layout of a logical volume and a physical
volume.
• Examinine the impact of LVM Passive Mirror Write Consistency
• Examinine the function of LVM LTG
• Trace some LVM system activity.
Platform
This lesson is independent of platform.

Guide Draft Version for review, Sunday, 15. October 2000, LVM.fm
References
http://w3.austin.ibm.com/:/projects/tteduc/ Technology Transfer Home Page

Logical Volume Manager overview
Introduction The Logical Volume Manager (LVM) is the layer between the operating
system (AIX) and the physical hard drives, the LVM provides reliable data
storage (Logical volumes) to the OS. The LVM make use of the underlying
physical storage, but hides the actual physical drives and drive layout. This
section will explain how its done, how the data can be traced, and which
parameters impacts the performance in different scenarios.
Physical A hierarchy of structures is used to manage fixed-disk storage. Each

volume individual fixed-disk drive, called a physical volume (PV) has a name, such
as /dev/hdisk0. Every physical volume in use belongs to a volume group
(VG). All of the physical volumes in a volume group are divided into
physical partitions (PPs) of the same size (by default 2MB in volume
groups that include physical volumes smaller than 300MB, 4MB
otherwise). For space-allocation purposes, each physical volume is
divided into five regions (outer_edge, inner_edge, outer_middle,
inner_middle and center). The number of physical partitions in each region
varies, depending on the total capacity of the disk drive.
Within each volume group, one or more logical volumes (LVs) are defined.
Logical Logical volumes are groups of information located on physical volumes.

volume Data on logical volumes appears to be contiguous to the user but can be
discontiguous on the physical volume. This allows file systems, paging
space, and other logical volumes to be resized or relocated, span multiple
physical volumes, and have their contents replicated for greater flexibility
and availability in the storage of data.
Each logical volume consists of one or more logical partitions (LPs). Each
logical partition corresponds to at least one physical partition. If mirroring is
specified for the logical volume, additional physical partitions are allocated
to store the additional copies of each logical partition.

Logical Volume Manager overview -- continued
Physical disks A disk must be designated as a physical volume and be put into an
available state before AIX can assign it to a volume group. A physical
volume has certain configuration and identification information written on it.
This information includes a physical volume identifier and for IA-64
partition information for the disk. When a disk becomes a physical volume,
it is divided into 512-byte physical blocks.
The first time you start up the system after connecting a new disk, AIX
detects the disk and examines it to see if it already has a unique physical
volume identifier in its boot record. If it does, the disk is designated as a
physical volume and a physical volume name (typically, hdiskx where x is a
unique number on the system) is permanently associated with that disk
until you undefine it.
Volume groups The physical volume must become part of a volume group before it can be
utilized by LVM. A volume group is a collection of 1 to 32 physical volumes
of varying sizes and types. A physical volume may belong to only one
volume group. The system will as default allow you to define up to 256
logical volumes per volume group, but the actual number you can define
depends on the total amount of physical storage defined for that volume
group and the size of the logical volumes you define.
There can be up to 255 volume groups per system.
A VG that is created with standard physical and logical volume limits can
be converted to big format which can hold up to 128 PVs and up to 512
more LVs. This operation requires that there be enough free partitions on
every PV in the VG for the Volume group descriptor area (VGDA)
expansion.
MAXPVS: 32 (128 big VG) MAXLVS: 255 (512 big VG)
Logical Storage Management

Volume groups 255 per system
Physical volume (MAXPVS / volume group factor)
per volume group
Physical partition (1016 x volume group factor)
volume group factor = 1, 2, 4, 8,
16, 32, 64, 28, or 256 MB
Logical volumes MAXLVS per volume group
Logical partitions (MAXPVS * 1016) per logical
volume

Physical In the design of LVM, each logical partition maps to one physical partition.
partitions PP And, each physical partition maps to a number of disk sectors. The design
of LVM limits the number of Physical Partitions that LVM can track per disk
to 1016. In most cases, not all the possible 1016 tracking partitions are
used by a disk. The default size of each physical partition during a "mkvg"
command is 4 MB, which implies that individual disks up to 4 GB can be
included in a volume group.
If a disk larger than 4 Gb is added to a volume group (based on usage of
the 4 MB size for Physical Partition) the disk addition will fail with a warning
message that the physical partition size needs to be increased. There are
two instances where this limitation will be enforced. The first case is when
the user tries to use "mkvg" to create a volume group where the number of
physical partitions on one of the disks in the volume group would exceed
1016. In this case, the user must pick from the
available physical partition size ranges of 1, 2, (4), 8, 16, 32, 64, 128, and
256 megabytes and use the "-s" option to "mkvg". The second case is
where the disk which violates the 1016 limitation is attempting to join a pre-
existing volume group with the "extendvg" command. The user can either
recreate the volume group with a larger physical partition size (which will
allow the new disk to work with the 1016 limitation) or the user can create a
stand-alone volume group (consisting of a larger physical partition size) for
the new disks.

Device drivers, The figure shows the interfaces to the LVM at different layers, starting top
hierachy and down, the file system JFS or J2, use the LVMDD API interface to access
interface to
LVM devices LV’s, the LVMDD use the disk DD to access the physical disk which is
handles by the SCSI DD or the SSA DD depending on the type of disk. we
do also have interface and commands to manipulate the LVM system, the
high level commands are complex commands written as shell scripts as
the mklv command. These scripts use basic LVM commands, such as
lcreatelv, which are AIX binaries to perform the operations. The basic
commands are written in C and use the LVM API liblvm.a access the LVM.
High level
JFS commands
LVM DD commands
Disk DD liblvm.a
SCSI SSA
DD DD

VGDA The VGDA is an area at the front of each disk which contains information
description about the volume group, the logical volumes that reside on the volume
group and disks that make up the volume group. For each disk in a volume
group, there exists a VGDA concerning that volume group. This VGDA
area is also used in quorum voting.
The VGDA contains information about what other disks make up the
volume group. This information is what allows the user to just specify one
of the disks in the volume group when they are using the "importvg"
command to import a volume group into an AIX system. The importvg will
go to that disk, read the VGDA and find out what other disks (by PVID)
make up the volume group and automatically import those disks into the
system. The information about neighboring disks can sometimes be useful
in data recovery. For the logical volumes that exist on that disk, the VGDA
gives information about that logical volume so anytime some change is
done to the status of the logical volume (creation, extension, or deletion),
then the VGDA on that disk and the others in the volume group
must be updated.
The VGDA space, that allows for 32 disks, is a fixed size which is part of
the LVM design. Large disks require more management mapping space in
the VGDA, which causes the number and size of available disks to be
added to the existing volume group to shrink. When a disk is added to a
volume group, not only does the new disk get a copy of the updated
VGDA, but as mentioned before, all existing drives in the volume group
must be able to accept the new, updated
VGDA.
VGSA The Volume Group Status Area (VGSA) records information on stale
description partitions for mirroring.
The VGSA is comprised of 127 bytes, where each bit in the bytes
represents up to 1016 physical partitions that reside on each disk. The bits
of the VGSA are used as a quick bit-mask to determine which physical
partitions, if any, have become stale. This is only important in the case of
mirroring where there exists more than one copy of the physical partition.
Stale partitions are flagged by the VGSA. Unlike the VGDA, the VGSA’s
are specific only to the drives which they exist. They do not contain
information about the status of partitions on other drives in the same
volume group. The VGSA is also used to determine which physical
partitions must undergo data resyncing when mirror copy resolution is
performed.

BIG VGDA The original design of the VGDA and VGSA limit the number of disks that
Volume Group can be added to a volume group to 32, and the total number of logical
Design
(BigVG) volumes to 256 (including one reserved for LVM internal use). With the
implemented proliferation of disk arrays, the need for increased capacity in a single
in AIX 4.3.2 volume group is growing.
This section describes the requirements for a new big Volume Group
Descriptor Area and Volume Group Status Areas, here after referred as
VGDA and VGSA.
Objectives
• Increase maximum number of disk per VG from 32 to 128
• Increase maximum number of logical volumes per VG to 512
• Provide migration path for small VG to big VG
Changes in commands:
• mkvg
• -B option is added to create big VGs.
• -t If the t flag (factor value) is not used, the default total of
1016physical partitions per physical volume limit will be set. Using
the factor value will change the physical partitions per disk to 1016*
factor and the total number of disks per VG to 64/factor. BigVG can
not be imported/activate into systems with pre AIX 4.3.2 versions.
• chvg
• -B option added to convert the small VG to bigVG. -B flag can be
used to convert the small VG to the bigVG format. This operation will
expand the VGDA/VGSA to change the total number of disks that
can be added to the volume group from 1-32 to 64. Once converted,
these volume groups cannot be imported/activated into systems
running pre AIX 4.3.2 versions. If both t and B flags are specified,
factor will be update first and then VG is converted to bigVG format
(sequential operation).

LVM Flexibility LVM offer great flexibility for the system administrator and users such as
• Real-time Volume Group and Logical Volume expansion/deletion
• Ability to customize data integrity check
• Use of Logical Volume under file system
• Use of Logical Volume as raw data storage
• User customized logical volumes
Real-time Typical UNIX operating systems have static file systems that require the
Volume Group archiving, deletion, and recreation of larger file systems in order for an
and Logical
Volume existing file system to expand. LVM allows the user to add disks to the
expansion / system without bringing the system down and allows the real-time
deletion expansion of the file system through the use of the logical volume. All file
systems exist on top of logical volumes. However, logical volumes can
exist without the presence of a file system. When a file system is created,
the system first creates a logical volume, then places the journaled file
system (jfs) "layer" on top of that logical volume. When a file system is
expanded, the logical volume associated with that file system is first
"grown", then the jfs is "stretched" to match the grown logical volume.
Ability to The user has the ability to control which levels of data integrity checks are
customize data placed in the LVM code in order to tune the system performance. The user
integrity
checks can change the mirror write consistency check, create mirroring, and
change the requirement for quorum in a volume group.
Use of Logical The logical volume is a logical to physical entity which allows the mapping
Volume under of data. The jfs maps files defined in its file system in its own logical way
a file system
and then translates file actions to a logical request. This logical request is
sent to the LVM device driver which converts this logical request into a
physical request. When the LVM device driver sends this physical request
to the disk device driver, it is further translated into another physical
mapping. At this level, LVM does not care about where the data is truly
located on the disk platter. But with this logical to physical abstraction, LVM
provides for the easy expansion of a file system, ease in mirroring data for
a file system, and the performance improvement of file access in certain
LVM configurations.

Use of Logical As stated before, the logical volume can run without the existence of the jfs
Volumes as file system to hold data. Typically, database programs use the "raw" logical
raw data
storage volume as a data "device" or "disk". They use the LVM logical volumes
(rather than the raw disk itself) because LVM allows them to control which
disks the data resides, allows the flexibility to add disks and "grow" the
logical volume, and gives data integrity with the mirroring of the data via
the logical volume mirroring capability.
User The user can create logical volumes, using a map file, that will allow them
customized to specify the exact disk(s) the logical volume will inhabit and the exact
logical
volumes order on the disk(s) that the logical volume will be created in. This ability
allows the user to tune the creation of their logical volumes for
performance cases.
Write Verify There is a capability in LVM to specify that you wish an extra level of data
LVM setting integrity is assured every time you write data to the disk. This is the ability
known as write verify. This capability is given to each logical volume in a
volume group. When you have write verify enabled, every write to a
physical portion of a disk that’s part of a logical volume causes the disk
device driver to issue the Write and Verify scsi command to the disk. This
means that after each write, the disk will reread the data and do an IOCC
parity check on the data to see if what the platter wrote exactly matched
what the write request buffer contained. This type of extra check
understandably adds more time to the completion length of a write request,
but it adds to the integrity of the system.

Quorum Quorum checking is the voting that goes on between disks in a volume
checking for group to see if a majority of disks exist to form a quorum that will allow the
LVM volume
groups disks in a volume group to become and stay activated. LVM runs many of
its commands and strategies based on having the most current copy of
some data. Thus, it needs a method to compare data on two or more disks
and figure out which one contains the most current information. This need
gives rise to the need of a quorum. If not enough quorums can be found
during a varyonvg command, the volume group will not varyon.
Additionally, if a disk dies during normal operation and the loss of the disk
causes volume group quorum to be lost, then the volume group will notify
the user that it is ceasing to allow any more disk i/o to the remaining disks
and enforces this by performing a self varyoffvg. However, the user can
turn off this quorum check and its actions by telling LVM that it always
wants to varyon or stay up regardless of the dependability of the system.
Or, the user can force the varyon of a volume group that doesn’t have
quorum. At this point, the user is responsible for any strange behavior from
that volume group.

Data Integrity and LVM Mirroring
Mirroring, and When discussing mirrors in LVM, it is easier to refer to each copy,
parameters for regardless of when it was created, as a copy. the exception to this is when
mirroring
one discusses Sequential mirroring. In Sequential mirroring, there is a
distinct PRIMARY copy and SECONDARY copies. However, the majority
of mirrors created on AIX systems are of the Parallel type. In Parallel
mode, there is no PRIMARY or SECONDARY mirror. All copies in a
mirrored set are just referred to as copy, regardless of which one was
created first. Since the user can remove any copy from any disk, at any
time, there can be no ordering of copies.
AIX allows up to three copies of a logical volume and the copies may be in
sequential or parallel arrangements. Mirrors improve the data integrity of a
system by providing more than one source of identical data. With multiple
copies of a logical volume, if one copy cannot provide the data, one or two
secondary copies may be accessed to provided the desired data.
Staleness of The idea of a mirror is to provide an alternate, physical copy of information.

Mirrors If one of the copies has become unavailable, usually due to disk failure,
then we refer to that copy of the mirror as going "stale". Staleness is
determined by the LVM device driver when a request to the disk device
driver returns with a certain type of error. When this occurs, the LVM
device driver notifies the VGSA of a disk that a particular physical partition
on that disk is stale. This information will prevent
further read or writes from being issued to physical partitions defined as
stale by the VGSA of that disk. Additionally, when the disk once again
becomes available (suppose it had been turned off accidentally), the
synchronization code knows which exact physical partitions must be
updated, instead of defaulting to the update of the entire disk. Certain High
Level commands will display the physical partitions and their stale
condition so that the user can realize which disks may be experiencing a
physical failure.

Data Integrity and LVM Mirroring -- continued
Sequential Sequential vs. Parallel mirror, and What good is Sequential Mirroring?
Mirroring
Sequential mirroring is based on the concept of an order within mirrors. All

read and write requests first go through a PRIMARY copy which services
the request. If the request is a write, then the write request is propagated
sequentially to the SECONDARY drives. Once the secondary drives have
serviced the same write request, then the LVM device driver will consider
the write request complete.
Parallel In Parallel mirroring, all copies are of equal ordering. Thus, when a read
Mirroring request arrives to the LVM, there is no first or favorite copy that is
accessed for the read. A search is done on the request queues for the
drives which contain the mirror physical partition that is required. The drive
that has the fewest requests is picked as the disk drive which will service
the read request. On write requests, the LVM driver will broadcast to all
drives which have a copy of the physical partition that needs updating.
Only when all write requests return will the write be considered complete
and the write-complete message will be returned to the calling program.
Disk 1 Disk 2 Disk 3
Write Write
ack Write Write ack
Write req ack
req
Write
req

Data Integrity and LVM Mirroring -- continued
Mirror Write Mirror Write Consistency Check (MWCC) is a method of tracking the last
Consistency 62 writes to a mirrored logical volume. If the AIX system crashes, upon
Check
reboot the last 62 writes to mirrors are examined and one of the mirrors is
used as a "source" to synchronize the mirrors (based on the last 62 disk
locations that were written). This "source" is of importance to parallel
mirrored systems. In sequentially mirrored systems, the "source" is always
picked to be the Primary disk. If that disk fails to respond, the next disk in
the sequential ordering will be picked as the "source" copy. There is a
chance that the mirror picked as "source" to correct the other mirrors was
not the one that received the latest write before the system crashed. Thus,
the write that may have completed on one copy and incomplete on another
mirror would be lost.
AIX does not guarantee that the absolute, latest write request completed
before a crash will be there after the system reboots. But, AIX will
guarantee that the parallel mirrors will be consistent with each other. If the
mirrors are consistent with each other, then the user will be able to realize
which writes were considered successful before the system crashed and
which writes will be retried. The point here is not data accuracy, but data
consistency. The use of the Primary mirror copy
as the source disk is the basic reason that sequential mirroring is offered.
Not only is data consistency guaranteed with MWCC, but the use of the
Primary mirror as the source disk increases the chance that all the copies
have the latest write that occurred before the mirrored system crashed.
Ability to The Volume Group Status Area (VGSA) tracks the status of 1016 physical
detect stale partitions per disk per volume group. During a read or write, if the LVM
mirror copies
and correct device driver detects that there was a failure in fulfilling a request, the
VGSA will note the physical partition(s) that failed and mark that
partition(s) "stale". When a partition is marked stale, this is logged by AIX
error logging and the LVM device driver will know not to send further
partition data requests to that stale partition. This saves wasted time in
sending i/o requests to a partition that most likely will not respond. And
when this physical problem is corrected, the VGSA will tell the mirror
synchronization code which partitions need to be updated to have the
mirrors contain the same data.

LVM Striping
Striping and Disk striping is the concept of spreading sequential data across more than
parameters for one disk to improve disk i/o. The theory is that if you have data that is close
striping
to each other, and if you can divide the request into more than one disk i/o,
you will reduce the time it takes to get the entire piece of data. This request
must be done so it is transparent to the user. The user doesn’t know which
pieces of the data reside on which disk and does not see the data until all
the disk i/o has completed (in the case of a read) and the data has been
reassembled for the user. Since LVM has the concept of a logical to
physical mapping already built into its design, the concept of disk striping
is an easy evolution. Striping is broken down into the "width" of a stripe and
the "stripe length". The width is how many disks the sequential data should
lay across. The stripe length is how many sequential bytes reside on one
disk before the data jumps to another disk to continue the sequential
information path.

LVM Striping -- continued
Striping We present an example to show the benefit of striping: A piece of data that
Example is stored of the disk is 100 bytes. The physical cache of the system is only
25 bytes. Thus, it takes 4 read requests to the same disk to complete the
reading of 100 bytes: As you can see, since the data is on the same disk,
four sequential reads must be required.
hdisk0: First read-bytes 0-24 hdisk0: Second read-bytes 25-49
hdisk0: Third read-bytes 50-74 hdisk0: Fourth read-bytes 75-99
If this logical volume were created with a stripe width of 4 (how many
disks) and a stripe size of 25 (how many consecutive bytes before going to
the next disk), then you would see:
hdisk0: First read-bytes 0-24 hdisk1: Second read-bytes 25-49
hdisk2: Third read-bytes 50-74 hdisk3: Fourth read-bytes 75-99
As you can see, each disk only requires one read request and the time to
gather all 100 bytes has been reduced 4-fold. However, there is still a
bottleneck of having the four independent data disks channel through one
adapter card. But, this can be remedied with the expensive option of
having each disk on an independent adapter card. Note the effect of using
striping: the user has now lost the usage of 3 disks that could have been
used for other volume groups.

LVM Performance
Performance Disk mirroring can improve the read performance of a system, but at a cost
with disk to the write performance. Of the two mirroring strategies, parallel and
mirroring
sequential, parallel is the better of the two in terms of disk i/o. In parallel
mirroring, when a read request is received, the lvm device driver looks at
the queued requests (read and write) and finds the disk with the least
number of requests waiting to execute. This is a change from AIX 3.2,
where a complex algorithm tried to approximate the disk that would be
"closest" to the required data (regardless of how many jobs it had queued
up). In AIX 4.1, it was decided that this complex algorithm did not
significantly improve the i/o behavior of mirroring and so the complex logic
was scrapped. The user can see how this new strategy of finding the
shortest wait line would improve the read time. And with mirroring, two
independent requests to two different locations can be issued at the same
time without causing disk contention, because the requests will be issued
to two independent disks. However, with the improvement to the read
request as a result of disk mirroring and the multiple identical sources of
reads, the LVM disk driver must now perform more writes in order to
complete the write request. With mirroring, all disks that make up a mirror
are issued write commands which each disk must complete before the
LVM device driver considers a write request as complete

LVM Performance -- continued
Changeable There are a few parameters that the user can change per logical volume
parameters which will affect the performance of the logical volume in terms of data
that affect LVM
performance access efficiency.
From experience however, many people have different views of how to
achieve that efficiently, so there can’t be a specific "right" recommendation
given in these notes.
Inter-policy - This comes in two variations, min and max. The two choices
tells LVM how the user wishes the logical volume to be spread over the
disks in the volume group. With min, this tells LVM that the logical volume
should be spread over as few disks as possible. The max policy directs
LVM to spread the logical volume over as many disks that are defined by
the volume group and limited by the "Upper Bound" variable. Some users
try to use this variation to form a cheap version of disk striping on systems
below AIX 4.1. However, it must be stated that the Inter-policy is a
"recommendation" to the allocp binary (Partition allocation routine), not a
strict requirement. In certain cases, depending on what is free on a disk,
these allocation policies may not be achievable.
Intra-policy - There are five regions on a disk platter defined by the intra-
policy: edge, inner-edge, middle, inner-middle, and center. This policy will
tell the LVM what the preferred location of the logical volume on the disk
platter. Depending on the value also provided for inter-policy, this
preference may or may not be satisfied by LVM. Many users have different
ideas as to which portion of the disk is considered the "best", so no
recommendation is given in these notes.
Mirror write consistency check - As mentioned before, the mirror write
consistency check tracks the last 62 distinct writes to physical partitions. If
the user turns this off, they will shorten (although slightly), the path length
involved in a disk write. However, the trade-off may be inconsistent mirrors
if the system crashes during a write call.
Write verify - This by default is turned off by LVM when a logical volume is
created. If this value is turned on for a logical volume, additional time
during writes will be accumulated as the IOCC check is performed for each
write to the disk platter.

Physical Mirroring on different disks - The default of disk mirroring is that the copies
Connections should exist on different disks. This is for performance as well as data
integrity. With copies residing on different disks, if one disk is extremely
busy, then a read request can be completed the other copy residing on a
less busy disk. Although it might seem the cost would be the same for
writes, the section "Command tag queuing" should show that writing to two
copies on the same disk is worse than writing to two copies on separate
disks.
Mirroring across different adapters - Another method to improve disk
throughput is to mirror the copies across adapters. This will give you a
better chance of not only finding a copy on a disk that is least busy, but it
will also improve your chances of finding an adapter that is not as busy.
LVM does not realize, nor care, that the two disks do not reside on the
same adapter. If the copies were on the same adapter, the bottleneck
there is still the bottleneck of getting your data through the flow of other
data coming from other devices sharing the same adapter card. With multi-
adapters, the throughput through the adapter channel should improve.
Command tag queuing - This is a feature only found on scsi-2 devices. In
scsi-1, an adapter may get many requests, but will only send out one
command at a time. Thus, if the scsi device driver received three requests
for i/o, it will buffer the last two requests until the first one sent is received.
It then will pick the next one in line and issue that command. Thus, the
target device will only receive one command at a time. With command-tag
queuing on scsi-2 devices, multiple commands may be sent out to the
same device at once. The two device drivers (disk and scsi adapter) will be
capable of determining which command returned and what to do with that
command. Thus, disk i/o throughput can be improved.

Physical The one important ability of LVM is the ability to let the user dictate how on
Placement of the disk platter the logical volume should be assigned. This is done with
Logical
Partitions the map file that can be used in the "mklv" and "mklvcopy" commands.
This map file will allow the user to assign a distinct physical partition
number to a distinct logical partition number. Thus, people with different
theories on the optimal layout for data partitions can customize their
systems according to their personal preferences.
Performance Disk striping is introduced in AIX 4.1. This is another word to describe the
consideration RAID 0 implementation in software. This functionality is based on the
with Disk
Striping assumption that large amounts of data can be more efficiently retrieved if
the request were broken up into smaller requests given to multiple disks.
And if the multiple disks are on multiple adapters, then the theory works
even better, as mentioned in the previous sections of mirroring across
different disks and adapters. In the previous sections, we describe the
efficiency gained for mirrors. In this case, the same efficiency is gained
with data across disks and adapters, but without mirroring. Thus there is a
savings on the write case, as compared to mirrors. But, there is a slight
loss in the read case, as compared to mirrors, because now there isn’t
more than one copy to read from if one disk is busier than the other.
Performance To sum up previously mentioned ideas about mirroring. If you have a

summarize system that is mainly to be used in read cases, mirroring gives you an
advantage because there is more than one version of the same data to be
used to satisfy a read request. The only downfall is that if you require just
as many writes as reads, then the system must wait for all the writes to
complete before the single write command is considered complete.
Additionally, there are two types of mirroring, parallel and sequential.
Parallel is the more efficient of the two, and is the default mirroring option
unless otherwise specified by the user. In parallel, the "best" disk is chosen
for the read request, all write requests are issued independently to each
disk that holds a copy of the data. In sequential mirroring, the same disk is
always used as the first disk to be read. Thus, all reads are guaranteed to
be issued to the "primary" disk (there is no "primary" in parallel mirroring)
and the writes must complete in a sequential order before the write is
considered complete.

Physical disk layout Power
AIX 4.3.3 and This section will explore the physical disk layout on Power platform.
AIX 5 IDs
There are three identifiers commonly used within LVM: Physical Volume
Identifier (PVID), Volume Group Identifier (VGID), and Logical Volume
Indentifier (LVID). The last two, VGID and LVID, are closely tied. The LVID
is simply a dot "." and a minor number appended to the end of the VGID.
The VGID is a combination of the machines unique processor serial
number (uname -a) and the date that the volume group was created.
The implementation of LVM, has always been to assume that the VGID of
a system was made up of 2 32 bit words. Throughout the code however,
the VGID/LVID is represented with the system data type struct unique_id
which is made up of 4 32 bit words. However the LVM library, driver, and
commands have always assumed or enforced the notion that the last 2
words, word3 and word 4 of this structure are zeroes.
AIX 5 is now changed such that all 4 32 bit words are used for a total of
128 bit or 32 HEX digits. The MSb 32 bits are copied from the processor ID
and the remaining 96 bits are the milisecond time stamp at creation time.
AIX 4.3.3
PVID
Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1
VGID
Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1
0 0 0 9 0 2 7 7
LVID
Byte9 Byte8 Byte 7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1
0 0 0 9 0 2 7 7 . X
AIX 5
PVID
Byte16 Byte15 Byte 14 Byte 13 Byte 12 Byte11 Byte 10 Byte 9 Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1
VGID
Byte16 Byte15 Byte 14 Byte 13 Byte 12 Byte11 Byte 10 Byte 9 Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1
LVID
Byte17 Byte16 Byte15 Byte 14 Byte 13 Byte 12 Byte11 Byte 10 Byte 9 Byte8 Byte7 Byte 6 Byte5 Byte4 Byte3 Byte2 Byte1

Physical disk layout Power -- continued
Example IDs The processor ID is 64 bit in AIX 5 the uname function cut out bit 33 to 47
from AIX 4 and such that the result is the first word and the last 16 bit of the last word.
AIX 5L
systems that LVID and VGID combine 64 bit processor ID and 64 bit time stamp to form
shows how IDs an ID. PVIDs are made of 32 bit processor ID and bits from the timestamp.
are
constructed
from
processor ID Example from AIX 5 Power system
PVID hdisk0: 00071483229d06620000000000000000
PVID hdisk1: 00071483b50bbaee0000000000000000
LVID hd1: 0007148300004c00000000e19f7c5aa3.8
LVID hd2: 0007148300004c00000000e19f7c5aa3.5
LVID hd3: 0007148300004c00000000e19f7c5aa3.7
LVID hd4: 0007148300004c00000000e19f7c5aa3.4
VGID rootvg: 0007148300004c00000000e19f7c5aa3
VGID testvg: 0007148300004c00000000e1b50bc8ec
uname -a: 000714834C00
In a AIX 4 system all the IDs are made of the MSB 32 bit of the processor
ID and 32 bit time stamp to form an ID.
Example from AIX 4.3.3 Power system

PVID hdisk0: 0009027724fdbd9f
PVID hdisk1: 0009027779fe61c6
LVID hd1: 0009027724fdc36d.8
LVID hd2: 0009027724fdc36d.5
LVID hd3: 0009027724fdc36d.7
LVID hd4: 0009027724fdc36d.4
VGID rootvg: 0009027724fdc36d
VGID datavg: 000902771db64c28
uname -a: 000902774C00

Physical The following example show a disk dump from sector 0 at a power system
volume, with a uninitialized is data not written by the LVM, sections holding 00’s or
logical volume
testlv defined initialized are cut out for clarity. The ID’s are those listed in the previous
section.
000000 ¦ C9 C2 D4 C1 00 00 00 00 00 00 00 00 00 00 00 00
000010 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000070 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000080 ¦ 00 07 14 83 B5 0B BA EE 00 00 00 00 00 00 00 00 - PVID hdisk1
000090 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0001F0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000200 ¦ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
00400 ¦ 39 C7 F2 9F 14 87 93 46 00 00 00 00 00 00 00 00
000410 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0005E0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
)è&))
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
')è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
(è)&'&(VWUXFWOYPBUHFGHILQHGLQOYPUHFK
(è%%&(&9*,'WHVWYJ
(è&&'
(è%$
(è-- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
è'()(&7
è
)è
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
%)è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
&è)&'&(èB/90VWUXFWOYPBUHF
&è%%&(&è9*,'WHVWYJ
&è&&'è_
&è%$è$
&è-- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
')è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
(èè'()(&7
(èè

))è
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
)))è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
è&)%$ 7KH9*6$
è
)(è
))è&)%$
è&)'&'& 7KH9*'$
è(%%&(&
è
è
è$
è
è

Disk data
continued )è
è%%%$((
è&
è
è
è
è
è
è
è
è
$è
%è
&è
'è
(è
)è
è
è
è
è
è$
è
)è
è&
è
)è
è&)'&'
è
)è
è&)%$
è
(è
)è&)%$
è&)'&'&
è(%%&(&
è
)è
è$

è
è
21A5F0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ¦......
21A600 ¦ 74 65 73 74 6C 76 00 00 00 00 00 00 00 00 00 00 ¦testlv
21A610 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ¦......
()è
(è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

Disk data
continued )))è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
è&$è$,;/9&%MIV
èè
èè
èèF
è&èHWHVWOY
èè
èè
èè
èè7XH6HS
è$$$è
$è$è7XH6HS
%è$$è
&è'è&\PH\
'è$()(è1RQH
(èè
)èè
èè
èè
èè
èè
èè
èè
èè
èè
èè
èè
$èè
%èè
&èè
'èèEEF
(è('($'%(()èHF
)è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Uninitialized

lvm_rec The structure lvm_rec is used by the lvm routines to define the disk layout
structure from
struct lvm_rec
file /usr/
include/ /* structure which describes the physical volume LVM record */
lvmrec.h {
__long32_t lvm_id;
/* LVM id field which identifies whether the PV is a member of a volume group */
#define LVM_LVMID 0x5F4C564D /* LVM id field of ASCII "_LVM" */
struct unique_id vg_id;
/* the id of the volume group to which this physical volume belongs */
__long32_t lvmarea_len;
/* the length of the LVM reserved area */

__long32_t vgda_len;
/* length of the volume group descriptor area */
daddr32_t vgda_psn [2];
/* the physical sector numbers of the beginning of the volume
group descriptor area copies on this disk */
daddr32_t reloc_psn;
/* the physical sector number of the beginning of a pool of
blocks (located at the end of the PV) which are reserved for
the relocation of bad blocks */
__long32_t reloc_len;
/* the length in number of sectors of the pool of bad block relocation blocks */
short int pv_num;
/* the physical volume number within the volume group of this physical volume */
short int pp_size;
/* the size in bytes for the partition, expressed as a power of
2 (i.e., the partition size is 2 to the power pp_size) */
__long32_t vgsa_len;
/* length of the volume group status area */
daddr32_t vgsa_psn [2];
/* the physical sector numbers of the beginning of the volume
group status area copies on this disk */
short int version;
/* the version number of this volume group descriptor and status area */
short int vg_type;
int ltg_shift;
char res1 [444]; /* reserved area */
};
If we use the string “_LVM” we can locate the above structure in the
previous disk dump an assign values to the variables
struct lvm_rec

Variable VALUE
#define LVM_LVMID 0x5F4C564D
struct unique_id vg_id; 0007148300004C00000000E1B50BC8EC
__long32_t lvmarea_len; 00001074
__long32_t vgda_len; 00000832
daddr32_t vgda_psn [2]; 00000088 000008C2
daddr32_t reloc_psn; 00867C2D
__long32_t reloc_len; 00000100
short int pv_num; 0001
short int pp_size; 0018
__long32_t vgsa_len; 00000008
daddr32_t vgsa_psn [2]; 00000080 000008BA
int ltg_shift; 0001
char res1 [444]; Uninitialized

VGSA structure
struct vgsa_area {
#ifdef _KERNEL
struct timestruc32_t b_tmstamp; /* Beginning time stamp */

#else
struct timestruc_t b_tmstamp;
#endif
/* Bit per PV */
uint pv_missing[(MAXPVS + (NBPI - 1)) / NBPI];
/* Stale PP bits */
uchar stalepp[MAXPVS][VGSA_BT_PV];
short factor; /* for pvs with > 1016 pps */
char pad2[10]; /* Padding */
#ifdef _KERNEL
struct timestruc32_t e_tmstamp; /* Ending time stamp */
#else
struct timestruc_t e_tmstamp;
#endif
};
struct big_vgsa_area {
#ifdef _KERNEL
struct timestruc32_t b_tmstamp; /* Beginning time stamp */
#else
struct timestruc_t b_tmstamp;

#endif
char b_tmbuf64bit[24]; /* Bit per PV */
uint pv_missing[(MAX_EVER_PV + (NBPI - 1)) / NBPI]; /* Stale PP bits */

uchar stalepp[MAX_EVER_PV][VGSA_BT_PV];
short factor; /* for pvs with > 1016 pps */
short version; /* vgsa version */

char valid[4]; /* Validity string "LVM" */
char pad2[824]; /* Padding */
char e_tmbuf64bit[24];
#ifdef _KERNEL
struct timestruc32_t e_tmstamp; /* Ending time stamp */
#else
struct timestruc_t e_tmstamp;
#endif
};

Physical disk layout IA-64
Introduction to IA64 systems has a different design than Power system, some, if not all,
AIX 5L on IA- IA-64 systems will use The Extensible Firmware Interface (EFI). EFI has
64 and EFI
partitioned defined a new disk partitioning scheme to replace the legacy DOS
disks partitioning support.
When booting from a disk device, the EFI firmware utilizes one or more
system partitions containing an EFI file system (FAT32) to locate EFI
applications and drivers, including the OS boot loader. These applications
and drivers provide ways to extend firmware or provide the operating
system with assistance during boot time or runtime. In addition, it is
expected that operating systems will define partitions unique to the
operating system. EFI applications, will also have the capability to display
and potentially create additional partitions before the OS is booted.
AIX traditionally has not supported partitioned disks because AIX was the
only OS running on the RS/6000 systems. Therefore the entire disk is
defined by an hdisk ODM object and /dev/hdiskn special file with a single
major and minor number assigned to the physical disk. In AIX 4.3.3 when a
disk becomes a physical volume (having a PVID) an old style MBR
(master boot record) renamed the IPL control block which contains the
PVID is written into the first sector at the disk.
The overall design for disk partitioning on AIX 5L on IA-64 is to introduce
disk partitioning at the disk driver level. An hdisk ODM object will still refer
to the physical disk, however multiple special files will be created and
associated with the partitions on the disk. Besides the EFI system
partitions, AIX 5L on IA-64 disk configure method will recognize IA-64
physical volume partitions.
AIX 5L on IA-64 supports a maximum of 4 partitions, of these one partition
can be a physical volume partition, other partitions are EFI system
partitions. Therefore only one AIX PV, and one volume group can be
defined per physical disk.
A new command, efdisk, act as a partition manager
Special files will be created for the following partition types:
• Entire physical disk n Access (used by efdisk) /dev/hdiskn_all
• System Partition index y on physical disk n /dev/hdiskn_sy
• Physical volume Partition on physical disk n /dev/hdiskn
• Unknown partition index x on physical disk n /dev/hdiskn_px

Physical disk layout IA-64 -- continued
Creating new
partitions at a
IA-64 system AIX 5L on IA-64 will partition disks under the following circumstances:
• Under the direction of the user/administrator via the efdisk command.
• During bos install after the designation of a "boot" disk (install targets)
• When adding a disk that is not yet a physical volume to a VG
• Under the direction of the "chdev -l hdiskx -a pv=yes: command
The disk After installing AIX 5L on a system with one disk, the physical drive and the
system after a /dev special files can be listed.
default
installation
lsdev -Cc disk
hdisk0 Available 00-19-10 Other IDE Disk Drive
/dev/hdisk0 - hdisk0, AIX 5L PV

/dev/hdisk0_all - The entire disk starting at block 0
/dev/hdisk0_s0 - EFI System partition 0 at disk 0
The EFI system partition holds HW information and EFI firmware data the
disk is DOS formatted and can be accessed through dos utilities as in the
example.
5L-IA64:/tmp> dosdir -D/dev/hdisk0_s0

A.OUT
BOOT.EFI
Free space: 33155072 bytes

Creating After creating four partitions we can list the start block number and length
partitions with with the efdisk command.
efdisk
------------------------------------------------------
Partition Index: 0
Partition Type: Physical Volume
StartingLBA: 1 (0x1)
Number of blocks: 819200 blocks (0xc8000)
Partition Index: 1
Partition Type: System Partition
StartingLBA: 819201 (0xc8001)
Number of blocks: 409600 blocks (0x64000)
Partition Index: 2
StartingLBA: 1228801 (0x12c001)
Partition Index: 3
StartingLBA: 1843201 (0x1c2001)

Disk layout at The following disk dump lists the data in hex format, the six leftmost digits
IA-64 systems is the byte offset from physical start of disk, each line list 16 bytes. The
data is read at a IBM Power system with the same utility as previous
examples, when byte swapping is mentioned it is relative to what it would
have been at a disk connected to a AIX Power system.
000000 ¦ C1 D4 C2 C9 00 00 00 00 00 00 00 00 00 00 00 00 - AMBI in ebcdic = IBMA byte swapped
000010 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0001C0 ¦ FF FF 09 FF FF FF 01 00 00 00 00 80 0C 00 00 FF -start LBA = 0x1
length = 0xc8000
0001D0 ¦ FF FF EF FF FF FF 01 80 0C 00 00 40 06 00 00 FF -start LBA = 0x0c8001
length = 0x064000
0001E0 ¦ FF FF EF FF FF FF 01 C0 12 00 00 60 09 00 00 FF -start LBA = 0x12c001

length = 0x096000
0001F0 ¦ FF FF EF FF FF FF 01 20 1C 00 00 60 09 00 55 AA -start LBA = 0x1c2001
length = 0x096000
000200 ¦ C1 D4 C2 C9 00 00 00 00 00 00 00 00 00 00 00 00 - AMBI in ebcdic = IBMA byte swapped
000210 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 - 0x200 = start LBA 1 = first part.
è'&)&(è09/B B/90E\WHVZDSSHG
è&(%)èOYPBUHFVWUXFWRIIVHWE\[
è&))(&èIURPWKHOYPVWUXFWDWSRZHUGDWD
è%$èLQSDUWLWLRQLVSODFHGDVDW39V

Disk layout at èè'()(&7GHIHFWOLVW

IA-64 systems èèRIIVHWE\[FRPSDUHGWR3RZHU
èè'()(&7GHIHFWOLVW
èè
è$&$)9*6$7LPHVWDPS
è
)è$&$)HQG9*6$WLPHVWDPS
è$&$%(&9*'$VWDUWWLPHVWDPS
è(&(%)9*,'IRULDYJ
è
For reference information the PVID, LVID and VGID are listed below.
$,;LDLD

39,'KGLVNFFDHHEHDG
/9,'OYFHEIHF
/9,'OYFHEIHF
9*,'LDYJFHEIHF

LVM Passive Mirror Write Consistency
AIX 5L Passive The previous Mirror Write Consistency Check (MWCC) algorithm has been
Mirror Write in place since AIX 3.1. This original design has served the Logical Volume
Consistency
Manager (LVM) well, but has always slowed the performance of mirrored
logical volumes that performed massive and varied writes. A new design is
implemented in AIX 5 to supplement the original MWCC design.
AIX 4 MWCC The AIX 4 MWCC method uses a table called the mwc table. This table is
algorithm kept in memory as well on the disk platter. The table has 62 entries and
each entry tracks the last 62 distinct Logical Track Group (LTG) writes. An
LTG is 128 Kilobytes. The mwc table is only concerned with writes, not
reads. The algorithm can be expressed in pseudo-code:
if (action is a write)
{
if (LTG to be written is already in the mwc table array in memory)
{
proceed and issue the write to the mirrors
wait until all mirrored writes complete
return to calling process
}
else
{
update the mwc table with this latest LTG number overwriting the
oldest LTG entry in the mwc table (in memory), write the memory
mwc table to the edge of the platter of all disks in the volume group
wait for the mwc table writes to complete - when the mwc table write of
the disk that holds the LTG in question returns, this is considered write
complete of the mwc table. issue the parallel mirror writes to all the
mirrors. wait until all mirrored writes complete and return to calling
process
}
}
else
process the read

LVM Passive Mirror Write Consistency -- continued
MWCC usage The reason for having mwcc is: Recovery from a crash while i/o is
for recovery proceeding on a mirrored logical volume. By implication, this means that
mwcc is ignored for non-mirrored logical volumes. A key phrase is data "in
flight", which implies that a write has been issued to a disk and the write
order has not come back from the disk with a confirmation that the action is
complete. Thus, there is no certainty that the data did in fact get written to
the disk. mwcc tracks the last 62 write orders so that upon reboot, this
table is used to rewrite the last 62 mirror writes. It is more than likely that
all the writes finished before the system crash, however LVM goes ahead
and goes to each of the 62 distinct LTGs, reads one copy of the mirror and
writes it to the other mirror(s) that exist. Note that mwcc does not
guarantee that the absolute latest write is made available to the user.
Mwcc just guarantees that the images on the mirrors are consistent
(identical).
AIX 4 MWCC The current mwcc algorithm has a penalty for heavily random writes. There
performance is a performance sag associated with doing an extra write for each write
implications
you perform. A good example, taken from a customer, is a mail server that
had mirrored accounts. Thousands of users were constantly writing or
deleting files from their mail accounts. Thus, the LTG counter was
constantly being changed and written to disk. In addition to that overhead,
if the mwcc table has been dispatched to be written, new requests that
come into the LVM work queue are held until the mwcc table write returns
so that it can be updated and once more sent down to the disk platters to
be updated.
Current AIX 4 Currently, the only way customers can work around the performance
MWCC penalty associated with mwcc is to turn the functionality off. But in order to
workaround.
insure data consistency, they must do a syncvg -f <vgname> immediately
after a system crash and reboot to synchronize data.
Since there is no mwcc table on the platter, there is no way to determine
which LTGs need resyncing, thus a forced resync of ALL partitions is
required. Omitting this synchronization may cause inconsistent data.
AIX 5 LVM The MWCC implementation in AIX 5 provides a new passive algorithm, but
Passive Mirror only for big VGs, The reason for this is that we need space for a dirty flag
Write
Consistency for each logical volume, and only the VGSA for big VGs provides this
Check space.

AIX 5 Passive The new MWCC algorithm set a flag when the mirrored LV is open in RW
MWCC mode, and the flag is no cleared until the last close on the device. The flag
algorithm
is then examined during subsequent boots, the algorithm implemented is:
1. The user opens a mirrored logical volume.
2. The lvm driver marks a bit in the VGDA which states that for purposes
of passive mwcc, the lv is "dirty"
3. Reads and writes occur to the mirrored lv with no (traditional) mwcc
table writes
4. The machine crashes
5. Upon reboot, the volume group automatically varies on. As part of this
varyonvg, checks are made to see if dirty bits exists for each lv
6. For each logical volume that is dirty, a "syncvg -f -l <lvname>" is
performed, regardless of whether or not the user wants to do this.
Advantage:
The behavior of a mirrored write will be the same as those of a mirrored
logical volume with now mwcc. Since crashes are very rare, the need for
mwcc resync is negligible. Thus, a mostly unnecessary write (mwc table
update) will be avoided.
Disadvantage:
After a crash, the entire logical volume is considered dirty, although only a
few blocks could have changed. Until all the partitions have been
resync’ed, then the logical volume will always be considered dirty while the
logical volume is open. Additionally, reads will be a bit slower as a read-
then-sync operation must be performed.
Commands Varyonvg command will inform the user that a background forced sync
affected by the may be occurring with the passive MWCC recovery.
Passive MWCC
algorithm Syncvg command will inform user that a non-forced sync on a logical
volume with a passive MWCC will result in a forced background sync.
Lslv command has been altered such that the output shows if Passive
MWCC is set and active.
To set passive sync
• mklv -w p = Use Passive MWCC algorithm
• chlv -w p = Use Passive MWCC algorithm
Changes in Three functions are changed hd_open, hd_close, and hd_ioctl:

Kernel
extensions hd_open: if the logical volume being opened is part of a big VG, it is being
due to Passive opened for write, it is mirrored, and the mwcc policy is passive, the
MWCC lv_dirty_bit

representing the logical volume minor number is marked as dirty. Multiple

settings of this may occur as multiple opens results in multiple visits to
hd_open.
hd_close: only when a logical volume is being closed for the last time, this
function is called. When this occurs, the function checks to see if the
logical volume is part of a big VG, it has more than one copy, the mwcc
policy is set to passive and the passive_mwcc_recover flag of the logical
volume is not set. If all these conditions are true, then the lv_dirty_bit of the
logical volume is cleared and the logical volume mirrors are considered
100% consistent with each other.
hd_ioctl: this will return additional status and tell the user if the logical
volume is current marked as needing to undergo or is actually undergoing
passive mwcc
recovery (all reads result in a resync of the mirrors).
The function hd_mirread is called upon the completion, successful or

otherwise, of a read of a mirrored logical volume. When entering this
function, if the passive_mwcc_recover flag is set, then the function will
search the other viable mirrors that were not read and copy the contents of
the just read mirror into those other mirrors via first set the mirrors to avoid
with the pb_mirbad variable, then calling the function hd_fixup.
The function hd_kdeflvs, which is called at varyonvg time, looks to see if

the volume group is mirrored, has the mwcc policy set to passive, and is a
big volume group. If it is, then it checks the lv_dirty_bit of that logical
volume in the VGSA. If the bit is set, then the driver notifies itself that it is
going to be in passive mwcc recovery state by setting the
passive_mwcc_recover flag to true.
Changes to allow hd_kextend to work properly with the new

LV_ACTIVE_MWC definition.
Changes in hdpin.exp
Export the call hd_sa_update so that hd_top can update the VGSA well
with the modified lv_dirty_bit as a result of hd_open or hd_close.

AIX 5 LVM Hot Spare Disk in a Volume group.
AIX 5 Hot • Automatic migration of failed disks for mirrored LVs

Spare Disk
function • Ability to create spare disk pool for a VG
The hot spare function applies to mirrored LVs, non mirrored LVs on a
failing disk can not be recovered and therefore no attempt is made.
AIX 5 Hot Chpv [-h Hotspare] ... existing flags ... PhysicalVolume
Spare disk
chpv
command
-h hotspare
• Sets the sparring characteristics of the physical volume such that the
physical volume can be used as a hot spare and the allocation
permission for physical partitions on the physical volume specified by
the PhysicalVolume parameter. This flag has no meaning for non
mirrored logical volumes. The Spare variable can be either:
• y
• Marks the disk as a hot spare disk within the VG it belongs to and
prohibits the allocation of physical partitions on the physical volume.
The disk must not have any partitions allocated to logical volumes to
be successfully marked as a hot spare disk.
• n
• Removes the disk from the hot spare pool for the volume group in
which it resides and allows allocation of physical partitions on the
physical volume.

AIX 5 LVM Hot Spare Disk in a Volume group. -- continued
AIX 5 Hot Chvg [-s Sync] [-h Hotspare] ... existing flags .... VolumeGroup
Spare disk
chvg -h hotspare
command
• Sets the sparing characteristics for the volume group specified by the
VolumeGroup parameter. Either allows (yes) the automatic migration of
failed disks, or prohibits (no) the automatic migration of failed disks.
This flag has no meaning for non mirrored logical volumes
• y
• Allows the automatic migration of failed disks. Use one for one
migration of partitions from one failed disk to one spare disk. The
smallest disk in the volume group spare pool that is big enough for
one to one migration will be used.
• Y
• Allows the automatic migration of failed disks. Potentially use the
entire pool of spare disks to migrate to as apposed to a one for one
migration of partitions to a spare.
• n
• Prohibits the automatic migration of failed disks. This is the default
value for a volume group.
• r
• Removes all disks from the hotspare pool for the volume group.
-s sync
Sets the synchronization characteristics for the volume group specified by
the VolumeGroup parameter. Either allows (yes) the automatic
synchronization of stale partitions or prohibits (no) the automatic
synchronization of stale partitions. This flag has no meaning for non
mirrored logical volumes.
• y
• Attempt to automatically synchronize stale partitions.
• n
• Prohibits automatic synchronization of stale partitions. This is the
default for a volume group.
• Lsvg -p will show the status of all physical volumes in the VG.
• Lsvg will show status of current state of sparing and synchronization.
• Lspv will show if a disk is a spare.

LVM Hot spot management
AIX 5 LVM Hot Provides tools to determine which logical partitions have high I/O traffic
Spot and allow the migration of those logical partitions to other disks. The
Management
benefit from this system is to:
• Improve performance by eliminating hot spots.
• The system can also be used to migrate certain logical partitions for
maintenance.
LVM Hot spot lvmstat { -l | -v } Name [ -e | -d ] [ -F ] [ -C ] [ -c Count ] [ -s ] [ Interval [

data collection Iterations ] ]
The lvmstat command generates reports that can be used to change

logical volume configuration to better balance the input/output load
between physical disks. By default, the statistics collection is not enabled
in the system. You must use the -e flag to enable this feature for the logical
volume or volume group in question. Enabling the statistics collection for a
volume group enables for all the logical volume in that volume group.
The first report generated by lvmstat provides statistics concerning the
time since the system was booted. Each subsequent report covers the
time since the previous report. All statistics are reported each time lvmstat
runs. The report consists of a header row followed by a line of statistics for
each logical partition or logical volume depending on the flags specified.
Flags
• -c Count Prints only the specified number of lines of statistics.
• -C Causes the counters that keep track of the iocnt, Kb_read and
Kb_wrtn be cleared for the specified logical volume/volume group.
• -d Specifies that statistics collection should be disabled for the
logical volume/volume group in question.
• -e Specifies that statistics collection should be enabled for the logical
volume/volume group in question.
• -F Causes the statistics to be printed colon-separated.
• -l Specifies the name specified is the name of the logical volume.
• -s Suppresses the header from the subsequent reports when Interval
is used.
• -v Specifies that the Name specified is the name of the volume
group.

LVM Hot spot management -- continued
LVM Hot Spot

lists
The lvmstat command is useful in determining whether a physical volume
is becoming a hindrance to
performance by identifying the busiest physical partitions for a logical
volume.
The lvmstat command generates two types of reports, per Logical partition
statistics in a logical volume and per logical volume statistics in a volume
group. The reports has the following format:
# lvmstat -l hd3
Log_part mirror# iocnt Kb_read Kb_wrtn Kbps
1 1 0 0 0 0.00
2 1 0 0 0 0.00
3 1 0 0 0 0.00
# lvmstat -v rootvg
Logical Volume iocnt Kb_read Kb_wrtn Kbps
hd2 1592 5620 880 0.05
hd9var 71 32 28 0.00
hd8 71 0 284 0.00
hd4 13 8 60 0.00
hd1 11 1 21 0.00
Migrating Hot migratelp LVname/LPartnumber[ /Copynumber ] DestPV[/PPartNumber]

Spots
The migratelp moves the specified logical partition LPartnumber of the

logical volume LVname to the DestPV physical volume. If the destination
physical partition PPartNumber is specified it will be used, otherwise a
destination partition is selected using the intra region policy of the logical
volume. By default the first mirror copy of the logical partition in question is
migrated. A value of 1, 2 or 3 can be specified for Copynumber to migrate
a particular mirror copy.
The migratelp command fails to migrate partitions of striped logical
volumes.
Examples

move the first logical partitions of logical volume lv00 to hdisk1, type:
migratelp lv00/1 hdisk1
move second mirror copy of the third logical partitions of logical volume
hd2 to hdisk5, type:
migratelp hd2/3/2 hdisk5

LVM split mirror AIX 4.3.3.
Splitting and For a long time it has been a desire to be able to make online backups,
reintegrating a especially in installations with mirrored volumes it’s been a requested
mirror
feature to be able to split the mirror and use one side of the mirror for
online backups. It has been possible to do a manual split and later
reintegration, but it has been rather complicated and therefore unsafe. In
AIX 4.3.3. this feature has been made available with an easy command
interface.
A mirrored LV can be divided with the chfs command, the example will split
the LV mounted on /testfs, copy number 3 will be mounted ad /backup.
chfs -a splitcopy=/backup -a copy=3 /testfs
The LV is reintegrated in two steps

# umount /backup
# rmfs /backup

LVM Variable logical track group (LTG)
AIX 5 Today the Logical Volume Manager (LVM) shipped with all versions of AIX
introduce has a constant max transfer size of 128K also know within LVM as the
Variable LTG
size to Logical Track Group (LTG). All IO within LVM must be on a Logical Track
improve disk Group boundary. When AIX was first released all disks supported 128K.
performance Today many disks are going beyond 128K and the efficiency of many disks
such as RAID Arrays are impacted if the IO is not a multiple of the stripe
size and the stripe size is normally larger than 128K.
The enhancements in AIX 5 will allow a VG LTG size to be specified at VG

creation time. The enhancements allows the VG LTG to be changed when
volume group is active but no logical volumes are open. The Default LTG
size is still 128K, other sizes must be requested by the user. Mkvg/chvg
will fail if the specified LTG is larger than the max_transfer size of the
target disk(s). Extendvg will fail if the specified LTG is larger than the
max_transfer size of the target disk(s). The change of LTG size will not be
allowed for disks active in concurrent mode.
Variable LTG LTG now supports the following sizes

size and
commands • 128K - Default value
• 256K
• 512K
• 1024K
Variable LTG commands:

• mkvg -L <size> - create a new volumegroup with LTGsize = <size>
• chvg -L <size> - change a volumegroup to LTGsize = <size>
• lsvg <volume group> will display the LTG size

LVM command overview
High level • varyonvg executable

commands
• extendvg shell script
• extendlv shell script
• mkvg shell script
• mklv shell script
• lsvg executable
• lspv executable
• lslv executable
Internal • getlvodm executable

commands
• getvgname executable
• putlvodm executable
• synclvodm executable
• allocp executable
• mapread
• map_alloc
• migfix executable
Low level • lcreatevg executable

commands
• lmigratelv executable
• lquerypv executable
• lqueryvg executable
• lextendlv executable
• lreducelv executable
• lquerylv executable
• lqueryvgs executable

LVM Problem Determination
LVM Problem The Purpose of this section is

Determination
• What is the root cause of the error?
• Can this problem be distilled to the simplest case?
• What has been lost and what is left behind?
• Is this situation repairable?
Because in most cases, each LVM problem case is specific to a user and
their environment, this section isn't a "how-to" section. Instead, it's mostly
a checklist section which will help the user gather necessary information to
rationally determine the root cause of the problem and if the problem can
be fixed in the field, rather than sending to Level 3 software support. And if
the problem must be sent to Level 3, this will give suggested information
that would speed the problem determination/solution given by Level 3.
Find out What The first question to be asked is if this problem is really in the LVM layer.
is the root The sections that detail how an I/O request is handed down from layer to
cause of the
error? layer might help clarify all the sections that must be considered. The most
important initial determination is whether the problem is in above the LVM
layer, in the LVM layer, or below the LVM layer. For instance, an application
program such as Oracle or HACMP/6000 that accesses the LVM directly
might have a problem. If you can determine what actions these failing
programs are attempting to the LVM, then try to recreate this action by
hand using a method that is not based on those application programs. If
your attempt by hand works, then the focus of the problem shifts "up" to
the application program. Obviously if it fails, then you isolated the problem
at the LVM layer (or below). Or, the problem could simply be corruption to
the data needed by LVM; the programs are behaving correctly, but data
needed by LVM is corrupted which is causing LVM to behave strangely. An
additional bonus to the field investigator is the fact that most high-level
commands are shell scripts. Thus, if they are familiar with shell
programming, they may turn on the shell output and what the execution of
the shell commands to observe the failure point. This information might
produce additional helpful information to the problem record. Finally, if
there is corruption or loss of data required by LVM (such as a disk
accidently erased from a volume group), it helps to find the exact steps
performed (or even not performed) by the user so that the investigator can
deduce the state of the system and what useful LVM information is left
behind.

LVM Problem Determination -- continued
Can this Many times problem reports from the field to Level 3 concerning LVM are
problem be difficult to investigate because clarification is required (to determine the
distilled to the
simplest case? root cause of the problem). Or, the problem is described with the complex
user configurations. If it is possible, the most basic action of the LVM is the
one that should be investigated. This is not always possible as some
problem may only be exposed when running in a complex environment.
However, whenever possible one should try to distill the case into how the
action to a logical volume is causing misbehavior by the system. And in
that clarification, a non-LVM root cause may be discovered instead.
What has been This type of question is typically asked of the system when some sort of
lost and what accident has resulted in data corruption or loss of LVM required
has been left
behind? information. Given the state of the system before the corruption, the steps
that most likely caused the corruption, and the current state of the
machine, one can deduce what is left to work with. Sometimes one will
receive conflicting information. This is because part of the ODM disagrees
with part of the VGDA. The ODM is the one that is easily alterable
(compared to the VGDA).
Is this Sometimes you have enough information to know what is missing and
situation what should be done to repair the system. However, the design of ODM,
repairable?
the system configurator, and LVM prevents the repair. By fixing one
problem, another is spawned. And, one is caught in a deadlock situation
that cannot be fixed unless one wrote very specific kernel code to repair
the internal aspects of the LVM (most likely the VGDA). This is not a trivial
solution, but it is possible. It is only through experience that a judgement
can be made if recovery can be attempted.

LVM Problem Determination -- continued
Problem • Warn the user of the consequences

Recovery
• Gather all possible data
• Save off what can be saved
• Each case is different, so must be the solution
Although this might seem a trivial step, when you attempt problem
recovery, most of the time you must alter or destroy an important internal
structure within the LVM (such as the VGDA). Once this is done, if the
recovery attempt didn't work, the user's system is usually in worse shape
than before the recovery attempt. Many users will decline the recovery
attempt once this warning is given. However, it is better to warn them
ahead of time!
Gather all While the volume group is still partially accessible, gather all possible data
possible data about the current volume group. The VGDA will provide information about
missing logical volumes, which will be important. Once the recovery
procedure starts, important reference information such as that gathered
from the VGDA will be lost for good. And if your information is incomplete,
then you may be stuck with no where to go.
Save off what Before starting the recovery, make a copy of files that can be restored in
can be saved case something goes wrong. A good example would be something like the
ODM database files that reside in /etc/objrepos. Sometimes the recovery
steps involves deleting information from those databases. And once
deleted, if one is unsure of their form, one can't try to recreate some of the
structures or values.
Each case is Since each LVM problem is most likely going to be unique for that system,
different, so these notes cannot provide a list of steps one would take in a repair. Once
must each
solution be again, the recovery steps must be based on individual experiences with
LVM. The LVM lab exercise on recovery provides a glimpse of the
complexity and information required to repair a system. However, this lab
is just an example, not a template of how all fixes should be attempted.

Trace LVM commands with the trace command
Tracehook 105 Trace HOOK 105 : HKWD KERN LVM

This event is recorded by the Logical Volume Manager for selected events.
LVM relocingblk bp=value pblock=value relblock=value

• Encountered relocated block
• bp=value, Buffer pointer
• pblock=value, Physical block number
• relblock=value, Relocated block number.
LVM oldbadblk bp=value pblock=value state=value bflags

• Bad block waiting to be relocated
• pblock=value, Physical block number
• state=value, State of the physical volume
• bflags, Buffer flags are defined in the sys/buf.h file.
LVM badblkdone bp=value

• Block relocation complete
• bp=value, Buffer pointer.
LVM newbadblk bp=value badblock=value error=value bflags

• New bad block found
• badblock=value, Block number of bad block
• error=value, System error number (the errno global variable)
LVM swreloc bp=value status=value error=value retry=value

• Software relocating bad block
• status=value, Bad block directory entry status
• error=value, System error number (the errno global variable)

• retry=value, Relocation entry count.
LVM resyncpp bp=value bflags

• Resyncing Logical Partition mirrors

Trace LVM commands with the trace command -- continued
Trace hook 105 LVM open device name flags=value

continued
Open device name, Name of the device
flags=value, Open file mode.
LVM close device name

Close device name, Name of the device.
LVM read device name ext=value

Read device name, Name of the device
ext=value, Extension parameters.
LVM write device name ext=value

Write device name, Name of the device
ext=value, Extension parameters.
LVM ioctl device name cmd=value arg=value

ioctl device name, Name of the device
cmd=value, ioctl command
arg=value, ioctl arguments.
Example on a trace -a --j105
ID ELAPSED_SEC DELTA_MSEC APPL SYSCALL KERNEL INTERRUPT
001 0.000000000 0.000000 TRACE ON channel 0 Mon Sep 18 21:52:50 2000
105 20.598330739 6.109275 LVM close: rloglv00
105 20.598415445 0.084706 LVM close: rlv00

Trace LVM commands with the trace command -- continued
Trace hook 10B : HKWD KERN LVMSIMP

10B
This event is recorded by Logical Volume Manager for selected events.
Recorded Data
Event:
LVM rblocked: bp=value

Request blocked by conflict resolution
bp=value
Buffer pointer.
LVM pend: bp=value resid=value error=value bflags

End of physical operation
bp=value, Buffer pointer
resid=value, Residual byte count
error=value, System error number (the errno global variable)
bflags, Buffer flags are defined in the sys/buf.h file.
• LVM lstart: device name bp=value lblock=value bcount=value bflags

opts: Value
• Start of logical operation
• device name, Device name
• lblock=value, Logical block number
• bcount=value, Byte count
• bflags, Buffer flags are defined in the sys/buf.h file
• opts: value, Possible values:
• WRITEV, HWRELOC, UNSAFEREL, RORELOC, NO_MNC,
MWC_RCV_OP, RESYNC_OP, ,AVOID_C1, AVOID_C2,
AVOID_C3
Example on a trace -a --j10b:
ID ELAPSED_SEC DELTA_MSEC APPL SYSCALL KERNEL INTERRUPT

001 0.000000000 0.000000 TRACE ON channel 0 Mon Sep 18 21:52:50 2000

10B 0.007512611 7.512611 LVM pend:pbp=F100 00971615E580 resid=0000 error=0000 B_WRITE
10B 0.007523970 0.011359 LVM lend:rhd9var lbp=F10000 971E17E1A0 resid=0000 error=0000

B_WRITE
10B 8.968758818 8961.234848 LVM lstart: rhd4 lbp=F100009

LVM Library calls
List of Logical The library of LVM subroutines is a main component of the Logical Volume
Volume Manager.
Subroutines
LVM subroutines define and maintain the logical and physical volumes of a
volume group. They are used by the system management commands to
perform system management for the logical and physical volumes of a
system. The programming interface for the library of LVM subroutines is
available to anyone who wishes to provide alternatives to or expand the
function of the system management commands for logical volumes.
Note: The LVM subroutines use the sysconfig system call, which requires
root user authority, to query and update kernel data structures describing a
volume group. You must have root user authority to use the services of the
LVM subroutine library.
The following services are available:
• lvm_querylv Queries a logical volume and returns all pertinent

information.
• lvm_querypv Queries a physical volume and returns all pertinent
information.
• lvm_queryvg Queries a volume group and returns pertinent
information.
• lvm_queryvgs Queries the volume groups of the system and returns
information for groups that are varied on-line.

logical volume device driver LVMDD
LVM logical The Logical Volume Device Driver (LVDD) is a pseudo-device driver that
volume device operates on logical volumes through the /dev/lvn special file. Like the
driver
physical disk device driver, this pseudo-device driver provides character
and block entry points with compatible arguments. Each volume group has
an entry in the kernel device switch table. Each entry contains entry points
for the device driver and a pointer to the volume group data structure. The
logical volumes of a volume group are distinguished by their minor device
numbers.
• Attention: Each logical volume has a control block located in the first
512 bytes. Data begins in the second 512-byte block. Care must be
taken when reading and writing directly to the logical volume, because
the control block is not protected from writes. If the control block is
overwritten, commands that use it can no longer be used.
Character I/O requests are performed by issuing a read or write request on
a /dev/rlvn character special file for a logical volume. The read or write is
processed by the file system SVC handler, which calls the LVDD ddread or
ddwrite entry point. The ddread or ddwrite entry point transforms the
character request into a block request. This is done by building a buffer for
the request and calling the LVDD ddstrategy entry point.
Block I/O requests are performed by issuing a read or write on a block
special file /dev/lvn for a logical volume. These requests go through the
SVC handler to the bread or bwrite block I/O kernel services. These
services build buffers for the request and call the LVDD ddstrategy entry
point. The LVDD ddstrategy entry point then translates the logical address
to a physical address (handling bad block relocation and mirroring) and
calls the appropriate physical disk device driver.
On completion of the I/O, the physical disk device driver calls the iodone
kernel service on the device interrupt level. This service then calls the
LVDD I/O completion-handling routine. Once this is completed, the LVDD
calls the iodone service again to notify the requester that the I/O is
completed.
The LVDD is logically split into top and bottom halves. The top half
contains the ddopen, ddclose, ddread, ddwrite, ddioctl, and ddconfig entry
points. The bottom half contains the ddstrategy entry point, which contains
block read and write code. This is done to isolate the code that must run
fully pinned and has no access to user process context. The bottom half of
the device driver runs on interrupt levels and is not permitted to page fault.
The top half runs in the context
of a process address space and can page fault.

Disk Device Calls
scsidisk, SCSI This driver supports the small computer system interface (SCSI) and the
Disk Device Fibre Channel Protocol for SCSI (FCP) fixed disk, CD-ROM (compact disk
Driver
read only memory), and read/write optical (optical memory) devices.
Syntax
#include <sys/devinfo.h>
#include <sys/scsi.h>
#include <sys/scdisk.h>
Device-Dependent Subroutines
Typical fixed disk, CD-ROM, and read/write optical drive operations are
implemented using the open, close, read, write, and ioctl subroutines.
open and close Subroutines:

The openx subroutine is intended primarily for use by the diagnostic
commands and utilities. Appropriate authority is required for execution.
The ext parameter passed to the openx subroutine selects the operation to
be used for the target device. The /usr/include/sys/scsi.h file defines
possible values for the ext parameter.
rhdisk Special File Provides raw I/O access to the physical volumes (fixed-
disk) device driver.
The rhdisk special file provides raw I/O access and control functions to
physical-disk device drivers for physical disks. Raw I/O access is provided
through the /dev/rhdisk0, /dev/rhdisk1, ..., character special files.
Direct access to physical disks through block special files should be

avoided. Such access can impair performance and also cause data
consistency problems between data in the block I/O buffer cache and data
in system pages. The /dev/hdisk block special files are reserved for system
use in managing file systems, paging devices and logical volumes.
The r prefix on the special file name indicates that the drive is to be
accessed as a raw device rather than a block device.


Disk low level Device Calls such as SCSI calls
SCSI Adapter The SCSI device driver has access to the physical disk (if SCSI disk). The
Device Driver driver support data transfers via read and write and control commands via
ioctl calls. The diskDD use the Adapter device driver to access and control
the physical storage device.
Supports the SCSI adapter.
Syntax
<#include /usr/include/sys/scsi.h>
<#include /usr/include/sys/devinfo.h>
Description
The /dev/scsin and /dev/vscsin special files provide interfaces to allow

SCSI device drivers to access SCSI devices. These files manage the
adapter resources so that multiple SCSI device drivers can access devices
on the same SCSI adapter simultaneously. The /dev/vscsin special file
provides the interface for the SCSI-2 Fast/Wide Adapter/A and SCSI-2
Differential Fast/Wide Adapter/A, while the /dev/scsin special file provides
the interface for the other SCSI adapters. SCSI adapters are accessed
through the special files /dev/scsi0, /dev/scsi1, .... and /dev/vscsi0, /dev/
vscsi1, ....
The /dev/scsin and /dev/vscsin special files provide interfaces for access
for both initiator and target mode device instances. The host adapter is an
initiator for access to devices such as disks, tapes, and CD-ROMs. The
adapter is a target when accessed from devices such as computer
systems, or other devices that can act as SCSI initiators.
For further information look in

Kernel and Subsystems Technical Reference, Volume 2
and Files Reference manual.

Exercises
Examine the Use a tool such as edhx, hexit, dd or other to Look at a physical volume,
physical disk
layout of a Idintify the PVID, the VGID, and the LVM structure.
logical volume
and a physical
volume. Hint: which device should you use to access these data. It may be esier to
copy data from the drive to a file with the dd command.
dd if=/dev/xxx of=/tmp/Myfile bs=1020k count=<number of MB>
Use another device to look at the logical volume, and does the data match
those from the physical device.
Examinine the This exercise will look at the perfromace impact enabling and disabling
impact of LVM MWC, to do do this we need a reproduceable write load. one way to get
Passive Mirror
Write this is to write a C program to create the load remember the file has to be
Consistency realy big to exceed the cache size or, force a sync to occur before
terminating.
Sample C code to write a big file:
void writetstfile()
{
char buffer[512];
char *filename = "/test/a_large_file";
register int i;
int fildes;
/* for (i=0;i<38;i++) buffer[i] = buf[i]; */
if ((fildes = creat(filename,0640)) < 0) {
printf("cannot create file \n");
exit(1);
}
else {
close(fildes);
if ((fildes = open(filename,1)) < 0) {
printf("cannot open file for write \n");
exit(1);
}

for (i=0; i< BLOCKS;i++)

if(write(fildes,buffer,512) < 0) {
printf("error writeng block %d\n",i);
exit(1);
}

Examinine the The LTG is the LVM Logical Track Group, the amount of data read or
function of written to the disk in each operation. Try to monitor the the data and the
LVM LTG
number of disk transactions per. second during IO. The IO and the disk
transactions per second can be monitored with the iostat command.
Test the split Test the “Splitting and reintegrating” facility of a mirror. First create a
mirror facility mirrored LV, and write data to it. Then split the mirror and access data from
both sides. Change data at the “primary side”, and then reintrgrate the
mirror, what happens?
How fast are the mirrors reintegrated?
are they realy synchronized?
Exercise Trace In this exercise we will use the trace command to monitor LVM activity
LVM system
activity. start, stop, and list the results from a LVM trace with the commands
trace -a -j105 -j10b

trcstop
trcrpt > <filename>
Try to Unmount a filesystem, mount the filesystem again, create a file, and
write data into the file to create some activity in the LVM trace file.


Draft Version for review, Sunday, 15. October 2000, j2-1.fm Guide
Unit 12. Enhanced Journaled File System
Objectives
• List the difference between the terms aggregate and fileset.
• Identify the various data structures that make up the JFS-2
filesystem.
• Use the fsdb command to trace the various data structures that
make up the logical and virtual file system.
References
SCnn-nnnn Title of Reference
http://www.yoururl.com
WEB Page Name

Guide Draft Version for review, Sunday, 15. October 2000, j2-1.fm
J2 - Enhanced Journaled File System
Introduction The Enhanced Journaled File System (JFS2), is an extent based

Journaled File System. It is the default filesystem on IA-64 systems and is
available on the Power based systems. Currently the default on Power
systems is the Journaled File System (JFS).
Numbers The following table list some general information about JFS2
Function Value
Block Size 512 - 4096 Configurable block size
Architectural max. files size 4 Petabytes
Max. file size tested 1 Tetabytes
Max. file system size 1 Tetabytes
Number of Inodes Dynamic, limited by disk space.
Directory Organization B-tree

Aggregate
Introduction The term aggregate is defined in this section. The layout of a JFS2
aggregate is described.
Definitions JFS2 separates the notion of a disk space allocation pool, called an
aggregate, from the notion of a mountable file system sub-tree, called a
fileset. The rules that define aggregates and filesets in JFS2 are:
• There is exactly one aggregate per logical volume.
• There may be multiple filesets per aggregate.
• In The first release of AIX 5L, only one fileset per aggregate is
supported,.
• The meta-data has been designed to support multiple filesets, and this
feature may be introduced in a future release of AIX 5.
The terms aggregate and fileset in this document correspond to their DCE/
DFS (Distributed Computing Environment Distributed File System) usage.
Aggregate An aggregate has a fixed block size (number of bytes per block) that is
block size defined at configuration time. The aggregate block size defines the
smallest unit of space allocation supported on the aggregate. The block
size cannot be altered, and must be no smaller than the physical block size
(currently 512 bytes). Legal aggregate block sizes are:
• 512 bytes
• 1024 bytes
• 2048 bytes
• 4096 bytes.
Do not confused aggregate block size with the logical volume block size,
which defines the smallest unit of I/O.

Aggregate -- continued
Aggregate The following diagram and table details the layout of the aggregate.
layout Note: Aggregate Block Size is 1K in this example. 1KB
(One Aggregate Block)
Aggregate
RESERVED
Block # 0 31
32 Inodes (16KB)
Aggregate Inode Table; inode numbers shown
Primary 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Secondary
Aggregate Control Page IAG Aggregate
Superblock 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Superblock
32 36 40 44 60
aggr inode #1: “self” aggr inode #2: block map aggr inode #16: fileset 0 aggr inode #16: fileset 1
owner: root owner: root owner: root owner: root
1st extent of Aggregate Inode Allocation Map perm: -rwx------ perm: -rwx------ perm: -rwx------ perm: -rwx------
etc: blah blah etc: blah blah etc: blah blah etc: blah blah
size: 8192 size: 16384 size: 12288 size: 8192
Control Section
Persistent Map
length[0]: 16
offset: 0 offset: 0 offset: 0 offset: 0

length[1]: 0
addr: 36 addr: 64 addr: 8 addr: 8

Working Map
ixd Section
addr[0]: 44
0xf8008000
0x00000000
0xf8008000
0x00000000
addr[1]: 0
length: 8 length: 16 length: 240 length: 5992

iagnum: 0
xad entries
offset: 8192
(8 total)
addr: 4
length: 10284
...
...
...
Part Function
Reserved area A 32K area at the front not used by JFS2. The first
block is used by the LVM.
Primary The primary aggregate superblock (defined as a
aggregate struct superblock) contains aggregate-wide
superblock information such as the:
• size of the aggregate
• size of allocation groups
• aggregate block size
The superblock is at fixed locations, which allows
us to always be able to find these without
depending on any other information.
Secondary The secondary aggregate superblock is a direct
aggregate copy of the primary aggregate superblock. The
superblock secondary aggregate superblock is used if the
primary aggregate superblock is corrupted.

Aggregate layout (continued)
Part Function
Aggregate inode Contains inodes that describe the aggregate-wide
table control structures these inodes are described
below.
Secondary Contains replicated inodes from the Aggregate
aggregate inode Inode Table. Since the inodes in the Aggregate
table Inode Table are critical for finding any file system
information they will each be replicated in the
Secondary Aggregate Inode Table. The actual
data for the inodes will not be repeated, just the
addressing structures used to find the data and
the inode itself.
Aggregate inode Describes the Aggregate Inode Table. It contains
allocation map allocation state information on the aggregate
inodes as well as their on-disk location.
Secondary Describes the Secondary Aggregate Inode Table.
aggregate inode
allocation map
Block allocation Describes the control structures for allocating and
map freeing aggregate disk blocks within the
aggregate. The Block Allocation Map maps one-
to-one with the aggregate disk blocks.
fsck working Provides space for fsck to be able to track the
space aggregate block allocations. This space is
necessary - for a very large aggregate there might
not be enough memory to track this information in
memory when fsck is run. The space is described
by the superblock. One bit is needed for every
aggregate block. The fsck working space always
exists at the end of the aggregate.
In-line Log, Provides space for logging of the meta-data
changes of the aggregate. The space is described
by the superblock. The in-line log always exist
following the fsck working space.

Aggregate When the aggregate is initially created, the first inode extent is allocated,
Inodes additional inode extents are allocated and de-allocated dynamically as
needed. These aggregate Inodes each describe certain aspects of the
aggregate itself, as follows:
Inode # Description
0 Reserved
1 Called the “self” inode, this inode describes the aggregate
disk blocks comprising the aggregate inode map. This is a
circular representation, in that aggregate inode one is itself
in the file that it describes. The obvious circular
representation problem is handled by forcing at least the
first aggregate inode extent to appear at a well-known
location, namely, 4K after the Primary Aggregate
Superblock. Therefore, JFS2 can easily find Aggregate
Inode one, and from there it can find the rest of the
Aggregate Inode table by following the B+–tree in inode one
2 Describes the Block Allocation Map.
3 Describes the In-line Log when mounted. This inode is
allocated but no data is saved to disk.
4 - 15 Reserved for future extensions.
16 - Starting at aggregate inode 16 there is one inode per fileset,
the Fileset Allocation Map Inode. These inodes describe the
control structures that represent each fileset. As additional
filesets are added to the aggregate, the aggregate inode
table itself may have to grow to accommodate additional
fileset inodes

Allocation Groups
Introduction Allocation Groups (AG) divide the space on an aggregate into chunks, and
allow JFS2 resource allocation policies to use well known methods for
achieving good JFS2 I/O performance.
Allocations When locating data on the disk JFS2 will attempt to:
policies
• Group disk blocks for related data and inodes close together.
• Distribute unrelated data throughout the aggregate.
Allocation Allocation group sizes must be selected which yield Allocation Groups that
Group Sizes are sufficiently large to provide for contiguous resource allocation over
time. The allocation group size is stored in the aggregate superblock. The
rules for setting the allocation group size is:
• maximum number of allocation groups per aggregate is 128
• minimum size of an allocation group is 8192 aggregate blocks
• The allocation group size must always be a power of 2 multiple of the
number of blocks described by one dmap page. (i.e. 1, 2, 4, 8,... dmap
pages)
Partial An aggregate whose size is not a multiple of the allocation group size
Allocation contains a partial allocation group - it is not fully covered by disk blocks.
Group
This partial allocation group will be treated as a complete allocation group,
except we mark the non-existent disk blocks allocated in the Block
Allocation Map.

Filesets
Introduction A fileset is a set of files and directories that form an independently

mountable sub-tree, equivalent to a Unix file system file hierarchy. A fileset
is completely contained within a single aggregate.
Layout The following illustration and table details the layout of a fileset.
Filese Inode Table
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Control Page IAG IAG
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
240 244 248 264 10284
fileset #0: 2nd Half of

Fileset Superblock
AG Free Inode List Information Fileset Inode Allocation Map: 2nd extent
inofree: 1 IAG Free List: 1st entry
extfree: 1
AG 0 numinos: 32 fileset inode #2:
Fileset Inode Allocation Map: 1st extent root directory
numfree: 28 Control Section
Control Section owner: root iagnum: 1
inofree: -1
iagnum: 0 perm: -rwx------ iagfree: -1
extfree: -1
1 numinos: 0 etc: blah blah
numfree: 0 size: 4096
Working Map Working Map
inofree: -1 0xf0000000 idotdot:2 0xffffffff
extfree: -1 0xffffffff 0xffffffff
2 numinos: 0 ... ...
numfree: 0
Persistent Map Persistent Map
0xf0000000 0xffffffff
0xffffffff 0xffffffff
... ...
inofree: -1 ixd Section ixd Section

extfree: -1 length[0]: 16 length[0]: 0
128 numinos: 0 addr[0]: 0
addr[0]: 248
numfree: 0 length[1]: 0 length[1]: 0
addr[1]: 0 addr[1]: 0
... ...
Part Function
Fileset Inode Contains inodes describing the fileset-wide control
table structures. The Fileset Inode Table logically
contains an array of inodes.
Fileset Inode A Fileset Inode Allocation Map which describes
allocation map the Fileset Inode Table. The Fileset Inode
Allocation Map contains allocation state
information on the fileset inodes as well as their
on-disk location.
Inodes Objects. Every JFS2 object is represented by an
inode, which contains the expected object-specific
information such as time stamps, file type (regular
vs. directory, etc.). They also “contain” a B+–tree
to record the allocation of extents. Note
specifically that all JFS2 meta data structures
(except for the superblock) are represented as
“files.” By reusing the inode structure for this data,
the data format (on-disk layout) becomes
inherently extensible.

Filesets -- continued
Super Inode Super Inodes found in the aggregate inode table (#16 and greater)
describe the Fileset Inode Allocation Map and other fileset information
resides in the Aggregate Inode Table. Since the Aggregate Inode Table is
replicated there is also a secondary version of this inode which points to
the same data.
Inodes When the fileset is initially created, the first inode extent is allocated,
additional inode extents are allocated and de-allocated dynamically as
needed. The inodes in a fileset are allocated as follows:
Fileset Description
Inode #
0 reserved
1 additional fileset information that would not fit in the Fileset
Allocation Map Inode in the Aggregate Inode Table.
2 The root directory inode for the fileset.
3 The ACL file for the fileset.
4- Fileset inodes from four onwards are used by ordinary fileset
objects, user files, directories, and symbolic links.

Extents
Introduction Disk space in a JFS2 filesystem is allocated in a sequence of contiguous

aggregate blocks called an extent.
Extent rules An extent is:

• made up of a series contiguous aggregate blocks.
• variable in size and can range from 1 to 223 aggregate blocks.
• wholly contained within a single aggregate
• large extents may span multiple allocation groups.
• indexed in a B+-tree.
Extent Extents are described in an xad structure. The two main values
Allocation describing an extent, its length, and its address. In an xad both the length
Descriptor
and address are expressed in units of the aggregate block size. Details of
the xad data structure are shown below.
struct xad {
uint8 xad_flag;
uint16 xad_reserved;
uint40 xad_offset;
uint24 xad_length;
uint40 xad_address;
};
Member Description
xad_flag Flags set on this extent. See /usr/include/j2/j2_xtree.h for
a list of flags.
xad_reserved Reserved for future use.
xad_offset Extents are generally grouped together to from a larger
group of disk blocks. The xad_offset, describes the
logical byte address this extent represents in the larger
group.
xad_length A 24-bit field, containing the length of the extent in
aggregate blocks. An extent can range in size from 1 to
224-1 aggregate blocks.
xad_address A 40-bit field containing the address of the first block of
the extent. The address is in units of aggregate blocks
and is the block offset from the beginning of the
aggregate.

Extents -- continued
Allocation In general, the allocation policy for JFS2 tries to maximize contiguous
Policy allocation by allocating a minimum number of extents, with each extent as
large and contiguous as possible. This allows for larger I/O transfer
resulting in improved performance. However in special cases this is not
always possible. For example copy-on-write clone of a segment will cause
a contiguous extent to be partitioned into a sequence of smaller
contiguous extents. Another case is restriction of the extent size. For
example the extent size is restricted for compressed files since we must
read the entire extent into memory and decompress it. We have a limited
amount of memory available so we must ensure we will have enough room
for the decompressed extent.
Fragmentation An extent based file system combined with a user-specified aggregate

block size allows JFS2 to not have separate support for internal
fragmentation. The user can configure the aggregate with a small
aggregate block size (e.g., 512 bytes) to minimize internal fragmentation
for aggregates with large numbers of small size files.
A defragmentation utility will be provided to reduce external fragmentation
which occurs from dynamic allocation/de-allocation of variable size
extents. This allocation and de-allocation can result in disconnected
variable size free extents all over the aggregate. The defragmentation
utility will coalesce multiple small free extents into single larger extents.

Binary Trees of Extents
Introduction Objects in JFS2 are stored in groups of extents arranged in binary trees.
The concepts on binary trees are introduced in this section.
Trees Binary trees consists of nodes arranged in a tree structure. Each node
contains an header describing the node. A flag in the node header
identifies the role of the node in the tree.
Root node
Header
flags=BT_ROOT
Internal
Leaf node
node
Header
Header flags=
flags= BT_LEAF
BT_INTERNAL
Array of extent
descriptors
Leaf node Leaf node
Header Header xad
flags= flags=
BT_LEAF BT_LEAF
xad
xad
Array of extent Array of extent

descriptors descriptors
xad xad
xad xad
xad xad
Header flags This table describe the binary tree header flags.
Flag Description
BT_ROOT The root or top of the tree.
BT_LEAF The bottom of a branch of a tree. Leaf nodes point to
the extents containing the objects data.
BT_INTERNAL An internal node points to two or more leaf nodes or
other internal nodes.

Binary Trees of Extents -- continued
Why B+-tree B+–trees are used in JFS2 to help performance by:

• providing fast reading and writing of extents - the most common
operations.
• fast search for reading a particular extent of a file.
• efficient append or insert of an extent in a file.
• efficient for traversal of an entire B+–tree
B+-tree index There is one generic B+–tree index structure for all index objects in JFS2
except for directories. The data being indexed depends upon the object.
The B+–tree is keyed by offset of the xad structure of the data being
described by the tree. The entries are sorted by the offsets of the xad
structures, each of which is an entry in a node of a B+–tree.
Root node The file j2_xtree.h describes the header for the root of the B+–tree in struct
header xtpage_t.
#define XTPAGEMAXSLOT 256

typedef union {
struct xtheader {
int64 next; /* 8: */
int64 prev; /* 8: */
uint8 flag; /* 1: */
uint8 rsrvd1; /* 1: */
int16 nextindex; /* 2: next index = # of entries */
int16 maxentry; /* 2: max number of entries */
int16 rsrvd2; /* 2: */
pxd_t self; /* 8: self */
} header; /* (32) */
xad_t xad[XTPAGEMAXSLOT]; /* 16 * maxentry: xad array */
} xtpage_t;

Binary Trees of Extents -- continued
Leaf node The file j2_btree.h describes the header for an internal node or a leaf node
header in struct btpage_t.
typedef struct {
int64 next; /* 8: right sibling bn */
int64 prev; /* 8: left sibling bn */
uint8 rsrvd[7]; /* 7: type specific */
int64 self; /* 8: self address */
uint8 entry[4064]; /* 4064: */
} btpage_t;

inodes
Overview Every file on a JFS2 filesystem is describe by an on-disk inode. The inode
holds the root header for the extent binary tree. File attribute data and
block allocation maps are also kept in the inode.
Inode Layout The inode is a 512 byte structure, split into four 128 byte sections
described here.
Inode Layout
Section 1 POSIX Attributes
é extended attributes
Section 2 é block allocation maps
é Inode allocation maps
é headers describing the inode data
In-line data
Section 3 or
xad’s
extended attributes
or
Section 4 more in-line data
or
additional xad’s
Section Description
1 This section describes the POSIX attributes of the JFS2 object
including the inode and fileset number, object type, object size,
user id, group id, created, access time, modified time, created
time and more.
2 This section contains several parts:
• descriptors for extended attributes
• block allocation maps
• inode allocation maps
• Header pointing to the data (b+-tree root, directory, in-line
data)
3 This section can contain one of the following:
• In-line File data - for very small files (up to 128 bytes)
• The first 8 xad structures describing the extents for this file.
4 This section extends section 3 by providing additional storage for
more attributes, xad structures or in-line data.

Inodes -- continued
Structure The current definition of the on-disk inode structure is

struct dinode{
/* I. base area (128 bytes)
* define generic/POSIX attributes */
ino64_t di_number; /* 8: inode number, aka file serial number */
uint32 di_gen; /* 4: inode generation number */
uint32 di_fileset; /* 4: fileset #, inode # of inode map file */
uint32 di_inostamp; /* 4: stamp to show inode belongs to fileset */
uint32 di_rsv1; /* 4: */
pxd_t di_ixpxd; /* 8: inode extent descriptor */
int64 di_size; /* 8: size */
int64 di_nblocks; /* 8: number of blocks allocated */
uint32 di_uid; /* 4: uid_t user id of owner */
uint32 di_gid; /* 4: gid_t group id of owner */
int32 di_nlink; /* 4: number of links to the object */
uint32 di_mode; /* 4: mode_t attribute, format and permission */
j2time_t di_atime; /* 16: time last data accessed */
j2time_t di_ctime; /* 16: time last status changed */
j2time_t di_mtime; /* 16: time last data modified */
j2time_t di_otime; /* 16: time created */
/* II. extension area (128 bytes)

* extended attributes for file system (96); */
ead_t di_ea; /* 16: ea descriptor */
union {
uint8 _data[80];
/* block allocation map */

struct {
struct bmap *__bmap; /* incore bmap descriptor */
} _bmap;
/* inode allocation map (fileset inode 1st half) */

struct {
uint32 _gengen; /* di_gen generator */
struct inode *__ipimap2; /* replica */
struct inomap *__imap; /* incore imap control */
} _imap;
} _data2;
/* B+-tree root header (32)

* B+-tree root node header, or dtroot_t for directory,
* or data extent descriptor for inline data; */
union {
struct {
int32 _di_rsrvd[4]; /* 16: */
dxd_t _di_dxd; /* 16: data extent descriptor */
} _xd;
int32 _di_btroot[8]; /* 32: xtpage_t or dtroot_t */
ino64_t _di_parent; /* 8: idotdot in dtroot_t */
} _data2r;
/* III. type-dependent area (128 bytes)

* B+-tree root node xad array or inline data */
union {
uint8 _data[128];
/* +-tree root node/inline data area */
struct {
uint8 _xad[128];
} _file;
/* device special file */

struct {
dev64_t _rdev; /* 8: dev_t device major and minor */
} _specfile;
/* symbolic link.
* link is stored in inode if its length is less than
* IDATASIZE. Otherwise stored like a regular file. */
struct {
uint8 _fastsymlink[128];
} _symlink;
} _data3;
/* IV. type-dependent extension area (128 bytes)
* user-defined attribute, or
* inline data continuation, or
* B+-tree root node continuation */
union {
uint8 _data[128];
} _data4;
}

Inodes -- continued
Allocation JFS2 allocates inodes dynamically, which provides the following

Policy advantages:
• Allows placement of inode disk blocks at any disk address, which
decouples the inode number from the location. This decoupling
simplifies supporting aggregate and fileset reorganization to enable
shrinking the aggregate. The inodes can be moved and still retain the
same number, which allows us to not need to search the directory
structure to update the inode numbers.
• There is no need to allocate “ten times as many inodes as you will ever
need”, as with filesystems that contain a fixed number of inodes, and
thus filesystem space utilization is optimized. This is especially
important with the larger inode size of 512 bytes in JFS2.
• File allocation for large files can consume multiple allocation groups and
still be contiguous. Static allocation forces a gap containing the initially
allocated inodes in each allocation group, with dynamic allocation, all
the blocks contained in an allocation group can be used for data.
Dynamic inode allocation causes a number of problems, including:
• With static allocation the geometry of the file system implicitly describes
the layout of inodes on disk. With dynamic allocation separate mapping
structures are required.
• The inode mapping structures are critical to JFS2 integrity. Due to the
overhead involved in replicating these structures we accept the risk of
losing these maps. However, replicating the B+–tree structures allows
us to find the maps.
Inode extents Inodes are allocated dynamically by allocating inode extents that are
simply a contiguous chunk of inodes on the disk. By definition, a JFS2
inode extent contains 32 inodes. With a 512 byte inode size, an inode
extent is therefore occupies 16KB on the disk.

Inodes -- continued
Inode When a new inode extent is allocated the extent is not initialized, but in
initialization order for fsck to be able to check if an inode is in-use, JFS2 will need some
information in it. Once an inode in an extent is marked in-use its fileset
number, inode number, inode stamp, and the inode allocation group block
address are initialized. Thereafter, the link field will be sufficient to
determine if the inode is currently in-use.
Inode Dynamic inode allocation implies that there is no direct relationship

Allocation Map between an inode number and the disk address of the inode. Therefore we
must have a means of finding the inodes on disk. The Inode Allocation
Map provides this function.
Inode Inode generation numbers are simply counters that will increment each
Generation time an inode is reused. Network file system protocols such as NFS
Numbers
(implicitly) require them; they form part of the file identifier manipulated by
VNOP_FID() and VFS_VGET().
The static-inode-allocation practice of storing a per-inode generation
counter doesn’t work with dynamic inode allocation, because when an
inode becomes free its disk space may literally be reused for something
other than an inode (e.g., the space may be reclaimed for ordinary file data
storage). Therefore, in JFS2 there is simply one inode generation counter
that is incremented on every inode allocation (rather than one counter per
inode that would be incremented when that inode is reused).
Although a fileset-wide generation counter will recycle faster than a per-
inode generation counter, a simple calculation shows that the 32-bit value
is still sufficient to meet NFS or DFS requirements.

File Data Storage
Overview This section introduces the data structures used to describe where a file’s
data is stored.
In-line data If a file contains small amounts of data the data may be stored in the inode
its self. This is called in-line storage. The header found in the second
section of the inode points to the data that is stored in the third and fourth
section of the inode.
inode
Inode Info
Header for in-line data
In-line data
Binary trees When more storage is needed than can be provided in-line the data must
be placed in extents. The header in the inode now becomes the binary
tree root header. If there are 8 or fewer extents for the file, then the xad
structures describing the extents are contained in the inode. An inode
containing less than 8 xad structures would look like:
inode
68
Inode Info
16KB
B+-tree header Data
offset: 0
addr: 68
length: 4
offset: 4096
xad entries
4096
(8 total)
addr: 84
length: 12
48KB
Data
offset: 26624
addr: 256
length: 2
26624
8KB Data

File Data Storage -- continued
INLINEEA bit Once the 8 xad structures in the inode are filled, an attempt is made to use
the last quadrant of the inode for more xad structures. If the INLINEEA bit
is set in the di_mode field of the inode, then the last quadrant of the inode
is available for 8 more xad structures.
More extents Once all of the available xad structures in the inode are used, the B+–tree
must be split. 4K of disk space is allocated for a leaf node of the B+–tree,
which is logically an array of xad entries with a header. The 8 xad entries
are moved from the inode to the leaf node, and the header is initialized to
point to the 9th entry as the first free entry. The first xad structure in the
inode is updated to point to the newly allocated leaf node, and the inode
header is updated to indicate that only one xad structure is now being
used, and that it contains the pure root of a B+-tree. The offset for this new
xad structure contains the offset of the first entry in the leaf node.
The organization of the inode now look like:
inode 412
header 68
Inode Info
16KB
B+-tree header Data
254 xad leaf node entries
offset: 0
addr: 412
length: 4
offset: 0
xad entries
4096
(8 total)
addr: 0
length: 0
48KB
Data
offset: 0
addr: 0
length: 0
26624
8KB Data

Continuing to As new extents are added to the file, they continue to be added to the leaf
add extents node in the necessary order, until the node fills. Once the node fills a new
4K of disk space is allocated for another leaf node of the B+–tree, and the
second xad structure from the inode is set to point to this newly allocated
node. The node now looks like:
inode 412
header 68
Inode Info
16KB
B+-tree header Data

offset: 0
addr: 412
length: 4
offset: 750
xad entries
4096
(8 total)
addr: 560
length: 4
48KB
Data
offset: 0
addr: 0
length: 0
560
header
26624
8KB Data

Another split As extents are added to the inode, this behavior continues until all 8 xad
structures in the inode contain leaf node xad structures, at which time
another split of the B+–tree will occur. This split creates an internal node of
the B+–tree which is used purely to route the searches of the tree. An
internal node looks exactly like a leaf node. 4K of disk space is allocated
for the internal node of the B+–tree., the 8 xad entries of the leaf nodes are
moved from the inode to the newly created internal node, and the internal
node header is initialized to point to the 9th entry as the first free entry. The
root of the B+–tree is then updated by making the inode’s first xad
structure point to the newly allocated internal node, and the header in the
inode is updated to indicate that now only 1 xad structure is being used for
the B+–tree.
As extents continue to be added, additional leaf nodes are created to
contain the xad structures for the extents, and these leaf nodes are added
to the internal node.
Once the first internal node is filled, a second internal node is allocated,
the inode’s second xad structure is updated to point to the new internal
node.
This behavior continues until all 8 of the inode’s xad structures contain
internal nodes.
inode 380 412

header header 68
Inode Info
16KB
254 xad internal node entries
B+-tree header Data

offset: 0
addr: 380
length: 4
offset: 8340
xad entries
4096
(8 total)
addr: 212
length: 4
48KB
Data
offset: 0
addr: 0
length: 0
212 560
header header
254 xad internal node entries
26624
8KB Data

fsdb Utility
Introduction The fsdb command enables you to examine, alter, and debug a file
system.
Starting fsdb It is best to run fsdb against an unmounted filesystem. Use the following
syntax to start fsdb:
fddb <path to logical volume>
For example:
# fsdb /dev/lv00
Aggregate Block Size: 512
>
Supported fsdb supports both the JFS and JFS2 file systems. The commands
filesystems available in fsdb are different depending on what filesystem type it is
running against. The following explains how to use fsdb with a JFS2 file
system.
Commands The commands available in fsdb can be viewed with the help command
as shown here.
> help
Xpeek Commands
a[lter] <block> <offset> <hex string>

b[tree] <block> [<offset>]
dir[ectory] <inode number> [<fileset>]
d[isplay] [<block> [<offset> [<format> [<count>]]]]
dm[ap] [<block number>]
dt[ree] <inode number> [<fileset>]
h[elp] [<command>]
ia[g] [<IAG number>] [a | <fileset>]
i[node] [<inode number>] [a | <fileset>]
q[uit]
su[perblock] [p | s]

Exercise 1 - fsdb
Introduction In this lab you will run the fsdb utility against a JFS2 filesystem that was
created for you. The filesystem should not be mounted when running
fsdb. The filesystem may be mounted to examine the files, just be sure to
un-mount it before running fsdb.
Lab steps Follow the steps in this table:

Step Action
1 Start fsdb on the logical volume /dev/lv00
# fsdb /dev/lv00
What is the aggregate block size used in this filesystem.

1 Type help to view the fsdb sub-commands. The
commands you will be using in this lab are: inode,
directory and display
2 What inode number represents the fileset root directory
inode?
Display the root inode for the file set. What command did
you use?
Note: If you want to display the aggregate inodes instead of

the fileset inode append an “a” to the command i.e.: inode
2 a.
3 Find the inode number of each file in the fileset using the
directory command followed by the inode number of the
root directory inode of the fileset. For example:
> dir 2
idotdot = 2
4 fileA
5 fileB
6 fileC
3 lost+found

f

Exercise 1 - fsdb -- continued
Using fsdb - In the next few steps you will locate and display the fileA’s data.
continued
Step Action
4 Display the inode of fileA, what command did you use?
Use the inode you displayed to answer the following

questions:
What is the file size of fileA?
How many disk blocks is fileA’s data using?
5 After the inode is displayed a sub-menu of commands is

shown. Type a t to display the root binary tree header.
Examine the flags in the header, what flags are set?
6 Type <enter> to walk down the xad structures in this

node. How many xad structures are used for this file?
7 The address field in the xad shows the aggregate block

number of the first data block of fileA. Use the display
command to display this block.
> d 12345
Did you find fileA’s data?

Exercise 1 - fsdb -- continued
FileB and fileC Use the commands and techniques you learned in the last section to
examine fileB, fileC and fileD. Answer the following questions about these
files:
1. What number inodes are used for fileB, fileC and fileD?
2. How many xad structures are used to describe fileB’s data blocks?
3. How many xad structures are used to describe fileC’s data blocks?
4. Examine the inode for fileD. How big is this file (as shown in di_size)?
How many aggregate blocks are being used by fileD?
Are enough aggregate blocks allocated to store the entire file? Explain
your answer.

Directory
Introduction In addition to files an inode can represent a directory. A Directory is a

journaled meta-data file in JFS2, and is composed of directory entries
which indicate the files and sub-directories contained in the directory.
Directory entry Stored in an array the directory entries links the names of the objects in the
directory to an inode number. The directory entry has the following
members.
Member Description
inumber Inode number
namelen Length of the name.
name[30] File name, up to 30 characters.
next If more that 30 characters are
needed additional entries are link
using the next pointer

Directory -- continued
Root Header In order to improve performance of locating a specific directory entry a

binary tree sorted by name is used. As with files, the header section of a
directory inode contains the binary tree root header. Each header
describes an 8 element array of directory entries. The root header is
defined by a dtroot_t structure contained in /usr/include/j2/j2_dtree.h:
typedef union {
struct {
ino64_t idotdot; /* 8: parent inode number */
int64 rsrvd1; /* 8: */
int8 nextindex; /* 1: next free entry in stbl */
int8 freecnt; /* 1: free count */
int8 freelist; /* 1: freelist header */
int32 rsrvd2; /* 4: */
int8 stbl[8]; /* 8: sorted entry index table */
} header; /* (32) */
dtslot_t slot[9];
} dtroot_t;
Member Description
idotdot Inode number of parent directory.
flag indicating if the node is an internal or leaf node, and
whether it is the root of the binary tree.
nextindex last used slot in the directory entry slot array.
freecnt number of free slots in the directory entry array.
freelist slot number of the head of the free list
stbl[8] indices to the directory entry slots that are currently in
use. The entries are sorted alphabetically by name.
slot[9] Array of directory entries. 8 entries, The header is
stored in the first slot.
Leaf and When more than 8 directory entries are needed a leaf or internal node is
internal node added. The directory internal and leaf node headers are similar to root
header
node header except that up to 128 directory entries. The page header is
defined by a dpage_t structure contained in /usr/include/j2/j2_dtree.h.

Directory slot The Directory Slot Array (stbl[]) is a sorted array, of indices to the directory
array slots that are currently in use. The entries are sorted alphabetically by
name. This limits the amount of shifting necessary when directory entries
are added or deleted, since the array is much smaller than the entries
themselves. A binary search can be used on this array to search for
particular directory entries.
In this example the directory entry table contains four files. The stbl table
contains the slot numbers of the entries ordering the entries alphabetically.
Directory Entry
table
1 def
STBL[8]
2 abc
3 xyz 2 1 4 3 0 0 0 0
4 hij
5
6
7
8
. and .. A directory does not contain specific entries for self (“.”) and parent (“..”).
Instead these will be represented in the inode itself. Self is the directory’s
own inode number, and the parent inode number is held in the “idotdot”
field in the header.

Growing As the number of files in the directory grow the directory tables must be
directory size increase in size. This table describes the steps used.
Step Action
1 Initial directory entries are stored in directory inode in-line
data area.
2 When the in-line data area of the directory inode becomes
full JFS2 allocates a leaf node the same size as the
aggregate block size.
3 When that initial leaf node becomes full and the leaf node is
not yet 4K, double the current size. First attempt to double
the extent in place, if there is not room to do this a new
extent must be allocated and the data from the old extent
must be copied to the new extent. The directory slot array
will only have been big enough to reference enough slots for
the smaller page so a new slot array will have to be created.
Use the slots from the beginning of the newly allocated
space for the larger array and copy the old array data to the
new location. Update the header to point to this array and
add the slots for the old array to the free list.
4 If the leaf node again becomes full and is still not 4K repeat
step 3. Once the leaf node reaches 4K allocate a new leaf
node. Every leaf node after the initial one will be allocated
as 4K to start.
5 When all entries are free in a leaf page, the page will be
removed from the B+–tree. When all the entries in the last
leaf page are deleted, the directory will shrink back into the
directory inode in-line data area.

Directory Examples
Introduction This sections demonstrates how the directory structures change over time.
Small Initial directory entries are stored in directory inode in-line data area.
Directories Examine this example of a small directory. In this example all the inode
information fits into the in-line data area:
# ls -ai
69651 .
2 ..
69652 foobar1
69653 foobar12
69654 foobar3
69655 longnamedfilewithover22charsinitsname
flag: BT_ROOT BT_LEAF

nextindex: 4
freecnt: 3
freelist: 6
idotdot: 2
stbl: {1,2,3,4,0,0,0}
1 inumber: 69652
next: -1
namelen: 7
name: foobar1
2 inumber: 69653
next: -1
namelen: 8
name: foobar12
3 inumber: 69654
next: -1
namelen: 7
name: foobar2
4 inumber: 69655
next: 5
namelen: 37
name:longnamedfilewithover2
5
next: -1
cnt: 0
name: 2charsinitsname
Note: the file with a long name has its name split across two slots.

Directory Examples -- continued
Adding a file An additional file called “afile” is created. The details for this file are added
at the next free slot (slot 6). As this is now, alphabetically, the first file in
the directory, the search table array (stbl[]) is re-organized, such that the
entry in slot 6 is now in the first entry.
# ls -ai
69651 .
2 ..
69656 afile
69652 foobar1
69653 foobar2
69654 foobar3
69655 longnamedfilewithover22charsinitsname
flag: BT_ROOT BT_LEAF
nextindex: 5
freecnt: 2
freelist: 7
idotdot: 2
stbl: {6,1,2,3,4,0,0,0}
1 inumber: 69652
next: -1
namelen: 7
name: foobar1
2 inumber: 69653
next: -1
namelen: 8
name: foobar12
3 inumber: 69654
next: -1
namelen: 7
name: foobar2
4 inumber: 69655
next: 5
namelen: 37
name:longnamedfilewithover2
5
next: -1
cnt: 0
name: 2charsinitsname
6 inumber: 69656
next: -1
namelen: 5
name: afile

Adding a leaf When the directory grows to where there are more entries than can be
node stored in the in-line data area of the inode then JFS2 allocates a leaf node
the same size as the aggregate block size. The in-line entries are moved
to a leaf node as illustrated.
Block 52
flag: BT_ROOT BT_INTERNAL flag: BT_LEAF
nextindex: 1 nextindex: 20
freecnt: 7 freecnt: 103
freelist: 2 freelist: 25
idotdot: 2 maxslot: 128
stbl: {1,2,3,4,5,6,7,8} stbl: {1,2,15, ... 8,13,14}
1 xd.len: 1 1 inumber: 5
xd.addr1: 0 next: -1
xd.addr2: 52 namelen: 5
next: -1 name: file0
namelen: 0 2 inumber: 6
name: file0 next: -1
namelen: 5
name: file1
3 inumber: 15
next: -1
namelen: 6
name: file10
19 inumber: 23
next: -1
namelen: 6
name: file18
20 inumber: 24
next: -1
namelen: 6
name: file19
Once the leaf is full, an internal node is added at the next free in-line data
slot in the inode, which will contain the address of the next leaf node.
Note: the internal node entry, contains the name of the first file (in
alphabetical order) for that leaf node.

Adding a Once all the in-line slots have been filled by internal nodes, a separate
internal node node block is allocated, the entries from the in-line data slots are moved to
this new node, and the first in-line data slot updated with the address of the
new internal node.
Block 118 Block 52

flag: BT_ROOT BT_INTERNAL flag: BT_INTERNAL flag: BT_LEAF
nextindex: 4 nextindex: 64 nextindex: 64
freecnt: 4 freecnt: 59 freecnt: 59
freelist: 5 freelist: 76 freelist: 21
idotdot: 2 maxslot: 128 maxslot: 128
stbl: {1,3,4,2,6,7,2,8} stbl: {1,19,18, ... 7,8} stbl: {1,2,15 ... 113,112}
1 xd.len: 1 1 xd.len: 1 1 inumber: 5

xd.addr1: 0 xd.addr1: 0 next: -1
xd.addr2: 118 xd.addr2: 52 namelen: 5
next: -1 next: -1 name: file0
namelen: 0 namelen: 0 2
name: file0 inumber: 6
name: file0 next: -1
2 xd.len: 1 2 xd.len: namelen: 5
xd.addr1: 0 xd.addr1: name: file1
xd.addr2: 1204 xd.addr2:
next: 3 inumber: 15
next: -1
namelen: 8 namelen: next: -1
name: file4845 name: namelen: 6
name: file10
3 xd.len: 1
xd.addr1: 0
xd.addr2: 1991
next: -1
namelen: 9
name: file13833 126 xd.len: 0 126 inumber: 10057
xd.addr1: -1
4 xd.len: 1 xd.addr2: 1473 next: -1
xd.addr1: 0 next: -1 namelen: 9
xd.addr2: 2609 namelen: 8 name: file10052
next: -1 name: file1472 127 inumber: 10041
namelen: 8 next: -1
name: file17723 127 xd.len: 1
xd.addr1: 0 namelen: 9
xd.addr2: 1472 name: file10036
next: -1
namelen: 8
name: file1017
After many extra files have been added to the directory, two layers of
internal nodes are required to reference all the files.
Note: now, that the internal node entries in the inode contain the name of
the alphabetically first entry referenced by each of the second level internal
nodes, and each entry in these references the name of the alphabetically
first entry in each leaf node.

Exercise 2 - Directories
Introduction In this exercise you will use the fsdb utility to examine directory inodes in
a jfs2 filesystem.
Small Run fsdb on the sample filesystem. Use the following steps to examine
directories the directory node for /mnt/small.
Step Action
1 Find the inode for directory small:
> dir 2
2 Display the inode found in the last step.
> i <inode number>
3 Using the t sub-command display the directory node root
header.
Is this header a root, internal or leaf header?

4 Type <enter> to display the directory entries. Repeat
<enter> until all the entries are displayed.
How many files are in the directory?
5 Examine the directory slot array stbl[] (displayed in the

header).
What file name is associated with the first slot entry?
6 Exit fsdb and mount the filesystem.

# mount /mnt
7 Create the file /mnt/small/a
#touch /mnt/small/a
Predict what the stbl[] table for directory small will look like
now?
8 Un-mount the filesystem run fsdb and check your

prediction.

Exercise 2 - Directories -- continued
Larger In this section you will examine the directory node structures for some
directories larger directories.
Step Action
1 What is the inode for the directory called medium?
2 Display the inode and look at the root tree header. The
flags should indicate that this is an internal header. One
entry should be found for each leaf node. Display the
entries with the <enter> key. How many leaf nodes are
their?.
3 Use the down sub command to display the first leaf node
header. How many entries is this header currently
describing?
What is the maximum number of entries (files) that be

described by a single leaf node?
4 Examine the big directory and answer the following

questions.
How many internal leaf nodes in big?
How many files in big?

Draft Version for review, Sunday, 15. October 2000, fs1.fm Guide
Unit 13. Logical and Virtual File Systems
Objectives
• Identify the various compoints that make up the logical and virtual
• To use the debugger (kdb/iadb) to display these components.
References

Guide Draft Version for review, Sunday, 15. October 2000, fs1.fm
General File System Interface
Introduction This lesson covers the interface and services that AIX 5L provides to
physical filesystem. The Logical File System (LFS), Virtual File System
(VFS) and the interface between these compoints and physical file
systems are discussed in this lesson.
Supported file Using the structure of the logical file system and the virtual filesystem AIX
systems 5L can support a number of different file system types transparently to
application programs. These file systems reside below the LFS/VFS and
operate relatively independently of each other. Currently AIX 5L supports
the following physical filesystem implementations:
• Enhanced Journaled Filesystem (JFS2)
• Journaled filesystem (JFS)
• Network File System (NFS)
• A CD-ROM File system which supports ISO-9660, High Sierra and
Rock Ridge formats.
Extensible The LFS/VFS interface also provides a relatively easy means by which
third party filesystem types can be added without any changes to the LFS.
Hierarchy Access to files and directories by a process is controlled by the various

layers in the AIX 5L kernel as illustrated here.
System Call
é System calls
é Logical File System (LFS)
é Virtual File System (VFS) Logical File System
é File System Implementation -
Support of individual file system
Virtural File System
layout.
é Fault Handler - Device page fault
handler support in the VMM. File System
Implementation
é Device Driver - Actual device
driver code to interface with the
device. It is invoked by the page
fault handler when the file system Fault Handler
implementation code maps the

opened file to kernel memory and
Device Driver
reads the mapped memory. LVM
is the device driver for J2 and
Journalled filesystems. Device

General File System Interface -- continued
Internal data This illustration shows the major data structures that will be discussed in
structures this lesson. This illustration is repeated throughout the lesson
highlighting the areas being discussed.
u-block inode
vnode gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
gfs
vmount
vfsops

Logical File System File System
(Vnode-VFS Interface)

Logical File System
Overview The Logical File System (LFS) provides a consistent programming

interface to applications via the system call interface, with calls such as
open(), close(), read() and write(). The LFS breaks down each
system call into requests for the underlying file system implementations.
LFS Data The data structures discussed in this section are the System Open File
Structures Table and the User File Descriptor Table. The system open file table has
one entry for each open file on the system. The user file descriptor table
(one per process) contains entries for each of the process open file...
u-block inode
vnode gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
gfs
vmount
vfsops

Operations The LFS provides a standard set of operations to support the system call
interface, its routines manage the open file table entries and the per-
process file descriptors. It provides:
• the User File Descriptor Table.
• the System File table. An open file table entry records the authorization
of a process’s access to a file system object.
The LFS abstraction specifies the set of file system operations that an
implementation must include in order to carry out logical file system
requests. Physical file systems can differ in how they implement these
predefined operations, but they must present a uniform interface to the
LFS. It supports UNIX-like file system access semantics, but other non-
UNIX file systems can also be supported.

Logical File System -- continued
User interface A user can refer to an open file table entry through a file descriptor held in
the thread’s ublock, or by accessing the virtual memory to which the file
was mapped. The file descriptor table entry is created when the file is
initially opened, via the open() system call and will remain until either the
user closes the file via the close() system call, or the process terminates.
The LFS is the level of the file system at which users can request file
operations by using system calls, such as open(), close(), read(), write()
etc. For all these calls (except open()), the file descriptor number is passed
as an argument to the call. The system calls implement services that are
exported to users, and provide a consistent user mode programming
interface to the LFS that is independent of the underlying file system type.
System calls that carry out file system requests:
• Map the user’s parameters to a file system object. This requires that the
system call component use the vnode (virtual node) component to
follow the object’s path name. In addition, the system call must resolve
a file descriptor or establish implicit (mapped) references using the open
file component.
• Verify that a requested operation is applicable to the type of the
specified object.
• Dispatch a request to the file system implementation to perform
operations.

User File Descriptor Table
Description The user file descriptor table, is contained in the user area, and is a per
process resource. Each entry references an open file, device, or socket
from the process’ perspective. The index into the table for a specific file, is
the value returned by the open() system call when the file is opened - the
file descriptor.
Table One or more slots of the file descriptor tables are used for each open file.
Management The file descriptor table can extend beyond first page of the ublock, and is
page-able. There is a fixed upper limit of 32768 open file descriptors per
process (defined as OPEN_MAX in /usr/include/sys/limits.h). This value is
fixed, and may not changed.
User File The user file descriptor table consists of an array of user file descriptor
Descriptor table structures defined in /usr/include/sys/user.h in the structure ufd:
Table structure
struct ufd {
struct file * fp;
unsigned short flags;
unsigned short count;
#ifdef __64BIT_KERNEL
unsigned int reserved;
#endif /* __64BIT_KERNEL */
};

System File Table
Description The system file table is a global resource, and is shared by all processes
on the system. One unique entry is allocated for each unique open of a file,
device, or socket in the system.
Table The table is a large array, and is partly initialized. It grows on demand, and
Management is never shrunk. Once entries are freed, they are added back onto the free
list (ffreelist). The table can contain a maximum of 1,000,000 entries, and
is not configurable.
Table entries The file table array consists of struct file data elements. Several of the
key members of this data structure are described in this table.
Member Description
f_count A reference count field detailing the current
number of opens on the file. This value is
increased each time the file is opened, and
decremented on each close(). Once the
reference count is zero, the slot is considered
free, and may be re-used.
f_flag various flags described in fcntl.h
f_type a type field describing the type of file:
/* f_type values */
#define DTYPE_VNODE 1 /* file */
#define DTYPE_SOCKET 2 /* communications endpoint */
#define DTYPE_GNODE 3 /* device */
#define DTYPE_OTHER -1 /* unknown */
f_offset a read/write pointer.
f_data Defined as f_up.f_uvnode it is a pointer to
another data structure representing the object
typically the vnode structure.
f_ops a structure containing pointers to functions for
the following file operations: rw (read/write),
ioctl, select, close, fstat.

System File Table -- continued
file structure The file table structure is described in /usr/include/sys/file.h

struct file {
long f_flag; /* see fcntl.h */
int f_count; /* reference count */
short f_options; /* file flags not passed through vnode layer */
short f_type; /* descriptor type */
union {
struct vnode *f_uvnode; /* pointer to vnode structure */
struct file *f_unext; /* next entry in freelist */
} f_up;
offset_t f_offset; /* read/write character pointer */
off_t f_dir_off; /* BSD style directory offsets */
union {
struct ucred *f_cpcred; /* process credentials at open() */
struct file *f_cpqmnext; /* next quick move chunk on free list*/
} f_cp;
Simple_lock f_lock; /* file structure fields lock */
Simple_lock f_offset_lock; /* file structure offset field lock */
caddr_t f_vinfo; /* any info vfs needs */
struct fileops
{
int (*fo_rw)();
int (*fo_ioctl)();
int (*fo_select)();
int (*fo_close)();
int (*fo_fstat)();
} *f_ops;
};

Virtual File System
Overview The Virtual FIle System (VFS) defines a standard set of operations on an
entire file system. Operations preformed by a process on a file or file
system are mapped through the VFS to the file system below. In this way,
the process need not know the specifics of different file systems (such as
JFS, J2, NFS or CDROM).
Data The data structures within a virtual file system are:

Structures
• vnode - one per file
• gfs - one per filesystem type kernel extension.
• vnodeops - one per filesystem type kernel extension.
• vfsops - one per filesystem type kernel extension.
• vfs - one per mounted filesystem.
• vmount - one per mounted filesystem.
u-block inode
vnode gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
gfs
vmount
vfsops

Functional For the purpose of this lesson the VFS will be broken into three sections
sections and described separately. These sections are:
• Vnode-VFS interface
• File and File System Operations
• The gnode

Vnode/vfs interface
Overview The interface between the logical file system and the underlying file
system implementations is referred to as the vnode/vfs interface. This
interface provides a logical boundary between generic objects understood
at the LFS layer and the file system specific objects that the underlying file
system implementation must manage such as inodes and super blocks.
The LFS is relatively unaware of the underlying file system data structures
since they can be radically different for the various file system types.
Data Vnodes and vfs structures are the primary data structures used to
Structures communicate through the interface (with help from the vmount).
• vnodes - represents a files
• vfs - represents a mounted file system
• vmount - contains specifics of the mount request.
u-block inode
vnode gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
gfs
vmount
vfsops

History The vnode and vfs structures of the LFS was created by Sun Micro
Systems and has evolved into a de-facto industry standard, thanks in part
to NFS.

Vnodes
Overview The vnode provides a standard set of operations within the file system,
and provides system calls with a mechanism for local name resolution.
This allows the logical file system to access multiple file system
implementations through a uniform name space.
Detail Vnodes are the primary handles by which the operating system references
files, and represent access to an object within a virtual file system. Each
time an object (file) within a file system is located (even if it is not opened),
a vnode for that object is located (if already in existence), or created, as
are the vnodes for any directory that has to be searched to resolve the
path to the object.
As a file is created, a vnode is also created, and will be re-used for every
subsequent reference made to the file by a path name. Every path name
known to the logical file system can be associated with, at most, one file
system object, and each file system object can have several names
because it can be mounted in different locations. Symbolic links and hard
links to an object always get the same vnode if accessed through the same
mount point.
vnode Vnodes are created by the vfs-specific code when needed, using the
Management vn_get kernel service. Vnodes are deleted with the vn_free kernel service.
Vnodes are created as the result of a path resolution.
structure The vnode is structure is defined in /usr/include/sys/vnode.h

struct vnode {
ushort v_flag;
ulong32int64 v_count; /* the use count of this vnode */
int v_vfsgen; /* generation number for the vfs */
Simple_lock v_lock; /* lock on the structure */
struct vfs *v_vfsp; /* pointer to the vfs of this vnode */
struct vfs *v_mvfsp; /* pointer to vfs which was mounted over * /
/* this vnode; NULL if not mounted */
struct gnode *v_gnode; /* ptr to implementation gnode */
struct vnode *v_next; /* ptr to other vnodes that share same gnode
*/
struct vnode *v_vfsnext; /* ptr to next vnode on list off of vfs */
struct vnode *v_vfsprev; /* ptr to prev vnode on list off of vfs */
union v_data {
void * _v_socket; /* vnode associated data */
struct vnode * _v_pfsvnode; /* vnode in pfs for spec */
} _v_data;
char * v_audit; /* ptr to audit object
*/
};

vfs and vmount
Description When new file systems are mounted, a vfs and vmount structures are
created. The vmount structure contains specifics of the mount request,
such as the object being mounted, and the stub over which it is being
mounted. The vfs structure is the connecting structure which links the
vnodes (representing files) with the vmount information, and the gfs
structure that help define the operations that can be performed on the
filesystem and its files.
vfs The vfs structure is the connecting structure which links the vnodes
(representing files) with the vmount information, and the gfs structure witch
provides a path to the operations that can be performed on the filesystem
and its files.
Element Description
*vfs_next vfs’s are a linked list with the first vfs entry
addressed by the rootvfs variable which is
private to the kernel.
*vfs_gfs path back to the gfs structure and its file
system specific subroutines through the
vfs_gfs pointer.
vfs_mntd The vfs_mntd pointer points to the vnode
within the file system which generally
represents the root directory of the file
system.
vfs_mntdover The vfs_mntdover pointer points to a vnode
within another file system, also usually
representing a directory, which indicates
where the file system is mounted. In this
sense, the vfs_mntd pointer corresponds to
the object within the vmount structure
referenced by the vfs_mdata pointer, and
the vfs_mntdover pointer corresponds to the
stub within the vmount structure referenced
by the vfs_mdata pointer.
vfs_nodes Pointer to all vnodes for this file system.
vfs_mdata Pointer to the vmount providing mount
information for this filesystem

vfs and vmount -- continued
vfs structure The vfs structure is defined in /usr/include/sys/vfs.h:

struct vfs {
struct vfs *vfs_next; /* vfs’s are a linked list */
struct gfs *vfs_gfs; /* ptr to gfs of vfs */
struct vnode *vfs_mntd; /* pointer to mounted vnode */
struct vnode *vfs_mntdover; /* pointer to mounted-over vnode */
struct vnode *vfs_vnodes; /* all vnodes in this vfs */
int vfs_count; /* number of users of this vfs */
caddr_t vfs_data; /* private data area pointer */
unsigned int vfs_number; /* serial number to help distinguish
between */
/* different mounts of the same object */
int vfs_bsize; /* native block size */
short vfs_rsvd1; /* Reserved */
unsigned short vfs_rsvd2; /* Reserved */
struct vmount *vfs_mdata; /* record of mount arguments */
Simple_lock vfs_lock; /* lock to serialize vnode list */
};
vfs The mount helper creates the vmount structure, and calls the vmount
Management subroutine. The vmount subroutine then creates the vfs structure, partially
populates it, and invokes the file system dependent vfs_mount subroutine
which completes the vfs structure, and performs any operations required
internally by the particular file system implementation.
There is one vfs structure for each file system currently mounted. New vfs
structures are created with the vmount subroutine. This subroutine calls
the vfs_mount subroutine found within the vfsops structure for the
particular virtual file system type. The vfs entries are removed with the
uvmount subroutine. This subroutine calls the vfs_umount subroutine from
the vfsops structure for the virtual file system type.
vmount The vmount structure contains specifics of the mount request. The vmount
structure is defined in /usr/include/sys/vmount.h
struct vmount {
uint vmt_revision; /* I revision level, currently 1 */
uint vmt_length; /* I total length of structure & data */
fsid_t vmt_fsid; /* O id of file system */
int vmt_vfsnumber; /* O unique mount id of file system */
uint vmt_time; /* O time of mount */
uint vmt_timepad; /* O (in future, time is 2 longs) */
int vmt_flags; /* I general mount flags */
/* O MNT_REMOTE is output only */
int vmt_gfstype; /* I type of gfs, see MNT_XXX above */
struct vmt_data {
short vmt_off; /* I offset of data, word aligned */
short vmt_size; /* I actual size of data in bytes */
} vmt_data[VMT_LASTINDEX + 1];
};

File and Filesystem Operations
Overview Each file system type extension provides functions to perform operations
on the filesystem and its files. Pointers to these functions are stored in the
vfsops (filesystem operations) and vnodeops (file operations) structures.
Data The data structures discussed in this section are:

Structures
• gfs - Holds pointers to the vnodeops and the vfsops structures
• vnodeops - contains pointers to filesystem dependent operations on
files (open, close, read, write...).
• vfsops - contains pointers to filesystem dependent operations on the
filesystem (mount, umount...)
u-block inode
vnode gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
gfs
vmount
vfsops


gfs
Description There is one gfs structure for each type of virtual file system currently
installed on the machine. For each gfs entry, there may be any number of
vfs entries.
Purpose The operating system uses the gfs entries as an access point to the virtual
file system functions on a type-by-type basis. There is no direct link from a
gfs entry to all of the vfs entries of a particular gfs type. The file system
code generally uses the gfs structure as a pointer to the vnodeops
structure and the vfsops structure for a particular type of file system.
gfs The gfs structures are stored within a global array accessible only by the
management kernel. The gfs entries are inserted with the gfsadd() kernel service, and
only one gfs entry of a given gfs_type can be inserted into the array.
Generally, gfs entries are added by the CFG_INIT section of the
configuration code of the file system kernel extension. The gfs entries are
removed with the gfsdel()kernel service. This is usually done within the
CFG_TERM section of the configuration code of the file system kernel
extension.
gfs structure The gfs structure is defined in /usr/include/sys/gfs.h

struct gfs {
struct vfsops *gfs_ops;
struct vnodeops *gn_ops;
int gfs_type; /* type of gfs (from vmount.h) */
char gfs_name[16]; /* name of vfs (eg. "jfs","nfs")*/
int (*gfs_init)(); /* ( gfsp ) - if ! NULL, */
/* called once to init gfs */
int gfs_flags; /* flags for gfs capabilities */
caddr_t gfs_data; /* gfs private config data*/
int (*gfs_rinit)();
int gfs_hold /* count of mounts */
}

vnodeops
Description The vnodeops structure contains pointers to the filesystem dependant

operations that can be performed on the vnode, such as link, mkdir,
mknod, open, close, remove.
vnodeops There is one vnodeops structure per filesystem kernel extension loaded
management (i.e. one per unique filesystem type), and is initialized when the extension
is loaded.
vnodeops This structure is defined in /usr/include/sys/vnode.h. Due to the size of this

structure structure, only a few lines are detailed below:
struct vnodeops {
/* creation/naming/deletion */
int (*vn_link)(struct vnode *, struct vnode *, char *,
struct ucred *);
int (*vn_mkdir)(struct vnode *, char *, int32long64_t,
struct ucred *);
int (*vn_mknod)(struct vnode *, caddr_t, int32long64_t,
dev_t, struct ucred *);
int (*vn_remove)(struct vnode *, struct vnode *, char *,
struct ucred *);
int (*vn_rename)(struct vnode *, struct vnode *, caddr_t,
struct vnode *,struct vnode *,caddr_t,struct ucred *);
int (*vn_rmdir)(struct vnode *, struct vnode *, char *,
struct ucred *);
/* lookup, file handle stuff */

int (*vn_lookup)(struct vnode *, struct vnode **, char *,
int32long64_t, struct vattr *, struct ucred *);
int (*vn_fid)(struct vnode *, struct fileid *, struct ucred *);
/* access to files */
int (*vn_open)(struct vnode *, int32long64_t, ext_t, caddr_t *,
struct ucred *);
int (*vn_create)(struct vnode *, struct vnode **, int32long64_t,
caddr_t, int32long64_t, caddr_t *, struct ucred *);

vfsops
Description The vfsops structure, contains pointers to the filesystem dependant

operations that can be performed on the vfs, such as mount, unmount or
sync.
vfsops There is one vfsops structure per filesystem kernel extension loaded (i.e.
management one per unique filesystem type), and is initialized when the extension is
loaded.
vfsops This structure is defined in /usr/include/sys/vfs.h.

structure
struct vfsops {
/* mount a file system */
int (*vfs_mount)(struct vfs *, struct ucred *);
/* unmount a file system */
int (*vfs_unmount)(struct vfs *, int, struct ucred *);
/* get the root vnode of a file system */
int (*vfs_root)(struct vfs *, struct vnode **,
struct ucred *);
/* get file system information */
int (*vfs_statfs)(struct vfs *, struct statfs *,
struct ucred *);
/* sync all file systems of this type */
int (*vfs_sync)();
/* get a vnode matching a file id */
int (*vfs_vget)(struct vfs *, struct vnode **, struct fileid *,
struct ucred *);
/* do specified command to file system */
int (*vfs_cntl)(struct vfs *, int, caddr_t, size_t,
struct ucred *);
/* manage file system quotas */
int (*vfs_quotactl)(struct vfs *, int, uid_t, caddr_t,
struct ucred *);
};

The Gnode
Introduction Gnode represent an object in a file system implementation, and serves as

the interface between the logical file system and the file system
implementation. There is a one-to-one correspondence between a gnode
and an object in a file system implementation.
Overview Each filesystem implementation is responsible for allocating and

destroying gnodes. Calls to the file system implementation serve as
requests to perform an operation on a specific gnode. A gnode is needed,
in addition to the file system inode, because some file system
implementations may not include the concept of an inode. Thus the gnode
structure substitutes for whatever structure the file system implementation
may have used to uniquely identify a file system object. The logical file
system relies on the file system implementation to provide valid data for
the following fields in the gnode:
• gn_type Identifies the type of object represented by the gnode.
• gn_ops Identifies the set of operations that can be performed on the
object.
Creation A gnode refers directly to a file (regular, directory, special, and so on), and
is usually embedded within a file system implementation specific structure
(such as an inode). Gnodes are created as needed by file system specific
code at the same time as creating implementation specific structures. This
is normally immediately followed by a call to the vn_get kernel service to
create a matching vnode. The gnode structure is usually deleted either
when the file it refers to is deleted, or when the implementation specific
structure is being reused for another file.

The Gnode -- continued
gnode and The gnode is typical embedded in an in-core inode. The member
inode gnode->gn_data points to the start of the inode.
Incore inode
gnode
gnode->gn_data
Structure The gnode structure is defined in /usr/include/sys/vnode.h:

struct gnode {
enum vtype gn_type; /* type of object: VDIR,VREG etc */
short gn_flags; /* attributes of object */
ulong gn_seg; /* segment into which file is mapped */
long32int64 gn_mwrcnt; /* count of map for write */
long32int64 gn_mrdcnt; /* count of map for read */
long32int64 gn_rdcnt; /* total opens for read */
long32int64 gn_wrcnt; /* total opens for write */
long32int64 gn_excnt; /* total opens for exec */
long32int64 gn_rshcnt; /* total opens for read share */
struct vnodeops *gn_ops;
struct vnode *gn_vnode; /* ptr to list of vnodes per this gnode */
dev_t gn_rdev; /* for devices, their "dev_t" */
chan_t gn_chan; /* for devices, their "chan", minor’s minor */
Simple_lock gn_reclk_lock; /* lock for filocks list */
int gn_reclk_event;/* event list for file locking */
struct filock *gn_filocks; /* locked region list */
caddr_t gn_data; /* ptr to private data (usually contiguous)
*/
}

Exercise 1
Overview This exercise will test you knowledge of the data structures of the LFS and
VFS and the relationships between them.
lab Use the following list of terms to best complete the statements below.
File vfs
File system vnodeops
System File Table vmount
1. A vnode represents a ______________.

2. A vfs represents a _____________.
3. The gfs contains pointers to the ufsops and the _____________.
4. The ___________ structure contains specifics about a mount request.
5. The ____________ has one entry for each open file on the system.
Answer the following two questions by completing this diagram as

directed.
u-block inode
gnode
User File
Descriptor
Table
System File
vnodeops
Table vfs
vfsops

6. Label the blocks representing the vnode, vmount and gfs structures
7. Draw a line representing the file pointer in the ufd to an entry in the
system file table.

Lab Exercise 1
Overview In the following exercise you will run a small C program that opens a file,
initializes it by writing a few bytes to it, then pauses. The pause allows us
to investigate the various LFS structures that are created by opening the
file, using the appropriate system debugger.
The program The C code for the example is:

#include <fcntl.h>
main()
{
int fd;
fd=open("foo", O_RDWR | O_CREAT);
write(fd, "abcd", 4);
close(fd);
fd=open("foo", O_RDONLY);
printf("fd = %d\n", fd);
pause();
}
The close() then open() is required, to ensure that the write is committed to
disk & hence that the inode is updated.
save this code to a file called t.c, and compile it using “make t”.
Lab Follow the steps in the table below.
Stage Description
1 Enter the C program from above, save it to a file called t.c
and compile with the command:
$ make t
2 Execute the program created in the last step. It will print the
file descriptor number of the file it creates, then pauses.
$ ./t
fd = 3
3 From another shell on the same system, enter the system
debugger (kdb or iadb).

Lab Exercise 1 -- continued
Lab
Stage Description
4 Initially, we need to find the address of the file structure for
the open file. We know that the file descriptor for our
program is number 3, so we have to find the mapping
between the file descriptor number and the file structure.
This mapping is done from the file descriptor table in the
uarea structure for the process. To find the uarea, find the
slot number in the thread table that our “t” process occupies,
the uarea slot number will be the same.
For kdb use the “th *” command to display all the threads.
Page down through the entries until you find the correct
entry:
(0)> th *
pvthread+000000 0 swapper SLEEP 000003 010 1 0

...
pvthread+001D00 55 t SLEEP 003A39 03C 1 0
...
5 Now use the command “uarea” on this thread slot number,

to view the user area (which contains the file descriptor
table), and page down through the output until you find the
“File descriptor table”:
(0)> u 55
File descriptor table at..F00000002FF3CEC0:
fd 0 fp..F100009600007430 count..00000000 flags. ALLOCATED
Rest of File Descriptor Table empty or paged out.
...

lab
Stage Description
6 The file structure for file descriptor 3 is at address
F100009600007700. Use the “file” command along with this
address to display the contents of the structure:
(0)!ILOH)
$''5&28172))6(7'$7$7<3()/$*6
))$912'(5($'
QRGHVORW
IBIODJIBFRXQW
IBRSWLRQVIBW\SH
IBGDWD)$IBRIIVHW
IBGLUBRIIIBFUHG)&&
IBORFN#)IBORFN
IBRIIVHWBORFN#)IBRIIVHWBORFN
IBYLQIRIBRSV$&&(YQRGHIRSV
912'()$
YBIODJYBFRXQW
YBYIVJHQYBYIVS))'
YBORFN#)$YBORFN
YBPYIVSYBJQRGH)$)
YBQH[WYBYIVQH[W)%)
YBYIVSUHYYBSIVYQRGH
YBDXGLW
Note that half way down the output, the address of the
vnode structure that corresponds to this file is printed,
followed by the contents of this vnode structure.
(We could also display the vnode structure separately by
running the kdb command “vnode” with the address
F10000971528A380.)
8 There are two items that we are interested in from the vnode
structure displayed in the last step, the v_vfsp address,
which points to the filesystem that contains the vnode, and
the v_gnode address, which points to the gnode structure
for the file. From the gnode we can display the inode
structure for the file.
Initially, display the gnode address, using the kdb command

“gnode” with the address F10000971528A3F8.
(0)> gnode F10000971528A3F8
GNODE............ F10000971528A3F8 KERN_heap+528A3F8
gn_type....... 00000001 gn_flags...... 00000000
gn_seg........ 00000000000078AD
gn_mwrcnt..... 00000000 gn_mrdcnt..... 00000000 gn_rdcnt...... 00000001
gn_wrcnt...... 00000000 gn_excnt...... 00000000 gn_rshcnt..... 00000000
gn_ops........ 00000000003D7DC8 jfs_vops
gn_vnode...... F10000971528A380 gn_rdev....... 8000000A00000008
gn_chan....... 00000000 gn_reclk_event 00000000FFFFFFFF
gn_reclk_lock@ F10000971528A440 gn_reclk_lock. 0000000000000000
gn_filocks.... 0000000000000000 gn_data....... F10000971528A3D8
gn_type....... REG

Lab
Step Action
9 The inode address is contained in the gn_data field, in this
case F10000971528A3D8. Use the kdb command “inode”
to display this structure:
(0)!LQRGH)$'
'(9180%(5&177<3()/$*6
.(51BKHDS$'$5(*
IRUZ))(EDFN))(
QH[W)$'SUHY)$'
JQRGH#)$)QXPEHU
GHY$LSPQW)()(
IODJORFNVELJH[SFRPSUHVV
FIODJFRXQWV\QFVQ'$LG&
PRYHGIUDJRSHQHYHQW))))))))))))))))
KLS))(QRGHORFN
QRGHORFN#)$$GTXRW>865@
GTXRW>*53@GLQRGH#)$&
FOXVWHUUFOXVWHUGLRFQWQRQGLR
VL]HJHWV
*12'()$)
JQBW\SHJQBIODJV
JQBVHJ$'
JQBPZUFQWJQBPUGFQWJQBUGFQW
JQBZUFQWJQBH[FQWJQBUVKFQW
JQBRSV''&MIVBYRSV
JQBYQRGH)$JQBUGHY$
JQBFKDQJQBUHFONBHYHQW))))))))
JQBUHFONBORFN#)$JQBUHFONBORFN
JQBILORFNVJQBGDWD)$'
JQBW\SH5(*
GLBJHQ))&GLBPRGH&GLBQOLQN
GLBDFFWGLBXLGGLBJLG
GLBQEORFNVGLBDFO
GLBPWLPH&)'GLBDWLPH&)'GLBFWLPH&)'
GLBVL]HBKLGLBVL]HBORGLBVHF
GLBUGDGGU
GLBYLQGLUHFWGLBULQGLUHFW
GLBSULYRIIVHWGLBSULYIODJVGLBSULY
912'()$
YBIODJYBFRXQW
YBYIVJHQYBYIVS))'
YBORFN#)$YBORFN
YBPYIVSYBJQRGH)$)
YBQH[WYBYIVQH[W)%)
YBYIVSUHYYBSIVYQRGH
YBDXGLW

lab
Step Action
10 The inode command displays the inode, gnode and vnode
structures.
The member number in the inode structure should contain

the inode number in hex of the file foo. Verify this inode
number matches the inode number displayed by the
command :
$ ls -lia foo
Don’t forget to convert the inode number from hex to

decimal.
11 The dev field displays the major and minor number of the
logical volume for the filesystem.
For example:
64 bit systems: 8000000A00000007 -> major=10 minor=7
32 bit systems: 000A0007 -> major=10 minor=7
Verify this number with the command:
$ ls -lia /dev/<logical volume>

Lab Exercise 2
Overview The instructor will create a simple shell script that simply prints its process
id, then pauses.
Both the “ps” command, and the process and thread tables entries for this
script will simply list the name of the program as the name of the shell that
it is being executed by (E.g. “ksh”).
Objective To determine the name of the script that the instructor is running.
Tips • Remember that the shell will have to open() the script prior to executing
it.
• The command find . -inum xxx can be used to find the name of a
file given the filesystem name and an inode number.

Draft Version for review, Sunday, 15. October 2000, boot.fm Guide
Unit 14. AIX 5L boot
Objectives
After completing this unit, you should be able to :
• List and locate boot components and their usage
• Understand the 3 Phases of rc.boot
• Understand the contents and usage of a RAMFS
• Understand the ODM structure and the usage of ODM classes
• Create new boot images
• Debug boot problems

Guide Draft Version for review, Sunday, 15. October 2000, boot.fm
What is boot
Definition It is the process that begins when the computer is powered up and
continues until the entries in the init table have been processed.
ROS process System ROS (Read Only Storage), contains firmware that is independent
of the operating system which initializes the hardware and loads AIX.
All platforms except RS6K will use an intermediate boot process called :
• Softros : (/usr/lib/boot/aixmon_chrp) for CHRP systems
• Softros : (/usr/lib/boot/aixmon_rspc) for RSPC systems
• Boot loader : (/usr/lib/boot/boot_elf) for IA-64 systems
AIX process AIX begins execution after system ROS firmware or the intermediate boot
process finishes its execution :
• sets up firmware information
• kernel initialization
• RAM filesystem based configuration
• control is passed to files based in the permanent filesystem (this may be
a disk or network filesystem)
• /etc/inittab entries are processed. This usually includes enabling the
user login process.

Various Types of boot
Devices AIX can boot from the following types of devices :

• hard disk boot
• CD-ROM boot
• tape boot (Not supported on IA-64 platform)
• network boot
Configuration The boot process can use one of the following boot configurations :
• standalone
• diskless/dataless (Not supported on IA64 platform)
• operating system installation/software maintenance
• diagnostics
Hard disk boot The hard disk boot has the following characteristics :
• the boot image resides on the hard disk
• the RAM filesystem contains the files necessary for configuring the hard
disk(s), and then accessing the filesystems that reside in the root
volume group (rootvg)
• this is the most common system configuration
• these types of systems are also known as “standalone” systems
• these types of systems may also be booted into the diagnostics
functions
CDROM boot The CDROM boot maybe used in the following situations :
• operating system installation
• diagnostics
• hard disk boot failure recovery/maintenance

Various Types of boot -- continued
Tape boot The Tape boot device can be used for :

• operating system installation
The tape boot device is usually used for creating bootable system backups
The tape boot device is not supported on IA-64 platform.
Network boot The network boot can be used for the following purposes :
• boot and install the operating system
• the operating system is installed on a hard disk with NIM
• subsequent boots are from the hard disk
• supported diskless/dataless configurations
• diagnostics
The centralized boot/filesystem servers offer convenient administration

Systems types and Kernel images
System Types There are four basic hardware architecture types:

• RS6K - the “classic” IBM workstation
• RSPC - the PowerPC Reference Platform workstation
• CHRP - Common Hardware Reference Platform
• IA-64 - Intel IA-64 Platform
boot images There are three corresponding types of boot images:

types
• The RS6K uses an hardware ROS to build the IPL Control Block
• The RSPC and CHRP uses a SOFTROS to build the IPL Control Block
• The IA-64 use an EFI boot loader to build the IPL Control Block
kernel types There are four types of Kernels loaded:

• 32 bits Power UP (/unix->/usr/lib/boot/unix_up)
• 32 bits Power MP (/unix->/usr/lib/boot/unix_mp)
• 64 bits Power (/unix->/usr/lib/boot/unix_64)
• 64 bits IA-64 (/unix->/usr/lib/boot/unix_ia64)

RAMFS and prototype files
Introduction In order to successfully boot a system, the AIX kernel will need basic
commands, configuration files, kernel extensions and device drivers to be
able to configure a minimum environment.
All the files needed are included in the RAMFS using the following
command
mkfs -V jfs -p <proto> <temp_filesystem_file>
prototypes A prototype file is a list of file and file descriptions that are needed to create
files a RAMFS.
description
A prototype file entry format is as follow :
<dest_file_name> <type> <mode> 0 0 <full_path_name>
Where :
• <dest_file_name> : is the name of the file, directory, link or device as it
will be written to the RAMFS
• <type> : defines the type of the entry and can be :
• d--- : a directory entry (this will change the relative path of the
following entries).
• l--- : a link (the target will be listed in the <full_path_name>
parameter)
• b--- : a block device (the <full_path_name> parameter will represent
the major and minor numbers)
• c--- : a character device (the <full_path_name> parameter will
represent the major and minor numbers)
• ---- : a file
• <mode> : represent the file permissions in octal format
• <full_path_name> : value will depend on the <type> as described
before.

RAMFS and prototype files -- continued
prototypes Prototype files are divided in several parts according to their specific use :
files types
• Prototypes files located in /usr/lib/boot are the base prototypes used for
a platform according to the boot device type and comes with the
platform base system device fileset
• Prototypes files located in /usr/lib/boot/network are specific to any
general kind of network boot device and comes with the platform base
system device fileset
• Prototypes files located in /usr/lib/boot/protoext are used for any specific
type of boot device and comes with the device specific fileset

Boot Image Creation
Introduction In order to successfully boot from a device, the administrator will need to
run commands that will create the boot structure.
bosboot The bosboot command is the most commonly used on AIX because it will
command manage all verification tasks and environment setup for the administrator.
The administrator can also use the mkboot command but he then should
take care himself of all these preliminary checks.
The bosboot command will also be used by over commands like mksysb
or installp post installation process when installing packages that needs to
build a new boot image.
bosboot The bosboot command will do the following :

process
overview • set execution environment
• parse command line arguments
• verify syntax and arguments
• point to platform specific files (like mkboot_chrp or aixmon_rspc)
• check for space needed in /tmp and destination filesystem if needed
• create a RAMFS if requested using mkfs and proto files
• create a bootimage and a boot record if requested using the appropriate
mkboot command
• copy the boot image and savebase to the boot device if requested.
• cleanup execution environment

Boot Image Creation -- continued
bosboot The most commonly used bosboot command is :

parameters
# bosboot -a -d /dev/hdisk0
For example if you need to load and invoke the kernel debugger you can
use :
bosboot -a -I -d /dev/hdisk0
The following table list the bosboot parameters that can be used :
argument description
-a Create complete boot image and device.
-w file Copy given boot image file to device.
-r file Create ROS Emulation boot image.
-d device Device for which to create the boot image.
-U Create uncompressed boot image.
-p proto Use given proto file for RAM disk file system.
-k kernel Use given kernel file for boot image.
-l lvdev Target boot logical volume for boot image.
-b file Use given file name for boot image name.
-D Load Low Level Debugger.
-I Load and Invoke Low Level Debugger.
-L Enable MP locks instrumentation (MP kernels)
-M norm|serv|both Boot mode - normal or service
-O offset boot image offset for CDROM file system.
-q Query size required for boot image.

AIX 5L Distributions
Introduction AIX 5L will be delivered in two separate distributions :

• One for Power systems
• One for Intel IA-64 systems
Power CDROM The distribution CDROM that IBM provides to our customers has three
Distributions boot images. There is a boot image for the RS6K computers, a second for
the RSPC computers, and a third for CHRP (/ppc/chrp/bootfile.exe). The
RS6K, RSPC, and CHRP UP computers can use the MP Kernel, which is
the method implemented for distribution media that goes to our customers.
In other words, when a customer receives boot/install media from IBM,
there is no need to determine whether the system is UP or MP. This boot
image is created using the MP kernel. The UP kernel is more efficient for
uniprocessor systems, but the strategy of a single boot image for both
hardware platform types lowers distribution cost, and is more convenient
for our customers.
IA-64 CDROM
Distributions

Checkpoint
Introduction Take a few minutes to answer the following questions. We will review the
questions as a group when everyone has finished.
Quiz 1. What is the name of the file used as a SOFTROS on CHRP systems
2. Does an IA-64 support 32 bit kernel
3. What are the common functions of the ROS, SOFTROS and EFI boot
loader.
4. List the 4 platforms supported by AIX 5L
5. What is the purpose of the RAMFS
6. How to create a RAMFS

Instructor Notes
Purpose Notes on Quiz and transition to the next section
Quiz The responses for the Quiz are :

responses
1. What is the name of the file used as a SOFTROS on CHRP systems
• /usr/lib/boot/aixmon_chrp
2. Does an IA-64 support 32 bit kernel : NO
3. What are the common functions of the ROS, SOFTROS and EFI boot
loader.
• create the IPLCB
• load the kernel
4. list the 4 platforms supported by AIX 5L
• RS6K
• RSPC
• CHRP
• IA-64
5. What is the purpose of a RAMFS :
• Get basic commands, configuration files, kernel extensions and
device drivers in order to be able to bring a minimum environment.
6. How to create a RAMFS :
• Using mkfs and prototype files.
Transition Now we will describe:

Statement
• the Power specific boot process if this is a Power course
• the IA-64 specific boot process if this is a IA-64 course

The Power Boot Mechanism
Introduction The section will explain the boot mechanism used by Power family
systems.
Boot overview When the system is powered on, the ROS or the firmware will look for the
bootrecord on the device pointed by the bootlist to find the boot entry point.
The Softros on RSPC and CHRP will execute and uncompress the boot
image if needed using the bootexpand process.
Then it will load the kernel that will initialize.
The kernel will then call init (In fact /usr/lib/boot/ssh at this stage)
The ssh will then call rc.boot for PHASE I and PHASE II specific to each
boot device types.
Then init will execute rc.boot phase 3 and the remaining common code in
rc.boot for disk and network boot devices
Boot diagram The following diagram represent the high level boot process overview.
execution of the system ROS
or firmware.
boot record read from boot device
rspc yes execution

or chrp of softros
boot
no
com yes
execution
pressed
boot bootexpand
img
no
Kernel initialization
Kernel call init (/usr/lib/boot/ssh)
init ssh call rc.boot PHASE I&II
init exit to newroot
init calls rc.boot PHASE III from
inittab and process the rest of
inittab entries.

Power boot disk layout
Boot image The following chart describes a Power boot disk :

overview boot disk
hd5
compressed compressed
kernel RAM Filesystem
bootexpand
VGDA softros (chrp and rspc) rest of
base the boot
customized disk
data
bootrecord
bootrecord 512 byte block containing size and location of the boot image. The boot
record is the first block on a disk or cdrom and is therefore separated from
the boot image. The boot image on a disk is placed in the boot logical
volume which is a reserved contiguous area.
softros RSPC and CHRP platform uses a SOFTROS program (/usr/lib/boot/

aixmon_rspc or /usr/lib/boot/aixmon_chrp) that performs system
initialization for AIX that the hardware firmware in ROS does not provide,
such as appending device information to the IPL control block.

Power boot disk layout -- continued
bootexpand Program to expand compressed boot image which is executed before

control is passed to kernel. The compression of a boot image is optional
but it is the default since the image size is less than half of an
uncompressed image and requires less time to load from the media.
kernel AIX 32 bits UP, 32 bits MP or 64 bits MP kernels that which control
passes to after expansion by bootexpand. The kernel initializes itself and
then passes control to the simple shell init (ssh) in the RAM filesystem.
RAM Filesystem used during boot process, that contains programs and data
filesystem for initializing devices and subsystems in order to install AIX, execute
diagnostics, or to access and bring up the rest of AIX.
base Area of the hard disk boot logical volume containing user configured
customized ODM device configuration information that is used by the system
data configuration process.

AIX 5L Power boot record
Introduction On Power systems, the boot record is located at the beginning of the boot
device and contains the following informations :
• The IPL record
• The boot partition table used by chrp and rspc systems.
IPL record The following table describe the content of the boot record.
description
size offse name description

t
4 0 IPL_record_id This physical volume contains a valid
IPL record if and only if this field
contains IPLRECID in EBCDIC ’IBMA’
20 4 reserved1
4 24 formatted_cap Formatted capacity. The number of
sectors available after formatting.
1 28 last_head THIS IS DISKETTE INFORMATION The
number of heads minus 1.
1 29 last_sector THIS IS DISKETTE INFORMATION The
number of sectors per track.
6 30 reserved2
4 36 boot_code_lengt Boot code length in sectors. A 0 value
h implies no boot code present
4 40 boot_code_offse Boot code offset. Must be 0 if no boot
t code present, else contains byte offset
from start of boot code to first instruction.
4 44 boot_lv_start Contains the PSN of the start of the BLV.
4 48 boot_prg_start Boot code start. Must be 0 if no boot
code present, else contains the PSN of
the start of boot code.
4 52 boot_lv_length BLV length in sectors.
4 56 boot_load_add 512 byte boundary load address for boot
code.
1 60 boot_frag 0x1 => fragmentation allowed

AIX 5L Power boot record -- continued
IPL record
description
continued
size offse name description

t
1 61 boot_emulation 0x1 => ROS network emulation code
2 62 reserved3
2 64 basecn_length Number of sectors for base
customization. Normal mode.
2 66 basecs_length Number of sectors for base
customization. Service mode.
4 68 basecn_start Starting PSN value for base
customization. Normal mode.
4 72 basecs_start Starting PSN value for base
customization. Service mode.
24 76 reserved4
4 100 ser_code_length Service code length in sectors. A 0 value
implies no service code present.
4 104 ser_code_offset Service code offset. 0 if no service code
is present, else contains byte offset from
start of service code to first instruction.
4 108 ser_lv_start Contains the PSN of the start of the SLV.
4 112 ser_prg_start Service code start. Must be 0 if service
code is not present, else contains the
PSN of the start of service code.
4 116 ser_lv_length SLV length in sectors.
4 120 ser_load_add 512 byte boundary load address for
service code.
1 124 ser_frag Service code fragmentation flag. Must
be 0 if no fragmentation allowed, else
must be 0x01.
1 125 ser_emulation ROS network emulation flag
2 126 reserved5
8 128 pv_id The unique identifier for this PV.
376 136 dummy Include the partition table.

boot partition The boot record contains 4 partition tables entries starting at offset 0x1be.
table Each entry contains the following information :
size in byte name description

1 boot_ind Boot indicator
1 begin_h Begin head
1 begin_s Begin sector
1 begin_c Begin cylinder
1 syst_ind System indicator
1 end_h End head
1 end_s End sector
1 end_c End cylinder
4 RBA Relative block address in little endian format
4 sectors Number of sectors in little endian format
boot partition RS6K platform doesn’t use a boot partition table. The four boot partition
tables entries table entries are used for :
• CHRP boot images
• CHRP and First RSPC boot image
• CHRP and Second RSPC boot image
• CHRP Third RSPC boot image

Example The following chart represent an AIX 5L boot record from a chrp system. It
was obtained using :
od -Ax -x /dev/hdisk0|pg
base_cs boot_lv
_start _start
IBMA 0000000 c9c2 d4c1 0000 0000 0000 0000 0000 0000
0000010 0000 0000 0000 0000 0000 0000 0000 0000
boot_code
_len 0000020 0000 0000 0000 2cc1 0000 0000 0000 1100
0000030 0000 0000 0000 0000 0000 0000 0000 0000
base_cn 0000040 0100 0100 0000 3cdc 0000 3cdc 0000 0000
_length 0000050 0000 0000 0000 0000 0000 0000 0000 0000
base_cs 0000060 0000 0000 0000 2cc1 0000 0000 0000 1100
_length 0000070 0000 0000 0000 0000 0000 0000 0000 0000
0000080 0007 1483 229d 0662 0000 0000 0000 0000
0000090 0000 0000 0000 0000 0000 0000 0000 0000
PVID serv_code base_cn ser_lv
_length _start _start
BOOT_SIGNATURE boot_partition
_table
00001b0 0000 0000 0000 0000 0000 0000 0000 0000

00001c0 0000 0000 0000 0000 0000 0000 0000 80ff
00001d0 ffff 41ff ffff 1b11 0000 c12c 0000 00ff
00001e0 ffff 41ff ffff 0211 0000 1900 0000 80ff
00001f0 ffff 41ff ffff 1b11 0000 c12c 0000 55aa
0000200 4182 000c 3880 0000 4800 000c 7c83 2378
0000210 7ca4 2b78 83c3 0098 7fde 1814 83de 0034
0000220 57de 063e 2c1e 0057 4182 0024 2c1e 0058
0000230 4182 001c 2c1e 0059 4182 0014 2c1e 0072
0000240 4182 000c 2c1e 0082 4082 0030 83c3 0288
0000250 7fde 1814 83de 006c 2c1e 0000 4182 001c
0000260 3fc0 8000 7fcf 01a4 3fc0 f000 83fe 10c0
0000270 67ff 0080 93fe 10c0 31ad ffd8 30c3 0080
RBA sectors

Instructor Notes
Purpose Notes on Power boot record
Little endian The RBA and sectors informations from the boot partition table are little
format endian format.
So to obtain the actual address, you will need to swap the 2 bytes as they
are display using the od command

Power boot images structures
Introduction Depending on the architecture, the boot image will not always contains the
same elements due to the needs of ROS and Firmware specifications.
RS6K boot The rs6k platform doesn’t need a an softros emulation, so the boot image
image start with the bootexpand program. The bootexpand will be loaded first to
uncompress the kernel and the RAMFS.
RSPC boot On rspc, the aixmon_rspc softros is located at the begening of the boot
image image, but the xcoff format is replaced by an hints structure has defined in
/usr/include/sys/boot.h. So an RSPC boot image will contain the following
sections :
• The hints structure
• The aixmon_rspc file reduced by it’s xcoff header and in fact starting at
its entry point
• The bootexpand program
• The compressed kernel
• The compressed RAMFS
• The saved base customization.
CHRP boot On chrp, the aixmon_chrp softros is located at the begening of the boot
image image, but the xcoff format is replaced by an ELF format. So a CHRP boot
image will contain :
• The ELF structure
• The aixmon_chrp file reduced by it’s xcoff header and in fact starting at
its entry point.
• The bootexpand program
• The compressed kernel
• The compressed RAMFS
• The saved base customization.

RSPC boot image hints header
introduction On rspc systems, the aixmon xcoff header is replaced by an hints

structure. The aixmon_rspc file is copied to the boot image after the hints
structure starting at it’s entry point.
hints boot The following table represents the hints structure :

structure
description
size name description

4 signature Signature for boot program ‘0x4149584d’
4 resid_data_address address of residual data as determined by
firmware
4 bss_offset Address of bss section
4 bss_length Length of bss section
4 jump_offset Jump offset in boot image
4 load_exec_address address of boot loader as determined by
firmware
4 header_size Size of header
4 header_block_size Offset to AIX boot image
4 image_length Size of boot program
4 Spare
4 res_mem_size reserved memory size
4 mode_control Boot mode control ‘0xDEAD0000 |
mode_control’

RSPC boot image hints header -- continued
RSPC boot The following output represents the hints header output from the following
image example command :
# dd if=<boot_disk> bs=512 skip=<RBA> count=1 |od -Xa -
x
0000000 0000 0000 0000 0000 0000 0000 0000 0000

*
0000200 3004 0000 00 fe 3200 0002 4149 5820 2034
0000210 2033 2030 3130 3130 3035 3437 3000 0000
0000220 0000 0000 0000 0000 0000 0000 0000 0000
*
0000400 4149 584d 0000 0000 0000 ff d4 0000 022c
0000410 0000 038c 0000 0000 0000 0400 0000 0097
0000420 0001 2810 0000 0000 0000 0000 dead 00c0
0000430 4800 0005 7e80 00a6 7e94 a278 3a94 1000
aixmon entry point

CHRP Boot image ELF structure
introduction On chrp systems, the aixmon xcoff header is replaced by an ELF header.
The aixmon_chrp file is copied to the boot image after the ELF header
starting at it’s entry point.
ELF boot The ELF boot header is made of :

header
description • ELF header structure
• Note section description
• loader section 1 description
• loader section 2 description
• Note data description
• The boot loader parameters data
ELF header The Following table describes the ELF header structure :
structure
description
16 e_ident ELF identification
2 e_type object file type
2 e_machine architecture
4 e_version object file version
4 e_entry entry point
4 e_phoff prog hdr byte offset
4 e_shoff section hdr byte offset
4 e_flags processor specific flags
2 e_ehsize ELF header size
2 e_phentsize prog hdr table entry size
2 e_phnum prog hdr table entry count
2 e_shentsize section header size
2 e_shnum section header entry count
2 e_shstrndx sect name string tbl idx

CHRP boot image ELF structure - Continued
Note, load 1 The following table describes the structure used to format note, loader 1
and load2 and loader 2 segments :
segments
descriptions

4 p_type segment type
4 p_offset offset to this segment
4 p_vaddr virt addr of seg in memory
4 p_paddr phy addr of seg in memory
4 p_filesz file image segment size
4 p_memsz mem image segment size
4 p_flags segment flags
4 p_align segment alignment
Note data The following table represent the note data description structure :
description

4 namesz size of name
4 descsz size of descriptor
4 type descriptor interpretation
8 name the owner of this entry
4 real_mode ISA env variable
4 real_base ISA env variable
4 real_size ISA env variable
4 virt_base ISA env variable
4 virt_size ISA env variable
4 load_base ISA env variable

CHRP boot image ELF structure - Continued
Boot loader The following table describes the boot loader structure :
parameters
description
4 timestamp date when the boot image was created
4 bootimage_size equivalent to the number of sectors for the
blv found in the bootrecord
4 boot_loader_size size of the aixmon in bytes
4 inst_offset jump offset in boot image
4 rmalloc_size Percent of memory for kernel heap
4 reserved1
4 reserved2
4 reserved3
example Use the following command to display the ELF structure:

# dd if=<boot_disk> bs=512 skip=<RBA> count=1 |od -Xa -
x
load_phdr1 note_phdr elf_hdr

IF
H

IIIIIIII
IIIIIIIIF
FF
F
HH
HH
load_phdr2
DIIIIIIIIIF
note_data EIIIIIIIIIIIIIIIIIIIIIIII
FHHF
G
BL_parms_data HIHDIFG
IEEEEE
DFD
FI
aixmon entry point

Exercise
Introduction This exercise will show you the way to locate the different parts of the boot
image using the boot record
Procedure Follow the following procedure to locate main parts of the boot image.
Step Action
1 Locate the boot disk using :
# bootinfo -b
2 Determine the architecture of your system using :
# bootinfo -p
3 Find the boot record located at the beginning of the disk
found in step 1 using :
# dd if=<boot_disk> bs=512 count=1 |od -Ax -x
4 • On RSPC or CHRP, locate in the boot partition table the
RBA and sectors from output of step 3.
• On RS6K, locate in the record, the boot_prg_start and
boot_code_length
5 Create a file using the offset and sectors length found in
step 5 using :
# dd if=<boot_disk> bs=512 skip=<offset>
count=<sectors> of=/tmp/myfile
6 Using the what command try to find what is included in this
file
What is missing from the what output ?
Why ?
7 Create a file using the offset and sectors length found in
step 5 plus the size of the boot_loader
# dd if=<boot_disk> bs=512
skip=<(offset*512)+boot_loader_size)>
count=512 of=/tmp/myfile2
8 What is myfile2
9 Using the results from step 3, locate the base customization
sector start and length : use these values to create a new
file
# dd if=<boot_disk> bs=512 skip=<base_cn_start>
count=<base_cn_length> of=/tmp/myfile3
10 Create a directory <dir1> and copy /etc/objrepos/* to dir1
# /usr/lib/boot/restbase -o myfile3 -d dir1 -v

Instructor Notes
Purpose Notes on boot record and image exercise
Details Step 6 should output something like :

07 1.3 src/rspc/usr/lib/boot/aixmon_chrp/
cl_in_services.c, chrp_softros, rspc500, 0025A_500 10/
22/98 14:25:3904 1.32 src/rspc/usr/lib/boot/
aixmon_chrp/aixmon_chrp.c, chrp_softros, rspc500,
0026A_500 6/16/00 12:43:2509 1.2 src/rspc/usr/lib/
boot/aixmon_chrp/printf.c,chrp_softros, rspc500,
0025A_500 1/13/99 10:38:0208 1.40 src/rspc/usr/
lib/boot/aixmon_chrp/iplcb_init.c, chrp_softros,
rspc500, 0029A_500 7/17/00 14:07:1139 1.5 src/
rspc/usr/lib/boot/aixmon_chrp/numa_topo.c,
chrp_softros, rspc500, 0028A_500 6/7/00 08:11:2148
1.1 src/rspc/usr/lib/boot/aixmon_chrp/rtas_func.c,
chrp_softros, rspc500, 0026A_500 6/16/00 13:04:3265
1.21 src/bos/usr/sbin/bootexpand/expndkro.c, bosboot,
bos500, 0025A_500 4/14/00 14:26:38
So it reflect the presence of the softros (aixmon_chrp) and the bootexpand
codes.
Here we are missing the kernel and ramfs because the are stripped and
then unreadable for the what command.
Step 8 should output something like :
# what /tmp/myfile2
/tmp/myfile2:
65 1.21 src/bos/usr/sbin/bootexpand/expndkro.c, bosboot,
bos500, 0
After completing the step 10, students should observe that the following
files were updated by the restbase command. That confirms that myfile3 is
actually the base customization area.
-rw-r--r-- 1 root system 32768 Aug 23 15:51 CuDvDr
-rw-r--r-- 1 root system 4096 Aug 23 15:51 CuPath
-rw-r--r-- 1 root system 4096 Aug 23 15:51 CuPath.vc
-rw-r--r-- 1 root system 4096 Aug 23 15:51 CuPathAt
-rw-r--r-- 1 root system 4096 Aug 23 15:51 CuPathAt.vc
-rw-r--r-- 1 root system 16384 Aug 23 15:51 CuAt
-rw-r--r-- 1 root system 8192 Aug 23 15:51 CuAt.vc

-rw-r--r-- 1 root system 4096 Aug 23 15:51 CuDep

-rw-r--r-- 1 root system 16384 Aug 23 15:51 CuVPD
-rw-r--r-- 1 root system 12288 Aug 23 15:51 CuDv

Power ROS and Softros
ROS On RS6K platforms, the Hardware ROS performs some basic hardware
configuration and tests, and create the IPL Control Block before
transferring control to kernel’s entry point.
Softros The RSPC and CHRP family of computers requires a boot image with
special software known as SOFTROS, which is used to provide function
that AIX requires, and is not provided by the hardware firmware. The
SOFTROS performs some basic hardware configuration and tests, and
also sets up some data structures to provide an environment for AIX that
more closely resembles the environment provided by RS6K system ROS.
On CHRP systems the firmware device tree is also appended to the IPL
Control Block. The the Softros transfer control to kernel’s entry point.

IPLCB on Power
Definition The IPLCB (Initial Program Load Control Block) defines the RAM resident
interface between the IPL Boot Process and the Operating System
The ROS or Softros will initialize the IPLCB structure using interfaces to
the firmware or ROS (on RS6K platform).
The kernel when loaded will use the IPLCB structure to initialize it’s
runtime structures.
IPLCB The IPLCB contains the following structures (described in : /usr/include/

Description sys/iplcb.h) :
• IPLCB Directory : contains the IPLCB ID and pointers (offset and size to
IPLCB Data)
• IPLCB Data such as :
• processor information ('ipl -proc [cpu]')
• memory region ('ipl -mem')
• system information ('ipl -sys')
• user information ('ipl -user')
• NUMA information ('ipl -numa')

IPLCB on Power -- continued
IPLCB The following screen output shows the IPLCB on a CHRP system captured
directory using the kdb iplcb -dir sub command :
example on a
CHRP system IPL directory [10000080]
ipl_control_block_id.........ROSIPL
ipl_cb_and_bit_map_offset...00000000 ipl_cb_and_bit_map_size....00008898
bit_map_offset..............000087A8 bit_map_size...............00000007
ipl_info_offset.............000002E8 ipl_info_size..............00000598
iocc_post_results_offset....00000000 iocc_post_results_size.....00000000
nio_dskt_post_results_offset00000000 nio_dskt_post_results_size.00000000
sjl_disk_post_results_offset00000000 sjl_disk_post_results_size.00000000
scsi_post_results_offset....00000000 scsi_post_results_size.....00000000
eth_post_results_offset.....00000000 eth_post_results_size......00000000
tok_post_results_offset.....00000000 tok_post_results_size......00000000
ser_post_results_offset.....00000000 ser_post_results_size......00000000
par_post_results_offset.....00000000 par_post_results_size......00000000
rsc_post_results_offset.....00000000 rsc_post_results_size......00000000
lega_post_results_offset....00000000 lega_post_results_size.....00000000
keybd_post_results_offset...00000000 keybd_post_results_size....00000000
ram_post_results_offset.....00000000 ram_post_results_size......00000000
sga_post_results_offset.....00000000 sga_post_results_size......00000000
fm2_post_results_offset.....00000000 fm2_post_results_size......00000000
net_boot_results_offset.....00000000 net_boot_results_size......00000000
csc_results_offset..........00000000 csc_results_size...........00000000
menu_results_offset.........00000000 menu_results_size..........00000000
console_results_offset......00000000 console_results_size.......00000000
diag_results_offset.........00000000 diag_results_size..........00000000
rom_scan_offset.............00000000 rom_scan_size..............00000000
sky_post_results_offset.....00000000 sky_post_results_size......00000000
global_offset...............00000000 global_size................00000000
mouse_offset................00000000 mouse_size.................00000000
vrs_offset..................00000000 vrs_size...................00000000
taur_post_results_offset....00000000 taur_post_results_size.....00000000
ent_post_results_offset.....00000000 ent_post_results_size......00000000
vrs40_offset................00000000 vrs40_size.................00000000
gpr_save_area1............@ 10000178
system_info_offset..........00000880 system_info_size...........0000009C
buc_info_offset.............0000091C buc_info_size..............00000150
processor_info_offset.......00000A6C processor_info_size........00000310
fm2_io_info_offset..........00000000 fm2_io_info_size...........00000000
processor_post_results_off..00000000 processor_post_results_size00000000
system_vpd_offset...........00000000 system_vpd_size............00000000
mem_data_offset.............00000000 mem_data_size..............00000000
l2_data_offset..............00000D7C l2_data_size...............000000C0
fddi_post_results_offset....00000000 fddi_post_results_size.....00000000
golden_vpd_offset...........00000000 golden_vpd_size............00000000
nvram_cache_offset..........00000000 nvram_cache_size...........00000000
user_struct_offset..........00000000 user_struct_size...........00000000
residual_offset.............00000E3C residual_size..............0000776C
numatopo_offset.............00000E3C numatopo_size..............00000000

Checkpoint
Quiz 1. Where is the softros located ?

2. what are the four common parts of the boot image across Power
platforms ?
3. What is the difference between the RSPC and the CHRP at the very
platforms of the boot image ?
4. In which logical volume is located the boot record ?
5. Who builds the IPLCB on the 3 Power platforms ?
6. What is the difference between the RS6K and the other Power
architectures in the boot record

Instructor Notes
Purpose Notes on Quiz and transition to the next section
Quiz The responses for the Quiz are :

responses
1. Where is the located the softros :
• after the header in the boot logical volume
2. what are the four common parts of the boot image across Power
platforms :
• bootexpand
• kernel
• ramfs
• saved base
3. What is the difference between the RSPC and the CHRP at the very
begening of the boot image
• RSPC use an hints structure
• CHRP use an ELF header
4. In which logical volume is located the boot record
• None, the bootrecord is located at the very beginning of the disk
5. Who build the IPLCB ?
• ROS on RS6K
• Softros on CHRP and RSPC
6. What is the difference between RS6K and other Power platforms in the
boot record ?
• The RS6K doesn’t use the boot partition table
Transition Now we will describe:

Statement
• the IA-64 specific boot process if this is not an only Power course

The IA-64 Boot Mechanism
Introduction The section will explain the boot mechanism used by the IA-64 platform.
Definitions EFI stands for Extensible Firmware Interface. EFI provide a standard
interface between the Hardware and the operating system on IA-64
platforms.
Boot overview When the system is powered on, the EFI will load first.
EFI will load BIOS for devices that needs.
EFI will then prompt to enter the setup for a timeout period.
EFI will then prompt the EFI boot menu for another timeout period after
witch he will scan the bootlist in order to find a boot device.
The EFI boot loader will prompt for the boot loader menu and after the
timeout or exit from the menu initialize the IPL Control Block.
Then it will locate and load the kernel that will initialize.
The kernel will then call init (In fact /usr/lib/boot/ssh at this stage)
The ssh will then call rc.boot for Phase I and Phase II specific to each boot
device types.
Then init will execute rc.boot Phase III and the remaining common code in
rc.boot for disk and network boot devices
If no boot device is found EFI will start the EFI Shell on IA-64 platforms that
supports EFI shell.

The IA-64 Boot Mechanism -- continued
Boot diagram The following diagram represent the high level boot process overview.
execution of EFI firmware
Load needed BIOS.
Prompt for Setup
prompt no setup
timeout or
os boot menu
request
yes
EFI boot
manager menu
boot yes boot maintenance

maint
manager manager menu
request
no
scan the
boot list
valid no
boot EFI
device Shell
found
yes
AIX boot loader
key yes
entered AIX boot
during loader menu
timeout
no
Kernel initialization
Kernel call init (/usr/lib/boot/ssh)
init ssh call rc.boot PHASE I&II
init exit to newroot
init calls rc.boot PHASE III from
inittab and the rest of inittab entries

IA-64 boot disk layout
Boot image The following represent the overview of an AIX 5L on IA-64 boot image :
overview
hdisk0_all
hdisk0
hd5 hdisk0_s0
kernel RAM Filesystem
VGDA base
customized
data EFI boot
rest loader
of the
PMBR,EFI Partition hdisk0
Header and entries
PMBR, EFI On IA-64 platform, AIX 5L must be aware of EFI disk partitioning.
Partition
Header and During installation, two partitions will be created on the target disk
entries (hdisk0_all) :
• A Physical Volume partition (hdisk0 in the AIX environment) known as a
block device in the EFI environment (blkXX).
• An IA-64 System partition (hdisk0_s0 in the AIX environment) known as
an IA-64 System partition in the EFI environment (fsXX)
kernel On IA-64 platform the 64 bit kernel (unix_ia64) can be used as the kernel
for either UP or MP systems. The kernel initializes itself and then passes
control to the simple shell init (ssh) in the RAM filesystem.

IA-64 boot disk layout -- continued
RAM Filesystem used during boot process, that contains programs and data for
filesystem initializing devices and subsystems in order to install AIX, execute
diagnostics, or to access and bring up the rest of AIX.
base Area of the hard disk boot logical volume containing user configured ODM
customized device configuration information that is used by the system configuration
data process.
EFI boot The EFI boot loader will reside in am IA-64 System Partition physically
loader located after the Physical Volume Partition by the installation process.

EFI boot manager and boot maintenance manager overview
Introduction At boot time, EFI will prompt for the EFI boot manager menu to be entered
for a timeout period.
The timeout period is customizable via the boot maintenance menu.
boot manager At boot time, the boot manager will display the bootlist and prompt for a
time out period.
If the timeout is reached, the boot manager will scan the bootlist in the boot
order to find a valid boot device.
If a key is entered before the timeout period, the user will be able to :
• select a boot device from the list to boot for this session
• start EFI Shell on platform that support EFI Shell
• enter the boot maintenance manager
boot The boot maintenance manager menu will allow the administrator to :
maintenance
manager menu • boot from a file
• add/delete boot options
• change boot order
• manage boot next setting
• set autoboot timeout
• select active console output devices (output,input and error)
• do a cold reset.

EFI Shell Overview
Introduction The EFI Shell allow you to configure the boot process used by the IA-64
platform. The main functions are to :
• Locate and identify different boot devices
• Set environment variable
• Use debugging sub commands
• boot from the selected boot device
EFI Shell The EFI shell startup will display informations about the current EFI level
startup and device mapping as follow :
example
EFI version x.xx [xx.xx] Build flags : EIF64 Running on Merced EFI_DEBUG
EFI IA-64 SDV/FDK (BIOS CallBacks) [Fri Mar 31 13:21:32 2000] - INTEL
Cache Enabled. This image Main entry is at address 000000003F2BA000
Stack = 000000003F2B6FF0 BSP = 000000003F293000
INT Stack = 000000003F292FF0 INT BSP = 000000003F26F000
EFI Shell version x.xx [xx.xx]
Device mapping table
fs0 : VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)
blk0 : VenHw(Unknown Device:01)/HD
blk3 : VenHw(Unknown Device:ff)/HD
blk4 : VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)
EFI Shell sub In the EFI Shell you will be able to use the following sub commands :
commands
sub command Description

help [internal command] Display this help
guid [sname] Dump known guid ids
set [-d] [sname] [value] Set/get environment variable
alias [-d] [sname] [value] Set/get alias settings
dh [-p prot_id] | [handle] Dump handle info
map [-dvr] [sname[:]] [handle] Map shortname to device path
mount BlkDevice [sname[:]] Mount a filesystem on a block device
cd [path] Updates the current directory
echo [[-on | -off] \ [text] Echo text to stdout or toggle script echo
endfor Script-only: Delimiter for loop construct
pause Script-only: Prompt to quit or continue

EFI Shell Overview -- continued
EFI Shell sub

commands
continued

ls [dir] [dir]... Obtain directory listing
mkdir [dir][dir].... Make directory
if [not] condition then Script-only: IF THEN construct
endif Script-only: Delimiter for IF THEN
construct
goto label Script-only: Jump to label location in
script
for var in <set> Script-only: Loop construct
mode [row col] Set/get current text mode
cp file [file] ... dest Copy files/directories
comp file1 file2 Compare two files
rm file/dir [file/dir] Remove file/directories
memmap Dumps memory map
type [-a] [-u] file Type file
dmpstore Dumps variable store
load driver_name Loads a driver
ver Displays version info
err [level] Set or display error level
time [hh:mm:ss] Set or display time
date [mm/dd/yyyy] Set or display date
stall microseconds Delay for x microseconds
reset [/warm] [reset string] Cold or Warm reset
vol fs [Volume Label] Set or display volume label
attrib [+/- rhs] [filename] View/sets file attributes
cls [background color] Clear screen
dnlk device [Lba] [Blocks] Hex dump of BlkIo Devices
pci [bus dev] [func] Dsiplay pci device(s) info
mm Address [Width] [;Type] Memory modify: Mem, MMIO, IO, PCI
mem [Address] [size] Dump Memory or Memory Mapped IO
[;MMIO]

EFI Shell Overview -- continued
EFI Shell sub

commands
continued

bcfg -? Configures boot driver & load options
edit [file name]
Edd30 [On|Off] Enable or Disable EDD 3.0 Device paths
unload [-nv]
EddDebug [blockdevicename] Debug of EDD info from adapter card
EFI Shell The following is an example of the EFI Shell use :

examples
Shell> map <== show the current device mapping
fs0 : VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)
blk3 : VenHw(Unknown Device:ff)/HD
Shell> pci <== list the pci devices
Bus Dev Func Description
0 0 0 ==> Generic System Peripheral - Interrupt Controller
Vendor 0x8086 Device 0x123D Program Interface 20
0 2 0 ==> Mass Storage Controller - SCSI Bus
Vendor 0x1077 Device 0x1280 Program Interface 0
0 3 0 ==> PCI Bridge Device - ISA
0 3 1 ==> Mass Storage Controller - IDE
0 3 2 ==> Serial Bus Controller - USB
0 3 3 ==> Serial BUS Controller - SMBUS
.
.
.
Shell> fs0: <== change to fs0
fs0:>dir <== list the content of fs0
XX/XX/XX 01:05p <DIR> 512 aix
XX/XX/XX 01:10p 279,792 a.out
XX/XX/XX 01:11p 23,636 boot.efi
fs0:>boot <== boot from fs0

IA-64 Boot Loader
Introduction The AIX 5L EFI boot loader makes the interface between EFI and the
kernel.
On disk drives, the AIX boot loader is located in the system partition.
Before loading the kernel, the boot loader will prompt the user to enter the
boot loader menu.
Then the boot loader will make use of EFI interface to initialize the IPL
Control Block.
The boot loader will then locate the kernel that reside in the hd5 that
actually is contained in the AIX PV partition.
Finally the boot loader will pass control to the kernel entry point.
boot loader The boot loader will make use of all the EFI boot services to load file
and EFI images such as kernel, RAM filesystem file and base customized data and
interactions
to locate various system tables such as System Abstraction Layer (SAL)
System Table (SST) and Advanced Configuration and Power Interface
(ACPI) Specification Tables. The boot loader will then create Initial
Program Load Control Block (IPLCB) and setup Translation Registers (TR)
before transferring control to kernel’s entry point.
EFI boot The boot loader menu can be used to set parameters that may affect the
loader menu kernel loading and operating environment like :
• enable the kernel debugger
• invoke the kernel debugger
• override RMALLOC memory reservation
• set boot loader debug flag
• set service/diagnostics flag
• select the amount of memory to enable
• Set the number of cpu to use
• select the number of CPU to use
• Toggle Single/Multi dispersal mode

IA-64 Initial Program Load Control Block
Introduction The IPLCB (Initial Program Load Control Block) defines the RAM resident
interface between the IPL Boot Process and the Operating System. The
boot loader will initialize the IPLCB structure using interfaces to EFI.
The kernel when loaded will use the IPLCB structure to initialize it’s
runtime structures.
IPLCB The IPLCB contains the following structures (described in : /usr/include/

Description sys/iplcb.h) :
• IPLCB Directory : contains the IPLCB ID and pointers (offset and size to
IPLCB Data)
• IPLCB Data such :
• IPLCB Hand off information
• IPLCB IPL information
• IPLCB system information
• IPLCB processor information
• I/O XAPIC Information
• Memory Information and Memory regions.
IPLCB The following screen shows the IPLCB Directory on a IA-64 system
directory captured using the IADB iplcb -dir sub command :
example on a
IA64 system > iplcb -dir
Directory Information
ipl_control_block_id......................= IA64_IPL
ipl_cb_and_bit_map_offset.................= 0x0
ipl_cb_and_bit_map_size...................= 0x7F0
bit_map_offset............................= 0x448
bit_map_size..............................= 0x27
ipl_info_offset...........................= 0xD8
ipl_info_size.............................= 0x7C
system_info_offset........................= 0x3D8
system_info_size..........................= 0x50
processor_info_offset.....................= 0x250
processor_info_size.......................= 0x188
io_xapic_info_offset......................= 0x428
io_xapic_info_size........................= 0x18
handoff_info_offset.......................= 0x158
handoff_info_size.........................= 0xF0
platform_int_info_offset..................= 0x440
platform_int_size.........................= 0x8
residual_offset...........................= 0x0
residual_size.............................= 0x0

Checkpoint
Quiz 7. In which partition is located the aix boot loader ?
8. What is the equivalent of fs0 partition in the AIX environment ?
9. In which partition is located the IA-64 boot record ?

10. In which partition is located the IA-64 boot image ?
11. Where is the bootexpand located on IA-64 ?

Instructor Notes
Purpose Checkpoint for rc.boot results
Answers 1. In which partition is located the aix boot loader ?

• The boot loader is located in fs0:
2. What is the equivalent of fs0 partition in the AIX environment ?
• the equivalent is hdiskxx_s0
3. In which partition is located the IA-64 boot record ?
• no boot record on IA-64
4. In which partition is located the IA-64 boot image ?
• the boot image is located in hd5that in fact resides in the rootvg PV
partition of the disk (blk5 in our example)
5. Where is the bootexpand located on IA-64 ?
• no bootexpand on IA-64

Hard Disk Boot process (rc.boot Phase I)
Introduction The main goal here is to get the devices configured and odm initialized
Hard disk The following chart represent the hard disk boot phase I process
Phase I
diagram
restore base configuration
from boot disk
<>0
restbase
return led 548
code
0
led 510
configuration manager Phase I

run bootinfo -b to get boot device
link boot device to /dev/ipldevice
led 511
exit 0

Hard Disk Boot process (rc.boot Phase II)
Introduction The main objective in hard disk boot phase II is to varyon rootvg and
mount standard filesystems.
Hard disk The following chart represent the hard disk boot phase II process
Phase II
diagram
led 511
ipl_varyon -v
ipl
varyon <>0
return led 552,554 or 556
code
0
led 517
fsck and mount aix filesystems

on /mnt.
check for dump in hd6
swapon hd6 if no dump present
run savebase recovery procedure
key
yes execute the
service
or dump service
in hd6 procedure
no
copy /etc/vg and
objrepos to disk
merge devices
unmount filesystems
remount filesystems
led 553
exit 0

Hard Disk Boot process (rc.boot Phase III)
Introduction The main objective in hard disk boot phase III is to mount runtime /tmp,
sync rootvg and then fall down the phase III common process.
Hard disk The following chart represent the hard disk boot phase III process
Phase III
diagram
fsck and mount /tmp

syncvg rootvg
continue phase III

common code

CDROM Boot process (rc.boot Phases I, II and III)
Introduction The main objective of the CDROM boot process is to configure devices
needed for installation and maintenance procedures and start the bi_main
process.
CDROM boot The following chart shows the CDROM boot phases I,II and III
phases I,II and
III diagram
configuration manager 1 3
Phase exit 0
Phase I number
led 517 2
exec bi_main
Mount the cdrom spot
exit 0
led 512
recreate the ramfs

from the SPOT
led 510
configure remaining
devices needed for install
led 511
exit 0

Tape Boot process (rc.boot Phases I, II and III)
Introduction The main objective of the Tape boot process is to configure devices
needed for installation and maintenance procedures and start the
bi_main process.
Tape boot The following chart shows the Tape boot phases I,II and III
phases I,II and
III diagram
1 3
Phase exit 0
led 510
number
configuration manager
Phase I 2
exec bi_main
configuration manager
Phase II exit 0
led 512
Change all tape

devices block_sizes to 512
Cleanup links
Cleanup ODM and rebuild
exit 0

Network Boot process (rc.boot Phases I, II and III)
Introduction The main objective of the Network boot process is to configure devices,
configure additional network options (network address, mask and default
route) and run the $RC_CONFIG script.
Network boot The following chart shows the Network boot phases I,II and III
phases I,II and
III diagram
set nim debug if 3

needed 1 Phase
number
led 600 continue phase III
2 common code
set nim debug if needed
boot no set nim environment exit 0
from
atm0 run $RC_CONFIG
yes
exit 0
restbase
save ATM datas
Clear ODM
configuration manager phase I
boot no
from
atm0
yes
configure ATM configure the
pvc, svc and native network
muxatmd bootdevice (ifconfig)
rc no rc no
=0 led 607 =0 led 607
yes yes
tftp miniroot
set nim environment
create /etc/hosts and routes exit 0
nfs mount the SPOT
run $RC_CONFIG from SPOT

Common Boot process (rc.boot Phase III)
Introduction The common Phase III boot code is run for disk and network boot only.
Common boot The following chart shows the common boot phases III process
Phase III
diagram
ensure 1024K free space in /tmp
load streams modules
fix the secondary dump device
swapon hd6 if no dump present
run savebase recovery procedure
key yes
is in config manager phase III
disable controlling tty
service
no
clean odm for alt disk install
config manager phase II
setup System Hang Detection
run graphical boot if needed
run savebase
clean unavailable tty from inittab
sync the files to hard disk
run /etc/rc.B1 if exists
start the syncd daemon
start the errdaemon daemon
clean /etc/locks and /etc/nologin
start mirrord daemon
start cfgchk daemon
run diagsrv if supported by platform
System initialization completed
exit 0

Network boot $RC_CONFIG files
Introduction As seen in the Network Boot Process (Phases I, II and III) these scripts are
ran by rc.boot when booting from a network device in phases I and II.
These script are located in the /usr/lib/boot/network directory.
They are loaded from the SPOT on the NIM server during the network boot
process.
rc.config types There are 3 types of rc.config files :

• rc.bos_inst : Used to configure a system for AIX installation
• rc.dd_boot : Used for network boot of diskless or dataless systems
• rc.diag : Used for booting to diagnostics
rc.bos_inst This script will :

• Phase I :
• Mount resources listed in niminfo as ${NIM_MOUNTS}
• Enable NIM debug if needed
• link necessary methods from the SPOT
• run configuration manager
• Phase II :
• Set some tcpip parameters
• enable diagnostics for pre-install diagnostics on disks
• execute bi_main

Network boot $RC_CONFIG files -- continued
rc.dd_boot This script will :

• Phase I :
• remove link from /lib to /usr/lib and populate /lib with hard links to /usr
to ensure the use of RAM libraries
• Mount the root directory
• get niminfo file
• unconfigure network services (ifconfig and routes)
• run configuration manager phase I
• reconfigure the network using nim informations
• mount /usr
• activate the local or remote paging spaces
• issue mergedev
• unmount all remote filesystems
• Phase II :
• mount types dd_boot filesystems
• clean up unused shared libraries
• set the hostname
rc.diag This script will :

• Phase I:
• Mount resources list in niminfo as ${NIM_MOUNTS}
• Enable NIM debug if needed
• link necessary methods from the SPOT
• run configuration manager
• Phase II :
• configure the console
• if graphic console configure gxme0 and rcm0
• For RSPC and CHRP start, sleep 2 and stop the errdaemon to get
errors since last boot
• Execute diag pretest before running diag

The init process
Introduction The init initializes and controls AIX processes.

The boot process, when running from the RAM filesystem (Phases I and
II), doesn’t use the real init command but /usr/lib/boot/ssh.
This strategy allows for more efficient use of the system resources during
boot
The real init is found in /usr/sbin/init. The real init begins during the kernel
newroot, which occurs at the end of Phase II of rc.boot.
The real init will use the /etc/inittab file to start AIX processes and run
system environment initialization scripts
/etc/inittab Here is a example of the inittab file :

init:2:initdefault:
brc::sysinit:/sbin/rc.boot 3 0</dev/console >/dev/console 2>&1
powerfail::powerfail:/etc/rc.powerfail 0</dev/console >/dev/
console 2>&1 # Power Failure Detection
rc:2:wait:/etc/rc 0</dev/console >/dev/console 2>&1
fbcheck:2:wait:/usr/sbin/fbcheck 0</dev/console >/dev/console
2>&1 # Run /etc/firstboot
srcmstr:2:respawn:/usr/sbin/srcmstr # System Resource Controller
rctcpip:2:wait:/etc/rc.tcpip > /dev/console 2>&1 # Start TCP/IP
daemons
rcnfs:2:wait:/etc/rc.nfs > /dev/console 2>&1 # Start NFS Daemons
cron:2:respawn:/usr/sbin/cron
cons:0123456789:respawn:/usr/sbin/getty /dev/console
writesrv:2:wait:/usr/bin/startsrc -swritesrv
uprintfd:2:respawn:/usr/sbin/uprintfd
shdaemon:2:off:/usr/sbin/shdaemon >/dev/console 2>&1 # High
availability daemon
logsymp:2:once:/usr/lib/ras/logsymptom # for system dumps
lft:2:respawn:/usr/sbin/getty /dev/lft0

ODM Structure and usage
Introduction The Object Data Manager is widely used in AIX to store and retrieve
various system informations.
For this purpose, AIX defines number of standard ODM classes.
Any application can create an use it’s own ODM classes to manage it’s
own informations.
AIX AIX System data managed by ODM includes:

Informations
managed by • Device configuration information
ODM
• Display information for SMIT (menus, selectors, and dialogs)
• Vital product data for installation and update procedures
• Diagnostics informations
• System resource information.
• RAS informations
Devices ODM The Devices classes are used by the configuration manager, device
Classes drivers and AIX device related commands (lsdev, lsattr ,lspv ,lsvg ...).
The following table list the Devices ODM classes and their definitions :
Class Definition
PdDv Predefined Devices
PdCn Predefined Connection
PdAt Predefined Attribute
PdAtXtd Extended Predefined Attribute
Config_Rules Configuration Rules
CuDv Customized Devices
CuDep Customized Dependency
CuAt Customized Attribute
CuDvDr Customized Device Driver
CuVPD Customized Vital Product Data
CuPart EFI partitions
CuPath
CuPathAt

ODM Structure and usage -- continued
SWVPD ODM The SWVPD classes are used by fileset related commands like installp,
Classes instfix, lslpp, oslevel.
SWVPD is divided in 3 parts :
• root : classes are in /etc/objrepos
• usr : classes are in /usr/lib/objrepos
• share : classes are located in /usr/share/lib/objrepos
The following table list the Software Vital Product Data ODM classes and
their definitions :
Class Definition
lpp The lpp object class contains information about the installed
software products, including the current software product
state.
inventory The inventory object class contains information about the files
associated with a software product.
history The history object class contains historical information about
the installation and updates of software products.
product The product object class contains product information about
the installation and updates of software products and their
prerequisites.
SRC ODM SRC Classes are used by the srcmstr and related commands : lssrc,
Classes startsrc, stopsrc and chssys.
The following table list the System Resource Controller ODM classes and
their definitions
Class Definition
SRCsubsys The subsystem object class contains the descriptors for all
SRC subsystems. A subsystem must be configured in this
class before it can be recognized by the SRC.
SRCsubsvr An object must be configured in this class if a subsystem
has subservers and the subsystem expects to receive
subserver-related commands from the srcmstr daemon.
SRCnotify This class provides a mechanism for the srcmstr daemon
to invoke subsystem-provided routines when the failure of a
subsystem is detected.
SRCextmeth

SMIT ODM The SMIT odm classes are used by smit and smitty commands.
Classes
The following table list the SMIT ODM classes and their definitions
Use Class Definition

smit menu sm_menu_opt 1 for title of screen
1 for first item
1 for second item
1 for last item
smit selector sm_name_hdr 1 for title of screen and other attributes
1 for entry field or pop-up list
smit selector sm_cmd_opt 1 for entry field or pop-up list
smit dialog sm_cmd_hdr 1 for title of screen and command string
smit dialog sm_cmd_opt 1 for first entry field 1 for second entry field
...
1 for last entry field
RAS ODM The RAS classes are used by the errdaemon, shdaemon, shconf and alog
Classes commands.
The following table list the RAS ODM classes and their definitions
Class Definition
errnotify Used by errlog notification process
SWservAt Used by errorlog, system dumps, System
Hang Detection and alog

Diagnostics The diagnostics classes are used by the diag command.

ODM Classes
The following table list the Diagnostics ODM classes and their definitions
Class Definition
PDiagRes Predefined Diagnostic Resource Object Class
PDiagAtt Predefined Diagnostic Attribute Device Object Class
PDiagTask Predefined Diagnostic Task Object Class
CDiagAtt Customized Diagnostic Attribute Object Class
TMInput Test Mode Input Object Class
MenuGoal Menu Goal Object Class
FRUB Fru Bucket Object Class
FRUs Fru Reporting Object Class
DAVars Diagnostic Application Variables Object Class
PDiagDev Predefined Diagnostic Devices Object Class
DSMOptions Diagnostic Supervisor Menu Options Object Class
ODM The following table list the ODM commands and their usage:
commands
Command Definition
odmadd Adds objects to an object class. The odmadd command
takes an ASCII stanza file as input and populates object
classes with objects found in the stanza file.
odmchange Changes specific objects in a specified object class.
odmcreate Creates empty object classes. The odmcreate command
takes an ASCII file describing object classes as input and
produces C language .h and .c files to be used by the
application accessing objects in those object classes.
odmdelete Removes objects from an object class.
odmdrop Removes an entire object class.
odmget Retrieves objects from object classes and puts the object
information into odmadd command format.
odmshow Displays the description of an object class. The
odmshow command takes an object class name as input
and puts the object class information into odmcreate
command format.

ODM Structure and Usage -- continued
ODM The following table list the odm subroutines and their use :
subroutines
subroutine definition
odm_add_obj Adds a new object to the object class.
odm_change_obj Changes the contents of an object.
odm_close_class Closes an object class.
odm_create_class Creates an empty object class.
odm_err_msg Retrieves a message string.
odm_free_list Frees memory allocated for the odm_get_list
subroutine.
odm_get_by_id Retrieves an object by specifying its ID.
odm_get_first Retrieves the first object that matches the specified
criteria in an object class.
odm_get_list Retrieves a list of objects that match the specified
odm_get_next Retrieves the next object that matches the specified
odm_get_obj Retrieves an object that matches the specified criteria
from an object class.
odm_initialize Initializes an ODM session.
odm_lock Locks an object class or group of classes.
odm_mount_class Retrieves the class symbol structure for the specified
object class.
odm_open_class Opens an object class.
odm_rm_by_id Removes an object by specifying its ID.
odm_rm_obj Removes all objects that match the specified criteria
from the object class.
odm_run_method Invokes a method for the specified object.
odm_rm_class Removes an object class.
odm_set_path Sets the default path for locating object classes.
odm_unlock Unlocks an object class or group of classes.
odm_terminate Ends an ODM session.

ODM Structure and Usage -- continued
ODM paths As the ODM classes can be found in 3 paths (root, usr and share), the user
must decide which path he will use before running ODM commands or
ODM subroutines.
For ODM commands, the user can set the path using :
# export ODMDIR=/usr/share/lib/objrepos
In a C program, the user should use :
odm_set_path("/usr/lib/objrepos");

boot and installation logging facilities
Introduction It can be useful to retrieve rapidly the logging files used for boot or
installation to help solve problems.
The alog command can be used to recover system logs
log types The alog command is used by installation and boot processes to log
informations or errors for the following topics :
• boot : log for the boot process
• bosinst : log used for the AIX installation process
• console : log used to store console messages
• nim : log used to store NIM messages
• dumpsymp : used to store dump symptom messages
alog command The following alog commands may be used :

usage
• alog -L : will list alog log types defined in the ODM
• alog -t <log_type> -o : will display the log file related to the log_type
• echo “Message xxx” | alog -t <boot_type> : will log the message to the
log file
• alog -L -t <log_type> : will display detailed information related to the
log_type definition (log file path, size and verbosity)
• alog -Cw <new_verbosity> -t <log_type> : will change the verbosity (0-
9) for the log_type
• alog -C -t <log_type -s <new_size> -f <new_file> : will change the file
and file size the log_type.
• alog -V -t <log_type> : will display the current verbosity
example The following example will output the 15 last lines of the boot log :
# alog -t boot -o|tail -15
Saving Base Customize Data to boot disk
Starting the sync daemon
Starting the error daemon
A device that was previously detected could not be found.
Run "diag -a" to update the system configuration.
System initialization completed.
Starting Multi-user Initialization

Performing auto-varyon of Volume Groups

Activating all paging spaces
0517-075 swapon: Paging device /dev/hd6 is already active.
/dev/rhd1 (/home): ** Unmounted cleanly - Check suppressed
Performing all automatic mounts
Replaying log for /dev/lv01.
Multi-user initialization completed

Debugging boot problems using KDB
introduction For boot problems debugging purposes, it can be useful to get a detailed
output of the boot process, including rc.boot outputs.
entering boot To enter the boot debugging, the administrator should first make sure the
debug KDB kernel debugger will be loaded invoked at boot time using :
# bosboot -I -ad/dev/ipldevice
The next reboot will launch the KDB on the native serial connection.
At the KDB prompt you will need to toggle the rc.boot debug flag and
optionally the exec debug flag in order to have rc.boot outputs at the native
serial connection.
Note that the exec tracing will continue after the end of the rc.boot.
example The following is an example of a boot debug session :

.......... kdb_tty_init done
.......... kdb_init_flihs done
region address region length nodeid att label
0000000000000000 0000000000FF1000 0000 01 01
0000000000FF1000 000000000000F000 0000 01 03
0000000001000000 0000000006FCC000 0000 01 01
0000000007FCC000 0000000000029000 0000 00 05
0000000007FF5000 000000000000B000 0000 01 02
0000000008000000 0000000018000000 0000 01 01
0000000020000000 FFFFFFFFE0000000 0000 00 07
Real memory size = 512 M Bytes
Model = 0800004C
Data cache size = 64 K Bytes
Inst cache size = 32 K Bytes
.......... kdb_mem_size done
.......... kdb_code_init done
First symbol __mulh
START END <name>
0000000000003500 0000000000DB55A8 _system_configuration+000020
F00000002FF3B400 F00000002FFC0818 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC _errno+000000
F100008080000000 F10000808A000000 pvproc+000000
F100008090000000 F100008094000000 pvthread+000000
F100000040000000 F100000040266C80 vmmdseg+000000
F1000013B0000000 F1000073B4800000 vmmswpft+000000
F100000BB0000000 F1000013B0000000 vmmswhat+000000
F100000050000000 F100000060000000 ptaseg+000000
F100000070000000 F1000000B0000000 ameseg+000000
F100009710000000 F100009720000000 KERN_heap+000000
F100009500000000 F100009510000000 lkwseg+000000

Debugging boot problems using KDB -- continued
example ************* Welcome to KDB *************

continued Call gimmeabreak...
Static breakpoint:
.gimmeabreak+000000 tweq r8,r8 r8=00000000F80003F8
.gimmeabreak+000004 blr <.kdb_init+00021C> r3=0
KDB(0)> dbgopt <== Enter debug options
Debug options:
--------------
1. Toggle rc.boot tracing - currently DISABLED
2. Toggle tracing of exec calls - currently DISABLED
q. Exit
Enter option: 1 <== Enable rc.boot tracing
Debug options:
--------------
1. Toggle rc.boot tracing - currently ENABLED
2. Toggle tracing of exec calls - currently DISABLED
q. Exit
Enter option: 2 <== Enable exec calls tracing
Debug options:
--------------
1. Toggle rc.boot tracing - currently ENABLED
2. Toggle tracing of exec calls - currently ENABLED
q. Exit
Enter option: q <== here, we quit the debug option menu
KDB(0)> q <== here, we quit the KDB so the boot process can pursue.
PFT:
id....................0007
raddr.....0000000001000000 eaddr.....0000000000000000
size..............00800000 align.............00800000
valid..1 ros....0 holes..0 io.....0 seg....1 wimg...2
PVT:
id....................0008
raddr.....0000000000692000 eaddr.....0000000000000000
size..............00100000 align.............00001000
valid..1 ros....0 holes..0 io.....0 seg....1 wimg...2
Exiting vmsi()
LED{814}
AIX Version 5.0
Starting NODE#000 physical CPU#002 as logical CPU#001... done.
exec(/etc/init)
exec(/usr/bin/sh,-c,/sbin/rc.boot 1)
exec(/sbin/rc.boot,/sbin/rc.boot,1)
+ [ 1 -ne 1 ]
+ PHASE=1
+ + bootinfo -p
exec(/usr/sbin/bootinfo,-p)
PLATFORM=chrp

Debugging boot problems using IADB
Introduction For boot problems debugging purposes, it can be useful to get a detailed
output of the boot process, including rc.boot outputs.
Prerequisites In order to get boot debug output you will need to have a device (TTY,
Thinkpad or an other system serial port) connected to the native serial port
and configured at 115200-8-N-1
Process The following process will be used to debug boot problems :
Step Action
1 If you want the IADB to be invoked at boot time, use :
# bosboot -I -ad /dev/ipldevice
You can also chose not to do this and set manually the
debugger flags on the boot loader menu
2 Boot or reboot the system
3 If you are using another system as the TTY, you may want
to set some tracing/capture options to capture the
debugging output.
4 If the autoboot flag is not set in EFI set the file system and
boot using :
Shell> fs0:
fs0> boot
5 The boot loader menu should come up with the debugger
flags set “ON” if you ran bosboot in step 1.
Otherwise, hit some key to enter the boot loader menu and
set the debugger flags. Then exit the boot loader menu
6 The boot loader will load the IADB that will prompt on the
native serial port.
At the IADB prompt type :
CPU0> set dbgmsg=on
CPU0> set exectrace=on
CPU0> go

Debugging boot problems using IADB -- continued
boot The following example show the beginning of what you can see on the
debugging native serial port when debugging the boot process :
output
example MEDIEVAL DEBUGGER ENTERE interrupt.
IP->E00000000001D2F2 brkpoint()+2: { .mfi
0: nop.m 0x100001
;; }
>CPU0> set dbgmsg=on
>CPU0> set exectrace=on <== here we ask for debugging
>CPU0> go <== here we go
See Ya!
Performing Hostile Takeover of the System Console...
AIX Version 5.0
Starting CPU#001... done.
+ ODMSTRNG=attribute=keylock and value=service
+ HOME=/
+ LIBPATH=/usr/lib:/lib:/usr/sbin:/etc:/usr/bin
+ SHOWLED=showled
+ SYSCFG_PHASE=BOOT
+ export HOME LIBPATH ODMDIR PATH SHOWLED SYSCFG_PHASE
+ umask 077
+ set -x
+ [ 1 -ne 1 ]
+ PHASE=1
+ + bootinfo -p
PLATFORM=ia64
+ [ ! -x /usr/lib/boot/bin/bootinfo_ia64 ]
+ [ 1 -eq 1 ]
+ 1> /)
+ + bootinfo -t
BOOTYPE=3
+ [ 0 -ne 0 ]
+ [ -z 3 ]
+ unset pdev_to_ldev undolt

Packaging Changes
Introduction The lpp packaging has been reviewed to reflect the need for platform
dependant packages.
Package The Packages names have the following structure :

names <pkg_name>.V.R.M.F.<plateform_type>.<install_type>.bff where :
• <pkg_name> is the name of the package to be installed
• V.R.M.F are the Version, Release, Modification and Fix levels of the
package
• <platform_type> is the platform type for which that package was
designed. The platform type can be one of :
• I : For Intel IA-64 platform
• N : For Neutral packages that can be installed on all platforms
• Nothing : For Power specific packages
Packaging installp, bffcreate, inutoc and instfix commands are updated to reflect
commands these changes.
By default packaging commands will process only packages related to the
platform where the command is ran.
A “-M” flag has been added to these command that accept the following
sub options :
• I : To process Intel related packages
• R : To process Power related packages
• N : To process Neutral related packages
• A : To process all kind of packages
installp The installp command will only accept the -M flag with -l or -L options.
options
installp option -L output will include platform informations
bffcreate The bffcreate command will accept all -M sub options to allow transit of
options packages regardless of the current platform. This is needed for nim
operations
instfix options The instfix command like the installp command will only accept the -M flag
when used in conjunction with the -T (list flag).

inutoc The inutoc command will accept the -M flag.

command

Checkpoint
Quiz 1. Who call rc.boot ?

2. What is common in phase II of tape, cdrom and network phase II ?
3. What is specific to the rc.boot phase III ?
4. What will you need to do if you want to modify something in rc.boot
phase I or II ?
5. What is the phase and/or device in rc.boot not supported on IA-64 ?
6. What is the usage of the ODM ?
7. What is init in the first two phases of the boot ?

Instructor Notes
Purpose <which objective does this map address >
Answers 1. Who call rc.boot ?

• init
2. What is common in phase II of tape,cdrom and network phase II ?
• They exec bi_main (rc.bos_inst for network) to run installation tasks
3. What is specific to the rc.boot phase III ?
• rc.boot phase III is called by the actual init process after newroot.
4. What will you need to do if you want to modify something in rc.boot
phase I or II ?
• You will need to run bosboot in order to copy your changed rc.boot to
the RAMFS
5. What is the phase and/or device in rc.boot not supported on IA-64 ?
• The tape boot device (this was said in the map various types of boot)
6. What is the usage of the ODM ?
• Store and retrieve system informations
7. What is init in the first two phases of the boot ?
• ssh

Draft Version for review, Sunday, 15. October 2000, proc.fm Guide
Unit 15. /proc Filesystem Support

This unit describes: The /proc filesystem in the AIX 5L kernel.

• List the directories and files that are found in the /proc
filesystem
• Describe the basic functionality of each file in the sub-directory
tree for a specific process
• Create a simple C program to access the files belonging to
another process

Guide Draft Version for review, Sunday, 15. October 2000, proc.fm
/proc Filesystem Support
Introduction /proc is a file system that provides access to the state of each active
process and Light Weight Process (LWP) in the system.
Platform This lesson is platform independent
/proc The contents of the /proc filesystem have the same appearance as any
filesystem other file and directory in a Unix filesystem. The name of each top-level
entry in the /proc directory is a sub-directory, named by the decimal
number corresponding to the process ID, and the owner of each is
determined by the user-ID of the process.
Access to process state is provided by additional files contained within
each sub-directory; this hierarchy is described more completely below.
Except where otherwise specified, ‘‘/proc file’’ is meant to refer to a non-
directory file within the hierarchy rooted at /proc.
Filesystem The directory structure for the proc directory is described below. The pid
heirarchy represents the process ID number and the lwp# represents the light-weight
process number.
File/Directory Name Description

/proc directory - list of processes
/proc/pid directory for process pid
/proc/pid/status status of process pid
/proc/pid/ctl control file for process pid
/proc/pid/psinfo ps info for process pid
/proc/pid/as address space of process pid
/proc/pid/map as map info for process pid
/proc/pid/object directory for objects for process pid
/proc/pid/sigact signal actions for process pid
/proc/pid/lwp/lwp# directory for LWP lwp#
/proc/pid/lwp/lwp#/lwpstatus status of LWP lwp#
/proc/pid/lwp/lwp#/lwpctl control file for LWP lwp#
/proc/pid/lwp/lwp#/lwpsinfo ps info for LWP lwp#

/proc Filesystem Support -- continued
Accessing / Standard system call interfaces are used to access /proc files: open(2),
proc files close(2), read(2), and write(2). Most files describe process state and can
only be open for reading. An open for writing allows process control; a
read-only open allows inspection, but not control.

Types of Files
Introduction Listed below are descriptions of the files that are contained in the /proc
filesystem heirarchy. These files are described in more detail on the
following pages.
Filename Mode Function

as rd/wr Contains the address-space
image of the process
ctl wr Allows change to the process
state or behaviour
status rd Contains state information about
the process
psinfo rd Information about the process
needed by the ps(1) command
map rd Information about the virtual
address map of the process
cred rd Describes the credentials
associated with the process
sigact rd Describes the disposition of all
signals associated with the
process
object N/A A directory containing read-only
files with names as they appear
in the map file
lwp N/A A directory for LWP
lwp#/lwpstatus rd State information for LWP lwp#
lwp#/lwpctl wr Allows change to the LWP
process state or behaviour of
LWP lwp#
lwp#/lwpsinfo ?? Process info for LWP lpw#

The as File
Introduction The as file contains the address-space image of the process and can be
opened for both reading and writing.
Accessing the lseek is used to position the file at the virtual address of interest and then
file the address space can be examined or changed through a read or write.

The ctl File
Introduction The ctl file is a write-only file to which structured messages are written
directing the system to change some aspect of the process’s state or
control its behavior in some way. The seek offset is not relevant when
writing to this file.
Control Individual LWPs also have associated lwpctl files. Process state changes
messages are effected through control messages written to either to the ctl file of the
process or to a specific lwpctl file. All control messages consist of an int
naming the specific operation followed by additional data containing
operands (if any). The effect of a control message is immediately reflected
in the state of the process visible through appropriate status and
information files.
Multiple control messages can be combined in a single write(2) to a control
file, but no partial writes are permitted; that is, each control message
(operation code plus operands) must be presented in its entirety to the
write and not in pieces over several system calls.
Descriptions Descriptions of allowable control messages are included on page 20.

of control
messages

The status File
Introduction The status file contains state information about the process and one of its
LWPs (chosen according to the rules described below).
File format The file is formatted as a struct pstatus containing the following members:
long pr_flags; /* Flags */
ushort_t pr_nlwp; /* Total number of lwps in the process */
sigset_t pr_sigpend; /* Set of process pending signals */
vaddr_t pr_brkbase; /* Address of the process heap */
ulong_t pr_brksize; /* Size of the process heap, in bytes */
vaddr_t pr_stkbase; /* Address of the process stack */
ulong_t pr_stksize; /* Size of the process stack, in bytes */
pid_t pr_pid; /* Process id */
pid_t pr_ppid; /* Parent process id */
pid_t pr_pgid; /* Process group id */
pid_t pr_sid; /* Session id */
timestruc_t pr_utime; /* Process user cpu time */
timestruc_t pr_stime; /* Process system cpu time */
timestruc_t pr_cutime; /* Sum of children’s user times */
timestruc_t pr_cstime; /* Sum of children’s system times */
sigset_t pr_sigtrace; /* Mask of traced signals */
fltset_t pr_flttrace; /* Mask of traced faults */
sysset_t pr_sysentry; /* Mask of system calls traced on entry */
sysset_t pr_sysexit; /* Mask of system calls traced on exit */
lwpstatus_t pr_lwp; /* "representative" LWP */

The status File -- continued
Member Here is a description of members of the status file:

description
Member Description
pr_flags A bit mask holding flags (flags are described below)
pr_nwlp Total number of LWPs in the process
pr_brkbase Virtual address of the process heap
pr_brksize Size of process heap in bytes. The address formed by
the sum of these values is the process break (see
brk(2)).
pr_stkbase Virtual address of the process stack
pr_stksize Size of the process stack in bytes. Each LWP runs on a
separate stack; the process stack is distinguished in that
the operating system will grow as necessary.
pr_pid Process ID
pr_ppid Parent process ID
pr_pgid Process group ID
pr_sid Session ID of the process
pr_utime User CPU time consumed by the process in seconds
and nanoseconds
pr_stime System CPU time consumed by the process in seconds
and nanoseconds
pr_cutime Cumulative user CPU time consumed by the process in
seconds and nanoseconds
pr_cstime Cumulative system CPU time consumed by the process
in seconds and nanoseconds
pr_sigtrace Set of signals that are being traced (see PCSTRACE)
pr_flttrace Set of hardware faults that are being traced (see
PCSFAULT)
pr_sysentry Set of system calls being traced on entry (see
PCSENTRY)
pr_sysexit Set of system calls being traced on exit (see PCSEXIT)
pr_lwp If the process is not a zombie, pr_lwp contains an
lwpstatus_t structure describing a representative LWP.
The contents of this structure ave the same meanin as if
it were read from an lwpstatus file.

The status File -- continued
pr_flags pr_flags is a bit-mask holding these flags:
Flag Description
PR_ISSYS System process (see PCSTOP)
PR_FORK Has its inherit-on-fork flag set (see PCSET)
PR_RLC Has its run-on-last-close flag set (see PCSET)
PR_KLC Has its kill-on-last-close flag set (see PCSET)
PR-ASYNC Has its asynchronous-stop flag set (see PCSET)
Multi-threaded When the process has more than one LWP, its representative LWP is
applications chosen by the /proc implementation. The chosen LWP is a stopped LWP
only if all the process’s LWPs are stopped, is stopped on an event of
interest only if all the LWPs are so stopped, or is in a PR_REQUESTED
stop only if there are no other events of interest to be found. The chosen
LWP remains fixed as long as all the LWPs are stopped on events of
interest and PCRUN is not applied to any of them.
When applied to the process control file, every /proc control operation that
must act on an LWP uses the same algorithm to choose which LWP to act
on. Together with synchronous stopping (see PCSET), this enables an
application to control a multiple-LWP process using only the process-level
status and control files if it so chooses. More fine-grained control can be
achieved using the LWP-specific files.

The psinfo file
Introduction The psinfo file contains information about the process needed by the ps(1)
command. If the process contains more than one LWP, a representative
LWP (chosen according to the rules described for the status file) is used to
derive the status information.
File format The file is formatted as a struct psinfo containing the following members:
ulong_t pr_flag; /* process flags */
ulong_t pr_nlwp; /* number of LWPs in process */
uid_t pr_uid; /* real user id */
gid_t pr_gid; /* real group id */
pid_t pr_pid; /* unique process id */
pid_t pr_ppid; /* process id of parent */
pid_t pr_pgid; /* pid of process group leader */
pid_t pr_sid; /* session id */
caddr_t pr_addr; /* internal address of process */
long pr_size; /* size of process image in pages */
long pr_rssize; /* resident set size in pages */
timestruc_t pr_start; /* process start time, time since epoch */
timestruc_t pr_time; /* usr+sys cpu time for this process */
dev_t pr_ttydev; /* controlling tty device (or PRNODEV)*/
char pr_fname[PRFNSZ]; /* last component of exec()ed pathname*/
char pr_psargs[PRARGSZ]; /* initial characters of arg list */
struct lwpsinfo pr_lwp; /* "representative" LWP */
Platform Some of the entries in psinfo, such as pr_flag and pr_addr, refer to internal
specific data kernel data structures and should not be expected to retain their meanings
across different versions of the operating system. They have no meaning
to a program and are only useful for manual interpretation by a user aware
of the implementation details.
Zombies psinfo is still accessible even after a process becomes a zombie.
Representative pr_lwp describes the representative LWP chosen as described under the
LWP pstatus file above. If the process is a zombie, pr_nlwp and pr_lwp.pr_lwpid
are zero and the other fields of pr_lwp are undefined.

The map File
Introduction The map file contains information about the virtual address map of the
process. The file contains an array of prmap structures, each of which
describes a contiguous virtual address region in the address space of the
traced process.
File format The prmap structure contains the following members:

caddr_t pr_vaddr; /* Virtual address */
ulong_t pr_size; /* Size of mapping in bytes */
char pr_mapname[32]; /* Name in /proc/pid/object */
off_t pr_off; /* Offset into mapped object, if any */
long pr_mflags; /* Protection and attribute flags */
long pr_filler[9]; /* For future use */
Member Members of the map file are described below:

description
Member Description
pr_vaddr Virtual address of the mapping within the traced process
pr_size Size of mapping in bytes
pr_mapname If not empty string, contains name of a file in the object directory
that can be opened for reading to yield a file descriptor for the
object to which vitrual address is mapped.
pr_off Offset within the mapped object (if any) to which the virtual
address is mapped
pr_mflags Protection and attribute flags (see below)
pr_filler For future use
pr_mflags pr_mflags is a bit-mask of protection and attribute flags:
Flag Description
MA_READ Mapping is readable by the traced process
MA_WRITE Mapping is writable by the traced process
MA_EXEC Mapping is executable by the traced process
MA_SHARED Mapping changes are shared by mapped object

The map File -- continued
Contiguous A contiguous area of the address space having the same underlying
address space mapped object may appear as multiple mappings because of varying read,
write, execute, and shared attributes. The underlying mapped object does
not change over the range of a single mapping. An I/O operation to a
mapping marked MA_SHARED fails if applied at a virtual address not
corresponding to a valid page in the underlying mapped object. Reads and
writes to private mappings always succeed. Reads and writes to
unmapped addresses always fail.

The cred File
Introduction The cred file contains a description of the credentials associated with the
process.
File format The file is formatted as a struct prcred containing the following members:
uid_t pr_euid; /* Effective user id */
uid_t pr_ruid; /* Real user id */
uid_t pr_suid; /* Saved user id (from exec) */
gid_t pr_egid; /* Effective group id */
gid_t pr_rgid; /* Real group id */
gid_t pr_sgid; /* Saved group id (from exec) */
uint_t pr_ngroups; /* Number of supplementary groups */
gid_t pr_groups[1]; /* Array of supplementary groups */
The list of associated supplementary groups in pr_groups is of variable

length; pr_ngroups specifies the number of groups.

The sigact File
Introduction The sigact file contains an array of sigaction structures describing the
current dispositions of all signals associated with the traced process.
Signal numbers are displaced by 1 from array indexes, so that the action
for signal number n appears in position n-1 of the array.

lwp/lwpctl file
Introduction The lwpctl file is a write-only control file. The messages written to this file
affect only the associated LWP rather than the process as a whole (where
appropriate).

The lwp/lwpstatus File
Introduction The lwp/lwpstatus file contains LWP-specific state information. This

information is present in the status file of the process for its representative
LWP, also.
File format The file is formatted as a struct lwpstatus containing the following member
long pr_flags; /* Flags */
short pr_why; /* Reason for stop (if stopped) */
short pr_what; /* More detailed reason */
lwpid_t pr_lwpid; /* Specific LWP identifier */
short pr_cursig; /* Current signal */
siginfo_t pr_info; /* Info associated with signal or fault */
struct sigaction pr_action; /* Signal action for current signal */
sigset_t pr_lwppend; /* Set of LWP pending signals */
stack_t pr_altstack; /* Alternate signal stack info */
short pr_syscall; /* System call number (if in syscall) */
short pr_nsysarg; /* Number of arguments to this syscall */
long pr_sysarg[PRSYSARGS];/* Arguments to this syscall */
char pr_clname[PRCLSZ]; /* Scheduling class name */
ucontext_t pr_context; /* LWP context */
pfamily_t pr_family; /* Processor family-specific information */

The lwp/lwpstatus File -- continued
Member Here is a description of members of the lwpstatus file:

description
Member Description
pr_flags A bit mask holding flags (described below)
pr_why Reason for LWP stop (if stopped). Possible values listed
below.r
pr_what More detailed reason for LWP stop. pr_why and pr_what
together, describe the reason for a stopped LWP.
pr_lwpid Specific LWP identifier.
pr_cursig Names the current signal; that is, the next signal to be
delivered to the LWP.
pr_info When the LWP is in a PR_SIGNALLED or PR_FAULTED
stop, pr_info contains additional information pertinent to the
particular signal or fault. (See sys/siginfo.h)
pr_action Contains signal action information about the current signal
(see sigaction(2)). It is undefined if pr_cursig is zero.
pr_lwppend Identifies any synchronously-generated or LWP-directed
signals pending for the LWP. Does not include signals
pending at the process leel.
pr_altstack Contains the alternate signal stack information for the LWP.
(see sigaltstack(2)).
pr_syscall Number of the system call, if any, being executed by the
LWP. It is nonzero if and only if the LWP is stopped on
PS_SYSENTRY or PR_SYSEXIT or is asleep with a
system call (PR_ASLEEP is set)
pr_nsysarg If pr_syscall is non-zero, pr_nsysarg is the number of
arguments to the system call
pr_sysarg Array of arguments to the system call.
pr_clname Contains the name of the scheduling class of the LWP.
pr_context Contains the user context of the LWP, as if it had called
getcontext(2). If the LWP is not stopped, all context values
are undefined.
pr_family Contains the CPU-family specific information about the
LWP. Use of this field is not portable across different
architectures.

The lwp/lwpstatus File -- continued
pr_flags pr_flags is a bit-mask holding these flags:
Flag Description
PR_STOPPED LWP is stopped
PR_ISTOP LWP is stopped on an event of interest (see PCSTOP)
PR_DSTOP LWP has a stop directive in effect (see PCSTOP)
PR_STEP LWP has a single-step directive in effect
PR_ASLEEP LWP is in an interruptible sleep within a system call
PR_PCINVAL LWP program counter register does not point to a valid
address
pr_why Possible values of pr_why are:
Value Description
PR_REQUESTED Shows that the stop occurred in response to a
stop directive, normally because PCSTOP was
applied or because another LWP stopped on an
event of interest and the asynchronous-stop flag
(see PCSET) was not set for the process.
pr_what is unused in this case.
PR_SIGNALLED Shows that the LWP stopped on receipt of a

signal (see PCSTRACE); pr_what holds the
signal number that caused the stop (for a newly-
stopped LWP, the same value is in pr_cursig)
PR_FAULTED shows that the LWP stopped on incurring a
hardware fault (see PCSFAULT); pr_what holds
the fault number that caused the stop
PR_SYSENTRY Show a stop on entry to or exit from a system call
PR_SYSEXIT (see PCSENTRY and PCSEXIT); pr_what holds
the system call number.
PR_JOBCONTROL Sows that the LWP stopped because of the
default action of a job control stop signal (see
sigaction(2)); pr_what holds the stopping signal
number.

The lwp/lwpsinfo File
Introduction The lwp/lwpsinfo file contains information about the LWP needed by ps(1).
This information also is present in the psinfo file of the process for its
representative LWP if it has one.
File format The file is formatted as a struct psinfo containing the following members:
ulong_t pr_flag; /* LWP flags */

lwpid_t pr_lwpid; /* LWP id */
caddr_t pr_addr; /* internal address of LWP */
caddr_t pr_wchan; /* wait addr for sleeping LWP */
uchar_t pr_stype; /* synchronization event type */
uchar_t pr_state; /* numeric scheduling state */
char pr_sname; /* printable character representing pr_state
*/
uchar_t pr_nice; /* nice for cpu usage */
int pr_pri; /* priority, high value = high priority */
timestruc_t pr_time; /* usr+sys cpu time for this LWP */
char pr_clname[8]; /* Scheduling class name */
char pr_name[PRFNSZ]; /* name of system LWP */
processorid_t pr_onpro; /* processor on which LWP is running */
processorid_t pr_bindpro; /* processor to which LWP is bound */
processorid_t pr_exbindpro; /* processor to which LWP is exbound */
Platform- Some of the entries in lwpsinfo, such as pr_flag, pr_addr, pr_state,

specific data pr_stype, pr_wchan, and pr_name, refer to internal kernel data structures
and should not be expected to retain their meanings across different
versions of the operating system. They have no meaning to a program and
are only useful for manual interpretation by a user aware of the
implementation details.

Control Messages
Introduction Process state changes are affected through messages written to the ctl file
of the process or to the lwpctl file of an individual LWP.
Sending All control messages consist of an int naming the specific operation
control followed by additional data containing operands (if any). Multiple control
messages
messages can be combined in a single write(2) to a control file, but no
partial writes are permitted; that is, each control message (operation code
plus operands) must be presented in its entirety to the write and not in
pieces over several system calls.
ENOENT Note that writing a message to a control file for a process or LWP that has
exited elicits the error ENOENT.
List of Here is a list of the allowable control messages:

messages
Control Message Description

PCSTOP Stops a LWPs
PCDSTOP Stops a LWPs
PCWSTOP Stops a LWPs
PCRUN Makes a LWP runnable again after a stop.
PCSTRACE Defines a set of signals to be traced in the process
PCSSIG Contains the current signal and its associated
signal information????
PCKILL End the process or LWP immediately????
PCUNKILL ????
PCSHOLD Set the held signals for the specific or chosen LWP
according to the operand sigset_t structure
PCSFAULT Define a set of hardware faults to be traced in the
process

PCSTOP, PCDSTOP, and PCWSTOP
Introduction There are three control messages that stop LWPs. They perform in
different ways. They are:
• PCSTOP
• PCDSTOP
• PCWSTOP
PCSTOP When applied to the process control file, directs all LWPs to stop and waits
for them to stop. Completes when every LWP has stopped on an event of
interest.
When applied to an LWP control file, directs the specific LWP to stop and
waits until it has stopped. Completes when the LWP stops on an event of
interest, immediately if already so stopped.
PCDSTOP When applied to the process control file, directs all LWPs to stop without
waiting for them to stop.
When applied to an LWP control file, directs the specific LWP to stop
without waiting for it to stop
PCWSTOP When applied to the process control file, simply waits for all LWPs to stop.
Completes when every LWP has stopped on an event of interest.
When applied to an LWP control file, simply waits for the LWP to stop.
Completes when the LWP stops on an event of interest, immediately if
already so stopped
Event of An event of interest is either a PR_REQUESTED stop or a stop that has

interest been specified in the process’s tracing flags (set by PCSTRACE,
PCSFAULT, PCSENTRY, and PCSEXIT). A PR_JOBCONTROL stop is
specifically not an event of interest. (An LWP may stop twice because of a
stop signal; first showing PR_SIGNALLED if the signal is traced and again
showing PR_JOBCONTROL if the LWP is set running without clearing the
signal.) If PCSTOP or PCDSTOP is applied to an LWP that is stopped, but
not on an event of interest, the stop directive takes effect when the LWP is
restarted by the competing mechanism; at that time the LWP enters a
PR_REQUESTED stop before executing any user-level code.

PCSTOP, PCDSTOP, and PCWSTOP -- continued
Blocked A write of a control message that blocks is interruptible by a signal so that,

control for example, an alarm(2) can be set to avoid waiting forever for a process
messages
or LWP that may never stop on an event of interest. If PCSTOP is
interrupted, the LWP stop directives remain in effect even though the write
returns an error.
System A system process (indicated by the PR_ISSYS flag) never executes at

process user level, has no user-level address space visible through /proc, and
cannot be stopped. Applying PCSTOP, PCDSTOP, or PCWSTOP to a
system process or any of its LWPs elicits the error EBUSY.

PCRUN
Introduction The control message PCRUN makes an LWP runnable again after a stop.
The operand is a set of flags, contained in a ulong_t, describing optional
additional actions.
Flag Here is a description of the flags contained in the operand of PCRUN:

descriptions
Flag Description
PRCSIG Clears the current signal, if any (see PCSSIG)
PRCFAULT Clears the current fault, if any (see PCCFAULT)
PRSTEP Directs the LWP to execute a single machine
instruction. On completion of the instruction, a trace
trap occurs. If FLTTRACE is being traced, the LWP
stops, otherwise it is sent SIGTRAP; if SIGTRAP is
being traced and not held, the LWP stops. When the
LWP stops on an event of interest the single-step
directive is cancelled, even if the stop occurs before
the instruction is executed. This operation requires
hardware and operating system support and may not
be implemented on all processors
PRSABORT Is significant only if the LWP is in a PR_SYSENTRY
stop or is marked PR_ASLEEP; it instructs the LWP
to abort execution of the system call (see
PCSENTRY, PCSEXIT).
PRSTOP Directs the LWP to stop again as soon as possible

after resuming execution (see PCSTOP). In
particular if the LWP is stopped on PR_SIGNALLED
or PR_FAULTED, the next stop will show
PR_REQUESTED, no other stop will have
intervened, and the LWP will not have executed any
user-level code
Using PCRUN When applied to an LWP control file PCRUN makes the specific LWP
on an LWP runnable. The operation fails (EBUSY) if the specific LWP is not stopped
on an event of interest.

PCRUN -- continued
Using PCRUN When applied to the process control file an LWP is chosen for the
on a process operation as described for /proc/pid/status. The operation fails (EBUSY) if
the chosen LWP is not stopped on an event of interest. If PRSTEP or
PRSTOP were requested, the chosen LWP is made runnable; otherwise,
the chosen LWP is marked PR_REQUESTED. If as a result all LWPs are
in the PR_REQUESTED stop state, they are all made runnable.
Once an LWP has been made runnable by PCRUN, it is no longer stopped
on an event of interest even if, because of a competing mechanism, it
remains stopped.

PCSTRACE
Introduction PCSTRACE Define a set of signals to be traced in the process: the receipt
of one of these signals by an LWP causes the LWP to stop. The set of
signals is defined using an operand sigset_t contained in the control
message.
SIGKILL Receipt of SIGKILL cannot be traced; if specified, it is silently ignored.
Held signals If a signal that is included in a held signal set of an LWP is sent to the LWP,
the signal is not received and does not cause a stop until it is removed
from the held signal set, either by the LWP itself or by setting the held
signal set with PCSHOLD or the PRSHOLD option of PCRUN.

PCCSIG
Introduction
PCCSIG The current signal and its associated signal information for the specific or
chosen LWP are set according to the contents of the operand siginfo
structure (see ). If the specified signal number is zero, the current signal is
cleared. An error (EBUSY) is returned if the LWP is not stopped on an
event of interest. The semantics of this operation are different from those
of kill(2), _lwp_kill(2), or PCKILL in that the signal is delivered to the LWP
immediately after execution is resumed (even if the signal is being held)
and an additional PR_SIGNALLED stop does not intervene even if the
signal is being traced. Setting the current signal to SIGKILL ends the
process immediately.

PCKILL, PCUNKILL
Introduction
PCKILL If applied to the process control file, a signal is sent to the process with
semantics identical to those of kill(2). If applied to an LWP control file, a
signal is sent to the LWP with semantics identical to those of _lwp_kill(2).
The signal is named in an operand int contained in the message. Sending
SIGKILL ends the process or LWP immediately.
PCUNKILL A signal is deleted, that is, it is removed from the set of pending signals. If
applied to the process control file, the signal is deleted from the process’s
pending signals. If applied to an LWP control file, the signal is deleted from
the LWP’s pending signals. The current signal (if any) is unaffected. The
signal is named in an operand int in the control message. It is an error
(EINVAL) to attempt to delete SIGKILL.

PCSHOLD
Introduction Set the set of held signals for the specific or chosen LWP (signals whose
delivery will be delayed if sent to the LWP) according to the operand
sigset_t structure. SIGKILL or SIGSTOP cannot be held; if specified, they
are silently ignored.

PCSFAULT
Introduction PCSFAULT defines a set of hardware faults to be traced in the process: on

incurring one of these faults an LWP stops. The set is defined via the
operand fltset_t structure.
Fault names Some fault names may not occur on all processors; there may be
processor-specific faults in addition to these. Fault names include the
following:
Fault Name Description

FLTILL Illegal instruction
FLTPRIV Privileged instruction
FLTBPT Breakpoint trap
FLTTRACE Trace trap
FLTACCESS Memory access fault (bus error)
FLTBOUNDS Memory bounds violation
FLTIOVF Integer overflow
FLTIZDIV Integer zero divide
FLTFPE Floating-point exception
FLTSTACK Unrecoverable stack fault
FLTPAGE Recoverable page fault
When not traced, a fault normally results in the posting of a signal to the
LWP that incurred the fault. If an LWP stops on a fault, the signal is posted
to the LWP when execution is resumed unless the fault is cleared by
PCCFAULT or by the PRCFAULT option of PCRUN. FLTPAGE is an
exception; no signal is posted. There may be additional processor-specific
faults like this.

PCSFAULT -- continued
pr_info field The pr_info field in /proc/pid/status or in /proc/pid/lwp/lw#/lwpstatus

identifies the signal to be sent and contains machine-specific information
about the fault. Signals can be any of the following and are described
below:
PCCFAULT The current fault (if any) is cleared; the associated

signal is not sent to the specific or chosen LWP.
PCSENTRY, These control operations instruct the process’s
PCSEXIT LWPs to stop on entry to or exit from specified
system calls.
PCSET Sets one or more modes of operation for the traced
process.
PCRESET Resets these modes. The modes to be set or reset
are specified by flags in an operand long in the
control message:
PSREG Sets the general registers for the specific or chosen
LWP according to the operand gregset_t structure.
PCSFPREG Sets the floating-point registers for the specific or
chosen LWP according to the operand fpregset_t
structure.
PCNICE Sets the LWP’s nice(2) priority.
PCCFAULT The current fault (if any) is cleared; the associated signal is not sent to the
specific or chosen LWP.

PCSENTRY, These control operations instruct the process’s LWPs to stop on entry to or
PCSEXIT exit from specified system calls. The set of system calls to be traced is
defined via an operand sysset_t structure.
When entry to a system call is being traced, an LWP stops after having
begun the call to the system but before the system call arguments have
been fetched from the LWP. When exit from a system call is being traced,
an LWP stops on completion of the system call just before checking for
signals and returning to user level. At this point all return values have been
stored into the LWP’s registers.
If an LWP is stopped on entry to a system call (PR_SYSENTRY) or when
sleeping in an interruptible system call (PR_ASLEEP is set), it may be
instructed to go directly to system call exit by specifying the PRSABORT
flag in a PCRUN control message. Unless exit from the system call is
being traced the LWP returns to user level showing error EINTR.
PCSET PCSET sets one or more modes of operation for the traced process.

PCRESET PCRESET resets these modes. The modes to be set or reset are
specified by flags in an operand long in the control message. The flags
are described below:
Flag Description
PR_FORK (inherit-on-fork) When set, the tracing flags of the process are
inherited by the child of a fork(2) or vfork(2).
When reset, child processes start with all tracing
flags cleared.
PR_RLC (run-on-last-close) When set and the last writable /proc file
descriptor referring to the traced process or any
of its LWPs is closed, all the tracing flags of the
process are cleared, any outstanding stop
directives are canceled, and if any LWPs are
stopped on events of interest, they are set
running as though PCRUN had been applied to
them. When reset, the process’s tracing flags are
retained and LWPs are not set running on last
close.
PR_KLC (kill-on-last-close) When set and the last writable /proc file
descriptor referring to the traced process or any
of its LWPs is closed, the process is exited with
SIGKILL.
PR_ASYNC (asynchronous-stop) When set, a stop on an event of interest by one

LWP does not directly affect any other LWP in the
process. When reset and an LWP stops on an
event of interest other than PR_REQUESTED,
all other LWPs in the process are directed to
stop.
It is an error (EINVAL) to specify flags other than

those described above or to apply these
operations to a system process. The current
modes are reported in the pr_flags field of /proc/
pid/status

EINVAL It is an error (EINVAL) to specify flags other than those described above or
to apply these operations to a system process. The current modes are
reported in the pr_flags field of /proc/pid/status.
PCSREG PCSREG sets the general registers for the specific or chosen LWP
according to the operand gregset_t structure. There may be machine-
specific restrictions on the allowable set of changes. PCSREG fails
(EBUSY) if the LWP is not stopped on an event of interest.
PCSFPREG PCSFPREG sets the floating-point registers for the specific or chosen
LWP according to the operand fpregset_t structure. An error (EINVAL) is
returned if the system does not support floating-point operations (no
floating-point hardware and the system does not emulate floating-point
machine instructions). PCSFPREG fails (EBUSY) if the LWP is not
stopped on an event of interest.
PCNICE The traced (or chosen) LWP’s nice(2) priority is incremented by the
amount contained in the operand int. Only the super-user may better an
LWP’s priority in this way, but any user may make the priority worse. This
operation is significant only when applied to an LWP in the time-sharing
scheduling class.

Directories
Introduction
Object The object directory contains read-only files with names as they appear in
directory the entries of the map file, corresponding to objects mapped into the
address space of the target process. Opening such a file yields a
descriptor for the mapped file associated with a particular address-space
region. The name a.out also appears in the directory as a synonym for the
executable file associated with the ‘‘text’’ of the running process.
The object directory makes it possible for a controlling process to get

access to the object file and any shared libraries (and consequently the
symbol tables)--in general, any mapped files--without having to know the
specific path names of those files.
lwp directory The lwp directory contains entries each of which names an LWP within the
containing process. These entries are directories containing additional files
and are described beginning on page 15.

Code Example
Introduction The following code is an simple example of how one process can use the
/proc filesystem to access the address space of another. Provided with a
single argument (the id of a currently running process), it prints the name
of the process from the psinfo structure.
#include <stdio.h>
#include <fcntl.h>
#include <sys/procfs.h>
main(int argc, char **argv)

{
char fname[512];
struct psinfo p;
int fd;
/* check for an argument */

if (argc != 2)
exit(1);
sprintf(fname, "/proc/%s/psinfo", argv[1]);
/* check that the process id is still running */

if((access(fname, F_OK)) < 0)
exit(1);
fd = open(fname, O_RDONLY);
read(fd, &p, sizeof(struct psinfo));
printf("process pid %s: exec path/args: %s %s\n",
argv[1], p.pr_fname, p.pr_psargs);
}


Draft Version for review, Sunday, 15. October 2000, lastpage.fm Student Guide
© Copyright IBM Corp. 2000 Version YYYYMMDD


AIX5L StudentGuide PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

AIX5L StudentGuide PDF

Caricato da

Copyright:

Formati disponibili

AIX 5L Internals

IBM Web Server

July 2000 Edition

Web Server Knowledge Channel

© Copyright IBM Corp. 2000 Version 20001015 Contents iii

Memory Overlay Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11

Single vs. Multiple Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7

Network boot $RC_CONFIG files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-54

viii AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000

Unit 1. Kernel Overview

What You Should Be Able to Do

© Copyright IBM Corp. 2000 Version 20001015 -1 of 38

Kernel The kernel is the base program of the computer. It is an intermediary

Continued on next page

-2 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000

Kernel Overview -- continued

The demands placed on the system software by customer applications will

Continued on next page

© Copyright IBM Corp. 2000 Version 20001015 -3 of 38

Kernel Overview -- continued

32-bit Power 64-bit Power Intel IA64

In addition, binary compatibility will not be provided to applications that are

Continued on next page

-4 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000

Kernel Overview -- continued

Compatibility There is no change to the compatibility provided for 32-bit kernel

Compatibility One important aspect of binary compatibility involves the required

© Copyright IBM Corp. 2000 Version 20001015 -5 of 38

Kernel Overview -- continued

Continued on next page

-6 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000

Kernel overview -- continued

The kernel can be dynamically extended with extra functionality.

© Copyright IBM Corp. 2000 Version 20001015 -7 of 38

file subsystem Inter-process

Roughly there are three distinct layers:

• The user level

Continued on next page

-8 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000

Kernel states -- continued

User mode or problem state:

Kernel mode or supervisor state:

© Copyright IBM Corp. 2000 Version 20001015 -9 of 38

#define MSR_PR 0x4000 /* Problem state */

Continued on next page

-10 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000

Kernel exercise -- continued

... more stuff...

... tons of more stuff ...

© Copyright IBM Corp. 2000 Version 20001015 -11 of 38

Semaphores 32-bit kernel 64-bit-kernel

Message Queues 32-bit kernel 64-bit kernel

Continued on next page

-12 of 38 AIX 5L Internals Version 20001015 © Copyright IBM Corp. 2000

Kernel Limits -- continued

Shared Memory 32-bit kernel 64-bit kernel

There are a couple of kernel parameters which affect the availability of

LVM 32-bit kernel 64-bit kernel

Continued on next page

© Copyright IBM Corp. 2000 Version 20001015 -13 of 38

Kernel Limits -- continued

Filesystems JFS JFS2

Miscellaneous 32-bit kernel 64-bit kernel