Sei sulla pagina 1di 48

The Evolving Solaris Kernel

The Evolving Solaris Kernel

Past, Present & Future

Jim Mauro
Senior Staff Engineer - Performance & Availability Engineering
Sun Microsystems, Inc.
400 Atrium Drive, Somerset, NJ 08812
james.mauro@Sun.COM

Richard McDougall
Senior Staff Engineer - Performance & Availability Engineering
Sun Microsystems, Inc.
richard.mcdougall@sun.com

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 1

The Evolving Solaris Kernel

Agenda
• Introduction
• Solaris Overview
• Distribution
• Releases
• System Overview & Kernel Features
• 64-bits
• The Evolution
• Things added, things changed
• Tips and tidbits along the way...
• Major Features Review
• Solaris 7
• Solaris 8
• Solaris 9

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 2
The Evolving Solaris Kernel

Introduction
• What is Solaris?
• A complete operating environment, built on a modular, dynamic
kernel
• The Solaris Operating Environment (SOE)
• SunOS - the kernel (the 5.X thing)
• Windowing - desktop environment. CDE default, OpenWindows
still included
• GNOME 2 Beta Available
• GNOME is the strategic direction
• Open Network Computing (ONC+). NFS (V2 & V3), NIS/NIS+,
RPC/XDR, LDAP

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 3

The Evolving Solaris Kernel

Solaris Distribution
• Many CDs in the distribution
- WEB start CD (Installation)
- OS bits, disks 1 and 2
- Software Supplement (more optional bits)
- Flash PROM Update
- Maintenance Update
- Sun Management Center
- Forte’ Workshop (try n’ buy)
• Bonus Software
- Software Companion (GNU, etc)
- StarOffice 6
- SunONE Advantage Software (2 CDs)
- Oracle Enterprise Server

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 4
The Evolving Solaris Kernel

Releases
• Base release, followed by quarterly update releases
• Solaris 8 - released 2/00
• Solaris 8, 6/00 (update 1)
• Solaris 8, 10/00 (update 2)
• Solaris 8, 1/01 (update 3)
• Solaris 8, 4/01 (update 4)
• Solaris 8, 7/01 (update 5)
• Solaris 8, 10/01 (update 6)
• Solaris 8, 2/02 (update 7)
• Solaris 9 - base release, May, 2002
• The model is designed to
• Provide predicatability for planning
• Provide a vehicle for getting new features, functionality and
patches out in a regular and timely fashion

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 5

The Evolving Solaris Kernel

Releases (cont)
• So, which release am I running?

sunsys> cat /etc/release


Solaris 8 6/00 s28s_u1wos_08 SPARC
Copyright 2000 Sun Microsystems, Inc. All Rights Reserved.
Assembled 26 April 2000
sunsys>

• Check out http://docs.sun.com, “What’s New” document for


a specific release

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 6
The Evolving Solaris Kernel

Kernel Features

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 7

The Evolving Solaris Kernel

System Overview

System Call Interface

TS/IA
Virtual File System Kernel
RT Framework Services
FX
FSS Clocks &
UFS NFS SPEC Timers
FS Callouts
Thread
Scheduling
and Virtual Networking
Process Memory
Management System
TCP
Bus and Device Drivers IP
Sockets

Hardware Address
Translation (HAT) SD SSD

HARDWARE

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 8
The Evolving Solaris Kernel

Solaris Kernel Features


• Dynamic Kernel
• Small core unix modules
• Major subsystems implemented as dynamically loadable modules
(file systems, scheduling classes, STREAMS modules, system
calls).
• Dynamic resource sizing & allocation (processes, files, locks,
memory, etc)
• Dynamic sizing based on system size
• Goal is to minimize/elminate need to use /etc/system tuneable parameters

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 9

The Evolving Solaris Kernel

Solaris Kernel Features


• Preemptive kernel
• Does NOT require interrupt disable/blocking via PIL for
synchronization
• Most kernel code paths are preemptable
• A few non-preemption points in critical code paths
• SCALABILITY & LOW LATENCY INTERRUPTS
• Well-defined, layered interfaces
• Module support, synchronization primitives, etc

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 10
The Evolving Solaris Kernel

Solaris Kernel Features


• Multithreaded kernel
• Kernel threads perform core system services
• Fine grained locking for concurrency
• Threaded subsystems
• Multithreaded process model
• User level threads and synchronization primitives
• Solaris (UI) & POSIX threads
• Two-level (M x N) model, evolved to one-level model
• Alternate thread library in Solaris 8
• Default thread library Solaris 9

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 11

The Evolving Solaris Kernel

Solaris Kernel Features


• Table-driven dispatcher with multiple scheduling class
support
• Dynamically loadable/modifyable table values
• Relatively “easy” to add new scheduling classes
• FSS and FX in Solaris 9
• Realtime support with preemptive kernel
• Additional kernel support for realtime applications (memory page
locking, asynchronous I/O, processor sets, interrupt control, high-
res clock)
• Kernel tuning via text file (/etc/system, driver.conf)
• Some things can be done “on the fly”
mdb(1)

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 12
The Evolving Solaris Kernel

Solaris Kernel Features


• Tightly integrated virtual memory and file system support
• Dynamic page cache memory implementation
• Virtual File System (VFS) Implementation
• Object-like abstractions for files and file systems
• Facilitates new features/functionality
• Kernel sockets via sockfs
• procfs (/proc) enhancements
• Doors (doorfs)
• fdfs, swapfs, tmpfs
(procfs), Doors (doorfs), fdfs, swapfs, tmpfs
• Disk-based, distributed & pseudo file systems

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 13

The Evolving Solaris Kernel

Solaris Kernel Features


• 32-bit and 64-bit kernel
• 64-bit kernel required for UltraSPARC-III based systems
(SunBlade, SunFire)
• 32-bit apps run just fine...
• Solaris DDI/DKI Implementation
• Device driver interfaces
• Includes interfaces for dynamic attach/detach/pwr
• Rich set of standards-compliant interfaces
• POSIX, UNIX International

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 14
The Evolving Solaris Kernel

Solaris Kernel Features


• Integrated networking facilities
• TCP/IP
IPv4, IPSec, IPv6
• Name services - DNS, NIS, NIS+, LDAP
• NFS - defacto standard distributed file system, NFS-V2 & NFS-V3
• Remote Procedure Call/External Data Representation (RPC/XDR)
facilities
• Sockets, TLI, Federated Naming APIs

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 15

The Evolving Solaris Kernel

64-Bits

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 16
The Evolving Solaris Kernel

64-bit Solaris
• Since Solaris 7, full 32-bit binary compatibility
• A simple directory namespace rule providing for the support
and co-existence of 32-bit binaries on a 64-bit Solaris 8
system;
For every directory on the system that contains binary
object files (executables, shared object libraries, etc), there is a
sparcv9 subdirectory containing the 64-bit versions
• All kernel modules must be the of the same data model; ILP32
(32-bit data model) or LP64 (64-bit data model)
• 64-bit kernel required to run 64-bit apps

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 17

The Evolving Solaris Kernel

32 bit limits
• Solaris 2.5
• Heap is limited to 2GB, malloc will fail beyond 2GB
• Solaris 2.5.1
• Heap limited to 2GB by default
• Can go beyond 2GB with kernel patch 103640-08+
• can raise limit to 3.75G by using ulimit or rlimit() if uid=root
• Do not need to be root with 103640-23+
• Solaris 2.6
• Heap limited to 2GB by default
• can raise limit to 3.75G by using ulimit or rlimit()
• Solaris 7 & 8
• Limits are raised by default
• 32 bit program can malloc 3.99GB

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 18
The Evolving Solaris Kernel

Solaris/SPARC V8/V9 Data Model


• Defines the width of integral data types
• 32-bit Solaris - ILP32
• 64-bit Solaris - LP64
’C’ data type ILP32 LP64
char 8 8
short 16 16
int 32 32
long 32 64
longlong 64 64
pointer 32 64
enum 32 32
float 32 32
double 64 64
quad 128 128

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 19

The Evolving Solaris Kernel

64-bit Performance
• 64 Bit Virtual Address Space
• (+) Free from the 3.9GB barrier
• (+) Memory map large files
• 64 Bit data types
• (+) 64 Bit Arithmetic, 64 Bit Registers
• (-) Pointers/Longs require moving 8 bytes
• Typically ~5% delta
• Larger cache footprint
• (-) Larger Stack

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 20
The Evolving Solaris Kernel

Which Data Model Is Booted?


• Use isainfo(1)
sunsys> isainfo
sparcv9 sparc
sunsys> isainfo -b
64
sunsys> isainfo -v
64-bit sparcv9 applications
32-bit sparc applications

• Or isalist(1)
sunsys> isalist
sparcv9+vis sparcv9 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc
sunsys>

• man isaexec(3C)
• Invoke isa-specific executable
• To create wrappers for shipping both 32-bit and 64-bit binaries,
and automatically launching the correct one

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 21

The Evolving Solaris Kernel

Evolving Features & Technical


Tidbits

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 22
The Evolving Solaris Kernel

The Evolution

1992 1993 1994 1995 1996 1998 2000 2002

Solaris 9
Solaris 7 SVM
Solaris 2.0 Solaris 2.2 Solaris 2.5 64-bit kernel MPSS
sun4d SMP Large pages (kernel) 64-bit procs MPO
VFS/Vnode UFS logging
ISM Large UFS Doors Resource Pools
Solaris 2.3 NFS V3 Priority Paging FSS
UP only FX
8-way SMP sun4u
New DNLC Solaris 2.5.1 Solaris 8
Solaris 2.1 sun4u MP New KMA
Cyclics
4-way SMP Solaris 2.6 T2
Solaris 2.4 Large files US-III
20-way SMP Processor Sets SunFire
New KMA Slab Allocator Kernel Sockets StarCat
Cachefs lockstat Freeware
CDE UFS directio UFS++
DR

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 23

The Evolving Solaris Kernel

General Priorities
• Reliability, scalability, performance
• on-going
• Standards compliance
• SunOS 4.X binary compatibility
• Threads / SMP scalability
• Big systems performance
• VM & I/O
• Lessons learned on threads
• Resource management
• Consolidation, ROI, TCO
• Resource Pools, Service Containers, Resource Virtualization

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 24
The Evolving Solaris Kernel

Virtual Memory & The Dynamic


Page Cache
• Creating a dynamic page cache allows for all of physical
memory to be used as disk buffer cache (read(2), write(2))
• The evolution of systems hardware, RAID and general I/O
tuning can create environments where the buffer cache
throttles the VM system
• The VM roller coaster (keeping the freelist sane)
• Priority paging (2.6 & 7) provided a band-aid
• Using directio bypasses the page cache for UFS reads/writes
• Solaris 8 implements a new cyclic page cache

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 25

The Evolving Solaris Kernel

Global Memory Management


• Demand Paged
• Not recently used (NRU) algorithm
• Dynamic file system cache
• Where has all my memory gone?
• Page scanner
• Operates bottom up from physical pages
• Default mode treats all memory equally

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 26
The Evolving Solaris Kernel

The Old Page Cache

kernel memory
pages pushed
out of segmap

segmap
reclaim process memory
heap, data, stack

page
scanner
free list

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 27

The Evolving Solaris Kernel

The Cyclic Page Cache

kernel memory
pages pushed
out of segmap

segmap
reclaim process memory
heap, data, stack

cache list

free list

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 28
The Evolving Solaris Kernel

Global Paging Dynamics


8192 (1GB Example)
fastscan

Scan Rate

100
slowscan

32MB
16MB
4MB

4MB

pages_before_pager
8MB

throttle- minfree desfree lotsfree cachefree cachefree+


free Free Memory deficit

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 29

The Evolving Solaris Kernel

Priority Paging
• Solaris 7 FCS or Solaris 2.6 with T-105181-09
• http://www.sun.com/sun-on-net/performance/priority_paging.html
• Set priority_paging=1 or cachefree in /etc/system
• Solaris 7 Extended vmstat
• ftp://playground.sun.com/pub/rmc/memstat
• Solaris 8
• New VM system, priority paging implemented at the core (make
sure it’s disabled in Sol 8!)
• New vmstat flag, “-p”
• Solaris 9
• Multiple page size support (MPSS)
• Memory Placement Optimizations (MPO)

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 30
The Evolving Solaris Kernel

Memory Monitoring
• Use vmstat or the memstat command on Solaris 7
• ftp://playground.sun.com/pub/rmc/memstat
# vmstat 3
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s0 s4 s6 in sy cs us sy id
0 0 0 269776 21160 0 0 0 0 0 0 0 0 0 0 2 154 200 92 0 0 100
0 0 0 269776 21152 0 0 0 0 0 0 0 0 0 0 2 155 203 113 0 0 99
0 0 0 269720 3896 5 17 80 0 109 0 59 0 0 0 2 221 773 134 0 2 98
0 0 0 269616 3792 0 0 160 0 160 0 76 0 0 0 2 279 242 130 0 1 99
0 0 0 269616 3792 0 0 192 0 192 0 105 0 0 0 2 294 225 138 0 1 99
0 0 0 269616 3800 1 90 234 5 232 0 99 0 0 0 2 323 964 305 5 3 92
0 0 0 269656 3832 0 0 106 0 106 0 51 0 0 0 2 237 212 121 0 1 99

# memstat 3 (Solaris 7 Only)


or
# vmstat -p 3 (Solaris 8+)
memory ---------- paging ----------- - executable - - anonymous - -- filesys -- --- cpu ---
free re mf pi po fr de sr epi epo epf api apo apf fpi fpo fpf us sy wt id
21160 0 22 0 5 5 0 0 0 0 0 0 0 0 0 5 5 0 1 0 99
21152 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100
21152 0 18 34 2 2 0 0 0 0 0 0 0 0 34 2 2 0 1 0 99
11920 0 0 277 106 272 0 153 0 0 32 0 98 149 277 8 90 0 3 0 97
11888 0 0 256 69 224 0 106 0 0 16 0 69 178 256 0 29 0 3 1 96
11896 0 0 213 106 261 0 124 0 0 26 0 106 232 213 0 2 0 3 13 84
11904 0 0 245 66 242 0 122 0 0 16 0 64 221 245 2 5 0 2 0 98
11896 0 0 245 64 224 0 132 0 0 21 0 64 189 245 0 13 0 2 0 98

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 31

The Evolving Solaris Kernel

Simple Memory Rule:


• Identifying a memory shortage without PP:
• Scanner not scanning -> no memory shortage
• Scanner running, page ins and page outs, swap device activity ->
potential memory shortage
• (use separate swap disk or 2.6 iostat -p to measure swap partition
activity)
• Identifying a memory shortage with PP on Sol 7:
• api and apo should be zero in memstat, non zero is a clear sign of
memory shortage
• Identifying a memory shortage on Sol 8:
• scan rate != 0
• freemem is real

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 32
The Evolving Solaris Kernel

Memory Summary
• Solaris 9
# mdb -k
> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 21146 165 9%
Anon 16891 131 7%
Exec and libs 8389 65 3%
Page cache 8248 64 3%
Free (cachelist) 2490 19 1%
Free (freelist) 190309 1486 77%

Total 247473 1933

• Solaris 8 and earlier


# prtmem

Total memory: 1933 Megabytes


Kernel Memory: 164 Megabytes
Application: 128 Megabytes
Executable & libs: 65 Megabytes
File Cache: 64 Megabytes
Free, file cache: 19 Megabytes
Free, free: 1491 Megabytes

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 33

The Evolving Solaris Kernel

The Threads Model


• Original 2-level, MxN model design goals
• Scalability
• Lightweight threads
• Pools of Virtual Processors (LWPs)
• Bound threads available
• Lessons learned...
• User level thread scheduling is complex
• Signal delivery is, at times, a nightmare
• Kernel threads are not as expensive as they used to be
• What we have today
• Alternate thread library in Solaris 8 (/usr/lib/lwp/libthread.so)
• 1-level is the default in Solaris 9 (/usr/lib/libthread.so)

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 34
The Evolving Solaris Kernel

2-Level MxN Model


proc 1 proc 2 proc 3 proc 4
Processes User Threads

LWP’s User Layer


Kernel Layer

Kernel Threads

the dispatcher
An unattached
kernel thread

Hardware Layer

Processors (CPU’s)

• The 1 level model is effectively all bound threads (proc 4)

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 35

The Evolving Solaris Kernel

Resource Management
• Effective management of hardware resources to applications
• Large application performance
• Multiple apps per Solaris instance (consolidation)
• Provide boundaries on resource consumption by applications
• Resource categories
• Processors (CPUs)
• Memory (physical memory)
• Disk IO bandwidth/latency/IOPS
• Network bandwidth/latency
• This is an on-going effort, with significant improvements in
subsequent Solaris 9 quarterly releases

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 36
The Evolving Solaris Kernel

Processor Control Commands


• CPU related commands
• psrinfo(1M) - provides information about the
processors on the system. Use “-v” for verbose
• psradm(1M) - online/offline processors, interrupt
enable/disable
• psrset(1M) - creation and management of processor
sets
• pbind(1M) - original processor bind command. Does
not provide exclusive binding
• processor_bind(2), processor_info(2),
pset_bind(2), pset_info(2), pset_creat(2),
p_online(2): system calls to do things
programmatically

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 37

The Evolving Solaris Kernel

Solaris 9 Resource Management


• Tasks, Projects & Extended Accounting
• Task - A collection of processes
• Project - A collection of tasks

Projects

Task Task Task

proc proc proc proc proc proc proc

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 38
The Evolving Solaris Kernel

Solaris 9 Resource Management


• Tasks & Projects provide abstractions for binding together
related processes, for the purpose of
• Resource management. Tasks and Projects can be bound to
process sets, have scheduler changes applied to them, etc.
• Resource controls. Resource limits can be applied at the Project or
Task level.
• Resource monitoring. Tools have been enhanced to monitor
utilization at the Project or Task level.
• “prstat -J” - Display statistics for processes and projects
• “prstat -T” - Display statistics for processes and tasks
• Extended accounting. The accounting facility had been enhanced to provide
project and task level accounting data.

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 39

The Evolving Solaris Kernel

Solaris 9 Resource Controls


• The following resource controls are available
project.cpu-shares: Number of CPU shares (FSS) available to this project
task.max-cpu-time: Maximum CPU time available to the processes in this task (milliseconds)
task.max-lwps: Maximum number of LWPs available to the processes in this task
process.max-cpu-time: Max CPU time available to this process
process.max-file-descriptor: Max number of open files for this process
process.max-file-size: Max file size
process.max-core-size: Max core file size
process.max-data-size: Max size of the process’s data segment
process.max-stack-size: Max size of the process’s stack

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 40
The Evolving Solaris Kernel

Solaris 9 Fair Share Scheduler


• Share based (versus priority based) process scheduling
• Designed to provide a guaranteed minimum amount of CPU
resources to a specific application (project/task)
• Defining a maximum, or ceiling, not currently available
• Shares are allocated to projects
• Shares are not percentages
• Shares allocated are relative to shares allocated to other projects
• The total number of shares allocated also matters
• FSS can be used in conjunction with processor sets
• Finer grained management and control

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 41

The Evolving Solaris Kernel

FSS & Processor Sets

Project A
16.66% (1/6)
Project B
40%
Project B (2/5)
33.33%
(2/6) Project C
100%
(3/3)

Project C
Project C 60%
50% (3/5)
(3/6)

Processor Set 1 Processor Set 2 Processor Set 3


2 CPUs 4 CPUs 2 CPUs
25% of System 50% of System 25% of System

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 42
The Evolving Solaris Kernel

Resource Pools
• Provides a facility for stateful (persistent) processor sets and
project binding, as well as scheduling class assignment
• Resource pool management is done via pooladm(1M),
poolbind(1M), and poolcfg(1M).
• /etc/pooladm.conf provides persistance across reboots
(managed via poolcfg(1M))
• poolbind(1M) provides for binding of projects or tasks to a
resource pool
• /etc/projects can define a resource pool for a project or task

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 43

The Evolving Solaris Kernel

Solaris Release Features Summary

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 44
The Evolving Solaris Kernel

Solaris 7 - New Features


• 64-Bits
• Kernel
• 64-bit binary support
• Full binary compatibility for 32-bit executables
• UFS logging
• mount -o logging
• Logs to spare blocks in cylinder group
• No fsck
• UFS noatime
• Disable access time update to inodes
• pgrep & pkill
• Ends ps -ef | grep proc_name | awk ‘ { print $2 }’

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 45

The Evolving Solaris Kernel

Solaris 7 - New Features


• traceroute bundled
• dumpadm(1M)
• Configure a seperate raw partition for dumps
• Dump running systems
• LDAP Client Library
• TCP with SACK
• Selective Acknowledgement - RFC 2018
• libdevinfo(3)
• Device configuration information APIs
• truss(1) Enhanced
• User level function tracing. “-u”, “-U”

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 46
The Evolving Solaris Kernel

Solaris 8 - New Features


• Cyclic Page Cache
• Enhanced VM page management functionality
• Priority for page allocation given to process segments
• freemem is real!
• System Message IDs
• Numeric ID generated for syslog messages
• devfsadm(1M)
• One tool for device configuration/management
• DR events managed through devfsadmd
• mmap MAP_ANON
• a = mmap( addr, len, prot, flags| MAP_ ANON,-1, off);
• POSIX High Resolution Timers
• CLOCK_HIGHRES via new Cyclics kernel subsystem

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 47

The Evolving Solaris Kernel

Solaris 8 - New Features


• prstat(1)
• Top-like curses based process monitor utility
• apptrace(1)
• truss-like utility for tracing user-level library calls
• /proc tools enhanced to work on core files
• pstack(1), pcred(1), pfiles(1)
• coreadm(1M)
• System-wide core file management
• mdb(1)
• New kernel debugger - replace adb & crash
• Supports use of adb macros and crash utilities
• Evolved to manage user code debugging (Sol 9)

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 48
The Evolving Solaris Kernel

Solaris 8 - New Features


• User Level Priority Inheritance
• User defined mutex locks attribute
• Forced unmount
• umount -f
• Alternate thread’s library
• /usr/lib/lwp/libthread.so - provides all bound threads
• Does not require re-compilation
• Freeware CD
• apache, bash, bzip2, tcsh, gcc, mkisofs, less, zsh, Glib, GTK+, etc,
etc,...

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 49

The Evolving Solaris Kernel

Solaris 9 - New Features


• Many, many (but not all) Solaris 9 features have been
backported to Solaris 8
• Available in various Solaris 8 update releases
• Resource Management
• Resource pools - configure boundaries on resources consumed by
processes and tasks
• Processors today, memory coming
• Resource pools cross reboots (unlike processor sets and bindings)
• See prctl(1), pooladm(1M), poolcfg(1M), poolbind(1M),
rctladm(1M), project(4)
• Fixed-Priority Scheduling Class (FX)
• TS class priority range, but priorities remain fixed
• Fair Share Scheduling Class (FSS)
• Share-based (versus priority-based) CPU allocation

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 50
The Evolving Solaris Kernel

Solaris 9 - New Features


• Command line process facilties
• pargs(1) - dump args and env associated with a live process, or
core file
• preap(1) - remove zombies (Harry Cooper & Ben could have used
this in 1968!)
• du(1), df(1M) and ls(1) - New “-h” option
• “-h” - provide human-readable output format.
• Lists sizes in Kbytes, Mbytes, Gbytes, etc...
• Multiple Page Size Support (MPSS)
• Support of pages larger than 8k for process stack, heap and
mmap’d anonymous memory
• Actual supported page sizes hardware dependent
• UltraSPARC-III supports 8k, 64k, 512k, 4MB...

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 51

The Evolving Solaris Kernel

Solaris 9 - New Features


• MPSS (cont)
jurassic> pagesize -a
8192
65536
524288
4194304
jurassic>

• New Threads Library/Model


• 1 Level threads model - all bound threads
• What was the alternate threads library in Solaris 8 is the default (in
/usr/lib) in Solaris 9.
• Dynamic Intimate Shared Memory (DISM)
• Allows database to dynamically shrink/grow the shared segment
• Original ISM implementation was a big performance win (shared
translation information, large pages), but was fixed in size
• DISM gives the best of both worlds

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 52
The Evolving Solaris Kernel

Solaris 9 - New Features


• Security
• Internet Key Exchange (IKE) Protocol
• Secure Shell (ssh) - SSHv1 & SSHv2
• Kerberos Key Distribution Center (KDC) & Admin Tools
• Secure LDAP
• 128-bit Encryption
• Role-Based Access Controls (RBAC) Enhanced
• tcp-wrappers 7.6 in freeware CD
• Xserver encrypted connections supported

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 53

The Evolving Solaris Kernel

Solaris 9 - New Features


• iPlanet Director Server
• LDAP Server bundled/integrated
• LDAP Name Service Support
• NIS+ - to - LDAP Migration Tool
• FTP Server
• Based on WU-ftp server
• PPP 4.0
• Includes PPPoE (Solaris 8 7/01)
• IP Network Multipathing (Solaris 8 10/00)
• Solaris Volume Manager
• Formerly Solaris DiskSuite
• Soft partitions and Device ID support

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 54
The Evolving Solaris Kernel

Summary
• Steady, sustained progress on key areas - scalability,
reliability, performance, features
• Going forward
• Resource management - memory, service containers
• Observability - More & better tools
• Resilience - fault detection, isolation, containment
• Management - Zero downtime admin
• patches, upgrades
• Reliability, performance, always at the top

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 55

The Evolving Solaris Kernel

Supplemental Slides

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 56
The Evolving Solaris Kernel

Kernel Statistics
• Solaris uses a central mechanism for kernel statistics
• “kstat”
• Kernel providers
• raw statistics (c structure)
• typed data
• classed statistics
• Perl and C API
• kstat(1M) command
# kstat -n system_misc
module: unix instance: 0
name: system_misc class: misc
avenrun_15min 90
avenrun_1min 86
avenrun_5min 87
boot_time 1020713737
clk_intr 2999968
crtime 64.1117776
deficit 0
lbolt 2999968
ncpus 2

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 57

The Evolving Solaris Kernel

Memory Accounting
• The ps command
• SZ = Virtual Size
• RSS = Resident Set Size (including shared)
# ps -ale
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND
root 22998 12.0 0.8 4584 1992 ? S 10:05:30 3:22 /usr/sbin/nsr/nsrc
root 23672 1.0 0.7 1736 1592 pts/16 O 10:22:54 0:00 /usr/ucb/ps -aux
root 3 0.4 0.0 0 0 ? S Sep 28 166:38 fsflush
root 733 0.4 1.0 6352 2496 ? S Sep 28 174:29 /opt/SUNWsymon/jre
root 345 0.3 0.7 2968 1736 ? S Sep 28 55:39 /usr/sbin/nsr/nsrd
root 23100 0.2 0.5 3880 1104 ? S Oct 15 0:25 rpc.rstatd
root 732 0.2 2.5 9920 6304 ? S Sep 28 94:43 esd - init topolog

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 58
The Evolving Solaris Kernel

The pmap command


• Verbose Process mappings
• Solaris 8 private/shared
• Solaris 9 private=Anon, shared=RSS-Anon
# pmap -x 123
Address Kbytes RSS Anon Locked Mode Mapped File
00010000 8 8 - - r-x-- mmap
00020000 8 8 8 - rwx-- mmap
01000000 1024 1024 - - rw-s- dev:0,2 ino:5304657
02000000 1024 1024 512 - rw--- dev:0,2 ino:5304657
03000000 1024 1024 512 - rw--R dev:0,2 ino:5304657
04000000 1024 1024 1024 - rw--- [ anon ]
05000000 512 512 512 - rw--R [ anon ]
FF280000 680 680 - - r-x-- libc.so.1
FF33A000 32 32 32 - rwx-- libc.so.1
FF380000 16 16 - - r-x-- libc_psr.so.1
FF3A0000 8 8 - - r-x-- libdl.so.1
FF3B0000 8 8 8 - rwx-- [ anon ]
FF3C0000 152 152 - - r-x-- ld.so.1
FF3F6000 8 8 8 - rwx-- ld.so.1
FFBFA000 24 24 24 - rwx-- [ stack ]
-------- ------- ------- ------- -------
total Kb 5552 5552 2640 -

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 59

The Evolving Solaris Kernel

SWAP Space ctd...


# swap -s
total: 101456k bytes allocated + 12552k reserved = 114008k used, 597736k available

should read:

total: 101456k bytes unallocated + 12552k allocated = 114008k reserved, 597736k


available

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 60
The Evolving Solaris Kernel

Swap:
# ./prtswap -l
Swap Reservations:
--------------------------------------------------------------------------
Total Virtual Swap Configured: 767MB =
RAM Swap Configured: 255MB
Physical Swap Configured: + 512MB

Total Virtual Swap Reserved Against: 513MB =


RAM Swap Reserved Against: 1MB
Physical Swap Reserved Against: + 512MB

Total Virtual Swap Unresv. & Avail. for Reservation: 253MB =


Physical Swap Unresv. & Avail. for Reservations: 0MB
RAM Swap Unresv. & Avail. for Reservations: + 253MB

Swap Allocations: (Reserved and Phys pages allocated)


--------------------------------------------------------------------------
Total Virtual Swap Configured: 767MB
Total Virtual Swap Allocated Against: 467MB

Physical Swap Utilization: (pages swapped out)


--------------------------------------------------------------------------
Physical Swap Free (should not be zero!): 232MB =
Physical Swap Configured: 512MB
Physical Swap Used (pages swapped out): - 279MB

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 61

The Evolving Solaris Kernel

The pmap command


• Swap reservations (Solaris 9):
# pmap -S 123
Address Kbytes Swap Mode Mapped File
00010000 8 - r-x-- mmap
00020000 8 8 rwx-- mmap
01000000 1024 - rw-s- dev:0,2 ino:5304657
02000000 1024 1024 rw--- dev:0,2 ino:5304657
03000000 1024 512 rw--R dev:0,2 ino:5304657
04000000 1024 1024 rw--- [ anon ]
05000000 512 512 rw--R [ anon ]
FF280000 680 - r-x-- libc.so.1
FF33A000 32 32 rwx-- libc.so.1
FF380000 16 - r-x-- libc_psr.so.1
FF3A0000 8 - r-x-- libdl.so.1
FF3B0000 8 8 rwx-- [ anon ]
FF3C0000 152 - r-x-- ld.so.1
FF3F6000 8 8 rwx-- ld.so.1
FFBFA000 24 24 rwx-- [ stack ]
-------- ------- -------
total Kb 5552 3152

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 62
The Evolving Solaris Kernel

Shared Memory
• System V Initimate Shared Memory (ISM)
• Shared translation data structures
• 4MB TLB Page Size
• Locked pages
• Invoke with an additional flag to shmat () - SHARE_MMU
• Default shared memory mode for Oracle RDBMS
• System V Dynamic Intimate Shared Memory (DISM)
• Solaris 8 U3
• Pageable variant of ISM
• Integrated with Oracle 9i (dynamic SGA)
• 8k TLB Page Size for Solaris 8
• 4MB TLB Page Size for Solaris 9 U1

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 63

The Evolving Solaris Kernel

The pmap command


# pmap -x 15492
15492: ./maps
Address Kbytes RSS Anon Locked Mode Mapped File
00010000 8 8 - - r-x-- maps
00020000 8 8 8 - rwx-- maps
00022000 20344 16248 16248 - rwx-- [ heap ]
03000000 1024 1024 - - rw-s- dev:0,2 ino:4628487
04000000 1024 1024 512 - rw--- dev:0,2 ino:4628487
05000000 1024 1024 512 - rw--R dev:0,2 ino:4628487
06000000 1024 1024 1024 - rw--- [ anon ]
07000000 512 512 512 - rw--R [ anon ]
08000000 8192 8192 - 8192 rwxs- [ dism shmid=0x5 ]
09000000 8192 4096 - - rwxs- [ dism shmid=0x4 ]
0A000000 8192 8192 - 8192 rwxsR [ ism shmid=0x2 ]
0B000000 8192 8192 - 8192 rwxsR [ ism shmid=0x3 ]
FF280000 680 672 - - r-x-- libc.so.1
FF33A000 32 32 32 - rwx-- libc.so.1
FF390000 8 8 - - r-x-- libc_psr.so.1
FF3A0000 8 8 - - r-x-- libdl.so.1
FF3B0000 8 8 8 - rwx-- [ anon ]
FF3C0000 152 152 - - r-x-- ld.so.1
FF3F6000 8 8 8 - rwx-- ld.so.1
FFBFA000 24 24 24 - rwx-- [ stack ]
-------- ------- ------- ------- -------
total Kb 50464 42264 18888 16384

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 64
The Evolving Solaris Kernel

Multiple Page Size Support


• Solaris 8
• Large (4MB) pages with ISM/DISM for shared memory
• Solaris 9
• "Multiple Page Size Support"
• Optional large pages for heap/stack
• A wrapper for unchanged programs (ppgsz)
• Programatically via memcntl(3C)
• Shared library for existing binaries (LD_PRELOAD) (/usr/lib/
libmpss.so)
• pmap enhancements to observe page sizes (pmap -sx)
• Tool to observe potential gains (trapstat -T)

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 65

The Evolving Solaris Kernel

TLB Trap CPU Accounting


# trapstat -t 3
cpu | itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim | %tim
-----+-------------------------------+-------------------------------+-----
0 k| 25 0.0 0 0.0 | 29558 0.5 6 0.0 | 0.6
0 u| 9728 0.1 1 0.0 | 17943 0.3 3 0.0 | 0.5
-----+-------------------------------+-------------------------------+-----
1 k| 0 0.0 0 0.0 | 19001 1.2 3 0.0 | 1.2
1 u| 7872 0.2 0 0.0 | 16300 0.5 0 0.0 | 0.8
=====+===============================+===============================+=====
ttl | 17625 0.2 1 0.0 | 82802 1.3 12 0.0 | 1.5

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 66
The Evolving Solaris Kernel

The pmap command


# pmap -xs 15492
Address Kbytes RSS Anon Locked Pgsz Mode Mapped File
00010000 8 8 - - 8K r-x-- maps
00020000 8 8 8 - 8K rwx-- maps
00022000 3960 3960 3960 - 8K rwx-- [ heap ]
00400000 8192 8192 8192 - 4M rwx-- [ heap ]
00C00000 4096 - - - - rwx-- [ heap ]
01000000 4096 4096 4096 - 4M rwx-- [ heap ]
03000000 1024 1024 - - 8K rw-s- dev:0,2 ino:4628487
08000000 8192 8192 - 8192 - rwxs- [ dism shmid=0x5 ]
09000000 4096 4096 - - 8K rwxs- [ dism shmid=0x4 ]
0A000000 4096 - - - - rwxs- [ dism shmid=0x2 ]
0B000000 8192 8192 - 8192 4M rwxsR [ ism shmid=0x3 ]
FF280000 136 136 - - 8K r-x-- libc.so.1
...
FF390000 8 8 - - 8K r-x-- libc_psr.so.1
FF3A0000 8 8 - - 8K r-x-- libdl.so.1
FF3B0000 8 8 8 - 8K rwx-- [ anon ]
FF3C0000 152 152 - - 8K r-x-- ld.so.1
FF3F6000 8 8 8 - 8K rwx-- ld.so.1
FFBFA000 24 24 24 - 8K rwx-- [ stack ]
-------- ------- ------- ------- -------
total Kb 50464 42264 18888 16384

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 67

The Evolving Solaris Kernel

Memory Placement Optmization


• Memory locality optimization for non-uniform memory
architectures
• Solaris 9 Update 1
• Ex800 machines are slightly non-uniform
• F15k systems are slightly more non-uniform
• Machine described as groups of latency (lgroups)
• Unit is typically a system board (processors+memory)
• Lgroups are an artifact of the hardware architecture (not user
configurable)
• Threads are assigned a “home” lgroup
• Memory allocated “close” to the threads accessing it
• Program heap and stack is allocated on the same lgroup
• Shared memory allocated round robin across boards in the system
or processor set. Different programatic policies also provided.
copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 68
The Evolving Solaris Kernel

Lock Statistics - mpstat


# mpstat 1
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
8 0 0 6611 456 300 1637 7 26 1110 0 135 33 45 2 21
9 1 0 1294 250 100 2156 3 29 1659 0 68 9 63 0 28
10 0 0 3232 308 100 2357 2 36 1893 0 104 2 66 2 30
11 0 0 647 385 100 1952 1 19 1418 0 21 4 83 0 13
12 0 0 190 225 100 307 0 1 589 0 0 0 98 0 2
13 0 0 624 373 100 1689 2 14 1175 0 87 7 80 2 12
14 0 0 392 312 100 1810 1 12 1302 0 49 2 80 2 15
15 0 0 146 341 100 2586 2 13 1676 0 8 0 82 1 17
16 0 0 382 355 100 1968 2 7 1628 0 4 0 88 0 12
17 0 0 88 283 100 689 0 4 474 0 95 1 94 2 3
18 0 0 3571 152 104 568 0 7 2007 0 15 0 93 1 6
19 0 0 3133 278 100 2043 2 24 1307 0 113 7 69 1 22
20 0 0 385 242 127 2127 2 22 1296 0 36 0 73 0 26
21 0 0 152 369 100 2259 0 10 1400 0 140 2 84 2 12
22 0 0 3964 241 120 1754 3 25 1085 0 91 11 62 1 26
23 0 2 555 193 100 1827 2 23 1148 0 288 7 64 7 22
24 0 0 811 245 113 1327 2 23 1228 0 110 3 76 4 17
25 0 0 105 500 100 2369 0 11 1736 0 6 0 88 0 11
26 0 0 163 395 131 2383 2 16 1487 0 64 2 79 1 18
27 0 1 718 1278 1051 2073 4 23 1311 0 237 9 67 6 19
28 0 0 868 271 100 2287 4 27 1309 0 139 9 55 0 36
29 0 0 931 302 103 2480 3 29 1569 0 165 9 66 2 23
30 0 0 2800 303 100 2146 2 13 1266 0 152 11 70 3 16
31 0 1 1778 320 100 2368 2 24 1381 0 261 11 56 5 28

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 69

The Evolving Solaris Kernel

Lock Statistics - lockstat


# lockstat sleep 10

Adaptive mutex spin: 293311 events in 10.015 seconds (29288 events/sec)

Count indv cuml rcnt spin Lock Caller


-------------------------------------------------------------------------------
218549 75% 75% 1.00 3337 0x71ca3f50 entersq+0x314
26297 9% 83% 1.00 2533 0x71ca3f50 putnext+0x104
19875 7% 90% 1.00 4074 0x71ca3f50 strlock+0x534
14112 5% 95% 1.00 3577 0x71ca3f50 qcallbwrapper+0x274
2696 1% 96% 1.00 3298 0x71ca51d4 putnext+0x50
1821 1% 97% 1.00 59 0x71c9dc40 putnext+0xa0
1693 1% 97% 1.00 2973 0x71ca3f50 qdrain_syncq+0x160
683 0% 97% 1.00 66 0x71c9dc00 putnext+0xa0
678 0% 98% 1.00 55 0x71c9dc80 putnext+0xa0
586 0% 98% 1.00 25 0x71c9ddc0 putnext+0xa0
513 0% 98% 1.00 42 0x71c9dd00 putnext+0xa0
507 0% 98% 1.00 28 0x71c9dd80 putnext+0xa0
407 0% 98% 1.00 42 0x71c9dd40 putnext+0xa0
349 0% 98% 1.00 4085 0x8bfd7e1c putnext+0x50
264 0% 99% 1.00 44 0x71c9dcc0 putnext+0xa0
187 0% 99% 1.00 12 0x908a3d90 putnext+0x454
183 0% 99% 1.00 2975 0x71ca3f50 putnext+0x45c
170 0% 99% 1.00 4571 0x8b77e504 strwsrv+0x10
168 0% 99% 1.00 4501 0x8dea766c strwsrv+0x10
154 0% 99% 1.00 3773 0x924df554 strwsrv+0x10

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 70
The Evolving Solaris Kernel

Lock Statistics - lockstat


Adaptive mutex block: 2818 events in 10.015 seconds (281 events/sec)

Count indv cuml rcnt nsec Lock Caller


-------------------------------------------------------------------------------
2134 76% 76% 1.00 1423591 0x71ca3f50 entersq+0x314
272 10% 85% 1.00 893097 0x71ca3f50 strlock+0x534
152 5% 91% 1.00 753279 0x71ca3f50 putnext+0x104
134 5% 96% 1.00 654330 0x71ca3f50 qcallbwrapper+0x274
65 2% 98% 1.00 872630 0x71ca51d4 putnext+0x50
9 0% 98% 1.00 260444 0x71ca3f50 qdrain_syncq+0x160
7 0% 98% 1.00 1390807 0x8dea766c strwsrv+0x10
6 0% 99% 1.00 906048 0x88876094 strwsrv+0x10
5 0% 99% 1.00 2266267 0x8bfd7e1c putnext+0x50
4 0% 99% 1.00 468550 0x924df554 strwsrv+0x10
3 0% 99% 1.00 834125 0x8dea766c cv_wait_sig+0x198
2 0% 99% 1.00 759290 0x71ca3f50 drain_syncq+0x380
2 0% 99% 1.00 1906397 0x8b77e504 cv_wait_sig+0x198
2 0% 99% 1.00 645358 0x71dd69e4 qdrain_syncq+0xa0

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 71

The Evolving Solaris Kernel

Lock Statistics - lockstat


Spin lock spin: 52335 events in 10.015 seconds (5226 events/sec)

Count indv cuml rcnt spin Lock Caller


-------------------------------------------------------------------------------
23531 45% 45% 1.00 4352 turnstile_table+0x79c turnstile_lookup+0x48
1864 4% 49% 1.00 71 cpu[19]+0x40 disp+0x90
1420 3% 51% 1.00 74 cpu[18]+0x40 disp+0x90
1228 2% 54% 1.00 23 cpu[10]+0x40 disp+0x90
1159 2% 56% 1.00 60 cpu[16]+0x40 disp+0x90
1138 2% 58% 1.00 22 cpu[24]+0x40 disp+0x90
1108 2% 60% 1.00 57 cpu[17]+0x40 disp+0x90
1082 2% 62% 1.00 24 cpu[11]+0x40 disp+0x90
1039 2% 64% 1.00 25 cpu[29]+0x40 disp+0x90
1009 2% 66% 1.00 17 cpu[23]+0x40 disp+0x90
1007 2% 68% 1.00 21 cpu[31]+0x40 disp+0x90
882 2% 70% 1.00 29 cpu[13]+0x40 disp+0x90
846 2% 71% 1.00 25 cpu[28]+0x40 disp+0x90
833 2% 73% 1.00 27 cpu[30]+0x40 disp+0x90

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 72
The Evolving Solaris Kernel

Lock Statistics - lockstat


Thread lock spin: 1232 events in 10.015 seconds (123 events/sec)

Count indv cuml rcnt spin Lock Caller


-------------------------------------------------------------------------------
468 38% 38% 1.00 1018 turnstile_table+0x79c ts_tick+0x8
251 20% 58% 1.00 683 turnstile_table+0x79c turnstile_block+0x1f4
180 15% 73% 1.00 152 sleepq_head+0x7f4 ts_tick+0x8
68 6% 78% 1.00 35 sleepq_head+0x7f4 turnstile_block+0x1f4
31 3% 81% 1.00 650 sleepq_head+0x7f4 ts_update_list+0x60
17 1% 82% 1.00 34 cpu[27]+0x64 cv_wait+0x18
7 1% 83% 1.00 64 cpu[13]+0x64 cv_wait+0x18
7 1% 84% 1.00 146 cpu[30]+0x64 ts_tick+0x8
6 0% 84% 1.00 56 cpu[29]+0x64 ts_tick+0x8
6 0% 84% 1.00 37 cpu[8]+0x64 turnstile_block+0x1f4
6 0% 85% 1.00 96 cpu[9]+0x64 ts_tick+0x8

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 73

The Evolving Solaris Kernel

Lock Statistics - lockstat


R/W writer blocked by writer: 1 events in 10.015 seconds (0 events/sec)

Count indv cuml rcnt nsec Lock Caller


-------------------------------------------------------------------------------
1 100% 100% 1.00 169634 0x9d42d620 segvn_pagelock+0x150
-------------------------------------------------------------------------------

R/W reader blocked by writer: 3 events in 10.015 seconds (0 events/sec)

Count indv cuml rcnt nsec Lock Caller


-------------------------------------------------------------------------------
3 100% 100% 1.00 1841415 0x75b7abec mir_wsrv+0x18
-------------------------------------------------------------------------------

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 74
The Evolving Solaris Kernel

lockstat - kernel profiling


# lockstat -kIi997 sleep 10

Profiling interrupt: 10596 events in 5.314 seconds (1994 events/sec)

Count indv cuml rcnt nsec CPU+PIL Caller


-------------------------------------------------------------------------------
5122 48% 48% 1.00 1419 cpu[0] default_copyout
1292 12% 61% 1.00 1177 cpu[1] splx
1288 12% 73% 1.00 1118 cpu[1] idle
911 9% 81% 1.00 1169 cpu[1] disp_getwork
695 7% 88% 1.00 1170 cpu[1] i_ddi_splhigh
440 4% 92% 1.00 1163 cpu[1]+11 splx
414 4% 96% 1.00 1163 cpu[1]+11 i_ddi_splhigh
254 2% 98% 1.00 1176 cpu[1]+11 disp_getwork
27 0% 99% 1.00 1349 cpu[0] uiomove
27 0% 99% 1.00 1624 cpu[0] bzero
24 0% 99% 1.00 1205 cpu[0] mmrw
21 0% 99% 1.00 1870 cpu[0] (usermode)
9 0% 99% 1.00 1174 cpu[0] xcopyout
8 0% 99% 1.00 650 cpu[0] ktl0
6 0% 99% 1.00 1220 cpu[0] mutex_enter
5 0% 99% 1.00 1236 cpu[0] default_xcopyout
3 0% 100% 1.00 1383 cpu[0] write
3 0% 100% 1.00 1330 cpu[0] getminor
3 0% 100% 1.00 333 cpu[0] utl0
2 0% 100% 1.00 961 cpu[0] mmread
2 0% 100% 1.00 2000 cpu[0]+10 read_rtc

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 75

The Evolving Solaris Kernel

Kernel Process Model


• Processes
• All processes begin life as a program
• All processes begin life as a disk file (ELF object)
• All processes have “state” or context that defines their execution
environment - hardware & software context
• Hardware context
• The processor state, which is CPU architecture dependent.
• In general, the state of the hardware registers (general registers,
privileged registers)
• Maintained in the LWP
• Software context
• Address space, credentials, open files, resource limits, etc - stuff
shared by all the threads in a process
• can be further divided into “hardware” context and “software”
context

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 76
The Evolving Solaris Kernel

Dispatcher Views

unbound user threads


are scheduled within

user thread

user thread

user thread
user thread

user thread

user thread
the threads library,
where the selected
user thread is linked
to an available LWP.

software context: software context: This does not apply to


open files, credentials, open files, credentials, bound threads
address space,

process
address space,
process

process group, process group,


session control,... session control,...
LWP machine

LWP machine

LWP machine

LWP machine
state

state

state

state
LWP

LWP
kernel dispatcher
view.

CPU
kthread

kthread

kthread

kthread

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 77

The Evolving Solaris Kernel

Dispatcher & Scheduling Classes


• Solaris supports multiple scheduling classes
• Allows for the co-existence of different priority schemes and
scheduling algorithms (policies) within the kernel
• Each scheduling class provides a class-specific function to
manage thread priorities, administration, creation, termination, etc.
• The class-specific functions are called using a MACRO scheme,
similar to what is used at the VFS layer
...
CL_PREEMPT(thread) -> ts_preempt()
...
• Each scheduling class is assigned a range of priorities
• For each loaded scheduling class, the priority-range falls within the
systems total range of global priorities
• The dispatcher is the kernel sunsystem that manages the
dispatch queues (run queues), handles thread selection,
context switching, preemption, etc

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 78
The Evolving Solaris Kernel

Scheduling Classes
• SunOS currently implements the following scheduling
classes
• Timeshare (TS)
• Fixed Priority (FX)
• Fair Share (FSS)
• Interactive (IA)
• System (SYS)
• Realtime (RT)
highest (best)
169 interrupt
priority 160 interrupt thread
159 priorities above system
realtime if realtime class is
not loaded, priorities 100-109.
100
99
system
60
lowest (worst) 59
priority timesharing
and interactive
0

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 79

The Evolving Solaris Kernel

Scheduling Classes - Priorities

59 10
ints
user priority range realtime 169 1
interrupt
0
global priority range

+60
system
user priority range interactive

-60

0
+60

user priority range timeshare

-60

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 80
The Evolving Solaris Kernel

Quick Tidbit
• Use dispadmin(1M) or mdb(1) for scheduling class info
# dispadmin -l
CONFIGURED CLASSES
==================

SYS (System Class)


TS (Time Sharing)
FX (Fixed Priority)
IA (Interactive)

# mdb -k
> ::class
SLOT NAME INIT FCN CLASS FCN
0 SYS sys_init sys_classfuncs
1 TS ts_init ts_classfuncs
2 FX fx_init fx_classfuncs
3 IA ia_init ia_classfuncs
4 0 0
5 0 0

• Note the RT class is not loaded

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 81

The Evolving Solaris Kernel

Thread Priorities & Scheduling


• Every thread has 2 priorities; a global priority, derived based
on its scheduling class, and (potentially) and inherited
priority
• Priority inherited from parent, alterable via priocntl(1)
command or system call
• Typically, threads run as either TS or IA threads
• IA threads created when thread is associated with a windowing
system
• RT threads are explicitly created
• SYS class used by kernel threads, and for TS/IA threads when
a higher priority is warranted
• A temporary boost when an important resource is being held
• Interrupts run at interrupt priority
copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 82
The Evolving Solaris Kernel

File System Types


Filesystem Type Device Description
ufs Regular Disk Unix Fast Filesystem, default in Solaris
pcfs Regular Disk MSDOS filesystem
hsfs Regular Disk High Sierra File System (CDROM)
tmpfs Regular Memory Uses memory and swap
nfs Psuedo Network Network filesystem
Uses a local disk as cache for another
cachefs Psuedo Filesystem
NFS file system
Uses a dynamic layout to mount other
autofs Psuedo Filesystem
file systems
Device
specfs Psuedo Filesystem for the /dev devices
Drivers
procfs Psuedo Kernel /proc filesystem representing processes
sockfs Psuedo Network Filesystem of socket connections
fifofs Psuedo Files FIFO File System

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 83

The Evolving Solaris Kernel

The virtual file system framework


VNODE OPERATIONS VFS OPERATIONS
umount()
rename()

unlink()

mount()
mkdir()
rmdir()
write()

fsync()

statfs()
close()
open()

creat()
read()

sync()
seek()

ioctl()
link()

Kernel

System Call Interface

VFS- File System Independant Layer (VFS & VNODE INTERFACES)

UFS PCFS HSFS VxFS NFS PROCFS

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 84
The Evolving Solaris Kernel

The VFS Interface


vfs_sw[] / *rootvfs
/usr
/var
/opt

VFSOP_xxx mount() ufs_mount()


unmount() ufs_unmount()
root() ufs_root()
statvfs() ufs_statvfs()
Mount Point VFS sync() ufs_sync()
vget() ufs_vget()
mountroot() ufs_mountroot()
swapvp() ufs_swapvp()

vnode
ufs
nfs
etc...
blocksize
flags VFS Type
Index into vfssw[]
device
synclist
hashlist

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 85

The Evolving Solaris Kernel

The vnode interface


VNODE Ops
close() ufs_close()
read() ufs_read()
write() ufs_write()
Memory Pages VNODE ioctl() ufs_ioctl()
create() ufs_create()
link() ufs_link()
. .
. .

Regular File
Filesystem Directory
Pointer Block Device
VNODE Type Character Device
Link
FIFO
Process
Socket
copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 86
The Evolving Solaris Kernel

File system Caching


• Solaris file systems use the VM system to cache
and move data
• Regular reads are page ins, delayed writes are
page outs
• VM Parameters and load dramatically effects file
system performance
• Solaris 8 gives executable, stack and heap pages priority
over file system pages

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 87

The Evolving Solaris Kernel

File System Caching


read()
write() fread() Stack
fwrite()

mmap()
File name lookups
STDIO
Buffers

(ncsize) Heap
The DNLC Level 1 Page Cache
cache hit ratio Directory Binary (Data)
can be observed Name segmap page cache Binary (Text)
with netstat -s Cache (256MB on Ultra)

direct. The cache hit ratio of


Inode Cache blocks the segmap cache can
(ufsninode) be measured with
netstat -k segmap

Level 2 Page Cache


The buffer Buffer Cache
cache hit Dynamic Page Cache
ratio can be Files mapped with
observed with (BUFHWM) mmap() buypass
the segmap cache
sar -b

Storage Devices

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 88
The Evolving Solaris Kernel

UFS
• Block based allocation
• 2TB Max file system size
• A file can grow to the max file system size
• triple indirect is implemented
• Prior to 2.6, max file size is 2GB

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 89

The Evolving Solaris Kernel

UFS Block Allocation


# filestat /home/bigfile
Inodes per cyl group: 64
Inodes per block: 64
Cylinder Group no: 0
Cylinder Group blk: 64
File System Block Size: 8192
Device block size: 512
Number of device blocks: 204928
Start Block End Block Length (Device Blocks)
----------- ----------- ----------------------
66272 -> 66463 192
66480 -> 99247 32768
1155904 -> 1188671 32768
1277392 -> 1310159 32768
1387552 -> 1420319 32768
1497712 -> 1530479 32768
1607872 -> 1640639 32768
1718016 -> 1725999 7984
1155872 -> 1155887 16
Number of extents: 9
Average extent size: 22769 Blocks

Note: The filestat command is show for demonstration purposes, and is not as yet
included with the Solaris operating system

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 90
The Evolving Solaris Kernel

UFS Logging
• Beginning in Solaris 7, UFS logging became a mount option
• Log to spare blocks in the file system (no metadevice)
• Fast reboots - no fsck required

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 91

The Evolving Solaris Kernel

UFS Direct I/O


• File systems cause a lot of paging activity
• Solaris 2.6 introduces a mechanism to bypass the
VM system
• Forces completely unbuffered I/Os
• Very slow writes (synchronous)
• Useful for copying large files or when application does caching e.g.
Oracle
• mount -o forcedirectio /dev/xyz /mountpt
• directio (fd, DIRECTIO_ON | DIRECTIO_OFF)

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 92
The Evolving Solaris Kernel

Direct I/O Checklist


• Must be aligned
• sector aligned (512 byte boundary)
• Must not be mapped
• Buffer must be word aligned

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 93

The Evolving Solaris Kernel

UFS Write Throttle


• A throttle exists in UFS to limit the amount of
memory UFS can saturate, per file
• Controlled by three parameters
• ufs_WRITES (1 = enabled)
• ufs_HW = 393216 bytes (high water mark to suspend IO)
• ufs_LW = 262144 bytes (low water mark to start IO)
• Almost always need to set this higher to get
maximum sequential write performance
• set ufs_LW=4194304
• set ufs_HW=67108864

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 94
The Evolving Solaris Kernel

UFS Performance
• Adjacent blocks are grouped and written together
or read ahead
• Controlled by the maxcontig parameter
• Defaults to 128k on most platforms, 1MB on SPARCstorage array
100,200
• Must be set higher to achieve adequate write performance
• maxphys must be raised beyond 128k also

copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 95

Potrebbero piacerti anche