Ds8000 Logical Configuration

Front cover
DS8000 Performance Monitoring and Tuning

Understand the performance aspects of the DS8000 architecture Configure the DS8000 to fully exploit its capabilities Use planning and monitoring tools with the DS8000
Bert Dufrasne Brett Allison John Barnes Jean Iyabi Rajesh Jeyapaul Peter Kimmel
Chuck Laing Anderson Nobre Rene Oehme Gero Schmidt Paulus Usong
ibm.com/redbooks
International Technical Support Organization DS8000 Performance Monitoring and Tuning March 2009
SG24-7146-01
Note: Before using this information and the product it supports, read the information in Notices on page xiii.
Second Edition (March 2009) This edition applies to the IBM System Storage DS8000 with Licensed Machine Code 5.4.1.xx.xx (Code bundles 64.1.x.x).
Copyright International Business Machines Corporation 2009. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii The team that wrote this IBM Redbooks publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Special thanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx Chapter 1. DS8000 characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The storage server challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Performance numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Recommendations and rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Modeling your workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Allocating hardware components to workloads. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Meeting the challenge: DS8000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 DS8000 models and characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 DS8000 performance characteristics overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Advanced caching techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 IBM System Storage multipath Subsystem Device Driver (SDD) . . . . . . . . . . . . . . 1.3.3 Performance characteristics for System z. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 2 3 3 3 4 5 6 6 6
Chapter 2. Hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Storage system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Processor memory and cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Cache and I/O operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Determining the right amount of cache storage . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 RIO-G interconnect and I/O enclosures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 RIO-G loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 I/O enclosures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Disk subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Device adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Fibre Channel disk architecture in the DS8000. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.3 Disk enclosures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.4 Fibre Channel drives compared to FATA and SATA drives . . . . . . . . . . . . . . . . . 20 2.4.5 Arrays across loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.6 Order of installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.7 Performance Accelerator feature (Feature Code 1980) . . . . . . . . . . . . . . . . . . . . 23 2.5 Host adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.1 Fibre Channel and FICON host adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.2 ESCON host adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.3 Multiple paths to Open Systems servers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.4 Multiple paths to System z servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.5 Spreading host attachments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Tools to aid in hardware planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6.1 White papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6.2 Disk Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6.3 Capacity Magic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Copyright IBM Corp. 2009. All rights reserved.
iii
Chapter 3. Understanding your workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 General workload types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Standard workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Read intensive cache unfriendly workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Sequential workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Batch jobs workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Sort jobs workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Database workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 DB2 query workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 DB2 logging workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 DB2 transaction environment workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 DB2 utilities workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Application workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 General file serving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Online transaction processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Video on demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Data warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.6 Engineering and scientific applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.7 Digital video editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Profiling workloads in the design phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Understanding your workload type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Monitoring the DS8000 workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Monitoring the host workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4. Logical configuration concepts and terminology . . . . . . . . . . . . . . . . . . . . 4.1 RAID levels and spares. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 RAID 5 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 RAID 6 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 RAID 10 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Spare creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The abstraction layers for logical configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Array sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Logical volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Space Efficient volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.7 Allocation, deletion, and modification of LUNs and CKD volumes . . . . . . . . . . . . 4.2.8 Logical subsystems (LSS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.9 Address groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.10 Volume access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.11 Summary of the logical configuration hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Understanding the array to LUN relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 How extents are formed together to make DS8000 LUNs . . . . . . . . . . . . . . . . . . 4.3.2 Understanding data I/O placement on ranks and extent pools . . . . . . . . . . . . . . . Chapter 5. Logical configuration performance considerations . . . . . . . . . . . . . . . . . . 5.1 Basic configuration principles for optimal performance. . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Workload isolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Workload resource-sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Workload spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Using workload isolation, resource-sharing, and spreading . . . . . . . . . . . . . . . . . 5.2 Analyzing application workload characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
29 30 30 30 30 30 30 31 32 32 32 32 32 33 34 34 34 34 35 35 35 38 38 38 41 42 42 43 43 44 45 45 45 46 47 48 49 50 53 53 54 54 55 56 61 63 64 64 65 66 67 68
5.2.1 Determining isolation requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.2 Reviewing remaining workloads for feasibility of resource-sharing. . . . . . . . . . . . 70 5.3 Planning allocation of disk and host connection capacity . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.1 Planning DS8000 hardware resources for isolated workloads . . . . . . . . . . . . . . . 70 5.3.2 Planning DS8000 hardware resources for resource-sharing workloads . . . . . . . . 70 5.4 Planning volume and host connection spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1 Spreading volumes for isolated and resource-sharing workloads. . . . . . . . . . . . . 71 5.4.2 Spreading host connections for isolated and resource-sharing workloads . . . . . . 72 5.5 Planning array sites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.1 DS8000 configuration example 1: Array site planning considerations . . . . . . . . . 73 5.5.2 DS8000 configuration example 2: Array site planning considerations . . . . . . . . . 75 5.5.3 DS8000 configuration example 3: Array site planning considerations . . . . . . . . . 77 5.6 Planning RAID arrays and ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.6.1 RAID-level performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.6.2 RAID array considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.6.3 Rank considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.7 Planning extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.7.1 Single-rank and multi-rank extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.7.2 Extent allocation methods for multi-rank extent pools. . . . . . . . . . . . . . . . . . . . . . 96 5.7.3 Balancing workload across available resources . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.7.4 Assigning workloads to extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.7.5 Planning for multi-rank extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.7.6 Planning for single-rank extent pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.8 Plan address groups, LSSs, volume IDs, and CKD PAVs . . . . . . . . . . . . . . . . . . . . . 118 5.8.1 Volume configuration scheme using application-related LSS/LCU IDs . . . . . . . 120 5.8.2 Volume configuration scheme using hardware-bound LSS/LCU IDs . . . . . . . . . 124 5.9 Plan I/O port IDs, host attachments, and volume groups . . . . . . . . . . . . . . . . . . . . . . 131 5.9.1 DS8000 configuration example 1: I/O port planning considerations . . . . . . . . . . 133 5.9.2 DS8000 configuration example 2: I/O port planning considerations . . . . . . . . . . 136 5.9.3 DS8000 configuration example 3: I/O port planning considerations . . . . . . . . . . 140 5.10 Implement and document DS8000 logical configuration . . . . . . . . . . . . . . . . . . . . . . 143 Chapter 6. Performance management process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Operational performance subprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Tasks, actors, and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Performance troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Tactical performance subprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Tasks, actors, and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Strategic performance subprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Tasks, actors, and roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 7. Performance planning tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Disk Magic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 The need for performance planning and modeling tools. . . . . . . . . . . . . . . . . . . 7.1.2 Overview and characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 148 149 151 152 153 153 154 155 156 156 157 158 158 158 159 161 162 162 163
Contents
7.1.3 Output information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 Disk Magic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Disk Magic for System z (zSeries) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Process the DMC file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 zSeries model to merge the two ESS-800s to a DS8300 . . . . . . . . . . . . . . . . . . 7.2.3 Disk Magic performance projection for zSeries model . . . . . . . . . . . . . . . . . . . . 7.2.4 Workload growth projection for zSeries model . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Disk Magic for Open Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Process the TotalStorage Productivity Center csv output file . . . . . . . . . . . . . . . 7.3.2 Open Systems model to merge the two ESS-800s to a DS8300 . . . . . . . . . . . . 7.3.3 Disk Magic performance projection for an Open Systems model . . . . . . . . . . . . 7.3.4 Workload growth projection for an Open Systems model . . . . . . . . . . . . . . . . . . 7.4 Workload growth projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Input data needed for Disk Magic study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 z/OS environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Open Systems environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Configuration guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 8. Practical performance management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction to practical performance management . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Performance management tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 TotalStorage Productivity Center overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 TotalStorage Productivity Center data collection . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 TotalStorage Productivity Center measurement of DS8000 components. . . . . . 8.2.4 General TotalStorage Productivity Center measurement considerations . . . . . . 8.3 TotalStorage Productivity Center data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Key performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 DS8000 key performance indicator thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 TotalStorage Productivity Center reporting options . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Predefined performance reports in TotalStorage Productivity Center. . . . . . . . . 8.5.3 Ad hoc reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.4 Batch reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.5 TPCTOOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.6 Volume Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.7 TPC Reporter for Disk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Monitoring performance of a SAN switch or director. . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 SAN configuration examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 TotalStorage Productivity Center for Fabric alerts . . . . . . . . . . . . . . . . . . . . . . . 8.6.3 TotalStorage Productivity Center for Fabric reporting. . . . . . . . . . . . . . . . . . . . . 8.6.4 TotalStorage Productivity Center for Fabric metrics . . . . . . . . . . . . . . . . . . . . . . 8.7 End-to-end analysis of I/O performance problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 Performance analysis examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 TotalStorage Productivity Center for Disk in mixed environment . . . . . . . . . . . . . . . . Chapter 9. Host attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 DS8000 host attachment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Attaching Open Systems hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Fibre Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 SAN implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163 163 165 165 170 177 179 180 181 188 194 195 197 197 198 198 199 203 204 204 205 205 207 212 214 214 216 217 218 221 222 223 228 229 232 236 239 240 242 243 246 247 248 249 257 263 265 266 266 267 267
vi
9.2.3 Multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Attaching IBM System z and S/390 hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 ESCON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 FICON configuration and performance considerations . . . . . . . . . . . . . . . . . . . . 9.3.4 z/VM, z/VSE, and Linux on System z attachment. . . . . . . . . . . . . . . . . . . . . . . . Chapter 10. Performance considerations with Windows Servers . . . . . . . . . . . . . . . 10.1 General Windows performance tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 I/O architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Windows Server 2008 I/O Manager enhancements . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Windows filesystem overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 NTFS guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Volume management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Microsoft Logical Disk Manager (LDM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.2 Microsoft LDM software RAID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.3 Veritas Volume Manager (VxVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.4 Determining volume layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Multipathing and the port layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.1 SCSIport scalability issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.2 Storport scalability features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.3 Subsystem Device Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.4 Subsystem Device Driver Device Specific Module . . . . . . . . . . . . . . . . . . . . . . 10.6.5 Veritas Dynamic MultiPathing (DMP) for Windows . . . . . . . . . . . . . . . . . . . . . . 10.7 Host bus adapter (HBA) settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8 I/O performance measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.1 Key I/O performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.2 Windows Performance console (perfmon) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.3 Performance log configuration and data export . . . . . . . . . . . . . . . . . . . . . . . . 10.8.4 Collecting configuration data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.5 Correlating performance and configuration data. . . . . . . . . . . . . . . . . . . . . . . . 10.8.6 Analyzing performance data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.7 Windows Server Performance Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9 Task Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.1 Starting Task Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10 I/O load testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10.1 Types of tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10.2 Iometer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 11. Performance considerations with UNIX servers . . . . . . . . . . . . . . . . . . . 11.1 Planning and preparing UNIX servers for performance . . . . . . . . . . . . . . . . . . . . . . 11.1.1 UNIX disk I/O architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 AIX disk I/O components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 AIX Journaled File System (JFS) and Journaled File System 2 (JFS2) . . . . . . 11.2.2 Veritas File System (VxFS) for AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.3 General Parallel FileSystem (GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.4 IBM Logical Volume Manager (LVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.5 Veritas Volume Manager (VxVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.6 IBM Subsystem Device Driver (SDD) for AIX . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.7 MPIO with SDDPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.8 Veritas Dynamic MultiPathing (DMP) for AIX . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.9 FC adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
270 275 275 276 279 279 281 282 283 283 284 284 284 285 285 286 286 287 288 289 289 290 290 291 291 291 292 294 296 296 297 297 300 301 301 304 304 305 307 308 309 311 312 315 315 316 320 321 321 322 322
Contents
vii
11.2.10 Virtual I/O Server (VIOS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 AIX performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 AIX vmstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 pstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.3 AIX iostat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.4 lvmstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.5 topas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.6 nmon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.7 fcstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.8 filemon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Solaris disk I/O components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 UFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 Veritas FileSystem (VxFS) for Solaris. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.3 SUN Solaris ZFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.4 Solaris Volume Manager (formerly Solstice DiskSuite). . . . . . . . . . . . . . . . . . . 11.4.5 Veritas Volume Manager (VxVM) for Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.6 IBM Subsystem Device Driver for Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.7 MPxIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.8 Veritas Dynamic MultiPathing (DMP) for Solaris. . . . . . . . . . . . . . . . . . . . . . . . 11.4.9 Array Support Library (ASL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.10 FC adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Solaris performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 fcachestat and directiostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.2 Solaris vmstat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.3 Solaris iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.4 vxstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.5 dtrace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 HP-UX Disk I/O architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.1 HP-UX High Performance File System (HFS). . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.2 HP-UX Journaled File System (JFS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.3 HP Logical Volume Manager (LVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.4 Veritas Volume Manager (VxVM) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.5 PV Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.6 Native multipathing in HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.7 Subsystem Device Driver (SDD) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.8 Veritas Dynamic MultiPathing (DMP) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . 11.6.9 Array Support Library (ASL) for HP-UX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.10 FC adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 HP-UX performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.1 HP-UX sar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.2 vxstat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.3 GlancePlus and HP Perfview/Measureware . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8 SDD commands for AIX, HP-UX, and Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8.1 HP-UX SDD commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8.2 Sun Solaris SDD commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9 Testing and verifying DS8000 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9.1 Using the dd command to test sequential rank reads and writes . . . . . . . . . . . 11.9.2 Verifying your system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 12. Performance considerations with VMware . . . . . . . . . . . . . . . . . . . . . . . 12.1 Disk I/O architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Multipathing considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
323 325 326 328 329 334 335 336 339 339 343 343 344 345 346 347 348 348 349 349 350 350 350 351 352 353 354 356 356 357 357 362 362 362 363 363 363 363 363 363 366 366 366 371 373 375 376 377 383 384 386 389
viii
12.3.1 Virtual Center Performance Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Performance monitoring with esxtop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.3 Guest-based performance monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 VMware specific tuning for maximum performance . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.1 Workload spreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.2 Virtual Machines sharing the same LUN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.3 ESX filesystem considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.4 Aligning partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Tuning of Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 13. Performance considerations with Linux. . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Supported platforms and distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Linux disk I/O architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.1 I/O subsystem architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.2 Cache and locality of reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.3 Block layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.4 I/O device driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 Specific configuration for storage performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.1 Host bus adapter for Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 Multipathing in Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.3 Software RAID functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.4 Logical Volume Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.5 Tuning the disk I/O scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.6 Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Linux performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.1 Disk I/O performance indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.2 Finding disk bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 14. IBM System Storage SAN Volume Controller attachment . . . . . . . . . . . 14.1 IBM System Storage SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.1 SAN Volume Controller concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.2 SAN Volume Controller multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.3 SVC Advanced Copy Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 SAN Volume Controller performance considerations . . . . . . . . . . . . . . . . . . . . . . . . 14.3 DS8000 performance considerations with SVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 DS8000 array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.2 DS8000 rank format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.3 DS8000 extent pool implications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.4 DS8000 volume considerations with SVC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.5 Volume assignment to SAN Volume Controller . . . . . . . . . . . . . . . . . . . . . . . . 14.3.6 Managed Disk Group for DS8000 Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 Performance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 Using TotalStorage Productivity Center for Disk to monitor the SVC . . . . . . . . 14.5 Sharing the DS8000 between a server and the SVC . . . . . . . . . . . . . . . . . . . . . . . . 14.5.1 Sharing the DS8000 between Open Systems servers and the SVC . . . . . . . . 14.5.2 Sharing the DS8000 between System i server and the SVC . . . . . . . . . . . . . . 14.5.3 Sharing the DS8000 between System z server and the SVC . . . . . . . . . . . . . . 14.6 Advanced functions for the DS8000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6.1 Cache-disabled VDisks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.7 Configuration guidelines for optimizing performance . . . . . . . . . . . . . . . . . . . . . . . .
389 390 391 392 392 392 393 396 399 401 402 402 403 404 405 405 406 406 407 409 410 412 414 417 417 418 421 422 422 425 426 427 429 429 429 430 434 434 435 436 436 437 437 438 438 438 438 439
Chapter 15. System z servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 15.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 15.2 Parallel Access Volumes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
Contents
ix
15.2.1 Static PAV, Dynamic PAV, and HyperPAV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.2 HyperPAV compared to dynamic PAV test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.3 PAV and large volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 Multiple Allegiance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4 How PAV and Multiple Allegiance work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.1 Concurrent read operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.2 Concurrent write operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5 I/O Priority Queuing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.6 Logical volume sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.6.1 Selecting the volume size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.6.2 Larger volume compared to smaller volume performance . . . . . . . . . . . . . . . . 15.6.3 Planning the volume sizes of your configuration. . . . . . . . . . . . . . . . . . . . . . . . 15.7 FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7.1 Extended Distance FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7.2 High Performance FICON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7.3 MIDAW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.8 z/OS planning and configuration guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.8.1 Channel configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.8.2 Extent pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.8.3 Considerations for mixed workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.9 DS8000 performance monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10 RMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.1 I/O response time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.2 I/O response time components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.3 IOP/SAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.4 FICON host channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.5 FICON director . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.6 Processor complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.7 Cache and NVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.8 DS8000 FICON/Fibre port and host adapter. . . . . . . . . . . . . . . . . . . . . . . . . . 15.10.9 Extent pool and rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11 RMF Magic for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.1 RMF Magic analysis process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.2 Data collection step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.3 RMF Magic reduce step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.4 RMF Magic analyze step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.5 Data presentation and reporting step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.11.6 Hints and tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 16. Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1 DB2 in a z/OS environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.1 Understanding your database workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.2 DB2 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.3 DB2 storage objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.4 DB2 dataset types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 DS8000 considerations for DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 DB2 with DS8000 performance recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.1 Know where your data resides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.2 Balance workload across DS8000 resources . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.3 Take advantage of VSAM data striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.4 Large volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.5 Modified Indirect Data Address Words (MIDAWs) . . . . . . . . . . . . . . . . . . . . . . 16.3.6 Adaptive Multi-stream Prefetching (AMP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
442 443 445 446 446 447 448 449 449 450 451 453 454 455 455 457 459 459 461 463 464 464 464 466 468 468 469 470 470 472 474 476 477 478 479 479 479 482 485 486 486 487 487 488 489 489 489 490 490 490 491 491
16.3.7 DB2 burst write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.8 Monitoring DS8000 performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 DS8000 DB2 UDB in an Open Systems environment . . . . . . . . . . . . . . . . . . . . . . . 16.4.1 DB2 UDB storage concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 DB2 UDB with DS8000 performance recommendations. . . . . . . . . . . . . . . . . . . . . . 16.5.1 Know where your data resides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.2 Balance workload across DS8000 resources . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.3 Use DB2 to stripe across containers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.4 Selecting DB2 logical sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.5 Selecting the DS8000 logical disk sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.6 Multipathing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6 IMS in a z/OS environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6.1 IMS overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6.2 IMS logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.7 DS8000 considerations for IMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8 IMS with DS8000 performance recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8.1 Know where your data resides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8.2 Balance workload across DS8000 resources . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8.3 Large volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8.4 Monitoring DS8000 performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 17. Copy Services performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1 Copy Services introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 FlashCopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.1 FlashCopy performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.2 Performance planning for IBM FlashCopy SE . . . . . . . . . . . . . . . . . . . . . . . . . 17.3 Metro Mirror. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.1 Metro Mirror configuration considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.2 Metro Mirror performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4 Global Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.1 Global Copy configuration considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.2 Global Copy performance consideration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5 Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.1 Global Mirror performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.2 Global Mirror Session parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.3 Avoid unbalanced configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.4 Growth within Global Mirror configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.6 z/OS Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.6.1 z/OS Global Mirror control dataset placement . . . . . . . . . . . . . . . . . . . . . . . . . 17.6.2 z/OS Global Mirror tuning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.6.3 z/OS Global Mirror enhanced multiple reader. . . . . . . . . . . . . . . . . . . . . . . . . . 17.6.4 zGM enhanced multiple reader performance improvement . . . . . . . . . . . . . . . 17.6.5 XRC Performance Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.7 Metro/Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.7.1 Metro/Global Mirror performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.7.2 z/OS Metro/Global Mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.7.3 z/OS Metro/Global Mirror performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
491 491 492 492 497 497 498 499 499 500 502 502 502 503 504 504 504 505 505 506 507 508 509 511 516 518 519 524 526 526 527 529 530 530 533 535 538 541 543 545 545 549 550 552 552 553 553 554
Appendix A. Logical configuration examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 A.1 Considering hardware resource availability for throughput. . . . . . . . . . . . . . . . . . . . . 556 A.2 Resource isolation or sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
Contents
xi
Scenario 1: Spreading everything with no isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . Scenario 2: Spreading data I/O with partial isolation . . . . . . . . . . . . . . . . . . . . . . . . . . Scenario 3: Grouping unlike RAID types together in the extent pool . . . . . . . . . . . . . . Scenario 4: Grouping like RAID types in the extent pool . . . . . . . . . . . . . . . . . . . . . . . Scenario 5: More isolation of RAID types in the extent pool . . . . . . . . . . . . . . . . . . . . . Scenario 6: Balancing mixed RAID type ranks and capacities . . . . . . . . . . . . . . . . . . . Appendix B. Windows server performance log collection . . . . . . . . . . . . . . . . . . . . . B.1 Windows Server 2003 log file configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Configuring logging of disk metrics Windows Server 2003 . . . . . . . . . . . . . . . . . . . . . . Saving counter log settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Importing counter logs properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analyzing disk performance from collected data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Retrieving data from a counter log file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exporting logged data on Windows Server 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Windows Server 2008 log file configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Windows Server 2008 Export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix C. UNIX shell scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 vgmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 lvmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.4 vpath_iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.5 ds_iostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.6 test_disk_speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.7 lsvscsimap.ksh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.8 mkvscsimap.ksh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix D. Post-processing scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2 Dependencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2.1 Running the scripts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix E. Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.1 Goals of benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.2 Requirements for a benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Define the benchmark architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Define the benchmark workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monitoring the performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Define the benchmark time frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.3 Caution using benchmark results to design production . . . . . . . . . . . . . . . . . . . . . . . Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How to get IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
556 561 562 563 565 566 571 572 572 575 576 576 576 578 580 584 587 588 588 589 590 594 597 598 602 607 608 608 609 623 624 624 625 625 626 626 627 629 629 630 630 630 631
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
xii
Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
xiii
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol ( or ), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:
AIX 5L AIX alphaWorks CICS DB2 Universal Database DB2 DS4000 DS6000 DS8000 ECKD Enterprise Storage Server ESCON eServer FICON FlashCopy GDPS Geographically Dispersed Parallel Sysplex GPFS HACMP i5/OS IBM iSeries Iterations OMEGAMON OS/390 Parallel Sysplex POWER5 POWER5+ POWER6 PowerHA PowerPC PowerVM POWER pSeries Rational Redbooks Redbooks (logo) RS/6000 S/390 Sysplex Timer System i System p5 System p System Storage System x System z10 System z9 System z Tivoli Enterprise Console Tivoli TotalStorage xSeries z/Architecture z/OS z/VM z/VSE z9 zSeries
The following terms are trademarks of other companies: Acrobat, and Portable Document Format (PDF) are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, other countries, or both. Disk Magic, IntelliMagic, and the IntelliMagic logo are trademarks of IntelliMagic BV in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office. Novell, SUSE, the Novell logo, and the N logo are registered trademarks of Novell, Inc. in the United States and other countries. Oracle, JD Edwards, PeopleSoft, Siebel, and TopLink are registered trademarks of Oracle Corporation and/or its affiliates. QLogic, and the QLogic logo are registered trademarks of QLogic Corporation. SANblade is a registered trademark in the United States. SAP R/3, SAP, and SAP logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries. VMotion, VMware, the VMware "boxes" logo and design are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. J2EE, Java, JNI, S24, Solaris, Solstice, Sun, ZFS, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Excel, Internet Explorer, Microsoft, MS-DOS, MS, PowerPoint, SQL Server, Visual Basic, Windows NT, xiv
Windows Server, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.
Notices
xv
xvi
Preface
This IBM Redbooks publication provides guidance about how to configure, monitor, and manage your IBM System Storage DS8000 to achieve optimum performance. It describes the DS8000 performance features and characteristics and how they can be exploited with the various server platforms that attach to the DS8000. Then, in separate chapters, we detail specific performance recommendations and discussions that apply for each server environment, as well as for database and DS8000 Copy Services environments. We also outline the various tools available for monitoring and measuring I/O performance for different server environments, as well as describe how to monitor the performance of the entire DS8000 subsystem.
The team that wrote this IBM Redbooks publication

This book was produced by a team of specialists from around the world working with the International Technical Support Organization, San Jose Center at the ESCC lab in Mainz, Germany. Bertrand Dufrasne is an IBM Certified Consulting I/T Specialist and Project Leader for System Storage disk products at the International Technical Support Organization, San Jose Center. He has worked at IBM in various I/T areas. Bertrand has written many IBM Redbooks publications and has also developed and taught technical workshops. Before joining the ITSO, he worked for IBM Global Services as an Application Architect in the retail, banking, telecommunication, and healthcare industries. He holds a Masters degree in Electrical Engineering from the Polytechnic Faculty of Mons (Belgium). Brett Allison has performed distributed systems performance-related work since 1997, including performance analysis of J2EE applications, UNIX/Windows NT systems, and SAN Storage technologies. He is currently the storage performance and capacity technical focal point for IBM Global Services Technology Delivery. He has designed and developed tools, processes, and service offerings to support storage performance and capacity. He has spoken at a number of conferences and is the author of several White Papers on performance. John Barnes is a Senior IT Specialist in IBM Global Services in the UK. John started his IBM career 30 years ago as a Large Systems hardware CE. After an assignment to the UK Hardware Support Centre, he moved to a career in Availability Management. John then joined the UK Storage and SAN Services Team in 2000, specializing in IBM TotalStorage Enterprise Storage Server (ESS) and SAN implementations. He now works in the UK STG Storage Services team, specializing in DS8000 and SAN implementations, including SAN Volume Controller (SVC) and Copy Services. Jean Iyabi is an active member of the IBM ESCC (European Storage Competence Center) in Mainz, Germany since 2001. As a Product Field Engineer, he acted as last level support for High End storage disk. Jean has extensive experience in DS8000 support and focuses on Host Attachment (System z), Extended Copy Services Functions, and Geographically Dispersed Parallel Sysplex (GDPS). He was assigned for two years as the EMEA field support interface with the DS8000 development and test teams in Tucson, AZ. He holds a degree in Electrical Engineering from the University of Applied Sciences of Wiesbaden (Germany).
xvii
Rajesh Jeyapaul is an AIX Development Support Specialist in IBM India. He has nine years of experience in AIX, specializing in investigating the performance impact of processes running in AIX. Currently, he is leading a technical team responsible for providing Development support to various AIX components. He holds a Masters Degree in Software Systems from the University of BITS, India, and an MBA from University of MKU, India. His areas of expertise include System p, AIX, and High-Availability Cluster Multi-Processing (HACMP). Peter Kimmel is an IT Specialist and the ATS team lead of the Enterprise Disk Performance team at the European Storage Competence Center in Mainz, Germany. He joined IBM Storage in 1999 and since then worked with SSA, VSS, the various ESS generations, and DS8000/DS6000. He has been involved in all Early Shipment Programs (ESPs), early installs for the Copy Services rollouts, and has co-authored several DS8000 IBM Redbooks publications so far. Peter holds a Diploma (MSc) degree in Physics from the University of Kaiserslautern. Chuck Laing is a Senior IT Architect and Master Certified IT Specialist with The Open Group. He is also an IBM Certified IT Specialist, specializing in IBM Enterprise Class and Midrange Disk Storage Systems/Configuration Management in the Americas ITD. He has co-authored eight previous IBM Redbooks publications about the IBM TotalStorage Enterprise Storage Server and the DS8000/6000. He holds a degree in Computer Science. He has worked at IBM for over ten years. Before joining IBM, Chuck was a hardware CE on UNIX systems for ten years and taught Computer Science at Midland College for six and a half years in Midland, Texas. Anderson Ferreira Nobre is a Certified IT Specialist and Certified Advanced Technical Expert - IBM System p5 in Strategic Outsourcing in Hortolndia (Brazil). He has 10 years of experience with UNIX (mainly with AIX). He was assigned to the UNIX team in 2005 to plan, manage, and support the UNIX, SAN, and Storage environments for IBM Outsourcing clients. Rene Oehme is an IBM Certified Specialist for High-End Disk Solutions, working for the Germany and CEMAAS Hardware Support Center in Mainz, Germany. Rene has more than six years of experience in IBM hardware support, including Storage Subsystems, SAN, and Tape Solutions, as well as System p and System z. Currently, he provides support for clients and service representatives with High End Disk Subsystems, such as the DS8000, DS6000, and ESS. His main focus is Open Systems attachment of High-End Disk Subsystems, including AIX, Windows, Linux, and VMware. He holds a degree in Information Technology from the University of Cooperative Education (BA) Stuttgart. Gero Schmidt is an IT Specialist in the IBM ATS technical sales support organization in Germany. He joined IBM in 2001 working at the European Storage Competence Center (ESCC) in Mainz, providing technical support for a broad range of IBM storage products (SSA, ESS, DS4000, DS6000, and DS8000) in Open Systems environments with a primary focus on storage subsystem performance. During his seven years of experience with IBM storage products, he participated in various beta test programs for ESS 800 and especially in the product rollout and beta test program for the DS6000/DS8000 series. He holds a degree in Physics (Dipl.-Phys.) from the Technical University of Braunschweig, Germany. Paulus Usong started his IBM career in Indonesia decades ago. He rejoined IBM at the Santa Teresa Lab (now known as the Silicon Valley Lab). In 1995, he joined the Advanced Technical Support group in San Jose. Currently, he is a Certified Consulting I/T Specialist and his main responsibilities are handling mainframe DASD performance critical situations
xviii
and performing Disk Magic study and remote copy sizing for clients who want to implement the IBM solution for their disaster recovery system.
The team: Rene, Bert, John, Brett, Gero, Anderson, Jean, Paulus, and Peter
Special thanks
For hosting this residency at the ESCC in Mainz, Germany, we want to thank: Rainer Zielonka - Director ESCC Dr. Friedrich Gerken - Manager Services and Technical Sales Support Rainer Erkens - Manager ESCC Service & Support Management Bernd Mller - Manager Enterprise Disk High-End Solutions Europe, for dedicating so many resources to this residency Stephan Weyrich - Opportunity Manager ESCC Workshops We especially want to thank Lee La Frese (IBM, Tucson) for being our special advisor and development contact for this book. Many thanks to those people in IBM in Mainz, Germany, who helped us with access to equipment as well as technical information and review: Uwe Heinrich Mueller, Uwe Schweikhard, Guenter Schmitt, Joerg Zahn, Werner Deul, Mike Schneider, Markus Oscheka, Hartmut Bohnacker, Gerhard Pieper, Alexander Warmuth, Kai Jehnen, Frank Krueger, and Werner Bauer Special thanks to: John Bynum DS8000 World Wide Technical Support Marketing Lead
Preface
xix
Thanks to the following people for their contributions to this project: Mary Anne Bromley Garry Bennet Jay Kurtz Rosemary McCutchen Brian J. Smith Sonny Williams IBM US Nick Clayton Patrick Keyes Andy Wharton Barry Whyte IBM UK Brian Sherman IBM Canada
Become a published author

Join us for a two-week to six-week residency program. Help write an IBM Redbooks publication dealing with specific products or solutions, while getting hands-on experience with leading edge technologies. You will team with IBM technical professionals, IBM Business Partners, and clients. Your efforts will help increase product acceptance and client satisfaction. As a bonus, you will develop a network of contacts in IBM development labs, and increase your productivity and marketability. Find out more about the residency program, browse the residency index, and apply online at: ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us. We want our IBM Redbooks publications to be as helpful as possible. Send us your comments about this or other IBM Redbooks publications in one of the following ways: Use the online Contact us review IBM Redbooks publication form found at: ibm.com/redbooks Send your comments in an e-mail to: redbook@us.ibm.com Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400
xx
Chapter 1.
DS8000 characteristics
This chapter contains a high level discussion and introduction to the storage server performance challenge. Then, we provide an overview of the DS8000 model characteristics that allow the DS8000 to meet this performance challenge.
1.1 The storage server challenge

One of the primary criteria in judging a storage server is performance: how fast it responds to a read or write request from an application server. How well a storage server accomplishes this task is dependent on the design of its hardware and firmware. Data continually moves from one component to another component within a storage server. The objective of server design is to have hardware of sufficient throughput to keep that data flowing smoothly without having to wait because a component is busy. When data stops flowing because a component is busy, a bottleneck has formed. Obviously, it is desirable to minimize the frequency and severity of bottlenecks. The ideal storage server is one in which all components are well utilized and bottlenecks are few. This scenario is the case if: The machine is designed well, with all hardware components in balance. To provide this balance over a range of workloads, a storage server must allow a range of hardware component options. The machine is sized well for the clients workload. That is, where options exist, the right quantities of each option were chosen. The machine is set up well. That is, where options exist in hardware installation and logical configuration, these options are chosen correctly.
1.1.1 Performance numbers

Raw performance numbers provide evidence that a particular storage server is better than the previous generation model, or better than the competitions product. But isolated performance numbers are often out of line with a production environment. It is important to understand how raw performance numbers relate to the performance of the storage server in processing a particular production workload. Throughput numbers are usually achieved in controlled tests, which have the objective of pushing as much data as possible through the storage server as a whole, or perhaps through just a single component. At the point of maximum throughput, the system is usually so overloaded that response times are greatly extended. Trying to achieve such throughput numbers in a normal business environment brings protests from the users of the system, because response times are extremely poor. To assure yourself that the DS8000 offers the latest and fastest technology, take a look at the performance numbers for the individual disks, adapters, and other components of the DS8000, as well as for the total device. You will find that the DS8000 uses the most current technology available. But, use a more rigorous approach when planning the DS8000 hardware configuration to meet the requirements of a specific environment.
1.1.2 Recommendations and rules

Hardware selections are sometimes based on general recommendations and rules. A general rule is a simple guideline for making a selection based on limited information. The advantage is that it allows you to make a quick decision, with little effort, that provides a solution that works acceptably well most of the time. The disadvantage is that it does not work all the time; sometimes, the solution is not at all what the client needs. You can increase the chances that the solution will work by making it more conservative. However, a conservative solution generally involves more hardware, which means a more expensive solution. In this chapter, we will provide recommendations and general rules for different hardware
components. Just remember, only use general rules when there is no information available to make a more informed decision.
1.1.3 Modeling your workload

A much better way to determine the hardware requirements for your workload is to run a Disk Magic model. Disk Magic is a modeling tool, which shows the throughput and response time of a storage server based on workload characteristics and the hardware resources of the storage server. By converting the results of performance runs into mathematical formulas, Disk Magic allows the results to be applied to a wide range of workloads. Disk Magic allows many variables of hardware to be brought together so the effect of each variable is integrated, producing a result that shows the overall performance of the storage server. For additional information about this tool, refer to 7.1, Disk Magic on page 162.
1.1.4 Allocating hardware components to workloads

There are two contrasting methods to allocate the use of hardware components to workloads. The first method is spreading the workloads across components, which means that you try to share the use of hardware components across all, or at least many, workloads. The more hardware components are shared among multiple workloads, the more effectively the hardware components are utilized, which reduces total cost of ownership (TCO). For example, to attach multiple hosts, you can use the same host adapters for all hosts instead of acquiring a separate set of host adapters for each host. However, the more that components are shared, the more potential there is that one workload will dominate use of the component. The second method is isolating workloads to specific hardware components, which means that specific hardware components are used for one workload, and other hardware components are used for different workloads. The downside of isolating workloads is that certain components are unused when their workload is not demanding service. On the upside, it means that when that workload does demand service, the component is available immediately, and the workload does not have to contend with other workloads for that resource. Spreading the workload maximizes the utilization and performance of the storage server as a whole. Isolating a workload is a way to maximize that individual workloads performance, making it run as fast as possible. For a detailed discussion, refer to 5.1, Basic configuration principles for optimal performance on page 64.
1.2 Meeting the challenge: DS8000

The DS8000 is a member of the DS product family. It offers disk storage servers with a wide range of hardware component options to fit many workload requirements, in terms of both type and size. It has the capability to scale very well to the highest disk storage capacities. The scalability is supported by design functions that allow installation of additional components without disruption. The IBM System Storage DS8000 has the performance to allow multiple workloads to be easily consolidated into a single storage subsystem.
Chapter 1. DS8000 characteristics
1.2.1 DS8000 models and characteristics

The DS8000 series currently has three Turbo models available: the DS8100 Turbo Model 931 and the DS8300 Turbo Models 932 and 9B2. The difference in models is in the processors and in the capability of storage system logical partitions (LPARs). The predecessors of the DS8000 series Turbo models were the DS8000 series Models 921, 922, and 9A2. The base frame houses the processor complexes, including system memory, up to 16 host adapters, and up to 128 disk modules. The first expansion frame houses up to 16 additional host adapters (for a total of 32) and up to 256 additional disk modules (for a total of 384). A second expansion frame houses up to another 256 disk modules (for a grand total of 640). The third and fourth expansion frame houses up to 256 and 128 (for a grand total of 1024) additional disk modules respectively. There are no additional host adapters installed for the second, third, and fourth expansion frames. Table 1-1 provides an overview of the DS8000 models, including processor, memory, host adapter, and disk specifications for each model. Note that the DS8300 LPAR model is essentially the same as the non-LPAR model in terms of hardware components. However, the LPAR model provides a 50/50, a 75/25, or a 25/75 split of processors and system memory, and up to half the maximum number of host adapters and disk modules on each system image.
Table 1-1 DS8000 processor models overview DS8100 Turbo model 931 Number of processor complexes Number of processors per complex Number of Storage Facility Images (SFIs) 2 2 1 DS8300 Turbo model 932 2 4 1 DS8300 Turbo LPAR model 9B2 2 4 2 (Each SFI has half the total processor resources) 2.2 GHz 32 GB 64 GB 128 GB 256 GB (Each SFI has half the total memory) 0-4 Model 9AE 4 - 32 (Each SFI can have 2 - 16) 4
Processor speed Processor Memory options (cache)
2.2 GHz 16 GB 32 GB 64 GB 128 GB
2.2 GHz 32 GB 64 GB 128 GB 256 GB
Expansion frames, minimum - maximum Expansion frame model Host adapters, minimum - maximum
0-1 Model 92E 2 - 16
0-4 Model 92E 2 - 32
Ports per Fibre Channel Protocol (FCP)/ Fibre Channel connection (FICON) host adapter Ports per Enterprise Systems Connection (ESCON) host adapter
DS8100 Turbo model 931 Disk drive modules (DDMs), minimum maximum 16 - 384
DS8300 Turbo model 932 16 - 1024
DS8300 Turbo LPAR model 9B2 32 - 1024
Next, we provide a short description of the main hardware components.
POWER5+ processor technology

The DS8000 series exploits the IBM POWER5+ technology, which is the foundation of the storage system LPARs. The DS8100 Model 931 utilizes the 64-bit microprocessors dual 2-way processor complexes, and the DS8300 Model 932/9B2 uses the 64-bit dual 4-way processor complexes. Within the POWER5+ servers, the DS8000 series offers up to 256 GB of cache, which is up to four times as much as the previous ESS models.
Internal fabric
The DS8000 comes with a high bandwidth, fault tolerant internal interconnection, which is also used in the IBM System p servers. It is called RIO-2 (Remote I/O) and can operate at speeds up to 1 GHz and offers a 2 GB/s sustained bandwidth per link.
Switched Fibre Channel Arbitrated Loop (FC-AL)

The disk interconnection has changed in comparison to the previous ESS. Instead of the Serial Storage Architecture (SSA) loops, there is now a switched FC-AL implementation. This implementation offers a point-to-point connection to each drive and adapter, so that there are four paths available from the controllers to each disk drive.
Disk drives
The DS8000 offers a selection of industry standard Fibre Channel (FC) disk drives. There are 15k rpm FC drives available with 146 GB, 300 GB, or 450 GB capacity. The 500 GB Fibre Channel Advanced Technology Attachment (FATA) drives (7200 rpm) allow the system to scale up to 512 TB of capacity.
Host adapters
The DS8000 offers enhanced connectivity with the availability of four-port Fibre Channel/FICON host adapters. The 4 Gb/s Fibre Channel/FICON host adapters, which are offered in longwave and shortwave, can also auto-negotiate to 2 Gb/s or 1 Gb/s link speeds. This flexibility enables immediate exploitation of the benefits offered by the higher performance, 4 Gb/s storage area network (SAN)-based solutions, while also maintaining compatibility with existing 2 Gb/s infrastructures. In addition, the four ports on the adapter can be configured with an intermix of Fibre Channel Protocol (FCP) and FICON, which can help protect your investment in Fibre adapters, and increase your ability to migrate to new servers. The DS8000 also offers two-port ESCON adapters. A DS8000 can support up to a maximum of 32 host adapters, which provide up to 128 Fibre Channel/FICON ports.
1.3 DS8000 performance characteristics overview

The IBM System Storage DS8000 offers optimally balanced performance, which is over six times the throughput of the Enterprise Storage Server Model 800. This throughput is possible, because the DS8000 incorporates many performance enhancements, such as the dual-clustered POWER5+ servers, four-port 4 Gb Fibre Channel/FICON host adapters, new Fibre Channel disk drives, and the high-bandwidth, fault-tolerant internal interconnections.
With all these new components, the DS8000 is positioned at the top of the high performance category. As previously mentioned in this chapter, the following components contribute to the high performance of the DS8000: Redundant Array of Independent Disks (RAID), array across loops (AAL), POWER5+ processors, RIO, and the FC-AL implementation with a truly switched FC back end. In addition to these, there are even more contributions to performance as illustrated in the following sections.
1.3.1 Advanced caching techniques

The DS8000 benefits from advanced caching techniques.
Sequential Prefetching in Adaptive Replacement Cache (SARC)

Another performance enhancer is the use of the new self-learning cache algorithms. The DS8000 series caching technology improves cache efficiency and enhances cache hit ratios. One of the patent-pending algorithms that is used in the DS8000 series and the DS6000 series is called Sequential Prefetching in Adaptive Replacement Cache (SARC). SARC provides: Sophisticated, patented algorithms to determine what data to store in cache based upon the recent access and frequency needs of the hosts Prefetching, which anticipates data prior to a host request and loads it into cache Self-learning algorithms to adapt and dynamically learn what data to store in cache based upon the frequency needs of the hosts
Adaptive Multi-Stream Prefetching (AMP)

AMP introduces an autonomic, workload-responsive, self-optimizing prefetching technology that adapts both the amount of prefetch and the timing of prefetch on a per-application basis in order to maximize the performance of the system. AMP provides provably optimal sequential read performance, maximizing the aggregate sequential read throughput of the system.
1.3.2 IBM System Storage multipath Subsystem Device Driver (SDD)

SDD is a pseudo device driver on the host system that is designed to support the multipath configuration environments in IBM products. It provides load balancing and enhanced data availability capability. By distributing the I/O workload over multiple active paths, SDD provides dynamic load balancing and eliminates data-flow bottlenecks. SDD also helps eliminate a potential single point of failure by automatically rerouting I/O operations when a path failure occurs. SDD is provided with the DS8000 series at no additional charge. Fibre Channel (Small Computer System Interface (SCSI)-FCP) attachment configurations are supported in the AIX, Hewlett-Packard UNIX (HP-UX), Linux, Microsoft Windows, Novell NetWare, and Sun Solaris environments.
1.3.3 Performance characteristics for System z

The DS8000 series supports the following IBM performance innovations for System z environments: FICON extends the ability of the DS8000 series system to deliver high bandwidth potential to the logical volumes needing it, when they need it. Older technologies are limited by the
bandwidth of a single disk drive or a single ESCON channel, but FICON, working together with other DS8000 series functions, provides a high-speed pipe supporting a multiplexed operation. High Performance FICON for z (zHPF) takes advantage of the hardware available today, with enhancements that are designed to reduce the overhead associated with supported commands, that can improve FICON I/O throughput on a single DS8000 port by 100%. Enhancements have been made to the z/Architecture and the FICON interface architecture to deliver improvements for online transaction processing (OLTP) workloads. When exploited by the FICON channel, the z/OS operating system, and the control unit, zHPF is designed to help reduce overhead and improve performance. Parallel Access Volume (PAV) enables a single System z server to simultaneously process multiple I/O operations to the same logical volume, which can help to significantly reduce device queue delays. This function is achieved by defining multiple addresses per volume. With Dynamic PAV, the assignment of addresses to volumes can be managed automatically to help the workload meet its performance objectives and reduce overall queuing. PAV is an optional feature on the DS8000 series. HyperPAV allows an alias address to be used to access any base on the same control unit image per I/O base. This capability also allows different HyperPAV hosts to use one alias to access different bases, which reduces the number of alias addresses required to support a set of bases in a System z environment with no latency in targeting an alias to a base. This functionality is also designed to enable applications to achieve equal or better performance than possible with the original PAV feature alone while also using the same or fewer z/OS resources. Multiple Allegiance expands the simultaneous logical volume access capability across multiple System z servers. This function, along with PAV, enables the DS8000 series to process more I/Os in parallel, helping to improve performance and enabling greater use of large volumes. I/O priority queuing allows the DS8000 series to use I/O priority information provided by the z/OS Workload Manager to manage the processing sequence of I/O operations.
Chapter 2.
Hardware configuration
In this chapter, we look at DS8000 hardware configuration, specifically as it pertains to the performance of the device. Understanding the hardware components, including the functions performed by each component, and the technology that they use will help you in making selections of the components to order and the quantities of each component. However, do not focus too much on any one hardware component. Instead, make sure to have a good balance of components that will work together effectively. The ultimate criteria as to whether a storage server is performing well depends on how good its total throughput is. We look at the major DS8000 hardware components: Storage unit, processor complex, and storage logical partitions (LPARs) Cache RIO-G interconnect Disk subsystem and device adapters Host adapters
2.1 Storage system

It is important to understand the naming conventions that are used to describe DS8000 components and constructs in order to fully appreciate the discussion.
Storage unit
A storage unit consists of a single DS8000 (including expansion frames). A storage unit can consist of several frames: one base frame and up to four expansion frames. The storage unit ID is the DS8000 base frame serial number, ending in 0 (for example, 75-06570).
Processor complex
A DS8000 processor complex is one POWER5+ p570 copper-based symmetric multiprocessor (SMP) system unit. On the DS8100 Turbo Model 931, each processor complex has 2-way servers running at 2.2 GHz. On the DS8300 Turbo Models 932 and 9B2, each processor complex has 4-way servers running at 2.2 GHz. On all DS8000 models, there are two processor complexes (servers), which are housed in the base frame. These processor complexes form a redundant pair so that if either processor complex fails, the surviving processor complex continues to run the workload.
Storage Facility Image (SFI)

In a DS8000, an SFI is a union of two logical partitions (processor LPARs), one from each processor complex. Each LPAR hosts one server. The SFI has control of one or more device adapter pairs and two or more disk enclosures. Sometimes, an SFI might also be referred to as a storage image or a storage LPAR. In a DS8000, a server is effectively the software that uses a processor logical partition (a processor LPAR) and that has access to a percentage of the memory and processor resources available on a processor complex. Models 931 and 932 are single SFI models, and consequently, can only have one storage LPAR using 100% of the resources. The DS8300 model 9B2 allows the creation of four servers. In Figure 2-1 on page 11, we have two Storage Facility Images (SFIs). The upper server 0 and upper server 1 form SFI 1. The lower server 0 and lower server 1 form SFI 2. In each SFI, server 0 is the darker color (green) and server 1 is the lighter color (yellow). SFI 1 and SFI 2 can share common hardware (the processor complexes), but they are completely separate from an operational point of view. Note: You might think that the lower server 0 and lower server 1 must be called server 2 and server 3. While this might make sense from a numerical point of view (for example, there are four servers so why not number them from 0 to 3), each SFI is not aware of the other SFIs existence. Each SFI must have a server 0 and a server 1, regardless of how many SFIs or servers there are in a DS8000 storage unit.
10
processor LPARs
Processor complex 0 Storage Facility Image 1 Processor complex 1
Server 0
Server 1
Server 0
Storage Facility Image 2
Server 1
processor LPARs
Figure 2-1 DS8300 Storage Facility Images
Each of the two Storage Facility Images has parts of the following DS8000 resources dedicated to its use: Processors Cache and persistent memory I/O enclosures Disk enclosures
Note: Licensed Machine Code (LMC) level 5.4.0xx.xx or later supports variable SFIs (or storage LPARs) on DS8000 Models 9B2 and 9A2. You can configure the two storage LPARs (or SFIs) for a 50/50 or 25/75% ratio.
The two SFIs can actually have different amounts of disk drives and host adapters available.
Chapter 2. Hardware configuration
11
2.2 Processor memory and cache

The DS8100 Turbo Model 931 offers processor memory options of 16 GB, 32 GB, 64 GB, and 128 GB. The DS8300 Turbo Models 932 and Model 9B2 offer processor memory options of 32 GB, 64 GB, 128 GB, and 256 GB. On all DS8000 models, each processor complex has its own system memory. Within each processor complex, the system memory is divided into: Memory used for the DS8000 control program Cache Persistent cache The amount actually allocated as persistent memory scales according to the processor memory selected.
2.2.1 Cache and I/O operations

Caching is a fundamental technique for hiding I/O latency. Cache is used to keep both the
data read and written by the host servers. The host does not need to wait for the hard disk drive to either obtain or store the data that is needed, because cache can be used as an intermediate repository. Prefetching data from the disk for read operations, as well as operations of writing to the disk, are done by the DS8000 asynchronously from the host I/O processing. Cache processing significantly improves the performance of the I/O operations done by the host systems that attach to the DS8000. Cache size and the efficient internal structure and algorithms that the DS8000 uses are factors that improve I/O performance. The significance of this benefit will be determined by the type of workload that is run.
Read operations
When a host sends a read request to the DS8000: A cache hit occurs if the requested data resides in the cache. In this case, the I/O operation will not disconnect from the channel/bus until the read is complete. A read hit provides the highest performance. A cache miss occurs if the data is not in the cache. The I/O operation is logically disconnected from the host, allowing other I/Os to take place over the same interface, and a stage operation from the disk subsystem takes place.
Write operations - fast writes

A fast write hit occurs when the write I/O operation completes as soon as the data received from the host is transferred to the cache and a copy is made in the persistent memory. Data written to a DS8000 is almost 100% fast write hits. The host is notified that the I/O operation is complete as soon as the data is stored in the two locations, providing very fast write operations. The data remains in the cache and persistent memory until it is destaged, at which point it is flushed from cache. Destage operations of sequential write operations to RAID 5 arrays are done in parallel mode, writing a stripe to all disks in the RAID set as a single operation. An entire stripe of data is written across all the disks in the RAID array, and the parity is generated once for all the data simultaneously and written to the parity disk, dramatically reducing the parity generation penalty associated with write operations to RAID 5 arrays. For RAID 6, data is striped on a block level across a set of drives, similar to RAID 5 configurations, and a second set of parity is calculated and written across all the drives. This
12
technique does not apply for the RAID 10 arrays, because there is no parity generation required and therefore no penalty involved when writing to RAID 10 arrays. It is possible that the DS8000 cannot copy write data to the persistent cache because it is full, which can occur if all data in the persistent cache is still waiting for destage to disk. In this case, instead of a fast write hit, the DS8000 sends a command to the host to retry the write operation. Having full persistent cache is obviously not a good situation, because it delays all write operations. On the DS8000, the amount of persistent cache is sized according to the total amount of system memory and is designed so that there is a low probability of full persistent cache occurring in normal processing.
Cache management
The DS8000 system offers superior caching algorithms, the Sequential Prefetching in Adaptive Replacement Cache (SARC) algorithm and the Adaptive Multi-stream Prefetching (AMP) algorithm, which were developed by IBM Storage Development in partnership with IBM Research. We explain these technologies next.
Sequential Prefetching in Adaptive Replacement Cache (SARC)

SARC is a self-tuning, self-optimizing solution for a wide range of workloads with a varying mix of sequential and random I/O streams. This cache algorithm attempts to determine four things: When data is copied into the cache Which data is copied into the cache Which data is evicted when the cache becomes full How the algorithm dynamically adapts to different workloads The DS8000 cache is organized in 4K byte pages called cache pages or slots. This unit of allocation ensures that small I/Os do not waste cache memory. The decision to copy some amount of data into the DS8000 cache can be triggered from two policies: demand paging and prefetching.
Demand paging means that eight disk blocks (a 4K cache page) are brought in only on a
cache miss. Demand paging is always active for all volumes and ensures that I/O patterns with some locality find at least some recently used data in the cache.
Prefetching means that data is copied into the cache even before it is requested. To prefetch,
a prediction of likely future data accesses is required. Because effective, sophisticated prediction schemes need extensive history of the page accesses, the algorithm uses prefetching only for sequential workloads. Sequential access patterns are commonly found in video-on-demand, database scans, copy, backup, and recovery. The goal of sequential prefetching is to detect sequential access and effectively preload the cache with data in order to minimize cache misses. For prefetching, the cache management uses tracks. A track is a set of 128 disk blocks (16 cache pages). To detect a sequential access pattern, counters are maintained with every track to record if a track has been accessed together with its predecessor. Sequential prefetching becomes active only when these counters suggest a sequential access pattern. In this manner, the DS8000 monitors application read patterns and dynamically determines whether it is optimal to stage into cache: Just the page requested The page requested plus remaining data on the disk track An entire disk track or multiple disk tracks that have not yet been requested
13
The decision of when and what to prefetch is essentially made on a per-application basis (rather than a system-wide basis) to be responsive to the different data reference patterns of various applications that can be running concurrently. To decide which pages are flushed when the cache is full, sequential and random (non-sequential) data is separated into different lists as illustrated in Figure 2-2.
RANDOM
MRU
SEQ
MRU
Desired size SEQ bottom LRU RANDOM bottom LRU

Figure 2-2 Sequential Prefetching in Adaptive Replacement Cache (SARC)
In Figure 2-2, a page, which has been brought into the cache by simple demand paging, is added to the Most Recently Used (MRU) head of the RANDOM list. With no further references to that page, it moves down to the Least Recently Used (LRU) bottom of the list. A page, which has been brought into the cache by a sequential access or by sequential prefetching, is added to the MRU head of the sequential (SEQ) list and then moves down in that list as more sequential reads are done. Additional rules control the management of pages between the lists in order to not keep the same pages in memory twice. To follow workload changes, the algorithm trades cache space between the RANDOM and SEQ lists dynamically. Trading cache space allows the algorithm to prevent one-time sequential requests from filling the entire cache with blocks of data that have a low probability of being read again. The algorithm maintains a desired size parameter for the SEQ list. The desired size is continually adapted in response to the workload. Specifically, if the bottom portion of the SEQ list is found to be more valuable than the bottom portion of the RANDOM list, the desired size of the SEQ list is increased; otherwise, the desired size is decreased. The constant adaptation strives to make optimal use of limited cache space and delivers greater throughput and faster response times for a given cache size.
SARC Performance
IBM performed a simulation comparing cache management with and without the SARC algorithm. The new algorithm, with no change in hardware, provided: Effective cache space: 33% greater Cache miss rate: 11% reduced Peak throughput: 12.5% increased Response time: 50% reduced Figure 2-3 on page 15 shows the improvement in response time due to SARC. 14
Figure 2-3 Response time improvement with SARC
Adaptive Multi-stream Prefetching (AMP)

As described previously, SARC dynamically divides the cache between the RANDOM and SEQ lists, where the SEQ list maintains pages brought into the cache by sequential access or sequential prefetching. The SEQ list is managed by the Adaptive Multi-stream Prefetching (AMP) technology, developed by IBM research. AMP introduces an autonomic, workload-responsive, self-optimizing prefetching technology that adapts both the amount of prefetch and the timing of prefetch on a per-application basis in order to maximize the performance of the system. The AMP algorithm solves two problems that plague most other prefetching algorithms: Prefetch wastage occurs when prefetched data is evicted from the cache before it can be used. Cache pollution occurs when less useful data is prefetched instead of more useful data. By wisely choosing the prefetching parameters, AMP provides optimal sequential read performance, maximizing the aggregate sequential read throughput of the system. The amount prefetched for each stream is dynamically adapted according to the applications needs and the space available in the SEQ list. The timing of the prefetches is also continuously adapted for each stream to avoid misses and, at the same time, to avoid any cache pollution. AMP dramatically improves performance for common sequential and batch processing workloads. It also provides excellent performance synergy with DB2 by preventing table scans from being I/O bound and improves performance of index scans and DB2 utilities, such as Copy and Recover. Furthermore, AMP reduces the potential for array hot spots, which result from extreme sequential workload demands.
15
SARC and AMP play complementary roles. While SARC carefully divides the cache between the RANDOM and the SEQ lists to maximize the overall hit ratio, AMP manages the contents of the SEQ list to maximize the throughput obtained for the sequential workloads. While SARC impacts cases that involve both random and sequential workloads, AMP helps any workload that has a sequential read component, including pure sequential read workloads.
2.2.2 Determining the right amount of cache storage

A common question is How much cache do I need in my DS8000? Unfortunately, there is no quick and easy answer. There are a number of factors that influence cache requirements: Is the workload sequential or random? Are the attached host servers System z or Open Systems? What is the mix of reads to writes? What is the probability that data will be needed again after its initial access? Is the workload cache friendly (a cache friendly workload performs much better with relatively large amounts of cache)? It is a common approach to base the amount of cache on the amount of disk capacity. The most common general rules are: For Open Systems, each TB of capacity needs between 2 GB and 4 GB of cache. For System z, each TB of disk capacity needs between 2 GB and 5 GB of cache. Most storage servers support a mix of workloads. These general rules can work acceptably well, but many times, they do not. Use a general rule only if you have no other information on which to base your selection. When coming from an existing disk storage server environment and you intend to consolidate this environment into DS8000s, follow these recommendations: Choose a cache size for the DS8000 series that has a similar ratio between cache size and disk storage to that of the configuration that you currently use. When you consolidate multiple disk storage servers, configure the sum of all cache from the source disk storage servers for the target DS8000 processor memory or cache size. For example, consider replacing four ESS Model 800s, each of which has 3.2 TB and 16 GB cache, with a single DS8100. The ratio between cache size and disk storage for each ESS Model 800 is 0.5% (16 GB/3.2 TB). The new DS8100 is configured with 18 TB to consolidate the four 3.2 TB Model 800s, plus provide capacity for growth. This DS8100 requires 90 GB of cache to keep the original cache-to-disk storage ratio. Round up to the next available memory size, which is 128 GB for this DS8100 configuration. Note that the cache size is not an isolated factor when estimating the overall DS8000 performance, but it must be considered together with other important factors, such as the DS8000 model, the capacity and speed of the disk drives, and the number and type of host adapters. Larger cache sizes mean that a higher percentage of reads are satisfied from the cache, which reduces the load on device adapters and DDMs associated with reading data from disk. To see the effects that different amounts of cache can have on the performance of the DS8000, we recommend that run a Disk Magic model. Refer to 7.1, Disk Magic on page 162.
16
2.3 RIO-G interconnect and I/O enclosures

The RIO-G interconnect provides connectivity among the processors and I/O enclosures.
2.3.1 RIO-G loop

Because RIO-G connections go from one component to another component in sequence, and then back to the first component, the connections are called a RIO-G loop. Each RIO-G port can operate at 1 GB/s in bidirectional mode. Also, each RIO-G port is capable of passing data in each direction on each cycle of the port, which creates a redundant high-speed interconnect that allows servers on either storage complex to access resources on any RIO-G loop. The DS8100 and DS8100 Turbo have a single RIO-G loop. The DS8300 and DS8300 Turbo have two RIO-G loops. On the DS8300 LPAR model, which has two Storage Facility Images (SFIs), one of the RIO-G loops is dedicated to SFI 1, and the other RIO-G loop is dedicated to SFI 2. On the LPAR model, all I/O enclosures on the RIO-G loop with the associated host adapters and drive adapters are dedicated to the Storage Facility Image that owns the RIO-G loop. As a result of the strict separation of the two images, the following configuration options exist: Each Storage Facility Image is assigned to one dedicated RIO-G loop; if an image is offline, its RIO-G loop is not available. All I/O enclosures on a given RIO-G loop are dedicated to the image that owns the RIO-G loop. The host adapter and the device adapters on a given loop are dedicated to the associated image that owns this RIO-G loop. Disk enclosures and storage devices behind a given device adapter pair are dedicated to the image that owns the RIO-G loop. Configuring capacity to an image is managed through the placement of disk enclosures on a specific DA pair dedicated to this image.
2.3.2 I/O enclosures

The I/O enclosures hold the device adapters and host adapters, and they provide connectivity between these adapters and the processors via the RIO-G loop. All I/O enclosures within the RIO interconnect fabric are equally served from either processor complex. The DS8100 always has four I/O enclosures, and the DS8300 always has eight I/O enclosures. Each I/O enclosure has 6 adapter slots: 2 slots for device adapters (DAs) and 4 slots for host adapters (HAs). However, depending on the number of disks installed and the number of host connections required, several of these I/O enclosures might not contain any adapters. DAs are referred to as DA pairs, because DAs are always installed in quantities of two (one DA is attached to each processor complex). The members of a DA pair are split across two I/O enclosures for redundancy. The number of disk devices installed determines the number of device adapters required. In any given I/O enclosure, the number of individual DAs installed can be zero, one, or two. Host adapters (HAs) are installed as required to support host connectivity. In any given I/O enclosure, the number of HAs installed can be any number between zero and four.
17
2.4 Disk subsystem

The DS8000 series offers a selection of Fibre Channel (FC) and FC Advanced Technology Attachment (FATA) disk drives, including 450 GB drives, allowing a DS8100 to scale up to 192 TB of capacity and a DS8300 to scale up to 512 TB of capacity. The disk subsystem consists of three components: First, located in the I/O enclosures are the device adapters. These device adapters are RAID controllers that are used by the storage images to access the RAID arrays. Second, the device adapters connect to switched controller cards in the disk enclosures, creating a switched Fibre Channel disk network.
modules (DDMs).
Finally, we have the disks themselves. The disks are commonly referred to as disk drive
2.4.1 Device adapters

Each DS8000 device adapter (DA), installed in the I/O enclosure, provides four Fibre Channel ports. These ports are used to connect the processor complexes to the disk enclosures. The device adapter is responsible for managing, monitoring, and rebuilding the disk RAID arrays. The RAID device adapter is built on PowerPC technology with four Fibre Channel ports and high function, high performance application-specific integrated circuits (ASICs). To ensure maximum data integrity, the RAID device adapter supports metadata creation and checking. RAID device adapters operate at 2 Gbps. Device adapters are installed in pairs, because each processor complex requires its own device adapter to connect to each disk enclosure for redundancy. DAs in a pair are installed in separate I/O enclosures to eliminate the I/O enclosure as a single point of failure.
2.4.2 Fibre Channel disk architecture in the DS8000

The DS8000 uses the same Fibre Channel disks that are used in conventional Fibre Channel Arbitrated Loop (FC-AL)-based storage systems. All disks are on a common loop. For commands and data to get to a particular disk, they must traverse all disks ahead of it in the loop. The main shortcomings with conventional FC-AL are: As the term arbitration implies, each individual disk within an FC-AL loop competes with the other disks to get on the loop, because the loop supports only one operation at a time. In the event of a failure of a disk or connection in the loop, it can be difficult to identify the failing component, leading to lengthy problem determination. Problem determination is even more difficult when the problem is intermittent. A third issue with conventional FC-AL is the increasing amount of time that it takes to complete a loop operation as the number of devices increases in the loop.
Switched disk architecture

To overcome the FC-AL issue, the DS8000 architecture is enhanced by adding a switch-based approach and creating FC-AL switched loops. So, it is called a Fibre Channel switched disk subsystem, which is illustrated in Figure 2-4 on page 19.
18
Rear storage enclosure N max=4
15
FC switch
Rear enclosures
15
Rear storage enclosure 2
8 or 16 DDMs per enclosure FC-AL PtP link
Rear storage enclosure 1
15
4 FC-AL Ports
Processor complex 0 device adapter

0
Processor complex 1 device adapter
Front storage enclosure 1
15
Front enclosures
Front storage enclosure 2
15
Front storage enclosure N max=4
15
Figure 2-4 DS8000 switched disk architecture
These switches use FC-AL protocol and attach FC-AL drives through a point-to-point connection. The arbitration message of a drive is captured in the switch and processed and propagated back to the drive without routing it through all the other drives in the loop. Performance is enhanced, because both DAs connect to the switched Fibre Channel disk subsystem. Note that each DA port can concurrently send and receive data.
2.4.3 Disk enclosures

DS8000 disks are mounted in a disk enclosure. Each enclosure holds 16 disks. Disk enclosures are referred to as disk enclosure pairs or expansion enclosure pairs, because you order and install them in groups of two. One disk enclosure in each pair is installed in the front of the DS8000 frame, and the other disk enclosure is installed in the rear of the frame. So the terms front enclosure and rear enclosure are used to distinguish the two disk enclosures. DDMs are added in increments of 16. For each group of 16 disks, eight are installed in the front enclosure and eight are installed in the rear enclosure. These 16 disks form two array sites of eight DDMs each, from which RAID arrays will be built during the logical configuration process. For each array site, four disks are in the front enclosure and four disks are in the rear enclosure. All disks within a disk enclosure pair must be the same capacity and rotation speed. A disk enclosure pair that contains only 16 DDMs must also contain 16 dummy carriers called fillers. These fillers are used to maintain airflow.
DDMs
The DS8000 provides a choice of several DDM types: 146 GB, 15K rpm FC disk 300 GB, 15K rpm FC disk
19
450 GB, 15K rpm FC disk 1000 GB, 7200 rpm Serial Advanced Technology Attachment (SATA) disk For existing installations: 73 GB, 15K rpm FC disk 500 GB, 7200 rpm FC Advanced Technology Attachment (FATA) disk These disks provide a range of options to meet the capacity and performance requirements of various workloads.
2.4.4 Fibre Channel drives compared to FATA and SATA drives

The Fibre Channel (FC) drives that are used for the System Storage DS8000 series run at 15000 rpm and the FATA and SATA drives run at at 7200 rpm. Because this rotational speed of the disk drives is an important aspect of performance, in particular, online transaction processing (OLTP) performance, choose FC drives for all performance critical applications. For sequential streaming loads, FATA and SATA drives deliver almost the same throughput, but for random workloads, the performance difference is huge and directly translates into major response time differences. FATA and SATA drives have large capacities and have a good price-per-TB ratio. However due to their slower random performance, they are only recommended for very small I/O access densities. Access density is the amount of I/Os per second per Gigabyte of storage (IOPS/GB). Note that FC and FATA/SATA disk drives have different mechanical properties: FC drives, as well as SAS or Small Computer System Interface (SCSI) disk drives, are built for the highest availability demands and a 100% duty cycle. FATA and SATA drives, however, are built for a much lower duty cycle, typically for loads that are only active approximately 20% of the time. Using a FATA/SATA drive for a higher duty cycle might heat up the drive so that it enters a speed throttling mode, which degrades its performance even more. Important: We recommend FC drives whenever there are high requirements for duty cycles and performance.
A third aspect regarding the difference between these drive types is the RAID rebuild time after a drive failure. Because this rebuild time grows with larger capacity drives, RAID 6 can be advantageous for the large-capacity SATA and FATA drives to prevent a failing second disk, which causes a loss of data during the rebuild of a first failed disk. We explain more detail about RAID 6 in 4.1.2, RAID 6 overview on page 43.
2.4.5 Arrays across loops

Each array site consists of eight DDMs. Four DDMs are taken from the front enclosure, and four DDMs are taken from the rear enclosure, of an enclosure pair. When a RAID array is created on the array site, half of the array is in each enclosure. Because the front enclosures are on one switched loop, and the rear enclosures are on a second switched loop, the array is split across two loops, which is called array across loops (AAL). By putting half of each array on one loop and half of each array on another loop, there are more data paths into each array. This design provides a performance benefit, particularly in situations where there is a large amount of I/O going to one array, such as sequential processing and array rebuilds.
20
2.4.6 Order of installation

A disk enclosure pair holds 32 DDMs or 16 DDMs plus 16 fillers. Each disk enclosure pair is installed in a specific physical location within a DS8000 frame and in a specific sequence. Disk enclosure pairs are installed from the top to the bottom in frame one, then top to bottom in frame two, and (for the DS8300 only) top to bottom in frame three.
DS8100
On the DS8100 Turbo, there can be up to twelve disk enclosure pairs spread across two frames. These twelve disk enclosure pairs are supported on four DA pairs, numbered 0-3. Note that these DA pair numbers are just assigned and do not indicate any order of installation, which is illustrated in Figure 2-5.
DS8100 DA Pair Installation Order

2 Disk Enclosures on DA Pair 0 DA Pair 0
2 2 0 0
HMC
3 3 1 1 2 2 0 0
b
Up to 4 Device Adapter Pairs DA pairs installed in order 2, 0, 3, 1 DA pairs 2 and 0 have up to 128 DDMs each DA pairs 3 and 1 have up to 64 DDMs each
S0 S1 0/1 1/0 2/3 3/2
Figure 2-5 DS8100 DA pair installation order
In Figure 2-5, DA pair 2 attaches to the first two disk enclosure pairs, which contain a total of 64 DDMs. DA pair 0 attaches to the next two disk enclosure pairs, which again contain a total of 64 DDMs. This method continues in a like manner for DA pairs 3 and 1 (in that order). At this point, all four DA pairs are in use, and there are 256 DDMs installed. DA pair 2 (which already has two disk enclosure pairs attached) is now used to attach the next two disk enclosure pairs. Then, DA pair 0 (which also has two disk enclosure pairs already attached) is used to attach two more disk enclosure pairs. At this point, the DS8100 holds its maximum configuration of 384 DDMs. DA pairs 2 and 0 each have 128 DDMs attached. DA pairs 3 and 1 each have 64 DDMs attached. Note: If you have more than 256 DDMs installed, several DA pairs will have more than 64 DDMs. For large configurations, modeling your configuration is even more important to ensure that your DS8100 has sufficient DAs and other resources to handle the workload.
21
DS8300
The DS8300 Turbo models can connect to one, two, three, or four Expansion Frames, which provides the following configuration alternatives: With one Expansion Frame, the storage capacity and number of adapters of the DS8300 models can expand: Up to 384 DDMs in total (as with the DS8100), for a maximum disk storage capacity of 172.8 TB when using 450 GB FC DDMs. Up to 32 host adapters (HAs), which can be an intermix of Fibre Channel/FICON (four-port) adapters and ESCON (two-port) adapters. With two Expansion Frames, the disk capacity of the DS8300 models expands: Up to 640 DDMs in total, for a maximum disk storage capacity of 288 TB when using 450 GB FC DDMs. With three Expansion Frames, the disk capacity of the DS8300 models expands: Up to 896 DDMs in total, for a maximum disk storage capacity of 403.2 TB when using 450 GB FC DDMs. With four Expansion Frames, the disk capacity of the DS8300 models expands: Up to 1024 DDMs in total, for a maximum disk storage capacity of 460.8 TB when using 450 GB FC DDMs (512 TB with the 500 GB FATA DDMs). There are no additional DAs installed for the second, third, and fourth Expansion Frames. Installing all possible 1024 DDMs results in their even distribution over all the DA pairs; refer to Figure 2-6.
DS8300 DA Pair Installation Order

2 2 0 0 C0 C1
b
6 6 4 4 7 7 5 5 4/5 5/4 6/7 7/6
3 3 1 1 2 2 0 0
b
6 6 4 4 7 7 5 5
b
3 3 1 1 0 0 1 1
b
0/1 1/0 2/3 3/2
4/6 6/4
For an LPAR Model: DA pairs 2, 6, 7, and 3 dedicated to SFI 1 DA pairs 0, 4, 5, and 1 dedicated to SFI 2
Figure 2-6 DS8300 DA pair installation order
DA pair 2 attaches to the first two disk enclosure pairs, which contain a total of 64 DDMs. DA pair 0 attaches to the next two disk enclosure pairs, which again contain a total of 64 DDMs. This method continues in a like manner for DA pairs 6, 4, 7, 5, 3, and 1 (in that order). At this point, all eight DA pairs are in use, and there are 512 DDMs installed.
22
DA pair 2 (which already has two disk enclosure pairs attached) is now used to attach the next two disk enclosure pairs. Then, DA pair 0 (which also has two disk enclosure pairs already attached) is used to attach two more disk enclosure pairs. At this point, the DS8300 holds a configuration of 640 DDMs. DA pairs 2 and 0 have 128 DDMs attached each. DA pairs 6, 4, 7, 5, 3, and 1 have 64 DDMs attached each. The installation sequence for the third and fourth Expansion Frames mirrors the installation sequence of the first and second Expansion Frames with the exception of the last 128 DDMs in the fourth Expansion Frame. Note: If you have more than 512 DDMs installed, several DA pairs will have more than 64 DDMs. For large configurations, modeling your configuration is even more important to ensure that your DS8000 has sufficient DAs and other resources to handle the workload.
DS8300 LPAR
The rules for adding DDMs on the DS8300 LPAR model are the same as the DS8300 non-LPAR model. Disks can be added to one storage image without regard to the number of disks in the other storage image. Each storage image within the DS8300 LPAR model can have up to half the disk hardware components of the DS8300. DA pairs 2, 6, 7, and 3 (and associated disk enclosures) are dedicated to storage image one and DA pairs 0, 4, 5, and 1 (and associated disk enclosures) are dedicated to storage image two. Each storage image can have no more than ten disk enclosures and no more than 512 DDMs. As is the case with the non-LPAR models, DA pairs 2 and 0 can have 128 DDMs each. All other DA pairs have 64 DDMs each.
2.4.7 Performance Accelerator feature (Feature Code 1980)

By default, the DS8300 Turbo (as well as the DS8100 Turbo) comes with a new pair of device adapters for each 64 DDMs. Consequently, each DA pair serves a minimum of eight ranks. If you order a system with 128 drives, you get two device adapter (DA) pairs. When ordering 512 disk drives, you get eight DA pairs, which is currently the maximum number of DA pairs that IBM offers for the DS8300 Turbo model. Having many DA pairs is important to achieve a higher throughput level as required by certain sequential workloads, such as data warehouse installations requiring a throughput of 1 GB/s or more. In your environment, perhaps your sequential throughput requirements are high, but your capacity requirements are low. For instance, you might have capacity requirements for 256 disks only, but you still want the full sequential throughput potential of all DAs. For these situations, IBM offers the Performance Accelerator feature (FC1980). When this feature is enabled, you will get one new DA pair for each 32 DDMs. This feature is offered for 932 and 922 models, which have one base frame and one expansion frame. For example, with this feature, you get six DA pairs with just 192 disk drives, or with 256 drives, you get the maximum of eight DA pairs. Figure 2-7 on page 24 shows how important it is for high sequential throughputs to install and use as many DA pairs as possible. Measurements were done on a DS8300 922 (non-Turbo) model.
23
Figure 2-7 Sequential throughput increasing when using additional DA pairs (922 model)
2.5 Host adapters

The DS8000 supports two types of host adapters: Fibre Channel/FICON and ESCON for existing installations. It does not support SCSI adapters. Host adapters are installed in I/O enclosures. There is no affinity between the host adapter and the processor complex. Either processor complex can access any host adapter.
2.5.1 Fibre Channel and FICON host adapters

Fibre Channel is a technology standard that allows data to be transferred from one node to another node at high speeds and great distances (up to 10 km (6.2 miles) and beyond). The DS8000 uses Fibre Channel protocol to transmit SCSI traffic inside Fibre Channel frames. It also uses Fibre Channel to transmit FICON traffic, which uses Fibre Channel frames to carry System z I/Os. The DS8000 Fibre Channel cards offers four 4-Gbps or 2-Gbps Fibre Channel ports. Each port independently auto-negotiates to either 4, 2, or 1 Gbps link speed. Each of the four ports on one DS8000 adapter can also independently be either Fibre Channel protocol (FCP) or FICON, although the ports are initially defined as switched point-to-point FCP. Selected ports will be configured to FICON automatically based on the definition of a FICON host. Each port can be either FICON or Fibre Channel protocol (FCP). The personality of the port is changeable via the DS Storage Manager GUI. A port cannot be both FICON and FCP simultaneously, but it can be changed as required. The card itself is PCI-X 64 Bit 133 MHz. The card is driven by a new high function, high performance application-specific integrated circuit (ASIC). To ensure maximum data integrity, it supports metadata creation and checking. The host adapter is illustrated in Figure 2-8 on page 25. Each Fibre Channel port supports a maximum of 509 host login IDs, which allows for the creation of very large storage area networks (SANs).
24
QDR
Fibre Channel Protocol Engine
PPC 750GX
Processor 1 GHz
Fibre Channel Protocol Engine
Data Protection Data Mover ASIC Flash
Buffer
Protocol Chipset
QDR
Data Mover
Figure 2-8 DS8000 host adapter card
These adapters are designed to hold four Fibre Channel ports, which can be configured to support either FCP or FICON. They are also enhanced in their configuration flexibility and provide more logical paths, from 256 with an ESS FICON port to 2048 per FICON port on the DS8000 series. The front end with the 4 Gbps ports scales up to 128 ports for a DS8300, which results in a theoretical aggregated host I/O bandwidth of 128 times 4 Gbps and outperforms an ESS by a factor of eight. The DS8100 still provides four times more bandwidth at the front end than an ESS. However, note that the 4 Gbps and the 2 Gbps HBAs essentially have the same architecture, with the exception of the protocol engine chipset. Hence, while the throughput for an individual port doubles between the 2 Gb HBA and the 4 Gb HBA, the aggregated throughput for the overall HBA when using all ports practically stays the same. For high performance configurations requiring the highest sequential throughputs, we recommend that you actively use two of the four ports of a 2 Gb HBA only, and when using the 4 Gb HBA, use one port in that case. The remaining ports can serve for pure attachment purposes.
Fibre Channel distances

There are two types of host adapters: long-wave ports and short-wave ports. With long-wave laser, you can connect nodes at distances of up to 10 km (6.2 miles). With short wave, you are limited to a distance of 300 m (984.3 feet) to 500 m (1640.5 feet). The type of connection (long wave or short wave) depends on the type of host adapter that you order. It is not user configurable. All four ports on each host adapter are the same type, either long wave or short wave.
2.5.2 ESCON host adapters

The ESCON host adapter in the DS8000 has two ports for connection to older System z hosts that do not support FICON. ESCON is fibre technology, but has a much slower speed, running at 17 MB/s.
2.5.3 Multiple paths to Open Systems servers

When you have determined your workload requirements in terms of throughput, you must choose the appropriate number of connections to put between your Open Systems and the DS8000 to sustain this throughput.
25
Because host connections frequently go through various external connections between the server and the DS8000, an availability-oriented approach is to have enough host connections for each server so that if half of the connections fail, processing can continue at the same level as before the failure. This approach requires that each connection carry only half the data traffic that it otherwise might carry. These multiple lightly loaded connections also help to minimize the instances when spikes in activity might cause bottlenecks at the host adapter or port. A multiple-path environment requires at least two connections. Four connections are typical, and eight connections are not unusual.
SAN switches and directors

Because a large number of hosts, each using multiple paths, can be connected to the DS8000, it might not be feasible to directly connect each path to the DS8000. The solution is to use SAN switches or directors to switch logical connections from multiple hosts. Each adapter on the host server attaches to the SAN device, and then, the SAN device attaches to the DS8000. When using switches to connect multiple paths between servers and a DS8000, we recommend using two separate switches to avoid a single point of failure.
2.5.4 Multiple paths to System z servers

In the System z environment, the normal practice is to provide multiple paths from each host to a disk subsystem. Typically, four paths are installed. The channels in each host that can access each Logical Control Unit (LCU) in the DS8000 are defined in the hardware configuration definition (HCD) or I/O configuration dataset (IOCDS) for that host. Dynamic Path Selection (DPS) allows the channel subsystem to select any available (non-busy) path to initiate an operation to the disk subsystem. Dynamic Path Reconnect (DPR) allows the DS8000 to select any available path to a host to reconnect and resume a disconnected operation, for example, to transfer data after disconnection due to a cache miss. These functions are part of the System z architecture and are managed by the channel subsystem on the host and the DS8000. In a System z environment, you need to select a SAN switch or director that also supports FICON. ESCON-attached hosts might need an ESCON director. An availability-oriented approach applies to the System z environments just as it does for Open Systems. Plan enough host connections for each server so that if half of the connections fail, processing can continue at the same level as before the failure.
2.5.5 Spreading host attachments

Each Fibre Channel/FICON host adapter provides four ports. A common question is how to distribute the server connections. Take the example of four host connections from each of four servers, all running a similar type workload. We recommend spreading the host connections with each host attaching to one port on each of four adapters. Now, consider the scenario that the workloads are different. You probably want to isolate mission-critical workloads (for example, customer order processing) from workloads that are lower priority but I/O intensive (for example, data mining) to prevent the I/O intensive workload from dominating the host adapter. If one of the four servers is running an I/O intensive workload, we recommend acquiring two additional host adapters and attaching the I/O intensive servers four connections to these adapters, two host connections on each adapter. The other three servers remain attached to the original four adapters, one host connection per adapter.
26
We also offer the following general guidelines: If you run on a non-LPAR DS8300 Turbo (which has two RIO-G loops), spread the host adapters across the RIO-G loops. For Fibre Channel and FICON paths with high or moderate utilization, use only two or three ports on each host adapter, which might increase the number of host adapters required. Spread multiple paths from a single host as widely as possible across host adapters, I/O enclosures, and RIO-G loops to maximize performance and minimize the points where a failure causes outages on multiple paths.
2.6 Tools to aid in hardware planning

In this chapter, we have discussed the hardware components of the DS8000. Each component is designed to provide high performance in its specific function and to mesh well with the other hardware components to provide a well-balanced, high performance storage system. There are a number of tools that can assist you in planning your specific hardware configuration.
2.6.1 White papers

IBM regularly publishes white papers that document the performance of specific DS8000 configurations. Typically, workloads are run on multiple configurations, and performance results are compiled so that you can compare the configurations. For example, workloads can be run using different DS8000 models, different numbers of host adapters, or different types of DDMs. By reviewing these white papers, you can make inferences about the relative performance benefits of different components to help you to choose the type and quantities of components to best fit your particular workload requirements. Your IBM representative or IBM Business Partner has access to these white papers and can provide them to you.
2.6.2 Disk Magic

A knowledge of DS8000 hardware components will help you understand the device and its potential performance. However, we recommend using Disk Magic to model your planned DS8000 hardware configuration to ensure that it will handle the required workload. Your IBM representative or IBM Business Partner has access to this tool and can run a Disk Magic study to configure a DS8000 based on your specific workloads. For additional information about the capabilities of this tool, refer to 7.1, Disk Magic on page 162. The tool can also be acquired by clients directly from IntelliMagic B.V. at: http://www.intellimagic.net
2.6.3 Capacity Magic

Determining usable capacity of a disk configuration is a complex task, dependent on DDM types, the RAID technique, and the type of logical volumes that are created. We recommend using the Capacity Magic tool to determine effective utilization. Your IBM representative or IBM Business Partner has access to this tool and can use it to validate that the planned physical disk configuration will provide enough effective capacity to meet your storage requirements.
27
28
Chapter 3.
Understanding your workload

In this chapter, we present and discuss the various workload types that an application can generate. This characterization can be useful for understanding performance documents and reports, as well as categorizing the various workloads in your installation. Information in this chapter is not just dedicated to the IBM System Storage DS8000. You can apply this information more generally to other disk storage systems.
29
3.1 General workload types

Next, we describe the workload type definitions that are used in several of the IBM performance documents.
3.1.1 Standard workload

This workload is characterized by random access of small-sized I/O records (less than or equal to 16 KB). This workload is a mix of 70% reads and 30% writes. This workload is also characterized by moderate read hit ratios in the disk subsystem cache (approximately 50 percent). This workload might be representative of a variety of online applications (for example, the SAP R/3 application, many database applications, and filesystems).
3.1.2 Read intensive cache unfriendly workload

This workload is characterized by extremely random 4 KB reads. The accesses are extremely random, such that virtually no cache hits occur in the disk subsystem cache. This workload might be representative of decision support or business intelligence applications, where virtually all of the cache hits are absorbed in the host memory buffers.
3.1.3 Sequential workload

In many user environments, sequential performance is critical due to the heavy use of sequential processing during the batch window. The types of sequential I/O requests that play an important role in batch processing cover a wide range.
3.1.4 Batch jobs workload

Batch workloads have several common characteristics: Frequently, batch workloads consist of a mixture of random data base accesses, skip-sequential, pure sequential, and sorting. Batch workloads include large data transfers and high path utilizations. Batch workloads are often constrained to operate within a particular window of time when online operation is restricted or shut down. Poor or improved performance is often not recognized unless it impacts this window.
3.1.5 Sort jobs workload

Most sorting applications, such as the z/OS DFSORT, are characterized by large transfers for input, output, and work datasets. Adding to the preceding list of characteristics, Table 3-1 on page 31 provides a summary of the characteristics of the various types of workloads.
30
Table 3-1 Workload types Workload type Sequential read Characteristics Large record reads - QSAM half track - Open 64 KB blocks Large files from disk Large record writes Large files to disk Random 4 KB record R/W ratio 3.4 Read hit ratio 84% Random 4 KB record R/W ratio 3.0 Read hit ratio 78% Random 4 KB record R/W ratio 5.0 Read hit ratio 92% Random 4 KB record R/W ratio 2.0 Read hit ratio 40% Random 4 KB record Read% = 67% Hit ratio 28% Random 4 KB record Read% = 70% Hit ratio 50% Random 4 KB record Read% = 100% Hit ratio 0% Representative of Database backups Large queries Batch Reports Database restores and loads Batch Average database CICS/VSAM IMS Representative of typical database conditions Interactive Existing software DB2 logging
Sequential write z/OS cache uniform
z/OS cache standard
z/OS cache friendly
z/OS cache hostile
Open read-intensive
Very large DB DB2 OLTP filesystem Decision support Warehousing Large DB inquiry
Open standard
Open read-intensive
3.2 Database workload

Analyzing and discussing database workload characteristics is a broad subject. In this section, we limit our discussion to DB2 I/O situations as an example of the database workload demands. In addition to the information discussed in this chapter, refer to Chapter 16, Databases on page 485. The DB2 environment is often difficult to typify, because there can be wide differences in I/O characteristics. DB2 Query has high read content and is of a sequential nature. Transaction environments have more random content and are sometimes cache unfriendly but other times have good hit ratios. DB2 has also implemented several changes that affect I/O characteristics, such as sequential prefetch and exploitation of I/O priority queuing. Users need to understand the unique characteristics of their installations processing before generalizing about DB2 performance.
Chapter 3. Understanding your workload
31
3.2.1 DB2 query workload

DB2 query workloads can typically be characterized by: High read content Large transfer size A DB2 query workload mostly has the same characteristics as a sequential read workload. The storage subsystem implements sequential prefetch algorithms. This functionality, which caches data that has the most probability to be accessed, provides good performance improvements for most DB2 queries.
3.2.2 DB2 logging workload

DB2 logging is mostly a very cache unfriendly workload with a high sequential write component. A high sequential write capability storage subsystem provides excellent performance for DB2 logging.
3.2.3 DB2 transaction environment workload

DB2 transaction workload characteristics can include: Low to moderate read hits, depending upon the size of the DB2 buffers Cache unfriendly for certain applications Deferred writes can cause low write hit ratios Deferred write chains with multiple locate-record commands in chain Low read/write ratio due to reads being satisfied in large DB2 buffer pool The enhanced prefetch cache algorithms, together with the high storage back-end bandwidth, provide high subsystem throughput and high transaction rates for DB2 transaction-based workloads. One of DB2s key advantages is the exploitation of a large buffer pool in processor storage. When managed properly, the buffer pool can avoid a large percentage of the accesses to disk. Depending on the application and the size of the buffer pool, this large buffer pool can translate to poor cache hit ratios for synchronous reads in DB2. You can spread data across several RAID arrays to increase the throughput even if all accesses are read misses. DB2 administrators often require that tablespaces and their indexes are placed on separate volumes. This configuration improves both availability and performance.
3.2.4 DB2 utilities workload

DB2 utilities, such as loads, reorganizations, copies, and recovers, generate high read and write sequential and sometimes random operations. This type of workload takes advantages of the sequential bandwidth performance of the back-end storage connection, such as Fibre Channel technology, and back-end storage disk technology (15K rpm speed for DDMs).
3.3 Application workload

This section categorizes various types of common applications according to their I/O behavior. There are four typical categories: 1. Need for high throughput. These applications need more bandwidth (the more, the better). Transfers are large, read only I/Os and typically, sequential access. These applications use database management systems (DBMSs); however, random DBMS access might also exist. 32
2. Need for high throughput and a mix of R/W, similar to category 1 (large transfer sizes). In addition to 100% read operations, this situation has a mixture of reads and writes in the 70/30 and 50/50 ratios. Here, the DBMS is typically sequential, but random and 100% writes operation also exist. 3. Need for high I/O rate and throughput. This category requires both performance characteristics of IOPS and MBps. Depending upon the application, the profile is typically sequential access, medium to large transfer sizes (16 KB, 32 KB, and 64 KB), and 100/0, 0/100, and 50/50 R/W ratios. 4. Need for high I/O rate. With many users and applications running simultaneously, this category can consist of a combination of small to medium-sized transfers (4 KB, 8 KB, 16 KB, and 32 KB), 50/50 and 70/30 R/W ratios, and a random DBMS. Note: Certain applications have synchronous activities, such as locking database tables during an online backup, or logging activities. These types of applications are highly sensitive to any increase in disk response time and must be handled with extreme care. Table 3-2 summarizes these workload categories and common applications that can be found at any installation.
Table 3-2 Application workload types Category 4 4 4 1 1 2 2 3 3 1 Application General file serving Online transaction processing Batch update Data mining Video on demand Data warehousing Engineering and scientific Digital video editing Image processing Backup Read/write ratio All simultaneously 50/50, 70/30 50/50 100/0 100/0 100/0, 70/30, 50/50 100/0, 0/100, 70/30, 50/50 100/0, 0/100, 50/50 100/0, 0/100, 50/50 100/0 I/O size 4 KB - 32 KB 4 KB, 8 KB 16 KB, 32 KB 32 KB, 64 KB, or larger 64 KB or larger 64 KB or larger 64 KB or larger 32 KB, 64 KB 16 KB, 32 KB, 64 KB 64 KB or larger Access type Random and sequential Random Random and sequential Mainly sequential, some random Sequential Mainly sequential, random easier Sequential Sequential Sequential Sequential
3.3.1 General file serving

This application type consist of many users, running many different applications, all with varying file access sizes and mixtures of read/write ratios, all occurring simultaneously. Applications can include file server, LAN storage, disk arrays, and even Internet/intranet servers. There is no standard profile here, other than the chaos principle of file access. General file serving fits this application type, because this profile covers almost all transfer sizes and R/W ratios.
33
3.3.2 Online transaction processing

This application category typically has many users, all accessing the same disk storage subsystem and a common set of files. The file access typically is under control of a DBMS and each user might work on the same or unrelated activities. The I/O requests are typically spread across many files; therefore, the file sizes are typically small and randomly accessed. A typical application consists of a network file server or a disk subsystem that is being accessed by a sales department entering order information.
3.3.3 Data mining

Databases are the repository of most data, and every time that information is needed, a database is accessed. Data mining is the process of extracting valid, previously unknown, and ultimately comprehensive information from large databases and using it to make crucial business decisions. This application category consists of a number of operations, each of which is supported by a variety of techniques, such as rule induction, neural networks, conceptual clustering, association discovery, and so on. In these applications, the DBMS only extracts large sequential or possibly random files depending on the DBMS access algorithms.
3.3.4 Video on demand

Video on demand consists of video playback that can be used to broadcast quality video for either satellite transmission or a commercial application, such as in-room movies. Fortunately for the storage industry, the current data rates needed for this type of transfer have been reduced dramatically due to data compression developments. A broadcast quality video stream MPEG2 now only needs about 3.7 MBps bandwidth to serve a single user. These advancements have reduced the need for higher speed interfaces and can be serviced with the current interface. However, these applications are now demanding numerous concurrent users interactively accessing multiple files within the same storage subsystem. This requirement has changed the environment of video applications in that the storage subsystem will be specified by a number of video streams that they can service simultaneously. In this application, the DBMS only extracts large sequential files.
3.3.5 Data warehousing

A data warehouse supports information processing by providing a solid platform of integrated, historical data from which to do analysis. A data warehouse organizes and stores the data needed for informational and analytical processing over a long historical time period. A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data that is used to support the managements decision making process. A data warehouse is always a physically separate store of data that spans a spectrum of time, and there are many relationships found in the data warehouse. An example of a data warehouse is a design around a financial institution and its functions, such as loans, savings, bank cards, and trusts for a financial institution. In this application, there are basically three kinds of operations: initial loading of the data, access to the data, and updating of the data. However, due to the fundamental characteristics of a warehouse, these operations can occur simultaneously. At times, this application can perform 100% reads when accessing the warehouse; 70% reads and 30% writes when accessing data while record updating occurs simultaneously; or even 50% reads and 50% writes when the user load is heavy. Remember that the data within the warehouse is a series of snapshots and after the snapshot of data is made, the data in the warehouse does not change. Therefore, there is typically a higher read ratio when using the data warehouse.
34
3.3.6 Engineering and scientific applications

The engineering and scientific arena includes hundreds of different applications. Typical applications are CAD, Finite Element Analysis, simulations and modeling, large scale physics applications, and so on. Transfers can consist of 1 GB of data for 16 users, while other transfers might require 20 GB of data and hundreds of users. The engineering and scientific areas of business are more concerned with the manipulation of spatial data as well as the manipulation of series data. This application typically goes beyond standard relational DBMS systems, which manipulate only flat (two-dimensional) data. Spatial or multi-dimensional issues and the ability to handle complex data types are commonplace in engineering and scientific applications. Object-Relational DBMS (ORDBMS) are now being developed, and they not only offer traditional relational DBMS features, but will additionally support complex data types. Objects can be stored and manipulated, and complex queries at the database level can be run. Object data is data about real-world objects, including information about their location, geometry, and topology. Location describes their position, geometry relates to their shape, and topology includes their relationship to other objects. These applications essentially have an identical profile to that of the data warehouse application.
3.3.7 Digital video editing

Digital video editing is popular in the movie industry. The idea that a film editor can load entire feature films onto disk storage and interactively edit and immediately replay the edited clips has become a reality. This application combines the ability to store huge volumes of digital audio and video data onto relatively affordable storage devices to process a feature film. In the near future, when films are being shot on location, there will be no need for standard 38 mm film, because all cameras will be directly fed into storage devices and film takes will be immediately reviewed. If the captured action does not turn out as expected, it will be redone immediately. Digital video editing has also been used to generate the latest high tech films that require sophisticated computer-generated special effects. Depending on the host and operating system that are used to perform this application, transfers are typically medium to large in size and access is always sequential. Image processing consists of moving huge image files for the purpose of editing. In these applications, the user regularly moves huge high-resolution images between the storage device and the host system. These applications service many desktop publishing and workstation applications. Editing sessions can include loading large files of up to 16 MB into host memory, where users edit, render, modify, and eventually store data back onto the storage system. High interface transfer rates are needed for these applications, or the users waste huge amounts of time waiting to see results. If the interface can move data to and from the storage device at over 32 MBps, an entire 16 MB image can be stored and retrieved in less than one second. The need for throughput is all important to these applications, and, along with the additional load of many users, I/O operations per second are also a major requirement.
3.4 Profiling workloads in the design phase

Assessing the I/O profile prior to the build and deployment of the application requires methods of evaluating the workload profile without measurement data. In these cases, we suggest using a combination of general rules based on application type and the development of an application I/O profile by the application architect or the performance architect. The following examples are basic examples that are designed to provide an idea how to approach workload profiling in the design phase.
35
For general rules for application types, refer to Table 3-1 on page 31. Requirements for developing an application I/O profile include: User population Determining the user population requires understanding the total number of potential users, which for an online banking application might represent the total number of customers. From this total population, you need to derive the active population that represents the average number of persons using the application at any given time, which is usually derived from experiences with other similar applications. In Table 3-3, we use 1% of the total population. From the average population, we estimate the peak. The peak workload is some multiplier of the average and is typically derived based on experience with similar applications. In this example, we use a multiple of 3.
Table 3-3 User Population Total potential users 50000 Average active users 500 Peak active users 1500
Transaction distribution Table 3-4 breaks down the number of times that key application transactions are executed by the average user and how much I/O is generated per transaction. Detailed knowledge of the application and database are required in order to identify the number of I/Os and the type of I/Os per transaction. The following information is a sample.
Table 3-4 Transaction distribution Transaction Look up savings account Look up checking account Transfer money to checking Configure new bill payee Submit payment Look up payment history Iterations per user 1 1 .5 .5 1 1 I/Os 4 4 4 reads/4 writes 4 reads/4 writes 4 writes 24 reads I/O Type Random read Random read Random read/write Random read/write Random write Random read
Logical I/O profile An I/O profile is created by combining the user population and the transaction distribution. Table 3-5 provides an example of a logical I/O profile.
Table 3-5 Logical I/O profile from user population and transaction profiles Transaction Look up savings account Look up checking account Iterations per user 1 1 I/Os 4 4 I/O type Random read I/Os (RR) RR Average user I/Os 2000 2000 Peak users 6000 6000
36
Transaction Transfer money to checking Configure new bill payee Submit payment Look up payment history
Iterations per user .5
I/Os 4 reads/4 writes 4 reads/4 writes 4 writes 24 reads
I/O type RR, random write I/Os (RW) RR, RW RW RR
Average user I/Os 1000, 1000
Peak users 3000 R/W
.5 1 1
1000, 1000 2000 12000
3000 R/W 6000 R/W 36000
Physical I/O profile The physical I/O profile is based on the logical I/O with the assumption that the database will provide cache hits to 90% of the read I/Os. All write I/Os are assumed to require a physical I/O. This physical I/O profile results in a read miss ratio of (1-.9) = .1 or 10%. Table 3-6 is an example, and every application will have different characteristics.
Table 3-6 Physical I/O profile Transaction Average user logical I/Os 2000 2000 1000, 1000 1000, 1000 2000 12000 20000 R, 2000 W Average active users physical I/Os 200 RR 200 RR 100 RR, 1000 RW 100 RR, 1000 RW 200 RR 1200 SR 2000 RR 2000 RW Peak active users physical I/Os 600 RR 600 RR 300 RR, 3000 RW 300 RR, 3000 RW 600 RR 3600 RR 6000 RR 6000 RW
Look up savings account Look up checking account Transfer money to checking Configure new bill payee Submit payment Look up payment history Totals
As you can see in Table 3-6, in order to meet the peak workloads, you need to design an I/O subsystem to support 6000 random reads/sec and 6000 random writes/sec: Physical I/Os RR RW The number of physical I/Os per second from the host perspective Random Read I/Os Random Write I/Os
To determine the appropriate configuration to support your unique workload, refer to Chapter 6, Performance management process on page 147.
37
3.5 Understanding your workload type

To understand the workload generated by your application and applied to the storage subsystem, monitoring tools are available at various levels from the storage subsystems point of view and from the hosts point of view.
3.5.1 Monitoring the DS8000 workload

To understand the type of workload that your application generates, you can monitor the workload on the resources of your storage subsystem. Most of the storage subsystems have a performance monitoring tool, which indicates the characteristics of the hosts workload. These tools provide workload information, such as read and write rate, read and write ratio, sequential and random type, average I/O size, and so on, at various levels in the storage subsystem (controller, logical volume, and RAID array). IBM TotalStorage Productivity Center for Disk is the tool to monitor the workload on your DS8000. Refer to 8.2.1, TotalStorage Productivity Center overview on page 205.
3.5.2 Monitoring the host workload

The following sections list the host-based performance measurements and reporting tools under the UNIX, Linux, Windows, i5/OS, and z/OS environments.
Open Systems servers

Here are the most common tools that are available on Open Systems servers to monitor the workload.
UNIX and Linux Open Systems servers

To get host information about I/O subsystems, CPU activities, virtual memory, and physical memory use, you can use the following common UNIX and Linux tools: iostat vmstat sar These commands are standard tools that are available with most UNIX and UNIX-like (Linux) systems. We recommend using iostat for the data that you need to evaluate your host I/O levels. Specific monitoring tools are also available for AIX, Linux, Hewlett-Packard UNIX (HP-UNIX), and Sun Solaris. For more information, refer to Chapter 11, Performance considerations with UNIX servers on page 307 and Chapter 13, Performance considerations with Linux on page 401.
Intel Open Systems servers

Common Windows 2000 Server, Windows Server 2003, and Windows Server 2008 monitoring tools include the Windows Performance Monitor (perfmon). Performance Monitor gives you the flexibility to customize the monitoring to capture various categories of Windows server system resources, including CPU and memory. You can also monitor disk I/O through perfmon. For more information, refer to Chapter 10, Performance considerations with Windows Servers on page 281.
38
System i environment
Here are the most popular tools on System i: Collection Services Disk Watcher Job Watcher iSeries Navigator Monitors IBM Performance management for System i Performance Tools for System i Most of these comprehensive planning tools address the entire spectrum of workload performance on System i, including CPU, system memory, disks, and adapters. The main IBM System i performance data collector is called Collection Services. It is designed to run 24x7x365 and is documented in detail in the System i Information Center at: http://publib.boulder.ibm.com/infocenter/systems/scope/i5os/index.jsp Collection Services is a sample-based engine (usually 5 to 15 minute intervals) that looks at jobs, threads, CPU, disk, and communications. It also has a set of specific statistics for the DS8000. For the new systems that are IOP-less hardware (System i POWER6 with IOP-less Fibre Channel), the i kernel keeps track of service and wait time (and therefore response times (S+W)) in buckets per logical unit number (LUN).
Disk Watcher is a new function of IBM i5/OS that provides disk data to help identify the
source of disk-related performance problems on System i platform. It can either collect information about every I/O in trace mode or collect information in buckets in statistics mode. In statistics mode, it can run much more often than Collection Services to see more granular statistics. The command strings and file layouts are documented in the System i Information Center. The usage is covered in the following articles: A New Way to Look at Disk Performance: http://www.ibmsystemsmag.com/i5/may07/administrator/15631p1.aspx Analyzing Disk Watcher Data: http://www.ibmsystemsmag.com/i5/may08/tipstechniques/20662p1.aspx Disk Watcher gathers detailed information associated with I/O operations to disk units. Disk Watcher provides data beyond the data that is available in tools, such as Work with Disk Status (WRKDSKSTS), Work with System Status (WRKSYSSTS), and Work with System Activity (WKSYSACT). Disk Watcher, like other tools, provides data about disk I/O, paging rates, CPU use, and temporary storage use. But Disk Watcher goes further by simultaneously collecting the program, object, job, thread, and task information that is associated with disk I/O operations.
Job Watcher is an advanced tool for collecting and analyzing performance information as a
means to effectively manage your system or to analyze a performance issue. It is job-centric and thread-centric and can collect data at intervals of seconds. The collection contains vital information, such as job CPU and wait statistics, call stacks, SQL statements, objects waited on, sockets, TCP, and more. For more information about using Job Watcher, refer to Web Power - New browser-based Job Watcher tasks help manage your IBM i performance at: http://www.ibmsystemsmag.com/i5/november08/administrator/22431p1.aspx
System z environment
The z/OS systems have proven performance monitoring and management tools available to use for performance analysis. Resource Measurement Facility (RMF), a z/OS performance tool, collects performance data and reports it for the desired interval. It also provides cache
39
reports. The cache reports are similar to the disk-to-cache and cache-to-disk reports that are available in the TotalStorage Productivity Center for Disk, except that RMFs cache reports are provided in text format. RMF collects the performance statistics of the DS8000 that are related to the link or port and also to the rank and extent pool. The REPORTS(ESS) parameter in the RMF report generator produces the reports that are related to those resources. For more information, refer to Chapter 15, System z servers on page 441.
40
Chapter 4.
Logical configuration concepts and terminology

This chapter summarizes the important concepts that need to be understood in preparation of the DS8000 logical configuration and performance tuning. You can obtain in-depth information in IBM System Storage DS8000 Architecture and Implementation, SG24-6786. Tip: If you are already familiar with the DS8000 virtualization concepts and terminology, you can skip this chapter. In this chapter, we review: Redundant Array of Independent Disks (RAID) levels Storage virtualization layers
41
4.1 RAID levels and spares

The DS8000 currently supports RAID 5, RAID 6, and RAID 10.
4.1.1 RAID 5 overview

RAID 5 is one of the most commonly used forms of RAID protection.
RAID 5 theory
The DS8000 series supports RAID 5 arrays. RAID 5 is a method of spreading volume data plus parity data across multiple disk drives. RAID 5 provides faster performance by striping data across a defined set of disk drive modules (DDMs). Data protection is provided by the generation of parity information for every stripe of data. If an array member fails, its contents can be regenerated by using the parity data.
RAID 5 implementation in the DS8000

In a DS8000, a RAID 5 array built on one array site will contain either seven or eight disks depending on whether the array site is supplying a spare. A seven-disk array effectively uses one disk for parity, so it is referred to as a 6+P array (where P stands for parity). The reason only seven disks are available to a 6+P array is that the eighth disk in the array site used to build the array was used as a spare. This array is referred to as a 6+P+S array site (where S stands for spare). An 8-disk array also effectively uses one disk for parity, so it is referred to as a 7+P array.
Drive failure with RAID 5

When a disk drive module fails in a RAID 5 array, the device adapter (DA) starts an operation to reconstruct the data that was on the failed drive onto one of the spare drives. The spare that is used will be chosen based on a smart algorithm that looks at the location of the spares and the size and location of the failed DDM. The device adapter performs the rebuild by reading the corresponding data and parity in each stripe from the remaining drives in the array, performing an exclusive-OR operation to recreate the data, and then writing this data to the spare drive. While this data reconstruction occurs, the device adapter can still service read and write requests to the array from the hosts. There might be degradation in performance while the sparing operation is in progress, because the device adapter and switched network resources are used to perform the reconstruction. Due to the switch-based architecture, this effect is minimal. Additionally, any read requests for data on the failed drive require data to be read from the other drives in the array, and then, the DA performs an operation to reconstruct the data. Performance of the RAID 5 array returns to normal when the data reconstruction onto the spare device completes. The time taken for sparing can vary, depending on the size of the failed DDM and the workload on the array, the switched network, and the DA. The use of arrays across loops (AAL) both speeds up rebuild time and decreases the impact of a rebuild.
42

RAID 6 protection provides more fault tolerance than RAID 5 in the case of disk failures and uses less raw disk capacity than RAID 10.
RAID 6 theory
Starting with Licence Machine Code 5.4.0.xx.xx, the DS8000 supports RAID 6 protection. RAID 6 presents an efficient method of data protection in case of double disk errors, such as two drive failures, two coincident medium errors, or a drive failure and a medium error. RAID 6 allows for additional fault tolerance by using a second independent distributed parity scheme (dual parity). Data is striped on a block level across a set of drives, similar to RAID 5 configurations, and a second set of parity is calculated and written across all the drives. RAID 6 is best used in combination with large capacity disk drives, such as the 500 GB Fibre ATA (FATA) drives, because these drives have a longer rebuild time, but RAID 6 can also be used with Fibre Channel (FC) drives when the primary concern is higher reliability.

A RAID 6 array in one array site of a DS8000 can be built on either seven or eight disks: In a seven disk array, two disks are always used for parity, while the eighth disk of the array site is needed as a spare. This kind of a RAID 6 array is hereafter referred to as a 5+P+Q+S array, where P and Q stand for parity and S stands for spare. A RAID 6 array, consisting of eight disks, is built when all necessary spare drives are available. An eight disk RAID 6 array also always uses two disks for parity, so it is referred to as a 6+P+Q array.

The process and completion time is comparable to a RAID 5 rebuild, but slower than rebuilding a RAID 10 array in the case of a single drive failure. When two drives need to be rebuilt, the reconstruction times are explicitly higher than a rebuild with any other RAID array.

RAID 10 is not as commonly used as RAID 5, mainly because more raw disk capacity is needed for every GB of effective capacity.
RAID 10 theory
RAID 10 provides high availability by combining features of RAID 0 and RAID 1. RAID 0 optimizes performance by striping volume data across multiple disk drives at a time. RAID 1 provides disk mirroring, which duplicates data between two disk drives. By combining the features of RAID 0 and RAID 1, RAID 10 provides a second optimization for fault tolerance. Data is striped across half of the disk drives in the RAID 1 array. The same data is also striped across the other half of the array, creating a mirror. Access to data is preserved if one disk in each mirrored pair remains available. RAID 10 offers faster data reads and writes than RAID 5, because it does not need to manage parity. However, with half of the DDMs in the group used for data and the other half to mirror that data, RAID 10 disk groups have less capacity than RAID 5 disk groups.

In the DS8000, the RAID 10 implementation is achieved using either six or eight DDMs. If spares exist on the array site, six DDMs are used to make a three-disk RAID 0 array, which is
Chapter 4. Logical configuration concepts and terminology
43
then mirrored. If spares do not exist on the array site, eight DDMs are used to make a four-disk RAID 0 array, which is then mirrored.

When a disk drive module (DDM) fails in a RAID 10 array, the controller starts an operation to reconstruct the data from the failed drive onto one of the hot spare drives. The spare that is used will be chosen based on a smart algorithm that looks at the location of the spares and the size and location of the failed DDM. Remember that a RAID 10 array is effectively a RAID 0 array that is mirrored. Thus, when a drive fails in one of the RAID 0 arrays, we can rebuild the failed drive by reading the data from the equivalent drive in the other RAID 0 array. While this data reconstruction occurs, the DA can still service read and write requests to the array from the hosts. There might be degradation in performance while the sparing operation is in progress, because DA and switched network resources are used to do the reconstruction. Due to the switch-based architecture of the DS8000, this effect is minimal. Read requests for data on the failed drive are not affected, because they can all be directed to the good RAID 1 array. Write operations are not affected. Performance of the RAID 10 array returns to normal when the data reconstruction onto the spare device completes. The time taken for sparing can vary, depending on the size of the failed DDM and the workload on the array and the DA. RAID 10 sparing completion time is a little faster that a RAID 5, because rebuilding a RAID 5 6+P configuration requires six reads plus one parity operation for each write, whereas a RAID 10 3+3 configuration requires one read and one write (essentially a direct copy).
4.1.4 Spare creation

When the array sites are created on a DS8000, the DS8000 microcode determines which sites will contain spares. The first four array sites normally each contribute one spare to the DA pair, with two spares being placed on each loop. In general, each device adapter pair thus has access to four spares.
Floating spares
The DS8000 implements a smart floating technique for spare DDMs. The DS8000 microcode might choose to allow the hot spare to remain where it has been moved, but it can instead choose to migrate the spare to a more optimum position. This move is done to better balance the spares across the DA pairs, the loops, and the enclosures. It might be preferable that a DDM that is currently in use as an array member is converted to a spare. In this case, the data on that DDM will be migrated in the background onto an existing spare. This process does not fail the disk that is being migrated, though it does reduce the number of available spares in the DS8000 until the migration process is complete. A smart process is used to ensure that the larger or higher rpm DDMs always act as spares. This design is preferable, because if we rebuild the contents of a 146 GB DDM onto a 300 GB DDM, approximately half of the 300 GB DDM will be wasted, because that space is not needed. The problem here is that the failed 146 GB DDM will be replaced with a new 146 GB DDM. So, the DS8000 microcode will most likely migrate the data back onto the recently replaced 146 GB DDM. When this process completes, the 146 GB DDM will rejoin the array and the 300 GB DDM will become the spare again. Another example is if we fail a 73 GB 15K rpm DDM onto a 146 GB 10K rpm DDM The data has now moved to a slower DDM, but the replacement DDM will be the same as the failed DDM. The array will have a mix of rpms, which is not desirable. Again, a smart migration of the data will be performed when suitable spares become available.
44
4.2 The abstraction layers for logical configuration

This chapter describes the terminology and necessary steps to configure logical volumes that can be accessed from attached hosts.
4.2.1 Array sites

An array site is a group of eight DDMs. The DDMs that make up an array site are predetermined by the DS8000, but note that there is no predetermined processor complex affinity for array sites. The DDMs that are selected for an array site are chosen from two disk enclosures on different loops; refer to Figure 4-1. The DDMs in the array site are of the same DDM type, which means they are the same capacity and the same speed (rpm).
Array Site
Switch
Loop 1
Figure 4-1 Array site
Loop 2
As you can see from Figure 4-1, array sites span loops. Four DDMs are taken from loop 1 and another four DDMs from loop 2.
4.2.2 Arrays
An array is created from one array site. Forming an array means defining it as a specific RAID type. The supported RAID types are RAID 5, RAID 6, and RAID 10 (refer to 4.1, RAID levels and spares on page 42). For each array site, you can select a RAID type. The process of selecting the RAID type for an array is also called defining an array. Note: In the DS8000 implementation, one array is defined using one array site. Figure 4-2 on page 46 shows the creation of a RAID 5 array with one spare, which is also called a 6+P+S array (capacity of 6 DDMs for data, capacity of one DDM for parity, and a spare drive). According to the RAID 5 rules, parity is distributed across all seven drives in this example.
45
On the right side in Figure 4-2, the terms D1, D2, D3, and so on stand for the set of data contained on one disk within a stripe on the array. If, for example, 1 GB of data is written, it is distributed across all the disks of the array.
Array Site
D1 D2 D3 D7 D8 D9 D10 D11 P D12 D13 D14 D15 D16 P D17 D18 ... ... ... ... ... ... ...
Creation of an array
Data Data Data Data Data Data Parity Spare
D4 D5 D6 P
RAID Array
Spare
Figure 4-2 Creation of an array
So, an array is formed using one array site, and while the array can be accessed by each adapter of the device adapter pair, it is managed by one device adapter. You define which adapter and which server manage this array later in the configuration process.
4.2.3 Ranks
In the DS8000 virtualization hierarchy, there is another logical construct, a rank. When defining a new rank, its name is chosen by the DS Storage Manager, for example, R1, R2, R3, and so on. You have to add an array to a rank. Note: In the DS8000 implementation, a rank is built using just one array. The available space on each rank will be divided into extents. The extents are the building blocks of the logical volumes. An extent is striped across all disks of an array as shown in Figure 4-3 on page 47 and indicated by the small squares in Figure 4-4 on page 48. The process of forming a rank performs two jobs: The array is formatted for either fixed block (FB) type data (Open Systems) or count key data (CKD) (System z). This formatting determines the size of the set of data contained on one disk within a stripe on the array. The capacity of the array is subdivided into equal-sized partitions, which are called extents. The extent size depends on the extent type: FB or CKD. An FB rank has an extent size of 1 GB (where 1 GB equals 230 bytes).
46
Figure 4-3 shows an example of an array that is formatted for FB data with 1 GB extents (the squares in the rank just indicate that the extent is composed of several blocks from different DDMs).
Data Data Data Data Data Data Parity Spare
D1
D7 D8 D9 D10 D11 P D12
D13 D14 D15 D16 P D17 D18
... ... ... ... ... ... ...
RAID Array
D2 D3 D4 D5 D6 P
Creation of a Rank
....
....
....
1 GB
1 GB
1 GB
1 GB
....
....
....
FB Rank of 1 GB extents
....
....
Figure 4-3 Forming an FB rank with 1 GB extents
4.2.4 Extent pools

An extent pool is a logical construct to aggregate the extents from a set of ranks to form a domain for extent allocation to a logical volume. Typically, the set of ranks in the extent pool have the same RAID type and the same disk rpm characteristics so that the extents in the extent pool have homogeneous characteristics. Best practice: Do not mix ranks with different RAID types or disk rpms in the same extent pool. There is no predefined affinity of ranks or arrays to a storage server. The affinity of the rank (and its associated array) to a given server is determined at the point that the rank is assigned to an extent pool. One or more ranks with the same extent type (FB or CKD) can be assigned to an extent pool. One rank can be assigned to only one extent pool. There can be as many extent pools as there are ranks.
Storage Pool Striping was made available with Licensed Machine Code 5.3.xx.xx and allows
you to create logical volumes striped across multiple ranks, which typically enhances performance. To benefit from Storage Pool Striping (see Storage Pool Striping extent rotation on page 51), more than one rank in an extent pool is required.
47
Storage Pool Striping can significantly enhance performance. However, when you lose one rank, not only is the data of this rank lost, but also, all of the data in this extent pool is lost, because data is striped across all ranks. Therefore, you must keep the number of ranks in an extent pool in the range of four to eight. The minimum number of extent pools is two, with one extent pool assigned to server 0 and the other extent pool assigned to server 1 so that both servers are active. In an environment where both FB type data and CKD type data are to go onto the DS8000 storage server, four extent pools will provide one FB pool for each server and one CKD pool for each server, to balance the capacity between the two servers. Figure 4-4 is an example of a mixed environment with CKD and FB extent pools. Additional extent pools might also be desirable to segregate ranks with different DDM types. Extent pools are expanded by adding more ranks to the pool. Ranks are organized in two rank groups; rank group 0 is controlled by server 0 and rank group 1 is controlled by server 1.
Extent Pool CKD0

1113 Cyl. CKD 1113 Cyl. CKD 1113 Cyl. CKD 1113 Cyl. CKD 1113 Cyl. CKD 1113 Cyl. CKD 1113 Cyl. CKD 1113 Cyl. CKD
Extent Pool CKD1

1113 Cyl. CKD 1113 Cyl. CKD 1113 Cyl. CKD 1113 Cyl. CKD
Server0
1GB FB
1GB FB
1GB FB
1GB FB
Extent Pool FBtest

1GB FB 1GB FB 1GB FB 1GB FB
1GB FB
1GB FB
1GB FB
1GB FB
Figure 4-4 Extent pools
4.2.5 Logical volumes

A logical volume is composed of a set of extents from one extent pool. On a DS8000, up to 65280 (we use the abbreviation 64K in this discussion, even though it is actually 65536 - 256, which is not quite 64K in binary) volumes can be created (64K CKD volumes, 64K FB volumes, or a mixture of both types of volumes, but the sum cannot exceed 64K).
Fixed Block LUNs

A logical volume composed of fixed block extents is called a logical unit number (LUN). A fixed block LUN is composed of one or more 1 GB (230 bytes) extents from one FB extent pool. A LUN cannot span multiple extent pools, but a LUN can have extents from different ranks within the same extent pool. You can construct LUNs up to a size of 2 TB (240 bytes). 48
Server1
Extent Pool FBprod
CKD volumes
A System z CKD volume is composed of one or more extents from one CKD extent pool. CKD extents are the size of 3390 Model 1, which has 1113 cylinders. However, when you define a System z CKD volume, you do not specify the number of 3390 Model 1 extents but the number of cylinders that you want for the volume. Prior to Licensed Machine Code 5.4.0.xx.xx, the maximum size for a CKD volume was 65520 cylinders. Now, you can define CKD volumes with up to 262668 cylinders, which is about 223 GB. This new volume capacity is called Extended Address Volume (EAV) and the device type is 3390 Model A. For more information about EAV volumes, refer to DS8000 Series: Architecture and Implementation, SG24-6786. Important: EAV volumes can only be exploited by z/OS Version 1.10 or later. If the number of specified cylinders is not an exact multiple of 1113 cylinders, part of the space in the last allocated extent is wasted. For example, if you define 1114 or 3340 cylinders, 1112 cylinders are wasted. For maximum storage efficiency, consider allocating volumes that are exact multiples of 1113 cylinders. In fact, consider multiples of 3339 cylinders for future compatibility. A CKD volume cannot span multiple extent pools, but a volume can have extents from different ranks in the same extent pool or you can stripe a volume across the ranks (see Storage Pool Striping extent rotation on page 51). The allocation process for FB volumes is similar, and it is shown in Figure 4-5.
Extent Pool FBprod

1 GB 1 GB 1 GB 1 GB free
Logical 3 GB LUN
Rank-a
3 GB LUN
use d 1 GB free
Rank-b
used
1 GB free
Allocate a 3 GB LUN
Extent Pool FBprod Rank-a

1 GB 1 GB 1 GB 1 GB used
3 GB LUN
2.9 GB LUN created

1 GB used
Rank-b
use d
1 GB used
used
100 MB unused
Figure 4-5 Creation of an FB LUN
4.2.6 Space Efficient volumes

When a normal LUN or CKD volume is created, it occupies the defined capacity on the physical drives. Starting with Licensed Machine Code 5.3.xx.xx, you can define another type of LUN or volume, which is called Space Efficient volumes.
49
A Space Efficient volume does not occupy physical capacity when it is created. Space gets allocated when data is actually written to the volume. The amount of space that gets physically allocated is a function of the amount of data changes that are performed on the volume. The sum of all defined Space Efficient volumes can be larger than the physical capacity available. This function is also called over provisioning or thin provisioning. Note: In the current implementation (Licensed Machine Code 5.4.1.xx.xx), Space Efficient volumes are supported as FlashCopy target volumes only.
4.2.7 Allocation, deletion, and modification of LUNs and CKD volumes

All extents of the ranks assigned to an extent pool are independently available for allocation to logical volumes. The extents for a LUN or volume are logically ordered, but they do not have to come from one rank and the extents do not have to be contiguous on a rank. This construction method of using fixed extents to form a logical volume in the DS8000 allows flexibility in the management of the logical volumes. Because the extents are cleaned after you have deleted a LUN or CKD volume, it can take time until these extents are available for reallocation. The reformatting of the extents is a background process. Figure 4-6 shows one rank in the extpool. When a LUN is first created out of the extpool, the extents are grouped together sequentially.
Extpool
sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents sequential extents
40GB LUN
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext 1GB ext
Figure 4-6 One rank per extent pool relationship
Figure 4-6 shows how the extents are grouped the first time that the LUNs are created in a one to one (1:1) ratio between ranks and extent pools. In this example, we show 40 extents used in a sequential pattern to create the first LUN. It is important to note that if one or more LUNs are deleted, new LUNs are created with free extents as illustrated in the Figure 4-7 on page 51.
50
In Figure 4-7, we show 13 colored extents to represent the free extents in the Extpool on the left that make up the 13 GB LUN on the right side of the diagram. The extents that are not shaded or colored (red) represent used extents.
13 extents
Extpool
13 GB LUN
Figure 4-7 Existing free extents that can be used to form a LUN after LUN deletions occur
There are two extent allocation algorithms of the DS8000: rotated volume allocation method and Storage Pool Striping extent allocation method.
Rotated volume allocation method

Extents can be allocated sequentially. In this case, all extents are taken from the same rank until we have enough extents for the requested volume size or the rank is full, in which case the allocation continues with the next rank in the extent pool. If more than one volume is created in one operation, the allocation for each volume starts in another rank. When allocating several volumes, the system rotates through the ranks. You might want to consider this allocation method when you prefer to manage performance manually. The workload of one volume is going to one rank, which makes the identification of performance bottlenecks easier; however, by putting all of the volumes data onto just one rank, you just might introduce a bottleneck.
Storage Pool Striping extent rotation

The second and preferred storage allocation method is Storage Pool Striping. Storage Pool Striping is an option (introduced with Licensed Machine Code 5.3.xx.xx) when a LUN/volume is created. The extents of a volume can be striped across several ranks. It is obvious that you need an extent pool with more than one rank to be able to use this storage allocation method. The DS8000 keeps a sequence of ranks. The first rank in the list is randomly picked each time that the storage subsystem powers on. The DS8000 keeps track of the rank in which the last allocation started. The allocation of a first extent for the next volume starts from the next rank in that sequence. The next extent for that volume is taken from the next rank in sequence and so on. So, the system rotates the extents across the ranks. Figure 4-8 on page 52 shows an example of how volumes get allocated within the extent pool.
51
When you create striped volumes and non-striped volumes in an extent pool, a rank can be filled before the other ranks. A full rank is skipped when you create new striped volumes. There is no reorg function for the extents in an extent pool. If you add one ore more ranks to an existing extent pool, the existing extents are not redistributed. Tip: If you have to add capacity to an extent pool, because it is nearly full, it is better to add several ranks at one time instead of just one rank. This method allows new volumes to be striped across the added ranks.
Where to start with the first volume is determined at power on (say, R2) Striped volume with two Extents created Next striped volume (five extents in this example) starts at next rank (R3) from which the previous volume was started Non-striped volume created Starts at next rank (R1), going in a round-robin Striped volume created Starts at next rank (R2) (extents 13 to 15)
11 47 89 1 1 1 36014
R1
Extent Pool
1 51 1 25
R2
Ranks
1 2 361 3 3
6
1 5
R3
Extent 8.12
Figure 4-8 Extent allocation methods
By using striped volumes, you distribute the I/O load to a LUN/CKD volume to more than just one set of eight disk drives. The ability to distribute a workload to many physical drives can greatly enhance performance for a logical volume. In particular, operating systems that do not have a volume manager but that can perform striping will benefit most from this allocation method. However, if you have extent pools with many ranks and all volumes are striped across the ranks and you lose just one rank, for example, because there are two disk drives in the same rank that fail at the same time and it is not a RAID 6 rank, you will lose a significant portion of your data. Therefore, it might be better to have extent pools with only about four to eight ranks. However, if you perform, for example, Physical Partition striping in AIX already, double striping probably will not improve performance any further, which is also true when the DS8000 LUNs are used by an SAN Volume Controller (SVC) striping data across LUNs. If you decide to use Storage Pool Striping, it is probably better to use this allocation method for all volumes in the extent pool to keep the ranks equally filled and utilized.
52
Tip: If you configure a new DS8000, do not mix striped volumes and non-striped volumes in an extent pool.
4.2.8 Logical subsystems (LSS)

A logical subsystem (LSS) is another logical construct. It groups logical volumes, LUNs, in groups of up to 256 logical volumes. You can define up to 255 LSSs. System z users are familiar with a logical control unit (LCU). System z operating systems configure LCUs to create device addresses. There is a one to one relationship between an LCU and a CKD LSS (LSS X'ab' maps to LCU X'ab'). When creating CKD logical volumes and assigning their logical volume numbers, consider whether Parallel Access Volumes (PAVs) are required on the LCU and reserve some of the addresses on the LCU for alias addresses. For more information about PAV, refer to 15.2, Parallel Access Volumes on page 442.
4.2.9 Address groups

Address groups are created automatically when the first LSS associated with the address group is created, and deleted automatically when the last LSS in the address group is deleted. LSSs are grouped into address groups of 16 LSSs. Figure 4-9 shows the concept of LSSs and address groups.
Address group X'1x' CKD

LSS X'10' LSS X'12' LSS X'14' LSS X'16' LSS X'18' LSS X'1A' LSS X'1C' X'1E00' X'1E01' LSS X'11' LSS X'13' LSS X'15' X'1500' LSS X'17' LSS X'19' LSS X'1B' LSS X'1D' X'1D00' LSS X'1F'
Extent Pool FB-2 Extent Pool CKD-2 Rank-w
Extent Pool CKD-1 Rank-a
Rank-b
Rank-x
Server0
LSS X'1E'
Extent Pool FB-1 Rank-c
Rank-y
Rank-d
Address group X'2x': FB

LSS X'20' LSS X'22' LSS X'24' LSS X'26' X'2800' LSS X'28' LSS X'21' X'2100' X'2101' LSS X'23' LSS X'25' LSS X'27' LSS X'29' LSS X'2B' LSS X'2D' LSS X'2F'
Extent Pool FB-2 Rank-z
Volume ID
LSS X'2A' LSS X'2C' LSS X'2E'
Figure 4-9 Logical storage subsystems
LSSs are numbered X'ab' where a is the address group and b denotes an LSS within the address group. So, for example, X'10' to X'1F' are LSSs in address group 1.
Server1
53
All LSSs within one address group have to be of the same type, either CKD or FB. The first LSS defined in an address group fixes the type of that address group. Important: System z users who still want to use ESCON to attach hosts to the DS8000 must be aware that ESCON supports only the 16 LSSs of address group 0 (LSS X'00' to X'0F'). Therefore, this address group must be reserved for ESCON-attached CKD devices in this case and not used as FB LSSs. The LUN identifications X'gabb' are composed of the address group X'g', and the LSS number within the address group X'a', and the position of the LUN within the LSS X'bb'. For example, LUN X'2101' denotes the second (X'01') LUN in LSS X'21' of address group 2.
4.2.10 Volume access

A DS8000 provides mechanisms to control host access to LUNs. In most cases, a server has two or more host bus adapters (HBAs) and the server needs access to a group of LUNs. For easy management of server access to logical volumes, the DS8000 introduced the concept of host attachments and volume groups.
Host attachment
Host bus adapters (HBAs) are identified to the DS8000 in a host attachment construct that specifies the HBAs worldwide port names (WWPNs). A set of host ports can be associated through a port group attribute that allows a set of HBAs to be managed collectively. This port group is referred to as host attachment within the GUI. Each host attachment can be associated with a volume group to define which LUNs that HBA is allowed to access. Multiple host attachments can share the same volume group. The host attachment can also specify a port mask that controls which DS8000 I/O ports the HBA is allowed to log in to. Whichever ports the HBA logs in to, it sees the same volume group that is defined in the host attachment associated with this HBA. The maximum number of host attachments on a DS8000 is 8192.
Volume group
A volume group is a named construct that defines a set of logical volumes. When used in conjunction with CKD hosts, there is a default volume group that contains all CKD volumes and any CKD host that logs in to a FICON I/O port has access to the volumes in this volume group. CKD logical volumes are automatically added to this volume group when they are created and automatically removed from this volume group when they are deleted. When used in conjunction with Open Systems hosts, a host attachment object that identifies the HBA is linked to a specific volume group. You must define the volume group by indicating which fixed block logical volumes are to be placed in the volume group. Logical volumes can be added to or removed from any volume group dynamically.
4.2.11 Summary of the logical configuration hierarchy

Going through the virtualization hierarchy, we started with just a bunch of disks that were grouped in array sites. An array site was transformed into an array, eventually with spare disks. The array was further transformed into a rank with extents formatted for FB data or CKD. Next, the extents were added to an extent pool that determined which storage server served the ranks and aggregated the extents of all ranks in the extent pool for subsequent allocation to one or more logical volumes. Within the extent pool, we can reserve storage for Space Efficient volumes. 54
Next, we created logical volumes within the extent pools (optionally striping the volumes), assigning them a logical volume number that determined to which logical subsystem they will be associated and which server will manage them. Space Efficient volumes can be created within the repository of the extent pool. Then, the LUNs can be assigned to one or more volume groups. Finally, the HBAs were configured into a host attachment that is associated with a volume group. This virtualization concept provides for greater flexibility. Logical volumes can dynamically be created, deleted, and resized. They can be grouped logically to simplify storage management. Large LUNs and CKD volumes reduce the total number of volumes, which also contributes to a reduction of the management effort. Figure 4-10 summarizes the virtualization hierarchy.
Array Site RAID Array Data Data Data Data Data Data Parity Spare Rank Type FB Extent Pool 1 GB FB 1 GB FB 1 GB FB Logical Volume
1 GB FB
1 GB FB
1 GB FB
Server0
1 GB FB
LSS FB
Address Group
X'2x' FB 4096 addresses LSS X'27'
Volume Group
Host Attachment
X'3x' CKD 4096 addresses
Figure 4-10 Virtualization hierarchy
4.3 Understanding the array to LUN relationship

Knowing how LUNs are formed out of the arrays in the extent pool is a very important concept to tweak performance throughput. This information can help you fine-tune performance throughput from an extent pool perspective. It will help you group the arrays in the extent pools appropriately, avoiding the performance impact otherwise encountered by placing unlike arrays with unlike strip widths and characteristics together when you place ranks into the extent pools. Last, this information will help you understand the concept of extent rotation when it comes to creating LUNs from multiple ranks within an extent pool.
1 GB FB
1 GB FB
55
4.3.1 How extents are formed together to make DS8000 LUNs

Note: It is important to understand that the greatest amount of isolation that you can achieve within the DS8000 is at the array/rank level, not the physical drive level. No matter how you slice the disks or no matter what virtualization engine you use, such as SVC, you will only be able to isolate I/O at the array/rank level. The next sections describe how LUNs are formed and why LUN formation is a limitation.
How extents are formed from a RAID 5 6+P+S array

Figure 4-11 shows a DS8000 array of eight DDMs as though you were looking down onto seven disks that have been carved into five logical extents. The eighth disk is the hot spare in this array. The first logical extent, which is called logical-extent1, is formed from sections (numbered 1) along the outer edge of each of the DDMs making up the array. Subsequent logical extents are formed from areas on the DDMs increasingly toward the center and inner edge. A single 1 GB extent does not reside on just one physical disk, but instead, it is made up of the stripes from the eight physical disks shown in this example in Figure 4-11.
Logical-extent1 Made up of strips from the outer section/edge of each DDM
RAID-5 6+P+S
1
1 2 3 4 5
read/write head
1
1 2 3 4 5 1 2 3 4 5
1
1 2 3 4 5
1
1 2 3 4 5
1
1 2 3 4 5
1
1 2 3 4 5
spare
Logical-extent3 Made up of strips from the middle section of each DDM
RAID-5 6+P+S
Figure 4-11 How logical extents are formed from the DS8000 in a 6+P+S type array format
Logical-exent3 is created from the middle sections of the DDMs. In this example, we only show that five extents equal the capacity of an entire DS8000 array. This example is for illustration purposes only, and in reality, the number of extents equals the capacity of the entire array. For example, a RAID 5 array that consists of 300 GB raw DDMs actually produces 1576 logical extents. Parity is not distributed on one disk, but instead, it is striped across all of the disks in the array. It is important to keep in mind that RAID 5 arrays consist of one disks worth of parity, which is striped throughout the array. In Figure 4-12 on page 57, we have a 6+P+S (one spare). One of the chunks in the array is parity, which is striped throughout the seven disks.
56
Chunk1 Chunk2 Chunk3 Chunk4 Chunk5 Chunk6 Parity
Chuck1 Chunk2 Chunk3 Chunk4 Chunk5 Parity Chunk7
Chunk1 Chunk2 Chuck3 Chunk4 Parity Chunk6 Chunk7
Chunk1 Chunk2 Chunk3 Parity Chunk5 Chunk6 Chunk7
Chunk1 Chunk2 Parity Chunk4 Chunk5 Chunk6 Chunk7
Chunk1 Parity Chunk3 Chunk4 Chunk5 Chuck6 Chunk7
Pairity Chunk2 Chunk3 Chunk4 Chunk5 Chunk6 Chunk7
Figure 4-12 The parity stripe in a 6+P+S RAID configuration
How extents are formed from a RAID 5 7+P array

Figure 4-13 shows a DS8000 array of eight DDMs as though you were looking down onto eight disks that have been carved into five logical extents.
RAID-5 7+P
1 1 1
read/write head
1
1 2 3 4 5 1 2 3 4 5
1
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Spare
1 2 3 4 5 1 2 3 4 5
RAID-5 7+P
Figure 4-13 How logical extents are formed from the DS8000 in a 7+P type array format
For fine-tuning of performance intensive workloads, you might realize that a 7+P RAID 5 array performs better than a 6+P array, and a 4x4 RAID 10, as shown in Figure 4-20 on page 60, performs better than a 3x3, as shown in Figure 4-19 on page 60. For random I/O, you might see an up to 15% greater throughput on a 7+P and a 4x4 array than a 6+P and a 3x3. For sequential applications, the differences are minimal. As a general rule though, try to balance workload activity evenly across RAID arrays, regardless of the size. It is not worth the management effort to do otherwise. It is important to remember that RAID 5 arrays consist of one disks worth of parity, which is striped throughout the array. In Figure 4-14 on page 58, we show a 7+P (no spare). One of the chunks (shaded) in the array is parity, which is striped throughout the eight disks.
57
Chunk1 Chunk2 Chunk3 Chunk4 Chunk5 Chunk6 Chunk7 Parity
Chunk1 Chunk2 Chunk3 Chunk4 Chunk5 Chunk6 Parity Chunk8
Chuck1 Chunk2 Chunk3 Chunk4 Chunk5 Parity Chunk7 Chunk8
Chunk1 Chunk2 Chunk3 Chunk4 Parity Chunk6 Chunk7 Chunk8
Chunk1 Chunk2 Chunk3 Parity Chunk5 Chunk6 Chunk7 Chunk8
Chunk1 Chunk2 Parity Chunk4 Chunk5 Chunk6 Chunk7 Chunk8
Chunk1 Pairity Chunk3 Chunk4 Chunk5 Chunk6 Chunk7 Chunk8
Parity Chunk2 Chunk3 Chunk4 Chunk5 Chunk6 Chunk7 Chunk8
Figure 4-14 The parity stripe in a 7+P RAID configuration
How extents are formed from a RAID 6 5+Q+P+S array

There is one less drives worth of capacity in this RAID 6 configuration than there is in the RAID 5 6+P+S configuration. This configuration has only five disks worth of available capacity. When the extents are formed, they are made up of stripes across seven raw disks as shown in Figure 4-15.
RAID-6 5+P+Q+S
read/write head
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
spare
RAID-6 5+P+Q+S
Figure 4-15 How logical extents are formed from the DS8000 in a 6+P+Q+S type array format
Two of the chunks in the array are parity and are striped throughout the seven disks as shown in Figure 4-15.
Parity Chunk2 Chunk3 Chunk4 Chunk5 Chunk6 Parity
Chuck1 Parity Chunk3 Chunk4 Chunk5 Parity Chunk7
Chunk1 Chunk2 Parity Chunk4 Parity Chunk6 Chunk7
Chunk1 Chunk2 Parity Parity Chunk5 Chunk6 Chunk7
Chunk1 Chunk2 Parity Chunk4 Parity Chunk6 Chunk7
Chunk1 Parity Chunk3 Chunk4 Chunk5 Parity Chunk7
Pairity Chunk2 Chunk3 Chunk4 Chunk5 Chunk6 Parity
Figure 4-16 The parity stripe in a 6+P+Q+S RAID configuration
58
Spare
How extents are formed from a RAID 6 6+Q+P array

Figure 4-17 shows a RAID 6 6+P+Q configuration, which means that there are eight disks across which to stripe, but only six disks worth of available usable space.
RAID-6 6+P+Q
read/write head
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
RAID-6 6+P+Q
Figure 4-17 How logical extents are formed from the DS8000 in a 6+P+Q type array format
Two of the chunks in the array are parity stripes, which are striped throughout the eight disks as shown in Figure 4-18.
Parity Chunk2 Chunk3 Chunk4 Chunk5 Chunk6 Chunk7 Parity
Chunk1 Parity Chunk3 Chunk4 Chunk5 Chunk6 Parity Chunk8
Chunk1 Chunk2 Parity Chunk4 Chunk5 Parity Chunk7 Chunk8
Chunk1 Chunk2 Chunk3 Parity Parity Chunk6 Chunk7 Chunk8
Chunk1 Parity Chunk2 Chunk3 Parity Chunk4 Parity Chunk5 Chunk6 Chunk7 Chunk8 Parity
Chunk1 Chunk2 Parity Chunk4 Chunk5 Parity Chunk7 Chunk8
Chunk1 Parity Chunk3 Chunk4 Chunk5 Chunk6 Parity Chunk8
Parity Chunk2 Chunk3 Chunk4 Chunk5 Chunk6 Chunk7 Parity
Figure 4-18 The parity stripe in a 6+P+Q RAID configuration
How extents are formed from a RAID 10 3X3 array

Figure 4-19 on page 60 shows a RAID 10 3X3+S+S configuration, which means that there are only three disks across which to stripe, and twice that for the mirrored copy. The available usable space is the capacity of only three disks. Figure 4-19 on page 60 shows a DS8000 array of eight DDMs. Although two sections with the number 1, (outer edge), are formed, these extents are mirrored and presented to the extent pool as one usable extent of 1GB. When the LUN is created from the extents in the extent pool from this configuration, the extent is written across twice. Figure 4-19 on page 60 shows two spares. There is no parity stripe in this RAID 10 configuration. Note: The spares in the mirrored RAID 10 configuration act independently; they are not mirrored spares.
59
RAID-10 3X2+S+S
Mirror-copy1
Mirror-copy2
read/write head
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
spare
spare
Logical-extent3 Made up of strips from the middle sections of each DDM
Figure 4-19 How logical extents are formed from the DS8000 in a 3X2+S+S mirror array format
How extents are formed from a RAID 10 4X2 array

Figure 4-20 shows a RAID 10 4X2 configuration with no spares, which means that there are only four disks across which to stripe, and twice that for the mirrored copy. The available usable space is the capacity of only four disks.
RAID-10 4X2
Mirror-copy1
Mirror-copy2
read/write head
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Figure 4-20 How logical extents are formed from the DS8000 in a 4X4 mirror type array format
It is important to note that the stripe widths differ in size from the stripes in a 3X2+S+S RAID array configuration. Note: Due to the different stripe widths that make up the extent from each type of RAID array, it is important not to intermix the RAID array types within the same extent pool.
60
4.3.2 Understanding data I/O placement on ranks and extent pools

It might be easier to understand sharing workloads on disk arrays by referring to Figure 4-21. A database workload or application can consist of two of the LUNs (logical volumes 1and 3) in the extent pool, and another application can consist of the other two logical volumes (logical volumes 5 and 7) on the ranks in the extpool that is shown in Figure 4-21. The volumes in the extent pool can even be assigned to two different servers. Because the disks share the same physical heads and spindles, I/O contention can result if the two applications workloads peak at the same time.
Extpool
Rank1
Rank2
LUN 1 LUN 2 LUN 3 LUN 4 LUN 5 LUN 6 LUN 7 LUN 8
Rank3
Rank4
LUN 5 spread across 4 ranks mapped to Host B
LUN 1 spread across 4 ranks, mapped to Host A
Figure 4-21 LUNs sharing the same array/rank in one extpool
In Figure 4-21, we show that the rotate extents function was used to create LUNs by spreading the extents across ranks in the extent pool. LUN 1 is assigned to host A and LUN 5 is assigned to host B. These LUNs share the same heads and spindles in the array/rank and are not isolated. To obtain true disk capacity isolation at the rank level, all the LUNs in the array must be assigned to one database or workload. Workloads must be strategically placed and distributed as evenly as possible for proper sharing or isolation. When the rotate extents function is not used, then each rank operates independently in the extent pool. LUN isolation to one rank is more achievable but can still spread across multiple ranks when space on one rank is depleted. To achieve even more isolation, we recommend that you place fewer ranks or even just one rank in an extent pool.
61
62
Chapter 5.
Logical configuration performance considerations

Important: Before reading this chapter, familiarize yourself with the material covered in Chapter 4, Logical configuration concepts and terminology on page 41. This chapter introduces a step-by-step approach to configuring the DS8000 with regard to workload and performance considerations. The chapter includes: Understand the basic configuration principles for optimal performance: Workload isolation Workload resource-sharing Workload spreading Analyze workload characteristics to determine isolation or resource-sharing Plan allocation of DS8000 disk and host connection capacity to identified workloads Plan spreading volumes and host connections for the identified workloads Plan array sites Plan RAID arrays and ranks and RAID-level performance considerations Plan extent pools with single-rank and multi-rank extent pool considerations Plan address groups, logical subsystems (LSSs), volume IDs, and count key data (CKD) Parallel Access Volumes (PAVs) Plan I/O port IDs, host attachments, and volume groups Implement and document the DS8000 logical configuration
63
5.1 Basic configuration principles for optimal performance

There are three major principles for achieving a logical configuration on a DS8000 subsystem for optimal performance: Workload isolation Workload resource-sharing Workload spreading
5.1.1 Workload isolation

Workload isolation can mean providing a high priority workload with dedicated DS8000
hardware resources to reduce the impact of less important workloads. Workload isolation can also mean limiting a lower priority workload to a subset of DS8000 hardware resources so that it will not impact more important workloads by fully utilizing all hardware resources. Isolation provides guaranteed availability of the hardware resources that are dedicated to the isolated workload. It removes contention with other applications for those resources. However, isolation limits the isolated workload to a subset of the total DS8000 hardware so that its maximum potential performance might be reduced. Unless an application has an entire DS8000 Storage Image dedicated to its use, there is potential for contention with other applications for any hardware (such as cache and processor resources), which are not dedicated. However, typically, isolation is implemented to improve the performance of all workloads by separating different workload types. One recommended approach to isolation is to identify lower priority workloads with heavy I/O demands and to separate them from all of the more important workloads. You might be able to isolate multiple lower priority workloads with heavy I/O demands to a single set of hardware resources and still meet their lower service-level requirements, particularly if their peak I/O demands are at different times. Important: For convenience, this chapter sometimes discusses isolation in terms of a single isolated workload in contrast to multiple resource-sharing workloads, but the approach also applies to multiple isolated workloads.
DS8000 disk capacity isolation

The level of disk capacity isolation required for a workload will depend on the scale of its I/O demands as compared to DS8000 array and device adapter capabilities, as well as organizational considerations, such as the importance of the workload and application administrator requests for workload isolation. You can subset DS8000 disk capacity for isolation at several levels: Rank level: Certain ranks are dedicated to a workload. That is, volumes for only one workload will be allocated on these ranks. The ranks can be a different disk type (capacity or speed), a different RAID array type (RAID 5, RAID 6, or RAID 10, arrays with spares or arrays without spares), or a different storage type (CKD or fixed block (FB)) than the disk types, RAID array types, or storage types that are used by other workloads. Workloads requiring different storage types will dictate rank, extent pool, and address group isolation. You need to consider workloads with heavy random activity for rank isolation. DA level: All ranks on one or more device adapter (DA) pairs are dedicated to a workload. That is, volumes for only one workload will be allocated on the ranks that are associated with one or more DAs. These ranks can be a different disk type (capacity or speed), RAID array type (RAID 5, RAID 6, or RAID 10, arrays with spares or arrays without spares), or 64
storage type (CKD or FB) than the disk types, RAID types, or storage types that are used by other workloads. You must consider workloads with heavy, large blocksize and sequential activity for DA-level isolation, because these workloads tend to consume all of the DA resources that are available to them. Processor complex level: All ranks assigned to extent pools managed by processor complex 0 or all ranks assigned to extent pools managed by processor complex 1 are dedicated to a workload. We typically do not recommend this approach, because it can reduce the processor and cache resources available to the workload by 50%. Storage Image level: This level applies to the 2107 Models 9A2 or 9B2 only. All ranks owned by one Storage Image (Storage Image 1 or Storage Image 2) are dedicated to a workload. That is, an entire logical DS8000 is dedicated to a workload. Storage unit level: All ranks in a physical DS8000 are dedicated to a workload. That is, the physical DS8000 runs only one workload.
DS8000 host connection isolation

The level of host connection isolation required for a workload will depend on the scale of its I/O demands as compared to DS8000 I/O port and host adapter capabilities, as well as organizational considerations, such as the importance of the workload and administrator requests for workload isolation. DS8000 host connection subsetting for isolation can also be done at several levels: I/O port level: Certain DS8000 I/O ports are dedicated to a workload. This subsetting is quite common. Workloads requiring FICON and FCP protocols must be isolated at the I/O port level, because each I/O port on the 4-port FCP/FICON-capable host adapter card can be configured to support only one of these protocols. Although Open Systems host servers and remote mirroring links use the same protocol (FCP), they are typically isolated to different I/O ports. You must also consider workloads with heavy large block sequential activity for host adapter isolation, because they tend to consume all of the I/O port resources that are available to them. Host adapter level: Certain host adapters are dedicated to a workload. You must isolate workloads requiring ESCON access from FICON or FCP workloads at the host adapter level, because they require a unique 2-port host adapter card. FICON and FCP workloads do not necessarily require host adapter isolation, because separate I/O ports on the same 4-port FCP/FICON-capable host adapter card can be configured to support each protocol (FICON or FCP). However, host connection requirements might dictate a unique type of host adapter card (Long wave (LW) or Short wave (SW)) for a workload. Workloads with heavy large block sequential activity must be considered for host adapter isolation, because they tend to consume all of the I/O port resources that are available to them. I/O enclosure level: Certain I/O enclosures are dedicated to a workload. This approach is not generally necessary.
5.1.2 Workload resource-sharing

Workload resource-sharing means multiple workloads use a common set of DS8000 hardware resources, such as:
Ranks Device adapters I/O ports Host adapters
Chapter 5. Logical configuration performance considerations
65
Multiple resource-sharing workloads can have logical volumes on the same ranks and can access the same DS8000 host adapters or even I/O ports. resource-sharing allows a workload to access more DS8000 hardware than can be dedicated to the workload, providing greater potential performance, but this hardware sharing can result in resource contention between applications that impacts performance at times. It is important to allow resource-sharing only for workloads that will not consume all of the DS8000 hardware resources that are available to them. It might be easier to understand the resource-sharing principle for workloads on disk arrays by referring to Figure 5-1. An application workload (for example, a database) can use two logical unit numbers (LUNs), such as LUNs 1 and 3, in an extent pool and another application (for example, another database) can use another two LUNs, LUNs 5 and 7, in the same extent pool. If the extents of these LUNs also share the same ranks (for example, using Storage Pool Striping or the rotate extents volume allocation algorithm as shown in Figure 5-1), I/O contention can easily occur if the two the application workloads peak at the same time, because these LUNs physically share the same disk heads and disk spindles in the arrays.
Extpool
Rank1
LUN 5 spread across 4 ranks mapped to Host B
Rank2
Rank3
Rank4
LUN 1 spread across 4 ranks mapped to Host A
Figure 5-1 Example for resource-sharing workloads with different LUNs sharing the same ranks
5.1.3 Workload spreading

Workload spreading means balancing and distributing workload evenly across all of the DS8000 hardware resources available, including:
Processor complex 0 and processor complex 1 Device adapters Ranks I/O enclosures Host adapters Spreading applies to both isolated workloads and resource-sharing workloads.
66
You must allocate DS8000 hardware resources to either an isolated workload or multiple resource-sharing workloads in a balanced manner. That is, you must allocate either an isolated workload or resource-sharing workloads to DS8000 ranks that are assigned to device adapters (DAs) and both processor complexes in a balanced manner. You must allocate either type of workload to I/O ports that are spread across host adapters and I/O enclosures in a balanced manner. You must distribute volumes and host connections for either an isolated workload or a resource-sharing workload in a balanced manner across all DS8000 hardware resources that have been allocated to that workload. You must create volumes as evenly as possible across all ranks and DAs allocated to those workloads. You can then use host-level striping (Open Systems Logical Volume Manager (LVM) striping or z/OS storage groups) across all of the volumes belonging to either type of workload. You can obtain more information about host-level striping in the appropriate chapters for the various operating systems and platforms in this book. One exception to the recommendation of spreading volumes is when specific files or datasets will never be accessed simultaneously, such as multiple log files for the same application where only one log file will be in use at a time. In that case, you can optimize the overall workload performance by placing all volumes required by these datasets or files on a single DS8000 rank. You must also configure host connections as evenly as possible across the I/O ports, host adapters, and I/O enclosures available to either an isolated or a resource-sharing workload. Then, you can use host server multipathing software to optimize performance over multiple host connections. For more information about multipathing software, refer to Chapter 9, Host attachment on page 265.
5.1.4 Using workload isolation, resource-sharing, and spreading

A recommended approach to optimizing performance on the DS8000 is to begin by identifying any workload that has the potential to negatively impact the performance of other workloads by fully utilizing all of the DS8000 I/O ports and DS8000 ranks available to it. Additionally, you must identify any workload that is so critical that its performance can never be allowed to be negatively impacted by other workloads. Then, identify the remaining workloads that are considered appropriate for resource-sharing. Next, define a balanced set of hardware resources that can be dedicated to any isolated workloads. Then, allocate the remaining DS8000 hardware for sharing among the resource-sharing workloads. The next step is planning extent pools and assigning volumes and host connections to all workloads in a way that is balanced and spread - either across all dedicated resources (for any isolated workload) or across all shared resources (for the multiple resource-sharing workloads). For spreading workloads evenly across a set of ranks, consider multi-rank extent pools using Storage Pool Striping, which we introduce in 5.7.5, Planning for multi-rank extent pools on page 106. The final step is the implementation of host-level striping and multipathing software if desired.
67
5.2 Analyzing application workload characteristics

The first and most important step in creating a successful logical configuration for the DS8000 is analyzing the workload characteristics for the applications that will access the DS8000, so that DS8000 hardware resources, such as RAID arrays and I/O ports, can be properly allocated to workloads with regard to isolation and resource-sharing considerations. You need to perform this workload analysis during the DS8000 capacity planning process and you need to complete it prior to ordering the DS8000 hardware.
5.2.1 Determining isolation requirements

The objective of this analysis is to identify workloads that require isolated (dedicated) DS8000 hardware resources, because this determination will ultimately affect the total amount of disk capacity required and the total number of disk drive types required, as well as the number and type of host adapters required. The result of this first analysis indicates which workloads require isolation and the level of isolation that is required. You must also consider organizational and business considerations in determining which workloads to isolate. Workload priority (the importance of a workload to the business) is a key consideration. Application administrators typically request dedicated resources for high priority workloads. For example, certain database online transaction processing (OLTP) workloads might require dedicated resources in order to guarantee service levels. The most important consideration is preventing lower priority workloads with heavy I/O requirements from impacting higher priority workloads. Lower priority workloads with heavy random activity need to be evaluated for rank isolation, and lower priority workloads with heavy, large blocksize, sequential activity must be evaluated for DA and I/O port isolation. Workloads that require different disk drive types (capacity and speed), different RAID types (RAID 5, RAID 6, or RAID 10), or different storage types (CKD or FB) will dictate isolation to different DS8000 arrays/ranks. For more information about the performance implications of various RAID types, refer to 5.6.1, RAID-level performance considerations on page 81. Workloads that use different I/O protocols (FCP or FICON) will dictate isolation to different I/O ports. However, even workloads that use the same disk drive types, RAID type, storage type, and I/O protocol need to be evaluated for separation or isolation requirements. Workloads with very heavy, continuous I/O access patterns must be considered for isolation to prevent them from consuming all available DS8000 hardware resources and impacting the performance of other types of workloads. Workloads with large blocksize and sequential activity must be considered for separation from those workloads with small blocksize and random activity. Isolation of only a few workloads that are known to have high I/O demands can allow all the remaining workloads (even the high priority workloads) to share hardware resources and achieve acceptable levels of performance. More than one workload with high I/O demands might be able to share the same isolated DS8000 resources, depending on the service level requirements and the times of peak activity. Examples of I/O workloads, files, or datasets that might have heavy and continuous I/O access patterns are: Sequential workloads (especially those workloads with large blocksize transfers) Log files or datasets Sort or work datasets or files
68
Business Intelligence and Data Mining Disk copies (including Point-in-Time Copy background copies, remote mirroring target volumes, and tape simulation on disk) Video/imaging applications Engineering/scientific applications Certain batch workloads You must consider workloads for all applications for which DS8000 storage will be allocated, including current workloads that will be migrated from other installed storage subsystems and new workloads that are planned for the DS8000. Also, consider projected growth for both current and new workloads. For existing applications, consider historical experience first. For example, is there an application where certain datasets or files are known to have heavy, continuous I/O access patterns? Is there a combination of multiple workloads that might result in unacceptable performance if their peak I/O times occur simultaneously? Consider workload importance (workloads of critical importance and workloads of lesser importance). For existing applications, you can also use performance monitoring tools that are available for the existing storage subsystems and server platforms to understand current application workload characteristics, such as: Read/Write ratio Random/sequential ratio Average transfer size (blocksize) Peak workload (I/Os per second for random access and MB per second for sequential access) Peak workload periods (time of day and time of month) Copy Services requirements (Point-in-Time Copy and Remote Mirroring) Host connection utilization and throughput (FCP Host connections and FICON and ESCON channels) Remote mirroring link utilization and throughput Estimate the requirements for new application workloads and for current application workload growth. You can obtain information about general workload characteristics in Chapter 3, Understanding your workload on page 29. As new applications are rolled out and current applications grow, you must monitor performance and adjust projections and allocations. You can obtain more information in Chapter 6, Performance management process on page 147 and in Chapter 8, Practical performance management on page 203. You can use the Disk Magic modeling tool to model the current or projected workload and estimate the required DS8000 hardware resources. We introduce Disk Magic in 7.1, Disk Magic on page 162.
69
5.2.2 Reviewing remaining workloads for feasibility of resource-sharing

After workloads with the highest priority or the highest I/O demands have been identified for isolation, the I/O characteristics of the remaining workloads must be reviewed to determine whether a single group of resource-sharing workloads is appropriate, or whether it makes sense to split the remaining applications into multiple resource-sharing groups. The result of this step is the addition of one or more groups of resource-sharing workloads to the DS8000 configuration plan.
5.3 Planning allocation of disk and host connection capacity

You need to plan the allocation of specific DS8000 hardware first for any isolated workload, and then, for the resource-sharing workloads. Use the workload analysis from 5.2.1, Determining isolation requirements on page 68 again, this time to define the disk capacity and host connection capacity required for the workloads. For any workload, the required disk capacity will be determined by both the amount of space needed for data and the number of arrays (of a specific speed) needed to provide the desired level of performance. The result of this step will be a plan indicating the number of ranks (including disk drive type) and associated DAs and the number of I/O adapters and associated I/O enclosures required for any isolated workload and for any group of resource-sharing workloads.
5.3.1 Planning DS8000 hardware resources for isolated workloads

For DS8000 disk allocation, isolation requirements might dictate the allocation of certain individual ranks or all of the ranks on certain device adapters to one workload. For DS8000 I/O port allocation, isolation requirements might dictate the allocation of certain I/O ports or all of the I/O ports on certain host adapters to one workload. Choose the DS8000 resources that will be dedicated in a balanced manner. If ranks are planned for workloads in multiples of two, half of the ranks can later be assigned to extent pools managed by processor complex 0, and the other ranks can be assigned to extent pools managed by processor complex 1. You must also note the device adapters to be used. If I/O ports are allocated in multiples of four, they can later be spread evenly across all I/O enclosures in a DS8000 frame if four or more host adapter (HA) cards are installed. If I/O ports are allocated in multiples of two, they can later be spread evenly across left and right I/O enclosures.
5.3.2 Planning DS8000 hardware resources for resource-sharing workloads

There might be one or more groups of resource-sharing workloads. By now, you should have identified and assigned the shared set of DS8000 hardware resources that these groups of workloads will use. Review the DS8000 resources that will be shared for balance. If ranks are planned for resource-sharing workloads in multiples of two, half of the ranks can later be assigned to processor complex 0 extent pools, and the other ranks can be assigned to processor complex 1 extent pools. You must also identify the device adapters that you will use. If I/O ports are allocated for resource-sharing workloads in multiples of four, they can later be spread evenly across all I/O enclosures in a DS8000 frame if four or more HA cards are installed. If I/O ports are allocated in multiples of two, they can later be spread evenly across left and right I/O enclosures.
70
5.4 Planning volume and host connection spreading

After hardware resources have been allocated for both isolated and resource-sharing workloads, plan the volume and host connection spreading for all of the workloads. Note: In this chapter, we use host connection in a general sense to refer to a connection between a host server (either z/OS or Open Systems) and the DS8000. The result of this step is a plan indicating: The specific number and size of volumes for each isolated workload or group of resource-sharing workloads and how they will be allocated to ranks and DAs The specific number of I/O ports for each workload or group of resource-sharing workloads and how they will be allocated to host adapters and I/O enclosures After the spreading plan is complete, use the DS8000 hardware resources identified in the plan as input to the DS8000 hardware ordering process.
5.4.1 Spreading volumes for isolated and resource-sharing workloads

At this point, consider each workloads requirements for the number and size of logical volumes. For a given amount of required disk capacity from the perspective of the DS8000, there are typically no significant DS8000 performance implications of using more small volumes as compared to fewer large volumes, but using one, or a small number, of standard volume sizes can simplify management. However, there are host server performance considerations related to the number and size of volumes. For example, for System z servers, the number of Parallel Access Volumes (PAVs) that are needed can vary with volume size. For more information about PAVs, see 15.2, Parallel Access Volumes on page 442. For System i servers, we recommend a volume size that is half the size of the disk drives used. There also can be Open Systems host server or multipathing software considerations related to the number or the size of volumes, so you must consider these factors in addition to workload requirements. There are significant performance implications with the assignment of logical volumes to ranks and DAs. The goal of the entire logical configuration planning process is to ensure that volumes for each workload are on ranks and device adapters that will allow all workloads to meet performance objectives. Follow these steps for spreading volumes across allocated hardware for each isolated workload, and then for each workload in a group of resource-sharing workloads: 1. Review the required number and the size of the logical volumes that are identified during the workload analysis. 2. Review the number of ranks allocated to the workload (or group of resource-sharing workloads) and the associated DA pairs. 3. Evaluate the use of multi-rank extent pools using Storage Pool Striping for the DS8000 volume allocation algorithm to spread workloads evenly across a set of ranks within an extent pool. Refer to 5.7.5, Planning for multi-rank extent pools on page 106 for more details about Storage Pool Striping.
71
4. Assign each required logical volume to a different rank or a different set of aggregated ranks (which means an extent pool with multiple ranks using Storage Pool Striping) if possible: If the number of volumes required is less than the number of ranks (or sets of aggregated ranks), assign the volumes evenly to ranks or extent pools that are owned by processor complex 0 and ranks or extent pools that are owned by processor complex 1, on as many DA pairs as possible. If the number of volumes required is greater than the number of ranks (or sets of aggregated ranks), assign additional volumes to the ranks and DAs in a balanced manner. Ideally, the workload has the same number of logical volumes on each of its ranks, on each DA available to it. 5. Then, you can use host-level striping (such as Open Systems Logical Volume Manager striping or z/OS storage groups) across all logical volumes.
5.4.2 Spreading host connections for isolated and resource-sharing workloads

Next, consider the requirements of each workload for the number and type of host connections. In addition to workload requirements, you also might need to consider the host server or multipathing software in relation to the number of host connections. For more information about multipathing software, see Chapter 9, Host attachment on page 265. There are significant performance implications from the assignment of host connections to I/O ports, host adapters, and I/O enclosures. The goal of the entire logical configuration planning process is to ensure that host connections for each workload access I/O ports and host adapters that will allow all workloads to meet the performance objectives. Follow these steps for spreading host connections across allocated hardware for each isolated workload, and then for each workload in a group of resource-sharing workloads: 1. Review the required number and type (SW, LW, FCP or FICON, or ESCON) of host connections that are identified in the workload analysis. You must use a minimum of two host connections to different DS8000 host adapter cards to ensure availability. 2. Review the host adapters that are allocated to the workload (or group of resource-sharing workloads) and the associated I/O enclosures. 3. Assign each required host connection to a different host adapter in a different I/O enclosure if possible, balancing across the left and right I/O enclosures: If the required number of host connections is less than the available number of I/O enclosures (which can be typical for certain Open Systems servers), an equal number of host connections must be assigned to the left I/O enclosure (0, 2, 4, and 6) and the right I/O enclosure (1, 3, 5, and 7) if possible. Within an I/O enclosure, assign each required host connection to the host adapter of the required type (SW FCP/FICON-capable, LW FCP/FICON-capable, or ESCON) with the greatest number of unused ports. When host adapters have an equal number of unused ports, assign the host connection to the adapter which has the least number of connections for this workload. If the number of required host connections is greater than the number of I/O enclosures, assign the additional connections to different host adapters with the greatest number of unused ports within the I/O enclosures. When host adapters have an equal number of unused ports, assign the host connection to the adapter that has the least number of connections for this workload.
72
5.5 Planning array sites

This step takes place after the DS8000 hardware has been installed. The result of this step is a mapping of specific array site numbers (with associated disk drive type and DA pair number) to planned ranks. During DS8000 installation, array sites are dynamically created and assigned to DA pairs. array site IDs (Sx) do not have any fixed or predetermined relationship to disk drive physical locations or to the disk enclosure installation order. The relationship between array site IDs and physical disk locations or DA assignment can differ between DS8000s, even on DS8000s with the same number and type of disk drives. After the DS8000 hardware has been installed, you can use the output of the DSCLI lsarraysite command to display and document array site information, including disk drive type and DA pair. You must check the disk drive type and DA pair for each array site to ensure that arrays, ranks, and ultimately volumes created from the array site are created on the DS8000 hardware resources required for the isolated or resource-sharing workloads. The result of this step is the addition of specific array site IDs to the plan of workload assignment to ranks. Now, we look at three examples of DS8000 configurations to review the disk drive types, DA pairs, and array sites available and to discuss several isolation and spreading considerations.
5.5.1 DS8000 configuration example 1: Array site planning considerations

Configuration example 1 is a DS8000 model with a partially populated base frame. Two pairs of disk enclosures are populated with disk drives, and a single device adapter is used (DA2). Figure 5-2 shows a schematic of this DS8000.
DS8000 Base Frame with 2 Disk Enclosures Populated

Each rectangle = 1 disk enclosure pair
32 disk drives
16 front + 16 rear
2 disk enclosure pairs 64 disk drives

8 Array Sites (S1-S8) All on DA Pair 2
2 2
HMC
S0 S1 0 1 1 0 2 33 2
Device Adapter Pair 2

1 Device adapter card in I/O enclosure 2 (lower left) 1 Device adapter card in I/O enclosure 3 (lower right)
1
Figure 5-2 DS8000 configuration example 1 with two disk enclosures populated (base frame only)
73
Note: In the schematic, each of the two green rectangles at the top of the DS8000 represents one disk enclosure pair, or 16 disk drives in the front and 16 disk drives in the rear for a total of 32 disk drives. The two green rectangles together make up a pair of disk enclosure pairs, for a total of 64 disk drives. The number 2 in the green rectangles indicates that the disk enclosure pairs are cabled to DA 2, which is shown by the boxes in I/O enclosures 2 and 3 at the bottom of the DS8000. Example 5-1 shows the output of the lsarraysite command issued for this DS8000. The lsarraysite output shows that this DS8000 has: Eight array sites (S1 - S8) One DA in use (DA2). DA0 will not be used until the remaining two disk enclosure pairs in the base frame are populated with disk drives. DA3 and DA1 will not be used until an expansion frame is added. Homogeneous disk drives (all 146 GB and 10K rpm)
Example 5-1 DS8000 configuration example 1: Array sites, DA pairs, and disk drive types dscli> lsarraysite -l -dev ibm.2107-7506571 Date/Time: September 4, 2005 2:30:15 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7506571 arsite DA Pair dkcap (10^9B) diskrpm State Array =================================================== S1 2 146.0 10000 Unassigned S2 2 146.0 10000 Unassigned S3 2 146.0 10000 Unassigned S4 2 146.0 10000 Unassigned S5 2 146.0 10000 Unassigned S6 2 146.0 10000 Unassigned S7 2 146.0 10000 Unassigned S8 2 146.0 10000 Unassigned -
Workload isolation considerations

Because only one DA pair is available for this DS8000, no DA pair isolation is possible. The only practical form of workload isolation is on an individual rank basis. That is, if workload isolation is required, one or more ranks can be dedicated to a workload, but the DA pair cannot be dedicated to the workload. Only order this single DA pair hardware configuration for workloads that do not require DA pair isolation. As an example, if you require that two ranks are dedicated to one workload, you can choose array sites S1 and S2 for the isolated workload, leaving six array sites (S3 - S8) available for a group of resource-sharing applications.
Workload spreading considerations

Continuing our example with two ranks dedicated to an isolated workload, you can plan for array site S1 to become an array and then a rank assigned to a processor complex 0 extent pool, and you can plan for array site S2 to become an array and then a rank assigned to a processor complex 1 extent pool. You can plan for volumes for the isolated workload to be created evenly on both ranks. In this way, the isolated workload is able to take advantage of the performance capabilities of both ranks, as well as the processor and cache resources on both DS8000 processor complexes. For the resource-sharing workloads, you can plan for three ranks (for example, S3, S5, and S7) to be used for arrays and then ranks assigned to processor complex 0 extent pools, and you can plan for three ranks (S4, S6, and S8) to be used for arrays and then ranks assigned
74
to processor complex 1 extent pools. Again, you can create volumes for the resource-sharing workloads evenly on all ranks, so that all workloads are able to take advantage of all six ranks performance capabilities as well as the processor and cache resources of both processor complexes.

Logical configuration example 2 is a DS8000 with a fully populated base frame (four pairs of disk enclosures) and one fully populated expansion frame (eight pairs of disk enclosures). Six DA pairs are in use (2, 0, 6, 4, 7, and 5). DA3 and DA1 will not be used until a second expansion frame is added. Figure 5-3 shows a schematic of this DS8000. Note: In the schematic, each of the two green rectangles at the top of the DS8000 represents one disk enclosure pair, or 16 disk drives in the front and 16 disk drives in the rear for a total of 32 disk drives. The two green rectangles together make up a pair of disk enclosure pairs for a total of 64 disk drives. The number 2 in the green rectangles indicates that the disk enclosure pairs are cabled to DA 2, which is shown by the boxes in I/O enclosures 2 and 3 at the bottom of the DS8000.
DS8000 with 12 Disk Enclosure Pairs

12 Disk enclosure pairs Each rectangle is one disk enclosure pair (front and rear) 384 disk drives total 48 Array Sites (S1-S48) 6 DA pairs DA2 (I/O enclosures 2,3) DA0 (I/O enclosures 0,1) DA6 (I/O enclosures 6,7) DA4 (I/O enclosures 4,5) DA7 (I/O enclosures 6,7) DA5 (I/O enclosures 4,5) DA1 and DA3 are not used unless a second expansion frame is added
2 2 0 0
HMC C0
6 6 4 4 7 7 5 5
b
S0 C1 S1 0 11 0 2 33 2
Base Frame A
4 55 4 6 77 6
1st Expansion Frame B
Figure 5-3 DS8000 configuration example 2 with 12 disk enclosure pairs
Example 5-2 on page 76 shows output from the DSCLI lsarraysite command issued for this DS8000. Because of the two fully populated frames, the lsarraysite output shows a total of 48 array sites (384 disk drives): Eight array sites on DA2 (S1 - S8) Eight array sites on DA0 (S9 - S16) Eight array sites on DA7 (S17 - S24) Eight array sites on DA6 (S25 - S32) Eight array sites on DA5 (S33 - S40) Eight array sites on DA4 (S41 - S48)
75
A total of six DA pairs are used (DA 2, 0, 7, 6, 5, and 4). DA3 and DA1 will not be used until a second expansion frame is added. All disk drives are the same capacity and speed (73 GB and 15K rpm). Important: It is important to note the association of array sites and DA pairs as shown in the lsarraysite output, because array sites do not have any fixed or predetermined relationship to physical disk drive locations in the DS8000. Array sites are created and assigned to device adapters dynamically during the DS8000 installation and can vary from one DS8000 to another DS8000. In this example, the association between array sites (S1 S48) and DA pairs is not the same as the order of installation of disks on DA pairs (DA2, DA0, DA6, DA4, DA7, and DA5).
Example 5-2 DS8000 configuration example 2: Array sites, DA pairs, and disk drive types dscli> lsarraysite -dev ibm.2107-7520331 -l Date/Time: September 9, 2005 2:57:27 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7520331 arsite DA Pair dkcap (10^9B) diskrpm State Array =================================================== S1 2 73.0 15000 Unassigned S2 2 73.0 15000 Unassigned S3 2 73.0 15000 Unassigned S4 2 73.0 15000 Unassigned S5 2 73.0 15000 Unassigned S6 2 73.0 15000 Unassigned S7 2 73.0 15000 Unassigned S8 2 73.0 15000 Unassigned S9 0 73.0 15000 Unassigned S10 0 73.0 15000 Unassigned S11 0 73.0 15000 Unassigned S12 0 73.0 15000 Unassigned S13 0 73.0 15000 Unassigned S14 0 73.0 15000 Unassigned S15 0 73.0 15000 Unassigned S16 0 73.0 15000 Unassigned S17 7 73.0 15000 Unassigned S18 7 73.0 15000 Unassigned S19 7 73.0 15000 Unassigned S20 7 73.0 15000 Unassigned S21 7 73.0 15000 Unassigned S22 7 73.0 15000 Unassigned S23 7 73.0 15000 Unassigned S24 7 73.0 15000 Unassigned S25 6 73.0 15000 Unassigned S26 6 73.0 15000 Unassigned S27 6 73.0 15000 Unassigned S28 6 73.0 15000 Unassigned S29 6 73.0 15000 Unassigned S30 6 73.0 15000 Unassigned S31 6 73.0 15000 Unassigned S32 6 73.0 15000 Unassigned S33 5 73.0 15000 Unassigned S34 5 73.0 15000 Unassigned S35 5 73.0 15000 Unassigned S36 5 73.0 15000 Unassigned S37 5 73.0 15000 Unassigned S38 5 73.0 15000 Unassigned S39 5 73.0 15000 Unassigned S40 5 73.0 15000 Unassigned -
76
S41 S42 S43 S44 S45 S46 S47 S48
4 4 4 4 4 4 4 4
73.0 73.0 73.0 73.0 73.0 73.0 73.0 73.0
15000 15000 15000 15000 15000 15000 15000 15000
Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned

Because multiple DA pairs are available for this DS8000, DA pair isolation is possible in addition to individual rank isolation. That is, if workload isolation is required: One or more ranks can be dedicated to a workload. One or more DA pairs can be dedicated to a workload. For example, all eight ranks on DA5 (S33 - S40) can be dedicated to one isolated workload, with the remaining 40 ranks (S1 - S32 and S41 - S48 on DAs 2, 0, 6, 4, and 7) available to be shared by the remaining workloads.

Again, plan for half of the array sites allocated to the isolated workload to become arrays and then ranks that will be assigned to processor complex 0 extent pools, and plan for the other half to ultimately be associated with processor complex 1. Then, plan for the volumes for the isolated workload to be allocated evenly across all the ranks. For the resource-sharing workloads, there might be more array sites than the number of volumes required for any single workload. Volumes for each workload must be allocated evenly across array sites on different DAs and across array sites that are planned to become arrays and ranks in processor complex 0 extent pools and processor complex 1 extent pools. For example, a resource-sharing workload requiring ten volumes might have volumes planned for array sites S1, S2, S9, S10, S17, S18, S25, S26, S41, and S42. Array sites S1, S9, S17, S25, and S41 are planned for ranks in processor complex 0 extent pools, and array sites S2, S10, S18, S26, and S42 are planned for ranks in processor complex 1 extent pools.

Logical configuration example 3 is a DS8000 model with two Storage Images, a fully populated base frame (four pairs of disk enclosures), and a single expansion frame with four pairs of fully populated disk enclosures (that are supported by DA6 and DA4) and four pairs of partially populated disk enclosures (that are supported by DA7 and DA5). In a dual Storage Image DS8000, 50% of the I/O enclosures are dedicated to each Storage Image: I/O enclosures 0, 1, 4, and 5 (and disks supported by DA pairs 0, 1, 4, and 5) are dedicated to Storage Image 1. I/O enclosures 2, 3, 6, and 7 (and disks supported by DA pairs 2, 3, 6, and 7) are dedicated to Storage Image 2. Figure 5-4 on page 78 shows a schematic of DS8000 logical configuration example 3.
77
DS8000 Model 9B2 with 10 Disk Enclosure Pairs

10 Disk enclosure pairs total
Storage Image 1 160 disk drives 20 Array Sites Only 1 populated disk enclosure pair on DA 5 Storage Image 2
160 disk drives 20 Array Sites Only 1 populated disk enclosure pair on DA 7
2 2 0 0
HMC C0
6 6 4 4 7 5
b
6 DA pairs total
Storage Image 1 DAs 0,4,5 (I/O enclosures 0, 1, 4, 5) Storage Image 2 DAs 2,6,7 (I/O enclosures 2, 3, 6, 7)
S0C1 S0 S1 S1 0 11 0 2 33 2
4 55 4 6 77 6
DA1 & DA3 are not used without a second expansion frame
Figure 5-4 DS8000 configuration example 3 with dual Storage Image and 10 disk enclosure pairs
In order to see all of the array sites in this DS8000, you must issue the DS command-line interface (CLI) lsarraysite command twice, one time for each Storage Image: Storage Image 1: IBM.2107-7566321 Storage Image 2: IBM.2107-7566322 The DSCLI lsarraysite output for Storage Image 1(IBM.2107-7566321) in Example 5-3 on page 79 shows a total of 20 array sites (160 disk drives): Eight array sites on DA0 (S1 - S8): 64 73 GB, 15K rpm disk drives. Four array sites on DA5 (S9 - S12): 32 73 GB, 15K rpm disk drives. The second disk enclosure pair on DA5 is not populated with disk drives. Eight array sites on DA4 (S13 - S20): 32 300 GB, 10K rpm disk drives. 32 73 GB, 15K rpm disk drives. A total of three DA pairs are used (DA0, DA4, and DA5). Storage Image 1 does not show array sites on DA1, because DA1 will not be used until a second expansion frame is added. Disk drives are not homogeneous. There are: 128 73 GB 15K rpm drives. 32 300 GB 10K rpm drives.
78
Important: Note the association of array sites and DA pairs as shown in the lsarraysite output, because array sites do not have any fixed or predetermined relationship to physical disk drive locations in the DS8000. Array sites are created and assigned to device adapters dynamically during DS8000 installation and can vary from one DS8000 to another DS8000. In this example, the association between array sites (S1 - S20 on each Storage Image) and DA pairs is not the same as the order of installation of disks on DA pairs (DA0, DA4, and DA5 for Storage Image 1 and DA2, DA6, and DA7 for Storage Image 2).
Example 5-3 DS8000 example 3 Storage Image 1: Array sites, DA pairs, and disk drive types dscli> lsarraysite -l -dev ibm.2107-7566321 Date/Time: September 4, 2005 2:23:20 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7566321 arsite DA Pair dkcap (10^9B) diskrpm State Array =================================================== S1 0 73.0 15000 Unassigned S2 0 73.0 15000 Unassigned S3 0 73.0 15000 Unassigned S4 0 73.0 15000 Unassigned S5 0 73.0 15000 Unassigned S6 0 73.0 15000 Unassigned S7 0 73.0 15000 Unassigned S8 0 73.0 15000 Unassigned S9 5 73.0 15000 Unassigned S10 5 73.0 15000 Unassigned S11 5 73.0 15000 Unassigned S12 5 73.0 15000 Unassigned S13 4 300.0 10000 Unassigned S14 4 300.0 10000 Unassigned S15 4 300.0 10000 Unassigned S16 4 300.0 10000 Unassigned S17 4 73.0 15000 Unassigned S18 4 73.0 15000 Unassigned S19 4 73.0 15000 Unassigned S20 4 73.0 15000 Unassigned -
The DSCLI lsarraysite output for Storage Image 2 (IBM.2107-7566322) Example 5-4 on page 80 also shows a total of 20 array sites (160 disk drives): Four array sites on DA7 (S1 - S4): 32 73 GB, 15K rpm disk drives. The second disk enclosure pair on DA7 is not populated with disk drives. Eight array sites on DA2 (S5 - S12): 64 73 GB, 15K rpm disk drives. Eight array sites on DA6 (S13 - S20): 32 300 GB, 10K rpm disk drives (S13 - S16). 32 73 GB, 15K rpm disk drives (S17 - S20). A total of three DA pairs are used (DA2, DA6, and DA7). Storage Image 2 shows no array sites on DA3, because DA3 will not be used until a second expansion frame is added. Disk drives are not homogeneous. There are: 128 73 GB 15K rpm drives. 32 300 GB 10K rpm drives.
79
Important: Note the association of array sites and DA pairs as shown in the lsarraysite output, because array sites do not have any fixed or predetermined relationship to physical disk drive locations in the DS8000. Array sites are created and assigned to device adapters dynamically during DS8000 installation and can vary from one DS8000 to another DS8000. In this example, the association between array sites (S1 - S20 on each Storage Image) and DA pairs is not the same as the order of installation of disks on DA pairs (DA0, DA4, and DA5 for Storage Image 1 and DA2, DA6, and DA7 for Storage Image 2).
Example 5-4 DS8000 example 3 Storage Image 2: Array sites, DA pairs, and disk drive types dscli> lsarraysite -l -dev ibm.2107-7566322 Date/Time: September 4, 2005 2:23:34 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7566322 arsite DA Pair dkcap (10^9B) diskrpm State Array =================================================== S1 7 73.0 15000 Unassigned S2 7 73.0 15000 Unassigned S3 7 73.0 15000 Unassigned S4 7 73.0 15000 Unassigned S5 2 73.0 15000 Unassigned S6 2 73.0 15000 Unassigned S7 2 73.0 15000 Unassigned S8 2 73.0 15000 Unassigned S9 2 73.0 15000 Unassigned S10 2 73.0 15000 Unassigned S11 2 73.0 15000 Unassigned S12 2 73.0 15000 Unassigned S13 6 300.0 10000 Unassigned S14 6 300.0 10000 Unassigned S15 6 300.0 10000 Unassigned S16 6 300.0 10000 Unassigned S17 6 73.0 15000 Unassigned S18 6 73.0 15000 Unassigned S19 6 73.0 15000 Unassigned S20 6 73.0 15000 Unassigned -

Because two Storage Images are available for this DS8000, Storage Image isolation is possible in addition to DA pair isolation and individual rank isolation. That is, if workload isolation is required: One or more individual ranks can be dedicated to a workload. All ranks of one drive type can be dedicated to a workload (for example, all 300 GB drives). One or more DA pairs can be dedicated to a workload. An entire Storage Image can be dedicated to a workload. Generally, a dual Storage Image DS8000 has been ordered with the intention of isolating workloads at the Storage Image level. That is, certain workloads are planned for Storage Image 1 and different workloads are planned for Storage Image 2. Additionally, heterogeneous disk drives are usually ordered with the intention of workload isolation. Within each Storage Image, different workloads are planned for the 300 GB drives and the 73 GB drives.
80

In either Storage Image, any workloads requiring 300 GB drives can have volumes spread across array sites S13 - S16, and workloads requiring 73 GB drives can have volumes spread across array sites S1 - S12 and S17 - S20.
5.6 Planning RAID arrays and ranks

The next step is planning the RAID arrays and ranks, which means taking the specific array sites planned for isolated or resource-sharing workloads and defining their assignment to RAID arrays and CKD or FB ranks, including planning array IDs and rank IDs. Because there is a one-to-one correspondence between an array and a rank on the DS8000, array and rank planning can be done in a single step. However, array and rank creation require separate steps. It is important to notice that the sequence of steps when creating the arrays and ranks will finally determine the numbering scheme of array IDs and rank IDs, because these IDs are chosen automatically by the system during creation. The logical configuration is not actually dependent on a specific ID numbering scheme, but it might help to make configuration planning and performance management easier. Note: Array sites, arrays, and ranks do not have a fixed or predetermined relationship to any DS8000 processor complex before they are finally assigned to an extent pool and thus a rank group (rank group 0/1 is managed by processor complex 0/1).
5.6.1 RAID-level performance considerations

When configuring arrays from array sites, you need to specify the RAID level, either RAID 5, RAID 6, or RAID 10. These RAID levels meet different requirements regarding performance, usable storage capacity, and data protection. However, the choice of the proper RAID types as well as the physical disk drives (speed and capacity) should have been made already when ordering the DS8000 hardware with regard to initial workload performance objectives as well as capacity requirements and availability considerations. For more information about implementing the various RAID levels on the DS8000, refer to 4.1, RAID levels and spares on page 42. RAID 5 is one of the most commonly used levels of RAID protection, because it optimizes cost-effective performance while emphasizing the use of usable capacity through data striping. It provides fault tolerance if one disk drive fails using XOR parity for redundancy. Hot spots within an array are avoided by distributing data and parity information across all of the drives in the array. The capacity of one drive in the RAID array is lost for holding the parity information. RAID 5 provides a good balance of performance and usable storage capacity. RAID 6 provides a higher level of fault tolerance than RAID 5 in the case of disk failures but also provides less usable capacity than a RAID 5, because the capacity of two drives in the array is set aside for holding the parity information. As with RAID 5, hot spots within an array are avoided by distributing data and parity information across all of the drives in the array. Still, RAID 6 offers more usable capacity than RAID 10 by providing an efficient method of data protection in case of double disk errors, such as two drive failures, two coincident medium errors, or a drive failure and a medium error during a rebuild. As the likelihood of medium errors increases with the capacity of the physical disk drives, particularly consider the use of RAID 6 in conjunction with large capacity disk drives and higher data availability requirements, for example, with 500 GB FC Advanced Technology Attachment (FATA), 1 TB Serial Advanced Technology Attachment (SATA), and even 450 GB FC disk drives where the array rebuild in case of a drive failure takes an extremely long time. RAID 6 can, of course,
81
also be used with smaller FC drives, when the primary concern is a higher level of data protection than is provided by RAID 5. RAID 10 optimizes high performance while maintaining fault tolerance for disk drive failures. The data is striped across several disks, and the first set of disk drives is mirrored to an identical set. RAID 10 can tolerate at least one, and in most cases, even multiple disk failures as long as the primary and secondary copy of a mirrored disk pair do not fail at the same time. In addition to the considerations for data protection and capacity requirements, the question typically arises about which RAID level performs better, RAID 5, RAID 6, or RAID 10. As with most complex issues, the answer is that it depends. There are a number of workload attributes that influence the relative performance of RAID 5, RAID 6, or a RAID 10, including the use of cache, the relative mix of read as compared to write operations, and whether data is referenced randomly or sequentially. Regarding read I/O operations, either random or sequential, there is generally no noteworthy difference between RAID 5, RAID 6, and RAID 10. When a DS8000 subsystem receives a read request from a host system, it first checks if the requested data is already in cache. If the data is in cache (that is, a read cache hit), there is no need to read the data from disk, and actually the RAID level on the arrays does not matter at all. For reads that must actually be satisfied from disk (that is, the array or the back end), performance of RAID 5, RAID 6, and RAID 10 is roughly equal, because the requests are spread evenly across all disks in the array. In RAID 5 and RAID 6 arrays, data is striped across all disks, so I/Os are spread across all disks. In RAID 10, data is striped and mirrored across two sets of disks, so half of the reads are processed by one set of disks, and half of the reads are processed by the other set, reducing the utilization of individual disks. Regarding random write I/O operations, the different RAID levels vary considerably in their performance characteristics. With RAID 10, each write operation at the disk back end initiates two disk operations to the rank. With RAID 5, an individual random small block write operation to the disk back end typically causes a RAID 5 write penalty, which initiates four I/O operations to the rank by reading the old data and the old parity block before finally writing the new data and the new parity block. For a RAID 6 with two parity blocks, the write penalty even increases to six required I/O operations at the back end for a single random small block write operation. Note that this assumption is a worst-case scenario that is quite helpful for understanding the back-end impact of random workloads with a given read:write ratio for the various RAID levels. It permits a rough estimation of the expected back-end I/O workload and helps to plan for the proper number of arrays. On a heavily loaded system, it might actually even take fewer I/O operations on average than expected for RAID 5 and RAID 6 arrays. The optimization of the queue of write I/Os waiting in cache for the next destage operation can lead to a high number of partial or even full stripe writes to the arrays with fewer back-end disk operations required for the parity calculation. It is important to understand that on modern disk systems, such as the DS8000, write operations are generally cached by the storage subsystem and thus handled asynchronously with short write response times for the attached host systems so that any RAID 5 or RAID 6 write penalties are generally shielded from the attached host systems in terms of disk response time. Typically, a write request that is sent to the DS8000 subsystem is written into storage server cache and persistent cache, and the I/O operation is then acknowledged immediately to the host system as completed. As long as there is room in these cache areas, the response time seen by the application is only the time to get data into the cache, and it does not matter whether RAID 5, RAID 6, or RAID 10 is used. However, if the host systems send data to the cache areas faster than the storage server can destage the data to the arrays (that is, move it from cache to the physical disks), the cache can occasionally fill up with no space for the next write request, and therefore, the storage server will signal the host system to retry the I/O write operation. In the time that it takes the host system to retry the I/O write 82
operation, the storage server will likely have time to destage part of the data, providing free space in the cache and allowing the I/O operation to complete on the retry attempt. When random small block write data is destaged from cache to disk, RAID 5 and RAID 6 arrays can experience a severe write penalty with four or six required back-end disk operations, while RAID 10 always requires only two disk operations per small block write request. Because RAID 10 performs only half the disk operations of RAID 5, for random writes, a RAID 10 destage completes faster and thereby reduces the busy time of the disk subsystem. So with steady and heavy random write workloads, the back-end write operations to the ranks (the physical disk drives) can become a limiting factor, so that only a RAID 10 configuration (instead of additional RAID 5 or RAID 6 arrays) will provide enough back-end disk performance at the rank level to meet the workload performance requirements. While RAID 10 clearly outperforms RAID 5 and RAID 6 with regard to small block random write operations, RAID 5 and also RAID 6 show excellent performance with regard to sequential write I/O operations. With sequential write requests, all of the blocks required for the RAID 5 parity calculation can be accumulated in cache, and thus the destage operation with parity calculation can be done dynamically as a full stripe write without the need for additional disk operations to the array. So with only one additional parity block for a full stripe write (for example, seven data blocks plus one parity block for a 7+P RAID 5 array), a RAID 5 requires less disk operation at the back end than a RAID 10, which always requires twice the amount of write operations due to data mirroring. RAID 6 also benefits from sequential write patterns with most of the data blocks required for the double parity calculation staying in cache and thus reducing the amount of additional disk operations to the back end considerably. For sequential writes, a RAID 5 destage completes faster and thereby reduces the busy time of the disk subsystem. Comparing RAID 5 to RAID 6, the performance of small block random read and the performance of a sequential read are roughly equal. Due to the higher write penalty, the RAID 6 small block random write performance is explicitly less than with RAID 5. Also, the maximum sequential write throughput is slightly less with RAID 6 than with RAID 5 due to the additional second parity calculation. However, RAID 6 rebuild times are close to RAID 5 rebuild times (for the same size disk drive modules (DDMs)), because rebuild times are primarily limited by the achievable write throughput to the spare disk during data reconstruction. So, RAID 6 mainly is a significant reliability enhancement with a trade-off in random write performance. It is most effective for large capacity disks that hold mission critical data and that are properly sized for the expected write I/O demand. Workload planning is especially important before implementing RAID 6 for write intensive applications, including Copy Services targets and FlashCopy Space Efficient (SE) repositories. RAID 10 is not as commonly used as RAID 5 for two key reasons. First, RAID 10 requires more raw disk capacity for every GB of effective capacity. Second, when you consider a standard workload with a typically high number of read operations and only a small amount of write operations, RAID 5 generally offers the best trade-off between overall performance and usable capacity. In many cases, RAID 5 write performance is adequate, because disk systems tend to operate at I/O rates below their maximum throughputs, and differences between RAID 5 and RAID 10 will primarily be observed at maximum throughput levels. Consider using RAID 10 for critical workloads with a high percentage of steady random write requests, which can easily become rank-limited. Here, RAID 10 provides almost twice the throughput as RAID 5 (because of the write penalty). The trade-off for better performance with RAID 10 is about 40% less usable disk capacity. Larger drives can be used with RAID 10 to get the random write performance benefit while maintaining about the same usable capacity as a RAID 5 array with the same number of disks. The individual performance characteristics of the RAID arrays can be summarized as:
83
For read operations from disk, either random or sequential, there is no significant difference in RAID 5, RAID 6, and RAID 10 performance. For random writes to disk, RAID 10 outperforms RAID 5 and RAID 6. For random writes to disk, RAID 5 performs better than RAID 6. For sequential writes to disk, RAID 5 tends to perform better. Table 5-1 shows a short overview of the advantages and disadvantages for the RAID levels with regard to reliability, space efficiency, and random write performance.
Table 5-1 RAID-level comparison with regard to reliability, space efficiency, and write penalty RAID level Reliability (number of erasures) Space efficiencya Performance write penalty (number of disk operations) 4 6 2
RAID 5 (7+P) RAID 6 (6+P+Q) RAID 10 (4x2)
1 2 At least 1
87.5% 75% 50%
a. The space efficiency in this table is based on the number of disks remaining available for data storage. The actual usable, decimal capacities are up to 5% less.
In general, workloads that make effective use of storage subsystem cache for reads and writes see little difference between RAID 5 and RAID 10 configurations. For workloads that perform better with RAID 5, the difference in RAID 5 performance over RAID 10 is typically small. However, for workloads that perform better with RAID 10, the difference in RAID 10 performance over RAID 5 performance or even RAID 6 performance can be significant. Because RAID 5, RAID 6, and RAID 10 basically perform equally well for both random and sequential read operations, RAID 5 and RAID 6 might be a good choice with regard to space efficiency and performance for standard workloads with a high percentage of read requests. RAID 6 offers a higher level of data protection than RAID 5, especially for large capacity drives, but the random write performance of RAID 6 is less due to the second parity calculation. Therefore, we highly recommend a proper performance sizing, especially for RAID 6. RAID 5 tends to have a slight performance advantage for sequential writes, whereas RAID 10 performs better for random writes. RAID 10 is generally considered to be the RAID type of choice for business-critical workloads with a high amount of random write requests (typically more than 35% writes) and low response time requirements. For array rebuilds, RAID 5, RAID 6, and RAID 10 require approximately the same elapsed time, although RAID 5 and RAID 6 require significantly more disk operations and therefore are more likely to impact other disk activity on the same disk array. Note that you can select RAID types for each individual array site. So, you can select the RAID type based on the specific performance requirements of the data that will be located there. The best way to compare the performance of a given workload using RAID 5, RAID 6, or RAID 10 is to run a Disk Magic model. For additional information about the capabilities of this tool, refer to 7.1, Disk Magic on page 162. For workload planning purposes, it might be convenient to have a general idea of the I/O performance that a single RAID array can provide. Figure 5-5 on page 85 and Figure 5-6 on page 86 show measurement results1 for a single array built from eight 146 GB 15k FC disk drives when configured as RAID 5, RAID 6, or RAID 10. These numbers are not 84
DS8000-specific, because they simply represent the limits that you can expect from a simple set of eight physical disks forming a RAID array.
Single Rank (8x FC146GB15k) - Random Read 4kB Workload

100
Response Time [ms
75
50
25
0 0 500 1000 IOps RAID5 (7+P) RAID6 (6+P+Q) RAID10 (4x2) 1500 2000 2500
Figure 5-5 Single rank RAID-level comparison for a 4 KB random workload with 100% reads
For small block random read workloads, there is no significant performance difference between RAID 5, RAID 6 and RAID 10 as seen in Figure 5-5. Without taking any read cache hits into account, 1800 read IOPS with high back-end response times above 30 ms mark the upper limit of the capabilities of a single array for random read access using all of the available capacity of that array. Small block random writes, however, make the difference between the various RAID levels with regard to performance. Even for a typical 70:30 random small block workload with 70% reads (no read cache hits) and 30% writes as shown in Figure 5-6 on page 86, the different performance characteristics between the RAID levels already become evident. With an increasing amount of random writes, RAID 10 clearly outperforms RAID 5 and RAID 6. Here, for a standard random small block 70:30 workload, 1500 IOPS mark the upper limit of a RAID 10 array, 1100 IOPS for a RAID 5 array, and 900 IOPS for a RAID 6 array. Note that in both Figure 5-5 and Figure 5-6 on page 86, no read cache hits have been considered. Furthermore, the I/O requests were spread across the entire available capacity of each RAID array. So, depending on the read cache hit ratio of a given workload and the capacity used on the array (using less capacity on an array simply means reducing disk arm movements and thus reducing average access times), you can expect typically lower overall response times and even higher I/O rates. Also, the read:write ratio, as well as the access pattern of a particular workload, either random or sequential, determine the achievable performance of a rank. Figure 5-5 and Figure 5-6 on page 86 (examples of small block
1
The measurements were done with IOmeter (http://www.iometer.org) on Windows Server 2003 utilizing the entire available capacity on the array for I/O requests. The performance data contained herein was obtained in a controlled, isolated environment. Actual results that might be obtained in other operating environments can vary significantly. There is no guarantee that the same or similar results will be obtained elsewhere.
85
random I/O requests) simply help to give you an idea of the performance capabilities of a single rank for different RAID levels.
Single Rank (8x FC146GB15k) - Random 70:30 4kB Workload

100
Response Time [ms
75
50
25
0 0 200 400 600 800 1000 IOps RAID5 (7+P) RAID6 (6+P+Q) RAID10 (4x2) 1200 1400 1600 1800
Figure 5-6 Single rank RAID-level comparison for a 4 KB random workload with 70% reads 30% writes
Despite the different RAID levels and the actual workload pattern (read:write ratio, sequential access, or random access), it is also important to note that the limits of the maximum I/O rate per rank also depend on the type of disk drives used. As a mechanical device, each disk drive is only capable of processing a limited number of random I/O operations per second depending on the drive characteristics. So the mere number of disk drives used for a given amount of storage capacity finally determines the achievable random IOPS performance. The 15k drives offer approximately 30% more random IOPS performance than 10k drives. As a general rule for random IOPS planning calculations, you can use 160 IOPS per 15k FC drive and 120 IOPS per 10k FC drive. Be aware that at these levels of disk utilization, you might see already elevated response times. So for excellent response time expectations, consider even lower IOPS limits. Low spinning, large capacity FATA or SATA disk drives offer a considerably lower maximum random access I/O rate per drive (approximately half of a 15k FC drive). Therefore, they are only intended for environments with fixed content, data archival, reference data, or near-line applications that require large amounts of data at low cost and do not require drive duty cycles greater than 20%. Note that duty cycles smaller than 20% are enforced on FATA drives on DS8000 for data protection reasons by throttling FATA drives if the duty cycle exceeds 20%.
5.6.2 RAID array considerations

A DS8000 RAID array supports only a single RAID type (either RAID 5, RAID 6, or RAID 10). If different RAID types are required for certain workloads, the workloads must be isolated in different RAID arrays. On the DS8000, four spare drives (the minimum) are required in a fully populated DA pair, so certain arrays will contain spares, such as:
86
RAID 5: 6+P+S RAID 6: 5+P+Q+S RAID 10: 3x2+2S And, other arrays do not contain any spares, such as: RAID 5: 7+P RAID 6: 6+P+Q RAID 10: 4x2 This requirement essentially leads to arrays with different storage capacities and performance characteristics although they have been created from array sites with identical disk types and RAID levels. The spares are assigned during array creation. Typically, the first arrays created from an unconfigured set of ranks on a given DA pair contain spare drives until the minimum requirements as outlined in 4.1.4, Spare creation on page 44 are met. With regard to the distribution of the spare drives, you might need to plan the sequence of array creation carefully if a mixture of RAID 5, RAID 6, and RAID 10 arrays is required on the same DA pair. Otherwise, you might not meet your initial capacity requirements and end up with more spare drives on the system than actually required, simply wasting storage capacity. For example, if you plan for two RAID 10 (3x2+2S) arrays on a given DA pair with homogeneous array sites, you might start with the creation of these arrays first, because these arrays will already reserve two spare drives per array, so that the final RAID 5 or RAID 6 arrays will not contain any spares. Or, if you prefer to obtain RAID 10 (4x2) arrays without spare drives, you can instead start first with the creation of four RAID 5 or RAID 6 arrays, which then contain the required number of spare drives before creating the RAID 10 arrays. In order to spread the available storage capacity and thus the overall workload evenly across both DS8000 processor complexes, you must assign an equal number of arrays containing spares to processor complex 0 (rank group 0) and processor complex 1 (rank group 1). Furthermore, note that performance can differ between RAID arrays containing spare drives and RAID arrays without spare drives, because the arrays without spare drives offer more storage capacity and also provide more active disk spindles for processing I/O operations. Note: You must confirm spare allocation after array creation and make any necessary adjustments to the logical configuration plan before creating ranks and assigning them to extent pools.
When creating arrays from array sites, it might help to order the array IDs by DA pair, array size (that is, arrays with or without spares), RAID level, or even disk type depending on the available hardware resources and workload planning considerations. The mapping of the array sites to particular DA pairs can be taken from the output of the DSCLI lsarraysite command as shown in Example 5-5 on page 87. Array sites are numbered starting with S1, S2, and so forth by the DS8000 microcode. Arrays are numbered with system-generated IDs starting with A0, A1, and so forth in the sequence that they are created.
Example 5-5 Array sites and DA pair association as taken from the DSCLI lsarraysite command dscli> lsarraysite -l Date/Time: 27 October 2008 17:45:59 CET IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 arsite DA Pair dkcap (10^9B) diskrpm State array diskclass encrypt ========================================================================= S1 2 146.0 15000 Unassigned ENT unsupported S2 2 146.0 15000 Unassigned ENT unsupported
87
S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 S21 S22 S23 S24
2 2 2 2 2 2 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7
146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0 146.0
15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000 15000
Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned Unassigned
ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT ENT
unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported unsupported
For configurations using only single-rank extent pools with the maximum control of volume placement and performance management, consider creating the arrays (A0, A1, and so forth) ordered by DA pair, which in most cases means simply following the sequence of array sites (S1, S2, and so forth) as shown in Example 5-6. But, note that the sequence of array sites is initially determined by the system and might not always strictly follow the DA pair order. Refer to Configuration technique for simplified performance management on page 112 for more information about this specific configuration strategy with single-rank extent pools and a hardware-related volume and logical subsystem (LSS)/logical control unit (LCU) ID configuration concept.
Example 5-6 Array ID sequence sorted by DA pair dscli> lsarray -l Date/Time: 24 October 2008 11:35:08 CEST IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 Array State Data RAIDtype arsite rank DA Pair DDMcap (10^9B) diskclass encrypt =========================================================================================== = A0 Unassigned Normal 6 (5+P+Q+S) S1 2 146.0 ENT unsupported A1 Unassigned Normal 6 (5+P+Q+S) S2 2 146.0 ENT unsupported A2 Unassigned Normal 6 (5+P+Q+S) S3 2 146.0 ENT unsupported A3 Unassigned Normal 6 (5+P+Q+S) S4 2 146.0 ENT unsupported A4 Unassigned Normal 6 (6+P+Q) S5 2 146.0 ENT unsupported A5 Unassigned Normal 6 (6+P+Q) S6 2 146.0 ENT unsupported A6 Unassigned Normal 6 (6+P+Q) S7 2 146.0 ENT unsupported A7 Unassigned Normal 6 (6+P+Q) S8 2 146.0 ENT unsupported A8 Unassigned Normal 6 (5+P+Q+S) S9 6 146.0 ENT unsupported A9 Unassigned Normal 6 (5+P+Q+S) S10 6 146.0 ENT unsupported A10 Unassigned Normal 6 (5+P+Q+S) S11 6 146.0 ENT unsupported A11 Unassigned Normal 6 (5+P+Q+S) S12 6 146.0 ENT unsupported A12 Unassigned Normal 6 (6+P+Q) S13 6 146.0 ENT unsupported A13 Unassigned Normal 6 (6+P+Q) S14 6 146.0 ENT unsupported A14 Unassigned Normal 6 (6+P+Q) S15 6 146.0 ENT unsupported A15 Unassigned Normal 6 (6+P+Q) S16 6 146.0 ENT unsupported A16 Unassigned Normal 6 (5+P+Q+S) S17 7 146.0 ENT unsupported A17 Unassigned Normal 6 (5+P+Q+S) S18 7 146.0 ENT unsupported A18 Unassigned Normal 6 (5+P+Q+S) S19 7 146.0 ENT unsupported
88
A19 A20 A21 A22 A23
Unassigned Unassigned Unassigned Unassigned Unassigned
Normal Normal Normal Normal Normal
6 6 6 6 6
(5+P+Q+S) (6+P+Q) (6+P+Q) (6+P+Q) (6+P+Q)
S20 S21 S22 S23 S24
7 7 7 7 7
146.0 146.0 146.0 146.0 146.0
ENT ENT ENT ENT ENT
unsupported unsupported unsupported unsupported unsupported
For initial configurations using multi-rank extent pools on a storage unit especially with a homogeneous hardware base (for example, a single type of DDMs using only one RAID level) and resource-sharing workloads, consider configuring the arrays in a round-robin fashion across all available DA pairs by creating the first array from the first array site on the first DA pair, then the second array from the first array site on the second DA pair, and so on. This sequence also sorts the arrays by array size (that is, arrays with or without spares), creating the smaller capacity arrays with spare drives first as shown in Example 5-7. If the ranks are finally created in the same ascending ID sequence from the arrays, the rank ID sequence will also cycle through all DA pairs in a round-robin fashion as seen in Example 5-8 on page 90, which might enhance the distribution of volumes across ranks from different DA pairs within multi-rank extent pools. The creation of successive volumes (using the rotate volumes allocation method) or extents (using the rotate extents allocation method) within a multi-rank extent pool also follows the ascending numerical sequence of rank IDs.
Example 5-7 Array ID sequence sorted by array size (with and without spares) and cycling through all available DA pairs dscli> lsarray -l Date/Time: 24 October 2008 11:35:08 CEST IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 Array State Data RAIDtype arsite rank DA Pair DDMcap (10^9B) diskclass encrypt =========================================================================================== A0 Unassigned Normal 6 (5+P+Q+S) S1 2 146.0 ENT unsupported A1 Unassigned Normal 6 (5+P+Q+S) S9 6 146.0 ENT unsupported A2 Unassigned Normal 6 (5+P+Q+S) S17 7 146.0 ENT unsupported A3 Unassigned Normal 6 (5+P+Q+S) S2 2 146.0 ENT unsupported A4 Unassigned Normal 6 (5+P+Q+S) S10 6 146.0 ENT unsupported A5 Unassigned Normal 6 (5+P+Q+S) S18 7 146.0 ENT unsupported A6 Unassigned Normal 6 (5+P+Q+S) S3 2 146.0 ENT unsupported A7 Unassigned Normal 6 (5+P+Q+S) S11 6 146.0 ENT unsupported A8 Unassigned Normal 6 (5+P+Q+S) S19 7 146.0 ENT unsupported A9 Unassigned Normal 6 (5+P+Q+S) S4 2 146.0 ENT unsupported A10 Unassigned Normal 6 (5+P+Q+S) S12 6 146.0 ENT unsupported A11 Unassigned Normal 6 (5+P+Q+S) S20 7 146.0 ENT unsupported A12 Unassigned Normal 6 (6+P+Q) S5 2 146.0 ENT unsupported A13 Unassigned Normal 6 (6+P+Q) S13 6 146.0 ENT unsupported A14 Unassigned Normal 6 (6+P+Q) S21 7 146.0 ENT unsupported A15 Unassigned Normal 6 (6+P+Q) S6 2 146.0 ENT unsupported A16 Unassigned Normal 6 (6+P+Q) S14 6 146.0 ENT unsupported A17 Unassigned Normal 6 (6+P+Q) S22 7 146.0 ENT unsupported A18 Unassigned Normal 6 (6+P+Q) S7 2 146.0 ENT unsupported A19 Unassigned Normal 6 (6+P+Q) S15 6 146.0 ENT unsupported A20 Unassigned Normal 6 (6+P+Q) S23 7 146.0 ENT unsupported A21 Unassigned Normal 6 (6+P+Q) S8 2 146.0 ENT unsupported A22 Unassigned Normal 6 (6+P+Q) S16 6 146.0 ENT unsupported A23 Unassigned Normal 6 (6+P+Q) S24 7 146.0 ENT unsupported
Note that depending on the installed hardware resources in the DS8000 storage subsystem, you might have different numbers of DA pairs and even different numbers of arrays per DA pair. Also, be aware that you might not be able to strictly follow your initial array ID numbering scheme anymore when upgrading storage capacity by adding array sites to the Storage Unit later.
89
5.6.3 Rank considerations

A DS8000 rank supports only a single storage type (either CKD or FB). If workloads require different storage types (CKD and FB), the workloads must be isolated to different DS8000 ranks. Ranks are numbered with system-generated IDs starting with R0, R1, and so forth in the sequence in which they are created. For ease of management and performance analysis, it might be preferable to create the ranks by simply following the order of the ascending array ID sequence, so that, for example, rank R27 can be associated with array A27. The association between ranks, arrays, array sites, and DA pairs can be taken from the output of the DSCLI command lsarray -l as shown in Example 5-8.
Example 5-8 Rank, array, array site, and DA pair association as provided by the DSCLI lsarray -l command dscli> lsarray -l Date/Time: 24 October 2008 17:43:27 CEST IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 Array State Data RAIDtype arsite rank DA Pair DDMcap (10^9B) diskclass encrypt =========================================================================================== A0 Assigned Normal 6 (5+P+Q+S) S1 R0 2 146.0 ENT unsupported A1 Assigned Normal 6 (5+P+Q+S) S9 R1 6 146.0 ENT unsupported A2 Assigned Normal 6 (5+P+Q+S) S17 R2 7 146.0 ENT unsupported A3 Assigned Normal 6 (5+P+Q+S) S2 R3 2 146.0 ENT unsupported A4 Assigned Normal 6 (5+P+Q+S) S10 R4 6 146.0 ENT unsupported A5 Assigned Normal 6 (5+P+Q+S) S18 R5 7 146.0 ENT unsupported A6 Assigned Normal 6 (5+P+Q+S) S3 R6 2 146.0 ENT unsupported A7 Assigned Normal 6 (5+P+Q+S) S11 R7 6 146.0 ENT unsupported A8 Assigned Normal 6 (5+P+Q+S) S19 R8 7 146.0 ENT unsupported A9 Assigned Normal 6 (5+P+Q+S) S4 R9 2 146.0 ENT unsupported A10 Assigned Normal 6 (5+P+Q+S) S12 R10 6 146.0 ENT unsupported A11 Assigned Normal 6 (5+P+Q+S) S20 R11 7 146.0 ENT unsupported A12 Assigned Normal 6 (6+P+Q) S5 R12 2 146.0 ENT unsupported A13 Assigned Normal 6 (6+P+Q) S13 R13 6 146.0 ENT unsupported A14 Assigned Normal 6 (6+P+Q) S21 R14 7 146.0 ENT unsupported A15 Assigned Normal 6 (6+P+Q) S6 R15 2 146.0 ENT unsupported A16 Assigned Normal 6 (6+P+Q) S14 R16 6 146.0 ENT unsupported A17 Assigned Normal 6 (6+P+Q) S22 R17 7 146.0 ENT unsupported A18 Assigned Normal 6 (6+P+Q) S7 R18 2 146.0 ENT unsupported A19 Assigned Normal 6 (6+P+Q) S15 R19 6 146.0 ENT unsupported A20 Assigned Normal 6 (6+P+Q) S23 R20 7 146.0 ENT unsupported A21 Assigned Normal 6 (6+P+Q) S8 R21 2 146.0 ENT unsupported A22 Assigned Normal 6 (6+P+Q) S16 R22 6 146.0 ENT unsupported A23 Assigned Normal 6 (6+P+Q) S24 R23 7 146.0 ENT unsupported
The DSCLI command lsrank -l as illustrated in Example 5-9 on page 91 shows the actual capacity of the ranks, and after their assignment to extent pools, the association to extent pools and rank groups. This information is important for subsequently configuring the extent pools with regard to the planned workload and capacity requirements. It is important to note that unassigned ranks do not have a fixed or predetermined relationship to any DS8000 processor complex. Each rank can be assigned to any extent pool or any rank group. Only when assigning a rank to an extent pool and thus rank group 0 or rank group 1 does the rank become associated with processor complex 0 or processor complex 1. Ranks from rank group 0 (even-numbered extent pools: P0, P2, P4, and so forth) are managed by processor complex 0, and ranks from rank group 1 (odd-numbered extent pools: P1, P3, P5, and so forth) are managed by processor complex 1. For a balanced distribution of the overall workload across both processor complexes, half of the ranks must be assigned to rank group 0 and half of the ranks must be assigned to rank group 1. Also, the ranks with and without spares must be spread evenly across both rank 90
groups. Furthermore, it is important that the ranks from each DA pair are distributed evenly across both processor complexes; otherwise, you might seriously limit the available back-end bandwidth and thus the systems overall throughput. If, for example, all ranks of a DA pair are assigned to only one processor complex, only one DA card of the DA pair is used to access the set of ranks, and thus, only half of the available DA pair bandwidth is available.
Example 5-9 Rank, array, and capacity information provided by the DSCLI command lsrank -l dscli> lsrank -l Date/Time: 28 October 2008 14:28:24 CET IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 ID Group State datastate array RAIDtype extpoolID extpoolnam stgtype exts usedexts encryptgrp ======================================================================================================= R0 - Unassigned Normal A0 6 fb 634 R1 - Unassigned Normal A1 6 fb 634 R2 - Unassigned Normal A2 6 fb 634 R3 - Unassigned Normal A3 6 fb 634 R4 - Unassigned Normal A4 6 fb 634 R5 - Unassigned Normal A5 6 fb 634 R6 - Unassigned Normal A6 6 fb 634 R7 - Unassigned Normal A7 6 fb 634 R8 - Unassigned Normal A8 6 fb 634 R9 - Unassigned Normal A9 6 fb 634 R10 - Unassigned Normal A10 6 fb 634 R11 - Unassigned Normal A11 6 fb 634 R12 - Unassigned Normal A12 6 fb 763 R13 - Unassigned Normal A13 6 fb 763 R14 - Unassigned Normal A14 6 fb 763 R15 - Unassigned Normal A15 6 fb 763 R16 - Unassigned Normal A16 6 fb 763 R17 - Unassigned Normal A17 6 fb 763 R18 - Unassigned Normal A18 6 fb 763 R19 - Unassigned Normal A19 6 fb 763 R20 - Unassigned Normal A20 6 fb 763 R21 - Unassigned Normal A21 6 fb 763 R22 - Unassigned Normal A22 6 fb 763 R23 - Unassigned Normal A23 6 fb 763 -
5.7 Planning extent pools

After planning the arrays and the ranks, the next step is to plan the extent pools, which means taking the planned ranks and defining their assignment to extent pools and rank groups, including planning the extent pool IDs. Extent pools are automatically numbered with system-generated IDs starting with P0, P1, and so forth in the sequence in which they are created. Extent pools that are created for rank group 0 are managed by processor complex 0 and have even-numbered IDs (P0, P2, P4, and so forth). Extent pools that are created for rank group 1 are managed by processor complex 1 and have odd-numbered IDs (P1, P3, P5, and so forth). Only in the case of a failure condition or during a concurrent code load will the ownership of a given rank group temporarily be moved to the alternate processor complex. A rank can be assigned to any extent pool or rank group. Each rank provides a particular number of storage extents of a certain storage type (either FB or CKD) to an extent pool. An extent pool finally aggregates the extents from the assigned ranks and provides the logical storage capacity for the creation of logical volumes for the attached host systems.
91
The assignment of the ranks to extent pools together with an appropriate concept for the logical configuration and volume layout is the most essential step to optimize overall subsystem performance. When an appropriate DS8000 hardware base has been selected for the planned workloads (that is, isolated and resource-sharing workloads), the next goal is to provide a logical configuration concept that will widely guarantee a balanced workload distribution across all available hardware resources within the storage subsystem at any time - from the beginning, when only part of the available storage capacity is used, up to the end, when almost all of the capacity of the subsystem is allocated. Next, we outline several concepts for the logical configuration for spreading the identified workloads evenly across the available hardware resources.
5.7.1 Single-rank and multi-rank extent pools

Using single-rank or multi-rank extent pools in general does not have any influence on the achievable overall I/O performance. The performance aspect is only related to the final distribution of the volumes and I/O workloads across the available ranks within the extent pools. In order to achieve a uniform subsystem I/O performance and avoid single resources becoming bottlenecks (called hot spots), it is desirable to distribute volumes and workloads evenly across all of the ranks (disk spindles) and DA pairs that are dedicated to a workload. There is no need to strictly go for only single-rank extent pools or only multi-rank extent pools on the whole storage subsystem. You can base your decision on individual considerations for each workload group that is assigned to set of ranks and thus extent pools. However, if for performance management reasons, a configuration concept on rank level is preferred that strictly relates LSS/LCU IDs to specific ranks, single-rank extent pools might be appropriate (refer to 5.8, Plan address groups, LSSs, volume IDs, and CKD PAVs on page 118). The decision to use single-rank and multi-rank extent pools also depends on the logical configuration concept that is chosen for the distribution of the identified workloads or workload groups with regard to isolation and resource-sharing considerations. A single mkfbvol or mkckdvol command can create a set of volumes with successive volume IDs from the specified extent pool. If the logical configuration concept requires the creation of a large number of successive volumes for a given workload on a single rank (for example, creating a set of volumes with LUN IDs 0000 - 00ff on a single rank for System z when you prefer a rank-related LCU assignment), using single-rank extent pools might be preferred for these workloads. If the logical configuration concept aims to balance certain workloads or workload groups (especially large resource-sharing workload groups) across multiple ranks with the allocation of volumes or extents on successive ranks, use multi-rank extent pools for these workloads. In this case, you simply take advantage of the DS8000 volume allocation algorithms to spread the volumes evenly across the ranks, which will achieve a well balanced volume distribution with considerably less management effort. With single-rank extent pools, you must distribute these volumes manually. Single-rank extent pools provide full control of performance management down to the rank level. Multi-rank extent pools using Storage Pool Striping just reduce overall management effort by shifting performance management from rank level to extent pool level with an extent pool simply representing a set of merged ranks (a larger set of disk spindles) with a uniform workload distribution. For performance management and analysis reasons, it is crucial to be able to easily relate volumes, which are related to a given I/O workload, to ranks, which finally provide the physical disk spindles for servicing the workloads I/O requests and determining the I/O processing capabilities. An overall logical configuration concept that easily relates volumes to workloads, extent pools, and ranks is desirable.
Single-rank extent pools provide an easy one-to-one mapping between ranks and extent
pools. Because a volume is always created from a single extent pool, single-rank extent pools 92
allow you to precisely control the volume placement across selected ranks and thus manually manage the I/O performance of the different workloads at the rank level. Furthermore, you can obtain the relationship of a volume to its extent pool by using the output of the DSCLI lsfbvol or lsckdvol command. Thus, with single-rank extent pools, there is a direct relationship between volumes and ranks based on the volumes extent pool, which makes performance management and analysis easier, especially with host-based tools, such as Resource Measurement Facility (RMF) on System z and a preferred hardware-related assignment of LSS/LCU IDs. However, the administrative effort increases, because you have to create the volumes for a given workload in multiple steps from each extent pool separately when distributing the workload across multiple ranks. Furthermore, you choose a configuration design that limits the capabilities of a created volume to the capabilities of a single rank with regard to capacity and performance. With single-rank extent pools, a single volume cannot exceed the capacity or the I/O performance provided by a single rank. So, for demanding workloads, consider creating multiple volumes from different ranks by using host-level-based techniques, such as volume striping, to distribute the workload. You can also waste storage capacity and are likely to benefit less from features, such as dynamic volume expansion (DVE), if extents remain left on ranks in different extent pools, because a single volume can only be created from extents within a single extent pool, not across extent pools. The decision to strictly use single-rank extent pools also limits the use of features, such as Storage Pool Striping or FlashCopy Space Efficiency, which exploits the capabilities of multiple ranks within a single extent pool.
Multi-rank extent pools allow you to fully exploit the features of the DS8000's virtualization
architecture, providing ease of use and also a more efficient usage of all of the available storage capacity in the ranks. Consider multi-rank extent pools especially for workloads that are to be evenly spread across multiple ranks. The DS8000 has always supported multi-rank extent pools with constantly developing volume allocation algorithms in a history of regular performance and usability enhancements. Multi-rank extent pools help to simplify management and volume creation, and they also allow the creation of single volumes that can span multiple ranks and thus even exceed the capacity and performance limits of a single rank. With a properly planned concept for the extent pools and a reasonable volume layout with regard to the various workloads and the workload planning principles outlined in 5.1, Basic configuration principles for optimal performance on page 64, the latest DS8000 volume allocation algorithms, such as rotate volumes (-eam rotatevols) and rotate extents (-eam rotateexts) take care of spreading the volumes and thus the individual workloads evenly across the ranks within homogeneous multi-rank extent pools. Multi-rank extent pools using Storage Pool Striping reduce the level of complexity for standard performance and configuration management by shifting the overall effort from managing a large number of individual ranks (micro-performance management) to a small number of multi-rank extent pools (macro-performance management). In most standard cases, manual allocation of ranks or even the use of single-rank extent pools is obsolete, because it only achieves the same result as multi-rank extent pools using the rotate volumes algorithm, but with higher administrative effort and the limitations for single-rank extent pools that we previously outlined. Especially when using homogeneous extent pools, which strictly contain only identical ranks of the same RAID level, DDM type, and capacity, together with standard volume sizes, multi-rank extent pools can help to
93
considerably reduce management efforts while still achieving a well balanced distribution of the volumes across the ranks. Furthermore, even multi-rank extent pools provide full control of volume placement across the ranks in cases where it is necessary to manually enforce a special volume allocation scheme. You can use the DSCLI command chrank -reserve to reserve all of the extents from a rank in an extent pool from being used for the next creation of volumes. Alternatively, you can use the DSCLI command chrank -release to release a rank and make the extents available again. The major drawback when using multi-rank extent pools compared to single-rank extent pools with regard to performance monitoring and analysis is a slightly higher effort in figuring out the exact relationship of volumes to ranks, because volumes within a multi-rank extent pool can be located on different ranks depending on the extent allocation method that is used and the availability of extents on the ranks during volume creation. The extent pool ID alone, which is given by the output of the lsfbvol or lsckdvol command, generally is insufficient to tell which ranks contribute extents to a given volume. While single-rank extent pools offer a direct relationship between volume, extent pool, and rank due to the one-to-one mapping of ranks to extent pools, you must use the DSCLI commands showfbvol or showckdvol -rank or showrank with multi-rank extent pools in order to determine the location of the volumes on the ranks. The showfbvol or showckdvol -rank command (Example 5-10) lists all of the ranks that contribute extents to a specific volume and the showrank command (Example 5-11 on page 95) reveals a list of all of the volumes that use extents from the specific rank. When gathering the logical configuration of a whole subsystem, you might prefer the use of the showrank command for each rank, because there are typically considerably fewer ranks than volumes on a DS8000 subsystem. Using a showfbvol -rank or showckdvol -rank command for each volume on a DS8000 can take a considerable amount of time and might be appropriate when investigating the particular extent distribution for the individual volumes in question.
Example 5-10 Use of showfbvol -rank command to relate volumes to ranks when using rotate extents dscli> showfbvol -rank 1a10 Date/Time: 05 November 2008 11:33:52 CET IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 Name w2k_1A10 ID 1A10 accstate Online datastate Normal configstate Normal deviceMTM 2107-900 datatype FB 512 addrgrp 1 extpool P2 exts 192 captype DS cap (2^30B) 192.0 cap (10^9B) cap (blocks) 402653184 volgrp V0 ranks 6 dbexts 0 sam Standard repcapalloc eam rotateexts ## Volume 1A10 uses Storage Pool Striping reqcap (blocks) 402653184 ==============Rank extents============== rank extents ============
94
R2 R3 R4 R5 R6 R7
32 32 32 32 32 32
## Volume 1A10 has 32 extents on ranks ## R2, R3, R4, R5, R6 and R7
Example 5-11 Use of showrank command to relate volumes to ranks in multi-rank extent pools dscli> showrank r2 Date/Time: 05 November 2008 11:34:06 CET IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 ID R2 SN Group 0 State Normal datastate Normal Array A18 RAIDtype 6 extpoolID P2 extpoolnam fb_146GB15k_RAID6_SPS_0 volumes 1A00,1A10 ## Volumes 1A00 and 1A10 have extents on rank R2 stgtype fb exts 763 usedexts 64 widearrays 1 nararrays 0 trksize 128 strpsize 384 strpesize 0 extsize 16384 encryptgrp dscli> showrank r3 Date/Time: 05 November 2008 11:34:11 CET IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-75GB192 ID R3 SN Group 0 State Normal datastate Normal Array A19 RAIDtype 6 extpoolID P2 extpoolnam fb_146GB15k_RAID6_SPS_0 volumes 1A01,1A10 ## Volumes 1A01 and 1A10 have extents on rank R3, stgtype fb ## so Volume 1A10 has extents on ranks R2 and R3 exts 763 usedexts 64 widearrays 1 nararrays 0 trksize 128 strpsize 384 strpesize 0 extsize 16384 encryptgrp -
Single-rank extent pools originally were recommended primarily for reasons of performance management, especially in conjunction with the initially released DS8000 extent allocation method, which followed a simple Fill and Spill algorithm. Therefore, in configurations where performance was a major concern, single-rank extent pools were preferred for providing full
95
control of volume placement, as well as performance management. However, single-rank extent pools implicitly require operating system (OS) striping or database (DB) striping to prevent hot volumes. SIngle-rank extent pools are no guarantee against hot spots. They are just a tool to facilitate strict host-based striping techniques, and they require careful performance planning. Multi-rank extent pools have always offered advantages with respect to ease of use and space efficiency. And especially with the latest rotate extents algorithm, multi-rank extent pools provide both ease of use and good performance for standard environments, and therefore, they are a good choice to start with for workload groups that have a sufficient number of ranks dedicated to them.
5.7.2 Extent allocation methods for multi-rank extent pools

When creating a volume on a DS8000 with microcode bundle 63.x (or later) and using the mkfbvol or mkckdvol DSCLI command, you can specify the extent allocation method (eam) manually using either the option -eam rotatevols (rotate volumes, which is the default) or -eam rotateexts (rotate extents) for each volume. The extent allocation method determines how a volume is created within a multi-rank extent pool with regard to the allocation of the extents on the available ranks. The extent allocation method can be selected at the volume level and is not an attribute of an extent pool. Basically, you can have volumes created with both methods in the same multi-rank extent pool. In order not to lose the benefits of a rotate extents algorithm with a uniform workload distribution across the ranks, do not carelessly use both extent allocation algorithms together within the same extent pool. For example, consider using rotate volumes as the extent allocation method for specific workloads where host-level striping or application-based striping is preferred. Use the rotate extents algorithm for other workloads in the same multi-rank extent pool where host-level striping is not an option. The rotate extents algorithm spreads the extents (1 GB for FB volumes and 1113 cylinders or approximately 0.94 GB for CKD volumes) of a single volume and, hence, the I/O activity of each volume across all the ranks in an extent pool and thus across more disks. This approach considerably reduces the occurrences of I/O hot spots at the rank level within the storage subsystem. Storage Pool Striping especially helps to balance the overall workload more evenly across the back-end resources, and thus, it considerably reduces the risk of single ranks becoming performance bottlenecks while providing ease of use with less administration effort. When using the default rotate volumes (rotatevols) extent allocation method, each volume, one volume after another volume, is placed on a single rank with a successive distribution across all ranks in a round-robin fashion. With the optional rotate extents (rotateexts) algorithm, the extents of each single volume are spread across all ranks within the extent pool (provided the size of the volume in extents is at least equal to or larger than the number of ranks in the extent pool). The maximum granularity for distributing a single DS8000 volume can easily be achieved by using the rotate extents volume allocation algorithm. The older volume allocation algorithms before rotate volumes and rotate extents are now referred to as legacy algorithms when listed by the lsfbvol or lsckdvol -l command.Table 5-2 on page 97 shows an overview of the various DS8000 code releases and the extent allocation methods.
96
History of DS8000 volume allocation algorithms

The first algorithm is the Fill and Spill algorithm (February 2005). When a volume was allocated, it obtained extents from the lowest numbered rank with available extents in an extent pool and continued using extents on that rank until the volume was fully allocated or additional extents on the next rank were required to complete the allocation. This algorithm worked well for the intended use: very large volumes (larger than a single RAID array). However, it did not lead to a balanced distribution of the volumes across the ranks.
Table 5-2 Overview of legacy and current DS8000 extent allocation methods DS8000 microcode release < 6.0.500.46 (February 2005) Extent allocation method
Fill and Spill (now referred to as legacy with the lsfbvol command) LUNs are created on the first rank in the extent pool until all extents are used, and then volume creation continues on the next rank in the extent pool. This initial allocation method does not lead to a balanced distribution of the volumes across multiple ranks in an extent pool. Most Empty (now referred to as legacy with the lsfbvol command) Each new LUN is created on the rank (in the specified extent pool) with the largest total number of available extents. If more than one rank in the specified extent pool has the same total number of free extents, the volume is allocated on the rank with the lowest rank ID (Rx). If the required volume capacity is larger than the number of free extents on any single rank, volume allocation begins on the rank with the largest total number of free extents and continues on the next rank in ascending numerical sequence of rank IDs (Rx). All extents for a volume are on a single rank unless the volume is larger than the size of a rank or the volume starts towards the end of one rank and spills over onto another rank. If all ranks in the extent pool have the same amount of available extents and if LUNs of the same size are created, a balanced distribution of the volumes across all ranks in ascending rank ID sequence can be achieved. Rotate LUNs/Rotate Volumes (rotatevols) This more advanced volume allocation algorithm ensures more strictly that successive LUN allocations to a multi-rank extent pool are assigned to different ranks by using an internal pointer, which points to the next rank within the extent pool to start with when creating the next volume. This algorithm especially improves the LUN distribution across the ranks within a multi-rank extent pool independent of LUN sizes or the available free capacity on the ranks. Rotate extents (rotateexts, which is also referred to as Storage Pool Striping) In addition to the rotate volumes extent allocation method, which remains the default, the new rotate extents algorithm is introduced as an additional option of the mkfbvol command (mkfbvol or mkckdvol -eam rotateexts), which evenly distributes the extents of a single volume across all the ranks within a multi-rank extent pool. This new algorithm, which is also known as Storage Pool Striping (SPS), provides the maximum granularity available on the DS8000 (that is, on the extent level = 1 GB for FB volumes and 0.94 GB or 1113 cylinders for CKD volumes), spreading each single volume across multiple ranks and thus evenly balancing the workload within an extent pool.
>= 6.0.500.46 (August 2005)
>= 6.2.420.21 (September 2006)
>= 63.0.104.0 (December 2007)
The second generation algorithm is called the Most Empty algorithm, which was introduced with DS8000 code level 6.0.500.46 (August 2005). Each new volume was created on whichever rank in the specified extent pool happened to have the largest total number of available extents. If more than one rank in the specified extent pool had the same total number of free extents, the volume was allocated to the rank with the lowest rank ID (Rx). If the required volume capacity was larger than the number of free extents on any single rank, volume allocation began on the rank with the largest total number of free extents and
97
continued on the next rank in ascending numerical sequence of rank IDs (Rx). All extents for a volume were on a single rank unless the volume was larger than the size of a rank or the volume started toward the end of one rank and spilled over onto another rank. If all ranks in the extent pool had the same amount of available extents, and if multiple volumes of the same capacity were created, they were allocated on different ranks in ascending rank ID sequence. With DS8000 code level 6.2.420.21 (September 2006), the algorithm was further improved and finally replaced by the third volume allocation algorithm called rotate volumes. New volumes now are allocated to ranks in a round-robin fashion, as long as the rank has enough available extents. Typically, volumes have a relationship to a single rank unless the capacity of available extents on a single rank is exceeded. In this case, the allocation continues on subsequent ranks until the volume is fully provisioned. In most respects, this algorithm looks similar to the second algorithm. However, it avoids stacking many small capacity LUNs that were created in sequence to a single rank. The algorithm now more strictly ensures that successive LUN allocations to a multi-rank extent pool are assigned to different ranks by using an internal pointer which points to the next rank within the extent pool to be used when creating the next volume. It especially improves the LUN distribution across the ranks within a multi-rank extent pool when different LUN sizes are used. With DS8000 code level 63.0.102.0 (December 2007), the fourth and latest volume allocation algorithm called rotate extents or Storage Pool Striping was introduced as an option in addition to the default rotate volumes algorithm. It tries to evenly distribute the extents of a single volume across all the ranks within a multi-rank extent pool in a round-robin fashion. If a rank runs out of available extents, it is skipped. Also, the next new volume to be allocated will start on a different rank from the starting rank of the previous volume, provided there is another rank with available extents. This algorithm further ensures that volumes start on different ranks. Where the volumes end depends on the volume size and the number of ranks containing available extents. This new algorithm provides the maximum granularity available on the DS8000 to spread single volumes across several ranks and to evenly balance the workload within an extent pool.
Considerations for using Storage Pool Striping (rotate extents)

The rotate extents algorithm is the fourth in a series of algorithms that have been used with DS8000 for allocating volumes. It is also referred to as Storage Pool Striping, and with multi-rank extent pools, it provides both ease of use and good performance for standard environments. By evenly distributing data and, hence, the I/O activity across multiple ranks, Storage Pools Striping might be a good choice to start with for large workload groups with a sufficient number of ranks dedicated to them, for example, a group of resource-sharing workloads of different host operating systems that do not all commonly use or even support host-level striping or application-level striping techniques. The major decision points for multi-rank extent pool configurations are the number of pools per DS8000 and the number (and type) of RAID arrays in each pool, depending on the identified workload groups for isolation and workload sharing. The workload spreading within an extent pool then is simply managed by the DS8000s volume allocation algorithm. In many cases, the ease of management makes Storage Pool Striping the best choice. Still, you must not ignore the use of host-level striping techniques or application striping techniques in conjunction with Storage Pool Striping in order to balance the overall I/O activity of a given workload across assigned volumes from different extent pools. Note: Use a performance monitoring tool, such as TotalStorage Productivity Center for Disk, to make sure that the I/O is spread as effectively across the ranks as the volumes and data are spread.
98
Prior to the introduction of Storage Pool Striping on DS8000, the maximum I/O performance of a single DS8000 LUN was simply limited by the I/O performance of the underlying rank, because a LUN was generally located on a single rank only. Using host-level striping with LUNs created from several ranks was the recommended way to achieve a single logical volume on the attached host system that was capable of a considerably higher random I/O performance. Now, with the new rotate extents algorithm, even single LUNs can be created that can deliver the I/O performance of multiple ranks taking advantage of all the available disk spindles within an extent pool. When Storage Pool Striping is used, you typically expect the extents and thus the workload to be evenly distributed across all ranks within an extent pool, so generally a volume simply can be related to an extent pool again, which in this case represents a set of evenly used ranks instead of only a single rank. So with Storage Pool Striping, the level of depth for standard performance management and analysis is shifted from a large number of individual ranks (micro-performance management) to a small number of extent pools (macro-performance management), which considerably reduces the overall management effort. However, if an extent pool is not homogeneous and it is created from ranks of different capacities (simply due to the use of RAID arrays with and without spares), a closer investigation for the individual ranks in an extent pool and the distribution of extents across the ranks for given volumes might be required with regard to performance management. Certain volumes that were created from the last available extents in this type of an extent pool might only be spread across a smaller number of large capacity ranks in the extent pool. In this case, use the DSCLI commands showfbvol or showckdvol -rank or showrank help to determine the association of volumes to ranks. Furthermore, DS8000 only provides performance metrics on the I/O port, rank, and volume level. There are no DS8000 performance metrics available for extents to provide a hot spot analysis on the extent level. So when using rotate extents as the volume allocation algorithm where the extents of each volume are spread across multiple ranks, you cannot tell how much I/O workload a certain extent or even a volume contributes to a specific rank. Typically, the workload is well balanced across the ranks with rotate extents, so that a single rank becoming a hot spot within a multi-rank extent pool is highly unlikely. Note: The extents for a single volume are not spread across ranks in a multi-rank extent pool by default. You need to manually specify the -eam rotateexts option of the mkfbvol or mkckdvol command in order to spread the extents of a volume across multiple ranks in an extent pool. Certain application environments might particularly benefit from the use of Storage Pool Striping. Examples for such environments include: Operating systems that do not directly support host-level striping VMware datastores Microsoft Exchange 2003 or Exchange 2007 databases Windows clustering environments Older Solaris environments Environments that need to sub-allocate storage from a large pool Resources sharing workload groups dedicated to a large number of ranks with a variety of different host operating systems, which do not all commonly use or even support host-level striping techniques or application-level striping techniques Applications with multiple volumes and volume access patterns that differ from day to day
99
There also are many valid reasons for not using Storage Pool Striping, mainly to avoid unnecessary additional layers of striping and reorganizing I/O requests, which might only increase latency and do not actually help you to achieve a more evenly balanced workload distribution. Multiple independent striping layers might even be counterproductive under certain circumstances. For example, we do not recommend that you create a number of volumes from a single multi-rank extent pool using Storage Pool Striping and then, additionally, use host-level striping or application-based striping on the same set of volumes. In this case, two layers of striping are combined with no overall performance benefit at all. In contrast, creating four volumes from four different extent pools from both rank groups using Storage Pool Striping and then using host-based striping or application-based striping on these four volumes to aggregate the performance of the ranks in all four extent pools and both processor complexes is reasonable. Examples where single-rank extent pools or multi-rank extent pools using the default rotate volumes algorithm (with volumes assigned to distinct ranks) for you to consider are: SAN Volume Controller (SVC): It is preferable to use SVC with dedicated LUN to rank associations with a small number of LUNs in each rank (for example, one or two volumes per rank). These LUNs become SVC managed disks (MDisks) in MDisk groups. OS logical volumes are virtual disks (VDisks) sub-allocated from MDisk groups, usually striping allocation extents across all the MDisks in the MDisk group. There are many similarities between the design of SVC MDisk groups and DS8000 striped storage pools; however, SVC provides more granular and even customizable stripe sizes than the DS8000. System i: System i controls its own striping and has its own recommendations about how to allocate storage volumes. So there are no common recommendations to use Storage Pool Striping, although there is likely no adverse consequence in using it. For large System i installations with a large number of ranks, it might be a valid option to use Storage Pool Striping with two or more identical ranks within an extent pool simply to reduce the overall management effort. System z: System z also controls its own striping and thus is not dependent on striping at the subsystem level. Furthermore, it has its own recommendations about how to allocate storage volumes. Often, multiple successive volumes are created on single ranks with a common LCU ID assignment scheme in place that is related to physical ranks. So here, single-rank extent pools and the use of System z storage management subsystem (SMS) striping might offer a benefit with regard to configuration and performance management. With a direct relation of volumes to ranks and a reasonable strategy for LCU and volume ID numbering, performance management and analysis can be done more easily just with native host-based tools, such as Remote Monitoring Facility (RMF) without the need for additional DSCLI outputs in order to relate volumes to ranks. However, for System z installations with a large number of ranks, it might still be a valid option to use Storage Pool Striping with two or more ranks of the same type within an extent pool simply to reduce the overall administration effort by shifting management from rank level to extent pool level with an extent pool simply representing a set of aggregated ranks with an even workload distribution. Refer to 15.8.2, Extent pool on page 461 for more information about System z-related recommendations with regard to multi-rank extent pools. Database volumes: If these volumes are used by databases or applications that explicitly manage the workload distribution by themselves, these applications might achieve maximum performance simply by using their native techniques for spreading their workload across independent LUNs from different ranks. Especially with IBM DB2 or Oracle where the vendor recommends specific volume configurations, for example, DB2 balanced configuration units (BCUs) or Oracle Automatic Storage Management (ASM), it is preferable to simply follow those recommendations.
100
Applications that have evolved particular storage strategies over a long period of time, which have proven their benefits, and where it is not clear whether they will additionally benefit from using Storage Pool Striping. When in doubt, simply follow the vendor recommendations. Note: For environments where dedicated volume-to-rank allocations are preferred or even required, you can use either single-rank or multiple-rank extent pools if the workloads need to be spread across multiple ranks with successive volume IDs. Multi-rank extent pools using the default rotate volumes extent allocation method together with a carefully planned volume layout will in most cases achieve the same volume distribution with less administration effort than can be achieved with single-rank extent pools. DS8000 Storage Pool Striping is based on spreading extents across different ranks. So with extents of 1 GB (FB) or 0.94 GB (1113 cylinders/CKD), the size of a data chunk is rather large. For distributing random I/O requests, which are evenly spread across each volumes capacity, this chunk size generally is quite appropriate. However, dependent on the individual access pattern of a given application and the distribution of the I/O activity across the volume capacity, certain applications might provide a higher overall performance with more granular stripe sizes for optimizing the distribution of their I/O requests across different RAID arrays by using host-level striping techniques or by having the application manage the workload distribution across independent volumes from different ranks. Additional considerations for the use of Storage Pool Striping for selected applications or environments include: DB2: Excellent opportunity to simplify storage management using Storage Pool Striping. You probably will still prefer to use DB2 traditional recommendations for DB2 striping for performance sensitive environments. DB2 and similar data warehouse applications, where the database manages storage and parallel access to data: Here, consider generally independent volumes on individual ranks with a careful volume layout strategy that does not use Storage Pool Striping. Containers or database partitions are configured according to recommendations from the database vendor. Oracle: Excellent opportunity to simplify storage management for Oracle. You will probably prefer to use Oracle traditional recommendations involving ASM and Oracles striping capabilities for performance sensitive environments. Small, highly active logs or files: Small highly active files or storage areas smaller than 1 GB with an extraordinary high access density might require spreading across multiple ranks for performance reasons. However, Storage Pool Striping only offers a striping granularity on extent levels around 1 GB, which is too large in this case. Here, continue to exploit host-level striping techniques or application-level striping techniques that support smaller stripe sizes. For example, assume that there is a 0.8 GB log file with extreme write content, and you want to spread this log file across several RAID arrays. Assume that you intend to spread its activity across four ranks. At least four 1 GB extents must be allocated, one extent on each rank (which is the smallest possible allocation). Creating four separate volumes, each with a 1 GB extent from each rank, and then using Logical Volume Manager (LVM) striping with a relatively small stripe size (for example, 16 MB) effectively distributes the workload across all four ranks. Creating a single LUN of four extents, which is also distributed across the four ranks using DS8000 Storage Pool Striping, cannot effectively spread the files workload evenly across all four ranks simply due to the large stripe size of one extent, which is larger than the actual size of the file.
101
Tivoli Storage Manager storage pools: Tivoli Storage Manager storage pools work well in striped pools. But in adherence to long standing Tivoli Storage Manager recommendations, the Tivoli Storage Manager databases need to be allocated in a separate pool or pools. AIX volume groups (VGs): LVM and physical partition (PP) striping continue to be powerful tools for managing performance. In combination with Storage Pool Striping, now considerably fewer stripes are required for common environments. Instead of striping across a large set of volumes from many ranks (for example, 32 volumes from 32 ranks), striping is only required across a small number of volumes from just a small set of different multi-rank extent pools from both DS8000 rank groups using Storage Pool Striping (for example, four volumes from four extent pools, each with eight ranks). For specific workloads, even using the advanced AIX LVM striping capabilities with a smaller granularity on the KB or MB level, instead of Storage Pool Striping with 1 GB extents (FB), might be preferable in order to achieve the highest possible level of performance. Windows volumes: Typically, only a small number of large LUNs per host system are preferred, and host-level striping is not commonly used. So, basically Storage Pool Striping is an ideal option for Windows environments. It easily allows the creation of single, large capacity volumes that offer the performance capabilities from multiple ranks. A single volume no longer is limited by the performance limits of a single rank, and the DS8000 simply handles spreading the I/O load across multiple ranks. Microsoft Exchange: Storage Pool Striping makes it much easier for DS8000 to conform to Microsoft sizing recommendations for Microsoft Exchange databases and logs. Microsoft SQL Server: Storage Pool Striping makes it much easier for DS8000 to conform to Microsoft sizing recommendations for Microsoft SQL Server databases and logs. VMware Datastore for Virtual Machine Storage Technologies (VMware ESX Server Filesystem (VMFS) or virtual raw device mapping (RDM) access: Because datastores concatenate LUNs rather than striping them, just allocate the LUNs inside a striped storage pool. Estimating the number of disks (or ranks) to support any given I/O load is straightforward based on the given requirements. In general, Storage Pool Striping helps to improve overall performance and reduce the effort of performance management by evenly distributing workloads across a larger set of ranks, reducing skew and hot spots. Certain application workloads can also benefit from the higher number of disk spindles behind only a single volume. But, there are cases where host-level striping or application-level striping might achieve even a higher performance, of course at the cost of higher overall administration effort. Storage Pool Striping still might deliver good performance in these cases, but manual striping with careful configuration planning is required to achieve the best possible performance. So with regard to overall performance and ease of use, Storage Pool Striping might still offer an excellent compromise for many environments, especially for larger workload groups where host-level striping techniques or application-level striping techniques are not widely used or not even available. Note: Business and performance critical applications always require careful configuration planning and individual decisions on a case by case basis about whether to use Storage Pool Striping or LUNs from dedicated ranks together with host-level striping techniques or application-level striping techniques for the best performance. Storage Pool Striping is best suited for completely new extent pools. Adding new ranks to an existing extent pool will not restripe volumes (LUNs) that are already allocated in an existing pool. So, adding single ranks to an extent pool, which uses Storage Pool Striping, when running out of available capacity simply undermines the concept of Storage Pool Striping and easily lead to hot spots on the added ranks. Thus, you need to perform capacity planning 102
using Storage Pool Striping on an extent pool level and not on a rank level. In order to upgrade capacity on a DS8000 using Storage Pool Striping, simply add new extent pools (preferably in groups of two: one for each rank group or processor complex) with a specific number of ranks per extent pool based on your individual configuration concept (for example, using multi-rank extent pool with four to eight ranks).
5.7.3 Balancing workload across available resources

To achieve a balanced utilization of all available resources of the DS8000 storage subsystem, you need to distribute the I/O workloads evenly across the available subsystem back-end resources: Ranks (disk drive modules) Device adapter (DA) pairs And, you need to distribute the I/O workloads evenly across the available subsystem front-end resources: I/O ports Host adapter (HA) cards I/O enclosures You need to distribute the I/O workloads evenly across both DS8000 processor complexes, as well. Configuring the extent pools determines the balance of the workloads across the available back-end resources, ranks, DA pairs, and both processor complexes. Each extent pool is associated with an extent pool ID (P0, P1, P2, and so forth). Each rank has a relation to a specific DA pair and can be assigned to only one extent pool. There can be as many (non-empty) extent pools as there are ranks. Extent pools can simply be expanded by adding more ranks to the pool. However, when assigning a rank to a specific extent pool, the affinity of this rank to a specific DS8000 processor complex is determined. There is no predefined affinity of ranks to a processor complex by hardware. All ranks assigned to even-numbered extent pools (P0, P2, P4, and so forth) form rank group 0 and are serviced by DS8000 processor complex 0. All ranks assigned to odd-numbered extent pools (P1, P3, P5, and so forth) form rank group 1 and are serviced by DS8000 processor complex 1. For a balanced distribution of the overall workload across both processor complexes and both DA cards of each DA pair, apply the following rules for each type of rank with regard to RAID level, storage type (FB or CKD), and disk drive characteristics (disk type, rpm speed, and capacity): Assign half of the ranks to even-numbered extent pools (rank group 0) and assign half of them to odd-numbered extent pools (rank group 1). Spread ranks with and without spares evenly across both rank groups. Distribute ranks from each DA pair evenly across both rank groups. It is important to understand that you might seriously limit the available back-end bandwidth and thus the systems overall throughput if, for example, all ranks of a DA pair are assigned to only one rank group and thus a single processor complex. In this case, only one DA card of the DA pair is used to service all the ranks of this DA pair and thus only half of the available DA pair bandwidth is available.
103
5.7.4 Assigning workloads to extent pools

Extent pools can only contain ranks of the same storage type, either fixed block (FB) for Open Systems or System i or count key data (CKD) for System z. The ranks within an extent pool must have the same RAID type and the same disk drive characteristics (type, size, and rpm speed), so that the storage extents in the extent pool have identical performance characteristics. Multiple extent pools, each with different rank characteristics, easily allow tiered storage concepts. For example, you can have extent pools with slow, large-capacity drives for backup purposes (for example, 300 GB, 10k rpm) and other extent pools with high-speed, small capacity drives (for example, 75 GB, 15k rpm) for performance-critical transaction applications. Furthermore, using dedicated extent pools with an appropriate number of ranks and DA pairs is a suitable approach for isolating workloads. The minimum number of required extent pools depends on the following considerations: The number of isolated and resource-sharing workload groups The number of different storage tiers and service level agreements The number of RAID levels, disk drive modules, and storage types (FB or CKD) Although you are not restricted from assigning all ranks to only one extent pool, the minimum number of extent pools, even with only one workload on a homogeneously configured DS8000, needs to be two (for example, P0 and P1) with one extent pool for each rank group, so that the overall workload is balanced across both processor complexes. To optimize performance, even the ranks for each workload group (either isolated or resource-sharing workload groups) need to be split across at least two extent pools with an equal number of ranks from each rank group, so that also at the workload level, each workload is balanced across both processor complexes. Typically, you need to assign an equal number of ranks from each DA pair to extent pools assigned to processor complex 0 (rank group 0: P0, P2, P4, and so forth) and to extent pools assigned to processor complex 1 (rank group 1: P1, P3, P5, and so forth). In environments with FB and CKD storage (Open Systems and System z), you additionally need separate extent pools for CKD and FB volumes, which lead to a recommended minimum of four extent pools to balance the capacity and I/O workload between the two DS8000 processor complexes. Additional extent pools might be desirable in order to meet individual needs, such as ease of use, implementing tiered storage concepts, or simply separating ranks with regard to different DDM types, RAID types, clients, applications, performance, or Copy Services requirements. The maximum number of extent pools, however, is given by the number of available ranks (that is, creating one extent pool for each rank). Creating dedicated extent pools on DS8000 with dedicated back-end resources for separate workloads allows individual performance management for business and performance-critical applications. Compared to easier to manage share and spread everything storage subsystems without the possibility to implement workload isolation concepts, creating dedicated extent pools on DS8000 with dedicated back-end resources for separate workloads is an outstanding feature of the DS8000 as an enterprise-class subsystem. It allows you to consolidate and manage various application demands with different performance profiles, which are typical in enterprise environments, on a single storage subsystem, but at the cost of higher administration efforts. Before actually configuring the extent pools, we advise you to collect all the hardware-related information of each rank with regard to the associated DA pair, disk type, available storage capacity, RAID level, and storage type (CKD or FB) in a spreadsheet and then to plan the distribution of the workloads across the ranks and their assignments to extent pools.
104
As the first step, you can visualize the rank and DA pair association in a simple spreadsheet based on the graphical scheme given in Figure 5-7.
DA2
6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P
6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P
DA2
DA0
DA0
DA3
DA3
DA1
DA1
Processor Com plex 0
Figure 5-7 Basic scheme for rank and DA pair association with regard to extent pool planning
This example represents a homogeneously configured DS8100 with four DA pairs and 32 ranks all configured to RAID 5. Based on the specific DS8000 hardware and rank configuration, the scheme typically becomes more complex with regard to the number of DA pairs, ranks, different RAID levels, disk drives, spare distribution, and storage types. Based on this scheme, you can easily start planning an initial assignment of ranks to your planned workload groups, either isolated or resource-sharing, and extent pools with regard to your capacity requirements as shown in Figure 5-8.
D S -8 1 0 0 7 5 x y z w n
DA 2 RA RA RA RA RA RA RA RA ID ID ID ID 10 10 10 10 5 5 5 5 6 6 6 6 / / / / / / / / / / / / x / / / / 386 519 519 519 / / / / / / / / / / / / / / / / fb fb fb fb / / / / 146G 146G 146G 146G / / / / B B B B 15k 15k 15k 15k 15k 15k 15k 15k RA RA RA RA RA RA RA RA ID ID ID ID 10 10 10 10 5 5 5 5 6 6 6 6 6 6 6 6 / / / / / / / / / / / / / / / / 386 519 519 519 / / / / / / / / / / / / / / / / fb fb fb fb / / / / 146G 146G 146G 146G / / / / B B B B 15k 15k 15k 15k 15k 15k 15k 15k C o lo u r DA 2 W o rklo a d w o rk lo a d w o rk lo a d w o rk lo a d w o rk lo a d w o rk lo a d w o rk lo a d w o rk lo a d w o rk lo a d A A B B B B C C E x te n t Pool P P P P P P P P 0 1 2 3 4 5 6 7 EA M R otV ol R otV ol R otE x t R otE x t R otE x t R otE x t R otV ol R otV ol
DA 0
ID ID ID ID ID ID ID ID
1771 1771 2066 2066 1287 1287 1549 1549 1287 1287 1549 1549 0
ckd ckd ckd ckd fb fb fb fb fb fb fb fb / / / / / / / /
300G 300G 300G 300G
B B B B
ID ID ID ID ID ID ID ID ID ID ID ID
1771 1771 2066 2066 1287 1287 1549 1549 1287 1287 1549 1549
ckd ckd ckd ckd fb fb fb fb fb fb fb fb / / / / / / / /
300G 300G 300G 300G
B B B B
DA 0
DA 3
RA RA RA RA
300G 300G 300G 300G 300G 300G 300G 300G
B B B B B B B B
15k 15k 15k 15k 15k 15k 15k 15k
RA RA RA RA RA RA RA RA
300G 300G 300G 300G
B B B B
15k 15k 15k 15k
DA 3 W o r k l o a d D e fi n i ti o n s w o rk lo a d A w o rk lo a d B w o rk lo a d C = = = O L T P M a in B u s in e s s A p p lic a va r i o u s w o r k l o a d s , O p e n S y s S y s t e m z w o rk lo a d , is o la t e d
R A ID 6 R A ID 6 R A ID 6 R A ID 6 P r o c e s so r C o m p l e DA 1
300G B 15k 300G B 15k DA 1 300G B 15k 300G B 15k P r o c e s so r C o m p l e x 1
Figure 5-8 Initial spreadsheet for workload assignments to ranks and extent pools with regard to capacity requirements
After this initial assignment of ranks to extent pools and appropriate workload groups, you can create additional spreadsheets to hold more details about the logical configuration and finally the volume layout with regard to array site IDs, array IDs, rank IDs, DA pair association, extent pools IDs, and even volume IDs, as well as their assignment to volume groups and host connections as, for example, shown in Figure 5-9 on page 106. 105
Figure 5-9 Example of a detailed spreadsheet for planning and documenting the logical configuration
5.7.5 Planning for multi-rank extent pools

This section outlines ideas for configuring multi-rank extent pools based on the latest rotate volumes and rotate extents volume allocation algorithms without specifying in general how many storage pools or how many RAID ranks per storage pool to use. The previously introduced concepts of workload isolation compared to resource-sharing still apply, and you must identify the workload groups, as well as the dedicated resources, carefully. As an example of the various ways to configure multi-rank extent pools, consider a set of uniformly configured ranks from a DS8100 (same RAID level, same type disk drives, and same storage type) as shown in Figure 5-10. All ranks are dedicated to a single workload group (for example, a large resource-sharing workload group). The ranks have already been marked by different colors with regard to their capacity and planned rank group assignment.
6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P 6+P+S 6+P+S 7+P 7+P
DA2
DA2
DA0
DA0
DA3
DA3
DA1
DA1
Figure 5-10 Example of a set of uniformly configured ranks with their association to DA pairs
106
The minimum recommended number of extent pools in this case is two (P0 and P1) on a uniformly equipped subsystem in order to spread the workload evenly across both DS8000 processor complexes as shown in Figure 5-11 on page 108. Note that here you have two extent pools, which are equal in available capacity, but contain ranks with different numbers of extents per rank. So here, you need to be aware that the last volumes will only be created from extents of the large capacity ranks as soon as the capacity of the small capacity ranks is exceeded. Two large extent pools might be a good choice when planning to use the rotate volumes extent allocation method with dedicated volumes on multiple ranks and a standard volume size as shown in Figure 5-11 on page 108. A two extent pool configuration might be convenient if FlashCopy SE is not used and all workloads are meant to share the same resources (workload resource-sharing). Here, simply distributing the ranks from each DA pair evenly across extent pools P0 and P1 offers all of the flexibility as well as ease of use. Using a standard volume size and aligning the number of volumes of a given workload to the number of ranks within the dedicated extent pools will help achieve a balanced workload distribution. The use of host-level striping or application-level striping will further optimize workload distribution evenly across the hardware resources. However, with regard to the different rank sizes and due to the distribution of spare drives (6+P+S and 7+P arrays for RAID 5 in this example), you might also consider using four strictly homogeneous extent pools (P0, P1, P2, and P3), where each extent pool has only ranks of the same capacity as shown in Figure 5-11 on page 108. Note that with homogeneous RAID 5 and RAID 6 configurations, you might have four arrays with spares on a DA pair, but only two with a RAID 10 configuration. So with RAID 10 configurations, there might not be enough 3x2+2S arrays available to use only homogeneous extent pools. Of course, your configuration also depends on the number of available arrays, as well as the number of ranks required per extent pool. In either case, try to spread ranks from each available DA pair evenly across all extent pools, so that the overall workload is spread evenly across all DA pairs. Be aware that you still need to manually distribute the volumes of each application workload evenly across all extent pools dedicated to this workload group. Furthermore, when using Storage Pool Striping, typically, you plan for a total of four to eight ranks per extent pool for an optimum performance benefit.
107
Configuration with 2 Extent Pools DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P0 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1
Configuration with 4 Extent Pools 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P0 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P2 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P1 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P3 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1
DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1
Processor Complex 0 managed
Figure 5-11 Example of basic multi-rank extent pools configurations
Homogeneous extent pools with ranks of the same capacity offer the best basis for the DS8000 volume allocation algorithms in order to achieve a strictly balanced distribution of the volumes across all ranks within an extent pool and especially with standard volume sizes. However, homogeneous extent pools with ranks of the same capacity also lead to extent pools with different amounts of available capacity. However, they reduce management efforts by providing identical performance characteristics for all of the volumes created from the same extent pool up to the allocation of the last extents, especially when using rotate extents as the preferred extent allocation method. Alternatively, if having extent pools that are equal in size is a major concern, consider four non-homogeneous extent pools with a mixed number of ranks with and without spares as shown in Figure 5-12 on page 109. In this case, however, additional management effort and care are required to control the volume placement when the capacity of the smaller ranks is exceeded, especially when using the rotate extents allocation method. The administrator needs to be aware that the volumes that are created from the last available extents provide lower performance, because they are only distributed across a smaller number of large capacity ranks and thus use fewer disk spindles when compared to the initially created volumes that span all ranks. Because there is no warning message when this condition applies, one method to control the usage of these final extents in non-homogeneous extent pools and plan their usage carefully only for less demanding applications can be to initially create dummy volumes on these large capacity ranks using the chrank -reserve/-release DSCLI commands and reserving these additional extents.
108
Alternate configuration with 4 Extent Pools DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P P0 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P P2 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P P1 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P P3 DA2 DA0 DA3 DA1 DA2 DA0 DA3 DA1
Figure 5-12 Alternate example of a four extent pool configuration with equal extent pool capacities
You can achieve a homogeneous extent pool configuration with extent pools of different rank capacities simply by following these steps after the initial creation of the extent pools: 1. Identify the number of available extents on the ranks and the assignment of the ranks to the extent pools from the output of the lsrank -l command. Calculate the amount of extents that make up the difference between the small and large ranks. 2. Use the DSCLI command chrank -reserve against all smaller (for example, 6+P+S) ranks in order to reserve all extents on these ranks within each extent pool from being used for the creation of the dummy volumes in the next step. 3. Now, create a number of dummy volumes using the mkfbvol command from each extent pool according to the number of large ranks in the extent pool and the additional capacity of these ranks in comparison to the smaller arrays. For example, with 16 ranks in extent pool P0 as shown in the Figure 5-11 on page 108, you have eight small capacity (6+P+S) ranks and eight (7+P) large capacity ranks. With 73 GB disk drives, you get 388 extents per 6+P+S rank and 452 extents per 7+P rank. In this case, you need to create eight dummy volumes of 64 extents in size (volume size = 64 GB, binary) per extent pool using two mkfbvol DSCLI commands: # mkfbvol -extpool P0 -cap 64 -type ds -name dummy_vol ee00-ee07 # mkfbvol -extpool P1 -cap 64 -type ds -name dummy_vol ef00-ef07 In this example, we use LSS ee with volume IDs ee00-ee07 for P0 (even extent pool) dummy volumes and LSS ef with volume IDs ef00-ef07 for P1 dummy volumes. The rotate extents volume allocation algorithm automatically distributes the volumes across the ranks. 4. Use the DSCLI command chrank -release against all smaller (6+P+S) ranks in order to release all extents on these ranks again so that finally all ranks in the extent pools are available for the creation of volumes for the attached host systems.
109
Now, we have created a homogeneous extent pool configuration with ranks of equal size. You can remove the dummy volumes when the last amount of storage capacity needs to be allocated. However, here you need to remember that the volumes that will be created from these final extents on the large (7+P) arrays are distributed only across half of the ranks in the extent pool, so consider using this capacity primarily for applications with lower I/O demands. Of course, you can also apply a similar procedure to a four extent pool configuration with 6+P+S and 7+P ranks mixed in each extent pool if you prefer four identical extent pools with exactly the same amount of storage capacity. However, simply using separate extent pools for 6+P+S and 7+P ranks reduces the administration effort. Another consideration for the number of extent pools to create is the usage of Copy Services, such as FlashCopy Space Efficient (FlashCopy SE). If you use FlashCopy SE, you also might consider a minimum of four extent pools with two extent pools per rank group, as shown in Figure 5-13. As the FlashCopy SE repository for the Space Efficient target volumes is distributed across all available ranks within the extent pool (comparable to using Storage Pool Striping), we recommend that you distribute the source and target volumes across different extent pools (that is, different ranks) from the same DS8000 processor complex (that is, the same rank group) for the best FlashCopy performance. Each extent pool can have FlashCopy source volumes, as well as repository space for Space Efficient FlashCopy target volumes from source volumes, in the alternate extent pool. However, for certain environments, consider a dedicated set of extent pools using RAID 10 arrays just for FlashCopy SE target volumes while the other extent pools using RAID 5 arrays are only used for source volumes. You can still separate workloads using different extent pools with regard to the principles of workload isolation as seen Figure 5-14 on page 111 using, for example, rotate extents (Storage Pool Striping) as the extent allocation method in the extent pools for the resource-sharing workload and rotate volumes in the extent pools for the isolated workload. The isolated workload can further use host-level striping or application-level striping. The workload isolation in this example is done on the DA pair level.
FlashCopy SE
7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P2
S E P O O L
7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P3
S DA2 E DA0 DA3 P DA1 O DA2 O DA0 L DA3 DA1
Figure 5-13 Example of a four extent pool configuration using FlashCopy SE
110
FlashCopy SE
6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P0
S E P O O L
6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P1
S DA2 E DA0 DA3 P DA1 O DA2 O DA0 L DA3 DA1
DA2 DA2 DA2 DA2
6+P+S 6+P+S 7+P 7+P P0 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P2 7+P 7+P 7+P 7+P 7+P 7+P P4
6+P+S 6+P+S 7+P 7+P P1 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P3 7+P 7+P 7+P 7+P 7+P 7+P P5
DA2 DA2 DA2 DA2
ISOLATION
DA0 DA3 DA1 DA0 DA3 DA1
SHARED
SHARED
Figure 5-14 Example of multi-rank extent pools with workload isolation on the DA pair level
5.7.6 Planning for single-rank extent pools

Single-rank extent pools can make it easier to control volume assignment to particular ranks. Furthermore, they can simplify performance analysis and management, because volumes are related to an extent pool and thus are directly related to a single rank. So typically, no additional DSCLI commands, such as showrank of showfbvol/showckdvol -rank, are required to associate a volume with a rank when analyzing performance statistics at the host level, especially if volume IDs with hardware-related LSS/LCU IDs are used as introduced in 5.8.2, Volume configuration scheme using hardware-bound LSS/LCU IDs on page 124. If a workload is not meeting its performance objectives, any resource contention can be identified rather quickly with a logical configuration strategy that uses a consistent and intuitive relationship between volumes and ranks. If full control of volume placement and manual performance management on the rank level are required, with successive volumes on distinct ranks using hardware-related LSS/LCU concepts, and no usage of FlashCopy SE or Storage Pool Striping, consider single-rank extent pools for these workloads. Note that careful planning of the logical configuration and the distribution of the workloads with regard to their performance objectives is required in this case with an increase in overall management effort. Furthermore, single-rank extent pools implicitly require host-level striping techniques or application-level striping techniques to prevent hot volumes and spreading the overall workload across the available resources in a balanced manner. Single-rank extent pools are no guarantee against hot spots, but they offer the highest level of performance management, because the workload of a given volume is directly related to a single rank. Hence, they require more careful performance planning to reduce skew and to avoid single ranks from becoming bottlenecks.
111
Under certain circumstances, you might consider modifying this strict single rank approach to extent pools with more than just one rank using Storage Pool Striping, for example: A volume size larger than a single rank is required. In this case, use single-rank extent pools where possible, and assign the minimum number of ranks needed to multi-rank extent pools. Here, you can even consider the use of Storage Pool Striping so that the volume is evenly related to the set of ranks associated with this extent pool. Required volume sizes result in unacceptable unused capacity if single-rank extent pools are used. In this case, use as low a rank-to-extent pool ratio as possible. Even here, you might consider the usage of Storage Pool Striping for these extent pools so that the volumes are evenly associated with all the ranks within the same extent pool. Installations with a large number of ranks require considerable administration efforts. If performance management on individual rank level is not actually required, you can reduce this overall administration effort by using extent pools with two or more identical ranks and Storage Pool Striping instead of single-rank extent pools. Note that in these cases when using multi-rank extent pools with Storage Pool Striping, the level of performance management is shifted from rank to extent pool level (a set of uniformly utilized ranks) and that the capability to control performance on the rank level for the
volumes within such an extent pool is lost.
Configuration technique for simplified performance management

A logical configuration strategy that easily relates volume IDs to extent pools and ranks can help to simplify performance management and analysis. This section describes the initial steps for such a logical configuration process configuring the arrays, ranks, and extent pools in a certain way when using single-rank extent pools with a hardware-related volume and the LSS/LCU ID configuration concept as further described in 5.8.2, Volume configuration scheme using hardware-bound LSS/LCU IDs on page 124. It is easiest to implement using the DSCLI, so it might not apply to all DS8000 environments. The following procedure is one way to establish distinct relationships between volumes, extent pools, ranks, and DA pairs: 1. Create arrays, ranks, and extent pools in a sequence that will illustrate the association to DAs as described next. 2. Create a separate extent pool for each rank (single-rank extent pools). 3. Plan for unique address groups for major workloads or standard volume types. 4. Plan for unique LSS/LCU IDs for each rank when creating the volumes. For CKD environments, it might be convenient to choose LCU IDs that match the first two digits of the z/OS device ID. Refer to Example configurations using single-rank extent pools on page 126 for more details. 5. Use DSCLI volume nicknames that identify either the workload or the DS8000 hardware resources associated with the volume. 6. Use extent pool nicknames that identify either the workload or the DS8000 hardware resources associated with the extent pool. Note: This logical configuration approach requires single-rank extent pools and is easiest to implement using the DSCLI. Other approaches to making the relationship between extent pools, ranks, DA pairs, and volumes easy to see might include using volume nicknames that indicate rank and DA pair or using LSS IDs that indicate DA pair and rank.
112
Array sites are assigned system-generated IDs in sequence beginning with S1 at DS8000 installation. Arrays, ranks, and extent pools are assigned system-generated IDs when they are created by the user during logical configuration. They are sequentially numbered in order of resource creation. For example, the first array created will be assigned array ID A0, the second array created will be assigned array ID A1, and so on, independent of which array sites are used to create the arrays. The same applies to ranks with rank IDs R0, R1, R2, and so on, which are assigned in the sequence of their creation independent of the array ID that is used. Extent pool IDs are also system-generated and sequentially numbered in order of creation, but extent pool IDs are also affected by the user assignment of the extent pool to processor complex 0 (even-numbered IDs) or processor complex 1 (odd-numbered IDs). The first extent pool created and assigned to processor complex 0 (rank group 0) will be assigned extent pool ID P0. The second extent pool created and assigned to processor complex 0 will be assigned extent pool ID P2 and so on. The first extent pool created and assigned to processor complex 1 (rank group 1) will be assigned extent pool ID P1. The second extent pool created and assigned to processor complex 1 will be assigned extent pool ID P3 and so on. Note: When an array, rank, or extent pool is deleted, the corresponding resource ID is freed and will be used for the next resource of the same resource type that is created. For example, if ranks R0 - R7 were created on array sites S1 - S8, and R0 (on array site S1) is deleted, when array site S9 is used to create an array, it will be assigned array ID A0. If array site S1 is then used to create an array, it will be assigned array ID A8. A consistent association of array site, array, rank, extent pool, and LSS/LCU IDs simplifies performance analysis and management, because the volume ID (which includes the LSS ID) implies a corresponding rank and DA pair, and the reverse is true also. The consistent association shown in the following example is based on the order of installation of disks on DA pairs (DA2, DA0, DA6 DA4, DA7, DA5, DA3, DA1, DA2, DA0, DA6 DA4, DA7, DA5, DA3, DA1). For more information about the order of installation of disks on DA pairs, refer to 2.4.6, Order of installation on page 21. Note: Because array site IDs begin at 1 and array IDs begin at 0, it is not uncommon to have odd-numbered array sites associated with even-numbered array IDs, or to have array site IDs that are one greater than array IDs. However, array, rank, and extent pool IDs need to have the same number (for example, A10, R10, and P10) by following the process that we describe next. The LSS ID needs to be the hexadecimal equivalent (for example, LSS 10 means 0x0A). You can achieve a consistent association between array site, array, rank, extent pool, and LSS/LCU IDs by following these steps: 1. Create arrays from all array sites associated with DA2 in ascending order of array site ID. That is, if DA2 has array sites S1 - S8, create an array on array site S1 first, create an array on array site S2 next, and so on. Note: DA2/DA0 can have more than eight array sites if disk enclosure pairs are installed in a second/third expansion frame.
113
2. Repeat creating arrays from all array sites in ascending order of array site ID for array sites associated with DA0, DA6, DA4, DA7, DA5, DA3, and DA1 in order as long as array sites exist. Note that the order of array sites assigned to DAs in this way might not necessarily follow the system-generated sequential numbering of array site IDs. Note: DA0 might have more than eight array sites if disk enclosure pairs are installed in a second expansion frame.
3. Create ranks from arrays in array ID numerical order. That is, create the first rank from array A0, the second rank from array A1, and so on. 4. Create as many extent pools as ranks, with half of the extent pools assigned to processor complex 0 (which is also referred to as server0, managing rank group 0) and half of the extent pools assigned to processor complex 1 (which is also referred to as server1, managing rank group 1). 5. Assign ranks with even-numbered rank IDs to even-numbered extent pools, in ascending order. That is, assign rank R0 to extent pool P0, assign rank R2 to extent pool P2, and so on. 6. Assign ranks with odd-numbered rank IDs to odd-numbered extent pools, in ascending order. That is, assign rank R1 to extent pool P1, assign rank R3 to extent pool P3, and so on. 7. When creating the plan for volumes for hexadecimal volume IDs where the first two digits match the hexadecimal equivalent of the extent pool ID, create, for example, volumes 00zz from extent pool P0, volumes 01zz from extent pool P1, and so on. If additional volume addresses are required (more than 256 per rank), one or more unique LSS/LCU IDs can be added to each rank. Refer to Volume configuration scheme using hardware-bound LSS/LCU IDs on page 124 for more details about this hardware-related configuration concept. You can simplify performance analysis if you follow a similar approach for logical configurations for all DS8000s. Next, we apply this configuration strategy to a DS8300 with a fully populated base frame and one fully populated expansion frame as introduced in 5.5.2, DS8000 configuration example 2: Array site planning considerations on page 75. Figure 5-15 on page 115 shows a schematic of this DS8300 with a fully populated base frame and one fully populated expansion frame.
114
DS8000 with 12 Disk Enclosure Pairs

12 Disk enclosure pairs Each rectangle is one disk enclosure pair (front and rear) 384 disk drives total 48 Array Sites (S1-S48) 6 DA pairs DA2 (I/O enclosures 2,3) DA0 (I/O enclosures 0,1) DA6 (I/O enclosures 6,7) DA4 (I/O enclosures 4,5) DA7 (I/O enclosures 6,7) DA5 (I/O enclosures 4,5) DA1 and DA3 are not used unless a second expansion frame is added
2 2 0 0
HMC C0
6 6 4 4 7 7 5 5
b
S0 C1 S1 0 11 0 2 33 2
Base Frame A
4 55 4 6 77 6
1st Expansion Frame B
Figure 5-15 DS8000 configuration example 2 (fully populated base and expansion frame)
Figure 5-16 on page 116 and Figure 5-17 on page 117 show a schematic of an example logical configuration for this DS8000 with: A unique extent pool for each rank. One unique LSS for each rank. If additional addresses are needed (for example, to provide additional small CKD volumes or PAVs, or to identify different workloads), you can add additional unique LSSs. A one-to-one relationship between the array, rank, extent pool, and LSS so that each horizontal box in the schematic represents an array, a rank, an extent pool, and an LSS. All volumes on a given rank are associated with the same unique extent pool and LSS. Array site, array, rank, and extent pool IDs that match up to show the association between these hardware resources and imply a DA association that is based on the convention of using array sites in the order of the installation of disks on DAs. Figure 5-16 on page 116 shows the logical configuration for the DS8000 base frame. For the base frame, the sequence of array site IDs matches the order of the installation of disks on DA pairs (DA2 followed by DA0). array sites on DA2 and DA0 were configured in ascending order, so the array site IDs (which begin with S1) are all one greater than the corresponding array, rank, and pool numbers. The LSS IDs are the hexadecimal equivalent of the corresponding array, rank, and pool IDs.
115
Figure 5-16 DS8000 configuration example 2: Base frame
Figure 5-17 on page 117 shows a schematic of the logical configuration for the first expansion frame. Logical configuration of this expansion frame also implements a one-to-one relationship between the array, rank, extent pool, and LSS, so again each horizontal box in the schematic represents an array, a rank, an extent pool, and an LSS, and all volumes on a given rank will be associated with the same unique extent pool and LSS. For this expansion frame, the sequence of array site IDs (S17 - S48) created dynamically at DS8000 installation does not show a clear association to the order of installation of disks on DA pairs (DA6, DA4, DA7, and DA5). For example, the next array site ID (array site S17) is on DA7 rather than DA6. To ensure that the array, rank, extent pool, and LSS IDs reflect a clear and consistent association with a DA pair, the arrays, ranks, extent pools, and LSSs were configured in sequence according to the order of installation of disks on DA pairs (DA2, DA0, DA6, DA4, DA7, DA5, DA3, and DA1). Using this approach for logical configuration, a volume ID (which includes the LSS ID) can be used to deduce the associated DA pair as well as the array, rank, and extent pool IDs. For example, a volume ID of 0000 indicates the array, rank, pool, and LSS 0 and implies DA2. A volume ID of 0900 indicates array, rank, pool, and LSS 9 and implies DA0.
116
Array sites were configured in array site ID sequence within the sequence of the order of installation of disks associated with DA pairs. Array site S17 on DA7 is the next candidate for array creation in pure array site ID sequence (S1 - S48). However, the disks on DA6 are next in order of installation, so the lowest-numbered array site on DA6 (S25) is used to create the next array (A16). Array A16 was used to create the next rank (R16), which was assigned to the next extent pool on processor complex 0 (P16), which was used to create volumes in the next LSS (0x10). Then, the next array site on DA6 (S26) is used to create array A17. After all array sites on DA6 have been used to create arrays, the first array site on DA4 is used. After all array sites on DA4 have been configured, the array sites on DA7 will be configured, and finally the array sites on DA5 will be configured. Therefore, all the array, rank, and Pool IDs are the same (and the LSS IDs are the hexadecimal equivalent) for this expansion frame, while the array site IDs do not have a consistent numerical relationship. The clear association of DA pair to rank, extent pool, and LSS ID allows quick identification of rank and DA pair based on the volume ID that contains the LSS ID. For example, given a hexadecimal volume ID 10zz (LSS 0x10), we can deduce that the volume is in extent pool P16, rank R16, and array A16 on DA 6.
Figure 5-17 DS8000 configuration example 2: First expansion frame
117
5.8 Plan address groups, LSSs, volume IDs, and CKD PAVs
After creating the extent pools and evenly distributing the back-end resources (DA pairs and ranks) across both DS8000 processor complexes, you can start with the creation of host volumes from these extent pools. When creating the host volumes, it is important to follow a strict volume layout scheme that evenly spreads the volumes of each application workload across all ranks and extent pools that are dedicated to this workload, in order to achieve a balanced I/O workload distribution across ranks, DA pairs, and DS8000 processor complexes. So, the next step is to plan the volume layout and thus the mapping of address groups and LSSs to volumes created from the various extent pools with regard to the identified workloads and workload groups. For performance analysis reasons, it is important to easily identify the association of given volumes to ranks or extent pools when investigating resource contention. Although the mapping of volumes to ranks can be taken from the DSCLI showrank or showfbvol/showckdvol -rank commands, performance and analysis management significantly is easier if a well-planned logical configuration strategy is in place using a numbering scheme that easily relates volume IDs to workloads, extent pools, and ranks. Each volume is associated with a hexadecimal 4-digit volume ID that has to be specified when creating the volume, as shown, for example, in Table 5-3 for volume ID 1101.
Table 5-3 Understanding the volume ID relation to address groups and LSSs/LCUs Volume ID 1101 Digits 1st digit: 1xxx 1st and 2nd digits: 11xx Description Address group (0-F) (16 address groups on a DS8000 subsystem) Logical subsystem (LSS) ID for FB Logical control unit (LCU) ID for CKD (x0-xF: 16 LSSs or LCUs per address group) Volume number within an LSS or LCU (00-FF: 256 volumes per LSS or LCU)
3rd and 4th digits: xx01
The first digit of the hexadecimal volume ID specifies the address group, 0 to F, of that volume. Each address group can only be used by a single storage type, either FB or CKD. Note that volumes accessed by ESCON channels need to be defined in address group 0, using LCUs 00 to 0F. So if ESCON channels are used, reserve address group 0 for these volumes with a range of volume IDs from 0000 to 0FFF. The first and second digit together specify the logical subsystem ID (LSS ID) for Open Systems volumes (FB) or the logical control unit ID (LCU ID) for System z volumes (CKD), providing 16 LSS/LCU IDs per address group. The third and fourth digits specify the volume number within the LSS/LCU, 00-FF, providing 256 volumes per LSS/LCU. The volume with volume ID 1101 is the volume with volume number 01 of LSS 11 belonging to address group 1 (first digit). Important: You must define volumes accessed by ESCON channels in address group 0 using LCUs 00 through 0F.
The LSS/LCU ID is furthermore related to a rank group. Even LSS/LCU IDs are restricted to volumes created from rank group 0, which are serviced by processor complex 0. Odd LSS/LCU IDs are restricted to volumes created from rank group 1, which are serviced by processor complex 1. So the volume ID also reflects the affinity of that volume to a DS8000
118
processor complex. All volumes which are created from even-numbered extent pools (P0, P2, P4, and so forth) have even LSS IDs and are managed by DS8000 processor complex 0, whereas all volumes created from odd-numbered extent pools (P1, P3, P5, and so forth) have odd LSS IDs and are managed by DS8000 processor complex 1. There is no direct DS8000 performance implication as a result of the number of LSSs/LCUs or the association of LSSs/LCUs and ranks, with the exception of additional CKD PAVs that are potentially available with multiple LCUs assigned to a single rank. For the z/OS CKD environment, a DS8000 volume ID is required for each PAV. The maximum of 256 addresses per LCU includes both CKD base volumes and PAVs, so the number of volumes and PAVs determines the number of LCUs required. So especially with single-rank extent pools, more than one LCU per rank might be needed. When planning the volume layout, you can basically decide on two concepts for LSS/LCU IDs. You can either try to strictly relate them to hardware resources (as on former IBM 2105 subsystems) or to application workloads. Especially with single-rank extent pools, it is a common practice to actually relate LSS/LCU IDs to the physical back-end resources, such as ranks and extent pools. So the relation of each volume to a specific rank (or set of ranks) simply can be taken from the volume ID. In certain homogeneous environments, this concept can reduce the effort of performance management and analysis. You can use this concept initially simply with native host-based tools without actually having to use additional DSCLI configuration data to investigate the relation of volumes to ranks. This concept is suitable in environments where performance management is the major concern and full control of performance on the rank level is required. Typically, this concept also requires host-level striping techniques or application-level striping techniques to balance workloads and increases configuration effort if volumes of a given workload are spread across multiple ranks. The other approach is to relate a LSS/LCU to a specific application workload with a meaningful numbering scheme for the volume IDs with regard to the specific applications, ranks, and extent pools. Each LSS can have 256 volumes, with volume numbers ranging from 00 to ff. So relating the LSS/LCU to a certain application workload and the volume number to the physical location of the volume, such as the rank (when using the rotate volumes algorithm with multi-rank extent pools or single-rank extent pools) or the extent pool (when using the rotate extents algorithm with multi-rank extent pools) might be a reasonable choice. Because the volume IDs are transparent to the attached host systems, this approach helps the administrator of the host system to determine the relation of volumes to ranks simply from the volume ID and thus easily identify independent volumes from different ranks when setting up host-level striping or when separating, for example, DB table spaces from DB logs onto volumes from physically different arrays. Ideally, all volumes that belong to a certain application workload or a group of related host systems are within the same LSS. However, because the volumes need to also be spread evenly across both DS8000 processor complexes, at least two logical subsystems are typically required per application workload: one even LSS for the volumes managed by processor complex 0 and one odd LSS for volumes managed by processor complex 1 (for example, LSS 10 and LSS 11). Furthermore, the assignment of LSS IDs to application workloads significantly reduces management effort when using DS8000-related Copy Services, because basic management steps (such as establishing Peer-to-Peer Remote Copy (PPRC) paths and consistency groups) are related to LSSs. Even if Copy Services are not currently planned, plan the volume layout accordingly, because management will be easier if you need to introduce Copy Services in the future (even, for example, when migrating to a new subsystem using Copy Services). Note that a single mkfbvol or mkckdvol command can create a set of volumes with successive volume IDs from the specified extent pool. So if you want to assign LSS/LCU IDs to specific hardware resources, such as ranks, where you need to create a large number of successive
119
volumes on a single rank, consider single rank extent pools. You can also extend this concept to assigning LSS/LCU IDs to extent pools with more than one rank using Storage Pool Striping. In this case, the LSS/LCU ID simply is related to an extent pool as a set of aggregated and evenly utilized ranks. If you want to assign LSS/LCUs based on application workloads and typically need to spread volumes across multiple ranks with the allocation of volumes or extents on successive ranks, multi-rank extent pools might be a good choice, simply taking advantage of the DS8000 volume allocation algorithms to spread the volumes and thus the workload evenly across the ranks with considerably less management effort. Single rank extent pools require much more configuration effort to distribute a set of successive volumes of a given workload with unique LSS/LCU IDs across multiple ranks. However, the actual strategy for the assignment of LSS/LCU IDs to resources and workloads can vary depending on the particular requirements in an environment. The following subsections introduce suggestions for LSS/LCU and volume ID numbering schemes to help to relate volume IDs to application workloads, extent pools, and ranks.
5.8.1 Volume configuration scheme using application-related LSS/LCU IDs

Typically, when using LSS/LCU IDs that are related to application workloads and distributing these workloads across multiple ranks, you can consider multi-rank extent pools using either rotate volumes or rotate extents (Storage Pool Striping) extent allocation methods. The DS8000 volume allocation algorithms take sufficient care of the distribution of volumes across the ranks within an extent pool. In order to achieve a uniform volume distribution across the ranks within a multi-rank extent pool using the rotate volumes algorithm, a standard volume size might slightly optimize the workload distribution, especially for large resource-sharing workload groups. You can choose a standard volume size based on the required granularity for average workload capacity, and align it to the available capacity on the ranks in order to be able to allocate most extents on the subsystem in a balanced manner without wasting capacity. Another way to spread the workload of an application evenly across all ranks and extent pools dedicated to this workload instead of using a standard volume size is to divide the overall capacity of each workload by the number of available ranks and then create an appropriate number of volumes if distinct ranks are used for each volume. Alternatively, when using multi-rank extent pools with Storage Pool Striping, the extents of each volume must spread evenly across all ranks within an extent pool. In this case, the volume size ideally is aligned to the number of ranks within the extent pools by choosing a volume size in GB that is a multiple of the number of ranks within the extent pool. For example, with 8 ranks in an extent pool, the volume size must be a multiple of 8 extents (8 GB for FB), such as 8 GB, 16 GB, and so forth. Therefore, each volume has the same number of extents on each rank. Furthermore, the extent allocation algorithm will even prevent successive volumes all starting on the same rank. It will start allocating extents of successively created volumes on different ranks. Aligned volume sizes as introduced can help to achieve a more strictly controlled volume distribution with a slightly more balanced overall workload distribution. However, the DS8000 volume allocation algorithms generally will take care of the distribution of volumes across the ranks within an extent pool even without aligned volume sizes.
120
Example configuration using the default rotate volumes algorithm

Here are several suggestions for a volume ID numbering scheme when using the rotate volumes algorithm with multi-rank extent pools and volumes on dedicated ranks. Note that these examples refer to a workload group where multiple workloads share the same set of resources configured in two or four extent pools. With a two extent pool configuration as shown in Figure 5-18 with 16 ranks in each extent pool (P0 and P1), consider creating 16 equally sized volumes with volume IDs 1000-100F in extent pool P0 and 16 volumes with volume IDs 1100-110f in extent pool P1 for a given application workload that requires 32 volumes and is associated with LSS IDs 10 and 11. If you need another 32 volumes for that application workload, for example, assigned to a second host system of the same host group, consider creating another set of volumes for the same LSS ID pair with volume IDs 1010-101F and 1110-111F, so that the fourth digit of the volume ID can be used to relate to a specific rank in the appropriate rank group (even or odd LSS/LCU ID) or just in the particular extent pool. Another workload that is sharing the same ranks simply follows the same scheme for the volume numbers but uses a different LSS ID pair, for example, LSS 2a and 2b. Here, two LSS/LCU IDs (one even, one odd) are used for a given workload to spread the I/O activity evenly across both processor complexes (both rank groups). This approach also allows the quick creation of volumes for a given workload with successive volume IDs per extent pool with just a single DSCLI mkfbvol/mkckdvol command. When creating the volumes in a balanced way, the volume allocation algorithm will spread the volumes across the arrays accordingly (refer to History of DS8000 volume allocation algorithms on page 97 for more details about how the algorithms work). You can also apply the same scheme to a four extent pool configuration as shown in Figure 5-19 on page 122. Dependent on the number of volumes required for a given workload, as well as the available number of ranks and extent pools assigned to that workload, you can apply variations of this concept or even totally different approaches.
Volume ID Configuration (Rank Group 0) 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 100a 100b 100c 100d 100e 100f 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 101a 101b 101c 101d 101e 101f 2a00 2a01 2a02 2a03 2a04 2a05 2a06 2a07 2a08 2a09 2a0a 2a0b 2a0c 2a0d 2a0e 2a0f Host B Application B 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P0 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P1 Volume ID Configuration (Rank Group 1) 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 110a 110b 110c 110d 110e 110f 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 111a 111b 111c 111d 111e 111f 2b00 2b01 2b02 2b03 2b04 2b05 2b06 2b07 2b08 2b09 2b0a 2b0b 2b0c 2b0d 2b0e 2b0f Host B Application B
Host A1 Host A2 Application A
Managed by Processor Complex 0
Figure 5-18 Volume layout example for two shared extent pools using the rotate volumes algorithm
121
Volume ID Configuration (Rank Group 0) 1000 1001 1002 1003 1004 1005 1006 1007 1010 1011 1012 1013 1014 1015 1016 1017 2a00 2a01 2a02 2a03 2a04 2a05 2a06 2a07 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P0 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P2 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P1 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P3
Volume ID Configuration (Rank Group 1) 1100 1101 1102 1103 1104 1105 1106 1107 1110 1111 1112 1113 1114 1115 1116 1117 2b00 2b01 2b02 2b03 2b04 2b05 2b06 2b07
1008 1009 100a 100b 100c 100d 100e 100f
1018 1019 101a 101b 101c 101d 101e 101f
2a08 2a09 2a0a 2a0b 2a0c 2a0d 2a0e 2a0f Host B Application B
1108 1109 110a 110b 110c 110d 110e 110f
1118 1119 111a 111b 111c 111d 111e 111f
2b08 2b09 2b0a 2b0b 2b0c 2b0d 2b0e 2b0f Host B Application B
Figure 5-19 Volume layout example for four shared extent pools using the rotate volumes algorithm
When the volumes of each host or application are distributed evenly across the ranks, DA pairs, and both processor complexes using the rotate volumes algorithm, consider using striping techniques on the host system in order to achieve a uniform distribution of the I/O activity across the assigned volumes (and thus, the DS8000 hardware resources). Refer to Figure 5-20.
1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 100a 100b 100c 100d 100e 100f AIX VG AIX LV 1 AIX LV 2
1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 110a 110b 110c 110d 110e 110f 1
1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 101a 101b 101c 101d 101e 101f AIX VG AIX LV A AIX LV B
1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 111a 111b 111c 111d 111e 111f 2
Figure 5-20 Example of using host-level striping with volumes on distinct ranks
122
We typically recommend for host-level striping a large granularity stripe size of at least 8 MB or 16 MB for standard workloads in order to span multiple full stripe sets of a DS8000 rank2 and not to disable the DS8000 cache algorithms sequential read ahead detection mechanism. For example, using an AIX host system with AIX Logical Volume Manager (LVM) means building an AIX LVM volume group (VG) with LUNs 1000-100f and LUNs 1100-110f (as shown in Figure 5-20 on page 122) and creating AIX logical volumes from this volume group with an INTERDISK-POLICY of maximum and PP sizes of at least 8 MB in order to spread the workload across all LUNs in the volume group (which is called AIX PP striping). If you have another set of volumes from the same ranks, for example, for a second host system, you simply configure them in another AIX LVM VG as shown in Figure 5-20 on page 122. Note that this is a general recommendation for standard workloads. If you know your workload characteristics in more detail, you also might consider another stripe size on your host system, which is aligned to your particular I/O request size, as well as to the DS8000 array stripe set size. However, perform a test with the real application to evaluate if the chosen stripe size really yields a better performance before implementing this stripe size in the productive environment.
Example configuration using the rotate extents algorithm

Here are several suggestions for a volume ID numbering scheme when using the rotate extents algorithm (Storage Pool Striping) with multi-rank extent pools and volumes evenly spread across a set of ranks. Note that these examples refer to a workload group where multiple workloads share the same set of resources. When using the rotate extents volume allocation method (Storage Pool Striping), another numbering scheme for volume IDs can be considered in order to relate the volume IDs to the extent pools and thus to a set of aggregated and uniformly utilized ranks. In addition to the LSS IDs assigned to a given application workload, the range of volume numbers can be used to allow a simple mapping between a range of volume IDs and an extent pool, so that you can easily identify the physical resources (here a set of shared ranks within a certain extent pool) behind a volume simply from the volume ID. An example of such a potential volume numbering scheme with a four extent pool configuration using the rotate extents volume allocation method is given in Figure 5-21.
The stripe size that is internally used on DS8000 depends on the storage type, either FB or CKD, and the RAID level. FB uses a stripe size of 256 KB segments per disk drive for RAID 5 and RAID 10 arrays and 192 KB for RAID 6 arrays. CKD uses a stripe size of 224 KB for RAID 5 and RAID 10 arrays and 168 KB for RAID 6 arrays.
123
Volume ID Configuration (Rank Group 0) 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P0 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P2 Host A1 Host A2 Application A Host B Application B 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P1 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P3
Volume ID Configuration (Rank Group 1)
1 0 0 0
1 0 0 1
2 a 0 0
1 1 0 0
1 1 0 1
2 b 0 0
1 0 1 0
1 0 1 1
2 a 1 0
1 1 1 0
1 1 1 1
2 b 1 0
Host B Application B
Figure 5-21 Volume layout example for four shared extent pools using the rotate extents algorithm
Here, the third digit of the volume ID is used to relate to an extent pool and thus a set of uniformly utilized ranks. Two LSS/LCU IDs (one even, one odd) are assigned to a workload to spread the I/O activity evenly across both processor complexes (both rank groups). Furthermore, you can quickly create volumes with successive volume IDs for a given workload per extent pool with a single DSCLI mkfbvol/mkckdvol command. Using host-level striping or application-level striping across volumes from different extent pools still is a valid option to balance the overall workload across the extent pools.
5.8.2 Volume configuration scheme using hardware-bound LSS/LCU IDs

If for performance management reasons a controlled mapping of LSS/LCU IDs to specific back-end hardware resources, such as ranks, is preferred, you typically use the assignment of unique LSS/LCU IDs to extent pools. In this case, all volumes created from this extent pool use unique LSS/LCU IDs so that the volume ID actually is related to a specific extent pool and thus the hardware resources in that extent pool. If performance management on the rank level is preferred, consider single-rank extent pools. Managing performance on the rank level is the most granular level available on DS8000. However, if a deep level of performance management is not actually required, you can also apply this concept to multi-rank extent pools using Storage Pool Striping with two or more identical ranks per extent pool. In this case, the volume and LSS/LCU ID are related to a set of uniformly utilized ranks instead of just a single rank, considerably reducing the overall management effort.
Example configuration using multi-rank extent pools with SPS

You can extend the single-rank extent pool configuration approach as introduced in Configuration technique for simplified performance management on page 112 to homogeneously configured multi-rank extent pools with identical ranks using Storage Pool Striping (SPS). Here, the LSS/LCU ID is not related to only a single rank but instead a set of aggregated ranks within an extent pool. Using SPS ensures that the workload is evenly 124
spread across all the ranks in the extent pool so that performance and configuration management simply can be shifted from rank level to extent pool level. Instead of a large number of ranks, now only a small number of extent pools need to be managed with considerably less overall administration effort. This approach combines the benefits from a strict volume to physical resource association as provided with single-rank extent pools as well as the ease of use with the reduction of overall management effort. Also, you need to apply appropriate consideration as outlined in 5.7.5, Planning for multi-rank extent pools on page 106 for evenly distributing the ranks and DA pairs across the extent pools with regard to the workload configuration principles. A simple example of assigning unique LSS/LCU IDs to extent pools is given in Figure 5-22 on page 125. Here, you can easily relate a volume ID to a set of ranks, for example, volume 1108 is evenly related to the set of ranks in extent pool P1. However, you cannot relate the workload of these volumes to individual ranks anymore, because all ranks in an extent pool are shared by all volumes in that extent pool. When using only identical ranks within an extent pool as shown in this example, you need to be aware that now the extent pools as an aggregation of multiple ranks can also have different performance capabilities dependent on the different number of disk drives actively servicing I/O requests. Here, extent pools P0 and P1 are built from ranks with spare drives. So they have a smaller number of disks actively servicing I/O requests than the extent pools P2 and P3, which were built from the same type of ranks but without spare drives. Thus, extent pools P0 and P1 have less total storage capacity and also provide slightly lower overall performance capabilities than extent pools P2 and P3. However, typically the overall I/O activity of a given workload scales evenly with its capacity allocation based on a given I/O access density3, so the difference in capacity for those extent pools might scale accordingly to their performance capabilities with regard to this workload and its space allocation.
Volume ID Configuration (Rank Group 0) 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P0 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P2
Volume ID Configuration (Rank Group 1) 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S 6+P+S P1 7+P 7+P 7+P 7+P 7+P 7+P 7+P 7+P P3
1 0 0 0
1 0 0 1
1 0 0 2
1 0 0 3
1 0 0 4
1 0 0 5
1 0 0 6
1 0 0 7
1 0 0 8
1 0 . .
1 0 f f
1 1 0 0
1 1 0 1
1 1 0 2
1 1 0 3
1 1 0 4
1 1 0 5
1 1 0 6
1 1 0 7
1 1 0 8
1 1 . .
1 1 f f
1 2 0 0
1 2 0 1
1 2 0 2
1 2 0 3
1 2 0 4
1 2 0 5
1 2 0 6
1 2 0 7
1 2 0 8
1 2 . .
1 2 f f
1 3 0 0
1 3 0 1
1 3 0 2
1 3 0 3
1 3 0 4
1 3 0 5
1 3 0 6
1 3 0 7
1 3 0 8
1 3 . .
1 3 f f
Figure 5-22 Volume layout example for hardware-related volume IDs with multi-rank extent pools
The access density is a measure of I/O rate per unit of usable storage capacity expressed in IOPS per usable GB of storage capacity.
125
Example configurations using single-rank extent pools

A controlled mapping of hexadecimal LSS/LCU IDs to specific ranks is easiest to achieve with single-rank extent pools. With only a single rank assigned to an extent pool, one or more unique LSS/LCU IDs can be associated with that rank by specifying volume IDs that contain only that LSS ID (or IDs) for volumes created from that rank or extent pool. Also, refer to Configuration technique for simplified performance management on page 112 for the initial steps to prepare the arrays, ranks, and extent pools before finally creating the volumes with appropriate volume IDs reflecting specific LSS/LCU ID assignments to ranks and extent pools. For example, in order to associate LSS 00 and LSS 50 with rank R0, any volumes created from the extent pool containing rank R0 (for example, extent pool P0) are given hexadecimal volume IDs taken from the following ranges: 0000 through 00FF 5000 through 50FF If one or more unique LSSs are used for each rank, it is easy to associate a volume ID to a rank, which can simplify performance analysis and management. In the previous example, any hexadecimal volume with an ID beginning with 00 or 50 is known to be allocated on rank R0. If cross-rank LSS assignment is required for remote mirroring consistency group considerations as discussed in Chapter 4, Logical configuration concepts and terminology on page 41, you can use the DSCLI showrank command to identify the volumes on a rank. Additionally for z/OS volumes, you can use the RMF Cache Subsystem Device Report together with the ESS RANK STATISTICS Report and the extent POOL STATISTICS Report to identify the extent pool and rank for any z/OS volume serial number. You must define volumes accessed by ESCON channels in address group 0 (LSS IDs 00 through 0F). For example, for a DS8000 that will support ESCON and FB access, ESCON volumes must be in the range 0000 through 0FFF, and FB volumes can be defined in any of the remaining LSSs (10-FE). Important: For ESCON devices, you must use address group 0 (LCU 00 through 0F). Volumes for CKD workloads and FB workloads must be defined in different address groups. For a DS8000 that will support FICON and FB access, CKD volumes and FB volumes can be defined in any address group, but they cannot be defined in the same address group. That is, FB volumes can be defined in the range 0000 through 0FFF, and CKD volumes can be defined in any of the remaining LSSs (10-FE). Conversely, CKD volumes can be defined in the range 0000 through 0FFF, and FB volumes can be defined in any of the remaining LSSs (10-FE). Important: All LSSs in an address group are a single storage type (either CKD or FB). That is, all LSSs in one address group are available to be used for CKD volumes, or all 16 LSSs in the address group are available for use with FB volumes. Using different address groups for volumes belonging to different workloads might make performance analysis and management easier. For example, one workload can use address group 0 (volume IDs in the range 0000 through 0FFF), and another workload can use address group 1 (volume IDs in the range 1000 through 1FFF). Different address groups can also be used to identify volumes of different standard sizes. 126
We now return to our example DS8000 configurations that were introduced in 5.3, Planning allocation of disk and host connection capacity on page 70 to see several possible LSS assignment options. For this discussion, we focus on isolation considerations for different storage types (CKD and FB). However, the same isolation strategies can be used for the isolation of different workloads of a single storage type.
DS8000 configuration example 1: LSS/LCU ID planning

Example 1 is a DS8000 populated with two disk enclosure pairs (64 disk drives), so a maximum of eight ranks can be configured. Figure 5-23 on page 127 shows two options for LSS assignment to the eight ranks. The first LSS assignment option (Single storage type) can only support a single storage type (either CKD or FB), because it uses LSSs from a single address group (address group 0, LSS IDs 00-FF). The second LSS assignment option (Mixed storage types) shown in Figure 5-23 on page 127 is appropriate for mixed storage types (CKD and FB), because it uses LSSs from two different address groups: Address group 0 (LSS IDs 00-0F) is associated with ranks R0, R1, R4, and R5. Address group 1 (LSS IDs 10-1F) is associated with ranks R2, R3, R6, and R7. Address group 0 can be used for CKD volumes, and address group 1 can be used for FB volumes. If the CKD volumes will be accessed by ESCON channels, address group 0 must be used for those volumes. If the CKD volumes will be accessed by FICON channels only, address group 1 can be used for CKD volumes, and address group 0 can be used for FB volumes. The second LSS assignment option (Mixed storage types) shown in Figure 5-23 on page 127 can also be used for the isolation of workloads of a single storage type. One workload can be allocated volumes in address group 0, and another workload can be allocated volumes in address group 1. In either of the examples of LSS assignment shown in Figure 5-23 (single or mixed storage types), additional unique LSSs can be added to each rank if more volume addresses are required.
IBM Systems and Technology Group
Single Storage Type (FB or CKD) Processor Complex 0 (even)

Rank R0/Pool P0 Rank R2/Pool P2 Rank R4/Pool P4 Rank R6/Pool P6
Processor Complex 1 (odd) LSS 01

DA 2 Rank R1/Pool P1 Rank R3/Pool P3 Rank R5/Pool P5 Rank R7/Pool P7
LSS 00 LSS 02 LSS 04 LSS 06
LSS 03 LSS 05 LSS 07
Mixed Storage Types (FB and CKD) Processor Complex 0 (even)

1

Figure 5-23 Hardware-related LSS assignment example for DS8000 configuration 1
127

Example 2 is a DS8000 with a fully populated base frame and fully populated first expansion frame. For our example of LSS assignment here, we consider only the 16 ranks that can be configured in the base frame. Figure 5-24 on page 129 shows one option for LSS assignment for the base frame. This LSS assignment option is appropriate for either single or mixed storage types, as well as for a configuration with large disk drives and many small volumes (and CKD PAVs). If a large number of addresses are not necessary, the number of LSSs per rank can be reduced. Four LSSs are assigned to each rank, providing a maximum of 1024 volume addresses for the rank. Each rank contains unique LSSs from a single address group. For example, rank R0 contains LSSs 00, 02, 04, and 06. A total of four ranks on the same DA pair contain LSSs for the same address group. For example, ranks R0, R1, R2, and R3 on DA 2 contain LSSs for address group 0. The remaining four ranks on the same DA pair contain LSSs for a different address group, which can be of a different storage type. For example, ranks R4, R5, R6, and R7 contain LSSs for address group 1. To support mixed storage types, alternating address groups on each DA pair can be used for CKD volumes and for FB volumes. Each storage type must have dedicated ranks, but both storage types can share the DA resources. For example, ranks 0 - 3 on DA2 can support CKD workloads, while ranks 4 - 7 on DA2 can support FB workloads. The same pattern of LSS assignment can also be used for isolation of workloads of a single storage type. If all 48 ranks in the base and first expansion frame were configured in this way, a total of 12 address groups (192 LSSs) will be used. Because unique address groups are supported by each DA pair, workload isolation at the DA pair level is also possible. For example, one workload can be limited to DA pair 2 (address groups 0 and 1 on ranks 0 - 7), and another workload (of the same or different storage type) can be limited to DA pair 0 (address groups 2 and 3 on ranks 8 - 15). If a workload has volumes on ranks assigned to processor complex 0 and volumes on ranks assigned to processor complex 1, the workload can access all DS8000 processor and cache resources. The workload can achieve maximum performance from those resources, or it might sometimes experience contention with other workloads for processor and cache resources. Each workload will not be impacted by contention for resources that are dedicated to it; however, it will not be able to take advantage of resources that are dedicated to another workload. Its maximum potential performance is limited to the capability of its dedicated resources.
128
Single or Mixed Storage Types (CKD and FB) Processor Complex 0 (even)
Rank R0/Pool P0 Rank R2/Pool P2 Rank R4/Pool P4 Rank R6/Pool P6 Rank R8/Pool P8
Processor Complex 1 (odd) 01

DA 2
00 08 10 18 20
02 0A 12 1A 22 2A 32 3A
04
06
03 0B 13 1B 23 2B 33 3B
05
07
Rank R1/Pool P1 Rank R3/Pool P3 Rank R5/Pool P5 Rank R7/Pool P7 Rank R9/Pool P9 Rank R11/Pool P11 Rank R13/Pool P13 Rank R15/Pool P15
0C 0E 14 16
09 11 19 21
0D 0F 15 17
1C 1E 24 26
DA 0
1D 1F 25 27
Rank R10/Pool P10 28 Rank R12/Pool P12 30 Rank R14/Pool P14 38

1
2C 2E 34 36
29 31 39
2D 2F 35 37
3C 3E
3D 3F
If only a single storage type is required, another option for LSS assignment to 16 ranks in the base frame of DS8000 2 is shown in Figure 5-25 on page 130. Only a single storage type can be supported, because LSSs from the same address groups (address group 2 and address group 3) are used on all ranks. This strategy of LSS assignment is suitable for resource-sharing (such as multiple workloads sharing ranks). Each workload can be assigned volumes from a single address group that are spread across multiple ranks, DA pairs, and servers, increasing the maximum potential performance of the workload. Another benefit of this LSS assignment pattern is that it is easy to define additional volumes on all ranks by simply adding an LSS from another address group to each rank. Assigning LSSs to ranks in this way might also be convenient for differentiating between different standard sizes of volumes. For example, for CKD volumes, address group 2 can be used for 3390 Mod3 volumes, address group 3 for 3390 Mod9 volumes, and so on.
129
Single Storage Type (CKD or FB) Processor Complex 0 (even)

Rank R0/Pool P0 Rank R2/Pool P2 Rank R4/Pool P4 Rank R6/Pool P6 Rank R8/Pool P8
Processor Complex 1 (odd) 21

DA 2
20 22 24 26 28
30 32 34 36 38
DA 0
31 33 35 37 39
Rank R1/Pool P1 Rank R3/Pool P3 Rank R5/Pool P5 Rank R7/Pool P7 Rank R9/Pool P9 Rank R11/Pool P11 Rank R13/Pool P13 Rank R15/Pool P15
23 25 27 29
Rank R10/Pool P10 2A 3A Rank R12/Pool P12 2C 3C Rank R14/Pool P14 2E 3E
2B 3B 2D 3D 2F 3F
Figure 5-25 Alternate hardware-related LSS assignment example for DS8000 configuration 2

Example 3 is a DS8000 with two Storage Images, a fully populated base frame, and a partially populated first expansion frame (Figure 5-26). We consider only the 16 ranks in the base frame for this example of LSS assignment. The upper two pairs of disk enclosures (supported by DA pair 2) are owned by Storage Image 2, and the lower two pairs of disk enclosures (supported by DA pair 0) are owned by Storage Image 1.
Storage Image 2 Processor Complex 0 (even)


Storage Image 1 Processor Complex 0 (even)

1

Figure 5-26 shows one option for LSS assignment to ranks belonging to the two Storage Images. This simple pattern with a one-to-one correspondence between a rank and an LSS is 130
only appropriate when a maximum of 256 volume addresses is required per rank. However, if additional volume addresses are needed, one or more additional unique LSSs can be added to each rank. As shown in the schematic, the same LSSs can be used on both Storage Images for two identically configured logical DS8000s (for example, for production and test, or for workloads isolated to one Storage Image. However, performance analysis and management can be simplified by assigning different LSSs in different address groups to the ranks in each Storage Image). Note: Any of the LSS assignment patterns shown for DS8000 configuration examples 1 and 2 can also be applied to one or both Storage Images in example 3.
5.9 Plan I/O port IDs, host attachments, and volume groups
Finally, when planning the attachment of the host system to the storage subsystems host adapter I/O ports, you also need to achieve a balanced workload distribution across the available front-end resources for each workload with appropriate isolation and resource-sharing considerations. Therefore, distribute the FC connections from the host systems evenly across the DS8000 host adapter (HA) ports, HA cards, I/O enclosures, and if available, RIO-G loops. For high availability, each host system must use a multipathing device driver, such as Subsystem Device Driver (SDD), and have a minimum of two host connections to HA cards in different I/O enclosures on the DS8000, preferably using one left side (even-numbered) I/O enclosure and one right side (odd-numbered) I/O enclosure so that there is a shortest path via the RIO-G loop to either DS8000 processor complex for a good balance of the I/O requests to both rank groups. For a host system with four FC connections to a DS8100, consider using one HA port in each of the four I/O enclosures. If a host system with four FC connections is attached to a DS8300, consider spreading two connections across two different I/O enclosures in the first RIO-G loop and two connections across different I/O enclosures in the second RIO-G loop. The number of host connections per host system is primarily determined by the required bandwidth. Use an appropriate number of HA cards in order to satisfy high throughput demands. Because the overall bandwidth of one HA card scales well up to two ports (2 Gbps), while the other two ports simply provide additional connectivity, use only one of the upper pair of FCP ports and one of the lower pair of FCP ports of a single HA card for workloads with high sequential throughputs, and spread the workload across several HA cards. However, with typical transaction-driven workloads showing high numbers of random, small block-size I/O operations, all four ports can be used likewise. For the best performance of workloads with very different I/O characteristics, consider isolation of large block sequential and small block random workloads at the I/O port level or even the HA card level. The best practice is using dedicated I/O ports for Copy Services paths and host connections. For more information about performance aspects related to Copy Services, refer to 17.3.1, Metro Mirror configuration considerations on page 519. In order to assign FB volumes to the attached Open Systems hosts using LUN masking, these volumes need to be grouped in DS8000 volume groups. A volume group can be assigned to multiple host connections, and each host connection is specified by the worldwide port name (WWPN) of the hosts FC port. A set of host connections from the same host system is called a host attachment. Each host connection can only be assigned to a single volume group. You cannot assign the same host connection to multiple volume groups, but the same volume group can be assigned to multiple host connections. In order to share volumes between multiple host systems, the most convenient way is to create a separate volume group for each host system and assign the shared volumes to each of the individual volume groups as required, because a single volume can be assigned to multiple volume groups. Only if a group of host systems shares exactly the same set of volumes, and there is
131
no need to assign additional non-shared volumes independently to particular hosts of this group, can you consider using a single shared volume group for all host systems in order to simplify management. Typically, there are no significant DS8000 performance implications due to the number of DS8000 volume groups or the assignment of host attachments and volumes to DS8000 volume groups. Do not omit additional host attachment and host system considerations, such as SAN zoning, multipathing software and host-level striping. For additional information, refer to Chapter 9, Host attachment on page 265, and Chapter 11, Performance considerations with UNIX servers on page 307, Chapter 13, Performance considerations with Linux on page 401, and Chapter 15, System z servers on page 441. The best practice is using dedicated I/O ports for Copy Services links and host connections. After the DS8000 has been installed, you can use the DSCLI lsioport command to display and document I/O port information, including the I/O ports, host adapter type, I/O enclosure location, and WWPN. Use this information to add specific I/O port IDs, the required protocol (FICON, FCP, or ESCON), and DS8000 I/O port WWPNs to the plan of host and remote mirroring connections identified in 5.3, Planning allocation of disk and host connection capacity on page 70. Additionally, the I/O port IDs might be required as input to DS8000 host definitions if host connections need to be restricted to specific DS8000 I/O ports using the -ioport option of the mkhostconnect DSCLI command. If host connections are configured to allow access to all DS8000 I/O ports, which is the default, typically the paths must be restricted by SAN zoning. Here, the I/O port WWPNs will be required as input for SAN zoning. The lshostconnect -login DSCLI command might help to verify the final allocation of host attachments to DS8000 I/O ports, because it lists host port WWPNs that are logged in, sorted by the DS8000 I/O port IDs for known connections. The lshostconnect -unknown DSCLI command might further help to identify host port WWPNs, which have not yet been configured to host connections, when creating host attachments using the mkhostconnect DSCLI command. The DSCLI lsioport output will identify: The number of I/O ports on each host adapter installed The type of host adapters installed (SW FCP/FICON-capable, LW FCP/FICON-capable, or ESCON) The distribution of host adapters across I/O enclosures The WWPN of each I/O port DS8000 I/O ports have predetermined, fixed DS8000 logical port IDs in the form I0xyz where: x = I/O enclosure y = slot number within the I/O enclosure z = port within the adapter card For example, I0101 is the I/O port ID for: I/O enclosure 1 Slot 0 Second port
132
Note: The slot numbers for logical I/O port IDs are one less than the physical location numbers for host adapter cards as shown on the physical labels and in TotalStorage Productivity Center for Disk, for example, I0101 is R1-XI2-C1-T2. A simplified example of spreading the DS8000 I/O ports evenly to two redundant SAN fabrics is given in Figure 5-27. Note that the SAN implementations can vary dependent on individual requirements, workload considerations for isolation and resource-sharing, and available hardware resources.
Left I/O Enclosures

Bay 0 C0
L4
0 0 0 0 3 3 3 3 0 1 2 3
Right I/O Enclosures

Bay 6 Bay 1 C0
R4
Bay Card Port
1 1 1 1 0 0 0 0 0 1 2 3
Bay 2 C0 C1
L3
2 2 2 2 3 3 3 3 0 1 2 3
Bay 4 C0
L4
4 4 4 4 3 3 3 3 0 1 2 3
Bay 3 C0 C1
R3
3 3 3 3 0 0 0 0 0 1 2 3
Bay 5 C0
R4
5 5 5 5 0 0 0 0 0 1 2 3
Bay 7 C0 C1
R3
7 7 7 7 0 0 0 0 0 1 2 3
C1
L8
0 0 0 0 0 0 0 0 0 1 2 3
C1
L8
4 4 4 4 0 0 0 0 0 1 2 3
C0 C1
L3
6 6 6 6 3 3 3 3 0 1 2 3
C1
R8
1 1 1 1 3 3 3 3 0 1 2 3
C1
R8
5 5 5 5 3 3 3 3 0 1 2 3
L7
2 2 2 2 0 0 0 0 0 1 2 3
L7
6 6 6 1 1 6 0 0 0 0 0 0 0 1 2 0 1 3
R7
3 3 3 3 3 3 3 3 0 1 2 3
R7
7 7 7 1 1 7 3 3 3 0 0 3 0 1 2 0 1 3
01 02 03 04 05 06 07 08
01 02 03 04 05 06 07 08
SAN Fabric #0
09 10 11 12 13 14 15 16
SAN Fabric #1
09 10 11 12 13 14 15 16
Figure 5-27 Example of spreading DS8000 I/O ports evenly across two redundant SAN fabrics
Now, we return again to the three DS8000 configuration examples introduced in 5.3, Planning allocation of disk and host connection capacity on page 70 to see the I/O enclosure, host adapter, and I/O port resources available.
5.9.1 DS8000 configuration example 1: I/O port planning considerations

The DS8000 from configuration example 1 is a a base-frame-only DS8000 model with four partially populated I/O enclosures. A schematic of the I/O enclosures is shown in Figure 5-28 on page 134. It shows that each of the four base frame I/O enclosures (0 - 3) contains two 4-port FCP/FICON-capable host adapters (one LW card and one SW card). I/O enclosures 2 and 3 also contain one 2-port ESCON adapter.
133
Figure 5-28 DS8000 example 1: I/O enclosures, host adapters, and I/O ports
The DSCLI lsioport command output for the DS8000 in Example 5-12 shows a total of 36 I/O ports available: Four multi-mode (shortwave) 4-port FCP/FICON-capable adapters: One in each base frame I/O enclosure (I/O enclosures 0 - 3) IDs I0000 - 0003, I0130 - I0133, I0240 - I0243, and I310 - I0313 Four single-mode (longwave) 4-port FCP/FICON-capable adapters: One in each base frame I/O enclosure (I/O enclosures 0 - 3) IDs I0030 - I0033, I0100 - I0103, I0200 - I0203, and I0330 - I0333 Two 2-port ESCON adapters: One in I/O enclosure 2 (IDs I0230 and I0231) One in I/O enclosure 3 (IDs I0300 and I0301)
Example 5-12 DS8000 example 1: I/O enclosures, host adapters, and I/O ports dscli> lsioport -l -dev ibm.2107-7506571 Date/Time: September 4, 2005 2:35:44 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7506571 ID WWPN State Type topo portgrp =============================================================== I0000 50050763030000B6 Online Fibre Channel-SW FC-AL 0 I0001 50050763030040B6 Online Fibre Channel-SW FC-AL 0 I0002 50050763030080B6 Online Fibre Channel-SW FC-AL 0 I0003 500507630300C0B6 Online Fibre Channel-SW FC-AL 0 I0030 50050763030300B6 Online Fibre Channel-LW FICON 0 I0031 50050763030340B6 Online Fibre Channel-LW FICON 0 I0032 50050763030380B6 Online Fibre Channel-LW FICON 0
134
I0033 I0100 I0101 I0102 I0103 I0130 I0131 I0132 I0133 I0200 I0201 I0202 I0203 I0230 I0231 I0240 I0241 I0242 I0243 I0300 I0301 I0310 I0311 I0312 I0313 I0330 I0331 I0332 I0333
500507630303C0B6 50050763030800B6 50050763030840B6 50050763030880B6 500507630308C0B6 50050763030B00B6 50050763030B40B6 50050763030B80B6 50050763030BC0B6 50050763031000B6 50050763031040B6 50050763031080B6 500507630310C0B6 50050763031400B6 50050763031440B6 50050763031480B6 500507630314C0B6 50050763031900B6 50050763031940B6 50050763031980B6 500507630319C0B6 50050763031B00B6 50050763031B40B6 50050763031B80B6 50050763031BC0B6
Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online
Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre ESCON ESCON Fibre Fibre Fibre Fibre ESCON ESCON Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre
Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-SW Channel-SW Channel-SW Channel-SW Channel-LW Channel-LW Channel-LW Channel-LW
Channel-SW Channel-SW Channel-SW Channel-SW
Channel-SW Channel-SW Channel-SW Channel-SW Channel-LW Channel-LW Channel-LW Channel-LW
FICON FICON FICON FICON FICON FC-AL FC-AL FC-AL FC-AL FICON FICON FICON FICON FC-AL FC-AL FC-AL FC-AL FC-AL FC-AL FC-AL FC-AL FICON FICON FICON FICON
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Note: As seen in Example 5-12 on page 134, the default I/O topology configurations are FICON for longwave (LW) host adapters and FC-AL for shortwave (SW) host adapters.

ESCON connections will be isolated to the two ESCON adapters. Individual ESCON workloads can be isolated at the I/O port level. Although with the small number of I/O ports, this approach results in very limited I/O bandwidth. Important: If there is ESCON workload on a DS8000, the ESCON volumes must be configured in address group 0 (LCUs 00 through 0F.) If the DS8000 in this example will support both FICON and FCP access (either for FCP host access or remote mirroring links), individual I/O ports must be dedicated to either FICON or to FCP. However, I/O ports on the same host adapter can be configured independently for the FCP protocol and the FICON protocol. As an example, on each of the four LW cards: The first port can be configured for FICON access: I0030 I0100 I0200 I0330
The second port can be configured for FCP host access: I0031
135
I0101 I0201 I0331 On two (or more) of the four LW cards, the third port can be configured for remote mirroring links (two links might be sufficient for remote mirroring requirements): I0032 I0102 It is possible to have Open Systems host workloads and remote mirroring workloads share the same I/O ports, but if enough I/O ports are available, the best practice is to separate FCP host connections and FCP remote mirroring connections to dedicated I/O ports. In each I/O enclosure of this DS8000, there is one SW FICON/FCP-capable host adapter and one LW FICON/FCP-capable host adapter. Each port on a DS8000 4-port SW adapter or a DS8000 4-port LW adapter can be configured to either FCP protocol for Open Systems host connections and DS8000 remote mirroring links, or to FICON protocol for z/OS host connections. Generally, SAN distance considerations and host server adapter types determine whether SW or LW adapters are appropriate. In our case, the fact that both LW and SW adapters have been ordered for DS8000 implies two sets of host server or SAN requirements. So it is likely that certain workloads will be separated by adapter type. Note: For host server connections or remote mirroring connections known to require high throughput, we recommend a maximum of two I/O ports on a single DS8000 host adapter.

A typical application of spreading for a FICON or FCP workload on this DS8000 is to spread host connections in multiples of four across all four I/O enclosures, beginning with one I/O port on each of four host adapters of the appropriate type (LW or SW). If additional host connections are required, one or more additional ports can be used on each of the same host adapter cards. (If both LW or SW host adapter cards are appropriate, the additional connections can be placed on different host adapter cards.)

The DS8000 given in this example has a base frame with four partially populated I/O enclosures and one expansion frame with four partially populated I/O enclosures. Figure 5-29 on page 137 is a schematic of the I/O enclosures in the base frame. It shows that each of the four base frame I/O enclosures (0 - 3) contains two 4-port LW FCP/FICON-capable host adapters and one 4-port SW FCP/FICON-capable host adapter. Figure 5-30 on page 137 is a schematic of the I/O enclosures in the first expansion frame of the DS8000 in this example. Each of the four expansion frame I/O enclosures (4 - 7) contains one 4-port LW FCP/FICON-capable host adapter. Note: The host adapters installed in the DS8000 in this example do not necessarily follow the current order of installation of host adapters.
136
Figure 5-29 DS8000 example 2: Base frame I/O enclosures, host adapters, and I/O ports
Figure 5-30 DS8000 example 2: First expansion frame I/O enclosures, host adapters, and I/O ports
137
Note: The host adapters installed in the DS8000 in this example do not necessarily follow the current order of installation of host adapters. Example 5-13 shows output from the DSCLI lsioport command for the DS8000 in this example. As shown in the lsioport output, 16 in total host adapters (64 I/O ports) are installed: Four 4-port SW FCP/FICON-capable host adapters in the base frame (enclosures 0 - 3) IDs I0010 - I0013, I0140 - I0143, I0210 - I0213, and I0340 - I0343 Eight 4-port LW FCP/FICON-capable host adapters in the base frame (enclosures 0 - 3) IDs I0030 - I0033, I0040 - I0043, I0100 - I0103, I0110 - I0113, I0230 - I0233, I0240 I0243, I0300 - I0303, and I0310 - I0313 Four 4-port FCP/FICON-capable host adapters in the expansion frame (enclosures 4 - 7) IDs I0400 - I0403, I0530 - I0533, I0600 - I0603, and I0730 - I0733
Example 5-13 DS8000 example 2: I/O enclosures, host adapters, and I/O ports dscli> lsioport -l -dev ibm.2107-7520331 Date/Time: September 9, 2005 3:06:36 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7520331 ID WWPN State Type topo portgrp =============================================================== I0010 5005076303010194 Online Fibre Channel-SW FC-AL 0 I0011 5005076303014194 Online Fibre Channel-SW FC-AL 0 I0012 5005076303018194 Online Fibre Channel-SW FC-AL 0 I0013 500507630301C194 Online Fibre Channel-SW FC-AL 0 I0030 5005076303030194 Online Fibre Channel-LW FICON 0 I0031 5005076303034194 Online Fibre Channel-LW FICON 0 I0032 5005076303038194 Online Fibre Channel-LW FICON 0 I0033 500507630303C194 Online Fibre Channel-LW FICON 0 I0040 5005076303040194 Online Fibre Channel-LW FICON 0 I0041 5005076303044194 Online Fibre Channel-LW FICON 0 I0042 5005076303048194 Online Fibre Channel-LW FICON 0 I0043 500507630304C194 Online Fibre Channel-LW FICON 0 I0100 5005076303080194 Online Fibre Channel-LW FICON 0 I0101 5005076303084194 Online Fibre Channel-LW FICON 0 I0102 5005076303088194 Online Fibre Channel-LW FICON 0 I0103 500507630308C194 Online Fibre Channel-LW FICON 0 I0110 5005076303090194 Online Fibre Channel-LW FICON 0 I0111 5005076303094194 Online Fibre Channel-LW FICON 0 I0112 5005076303098194 Online Fibre Channel-LW FICON 0 I0113 500507630309C194 Online Fibre Channel-LW FICON 0 I0140 50050763030C0194 Online Fibre Channel-SW FC-AL 0 I0141 50050763030C4194 Online Fibre Channel-SW FC-AL 0 I0142 50050763030C8194 Online Fibre Channel-SW FC-AL 0 I0143 50050763030CC194 Online Fibre Channel-SW FC-AL 0 I0210 5005076303110194 Online Fibre Channel-SW FC-AL 0 I0211 5005076303114194 Online Fibre Channel-SW FC-AL 0 I0212 5005076303118194 Online Fibre Channel-SW FC-AL 0 I0213 500507630311C194 Online Fibre Channel-SW FC-AL 0 I0230 5005076303130194 Online Fibre Channel-LW FICON 0 I0231 5005076303134194 Online Fibre Channel-LW FICON 0 I0232 5005076303138194 Online Fibre Channel-LW FICON 0 I0233 500507630313C194 Online Fibre Channel-LW FICON 0 I0240 5005076303140194 Online Fibre Channel-LW FICON 0 I0241 5005076303144194 Online Fibre Channel-LW FICON 0 I0242 5005076303148194 Online Fibre Channel-LW FICON 0
138
I0243 I0300 I0301 I0302 I0303 I0310 I0311 I0312 I0313 I0340 I0341 I0342 I0343 I0400 I0401 I0402 I0403 I0530 I0531 I0532 I0533 I0600 I0601 I0602 I0603 I0730 I0731 I0732 I0733
500507630314C194 5005076303180194 5005076303184194 5005076303188194 500507630318C194 5005076303190194 5005076303194194 5005076303198194 500507630319C194 50050763031C0194 50050763031C4194 50050763031C8194 50050763031CC194 5005076303200194 5005076303204194 5005076303208194 500507630320C194 50050763032B0194 50050763032B4194 50050763032B8194 50050763032BC194 5005076303300194 5005076303304194 5005076303308194 500507630330C194 50050763033B0194 50050763033B4194 50050763033B8194 50050763033BC194
Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online Online
Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre
Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-SW Channel-SW Channel-SW Channel-SW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW
FICON FICON FICON FICON FICON FICON FICON FICON FICON FC-AL FC-AL FC-AL FC-AL FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Notes: 1. The port IDs shown in the lsioport command output are logical port IDs. In logical port IDs, the slot numbers are one less than the physical location numbers for host adapter cards listed below the host adapter cards in the DS8000. 2. As shown in Example 5-13 on page 138, the default I/O topology configurations are FICON for LW host adapters and Fibre Channel Arbitrated Loop (FC-AL) for SW host adapters.

Because there are multiple FICON/FCP-capable host adapters on the DS8000 in this example, workload isolation is available at the individual I/O port level and at the host adapter level. Because there are multiple types of FICON/FCP-capable host adapters (SW and LW), you can also isolate FICON and FCP host connections by host adapter type. If the DS8000 in this example will support both FICON and FCP access (either for FCP host access or remote mirroring links), separate I/O ports must be dedicated to FICON or to FCP. However, I/O ports on the same host adapter can be configured independently for the FCP protocol and the FICON protocol. For example, on each of the four LW cards: The first port can be configured for FICON access. The second port can be configured for FCP host access. The third port on two or more of the host adapters can be configured for FCP remote mirroring links.
139
It is possible for Open Systems host workloads and remote mirroring workloads to share the same I/O ports, but if enough I/O ports are available, separation of the FCP host connections and the FCP remote mirroring connections might simplify performance management. Note: For host server connections or remote mirroring connections known to require high throughput, we recommend a maximum of two I/O ports on a single DS8000 host adapter.

A typical application of spreading for a FICON or FCP workload on DS8000 2 spreads host connections evenly across host adapters of the required type (SW or LW) and evenly across I/O enclosures.

The DS8000 in configuration example 3 has two Storage Images, a base frame with four partially populated I/O enclosures, and one expansion frame with four partially populated I/O enclosures (Figure 5-31). All installed host adapters are the same type (LW 4-port FCP/FICON-capable). Separate I/O enclosures are dedicated to each Storage Image: I/O enclosures 0, 1, 4, and 5 are dedicated to Storage Image 1. I/O enclosures 2, 3, 6, and 7 are dedicated to Storage Image 2.
Figure 5-31 DS8000 example 3: Base frame I/O enclosures, host adapters, and I/O ports
140
Figure 5-31 on page 140 and Figure 5-32 are schematics of the I/O enclosures in the base frame and the first expansion frame. A total of 12 4-port FCP/FICON-capable host adapters are installed in this DS8000, for a total of 48 available I/O ports. Six LW host adapters are installed in the base frame (I/O enclosures 0 - 3), with three host adapters dedicated to each Storage Image. Six LW host adapters are installed in the expansion frame (I/O enclosures 4 7), with three host adapters dedicated to each Storage Image. All installed host adapters are the same type (LW 4-port FCP/FICON-capable). In this example, each Storage Image owns two host adapters in its first I/O enclosure in the base frame and expansion frame (I/O enclosures 0 and 4 for Storage Image 1 and I/O enclosures 2 and 6 for Storage Image 2). Each storage Image owns one host adapter in its second I/O enclosure in the base frame and expansion frame (I/O enclosures 1 and 5 for Storage Image 1 and I/O enclosures 3 and 7 for Storage Image 2).
Figure 5-32 DS8000 example 3: First expansion frame I/O enclosures, host adapters, and I/O ports
The DS8000 in this example is a DS8000 model with dual Storage Images, so in order to see all the host adapters, you must issue the DSCLI lsioport command twice, once for each Storage Image: Storage Image 1 - IBM.2107-7566321 Storage Image 2 - IBM.2107-7566322
141
As shown in the output from the lsioport command for Storage Image 1 in Example 5-14 and the output from the from the lsioport command for Storage Image 2 in Example 5-15, half of the I/O enclosures are dedicated to each Storage Image: I/O enclosures 0, 1, 4, and 5 (as indicated by I/O port IDs I00yz, I01yz, I04yz, and I05yz) appear in the output for Storage Image 1 (IBM.2107-7566321). I/O enclosures 2, 3, 6, and 7 (as indicated by I/O port IDs I02yz, I03yz, I06yz, and I07yz) appear in the output for Storage Image 2 (IBM.2107-7566322).
Example 5-14 DS8000 example 3: Storage Image 1 I/O enclosures, host adapters, and I/O ports dscli> lsioport -l -dev ibm.2107-7566321 Date/Time: September 4, 2005 5:32:12 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7566321 ID WWPN State Type topo portgrp ============================================================ I0000 50050763030003BD Online Fibre Channel-LW FICON 0 I0001 50050763030043BD Online Fibre Channel-LW FICON 0 I0002 50050763030083BD Online Fibre Channel-LW FICON 0 I0003 500507630300C3BD Online Fibre Channel-LW FICON 0 I0030 50050763030303BD Online Fibre Channel-LW FICON 0 I0031 50050763030343BD Online Fibre Channel-LW FICON 0 I0032 50050763030383BD Online Fibre Channel-LW FICON 0 I0033 500507630303C3BD Online Fibre Channel-LW FICON 0 I0100 50050763030803BD Online Fibre Channel-LW FICON 0 I0101 50050763030843BD Online Fibre Channel-LW FICON 0 I0102 50050763030883BD Online Fibre Channel-LW FICON 0 I0103 500507630308C3BD Online Fibre Channel-LW FICON 0 I0400 50050763032003BD Online Fibre Channel-LW FICON 0 I0401 50050763032043BD Online Fibre Channel-LW FICON 0 I0402 50050763032083BD Online Fibre Channel-LW FICON 0 I0403 500507630320C3BD Online Fibre Channel-LW FICON 0 I0430 50050763032303BD Online Fibre Channel-LW FICON 0 I0431 50050763032343BD Online Fibre Channel-LW FICON 0 I0432 50050763032383BD Online Fibre Channel-LW FICON 0 I0433 500507630323C3BD Online Fibre Channel-LW FICON 0 I0500 50050763032803BD Online Fibre Channel-LW FICON 0 I0501 50050763032843BD Online Fibre Channel-LW FICON 0 I0502 50050763032883BD Online Fibre Channel-LW FICON 0 I0503 500507630328C3BD Online Fibre Channel-LW FICON 0 Example 5-15 DS8000 example 3: Storage Image 2 I/O enclosures, host adapters, and I/O ports dscli> lsioport -l -dev ibm.2107-7566322 Date/Time: September 4, 2005 5:32:15 PM EDT IBM DSCLI Version: 5.0.5.52 DS:IBM.2107-7566322 ID WWPN State Type topo portgrp =============================================================== I0200 5005076303100BBD Online Fibre Channel-LW FICON 0 I0201 5005076303104BBD Online Fibre Channel-LW FICON 0 I0202 5005076303108BBD Online Fibre Channel-LW FICON 0 I0203 500507630310CBBD Online Fibre Channel-LW FICON 0 I0230 5005076303130BBD Online Fibre Channel-LW FICON 0 I0231 5005076303134BBD Online Fibre Channel-LW FICON 0 I0232 5005076303138BBD Online Fibre Channel-LW FICON 0 I0233 500507630313CBBD Online Fibre Channel-LW FICON 0 I0300 5005076303180BBD Online Fibre Channel-LW FICON 0 I0301 5005076303184BBD Online Fibre Channel-LW FICON 0 I0302 5005076303188BBD Online Fibre Channel-LW FICON 0 I0303 500507630318CBBD Online Fibre Channel-LW FICON 0 I0600 5005076303300BBD Online Fibre Channel-LW FICON 0 I0601 5005076303304BBD Online Fibre Channel-LW FICON 0
142
I0602 I0603 I0630 I0631 I0632 I0633 I0700 I0701 I0702 I0703
5005076303308BBD 500507630330CBBD 5005076303330BBD 5005076303334BBD 5005076303338BBD 500507630333CBBD 5005076303380BBD 5005076303384BBD 5005076303388BBD 500507630338CBBD
Online Online Online Online Online Online Online Online Online Online
Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre Fibre
Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW Channel-LW
FICON FICON FICON FICON FICON FICON FICON FICON FICON FICON
0 0 0 0 0 0 0 0 0 0
Notes: 1. The port IDs shown in the lsioport command output are logical port IDs. In logical port IDs, the slot numbers are one less than the physical location numbers for host adapter cards on labels below the host adapter cards at the bottom of the DS8000. 2. As shown in the DSCLI output, the default I/O topology configurations are FICON for LW host adapters and FC-AL for SW host adapters. 3. The host adapters installed in the DS8000 in this example do not necessarily follow the current order of installation of host adapters.

There are two Storage Images on the DS8000 model in this example, and a workload is typically isolated to a single Storage Image, so host connections for the workload are planned for only the I/O enclosures belonging to the appropriate Storage Image. Because there are multiple FICON/FCP-capable host adapters, further workload isolation is also available at the individual I/O port level and at the host adapter level.

A typical application of spreading for a FICON or FCP workload on this DS8000 spreads host connections evenly across host adapters and evenly across I/O enclosures owned by a single Storage Image.
5.10 Implement and document DS8000 logical configuration

After the logical configuration has been completely planned, you can use either the DS Storage Manager or the DSCLI to implement it on the DS8000 in the following steps: 1. Change the password for the default user (admin) for DS Storage Manager and DSCLI. 2. Create additional user IDs for DS Storage Manager and DSCLI. 3. Apply DS8000 authorization keys. 4. Create arrays. 5. Create ranks. 6. Create extent pools. 7. Assign ranks to extent pools. 8. Create CKD Logical Control Units (LCUs). 9. Create CKD volumes. 10.Create CKD PAVs.
143
11.Create FB LUNs. 12.Create Open Systems host definitions. 13.Create Open Systems DS8000 volume groups. 14.Assign Open Systems hosts and volumes to DS8000 volume groups. 15.Configure I/O ports. 16.Implement SAN zoning, multipathing software, and host-level striping as desired. After the logical configuration has been created on the DS8000, it is important to document it. You can use the DS Storage Manager to export information in spreadsheet format (that is, save it as a comma separated values (csv) file). You can use this information together with the planning spreadsheet, as shown in Figure 5-9 on page 106, to document the logical configuration. The DSCLI provides a set of list (ls) and show commands, which can be redirected and appended into a plain text or csv file. A list of selected DSCLI commands as shown in Example 5-16 can easily be invoked as a DSCLI script (using the DSCLI command dscli -script) to collect the logical configuration of a DS8000 Storage Image. This output can be used as a text file or imported into a spreadsheet to document the logical configuration. You can obtain more advanced scripts to collect the logical configuration of a DS8000 subsystem in Appendix C, UNIX shell scripts on page 587. Example 5-16 only collects a minimum set of DS8000 logical configuration information, but it illustrates a simple DSCLI script implementation and runs quickly within a single DSCLI command session. Dependent on the environment, you can modify this script to include more commands to provide more information, for example, about Copy Services configurations and source/target relations. Note that the DSCLI script terminates with the first command that returns an error, which, for example, can even be a simple lslcu command if no LCUs are defined. You can adjust the output of the ls commands in a DSCLI script to meet special formatting and delimiter requirements using appropriate options for format, delim, or header in the specified DS8000 profile file or selected ls commands.
Example 5-16 Example of a minimum DSCLI script get_config.dscli to gather the logical configuration > dscli -cfg profile/DEVICE.profile -script get_config.dscli > DEVICE_SN_config.out CMMCI9029E showrank: rank R48 does not exist. > cat get_config.dscli ver -l lssu -l lssi -l lsarraysite -l lsarray -l lsrank -l lsextpool -l lsaddressgrp lslss #lslcu lsioport -l lshostconnect lsvolgrp lsfbvol -l # Use only if FB volumes have been configured
# Use only if FB volumes have been configured # Use only if CKD volumes and LCUs have been configured # otherwise the command returns an error and the script terminates.
144
#lsckdvol -l showrank showrank showrank showrank ... showrank R0 R1 R2 R3 R128
# # # # # # #
Use only if CKD volumes have been configured otherwise the command returns an error and the script terminates. Modify this list of showrank commands so that the showrank command is run on all available ranks! Note that an error is returned if the specified rank is not present. The script terminates on the first non-existing rank. Check for gaps in the rank ID sequence.
145
146
Chapter 6.
Performance management process

This chapter describes the need for performance management and the processes and approaches that are available for managing performance of the DS8000: Introduction Purpose Operational performance subprocess Tactical performance subprocess Strategic performance subprocess
147
6.1 Introduction
The IBM System Storage DS8000 series is designed to support the most demanding business applications with its exceptional performance and superior data throughput. This strength, combined with its world-class resiliency features, makes it an ideal storage platform for supporting todays 24x7, global business environment. Moreover, with its tremendous scalability, broad server support, and flexible virtualization capabilities, the DS8000 can help simplify the storage environment and consolidate multiple storage systems onto a single DS8000 system. This power is the potential of the DS8000 but careful planning and management is essential to realize that potential in a complex IT environment. Even a well-configured system will be subject to changes over time that affect performance, such as: Additional host systems Increasing workload Additional users Additional DS8000 capacity
A typical case
To demonstrate the performance management process, we look at a typical situation where DS8000 performance has become an issue. Users begin to open incident tickets to the IT Help Desk claiming that the system is slow and therefore is delaying the processing of orders from their clients and the submission of invoices. IT Support investigates and detects that there is contention in I/O to the host systems. The Performance and Capacity team is involved and analyzes performance reports together with the IT Support teams. Each IT Support team (operating system, storage, database, and application) issues its report defining the actions necessary to resolve the problem. Certain actions might have a marginal effect but are faster to implement; other actions might be more effective but need more time and resources to put in place. Among the actions, the Storage Team and Performance and Capacity Team reports that additional storage capacity is required to support the I/O workload of application and ultimately to resolve the problem. IT Support presents its findings and recommendations to the companys Business Unit, requesting application downtime to implement the changes that can be made immediately. The Business Unit accepts the report but says it has no money for the purchase of new storage. They ask the IT department how they can ensure that the additional storage will resolve the performance issue. Additionally, the Business Unit asks the IT department why the need for additional storage capacity was not submitted as a draft proposal three months ago when the budget was finalized for next year, knowing that the system is one of the most critical systems of the company. Incidents, such as this one, make us realize the distance that can exist between the IT department and the companys business strategy. In many cases, the IT department plays a key role in determining the companys strategy. Therefore, the questions to consider are: How can we avoid situations like those just described? How can we make performance management become more proactive and less reactive? What are best practices for performance management? What are the key performance indicators of the IT infrastructure and what do they mean from the business perspective? Are the defined performance thresholds adequate? How can we identify the risks in managing the performance of assets (servers, storage systems, and applications) and mitigate them? 148
In the following pages, we present a method to implement a performance management process. The goal is to give you ideas and insights with particular reference to the DS8000. We assume in this instance that data from IBM TotalStorage Productivity Center is available. To better align the understanding between the business and the technology, we use as a guide the Information Technology Infrastructure Library (ITIL) to develop a process for performance management as applied to DS8000 performance and tuning.
6.2 Purpose
The purpose of performance management is to ensure that the performance of the IT infrastructure matches the demands of the business. The activities involved are: Definition and review of performance baselines and thresholds Performance data collection from the DS8000 Check if the performance of resources are within the defined thresholds Analyze performance using DS8000 performance data collected and tuning recommendations Definition and review of standards and IT architecture related to performance Analyze performance trends Sizing new storage capacity requirements Certain activities are related to the operational activities, such as the analysis of performance of DS8000 components, and other activities are related to tactical activities, such as the performance analysis and tuning. Other activities are related to strategic activities, such as storage capacity sizing. We can split the process into three subprocesses: Operational performance subprocess Analyze the performance of DS8000 components (processor complexes, device adapters, host adapters, ranks, and so forth) and ensure that they are within the defined thresholds and service level objectives (SLOs) and service level agreements (SLAs). Tactical performance subprocess Analyze performance data and generate reports for tuning recommendations and the review of baselines and performance trends. Strategic performance subprocess Analyze performance data and generate reports for storage sizing and the review of standards and architectures that are related to performance. Every process is composed of the following elements: Inputs: Data and information required for analysis. The possible inputs are: Performance data collected by TotalStorage Productivity Center Historical performance reports Product specifications (benchmark results, performance thresholds, and performance baselines) User specifications (SLOs and SLAs) Outputs: The deliverables or results from the process. Possible types of output are: Performance reports and tuning recommendations Performance trends
Chapter 6. Performance management process
149
Performance alerts Tasks: The activities that are the smallest unit of work of a process. These tasks can be: Performance data collection Performance report generation Analysis and tuning recommendations Actors: A department or person in the organization that is specialized to perform certain type of work. Actors can vary from organization to organization. In smaller organizations, a single person can own multiple actor responsibilities, for example: Capacity and performance team Storage team Server teams Database team Application team Operations team IT Architect IT Manager Clients Roles: The tasks that need to be executed by an actor, but another actor might own the activity, and other actors might just be consulted. Knowing the roles is helpful when you define the steps of the process and who is going to do what. The roles can be: Responsible: The person that executes that task but not necessarily the owner of that task. Suppose that the capacity team is the owner for the generation of the performance report with the tuning recommendations, but the specialist that holds the skill to make tuning recommendations in the DS8000 is the Storage Administrator. Accountable: The owner of that activity. There can only be one owner. Consulted: The people that are consulted and whose opinions are considered. Suppose that the IT Architect proposes a new architecture for the storage. Normally, the opinion of the Storage Administrator is requested. Informed: The people who are kept up-to-date on progress. The IT Manager normally wants to know the evolution of activities. When assigning the tasks, you can use a Responsible, Accountable, Consulted, and Informed (RACI) matrix to list the actors and the roles that are necessary to define a process or subprocess. A RACI diagram, or RACI matrix, is used to describe the roles and responsibilities of various teams or people to deliver a project or perform an operation. It is especially useful in clarifying roles and responsibilities in cross-functional and cross-departmental projects and processes.
150
6.3 Operational performance subprocess

The operational performance subprocess is related to daily activities that can be executed every few minutes, hourly, or daily. For example, you can monitor the DS8000 storage system, and, if there is utilization above a defined threshold, automatically send an alert to the designated person. TotalStorage Productivity Center allows you to set performance thresholds for two major categories: Status change alerts Configuration change alerts Note: TotalStorage Productivity Center for Disk is not designed to monitor hardware nor to report hardware failures. You can configure the DS8000 Hardware Management Console (HMC) to send alerts via Simple Network Management Protocol (SNMP) or e-mail when a hardware failure occurs. You might also need to compare the DS8000 performance with the users performance requirements. Often these requirements are explicitly defined in formal agreements between IT management and user management. These agreements are referred to as service level agreements (SLA) or service level objectives (SLO). These agreements provide a framework for measuring IT resource performance requirements against IT resource fulfillment.
Performance SLA
A performance SLA is a formal agreement between IT Management and User representatives concerning the performance of the IT resources. Often these SLAs provide goals for end-to-end transaction response times. In the case of storage, these types of goals typically relate to average disk response times for different types of storage. Missing the technical goals described in the SLA result in financial penalties to the IT service management providers.
Performance SLO
Performance SLOs are similar to SLAs with the exception that misses do not carry financial penalties. While SLO misses do not carry financial penalties, misses are a breach of contract in many cases and can lead to serious consequences if not remedied. Having reports that show you how many alerts and how many misses in SLOs/SLAs have occurred over time are very important. They tell how effective your storage strategy is (standards, architectures, and policy allocation) in the steady state. In fact, the numbers in those reports are inversely proportional to the effectiveness of your storage strategy. In other words, the more effective your storage strategy, the fewer performance threshold alerts are registered and the fewer SLO/SLA targets will be missed. It is not necessary to have implemented SLOs or SLAs for you to discover the effectiveness of your current storage strategy. The definition of SLO/SLA requires a deep and clear understanding of your storage strategy and how well your DS8000 is running. That is why, before implementing this process, we recommend that you start with the tactical performance process: Generate the performance reports Define tuning recommendations Review the baseline after implementing tuning recommendations Generate performance trends reports
151
Then, redefine the thresholds with fresh performance numbers. Failure to redefine the thresholds with fresh performance numbers will cause you to spend time dealing with performance incident tickets with false-positive alerts and not spend the time analyzing the performance and making tuning recommendations for your DS8000. Let us look to the characteristics of this process.
6.3.1 Inputs
The inputs necessary to make this process effective are: Performance trends reports of DS8000 components: Many people ask for the IBM recommended thresholds. In our opinion, the best recommended thresholds are those thresholds that fit your environment. The best thresholds are extremely dependent on the configuration of your DS8000 and the type of workloads. For example, you need to define thresholds for IOPS if your application is a transactional system. If the application is a data warehouse, you need to define thresholds for throughput. Also, you must not expect the same performance from different ranks where one set of ranks has 73 GB, 15k revolutions per minute (rpm) Fibre Channel disk drive modules (DDMs) and another set of ranks has 500 GB, 7200 rpm Fibre Channel Advanced Technology Attachment (FATA) DDMs. Check the outputs generated from the tactical performance subprocess for additional information. Performance SLO and performance SLA: You can define the SLO/SLA requirements in two ways: By hardware (IOPS by rank or MB/s by port): This performance report is the easiest way to implement an SLO or SLA but the most difficult method for which to get client agreement. The client normally does not understand the technical aspects of a DS8000. By host or application (IOPS by system or MB/s by host): Most probably this performance report is the only way that you are going to get an agreement from the client but this agreement is not certain. As we said before, the client does not normally understand the technical aspects of IT infrastructure. The most typical way to define a performance SLA is by average execution time or response time of a transaction in the application. So, the performance SLA/SLO for the DS8000 is normally an internal agreement among the support teams, which creates additional work for you to generate those reports, and there is no predefined solution. It is dependent on your environments configuration and the conditions that define those SLOs/SLAs. We recommend that when configuring the DS8000 with SLO/SLA requirements that you separate the applications or hosts by LSS (reserve two LSSs, one even and one odd, for each host, system, or instance). The benefit of generating performance reports using this method is that they are more meaningful to the other support teams and to the client. Consequently, the level of communication will increase significantly and reduce chances for misunderstandings. Note: When defining a DS8000-related SLA or SLO, ensure that the goals are based on empirical evidence of performance within the environment. Application architects with applications that are highly sensitive to changes in I/O throughput or response time need to consider the measurement of percentiles or standard deviations as opposed to average values over an extended period of time. IT management must ensure that the technical requirements are appropriate for the technology.
In cases where contractual penalties are associated with production performance SLA or SLO misses, be extremely careful in the management and implementation of the DS8000.
Even in the cases where no SLA or SLO exists, users have performance expectations that are not formally communicated. In these cases, they will let IT management know when the 152
performance of the IT resources is not meeting their expectations. Unfortunately, by the time they communicate their missed expectations they are often frustrated, and their ability to manage their business is severely impacted by performance issues. While there might not be any immediate financial penalties associated with missed user expectations, prolonged negative experiences with under-performing IT resources will result in low user satisfaction.
6.3.2 Outputs
The outputs generated by this process are: Documentation of defined DS8000 performance thresholds. It is important to document the agreed-to thresholds. Not just for you, but also for other members of your team or other teams that need to know. DS8000 alerts for performance utilization. These alerts are generated when a DS8000 component reaches a defined level of utilization. With TotalStorage Productivity Center for Disk, you can automate the performance data collection and also configure TotalStorage Productivity Center to send an alert when this type of an event occurs. Performance reports comparing the performance utilization of DS8000 with the performance SLO and SLA.
6.3.3 Tasks, actors, and roles

It is easier to visualize and understand the tasks, actors, and roles when they are combined using a Responsible, Accountable, Consulted, and kept Informed (RACI) matrix, as shown in Figure 6-1.
Figure 6-1 Operational tasks, actors, and roles
Figure 6-1 is an example of a RACI matrix for the operational performance subprocess, with all the tasks, actors, and roles identified and defined. Provide performance trends report: This report is an important input for the operational performance subprocess. With this data, you can identify and define the thresholds that best fit your DS8000. Consider how the workload is distributed between of the internal components of the DS8000: host adapters, processor complexes, device adapters, and ranks. This analysis avoids the definition of thresholds that generate false-positive performance alerts and ensure that you monitor only what is relevant to your environment.
153
Define the thresholds to be monitored and their respective values, severity, queue to open the ticket, and additional instructions: In this task, using the baseline performance report, you can identify and set the relevant threshold values. You can use TotalStorage Productivity Center to create alerts when these thresholds are exceeded. For example, you can configure TotalStorage Productivity Center to send the alerts via SNMP traps to Tivoli Enterprise Console (TEC) or via e-mail. However, the opening of an incident ticket needs to be performed by the Monitoring team who will need to know the severity to set, on which queue to open the ticket, and any additional information that is required in the ticket. Figure 6-2 is an example of the required details.
Figure 6-2 Thresholds definitions table
Implement performance monitoring and alerting: After you have defined what DS8000 components to monitor, set their corresponding threshold values. For detailed information about how to configure TotalStorage Productivity Center, refer to the IBM TotalStorage Productivity Center documentation, which can be found at: http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/index.jsp Publish the documentation to the IT Management team: After you have implemented the monitoring, send the respective documentation to those people who need to know.
6.3.4 Performance troubleshooting

If an incident ticket is open for performance issues, you might be asked to investigate. If that occurs, the following tips can help during your problem determination.
Sample questions for an AIX host

The following questions and comments are examples of the type of questions to ask when engaged to analyze a performance issue. They might not all be appropriate for every environment: What was running during the sample period (backups, production batch, online queries, and so on)? Describe your application (complex data warehouse, online order fulfillment, DB2, and so forth). Explain the type of delays experienced. What other factors might indicate a SAN-related or DS8000-related I/O issue? Describe any recent changes or upgrades to the application, operating system, microcode, or database. When did the issue start and is there a known frequency (specify the time zone)?
154
Does the error report show any disk, logical volume (LV), host adapter card, or other I/O-type errors? What is the interpolicy of the logical volumes (maximum or minimum)? Describe any striping, mirroring, or RAID configurations in the affected LVs. Is this a production or development workload, and is it associated with benchmarking or breakpoint testing? In addition to the answers to these questions, the client must provide server performance and configuration data. Refer to the relevant host chapters in this book for more detail.
Identify problems and recommend a fix

To identify problems in the environment, the storage and performance management team must have the tools to monitor, collect, and analyze performance data. While the tools might vary, these processes are completely worthless without storage resource management tools. The following sample process provides a way to correct a DS8000 disk bottleneck: 1. The performance management team identifies the hot RAID array and the logical unit number (LUN) to which you need to migrate to alleviate the disk array bottleneck. 2. The performance team identifies the target RAID array with low utilization. 3. The client using the hot LUN is contacted and requested to open a change request to allocate a new LUN on a lesser utilized array. 4. The client opens a ticket and requests a new LUN. 5. The storage management team defines and zones the new LUN. 6. The client migrates data from the old LUN to the new LUN. The specifics of this step are operating system (OS)-specific and application-specific and are not detailed here. 7. The client opens a change request to delete the old LUN. 8. Performance management evaluates the change and confirms success. If the disk array still has disk contention, further changes might be recommended.
6.4 Tactical performance subprocess

This process deals with activities that occur over a cycle of weeks or months. These activities are related to the collection of data and the generation of reports for performance and utilization trends. With these reports, you can produce tuning recommendations. The process allows you to gain a better understanding of the workload of each application or host system. You can also verify that you are getting the expected benefits of new tuning recommendations implemented on the DS8000. The performance reports generated by this process are also used as inputs for the operational and strategic performance subprocesses. Note: We recommend the tactical performance subprocess as the starting point for the implementation of a performance management process. Regardless of the environment, the implementation of proactive processes to identify potential performance issues before they become a crisis will save time and money. All the methods that we describe depend on storage management reporting tools. These tools must provide for the long-term gathering of performance metrics, allow thresholds to be set, and provide for alerts to be sent. These capabilities permit Performance Management to effectively analyze the I/O workload and establish proactive processes to manage potential
155
problems. Key performance indicators provide information to identify consistent performance hot spots or workload imbalances.
6.4.1 Inputs
The inputs necessary to make this process effective are: Product specifications: Documents that describe the characteristics and features of the DS8000, such as data sheets, Announcement Letters, and planning manuals Product documentation: Documents that provide information about the installation and use of the DS8000, such as user manuals, white papers, and IBM Redbooks publications Performance SLOs/Performance SLAs: The documentation of performance SLO/SLAs to which the client has agreed for DS8000.
6.4.2 Outputs
Performance reports with tuning recommendations and performance trends reports are the outputs that are generated by this process.
Performance reports with tuning recommendations

We recommend that you create your performance report with three chapters at least: Hardware view: This view provides information about the performance of each component of DS8000. You can determine the health of the storage subsystem by analyzing key performance metrics for the host adapter, port, array, and volume. You can generate workload profiles at the DS8000 subsystem or component level by gathering key performance metrics for each component. Application or host system view: This view helps you identify the workload profile of each application or host system accessing the DS8000. This view helps you understand which RAID configuration performs best or whether it is advisable to move a specific system from ranks with DDMs of 73 GB/15k rpm to ranks with DDMs of 300 GB/15k rpm, for example. This information also helps other support teams, such as Database Administrators, the IT Architect, the IT Manager, or even the clients. This type of report can assist you in meetings to become more interactive with more people asking questions. Conclusions and recommendations: This section is the information that the IT Manager, the clients, and the IT Architect read first. Based on the recommendations agreement can be reached about actions to implement to mitigate any performance issue.
Performance trends reports

It is important for you to see how the performance of DS8000 is changing over time. As in the performance report with tuning recommendations section, we recommend that you create three chapters: Hardware view: Analysis of the key performance indicators referenced in the 8.3, TotalStorage Productivity Center data collection on page 214. Application or host system view: One technique for creating workload profiles is to group volume performance data into logical categories based on their application or host system. Conclusions and recommendations: The reports might help you recommend changing the threshold values for your DS8000 performance monitoring or alerting. The report might show that a specific DS8000 component will soon reach its capacity or performance limit.
156

Figure 6-3 shows an example of a RACI matrix for the tactical performance subprocess with all the tasks, actors, and roles identified and defined.
Figure 6-3 Tactical tasks, actors, and roles
The tasks include: Collect configuration and raw performance data: Use TotalStorage Productivity Center for Disk daily probes to collect configuration data. Set up the Subsystem Performance Monitor to run indefinitely and to collect data at 15 minute intervals for each DS8000. Generate performance graphics: Produce one key metric for each physical component in the DS8000 over time. For example, if your workload is evenly distributed across the entire day, show the average daily disk utilization for each disk array. Configure thresholds in the chart to identify when a performance constraint might occur. Use a spreadsheet to create a linear trend line based on the data previously collected and identify when a constraint might occur. Generate performance reports with tuning recommendations: You must collect and review host performance data on a regular basis. On discovery of a performance issue, Performance Management must work with the Storage team and the client to develop a plan to resolve it. This plan can involve some form of data migration on the DS8000. For key I/O-related performance metrics, refer to the chapters in this book for your specific operating system. Typically, the end-to-end I/O response time is measured, because it provides the most direct measurement of the health of the SAN and disk subsystem. Generate performance reports for trend analysis: Methodologies typically applied in capacity planning can be applied to the storage performance arena. These methodologies rely on workload characterization, historical trends, linear trending techniques, I/O workload modeling, and What if scenarios. You can obtain details about disk management in Chapter 7, Performance planning tools on page 161. Schedule meetings with the involved areas: The frequency depends on the dynamism of your IT environment. The greater the rate of change, such as the deployment of new systems, upgrades of software and hardware, fix management, allocation of new LUNs, implementation of Copy Services, and so on, will determine the frequency of meetings. For the performance reports with tuning recommendations, we recommend weekly meetings. For performance trends reports, we recommend meetings on a monthly basis. You might want to change the frequency of these meetings after you gain more confidence and familiarity with the performance management process. At the end of these meetings, define with the other support teams and the IT Manager the actions to resolve any potential issues that have been identified.
157
6.5 Strategic performance subprocess

The strategic performance subprocess is related to activities that occur over a cycle of six to 12 months. These activities define or review standards and architecture or size new or existing DS8000s. It might be an obvious observation, but it is important to remember that the IT resources are finite and some day they will run out. In the same way, the money to invest in IT Infrastructure is limited, which is why this process is important. In each company, there is normally a time when the budget for the next year is decided. So, even if you present a list of requirements with performance reports to justify the investments, you might not be successful. The timing of the request and the benefit of the investment to the business are also important considerations. Just keeping the IT systems up and running is not enough. The IT Manager and Chief Information Officer (CIO) need to show business benefit for the company. Usually this benefit means providing the service at the lowest cost but also showing a financial advantage that the services provide. This benefit is how the IT industry has grown over the years while at the same time, it has increased productivity, reduced costs, and enabled new opportunities. You will need to check with your IT Manager or Architect to learn when the budget will be set and start three to four months before this date. You can then define the priorities for the IT infrastructure for the coming year to meet the business requirements.
6.5.1 Inputs
The inputs required to make this process effective are: Performance reports with tuning recommendations Performance trends reports
6.5.2 Outputs
The outputs generated by this process are: Standards and architectures: Documents that specify: Naming convention for the DS8000 components: ranks, extent pools, volume groups, host connections, and LUNs. Rules to format and configure DS8000: Arrays, RAID, ranks, extent pools, volume groups, host connections, logical subsystems (LSSs), and LUNs. Policy allocation: When to pool the applications or host systems on the same set of ranks. When to segment the applications or hosts systems in different ranks. Which type of workload must use RAID 5, RAID 6, or RAID 10? Which type of workload must use DDMs of 73 GB/15k rpm, 146 GB/15k rpm, or 300 GB/15k rpm? Sizing of new or existing DS8000: According to the business demands, what are the recommended capacity, cache, and host ports for a new or existing DS8000? Plan configuration of new DS8000: What is the planned configuration of the new DS8000 based on your standards and architecture and according to the workload of the systems that will be deployed?
158
Figure 6-4 Strategic tasks, actors, and roles
Figure 6-4 is an example of RACI matrix for the strategic performance subprocess with all the tasks, actors, and roles identified and defined: Define priorities of new investments: In defining the priorities of where to invest, you must consider these four objectives: Reduce cost: The most simple example is storage consolidation. There might be several storage systems in your data center, which are nearing the end of their useful life. The costs of maintenance are increasing, and the storage subsystems use more energy than new models. The IT Architect can create a case for storage consolidation but will need your help to specify and size the new storage. Increase availability: There are production systems that need to be available 24x7. The IT Architect needs to submit a new solution for this case to provide data mirroring. The IT Architect will require your help to specify the new storage for the secondary site and to provide figures for the necessary performance. Mitigate risks: Consider a case where a system is running on a old storage model without a support contract from the vendor. That system started as a pilot with no importance. Over time, that system presented great performance and is now a key application for the company. The IT Architect needs to submit a proposal to migrate to a new storage system. Again, the IT Architect will need your help to specify the new storage requirements. Business units demands: Depending on the target results that each business unit has to meet, the business units might require additional IT resources. The IT Architect will require information about the additional capacity that is required. Define and review standards and architectures: After you have defined the priorities, you might need to review the standards and architecture. New technologies will appear so you might need to specify new standards for new storage models. Or maybe, after a period of time analyzing the performance of your DS8000, you discover that for a certain workload, you might need to change a standard. Size new or existing DS8000: Modeling tools, such as Disk Magic, which is described in 7.1.4, Disk Magic modeling on page 163, can gather multiple workload profiles based on host performance data into one model and provide a method to assess the impact of one or more changes to the I/O workload or DS8000 configuration. Note: For environments with multiple applications on the same physical servers or on logical partitions (LPARs) using the same Virtual I/O servers, defining new requirements can be quite challenging. We recommend building profiles at the DS8000 level first and eventually moving into more in-depth study and understanding of the other shared resources in the environment.
159
Plan configuration of new DS8000: Configuring the DS8000 to meet the specific I/O performance requirements of an application will reduce the probability of production performance issues. To produce a design to meet these requirements, Storage Management needs to know: I/Os per second Read to write ratios I/O transfer size Access type: Sequential or random
For help in translating application profiles to I/O workload, refer to Chapter 3, Understanding your workload on page 29. After the I/O requirements have been identified, documented, and agreed upon, the DS8000 layout and logical planning can begin. Refer to Chapter 5, Logical configuration performance considerations on page 63 for additional detail and considerations for planning for performance. Note: A lack of communication between the Application Architects and the Storage Management team regarding I/O requirements will likely result in production performance issues. It is essential that these requirements are clearly defined. For existing applications, you can use Disk Magic to analyze an applications I/O profile. Details about Disk Management are in Chapter 7, Performance planning tools on page 161.
160
Chapter 7.
Performance planning tools

In this chapter, we present the Disk Magic tool that can be used for DS8000 capacity and performance planning. It is available for both System z and Open Systems. Typically, the IBM Business Partner or the IBM marketing support representative uses the Disk Magic tool. It is also available for users through IntelliMagic.
161
7.1 Disk Magic

We describe Disk Magic and its use. We include examples that show the required input data, how the data is fed into the tool, and also show the output reports and information that Disk Magic provides. Note: Disk Magic is available for the use of IBM Business partners, IBM representatives, and users. Clients must contact their IBM representative to run the Disk Magic tool when planning for their DS8000 hardware configurations. Disk Magic for Windows is a product of IntelliMagic. Their Web site is: http://www.intellimagic.net/
7.1.1 The need for performance planning and modeling tools

Trying to apply the results of a lab benchmark to another environment and making extrapolations and other adjustments for the differences (and there are usually a significant number) generally leads to a low-confidence result. One of the problems with general rules is that they are based on assumptions about the workload, which have been lost or never documented in the first place. For example, a general rule for I/O rates that applies to 4 KB transfers does not necessarily apply to 256 KB transfers. A particular rule only applies if the workload is the same and all the hardware components are the same, which of course they seldom are. Disk Magic overcomes this inherent lack of flexibility in rules by allowing the person running the model to explicitly specify the details of the workload and the details of the hardware. Disk Magic then computes the result in terms of response time and resource utilization of running that workload on that hardware. Disk Magic models are often based on estimates of the workload. For example, what is the maximum I/O rate that the storage server will see? This I/O rate is obviously dependent on identifying the historical maximum I/O rate (which can require a bit of searching) and then possibly applying adjustment factors to account for anticipated changes. The more you can substitute hard data or even reasonable estimates to replace assumptions and guesses, the more accurate the results will be. In any event, a Disk Magic model is likely to be far more accurate than results obtained by adjusting benchmark results or by applying general rules. Disk Magic is calibrated to match the results of lab runs documented in sales materials and white papers. You can view it as an encoding of the data obtained in benchmarks and reported in white papers. When the Disk Magic model is run, it is important to size each component of the storage server for its peak usage period, usually a 15 or 30 minute interval. Using a longer period tends to average out the peaks and non-peaks, which does not give a true reading of the maximum demand. Different components can peak at different times. For example, a processor-intensive online application might drive processor utilization to a peak while users are actively using the system. However, disk utilization might be at a peak when the files are backed up during off-hours. So, you might need to model multiple intervals to get a complete picture of your processing environment.
162
7.1.2 Overview and characteristics

Disk Magic is a Windows-based disk subsystem performance modeling tool. Disk Magic can be used to help to plan the DS8000 hardware configuration. With Disk Magic, you model the DS8000 performance when migrating from another disk subsystem or when making changes to an existing DS8000 configuration and the I/O workload. Disk Magic is for use with both System z (zSeries) and Open Systems server workloads. When running the DS8000 modeling, you start from either one of these scenarios: An existing, non-DS8000 model, which you want to migrate to a DS8000. This system can be an IBM product, such as an IBM 3990-6, an IBM TotalStorage Enterprise Storage Server (ESS), or a non-IBM disk subsystem. Because a DS8000 can have much greater storage and throughput capacity than other disk storage subsystems, with Disk Magic you can merge the workload from several existing disk subsystems into a single DS8000. An existing DS8000 workload. Modelling a planned new workload, even if you do not have the workload currently running on any disk subsystem. Here, you need to have an estimate of the workload characteristics, such as disk capacity, I/O rate, and cache statistics, which provide an estimate of the DS8000 performance results. Only use an estimate for extremely rough planning purposes.
7.1.3 Output information

Disk Magic models the DS8000 performance based on the I/O workload and the DS8000 hardware configuration. Thus, it helps in the DS8000 capacity planning and sizing decisions. Major DS8000 components that can be modeled using Disk Magic are: DS8000 model: DS8100, DS8300, or DS8300 logical partitioning (LPAR) Cache size for DS8000 Number, capacity, and speed of disk drive modules (DDMs) Number of arrays and RAID type Type and number of DS8000 host adapters Type and number of channels Remote Copy option When working with Disk Magic, always make sure to feed in accurate and representative workload information, because Disk Magic results depend on the input data provided. Also, carefully estimate future demand growth, which will be fed into Disk Magic for modeling projections on which the hardware configuration decisions will be made.
7.1.4 Disk Magic modeling

The process of modeling with Disk Magic starts with creating a base model for the existing disk subsystems. Initially, you load the input data that describes the hardware configuration and workload information of those disk subsystems. When you create the base model, Disk Magic validates the hardware and workload information that you entered, and if everything is acceptable, a valid base model is created. If not, Disk Magic provides messages explaining why the base model cannot be created, and it shows the errors on the log. After the valid base model is created, you proceed with your modeling. Essentially, you change the hardware configuration options of the base model to decide what is your best DS8000 configuration for a given workload. Or, you can modify the workload values that you initially entered, so, for example, you can see what happens when your workload grows or its characteristics change.
Chapter 7. Performance planning tools
163
Welcome to Disk Magic

When we launch the Disk Magic program, we start with the Welcome to Disk Magic panel (refer to Figure 7-1). In this panel, we have the option to either: Open an existing Disk Magic file, which has an extension of DM2. Open an input file for: zSeries modeling. This input file is called the Disk Magic Automated Input file, which is known as the DMC file, and is created using Resource Measurement Facility (RMF) Magic (refer to 15.10, RMF on page 464). Comma separated values (csv) output file from TotalStorage Productivity Center. Performance IOSTAT and TXT files from Open Systems. PT Report file from System i (iSeries). Create a new project by entering the input data manually. The options are: zSeries servers Open servers iSeries servers A Transaction Processing Facility (TPF) project SAN Volume Controller project
Figure 7-1 Welcome to Disk Magic
164
7.2 Disk Magic for System z (zSeries)

We explain how to use Disk Magic as a modeling tool for System z (still designated as zSeries in Disk Magic). In the example presented, we merge two ESS-800s into one DS8300. We need the DMC file, for the period for which we want to create the Disk Magic model. The periods to model are usually: Peak I/O period Peak Read + Write throughput in MBps Peak Write throughput in MBps DMC files are created by RMF Magic (refer to 15.10, RMF on page 464). The file name uses a dmc suffix, shows the date and time of the corresponding RMF period, and uses the following naming convention: The first two characters represent an abbreviation of the month. The next two digits represent the date. The following four digits show the time period. For example, JL292059 means that the DMC file was created for the RMF period of July 29 at 20:59.
7.2.1 Process the DMC file

The steps to process the DMC file are: 1. From the panel in Figure 7-1 on page 164, first we select zSeries or WLE Automated Input - with extension DMC from the Create New Project Using Automated Input options. Figure 7-2 shows the new window that opens and where you can select the DMC file that you want to use.
Figure 7-2 zSeries select DMC file
2. In this particular example, we select the JL292059.dmc file, which opens the following window, shown in Figure 7-3 on page 166.
165
Figure 7-3 zSeries DMC file opened
3. Here, we see that there are four LPARs (SYSA, SYSB, SYSC, and SYSD) and two disk subsystems (IBM-12345 and IBM-67890). Clicking the IBM-12345 icon opens the general information that is related to this disk subsystem (Figure 7-4). It shows that this is an ESS-800 with 32 GB of cache, and that it was created in RMF Magic using the subsystem identifier (SSID) or logical control unit (LCU) level. The number of subsystem identifiers (SSIDs) or LCUs is 12, as shown in the Number of zSeries LCUs field.
Figure 7-4 zSeries general information of the disk subsystem
4. Selecting Hardware Details on Figure 7-4 brings up the window in Figure 7-5 on page 167 and allows you to change the following features, based on the actual hardware configuration of the ESS-800: SMP Type Number of host adapters Number of device adapters Cache Size
166
Figure 7-5 zSeries configuration details
5. Next, click the Interface tab shown in Figure 7-4 on page 166. We see that each LPAR connects to the disk subsystem through eight FICON Express2 2 Gb channels. If this is not correct, you can change it by clicking Edit.
Figure 7-6 zSeries LPAR to disk subsystem Interfaces
6. Selecting From Disk Subsystem in Figure 7-6 shows the interface used by the disk subsystem. Figure 7-7 on page 168 indicates that ESS IBM-12345 uses eight FICON Ports. In this panel, you also indicate if there is a Remote Copy relationship between this ESS-800 and a remote disk subsystem. You also get a choice to define the connections used between the Primary site and the Secondary site.
167
Figure 7-7 zSeries disk subsystem to LPAR interfaces
7. The next step is to look at the DDM by clicking the zSeries Disk tab. The DDM type shows up here as 36 GB/15K rpm. Because the DDM used is actually 73 GB/10K rpm, we update this information by clicking Edit. The 3390 types or models here are 3390-3 and 3390-9 (Figure 7-8). Because any 3390 model that has a greater capacity than a 3390-9 model will show up as a 3390-9 in the DMC file, we need to know the actual models of the 3390s. Generally, there is a mixture of 3390-9, 3390-27, and 3390-54.
Figure 7-8 zSeries DDM option
8. To see the last option, select zSeries Workload. Because this DMC file is created using the SSID or LCU option, here we see the I/O statistics for each LPAR by SSID (Figure 7-9 on page 169). If we click the Average tab to the right of SYSA (the Average tab at the top in Figure 7-9 on page 169) and scroll to the right of SSID 4010 and click the Average tab (the Average tab at the bottom in Figure 7-10 on page 169), we get the total I/O rate from all four LPARs to this ESS-800, which is 9431.8 IOPS (Figure 7-10 on page 169).
168
Figure 7-9 zSeries I/O statistics from SYSA on SSID 4010
Figure 7-10 zSeries I/O statistics from all LPARs to this ESS-800
9. Clicking Base creates the base model for this ESS-800. If the workload statistics mean that the base model cannot be created, the cause might be an excessive CONN time, for example. In which case, we will have to find another DMC from a different time period, and try to create the base model from that DMC file. After creating this base model for IBM-12345, we must also create the base model for IBM-67890, following this same procedure.
169
7.2.2 zSeries model to merge the two ESS-800s to a DS8300

Now, we start the merge procedure to merge the two ESS-800s to a DS8300: 1. In Figure 7-11, right-click IBM-12345 to open a new window, and click Merge, which opens yet another window, and select Add to Merge Source Collection and create New Target. This option creates the Merge Target1, which is the new disk subsystem that we will use as the merge target (Figure 7-12).
Figure 7-11 zSeries merge and create new target disk subsystem
Figure 7-12 zSeries merge target disk subsystem
2. Because we want to merge the ESS-800s to a DS8300, we need to modify this Merge Target1. Clicking IBM DS8100 on the Hardware Type option opens a window presenting choices, where we can select the IBM DS8300 Turbo. We also select Parallel Access Volumes so that Disk Magic will model the DS8300 to take advantage of this feature.
170
3. Selecting Hardware Details opens the window in Figure 7-13. If we had selected the IBM DS8300LPAR Turbo option, this option also allows us to select from the Processor Percentage option, with the choices of 25, 50, or 75. The Failover Mode option allows you to model the performance of the DS8000 when one processor server with its associated processor storage has been lost. Here, we can select the cache size, in this case, we select 64 GB, because the two ESS-800s each has 32 GB cache. In a DS8300, this selection automatically also determines the nonvolatile storage (NVS) size. Disk Magic computes the number of host adapters on the DS8000 based on the specification on the Interfaces page, but you can, to a certain extent, override these numbers. We recommend that you use one host adapter for every two ports, for both the Fibre Channel connection (FICON) ports and the Fibre ports. The Fibre Ports are used for Peer-to-Peer Remote Copy (PPRC) links. Here, we select 4 FICON Host Adapters because we are using eight FICON ports on the DS8300 (refer to the Count column in Figure 7-15 on page 172).
Figure 7-13 zSeries target hardware details option
4. Clicking the Interfaces tab opens the From Servers dialog (Figure 7-14 on page 172). Because the DS8300 FICON ports are running at 4 Gbps, we need to update this option on all four LPARs and also on the From Disk Subsystem (Figure 7-15 on page 172) dialog. If the Host CEC uses different FICON channels than what is specified here, it also needs to be updated. At this point, you select and determine the Remote Copy Interfaces. You need to select the Remote Copy type and the connections used for the Remote Copy links.
171
Figure 7-14 zSeries target interfaces from LPARs
Figure 7-15 zSeries target interfaces from disk subsystem
5. To select the DDM capacity and rpm used, click the zSeries Disk tab in Figure 7-15. Now, you can select the DDM type used by clicking Edit in Figure 7-16 on page 173. In our example, we select DS8000 146GB/15k DDM. Usually, you do not specify the number of volumes used, but let Disk Magic determine it by adding up all the 3390s coming from the merge source disk subsystems. If you know the configuration that will be used as the target subsystem and want the workload to be spread over all the DDMs in that configuration, you can select the number of volumes on the target subsystem so that it will reflect the number of ranks configured. You can also specify the RAID type used for this DDM set.
172
Figure 7-16 zSeries target DDM option
6. Merge the second ESS onto the target subsystem. In Figure 7-17, right-click IBM-67890, select Merge, and then, select Add to Merge Source Collection.
Figure 7-17 zSeries merge second ESS-800
7. Perform the merge procedure. From the Merge Target window (Figure 7-18 on page 174), click Start Merge.
173
Figure 7-18 zSeries start the merge
8. This selection initiates Disk Magic to merge the two ESS-800s onto the new DS8300 and creates Merge Result1 (Figure 7-19).
Figure 7-19 zSeries disk subsystem created as the merge result
9. To see the DDM configured for the DS8300, select zSeries Disk on MergeResult1. Here, you can see the total capacity configured based on the total number of volumes on the two ESS-800s (Figure 7-20 on page 175). There are 11 ranks of 146 GB/15K rpm DDM required.
174
Figure 7-20 zSeries DDM of the new DS8300
10.Selecting zSeries Workload shows the Disk Magic predicted performance of the DS8300. You can see that the modelled DS8300 will have an estimated response time of 1.1 msec. Note that Disk Magic assumes that the workload is spread evenly among the ranks within the extent pool configured for the workload (Figure 7-21).
Figure 7-21 zSeries performance statistics of the DS8300
11.Clicking Utilization brings up the utilization statistics of the various DS8300 components. In Figure 7-22 on page 176, you can see that the Average FICON HA Utilization is 39.8% and has a darker (amber) background color. This amber background is an indication that the utilization of that resource is approaching its limit. This percentage is still acceptable at this point, but it is as a warning that workload growth might push this resource to its limit. Any resource that is a bottleneck will be shown with a red background. If a resource has a red background, you need to increase the size of that resource to resolve the bottleneck.
175
Figure 7-22 zSeries DS8300 utilization statistics
12.Figure 7-23 on page 177 can be used as a guideline for the various resources in a DS8000. The middle column has an amber background color, and the rightmost column has a red background color. The amber number indicates a warning that if the resource utilization reaches this number, an increase in the workload might soon cause the resource to reach its limit. The red numbers are the utilization numbers, which indicate that the resource is already saturated and will cause an increase in one of the components of the response time. 13.Of course, it is better if the merge result shows that none of the resource utilization falls into the amber color category.
176
Figure 7-23 Disk Magic utilization guideline
7.2.3 Disk Magic performance projection for zSeries model

Based on the previous modeling results, we can create a chart comparing the performance of the original ESS-800s with the performance of the new DS8300: 1. Clicking 2008/07/29 20:59 puts the LPAR and disk subsystems in the right column.
Figure 7-24 zSeries panel for performance comparison
177
2. Hold the Ctrl key down on the keyboard and select IBM-12345, IBM-67890, and MergeResult1. Right-click any of them and a small window pops up. Select Graph from this window. In the panel that appears (Figure 7-25), select Clear to clear up any graph option that might have been set up before.
Figure 7-25 zSeries graph option panel
3. Click Plot to produce the response time components graph of the three disk subsystems that you selected in a Microsoft Excel spreadsheet. Figure 7-26 is the graph that was created based on the numbers from the Excel spreadsheet.
2.5
2 1.5 1
1.5 m sec 1 0.5 0 ES S-12345@9432IO /s
ES S -67890@6901IO /s C n D on isc P d IO Q R en S /T
D 8300@16332IO S /s
Figure 7-26 zSeries response time comparison
178
You might have noticed that the DS8300 Response Time on the chart shown in Figure 7-26 on page 178 is 1.0 msec, while the Disk Magic projected Response Time of the DS8300 in Figure 7-28 on page 180 is 1.1 msec. They differ, because Disk Magic rounds up to one decimal point on the performance statistics.
7.2.4 Workload growth projection for zSeries model

A useful feature of Disk Magic is the capability to create a workload growth projection and observe the impact of this growth on the various DS8000 resources and also on the response time. To run this workload growth projection, click Graph in Figure 7-21 on page 175 to open the panel that is shown in Figure 7-27. Click Range Type, and choose I/O Rate, which fills up the from field with the I/O Rate of the current workload, which is 16,332.4 I/O per second, the to field with 20,332,4 and the by field with 1,000. We can change these numbers. In our case, we change the numbers to 16000, 50000, and 2000. Select Clear to clear up any graph option that might have been set up before. Now, select Plot. An error message indicates a host adapter Utilization > 100%. This message means that you cannot increase the I/O rate up to 50000 I/O per second because of a FICON host adapter bottleneck. Click OK to complete the graph creation. The graph is based on the I/O rate increase as shown in Figure 7-28 on page 180. Next, select Utilization Overview in the Graph Data choices, then click Clear and Plot to produce the chart shown in Figure 7-29 on page 180.
Figure 7-27 zSeries workload growth projection dialog
179
4 3.5 3
R C m on n inm ec /T o p e ts s
3.6
2.5 2 1.5 1 0.5 0

16 00 0 18 0 00 20 0 00 22 00 0 24 0 00 28 0 00 2 00 6 0 3 00 0 0 3 00 2 0 3 0 40 0 3 00 6 0
2.1 1.5 1.6 1.3 1.4 1.4 1.2 1.2 1 1.1 1.1 1.1
3 0 80 0
1.3%
IO /sec C onn D isc P end IO Q R S /T

Figure 7-28 zSeries workload growth impact on response time components
In Figure 7-29, observe that the FICON host adapter started to reach the red area at 22000 I/O per second. The workload growth projection stops at 40000 I/O per second, because the FICON host adapter reaches 100% utilization when the I/O rate is greater than 40000 I/O pr second.
Total I/O Rate (I/Os per second) Amber Red 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000 36000 38000 40000 Threshold Threshold 60% 70% n/a 60% 60% 35% 35% 80% 90% n/a 80% 80% 50% 50% 10.7% 12.0% 13.4% 14.7% 16.0% 17.4% 18.7% 20.1% 21.4% 22.7% 24.1% 25.4% 26.7% 6.7% 0.5% 7.5% 0.6% 8.4% 0.7% 9.2% 0.7% 10.0% 10.9% 11.7% 12.5% 13.4% 14.2% 15.0% 15.9% 16.7% 0.8% 0.9% 1.0% 1.1% 1.2% 1.4% 1.5% 1.6%
Utilizations Average SMP Average Bus Average Logical Device Highest DA Highest HDD Average FICON HA Highest FICON Port
14.9% 16.8% 18.7% 20.5% 22.4% 24.3% 26.1% 28.0% 29.9% 31.8% 33.6% 35.5% 37.4% 27.2% 30.6% 34.0% 37.3% 40.7% 44.1% 47.5% 50.9% 54.3% 57.7% 61.1% 64.5% 67.9% 39.0% 43.9% 48.8% 53.7% 58.5% 63.4% 68.3% 73.2% 78.0% 82.9% 87.8% 92.7% 97.6% 31.9% 35.9% 39.9% 43.9% 47.9% 51.9% 55.9% 59.9% 63.9% 67.9% 71.9% 75.8% 79.8%
Figure 7-29 zSeries workload growth impact on resource utilization
7.3 Disk Magic for Open Systems

In this section, we show how to use Disk Magic as a modeling tool for Open Systems. We illustrate the example of merging two ESS-800s into one DS8300. In this example, we use the
180
4 0 00 0
TotalStorage Productivity Center comma separated values (csv) output file for the period that we want to model with Disk Magic. The periods to model are usually: Peak I/O period Peak Read + Write throughput in MBps Peak Write throughput in MBps TotalStorage Productivity Center for Disk creates the TotalStorage Productivity Center csv output files.
7.3.1 Process the TotalStorage Productivity Center csv output file

We used these steps to process the csv files: 1. From Figure 7-1 on page 164, we select Open and iSeries Automated Input and click OK. This selection opens a window where you can select the csv files to use. In this case, we select the ESS11_TPC.csv, and then, we select ESS14_TPC.csv while holding the Ctrl key down. Then, click Open as shown in Figure 7-30.
Figure 7-30 Open Systems select input file
2. The result is shown in Figure 7-31 on page 182. To include both csv files, click Select All and then click Process to display the I/O Load Summary by Interval table (refer to Figure 7-35 on page 184). This table shows the combined load of both ESS11 and ESS14 for all the intervals recorded in the TotalStorage Productivity Center csv file.
181
Figure 7-31 Open Systems Multiple File Open panel
3. Selecting Excel in Figure 7-35 on page 184 creates a spreadsheet with graphs for the I/O rate (Figure 7-32), Total MBps (Figure 7-33 on page 183), and Write MBps (Figure 7-34 on page 183) for the combined workload on both of the ESS-800s by time interval. Figure 7-32 shows the I/O Rate graph and that the peak is approximately 18000+ IOPS.
I/O Rate
20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 1-Aug 3-Aug 4-Aug 5-Aug 6-Aug 8-Aug 9-Aug 10-Aug 12-Aug 13-Aug 14-Aug 16-Aug 17-Aug 18-Aug 19-Aug 21-Aug 22-Aug 23-Aug
Interval Time
Figure 7-32 Open Systems I/O rate by time interval
182
Figure 7-33 shows the total MBps graph and that the peak is approximately 4700 MBps. This graph shows that this peak looks out-of-line compared to the other total MBps numbers from the other periods. Before using this interval to model the peak total MBps, investigate to learn whether this peak is real or if something unusual happened during this period that might have caused this anomaly.
Total MB/s
6000 5000 4000 3000 2000 1000 0 10-Aug 11-Aug 13-Aug 14-Aug 15-Aug 17-Aug 18-Aug 19-Aug 20-Aug 22-Aug 23-Aug 1-Aug 3-Aug 4-Aug 5-Aug 6-Aug 8-Aug 9-Aug
Interval Time
Figure 7-33 Open Systems total MBps by time interval
Also, investigate the situation for the peak write MBps on Figure 7-34. The peak period here, as expected, coincides with the peak period of the total MBps.
Write MB/s
1400 1200 1000 800 600 400 200 0 1-Aug 3-Aug 4-Aug 5-Aug 6-Aug 8-Aug 9-Aug 10-Aug 11-Aug 13-Aug 14-Aug 15-Aug 16-Aug 18-Aug 19-Aug 20-Aug 22-Aug 23-Aug
Interval Time
Figure 7-34 Open Systems write MBps by time interval Chapter 7. Performance planning tools
183
4. Clicking the I/O Rate column header in Figure 7-35 highlights the peak I/O rate for the combined ESS11 and ESS14 disk subsystems (Figure 7-35).
Figure 7-35 Open Systems I/O Load Summary by Interval
5. From this panel, select Add Model and then select Finish. A pop-up window prompts you with Did you add a Model for all the intervals you need? because you can include multiple workload intervals in the model. However, we just model one workload interval, so we respond Yes. The window in Figure 7-36 opens.
Figure 7-36 Open Systems disk subsystems
6. In Figure 7-36, double-click ESS11 to get the general information related to this disk subsystem as shown in Figure 7-37 on page 185. Figure 7-37 on page 185 shows that this disk subsystem is an ESS-800 with 16 GB of cache and 2 GB of NVS.
184
Figure 7-37 Open Systems general information about the ESS11 disk subsystem
7. Selecting Hardware Details in Figure 7-37 allows you to change the following features, based on the actual hardware configuration of the ESS-800: SMP type Number of host adapters Number of device adapters Cache size
In this example (Figure 7-38), we change the cache size to the actual cache size of the ESS11, which is 32 GB.
Figure 7-38 Open Systems configuration details
8. Next, click the Interface tab in Figure 7-37. The From Servers panel (Figure 7-39 on page 186) shows that each server connects to the disk subsystem through four 2 Gb Fibre Channels. If this information is not correct, you can change it by clicking Edit.
185
Figure 7-39 Open Systems server to disk subsystem interfaces
9. Selecting the From Disk Subsystem option in Figure 7-39 displays the interface used by the disk subsystem. Figure 7-40 shows that ESS11 uses eight Fibre 2 Gb Ports. We need to know how many Fibre ports are actually used here, because there are two servers accessing this ESS-800 and each server uses four Fibre Channels, so there can be up to eight Fibre ports on the ESS-800. In this particular case, there are eight Fibre ports on the ESS-800. If there are more (or fewer) FICON ports, you can update this information by clicking Edit. 10.On this panel, you also indicate whether there is a Remote Copy relationship between this ESS-800 and a remote disk subsystem. It also gives you a choice to define the connections used between the Primary site and the Secondary site.
Figure 7-40 Open Systems disk subsystem to server interfaces
186
11.The next step is to look at the DDM by choosing the Open Disk tab. Figure 7-41 shows DDM options by server. Here, we fill in or select the actual configuration specifics of ESS11, which is accessed by server Sys_ESS11. The configuration details are: Total capacity: 12000 GB DDM type: 73 GB/10K rpm RAID type: RAID 5 12.Next, click Sys_ESS14 in Figure 7-41 and leave the Total Capacity at 0, because ESS11 is accessed by Sys_ESS11 only.
Figure 7-41 Open Systems DDM option by server
13.Selecting the Total tab in Figure 7-41 displays the total capacity of the ESS-800, which is 12 TB on 28 RAID ranks of 73GB/10K rpm DDMs, as shown in Figure 7-42.
Figure 7-42 Open Systems disk subsystem total capacity
187
14.To see the last option, select Open Workload in Figure 7-42 on page 187. Figure 7-43 shows the I/O rate from Sys_ESS11 is 6376.8 IOPS and the service time is 4.5 msec. If we click Average, we observe the same I/O statistics, because ESS11 is accessed by Sys_ESS11 only. Clicking Base creates the base model for this ESS-800. 15.We can now also create the base model for ESS14 by following the same procedure.
Figure 7-43 Open Systems workload statistics
7.3.2 Open Systems model to merge the two ESS-800s to a DS8300

Now, we start the merge procedure: 1. In Figure 7-44, right-click ESS11, and click Merge from the pop-up menu. Then, select Add to Merge Source Collection and create New Target from the cascading menu. This selection creates Merge Target1, which is the new disk subsystem that we will use as the merge target (Figure 7-45 on page 189).
Figure 7-44 Open Systems merge and create a new target disk subsystem
188
2. Because we want to merge the ESS-800s to a DS8300, we need to modify Merge Target1. Clicking Hardware Type in Figure 7-45 option opens a list box where we select IBM DS8300 Turbo.
Figure 7-45 Open Systems merge target disk subsystem
3. Selecting Hardware Details opens the window shown in Figure 7-46. If we had selected the IBM DS8300LPAR Turbo option, this window allows us to select from the Processor Percentage option, which offers the choices of 25, 50, or 75. The Failover Mode option allows you to model the performance of the DS8000 when one processor server with its associated processor storage has been lost. Here, we can select the cache size, in this case 64 GB, which is the sum of the cache sizes of ESS11 and ESS14. In a DS8300, this selection automatically also determines the NVS size. Disk Magic computes the number of host adapters on the DS8000 based on what is specified on the Interfaces page, but you can, to a certain extent, override these numbers. We recommend that you use one host adapter for every two Fibre ports. In this case, we select four Fibre host adapters, because we are using eight Fibre ports. 4. Clicking the Interface option in Figure 7-45 opens the dialog that is shown in Figure 7-46.
Figure 7-46 Open Systems target hardware details option
189
5. Because the DS8300 Fibre ports run at 4 Gbps, we need to update this option on both servers and also on the From Disk Subsystem dialog (Figure 7-48). If the servers use different Fibre Channels than the Fibre Channels that are specified here, update this information. Select and determine the Remote Copy Interfaces. You need to select the Remote Copy type and the connections used for the Remote Copy links.
Figure 7-47 Open Systems target interfaces from server
Figure 7-48 Open Systems target interfaces from disk subsystem
6. To select the DDM capacity and rpm used, click the Open Disk tab in Figure 7-49 on page 191. Then, select the DDM used by clicking Add. Select the HDD type used (146GB/15K rpm) and the RAID type (RAID 5), and enter capacity in GB (24000). Now, click OK.
190
Figure 7-49 Open Systems target DDM option
7. Now, merge the second ESS onto the target subsystem. In Figure 7-50, right-click ESS14, select Merge, and then, select Add to Merge Source Collection.
Figure 7-50 Open Systems merge second ESS-800
8. To start the merge, in the Merge Target window that is shown in Figure 7-51 on page 192, click Start Merge. This selection initiates Disk Magic to merge the two ESS-800s onto the new DS8300. A pop-up window allows you to select whether to merge all workloads or
191
only a subset of the workloads. Here, we select I want to merge all workloads on the selected DSSs, which creates Merge Result1 (refer to Figure 7-52).
Figure 7-51 Open Systems start the merge
9. Clicking the Open Disk tab in Figure 7-52 shows the disk configuration. In this case, it is 24 TB on 28 ranks of 146 GB/15K rpm DDMs (Figure 7-53 on page 193).
Figure 7-52 Open Systems merge result disk subsystem
192
Figure 7-53 Open Systems DDM of the new DS8300
10.Selecting the Open Workload tab in Figure 7-53 shows the Disk Magic predicted performance of the DS8300. Here, we see that the modelled DS8300 will have an estimated service time of 5.9 msec (Figure 7-54).
Figure 7-54 Open Systems performance statistics on the DS8300
11.Click Utilizations in Figure 7-54 to show the utilization statistics of the various components of the DS8300. In Figure 7-55 on page 194, we see that the Highest HDD Utilization is 60.1% and has a darker (amber) background color. This amber background is an indication that the utilization of that resource is approaching its limit. It still acceptable at this point, but the color is a warning that a workload increase might push this resource to its limit. Any resource that is a bottleneck will be shown with a red background. If a resource shows a red background, you need to increase that resource to resolve the bottleneck.
193
Figure 7-55 Open Systems DS8300 utilization statistics
Use Figure 7-23 on page 177 as a guideline for the various resources in a DS8000. The middle column has an amber background color and the rightmost column has a red background color. The amber number indicates a warning that if the resource utilization reaches this number, a workload increase might soon cause the resource to reach its limit. The red numbers are utilization numbers that will cause an increase in one of the components of the response time.
7.3.3 Disk Magic performance projection for an Open Systems model

Based on the previous modeling results, we can create a chart comparing the performance of the original ESS-800s with the performance of the new DS8300. In Figure 7-56, clicking Sat Aug 04 02:00 copies the server and disk subsystems in the right column.
Figure 7-56 Open Systems panel for performance comparison
Hold the Ctrl key down and select ESS11, ESS14, and MergeResult1. Right-click any of them, and a small window appears. Select Graph from this window. On the panel that appears (Figure 7-57 on page 195), select Clear to clear any graph option that might have been set up before. Click Plot to produce the service time graph of the three disk subsystems selected (Figure 7-58 on page 195).
194
Figure 7-57 Open Systems graph option panel
14
11.4
12
Service Time (msec)
10
5.9 4.5
0 ESS11 @ 6377 IO/s ESS14 @ 12490 IO/s DS8300 @ 18867 IO/s
Configuration
Figure 7-58 Open Systems service time comparison
7.3.4 Workload growth projection for an Open Systems model

A useful feature of Disk Magic is the capability to project workload growth and observe the impact of this growth on the various DS8000 resources and also on the service time. To create this workload growth projection, click Graph in Figure 7-54 on page 193, which opens the panel that is shown in Figure 7-59 on page 196. Click Range Type, and choose I/O Rate. This selection fills the from field with the I/O Rate of the current workload, which is 18,867.2 IOPSec, the to field with 22,867.2, and the by field with 1,000. You can change these numbers. In our case, we change the numbers to 18000,
195
40000, and 1000. Select Line for the graph type, and then, select Clear to clear up any graph that might have been set up before. Now, select Plot. An error message displays, showing that the HDD utilization > 100%. This message indicates that we cannot increase the I/O rate up to 40000 IOPSec because of the DDM bottleneck. Clicking OK completes the graph creation.
Figure 7-59 Open Systems workload growth projection dialog
The graph shows the service time plotted against the I/O rate increase as shown in Figure 7-60.
40
35
33.4
30
25
msec
20
15.4
15
10.7
10
5.7
5
6.3
7.2
8.5
0 18000 20000 22000 24000 26000 28000 30000
IO/sec
Figure 7-60 Open Systems service time projection with workload growth
196
This graph shows the DS8300 resource utilization growth with the increase in I/O rate. We can observe here that the HDD utilization starts to reach the red area at an I/O rate > 24000 IOPS. This utilization impacts the service time and can be seen in Figure 7-60 on page 196 where at greater than 24000 IOPS, the service time increases more rapidly. After selecting Utilization Overview in the Graph Data choices, click Clear and click Plot, which produces the resource utilization table in Figure 7-61.
Total I/O Rate (I/Os per second) Utilizations Average SMP Average Bus Average Logical Device Highest DA Highest HDD Average FICON HA Highest FICON Port Average Fibre HA Highest Fibre Port Amber Red Threshold Threshold 60% 70% n/a 60% 60% 35% 35% 60% 60% 80% 90% n/a 80% 80% 50% 50% 80% 80% 18000 24.1% 12.1% 3.6% 8.9% 57.3% 0.0% 0.0% 28.8% 26.7% 20000 26.8% 13.5% 4.6% 9.9% 63.7% 0.0% 0.0% 32.0% 29.7% 22000 29.5% 14.8% 5.8% 10.9% 70.1% 0.0% 0.0% 35.2% 32.7% 24000 32.2% 16.2% 7.7% 11.9% 76.5% 0.0% 0.0% 38.4% 35.6% 26000 34.8% 17.5% 10.8% 12.9% 82.8% 0.0% 0.0% 41.6% 38.6% 28000 37.5% 18.9% 17.4% 13.9% 89.2% 0.0% 0.0% 44.8% 41.6% 30000 40.2% 20.2% 42.2% 14.9% 95.6% 0.0% 0.0% 48.1% 44.5%
Figure 7-61 Open Systems workload growth impact on resource utilization
7.4 Workload growth projection

The workload running on any disk subsystem always grows over time, which is why it is important to project how the new disk subsystem will perform as the workload grows. Two of the growth projection options are: I/O rate The I/O rate growth projection will project the disk subsystem performance when the I/O rate grows. Use this option when the I/O rate is expected to grow but without a simultaneous growth of cache or the backstore capacity of the subsystem. In other words, you expect that the Access Density (number of I/O/sec/GB) will increase. In this particular case, Disk Magic will keep the hit rate the same for each step. I/O rate with capacity growth The I/O rate with capacity growth projection will project the disk subsystem performance when both the I/O rate and the capacity grows. With this selection, Disk Magic will grow the workload and the backstore capacity at the same rate (the Access Density remains constant) while the cache size remains the same. Automatic Cache modeling is used to compute the negative effect that this will have on the cache hit ratio.
7.5 Input data needed for Disk Magic study

To perform the Disk Magic study, we need to get information about the characteristics of the current workload. Depending on the environment where the current workload is running, there are different methods of collecting data.
197
7.5.1 z/OS environment

For each control unit to be modelled (current and proposed), we need: Control unit type and model Cache size NVS size DDM size and speed Number, type, and speed of channels PAV, and whether it is used For Remote Copy: Remote Copy type Distance Number of links In a z/OS environment, running a Disk Magic model requires the System Management Facilities (SMF) record types 70 through 78. The easiest way to send the SMF data to IBM is through ftp. To avoid huge dataset sizes, separate the SMF data by SYSID or by date. The SMF dataset needs to be tersed before putting it on the ftp site. Example 7-1 shows the instructions terse the dataset.
Example 7-1 How to terse the SMF dataset Use TRSMAIN with the PACK option to terse the SMF dataset. Do NOT run TRSMAIN against an SMF dataset that resides on tape, which will cause problems with the terse process of the dataset. The SMF record length on tape can be greater than 32760 bytes, which TRSMAIN will not be able to handle properly.
Example 7-2 shows how to FTP the tersed dataset.

Example 7-2 FTP instructions The FTP site is: testcase.boulder.ibm.com userid = anonymous password = your e-mail userid, for example: abc@defg.com directory to put in: eserver/toibm/zseries notify IBM with the filename you use to create your FTP file
7.5.2 Open Systems environment

In an Open Systems environment, we need the following information for each control unit to be included in this study: Storage controller make, machine type, model, and serial number The number, size, and speed of the disk drives installed on each controller The number, speed, and type of channels The cache size Whether the control unit is direct-attached or SAN-attached How many servers are allocated and sharing these disks For Remote Copy: Remote Copy type Distance Number of links
198
Data collection
The preferred data collection method for a Disk Magic study is by using TotalStorage Productivity Center. For each control unit to be modeled, collect performance data, create a report for each control unit, and export each report as a comma separated values (csv) file. You can obtain the detailed instructions for this data collection from the IBM representative.
Other data collection techniques

If TotalStorage Productivity Center is not available or cannot be used for the existing disk subsystems, other data collection techniques are available. Contact your IBM representative. The Help function in Disk Magic documents how to gather various Open Systems types of performance data by using commands, such as iostat in Linux/Unix and perfmon in Windows.
7.6 Configuration guidelines

The following configuration guidelines are designed to assist your storage management planning in estimating the required hardware components for a given workload. The purpose of this guideline is to estimate the required configuration to support a workload with known I/O characteristics. For more information about workload, refer to Chapter 3, Understanding your workload on page 29. These guidelines do not represent any type of guarantee of performance and do not replace the more accurate estimation techniques that can be obtained from a Disk Magic study.
Workload
The example is based on an online workload with the assumption that the transfer size is 4K and all the read and write operations are random I/Os. The workload is a 70/30/50 online transaction processing (OLTP) workload, which is an online workload with 70% reads, 30% writes, and a 50% read-hit-ratio. The workload estimated characteristics are: Maximum host I/O rate is 10000 IOPS. Write efficiency is 33%, which means that 67% of the writes are destaged. Ranks use a RAID 5 configuration.
DS8000 DDM configuration

The DDM characteristics follow.
DDM speed
For I/O intensive workloads, consider 15K rpm DDMs.
DDM capacity
You must estimate to determine the choice among the 146 GB, 300 GB, or 450 GB DDMs based on: Total capacity needed in GBs Estimated Read and Write I/O rates RAID type used. For a discussion about this topic, refer to 5.6, Planning RAID arrays and ranks on page 81.
199
Based on the workload characteristics, the calculation is: Reads: Read misses are 10000 x 70% x 50% = 3500 IOPS Writes: 10000 x 30% x 67% x 4 = 8040 IOPS Total = 11540 IOPS Note: The RAID 5 write penalty of four I/O operations per write is shown in Table 5-1 on page 84 under the heading Performance Write Penalty. Calculate the number of ranks required based on: A 15K rpm DDM can sustain a maximum of 200 IOPS/DDM. For a 10K rpm DDM, reduce this number by 33%. For planning purposes, use a DDM utilization of 50%: The (6+P) rank will be able to sustain 200 x 7 x 50% = 700 IOPS. The (7+P) rank will be able to handle a higher IOPS. To be on the conservative side, calculate these estimates based on the throughput of a (6+P) rank. Based on the 11540 total IOPS, this calculation yields: 11540 / 700 = 17 ranks Because you can only order DDMs based on a multiple of two ranks, requiring a capacity with 17 ranks will require a DS8000 configuration with 18 ranks. Depending on the DDM size, the table in Figure 7-62 shows how much capacity you will get with 18 ranks. Knowing the total GB capacity needed for this workload, use this chart to select the DDM size that will meet the capacity requirement.
146GB Total GB capacity for 18 Ranks zSeries Open 15,963 16,173
300GB 32,389 32,826
450GB 48,623 49,268
Figure 7-62 Total rank capacity in GBs by DDM size
Note: Only use larger DDM sizes for applications that are less I/O intensive.
DS8000 FICON or Fibre ports and host adapters

If the workload runs on a zSeries and the CEC is a z10 with FICON Express4 channels, perform the following estimate: The maximum throughput on the FICON channel is 14000 IOPS. At 50% utilization, you get a throughput of 7000 IOPS. Divide the total IOPS by this throughput: 10000/7000 = 1.43 and round this up to 2. This result means that you will need two FICON Express4 channels. For connectivity reasons, increase this to four FICON Express4 channels. Generally, we recommend that you use only two FICON ports in each FICON host adapter, which means that you will need two host adapters.
200
For Open Systems loads, make sure that the DS8000 adapters and ports do not reach their throughput limits (MB/s values). Hence, if you have to fulfill specific throughput requirements,
make sure that each port under peak conditions is only loaded with a maximum of approximately 70% of its nominal throughput and leave every second port on a DS8000 idle. If no throughput requirements are given, use the following rule to initially estimate the correct number of host adapters: For each TB of capacity used, configure a nominal throughput of 100 MB/s. For instance, a 16 TB disk capacity then leads to 1600 MB/s required nominal throughput. With 2-Gbps ports assumed, you need eight ports. Following the recommendation of using every second port only, you need four host adapters.
DS8000 cache
Use Figure 7-63 as a guideline for the DS8000 cache size if there is no workload experience that can be used.
Figure 7-63 Cache guideline based on the DS8000 capacity
Recommendation
To finalize your DS8000 configuration, contact your IBM representative to further validate the workload against the selected configuration.
201
202
Chapter 8.
Practical performance management

This chapter describes the tools, data, and activities available for supporting DS8000 performance management processes. In this chapter, we present: Introduction to practical performance management Performance management tools TotalStorage Productivity Center data collection Key performance metrics TotalStorage Productivity Center reporting options Monitoring performance of a SAN switch or director End-to-end analysis of I/O performance problems TotalStorage Productivity Center for Disk in mixed environment
203
8.1 Introduction to practical performance management

In Chapter 6, Performance management process on page 147, we discussed performance management processes and inputs, actors, and roles. Performance management processes include operational processes, such as data collection and alerting, tactical processes, such as performance problem determination and analysis, and strategic processes, such as long-term trending. The purpose of this chapter is to define the tools, metrics, and processes required to support the operational, tactical, and strategic performance management processes.
8.2 Performance management tools

Tools for collecting, monitoring, and reporting on DS8000 performance are critical to the performance management processes. At the time of the writing of this book, the storage resource management tool with the most DS8000 performance management capabilities is IBM TotalStorage Productivity Center for Disk. TotalStorage Productivity Center for Disk provides support for DS8000 performance management processes with the following features, which are shown in Table 8-1, that are discussed in greater detail in the remainder of this chapter.
Table 8-1 TotalStorage Productivity Center supported activities for performance processes Process Operational Activities Performance data collection for port, array, volume, and switch metrics Alerting Performance reporting of port, array, volume, and switch Performance analysis and tuning Short-term trending Feature Performance monitor jobs using Common Information Model (CIM) object managers (CIMOMs) Alerts and constraint violations Predefined, Adhoc, Batch, TPCTOOL, and BIRT Tool facilitates via data collection and reporting GUI charting facilitates trend lines, and reporting options facilitate export to analytical tools Volume Planner GUI charting facilitates trend lines, and reporting options facilitate export to analytical tools
Operational Tactical/ Strategic Tactical Tactical
Tactical Strategic
Workload profiling Long-term trending
Certain features that are required to support the performance management processes are not provided in TotalStorage Productivity Center. These features are shown in Table 8-2 on page 205.
204
Table 8-2 Other tools required Process Strategic Strategic Activity Sizing Planning Alternative Disk Magic, general rules, refer to Chapter 7, Performance planning tools on page 161. Logical configuration performance considerations, refer to Chapter 5, Logical configuration performance considerations on page 63. Native host tools. Refer to the OS chapters for more detail. Native host tools. Refer to the OS chapters for more detail.
Operational Tactical
Host data collection performance and alerting Host performance analysis and tuning
8.2.1 TotalStorage Productivity Center overview

IBM TotalStorage Productivity Center Standard Edition (SE) is designed to reduce the complexity of managing SAN storage devices by allowing administrators to configure, manage, and monitor storage devices and switches from a single console. TotalStorage Productivity Center includes the following components for managing your storage infrastructure: TotalStorage Productivity Center for Disk Facilitates monitoring and performance management of Storage Management Initiative Specification (SMI-S) compliant disk storage devices. TotalStorage Productivity Center for Fabric Facilitates monitoring and performance management of SAN switches. TotalStorage Productivity Center for Data Facilitates data consolidated data collection and reporting of servers at the disk and file level. TotalStorage Productivity Center for Data does not provide any performance management capabilities. For a full list of the features that are provided in each of the TotalStorage Productivity Center SE components, visit the IBM Web site at: http://www.ibm.com/systems/storage/software/center/standard/index.html
8.2.2 TotalStorage Productivity Center data collection

TotalStorage Productivity Center for Disk uses the Common Information Model (CIM) Agent as an interface to the disk subsystem. TotalStorage Productivity Center uses the CIM Agent for discovery, probes, and performance monitoring. TotalStorage Productivity Center also gets CIM indications from the CIM Agent. In this book, we use Common Information Model Object Manager (CIMOM) and Common Information Model (CIM) agent interchangeably. Figure 8-1 on page 206 shows a high-level design overview of a TotalStorage Productivity Center implementation.
Chapter 8. Practical performance management
205
TPC for Disk
TPC for Data
TPC for Fabric
TotalStorage Productivity Center for Devices

APIs for Storage Management Applications Device Server Performs discovery Real time availability monitoring Configure the SAN and storage devices Data Server Common Schema and Database Time-based data collection engine Collects data on a periodic basis for capacity and performance analysis purposes. GUI Common Graphical User Interface for all functions New highly scalable topology display and status display. Display and configuration of SMI-S devices
COMMON HOST AGENT FRAMEWORK
CIMOM
CIMOM
CIMOM
SNMP
HOST AGENT
HOST AGENT
HOST AGENT
Figure 8-1 TotalStorage Productivity Center: SMI-S design overview
There are many components in a TotalStorage Productivity Center environment. An example of the complexity of a TotalStorage Productivity Center environment is provided in Figure 8-2.
TPC Proxy CIMOM Proxy CIMOM GUI
GUI
GUI Proxy CIMOM Proxy CIMOM Proxy CIMOM Backup Master Console SVC Master Console (Proxy CIMOM, GUI)
GUI
GUI
GUI Internal HMC
GUI
GUI External HMC
2007 IBM Corporation
Figure 8-2 TotalStorage Productivity Center physical layout
Figure 8-2 does not include TotalStorage Productivity Center for Fabric or switch components. This architecture was designed by the Storage Networking Industry Association (SNIA), an industry workgroup. The architecture is not simple, but it is open, meaning any company can use SMI-S standard CIMOMs to manage and monitor storage and switches.
206
For additional information about the configuration and deployment of TotalStorage Productivity Center, refer to: TotalStorage Productivity Center V3.3 Update Guide, SG24-7490: http://www.redbooks.ibm.com/abstracts/sg247490.html?Open Storage Subsystem Performance Monitoring using TotalStorage Productivity Center, which is in Monitoring Your Storage Subsystems with TotalStorage Productivity Center, SG24-7364
8.2.3 TotalStorage Productivity Center measurement of DS8000 components

To gather performance data, you first need to set up a job that is called a Subsystem Performance Monitor. When this job starts, TotalStorage Productivity Center tells the CIMOM to start collecting data. Later, TotalStorage Productivity Center regularly queries the performance counters on that CIMOM for a specific device and stores the information in its database as a sample. After two samples are in the database, TotalStorage Productivity Center can calculate the differences between them. After calculating the gathered data, TotalStorage Productivity Center can use several metrics to display the data in a meaningful way. TotalStorage Productivity Center can collect DS8000 and SAN fabric performance data. TotalStorage Productivity Center for Disk can collect the components shown in Figure 8-3 on page 208. Displaying a metric within TotalStorage Productivity Center depends on the ability of the storage subsystem and the Common Information Model (CIM) object manager (CIMOM) to provide the performance data and related information, such as the values that are assigned to processor complexes. We guide you through the diagram in Figure 8-3 on page 208 by drilling down from the overall subsystem level. Note: A metric is a numerical value that is derived from the information that is provided by a device. It is not just the raw data, but a calculated value. For example, the raw data is just the transferred bytes, but the metric uses this value and the interval to tell you the bytes/second.
207
Front End Port for server connections
By Port
Port Port Port Port Port Port Port Port
By Subsystem
Controller 1
Write Cache Controller 1 Read Cache Controller 1 Write Cache Mirror Controller 2
Controller 2
Write Cache Mirror Controller 1
- Cache
Cache
Read Cache Controller 2 Write Cache Controller 2
By Controller
Back End Port for connection to the disks
Port
Port
Port
Port
Port
Port
Port
Port
By Array
Arrays (sometimes called enclosures, 8-Pack, Mega Pack, )
Figure 8-3 Storage device: Physical view
The amount of the available information or available metrics depends on the type of subsystem involved. The SMI-S standard does not require vendors to provide detailed performance data. For the DS8000, IBM provides extensions to the standard that include much more information than required by the SMI-S standard.
Subsystem
On the subsystem level, metrics have been aggregated from multiple records to a single value per metric in order to give the performance of a storage subsystem from a high-level view, based on the metrics of other components. This is done by adding values, or calculating average values, depending on the metric.
Cache
Notice the cache in Figure 8-3. This cache is a subcomponent of the subsystem, because the cache plays a crucial role in the performance of any storage subsystem. You do not find the cache as an selection in the Navigation Tree in TotalStorage Productivity Center, but there are available metrics that provide information about cache. Cache metrics for the DS8000 are available in the following report types: Subsystem Controller Array Volume
Cache metrics
Metrics, such as disk-to-cache operations, show the number of data transfer operations from disks to cache, referred as staging for a specific volume. Disk-to-cache operations are directly linked to read activity from hosts. When data is not found in the DS8000 cache, the data is first staged from back-end disks into the cache of the DS8000 server and then transferred to the host.
208
Read hits occur when all the data requested for a read data access is located in cache. The DS8000 improves the performance of read caching by using Sequential Prefetching in Adaptive Replacement Cache (SARC) staging algorithms. Refer to 1.3.1, Advanced caching techniques on page 6 for more information about the SARC algorithm. The SARC algorithm seeks to store those data tracks that have the greatest probability of being accessed by a read operation in cache. The cache-to-disk operation shows the number of data transfer operations from cache to disks, referred as destaging for a specific volume. Cache-to-disk operations are directly linked to write activity from hosts to this volume. Data written is first stored in the persistent memory (also known as nonvolatile storage (NVS)) at the DS8000 server and then destaged to the back-end disk. The DS8000 destaging is enhanced automatically by striping the volume across all the disk drive modules (DDMs) in one or several ranks (depending on your configuration). This striping provides automatic load balancing across DDMs in ranks and an elimination of the hot spots. The DASD fast write delay percentage due to persistent memory allocation gives us information about the cache usage for write activities. The DS8000 stores data in the persistent memory before sending an acknowledgement to the host. If the persistent memory is full of data (no space available), the host will receive a retry for its write request. In parallel, the subsystem has to destage data stored in its persistent memory to the back-end disk before accepting new write operations from any host. If a volume is facing write operation delayed due to persistent memory constraint, consider moving the volume to a new rank that is less used or spread this volume on multiple ranks (increase the number of DDMs used). If this solution does not fix the persistent memory constraint problem, you can consider adding cache capacity to your DS8000.
Controller
TotalStorage Productivity Center refers to the DS8000 processor complexes as controllers. A DS8000 has two processor complexes, and each processor complex independently provides major functions for the disk subsystem. Examples include directing host adapters for data transferring to and from host processors, managing cache resources, and directing lower device interfaces for data transferring to and from physical disks. To analyze performance data, you need to know that most volumes can only be assigned/used by one controller at a time. You can use the controller reports to identify if the DS8000s processor complexes are busy and persistent memory is sufficient. Write delays can occur due to write performance limitations on the back-end disk (at the rank level) or limitation of the persistent memory size.
Ports
The port information reflects the performance metrics for the front-end DS8000 ports that connect the DS8000 to the SAN switches or hosts. The DS8000 host adapter (HA) card has four ports. The SMI-S standards do not reflect this aggregation so TotalStorage Productivity Center does not show any group of ports belonging to the same HA. Monitoring and analyzing the ports belonging to the same card are beneficial, because the aggregate throughput is less that the sum of the stated bandwidth of the individual ports. For more information about DS8000 port cards, refer to 2.5.1, Fibre Channel and FICON host adapters on page 24. Note: TotalStorage Productivity Center reports on many port metrics; therefore, be aware that the ports on the DS8000 are the front-end part of the storage device.
209
TotalStorage Productivity Center array reports

A TotalStorage Productivity Center for Disk array refers to a DS8000 array site. Refer to Figure 8-4 on page 211 for more information. A DS8000 array site is composed of a group of eight DDMs. A DS8000 array is defined on an array site with a specific RAID type. A rank is a logical construct to which an array is assigned. A rank provides a pool of extents, which are used to create one or several volumes. A volume can use DS8000 extents from one or several ranks. Refer to 4.2.1, Array sites on page 45, 4.2.2, Arrays on page 45, and 4.2.3, Ranks on page 46 for further discussion. Note: On the DS8000, there is a 1:1 relationship between the array site, array, and rank although the numbering will not always be the same. In Example 8-1, we only display information that is required to illustrate the point.
Example 8-1 TotalStorage Productivity Center array-# = DSCLI arraysite S# dscli> showrank r8 ID R8 Array A8 extpoolID P0 volumes 8000,8001,8002,8003,A001,A004,A007,A050,A200,A230,A250,A260 dscli> lsarray -l a8 Date/Time: 21. Mai 2008 14:43:14 CEST IBM DSCLI Version: 5.4.0.520 DS: IBM.2107-75Z0281 Array State Data RAIDtype arsite rank DA Pair DDMcap (10^9B) diskclass =========================================================================== A8 Assigned Normal 5 (7+P) S5 R8 0 146.0 ENT dscli> lsarraysite -l s5 Date/Time: 21. Mai 2008 14:43:23 CEST IBM DSCLI Version: 5.4.0.520 DS: IBM.2107-75Z0281 arsite DA Pair dkcap (10^9B) diskrpm State array diskclass ============================================================= S5 0 146.0 15000 Assigned A8 ENT dscli> showfbvol -rank 8000 Date/Time: 21. Mai 2008 14:52:47 CEST IBM DSCLI Version: 5.4.0.520 DS: IBM.2107-75Z0281 Name eRMM_8000 ID 8000 extpool P0 eam rotateexts ==============Rank extents============== rank extents ============ R0 14 R2 13 R8 14 R10 14
210
Figure 8-4 TotalStorage Productivity Center array site to volume
The array reports include both front-end metrics and back-end metrics. The back-end metrics are specified by the keyword Backend. They provide metrics from the perspective of the controller to the back-end array sites. The front-end metrics relate to the activity between the server and the controller. There is a relationship between array operations, cache hit ratio, and percentage of read requests. When the cache hit ratio is low, the DS8000 has frequent transfers from DDMs to cache (staging). When the percentage of read requests is high and cache hit ratio is also high, most of the I/O requests can be satisfied without accessing the DDMs due to the cache management prefetching algorithm. When the percentage of read requests is low, the DS8000 write activity to the DDMs can be high. The DS8000 has frequent transfers from cache to DDMs (destaging). Comparing the performance of different arrays shows if the global workload is equally spread on the DDMs of your DS8000. Spreading data across multiple arrays increases the number of DDMs used and optimizes the overall performance. Important: Back-end write metrics do not include the RAID overhead. In reality, the RAID 5 write penalty adds additional unreported I/O operations.
Volumes
The volumes, which are also called logical unit numbers (LUNs), are shown in Figure 8-5 on page 212. The host server sees the volumes as physical disk drives and treats them as physical disk drives.
211
Figure 8-5 DS8000 volume
Analysis of volume data facilitates the understanding of the I/O workload distribution among volumes as well as workload characteristics (random or sequential and cache hit ratios). A DS8000 volume can belong to one or several ranks as shown in Figure 8-5. For more information about volumes, refer to 4.2.5, Logical volumes on page 48. Analysis of volume metrics will show how busy the volumes are on your DS8000. This information helps to: Determine where the most accessed data is located and what performance you get from the volume. Understand the type of workload your application generates (sequential or random and the read or write operation ratio). Determine the cache benefits for the read operation (cache management prefetching algorithm SARC). Determine cache bottlenecks for write operations. Compare the I/O response observed on the DS8000 with the I/O response time observed on the host.
8.2.4 General TotalStorage Productivity Center measurement considerations

In order to understand the TotalStorage Productivity Center measurements of the DS8000 components, it is helpful to understand the context for the measurement. The measurement facilitates insight into the behavior of the DS8000 and its ability to service I/O requests. The DS8000 handles various types of I/O requests differently. Table 8-3 on page 213 shows the behavior of the DS8000 for various I/O types.
212
Table 8-3 DS8000 I/O types and behavior I/O type Sequential read Random read Sequential write DS8000 high-level behavior Pre-stage reads in cache to increase cache hit ratio. Attempt to find data in cache. If not present in cache, read from back end. Write data to NVS of processor complex owning volume and send copy of data to cache in other processor complex. Upon back-end destaging, perform prefetching of read data and parity into cache to reduce the number of disk operations on the back end. Write data to NVS of processor complex owning volume and send copy of data to cache in other processor complex. Destage modified data from NVS to disk as determined by microcode.
Random write
Understanding writes to a DS8000

When the DS8000 accepts a write request, it will process it without writing to the DDMs physically. The data is written into both the processor complex to which the volume belongs and the persistent memory of the second processor complex in the DS8000. Later, the DS8000 asynchronously destages the modified data out to the DDMs. In cases where back-end resources are constrained, NVS delays might occur. TotalStorage Productivity Center reports on these conditions with the following front-end metrics: Write Cache Delay I/O Rate and Write Cache Delay I/O Percentage. The DS8000s lower interfaces use switched Fibre Channel connections, which provide a high data transfer bandwidth. In addition, the destage operation is designed to avoid the write penalty of RAID 5, if possible. For example, there is no write penalty when modified data to be destaged is contiguous enough to fill the unit of a RAID 5 stride. A stride is a full RAID 5 stripe. However, when all of the write operations are completely random across a RAID 5 array, the DS8000 cannot avoid the write penalty.
Understanding reads on a DS8000

If the DS8000 cannot satisfy the read I/O requests within the cache, it transfers data from the DDMs. The DS8000 suspends the I/O request until it has read the data. This situation is called cache-miss. If an I/O request is cache-miss, the response time will include the data transfer time between host and cache, and also the time that it takes to read the data from DDMs to cache before sending it to the host. The various read hit ratio metrics show how efficiently cache works on the DS8000. The read hit ratio depends on the characteristics of data on your DS8000 and applications that use the data. If you have a database and it has a high locality of reference, it will show a high cache hit ratio, because most of the data referenced can remain in the cache. If your database has a low locality of reference, but it has the appropriate sets of indexes, it might also have a high cache hit ratio, because the entire index can remain in the cache. A database can be cache-unfriendly by nature. An example of a cache-unfriendly workload is a workload consisting of large sequential reads to a highly fragmented filesystem. If an application reads this file, the cache hit ratio will be very low, because the application never reads the same data, due to the nature of sequential access. In this case, defragmentation of the filesystem improves the performance. You cannot determine if increasing the size of cache improves the I/O performance without knowing the characteristics of data on your DS8000.
213
We recommend that you monitor the read hit ratio over an extended period of time: If the cache hit ratio has been historically low, it is most likely due to the nature of the data access patterns. Defragmenting the filesystem and making indexes if none exist might help more than adding cache. If you have a high cache hit ratio initially and it is decreasing as the workload increases, adding cache or moving part of the data to volumes associated with the other processor complex might help.
Interpreting read-to-write ratio

The read-to-write ratio depends on how the application programs issue I/O requests. In general, the overall average read-to-write ratio is in the range of 75% to 80% reads. For a logical volume that has sequential files, it is key to understand what kind of applications access those sequential files. Normally, these sequential files are used for either read-only or write-only at the time of use. The DS8000 cache management prefetching algorithm (SARC) determines if the data access pattern is sequential or not. If the access is sequential, contiguous data is prefetched into cache in anticipation of the next read request. TotalStorage Productivity Center for Disk reports the reads and writes via various metrics. 8.3, TotalStorage Productivity Center data collection on page 214 describes these metrics in greater detail.
8.3 TotalStorage Productivity Center data collection

In this section, we discuss the performance data collection considerations, such as time stamps, durations, and intervals.
8.3.1 Timestamps
TotalStorage Productivity Center server uses the timestamp of the source devices when it inserts data into the database. If the TotalStorage Productivity Center server clock is not synchronized with the rest of your environment, it does not include any additional offset, because you might need to compare the performance data of the DS8000 with the data gathered on a server. Although the devices time information is written to the database, reports are always based on the time of the TotalStorage Productivity Center server. TotalStorage Productivity Center actually receives the time zone information from the devices (or the CIMOMs) and uses this information to adjust the time in the reports to the local time. Certain devices might convert the time into Greenwich mean time (GMT) timestamps and not provide any time zone information. This complexity is necessary to be able to compare the information from two subsystems located in different time zones from a single administration point. This administration point is the GUI not the TotalStorage Productivity Center server. If you open the GUI in different time zones, a performance diagram might show a distinct peak at different times, depending on its local time zone. When using TotalStorage Productivity Center to compare data from a server (for example, iostat data) with the data of the storage subsystem, it is important to know the time stamp of the storage subsystem. Unfortunately, TotalStorage Productivity Center does not provide a report to see the time zone information for a device. Most likely because the devices or CIMOMs convert the timestamps into GMT timestamps before they are sent.
214
In order to ensure that the timestamps on the DS8000 are synchronized with the other infrastructure components, the DS8000 provides features for configuring a Network Time Protocol (NTP) server. In order to modify the time and configure the hardware management console (HMC) to utilize an NTP server, the following steps are required: 1. Log on to HMC. 2. Select HMC Management. 3. Select Change Date and Time. 4. A dialog box similar to Figure 8-6 will appear. Change the Time here to match the current time for the time zone.
Figure 8-6 DS8000 Date and Time
5. In order to configure an NTP server, select the NTP Configuration tab. A dialog box similar to Figure 8-7 on page 216 will display.
215
Figure 8-7 DS8000 NTP Configuration
6. Select Add NTP Server and provide the IP address and the NTP version. 7. Check Enable NTP service on this HMC and click OK. Note: These configuration changes will require a reboot of the HMC. These steps were tested on DS8000 code Version 4.0 and later.
8.3.2 Duration
TotalStorage Productivity Center provides the ability to collect data continuously. From a performance management perspective, collecting data continuously means performance data exists to facilitate reactive, proactive, and even predictive processes as described in Chapter 8, Practical performance management on page 203. For ongoing performance management of the DS8000, we recommend one of the following approaches to data collection: Run continuously. The benefit to this approach is that at least in theory data always exists. The downside is that if a component of TotalStorage Productivity Center goes into a bad state, it will not always generate an alert. In these cases, data collection might stop with only a warning, and a Simple Network Management Protocol (SNMP) alert will not be generated. In certain cases, the only obvious indication of a problem is a lack of performance data. Restart collection every n number of hours. In this approach, configure the collection to run for somewhere between 23 and 168 hours. For larger environments, a significant delay period might need to be configured the last interval and the first interval in the next data collection. The benefit to this approach is that data collection failures will result in an alert every time that the job fails. You can configure this alert to go to an operational monitoring tool, such as Tivoli Enterprise Console (TEC). In this case, performance data loss is limited to the configured duration. The downside to this approach is that there will always be data missing for a period of time as TotalStorage Productivity Center begins to start the data collection on all of the devices. For large environments, this technique might not be tenable for an interval less than 72 hours, because the start-up costs related to starting the collection on a large number of devices can be significant.
216
In Table 8-4, we list the advantages and disadvantages of these methods.

Table 8-4 Scheduling considerations Situation Job startup behavior Scheduled with a duration of n hours, repeating every n hours Job tries to connect to any CIMOM to which the device is connected. If the connection fails, the job fails, and an alert, if defined, is generated. In any case, the job is scheduled to run after n hours, so n number of hours is the maximum that you lose. CIMOM fails after successful start of job Performance data collection job fails, and an alert, if defined, is generated. You lose up to n number of hours of information in addition to the one hour pause, depending on when this happens. Set to run indefinitely Job tries to connect to any CIMOM to which the device is connected. If the connection fails, the job fails, and an alert, if defined, is generated. If you do not fix the problem and restart the job manually, the job never automatically restarts, even though the problem might be temporary. The job tries to reconnect to the CIMOM within the defined intervals and recover if the communication can be reestablished. But, there might be situations where this recovery does not work. For example, if the CIMOM was restarted, it might not be able to resume the performance collection from the device until TotalStorage Productivity Center restarts the data collection job. Data is gathered as completely as possible. You get an alert only one time. No problems.
Data completeness Alerts Manual restart
There is at least a gap of one hour every n number of hours. You get alerts for every job that fails. Manually restarted jobs can cause trouble, because the jobs can easily overlap with the next scheduled job, which prevents the scheduled job from starting. Logfile created for each scheduled run. The Navigation Tree shows the status of the current and past jobs, and whether the job was successful in the past.
Logfiles
Usually, you only see a single logfile. You see multiple logfiles only if you have stopped and restarted the job manually.
8.3.3 Intervals
In TotalStorage Productivity Center, the data collection interval is referred to as the sample interval. The sample interval for DS8000 performance data collection tasks is from five minutes to 60 minutes. A shorter sample interval results in a more granular view of performance data at the expense of requiring additional database space. The appropriate sample interval depends on the objective of the data collection. Table 8-5 on page 218 displays example data collection objectives and reasonable values for a sample interval.
217
Table 8-5 Sample interval examples Objective Problem determination/service level agreement (SLA) Ongoing performance management Baseline or capacity planning Sample interval minutes 5 15 60
To reduce the growth of the TotalStorage Productivity Center database while watching for potential performance issues, TotalStorage Productivity Center has the ability to only store samples in which an alerting threshold is reached. This skipping function is useful for SLA reporting and longer term capacity planning. In support of ongoing performance management, a reasonable sample interval is 15 minutes. An interval of 15 minutes usually provides enough granularity to facilitate reactive performance management. In certain cases, the level of granularity required to identify the performance issue is less than 15 minutes. In these cases, you can reduce the sample interval. TotalStorage Productivity Center also provides reporting at higher intervals, including hourly and daily. TotalStorage Productivity Center provides these views automatically.
8.4 Key performance metrics

TotalStorage Productivity Center for Disk has a significant number of metrics available for reporting the health and performance of the DS8000. The metrics provided can be categorized as either front-end or back-end metrics. Front-end metrics relate to the activity between the server and the storage cache, whereas back-end metrics relate to the activity between the storage controller and the disk arrays. In TotalStorage Productivity Center, back-end statistics are clearly delineated with the key word Backend whereas front-end metrics do not contain any keyword. Table 8-6 provides a matrix of key DS8000 metrics for each component.
Table 8-6 TotalStorage Productivity Center subsystem, controller, port, volume, and array metrics Subsystem Controller Key DS8000 metrics Definition Volume Array Average number of read operations per second for the sample interval. Average number of write operations per second for the sample interval. Average number of read and write operations per second for the sample interval. Percentage of reads during the sample interval that are found in the cache. A storage subsystem-wide target is 50%, although this percentage will vary depending on the workload. Percentage of writes that are handled in cache. This number needs to be 100% Port
Read I/O Rate (overall) Write I/O Rate (overall) Total I/O Rate (overall) Read Cache Hits Percentage (overall)
Write Cache Hits Percentage (overall)
218
Subsystem
Controller
Key DS8000 metrics
Definition Volume Array The rate of I/Os (actually writes) that are delayed during the sample interval because of write cache. This must be 0. Average read data rate in megabytes per second during the sample interval. Average write data rate in megabytes per second during the sample interval. Average total (read+write) data rate in megabytes per second during the sample interval. Average response time in milliseconds for reads during the sample interval. For this report, this metric is an average of read hits in cache as well as read misses. Average response time in milliseconds for writes during the sample interval. Average response time in milliseconds for all I/O in the sample interval, including both cache hits as well as misses to back-end storage if required. Average transfer size in kilobytes for reads during the sample interval. Average transfer size in kilobytes for writes during the sample interval. Average transfer size in kilobytes for all I/O during the sample interval. The average read rate in reads per second caused by read misses. This rate is the read rate to the back-end storage for the sample interval. The average write rate in writes per second caused by front-end write activity. This rate is the write rate to the back-end storage for the sample interval. These writes are logical writes, and the actual number of physical I/O operations depends on the type of RAID architecture. The average write rate in writes per second caused by front-end write activity. This rate is the write rate to the back-end storage for the sample interval. These writes are logical writes and the actual number of physical I/O operations depends on the type of RAID architecture. Average number of megabytes per second read from back-end storage during the sample interval. Port
Write-cache Delay I/O Rate
Read Data Rate Write Data Rate Total Data Rate
Read Response Time
Write Response Time Overall Response Time
Read Transfer Size Write Transfer Size Overall Transfer Size Backend Read I/O Rate
Backend Write I/O Rate
Total Backend I/O Rate
Backend Read Data Rate
219
Subsystem
Controller
Key DS8000 metrics
Definition Volume Array Average number of megabytes per second written to back-end storage during the sample interval. Sum of the Backend Read and Write Data Rates for the sample interval. Average response time in milliseconds for read operations to the back-end storage. Average response time in milliseconds for write operations to the back-end storage. This time might include several physical I/O operations, depending on the type of RAID architecture. The weighted average of Backend Read and Write Response Times during the sample interval. Average disk utilization during the sample interval. This percentage is also the utilization of the RAID array, because the activity is uniform across the array. The average rate per second for operations that send data from an I/O port, typically to a server. This operation is typically a read from the servers perspective. The average rate per second for operations where the storage port receives data, typically from a server. This operation is typically a write from the servers perspective. Average read plus write I/O rate per second at the storage port during the sample interval. The average data rate in megabytes per second for operations that send data from an I/O port, typically to a server. The average data rate in megabytes per second for operations where the storage port receives data, typically from a server. Average (read+write) data rate in megabytes per second at the storage port during the sample interval. Average number of milliseconds that it took to service each port send (server read) operation for a particular port over the sample interval. Average number of milliseconds that it took to service each port receive (server write) operation for a particular port over the sample interval. Port
Backend Write Data Rate
Total Backend Data Rate Backend Read Response Time Backend Write Response Time
Overall Backend Response Time Disk Utilization Percentage
Port Send I/O Rate
Port Receive I/O Rate
Total Port I/O Rate Port Send Data Rate
Port Receive Data Rate
Total Port Data Rate
Port Send Response Time
Port Receive Response Time
220
Subsystem
Controller
Key DS8000 metrics
Definition Volume Array Weighted average port send and port receive time over the sample interval. Average size in kilobytes per Port Send operation during the sample interval. Average size in kilobytes per Port Receive operation during the sample interval. Average size in kilobytes per port transfer during the sample interval. Threshold < 60 >1 > 50% > 250 > 1000 > 35 > 35 > 35 Port
Total Port Response Time Port Send Transfer Size Port Receive Transfer Size Total Port Transfer Size
The following Redpaper provides a more exhaustive list of TotalStorage Productivity Center metrics for DS8000: http://www.redbooks.ibm.com/abstracts/redp4347.html?Open&pdfbookmark
8.4.1 DS8000 key performance indicator thresholds

As seen in Table 8-6 on page 218, TotalStorage Productivity Center for Disk provides an overwhelming number of metrics for performance management. In this section, we provide additional information about a subset of critical metrics. We provide suggested threshold values for the purpose of providing general rules for constraint of various DS8000 components. As with any rules, you must adjust them for the performance requirements of your environment. Colors are used to distinguish the components.
Table 8-7 DS8000 key performance indicator thresholds Component Controller Controller Array Array Array Array Array Array Metric Cache Holding Time Write Cache Delay I/O Rate Disk Utilization Percentage Write I/O Rate (overall) Total I/O Rate (overall) Overall Backend Response Time Backend Write Response Time Backend Read Response Time Comment Indicates high cache track turnover and possibly cache constraint. Indicates writes delayed due to insufficient memory resources. Indicates disk saturation. RAID 5 with four operations per write indicates saturated array. Even if all I/Os are reads, this metric indicates busy disks. Indicates busy disks. Indicates busy disks. Indicates busy disks.
221
Component Volume
Metric Read Cache Hits Percentage (overall) Write Cache Hits Percentage (overall) Read I/O Rate (Overall) Write I/O Rate (overall) Read Response Time Write Response Time
Threshold > 90
Comment Look for opportunities to move volume data to application or database cache. Cache misses can indicate busy back end or need for additional cache. Look for high rates. Look for high rates. Indicates disk or port contention. Indicates caches misses, busy back end, and possible front-end contention. Cache misses may indicate busy back end or need for additional cache. Indicates throughput intensive workload. Indicates throughput intensive workload. Indicates transaction intensive load. If port data rate is close to bandwidth, this rate indicates a saturation. Indicates contention on I/O path from DS8000 to host. Indicates potential issue on I/O path or DS8000 back end. Indicates potential issue on I/O path or DS8000 back end.
Volume
< 100
Volume Volume Volume Volume
N/A N/A > 20 >5
Volume
Write Cache Delay I/O Rate
>1
Volume Volume Port Port Port Port Port
Read Transfer Size Write Transfer Size Total Port I/O Rate Total Port Data Rate Port Send Response Time Port Receive Response Time Total Port Response Time
> 100 > 100 > 2500 ~= 2/4 Gb > 20 > 20 > 20
8.5 TotalStorage Productivity Center reporting options

TotalStorage Productivity Center provides numerous ways of reporting about DS8000 performance data. In this section, we provide an overview of the various options and their appropriate usage in ongoing performance management of a DS8000 (Table 8-8 on page 223). The categories of usage are based on definitions in 6.4, Tactical performance subprocess on page 155.
222
Table 8-8 Report category, usage, and considerations Report type Alerting/Constraints Performance process Operational Advantages Facilitates operational reporting for certain failure conditions and threshold exception reporting for support of SLAs and service level objectives (SLOs) Ease of use Top 10 reports provide method for quickly viewing entire environment Ease of use Flexible Disadvantages Requires thorough understanding of workload to configure appropriate thresholds
Predefined performance reports
Tactical
Limited metrics Lacks scheduling Inflexible charting Limited to 2500 rows displayed Can only export multiple metrics of same data type at a time Lack of scheduling Inflexible charting Limited to 2500 rows displayed Time stamps are in AM/PM format Volume data does not contain array correlation No charting
Ad hoc reports
Tactical, Strategic
Batch reports
Tactical, Strategic
Ease of use Ability to export all metrics available Schedule Drill downs with preestablished relationships Flexible Programmable
TPCTOOL
Tactical, Strategic
Non-intuitive Output to flat files that must be post-processed in spreadsheet or other reporting tool Requires some DB and reporting skills Does not take into account future or potential changes to the environment
Custom Reports (BIRT) Analytical
Tactical, Strategic Tactical, Strategic
Highly customizable
Easy to use Detailed
All of the reports utilize the metrics available for the DS8000 as described in Table 8-6 on page 218. In the remainder of this section, we describe each of the report types in detail.
8.5.1 Alerts
TotalStorage Productivity Center provides support for the performance management operational subprocesses via performance alerts and constraint violations. In this section, we discuss the difference between the alerts and constraint violations and how to implement them.
223
While TotalStorage Productivity Center is not an online performance monitoring tool, it uses the term performance monitor for the name of the job that is set up to gather data from a subsystem. The performance monitor is a performance data collection task. TotalStorage Productivity Center collects information at certain intervals and stores the data in its database. After inserting the data, the data is available for analysis using several methods that we discuss in this section. Because the intervals are usually 5 - 15 minutes, TotalStorage Productivity Center is not an online or real-time monitor. You can use TotalStorage Productivity Center to define performance-related alerts that can trigger an event when the defined thresholds are reached. Even though TotalStorage Productivity Center works in a similar manner to a monitor without user intervention, the actions are still performed at the intervals specified during the definition of the performance monitor job. Before discussing alerts, we must clarify the terminology.
Alerts
Generally, alerts are the notifications defined for different jobs. TotalStorage Productivity Center creates an alert on certain conditions, for example, when a probe or scan fails. There are various ways to be notified: SNMP traps, Tivoli Enterprise Console (TEC) events, and e-mail are the most common methods. All the alerts are always stored in the Alert Log, even if you have not set up notification. This log can be found in the Navigation Tree at IBM TotalStorage Productivity Center Alerting Alert Log. In addition to the alerts that you set up when you define a certain job, you can also define alerts that are not directly related to a job, but instead to specific conditions, such as a new subsystem has been discovered. This type of alert is defined in Disk Manager Alerting Storage Subsystem Alerts. These types of alerts are either condition-based or threshold-based. When we discuss setting up a threshold, we really mean setting up an alert that defines a threshold. The same is true if someone says they set up a constraint. They really set up an alert to define a constraint. These values or conditions need to be exceeded or met in order for an alert to be generated.
Constraints
In contrast to the alerts that are defined with a probe or a scan job, the alerts defined in the Alerting navigation subtree are kept in a special constraint report available in the Disk Manager Reporting Storage Subsystem Performance Constraint Violation navigation subtree. This report lists all the threshold-based alerts, which can be used to identify hot spots within the storage environment. In order to effectively utilize thresholds, the analyst must have familiarity with the workloads. Figure 8-8 on page 225 shows all the available constraint violations. Unfortunately, most of them are not applicable to the DS8000.
224
Figure 8-8 Available constraint violations
Table 8-9 shows the constraint violations applicable to the DS8000. For those constraints without predefined values, we provide suggestions. You need to configure the exact values appropriately for the environment. Most of the metrics that are used for constraint violations are I/O rates and I/O throughput. It is difficult to configure thresholds based on these metrics, because absolute threshold values depend on the hardware capabilities and the workload. It might be perfectly acceptable for a tape backup to utilize the full bandwidth of the storage subsystem ports during backup periods. If the thresholds are configured to identify a high data rate, a threshold will be generated. In these cases, the thresholds are exceeded, but the information does not necessarily indicate a problem. These types of exceptions are called false positives. Other metrics, such as Disk Utilization Percentage, Overall Port Response Time, Write Cache Delay Percentage, and perhaps Cache Hold Time, tend to be more predictive of actual resource constraints and need to be configured in every environment. These constraints are highlighted in green in Table 8-9.
Table 8-9 DS8000 constraints Condition Disk Utilization Percentage Threshold Critical stress 80 Warning stress 50 Comment Can be effective in identifying consistent disk hot spots. Can be used to identify hot ports. Percentage of total I/O operations per processor complex delayed due to write cache space constraints. Amount of time in seconds that the average track persisted in cache per processor complex.
Overall Port Response Time Threshold Write Cache Delay Percentage Threshold
20 10
10 3
Cache Holding Time Threshold
30
60
225
Condition Total Port I/O Rate Threshold Total Port Data Rate Threshold Total I/O Rate Threshold
Critical stress Depends Depends Depends
Warning stress Depends Depends Depends
Comment Indicates highly active ports. Indicates highly active port. Difficult to use, because I/O rates vary depending on workload and configuration. Difficult to use, because data rates vary depending on workload and configuration.
Total Data Rate Threshold (MB)
Depends
Depends
For information about the exact meaning of these metrics and thresholds, refer to 8.3, TotalStorage Productivity Center data collection on page 214. Figure 8-9 on page 227 is a diagram to illustrate the four thresholds that create five regions. Stress alerts define levels that, when exceeded, trigger an alert. An idle threshold level triggers an alert when the data value drops below the defined idle boundary. There are two types of alerts for both the stress category and the idle categories: Critical Stress: No warning stress alert is created, because both (warning and critical) levels are exceeded with the interval. Warning Stress: It does not matter that the metric shows a lower value than in the last interval. An alert is triggered, because the value is still above the warning stress level. Normal workload and performance: No alerts are generated. Warning Idle: The workload drops significantly, and this drop might indicate a problem (does not have to be performance-related). Critical Idle: The same applies as for critical stress.
226
Figure 8-9 Alert levels
It is unnecessary to specify a threshold value for all levels. In order to configure a constraint, perform the following steps: 1. 2. 3. 4. Go to Disk Manager Alerting. Right-click Storage Subsystems Alert. Select Create Storage Subsystem Alert. A window appears that is similar to Figure 8-10. Select the triggering condition from the list box and scroll down until you see the desired metrics.
Figure 8-10 Create Storage Subsystem Alert
227
5. Select the Condition to configure. 6. Set the Critical Stress and Warning Stress levels. 7. On the Storage Subsystems tab, select the systems to which apply the Constraint. 8. Configure any Triggered Actions, such as SNMP Trap, TEC Event, Login Notification, Windows Event Log, Run Script, or email. 9. Save the Alert and provide a name. 10.You can view the alerts in the Disk Manager Reporting Storage Subsystem Performance - Constraint Violation navigation subtree.
Limitations to alert definitions

There are a few limitations to alert levels: There are only a few constraints that apply to the DS8000. Thresholds are always active. They cannot be set to exclude specific time periods. Detailed knowledge of the workload is required to utilize the constraints effectively. An alert is reissued for every sample that exceeds the threshold. If an idle threshold and a stress threshold are both exceeded, only the stress alert is generated. Note: Configuring constraint violations too conservatively can lead to an excessive number of false positive alerts.
8.5.2 Predefined performance reports in TotalStorage Productivity Center

TotalStorage Productivity Center has several predefined reports to facilitate tactical performance management processes. To navigate to these reports, select IBM TotalStorage Productivity Center My Reports System Reports Disk.
Figure 8-11 TotalStorage Productivity Center predefined reports
The predefined TotalStorage Productivity Center performance reports are customized reports. The Top Volume reports show only a single metric over a given time period. These reports provide a way to identify the busiest volumes in the entire environment or by storage
228
subsystem. You can use Selection and Filter for these reports. We describe the Selection and Filter options in detail in 8.5.4, Batch reports on page 232.
8.5.3 Ad hoc reports

In contrast to the predefined reports, the reports generated in Disk Manager contain all the metrics that apply to that component of the subsystem (for example, controller, array, and volume). Use the Selection and Filter options to eliminate unnecessary data. We recommend creating and saving reports for reuse later. Create reports in the Reporting panel by: 1. Select Disk Manager Reporting Storage Subsystem Performance (or in the corresponding Fabric Manager Navigation Tree) as shown in Figure 8-12. Regard these reports as toolkits for building reusable reports. 2. Include only key columns and save them by selecting the Save icon or File Save. They will now show up under the navigation tree in IBM TotalStorage Productivity Center My Reports User ID in My Reports. 3. To generate a chart, select the chart icon as shown . 4. Select the metric to display as shown in Figure 8-13 on page 230. Click OK. A chart, such as the chart shown in Figure 8-14 on page 230, will be displayed. 5. Often the report is difficult to analyze due to the layout and the space occupied by the trend lines. Removing the trend lines improves the usability of the chart. Remove trends by right-clicking in the chart area and selecting Customize Chart and then clearing the check mark in Show Trends. 6. Another option for viewing the data is to export the data to a comma separated values (csv) file. Select File Export Data. Provide a file name as shown in Figure 8-15 on page 231.
, Select any of the available metrics for DS8000 under Storage Subsystem Performance: By Storage Subsystem By Controller By array By Volume By Port Use the selection and filter criteria as described in 8.5.4, Batch reports on page 232.
Figure 8-12 Storage Subsystem Performance reports
229
Figure 8-13 Chart metrics
Figure 8-14 Read I/O rate chart
230
Figure 8-15 Export chart
Drill up and drill down

When you see the report as displayed in Figure 8-16, there are two small icons displayed on the left side of the table (we have copied and resized the icons on the top of the picture).
Figure 8-16 Drill down and drill up
If you click the drill down icon in Figure 8-16, you get a report containing all the volumes that are stored on that specific array. If you click the drill up icon, you get a performance report at the controller level. In Figure 8-17 on page 232, we show you the DS8000 components and
231
levels to which you can drill down. TotalStorage Productivity Center refers to the DS8000 processor complexes as controllers.
Figure 8-17 Drill-down path and drill-up path
8.5.4 Batch reports

The TotalStorage Productivity Center predefined and adhoc reports provide a basic level of reporting but they have a few limitations. Use batch reports if you want to perform additional analysis and reporting or to regularly schedule performance data extracts. In the following section, we describe the necessary steps for generating batch reports: 1. Expand the Batch Reports subtree as shown Figure 8-18.
Figure 8-18 Batch Reports
232
2. Right-click Batch Reports and Select Create Batch Report. 3. Select the report type and provide a description as shown in Figure 8-19.
Figure 8-19 Batch report type
4. On the Selection tab, select the date and time range, the interval, and the Subsystem components as shown in Figure 8-20.
Figure 8-20 Batch selection options
Note: Avoid using the Selection button when extracting volume data. In this case, we recommend using Filter. Use the following syntax to gather volumes for only the subsystem of interest: DS8000-2107-#######-IBM. Refer to Figure 8-21 for an example. Replace ####### with the seven character DS8000 serial number.
233
Figure 8-21 Filter on Subsystem
5. In order to reduce the amount of data, we suggest creating a filter that requires the selected component to contain at least 1 for the Total I/O Rate (overall) (ops/s) as shown in Figure 8-22.
Figure 8-22 Filter on Total I/O Rate
6. Click the Options tab and select Include Headers. 7. Leave the radio button selected for CSV File. This option exports the data to a comma separated values file that can be analyzed with spreadsheet software. 8. Select an agent computer. Usually, this batch report runs on the TotalStorage Productivity Center Server. Refer to Figure 8-23 for an example.
Figure 8-23 Batch options
234
9. Another consideration is When to Run. Click When to Run to see the available options. The default is Run Now. While this option is fine for ad hoc reporting, you might also schedule the report to Run Once at a certain time or Run Repeatedly. This tab also contains an option for setting the time zone for the report. The default is to use the local time in each time zone. Refer to the discussion about time stamps in 8.3.1, Timestamps on page 214 for more information. 10.Prior to running the job, configure any desired alerts in the Alert tab, which provides a means for sending alerts if the job fails. This feature can be useful if the job is a regularly scheduled job. 11.In order to run the batch report, immediately click the Save icon (diskette) in the toolbar as shown in Figure 8-24.
Figure 8-24 Batch report run
12.When clicking the Save icon, a prompt displays Specify a Batch Report name. Enter a name that is descriptive enough for later reference. 13.After submitting the job, it will either be successful or unsuccessful. Examine the log under the Batch Reports to perform problem determination on the unsuccessful jobs. Note: The location of the batch file reports is not intuitive. It is located in the TotalStorage Productivity Center installation directory as shown in Figure 8-25.
235
Figure 8-25 Batch file output location
8.5.5 TPCTOOL
You can use TPCTOOL command line interface to extract data from the TotalStorage Productivity Center database. While it requires no knowledge of the TotalStorage Productivity Center schema or SQL query skills, you need to understand how to use the tool. It is not obvious. Nevertheless, it has advantages over the TotalStorage Productivity Center GUI, such as: Multiple components Extract information about multiple components, such as volumes and arrays by specifying a list of component IDs. If the list is omitted, every component for which data has been gathered is returned. Multiple metrics The multiple metrics feature is probably the most important feature of the TPCTOOL reporting function. While exporting data from a history chart allows data from multiple samples for multiple components, it is limited to a single metric type. In TPCTOOL, the metrics are specified by the columns parameter. The data extraction can be completely automated. TPCTOOL, when used in conjunction with shell scripting, can provide an excellent way to automate the TotalStorage Productivity Center data extracts, which can be useful for loading data into a consolidated performance history repository for the custom reporting and data correlation with other data sources. TPCTOOL can be useful if you need to create your own metrics using supplied metrics or counters. For example, you can create a metric that shows the access density: the 236
number of I/Os per GB. For this metric, you also need information from other TotalStorage Productivity Center reports that include the volume capacity. Manipulating the data will require additional work. Nevertheless, TPCTOOL also has a few limitations: Single subsystem or fabric Reports can only include data of a single subsystem or a single fabric, regardless of the components, ctypes, and metrics that you specify. Identification The identification of components, subsystems, and fabrics is not so easy, because TPCTOOL uses worldwide name (WWN) and Globally Unique Identifiers (GUIDs) instead of the user-defined names or labels. At least for certain commands, you can tell TPCTOOL to return more information by using the -l parameter. For example, lsdev also returns the user-defined label when using the -l parameter. Correlation The drill-down relationships provided in the GUI are not maintained in the TPCTOOL extracts. Manual correlation of volume data with the TotalStorage Productivity Center array can be done or a script can be used to automate this process. A script is provided in Correlate TotalStorage Productivity Center volume batch reports with rank data obtained from DSCLI. on page 618 for mapping volume data with the TotalStorage Productivity Center arrays. The script is specific to volume data extracted using batch reports; however, the logic can be applied to TPCTOOL extracted volume data. TPCTOOL has one command for creating reports and several list commands (starting with ls) for querying information needed to generate a report. To generate a report with TPCTOOL: 1. Launch TPCTOOL by clicking tpctool.bat in the installation directory. Typically, tpctool.bat is in C:\Program Files\IBM\TPC\cli. 2. List the devices using lsdev as shown in Figure 8-26. Note the devices from which to extract data. In this example, we use 2107.1303241+0.
Figure 8-26 TPCTOOL lsdev output
3. Determine the component type to report by using the lstype command as shown in Figure 8-27 on page 238.
237
Figure 8-27 TPCTOOL lstype command output list
4. Next, decide which metrics to include in the report. The metrics returned by the lsmetrics command are the same as the columns in the TotalStorage Productivity Center GUI. Figure 8-28 provides an example of the lsmetrics command.
tpctool> lsmetrics -user <USERID> -pwd <PASSWORD> -ctype subsystem localhost:9550 -subsys 2107.1303241+0 Metric Value ============================================== Read I/O Rate (normal) 801 Read I/O Rate (sequential) 802 Read I/O Rate (overall) 803 Write I/O Rate (normal) 804 Write I/O Rate (sequential) 805 Write I/O Rate (overall) 806 Total I/O Rate (normal) 807 Total I/O Rate (sequential) 808 Total I/O Rate (overall) 809 Read Cache Hit Percentage (normal) 810 Record Mode Read I/O Rate 828 Read Cache Hits Percentage (sequential) 811 Read Cache Hits Percentage (overall) 812 Write Cache Hits Percentage (normal) 813 Write Cache Hits Percentage (sequential) 814 Write Cache Hits Percentage (overall) 815 Total Cache Hits Percentage (normal) 816 Total Cache Hits Percentage (sequential) 817 Total Cache Hits Percentage (overall) 818 Cache Holding Time 834 Read Data Rate 819 Write Data Rate 820 Total Data Rate 821 Read Response Time 822 Write Response Time 823 Overall Response Time 824 Read Transfer Size 825 Write Transfer Size 826 Overall Transfer Size 827 Record Mode Read Cache Hit Percentage 829 Disk to Cache Transfer Rate 830 Cache to Disk Transfer Rate 831 NVS Full Percentage 832 NVS Delayed I/O Rate 833 Backend Read I/O Rate 835 Backend Write I/O Rate 836 Total Backend I/O Rate 837 Backend Read Data Rate 838 Backend Write Data Rate 839 Total Backend Data Rate 840 Backend Read Response Time 841 Backend Write Response Time 842 Overall Backend Response Time 843 Backend Read Transfer Size 847 Backend Write Transfer Size 848 Overall Backend Transfer Size 849
-url
Figure 8-28 The lsmetrics command output
5. Determine the start date and time and put in the following format: YYYY.MM.DD:HH:MM:SS. 6. Determine the data collection interval in seconds: 86400 (1 day). 7. Determine the summarization level: sample, hourly, or daily. 238
8. Run the report using the getrpt command as shown in Figure 8-29. The command output can be redirected to a file for analysis in a spreadsheet. The <USERID> and <PASSWORD> variables need to be replaced with the correct values for your environment.
tpctool> getrpt -user <USERID> -pwd <PASSWORD> -ctype array -url localhost:9550 -subsy 2107.1303241+0 -level hourly -start 2008.11.04:10:00:00 -duration 86400 -columns 801,8 Timestamp Interval Device Component 801 802 ================================================================================ 2008.11.04:00:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 78.97 26.31 2008.11.04:01:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 54.73 14.85 2008.11.04:02:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 43.72 11.13 2008.11.04:03:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 40.92 8.36 2008.11.04:04:00:00 3600 DS8000-2107-1303241-IBM 2107.1303241-10 50.92 10.03
Figure 8-29 The getrpt command sample
Tip: If you want to import the data into Excel later, we recommend using a semi-colon as the field separator (-fs parameter). A comma can easily be mistaken as a decimal or decimal grouping symbol. The book titled TotalStorage Productivity Center Advanced Topics, SG24-7438, contains instructions for importing TPCTOOL data into Excel. The book also provides a Visual Basic macro that can be used to modify the time stamp to the international standard. The lstime command is extremely helpful, because it provides information that can be used to determine if performance data collection is running. It provides three fields: Start Duration Option The date and time of the start of performance data collections The number of seconds that the job ran The location
Example 8-2 Using the TPCTOOL lstime command
tpctool> lstime -user <USERID> -pwd <PASSWORD> -ctype array -url localhost:9550 -level hourly -subsys 2107.1303241+0 Start Duration Option =================================== 2008.10.23:13:00:00 370800 server 2008.10.27:20:00:00 928800 server In order to identify if the performance job is still running, use the following logic: 1. Identify the start time of the last collection (2008.10.27 at 20:00:00). 2. Identify the duration (928800). 3. Add the start time to the duration (Use Excel =Sum(2008.10.27 20:00:00+(928800/86400)). 4. Compare the result to the current time. The result is 2008.11.07 at 14:00, which happens to be the current time. This result indicates that data collection is running.
8.5.6 Volume Planner

The Volume Planner helps administrators plan for the provisioning of subsystem storage based on capacity, storage controller type, number of volumes, volume size, performance requirements, RAID level, performance utilization, and capacity utilization if performance data for the underlying storage subsystem exists. If a prior performance monitor has not been run for that storage subsystem, the Volume Planner will make its decisions based on capacity only. The Volume Planner generates a plan that presents the storage controllers and storage
239
pools that can satisfy the request. If you explicitly specify the storage pool and controller information, the Volume Planner checks to see whether the input performance and capacity requirements can be satisfied.
8.5.7 TPC Reporter for Disk

The IBM TPC Reporter for Disk is a Java 2 Platform application that connects remotely to a server running IBM TotalStorage Productivity Center software. The TPC Reporter extracts storage subsystem information and hourly performance statistics from the TotalStorage Productivity Center server. Extracted statistics are compiled locally and transcribed into a white paper-style PDF file, which is saved on the local machine. The report contains information detailing your storage server utilization. The automatically generated report contains an overview of the subsystem information, basic attributes of each subsystem component, a performance summary of each component, aggregate statistics of each component, and charts detailing information about each component instance. DS8000 component types reported are subsystem, ports, arrays, and volumes.
Configuring TPC Reporter for Disk

In order to run the TPC Reporter for Disk: 1. Download and install the package on a system connected to the same network as the TotalStorage Productivity Center server. Download the software from: http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS2618 2. Install software per the accompanying readme file. 3. Start the program by clicking the TPC Reporter for Disk icon. A start window will display as shown in Figure 8-30. Click Start.
Figure 8-30 TPC for Disk start
4. A prompt will ask for host name (IP), user ID, and password. Click OK. 5. If it connects properly, it will then prompt you to select one or more device serial numbers as shown in Figure 8-31 on page 241.
240
Figure 8-31 TPC Reporter for Disk: Select Serial Number
6. Select the date range as shown in Figure 8-32.
Figure 8-32 TPC Reporter for Disk: Select Date Range
7. Select the components: Ports, arrays, and volumes. 8. Enter the client information and click Continue. 9. Enter the IBM/Business Partner information and click Continue. 10.Select the file name to save the report as and click Save. 11.Click Exit to close the reporter. A report is now in the location selected in the previous steps. The report will contain the following information: Configuration and Capacity Performance overview - Subsystem-level averages Subsystem-level charts of key metrics Subsystem definitions Port information Port performance summary Port detail charts Port metric definitions Array configuration information Array performance summary Array detail charts Array metric definitions Volume information Volume performance summary Volume detail charts Volume metric definitions
241
Note: TPC Reporter for Disk is an excellent way to generate regular performance healthcheck reports especially for port and array. Due to the quantity of reports generated, we suggest excluding the volumes from the report unless a problem is identified with the array or port data that requires additional detail.
8.6 Monitoring performance of a SAN switch or director

All SAN switch and director vendors provide management software that includes performance monitoring capabilities. The real-time SAN statistics, such as port utilization and throughput information available from SAN management software, can be used to complement the performance information provided by host servers or storage subsystems. For additional information about monitoring performance through a SAN switch or director point product, refer to: http://www.brocade.com http://www.cisco.com Most SAN management software includes options to create Simple Network Management Protocol (SNMP) alerts based on performance criteria, and to create historical reports for trend analysis. Certain SAN vendors offer advanced performance monitoring capabilities, such as measuring I/O traffic between specific pairs of source and destination ports, and measuring I/O traffic for specific LUNs. In addition to the vendor point products, TotalStorage Productivity Center for Fabric can be used as a central data repository and reporting tool for switch environments. While it lacks real-time capabilities, TotalStorage Productivity Center for Fabric collects and reports on data at a 5 - 60 minute interval for later analysis. TotalStorage Productivity Center for Fabric provides facilities to report on fabric topology, configuration and configuration changes, and switch and port performance and errors. In addition, it provides the ability to configure alerts or constraints for Total Port Data Rate and Total Port Packet Rate. Configuration options allow the creation of events to be triggered if the constraints are exceeded. While TotalStorage Productivity Center for Fabric does not provide real-time monitoring, it has several advantages over traditional vendor point products: Ability to store performance data from multiple switch vendors in a common database Advanced reporting and correlation between host data and switch data via custom reports Centralized management and reporting Aggregation of port performance data for entire switch In general, you need to analyze SAN statistics to: Ensure that there are no SAN bottlenecks limiting DS8000 I/O traffic. (For example, analyze any link utilization over 80%.) Confirm that multipathing/load balancing software operates as expected. Isolate the I/O activity contributed by adapters on different host servers sharing storage subsystem I/O ports. Isolate the I/O activity contributed by different storage subsystems accessed by the same host server.
242
8.6.1 SAN configuration examples

We look at four example SAN configurations where SAN statistics might be beneficial for monitoring and analyzing DS8000 performance. The first example configuration, which is shown in Figure 8-33, has host server Host_1 connecting to DS8000_1 through two SAN switches or directors (SAN Switch/Director_1 and SAN Switch/Director_2). There is a single inter-switch link (ISL) between the two SAN switches. In this configuration, the performance data available from the host and from the DS8000 will not be able to show the performance of the ISL. For example, if the Host_1 adapters and the DS8000_1 adapters do not achieve the expected throughput, the SAN statistics for utilization of the ISL must be checked to determine whether the ISL is limiting I/O performance.
Host_1
SAN Switch/Director_1
ISL
SAN Switch/Director_2
S to rag e E n c lo s u re S to rag e E n c lo s u re S to rag e E n c lo s u re S to rag e E n clo s u re
D S8000_1
I/O D r a w e r
I/O D r a w e r
I/O D r a w e r
I/O D r a w e r
Figure 8-33 Inter-switch link (ISL) configuration
A second type of configuration in which SAN statistics can be useful is shown in Figure 8-34 on page 244. In this configuration, host bus adapters or channels from multiple servers access the same set of I/O ports on the DS8000 (server adapters 1 - 4 share access to DS8000 I/O ports 5 and 6). In this environment, the performance data available from only the host server or only the DS8000 might not be enough to confirm load balancing or to identify each servers contributions to I/O port activity on the DS8000, because more than one host is accessing the same DS8000 I/O ports. If DS8000 I/O port 5 is highly utilized, it might not be clear whether Host_A, Host_B, or both hosts are responsible for the high utilization. Taken together, the performance data available from Host_A, Host_B, and the DS8000 might be enough to isolate each server connections contribution to I/O port utilization on the DS8000; however, the performance data available from the SAN switch or director might make it easier to see load balancing and relationships between I/O traffic on specific host server ports and DS8000 I/O ports at a glance, because it can provide real-time utilization and traffic statistics for both host server SAN ports and DS8000 SAN ports in a single view, with a common reporting interval and metrics.
243
TotalStorage Productivity Center for Fabric can be used for analysis of historical data, but it does not collect data in real time.
Host_A 1 2 3
Host_B 4
SAN Switch, Director or Fabric

5
S to ra g e E n c lo s u r e S to ra g e E n c lo s u r e S to ra g e E n c lo s u r e S to ra g e E n c lo s u r e
CEC 0
CEC 1
I /O D r a w e r R IO 1 I /O D r a w e r R IO 0
I/ O D r a w e r R IO 1 I/ O D r a w e r R IO 0
Figure 8-34 Shared DS8000 I/O ports
SAN statistics can also be helpful in isolating the individual contributions of multiple DS8000s to I/O performance on a single server. In Figure 8-35 on page 245, host bus adapters or channels 1 and 2 from a single host (Host_A) access I/O ports on multiple DS8000s (I/O ports 3 and 4 on DS8000_1 and I/O ports 5 and 6 on DS8000_2). In this configuration, the performance data available from either the host server or from the DS8000 might not be enough to identify each DS8000s contribution to adapter activity on the host server, because the host server is accessing I/O ports on multiple DS8000s. For example, if adapters on Host_A are highly utilized or if I/O delays are experienced, it might not be clear whether this is due to traffic that is flowing between Host_A and DS8000_1, between Host_A and DS8000_2, or between Host_A and both DS8000_1 and DS8000_2. The performance data available from the host server and from both DS8000s can be used together to identify the source of high utilization or I/O delays. Additionally, you can use TotalStorage Productivity Center for Fabric or vendor point products to gather performance data for both host server SAN ports and DS8000 SAN ports.
244
Host_1
SAN Switch or Director 3

Storage Enclosure Storage Enclosure Storage Enclosure Storage Enclosure
6
DS8000_1
DS8000_2
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
Figure 8-35 Single server accessing multiple DS8000s
Another configuration in which SAN statistics can be important is a remote mirroring configuration, such as the configuration shown in Figure 8-36. Here, two DS8000s are connected through a SAN for synchronous or asynchronous remote mirroring or remote copying, and the SAN statistics can be collected to analyze traffic for the remote mirroring links.
Primary Site
Secondary Site
DS8000_1
DS8000_2
2
SAN Switch or Director
3
SAN Switch or Director
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
I/O Drawer
Figure 8-36 Remote mirroring configuration
You must check SAN statistics to determine if there are SAN bottlenecks limiting DS8000 I/O traffic. You can also use SAN link utilization or throughput statistics to breakdown the I/O
245
activity contributed by adapters on different host servers to shared storage subsystem I/O ports. Conversely, you can use SAN statistics to break down the I/O activity contributed by different storage subsystems accessed by the same host server. SAN statistics can also highlight whether multipathing/load balancing software is operating as desired, or whether there are performance problems that need to be resolved.
8.6.2 TotalStorage Productivity Center for Fabric alerts

TotalStorage Productivity Center for Fabric provides the ability to configure two types of alerts: Discovery Alert due to a type of change in the environment. These alerts are relevant to performance from the perspective of identifying changes in the environment. Threshold Alert due to a type of performance threshold exception or constraint violation.
Table 8-10 TotalStorage Productivity Center for Fabric alert types Alert component Fabric Switch Switch Switch Switch Switch Switch Switch Switch Switch Switch Switch Switch Switch Switch Endpoint Device Alert Endpoint Device Alert Endpoint Device Alert Endpoint Device Alert Condition Fabric Discovered Switch Discovered Switch State Change Switch Property Change Switch Status Degraded Switch Status Improved Switch Version Change Switch to Port Change Switch Blade Change Switch Blade Change Online Switch Blade Change Offline Total Port Data Rate Threshold Link Failure Rate Threshold Error Frame Rate Threshold Total Port Packet Rate Threshold Endpoint Discovered Endpoint State Changed Endpoint to Node Changed Endpoint Version Change Type Discovery Discovery Discovery Discovery Discovery Discovery Discovery Discovery Discovery Discovery Discovery Threshold Threshold Threshold Threshold Discovery Discovery Discovery Discovery
The available TotalStorage Productivity Center for Fabric thresholds are highlighted in gray in Table 8-10. Unfortunately, the available thresholds are aggregated at the switch level and are not granular enough to identify individual port saturation.
246
Configuring TotalStorage Productivity Center for Fabric alerts

The discovery thresholds are enabled by default and reported during a probe. The threshold alerts require configuration. The following steps illustrate how to configure a new threshold alert: 1. In the Navigation Tree, expand Fabric Manager and Alerting. Right-click Switch Alerts and select Create Switch Alerts. A window similar to Figure 8-37 will appear.
Figure 8-37 TotalStorage Productivity Center for Fabric: Switch alert threshold
2. Configure the Critical Stress and Warning Stress rates. 3. Enable any Triggered-Actions and save.
8.6.3 TotalStorage Productivity Center for Fabric reporting

TotalStorage Productivity Center for Fabric provides the ability to create several types of performance reports, which are shown in Table 8-11.
Table 8-11 TotalStorage Productivity Center for Fabric reports Report type Predefined Predefined Predefined Ad hoc Batch Report name Switch Performance Total Switch Port Data Rate Total Switch Port Packet Rate Create line chart with up to 10 ports and any supported metric Export port performance data Comments Aggregate of all ports for a switch. Graph individual switch port metrics. Graph individual switch port metrics. Useful for identifying port hot spots over time. Useful for exporting data for analysis in spreadsheet software.
247
Report type TPCTOOL Custom
Report name Command line tool for extracting data from TPC Create custom queries using BIRT
Comments Extract data for analysis in spreadsheet software. Can be automated. Useful for creating reports not available in TotalStorage Productivity Center.
The process of using TotalStorage Productivity Center for Fabric to create reports is similar to the process that is used to create reports in TotalStorage Productivity Center for Disk as described in 8.5, TotalStorage Productivity Center reporting options on page 222.
8.6.4 TotalStorage Productivity Center for Fabric metrics

Table 8-12 shows the TotalStorage Productivity Center for Fabric metrics that you can collect.
Table 8-12 TotalStorage Productivity Center for Fabric metrics Metric Port Send Packet Rate Port Receive Packet Rate Total Port Packet Rate Port Send Data Rate Definition Average number of packets per second for send operations, for a particular port during the sample interval Average number of packets per second for receive operations, for a particular port during the sample interval Average number of packets per second for send and receive operations, for a particular port during the sample interval Average number of megabytes (2^20 bytes) per second that were transferred for send (write) operations, for a particular port during the sample interval Average number of megabytes (2^20 bytes) per second that were transferred for receive (read) operations, for a particular port during the sample interval Average number of megabytes (2^20 bytes) per second that were transferred for send and receive operations, for a particular port during the sample interval Peak number of megabytes (2^20 bytes) per second that were sent by a particular port during the sample interval Peak number of megabytes (2^20 bytes) per second that were received by a particular port during the sample interval Average number of KB sent per packet by a particular port during the sample interval Average number of KB received per packet by a particular port during the sample interval Average number of KB transferred per packet by a particular port during the sample interval The average number of frames per second that were received in error during the sample interval The average number of frames per second that were lost due to a lack of available host buffers during the sample interval
Port Receive Data Rate
Port Peak Send Data Rate Port Peak Receive Data Rate Port Send Packet Size Port Receive Packet Size Overall Port Packet Size Error Frame Rate Dumped Frame Rate
248
Metric Link Failure Rate Loss of Sync Rate Loss of Signal Rate CRC Error Rate
Definition The average number of link errors per second during the sample interval The average number of times per second that synchronization was lost during the sample interval The average number of times per second that the signal was lost during the sample interval The average number of frames received per second in which the cyclic redundancy check (CRC) in the frame did not match the CRC computed by the receiver during the sample interval
The most important metric for determining if a SAN bottleneck exists is the Total Port Data Rate. When used in conjunction with the port configuration information, you can identify port saturation. For example, if the inter-switch links (ISLs) between two switches are rated at 4 Gbit/sec then a throughput of greater than or equal to 3.5 Gbits/sec indicates saturation.
TotalStorage Productivity Center For Fabric topology viewer

TotalStorage Productivity Center provides a tool called topology viewer for viewing the connectivity in an environment. Figure 8-38 shows an example of this feature. This example shows the connectivity between the host x346-tic-5 to the switch and from the switch the back-end connections. Additional detail can be provided by drilling down on each of the paths. Alerts for the component are indicated by a red exclamation point (!).
Figure 8-38 TotalStorage Productivity Center topology viewer
8.7 End-to-end analysis of I/O performance problems

In order to support tactical performance management processes, problem determination skills and processes must exist. In this section, we discuss the logical steps required to perform successful problem determination for I/O performance issues. The process of I/O performance problem determination consists of the following logical steps: Define the problem. Classify the problem. Identify the I/O bottleneck. Implement changes to remove the I/O bottleneck. Validate that changes resolved the issue. 249
Perceived or actual I/O bottlenecks can result from hardware failures on the I/O path, contention on the server, contention on the SAN Fabric, contention on the DS8000 front-end ports, or contention on the back-end disk adapters or disk arrays. In this section, we provide a process for diagnosing these scenarios using TotalStorage Productivity Center and external data. This process was developed for identifying specific types of problems and is not a substitute for common sense, knowledge of the environment, and experience. Figure 8-39 shows the high-level process flow.
Definition Classification
No Hardware or configuration issue No ID hot/slow host disks Yes Fix it Yes Host resource issue? Fix it
Identification
storage bottleneck Fix it Yes switch bottleneck No Expand Scope
Validation
Figure 8-39 I/O performance analysis process
I/O bottlenecks as referenced in this section relate to one or more components on the I/O path that have reached a saturation point and can no longer achieve the I/O performance requirements. I/O performance requirements are typically throughput-oriented or transaction-oriented. Heavy sequential workloads, such as tape backups or data warehouse environments, might require maximum bandwidth and use large sequential transfers. However, they might not have stringent response time requirements. Transaction-oriented workloads, such as online banking systems, might have stringent response time requirements but have no requirements for throughput. If a server CPU or memory resource shortage is identified, it is important to take the necessary remedial actions. These actions might include but are not limited to adding additional CPUs, optimizing processes or applications, or adding additional memory. In general, if there are not any resources constrained on the server but the end-to-end I/O response time is higher than expected for the DS8000 (See General rules on page 293), there is likely a resource constraint in one or more of the SAN components. In order to troubleshoot performance problems, TotalStorage Productivity Center for Disk and TotalStorage Productivity Center for Fabric data must be augmented with host performance and configuration data. Figure 8-40 on page 251 shows a logical end-to-end view from a measurement perspective.
250
Host Data
Host 1
HBA HBA
Host 2
HBA HBA
TPC For Fabric
Switch 1
Switch 2
Port TPC For Disk Array Volume
Controller 1
DS8000
Controller 2
Figure 8-40 End-to-end measurement
As shown in Figure 8-40, TotalStorage Productivity Center does not provide host performance, configuration, or error data. TotalStorage Productivity Center for Fabric provides performance and error log information about SAN switches. TotalStorage Productivity Center for Disk provides DS8000 storage performance and configuration information.
Process assumptions
This process assumes that: The server is connected to the DS8000 natively. Tools exist to collect the necessary performance and configuration data for each component along the I/O path (server disk, SAN fabric, and DS8000 arrays, ports, and volumes). Skills exist to utilize the tools, extract data, and analyze data. Data is collected in a continuous fashion to facilitate performance management.
Process flow
The order in which you conduct the analysis is important. We suggest the following process: 1. Define the problem. A sample questionnaire is provided in Sample questions for an AIX host on page 154. The goal is to assist in determining the problem background and understand how the performance requirements are not being met. Note: Before proceeding any further ensure that adequate discovery is pursued to identify any changes in the environment. In our experience, there is a significant correlation between changes in the environment and sudden unexpected performance issues.
251
2. Properly classify the problem by identifying hardware or configuration issues. Hardware failures often manifest themselves as performance issues, because I/O is significantly degraded on one or more paths. If a hardware issue is identified at this point, all problem determination efforts must be focused on identifying the root cause of the hardware errors: a. Gather any errors on any of the host paths. Note: If you notice significant errors in the datapath query device or the pcmpath query device and the errors increase, likely there is a problem with a physical component on the I/O path. b. Gather the host error report and look for Small Computer System Interface (SCSI) or FIBRE errors. Note: Often a hardware error relating to a component on the I/O path will manifest itself as a TEMP error. A TEMP error does not necessarily exclude hardware failure. You must perform diagnostics on all hardware components in the I/O path, including host bus adapter (HBA), SAN switch ports, and DS8000 HBA ports. c. Gather the SAN switch configuration and errors. Every switch vendor provides different management software. All of the SAN switch software provides error monitoring and a way to identify if there is a hardware failure with a port or application-specific integrated circuit (ASIC). Refer to your vendor-specific manuals or contact vendor support for more information about identifying hardware failures. Note: As you move from the host to external resources, remember any patterns. A common error pattern that you see involves errors affecting only those paths on the same HBA. If both paths on the same HBA experience errors, the errors are a result of a common component. The common component is likely to be the host HBA, the cable from the host HBA to the SAN switch, or the SAN switch port itself. Ensure that all of these components are thoroughly reviewed before proceeding. d. If errors exist on one or more of the host paths, determine if there are any DS8000 hardware errors. Log on to the HMC as customer/cust0mer and look to make sure that there are no hardware alerts. Figure 8-41 provides a sample of a healthy DS8000. If there are any errors, you might need to open a problem ticket (PMH) with DS8000 hardware support (2107 engineering).
Figure 8-41 DS8000 healthy HMC
3. After validating that no hardware failures exist, analyze server performance data and identify any disk bottlenecks. The fundamental premise of this methodology is that I/O performance degradation relating to SAN component contention can be observed at the server via analysis of key
252
server-based I/O metrics. Degraded end-to-end I/O response time is the strongest indication of I/O path contention. Typically, server physical disk response times measure the time that a physical I/O request takes from the moment that the request was initiated by the device driver until the device driver receives an interrupt from the controller that the I/O completed. The measurements are displayed as either service time or response time. They are usually averaged over the measurement interval. Typically, server wait or queue metrics refer to time spent waiting at the HBA, which is usually an indication of HBA saturation. In general, you need to interpret the service times as response times, because they include potential queuing at various storage subsystem components, for example, switch, storage HBA, storage cache, storage back-end disk controller, storage back-end paths, and disk drives. Note: Subsystem-specific load balancing software usually does not add any performance overhead and can be viewed as a pass-through layer. In addition to the disk response time and disk queuing data, gather the disk activity rates, including read I/Os, write I/Os, and total I/Os, because they show which disks are active: a. Gather performance data as shown in Table 8-13.
Table 8-13 Native tools and key metrics OS AIX Native tool iostat (5.3), filemon Command/Object iostat -D, filemon -o /tmp/fmon.log -O all Metric/Counter read time(ms) write time(ms) reads, writes queue length avserv (ms) avque blks/s svctm(ms) avgqu-sz tps svc_t(ms) Avque blks/s Avg Disk Sec/Read Avg Disk Sec/Write Read Disk Queue Length Write Disk Queue Length Disk Reads/sec Disk Writes/sec N/A
Hewlett-Packard UNIX (HP-UX) Linux
sar
sar -d
*iostat
iostat -d
Solaris
iostat
iostat -xn
Windows server
perfmon
Physical Disk
System z
Resource Measurement Facility (RMF)/System Management Facilities (SMF)
Chapter 15, System z servers on page 441
253
Note: The number of total I/Os per second indicates the relative activity of the device. This relative activity provides a metric to prioritize the analysis. Those devices with high response times and high activity are obviously more important to understand than devices with high response time and infrequent access. If analyzing the data in a spreadsheet, consider creating a combined metric of Average I/Os x Average Response Time to provide a method for identifying the most I/O-intensive disks. You can obtain additional detail about OS-specific server analysis in the OS-specific chapters. b. Gather configuration data (Subsystem Device Driver (SDD)/Subsystem Device Driver Path Control Module (SDDPCM) as shown in Table 8-14. In addition to the multipathing configuration data, you need to collect configuration information for the host and DS8000 HBAs, including the bandwidth of each adapter.
Table 8-14 Path configuration data OS All UNIX Tool SDD/SDDPCM Command datapath query essmap pcmpath query essmap datapath query essmap Key LUNserial Other information *Rank, logical subsystem (LSS), Storage subsystem *Rank, LSS, Storage subsystem
Windows
SDD/Subsystem Device Driver Device Specific Module (SDDDSM)
LUN serial
Note: The rank column is not meaningful for multi-rank extent pools on the DS8000. For single rank extent pools, it only provides a mechanism for understanding that different volumes are located on different ranks.
Note: Ensure that multipathing behaves as designed. For example, if there are two paths zoned per HBA to the DS8000, there must be four paths active per LUN. Both SDD and SDDPCM use an active/active configuration of multipathing, which means that traffic flows across all the traffic fairly evenly. For native DS8000 connections, the absence of activity on one or more paths indicates a problem with the SDD behavior. c. Format the data. Format the data and correlate the host LUNs with their associated DS8000 resources. Formatting the data is not required for analysis, but it is easier to analyze formatted data in a spreadsheet. The following steps represent the logical steps required to perform the formatting, and they do not represent literal steps. You can codify these steps in scripts. You can obtain examples of these scripts in Appendix D, Post-processing scripts on page 607: i. Read configuration file. ii. Build hdisk hash with key = hdisk and value = LUN SN. iii. Read I/O response time data. iv. Create hashes for each of the following values with hdisk as the key: Date, Start time, Physical Volume, Reads, Avg Read Time, Avg Read Size, Writes, Avg Write Time, and Avg Write Size. 254
v. Print the data to a file with headers and commas to separate the fields. vi. Iterate through hdisk hash and use the common hdisk key to index into the other hashes and print those hashes that have values. d. Analyze the host performance data: i. Determine if I/O bottlenecks exist by summarizing the data and analyzing key performance metrics for values in excess of thresholds discussed in General rules on page 293. Identify those vpaths/LUNs with poor response time. We show an example in 10.8.6, Analyzing performance data on page 297. At this point, you need to have excluded hardware errors and multipathing configuration issues, and you must have identified the hot LUNs. Proceed to step four to determine the root cause of the performance issue. ii. If no degraded disk response times exist, the issue is likely related to something internal to the server. 4. If there were disk constraints identified, continue the identification of the root cause by collecting and analyzing DS8000 configuration and performance data: a. Gather the configuration information. A script called DS8K-Config-Gatherer.cmd is provided in Appendix C, UNIX shell scripts on page 587. TotalStorage Productivity Center can also be used to gather configuration data via the topology viewer or from the Data Manager Reporting Asset By Storage Subsystem as shown in Figure 8-42.
Figure 8-42 By Storage Subsystem Asset report
Note: While analysis of the SAN fabric and the DS8000 performance data can be completed in either order, SAN bottlenecks occur much less frequently than disk bottlenecks, so it is more efficient to analyze DS8000 performance data first. b. Use TotalStorage Productivity Center to gather DS8000 performance data for subsystem port, array, and volume. Compare the key performance indicators from Table 8-7 on page 221 with the performance data. Follow these steps to analyze the performance: i. For those server LUNs that had poor response time, analyze the associated volumes during the same period. If the problem is on the DS8000, a correlation exists between the high response times observed on the host and the volume response times observed on the DS8000.
255
Note: Meaningful correlation with the host performance measurement and the previously identified hot LUNs requires analysis of the DS8000 performance data for the same time period that the host data was collected. Refer to 8.3.1, Timestamps on page 214 for more information about time stamps. ii. Correlate the hot LUNs with their associated disk arrays. When using the TotalStorage Productivity Center GUI, the relationships are provided automatically within the drill-down feature. If using batch exports and you want to correlate the volume data with the rank data, you can perform this correlation manually or by using the script provided in Correlate TotalStorage Productivity Center volume batch reports with rank data obtained from DSCLI. In the case of multiple ranks per extent pool and Storage Pool Striping, one volume can exist on multiple ranks. Note: TotalStorage Productivity Center performance reports always refer to the hardware numbering scheme, which is bound to the array sites Sxy and not to the array number (DSCLI array number - 1). Refer to Example 8-2 and Figure 8-4 on page 211 for more information. iii. Analyze storage subsystem ports for the ports associated with the server in question. 5. Continue the identification of the root cause by collecting and analyzing SAN fabric configuration and performance data: a. Gather the connectivity information and establish a visual diagram of the environment. If you have TotalStorage Productivity Center for Fabric, you can use the Topology Viewer to quickly create a visual representation of your SAN environment as shown in Figure 8-38 on page 249. Note: Sophisticated tools are not necessary for creating this type of view; however, the configuration, zoning, and connectivity information must be available in order to create a logical visual representation of the environment. b. Gather the SAN performance data. Each vendor provides SAN management applications that provide alerting and some level of performance management. Often, the performance management software is limited to real-time monitoring and historical data collection features require additional licenses. In addition to the vendor-provided solutions, TotalStorage Productivity Center provides a component called TotalStorage Productivity Center for Fabric. TotalStorage Productivity Center for Fabric can collect the metrics that are shown in Table 8-12 on page 248. c. Consider graphing the Overall Port Response Time and Total Port Data Rate metrics to determine if any of the ports along the I/O path are saturated during the time when the response time was degraded. If the Total Port Data Rate is close to the maximum expected throughput for the link, this situation is likely a contention point. You can add additional bandwidth to mitigate this type of issue either by adding additional links or by adding faster links, which might require upgrades of the server HBAs and the DS8000 host adapter cards in order to take advantage of the additional switch link capacity. Beside the ability to create ad hoc reports using TotalStorage Productivity Center for Fabric metrics, TotalStorage Productivity Center provides the following reports: i. IBM Tivoli Storage Productivity Center Reporting System Reporting Fabric Switch Performance
256
ii. IBM Tivoli Storage Productivity Center Reporting System Reporting Fabric Top Switch Port Data Rate iii. IBM Tivoli Storage Productivity Center Reporting System Reporting Fabric Top Switch Port Packet Rate
8.7.1 Performance analysis examples

This section provides sample performance data, analysis, and recommendations for the following performance scenarios using the process described in 8.7, End-to-end analysis of I/O performance problems on page 249. The examples highlight the key performance data appropriate for each problem type. We provide the host configuration or errors are only in the cases where it is critical to determine the outcome.
DS8000 disk array bottleneck sample

The most common type of performance problem is a disk array bottleneck. Similar to other types of I/O performance problems, a disk array bottleneck usually manifests itself in high disk response time on the host. In many cases, the write response times are excellent due to cache hits, while reads often require immediate disk access.
Problem definition
The application owner complains of poor response time for transactions during certain times of the day.
Problem classification
There are no hardware errors, configuration issues, or host performance constraints.
Identification
Figure 8-43 on page 258 shows the average read response time for a Windows Server 2003 server performing a random workload in which the response time increases steadily over time.
257
Average Disk Read Response Time (ms)

30.00
25.00 Response Time (ms) 20.00 Disk1 Disk2 15.00 10.00 Disk3 Disk4 Disk5 Disk6
5.00 16:55:08 17:01:08 17:07:08 17:13:08 17:19:08 17:25:08 17:31:08 17:37:08 17:43:08 17:49:08 17:55:08 18:01:08 18:07:08 18:13:08 18:19:08 18:25:08 18:31:08 18:37:08 18:43:08 18:49:08 18:55:08
Time - 1 Minute Intervals
Figure 8-43 Windows Server 2003 perfmon average physical disk read response time
At approximately 18:39, the average read response time jumps from approximately 15 ms to 25 ms. Further investigation on the host reveals that the increase in response time correlates with an increase in load as shown in Figure 8-44.
Average Disk Reads/sec

1,000.00 900.00 800.00 Disk Reads/sec 700.00 600.00 500.00 400.00 300.00 200.00 100.00 16:55:08 17:01:08 17:07:08 17:13:08 17:19:08 17:25:08 17:31:08 17:37:08 17:43:08 17:49:08 17:55:08 18:01:08 18:07:08 18:13:08 18:19:08 18:25:08 18:31:08 18:37:08 18:43:08 18:49:08 Disk1 Disk2 Disk3 Disk4 Disk5 Disk6
Time - 1 Minute Interval
Figure 8-44 Average Disk Reads/sec
258
As discussed in 8.7, End-to-end analysis of I/O performance problems on page 249, there are several possibilities for high average disk read response time: DS8000 array contention DS8000 port contention SAN fabric contention Host HBA saturation Because the most probable reason for the elevated response times is the disk utilization on the array, gather and analyze this metric first. Figure 8-45 shows the disk utilization on the DS8000.
Disk Utilization
100 90 80 Disk Utilization 70 60 50 40 30 20 10 0
:0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :0 0 :1 2 :2 2 :0 2 :1 2 :2 2 :3 2 :0 2 :3 2 :4 2 :5 2 :5 2 :4 2 :5 2 :0 0
2107.75GB192-13 2107.75GB192-14 2107.75GB192-21 ` 2107.75GB192-22 2107.75GB192-5 2107.75GB192-6
19
18
18
17
18
Time - 5 Minute Intervals
Figure 8-45 TotalStorage Productivity Center array disk utilization
Recommend changes
We recommend adding volumes on additional disks. For environments where host striping is configured, you might need to recreate the host volumes to spread the I/O from an existing workload across the new volumes.
Validate changes
Gather performance data and determine if the issue is resolved.
Hardware connectivity sample one

While infrequent connectivity issues occur as a result of broken or damaged components in the I/O path, the following example illustrates the steps required to identify and resolve these types of issues: Define the problem The online transactions for Windows Server 2003 SQL server appear to be taking longer than normal and timing out in certain cases.
19
18
18
18
19
19
19
19
259
Classify the problem After reviewing the hardware configuration and the error reports for all hardware components, we have determined that there are errors on the paths associated with one of the host HBAs as shown in Figure 8-46 on page 260. This output shows the errors on path 0 and path 1, which are both on the same HBA (SCSI port 1). For a Windows Server 2003 server running SDD, additional information about the host adapters is available via the gethba.exe command. The command that you use to identify errors depends on the multipathing software installation.
Figure 8-46 Example of datapath query device
Identify the root cause A further review of the switch software revealed significant errors on the switch port associated with the paths in question. A visual inspection of the environment revealed the cable from the host to the switch was kinked. Implement changes to resolve the problem Replace the cable. Validate the problem resolution Since implementing the change, the error counts have stopped increasing and nightly backups have completed within the backup window.
Hardware connectivity sample two

While infrequent connectivity issues occur as a result of broken or damaged components in the I/O path, the following example illustrates the steps required to identify and resolve these types of issues: Define the problem Users report that the data warehouse application on an AIX server is not completing jobs in a reasonable amount of time. Online transactions are also timing out. Classify the problem A review of the host error log shows a significant number of hardware errors. We provide an example of the errors in Figure 8-47 on page 261.
260
Figure 8-47 AIX error log
Identify the root cause The IBM service support representative (SSR) ran the IBM Diagnostics on the host HBA, and the card did not pass diagnostics. Note: In the cases where there is a path with significant errors, you can disable the path with the multipathing software, which allows the non-working paths to be disabled without causing performance degradation to the working paths. With SDD, disable the path by using datapath set device # path # offline. Implement changes to resolve the problem Replace the card. Validate the problem resolution The errors did not persist after the card was replaced and the paths were brought online.
Performance analysis example: DS8000 port bottleneck

While DS8000 port bottlenecks do not occur often, they are a component that is typically oversubscribed: Define the problem The production server batch runs exceed their batch window. Classify the problem There are no hardware errors, configuration issues, or host performance constraints. Identify the root cause The production server throughput diminishes significantly at approximately 18:30 daily. At the same time, development workloads running on the same DS8000 ports significantly increase. Figure 8-48 on page 262 demonstrates the overall workload from both the production server and the development server.
261
Total Disk Throughput

1000000 900000 800000 700000
KB/sec
Dev Disk7 Dev Disk6 Dev Disk5 Dev Disk4 Production Disk5 Production Disk2 Production Disk1
600000 500000 400000 300000 200000 100000 0 17:39:06 17:46:06 17:53:06 18:00:06 18:07:06 18:14:06 18:21:06 18:28:06 18:35:06 18:42:06 18:49:06 18:56:06 19:03:06 19:10:06 19:17:06 19:24:06
Time - 1 minute inte rv al
Figure 8-48 Production throughput compared to development throughput
DS8000 port data reveals a peak throughput of around 300 MBps per port.

700.00 600.00 500.00 Total MB/sec 400.00 300.00 200.00 100.00 17 :3 17 9 :0 :4 0 17 4 :0 :4 0 17 9 :0 :5 0 17 4 :0 :5 0 18 9 :0 :0 0 18 4 :0 :0 0 18 9 :0 :1 0 18 4 :0 :2 0 18 4 :0 :2 0 18 9 :0 :3 0 18 4 :0 :3 0 18 9 :0 :4 0 18 4 :0 :4 0 18 9 :0 :5 0 18 4 :0 :5 0 19 9 :0 :0 0 19 4 :0 :0 0 19 9 :0 :1 0 19 4 :0 :1 0 19 9 :0 :2 0 19 4 :0 :2 0 9: 00
R1-I3-C4-T0 R1-I3-C1-T0
Time - 5 minute
Figure 8-49 Total port data rate
Implement changes to resolve the problem Rezone ports for production servers and development servers so that they do not use the same DS8000 ports. Add additional ports so that each server HBA is zoned to two DS8000 ports according to best practices. Validate the problem resolution After implementing the new zoning that separated the production server and the development server, the storage ports were no longer the bottleneck.
262
Performance analysis example: Server HBA bottleneck

Although rare, server HBA bottlenecks occur, usually as result of a highly sequential workload with under configured HBAs. We discuss an example of the type of workload and configuration that will lead to this type of problem in 10.8.6, Analyzing performance data on page 297.
8.8 TotalStorage Productivity Center for Disk in mixed environment

A benefit of IBM TotalStorage Productivity Center for Disk is the capability to analyze both Open Systems fixed block (FB) and System z count key data (CKD) workloads. When the DS8000 subsystems are attached to multiple hosts running on different platforms, Open Systems hosts might affect your System z workload, and the System z workload might affect the Open Systems workloads. If you use a mixed environment, looking at the RMF reports will not be sufficient. You also need the information about the Open Systems. The IBM TotalStorage Productivity Center for Disk informs you about the cache and I/O activity. Before beginning the diagnostic process, you must understand your workload and your physical configuration. You need to know how your system resources are allocated, as well as understand your path and channel configuration for all attached servers. Let us assume that you have an environment with a DS8000 attached to a z/OS host, an AIX System p host, and several Windows 2000 Server hosts. You have noticed that your z/OS online users experience a performance degradation between 7:30 a.m. and 8:00 a.m. each morning. You might notice that there are 3390 volumes indicating high disconnect times, or high device busy delay time for several volumes in the RMF device activity reports. Unlike UNIX or Windows 2000 Server, you might notice response time and its breakdown to connect, disconnect, pending, and IOS queuing.
Disconnect time is an indication of cache miss activity or destage wait (due to persistent memory high utilization) for logical disks behind the DS8000s. Device busy delay is an indication that another system locks up a volume, and an extent conflict occurs among z/OS hosts or applications in the same host when using Parallel
Access Volumes. The DS8000 multiple allegiance or Parallel Access Volume capability allows it to process multiple I/Os against the same volume at the same time. However, if a read or write request against an extent is pending while another I/O is writing to the extent, or if a write request against an extent is pending while another I/O is reading or writing data from the extent, the DS8000 will delay the I/O by queuing. This condition is referred as extent conflict. Queuing time due to extent conflict is accumulated to device busy (DB) delay time. An extent is a sphere of access; the unit of increment is a track. Usually, I/O drivers or system routines decide and declare the sphere. To determine the possible cause of high disconnect times, check the read cache hit ratios, read-to-write ratios, and bypass I/Os for those volumes. If you see the cache hit ratio is lower than usual while you have not added other workload on your System Z environment, I/Os against Open Systems fixed block volumes might be a cause of the problem. Possibly, FB volumes defined on the same server had a cache-unfriendly workload, thus impacting your System Z volumes hit ratio. In order to get more information about cache usage, you can check the cache statistics of the FB volumes that belong to the same server. You might be able to identify the FB volumes that have a low read hit ratio and short cache holding time. Moving the workload of these Open
263
Systems logical disks or the System Z CKD volumes about which you are concerned to the other side of the cluster, so that you can concentrate cache-friendly I/O workload to either cluster, will improve the situation. If you cannot or if the condition has not improved after this move, consider balancing the I/O distribution on more ranks. Balancing the I/O distribution on more ranks will optimize the staging and destaging operation. The approaches for using the data of other tools in conjunction with the IBM TotalStorage Productivity Center for Disk, as described in this chapter, do not cover all the possible situations you will encounter. But if you basically understand how to interpret the DS8000 performance reports, and you also have a good understanding of how the DS8000 works, you will be able to develop your own ideas about how to correlate the DS8000 performance reports with other performance measurement tools when approaching specific situations in your production environment.
264
Chapter 9.
Host attachment
This chapter discusses the attachment considerations between host systems and the DS8000 series for availability and performance. Topics include: DS8000 attachment types Attaching Open Systems hosts Attaching System z hosts We provide detailed information about performance tuning considerations for specific operating systems in subsequent chapters of this book.
265
9.1 DS8000 host attachment

The DS8000 enterprise storage solution provides a variety of host attachments allowing exceptional performance and superior data throughput. We recommend a minimum of two connections to any host, and the connections need to be on different host adapters in different I/O enclosures. You can consolidate storage capacity and workloads for Open Systems hosts and IBM System z hosts using the following adapter types and protocols: Fibre Channel Protocol (FCP)-attached Open Systems hosts Fibre Channel (FCP)/Fibre Connection (FICON)-attached System z hosts Enterprise Systems Connection Architecture (ESCON)-attached System z hosts The DS8100 Model 921 and Turbo Model 931 support a maximum of 16 FCP/FICON host adapters and four device adapter pairs of 2 Gbps or 4 Gbps (up to 64 host ports). The DS8300 Models 922/9A2 and the Turbo Models 932/9B2 support a maximum of 32 FCP/FICON host adapters and eight device adapter pairs of 2 Gbps or 4 Gbps (up to 128 host ports). All the ports can be intermixed and independently configured. ESCON adapters are also supported. They contain only two ports or links, which means up to 32 host ports for the DS8100 and up to 64 host ports for the DS8300. You must cable host attachment requiring maximum I/O port performance to host adapters with a maximum of two I/O ports in use. Note that all connections on a host adapter card share bandwidth in a balanced manner. You must allocate host connections across I/O ports, host adapters, and I/O enclosures in a balanced manner (workload spreading). The DS8000 can support host systems and remote mirroring links using Peer-to-Peer Remote Copy (PPRC) on the same I/O port. However, we recommend that you have dedicated I/O ports for remote mirroring links. For z/OS environments with FICON host connections, use a 1:1 ratio for FICON Express 2 connections (one server port to one DS8000 I/O port). For FICON Express connections, a 2:1 ratio might be acceptable (two server ports to one DS8000 I/O port), but a FICON switch will be necessary. The primary change in FICON Express4 channels compared to FICON Express2 channels is the speed of the link that connects the channel to a director or control unit (CU) port. FICON Express4 features support up to 4 Gbps (link data rate), while FICON Express2 and FICON Express features support up to 2 Gbps. For more information about DS8000 FICON support, refer to IBM System Storage DS8000 Host Systems Attachment Guide, SC26-7917, and FICON Native Implementation and Reference Guide, SG24-6266.
9.2 Attaching Open Systems hosts

This section describes the host system requirements and attachment considerations for Open Systems hosts running AIX, Linux, Hewlett-Packard UNIX (HP-UX), Sun Solaris, Novell Netware, and Windows to the DS8000 series with Fibre Channel adapters. Note: There is no direct Small Computer System Interface (SCSI) attachment support for the DS8000.
266
9.2.1 Fibre Channel

Fibre Channel is a 100, 200, or 400 MBps, full-duplex, serial communications technology to interconnect I/O devices and host systems that might be separated by tens of kilometers. The DS8000 supports 4, 2, and 1 Gbps connections, negotiating the link speed automatically.
Supported Fibre Channel attached hosts

For specific considerations that apply to each server platform, as well as for the most current information about supported servers (the list is updated periodically), refer to these Web sites: http://www.ibm.com/systems/support/storage/config/ssic http://www.ibm.com/systems/resources/systems_storage_disk_ds8000_interop.pdf
Fibre Channel topologies

The DS8000 architecture supports all three Fibre Channel interconnection topologies: Direct connect Arbitrated loop Switched fabric For maximum flexibility and performance, we recommend a switched fabric topology. We discuss recommendations for implementing a switched fabric in more detail in 9.2.2, SAN implementations on page 267.
9.2.2 SAN implementations

In this section, we describe a basic SAN network and how to implement it for maximum performance and availability. We show examples of a properly connected SAN network to maximize the throughput of disk I/O.
Description and characteristics of a SAN

A Storage Area Network (SAN) allows you to connect heterogeneous Open Systems servers to a high speed (4 Gbps or 2 Gbps now; 8 Gbps in the future) network, sharing storage devices, such as disk storage and tape libraries. Instead of each server having its own locally attached storage and tape drives, a SAN allows you to share centralized storage components and easily allocate storage to hosts.
SAN cabling for availability and performance

For availability, we need to connect to different adapters in different I/O enclosures whenever possible. You must use multiple Fibre Channel switches or directors to avoid a potential single point of failure. You can use inter-switch links (ISLs) for connectivity. For performance, a general rule for host adapters is to have four host adapters, with eight host adapter ports connected to the SAN to support a DS8000 configuration with up to 128 disk drive modules (DDMs). For a DS8000 configuration that consists of 128 - 256 DDMs, you need to have eight host adapters with 16 host adapter ports connected to the SAN. In a DS8100, you achieve eight host adapters with 16 host adapter ports connected to the SAN by having two host adapters in each of the I/O enclosures. In a DS8300, you have eight host adapters with 16 host adapter ports connected to the SAN by having a single host adapter in each of the I/O enclosures. For even larger capacity DS8000s, consider additional host adapters up to 32 adapters for the largest DS8300 configuration.
Chapter 9. Host attachment
267
In this largest configuration, you can support up to 128 direct connect host attachments, but the current implementation of the DS8300 allows for the installation of even more host adapters, up to 64 adapters or 256 ports. Consider these additional ports for connectivity purposes if you choose not to use Fibre Channel switched connections. Do not expect additional performance or throughput capabilities beyond the installation of 32 host adapters.
Importance of establishing zones

For Fibre Channel attachments in a SAN, it is important to establish zones to prevent interaction from host adapters. Every time that a host adapter joins the fabric, it issues a Registered State Change Notification (RSCN), which does not cross zone boundaries, but will affect every device or host adapter in the same zone. If a host adapter goes fails and starts logging in and out of the switched fabric, or a server must be rebooted several times, you do not want it to disturb the I/O to other hosts. Figure 9-1 on page 270 shows zones that only include a single host adapter and multiple DS8000 ports, which is the recommended way to create zones to prevent interaction between server host adapters. Tip: Each zone contains a single host system adapter with the desired number of ports attached to the DS8000. By establishing zones, you reduce the possibility of interactions between system adapters in switched configurations. You can establish the zones by using either of two zoning methods: Port number Worldwide port name (WWPN) You can configure switch ports that are attached to the DS8000 in more than one zone, which enables multiple host system adapters to share access to the DS8000 host adapter ports. Shared access to a DS8000 host adapter port might be from host platforms that support a combination of bus adapter types and operating systems. Note: A DS8000 host adapter port configured to run with the FICON topology cannot be shared in a zone with non-z/OS hosts, and ports with non-FICON topology cannot be shared in a zone with z/OS hosts.
LUN masking
In Fibre Channel attachment, logical unit number (LUN) affinity is based on the worldwide port name (WWPN) of the adapter on the host, independent of the DS8000 host adapter port to which the host is attached. This LUN masking function on the DS8000 is provided through the definition of DS8000 volume groups. A volume group is defined using the DS Storage Manager or dscli, and host WWPNs are connected to the volume group. The LUNs that are to be accessed by the hosts connected to the volume group are then defined to reside in that volume group. While it is possible to limit through which DS8000 host adapter ports a given WWPN will connect to volume groups, we recommend that you define the WWPNs to have access to all available DS8000 host adapter ports. Then, using the recommended process of creating Fibre Channel zones as discussed in Importance of establishing zones on page 268, you can limit the desired host adapter ports through the Fibre Channel zones. In a switched fabric with multiple connections to the DS8000, this concept of LUN affinity enables the host to see the same LUNs on different paths.
268
Configuring logical disks in a SAN

In a SAN, care must be taken in planning the configuration to prevent a large number of disk device images from being presented to the attached hosts. A large number of disk devices presented to a host can cause longer failover times in cluster environments. Also, boot times can take longer, because the device discovery steps will take more time. The number of times that a DS8000 logical disk is presented as a disk device to an open host depends on the number of paths from each host adapter to the DS8000. The number of paths from an open server to the DS8000 is determined by: The number of host adapter cards installed in the server The number of connections between the SAN switches and the DS8000 The zone definitions created by the SAN switch software Note: Each physical path to a logical disk on the DS8000 is presented to the host operating system as a disk device. Consider a SAN configuration as shown in Figure 9-1: The host has two connections to the SAN switches, and each SAN switch in turn has four connections to the DS8000. Zone A includes one Fibre Channel card (FC0) and two paths from SAN switch A to the DS8000. Zone B includes one Fibre Channel card (FC1) and two paths from SAN switch B to the DS8000. This host is only using four of the eight possible paths to the DS8000 in this zoning configuration. By cabling the SAN components and creating zones as shown in Figure 9-1, each logical disk on the DS8000 will be presented to the host server four times, because there are four unique physical paths from the host to the DS8000. As you observe the picture, Zone A shows that FC0 will have access through DS8000 host ports I0140 and I0310. Zone B shows that FC1 will have access through DS8000 host ports I0010 and I0230. So, in combination, this configuration provides for four paths to each logical disk presented by the DS8000. If Zone A and Zone B were modified to include four paths each to the DS8000, the host has a total of eight paths to the DS8000. In that case, each logical disk assigned to the host is presented as eight physical disks to the host operating system. Additional DS8000 paths are shown as connected to Switch A and Switch B, but are not in use for this example.
269
Host
FC0 FC1
SAN switch A SAN switch B
I0010 I0030 I0110 I0140 I0210 I0230 I0310 I0340
DS8000 Model 921
Zone A- FC0 DS8000_I0140 DS8000_I0310

Figure 9-1 Zoning in a SAN environment
Zone B- FC1 DS8000_I0010 DS8000_I0230
You can see how the number of logical devices presented to a host can increase rapidly in a SAN environment if you are not careful in selecting the size of logical disks and the number of paths from the host to the DS8000. Typically, we recommend that you cable the switches and create zones in the SAN switch software for dual-attached hosts so that each server host adapter has two to four paths from the switch to the DS8000. With hosts configured this way, you can let the multipathing module balance the load across the four host adapters in the DS8000. Zoning more paths, such as eight connections from the host to DS8000, generally does not improve SAN performance and only causes twice as many devices to be presented to the operating system.
9.2.3 Multipathing
Multipathing describes a technique that allows you to attach one host to a external storage
device via more than one path. Usage of multipathing can improve fault-tolerance and the performance of the overall system, because the fault of a single component in the environment can be tolerated without an impact to the host. Also, you can increase the overall system bandwidth, which positively influences the performance of the system. As illustrated in Figure 9-2, attaching a host system using a single-path connection implements a solution that depends on several single points of failure. In this example, as a single link failure either between the host system and the switch or between the switch and the storage system, as well as a failure of the host adapter in the host system, the DS8000 storage system or even a failure of the switch leads to an access loss of the host system. Additionally, the path performance of the whole system is reduced by the slowest component in the link.
270
Host System
Host adapter
single point of failure SAN sw itch single point of failure

Host Port I0001
Logical disk
DS8000
Figure 9-2 SAN single-path connection
Adding additional paths requires you to use multipathing software, because otherwise, the same LUN behind each path is handled as a separate disk from the operating system side, which does not allow failover support. Multipathing provides DS8000-attached Open Systems hosts running Windows, AIX, HP-UX, Sun Solaris, or Linux with: Support for several paths per LUN Load balancing between multiple paths when there is more than one path from a host server to the DS8000. This approach might eliminate I/O bottlenecks that occur when many I/O operations are directed to common devices via the same I/O path, thus improving the I/O performance. Automatic path management, failover protection, and enhanced data availability for users that have more than one path from a host server to the DS8000. It eliminates a potential single point of failure by automatically rerouting I/O operations to the remaining active paths from a failed data path. Dynamic reconsideration after changing the configuration environment, including zoning, LUN masking, and adding or removing physical paths.
271
Host System
multipathing module Host adapter Host adapter
SAN sw itch
SAN sw itch
Host Port I0001
Host Port I0131
LUN
DS8000
Figure 9-3 DS8000 multipathing implementation using two paths
DS8000 supports several multipathing implementations. Depending on the environment, host type, and operating system, only a subset of those implementations are available. This section introduces their concepts and give general information about the implementation, usage, and specific benefits. Note: Do not intermix several multipathing solutions within one host system; usually, the multipathing software solutions cannot coexist.
Subsystem Device Driver (SDD)

The IBM Subsystem Device Driver (SDD) software is a generic host-resident pseudo device driver that is designed to support the multipath configuration environments in the DS8000. SDD resides on the host system with the native disk device driver and manages redundant connections between the host server and the DS8000, providing enhanced performance and data availability. SDD is provided by and maintained by IBM for the host operating systems AIX, Linux, HP-UX, Sun Solaris, Novell Netware, and Windows. The Subsystem Device Driver can operate under different modes or configurations: Concurrent data access mode: A system configuration where simultaneous access to data on common LUNs by more than one host is controlled by system application software, such as Oracle Parallel Server, or file access software that has the ability to deal with address conflicts. The LUN is not involved in access resolution. Non-concurrent data access mode: A system configuration where there is no inherent system software control of simultaneous access to the data on a common LUN by more than one host. Therefore, access conflicts must be controlled at the LUN level by a hardware-locking facility, such as Small Computer System Interface (SCSI) Reserve/Release.
272
Note: Do not share LUNs among multiple hosts without the protection of Persistent Reserve (PR). If you share LUNs among hosts without PowerHA, you are exposed to data corruption situations. You must also use PR when using FlashCopy. It is important to note that the IBM Subsystem Device Driver does not support booting from or placing a system primary paging device on an SDD pseudo device. For certain servers running AIX, booting off the DS8000 is supported. In that case, LUNs used for booting are manually excluded from the SDD configuration by using the querysn command to create an exclude file. You can obtain more information in querysn for multi-booting AIX off the DS8000 on page 370. For more information about installing and using SDD, refer to IBM System Storage Multipath Subsystem Device Driver Users Guide, GC52-1309. This publication and other information are available at: http://www.ibm.com/servers/storage/support/
SDD load balancing

SDD automatically adjusts data routing for optimum performance. Multipath load balancing of data flow prevents a single path from becoming overloaded and causing the I/O congestion that occurs when many I/O operations are directed to common devices along the same I/O path. The policy that is specified for the device determines the path that is selected to use for an I/O operation. The available policies are: Load balancing (default). The path to use for an I/O operation is chosen by estimating the load on the adapter to which each path is attached. The load is a function of the number of I/O operations currently in process. If multiple paths have the same load, a path is chosen at random from those paths. Round-robin. The path to use for each I/O operation is chosen at random from those paths not used for the last I/O operation. If a device has only two paths, SDD alternates between the two paths. Failover only. All I/O operations for the device are sent to the same (preferred) path until the path fails because of I/O errors. Then, an alternate path is chosen for subsequent I/O operations. Normally, path selection is performed on a global rotating basis; however, the same path is used when two sequential write operations are detected.
Single path mode

SDD does not support concurrent download and installation of the Licensed Machine Code (LMC) to the DS8000 if hosts use a single-path mode. However, SDD does support a single-path Fibre Channel connection from your host system to a DS8000. It is possible to create a volume group or a vpath device with only a single path. Note: With a single-path connection, which we do not recommend, SDD cannot provide failure protection and load balancing.
Single FC adapter with multiple paths

A host system with a single Fibre Channel (FC) adapter that connects through a switch to multiple DS8000 ports is considered to have multiple Fibre Channel paths.
273
From an availability point of view, we discourage this configuration because of the single fiber cable from the host to the SAN switch. However, this configuration is better than a single path from the host to the DS8000, and this configuration can be useful for preparing for maintenance on the DS8000.
Path failover and online recovery

SDD automatically and non-disruptively can redirect data to an alternate data path. When a path failure occurs, the IBM SDD automatically reroutes the I/O operations from the failed path to the other remaining paths, which eliminates the possibility of a data path being a single point of failure.
SDD datapath command

SDD provides commands that you can use to display the status of adapters that are used to manage disk devices or to display the status of the disk devices themselves. You can also set individual paths online or offline and also set all paths connected to an adapter online or offline at one time.
Multipath I/O
Multipath I/O (MPIO) summarizes native multipathing technologies that are available in several operating systems, such as AIX, Linux, and Windows. Although the implementation differs for each of the operating systems, the basic concept is almost the same: The multipathing module is delivered with the operating system. The multipathing module supports failover and load balancing for standard SCSI devices, such as simple SCSI disks or SCSI arrays. To add device-specific support and functions for a specific storage device, each storage vendor might provide a device-specific module implementing advanced functions for managing the specific storage device. IBM currently provides a device-specific module for the DS8000 for AIX, Linux, and Windows according to the information in Table 9-1.
Table 9-1 Available DS8000-specific MPIO path control modules Operating system AIX Windows Multipathing solution MPIO MPIO Device-specific module SDD Path Control Module SDD Device Specific Module DM-MPIO configuration file Acronym SDDPCM Subsystem Device Driver Device Specific Module (SDDDSM) DM-MPIO
Linux
Device-Mapper Multipath
External multipathing software

In addition to the SDD and MPIO solutions, third-party multipathing software is available for specific host operating systems and configurations. For example, Veritas provides an alternative to the IBM provided multipathing software. Veritas relies on the Microsoft implementation of MPIO and Device Specific Modules (DSMs) that rely on the Storport driver. The Storport driver is not available for all versions of Windows. The Veritas Dynamic MultiPathing (DMP) software is also available for Sun Solaris.
274
Check the System Storage Interoperation Center (SSIC) Web site for your specific hardware configuration: http://www.ibm.com/systems/support/storage/config/ssic/
9.3 Attaching IBM System z and S/390 hosts

This section describes the host system requirements and attachment considerations for the IBM System z and S/390 hosts (z/OS, z/VM, z/VSE, Linux on System z, and Transaction Processing Facility (TPF)) to the DS8000 series. The attachment is either through an ESCON adapter or a FICON adapter. Note: z/VM, z/VSE, and Linux for System z can also be attached to the DS8000 series with FCP.
9.3.1 ESCON
The ESCON adapter in the DS8000 has two ports and is intended for connection to older System z hosts that do not support FICON. For good performance and high availability, the ESCON adapters (refer to 2.5.2, ESCON host adapters on page 25) must be available through all I/O enclosures and provide the following configurations: Access to only the first 16 (3390) logical control units (LCUs) Up to 32 ESCON links for the DS8100 and 64 ESCON links for the DS8300; two per ESCON host adapter A maximum of 64 logical paths per port or link and 256 logical paths per control unit image (or logical subsystem (LSS)) Access to all 16 LCUs (4096 CKD devices) over a single ESCON port 17 MB/s native data rate For the System z environments with ESCON attachment, it is not possible to take full advantage of the DS8000 performance capacity. When configuring for ESCON, consider these general recommendations (refer to Figure 9-4 on page 276): Use 4-path or 8-path groups (preferably eight) between each system z host and LSS. Plug channels for a 4-path group into four host adapters across different I/O enclosures. Plug channels for a 8-path group into four host adapters across different I/O enclosures (using both ports per adapter) or into eight host adapters across different I/O enclosures. One 8-path group is better than two 4-path groups. This way, the host system and the DS8000 are able to balance all of the work across the eight available paths.
275
CEC 1
LPAR A LPAR B
CEC 2
LPAR C LPAR D
ESCON Director
ESCON Director
CU 0200 CU 0210 CU 0220 CU 0230 CU 0240 CU 0250 CU 0260 CU 0270
LV 2000-20FF LV 2100-21FF LV 2200-22FF LV 2300-23FF LV 2400-24FF LV 2500-25FF LV 2600-26FF LV 2700-27FF
(LCU 00) (LCU 01) (LCU 02) (LCU 03) (LCU 04) (LCU 05) (LCU 06) (LCU 07)
Figure 9-4 DS8000 ESCON attachment
You can use ESCON cables to attach the DS8000 directly to a S/390 or System z host, or to an ESCON director, channel extender, or a dense wave division multiplexer (DWDM). ESCON cables cannot be used to connect to another DS8000, either directly or via ESCON director or DWDM, for Remote Copy (PPRC). The maximum unrepeated distance of an ESCON link from the ESS to the host channel port, ESCON switch, or extender is 3 km (1.86 miles) using 62.5 micron fiber or 2 km (1.24 miles) using 50 micron fiber. The FICON bridge card in the ESCON Director 9032 Model 5 enables connections to ESCON host adapters in the storage unit. The FICON bridge architecture supports up to 16384 devices per channel. Note: The IBM ESCON Director 9032 (including all models and features) has been withdrawn from marketing. There is no IBM replacement for the IBM 9032. Third-party vendors might be able to provide functionality similar to the IBM 9032 Model 5 ESCON Director.
9.3.2 FICON
FICON is a Fibre Connection used with System z servers. Each storage unit host adapter has four ports, and each port has a unique world wide port name (WWPN). You can configure the port to operate with the FICON upper layer protocol. When configured for FICON, the storage unit provides the following configurations: Either fabric or point-to-point topology A maximum of 64 host ports for DS8100 Models 921/931 and a maximum of 128 host ports for DS8300 Models 922/9A2 and 932/9B2 A maximum of 2048 logical paths on each Fibre Channel port 276
Access to all 255 control unit images (65280 CKD devices) over each FICON port. The connection speeds are 100 - 200 MB/s, which is similar to Fibre Channel for Open Systems. FICON channels were introduced in the IBM 9672 G5 and G6 servers with the capability to run at 1 Gbps. These channels were enhanced to FICON Express channels and then to FICON Express2 channels and both were capable of running at transfer speeds of 2 Gbps. The fastest link speeds currently available are FICON Express4 channels. They are designed to support 4 Gbps link speeds and can also auto-negotiate to 1 or 2 Gbps link speeds depending on the capability of the director or control unit port at the other end of the link. Operating at 4 Gbps speeds, FICON Express4 channels are designed to achieve up to 350 MBps for a mix of large sequential read and write I/O operations as depicted in the following charts. Figure 9-5 shows a comparison of the overall throughput capabilities of various generations of channel technology.
Figure 9-5 Measurements of channel performance over several generations of channels
As you can see, the FICON Express4 channel on the IBM System z9 EC and z9 BC represents a significant improvement in maximum bandwidth capability compared to FICON Express2 channels and previous FICON offerings. The response time improvements are expected to be noticeable for large data transfers. The speed at which data moves across a 4 Gbps link is effectively 400 MBps compared to 200 MBps with a 2 Gbps link. The maximum number of I/Os per second that was measured on a FICON Express4 channel running an I/O driver benchmark with a 4 KB per I/O workload is approximately 13000, which is the same as what was measured with a FICON Express2 channel. Changing the link speed has no effect on the number of small block (4 KB per I/O) I/Os that can be processed. The greater performance capabilities of the FICON Express4 channel make it a good match with the performance characteristics of the new DS8000 host adapters.
277
Note: FICON Express2 SX/LX and FICON Express SX/LX are supported on System z10 and System z9 servers only if carried forward on an upgrade. The FICON Express LX feature is required to support CHPID type FCV. The System z10 and System z9 servers offer FICON Express4 SX and LX features that have four (or two for the 2-port SX and LX features) independent channels. Each feature occupies a single I/O slot and utilizes one CHPID per channel. Each channel supports 1 Gbps, 2 Gbps, and 4 Gbps link data rates with auto-negotiation to support existing switches, directors, and storage devices. Note: FICON Express4-2C SX/4KM LX are only available on z10 BC and z9 BC. FICON Express4 is the last feature to support 1 Gbps link data rates. Future FICON features will not support auto-negotiation to 1 Gbps link data rates. For any generation of FICON channels, you can attach directly to a DS8000 or you can attach via a FICON-capable Fibre Channel switch. When you use a Fibre Channel/FICON host adapter to attach to FICON channels, either directly or through a switch, the port is dedicated to FICON attachment and cannot be simultaneously attached to FCP hosts. When you attach a DS8000 to FICON channels through one or more switches, the maximum number of FICON logical paths is 2048 per DS8000 host adapter port. The directors provide extremely high availability with redundant components and no single points of failure. Figure 9-6 on page 279 shows an example of FICON attachment to connect a System z server through FICON switches, using 16 FICON channel paths to eight host adapter ports on the DS8000, and addressing eight Logical Control Units (LCUs). This channel consolidation might be possible when your host workload does not exceed the performance capabilities of the DS8000 host adapter and is most appropriate when connecting to the original generation FICON channel. It is likely, again depending on your workload, that FICON Express2 channels must be configured one to one with a DS8000 host adapter port.
278
zSeries
FICON (FC) channels
FC FC FC FC FC FC FC FC FC FC
zSeries
FICON (FC) channels
FC FC FC FC FC FC
FICON Director
FICON Director
FICON Director
FICON Director
CU 2000 CU 2100 CU 2200 CU 2300 CU 2400 CU 2500 CU 2600 CU 2700
LV 2000-20FF LV 2100-21FF LV 2200-22FF LV 2300-23FF LV 2400-24FF LV 2500-25FF LV 2600-26FF LV 2700-27FF
(LCU 20) (LCU 21) (LCU 22) (LCU 23) (LCU 24) (LCU 25) (LCU 26) (LCU 27)
Figure 9-6 DS8000 FICON attachment
9.3.3 FICON configuration and performance considerations

When configuring for FICON, consider the following recommendations: Eight host adapters in a DS8100 with two 8-path path groups (16 FICON channels using two ports from each host adapter spread evenly across the four I/O enclosures) Sixteen host adapters in a DS8300 with four 8-path path groups (32 FICON channels using two ports from each host adapter spread evenly across the eight I/O enclosures) For more information about DS8000 FICON support, refer to IBM System Storage DS8000 Host Systems Attachment Guide, SC26-7917, and FICON Native Implementation and Reference Guide, SG24-6266.
9.3.4 z/VM, z/VSE, and Linux on System z attachment

FICON channels in Fibre Channel Protocol (FCP) mode provide full fabric and point-to-point attachments of Fixed Block devices to the operating system images, which allows z/VM, z/VSE, and Linux on System z to access industry-standard FCP storage controllers and devices. This capability can facilitate the consolidation of UNIX server farms onto System z servers, protecting investments in Small Computer System Interface (SCSI)-based storage. The FICON features provide support of Fibre Channel and (SCSI) devices in z/VM, z/VSE, and Linux on System z. The Fibre Channel Protocol (FCP) allows z/VM, z/VSE, and Linux on System z to access industry-standard SCSI devices. For disk applications, these FCP storage devices utilize Fixed Block (512-byte) sectors rather than Extended Count Key Data (ECKD) format.
279
Linux FCP connectivity

You can use either direct or switched attachment to attach a storage unit to a System z host system that runs SUSE SLES 8, 9, or 10, or Red Hat Enterprise Linux 4.4 and later with current maintenance updates for ESCON and FICON. FCP attachment to System z Linux systems is only through a switched-fabric configuration. You cannot attach the host through a direct configuration.
280
10
Chapter 10.
Performance considerations with Windows Servers

This chapter discusses performance considerations for supported Microsoft Windows servers attached to the IBM System Storage DS8000. In the context of this chapter, the term Windows servers refers to native servers as opposed to Windows servers running as guests on VMware. You can obtain the most current list of supported Windows servers (at the time of writing this book) from the Interoperability matrix at: http://www.ibm.com/systems/support/storage/config/ssic/displayesssearchwithoutjs.w ss?start_over=yes Disk throughput and I/O response time for any server connected to a DS8000 are affected by the workload and configuration of the server and DS8000, data layout and volume placement, connectivity characteristics, and the performance characteristics of the DS8000 itself. While the health and tuning of all of the system components affect the overall performance management and tuning of a Windows server, this chapter limits discussion to the following items: General Windows performance tuning I/O architecture overview Filesystem Volume management Multipathing and the port layer Host bus adapter (HBA) settings Windows Server 2008 I/O enhancements I/O performance measurement Problem determination Load testing
281
10.1 General Windows performance tuning

Windows 2000, Windows Server 2003, and Windows Server 2008 are largely self-tuning. Typically, leaving the system defaults is reasonable from a performance perspective. In this section, we discuss the general considerations for improving the disk throughput or response time for either a file server or a database server: Install enough memory on the system to provide sufficient space for database, application, and filesystem cache. As a general rule, database buffer hit ratios must be greater than 90%. Increasing the cache hit ratios is the most important tuning consideration for I/O performance of databases, because it reduces the amount of physical I/O that is required. Defragment drives daily if possible. The native defragment tool on Windows (defrag.exe) will not defragment files that are open. Defragmenting application, system (Paging File), or database files that are always open requires a more sophisticated defrag utility or coordinating server restarts with disk defrag activities. Schedule processes that are CPU-intensive, memory-intensive, or disk-intensive during after-hours operations. Examples of these processes are virus scanners, backups, and disk fragment utilities. These types of processes must be scheduled to run when the server is least active. Specify the server type to determine how system cache is allocated and used. You can improve the performance of file caching by optimizing the server service for the file and print server (Refer to the following links for additional detail). Evaluate carefully the installed Windows services to determine whether they are needed for your environment or can be provided for by another server. Consider stopping, manually starting, or disabling the following services: Alerter, Clipbook Server, Computer Browser, Messenger, Network dynamic data exchange (DDE), Object Linking and Embedding (OLE) Schedule, and spooler. Follow the Microsoft recommendation that large dedicated file servers or database servers are configured as backup domain controllers (BDC) due to the overhead associated with the netlogon service. Set up the Windows tasking relative to your workload types. More often than not, many applications, such as SQL Server, run as background tasks and therefore setting the background and foreground tasks to run equally can be of benefit. Optimize the Paging File configuration (Refer to the following links for additional detail). Set the NTFS log file to 64 MB to reduce the frequency of the NTFS log file expansion. Log file expansion is costly, because it locks the volume for the duration of the log file expansion operation. Set the NTFS log file to 64 MB through the following command line entry: chkdsk x: /L:65536. For detailed instructions about these tuning suggestions, refer to the following publications: Tuning IBM System x Servers for Performance, SG24-5287 Tuning Windows Server 2003 on IBM System x Servers, REDP-3943 Also, refer to these Web sites: http://www.microsoft.com/windowsserver2003/evaluation/performance/perfscaling.mspx http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fde-d599bac8184a/ Perf-tun-srv.docx
282
10.2 I/O architecture overview

At a high level, the Windows I/O architecture is similar to the I/O architecture of most Open Systems. Figure 10-1 shows a generic view of the I/O layers and examples of how they are implemented.
Figure 10-1 Windows I/O stack
In order to initiate an I/O request, an application issues an I/O request using one of the supported I/O request calls. The I/O manager receives the application I/O request and passes the I/O request packet (IRP) from the application to each of the lower layers that route the IRP to the appropriate device driver, port driver, and adapter-specific driver. Windows server filesystems can be configured as FAT, FAT32, or NTFS. The file structure is specified for a particular partition or logical volume. A logical volume can contain one or more physical disks. All Windows volumes are managed by the Windows Logical Disk Management utility. For additional information relating to the Windows Server 2003 and Windows Server 2008 I/O stacks and performance, refer to the following documents: http://download.microsoft.com/download/5/6/6/5664b85a-ad06-45ec-979e-ec4887d715eb /Storport.doc http://download.microsoft.com/download/5/b/9/5b97017b-e28a-4bae-ba48-174cf47d23cd /STO089_WH06.ppt
10.3 Windows Server 2008 I/O Manager enhancements

There are several I/O performance enhancements to the Windows Server 2008 I/O subsystem. We summarize them in this section.
I/O priorities
The Windows Server 2008 I/O subsystem provides a mechanism to specify I/O processing priorities. Windows will primarily use this mechanism to prioritize critical I/O requests over background I/O requests. API extensions exist to provide application vendors file-level I/O priority control. The prioritization code has some processing overhead and can be disabled for disks that are targeted for similar I/O activities (such as an SQL database).
Chapter 10. Performance considerations with Windows Servers
283
I/O completion and cancellation

The Windows Server 2008 I/O subsystem provides a more efficient way to manage the initiation and completion of I/O requests, resulting in a reduced number of context switches, lower CPU utilization, and reduced overall I/O response time.
I/O request size

The maximum I/O size was increased from 64 KB per I/O request in Windows Server 2003 to 1024 KB in Windows Server 2008. For large sequential workloads, such as backups, this increase can significantly improve the disk throughput. You can obtain additional information at: http://blogs.technet.com/askperf/archive/2008/02/07/ws2008-memory-management-dynam ic-kernel-addressing-memory-priorities-and-i-o-handling.aspx
10.4 Filesystem
A filesystem is a part of the operating system that determines how files are named, stored, and organized on a volume. A filesystem manages files, folders, and the information needed to locate and access these files and folders for local or remote users.
10.4.1 Windows filesystem overview

Microsoft Windows 2000 Server, Windows Server 2003, and Windows Server 2008 all support the FAT/FAT32 filesystem and NTFS. However, we recommend using NTFS for the following reasons: NTFS provides considerable performance benefits by using a B-tree structure as the underlying data structure for the filesystem. This type of structure improves performance for large filesystems by minimizing the number of times that the disk is accessed, which makes it faster than FAT/FAT32. NTFS provides significant scalability over FAT/FAT32 in terms of maximum volume size. In theory, the maximum file size is 264. However, on a Windows 32-bit system using 64 KB clusters, the maximum volume size is 256 TB, and the maximum file size is 16 TB. NTFS provides recoverability via a journaled filesystem functionally similar to UNIX journaled filesystems. NTFS fully supports the Windows NT security model and supports multiple data streams. No longer is a data file a single stream of data. Additionally, under NTFS, a user can add user-defined attributes to a file.
10.4.2 NTFS guidelines

Follow these guidelines: Allocation Format the logical volumes with 64 KB allocations. Setting the allocation size to 64 KB improves the efficiency of the NTFS filesystem by reducing fragmentation of the filesystem and reducing the number of allocation units required for large file allocations. While it is an easy way to reduce space on volumes, NTFS filesystem compression is not appropriate for enterprise file servers. Implementing compression places an unnecessary overhead on the CPU for all disk operations. Consider options for adding disk, near-line storage, or archiving data before seriously considering filesystem
Compression
284
compression. In addition to causing additional CPU overhead, the I/O subsystem will not honor asychnronous I/O calls made to compressed files. Refer to the following link for additional detail: http://support.microsoft.com/kb/156932 Defragment disks Over time, files become fragmented in noncontiguous clusters across disks, and disk response time suffers as the disk head jumps between tracks to seek and reassemble the files when they are required. We recommend regularly defragmenting volumes. For Windows 2000 Server and Windows Server 2003 servers, use diskpar.exe and diskpart.exe respectively to force sector alignment. Windows Server 2008 automatically enforces a 1 MB offset for the first sector in the partition, which negates the need for using diskpart.exe. For additional information, refer to the following documents: http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4b ae-9fde-d599bac8184a/Perf-tun-srv.docx http://support.microsoft.com/kb/929491 Note: The start sector offset must be 256 KB due to the stripe size on the DS8000. Workloads with small, random I/Os (<16 KB) will not likely experience any significant performance improvement from sector alignment on DS8000 LUNs.
Block alignment
10.5 Volume management

Volume managers provide an abstraction layer between the physical resources and the filesystem and allow administrators to group multiple physical resources into a single volume.
10.5.1 Microsoft Logical Disk Manager (LDM)

The LDM provides an abstraction layer between the NTFS filesystem layer and the physical storage layer. A single DS8000 LUN appears as a physical disk to Windows. Disks are created as either dynamic or basic: Dynamic volume A single partition that covers the entire physical disk. It can contain more than one volume represented on the system by an alphabetical letter (that is, C:, D:, E:, and so forth), or it can be grouped with other physical disks under a volume. Dynamic disks can be expanded. Microsoft recommends no more than 32 physical disks per dynamic volume. Volumes spanning more than one physical disk are referred to as spanned volumes or concatenated volumes. The use of basic disks stems from legacy DOS disk partitions. Disks are divided into a primary partition and extended partitions. Basic disks cannot be expanded.
Basic disk
All dynamic disks contain an LDM database that keeps track of changes to the volume state and synchronizes the databases across disks for the purpose of recovery. If all dynamic disks exist on the SAN and there is an unplanned outage to the SAN disks, the LDM on the SAN disks will all be in the same state. If you have dynamic disks both locally and on the SAN, there is a high probability that the LDMs will be out of synch if you take an outage to your SAN disks only. For this reason, Microsoft recommends the following approach when configuring a system with SAN-attached disks, such as DS8000.
285
Use dynamic disks for SAN-attached storage and basic disks for local storage, or use basic disks for SAN-attached storage and dynamic disks for local storage. For more information about this recommendation, refer to the following article at: http://support.microsoft.com/kb/816307 Note: Concatenated volumes provide capacity scalability but do not distribute allocation units across the physical disks. Volume space is allocated sequentially starting at the first drive in the drive set, which often leads to hot spots on certain volumes within a concatenated volume.
10.5.2 Microsoft LDM software RAID

The LUNs provisioned from the DS8000 will have some level of hardware RAID and will be presented to the Windows server as a single physical disk. The DS8000 provides options for configuring hardware RAID as RAID 5, RAID 10, or RAID 6. These RAID versions are explained in greater detail in 4.1, RAID levels and spares on page 42. Microsoft provides facilities for configuring software RAID, which combines multiple Windows physical disks (DS8000 LUNs) into a volume and performs an additional level of RAID across those disks. Microsoft currently supports three types of software RAID for dynamic disks: RAID 0 RAID 0 performs a round-robin distribution of 64 KB stripes across the physical disks (DS8000 LUNs) in the dynamic volume. RAID 0 provides a mechanism for distributing the data across multiple physical disks, which typically results in an improvement of sequential I/O performance. RAID 1 mirrors the 64 KB blocks across two physical disks. RAID 1 provides an additional level of reliability at the expense of doubling the number of write operations and doubling the capacity required. RAID 5 provides a mechanism for spreading data across multiple physical disks. Data and parity blocks are spread across all the physical volumes in the dynamic volume. If the physical disks (DS8000 LUNs) reside on the same DS8000 rank, there is no additional availability.
RAID 1
RAID 5
Notes: RAID 0 provides no availability improvement. If the two physical disks (DS8000 LUNs) reside on the same DS8000 rank, there is no additional availability. While RAID 0 has some potential performance benefits for sequential I/O streams, the Microsoft LDM implementation does not allow a software RAID 0 volume to be extended, which makes it impractical for enterprise class servers. We do not recommend using Microsoft software RAID in conjunction with physical disks provisioned from a DS8000.
10.5.3 Veritas Volume Manager (VxVM)

While Microsoft LDM provides basic features for managing volumes, it cannot address the more sophisticated requirements that are typically demanded by some enterprise clients. Veritas Storage Foundation for Windows provides the Veritas Volume Manager (VxVM), a comprehensive solution for managing Windows server volumes.
286
VxVM includes the following features: Support of concatenated, striped (Raid 0), mirrored (Raid 1), mirrored striped (Raid 1+0), and RAID 5 volumes Dynamic expansion of all volume types Dynamic MultiPathing (DMP) as an optional component Support for Microsoft Cluster Service (might require additional hardware and software) Support for up to 256 physical disks in a dynamic volume The Veritas Storage Foundation Administrator Guide contains additional information relating to VxVM, and you can refer to it at: http://seer.entsupport.symantec.com/docs/286744.htm Note: For applications requiring high sequential throughput, consider using striped volumes. Striped volumes must be extended by the number of drives in the stripe set. For example, a volume striped across a series of four physical disks will require four physical disks to be added during any extension of the striped volume. To have any performance benefit, the physical disks (DS8000 LUNs) have to reside on separate DS8000 ranks.
10.5.4 Determining volume layout

From a performance perspective, there are two general approaches to volume layout. The first approach is to spread everything everywhere. The second approach is to isolate volumes based on the nature of the workload or application. We discuss these approaches in great detail in Chapter 5, Logical configuration performance considerations on page 63. There is nothing inherently different about a Windows server from any other distributed server that significantly alters the decision about whether to isolate or share. In either case, you can spread the volumes and files supporting the workload appropriately in order to support the performance and capacity requirements of the application. In general, there are two types of workloads that require isolation: Workloads that are highly sensitive to increases in I/O response time are generally better to isolate. An example of this type of application is a real-time banking application that performs synchronous updates and any degradation in disk response time will result in delays to users at automatic teller machines (ATMs) or other interfaces. Applications that consistently demand high I/O throughput are generally better to isolate, because they tend to impact other applications. An example of this type of application is a large data mining or business intelligence (BI) application that performs large sequential reads of a large database. Table 10-1 demonstrates examples of typical Windows server workloads and categorizes them as potential candidates for either shared or isolated configurations.
Table 10-1 Windows server workload matrix Application Throughput Sensitivity Candidate for isolation or spreading Spreading/Shared Isolation Spreading/Shared
SQL Server online transaction processing (OLTP) SQL Server OLTP Real-time IIS Server
< 50 MBps > 100 MBps <10 MBps
Low High Medium
287
Application
Throughput
Sensitivity
Candidate for isolation or spreading Isolation Spreading/Shared
SQL Server Data Mining BI File servers
> 200 MBps > 100 MBps
Low Medium
Note: Consider the applications listed in Table 10-1 as general examples only and not specific rules.
Approaches to spreading volumes for Windows servers on DS8000

The goal of spreading volumes is to increase the probability that I/Os are evenly spread across all the DS8000 resources in order to eliminate or reduce contention. In order to achieve spreading, there are two general approaches on the DS8000: Use host-based striping to stripe data blocks across Windows server physical disks configured on DS8000 LUNs, which in general, is only practical for Windows servers implementing Veritas Storage Foundation for Windows. 10.5.3, Veritas Volume Manager (VxVM) on page 286 discusses this approach. Because the DS8000 uses a stripe size of 256 KB, the host base stripe must be a multiple of the DS8000 stripe. Use DS8000 to spread the volumes either by Storage Pool Striping or the Rotating Volume algorithm as described in 5.7.1, Single-rank and multi-rank extent pools on page 92. We discuss approaches for spreading in Workload spreading considerations on page 74. Note: In general, there is a performance benefit from one level of striping. We recommend the most granular level available so long as it is beneficial for the workload. For random workloads, we recommend a stripe size > 256 KB. Unfortunately in Windows LDM, the maximum stripe size is 64 KB. For most Windows server installations, the DS8000 Storage Pool Striping (SPS) provides a reliable method for distributing the I/O workload across multiple arrays.
10.6 Multipathing and the port layer

The multipathing, storage port, and adapter drivers exist in three separate logical layers; however, they function together to provide access and multipathing facilities to the DS8000. The purpose of multipathing is to provide redundancy and scalability. Redundancy is facilitated through multiple physical paths from the server to a DS8000. Scalability is implemented by allowing the server to have multiple paths to the storage and to balance the traffic across the paths. There are several methods available for configuring multipathing for Windows servers attached to an IBM DS8000: Windows 2000 Server IBM Subsystem Device Driver (SDD) and Veritas DMP Windows Server 2003 IBM SDD, IBM Subsystem Device Driver Device Specific Module (SDDDSM), and Veritas DMP Windows Server 2008 IBM SDDDSM and Veritas DMP
288
On Windows servers, the implementations of multipathing rely on either native multipathing (Microsoft MPIO + Storport driver) or non-native multipathing and the Small Computer System Interface (SCSI) port driver or SCSIport. The following sections discuss the performance considerations for each of these implementations.
10.6.1 SCSIport scalability issues

Microsoft originally designed the SCSIport storage driver for parallel SCSI interfaces. IBM SDD and older versions of Veritas DMP still rely on it. HBA miniport device drivers compliant with the SCSIport driver have a number of performance and scalability limitations. The following section summarizes the key scalability issues with this architecture: Adapter limits The Microsoft SCSIport driver was originally designed for parallel SCSI interfaces and is incapable of taking full advantage of the Fibre Channel interconnect and hardware RAID provided by the DS8000. SCSIport is limited to 254 outstanding I/O requests per adapter regardless of the number of physical disks associated with the adapter. Serialized I/O requests processing The SCSIport driver synchronizes I/O start requests and completion requests. Additionally, when the miniport (host bus adapter (HBA)) interrupt processing begins a request, it will queue any new requests until after completion of the current request. The net result is that the SCSIport driver cannot fully take advantage of the parallel processing capabilities available on modern enterprise class servers and the DS8000. Elevated Interrupt Request Levels (IRQLs) On Windows, system processes are prioritized from Low (0) to High (31). The processes with the lowest priorities will be processed first. For Fibre Channel adapters, the SCSIport miniport drivers must perform significant processing to translate the SCSI Request Block from the SCSIport driver into a controller-specific format. The priority of this processing is at the same elevated level as all the other interrupts for that device. There is a high probability that other higher priority processes might run on the same processors as the device interrupts, which on I/O intensive systems can cause a significant queuing of interrupts, resulting in slower I/O throughput. Data buffer processing overhead The SCSIport exchanges physical address information with the miniport driver one element at a time instead of in a batch, which is inefficient, especially with large data transfers, and results in the slow processing of large requests. I/O queue scalability limitations SCSIport maintains a queue for devices and a queue for adapters (first in first out (FIFO), Maximum queue = 254). Each I/O request must access a spinlock at each queue in order to pass the request to the next layer. SCSIport does not provide a means for managing the queues in the case of high loads. One of the possible results of this architecture is that one highly active device can dominate the adapter queue, resulting in latency for other non-busy disks.
10.6.2 Storport scalability features

In response to significant performance and scalability advances in storage technology, such as hardware RAID and high performing storage arrays, Microsoft has developed a new storage driver called Storport. The architecture and capabilities of this driver address most of the scalability limitations existing in the SCSIport driver.
289
Key features addressed by the Storport driver include: Adapter limits removed There are no adapter limits. There is a limit of 254 requests queued per device. Improvement in I/O request processing Storport decouples the StartIo and Interrupt processing, enabling parallel processing of start and completion requests. Improved IRQL processing Storport provides a mechanism to perform part of the I/O request preparation work at a low priority level, reducing the number of requests queued at the same elevated priority level. Improvement in data buffer processing Lists of information are exchanged between the Storport driver and the miniport driver as opposed to single element exchanges. Improved queue management Granular queue management functions provide HBA vendors and device driver developers the ability to improve management of queued I/O requests. For additional information about the Storport driver, refer to the following document: http://download.microsoft.com/download/5/6/6/5664b85a-ad06-45ec-979e-ec4887d715 eb/Storport.doc
10.6.3 Subsystem Device Driver

The IBM Subsystem Device Driver (SDD) provides path failover/failback processing for the Windows server attached to the IBM System Storage DS8000. SDD relies on the existing Microsoft SCSIport system-supplied port driver and HBA vendor-provided miniport driver. It also provides I/O load balancing. For each I/O request, SDD dynamically selects one of the available paths to balance the load across all possible paths. To receive the benefits of path balancing, ensure that the disk drive subsystem is configured so that there are multiple paths to each LUN, which will enable performance benefits from the SDD path balancing and also will prevent the loss of access to data in the event of a path failure. We discuss the Subsystem Device Driver in further detail in Subsystem Device Driver (SDD) on page 272.
10.6.4 Subsystem Device Driver Device Specific Module

Subsystem Device Driver Device Specific Module (SDDDSM) provides multipath I/O support based on Microsoft MPIO technology for Windows Server 2003 and Windows Server 2008 servers. A Storport-based driver is required for the Fibre Channel adapter. SDDDSM uses a device-specific module designed to provide support of specific storage arrays. The DS8000 supports most versions of Windows Server 2003 and Windows Server 2008 servers as specified at the System Storage Interoperation Center (SSIC): http://www.ibm.com/systems/support/storage/config/ssic/displayesssearchwithoutjs.w ss?start_over=yes You can obtain additional information about SDDDSM in the SDD Users Guide: http://www.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S7000303 290
In Windows Server 2003, the MPIO drivers are provided as part of the SDDDSM package. On Windows Server 2008, they ship with the OS. Note: For non-clustered environments, we recommend using SDDDSM for the performance and scalability improvements as previously described.
10.6.5 Veritas Dynamic MultiPathing (DMP) for Windows

For enterprises with significant investment in Veritas software and skills, Veritas provides an alternative to the multipathing software provided by IBM. Veritas relies on the Microsoft implementation of MPIO and Device Specific Modules (DSMs), which rely on the Storport driver. This implementation is not available for all versions of Windows. Check the System Storage Interoperation Center (SSIC) for your specific hardware configuration: http://www.ibm.com/systems/support/storage/config/ssic/displayesssearchwithoutjs.w ss?start_over=yes
10.7 Host bus adapter (HBA) settings

For each HBA, there are BIOS and driver settings that are suitable for connecting to your DS8000. Configuring these settings incorrectly can affect performance or cause the HBA to not work properly. To configure the HBA, refer to the IBM System Storage DS8000 Host Systems Attachment Guide, SC26-7917-02. This guide contains detailed procedures and recommended settings. You also need to read the readme file and manuals for the driver, BIOS, and HBA. Obtain a list of supported HBAs, firmware, and device driver information at: http://www.ibm.com/servers/storage/support/config/hba/index.wss Note: When configuring the HBA, we strongly recommend that you install the newest version of driver and the BIOS. The newer version includes more effective function and problem fixes so that the performance or reliability, availability, and service (RAS) can improve.
10.8 I/O performance measurement

Throughout this chapter, we have focused on describing the various layers in the I/O stack and the performance considerations at each layer. As anyone managing production environments knows, even when proper care and consideration have been taken during the planning and configuration phase, problems can occur in production. Problems can occur as a result of improper configuration, application changes, changes in workload patterns, workload growth, or hardware failures. In this section, we provide an overview of several tools, metrics, and techniques that are available for diagnosing I/O performance issues. Figure 10-2 demonstrates the I/O layer on the left side and an example of its corresponding implementation on the right side.
291
Figure 10-2 I/O layer and example implementations
At the application layer, there are application-specific tools and metrics available for monitoring and analyzing application performance on Windows servers. Application-specific objects and counters are outside the scope of this text. The I/O Manager provides a control mechanism for interacting with the lower layer devices. Many of the I/O Manager calls are monitored and recorded in Event Trace for Windows (ETW), which is available in Windows Performance Console (perfmon). While the information provided from the ETW can be excellent for problem determination, it is often complex to interpret and much too detailed for general disk performance issues, particularly in Windows 2000 Server and Windows Server 2003. The usability of ETW was improved when Microsoft provided a utility called Windows Server Performance Analyzer (SPA) that works on Windows Server 2003 servers. It provides a simple way to collect system performance statistics (volume metrics), as well as ETW information, and process and correlate it in a user friendly report. In Windows Server 2008, all of the functionality of SPA was incorporated into perfmon. In an effort to appeal to the widest possible audience, we will take a generic approach that can be applied across Windows servers. In this approach, we will utilize the basic perfmon logging facilities to collect key PhysicalDisk and LogicalDisk counters to diagnose the existence of disk performance issues. We will not provide analysis of the ETW events, although you can analyze the ETW events with the use of SPA. We will also demonstrate how to correlate the Windows physical disks to the DS8000 LUNs using the configuration information provided with IBM SDD/SDDDSM. In this section, we discuss: Overview of I/O metrics Overview of perfmon Overview of logs Mechanics of logging Mechanics of exporting data Collecting multipath data Correlating the configuration and performance data Analyzing the performance data
10.8.1 Key I/O performance metrics

The Windows I/O subsystem retrieves application code and data from disk in response to application I/O requests. If the data is in the application buffers or the system file cache, the I/O request is returned in nanoseconds. If the data does not reside in memory, it will require 292
access from the disk subsystem, resulting in delays of milliseconds per I/O request. Due to the relatively long processing times for I/O requests, the disk subsystem often composes the largest component of end-to-end application response time. As a result, the disk subsystem can be the single most important aspect of overall application performance. In this section, we discuss the key performance metrics available for diagnosing performance issues on Windows servers. In Windows servers, there are two kinds of disk counters: PhysicalDisk object counters and LogicalDisk object counters. PhysicalDisk object counters are used to monitor single disks of hardware RAID arrays (DS8000 LUNs) and are enabled by default. LogicalDisk object counters are used to monitor logical disks or software RAID arrays and are enabled by default on Windows Server 2003 and Windows Server 2008 servers. In Windows 2000 Server, the logical disk performance counters are disabled by default but can be enabled by typing the command DISKPERF -ye and then restarting the server. Tip: When attempting to analyze disk performance bottlenecks, always use physical disk counters to identify performance issues with individual DS8000 LUNs. Table 10-2 describes the key I/O-related metrics that are reported by perfmon.
Table 10-2 Performance monitoring counters Object Physical Disk Counter Average Disk sec/Read Description The average amount of time in seconds to complete an I/O read request. Because most I/Os complete in milliseconds, three decimal places are appropriate for viewing this metric. These results are end-to-end disk response times. The average amount of time in seconds to complete an I/O write request. Because most I/Os complete in milliseconds, three decimal places are appropriate for viewing this metric. These results are end-to-end disk response times. The average number of disk reads per second during the collection interval. The average number of disk writes per second during the collection interval. The average number of bytes read per second during the collection interval. The average number of bytes written per second during the collection interval. Indicates the average number of read I/O requests waiting to be serviced. Indicates the average number of write I/O requests waiting to be serviced.
Physical Disk
Average Disk sec/Write
Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk Physical Disk
Disk Reads/sec Disk Writes/sec Disk Read bytes/sec Disk Write bytes/sec Average Disk Read Queue Length Average Disk Write Queue Length
General rules
We provide the following rules based on our field experience. These rules are provided as general guidelines and do not represent service level agreements (SLAs) or service level objectives (SLOs) that have been endorsed by IBM for your DS8000. Prior to using these rules for anything specific, such as a contractual SLA, you must carefully analyze and
293
consider these technical requirements: disk speeds, RAID format, workload variance, workload growth, measurement intervals, and acceptance of response time and throughput variance. The general rules are: In general, average write disk response times for Fibre Channel-based DS8000 LUNs must be between 2 and 6 ms. In general, average read disk response times for Fibre Channel-based DS8000 LUNs must be between 5 and 15 ms. Average values higher than the top end of these ranges indicate contention either in the fabric or the DS8000. It is beneficial to look at both the I/O rates and the disk response times, because often the disks with the highest response times had extremely low I/O rates. Focus on those disks that have high I/O rates and high response times. Shared storage environments are more likely to have a variance in disk response time. If your application is highly sensitive to variance in response time, you need to isolate the application at either the processor complex, device adapter (DA), or rank level. On average, the total I/Os to any single volume (partial rank - DS8000 LUN) must not exceed 500 IOPS, particularly if this workload is a write-intensive workload on a RAID 5 LUN. If you consistently issue more than 500 IOPS to a single LUN, look at spreading out the data in the volume across more than one physical disk, or use DS8000 Storage Pool Striping (SPS) to rotate the extents across ranks.
Note: By default, Windows provides the response time in seconds. In order to convert to milliseconds, you must multiply by 1000, which is done automatically in the perfmon-essmap.pl script that is provided in Running the scripts on page 609.
10.8.2 Windows Performance console (perfmon)

The Windows Performance console (perfmon) is one of the most valuable monitoring tools available to Windows server administrators. It is commonly used to monitor server performance and to isolate disk bottlenecks. The tool provides real-time information about server subsystem performance. It also provides the ability to log performance data to a file for later analysis. The data collection interval can be adjusted based on your requirements.
Monitoring disk performance in real time

Monitoring disk activity in real time permits you to view current disk activity on local or remote disk drives. Only the current, and not the historical, level of disk activity is shown in the chart. If you want to determine whether excessive disk activity on your system is slowing performance, log the disk activity of the desired disks to a file, over a period of time that represents typical use of your system. View the logged data in a chart and export to a spreadsheet-readable file to see if disk activity is affecting system performance. The logging feature of the Windows Performance console makes it possible to store, append, chart, export, and analyze data captured over time. Products, such as SQL Server and Exchange, provide additional monitors that allow the Performance console to extend its usefulness beyond the operating system level. The Performance console includes two tools: System Monitor Performance Logs and Alerts
294
Windows Server 2003 perfmon

Figure 10-3 shows the main Performance console window for Windows Server 2003.
Figure 10-3 Main Performance console window in Windows Server 2003
The Performance console is a snap-in for Microsoft Management Console (MMC). You use the Performance console to configure the System Monitor and Performance Logs and Alerts tools. You can open the Performance console by clicking Start Programs Administrative Tools Performance or by typing perfmon on the command line.
Windows Server 2008 perfmon

The Windows Server 2008 performance console (perfmon) has additional features, including these key new features: Data Collector Sets Provides the ability to configure data collection templates for collection of system and trace counters. Resource Overview Provides a high-level view of key system resources, including CPU% total usage, disk aggregate throughput, network bandwidth utilization, and memory hard faults/sec, which is shown in Figure 10-4 on page 296. Reports Integration of SPA functionality. This feature provides the ability to quickly report on collected counter and trace data in a way that is both user friendly and provides substantial detail.
295
Figure 10-4 Windows Server 2008 perfmon console
As with Windows Server 2003, you can open the Performance console by clicking Start Programs Administrative Tools Performance or by typing perfmon on the command line.
10.8.3 Performance log configuration and data export

With large numbers of physical disks and long collection periods required to identify certain disk bottlenecks, it is impractical to use real-time monitoring. In these cases, disk performance data can be logged for analysis over extended periods. Instructions for collecting the necessary log data and exporting the data to a spreadsheet-readable format for both Windows Server 2003 and Windows Server 2008 are located in Appendix B, Windows server performance log collection on page 571. Refer to these instructions for collecting the necessary performance data to perform analysis. The remaining sections assume that you have collected performance data.
10.8.4 Collecting configuration data

This section only pertains to those systems utilizing either SDD or SDDDSM. In the case of SDDDSM, the program will contain DSM: 1. Click Start Program Subsystem Device Driver Subsystem Device Driver Management. An MS DOS window opens. 2. Enter datapath query essmap as shown in Figure 10-5 on page 297.
296
Figure 10-5 SDD datapath query essmap output
3. By default, the output shows on the display. In order to save the data, you need to redirect the output to a file. Enter datapath query essmap > $servername.essmap.txt where $servername is the actual name of the server and press Enter as shown in Example 10-1.
Example 10-1 The datapath query essmap command C:\Program Files\IBM\Subsystem Device Driver>datapath query essmap > $servername.essmap.txt
4. Place the $servername.essmap.txt in the same directory as the performance comma-separated values (csv) file. 5. In addition to the multipathing configuration data, gather the host HBA configuration information, including but not limited to the HBA bandwidth, errors, and HBA queue settings.
10.8.5 Correlating performance and configuration data

The first step in the analysis is to correlate performance and configuration data. At this point, the performance and configuration data must already have been collected. A script is provided for correlating the SDD output with the performance output. You can find the script and instructions for running the script in Running the scripts on page 609. The script performs several functions: Correlates the DS8000 LUNs with the PhysicalDisks seen on the server Reformats the perfmon headers so that they are easier to place on charts Reformats the data so that it is easier to use Excel Pivot tables Reformats the data so that it is less likely to encounter maximum Excel column limitations Converts bytes to KBs Converts seconds to milliseconds
10.8.6 Analyzing performance data

Performance analysis can be complicated. There are many methods for arriving at the same conclusion. We refer collectively to the methods used for identifying a disk bottleneck as a methodology. We provide a a methodology for analyzing disk I/O issues in 8.7, End-to-end analysis of I/O performance problems on page 249. The techniques and mechanics described in this section are based on this methodology. The approach is a top-down approach in which you start at the highest level of the I/O stack and work your way down. In the following case, we start at the Windows server volume level. We assume that there are no issues above this layer. In your actual environment, you must analyze all of the layers in the I/O stack to identify if any issues exist at higher layers.
297
The following case involves a Windows Server 2003 server running two 2 Gb/sec QLogic HBAs. A highly sequential read workload ran on the system. After the performance data is correlated to the DS8000 LUNs and reformatted, open the performance data file in Microsoft Excel. It looks similar to Figure 10-6.
DATE TIME Subsyste LUN m Serial Disk Disk Avg Read Avg Total Reads/se RT(ms) Time c 1,035.77 1,035.75 1,035.77 1,035.77 1,035.75 1,035.77 1,047.24 1,047.27 1,047.29 1,047.25 1,047.29 1,047.29 1,035.61 1,035.61 1,035.61 0.612 0.613 0.612 0.615 0.612 0.612 5.076 5.058 5.036 5.052 5.064 5.052 0.612 0.612 0.615 633.59 634.49 633.87 637.11 634.38 633.88 5,315.42 5,296.86 5,274.30 5,291.01 5,303.36 5,290.89 634.16 633.88 636.72 Avg Read Read Queue KB/sec Length 0.63 0.63 0.63 0.64 0.63 0.63 5.32 5.30 5.27 5.29 5.30 5.29 0.63 0.63 0.64 66,289.14 66,288.07 66,289.14 66,289.14 66,288.07 66,289.14 67,023.08 67,025.21 67,026.28 67,024.14 67,026.28 67,026.28 66,279.00 66,279.00 66,279.00
11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008
13:44:48 13:44:48 13:44:48 13:44:48 13:44:48 13:44:48 14:29:48 14:29:48 14:29:48 14:29:48 14:29:48 14:29:48 13:43:48 13:43:48 13:43:48
75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192 75GB192
75GB1924 Disk6 75GB1924 Disk2 75GB1924 Disk3 75GB1924 Disk5 75GB1924 Disk4 75GB1924 Disk1 75GB1924 Disk6 75GB1924 Disk2 75GB1924 Disk3 75GB1924 Disk5 75GB1924 Disk4 75GB1924 Disk1 75GB1924 Disk6 75GB1924 Disk2 75GB1924 Disk3
Figure 10-6 The perfmon-essmap.pl script output
Summary
Microsoft Excel provides an excellent way to summarize and analyze data via pivot tables. After you have the normalized data, you can create a pivot table. After you have created the pivot table, you can summarize the performance data by placing the disk and the LUN data in the rows section and all of the key metrics in the data section. Figure 10-7 shows the summarized data.
LUN Disk Average of Disk Reads/sec Average of Avg Average of Read RT(ms) Avg Read Queue Length 20.21 19.68 20.07 20.11 19.43 19.69 21.35 20.79 21.20 21.24 20.52 20.80 Average of Read KB/sec
75GB1924800 75GB1924801 75GB1924802 75GB1924803 75GB1924804 75GB1924805
Disk1 Disk2 Disk3 Disk4 Disk5 Disk6
1,040.80 1,040.79 1,040.79 1,040.79 1,040.78 1,040.80
66060.43334 66060.26584 66060.30111 66060.08953 66059.70165 66060.38927
Figure 10-7 Summary of perfmon-sdd output
Observations: Average read response time of 20 ms indicates a bottleneck. I/Os are spread evenly across the disks. There is not a single disk that has a bottleneck. The sum of the average combined throughput is 396360 MB, which is extremely close to the theoretical limit of 2 Gb HBAs.
298
Note: In order to have a meaningful summary or averaged data, you must collect the data for a period of time that reflects the goal of the collection. High response times during a backup period might not be problematic, whereas high response times during an online period might indicate problems. If the data is collected over too long of a period of time, the averages can be misleading and reflect multiple workloads.
Graph key metrics

Using pivot tables or other methods, graph the key performance metrics. In this case, the write workload was minimal, so we focus on the read metrics. Figure 10-8 displays the average Disk Read KB/sec during the analysis period.
Disk Read KB/sec 450000 400000 350000 KBytes/sec 300000 250000 200000 150000 100000 50000 0 13:33:48 13:38:48 13:43:48 13:48:48 13:53:48 13:58:48 14:03:48 14:08:48 14:13:48 14:18:48 14:23:48 14:28:48 14:33:48 14:38:48 14:43:48 14:48:48 14:53:48 14:58:48 15:03:48 15:08:48 15:13:48 15:18:48 15:23:48 15:28:48 15:33:48
Time - 1 Minute Interval 75GB1924800 75GB1924801 75GB1924802 75GB1924803 75GB1924804 75GB1924805
Figure 10-8 Disk Read KB/sec
Observations: Throughput peaked early during the measurement period and resulted in a horizontal line at 400000 KB/sec. It appears that this workloads throughput was limited from the beginning of the collection period. We know from previously gathered data that the system has two 2 HBAs that have a 2 Gb/sec capacity, which equates to roughly 200 MBps for each adapter.
Recommendations
Confirm that other SAN fabric components and the DS8000 HBAs are 4 Gbps. If they are 2 GB/sec, replacing the current 2 Gb/sec cards with 4 Gbps cards will increase the bandwidth available to the host for additional throughput.
299
Removing disk bottlenecks

After you have detected a disk bottleneck, you might perform several of these actions: If the disk bottleneck is a result of another user in the shared environment causing disk contention, request a LUN on a lesser utilized rank and migrate data files from the current physical disk to the new physical disk. If the disk bottleneck is caused by too much load generated from the Windows server to a single DS8000 LUN, spread the I/O activity across more DS8000 ranks, which might require the allocation of additional LUNs. If host-based striping is not practical, consider using Storage Pool Striping LUNs and migrate the current volumes files to multiple additional files spread across the DS8000. On Windows Server 2008 for sequential workloads with large transfer sizes (256 KB), consider host-based striping with a transfer size that is at least as big as the DS8000 stripe size (256 KB). Ideally, the host-based stripe size needs to be at least four times the size of the DS8000 stripe size. Because Windows Server 2003 limits the I/O transfer size on the host to 64 KB, this suggestion is impractical; however, based on internal tests, there appears to be a benefit to host-based striping on Windows Server 2003 for sequential workloads. Use faster disks: 15000 rpm arrays instead of 10000 rpm arrays on the DS8000. Using faster disks means migrating the data to another rank when incorporating the new disks. Because the DS8000 arrays can be RAID 5 and RAID 10, swap RAID arrays from one to the other and run a new set of measurements. For example, if the database workload activity is mostly sequential write operations, using RAID 5 can improve the performance. Follow this list: Sequential reads and writes work fine on RAID 5. High I/O random writes work better on RAID 10. Low I/O random reads and writes can work as well on RAID 5 as on RAID 10. Off-load processing to another system in the network (either users, applications, or services). Add more RAM. Adding memory increases system memory disk cache, which might reduce the number of required physical I/Os, indirectly reducing disk response times. If the problem is a result of a lack of bandwidth on the HBAs, install additional HBAs to provide more bandwidth to the DS8000. For more information about Windows server disk subsystem tuning, refer to the following document: http://www.microsoft.com/whdc/archive/subsys_perf.mspx
10.8.7 Windows Server Performance Analyzer

Another approach to performing disk performance analysis for Windows Server 2003 servers is to utilize Windows Server Performance Analyzer (SPA). The Windows Server Performance Analyzer (SPA) is a data collection and post-processing tool that correlates and reports on Windows Performance Monitor (perfmon) logs and Event Tracing for Windows (ETW). ETW provides a means for monitoring and reporting on process-level file I/O on a physical disk level. While this technique does not apply to Windows 2000 servers, it maps directly to the functionality provided in the Windows Server 2008 version of perfmon. The logical activities for this approach are: 1. Download and install SPA. 2. Collect data.
300
3. Produce report. 4. Analyze report. At the time of the writing of this book, there was an excellent guide about using SPA to perform disk performance diagnosis available at the following link: http://www.codeplex.com/PerfTesting/Wiki/View.aspx?title=How%20To%3a%20Identify%20 a%20Disk%20Performance%20Bottleneck%20Using%20SPA1&referringTitle=How%20Tos
10.9 Task Manager

In addition to the Performance console, Windows also includes the Task Manager, a utility that allows you to view the status of processes and applications and gives you real-time information about memory usage.
10.9.1 Starting Task Manager

You can run Task Manager using any one of the following three methods: Right-click a blank area of the task bar and select Task Manager. Press Ctrl+Alt+Delete and click Task Manager. Press Ctrl+Shift+Esc. Figure 10-9 shows that Task Manager has three views: Applications, Processes, and Performance. The latter two views are of interest to us in this discussion.
Processes tab
In Figure 10-9, you can see the resources being consumed by each of the processes currently running. You can click the column headings to change the sort order, which will be based on that column.
Figure 10-9 Windows Task Manager Processes tab
301
Click View Select Columns. This selection displays the window shown in Figure 10-10, from which you can select additional data to be displayed for each process.
Figure 10-10 Select columns for the Processes view
Table 10-3 shows the columns available in the Windows Server 2003 operating system that are related to disk I/O.
Table 10-3 Task Manager disk-related columns Column Paged Pool Description The paged pool (user memory) usage of each process. The paged pool is virtual memory available to be paged to disk. It includes all of the user memory and a portion of the system memory. The amount of memory reserved as system memory and not pageable for this process. The processs base priority level (low/normal/high). You can change the processs base priority by right-clicking it and selecting Set Priority. This option remains in effect until the process stops. The number of read input/output (file, network, and disk device) operations generated by the process. The number of bytes read in input/output (file, network, and disk device) operations generated by the process. The number of write input/output operations (file, network, and disk device) generated by the process. The number of bytes written in input/output operations (file, network, and device) generated by the process. The number of input/output operations generated by the process that are neither reads nor writes (for example, a control type of operation). The number of bytes transferred in input/output operations generated by the process that are neither reads nor writes (for example, a control type of operation).
Non-Paged Pool Base Priority
I/O Reads I/O Read bytes I/O Writes I/O Write bytes I/O Other I/O Other bytes
302
Performance tab
The Performance view shows you performance indicators, as shown in Figure 10-11.
Figure 10-11 Task Manager Performance view
The charts show you the CPU and memory usage of the system as a whole. The bar charts on the left show the instantaneous values, and the line graphs on the right show the history since Task Manager was started. The four sets of numbers under the charts are: Totals: Handles: Current total handles of the system Threads: Current total threads of the system Processes: Current total processes of the system Physical Memory (K): Total: Total RAM installed (in KB) Available: Total RAM available to processes (in KB) File Cache: Total RAM released to the file cache on demand (in KB) Commit Charge (K): Total: Total amount of virtual memory in use by all processes (in KB) Limit: Total amount of virtual memory (in KB) that can be committed to all processes without adjusting the size of the paging file Peak: Maximum virtual memory used in the session (in KB) Kernel Memory (K): Total: Sum of paged and non-paged kernel memory (in KB) Paged: Size of paged pool allocated to the operating system (in KB) Non-paged: Size of non-paged pool allocated to the operating system (in KB)
303
10.10 I/O load testing

There are a variety of reasons for conducting I/O load tests. They all start with a hypothesis and have well defined performance requirements. The objective of the test is to determine if the hypothesis is true or false. For example, a hypothesis might be that you believe that a DS8000 with 18 disk arrays and 128 GB of cache can support 10000 IOs/sec with a 70/30/50 workload and response time requirements of: Read response times: 95th percentile < 10 ms Write response times: 95th Percentile < 5 ms The generic steps for testing are: 1. 2. 3. 4. 5. Define the hypothesis. Simulate the workload using artificial or actual workload. Measure the workload. Compare workload measurements with objectives. If the results support your hypothesis, publish the results and make recommendations. Or, if the results do not support your hypothesis, determine why and make adjustments.
The following section provides several examples of types of tests related to I/O performance.
10.10.1 Types of tests

The following examples of tests might be appropriate for a Windows environment: Pre-deployment hardware validation. Ensure that the operating system, multipathing, and HBA drivers are at the latest levels and supported. Prior to deployment of any solution and especially a complex solution, such as Microsoft cluster servers, ensure that the configuration is supported. Check the interoperability site: http://www.ibm.com/systems/support/storage/config/ssic/displayesssearchwithoutj s.wss?start_over=yes Application-specific requirements. Often, we receive inquiries about DS8000 and Microsoft Exchange. The Exchange workloads vary significantly depending on the version of Exchange deployed. For example, Microsoft Exchange 2007 significantly reduces the I/O throughput requirements as compared with previous versions of Microsoft Exchange. We suggest testing using the Microsoft-provided tools (Jetstress) to validate and test I/O throughput requirements in your environment. The typical Microsoft Exchange workload is 50% reads/50% writes. This type of workload is extremely I/O intensive on RAID 5 devices. Therefore, we recommend ensuring that an adequate number of physical disks are available to support the workload. For additional information about Jetstress, refer to: http://www.msexchange.org/tutorials/Disk-Performance-Testing-Jetstress-2007. html http://technet.microsoft.com/en-us/library/bb643093.aspx Note: For extremely high performance requirements, particularly where mailbox sizes are small (for example, 50 MB per user), we recommend investigating a RAID 10 configuration for maximum throughput. Troubleshoot performance problems. Often, performance problems are challenging to diagnose. In these cases, it can be beneficial to separate the performance characteristics of the I/O subsystem from the performance characteristics of the application in question. You can utilize a synthetic workload (refer to 10.10.2, Iometer on page 305). By using a
304
synthetic workload, you can validate the performance characteristics of the I/O subsystem without the added complication of the application. For example, if an application owner states that the I/O subsystem is under-performing, take the following steps: Gather the I/O workload characteristics of the application (Reads/Writes and Sequential/Random). Conduct load using synthetic load tool. Gather measurements. Analyze the data. Compare to the hypothesis Does I/O subsystem perform reasonably, or is the I/O subsystem under-performing? Perform various What-if scenarios. Often, it is necessary to understand the implications of major future changes to the environment or workload. In these types of situations, you perform the same steps as described previously in 10.10, I/O load testing on page 304.
10.10.2 Iometer
Iometer is an I/O subsystem measurement and characterization tool for single and clustered
systems. Formerly, Iometer was owned by Intel Corporation, but Intel has discontinued work on Iometer, and it was given to the Open Source Development Lab. For more information about Iometer, go to: http://www.iometer.org/ Iometer is both a workload generator (it performs I/O operations in order to stress the system) and a measurement tool (it examines and records the performance of its I/O operations and their impact on the system). It can be configured to emulate the disk or network I/O load of any program or benchmark, or can be used to generate entirely synthetic I/O loads. It can generate and measure loads on single or multiple (networked) systems. Iometer can be used for the measurement and characterization of: Performance of disk and network controllers Bandwidth and latency capabilities of buses Network throughput to attached drives Shared bus performance System-level hard drive performance System-level network performance Iometer consists of two programs, Iometer and Dynamo. Iometer is the controlling program. Using Iometers graphical user interface, you configure the workload, set operating parameters, and start and stop tests. Iometer tells Dynamo what to do, collects the resulting data, and summarizes the results in output files. Only run one copy of Iometer at a time. It is typically run on the server machine. Dynamo is the workload generator. It has no user interface. At Iometers command, Dynamo performs I/O operations, records performance information, and then returns the data to Iometer. There can be more than one copy of Dynamo running at a time. Typically, one copy runs on the server machine, and one additional copy runs on each client machine. Dynamo is multi-threaded. Each copy can simulate the workload of multiple client programs. Each running copy of Dynamo is called a manager. Each thread within a copy of Dynamo is called a worker.
305
Iometer provides the ability to configure: Read/write ratios Sequential/random Arrival rate and queue depth Blocksize Number of concurrent streams With these configuration settings, you can simulate and test most types of workloads. Specify the workload characteristics to reflect the workload in your environment.
306
11
Chapter 11.
Performance considerations with UNIX servers

This chapter discusses performance considerations for attaching the DS8000 to several of the supported UNIX operating systems. In this chapter, we discuss: Planning and preparing UNIX servers for performance UNIX disk I/O architecture AIX disk I/O components AIX performance monitoring tools Solaris disk I/O components Solaris performance monitoring tools Hewlett-Packard UNIX (HP-UX) disk I/O components HP-UX performance monitoring tools Subsystem Device Driver (SDD) commands for AIX, HP-UX, and Solaris Testing and verifying DS8000 storage
307
11.1 Planning and preparing UNIX servers for performance

Planning and configuring a UNIX system for performance is never a simple task. There are numerous factors to take into account before tuning parameters and deciding the ideal setting. To help you answer these questions, consider the following factors: Type of application: There are thousands of applications available on the market. But, it is possible group them into a few types, based on their I/O profile. The I/O profile helps you decide the best configuration at the operating system level. For more details about identifying and classifying applications, consult Logical configuration performance considerations on page 63. Platform and version of operating system: Although, in general terms, they have the same performance characteristics, there can be differences in how each operating system implements these functions. In addition, newer versions or releases of an operating system bring or already have performance parameters with optimized default values for certain workload types. Type of environment: Another significant factor is whether the environment is for production, testing, or development. Normally, production environments are hosted in servers with several CPUs, dozens of gigabytes of memory, and terabytes of disk space. They demand high levels of availability and therefore are difficult to schedule for downtime, hence, the need to make a more detailed plan. Quality and assurance (or testing) environments normally are smaller in size, but they need to sustain the performance tests. Development environments are typically much smaller than their respective production environments, and normally performance is not a concern. Before planning for performance, validate the configuration of your environment. Refer to the System Storage Interoperation Center (SSIC) at: http://www.ibm.com/systems/support/storage/config/ssic/index.jsp The DS8000 interoperability matrix provides the dependencies among firmware level, operating system patch levels, and adapter models: http://www.ibm.com/servers/storage/disk/ds8000/interop.html Also, for host bus adapter (HBA) interoperability, visit: http://www-03.ibm.com/systems/support/storage/config/hba/index.wss Check the IBM Support Web site to download the latest version of firmware for the Fibre Channel (FC) adapters for AIX servers: http://www14.software.ibm.com/webapp/set2/firmware/gjsn Download the latest fix packs according the AIX version from the following site: http://www-933.ibm.com/eserver/support/fixes/fixcentral/main/pseries/aix Solaris 8, Solaris 9, and Solaris 10 require patches to ensure that the host and DS8000 function correctly. Refer to the following Web site for the most current list of Solaris-SPARC patches and the Solaris-x86 patch for Solaris 8, Solaris 9, and Solaris 10: http://sunsolve.sun.com/show.do?target=patchpage For HP-UX servers, download device drivers from: http://www.hp.com/country/us/eng/support.html
308
Apply the recommended patches. Go to the required patches Web page for additional information about how to download and install them: http://www.hp.com/products1/unix/java/patches/index.html Also, always consult the IBM System Storage DS8000 Host System Attachment Guide, SC26-7917-02, for detailed information about how to attach and configure a host system to a DS8000: http://www-01.ibm.com/support/docview.wss?rs=1114&context=HW2B2&dc=DA400&q1=ssg1*& uid=ssg1S7001161&loc=en_US&cs=utf-8&lang=en
11.1.1 UNIX disk I/O architecture

It is fundamental to understand the UNIX I/O subsystem to adequately tune your system. The I/O subsystem can be represented by a set of layers. Figure 11-1 provides an overview of those layers:
Figure 11-1 UNIX disk I/O architecture
I/O requests normally go through these layers: Application/DB layer: This layer is the top-level layer where many of the I/O requests start. Each of those applications generates several I/Os that follow a pattern or a profile. The characteristics that compose an application's I/O profile are: IOPS: IOPS is the number of I/Os (reads and writes) per second. Throughput: How much data is transferred in a given sample time? Typically, the throughput is measured in MB/s or KB/s. I/O size: This I/O size is the result of MB/s divided by IOPS. Read ratio: The read ratio is the percentage of I/O reads compared to the total of I/Os. Disk space: This amount is the total amount of disk space needed by the application. I/O system calls layer: Through the system calls provided by the operating system, the application issues I/O requests to the storage. By default, all I/O operations are synchronous. Many operating systems also provide asynchronous I/O, which is a facility that allows an application to overlap processing time while it issues I/Os to the storage. Typically, the databases take the advantage of this feature. Filesystem layer: The filesystem is the operating systems way to manipulate the data in the form of files. Many filesystems supports buffered and unbuffered I/Os. If your application has its own caching mechanism and supports a type of direct I/O, we
Chapter 11. Performance considerations with UNIX servers
309
recommend that you enable it, because it avoids double-buffering and reduces the CPU utilization. Otherwise, your application can take advantage of features, such as file caching, read-ahead, and write-behind. Volume manager layer: A volume manager is a key component to distribute the I/O workload over the logical unit numbers (LUNs) of DS8000. You need understand how they work to combine strategies of spreading workload at the storage level with spreading workload at the operating system, and consequently, maximizing the I/O performance of the database. Multipathing/Disk layer: Today, there are several multipathing solutions available: hardware multipathing, software multipathing, and operating system multipathing. It is usually better to adopt the operating system multipathing solutions. However, depending on the environment, you might face limitations and prefer to use a hardware or software multipathing solution. Always try not to exceed a maximum of four paths for each LUN unless required. FC adapter layer: The need to make configuration changes in the FC adapters depends on the operating system and vendor model. Always consult the DS8000 Host Attachment Guide for specific instructions about how to set up the FC adapters. Also, check the compatibility matrix for dependencies among the firmware level, operating system patch levels, and adapter models. Fabric layer: The Storage Area Network (SAN) is used to interconnect storage devices and servers. Array layer: The array layer is the DS8000 in our case. Normally in each of these layers, there are performance indicators that enable you to assess how that particular layer impacts performance.
Typical performance indicators

The typical performance indicators are: The first performance indicator used to assess whether there is an I/O bottleneck is the wait I/O time (wio). It is essential to realize that the wio is calculated differently depending on the operating system: In AIX, until Version 4.3.2, if the CPU was not busy in user or system mode and there was any I/O for disk in progress, the counter wio was incremented. Starting with AIX Version 4.3.3, if the CPU was not busy in user or system mode and there was outstanding I/O started by that CPU, the counter wio was incremented. In this way, systems with several CPUs have a less inflated wio. Moreover, even in AIX Version 4.3.3, wio from filesystems mounted via Network Filesystem (NFS) is also recorded. In Solaris, wio is calculated with the idle time. In addition, the counter wio in Solaris is incremented by disk I/O. It also is incremented by I/Os from filesystems and Small Computer System Interface (SCSI) tape devices. For more details about the wait I/O time in AIX, consult the following link: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/topic/com.ibm.aix.pr ftungd/doc/prftungd/wait_io_time_reporting.htm?tocNode=int_138909 For more details about the wait I/O time in Solaris, consult the following link: http://sunsite.uakom.sk/sunworldonline/swol-08-1997/swol-08-insidesolaris.html
310
The wio might be an indication that there is a disk I/O bottleneck, but it is not enough to assume from only the wio that there is a constraint of disk I/O. We must observe other counters, such as the blocked process in the kernel threads column and statistics generated by iostat or an equivalent tool. The technology for disk has evolved significantly. In the past, disks only were capable of 120 I/Os per second and had no cache memory. Consequently, utilization levels of 10 to 30% were considered extremely high. Today, with arrays of DS8000 class (supporting tens or hundreds of gigabytes of cache memory and hundreds of physical disks at the back end), even utilization levels above 80 or 90% still might not indicate an I/O performance problem. It is fundamental to check the queue length, the service time, and the I/O size averages being reported in the disk statistics: If the queue length and the service time are low, there is no performance problem. If the queue length is low and the service time and the I/O size are high, it is also not evidence of a performance problem. Performance thresholds: They only might indicate that something has changed in the system. However, they are not able to say why or how it has changed. Only a good interpretation of data is able to answer these types of questions. Here is a real case: A user was complaining that the users transactional system had a performance degradation of 12% in the average transaction response time. The Database Administrator (DBA) argued that the database spent a good part of the time in disk I/O operations. At the operating system and the storage level, all performance indicators were excellent. The only curious fact was that the system had a hit cache of 80%, which did not make much sense, because the access characteristic of a transactional system is random. High levels of hit cache indicated that somehow the system was reading sequentially. By analyzing the application, it was discovered that about 30% of database workload was related to eleven queries. These eleven queries were accessing the tables without the help of an index. The fix was the creation of those indexes with specific fields to optimize the access to disk.
11.2 AIX disk I/O components

Since AIX Version 6, a set of tunable parameters from six tuning commands (vmo, ioo, schedo, raso, no, and nfso) is preset with default values optimized for most types of workloads, and the tunable parameters are classified as restricted use tunables. Only change them if instructed to do so by IBM support. Refer to the IBM AIX Version 6.1 Differences Guide, SG24-7559, section 6.2 Restricted Tunables for additional information: http://www.redbooks.ibm.com/abstracts/sg247559.html?Open Consequently, with AIX 6 (with the DS800), you just need to use the default parameters and install the filesets for multipathing and host attachment that already provide basic performance defaults for queue length and SCSI timeout. For additional information about setting up the volume layout, refer to 11.2.4, IBM Logical Volume Manager (LVM) on page 316. For AIX 5.3 or below, refer to 11.2, AIX disk I/O components on page 311. The recommendations apply for the most cases. If you want to get additional information about the tunable commands and their parameters for an specific configuration, consult the AIX 5.3 documentation for further details: http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix .cmds/doc/aixcmds3/ioo.htm
311
For a complete discussion of AIX tuning, refer to the following links: AIX 6.1 Performance Tuning Manual, SC23-5253-00: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/topic/com.ibm.aix.pr ftungd/doc/prftungd/file_sys_perf.htm?tocNode=int_214554 AIX 5.3 Performance Tuning Manual, SC23-4905-04: http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.prftung d/doc/prftungd/file_sys_perf.htm The Performance Tuning Manual discusses the relationships between Virtual Memory Manager (VMM) and the buffers used by the filesystems and Logical Volume Manager (LVM): http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.p rftungd/doc/prftungd/vmm_page_replace_tuning.htm A paper providing tuning recommendations for Oracle on AIX: http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100883 A paper discussing the setup and tuning of Direct I/O with SAS 9: http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100890 A paper describing the performance improvement of an Oracle database with concurrent I/O (CIO): http://www-03.ibm.com/systems/resources/systems_p_os_aix_whitepapers_db_perf_ai x.pdf A paper discussing how to optimize Sybase ASE on AIX: http://www.ibm.com/servers/enable/site/peducation/wp/b78a/b78a.pdf
11.2.1 AIX Journaled File System (JFS) and Journaled File System 2 (JFS2)
JFS and JFS2 are AIX standard filesystems. JFS was created for the 32-bit kernels. They implement the concept of a transactional filesystem where all of the I/O operations of the metadata information are kept in a log. The practical impact is that in the case of a recovery of a filesystem, the fsck command looks at that log to see what I/O operations were completed and rolls back only those operations that were not completed. Of course, from a performance point of view, there is overhead. However, it is generally an acceptable compromise to ensure the recovery of a corrupted filesystem. Its file organization method is a linear algorithm. You can mount the filesystems with the Direct I/O option. You can adjust the mechanisms of sequential read ahead, sequential and random write behind, delayed write operations, and others. You can tune its buffers to increase the performance. It also supports asynchronous I/O. JFS2 was created for 64-bit kernels. Its file organization method is a B+ tree algorithm. It supports all the features described for JFS, with exception of delayed write operations. It also supports concurrent I/O (CIO).
AIX filesystem caching

In AIX, you can limit the amount of memory used for filesystem cache and the behavior of the page replacement algorithm. The parameters to be configured and their recommended values are: minperm% = 3 maxperm% = 90 maxclient% = 90
312
strict_maxperm = 0 strict_maxclient = 1 lru_file_repage = 0 lru_poll_interval = 10
File System I/O buffers for AIX 5.3

AIX keeps track of I/O to disk using pinned memory buffers (pbufs). In AIX 5.3, pbufs are controlled by the following system-wide tuning parameters: numfsbufs = 1568 nfs_v2_vm_bufs = 1000 nfs_v3_vm_bufs = 1000 nfs_v4_vm_bufs = 1000 j2_nBufferPerPagerDevice = 2048 j2_dynamicBufferPreallocation = 16
Read ahead
JFS and JFS2 have read ahead algorithms that can be configured to buffer data for sequential reads into the filesystem cache before the application requests it. Ideally, this feature reduces the percent of I/O wait (%iowait) and increases I/O throughput as seen from the operating system. Configuring the read ahead algorithms too aggressively will result in unnecessary I/O. The VMM tunable parameters that control read ahead behavior are: For JFS: minpgahead = max(2, <applications blocksize> / <filesystems blocksize>) maxpgahead = max(256, (<applications blocksize> / <filesystems blocksize> * <applications read ahead block count>)) For JFS2: j2_minPgReadAhead = max(2, <applications blocksize> / <filesystems blocksize>) j2_maxPgReadAhead = max(256, (<applications blocksize> / <filesystems blocksize> * <applications read ahead block count>))
I/O pacing
The purpose of I/O pacing is to manage concurrency to files and segments by limiting the CPU resources for processes that exceed a specified number of pending write I/Os to a discrete file or segment. When a process exceeds the maxpout limit (high-water mark), it is put to sleep until the number of pending write I/Os to the file or segment is less than minpout (low-water mark). This pacing allows another process to access the file or segment. Disabling I/O pacing (default) improves backup times and sequential throughput. Enabling I/O pacing ensures that no single process dominates the access to a file or segment. Typically, we recommend leaving I/O pacing disabled. There are certain circumstances where it is appropriate to have I/O pacing enabled: For HACMP, we recommend enabling I/O pacing to ensure that heartbeat activities complete. If you enable it, start with settings of maxpout=321 and minpout=240. Beginning with AIX 5.3, I/O pacing can be enabled at the filesystem level with mount command options. In AIX Version 6, I/O pacing is technically enabled but with such high settings that the I/O pacing will not become active except under extreme situations. In summary, enabling I/O pacing improves user response time at the expense of throughput.
313
Write behind
This parameter enables the operating system to initiate I/O that is normally controlled by the syncd. Writes are triggered when a specified number of sequential 16 KB clusters are updated: Sequential write behind: numclust for JFS j2_nPagesPerWriteBehindCluster and j2_nRandomCluster for JFS2 Random write behind: maxrandwrt for JFS j2_maxRandomWrite Note that setting j2_nPagesPerWriteBehindCluster to 0 disables JFS2 sequential write behind and setting j2_maxRandomWrote=0 also disables JFS2 random write behind.
Mount options
Use release behind mount options when appropriate: The release behind mount option can reduce syncd and lrud overhead. This option modifies the filesystem behavior in such a way that it does not maintain data in JFS2 cache. You use these options if you know that data going into or out of certain filesystems will not be requested again by the application before the data is likely to be paged out. Therefore, the lrud daemon has less work to do to free up cache and eliminates any syncd overhead for this filesystem. One example of a situation where you can use these options is if you have a Tivoli Storage Manager Server with disk storage pools in filesystems and you have configured the read ahead mechanism to increase the throughput of data, especially when a migration takes place from disk storage pools to tape storage pools: -rbr for release behind after a read -rbw for release behind after a write -rbrw for release behind after a read or a write Direct I/O (DIO) Bypass JFS/JFS2 cache No read ahead An option of the mount command Useful for databases that use filesystems rather than raw logical volumes. If an application has its own cache, it does not make sense to also have the data in filesystem cache. Concurrent I/O (CIO) Same as DIO but without inode locking, so the application must ensure data integrity for multiple simultaneous I/Os to a file.
Asynchronous I/O
Asynchronous I/O is the AIX facility that allows an application to issue an I/O request and continue processing without waiting for the I/O to finish: Since AIX 5.2, there are two types of asynchronous I/O: the legacy AIO and the new POSIX-compliant AIO. Many databases already take advantage of legacy AIO, so normally, AIX legacy AIO will be enabled.
314
For additional information about the two types of asynchronous I/O, consult AIX 5L Differences Guide Version 5.2 Edition, SG24-5765, in section 2.5 POSIX-compliant AIO (5.2.0): http://www.redbooks.ibm.com/abstracts/sg245765.html?Open With AIX 5.3 Technology Level (TL) 05, a new aioo command was shipped with the AIO fileset (bos.rte.aio) that allows you to increase the values of three tunable parameters (minservers, maxservers, and maxreqs) online without a reboot. However, a reduction of any of these values requires a server reboot to take effect. With AIX Version 6, the tunables fastpath and fsfastpath are classified as restricted tunables and now are set to a value of 1 by default. Therefore, all asynchronous I/O requests to a raw logical volume are passed directly to the disk layer using the corresponding strategy routine (legacy AIO or POSIX-compliant AIO), or all asynchronous I/O requests for files opened with cio are passed directly to LVM or disk using the corresponding strategy routine. Also, there are no more AIO devices in Object Data Manager (ODM) and all their parameters now become tunables using the ioo command. The newer aioo command is removed. For additional information, refer to IBM AIX Version 6.1 Differences Guide, SG24-7559, at: http://www.redbooks.ibm.com/abstracts/sg247559.html?Open
11.2.2 Veritas File System (VxFS) for AIX

Veritas File System was developed by Veritas. You implement it similarly across Solaris and HP-UX, among others. Refer to 11.4.2, Veritas FileSystem (VxFS) for Solaris on page 344 for a brief description of the features and additional information.
11.2.3 General Parallel FileSystem (GPFS)

GPFS is a concurrent filesystem that can be shared among the nodes that compose a cluster through a SAN or through a high-speed TCP/IP network. Beginning with Version 2.3, GPFS does not require an LVM or an HACMP cluster. You use it for sharing data among the servers of a cluster concurrently without losing the facility to manipulate files that a standard filesystem provides. It implements Direct I/O and filesystem caching among other features.
GPFS memory buffers

The pagepool sets the size of buffer cache on each node. For a database system, 100 MB might be enough. For an application with a large number of small files, you might need to increase this setting to 2 GB - 4 GB of RAM memory. The maxFilesToCache sets the number of inodes to cache for recently used files. The default value is 1000. If the application has a large number of small files, consider increasing this value. There is a limit of 300000 tokens. The maxStatCache sets the number of inodes to keep in stat cache. This value needs to be four times the size of the maxFilesToCache value.
Number of threads
The workerThreads parameter controls the maximum number of concurrent file operations at any instant. The recommended value is the same number of maxservers in AIX 5.3. There is a limit of 550 threads.
315
The prefetchThreads parameter controls the maximum possible number of threads dedicated to prefetching data for files that are read sequentially or to handle sequential write behind. For Oracle RAC, set this value for 548. There is a limit of 550 threads.
maxMBpS
Increase the maxMBpS to 80% of the total bandwidth for all HBAs in a single host. The default value is 150 MB/s.
maxblocksize
Configure the GPFS blocksize (maxblocksize) to match the applications I/O size, the RAID stripe size, or a multiple of the RAID stripe size. For example, if you use an Oracle database, it is better to adjust a value that matches the product of the value of the DB_BLOCK_SIZE and DB_FILE_MULTIBLOCK_READ_COUNT parameters. If the application does a lot of sequential I/O, it is better to configure a blocksize from 8 to 16 MB to take advantage of the sequential prefetching algorithm on the DS8000. For additional information, consult the following links: GPFS manuals http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm .cluster.gpfs.doc/gpfsbooks.html Tuning considerations in the Concepts, Planning, and Installation Guide V3.2.1, FA76-0413-02: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.gp fs321.install.doc/bl1ins_tuning.html Deploying Oracle 10g RAC on AIX V5 with GPFS, SG24-7541: http://www.redbooks.ibm.com/abstracts/sg247541.html?Open Configuration and Tuning GPFS for Digital Media Environments, SG24-6700: http://www.redbooks.ibm.com/abstracts/sg246700.html?Open
11.2.4 IBM Logical Volume Manager (LVM)

IBM LVM is the standard LVM that comes with AIX. It is an abstraction layer that allows storage virtualization at the operating system level. It is possible to implement RAID 0 or RAID 1, or a combination of both RAID types. It is also possible to spread the data over the LUNs in a round-robin manner. You can configure the buffer sizes to optimize performance. Figure 11-2 shows an overview.
316
Figure 11-2 IBM LVM overview
In Figure 11-2, the DS8000 LUNs that are under the control of the LVM are called physical volumes (PVs). The LVM splits the disk space into smaller pieces, which are called physical partitions (PPs). A logical volume (LV) is composed of several logical partitions (LPs). A filesystem can be mounted over an LV, or it can be used as a raw device. Each LP can point to up to three corresponding PPs. The ability of the LV to point a single LP to multiple PPs is the way in which LVM implements mirroring (RAID 1). To set up the volume layout with DS8000s LUNs, you can adopt one of the following strategies: Storage Pool Striping (SPS): In this case, you are spreading the workload at the storage level. At the operating system level, you just need to create the LVs with the inter-policy attribute set to minimum, which is the default option when creating an LV. PP Striping: A set of LUNs are created in different ranks inside of DS8000. When the LUNs are recognized in AIX, a volume group (VG) is created and the LVs are spread evenly over the LUNs by setting the inter-policy to maximum, which is the most common method used to distribute the workload. The advantage of this method compared to the SPS is the granularity of data spread over the LUNs. While in SPS, the data is spread in chunks of 1 GB. In a VG, you can create PP sizes from 8 MB to 16 MB. The advantage of this method compared to LVM Striping is that you have more flexibility to manage the LVs, such as adding more disks and redistributing evenly the LVs across all disks by reorganizing the VG. LVM Striping: As in the PP Striping, a set of LUNs is created in different ranks inside of DS8000. After the LUNs are recognized in AIX, a VG is created with larger PP sizes, such as 128 MB or 256 MB. And the LVs are spread evenly over the LUNs by setting the stripe size of LV from 8 MB to 16 MB. From a performance standpoint, LVM Striping and PP Striping provide the same performance. You might see an advantage in a scenario of HACMP with LVM Cross-Site and VGs of 1 TB or more when you perform cluster verification, or you see that operations related to creating, modifying, or deleting LVs are faster.
317
Volume group limits

When creating the volume group, there are LVM limits to consider along with the potential expansion of the volume group. The key LVM limits for a volume group are shown in Table 11-1.
Table 11-1 Volume group characteristics Limit Maximum PVs/VG Maximum LVs/VG Maximum PPs/VG Standard VG 32 256 32512 Big VG 128 512 130048 Scalable VG 1024 4096 2097152
Note: We recommend using AIX scalable volume groups whenever possible.
PP Striping
Figure 11-3 shows an example of PP Striping. The volume group contains four LUNs and has created 16 MB physical partitions on the LUNs. The logical volume in this example is composed of a group of 16 MB physical partitions from four logical disks: hdisk4, hdisk5, hdisk6, and hdisk7.
PP Striping
/dev/inter-disk_lv 8GB Logical disk (LUN) = hdisk4
16MB 16MB 16MB 16MB 16MB 16MB
pp497
lp1
pp1
lp5
pp2
pp3
pp4
... ... ... ...
pp498
16MB
pp499
16MB
pp500
8GB Logical disk (LUN) = hdisk5

16MB 16MB
pp1
lp2
lp6
pp2
16MB
pp3
16MB
pp4
16MB 16MB
pp497
pp498
16MB
pp499
16MB
pp500

16MB 16MB
lp3
pp1
lp7
pp2
16MB
pp3
16MB
pp4
16MB 16MB
pp497
pp498
16MB
pp499
16MB
pp500

16MB 16MB
pp1 pp2
16MB
pp3
16MB
pp4
16MB 16MB
pp497
lp4
lp8
pp498
16MB
pp499
16MB
pp500
vpath0, vpath1, vpath2, and vpath3 are hardware-striped LUNs on different DS8000 Extent Pools 8 GB/16 MB partitions ~ 500 physical partitions per LUN (pp1-pp500) /dev/inter-disk_lv is made up of eight logical partitions (lp1 + lp2 + lp3 + lp4 + lp5 +lp6 +lp7 + lp8) = 8 x 16 = 128 MB
Figure 11-3 Inter-disk policy logical volume
The first step is to create a volume group. We recommend that you create a VG with a set of DS8000 LUNs where each LUN is located in a separate Extent Pool. If you are going to add a new set of LUNs to a host, define another VG and so on. For you to create a VG, execute the following command to create the data01vg and a PP size of 16 MB: mkvg -S -s 16 -y data01vg hdisk4 hdisk5 hdisk6 hdisk7
318
Note: To create the volume group, if you use SDD, you use the mkvg4vp command. And if you use SDDPCM, you use the mkvg command. All the flags for the mkvg command apply to the mkvg4vp command. After you create the VG, the next step is to create the LVs. To create a VG with four disks (LUNs), we recommend that you create the LVs as a multiple of the number of disks in the VG, times the PP size. In our case, we create the LVs in multiples of 64 MB. You can implement the PP Striping by using the option -e x. To create an LV of 1 GB, execute the following command: mklv -e x -t jfs2 -y inter-disk_lv data01vg 64 hdisk4 hdisk5 hdisk6 hdisk7 Preferably, use inline logs for JFS2 logical volumes, because then there is one log for every filesystem and it is automatically sized. Having one log per filesystem improves performance, because it avoids serialization of access when multiple filesystems make metadata changes. The disadvantage of inline logs is that they cannot be monitored for I/O rates, which can provide an indication of the rate of metadata changes for a filesystem.
LVM Striping
Figure 11-4 shows an example of a striped logical volume. The logical volume called /dev/striped_lv uses the same capacity as /dev/inter-disk_lv (shown in Figure 11-3), but it is created differently.
LVM Striping
8GB LUN = hdisk4
lp1
256MB 256MB 256MB 256MB
pp1
lp5
pp2
pp3
pp4
.... .... .... ....
256MB 256MB
pp29
pp30
256MB 256MB
pp31
pp32
1.1 1.2 1.3 5.1 5.2 5.3
8GB LUN = hdisk5

lp2
256MB 256MB 256MB 256MB
pp1
lp6
pp2
pp3
pp4
256MB 256MB
pp29 pp30
256MB 256MB
pp31 pp32
2.1 2.2 2.3 6.1 6.2 6.3
IO
lp3
pp1
8GB LUN = hdisk6

256MB 256MB 256MB 256MB
lp7
pp2
pp3
pp4
256MB 256MB
pp29 pp30
256MB 256MB
pp31 pp32
3.1 3.2 3.3 7.1 7.2 7.3
8GB LUN = hdisk7

lp4
256MB 256MB 256MB 256MB
pp1
lp8
pp2
pp3
pp4
256MB 256MB
pp29 pp30
256MB 256MB
pp31 pp32
4.1 4.2 4.3 8.1 8.2 8.3
hdisk4, hdisk5, hdisk6, and hdisk7 are hardware-striped LUNS on different DS8000 Extent Pools 8 GB/256 MB partitions ~ 32 physical partitions per LUN (pp1 pp32) /dev/striped_lv is made up of eight logical partitions (8 x 256 MB = 32 MB) Each logical partition is divided into 64 equal parts of 4 MB (only 3 of the 4 MB parts are shown for each logical partition) /dev/striped_lv = lp1.1 +lp2.1 + lp3.1 + lp4.1 + lp1.2 + lp2.2 + lp3.2 + lp4.2 + lp5.1.
Figure 11-4 Striped logical volume
Notice that /dev/striped_lv is also made up of eight 256 MB physical partitions, but each partition is then subdivided into 32 chunks of 8 MB; only three of the 8 MB chunks are shown per logical partition for space reasons.
319
Again, the first step is to create a VG. To create a VG for LVM Striping, execute the following command: mkvg -S -s 256 -y data01vg hdisk4 hdisk5 hdisk6 hdisk7 For you to create a striped LV, you need to combine the following options when using LVM Striping: Stripe width (-C): This option sets the maximum number of disks to spread the data. The default value is used from the option upperbound. Copies (-c): This option is only required when you create mirrors. You can set from 1 to 3 copies. The default value is 1. Strict allocation policy (-s): This option is only required when you create mirrors and it is necessary to use the value s (superstrict). Stripe size (-S): This option sets the size of a chunk of a sliced PP. Since AIX 5.3, the valid values include 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M, 2M, 4M, 8M, 16M, 32M, 64M, and 128M. Upperbound (-u): This option sets the maximum number of disks for a new allocation. If you set the allocation policy to superstrict, the upperbound value must be the result of the stripe width times the number of copies that you want to create. Important: Do not set the option -e with LVM Striping. Execute the following command to create an striped LV: mklv -C 4 -c 1 -s s -S 8 -t jfs2 -u 4 -y striped_lv data01vg 4 hdisk4 hdisk5 hdisk6 hdisk7 AIX 5.3 implemented a new feature, the striped column. With this feature, you can extend an LV in a new set of disks after the current disks where the LV is spread is full.
Memory buffers
Adjust the memory buffers (pv_min_pbuf) of LVM to increase the performance. Set it to 1568 for AIX 5.3.
Scheduling policy
If you have a dual-site cluster solution using PowerHA with LVM Cross-Site, you can reduce the link requirements among the sites by changing the scheduling policy of each LV to parallel write/sequential read (ps). You must remember that the first copy of the mirror needs to point to the local storage.
11.2.5 Veritas Volume Manager (VxVM)

VxVM is another LVM. In the case of AIX, VxVM can replace the IBM LVM for rootvg. The Veritas Volume Manager for AIX is similar to other platforms, such as Solaris and HP-UX. Refer to 11.6.4, Veritas Volume Manager (VxVM) for HP-UX on page 362 for a brief description of the features and additional information.
320
11.2.6 IBM Subsystem Device Driver (SDD) for AIX

SDD is IBM proprietary multipathing software that only works with the DS8000 and other IBM storage devices. The Subsystem Device Driver is discussed in further detail in Subsystem Device Driver (SDD) on page 272. For additional information about SDD, consult the following link and always check the interoperability matrix of SDD to see which SDD version is supported: http://www-01.ibm.com/support/docview.wss?rs=540&context=ST52G7&dc=DA400&uid=ssg1S 7001350&loc=en_US&cs=utf-8&lang=en#AIXSDD
11.2.7 MPIO with SDDPCM

MPIO is another multipathing device driver. It was introduced in AIX 5.2. The reason for providing your own multipathing solution is that in a SAN environment you might want to connect to several storage subsystems from a single host. Each storage vendor has its own multipathing solution that is not interoperable with the multipathing solution of other storage vendors, which increases the complexity of managing the compatibility of operating system fix levels, HBA firmware levels, and multipathing software versions. AIX provides the base MPIO device driver; however, it is still necessary to install the MPIO device driver provided by the storage vendor also to take advantage of all of the features of a multipathing solution. We prefer to use MPIO with SDDPCM rather than SDD with AIX whenever possible. MPIO with SDDPCM is not supported with PowerHA/XD. For additional information about MPIO, consult the following links: Multi-path I/O for AIX 5L Version 5.2 white paper: http://www-03.ibm.com/systems/resources/systems_p_os_aix_whitepapers_multi_path .pdf A paper from Oracle providing the best practices of Automatic Storage Management (ASM) with Multipathing: http://www.oracle.com/technology/products/database/asm/pdf/asm%20and%20multipat hing%20best%20practices%20info%20matrix%203-5-08.pdf A paper from Oracle titled ASM Overview and Technical Best Practices: http://www.oracle.com/technology/products/database/asm/pdf/asm_10gr2_bestpracti ces%209-07.pdf Check the interoperability matrix of SDDPCM (MPIO) to see which version is supported: http://www-01.ibm.com/support/docview.wss?rs=540&context=ST52G7&dc=DA400&uid=ss g1S7001350&loc=en_US&cs=utf-8&lang=en#AIXSDDPCM A paper providing an example scenario using MPIO multipathing, Virtual I/O Server (VIOS), and SAN Volume Controller (SVC). You can apply this scenario to DS8000, although the host attachment device driver for DS8000 and the MPIO multipathing device driver for DS8000 differ: http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS2999 MPIO is only supported with HACMP if you configure the VGs in Enhanced Concurrent Mode. Refer to this link for additional information about HACMP with MPIO: http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/FLASH10504
321
If you use a multipathing solution with Virtual I/O Server (VIOS), use MPIO. There are several limitations when using SDD with VIOS. Refer to 11.2.10, Virtual I/O Server (VIOS) on page 323 and the VIOS support site for additional information: http://www14.software.ibm.com/webapp/set2/sas/f/vios/documentation/datasheet.ht ml#multipath
11.2.8 Veritas Dynamic MultiPathing (DMP) for AIX

Veritas Dynamic MultiPathing (DMP) is a device driver that is provided by Veritas to work with VxVM. If you use VxVM, use DMP instead of SDD. Also, there is an option for you to use VxVM with SDD. Refer to 11.2.8, Veritas Dynamic MultiPathing (DMP) for AIX on page 322 for additional information.
11.2.9 FC adapters
FC adapters or host bus adapters (HBAs) provide the connection between the host and the storage devices. There are three important parameters that we recommend that you configure: num_cmd_elems: This parameter sets the maximum number of commands to queue to the adapter. When a large number of supported storage devices are configured, you can increase this attribute to improve performance. The default value is 200. The maximum values are: LP10000 adapters: 2048 LP9000 adapters: 2048 LP7000 adapters: 1024 dyntrk: Beginning with AIX 5.20 TL1, the AIX Fibre Channel (FC) driver supports FC dynamic device tracking, which enables dynamically changing FC cable connections on switch ports or on supported storage ports without unconfiguring and reconfiguring the hdisk and SDD vpath devices. Note: The disconnected cable must be reconnected within 15 seconds. fc_err_recov: Beginning with AIX 5.1 and AIX 5.2 TL02, the fc_err_recov attribute enables fast failover during error recovery. Enabling this attribute can reduce the amount of time that the AIX disk driver takes to fail I/O in certain conditions and, therefore, reduce the overall error recovery time. The default value for fc_err_recov is delayed_fail. To enable FC adapter fast failover, change the value to fast_fail. Note: Only change the attributes fc_err_recov to fast_fail and dyntrk to yes if you use a multipathing solution with more than one path. Example 11-1 is the output of the attributes of an fcs device.
Example 11-1 Output of an fcs device # lsattr -El fcs0 bus_intr_lvl 8673 bus_io_addr 0xffc00 bus_mem_addr 0xfffbf000 init_link al intr_priority 3 lg_term_dma 0x800000 max_xfer_size 0x100000 Bus interrupt level Bus I/O address Bus memory address INIT Link flags Interrupt priority Long term DMA Maximum Transfer Size False False False True False True True
322
num_cmd_elems 200 pref_alpa 0x1 sw_fc_class 2 #
Maximum number of COMMANDS to queue to the adapter True Preferred AL_PA True FC Class for Fabric True
Example 11-2 is the output of attributes of an fscsi device.

Example 11-2 Output of an fscsi device # lsattr -El attach dyntrk fc_err_recov scsi_id sw_fc_class # fscsi0 switch no delayed_fail 0x490e00 3 How this adapter is CONNECTED False Dynamic Tracking of FC Devices True FC Fabric Event Error RECOVERY Policy True Adapter SCSI ID False FC Class for Fabric True
For additional information, consult the SDD Users Guide at the following link: http://www-01.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S7000303
11.2.10 Virtual I/O Server (VIOS)

Virtual I/O Server (VIOS) is an appliance that provides virtual storage and Shared Ethernet adapter capability to client logical partitions (LPARs) on POWER5 and POWER6. It is built on top of AIX and uses a default AIX user padmin running in a restricted shell to execute only the available commands provided by the ioscli command line interface. VIOAS allows a physical adapter with disk attached at the VIOS partition level to be shared by one or more partitions, enabling clients to consolidate and potentially minimize the number of required physical adapters. Refer to Figure 11-5 on page 324 for an illustration. VIOS works this way: On the DS8000s side, the same set of LUNs for the VIOSs is assigned to the corresponding volume groups for LUN masking. At the VIOS, there are at least two LPARs defined. They are considered SCSI servers providing access to the DS8000 LUNs for the other LPARs. We recommend that you have two or more HBAs installed on each VIOS. For every LPAR VSCSI client, there is a Virtual SCSI Server (VSCSI target) defined in both VIOSs through the Hardware Management Console (HMC). Similarly, for every VIOS, there is a VSCSI Client device (VSCSI Initiator) defined for the corresponding LPAR VSCSI Client. The host attachments for SDDPCM and SDDPCM for AIX are installed in the VIOSs through the oem_setup_env command, just as in an ordinary AIX server. For all DS8000s LUNs that you map directly to the corresponding LPAR Clients vscsi device, you also need to disable the SCSI reservation first. The LPAR that is the VSCSI Client only needs the basic MPIO device driver (MPIO will work only in failover mode). You can use the VSCSI disk devices like any other ordinary disk in the AIX.
323
Figure 11-5 Multipathing with VIOS overview
When you assign several LUNs from the DS8000 to the VIOS and then map those LUNs to the LPAR Clients with the time, trivial activities, such as upgrading the SDDPCM device driver, can become somewhat challenging. To ease the complexity, we created two scripts: the first script generates a list of mappings among the LUNs and LPAR Clients. The second script, based on that output, creates the commands needed to recreate the mappings among the LUNs and LPAR Clients. The scripts are available in Appendix C, UNIX shell scripts on page 587. For additional information about VIOS, refer to the following links: Introduction to PowerVM Editions on IBM p5 Servers: http://www.redbooks.ibm.com/redpieces/abstracts/sg247940.html IBM System p PowerVM Editions Best Practices: http://www.redbooks.ibm.com/abstracts/redp4194.html Virtual I/O Server and Integrated Virtualization Manager command descriptions: http://publib.boulder.ibm.com/infocenter/systems/scope/hw/topic/iphcg/iphcg.pdf Also, check the VIOS frequently asked questions (FAQs) that explain in more detail several restrictions and limitations, such as the lack of the load balancing feature for AIX VSCSI MPIO devices, and so forth: http://www14.software.ibm.com/webapp/set2/sas/f/vios/documentation/faq.html
324
Performance recommendations
Here are our performance recommendations when configuring Virtual SCSI for performance: CPU: Typical entitlement is .25 Virtual CPU of 2 Always run uncapped Run at higher priority (weight factor >128) More CPU power with high network loads
Memory: Typically >= 1 GB (at least 512 MB of memory is required. The minimum is 512 MB + 4 MB per hdisk.) Add more memory if there are extremely high device (vscsi and hdisk) counts Small LUNs drive up the memory requirements For multipathing with VIOS, check the configuration of the following parameters: fscsi devices on VIOS: The attribute fc_err_recov is set to fast_fail The attribute dyntrk is set to yes with the command chdev -l fscsiX -a dyntrk=yes
hdisk devices on VIOS: The attribute algorithm is set to load_balance The attribute reserve_policy is set to no_reserve The attribute hcheck_mode is set to nonactive The attribute hcheck_interval is set to 20 The attribute vscsi_path_to is set to 30 The attribute algorithm is set to failover The attribute reserv_policy is set to no_reserve The attribute hcheck_mode is set to nonactive The attribute hcheck_interval is set to 20
vscsi devices in client LPARs: hdisk devices in client:
Note: Only change the reserve_policy parameter to no_reserve if you are going to map the LUNs of DS8000 directly to the client LPAR. For additional information, refer to the following link: http://publib.boulder.ibm.com/infocenter/systems/scope/hw/index.jsp?topic=/iphb1/i phb1_vios_planning_vscsi_sizing.htm
11.3 AIX performance monitoring tools

In this section, we briefly review the various AIX commands and utilities that are useful for performance monitoring.
325
11.3.1 AIX vmstat

The vmstat utility is a useful tool for taking a quick snapshot of the systems performance. It is the first step toward understanding the performance issue and specifically in determining whether the system is I/O bound. Example 11-3 shows how vmstat can help monitor filesystem activity using the command vmstat -I.
Example 11-3 The vmstat -I utility output for filesystem activity analysis [root@p520-tic-3]# vmstat -I 1 5 System Configuration: lcpu=30 mem=3760MB kthr memory page faults cpu -------- ----------- ------------------------ ------------ ----------r b p avm fre fi fo pi po fr sr in sy cs us sy id wa 10 1 42 5701952 10391285 338 715 0 0 268 541 2027 46974 13671 18 5 60 16 10 0 88 5703217 10383443 0 6800 0 0 0 0 9355 72217 26819 14 10 15 62 9 0 88 5697049 10381654 0 7757 0 0 0 0 10107 77807 28646 18 11 11 60 8 1 70 5692366 10378376 0 8171 0 0 0 0 9743 80831 26634 20 12 13 56 5 0 74 5697938 10365625 0 6867 0 0 0 0 11986 63476 28737 13 10 13 64 10 0 82 5698586 10357280 0 7745 0 0 0 0 12178 66806 29431 14 11 12 63 12 0 80 5704760 10343915 0 7272 0 0 0 0 10730 74279 29453 16 11 11 62 6 0 84 5702459 10337248 0 9193 0 0 0 0 12071 72015 30684 15 12 11 62 6 0 80 5706050 10324435 0 9183 0 0 0 0 11653 72781 31888 16 10 12 62 8 0 76 5700390 10321102 0 9227 0 0 0 0 11822 82110 31088 18 14 12 56 [root@p520-tic-3]#
In an I/O-bound system, look for: A high I/O wait percentage as shown in the cpu column under the wa sub-column. Example 11-3 shows that a majority of CPU cycles are waiting for I/O operations to complete. A high number of blocked processes as shown in the kthr column under the sub-columns b and p, which are wait queue (b) and wait queue for raw devices (p) respectively. A high number of blocked processes normally indicates I/O contention among the process. Paging activity as seen under the column page. High first in first out (FIFO) indicates intensive file caching activity. Example 11-4 shows you another option that you can use, vmstat -v, from which you can understand whether the blocked I/Os are due to a shortage of buffers.
326
Example 11-4 The vmstat -v utility output for filesystem buffer activity analysis [root@p520-tic-3]# vmstat -v | tail -7 0 pending disk I/Os blocked with no pbuf 0 paging space I/Os blocked with no psbuf 2484 filesystem I/Os blocked with no fsbuf 0 client filesystem I/Os blocked with no fsbuf 0 external pager filesystem I/Os blocked with no fsbuf 0 Virtualized Partition Memory Page Faults 0.00 Time resolving virtualized partition memory page faults [root@p520-tic-3]#
In Example 11-4, notice that: Filesystem buffer (fsbuf) and LVM buffer (pbuf) space are used to hold the I/O request at the filesystem and LVM level respectively. If a substantial number of I/Os are blocked due to insufficient buffer space, both buffers can be increased using the ioo command, but a larger value will result in overall poor system performance. Hence, we suggest that you increase buffers incrementally and monitor the system performance with each increase. For the best practice values, consult the application papers listed under AIX filesystem caching on page 312. Using lvmo, you can also check if contention is happening due to a lack of LVM memory buffer, which is illustrated in Example 11-5.
Example 11-5 Output of lvmo -a [root@p520-tic-3]# lvmo -a -v rootvg vgname = rootvg pv_pbuf_count = 512 total_vg_pbufs = 1024 max_vg_pbuf_count = 16384 pervg_blocked_io_count = 0 pv_min_pbuf = 512 global_blocked_io_count = 0 [root@p520-tic-3]#
As you can see in Example 11-5, there are two incremental counters: pervg_blocked_io_count and global_blocked_io_count. The first counter indicates how many times an I/O block happened because of a lack of Logical Volume Managers (LVMs) pinned memory buffer (pbufs) on that VG. The second incremental counter counts how many times an I/O block happened due to the lack of LVMs pinned memory buffer (pbufs) in the whole OS. Other indicators for I/O bound can be seen with the disk xfer part of the vmstat output when run against the physical disk as shown in Example 11-6.
Example 11-6 Output of vmstat for disk xfer # vmstat hdisk0 hdisk1 1 8 kthr memory page faults ---- ---------- ----------------------- -----------r b avm fre re pi po fr sr cy in sy cs 0 0 3456 27743 0 0 0 0 0 0 131 149 28 0 0 3456 27743 0 0 0 0 0 0 131 77 30 1 0 3498 27152 0 0 0 0 0 0 153 1088 35 0 1 3499 26543 0 0 0 0 0 0 199 1530 38 cpu ----------us sy id wa 0 1 99 0 0 1 99 0 1 10 87 2 1 19 0 80 disk xfer -----1 2 3 4 0 0 0 0 0 11 0 59
327
0 0 0 0
1 0 0 0
3499 3456 3456 3456
25406 24329 24329 24329
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
187 2472 178 1301 124 58 123 58
38 37 19 23
2 26 0 72 2 12 20 66 0 0 99 0 0 0 99 0
0 0 0 0
53 42 0 0
The disk xfer part provides the number of transfers per second to the specified physical volumes that occurred in the sample interval. This count does not imply an amount of data that was read or written. You can consult the man page of vmstat for additional details: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/topic/com.ibm.aix.cmds/ doc/aixcmds6/vmstat.htm?tocNode=int_216867
11.3.2 pstat
The pstat command counts how many legacy asynchronous I/O servers are being used in the server. There are two asynchronous I/O subsystems (AIOs): Legacy AIO Posix AIO In AIX Version 5.3, you can use the command psat -a | grep aioserver | wc -l to get the number of legacy AIO servers that are running. You can use the command pstat -a | grep posix_aioserver | wc -l to see the number of Posix AIO servers.
Example 11-7 pstat -a output to measure the legacy AIO activity [root@p520-tic-3]# pstat -a | grep aioserver | wc -l 0 [root@p520-tic-3]#
Note: If you use raw devices, you have to use ps -k instead of pstat -a to measure the legacy AIO activity. Example 11-7 shows that the host does not have any AIO servers running. This function is not enabled by default. You can enable this function with mkdev -l aio0 or by using SMIT. For Posix AIO, substitute posix_aio for aio0. In AIX Version 6, both AIO subsystems are loaded by default but will be activated only when an AIO request is initiated by the application. Use the command pstat -a | grep aio to see the AIO subsystems loaded, as shown in Example 11-8.
Example 11-8 pstat -a output to show the AIO subsystem defined in AIX 6 [root@p520-tic-3]# pstat -a | grep aio 18 a 1207c 1 1207c 0 0 33 a 2104c 1 2104c 0 0 [root@p520-tic-3]# 1 1 aioLpool aioPpool
In AIX Version 6, you can use the new ioo tunables to show whether the AIO is being used. An illustration is given in Example 11-9 on page 329.
328
Example 11-9 ioo -a output to show the AIO subsystem activity in AIX 6 [root@p520-tic-3]# ioo -a | grep aio aio_active = 0 aio_maxreqs = 65536 aio_maxservers = 30 aio_minservers = 3 aio_server_inactivity = 300 posix_aio_active = 0 posix_aio_maxreqs = 65536 posix_aio_maxservers = 30 posix_aio_minservers = 3 posix_aio_server_inactivity = 300 [root@p520-tic-3]#
From Example 11-9, aio_active and posix_aio_active show whether the AIO is being used. The parameters aio_server_inactivity and posix_aio_server_inactivity show how long an AIO server will sleep without servicing an I/O request. To check the Asynchronous I/O configuration in AIX 5.3, just type the following commands shown in Example 11-10.
Example 11-10 lsattr -El aio0 output to list the configuration of legacy AIO [root@p520-tic-3]# autoconfig defined fastpath enable kprocprio 39 maxreqs 4096 maxservers 10 minservers 1 [root@p520-tic-3]# lsattr -El aio0 STATE to be configured at system restart State of fast path Server PRIORITY Maximum number of REQUESTS MAXIMUM number of servers per cpu MINIMUM number of servers True True True True True True
Notes: If your AIX 5.3 is between TL05 and TL08, you can also use the aioo command to list and increase the values of maxservers, minservers, and maxreqs. If you use AIX Version 6, there are no more Asynchronous I/O devices in ODM, and the command aioo has been removed. You must use the ioo command to change them. The general rule is to monitor the I/O wait using the vmstat command. If the I/O wait is more than 25%, consider enabling AIO, which reduces the IO wait but does not help disks that are overly busy. You can monitor busy disks by using iostat, which we explain in the next section.
11.3.3 AIX iostat

The iostat command is used for monitoring system I/O device load. It can be used to determine and balance the I/O load between physical disks and adapters. The lsattr -E -l sys0 -a iostat command option indicates whether the iostat statistic collection is enabled. To enable the collection of iostat data, use chdev -l sys0 -a iostat=true. The disk and adapter-level system throughput can be observed using the iostat -aDR command.
329
The option a will get the adapter-level details, and the option D will get the disk-level details. The option R will reset the min* and max* values at each interval. Refer to Example 11-10.
Example 11-11 Disk-level and adapter-level details using iostat -aDR [root@p520-tic-3]# iostat -aDR 1 1 System configuration: lcpu=2 drives=1 paths=1 vdisks=1 tapes=0 Vadapter: vscsi0
xfer: read: write: queue:
Kbps tps bkread bkwrtn partition-id 29.7 3.6 2.8 0.8 0 rps avgserv minserv maxserv 0.0 48.2S 1.6 25.1 wps avgserv minserv maxserv 30402.8 0.0 2.1 52.8 avgtime mintime maxtime avgwqsz avgsqsz sqfull 0.0 0.0 0.0 0.0 0.0 0.0
Paths/Disks: hdisk0
xfer: read: write: queue:
%tm_act 1.4 rps 2.8 wps 0.8 avgtime 11.5
bps 30.4K avgserv 5.7 avgserv 9.0 mintime 0.0
tps bread bwrtn 3.6 23.7K 6.7K minserv maxserv timeouts 1.6 25.1 0 minserv maxserv timeouts 2.1 52.8 0 maxtime avgwqsz avgsqsz 34.4 0.0 0.0
fails 0 fails 0 sqfull 0.9
When analyzing the output of iostat: Check if the number of I/Os is balanced among the disks. If not, it might indicate that you have problems in the distribution of PPs over the LUNs. With the information provided by lvmstat or filemon, pick the most active LV, and with the lslv -m command, check if the PPs are distributed evenly among the disks of the VG. If not, check the inter-policy attribute on the LVs if they are set to maximum. If the PPs are not distributed evenly and the logical volumes (LVs) inter-policy attribute is set to minimum, you need to change the attribute to maximum and reorganize the VG. Check in the read section that the avgserv is larger than 15 ms. It might indicate that your bottleneck is in a lower layer, which can be the HBA, the SAN, or even in the storage. Also, check if the same problem occurs with other disks of the same VG. If yes, you need to add up the number of I/Os per second, add up the throughput by vpath (if it is the case), rank, and host and compare with the performance numbers from TotalStorage Productivity Center for Disk. Check in the write section if the avgserv is larger than 3 ms. Writes averaging significantly and consistently higher indicate that write cache is full, and there is a bottleneck in the disk. Check in the queue section if avgwqsz is larger than avgsqsz. Compare with other disks in the storage. Check whether the PPs are distributed evenly in all disks in the VG. If avgwqsz is smaller than avgsqsz, compare with other disks in the storage. If there are differences and the PPs are distributed evenly in the VG, it might indicate that the unbalance is at the rank level. The following example shows how multipath needs to be considered to interpret the iostat output.
330
In this example, a server has two Fibre Channel adapters and is zoned so that it uses four paths to the DS8000. In order to determine the I/O statistics for vpath0 for the example given in Figure 11-6, you need to add up the iostats for hdisk1 - hdisk4. One way to find out which disk devices make a vpath is to use the datapath query essmap command that is included with SDD. Tip: When using iostat on a server that is running SDD with multiple attachments to the DS8000, each disk device is really just a single path to the same logical disk (LUN) on the DS8000. To understand how busy a logical disk is, you need to add up iostats for each disk device making up a vpath.
Host
I/O calls from OS vpath0 from SDD (Subsystem Device Driver) load balance & failover
vpath0
FC0 SAN switch
FC1 SAN switch
Devices presented to the OS: vpath0 disk1 disk2 disk3 disk4
reported on by iostat
10000 10001 10002 10003

Slot 1
10010 10011 10012 10013

Slot 2
10030 10031 10032 10033

Slot 4
10040 10041 10042 10043

Slot 5
10100 10101 10102 10103

Slot 1
10110 10111 10112 10113

Slot 2
10130 10031 10132 10133

Slot 4
10140 10141 10142 10143

Slot 5
Enclosure 0
Enclosure 1
DS8000 LUN 1
Figure 11-6 Devices presented to iostat
Another way is shown in Example 11-12. The command, datapath query device 0, lists the paths (hdisks) to vpath0. In this example, the logical disk on the DS8000 has LUN serial number 75065513000. The disk devices presented to the operating system are hdisk4, hdisk12, hdisk20, and hdisk28, so we can add up the iostats for these four hdisk devices to see how busy vpath0 is.
331
Example 11-12 The datapath query device command {CCF-part2:root}/ -> datapath query device 0 DEV#: 0 DEVICE NAME: vpath0 TYPE: 2107900 POLICY: Optimized SERIAL: 75065513000 ========================================================================== Path# Adapter/Hard Disk State Mode Select Errors 0 fscsi0/hdisk4 OPEN NORMAL 155 0 1 fscsi0/hdisk12 OPEN NORMAL 151 0 2 fscsi1/hdisk20 OPEN NORMAL 144 0 3 fscsi1/hdisk28 OPEN NORMAL 131 0
The option shown in Example 11-13 provides details in a record format, which can be used to sum up the disk activity.
Example 11-13 Output of iostat -alDRT [root@p520-tic-3]# iostat -alDRT 1 5 System configuration: lcpu=6 drives=32 paths=32 vdisks=0 tapes=0 Adapter: xfers time -------------------- --------------------------- --------bps tps bread bwrtn fcs1 0.0 0.0 0.0 0.0 15:43:22 Disks: write xfers read
queue time -------------------- ------------------------------------------------------------------ -------------------------------------- --------%tm bps tps bread bwrtn wps avg min max time fa il avg min max avg avg serv act serv serv serv outs time time time wqsz sqsz qfull hdisk10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:43:22 hdisk18 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:43:22 hdisk9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:43:22 hdisk5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:43:22 hdisk15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 15:43:22 root@p520-tic-3]#
------------------------------------
rps
avg
min
max time fail
serv
serv
serv outs
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
It is not unusual to see a device reported by iostat as 90% to 100% busy, because a DS8000 volume that is spread across an array of multiple disks can sustain a much higher I/O rate than for a single physical disk. A device being 100% busy is generally a problem for a single device, but it is probably not a problem for a RAID 5 device. 332
Further Asynchronous I/O can be monitored through iostat -A for legacy AIO and iostat -P for Posix AIO. Because the asynchronous I/O queues are assigned by filesystem, it is more interesting to measure the queues per filesystem. If you have several instances of the same application where each application uses a set of filesystems, you can see which instances are consuming more resources. Execute the iostat -AQ command to see legacy AIO, which is shown in Example 11-14. Similarly for POSIX-compliant AIO statistics, use iostat -PQ.
Example 11-14 iostat -AQ output to measure legacy AIO activity by filesystem [root@p520-tic-3]# iostat -AQ 1 2 System configuration: lcpu=4 aio: avgc avfc maxg maif maxr avg-cpu: % user % sys % idle % iowait 0 0 0 0 16384 0.0 0.1 99.9 0.0 Queue# 129 130 132 133 136 137 138 Count 0 0 0 0 0 0 0 Filesystems / /usr /var /tmp /home /proc /opt
aio: avgc avfc maxg maif maxr avg-cpu: % user % sys % idle % iowait 0 0 0 0 16384 0.0 0.1 99.9 0.0 Queue# 129 130 132 133 136 137 138 Count 0 0 0 0 0 0 0 Filesystems / /usr /var /tmp /home /proc /opt
Refer to the paper Oracle Architecture and Tuning on AIX (v1.2) listed under 11.2, AIX disk I/O components on page 311 for recommended configuration values for legacy AIO. This paper lists several considerations about the implementation of legacy AIO in AIX 5.3 as well as in AIX 6.1. If your AIX system is in a SAN environment, you might have so many hdisks that iostat will not give you too much information. We recommend using nmon, which can report iostats based on vpaths or ranks, as discussed in Interactive nmon options for DS8000 performance monitoring on page 336. For detailed information about the implementation of asynchronous I/O statistics in iostat, consult AIX 5L Differences Guide Version 5.3 Edition, SG24-7463. Refer to section 6.3 Asynchronous I/O statistics: http://www.redbooks.ibm.com/abstracts/sg247463.html?Open Or, consult the iostat man pages: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp?topic=/com.ib m.aix.cmds/doc/aixcmds3/iostat.htm&tocNode=int_215801
333
11.3.4 lvmstat
The lvmstat command reports input and output statistics for logical partitions, logical volumes, and volume groups. This command is useful in determining the I/O rates to LVM volume groups, logical volumes, and logical partitions. This command is useful for dealing with unbalanced I/O situations where the data layout was not considered initially.
Enabling volume group I/O using lvmstat

By default, the statistics collection is not enabled.If the statistics collection has not been enabled for the volume group or logical volume that you want to monitor, lvmstat will report an error such as: #lvmstat -v rootvg 0516-1309 lvmstat:Statistics collection is not enabled for this logical device. Use -e option to enable. To enable statistics collection for all logical volumes in a volume group (in this case, the rootvg volume group), use the -e option together with the -v <volume group> flag as the following example shows: #lvmstat -v rootvg -e When you do not need to continue collecting statistics with lvmstat, disable it, because it impacts the performance of the system. To disable statistics collection for all logical volumes in a volume group (in this case, the rootvg volume group), use the -d option together with the -v <volume group> flag as the following example shows: #lvmstat -v rootvg -d This command will disable the collection of statistics on all logical volumes in the volume group. The first report section generated by lvmstat provides statistics concerning the time since the statistical collection was enabled. Each subsequent report section covers the time since the previous report. All statistics are reported each time that lvmstat runs. The report consists of a header row, followed by a line of statistics for each logical partition or logical volume, depending on the flags specified.
Monitoring volume group I/O using lvmstat

Once a volume group is enabled for lvmstat monitoring, such as rootvg in this example, you only need to run lvmstat -v rootvg to monitor all activity to rootvg. An example of the lvmstat output is shown in Example 11-15.
Example 11-15 The lvmstat command example #lvmstat -v rootvg Logical Volume iocnt Kb_read Kb_wrtn Kbps lv05 682478 16 8579672 16.08 loglv00 0 0 0 0.00 datalv 0 0 0 0.00 lv07 0 0 0 0.00 lv06 0 0 0 0.00 lv04 0 0 0 0.00 lv03 0 0 0 0.00
Notice that lv05 is busy performing writes.
334
The lvmstat tool has powerful options, such as reporting on a specific logical volume or only reporting busy logical volumes in a volume group. For additional information about usage, check the following links: AIX 5L Differences Guide Version 5.2 Edition, SG24-5765: http://www.redbooks.ibm.com/abstracts/sg245765.html?Open AIX 5L Performance Tools Handbook, SG24-6039 http://www.redbooks.ibm.com/abstracts/sg246039.html?Open The man page of lvmstat: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp?topic=/com .ibm.aix.cmds/doc/aixcmds3/lvmstat.htm&tocNode=int_215986
11.3.5 topas
The interactive AIX tool, topas, is convenient if you want to get a quick overall view of the systems current activity. A fast snapshot of memory usage or user activity can be a helpful starting point for further investigation. Figure 11-7 contains a sample topas output.
Figure 11-7 topas output in AIX 6.1
With AIX6.1 the topas monitor has enhanced monitoring capabilities and now also provides I/O statistics for filesystems: Enter ff (first f turns it off, the next f expands it) to expand the filesystems I/O statistics. Type F to get an exclusive and even more detailed view of the filesystem I/O statistics. Expanded disk I/O statistics can be obtained by typing dd or D in the topas initial window. Consult the topas manual page for more details: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp?topic=/com.ib m.aix.cmds/doc/aixcmds5/topas.htm&tocNode=int_123659
335
11.3.6 nmon
The nmon tool and analyzer for AIX and Linux is a great storage performance analysis resource, and it is free. It was written by Nigel Griffiths who works for IBM in the United Kingdom. We use this tool, among others, when we perform client benchmarks. It is available at: http://www.ibm.com/developerworks/eserver/articles/analyze_aix/ Note: The nmon tool is not formally supported. No warranty is given or implied, and you cannot obtain help or maintenance from IBM. The nmon tool currently comes in two versions to run on different levels of AIX: The nmon Version 12e for AIX 5L and AIX 6.1 The nmon Version 9 for AIX 4.X. This version is functionally established and will not be developed further. The interactive nmon tool is similar to monitor or topas, which you might have used before to monitor AIX, but it offers many more features that are useful for monitoring DS8000 performance. We will explore these interactive options. Unlike topas, the nmon tool can also record data that can be used to establish a baseline of performance for comparison later. Recorded data can be saved in a file and imported into the nmon analyzer (a spreadsheet format) for easy analysis and graphing.
Interactive nmon options for DS8000 performance monitoring

The interactive nmon tool is an excellent way to show comprehensive AIX system monitoring information, of your choice, on one display. When used interactively, nmon updates statistics every two seconds. You can change the refresh rate. To run the tool, just type nmon and press Enter. Then, press the keys corresponding to the areas of interest. For this book, we are interested in monitoring storage. The options relating to storage are a, d, D, and e. For example, type nmon to start the tool, then select A (adapter), then D (disk I/O graphs or disk statistics, but not both at the same time), and then E (DS8000 vpath statistics). It is also helpful to only view just the busiest disk devices, so type a (.) period to turn on this viewing feature. We also want to look at C (CPU utilization) and sometimes T (top processes). The nmon tool has its own way of ordering the topics that you choose on the display. The different options you can select when running nmon Version 12 are shown in Example 11-16.
Example 11-16 The nmon tool options lqHELPqqqqqqqqqmost-keys-toggle-on/offqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqk xh = Help information q = Quit nmon 0 = reset peak counts x x+ = double refresh time - = half refresh r = ResourcesCPU/HW/MHz/AIXx xc = CPU by processor C=upto 128 CPUs p = LPAR Stats (if LPAR) x xl = CPU avg longer term k = Kernel Internal # = PhysicalCPU if SPLPAR x xm = Memory & Paging M = Multiple Page Sizes P = Paging Space x xd = DiskI/O Graphs D = DiskIO +Service times o = Disks %Busy Map x xa = Disk Adapter e = ESS vpath stats V = Volume Group stats x x^ = FC Adapter (fcstat) O = VIOS SEA (entstat) v = Verbose=OK/Warn/Danger x xn = Network stats N=NFS stats (NN for v4) j = JFS Usage stats x xA = Async I/O Servers w = see AIX wait procs "="= Net/Disk KB<-->MB x xb = black&white mode g = User-Defined-Disk-Groups (see cmdline -g) x xt = Top-Process ---> 1=basic 2=CPU-Use 3=CPU(default) 4=Size 5=Disk-I/O x xu = Top+cmd arguments U = Top+WLM Classes . = only busy disks & procsx xW = WLM Section S = WLM SubClasses) x xNeed more details? Then stop nmon and use: nmon -? x
336
xOr try http://www.ibm.com/collaboration/wiki/display/WikiPtype/nmon x x x xnmon version 12e build=5300-06 - written by Nigel Griffiths, nag@uk.ibm.com x x x x x x x x x mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
The nmon adapter performance

Example 11-17 displays the ability of nmon to show I/O performance based on system adapters and Fibre Channel statistics. Notice the output shows one SCSI controller and two Fibre Channel adapters. The great thing about the Fibre Channel statistics is that nmon can get real FC statistics, which are useful to measure the throughput of a tape drive attached to the FC adapter.
Example 11-17 The nmon tool adapter statistics lqnmon12eqqqqqS=WLMsubclassesqqqqHost=p520-tic-3qqqqqRefresh=1 secsqqq12:57.11qk x Fibre-Channel-Adapter qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx x Adapter Receive Transmit -Requests--Size KB-x xNumber Name KB/s KB/s In Out In Out x x 1 fcs0 0.0 0.0 0.0 0.0 0.0 0.0 x x 2 fcs1 0.0 0.0 0.0 0.0 0.0 0.0 x x Totals 0.0 0.0 MB/s 0.0 0.0 x x Disk-Adapter-I/O qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx xName %busy read write xfers Disks Adapter-Type x xscsi0 0.0 0.0 0.0 KB/s 0.0 2 PCI-X Dual Channel Ulx xide0 0.0 0.0 0.0 KB/s 0.0 1 ATA/IDE Controller Dex xsisscsia0 9.8 0.0 109.9 KB/s 26.5 2 PCI-X Dual Channel Ulx xfcs0 0.0 0.0 0.0 KB/s 0.0 4 FC Adapter x xTOTALS 4 adapters 0.0 109.9 KB/s 26.5 9 TOTAL(MB/s)=0.1 x xqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx x x x x x x x x x x x x x x x x mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
The nmon tool vpath performance

The e option of nmon shows I/O activity based on vpaths as shown in Example 11-18.
Example 11-18 The nmon tool vpath option Name Size(GB) AvgBusy read-KB/s write-KB/s TotalMB/s xfers/s vpaths=2 vpath0 10.0 22.9% 1659.2 796.3 2.4 4911.0 vpath1 10.0 23.5% 1673.0 765.5 2.4 4877.0 TOTALS 23.2% 3332.2 1561.8 4.8 9788.1 +------------------------------------------------------------------------------+
The nmon tool disk group performance The nmon Version 10 tool has a feature called disk grouping. For example, you can create a
disk group based on your AIX volume groups. First, you need to create a file that maps hdisks
337
to nicknames. For example, you can create a map file like that shown in Example 11-19 on page 338.
Example 11-19 The nmon tool disk group mapping file vi /tmp/vg-maps rootvg hdisk0 hdisk1 6000vg hdisk2 hdisk3 hdisk4 hdisk5 hdisk6 hdisk7 hdisk8 hdisk9 hdisk10 hdisk11 hdisk1 2 hdisk13 hdisk14 hdisk15 hdisk16 hdisk17 8000vg hdisk26 hdisk27 hdisk28 hdisk29 hdisk30 hdisk31 hdisk32 hdisk33
Then, type nmon with the -g flag to point to the map file: nmon -g /tmp/vg-maps When nmon starts, press the G key to view statistics for your disk groups. An example of the output is shown in Example 11-20.
Example 11-20 The nmon tool disk-group output --nmon-v10p---N=NFS--------------Host=san5198b-------Refresh=1 secs---14:02.10----Disk-Group-I/O Name Disks AvgBusy Read|Write-KB/s TotalMB/s xfers/s R:W-SizeKB rootvg 2 0.0% 0.0|0.0 0.0 0.0 0.0 6000vg 16 45.4% 882.6|93800.1 92.5 2131.5 44.4 8000vg 8 95.3% 1108.7|118592.0 116.9 2680.7 44.7 Groups= 3 TOTALS 26 5.4% 1991.3|212392.2 209.4 4812.2
Notice that: The nmon tool reports real-time iostats for different disk groups. In this case, the disk groups that we created are for volume groups. You can create logical groupings of hdisks for any kind of group that you want. You can make multiple disk-group map files and start nmon -g <map-file> to report on different groups. To enable nmon to report iostats based on ranks, you can make a disk-group map file listing ranks with the associated hdisk members. Use the SDD command datapath query essmap to provide a view of your host systems logical configuration on the DS8000 or DS6000. You can, for example, create a nmon disk group of storage type (DS8000 or DS6000), logical subsystem (LSS), rank, port, and so forth to give you unique views into your storage performance.
Recording nmon information for import into the nmon analyzer tool
A great benefit that the nmon tool provides is the ability to collect data over time to a file and then just import the file into the nmon analyzer tool, which is at: http://www.ibm.com/developerworks/aix/library/au-nmon_analyser/ To collect nmon data in comma-separated value (csv) file format for easy spreadsheet import: 1. Run nmon with the -f flag. Refer to nmon -h for the details, but as an example, to run nmon for an hour capturing data snapshots every 30 seconds, use: nmon -f -s 30 -c 120 2. This command will create the output file in the current directory called <hostname>_date_time.nmon. 338
The nmon analyzer is a macro-customized Microsoft Excel spreadsheet. After transferring the output file to the machine running the nmon analyzer, simply start the nmon analyzer, enabling the macros, and click Analyze nmon data. You will be prompted to select your spreadsheet and then to save the results. Many spreadsheets have fixed numbers of columns and rows. We suggest that you collect up to a maximum of 300 snapshots to avoid experiencing these issues. When you capture data to a file, the nmon tool disconnects from the shell to ensure that it continues running even if you log out, which means that nmon can appear to fail, but it is still running in the background until the end of the analysis period.
11.3.7 fcstat
The fcstat command displays statistics from an specific Fibre Channel adapter. Example 11-21 shows the output of the fcstat command.
Example 11-21 The fcstat command output # fcstat fcs0 FIBRE CHANNEL STATISTICS REPORT: fcs0 skipping......... FC SCSI Adapter Driver Information No DMA Resource Count: 0 No Adapter Elements Count: 0 No Command Resource Count: 99023 skipping.........
The No Command Resource Count indicates how many times the num_cmd_elems value was exceeded since AIX was booted. You can keep taking snapshots every 3 to 5 minutes during a peak period to evaluate if you need to increase the value of num_cmd_elems. For additional information, consult the man pages of the fcstat command: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp?topic=/com.ib m.aix.cmds/doc/aixcmds2/fcstat.htm&tocNode=int_122561
11.3.8 filemon
The filemon command monitors a trace of filesystem and I/O system events, and reports performance statistics for files, virtual memory segments, logical volumes, and physical volumes. The filemon command is useful to individuals whose applications are believed to be disk-bound, and who want to know where and why. The filemon command provides a quick test to determine if there is an I/O problem by measuring the I/O service times for reads and writes at the disk and logical volume level. The filemon command resides in /usr/bin and is part of the bos.perf.tools file set, which can be installed from the AIX base installation media.
filemon measurements
To provide a more complete understanding of filesystem performance for an application, the filemon command monitors file and I/O activity at four levels: Logical filesystem The filemon command monitors logical I/O operations on logical files. The monitored operations include all read, write, open, and seek system calls, which might or might not
339
result in actual physical I/O depending on whether the files are already buffered in memory. I/O statistics are kept on a per-file basis. Virtual memory system The filemon command monitors physical I/O operations (that is, paging) between segments and their images on disk. I/O statistics are kept on a per segment basis. Logical volumes The filemon command monitors I/O operations on logical volumes. I/O statistics are kept on a per-logical volume basis. Physical volumes The filemon command monitors I/O operations on physical volumes. At this level, physical resource utilizations are obtained. I/O statistics are kept on a per-physical volume basis.
filemon examples
A simple way to use filemon is to run the command shown in Example 11-22, which will: Run filemon for two minutes and stop the trace. Store output in /tmp/fmon.out. Just collect logical volume and physical volume output.
Example 11-22 Using filemon #filemon -o /tmp/fmon.out -T 500000 -PuvO lv,pv; sleep 120; trcstop
Note: To set the size of the buffer of option -T, in general, you can start with 2 MB per logical CPU. For additional information about filemon, check the man pages of filemon: http://publib.boulder.ibm.com/infocenter/systems/scope/aix/index.jsp?topic=/com.ib m.aix.cmds/doc/aixcmds2/filemon.htm&tocNode=int_215661 To produce sample output for filemon, we ran a sequential write test in the background, and started a filemon trace, as shown in Example 11-23. We used the lmktemp command to create a 2 GB file full of nulls while filemon gathered I/O statistics.
Example 11-23 Using filemon with a sequential write test cd /interdiskfs time lmktemp 2GBtest 2000M & filemon -o /tmp/fmon.out -T 500000 -PuvO lv,pv; sleep 120; trcstop
In Example 11-24 on page 341, we look at parts of the /tmp/fmon.out file. When analyzing the output from filemon, focus on: Most active physical volume: Look for balanced I/O across disks. Lack of balance might be a data layout problem. Look at I/O service times at the physical volume layer: Writes to cache that average less than 3 ms is good. Writes averaging significantly and consistently longer times indicate that write cache is full, and there is a bottleneck in the disk.
340
Reads averaging less than 10 ms - 20 ms are good. The disk subsystem read cache hit rate affects this value considerably. Higher read cache hit rates will result in lower I/O service times, often near 5 ms or less. If reads average greater than 15 ms, it can indicate that something between the host and the disk is a bottleneck, though it usually indicates a bottleneck in the disk subsystem. Look for consistent I/O service times across physical volumes. Inconsistent I/O service times can indicate unbalanced I/O or a data layout problem. Longer I/O service times can be expected for I/Os that average greater than 64 KB in size. Look at the difference between the I/O service times between the logical volume and the physical volume layers. A significant difference indicates queuing or serialization in the AIX I/O stack. The fields in the filemon report of the filemon command are: util Utilization of the volume (fraction of time busy). The rows are sorted by this field, in decreasing order. The first number, 1.00, means 100 percent. Number of 512-byte blocks read from the volume. Number of 512-byte blocks written to the volume. Total transfer throughput in Kilobytes per second. Name of volume. Contents of volume; either a filesystem name, or logical volume type (jfs2, paging, jfslog, jfs2log, boot, or sysdump). Also indicates if the filesystem is fragmented or compressed.
#rblk #wblk KB/sec volume description
Example 11-24 The filemon most active logical volumes report Thu Oct 6 21:59:52 2005 System: AIX CCF-part2 Node: 5 Machine: 00E033C44C00 Cpu utilization: 73.5%
Most Active Logical Volumes -----------------------------------------------------------------------util #rblk #wblk KB/s volume description -----------------------------------------------------------------------0.73 0 20902656 86706.2 /dev/305glv /interdiskfs 0.00 0 472 2.0 /dev/hd8 jfs2log 0.00 0 32 0.1 /dev/hd9var /var 0.00 0 16 0.1 /dev/hd4 / 0.00 0 104 0.4 /dev/jfs2log01 jfs2log Most Active Physical Volumes -----------------------------------------------------------------------util #rblk #wblk KB/s volume description -----------------------------------------------------------------------0.99 0 605952 2513.5 /dev/hdisk39 IBM FC 2107 0.99 0 704512 2922.4 /dev/hdisk55 IBM FC 2107 0.99 0 614144 2547.5 /dev/hdisk47 IBM FC 2107 0.99 0 684032 2837.4 /dev/hdisk63 IBM FC 2107 0.99 0 624640 2591.1 /dev/hdisk46 IBM FC 2107
341
0.99 0.98
0 728064 3020.1 0 612608 2541.2
/dev/hdisk54 /dev/hdisk38
IBM FC 2107 IBM FC 2107
skipping........... -----------------------------------------------------------------------Detailed Logical Volume Stats (512 byte blocks) -----------------------------------------------------------------------VOLUME: /dev/305glv description: /interdiskfs writes: 81651 (0 errs) write sizes (blks): avg 256.0 min 256 max 256 sdev 0.0 write times (msec): avg 1.816 min 1.501 max 2.409 sdev 0.276 write sequences: 6 write seq. lengths: avg 3483776.0 min 423936 max 4095744 sdev 1368402.0 seeks: 6 (0.0%) seek dist (blks): init 78592, avg 4095744.0 min 4095744 max 4095744 sdev 0.0 time to next req(msec): avg 1.476 min 0.843 max 13398.588 sdev 56.493 throughput: 86706.2 KB/sec utilization: 0.73 skipping........... -----------------------------------------------------------------------Detailed Physical Volume Stats (512 byte blocks) -----------------------------------------------------------------------VOLUME: /dev/hdisk39 description: IBM FC 2107 writes: 2367 (0 errs) write sizes (blks): avg 256.0 min 256 write times (msec): avg 1.934 min 0.002 write sequences: 2361 write seq. lengths: avg 256.7 min 256 seeks: 2361 (99.7%) seek dist (blks): init 14251264, avg 1928.4 min 256 seek dist (%tot blks):init 10.61802, avg 0.00144 min 0.00019 time to next req(msec): avg 50.666 min 1.843 throughput: 2513.5 KB/sec utilization: 0.99 VOLUME: /dev/hdisk55 description: IBM FC 2107 writes: 2752 (0 errs) write sizes (blks): avg 256.0 min 256 write times (msec): avg 1.473 min 0.507 write sequences: 2575 write seq. lengths: avg 273.6 min 256 seeks: 2575 (93.6%) seek dist (blks): init 14252544, avg 1725.9 min 256 seek dist (%tot blks):init 10.61897, avg 0.00129 min 0.00019 time to next req(msec): avg 43.573 min 0.844 throughput: 2922.4 KB/sec utilization: 0.99 skipping to end.....................
max max max
256 sdev 2.374 sdev 512 sdev
0.0 0.524 12.9
max 511232 sdev 23445.5 max 0.38090 sdev 0.01747 max 14010.230 sdev 393.436
max max max
256 sdev 1.753 sdev 512 sdev
0.0 0.227 64.8
max 511232 sdev 22428.8 max 0.38090 sdev 0.01671 max 14016.443 sdev 365.314
342
In the filemon output in Example 11-24 on page 341, we notice: The most active logical volume is /dev/305glv (/interdiskfs); it is the busiest logical volume with an average data rate of 87 MBps. The Detailed Logical Volume Status shows an average write time of 1.816 ms for /dev/305glv. The Detailed Physical Volume Stats show an average write time of 1.934 ms for the busiest disk, /dev/hdisk39, and 1.473 ms for /dev/hdisk55 for the next busiest disk. The filemon command is a useful tool to determine where a host is spending I/O. More details about the filemon options and reports are available in the publication AIX 5L Performance Tools Handbook, SG24-6039, which you can download from: http://www.redbooks.ibm.com/abstracts/sg246039.html?Open
11.4 Solaris disk I/O components

Solaris 10 comes with several performance enhancements and new features. The following sections give configuration recommendations according to a typical environment, the type of application, and the setup of the volume layout.
11.4.1 UFS
UFS is the standard filesystem of Solaris. You can configure a journaling feature, adjust cache filesystem parameters, implement Direct I/O, and adjust the mechanism of sequential read ahead. For more information about UFS, consult the following links: Introduction to the Solaris filesystem: http://www.solarisinternals.com/si/reading/sunworldonline/swol-05-1999/swol-05filesystem.html http://www.solarisinternals.com/si/reading/fs2/fs2.html http://www.solarisinternals.com/si/reading/sunworldonline/swol-07-1999/swol-07filesystem3.html You can obtain the Tunable Parameters Reference Manual for Solaris 8 at: http://docs.sun.com/app/docs/doc/816-0607 You can obtain the Tunable Parameters Reference Manual for Solaris 9 at: http://docs.sun.com/app/docs/doc/806-7009 You can obtain the Tunable Parameters Reference Manual for Solaris 10 at: http://docs.sun.com/app/docs/doc/817-0404 More information about SUN Solaris commands and tuning options is available from the following Web site: http://www.solarisinternals.com
343
11.4.2 Veritas FileSystem (VxFS) for Solaris

The Veritas FileSystem supports 32-bit and 64-bit kernels and is a journaled filesystem. Its method of file organization is based on a hash algorithm. It implements Direct I/O and Discovered Direct I/O. With the Veritas FileSystem, you can adjust the mechanisms of sequential read ahead and sequential and random write behind. You can tune its buffers to increase performance, and it supports asynchronous I/O.
Filesystem blocksize
The smallest allocation unit of a filesystem is the blocksize. In VxFS, you can choose from 512 bytes to 8192 bytes. To decide which size is best for your application, consider the average size of the applications files. If the application is a file server and the average size is about 1 KB, choose a blocksize of 1 KB. But if the application is a database with a few but large files, choose the maximum size of 8 KB. The default blocksize is 2 KB. In addition, when creating and allocating file space inside of VxFS and using standard tools, such as mkfile (Solaris only) or database commands, you might have performance degradations. For additional information, refer to: http://seer.entsupport.symantec.com/docs/192660.htm
Direct I/O and Discovered Direct I/O

Direct I/O works in a similar manner to AIX Direct I/O and Solaris UFS Direct I/O. It is a way to execute an I/O request bypassing the filesystem cache. A Discovered Direct I/O is similar to a Direct I/O request, but it does not need a synchronous commit of the inode when the file is extended or blocks are allocated. Whenever the filesystem gets an I/O request larger than the attribute discovered_direct_iosz, it tries to use Direct I/O on the request. The default value is 256 KB.
Quick I/O
Quick I/O is a licensed feature from Storage Foundation for Oracle, DB2, or Sybase databases. However, the binaries come in the VRTSvxfs package. It allows a database access to pre-allocated VxFS files as raw character devices. It combines the performance of a raw device with the convenience to manipulate files that a filesystem can provide. There is an interesting document describing the use of Quick I/O compared to other I/O access methods such as Direct I/O at: http://eval.veritas.com/webfiles/docs/qiowp.pdf
Read Ahead and Write Behind

VxFS uses the following parameters to adjust the configuration of Read Ahead and Write Behind: read_pref_io: It sets the preferred read request size. The default is 64 KB. write_pref_io: It sets the preferred write request size. The default is 64 KB. read_nstream: It sets the number of parallel read requests at once. The default is 1. write_nstream: It sets the number of parallel write requests at once. The default is 1. For additional usage information, consult the following links: Veritas File System 5.0 - Administrators Guide for Solaris: ftp://exftpp.symantec.com/pub/support/products/Foundation_Suite/283879.pdf Veritas File System 5.0 - Administrators Guide for AIX: ftp://exftpp.symantec.com/pub/support/products/FileSystem_UNIX/284315.pdf
344
Veritas File System 5.0 - Administrators Guide for HP-UX: ftp://exftpp.symantec.com/pub/support/products/FileSystem_UNIX/283704.pdf Veritas File System 5.0 - Administrators Guide for Linux: ftp://exftpp.symantec.com/pub/support/products/FileSystem_UNIX/283836.pdf In addition, there is an IBM Redbooks publication about VERITAS Storage Foundation Suite: http://www.redbooks.ibm.com/abstracts/sg246619.html?Open
11.4.3 SUN Solaris ZFS

ZFS is the third generation of UNIX Filesystem, developed by SUN. It was introduced in Solaris 10. It is an entirely redesigned filesystem. It is a 128-bit filesystem and does not need a Volume Manager. It is a transactional filesystem, and it implements dynamic striping, multiple blocksizes, and intelligent prefetch, among other features.
Storage pools
ZFS implements built-in features of volume management. You define a storage pool and add disks for that storage pool. It is not necessary to partition the disks and create filesystems on top of those partitions. Instead, you simply define the filesystems, and ZFS allocates disk space dynamically. ZFS does what virtual memory does by abstracting the real memory through a virtual memory address space. You also can implement quotas and reserve space for a specific filesystem. Note: ZFS is not supported with SDD, only with MPxIO.
Data Integrity Model

ZFS implements its transactional model. In case of a system crash, there is no need to run an fsck. It also keeps checking the integrity of each data block every time that it executes an I/O read. If it detects any error in a mirrored storage pool, ZFS automatically reads the data from a mirror and fixes the corrupted data. There are five options to configure the checksum: on: This option enables checksum and uses the fletcher2 method of checksumming. This option is the default. off: This option disables checksum. We do not recommend that you disable checksumming. fletcher2: This option is a 256-bit hash method used for checksumming. This option is the fastest and simplest checksum option available in ZFS. fletcher4: This option is another 256-bit hash method for checksumming. It is stronger than fletcher2, but it still is not a cryptographic hash algorithm. sha256: This option is a 256-bit cryptographically strong hash method used for checksumming. This option is the highest level of checksumming available in ZFS. For detailed information about how to check and configure the checksumming method, refer to this link: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Checksums
345
Another important detail is that every time that ZFS issues an I/O write to the disk, it does not know if the storage has a nonvolatile random access memory (NVRAM). Therefore, it requests to flush the data from cache to disk in the storage. If your Solaris release is November 2006 (11/06), you can disable the flush by executing the command set zfs:zfs_nocacheflush=1. Additional details are provided at the following link: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#FLUSH
Dynamic striping
With the storage pool model, after you add more disks to the storage pool, ZFS automatically redistributes the data among the disks.
Multiple blocksize
In ZFS, there is no need to define a blocksize for each filesystem. ZFS tries to match the blocksize with the application I/O size. However, if your application is a database, we recommend that you enforce the blocksize to match the database blocksize. The parameter is recordsize and can range from 512 bytes to 128 KB. For example, if you want to configure a blocksize of 8 KB, you type the command zfs set recsize=16384 <userpool name> or <filesystem>.
Cache management
Cache management is implemented by a modified version of an Adaptive Replacement Cache (ARC) algorithm. By default, it tries to use the real memory of the system while the utilization increases and there is free memory in the system. Therefore, if your application also uses a lot of memory, such as a database, you might need to limit the amount of memory available for the ZFS ARC. For detailed information and instructions about how to limit the ZFS ARC, check the following link: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE For more information about ZFS, consult the following links: A presentation that introduces ZFS and its new features: http://www.sun.com/software/solaris/zfs_lc_preso.pdf A Web site with ZFS documentation and a link to a guide about ZFS best practices: http://opensolaris.org/os/community/zfs/docs/ The ZFS Evil Tuning Guide is a Web site with the latest recommendations about how to tune ZFS: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide A blog with performance recommendations for tuning ZFS with a database: http://blogs.sun.com/realneel/entry/zfs_and_databases The Solaris ZFS Administrators Guide: http://dlc.sun.com/pdf/817-2271/817-2271.pdf
11.4.4 Solaris Volume Manager (formerly Solstice DiskSuite)

Solaris Volume Manager is an LVM developed by SUN. Since Solaris 9, it has been bundled with the operating system. It comes with a new feature called soft partition and allows you to implement RAID 0, RAID 1, and RAID 5.
346
For detailed information, consult the following links: A paper explaining the advantages to move from VxVM to SVM: http://www.sun.com/software/whitepapers/solaris9/transition_volumemgr.pdf A paper providing the performance best practices with SVM: http://www.sun.com/blueprints/1103/817-4368.pdf The Solaris Volume Manager Administration Guide - Solaris 10: http://docs.sun.com/app/docs/doc/816-4520
11.4.5 Veritas Volume Manager (VxVM) for Solaris

Veritas Volume Manager (VxVM) for Solaris is a third-party LVM that was developed by Veritas. It supports RAID 0, RAID 1, and RAID 5 and allows you to spread data over subdisks that make up plexes. (You can create over 32 plexes). Figure 11-8 gives an overview.
Figure 11-8 Veritas Volume Manager (VxVM) overview
The DS8000 LUNs that are under the control of VxVM are called VM Disks. You can split those VM Disks in smaller pieces that are called subdisks. A plex looks like a mirror and is composed of a set of subdisks. It is at the plex level that you can configure RAID 0, RAID 5, or simply concatenate the subdisks. A volume is composed of one or more plexes. When you add more than one plex to a volume, you can implement RAID 1. On top of that volume, you can create a filesystem or simply use it as a raw device. To set up the volume layout with DS8000 LUNs, you can adopt one of the following strategies: Storage Pool Striping (SPS): In this case, you spread the workload at the storage level. At the operating system level, you just need to create the plexes with the layout attribute set to concat (this is the default option when creating a plex). Striped Plex: A set of LUNs is created in different ranks inside of DS8000. After the LUNs are recognized in Solaris, a Disk Group (DG) is created, the plexes are spread evenly over the LUNs, and the stripe size of a plex is set from 8 MB to 16 MB.
347
RAID considerations
When using VxVM with DS8000 LUNs, spread the workload over the several DS8000 LUNs by creating RAID 0 plexes. The stripe size is based on the I/O size of your application. If your application has I/O sizes of 1 MB, define the stripe sizes as 1 MB. If your application performs a lot of sequential I/Os, it is better to configure stripe sizes of 4 MB or more to take advantage of the DS8000 prefetch algorithm. Refer to Chapter 11, Performance considerations with UNIX servers on page 307 for details about RAID configuration.
vxio:vol_maxio
When you use VxVM on the DS8000 LUNs, you must set the VxVM maximum I/O size parameter (vol_maxio) to match the I/O size of your application or the stripe size of VxVM RAID 0. If the I/O size of your application is 1 MB and you use the Veritas Volume Manager on your DS8000 LUNs, edit the /etc/system and add the entry set vxio:vol_maxio=2048. The value is in blocks of 512 bytes. For detailed information about how to configure and use VxVM in several platforms, consult the following links: Veritas Volume Manager 5.0 - Administrators Guide for Solaris: ftp://exftpp.symantec.com/pub/support/products/Foundation_Suite/283916.pdf Veritas Volume Manager 5.0 - Administrators Guide for AIX: ftp://exftpp.symantec.com/pub/support/products/VolumeManager_UNIX/284310.pdf Veritas Volume Manager 5.0 - Administrators Guide for HP-UX: ftp://exftpp.symantec.com/pub/support/products/VolumeManager_UNIX/283742.pdf Veritas Volume Manager 5.0 - Administrators Guide for Linux: ftp://exftpp.symantec.com/pub/support/products/VolumeManager_UNIX/283835.pdf Veritas Storage Foundation 5.0 - Administrators Guide for Windows 2000 Server and Windows Server 2003: ftp://exftpp.symantec.com/pub/support/products/Storage_Foundation_for_Windows/2 86744.pdf
11.4.6 IBM Subsystem Device Driver for Solaris

IBM Subsystem Device Driver for Solaris is the IBM multipathing device driver that is available for SUN Solaris machines.
11.4.7 MPxIO
MPxIO is the multipathing device driver that comes with Solaris and is required when implementing SUN Clusters. For additional information, consult the following links: A presentation providing an overview of MPxIO: http://opensolaris.org/os/project/mpxio/files/mpxio_toi_sio.pdf In addition, the home page of MPxIO: http://opensolaris.org/os/project/mpxio The Solaris SAN Configuration and Multipathing Guide: http://docs.sun.com/app/docs/doc/820-1931
348
11.4.8 Veritas Dynamic MultiPathing (DMP) for Solaris

Veritas Dynamic MultiPathing (DMP) for Solaris is a device driver provided by Veritas to work with VxVM for Solaris. General performance recommendations or tips for multipathing software: For SDD, check the compatibility matrix to download the correct version with Solaris: http://www-01.ibm.com/support/docview.wss?rs=540&context=ST52G7&dc=DA400&uid=ss g1S7001350&loc=en_US&cs=utf-8&lang=en#SolarisSDD When configuring DS8000 with MPxIO, follow the instructions in the Host System Attachment Guide, SG26-7628: http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/topic/com.ibm.storag e.ssic.help.doc/f2c_attsunhoststms_19yp60.html Read a very interesting paper about configuring and tuning DMP: http://www.symantec.com/content/en/us/enterprise/media/stn/pdfs/Articles/dynami c_multi_pathing.pdf
11.4.9 Array Support Library (ASL)

It is a software package used by Device Discovery Layer, which is a component of VxVM to support third party arrays. For additional information, check the following links: There is a technical note about ASL that provides additional information: http://seer.entsupport.symantec.com/docs/249446.htm Check the compatibility matrix in the Veritas Web site to find the right package: http://www.symantec.com/business/products/otherresources.jsp?pcid=pcat_storage& pvid=203_1 Check the following guide for an updated list of the updates that Sun Solaris needs for attachment to the DS8000: Host System Attachment Guide, SC26-7917-02. Download this guide from: http://www-01.ibm.com/support/docview.wss?rs=1114&context=HW2B2&dc=DA400&q1=ssg 1*&uid=ssg1S7001161&loc=en_US&cs=utf-8&lang=en Adjust the maximum I/O size for each device driver if you must work with I/Os larger than 1 MB. For example, to change to an I/O size of 4 MB: sd (SCSI Device Driver): Edit the file /kernel/drv/sd.conf and add a line with sd_max_xfer_size=0x400000 ssd (Fibre Channel Device Driver): Edit the file /kernel/drv/ssd.conf and add a line with ssd_max_xfer_size=0x400000 st (Tape Device Driver): Edit the file /kernel/drv/st.conf and add a line with st_max_xfer_size=0x400000 Notes: For Solaris 8 and Solaris 9, you need to add this line (that is in the previous paragraph above this note) at the end of the file. For Solaris 10 SPARC the sd and ssd transfer size setting default to maxphys. Certain Fibre Channel HBAs do not support requests greater than 8 MB. Do not forget to test the new values before you put them in production.
349
11.4.10 FC adapter
AMCC (formerly JNI), Emulex, QLogic, and SUN FC adapters are described in the DS8000 Host System Attachment Guide, SC26-7917-02 with recommended performance parameters. For more information, refer to the following link: http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/index.jsp?topic=/com.ib m.storage.ssic.help.doc/f2c_agrs62105inst_1atzyy.html
11.5 Solaris performance monitoring tools

You can use the following tools or commands.
11.5.1 fcachestat and directiostat

There are two very useful tools to measure performance at the filesystem level: fcachestat and directiostat.
fcachestat
The fcachestat command and its output are illustrated in Example 11-25.
Example 11-25 The fcachestat command output [root@v480-1]# fcachestat 1 --- dnlc ---- inode --%hit total %hit total 99.68 145.4M 17.06 512720 100.00 1 0.00 0 100.00 8 0.00 0 100.00 1 0.00 0 100.00 1 0.00 0 100.00 1 0.00 0 100.00 4 0.00 0 100.00 1 0.00 0 100.00 7 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 [root@v480-1]# -- ufsbuf -%hit total 99.95 17.6M 99.99 8653 99.99 7747 99.99 9653 99.99 8533 99.99 8651 99.98 9358 99.99 8781 99.99 8387 99.99 9215 99.99 8237 99.99 7789 -- segmap -%hit total 42.05 11.0M 0.00 4288 0.00 3840 0.00 4780 0.00 4215 0.00 4289 0.00 4636 0.00 4352 0.00 4154 0.00 4566 0.00 4080 0.00 3842 -- segvn --%hit total 80.97 9.7M 50.01 4289 50.00 3840 50.92 4694 49.15 4279 51.18 4195 48.82 4752 50.00 4352 50.00 4154 50.00 4566 50.14 4069 49.86 3853
With this tool, you can measure the filesystem buffer utilization. You have to ignore the first line, because it is an accumulation of statistics since the server was booted: Directory Name Lookup Cache (DNLC) and inode cache: Every time that a process tries to look for a directory, inodes, and file metadata, it first looks in the cache memory. If the information is not there, it must go to the disk. If you are not reaching above 90%, you might need to increase the size of bufhwm. UFS buffer cache: Whenever a process accesses a file, it first checks in the buffer cache to see if the file pages of that file are still there. If not, it has to get the data from disk. If you see a hit cache percentage below 90%, you might have problems with data buffering in memory. You can check the actual buffer size with the sysdef command shown in Example 11-26.
Example 11-26 The sysdef command output [root@v480-1]# sysdef
350
skipping * * Tunable Parameters * 85114880 maximum memory allowed in buffer cache (bufhwm) 30000 maximum number of processes (v.v_proc) 99 maximum global priority in sys class (MAXCLSYSPRI) 29995 maximum processes per user id (v.v_maxup) 30 auto update time limit in seconds (NAUTOUP) 25 page stealing low water mark (GPGSLO) 1 fsflush run rate (FSFLUSHR) 25 minimum resident memory for avoiding deadlock (MINARMEM) 25 minimum swapable memory for avoiding deadlock (MINASMEM) skipping... [root@v480-1]#
To make a change, you need to edit the /etc/system and look for the parameter bufhwm. For additional information and to download this tool, go to the following link: http://www.brendangregg.com/cachekit.html You can also use sar -a, sar -b, and sar -v to check the DNLC and inode cache utilizations. Check the following links for more details about how to use sar in Solaris: The sar -a command: http://docs.sun.com/app/docs/doc/817-0403/enueh?a=view The sar -b command: http://docs.sun.com/app/docs/doc/817-0403/enuef?a=view
directiostat
The directiostat command and its output are illustrated in Example 11-27.
Example 11-27 directiostats output [root@v480-1]# ./directiostat 1 5 lreads lwrites preads pwrites 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [root@v480-1]# Krd 0 0 0 0 0 Kwr holdrds 0 0 0 0 0 0 0 0 0 0 nflush 0 0 0 0 0
With this tool, you can measure the I/O request being executed in filesystems with the Direct I/O mount option enabled. For additional information and to download this tool, go to the following link: http://www.solarisinternals.com/wiki/index.php/Direct_I/O
11.5.2 Solaris vmstat

The vmstat tool reports virtual memory activity. In case of intense I/O activity, it allows you to determine whether you really have a memory shortage issue or if the server is doing a lot of file cache in or file cache out. Certain operating systems do not allow you distinguish the two types of paging activity. Execute the command shown in Example 11-28.
351
Example 11-28 The vmstat -p command output [root@v480-1]# vmstat -p 1 5 memory page swap free re mf fr de 3419192 3244504 3 3 0 0 3686632 3669336 8 19 0 0 3686632 3669392 0 0 0 0 3686632 3669392 0 0 0 0 3686632 3669392 0 0 0 0 [root@v480-1]# executable sr epi epo epf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 anonymous filesystem api apo apf fpi fpo fpf 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The vmstat output has five major columns (memory, page, executable, anonymous, and filesystem). The filesystem column contains three sub-columns: fpi: It means file pages in. It tells how many file pages were copied from disk to memory. fpo: It means file pages out. It tells how many file pages were copied from memory to disk. fpf: It means file pages free. It tells how many file pages are being freed at every sample of time. If you see no page activity in anonymous (api/apo) and only in file page activity (fpi/fpo), it means that you do not have memory constraints but there are too many file page activities and you might need to optimize it. One way is by enabling Direct I/O in the filesystems of your application. Another way is by adjusting the read ahead mechanism if that is enabled, or adjusting the scanners parameters of virtual memory. The recommended values for them are: fastscan: This parameter sets how many memory pages are scanned per second. Configure it for 1/4 of real memory with a limit of 1 GB. handspreadpage: This parameter sets the distance between the two-handled clock algorithm to look for candidate memory pages to be reclaimed when memory is slow. Configure it with the same value set for the fastscan parameter. maxpgio: This parameter sets the maximum number of pages that can be queued by the Virtual Memory Manager. Configure it for 1024 if you use eight or more ranks in the DS8000 and if you use a high-end server.
11.5.3 Solaris iostat

The iostat tool reports I/O activities from disks and slices in the operating system. You have to ignore the first line, because it is an accumulation of statistics since the server was booted. Execute the command as shown in Example 11-29.
Example 11-29 The iostat command output [root@v480-1]# iostat -xn 1 5 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0 0.0 0.4 0.1 1.4 0.0 0.0 0.6 23.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w
%b 0 0 0 %b 0 0 0
device c0t0d0 c1t0d0 c1t1d0 device c0t0d0 c1t0d0 c1t1d0
%b device
352
0.0 0.0 0.0 r/s 0.0 0.0 0.0
0.0 0.0 0.0 w/s 0.0 0.0 0.0
r/s w/s 0.0 0.0 0.0 0.0 0.0 0.0 [root@v480-1]#
0.0 0.0 0.0 0.0 0.0 0.0 extended kr/s kw/s 0.0 0.0 0.0 0.0 0.0 0.0 extended kr/s kw/s 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 device statistics wait actv wsvc_t asvc_t %w 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 device statistics wait actv wsvc_t asvc_t %w 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0
0 c0t0d0 0 c1t0d0 0 c1t1d0 %b 0 0 0 %b 0 0 0 device c0t0d0 c1t0d0 c1t1d0 device c0t0d0 c1t0d0 c1t1d0
When analyzing the output: Look to see if the number of I/Os is balanced among the disks. If not, it might indicate that you have problems in the distribution of subdisks or plexes (VxVM), or volumes (SVM) over the LUNs. You can use the option -p in iostat to see the workload among the slices of a LUN. If r/s is larger than w/s and if the asvc_t is larger than 15 ms, it means that your bottleneck is in a lower layer, which can be the HBA, SAN, or even in the storage. For some reason, it is taking too much time to process the I/O request. Also, check if the same problem occurs with other disks of the same Disk Set or Disk Group (DG). If yes, you will need to add up the number of I/Os per second, add up the throughput by vpath (if it is the case), rank, and host, and compare with the performance numbers from TotalStorage Productivity Center for Disk. If r/s is smaller than w/s and if the asvc_t is larger than 3 ms, writes averaging significantly and consistently higher indicate that the write cache is full, and there is a bottleneck in the disk. Confirm this information with the TotalStorage Productivity Center for Disk reports. If the wait is greater than actv, compare with other disks of the storage subsystem. Look if the distribution of subdisks or plexes (VxVM) or volumes (SVM) is evenly distributed among all disks in the Disk Set or DG. If the wait value is smaller than actv, compare with other disks of the storage. If there are differences and the subdisks or plexes (VxVM) or volumes (SVM) are distributed evenly in the Disk Set or DG, it might indicate that the unbalance is at the rank level. Confirm this information with the TotalStorage Productivity Center for Disk reports. For detailed information about the options of iostat, check the following link: http://docs.sun.com/app/docs/doc/816-5166/iostat-1m?a=view
11.5.4 vxstat
The vxstat performance tool is part of VxVM. With this tool, you can collect performance data related to VM disks, subdisks, plexes, and volumes. It can provide the following information: Operations (reads/writes): The number of I/Os over the sample interval Blocks (reads/writes): The number of blocks in 512 bytes over the sample interval Avg time (reads/writes): The average response time for reads and writes over the sample interval With the DS8000, we recommend that you collect performance information from VM disks and subdisks. To display 10 sets of disk statistics, with intervals of one second, use vxstat -i 1 -c 10 -d. To display 10 sets of subdisk statistics, with intervals of one second, use
353
vxstat -i 1 -c 10 -s. You need to dismiss the first sample, because it provides statistics since the boot of the server. When analyzing the output of vxstat, focus on: Whether the number of I/Os is balanced among the disks. If not, it might indicate that you have problems in the distribution of subdisks or plexes (VxVM) over the LUNs. If operations/read is bigger than operations/write and if the Avg time/read is longer than 15 ms, it means that your bottleneck is in a lower layer, which can be the HBA, SAN, or even in the storage. For some reason, it takes too much time to process the I/O request. Also, check if the same problem occurs with other disks of the same Disk Group (DG) or LUNs of DS8000. If yes, you will need to add up the number of I/Os per second, add up the throughput by VM disk, rank, and host, and compare with the performance numbers from TotalStorage Productivity Center for Disk. If operations/read is smaller than operations/write and if the Avg time/write is greater than 3 ms, writes averaging significantly and consistently higher might indicate that write cache is full, and that there is a bottleneck in the disk. Confirm this information with the TotalStorage Productivity Center for Disk reports. Whether the distribution of subdisks or plexes (VxVM) is even in all disks in the DG. If there are differences and the subdisks or plexes (VxVM) are distributed evenly in the DG, it can indicate that the unbalance might be at the rank level.
11.5.5 dtrace
The dtrace is not just a trace tool. DTrace is also a framework for tracking dynamically the operating systems kernel and also applications that run on top of Solaris 10. You can write your own tools for performance analysis by using the D programming language. The syntax is based on C programming language with several specific commands for tracing instrumentation. There are many scripts already developed that you can use for performance analysis. You can start by downloading the DTrace Toolkit from the following link: http://www.brendangregg.com/dtrace.html#DTraceToolkit Follow the instructions at the Web site to install the DTrace Toolkit. When installed, set your PATH environment variable to avoid having to type the full path every time, as shown in Example 11-30 on page 354.
Example 11-30 Setting PATH environment variable for DTrace Toolkit [root@v480-1]# export PATH=$PATH:/opt/DTT:/opt/DTT/Bin
One example is a very large sequential I/O that might be reaching a limit. Refer to the script in Example 11-31.
Example 11-31 bitesize.d script [root@v480-1]# [1] 6516 [root@v480-1]# Tracing... Hit 1000+0 records 1000+0 records ^C PID 0 dd if=/dev/zero of=/export/home/test_file.dd bs=2048k cou> bitesize.d Ctrl-C to end. in out
CMD sched\0
354
value 256 512 1024 2048 4096 8192 16384 3
------------- Distribution ------------| |@@@@@@@@@ |@@@ | | |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
count 0 59 22 1 0 191 0
fsflush\0 value 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 ------------- Distribution ------------- count | 0 | 1 | 1 | 0 |@@@@@@@ 15 |@@@@@@ 14 |@@@@@ 11 |@@@@@@ 12 |@@ 5 |@@ 4 |@@@@@@@@@@@ 24 | 0
6516
dd if=/dev/zero of=/export/home/test_file.dd bs=2048k count=1000\0 value 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 ------------- Distribution ------------| | |@@ |@ |@ |@@ |@ |@ |@@@ |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | count 0 2 118 29 47 142 62 47 150 1692 0
[1] + Done count=1000& [root@v480-1]#
dd if=/dev/zero of=/export/home/test_file.dd bs=2048k
In the previous example, we executed a dd command with a blocksize of 2 MB, but when we measure the I/O activity, we can see that in fact the maximum I/O size is 1 MB and not 2 MB, which might be related to the maximum physical I/O size that can be executed in the operating system. Let us confirm the maxphys value (Example 11-32).
Example 11-32 Checking the maxphys parameter [root@v480-1]# grep maxphys /etc/system [root@v480-1]#
As you can see, the maxphys parameter is not set in the /etc/system configuration file. It means that Solaris 10 is using the default value, which is 1 MB. You can increase the value of maxphys to increase the size of I/O requests.
355
For additional information about how to use DTrace, check the following links: Introduction about DTrace: http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Intro The DTrace Toolkit: http://www.solarisinternals.com/wiki/index.php/DTraceToolkit
11.6 HP-UX Disk I/O architecture

HP-UX 11i v3 comes with several performance enhancements and new features. The two most important from an I/O perspective are the LVM Version 2 and the new Mass Storage Subsystem. Refer to 11.5, Solaris performance monitoring tools on page 350 for the recommended configuration and additional details about those new features and the setup of the volume layout.
11.6.1 HP-UX High Performance File System (HFS)

High Performance File System (HFS) is the HP-UX product similar to the UNIX File System (UFS). It was designed for small files. It is not a journaled filesystem, and it also does not support Direct I/O.
Blocksize, fragment size, cylinders per cylinder group, and minfree

The default blocksize in HFS is 1 KB. You have to adjust this value based on the average size of the application files. For database applications, for example, files can easily be bigger than 2 GB. In that case, set the filesystem blocksize equal to the database blocksize. Use the same value defined in blocksize for fragment size. In the case of cylinders per cylinder group, if your application is a database, set it to 32. For minfree, if your application is a database, set it to 2% or zero.
HFS read ahead

The read ahead mechanism in HFS is set by the parameter hfs_ra_per_disk, and the value is in KB. This value needs to match the I/O size of your application. For example, if you use the Oracle database, it is better use a value that matches the product of the values of the DB_BLOCK_SIZE and DB_FILE_MULTIBLOCK_READ_COUNT parameters. For DB2, the read ahead value needs to match the product of the databases blocksize and extent size.
Asynchronous I/O
Asynchronous I/O is a feature of HP-UX that is not enabled by default. It allows the application to keep processing while issuing I/O requests without needing to wait for a reply, consequently, reducing the applications response time. Normally, database applications take advantage of that feature. If your application supports asynchronous I/O, enable it in the operating system as well. For detailed information about how to configure asynchronous I/O, refer to the appropriate application documentation: For Oracle 11g with HP-UX using asynchronous I/O: http://download.oracle.com/docs/cd/B28359_01/server.111/b32009/appb_hpux.htm#BA BBFDCI For Sybase Adaptive Server Enterprise (ASE) 15.0 using asynchronous I/O: http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.dc35823_1500/html /uconfig/BBCBEAGF.htm
356
Note: If there is no recommendation for the parameter max_async_ports, set it to 1024.
Dynamic buffer cache

Other important options for HP-UX and I/O performance are dbc_min_pct and dbc_max_pct. These kernel parameters, dbc_min_pct and dbc_max_pct, control the lower and upper limit, respectively, as a percentage of system memory that can be allocated for buffer cache. The number of pages of memory that are allocated for buffer cache use at any given time is determined by the system needs, but the two parameters ensure that allocated memory never drops below dbc_min_pct and does not exceed dbc_max_pct of the total system memory. The default value for dbc_max_pct is 50%, which is usually too much. If you want to use a dynamic buffer cache, set the dbc_max_pct value to 25%. If you have 4 GB of memory or more, start with an even smaller value. With a large buffer cache, the system is likely to have to page out or shrink the buffer cache to meet the application memory needs, which causes I/Os to lose paging space. You want to avoid that from happening and set memory buffers to favor applications over cached files.
11.6.2 HP-UX Journaled File System (JFS)

Journaled File System (JFS) is the HP-UX version of Veritas File System (VxFS). Basically, it provides almost the same features as VxFS. There is also the OnlineJFS, which is an optional product that provides the ability to change online the size of the JFS filesystem. The latest version of VxFS is 5.0. Refer to the Solaris section 11.4.2, Veritas FileSystem (VxFS) for Solaris on page 344 for additional information about VxFS. Also, check the following links: JFS Tuning and Performance v3.5: http://docs.hp.com/en/5576/JFS_Tuning.pdf Supported File and File System Sizes for HFS and JFS: http://docs.hp.com/en/5992-4023/5992-4023.pdf Veritas documentation at the HP Web site: http://docs.hp.com/en/oshpux11iv3.html#VERITAS%20Volume%20Manager%20and%20File% 20System A paper discussing the most common misconfigured resources, among them, the file system cache mechanisms: http://docs.hp.com/en/5992-0732/5992-0732.pdf A paper providing the recommended server tuning parameters for most types of workloads: http://docs.hp.com/en/5992-4222ENW/5992-4222ENW.pdf
11.6.3 HP Logical Volume Manager (LVM)

There are two versions of the HP LVM, Version 1 and Version 2. Version 2 was introduced with the March 2008 release of HP-UX 11i v3 (11.31). LVM supports RAID 0 or RAID 1, or, with Version 2, a combination of both RAID 0 and RAID 1. It is also possible to spread the data among the disks through LVM Distributed Allocation Policy. Figure 11-9 on page 358 gives an overview.
357
Figure 11-9 HP-UX LVM overview
The DS8000 LUNs that are under control of LVM are called physical volumes (PVs). The LVM splits the disk space in smaller pieces that are called physical extents (PEs). A logical volume (LV) is composed of several logical extents (LEs). A filesystem is created on top of a LV or simply used as a raw device. Each LE can point to up to two corresponding PEs in LVM Version 1.0 and up to five corresponding PEs in LVM Version 2.0/2.1, which is how LVM implements mirroring (RAID 1). Note: In order to implement mirroring (RAID 1), it is necessary to install an optional product, HP MirrorDisk/UX. To set up the volume layout with DS8000 LUNs, you can adopt one of the following strategies: Storage Pool Striping (SPS): In this case, you spread the workload at the storage level. At the operating system level, you just need to create the LVs with the inter-policy attribute set to minimum (this is the default option when creating an LV). Distributed Allocation Policy: A set of LUNs is created in different ranks inside the DS8000. When the LUNs are recognized in the HP-UX, a VG is created and the LVs are spread evenly over the LUNs with the option -D. The advantage of this method compared to the SPS is the granularity of data spread over the LUNs. Whereas in SPS the data is spread in chunks of 1 GB, you can create PE sizes from 8 MB to 16 MB in a VG. LVM Striping: As with Distributed Allocation Policy, a set of LUNs is created in different ranks inside the DS8000. When the LUNs are recognized in the HP-UX, a VG is created with larger PE sizes, such as 128 MB or 256 MB. And the LVs are spread evenly over the LUNs by setting the stripe size of LV from 8 MB to 16 MB. From a performance standpoint, LVM Striping and PP Striping will provide the same results.
358
Volume group limits

When creating the volume group, there are LVM limits to consider along with the potential expansion of the volume group. The main LVM limits for a volume group are shown in Table 11-2.
Table 11-2 HP-UX volume group characteristics LVM 1.0 Maximum PVs/VG Maximum LVs/VG Maximum PEs/PV 255 255 65535 LVM 2.0 2047 2047 16777216 LVM 2.1 2047 2047 16777216
Note: We recommend that you use LVM Version 2.0 or later.
LVM Distributed Allocation Policy

Figure 11-10 shows an example of the inter-disk policy logical volume. The LVM has created a volume group containing four LUNs and has created 16 MB physical partitions on the LUNs. The logical volume in this example is a group of 16 MB physical partitions from four different logical disks: disk4, disk5, disk6, and disk7.
Inter-disk logical volume

/dev/inter-disk_lv 8 GB Logical disk (LUN) = disk4
16MB 16MB 16MB 16MB 16MB 16MB
pe497
le1
pe1
le5
pe2
pe3
pe4
... ... ... ...
pe498
16MB
pe499
16MB
pe500
8 GB Logical disk (LUN) = disk5

16MB 16MB
le2
pe1
le6
pe2
16MB
pe3
16MB
pe4
16MB 16MB
pe497
pe498
16MB
pe499
16MB
pe500

16MB 16MB
le3 lp3
pe1
le7
pe2
16MB
pe3
16MB
pe4
16MB 16MB
pe497
pe498
16MB
pe499
16MB
pe500

16MB 16MB
pe1 pe2
16MB
pe3
16MB
pe4
16MB 16MB
pe497
le4
le8
pe498
16MB
pe499
16MB
pe500
disk4, disk5, disk6, and disk7 are hardware-striped LUNs on different DS8000 Extent Pools 8 GB/16 MB partitions ~ 500 physical extents per LUN (pe1-pe500) /dev/inter-disk_lv is made up of eight logical extents (le1 + le2 + le3 + le4 + le5 +le6 +le7 + le8) = 8 x 16 = 128 MB
Figure 11-10 Inter-disk policy logical volume
The first step is to initialize the PVs with the following commands: pvcreate /dev/rdsk/disk4 pvcreate /dev/rdsk/disk5
359
pvcreate /dev/rdsk/disk6 pvcreate /dev/rdsk/disk7 The next step is to create the VG. We recommend to create a VG with a set of DS8000 LUNs where each LUN is located in a different Extent Pool. If you add a new set of LUNs to a host, define another VG and so on. To create the VG data01vg and the PE size of 16 MB: 1. Create the directory /dev/data01vg with a character special file called group: mkdir /dev/data01vg mknod /dev/data01vg/group c 70 0x020000 2. Create the VG with the following command: vgcreate /dev/data01vg -g data01pvg01 -s 16 /dev/data01vg /dev/dsk/disk4 /dev/dsk/disk5 /dev/dsk/disk6 /dev/dsk/disk7 Then, you can create the LVs with the option -D, which stripes the logical volume from one LUN to the next LUN in chunks the size of the physical partition size of the volume group. For instance: lvcreate -D y -l 16 -m 1 -n inter-disk_lv -s g /dev/data01vg
LVM striping
An example of a striped logical volume is shown in Figure 11-11. The logical volume called /dev/striped_lv uses the same capacity as /dev/inter-disk_lv (shown in Figure 11-10 on page 359), but it is created differently. Notice that /dev/striped_lv is also made up of eight 256 MB physical partitions, but each partition is then subdivided into 64 chunks of 4 MB (only three of the 4 MB chunks are shown per logical partition for space reasons).
360
Striped logical volume

8 GB LUN = disk4
le1
256MB 256MB 256MB 256MB
pe1
le5
pe2
pe3
pe4
.... .... .... ....
256MB 256MB
pe29
pe30
256MB 256MB
pe31
pe32
1.1 1.2 1.3 5.1 5.2 5.3
8 GB LUN = disk5
le2
256MB 256MB 256MB 256MB
pe1
le6
pe2
pe3
pe4
256MB 256MB
pe29 pe30
256MB 256MB
pe31 pe32
2.1 2.2 2.3 6.1 6.2 6.3
IO
le3
pe1
8 GB LUN = disk6
256MB 256MB 256MB 256MB
le7
pe2
pe3
pe4
256MB 256MB
pe29 pe30
256MB 256MB
pe31 pe32
3.1 3.2 3.3 7.1 7.2 7.3
8 GB LUN = disk7
le4
256MB 256MB 256MB 256MB
pe1
le8
pe2
pe3
pe4
256MB 256MB
pe29 pe30
256MB 256MB
pe31 pe32
4.1 4.2 4.3 8.1 8.2 8.3
disk4, disk5, disk6, and disk7 are hardware-striped LUNS on different DS8000 Extent Pools 8 GB/256 MB partitions ~ 32 physical extents per LUN (pe1 pe32) /dev/striped_lv is made up of eight logical partitions (8 x 256 = 2048 MB) Each logical partition is divided into 64 equal parts of 4 MB (only three of the 4 MB parts are shown for each logical partition) /dev/striped_lv = le1.1 +le2.1 + le3.1 + le4.1 + le1.2 + le2.2 + le3.2 + le4.2 + le5.1.
Figure 11-11 Striped logical volume
As with LVM Distributed Allocation Policy, the first step is to initialize the PVs with the following commands: pvcreate pvcreate pvcreate pvcreate /dev/rdsk/disk4 /dev/rdsk/disk5 /dev/rdsk/disk6 /dev/rdsk/disk7
Then, you need to create the VG with a PE size of 256 MB: 1. Create the directory /dev/data01vg with a character special file called group: mkdir /dev/data01vg mknod /dev/data01vg/group c 70 0x020000 2. Create the VG with the following command: vgcreate /dev/data01vg -g data01pvg01 -s 256 /dev/data01vg /dev/dsk/disk4 /dev/dsk/disk5 /dev/dsk/disk6 /dev/dsk/disk7 The last step is to create all of the needed LVs with a stripe size of 8 MB. To create a striped LV, you need to combine the following options: Number of LEs (-l): This option sets the size of your LV. In our case, we want to create a 2 GB LV. Knowing that the PE size is 256 MB, we only need to divide 2048 by 256 to find out how many LEs are needed. Stripes (-i): This option sets the number of disks where the data needs to be spread. Stripe size (-I): This option sets the size in kilobytes of the stripe. Name of LV (-n): This option sets the name of the LV.
361
For each LV, execute the following command: lvcreate -l 8 -i 4 -I 8192 -n striped_lv /dev/data01vg For additional information, refer to: LVM Limits White Paper: http://docs.hp.com/en/6054/LVM_Limits_White_Paper_V4.pdf LVM Version 2.0 Volume Groups in HP-UX 11i v3: http://docs.hp.com/en/lvm-v2/L2_whitepaper_8.pdf LVM New Features in HP-UX 11i v3: http://docs.hp.com/en/LVM-11iv3features/LVM_New_Features_11iv3_final.pdf LVM documentation at docs.hp.com: http://docs.hp.com/en/oshpux11iv3.html#LVM%20Volume%20Manager
11.6.4 Veritas Volume Manager (VxVM) for HP-UX

The Veritas Volume Manager (VxVM) is another LVM available for HP-UX. It is similar to VxVM for Solaris and AIX. Refer to 11.2.5, Veritas Volume Manager (VxVM) on page 320 for a brief description of features and additional information.
11.6.5 PV Links
PV Links is the multipathing solution that comes with HP-UX. It primarily provides a failover capability, but if the storage allows it, you can use the alternate path for load balancing. It Is important to first check the DS8000 Interoperability Matrix to decide which multipathing solution to use: http://www.ibm.com/systems/resources/systems_storage_disk_ds8000_interop.pdf
11.6.6 Native multipathing in HP-UX

Native multipathing is the multipathing solution imbedded in HP-UX 11i v3. Native multipathing uses a new way of representing storage devices, known as the agile view. Instead of having the device names of disks and tape drives represented by a hardware path, the system uses device names based on objects. In other words, in the old way (now called the legacy view), the name of a LUN used to look like /dev/dsk/c4t2d0 and used to correspond to controller 5, SCSI target 2, and SCSI LUN 0. With the new way, the name of a LUN looks like /dev/dsk/disk7. This new way of representing storage devices resulted in a series of necessary modifications to existing commands, implementation of new commands, modifications in LVM, a new hardware addressing model, and changes in tunable parameters. In addition, the storage subsystem needs to be Asymmetric Logical Unit Access (ALUA)-compliant. For additional information, refer to the following links: The Next Generation Mass Storage Stack HP-UX 11i v3 http://docs.hp.com/en/MassStorageStack/The_Next_Generation_Mass_Storage_Stack.p df HP-UX 11i v3 Native Multi-Pathing for Mass Storage http://docs.hp.com/en/native-multi-pathing/native_multipathing_wp_AR0709.pdf HP-UX 11i v3 Mass Storage I/O Performance Improvements http://docs.hp.com/en/11iv3IOPerf/IOPerformanceWhitePaper.pdf
362
LVM Migration from Legacy to Agile Naming Model http://docs.hp.com/en/LVMmigration1/LVM_Migration_to_Agile.pdf
11.6.7 Subsystem Device Driver (SDD) for HP-UX

SDD is the IBM multipathing device driver that is available for HP-UX machines. Check the compatibility matrix to download the correct version with HP-UX: http://www-01.ibm.com/support/docview.wss?rs=540&context=ST52G7&dc=DA400&uid=ssg1S 7001350&loc=en_US&cs=utf-8&lang=en#HPSDD
11.6.8 Veritas Dynamic MultiPathing (DMP) for HP-UX

Veritas DMP is the device driver provided by Veritas to work with VxVM. Check the Solaris section 11.4.8, Veritas Dynamic MultiPathing (DMP) for Solaris on page 349 for additional information.
11.6.9 Array Support Library (ASL) for HP-UX

The Array Support Library (ASL) is a software package used by Device Discovery Layer, which is a component of VxVM to support third-party arrays. Check the Solaris section 11.4.9, Array Support Library (ASL) on page 349 for additional information.
11.6.10 FC adapter
The FC adapter provides the connection between the host and the storage devices.
11.7 HP-UX performance monitoring tools

Next, we describe the commands available to monitor the performance of each layer of HPUX I/O subsystem architecture.
11.7.1 HP-UX sar

System Activity Report (SAR) is a tool that reports the contents of certain cumulative activity counters within the UNIX operating system. SAR has numerous options, providing paging, TTY, CPU busy, and many other statistics. Used with the appropriate command flag (-u), sar provides a quick way to tell if a system is I/O-bound. There are three possible modes in which to use the sar command: Real-time sampling and display System activity accounting via cron Displaying previously captured data
Real-time sampling and display

One way to run sar is to specify a sampling interval and the number of times that you want it to run. To collect and display system statistic reports immediately, run sar -u 1 5. Example 11-33 shows an example of sar output.
Example 11-33 HP-UX sar sample output [root@rx4640-1]# sar -u 1 5
363
HP-UX rx4640-1 B.11.31 U ia64 10:57:56 10:57:57 10:57:58 10:57:59 10:58:00 10:58:01 %usr 3 1 1 1 1 %sys 23 6 9 10 1 10 %wio 12 42 36 37 70 39
11/13/08 %idle 62 51 54 52 28 50
Average 1 [root@rx4640-1]#
Not all sar options are the same for AIX, HP-UX, and Sun Solaris, but the sar -u output is the same. The output in the example shows CPU information every 1 second, 5 times. To check if a system is I/O-bound, the important column to check is the %wio column. The %wio includes time spent waiting on I/O from all drives, including internal and DS8000 logical disks. If %wio values exceed 40, you need to investigate to understand storage I/O performance. The next action is to look at I/O service times reported by the sar -Rd command.
Example 11-34 HP-UX sar -d output [root@rx4640-1]# sar -Rd 1 5 HP-UX rx4640-1 B.11.31 U ia64 11:01:11 11:01:12 11:01:13 11:01:14 11:01:15 11:01:16 device disk3 disk3 disk3 disk3 disk3 %busy 9.90 15.00 9.00 9.09 11.88 11/13/08 r/s 11 27 4 6 9 11 w/s 6 28 23 18 20 19 blks/s 1006 792 402 598 388 638 avwait 0.00 9.11 16.18 7.86 8.36 9.00 avserv 28.01 11.45 22.43 19.53 16.87 17.56
avque 0.50 3.46 4.35 2.38 2.05 2.85
Average disk3 10.98 [root@rx4640-1]# sar -d 1 5
The avwait and avserv columns show the average times spent in the wait queue and service queue respectively. The avque column represents the average number of I/Os in the queue of that device. With the HP-UX 11i v3, the sar command has new options to monitor the performance: -H reports I/O activity by HBA -L reports I/O activity by lunpath -R with option -d splits the number of I/Os per second between reads and writes -t reports I/O activity by tape device When analyzing the output of sar: Check if the number of I/Os is balanced among the disks. If not, it might indicate that you have problems in the distribution of Logical Volumes (LVs) over the LUNs. Check if the I/O is balanced among the controllers (HBAs). If not, certain paths might be in failed status. If r/s is larger than w/s and if the avserv is larger than 15 ms, your bottleneck is in a lower layer, which can be the HBA, SAN, or even in the storage. It is taking too much time to process the I/O request. Also, check if the same problem occurs with other disks of the same volume group (VG) or DS8000 LUNs. If yes, you will need to add up the number of
364
I/Os per second, add up the throughput by LUN, rank, and host, and compare with the performance numbers from TotalStorage Productivity Center for Disk. If r/s is smaller than w/s and if the avserv is greater than 3 ms and writes are averaging significantly and consistently higher, it might indicate that write cache is full, and there is a bottleneck in the disk. Confirm this information with the TotalStorage Productivity Center for Disk reports. Check that LVs are evenly distributed across all disks in the VG. If there are differences and the LVs are distributed evenly in the VG, it might indicate that the unbalance is at the rank level. For additional information about the HP-UX sar command, go to: http://docs.hp.com/en/B2355-60130/sar.1M.html
System activity accounting via cron

The sar data collection program is an unintrusive program, because it just extracts data from information collected by the system. You do need to configure a system to collect data however, and the frequency of the data collection can affect performance and the size of the data files collected. To configure a system to collect data for sar, you can run the sadc command or the modified sa1 and sa2 commands. Here is more information about the sa commands and how to configure sar data collection: The sa1 and sa2 commands are shell procedure variants of the sadc command. The sa1 command collects and stores binary data in the /var/adm/sa/sadd file, where dd is the day of the month. The sa2 command is designed to be run automatically by the cron command and run concurrently with the sa1 command. The sa2 command will generate a daily report called /var/adm/sa/sardd. It will also remove a report more than one week old. The /var/adm/sa/sadd file contains the daily data file, and dd represents the day of the month. And, /var/adm/sa/sardd contains the daily report file, and dd represents the day of the month. Note the r in /var/adm/sa/sardd for sa2 output. To configure a system to collect data, edit the root crontab file. For our example, if we just want to run sa1 every 15 minutes every day, and the sa2 program to generate ASCII versions of the data just before midnight, we will change the cron schedule to look like: 0,15,30,45 * * * 0-6 /usr/lib/sa/sa1 55 23 * * 0-6 /usr/lib/sa/sa2 -A
Display previously captured data

After the sa1 and sa2 commands are configured in cron and data collection starts, you will see binary report files in /var/adm/sa/sadd, where dd represents the day of the month. You can view performance information files from these files with: sar -f /var/adm/sa/sadd where dd is the day you are interested in. You can also focus on a certain time period, for example, 8 a.m. to 5:15 p.m. with: sar -s 8:00 -e 17:15 -f /var/adm/sa/sadd
Remember, sa2 will remove the data collection files over a week old as scheduled in cron. You can save sar information to view later with the commands:
365
sar -A -o data.file interval count > /dev/null & [SAR data saved to data.file] sar -f data.file [Read SAR information back from saved file:] All data is captured in binary form and saved to a file (data.file). The data can then be selectively displayed with the sar command using the -f option.
sar summary
The sar tool helps to tell quickly if a system is I/O-bound. Remember though that a busy system can mask I/O issues, because io_wait counters are not increased if the CPUs are busy. The sar tool can help to save a history of I/O performance so you have a baseline measurement for each host. You can then verify if tuning changes make a difference. You might want, for example, to collect sar data for a week and create reports: 8 a.m. - 5 p.m. Monday - Friday if that time is the prime time for random I/O, and 6 p.m. - 6 a.m. Saturday Sunday if those times are batch/backup windows.
11.7.2 vxstat
The vxstat tool is a performance tool that comes with VxVM. Refer to 11.5.4, vxstat on page 353 for additional information.
11.7.3 GlancePlus and HP Perfview/Measureware

HP has graphical tools to measure system performance, including: HP Perfview/Measureware GlancePlus HP Perfview/Measurware is good for recording performance measurements and maintaining a baseline of system performance data for reference. The HP Perfview/Measurware tool can show statistics for each physical disk in graphical format and you can change the time scale easily to your liking.
11.8 SDD commands for AIX, HP-UX, and Solaris

For availability and performance, we recommend dual attaching the host to the DS8000 or SAN fabric and using a product that provides a multipath configuration environment, such as SDD. We describe SDD and the common commands that it provides in Subsystem Device Driver (SDD) on page 272. SDD provides commands that are specific for each platform, and we will describe a few of the AIX, HP-UX, and Sun Solaris SDD commands here. All three platforms use the helpful datapath query SDD command. Table 11-3 shows a summary of the SDD commands and their operating system platforms.
366
Table 11-3 SDD command matrix SDD command chgvpath ckvpath datapath defvpath get_root_disks gettrace hd2vp pathtest querysn rmvpath showvpath vp2hd vpathmkdev chgvpath addpaths cfallvpath dpovgfix extendvg4vp lquerypr lsvpcfg mkvg4vp restvg4vp savevg4vp X X X X X X X X X X X X X X AIX HP-UX X X X X X X X X X X X X X X X X X X X X Solaris
AIX SDD commands

There are particular SDD commands for AIX that you will want to use to: Verify that a host uses vpath devices properly for redundancy and load balancing Configure volume groups using vpaths instead of hdisks Add paths to vpaths dynamically Table 11-4 lists the AIX SDD commands.
367
Table 11-4 AIX SDD commands Command addpaths lsvpcfg dpovgfix hd2vp vp2hd querysn lquerypr mkvg4vp extendvg4vp savevg4vp pathtest cfallvpath restvg4vp Description Dynamically adds paths to SDD devices while they are in the Available state. Queries the SDD configuration state. Fixes an SDD volume group that has mixed vpath and hdisk physical volumes. The SDD script that converts a DS8000 hdisk device volume group to a Subsystem Device Driver vpath device volume group. The SDD script that converts an SDD vpath device volume group to a DS8000 hdisk device volume group. The SDD driver tool to query unique serial numbers of DS8000 devices. This tool is used to exclude certain LUNs from SDD, for example, boot disks. The SDD driver persistent reserve command tool. Creates an SDD volume group. Extends SDD devices to an SDD volume group. Backs up all files belonging to a specified volume group with SDD devices. Used with tracing functions. Fast-path configuration method to configure the SDD pseudo-parent dpo and all of the SDD vpath devices. Restores all files belonging to a specified volume group with SDD devices.
addpaths
In a SAN environment, where servers are attached to SAN switches, the paths from the server to the DS8000 are controlled by zones created with the SAN switch software. You might want to add a new path and remove another path for planned maintenance on the DS8000 or for proper load balancing. You can take advantage of the addpaths command to make the changes live.
lsvpcfg
To display which DS8000 vpath devices are available to provide failover protection, run the lsvpcfg command. You will see output similar to that shown in Example 11-35.
Example 11-35 The lsvpcfg command for AIX output # lsvpcfg vpath0 (Avail pv vpathvg)018FA067=hdisk1 (Avail ) vpath1 (Avail )019FA067=hdisk2 (Avail ) vpath2 (Avail )01AFA067=hdisk3 (Avail ) vpath3 (Avail )01BFA067=hdisk4 (Avail )hdisk27(Avail ) vpath4 (Avail )01CFA067=hdisk5 (Avail )hdisk28 (Avail ) vpath5 (Avail )01DFA067=hdisk6 (Avail )hdisk29 (Avail ) vpath6 (Avail )01EFA067=hdisk7(Avail )hdisk30 (Avail ) vpath7(Avail )01FFA067=hdisk8 (Avail )hdisk31 (Avail ) vpath8 (Avail )020FA067=hdisk9 (Avail )hdisk32 (Avail ) vpath9 (Avail pv vpathvg)02BFA067=hdisk20 (Avail )hdisk44 (Avail ) vpath10 (Avail pv vpathvg)02CFA067=hdisk21 (Avail )hdisk45 (Avail ) vpath11 (Avail pv vpathvg)02DFA067=hdisk22 (Avail )hdisk46 (Avail ) vpath12 (Avail pv vpathvg)02EFA067=hdisk23 (Avail )hdisk47(Avail ) vpath13 (Avail pv vpathvg)02FFA067=hdisk24 (Avail )hdisk48 (Avail
368
Notice in the example that vpath0, vpath1, and vpath2 all have a single path (hdisk device) and, therefore, will not provide failover protection, because there is no alternate path to the DS8000 LUN. The other SDD vpath devices have two paths and, therefore, can provide failover protection and load balancing.
dpovmfix and dpovgfix

It is possible for certain commands, such as chdev, on an hdisk device to cause a pvid (physical volume ID) to move back to an hdisk (single path to DS8000 LUN) instead of remaining on the vpath device. For example, look at the output shown in Example 11-36. The lsvpcfg command shows that hdisk46 is part of the volume group vpathvg and has a pvid assigned. You need to ensure that the PVIDs are assigned to the vpaths and not the underlying hdisks. The command lsvg -p vpathvg lists the physical volumes making up the volume group vpathvg. Notice that hdisk46 is listed among the other vpath devices, which is not correct for failover and load balancing, because access to the DS8000 logical disk with serial number 02DFA067 is using a single path hdisk46 instead of vpath11. The system is operating in a mixed mode with vpath pseudo devices and partially uses hdisk devices.
Example 11-36 AIX loss of a device path #lsvpcfg vpath11 (Avail pv vpathvg)02DFA067=hdisk22 (Avail )hdisk46 (Avail pv vpathvg) vpath12 (Avail pv vpathvg)02EFA067=hdisk23 (Avail )hdisk47(Avail ) vpath13 (Avail pv vpathvg)02FFA067=hdisk24 (Avail )hdisk48 (Avail ) lsvg -p vpathvg vpathvg: PV_NAME PV STATE TOTAL PPs FREE PPs vpath10 active 29 4 hdisk46 active 29 4 vpath12 active 29 4 vpath13 active 29 28
FREE DISTRIBUTION 00..00..00..00..04 00..00..00..00..04 ! MIXED MODE- HDISKs and VPATHS ! 00..00..00..00..04 06..05..05..06..06
To fix this problem, run the command dpovgfix volume_group_name. Then, rerun the lsvpcfg or lsvg command to verify. Note: In order for the dpovgfix shell script to be executed, you must unmount all mounted filesystems of this volume group. After successful completion of the dpovgfix shell script, mount the filesystems again.
hd2vp and vp2hd

SDD provides two conversion scripts to move volume group devices to vpaths or hdisks: The hd2vp script converts a volume group from DS8000 hdisks into SDD vpaths. The syntax for the hd2vp script is hd2vp vgname. The vp2hd script converts a volume group from SDD vpaths into DS8000 hdisks. Use the vp2hd program when you want to configure your applications back to original DS8000 hdisks, or when you want to remove the SDD from your host system. The syntax for the vp2hd script is: vp2hd vgname. These conversion programs require that a volume group contain either all original DS8000 hdisks or all SDD vpaths. The program fails if a volume group contains both kinds of device special files (mixed volume group). You might need to use dpovgfix first to fix a volume group to contain all of one kind of device or another.
369
querysn for multi-booting AIX off the DS8000

AIX supports Fibre Channel boot capability for selected System p systems, which allows you to select a DS8000 Fibre Channel device as the boot device. However, a multipathing boot device is not supported. If you plan to select a device as a boot device, do not configure that DS8000 device with multipath configuration. The SDD driver will automatically exclude any DS8000 devices from the SDD configuration, if these DS8000 boot devices are the physical volumes of an active rootvg. Tip: If you require dual or multiple boot capabilities on a server, and multiple operating systems are installed on multiple DS8000 boot devices, use the querysn command to manually exclude all DS8000 boot devices that belong to multiple non-active rootvg volume groups on the server. SDD V1.3.3.3 allows you to manually exclude DS8000 devices from the SDD configuration. The querysn command reads the unique serial number of a DS8000 device (hdisk) and saves the serial number in an exclude file, /etc/vpexclude. During the SDD configuration, SDD configure methods read all the serial numbers in this exclude file and exclude these DS8000 devices from the SDD configuration. The exclude file, /etc/vpexclude, holds the serial numbers of all inactive DS8000 devices (hdisks) in the system. If an exclude file exists, the querysn command will add the excluded serial number to that file. If no exclude file exists, the querysn command will create an exclude file. There is no user interface to this file. Tip: You must not use the querysn command on the same logical device multiple times. Using the querysn command on the same logical device multiple times results in duplicate entries in the /etc/vpexclude file, and the system administrator will have to administer the file and its content. The syntax for querysn is querysn <-d> -l device-name.
Managing secondary system paging space for AIX

For better performance, you might want to place a secondary paging space on the DS8000. SDD 1.3.2.6 (or later) supports secondary system paging on a multi-path Fibre Channel vpath device from an AIX 4.3.3 or AIX 5.1.0 host system to a DS8000. It is worth noting that your host system must be tuned so that little or no I/O goes to page space. If this is not possible, the system needs more memory. The benefits are multipathing to your paging spaces. All the same commands for hdisk-based volume groups apply to using vpath-based volume groups for paging spaces. Important: IBM does not recommend moving the primary paging space out of rootvg. Doing so might mean that no paging space is available during the system startup. Do not redefine your primary paging space using vpath devices.
lquerypr
The lquerypr command implements certain SCSI-3 persistent reservation commands on a device. The device can be either hdisk or SDD vpath devices. This command supports persistent reserve service actions or read reservation key, release persistent reservation, preempt-abort persistent reservation, and clear persistent reservation.
370
The syntax and options are: lquerypr [[-p]|[-c]|[-r]][-v][-V][-h/dev/PVname] Flags: p If the persistent reservation key on the device differs from the current host reservation key, it preempts the persistent reservation key on the device. If there is a persistent reservation key on the device, it removes any persistent reservation and clears all reservation key registration on the device. Removes the persistent reservation key on the device made by this host. Displays the persistent reservation key if it exists on the device. Verbose mode. Prints detailed message.
r v V
To query the persistent reservation on a device, type: lquerypr -h /dev/vpath30 This command queries the persistent reservation on the device. If there is a persistent reserve on a disk, it returns 0 if the device is reserved by the current host. It returns 1 if the device is reserved by another host. Caution must be taken with the command, especially when implementing the preempt-abort or clear persistent reserve service action. With the preempt-abort service action, the current persistent reserve key is preempted; it also aborts tasks on the LUN that originated from the initiators that are registered with the preempted key. With clear service action, both persistent reservation and reservation key registrations are cleared from the device or LUN. This command is useful if disk was attached to one system, and was not varied off leaving the SCSI reserves on the disks, thus preventing another system from accessing them.
mkvg4vp, extendvg4vp, savevg4vp, and restvg4vp

The mkvg4vp, extendvg4vp, savevg4vp, and restvg4vp commands have the same functionality as their counterpart commands without the 4vp extension; use the 4vp versions when operating on vpath devices. These commands will maintain pvids on vpaths and keep SDD working properly. It is a good idea to check periodically to make sure that none of the volume groups use hdisks instead of vpaths. You can verify the path status several ways. Several commands are: lspv (look for hdisk with volume group names listed) lsvpcfg lsvg -p <vgname> Remember to change any scripts that you might have that call savevg or restvg and change the calls to savevg4vp and restvg4vp.
11.8.1 HP-UX SDD commands

SDD for HP-UX adds the specific commands shown in Table 11-5.
371
Table 11-5 HP-UX SDD commands Command showvpath chgvpath Description Lists the configuration mapping between SDD devices and underlying disks. Configures SDD vpath devices. Updates the information in /etc/vpath.cfg and /etc/vpathsave.cfg. The -c option updates the configuration file. The -r option updates the device configuration without a system reboot. Second part of the chgvpath command configuration during startup time. SDD driver console command tool. Removes SDD vpath devices from the configuration. Converts a volume group from DS8000 hdisks into SDD vpaths. Generates a file called /etc/vpathexcl.cfg to exclude bootable disks from the SDD configuration. Lists all disk storage devices by serial number. Debug tool. Debug tool that gets trace information. Converts a volume group from SDD vpaths into DS8000 hdisks.
defvpath datapath rmvpath [-all, -vpathname] hd2vp get_root_disks querysn pathtest gettrace vp2hd
showvpath
The showvpath command for HP-UX is similar to the lsvpcfg command for AIX. Use showvpath to verify that an HP-UX vpath uses multiple paths to the DS8000. We show an example of the output from showvpath in Example 11-37.
Example 11-37 The showvpath command for HP-UX #/opt/IBMsdd/bin/showvpath vpath1: /dev/rdsk/c12t0d0 /dev/rdsk/c10t0d0 /dev/rdsk/c7t0d0 /dev/rdsk/c5t0d0 vpath2: /dev/rdsk/c12t0d1
Notice that vpath1 in the example has four paths to the DS8000. The vpath2, however, has a single point of failure, because it only uses a single path. Tip: You can use the output from showvpath to modify iostat or sar information to report statistics based on vpaths instead of hdisks. Gather iostats to a file, and then replace the disk names with the corresponding vpaths.
372
hd2vp and vp2hd

The hd2vp and vp2hd commands work the same for HP-UX as they do for AIX. Use hd2vp to convert volume groups to use vpaths and take advantage of the failover and load balancing features of SDD. When removing SDD, you can move the volume group devices back to disk devices using the vp2hd command.
11.8.2 Sun Solaris SDD commands

SDD for Solaris adds the specific commands that are shown in Table 11-6.
Table 11-6 Solaris SDD commands Command cfgvpath Description Configures SDD vpath devices. Updates the information in /etc/vpath.cfg and /etc/vpathsave.cfg. The -c option updates the configuration file. The -r option updates the device configuration without a system reboot. Second part of the cfgvpath command configuration during startup time. SDD driver console command tool. Removes SDD vpath devices from the configuration. Generates a file called /etc/vpathexcl.cfg to exclude bootable disks from the SDD configuration. Creates file links in /dev/dsk and /dev/rdsk. Lists the configuration mapping between SDD devices and underlying disks. Debug tool.
defvpath datapath rmvpath [-all, -vpathname] get_root_disks vpathmkdev showvpath pathtest
On Sun Solaris, SDD resides above the Sun SCSI disk driver (sd) in the protocol stack. For more information about how SDD works, refer to Subsystem Device Driver (SDD) on page 272. SDD is supported for the DS8000 on Solaris 8/9. There are specific commands that SDD provides to Sun Solaris, which we list next, as well as the steps to update SDD after making DS8000 logical disk configuration changes for a Sun server.
cfvgpath
The cfgvpath command configures vpath devices using the following process: Scan the host system to find all DS8000 devices (LUNs) that are accessible by the Sun host. Determine which DS8000 devices (LUNs) are the same devices that are accessible through different paths. Create configuration file /etc/vpath.cfg to save the information about DS8000 devices. With the -c option: cfgvpath exits without initializing the SDD driver. The SDD driver will be initialized after reboot. This option is used to reconfigure SDD after a hardware reconfiguration. Without the -c option: cfgvpath initializes the SDD device driver vpathdd with the information stored in /etc/vpath.cfg and creates pseudo-vpath devices /devices/pseudo/vpathdd*.
373
Note: Do not use cfgvpath without the -c option after hardware reconfiguration, because the SDD driver is already initialized with the previous configuration information. A reboot is required to properly initialize the SDD driver with the new hardware configuration information.
vpathmkdev
The vpathmkdev command creates files vpathMsN in the /dev/dsk/ and /dev/rdsk/ directories by creating links to the pseudo-vpath devices /devices/pseudo/vpathdd*, which are created by the SDD driver. Files vpathMsN in the /dev/dsk/ and /dev/rdsk/ directories provide block and character access to an application the same way as the cxtydzsn devices created by the system. The vpathmkdev command is executed automatically during SDD package installation and must be executed manually to update files vpathMsN after hardware reconfiguration.
showvpath
The showpath command lists all SDD devices and their underlying disks. We illustrate the showpath command in Example 11-38.
Example 11-38 Sun Solaris showpath command output # showpath vpath0c c1t8d0s2 /devices/pci@1f,0/pci@1/scsi@2/sd@1,0:c,raw c2t8d0s2 /devices/pci@1f,0/pci@1/scsi@2,1/sd@1,0:c,raw
Tip: Note that you can use the output from showvpath to modify iostat or sar information to report statistics based on vpaths instead of hdisks. Gather iostats to a file, and then replace the disk device names with the corresponding vpaths.
Changing an SDD hardware configuration in Sun Solaris

When adding or removing multi-port SCSI devices from a Sun Solaris system, you must reconfigure SDD to recognize the new devices. Perform the following steps to reconfigure SDD: 1. Shut down the system. Type shutdown -i0 -g0 -y and press Enter. 2. Perform a configuration restart. From the OK prompt, type boot -r and press Enter. This command uses the current SDD entries during restart, not the new entries. The restart forces the new disks to be recognized. 3. Run the SDD configuration utility to make the changes to the directory /opt/IBMdpo/bin. Type cfgvpath -c and press Enter. 4. Shut down the system. Type shutdown -i6 -g0 -y and press Enter. 5. After the restart, change to the /opt/IBMdpo/bin directory by typing cd /opt/IBMdpo/bin. 6. Type devfsadm and press Enter to reconfigure all the drives. 7. Type vpathmkdev and press Enter to create all the vpath devices. For specific information about SDD commands, refer to IBM TotalStorage Multipath Subsystem Device Driver Users Guide, SC30-4096.
374
11.9 Testing and verifying DS8000 Storage

To characterize storage performance from a host perspective, we enter into a multidimensional discussion involving considerations that include: throughput (IOPS and MB/s); read or write; random or sequential; blocksize; and response time. Here are a couple of rules that will help characterize these terms: Throughput or bandwidth is measured in two separate and opposing metrics: Input/Output operations per second (IOPS) and data transfer rate (MB/s). Generally, random workloads are characterized by IOPS and sequential workloads by MBps. Maximum IOPS are experienced when moving data that has a smaller blocksize (4 KB). Maximum data transfer rates are experienced when moving data with a larger blocksize (1024 KB). Writes are faster than reads because of the effects of the disk subsystems cache. Random workloads are a mixture of reads and writes to storage. Sequential workloads generally have higher MB/s throughput than random workloads. Refer to Chapter 3, Understanding your workload on page 29 for an understanding of workloads. The UNIX dd command is a great tool to drive sequential read workloads or sequential write workloads against the DS8000. It will be rare when you can actually drive the DS8000 at the maximum data rates that you see in published performance benchmarks. But, after you understand how your total configuration (for instance, a DS8000 attached with four SDD paths through two SANs to your host with two HBAs, four CPUs, and 1 GB of memory) performs against certain dd commands, you will have a baseline from which you can compare operating system kernel parameter changes or different logical volume striping techniques in order to improve performance. In this section, we discuss how to: Determine the sequential read speed that an individual vpath (LUN) can provide in your environment. Measure sequential read and write speeds for filesystems. While running the dd command in one host session, we recommend that you use the UNIX commands and shell scripts presented earlier in this chapter. We will assume that, at a minimum, you will have the AIX nmon tool running with the c, a, e, and d features turned on. Below, we run many kinds of dd commands. If, at any time, you want to make sure that there are no processes with the string dd running on your system, execute the following kill-grep-awk command: kill -kill ps -ef | grep dd | awk { print $2 } Note, the previous command might kill processes other than the dd command that you might not want to terminate. Important: Use extreme caution when using the dd command to perform a sequential write operation. Ensure that the dd does not write to a device file that is part of the UNIX operating system.
375
11.9.1 Using the dd command to test sequential rank reads and writes
To test the sequential read speed of a rank, you can run the command: time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781 The rvpath0 is the character or raw device file for the LUN presented to the operating system by SDD. This command will read 100 MB off of rvpath0 and report how long it takes in seconds. Take 100 MB and divide by the number of seconds reported to determine the MB/s read speed. If you determine that the average read speed for your vpaths, for example, 50 MB/s, you know you need to stripe your future logical volumes across at least four different ranks to achieve 200 MB/s sequential read speeds. We show an example of the output from the test_disk_speeds sample shell script in Example 11-39.
Example 11-39 The test_disk_speeds script output # test_disk_speeds rvpath0 100 MB/sec 100 MB bs=128k
Let us explore the dd command more. Issue the following command: dd if=/dev/rvpath0 of=/dev/null bs=128k Your nmon monitor (the e option) reports that this previous command has imposed a sustained 100 MB/s bandwidth with a blocksize=128k on vpath0. Notice the xfers/sec column; xfers/sec is IOPS. Now, if your dd command has not already errored out because it reached the end of the disk, press Ctrl-C to stop the process. Now, nmon reports idle. Next, issue the following dd command with a 4 KB blocksize and put it in the background: dd if=/dev/rvpath0 of=/dev/null bs=4k & For this command, nmon reports a lower MB/s but a higher IOPS, which is the nature of I/O as a function of blocksize. Use the previous kill-grep-awk command to clear out all the dd processes from your system. Try your dd sequential read command with a bs=1024 and you see a high MB/s but a reduced IOPS. Now, start several of these commands and watch your throughput increase until it reaches a plateau; something in your configuration (CPU?, HBAs?, DS8000, or rank?) has become a bottleneck. This plateau is as fast as your hardware configuration can perform sequential reads for a specific blocksize. The kill-grep-awk script will clear everything out of the process table for you. Try loading up another raw vpath device (vpath1) device. Watch the performance of your HBAs (nmon a option) approach 200 MBps. You can perform the same kinds of tests against the block vpath device, vpath0. What is interesting here is that you will always observe the same I/O characteristics, no matter what blocksize you specify. That is because, in AIX anyway, the Logical Volume Manager breaks everything up into 4K blocks, reads and writes. Run the following two commands separately. The nmon command reports about the same for both: dd if=/dev/vpath0 of=/dev/null bs=128k dd if=/dev/vpath0 of=/dev/null bs=4k Use caution when using the dd command to test sequential writes. If LUNs have been incorporated into the operating system using logical volume manager (LVM) commands, and the dd command is used to write to the LUNs, they will not be part of the operating system anymore, and the operating system will not like that. For example, if you want to write to a vpath, that vpath must not be part of a LVM volume group. And if you want to write to a LVM
376
logical volume, it must not have a filesystem on it, and if the logical volume has a logical volume control block (LVCB), you must skip over the LVCB when writing to the logical volume. It is possible to create a logical volume without a LVCB by using the mklv -T O option. Important: Use extreme caution when using the dd command to perform a sequential write operation. Ensure that the dd command is not writing to a device file that is part of the UNIX operating system. The following commands will perform sequential writes to your LUNs: dd if=/dev/zero of=/dev/rvpath0 bs=128k dd if=/dev/zero of=/dev/rvpath0 bs=1024k time dd if=/dev/zero of=/dev/rvpath0 bs=128k count=781 Try different blocksizes, different raw vpath devices, and combinations of reads and writes. Run the commands against the block device (/dev/vpath0) and notice that blocksize does not affect performance.
11.9.2 Verifying your system

So far, we have been demonstrating the dd command. At each step along the way, there are tests that you must run to test the storage infrastructure. We will discuss these tests here so that you will be equipped with testing techniques as you configure your storage for the best performance.
Verify the storage subsystem

After the LUNs have been assigned to the host system, and multipathing software, such as SDD, has discovered the LUNs, it is important to test the storage subsystem. The storage subsystem includes the SAN infrastructure, the host system HBAs, and the DS8000: 1. The first step is to run the following command to review that your storage allocation from the DS8000 is working well with SDD: datapath query essmap Make sure that the LUNs that you have are what you expect. Is the number of paths to the LUNs correct? Are all of the LUNs from different ranks? Are the LUN sizes correct? Output from the command looks like the output in Example 11-40.
Example 11-40 The datapath query essmap command output
{CCF-part2:root}/tmp/perf/scripts -> datapath query essmap Disk Path P Location adapter LUN SN ----------- ----------- ---------------vpath0 hdisk4 7V-08-01[FC] fscsi0 75065513000 vpath0 hdisk12 7V-08-01[FC] fscsi0 75065513000 vpath0 hdisk20 7k-08-01[FC] fscsi1 75065513000 vpath0 hdisk28 7k-08-01[FC] fscsi1 75065513000 vpath1 hdisk5 7V-08-01[FC] fscsi0 75065513001 vpath1 hdisk13 7V-08-01[FC] fscsi0 75065513001 vpath1 hdisk21 7k-08-01[FC] fscsi1 75065513001 vpath1 hdisk29 7k-08-01[FC] fscsi1 75065513001 vpath2 hdisk6 7V-08-01[FC] fscsi0 75065513002 vpath2 hdisk14 7V-08-01[FC] fscsi0 75065513002 vpath2 hdisk22 7k-08-01[FC] fscsi1 75065513002 vpath2 hdisk30 7k-08-01[FC] fscsi1 75065513002 vpath3 hdisk7 7V-08-01[FC] fscsi0 75065513003 vpath3 hdisk15 7V-08-01[FC] fscsi0 75065513003 vpath3 hdisk23 7k-08-01[FC] fscsi1 75065513003 vpath3 hdisk31 7k-08-01[FC] fscsi1 75065513003 vpath4 hdisk8 7V-08-01[FC] fscsi0 75065513100 vpath4 hdisk16 7V-08-01[FC] fscsi0 75065513100 vpath4 hdisk24 7k-08-01[FC] fscsi1 75065513100 vpath4 hdisk32 7k-08-01[FC] fscsi1 75065513100 Type -----------IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 IBM 2107-900 Size ---10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7 LSS ---48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 49 49 49 49 Vol --0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 0 0 0 0 Rank C/A ----- ---0000 0e 0000 0e 0000 0e 0000 0e fffe 0e fffe 0e fffe 0e fffe 0e fffc 0e fffc 0e fffc 0e fffc 0e fffa 0e fffa 0e fffa 0e fffa 0e ffff 17 ffff 17 ffff 17 ffff 17 S Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Connection port RaidMode ----------- ---- -------R1-B2-H1-ZA 100 RAID10 R1-B2-H1-ZB 101 RAID10 R1-B1-H3-ZC 32 RAID10 R1-B1-H3-ZD 33 RAID10 R1-B2-H1-ZA 100 RAID10 R1-B2-H1-ZB 101 RAID10 R1-B1-H3-ZC 32 RAID10 R1-B1-H3-ZD 33 RAID10 R1-B2-H1-ZA 100 RAID10 R1-B2-H1-ZB 101 RAID10 R1-B1-H3-ZC 32 RAID10 R1-B1-H3-ZD 33 RAID10 R1-B2-H1-ZA 100 RAID10 R1-B2-H1-ZB 101 RAID10 R1-B1-H3-ZC 32 RAID10 R1-B1-H3-ZD 33 RAID10 R1-B2-H1-ZA 100 RAID10 R1-B2-H1-ZB 101 RAID10 R1-B1-H3-ZC 32 RAID10 R1-B1-H3-ZD 33 RAID10
377
vpath5 vpath5 vpath5 vpath5 vpath6 vpath6 vpath6 vpath6
hdisk9 hdisk17 hdisk25 hdisk33 hdisk10 hdisk18 hdisk26 hdisk34
7V-08-01[FC] 7V-08-01[FC] 7k-08-01[FC] 7k-08-01[FC] 7V-08-01[FC] 7V-08-01[FC] 7k-08-01[FC] 7k-08-01[FC]
fscsi0 fscsi0 fscsi1 fscsi1 fscsi0 fscsi0 fscsi1 fscsi1
75065513101 75065513101 75065513101 75065513101 75065513102 75065513102 75065513102 75065513102
IBM IBM IBM IBM IBM IBM IBM IBM
2107-900 2107-900 2107-900 2107-900 2107-900 2107-900 2107-900 2107-900
10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7
49 49 49 49 49 49 49 49
1 1 1 1 2 2 2 2
fffd fffd fffd fffd fffb fffb fffb fffb
17 17 17 17 17 17 17 17
Y Y Y Y Y Y Y Y
R1-B2-H1-ZA R1-B2-H1-ZB R1-B1-H3-ZC R1-B1-H3-ZD R1-B2-H1-ZA R1-B2-H1-ZB R1-B1-H3-ZC R1-B1-H3-ZD
100 101 32 33 100 101 32 33
RAID10 RAID10 RAID10 RAID10 RAID10 RAID10 RAID10 RAID10
2. Next, run sequential reads and writes to all of the vpath devices (raw or block) for about an hour. Use the commands that we discussed in 11.9.1, Using the dd command to test sequential rank reads and writes on page 376. Then, look at your SAN infrastructure to see how it performs. Look at the UNIX error report. Problems will show up as storage errors, disk errors, or adapter errors. If there are problems, they will not be hard to find in the error report, because there will be a lot of them. The source of the problem can be hardware problems on the storage side of the SAN, Fibre Channel cables or connections, down-level device drivers, or device (HBA) microcode. If you see errors similar to the errors shown in Example 11-41, stop and get them fixed.
Example 11-41 SAN problems reported in UNIX error report IDENTIFIER 3074FEB7 3074FEB7 3074FEB7 825849BF 3074FEB7 3074FEB7 3074FEB7 3074FEB7 3074FEB7 TIMESTAMP 0915100805 0915100805 0915100805 0915100705 0915100705 0915100705 0914175405 0914175405 0914175305 T T T T T T T T T T C H H H H H H H H H RESOURCE_NAME fscsi0 fscsi3 fscsi3 fcs0 fscsi3 fscsi0 fscsi0 fscsi0 fscsi0 DESCRIPTION ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR ADAPTER ERROR
Ensure that after running an hours worth of dd commands on all your vpaths, there are no storage errors in the UNIX error report. 3. Next, issue the following command to see if SDD is correctly load balancing across paths to the LUNs: datapath query device Output from this command looks like Example 11-42.
Example 11-42 The datapath query device command output {CCF-part2:root}/tmp/perf/scripts -> datapath query device|more Total Devices : 16
DEV#: 0 DEVICE NAME: vpath0 TYPE: 2107900 POLICY: Optimized SERIAL: 75065513000 ========================================================================== Path# Adapter/Hard Disk State Mode Select Errors 0 fscsi0/hdisk4 OPEN NORMAL 220544 0 1 fscsi0/hdisk12 OPEN NORMAL 220396 0 2 fscsi1/hdisk20 OPEN NORMAL 223940 0 3 fscsi1/hdisk28 OPEN NORMAL 223962 0 DEV#: 1 DEVICE NAME: vpath1 TYPE: 2107900 POLICY: Optimized SERIAL: 75065513001 ========================================================================== Path# Adapter/Hard Disk State Mode Select Errors
378
0 1 2 3
fscsi0/hdisk5 fscsi0/hdisk13 fscsi1/hdisk21 fscsi1/hdisk29
OPEN OPEN OPEN OPEN
NORMAL NORMAL NORMAL NORMAL
219427 219163 223578 224349
0 0 0 0
DEV#: 2 DEVICE NAME: vpath2 TYPE: 2107900 POLICY: Optimized SERIAL: 75065513002 ========================================================================== Path# Adapter/Hard Disk State Mode Select Errors 0 fscsi0/hdisk6 OPEN NORMAL 218881 0 1 fscsi0/hdisk14 OPEN NORMAL 219835 0 2 fscsi1/hdisk22 OPEN NORMAL 222697 0 3 fscsi1/hdisk30 OPEN NORMAL 223918 0 DEV#: 3 DEVICE NAME: vpath3 SERIAL: 75065513003 TYPE: 2107900 POLICY: Optimized
Check to make sure, for every LUN, that the counters under the Select column are the same and that there are no errors. 4. Next, spot-check the sequential read speed of the raw vpath device. The following command is an example of the command run against a LUN called vpath0. For the LUNs that you test, ensure that they each yield the same results: time dd if=/dev/rvpath0 of=/dev/null bs=128k count=781
Tip: For this dd command, for the first time that it is run against rvpath0, the I/O must be read from disk and staged to the DS8000 cache. The second time that this dd command is run, the I/O is already in cache. Notice the shorter read time when we get an I/O cache hit. Of course, if any of these LUNs are on ranks that are also used by another application, you see a variation in the throughput. If there is a large variation in the throughput, perhaps you need to give that LUN back to the storage administrator; trade for another LUN. You want all your LUNs to have the same performance. If everything looks good, continue with the configuration of volume groups and logical volumes.
Verify the logical volumes

The next time to stop and look at how your DS8000 storage performs is after the logical volumes have been created. Remember that after volume groups and logical volumes have been created, it is a disastrous idea to use the dd command to perform sequential writes to the raw vpath device, so do not do that. It is a great idea to create a temporary raw logical volume on the vpaths and use it for testing: 1. Put the nmon monitor up for a quick check on I/O throughput performance and vpath balance. 2. Test the sequential read speed on every raw logical volume device, if practical, or at least a decent sampling if you have too many to test each one. The following command is an example of the command run against a logical volume called 38glv. Perform this test against all your logical volumes to ensure that they each yield the same results: time dd if=/dev/r38glv of=/dev/null bs=128k count=781 3. Use the dd command without the time or count options to perform sequential reads and writes against all your logical volumes, raw or block devices. Watch nmon for each LUNs Mb/s and IOPS. Monitor the adapter. Notice the following characteristics:
379
Performance is the same for all the logical volumes. Raw logical volumes devices (/dev/rlvname) are faster than the counterpart block logical volume devices (/dev/lvname) as long as the blocksize specified is more than 4 KB. Larger blocksizes result in higher MB/s but reduced IOPS for raw logical volumes. The blocksize will not affect the throughput of a block (not raw) logical volume, because, in AIX, the LVM imposes an I/O blocksize of 4 KB. Verify this size by running the dd command against a raw logical volume with a blocksize of 4 KB. This performance will be the same as running the dd command against the non-raw logical volume. Reads are faster than writes. With inter-disk logical volumes, nmon will not report that all the LUNs have input at the same time, as with a striped logical volume. This result is normal and has to do with the nmon refresh rate and the characteristics of inter-disk logical volumes. 4. Ensure that the UNIX errorlog is clear of storage-related errors.
Verify the filesystems and characterize performance

After the filesystems have been created, it is a good idea to take time to characterize and document the filesystems performance. A simple way to test sequential write/read speeds for filesystems is to time how long it takes to create a large sequential file and then how long it takes to copy the same file to /dev/null. After creating the file for the write test, you need to take care that the file is not still cached in host memory, which will invalidate the read test because the data will come from RAM instead of disk. This lmktemp command, used next, is currently on AIX 5.3 and has been around for a long time. It creates a file and lets you control how large the file needs to be. It does not appear to be supported by any AIX documentation and therefore might disappear in future releases of AIX: A simple sequential write test: cd /interdiskfs time lmktemp 2GBtestfile 2000M real 0m18.83s user 0m0.15s sys 0m18.68s Divide 2000/18.83 seconds = 107 MB/s sequential write speed. Sequential read speed: cd / unmount /interdiskfs (this command will flush file from operating system (jfs, jfs2) memory) mount /interdiskfs cd - (cd back to the previous directory, /interdiskfs) time dd if=/interdiskfs/2GBtestfile of=/dev/null bs=128k real 0m11.19s user 0m0.39s sys 0m10.31s Divide 2000/11.19 seconds = 178 MB/s read speed. Now that the DS8000 cache is primed, run the test again. When we ran the test again, we got 4.08 seconds. Priming the cache is a good idea for isolated application read testing. If you have an application, such as a database, and you perform several isolated fixed reads, ignore the first run and measure the second run to take advantage of read hits from
380
the DS8000 cache, because these results are a more realistic measure of how the application will perform. For HP-UX, use the prealloc command instead of lmktemp for AIX to create large files. For Sun Solaris, use the mkfile command. Note: The prealloc command for HP-UX and the lmktemp command for AIX have a 2 GB size limitation. Those commands are not able to create a file greater than 2 GB in size. If you want a file larger than 2 GB for a sequential read test, concatenate a couple of 2 GB files.
381
382
12
Chapter 12.
Performance considerations with VMware

This chapter discusses the monitoring and tuning tools and techniques that can be used with VMware ESX to optimize throughput and performance when attaching to the DS8000. It addresses the following topics: I/O architecture from a VMware perspective Initial planning considerations for optimum performance of VMware host systems using the DS8000 in a storage area network (SAN) Specific VMware performance measuring tools and tuning options SAN multipathing considerations Testing and verifying the DS8000 storage attached to VMware host systems Configuring VMware logical storage for optimum performance VMware operating system tuning considerations for maximum storage performance
383
12.1 Disk I/O architecture overview

The DS8000 currently supports the VMware high-end virtualization solution Virtual Infrastructure 3 and the included VMware ESX Server starting with Version 2.5. Other VMware products, such as VMware Server and Workstation, are not intended for the data center-class environments where the DS8000 is typically used. This chapter introduces the relevant logical configuration concepts needed to attach VMware ESX Server to a DS8000 and will focus on performance-relevant configuration and measuring options. For further information about how to set up ESX Server with the DS8000, refer to IBM System Storage DS8000 Architecture and Implementation, SG24-6786. You can obtain general recommendations about how to set up VMware with IBM hardware from Tuning IBM System x Servers for Performance, SG24-5287. VMware ESX Server supports the usage of external storage that can reside on a DS8000 system. DS8000 storage is typically connected by Fibre Channel (FC) and usually accessed over a Storage Area Network (SAN). Each logical volume that is accessed by a ESX Server is configured in a specific way, and this storage can be presented to the Virtual Machines (VM) as virtual disks. To understand how storage is configured in ESX Server, it is necessary to understand the layers of abstraction that are shown in Figure 12-1.
Virtual machine
vmdk
vmdk
Virtual disks
ESX Server
VMFS volume
External storage
DS8000 logical volume
Figure 12-1 Storage stack for ESX Server
For VMware to use external storage, VMware needs to be configured with logical volumes that are defined in accordance with the users expectations, which might include using RAID or striping at a storage hardware level. These logical volumes have to be presented to the ESX Server. For the DS8000, host access to the volumes includes the proper configuration of logical volumes, host mapping, correct LUN masking, and zoning of the involved SAN fabric. At the ESX Server layer, these logical volumes can be addressed as a VMware ESX Server File System (VMFS) volume or as a raw disk using Raw Device Mapping (RDM). A VMFS volume is a storage resource that can serve several VMs as well as several ESX Servers as
384
consolidated storage, whereas a RDM volume is intended for usage as isolated storage by a
single VM. On the Virtual Machine layer, you can configure one or several Virtual Disks (VMDKs) out of a single VMFS volume. These Virtual Disks can be configured to be used by several VMs.
VMware datastore concept

ESX Server uses specially formatted logical containers called datastores. These datastores can reside on various types of physical storage devices, local disks inside the ESX Server, and FC-attached disks, as well as iSCSI disks and Network File System (NFS) disks. The virtual machines disks are stored as files within a VMware ESX Server File System (VMFS). When a guest operating system issues a Small Computer System Interface (SCSI) command to its virtual disks, the VMware virtualization layer converts this command to VMFS file operations. From the standpoint of the virtual machine operating system, each virtual disk is recognized as a direct-attached SCSI drive connected to a SCSI adapter. Device drivers in the virtual machines operating system communicate with the VMware virtual SCSI controllers. Figure 12-2 illustrates the virtual disk mapping within VMFS.
ESX Server
virtual machine 1 virtual machine 2 virtual machine 3
virtual disk
virtual disk
virtual disk
SCSI controller
SCSI controller
SCSI controller
VMware virtualization layer
VMFS
LUN0
disk1 disk2 disk3
Figure 12-2 Mapping of virtual disks to LUNs within VMFS
VMFS is optimized to run multiple virtual machines as one workload to minimize disk I/O overhead. A VMFS volume can be spanned across several logical volumes, but there is no striping available to improve disk throughput in these configurations. Each VMFS volume can be extended by adding additional logical volumes while the virtual machines use this volume. A raw device mapping (RDM) is implemented as a special file in a VMFS volume that acts as a proxy for a raw device. RDM combines advantages of direct access to physical devices with the advantages of virtual disks in the VMFS. In certain special configurations, you must use RDM raw devices, such as in Microsoft Cluster Services (MSCS) clustering, using virtual
Chapter 12. Performance considerations with VMware
385
machine snapshots, or using VMotion, which enables the migration of virtual machines from one datastore to another with zero downtime. With RDM volumes, ESX Server supports the usage of N_Port ID Virtualization (NPIV). This HBA virtualization technology allows a single physical host bus adapter (HBA) port to function as multiple logical ports, each with its own separate worldwide port name (WWPN). This function can be helpful when migrating virtual machines between ESX Servers using VMotion as well as to separate workloads of multiple VMs configured to the same paths on the HBA level for performance measuring purposes. The VMware ESX virtualization of datastores is shown in Figure 12-3.
ESX Server
virtual machine 1 virtual machine 2
virtual disk 1
virtual disk 2
virtual disk 1
virtual disk 2
SCSI controller
SCSI controller

HBA1 HBA2
VMFS
LUN0
LUN1
LUN4
.vmdk
RDM
Figure 12-3 VMware virtualization of datastores
12.2 Multipathing considerations

In ESX Server, the name of the storage device is displayed as a sequence of three to four numbers separated by colons, for example, vmhba2:0:1:1. This naming has the following meaning: <HBA>:<SCSI target>:<SCSI LUN>:<disk partition> The abbreviation vmhba refers to the physical Host Bus Adapter (HBA) types: either a Fibre Channel HBA, a SCSI adapter, or even an iSCSI initiator. The SCSI target and SCSI LUN numbers are assigned during scanning the HBAs for available storage and usually will not change afterward. The fourth number indicates a partition on a disk that a VMFS datastore occupies and must never change for a selected disk. Thus, in this example, vmhba2:0:1:1 refers to the first partition on SCSI LUN1, SCSI target 1, and accessed through HBA 2. In multipathing environments, which are the standard configuration in properly implemented SAN environments, the same LUN can be accessed via several paths. The same LUN has
386
actually two or more storage device names. After a rescan or reboot, the path information displayed by ESX Server might change; however, the name still refers to the same physical device. In Figure 12-4, the identical LUN can be addressed as vmhba2:0:1 or vmhba2:1:1, which can easily by verified in the Canonical Path column. If more than one HBA is used to connect to the SAN-attached storage for redundancy reasons, this LUN can also be addressed via a different HBA, in this configuration, for example, vmhba5:0:1 and vmhba5:1:2.
Figure 12-4 Storage Adapter properties view in the VI Client
ESX Server provides a built-in multipathing support, which means it is not necessary to install any additional failover driver. Any external failover drivers, such as SDD, are not supported for VMware ESX. In the current implementation, ESX Server only supports path failover, which means that only one of the paths of one LUN is active at a time. Only in the case of a failure of any element in the active path will the multipathing module perform a path failover, which means change the active path to another path that is still available. VMware ESX currently does not support dynamic load balancing. However, there is already a Round-robin load balancing scheme available, which is considered experimental and is not to be used in production environments. ESX Server 3 provides two multipathing policies for usage in production environments: Fixed and Most Recently Used (MRU). MRU policy is designed for usage by active/passive storage devices, such as DS4000, which only have one active controller available per LUN. The Fixed policy is the best choice for attaching a DS8000 to VMware ESX, because this policy makes sure that the designated preferred path to the storage is used whenever available. During a path failure, an alternate path will be used, and as soon as the preferred path is available again, the multipathing module will switch back to it as the active path. The multipathing policy and the preferred path can be configured from the VI Client or by using the command line tool esxcfg-mpath. Figure 12-5 shows how the preferred path is changed from the VI Client.
387
Figure 12-5 Manage Paths window in the VI Client
By using the Fixed multipathing policy, you can implement a static load balancing if several LUNs are attached to the VMware ESX Server. The multipathing policy is set on a per LUN basis, as well as the preferred path is chosen for each LUN individually. If the ESX Server is connected over four paths to its DS8000 storage, we recommend that you spread the preferred paths over all four available physical paths. For example, when you want to configure four LUNs, assign the preferred path of LUN0 via the first path, the one for LUN1 via the second path, the preferred path for LUN2 via the third path, and the one for LUN3 via the fourth path. This method allows you to spread the throughput over all physical paths in the SAN fabric and thus results in an optimized performance regarding the physical connections between ESX Server and DS8000. If the workload varies greatly between the accessed LUNs, it might be a good approach to monitor the performance on the paths and adjust the configuration according to the workload. It might be necessary to assign one path as preferred to just one LUN having high workload but sharing another path as preferred between five separate LUNs showing moderate workloads. Of course, this static load balancing will only work if all paths are available. As soon as one path fails, all LUNs having selected this failing path as preferred will fail over to another path and put additional workload onto those paths. Furthermore, there is no capability to influence the failover algorithm to which path the failover will occur. When the active path fails, for example, due to a physical path failure, I/O might pause for about 30 to 60 seconds until the FC driver determines that the link is down and fails over to one of the remaining paths. This behavior can cause the virtual disks used by the operating systems of the virtual machines to appear unresponsive. After failover is complete, I/O resumes normally. The timeout value for detecting a failed link can be adjusted, it is usually set in the HBA BIOS or driver and thus the way to set this option depends on the HBA hardware and vendor. In general, the recommended failover timeout value is 30 seconds. With VMware ESX, you can adjust this value by editing the device driver options for the installed HBAs in /etc/vmware/esx.conf. Additionally, you can increase the standard disk timeout value in the virtual machines operating system to make sure that the operating system is not extensively disrupted and to make sure that the system is logging permanent errors during the failover phase. Adjusting this timeout again depends on the operating system that is used; refer to the appropriate technical documentation for details.
388
12.3 Performance monitoring tools

This section reviews performance monitoring tools available with VMware ESX.
12.3.1 Virtual Center Performance Statistics

Virtual Center (VC) is the entry point for virtual platform management in VMware ESX. It also includes a module to view and analyze performance statistics and counters. The Virtual Center performance counters collection is reduced by default to a minimum level, but you can modify the settings to allow a detailed analysis. VC includes real-time performance counters displaying the past hour (which is not archived), as well as archived statistics that are stored in a database. The real-time statistics are collected every 20 seconds and presented in the Virtual Infrastructure Client (VI Client) for the past 60 minutes (Figure 12-6 on page 390). These real-time counters are also the basis for the archived statistics, but to avoid the performance database expanding too much, the granularity is recalculated according to the age of the performance counters. Virtual Center collects those real-time counters, aggregates them for a data point every five minutes and stores them as past day statistics in the database. After one day, these counters are aggregated once more to a 30 minute interval for the past week statistics, for the past month, a data point is available every two hours, and for the last year, one datapoint is stored per day. In general, the Virtual Center statistics are a good basis to get an overview about the actual performance statistics and to further analyze performance counters over a longer period of time, for example, several days or weeks. If a granularity of 20 second intervals is sufficient for your individual performance monitoring perspective, VC can be a good data source after configuration. You can obtain more information about how to use the Virtual Center Performance Statistics at: http://communities.vmware.com/docs/DOC-5230
389
Figure 12-6 Real-time Performance Statistics in VI Client
12.3.2 Performance monitoring with esxtop

The esxtop command line tool provides the finest granularity among the performance counters available within VMware ESX Server. The tool is available on the ESX Server service console, and you must have root privileges to use the tool. Esxtop is available for usage either in Interactive Mode or in Batch Mode. When using Interactive Mode, the performance statistics are displayed inside the command line console, whereas Batch Mode allows you to collect and save the performance counters in a file. The esxtop utility reads its default configuration from a file called ~/.esxtop3rc. The best way to configure this default configuration to fit your needs is to change and adjust it for a running esxtop process and then save this configuration using the W interactive command. Example 12-1 illustrates the basic adjustments required to monitor disk performance on a SAN-attached storage device.
Example 12-1 Esxtop basic adjustment for monitoring disk performance
esxtop PCPU(%): 7 8 9 21 22 31 32 33 34 3.27, 3.13, 2.71, 7 drivers 8 vmotion 9 console 21 vmware-vmkauthd 22 Virtual Center 31 Windows1 32 Windows2 33 Windows3 34 Linux 1
#starts esxtop in Interactive Mode

2.66 16 1 1 1 12 6 6 6 5 ; used total: 0.01 0.01 0.00 0.00 0.85 0.84 0.00 0.00 6.79 6.81 0.86 0.86 0.58 0.58 0.60 0.61 1.40 1.41 2.94 0.00 1571.25 0.00 98.20 0.01 97.04 0.00 98.20 0.00 1170.05 0.00 588.24 0.00 588.54 0.00 588.52 0.00 489.17 0.00 0.00 0.32 0.00 1.60 0.14 0.09 0.09 0.46 0.00 0.00 97.03 0.00 383.57 97.37 97.84 97.95 96.56 0.00 0.00 0.06 0.00 0.20 0.13 0.05 0.08 0.22 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
d e vmhba2 a vmhba2:0 t vmhba2:0:0 W
#changes to disk storage utilization panels #selects expanded display of vmhba2 #selects expanded display of SCSI channel 0 #selects expanded display mode of SCSI target 0 #writes the current configuration into ~/.esctop3rc file
390
After this initial configuration, the performance counters are displayed as shown in Example 12-2.
Example 12-2 Disk performance metrics in esxtop
1:25:39pm up 12 days 23:37, 86 worlds; CPU load average: 0.36, 0.14, 0.17 ADAPTR CID TID LID vmhba1 vmhba2 0 0 0 vmhba2 0 0 1 vmhba2 0 0 2 vmhba2 0 0 3 vmhba2 0 0 4 vmhba2 0 0 5 vmhba2 0 0 6 vmhba2 0 0 7 vmhba2 0 0 8 vmhba2 0 0 9 vmhba2 0 1 vmhba3 vmhba4 vmhba5 NCHNS NTGTS NLUNS NVMS AQLEN LQLEN WQLEN ACTV %USD 2 1 1 32 238 0 0 1 1 1 10 4096 32 0 8 25 1 1 1 10 4096 32 0 0 0 1 1 1 10 4096 32 0 0 0 1 1 1 9 4096 32 0 0 0 1 1 1 4 4096 32 0 0 0 1 1 1 17 4096 32 0 0 0 1 1 1 4 4096 32 0 0 0 1 1 1 4 4096 32 0 0 0 1 1 1 4 4096 32 0 0 0 1 1 1 4 4096 32 0 0 0 1 1 10 76 4096 0 0 1 2 4 16 4096 0 0 1 2 4 16 4096 0 0 1 2 20 152 4096 0 0 LOAD CMDS/s READS/s WRITES/s MBREAD/s 4.11 0.20 3.91 0.00 0.25 25369.69 25369.30 0.39 198.19 0.00 0.00 0.00 0.00 0.00 0.00 0.39 0.00 0.39 0.00 0.00 0.39 0.00 0.39 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.78 0.00 0.78 0.00
Additionally, you can change the field order and select or deselect various performance counters in the view. The minimum refresh rate is 2 seconds, and the default setting is 5 seconds. When using esxtop in Batch Mode, always include all of the counters by using the option -a. To collect the performance counters every 10 seconds for 100 iterations and save them to file, invoke esxtop this way: esxtop -b -a -d 10 -n 100 > perf_counters.csv Additional information about how to use esxtop is provided in the ESX Server Resource Management Guide: http://www.vmware.com/pdf/vi3_301_201_resource_mgmt.pdf
12.3.3 Guest-based performance monitoring

Because the operating systems running in the virtual machines are hosting the applications performing the host workload, it makes sense to use performance monitoring in these operating systems as well. We describe the tools that you use in Chapter 10, Performance considerations with Windows Servers on page 281 and Chapter 13, Performance considerations with Linux on page 401. The guest operating system is, of course, unaware of the underlying VMware ESX virtualization layer, so any performance data captured inside the VMs can be misleading and must only be analyzed and interpreted in conjunction with the actual configuration and performance data gathered in ESX Server or on a disk or SAN layer. There is one additional benefit of using the Windows Performance Monitor Perfmon (refer to 10.8.2, Windows Performance console (perfmon) on page 294). When you use esxtop in Batch Mode with option -a, it collects all available performance counters and thus the collected comma-separated values (csv) data gets very large and cannot be easily parsed. Perfmon can be very helpful to quickly analyze results or to reduce the amount of csv data to a subset of counters that can be analyzed more easily using other utilities. You can obtain more information about importing the esxtop csv output into Perfmon from: http://communities.vmware.com/docs/DOC-5100
391
12.4 VMware specific tuning for maximum performance

Due to the special VMware ESX Server setup and the additional virtualization layer implemented in ESX Server, it is necessary to focus on additional topics and configuration options when discussing tuning VMware ESX. This section focuses on important points when discussing tuning VMware ESX Server with attached DS8000 Storage to achieve maximum performance.
12.4.1 Workload spreading

We generally recommend to spread the I/O workload across the available hardware. This method is the most effective way to avoid any hardware limitations of either the HBA, processor complex, device adapter, or disk drives that negatively impact the potential performance of your ESX Server. As discussed in 5.1, Basic configuration principles for optimal performance on page 64, it is also important to identify and separate specific workloads, because they can negatively influence other workloads that might be more business critical. Within ESX Server, it is not possible to configure striping over several LUNs for one datastore. It is possible to add more than one LUN to a datastore, but adding more than one LUN to a datastore just extends the available amount of storage by concatenating one or more additional LUNs without balancing the data over the available logical volumes. The easiest way to implement striping over several hardware resources is to use the Storage Pool Striping (SPS) function (refer to Storage Pool Striping extent rotation on page 51 for further information) of the attached DS8000. The only other possibility to achieve striping at the Virtual Machine level is to configure several virtual disks for a VM that are located on different hardware resources, such as different HBAs, device adapters, or servers, and then configure striping of those virtual disks within the host operating system layer. We recommend that you use Storage Pool Striping, because the striping only has to be implemented once when configuring the DS8000. Implementing the striping on the host operating system level requires you to configure it for each of the VMs separately. Furthermore, according to the VMware documentation, host-based striping is currently only supported using striping within Windows dynamic disks. For performance monitoring purposes, be careful with spanned volumes or even avoid these configurations. When configuring more than one LUN to a VMFS datastore, the volume space is spanned across the multiple LUNs, which can cause an imbalance in the utilization of those LUNs. If several virtual disks are initially configured within a datastore and the disks are mapped to different virtual machines, it is no longer possible to identify in which area of the configured LUNs the data of each VM is allocated. Thus, it is no longer possible to pinpoint which host workload causes a possible performance problem. In summary, avoid using spanned volumes and configure your systems with just one LUN per datastore.
12.4.2 Virtual Machines sharing the same LUN

The SCSI protocol allows multiple commands to be active for the same LUN at a time. A configurable parameter called LUN queue depth determines how many commands can be active at one time for a given LUN. This queue depth parameter is handled by the SCSI driver for a specific HBA. Depending on the HBA type, you can configure up to 255 outstanding 392
commands for a QLogic HBA, and Emulex supports up to 128. The default value for both vendors is 32. If a Virtual Machine generates more commands to a LUN than the LUN queue depth, these additional commands are queued in the ESX kernel, which increases the latency. The queue depth is defined on a per LUN basis, not per initiator. An HBA (SCSI initiator) supports many more outstanding commands. For ESX Server, if two Virtual Machines access their virtual disks on two different LUNs, each VM can generate as many active commands as the LUN queue depth. But if those two Virtual Machines have their virtual disks on the same LUN (within the same VMFS volume), the total number of active commands that the two VMs combined can generate without queuing I/Os in the ESX kernel is equal to the LUN queue depth. Therefore, when several Virtual Machines share a LUN, the maximum number of outstanding commands to that LUN from all those VMs together must not exceed the LUN queue depth. Within ESX Server, there is a configuration parameter Disk.SchedNumReqOutstanding, which can be configured from the Virtual Center. If the total number of outstanding commands from all Virtual Machines for a specific LUN exceeds this parameter, the remaining commands are queued to the ESX kernel. This parameter must always be set at the same value as the queue depth for the HBA. To reduce latency, it is important to make sure that the sum of active commands from all Virtual Machines of an ESX Server does not frequently exceed the LUN queue depth. If the LUN queue depth is exceeded regularly, you might either increase the queue depth or move the virtual disks of a few Virtual Machines to different VMFS volumes. Therefore, you lower the number of Virtual Machines accessing a single LUN. The maximum LUN queue depth per ESX Server must not exceed 64. The maximum LUN queue depth per ESX Server can be up to 128 only when a server has exclusive access to a LUN. VMFS is a filesystem for clustered environments, and it uses SCSI reservations during administrative operations, such as creating or deleting virtual disks or extending VMFS volumes. A reservation makes sure that at a given time, a LUN is only available to one ESX Server exclusively. These SCSI reservations usually are only used for administrative tasks that require a metadata update. To avoid SCSI reservation conflicts in a productive environments with several ESX Servers accessing shared LUNs, it might be helpful to perform those administrative tasks at off-peak hours. If this is not possible, perform the administrative tasks from an ESX Server that also hosts I/O-intensive Virtual Machines, which will be less impacted because the SCSI reservation is set on SCSI initiator level, which means for the complete ESX Server. The maximum number of Virtual Machines that can share the same LUN depends on several conditions. In general, Virtual Machines with heavy I/O activity result in a smaller number of possible VMs per LUN. Additionally, you must consider the already discussed LUN queue depth limits per ESX Server and the storage system specific limits.
12.4.3 ESX filesystem considerations

VMware ESX Server offers two possibilities to manage Virtual Disks: VMware ESX Server File System (VMFS) and Raw Device Mapping (RDM). VMFS is a clustered file system that allows concurrent access by multiple hosts. RDM is implemented as a proxy for a raw physical device. It uses a mapping file containing metadata, and all disk traffic is redirected to the physical device. RDM can only be accessed by one Virtual Machine exclusively.
393
RDM offers two configuration modes: virtual compatibility mode and physical compatibility mode. When using physical compatibility mode, all SCSI commands toward the virtual disk are passed directly to the device, which means all physical characteristics of the underlying hardware become apparent. Within virtual compatibility mode, the virtual disk is mapped as a file within a VMFS volume, allowing advanced file locking support and the usage of snapshots. Figure 12-7 compares both possible RDM configuration modes and VMFS.
ESX Server
virtual machine virtual machine virtual machine
virtual disk 1
virtual disk 1
virtual disk 1

open commands read / write commands open commands read / write commands
LUN0
LUN0
address resolution
LUN2
LUN0
.vmdk
mapping file
mapping file
address resolution
LUN3
VMFS
RDM in virtual mode
RDM in physical mode
Figure 12-7 Comparison of RDM virtual and physical modes with VMFS
The implementations of VMFS and RDM imply a possible impact on the performance of the virtual disks; therefore, all three possible implementations have been tested together with the DS8000. This section summarizes the outcome of those performance tests. In general, it turned out that the filesystem selection has only very limited impact on the performance: For random workloads, the measured throughput is almost equal between VMFS, RDM physical, and RDM virtual. Only for read requests of 32 KB, 64 KB, and 128 KB transfer sizes, both RDM implementations show a slight performance advantage (Figure 12-9 on page 395). For sequential workloads for all transfer sizes, a slight performance advantage for both RDM implementations was verified against VMFS. For all sequential write and certain read requests, the measured throughput for RDM virtual was slightly higher than for RDM physical mode, which might be caused by an additional caching of data within the virtualization layer, which is not used in RDM physical mode (Figure 12-8 on page 395).
394
Figure 12-8 Result of random workload test for VMFS, RDM physical, and RDM virtual
Sequential Throughput
250,00
200,00 VMFS write 150,00 MBps RDM physical write RDM virtual write VMFS read 100,00 RDM physical read RDM virtual read 50,00
0,00 4 8 16 32 64 128 transfer size in KB
Figure 12-9 Result of sequential workload test for VMFS, RDM physical, and RDM virtual
To summarize, the choice between the available filesystems, VMFS and RDM, only has a very limited influence on the data performance of the Virtual Machines. These tests verified a possible performance increase of about 2 - 3%.
395
12.4.4 Aligning partitions

In a RAID array, the smallest hardware unit used to build a logical volume (LUN) is called a stripe. These stripes are distributed onto several physical drives in the array according to the RAID algorithm that is used. Usually, stripe sizes are much larger than sectors. For the DS8000, we use a 256 KB stripe size for RAID 5 and RAID 10 and 192 KB for RAID 6 in an Open Systems attachment. Thus, a SCSI request that intends to read a single sector in reality reads one stripe from disk. When using VMware ESX, each VMFS datastore segments the allocated LUN into blocks, which can be between 1 MB and 8 MB in size. The filesystem used by the Virtual Machines operating system optimizes I/O by grouping several sectors into one cluster, the cluster size usually is in the range of several KB. If the VMs operating system reads a single cluster from its virtual disk, at least one block (within VMFS) and all the corresponding stripes on physical disk need to be read. Depending on the sizes and the starting sector of the clusters, blocks, and stripes, reading one cluster might require reading two blocks and all of the corresponding stripes. Figure 12-10 illustrates that in an unaligned structure, a single I/O request can cause additional I/O operations. Thus, an unaligned partition setup results in additional I/O incurring a penalty on throughput and latency and leads to lower performance for the host data traffic.
read one cluster Virtual Machine file system
cluster VMFS
block DS8000 LUN
stripe
Figure 12-10 Processing of a data request in a unaligned structure
An aligned partition setup makes sure that a single I/O request results in a minimum number of physical disk I/Os, eliminating the additional disk operations, which, in fact, results in an overall performance improvement. Operating systems using the x86 architecture create partitions with a master boot record (MBR) of 63 sectors. This design is a relief from legacy BIOS code from personal computers that used cylinder, head, and sector addressing instead of Logical Block Addressing (LBA). The first track is always reserved for the MBR, and the first partition starts at the second track
396
(cylinder 0, head 1, and sector 1), which is sector 63 in LBA. Also, in todays operating systems, the first 63 sectors cannot be used for data partitions. The first possible start sector for a partition is 63. In a VMware ESX environment, because of the additional virtualization layer implemented by ESX, this partition alignment has to be performed for both layers: VMFS and the host filesystems. Because of that additional layer, using properly aligned partitions is considered to have even a higher performance effect than in the usual host setups without an additional virtualization layer. Figure 12-11 illustrates how a single I/O request is fulfilled within an aligned setup without causing additional physical disk I/O.
read one cluster
Virtual Machine file system
cluster
VMFS
block
DS8000 LUN
stripe
Figure 12-11 Processing a data request in an aligned structure
Partition alignment is a known issue in filesystems, but its effect on performance is somehow controversial. In performance lab tests, it turned out that in general all workloads show a slight increase in throughput when the partitions are aligned. A significant effect can only be verified on sequential workloads. Starting with transfer sizes of 32 KB and larger, we recognized performance improvements of up to 15%. In general, aligning partitions can improve the overall performance. For random workloads, we only identified a slight effect, whereas for sequential workloads, a possible performance gain of about 10% seems to be realistic. So, we can recommend to align partitions especially for mainly sequential workload characteristics. Aligning partitions within an ESX Server environment requires two steps. First, the VMFS partition needs to be aligned, and then, the partitions within the VMware guest systems filesystems have to be aligned as well for maximum effectiveness. You can only align the VMFS partition when configuring a new datastore. When using the VI client, the new partition is automatically configured to an offset of 128 sectors = 64 KB. But, in fact, this configuration is not ideal when using DS8000 disk storage. As the DS8000 uses larger stripe sizes, the offset must be configured to at least the stripe size. For RAID 5 and
397
RAID 10 in Open Systems attachment, the stripe size is 256 KB, and it is a good approach to set the offset to 256 KB (or 512 sectors). To configure an individual offset is only possible from the ESX Server command line. Example 12-3 shows how to create an aligned partition with an offset of 512 using fdisk.
Example 12-3 Creating an aligned VMFS partition using fdisk fdisk /dev/sdf #invoke fdisk for /dev/sdf Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable.
The number of cylinders for this disk is set to 61440. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): n #create a new partition Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-61440, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-61440, default 61440): Using default value 61440 Command (m for help): t Selected partition 1 Hex code (type L to list codes): fb Changed system type of partition 1 to fb (Unknown) Command (m for help): x Expert command (m for help): b Partition number (1-4): 1 New beginning of data (32-125829119, default 32): 512 Expert command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. fdisk -lu /dev/sdf #check the partition config #set partitions system id #fb = VWware VMFS volume
#enter expert mode #set starting block number #partition offset set to 512 #save changes
Disk /dev/sdf: 64.4 GB, 64424509440 bytes 64 heads, 32 sectors/track, 61440 cylinders, total 125829120 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot /dev/sdf1 Start End Blocks 512 125829119 62914304 Id fb System Unknown
398
Afterwards a new VMFS filesystem has to be created within the aligned partition using the vmkfstools command as shown in Example 12-4.
Example 12-4 Creating a VMFS volume using vmkfstools vmkfstools -C vmfs3 -b 1m -S LUN0 vmhba2:0:0:1 Creating vmfs3 file system on "vmhba2:0:0:1" with blockSize 1048576 and volume label "LUN0". Successfully created new volume: 490a0a3b-cabf436e-bf22-001a646677d8
As the last step, all the partitions at the virtual machine level must be aligned as well. This task needs to be performed from the operating system of each VM using the available tools. For example, for Windows, use the diskpart utility as shown in Example 12-5. Windows only allows you to align basic partitions, and the offset size is set in KB (not in sectors).
Example 12-5 Creating an aligned NTFS partition using diskpart DISKPART> create partition primary align=256 DiskPart succeeded in creating the specified partition. DISKPART> list partition Partition ### ------------* Partition 1 DISKPART> Type Size Offset ---------------- ------- ------Primary 59 GB 256 KB
You can obtain additional information about aligning VMFS partitions and the performance effects from the document VMware Infrastructure 3: Recommendations for aligning VMFS partitions at: http://www.vmware.com/pdf/esx3_partition_align.pdf
12.5 Tuning of Virtual Machines

After the ESX Server environment has been tuned to optimal performance, the individual Virtual Machines and their operating systems have to examined more closely. In general, all the information discussed in the chapters for the operating systems Windows and Linux also applies to those environments where these operating systems run in a Virtual Machine in an ESX environment. Due to the additional virtualization layer implemented by ESX Server, it is necessary to pay attention to the following specific information: ESX Server emulates either a BusLogic or a LSI Logic SCSI adapter. The specifics of the implementation of the SCSI driver that is used inside the Virtual Machines operation system can affect disk I/O performance. For the BusLogic adapters, VMware provides a driver customized for Windows that is recommended for high performance. The BusLogic driver is part of the VMware tools and will be installed automatically when the VMware tools are installed.
399
In Virtual Machines running Windows using the LSILogic driver, I/Os larger than 64 KB might be split into multiple I/Os of a maximum size of 64 KB, which might negatively affect I/O performance. You can improve I/O performance by editing the registry setting: HKLM\SYSTEM\CurrentControlSet\Services\Symmpi\Parameters\Device\MaximumSGList For further information, refer to Large Block Size Support at: http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc& externalId=9645697&sliceId=1
400
13
Chapter 13.
Performance considerations with Linux

This chapter discusses the monitoring and tuning tools and techniques that can be used with Linux systems to optimize throughput and performance when attaching the DS8000. We also discuss the supported distributions of Linux when using the DS8000, as well as the tools that can be helpful for the monitoring and tuning activity: Linux disk I/O architecture Host Bus Adapter (HBA) considerations Multipathing Software RAID functions Logical Volume Manager (LVM) Disk I/O schedulers Filesystem considerations
401
13.1 Supported platforms and distributions

Linux tends to be the only operating system that is available for almost all hardware platforms. IBM currently supports Linux on following platforms: On x86-based servers in 32-bit and 64-bit mode On System p servers in 32-bit and 64-bit mode On System z servers in 31-bit and 64-bit mode Running as a guest in VMware ESX Running as a guest in z/VM IBM currently supports the following major Linux distributions: Red Hat Enterprise Linux (RHEL) 2.1, 3.0, 4.0, and 5 Novell SuSE Linux Enterprise (SLES) 8.0, 9.0, and 10 Red Flag Linux (Asianux) 4.1 For further clarification and the most current information about supported Linux distributions and hardware prerequisites, refer to the System Storage Interoperation Center (SSIC) Web site: http://www.ibm.com/systems/support/storage/config/ssic Further information about supported kernel versions and additional restrictions can be obtained from the IBM Subsystem Device Driver for Linux Web site: http://www-01.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S4000107 This chapter introduces the relevant logical configuration concepts needed to attach Linux operating systems to a DS8000 and focuses on performance relevant configuration and measuring options. For further information about hardware-specific Linux implementation and general performance considerations with regard to the hardware setup, refer to the following documentation: For a general Linux implementation overview: Linux Handbook A Guide to IBM Linux Solutions and Resources, SG24-7000 For x86-based architectures: Tuning IBM System x Servers for Performance, SG24-5287 Tuning Red Hat Enterprise Linux on IBM eServer xSeries Servers, REDP-3861 Tuning SUSE LINUX Enterprise Server on IBM eServer xSeries Servers, REDP-3862 For System p hardware: Virtualizing an Infrastructure with System p and Linux, SG24-7499 Tuning Linux OS on System p The POWER Of Innovation, SG24-7338 For System z hardware: Linux on IBM System z: Performance Measurement and Tuning, SG24-6926 Linux for IBM System z9 and IBM zSeries, SG24-6694 z/VM and Linux on IBM System z, SG24-7492
13.2 Linux disk I/O architecture

Before discussing relevant disk I/O-related performance topics, this section briefly introduces the Linux disk I/O architecture. We look at the Linux disk I/O subsystem to have a better understanding of the components that have a major effect on system performance.
402
The architecture discussed here applies to Open Systems servers attached to DS8000 using the Fibre Channel Protocol (FCP). If Linux is installed on System z servers, a special disk I/O setup might apply depending on the specific hardware implementation and configuration. For further information about disk I/O setup and configuration for System z, refer to the IBM Redbooks publication Linux for IBM System z9 and IBM zSeries, SG24-6694.
13.2.1 I/O subsystem architecture

Figure 13-1 illustrates the basic I/O subsystem architecture.
User process write()
VFS / file system layer file

page cache page page cache cache block buffer
bio
pdflush
block layer
I/O scheduler I/O Request queue
device driver
Device driver
disk device
Disk
sector
Figure 13-1 I/O subsystem architecture
For a quick overview of overall I/O subsystem operations, we use an example of writing data to a disk. The following sequence outlines the fundamental operations that occur when a disk-write operation is performed, assuming that the file data is on sectors on disk platters, that it has already been read, and is on the page cache: 1. A process requests to write a file through the write() system call. 2. The kernel updates the page cache mapped to the file. 3. A pdflush kernel thread takes care of flushing the page cache to disk. 4. The filesystem layer puts each block buffer together to a bio struct (refer to 13.2.3, Block layer on page 405) and submits a write request to the block device layer.
Chapter 13. Performance considerations with Linux
403
5. The block device layer gets requests from upper layers and performs an I/O elevator operation and puts the requests into the I/O request queue. 6. A device driver, such as Small Computer System Interface ( SCSI) or other device-specific drivers, will take care of write operation. 7. A disk device firmware performs hardware operations, such as seek head, rotation, and data transfer to the sector on the platter. This sequence is simplified, because it only reflects I/Os to local physical disks (a SCSI disk attached via a native SCSI adapter). Storage configurations using additional virtualization layers and SAN attachment require additional operations and layers, such as in the DS8000 storage system.
13.2.2 Cache and locality of reference

Achieving a high cache hit rate is the key for performance improvement. In Linux, the technique called locality of reference is used. This technique is based on the following principles: The data most recently used has a high probability of being used in the near future (temporal locality). The data that resides close to the data, which has been used, has a high probability of being used (spatial locality). Figure 13-2 illustrates this principle.
CPU
Register Data Cache Data Data
CPU
Register Data Cache Data Data1 Data2 Data1 Data2
Memory
Data
Memory Disk Disk
First access
First access
CPU
Register Data Cache Data Data
CPU
Register Data Cache Data Data1 Data2 Data1 Data2
Memory
Data
Memory Disk Disk Second access to data2 in a few seconds
Second access in a few seconds
Temporal locality
Spatial locality
Figure 13-2 Locality of reference
Linux uses this principle in many components, such as page cache, file object cache (i-node cache, directory entry cache, and so on), read ahead buffer, and more.
404
Flushing a dirty buffer

When a process reads data from disk, the data is copied to memory. The process and other processes can retrieve the same data from the copy of the data cached in memory. When a process tries to change the data, the process changes the data in memory first. At this time, the data on disk and the data in memory are not identical, and the data in memory is referred to as a dirty buffer. The dirty buffer must be synchronized to the data on disk as soon as possible, or the data in memory can be lost if a sudden crash occurs. The synchronization process for a dirty buffer is called flush. In the Linux kernel 2.6 implementation, the pdflush kernel thread is responsible for flushing data to the disk. The flush occurs on a regular basis (kupdate) and when the proportion of dirty buffers in memory exceeds a certain threshold (bdflush). The threshold is configurable in the /proc/sys/vm/dirty_background_ratio file.
13.2.3 Block layer

The block layer handles all the activity related to block device operation (refer to Figure 13-1 on page 403). The key data structure in the block layer is the bio structure. The bio structure is an interface between the filesystem layer and the block layer. When a write is performed, the filesystem layer tries to write to the page cache, which is made up of block buffers. It makes up a bio structure by putting the contiguous blocks together and then sends the bio to the block layer (refer to Figure 13-1 on page 403). The block layer handles the bio request and links these requests into a queue called the I/O request queue. This linking operation is called I/O elevator or I/O scheduler. In Linux kernel 2.6 implementations, four types of I/O elevator algorithms are available. The Linux kernel 2.6 employs a new I/O elevator model. While the Linux kernel 2.4 used a single, general purpose I/O elevator, kernel 2.6 offers the choice of four elevators. Because the Linux operating system can be used for a wide range of tasks, both I/O devices and workload characteristics change significantly. A notebook computer probably has different I/O requirements than a 10000 user database system. To accommodate these differences, four I/O elevators are available. Further discussion about I/O elevator implementation and tuning is discussed in 13.3.5, Tuning the disk I/O scheduler on page 412.
13.2.4 I/O device driver

The Linux kernel takes control of devices using a device driver. The device driver is usually a separate kernel module and is provided for each device (or group of devices) to make the device available for the Linux operating system. After the device driver is loaded, it runs as a part of the Linux kernel and takes full control of the device. Here, we describe SCSI device drivers.
SCSI
The Small Computer System Interface (SCSI) is the most commonly used I/O device technology, especially in the enterprise server environment. In Linux kernel implementations, SCSI devices are controlled by device driver modules. They consist of the following types of modules (Figure 13-3 on page 406): Upper level drivers: sd_mod, sr_mod (SCSI-CDROM), st (SCSI Tape), sq (SCSI generic device), and so on. Provides functionality to support several types of SCSI devices, such as SCSI CD-ROM, SCSI tape, and so on.
405
Middle level driver: scsi_mod Implements SCSI protocol and common SCSI functionality. Low level drivers Provide lower level access to each device. A low level driver is basically specific to a hardware device and is provided for each device, for example, ips for the IBM ServeRAID controller, qla2300 for the QLogic HBA, mptscsih for the LSI Logic SCSI controller, and so on. Pseudo driver: ide-scsi Used for IDE-SCSI emulation.
Process
sg
st
sd_mod scsi_mod
sr_mod
Upper level driver Mid level driver
mptscsih
ips
qla2300
Device
Low level driver
Figure 13-3 Structure of SCSI drivers
If specific functionality is implemented for a device, it must be implemented in device firmware and the low level device driver. The supported functionality depends on which hardware you use and which version of device driver you use. The device itself must also support the desired functionality. Specific functions are usually tuned by a device driver parameter.
13.3 Specific configuration for storage performance

Many specific parameters influence the whole system performance as well as the performance for a specific application, which, of course, also applies to Linux systems. The focus of this chapter is just disk I/O performance, thus, we do not discuss the influence of cpu usage, memory usage, and paging, as well as specific performance tuning possibilities for these areas. For further general performance and tuning recommendations, refer to Linux Performance and Tuning Guidelines, REDP-4285.
13.3.1 Host bus adapter for Linux

IBM supports several host bus adapters (HBAs) in a large number of possible configurations. To confirm that a specific HBA is supported by IBM, check the latest information in the System Storage Interoperation Center (SSIC): http://www.ibm.com/systems/support/storage/config/ssic/index.jsp For each HBA, there are BIOS levels and driver versions available. The supported version for each Linux kernel level, distribution, and related information are available from the following link: http://www-03.ibm.com/systems/support/storage/config/hba/index.wss
406
To configure the HBA properly, refer to the IBM TotalStorage DS8000: Host Systems Attachment Guide, SC26-7625, which includes detailed procedures and recommended settings. Also, read the readme files and manuals of the driver, BIOS, and HBA. Each HBA driver allows you to configure several parameters. The list of available parameters depends on the specific HBA type and driver implementation. If these settings are not configured correctly, it might affect performance or the system might not work properly. You can configure each parameter as either temporary or persistent. For temporary configurations, you can use the modprobe command. Persistent configuration is performed by editing the following file (based on distribution): /etc/modprobe.conf for RHEL /etc/modprobe.conf.local for SLES To set the queue depth of an Emulex HBA to 20, add the following line to modprobe.conf(.local): options lpfc lpfc_lun_queue_depth=20 Specific HBA types support a failover on the HBA level, for example, QLogic HBAs. When using Device Mapper - Multipath I/O (DM-MPIO) or Subsystem Device Driver (SDD) for multipathing, this failover on the HBA level needs to be manually disabled. To disable failover on a QLogic qla2xxx adapter, simply add the following line to modeprobe.conf(.local): options qla2xxx ql2xfailover=0 For performance reasons, the parameters queue depth and several timeout and retry parameters in the case of path errors can be interesting. Changing the queue depth allows you to queue more outstanding I/Os on the adapter level, which can in certain configurations have a positive effect on throughput. However, increasing the queue depth cannot be generally recommended, because it can slow performance or cause delays, depending on the actual configuration. Thus, the complete setup needs to be checked carefully before adjusting the queue depth. Change the I/O timeout and retry values when using DM-MPIO, which handles the path failover and recovery scenarios. In those cases, we recommend that you decrease those values to allow a fast reaction of the multipathing module in case of path or adapter problems.
13.3.2 Multipathing in Linux

In a Linux environment, IBM supports two multipathing solutions for DS8000: Subsystem Device Driver (SDD) Device Mapper - Multipath I/O (DM-MPIO) Subsystem Device Driver (SDD) is a generic device driver designed to support the multipath configuration environments in the DS8000. SDD is provided and maintained by IBM for several operating systems, including Linux. SDD is the old approach of a multipathing solution. Starting with kernel Version 2.6, new, smarter multipathing support is available for Linux. The Multipath I/O support included in Linux 2.6 kernel versions is based on Device Mapper (DM), a new multipath module of the Linux kernel that supports logical volume management. With Device Mapper, a virtual block device is presented where blocks can be mapped to any existing physical block device. Using the multipath module, the virtual block device can be mapped to several paths toward the same physical target block device. The purpose of
407
Device Mapper is to balance the workload of I/O operations across all available paths as well as detecting defective links and failing over to the remaining links. Currently (as of November 2008), IBM only supports SDD for SLES 8 and 9 and RHEL 3 and 4 versions. For the new distribution releases SLES 10 and RHEL 5, DM-MPIO is the only supported multipathing solution. We recommend to use DM-MPIO if possible for your system configuration. DM-MPIO already is the preferred multipathing solution for most Linux 2.6 kernels. Hence, it is also available for 2.4 kernels but needs to be manually included and configured during kernel compilation. DM-MPIO is the required multipathing setup for LVM2. Further information about supported distribution releases, kernel versions, and multipathing software is documented at the IBM Subsystem Device Driver for Linux Web site: http://www.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S4000107 DM-MPIO provides round-robin load balancing for up to eight paths per LUN. The userspace component is responsible for automated path discovery and grouping, as well as path handling and retesting of previously failed paths. The framework is extensible for hardware-specific functions and additional load balancing or failover algorithms. IBM provides a device-specific configuration file for the DS8000 for the supported levels of Red Hat Enterprise Linux (RHEL) and SUSE Enterprise Linux Server (SLES). This file needs to be copied to /etc/multipath.conf before the multipath driver and multipath tools are started. It sets default parameters for the scanned LUNs and creates user friendly names for the multipath devices that are managed by DM-MPIO. Further configuration, as well as adding aliases for certain LUNs or blacklisting specific devices, can be manually configured by editing this file. Using DM-MPIO, you can configure various path failover policies, path priorities, and failover priorities. This type of configuration can be done for each device individually in the /etc/multipath.conf setup. When using DM-MPIO, consider changing the default HBA timeout settings. In case a path fails, the default HBA timeout setting must be reported to the multipath module as fast as possible to avoid delay, because of I/O retries on the HBA level. MPIO is then able to react quickly and fail over to one of the remaining healthy paths. This setting needs to be configured at the HBA level. Edit the file /etc/modprobe.conf or /etc/modprobe.conf.local (depending on the distribution). For example, you can use the following settings for an QLogic qla2xxx HBA (Example 13-1).
Example 13-1 HBA settings in /etc/modeprobe.conf.local cat /etc/modprobe.conf.local # # please add local extensions to this file # options qla2xxx qlport_down_retry=1 ql2xfailover=0 ql2xretrycount=5
For further configuration and setup information, refer to the following publications: For SLES: http://www.novell.com/documentation/sles10/stor_evms/index.html?page=/documenta tion/sles10/stor_evms/data/mpiotools.html
408
For RHEL: http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/DM_Mu ltipath/ Considerations and comparisons between IBM SDD for Linux and DM-MPIO: http://www.ibm.com/support/docview.wss?uid=ssg1S7001664&rs=555
13.3.3 Software RAID functions

Generally speaking, with redundant array of independent disks (RAID) technology, you can spread the I/O over multiple physical disk spindles. When attaching external disk storage systems, such as the DS8000, partial hardware RAID functionality is already included within that storage system. Additionally, when configuring DS8000 LUNs using the Extent Allocation Method rotateextents (which is also called the Storage Pool Striping (SPS) function), it is possible to configure a LUN that is striped across several ranks or even device adapter (DA) Pairs. However, as discussed in 5.7.5, Planning for multi-rank extent pools on page 106, consider adding an additional RAID layer at the host level. Also, consider adding an additional RAID layer at the host level for applications mainly dealing with small transfer sizes. Because SPS uses the 1 GB extent as the stripe size, hot spots still might emerge on a single rank if an application has very high I/O demands against data within a single 1 GB extent. In this case, it might be profitable to configure a host-based striping over several LUNs to distribute the data over several ranks. Software RAID in the Linux 2.6 kernel distributions is implemented through the md device driver. This driver implementation is device-independent and therefore is flexible and allows many types of disk storage to be configured as a RAID array. Supported software RAID levels are RAID 0 (striping), RAID 1 (mirroring), and RAID 5 (striping with parity), RAID 6 (striping with double parity), or RAID 10 (combination of mirroring and striping). We recommend that you do not create a stripe size less than 64 KB. If many applications address the same array, software striping at the operating system level does not help much for host performance improvement, but it also does not hurt the DS8000 performance. For sequential I/O, it is better to have larger stripe sizes so that the LVM will not have to split write requests. You can create a stripe size up to 512 MB, the maximum single I/O size, to fully utilize the FC connection. For random I/O, the stripe size is not so important. The mdadm tool provides the functionality of the legacy programs mdtools and raidtools. Example 13-2 shows how to create a RAID 0 for two DM-MPIO devices and a stripe size of 128 KB.
Example 13-2 Creating a RAID 0 using mdadm x345-tic-20:/dev/disk/by-name # mdadm --create /dev/md0 --run --level=raid0 --chunk=128 --raid-devices=2 /dev/disk/by-name/mpathb /dev/disk/by-name/mpathc mdadm: array /dev/md0 started.
Using the command mdadm --detail --scan returns the configuration of all software RAIDs.
Example 13-3 Displaying the configuration of all software RAIDs x345-tic-20:/dev/disk/by-name # mdadm --detail --scan ARRAY /dev/md0 level=raid0 num-devices=2 UUID=9fd0170a:d18e3d37:2d44e795:d9cc87b4
409
You can obtain further documentation about how to use the command line RAID tools in Linux from: http://tldp.org/HOWTO/Software-RAID-HOWTO-5.html
13.3.4 Logical Volume Manager

Starting with kernel version 2.6, a Logical Volume Manager Version 2 (LVM2) is available and is downward-compatible with the previous LVM. LVM 2 does not require any kernel patches. It makes use of the Device Mapper (DM-MPIO) integrated in kernel 2.6. With kernel 2.6, only LVM2 is supported. Therefore, this section always refers to LVM Version 2. Instead of LVM 2, you can use Enterprise Volume Management System (EVMS), which offers a uniform interface for logical volumes and RAID volumes.
Logical Volume Manager (LVM2)

With the use of LVM, you can configure logical partitions that can reside on multiple physical drives or LUNs. Each LUN mapped from the DS8000 is divided into one or more physical volumes (PVs). Several of those PVs can be added to a logical volume group (VG), and later on, logical volumes (LVs) are configured out of a volume group. Each physical volume (PV) consists of a number of fixed-size physical extents (PEs). Similarly, each logical volume (LV) consists of a number of fixed-size logical extents (LEs). A logical volume (LV) is created by mapping logical extents (LEs) to physical extents (PEs) within a volume group. An LV can be created with a size from just one extent to all available extents in a VG. With LVM2, you can influence the way LEs (for a logical volume) are mapped to the available PEs. With LVM linear mapping, the extents of several PVs are concatenated to built a larger logical volume. Figure 13-4 illustrates a logical volume across several physical volumes. With striped mapping, groups of contiguous physical extents, which are called stripes, are mapped to a single physical volume. With this functionality, it is possible to configure striping between several LUNs within LVM, which provides approximately the same performance benefits as the Software RAID functions.
LUN0
LUN1
LUN2
physical extent
Physical Volume 1
Physical Volume 2
Physical Volume 3
logical extent
stripe
Logical Volume
Figure 13-4 LVM striped mapping of three LUNs to a single logical volume
410
Furthermore, LVM2 offers additional functions and flexibility: Logical volumes can be resized during operation. Data from one physical volume can be relocated during operations, for example, in data migration scenarios. Logical volumes can be mirrored between several physical volumes for redundancy. Logical volume snapshots can be created for backup purposes. With the Linux 2.6 kernel levels, only LVM 2 is supported. This configuration also uses the Device Mapper multipathing functionality, thus, every LUN can be an mpath device that is available via several paths for redundancy. Both basic functions, host-based striping and host-based mirroring, can be implemented either using the Software RAID functions using mdadm or Logical Volume Management functions using LVM2. From a performance point of view, both solutions deliver comparable results, maybe with a slight performance advantage for mdadm because of lesser implementation overhead compared to LVM. Both implementations can also be configured using the EVMS management functions. Further documentation about LVM2 can be obtained from: http://sources.redhat.com/lvm2/ http://www.tldp.org/HOWTO/LVM-HOWTO/index.html
Enterprise Volume Management System (EVMS)

The Enterprise Volume Management System is an open source logical volume manager that integrates all aspects of storage management within an extensible framework. EVMS integrates disk partitioning, logical volume management (LVM2), Device Mapper multipath (DM-MPIO), software RAID management, and filesystem operations within a single management tool. EVMS provides a common management interface for the plugged-in functions of all included management layers. EVMS offers a graphical user interface (EVMS GUI), a menu-driven interface (EVMS Ncurses), and an command line interpreter (EVMS CLI). Depending on your needs, all three available user interfaces offer about the same functionality. Using EVMS does not influence the performance of a Linux system directly, but with EVMS and the LVM, filesystem and Software RAID functions can be configured to influence the storage performance of an Linux system. You can obtain more information about using EVMS from: http://evms.sourceforge.net/user_guide/
411
13.3.5 Tuning the disk I/O scheduler

The Linux kernel 2.6 employs a new I/O elevator model. While the Linux kernel 2.4 used a single, general purpose I/O scheduler, kernel 2.6 offers the choice of four schedulers or elevators. The I/O scheduler forms the interface between the generic block layer and the low-level device drivers. Functions provided by the block layer can be utilized by the filesystems and the virtual memory manager to submit I/O requests to the block devices. These requests are transformed by the I/O scheduler to the low-level device driver. Red Hat Enterprise Linux AS 4 and SUSE Linux Enterprise Server 10 support four types of I/O schedulers. You can obtain additional details about configuring and setting up I/O schedulers in Tuning Linux OS on System p The POWER Of Innovation, SG24-7338.
Descriptions of the available I/O schedulers

The four available I/O schedulers are: Deadline I/O scheduler The deadline I/O scheduler incorporates a per-request expiration-based approach and operates on five I/O queues. The basic idea behind I/O scheduler is that all read requests are satisfied within a specified time period. However, write requests do not have any specific deadlines. Web servers are found to perform better when configured with deadline I/O scheduler and ext3. Noop I/O scheduler The Noop I/O scheduler is an I/O scheduler that performs and provides basic merging and sorting functions. In large I/O subsystems that incorporate RAID controllers and a vast number of contemporary physical disk drives called TCQ drives, the Noop I/O scheduler has the potential to outperform the other three I/O schedulers as the workload increases. Completely Fair Queuing I/O scheduler The Completely Fair Queuing (CFQ) I/O scheduler is implemented on the concept of fair allocation of I/O bandwidth among all the initiators of I/O requests. The CFQ I/O scheduler strives to manage per-process I/O bandwidth and provide fairness at the level of process granularity. The sequential writes perform better when CFQ I/O scheduler is configured with eXtended Filesystem (XFS). The number of requests fetched from each queue is controlled by the cfq_quantum tunable parameter. Anticipatory I/O scheduler The anticipatory I/O scheduler attempts to reduce the per-thread read response time. It introduces a controlled delay component into the dispatching equation. The delay is invoked on any new read request to the device driver. File servers are found to perform better when the anticipatory I/O scheduler is configured with ext3. Sequential reads perform better when the anticipatory I/O scheduler is configured with XFS or ext3.
Selecting the right I/O elevator for a selected type of workload

For most server workloads, either the Complete Fair Queuing (CFQ) elevator or the Deadline elevator is an adequate choice, because both of them are optimized for the multiuser and multi-process environment in which a typical server operates. Enterprise distributions typically default to the CFQ elevator. However, on Linux for IBM System z, the Deadline scheduler is favored as the default elevator. Certain environments can benefit from selecting a different I/O elevator. With Red Hat Enterprise Linux 5.0 and Novell SUSE Linux Enterprise Server 10, the I/O schedulers can now be selected on a per disk subsystem basis as opposed to the global setting in Red Hat Enterprise Linux 4.0 and Novell SUSE Linux Enterprise Server 9. 412
With the capability to have different I/O elevators per disk subsystem, the administrator now can isolate a specific I/O pattern on a disk subsystem (such as write intensive workloads) and select the appropriate elevator algorithm: Synchronous filesystem access Certain types of applications need to perform filesystem operations synchronously, which can be true for databases that might even use a raw filesystem or for large disk subsystems where caching asynchronous disk accesses simply is not an option. In those cases, the performance of the anticipatory elevator usually has the least throughput and the highest latency. The three other schedulers perform equally well up to an I/O size of roughly 16 KB where the CFQ and the NOOP elevators begin to outperfom the deadline elevator (unless disk access is very seek-intense). Complex disk subsystems Benchmarks have shown that the NOOP elevator is an interesting alternative in high-end server environments. When using configurations with enterprise-class disk subsystems, such as the DS8000, the lack of ordering capability of the NOOP elevator becomes its strength. Enterprise-class disk subsystems can contain multiple SCSI or Fibre Channel disks that each have individual disk heads and data striped across the disks. It becomes difficult for an I/O elevator to anticipate the I/O characteristics of such complex subsystems correctly, so you might often observe at least equal performance at less overhead when using the NOOP I/O elevator. Most large scale benchmarks that use hundreds of disks most likely use the NOOP elevator. Database systems Due to the seek-oriented nature of most database workloads, some performance gain can be achieved when selecting the deadline elevator for these workloads. Virtual machines Virtual machines, regardless of whether in VMware or VM for System z, usually communicate through a virtualization layer with the underlying hardware. So, a virtual machine is not aware of whether the assigned disk device consists of a single SCSI device or an array of Fibre Channel disks on a DS8000. The virtualization layer takes care of necessary I/O reordering and the communication with the physical block devices. CPU-bound applications While certain I/O schedulers can offer superior throughput, they can at the same time create more system overhead. The overhead that for instance the CFQ or deadline elevators cause comes from aggressively merging and reordering the I/O queue. Sometimes, the workload is not so much limited by the performance of the disk subsystem as by the performance of the CPU, for example, with a scientific workload or a data warehouse processing very complex queries. In these scenarios, the NOOP elevator offers an advantage over the other elevators, because it causes less CPU overhead as shown on Figure 13-5 on page 414. However, note that when comparing CPU overhead to throughput, the deadline and CFQ elevators are still the best choices for most access patterns to asynchronous filesystems. Single ATA or SATA disk subsystems If you choose to use a single physical ATA or SATA disk, for example, for the boot partition of your Linux system, consider using the anticipatory I/O elevator, which reorders disk writes to accommodate the single disk head found in these devices.
413
13.3.6 Filesystems
The filesystems that are available for Linux have been designed with different workload and availability characteristics in mind. If your Linux distribution and the application allow the selection of a different filesystem, it might be worthwhile to investigate if Ext, Journal File System (JFS), ReiserFS, or eXtended File System (XFS) is the optimal choice for the planned workload. Generally speaking, ReiserFS is more suited to accommodate small I/O requests whereas XFS and JFS are tailored toward large filesystems and large I/O sizes. Ext3 fits the gap between ReiserFS and JFS and XFS, because it can accommodate small I/O requests while offering good multiprocessor scalability. The workload patterns JFS and XFS are best suited for high-end data warehouses, scientific workloads, large Symmetric Multi Processor (SMP) servers, or streaming media servers. ReiserFS and Ext3 are typically used for file, Web, or mail serving. For write-intense workloads that create smaller I/Os up to 64 KB, ReiserFS might have an edge over Ext3 with default journaling mode. However, this advantage is only true for synchronous file operations. An option to consider is the Ext2 filesystem. Due to its lack of journaling abilities, Ext2 outperforms ReiserFS and Ext3 for synchronous filesystem access regardless of the access pattern and I/O size. So, Ext2 might be an option when performance is more important than data integrity.
80000
70000
60000
50000
kB/sec 40000
Ext2 Ext3 Ext3 Writeback ReiserFS
30000
20000
10000
0 4 8 16 32 64 kB/op 128 256 512 1024 2048
Figure 13-5 Random write throughput comparison between Ext and ReiserFS (synchronous)
In the most common scenario of an asynchronous filesystem, ReiserFS most often delivers solid performance and outperforms Ext3 with the default journaling mode (data=ordered). However, Ext3 is equal to ReiserFS as soon as the default journaling mode is switched to writeback.
414
Using ionice to assign I/O priority

A feature of the CFQ I/O elevator is the option to assign priorities on a process level. Using the ionice utility, you can restrict the disk subsystem utilization of a specific process: Idle: A process with the assigned I/O priority idle will only be granted access to the disk subsystems if no other processes with a priority of best-effort or higher request access to the data. This setting is useful for tasks that only run when the system has free resources, such as the updatedb task. Best-effort: As a default, all processes that do not request a specific I/O priority are assigned to this class. Processes will inherit eight levels of the priority of their respective CPU nice level to the I/O priority class. Real time: The highest available I/O priority is real time, which means that the respective process will always be given priority access to the disk subsystem. The real time priority setting can also accept eight priority levels. Use caution when assigning a thread a priority level of real time, because this process can cause the other tasks to be unable to access the disk subsystem.
Access time updates

The Linux filesystem keeps records of when files are created, updated, and accessed. Default operations include updating the last-time-read attribute for files during reads and writes to files. Because writing is an expensive operation, eliminating unnecessary I/O can lead to overall improved performance. However, under most conditions, disabling file access time updates will only yield a very small performance improvement. Mounting filesystems with the noatime option prevents inode access times from being updated. If file and directory update times are not critical to your implementation, as in a Web serving environment, an administrator might choose to mount filesystems with the noatime flag in the /etc/fstab file as shown in Example 13-4. The performance benefit of disabling access time updates to be written to the filesystem ranges from 0 to 10% with an average of 3% for file server workloads.
Example 13-4 Update /etc/fstab file with noatime option set on mounted filesystems
/dev/sdb1 /mountlocation ext3 defaults,noatime 1 2
Selecting the journaling mode of the filesystem

Three journaling options for most filesystems can be set with the data option in the mount command. However, the journaling mode has the biggest effect on performance for Ext3 filesystems, so we suggest you use this tuning option mainly for Red Hats default filesystem: data=journal This journaling option provides the highest form of data consistency by causing both file data and metadata to be journaled. It also has the highest performance overhead. data=ordered (default) In this mode, only metadata is written. However, file data is guaranteed to be written first. This setting is the default setting. data=writeback This journaling option provides the fastest access to the data at the expense of data consistency. The data is guaranteed to be consistent as the metadata is still being logged. However, no special handling of actual file data is done, which can lead to old data appearing in files after a system crash. Note that the type of metadata journaling implemented when using the writeback mode is comparable to the defaults of ReiserFS, JFS, or XFS. The writeback journaling mode improves Ext3 performance especially for
415
small I/O sizes as shown in Figure 13-6 on page 416. The benefit of using writeback journaling declines as I/O sizes grow. Also, note that the journaling mode of your filesystem only impacts write performance. Therefore, a workload that performs mainly reads (for example, a Web server) will not benefit from changing the journaling mode.
140000
120000
100000
80000 kB/sec 60000 data=ordered data=writeback
40000
20000
0 4 8 16 32 64 kB/op 128 256 512 1024 2048
Figure 13-6 Random write performance impact of data=writeback
There are three ways to change the journaling mode on a filesystem: When executing the mount command: mount -o data=writeback /dev/sdb1 /mnt/mountpoint /dev/sdb1 is the filesystem being mounted. Including it in the options section of the /etc/fstab file: /dev/sdb1 /testfs ext3 defaults,data=writeback 0 0 If you want to modify the default data=ordered option on the root partition, make the change to the /etc/fstab file, and then execute the mkinitrd command to scan the changes in the /etc/fstab file and create a new image. Update grub or lilo to point to the new image.
Blocksizes The blocksize, the smallest amount of data that can be read or written to a drive, can have a
direct impact on a servers performance. As a guideline, if your server is handling a lot of small files, a smaller blocksize will be more efficient. If your server is dedicated to handling large files, a larger blocksize might improve performance. Blocksizes cannot be changed dynamically on existing filesystems, and only a reformat will modify the current blocksize. Most Linux distributions allow blocksizes between 1 K, 2 K, and 4 K. As benchmarks have shown, there is hardly any performance improvement to be gained from changing the blocksize of a filesystem, so it is generally better to leave it at the default of 4 K.
416
13.4 Linux performance monitoring tools

This section introduces the commonly used performance measurement tools. Additionally, Linux offers several other performance measurement tools, most of them equal to those tools that are available in the UNIX operating systems.
13.4.1 Disk I/O performance indicators

The disk subsystem is often the most important aspect of server performance and is usually the most common bottleneck. Applications are considered to be I/O-bound when CPU cycles are wasted simply waiting for I/O tasks to finish. However, problems can be hidden by other factors, such as lack of memory. The symptoms that show that the server might be suffering from a disk bottleneck (or a hidden memory problem) are shown in Table 13-1.
Table 13-1 Disk I/O performance indicators Disk I/O indicators Disk I/O numbers and wait time Analysis Analyze the number of I/Os to the LUN. This data can be used to discover if reads or writes are the cause of problem. Use iostat to get the disk I/Os. Use stap ioblock.stp to get read/write blocks. Also, use scsi.stp to get the scsi wait times, requested submitted, and completed. Also, long wait times might mean the I/O is to specific disks and not spread out. The memory buffer available for the block I/O request might not be sufficient, and the page cache size can be smaller than the maximum number of Disk I/O size. Use stap ioblock.stp to get request sizes. Use iostat to get the blocksizes. An incorrect I/O scheduler might cause performance bottlenecks. A certain I/O scheduler performs better if configured with the appropriate filesystem. An incorrect filesystem might cause performance bottlenecks. The appropriate filesystem must be chosen based on the requirements. If all the disk I/Os are directed to the same physical disk, it might cause a disk I/O bottleneck. Directing the disk I/O to different physical disks will increase the performance. If the filesystem is created with small-sized blocks, creating files larger than the blocksize might cause a performance bottleneck. Creating a filesystem with a proper blocksize improves the performance.
Disk I/O size
Disk I/O scheduler
Disk filesystem
Disk I/O to physical device
Filesystem blocksize
417
Disk I/O indicators Swap device/area
Analysis If a single swap device/area is used, it might cause performance problems. To improve the performance, create multiple swap devices or areas.
13.4.2 Finding disk bottlenecks

A server exhibiting the following symptoms might suffer from a disk bottleneck (or a hidden memory problem): Slow disks will result in: Memory buffers filling with write data (or waiting for read data), which will delay all requests, because free memory buffers are unavailable for write requests (or the response is waiting for read data in the disk queue). Insufficient memory, as in the case of not enough memory buffers for network requests, will cause synchronous disk I/O. Disk utilization, controller utilization, or both types will typically be very high. Most local area network (LAN) transfers will happen only after disk I/O has completed, causing long response times and low network utilization. Disk I/O can take a relatively long time and disk queues will become full, so the CPUs will be idle or have low utilization, because they wait long periods of time before processing the next request. Linux offers command line tools to monitor performance-relevant information. Several of these tools are extremely helpful to get performance metrics for disk I/O-relevant areas.
The vmstat command

One way to track disk usage on a Linux system is by using the vmstat tool (Example 13-5). The important columns in vmstat with respect to I/O are the bi and bo fields. These fields monitor the movement of blocks in and out of the disk subsystem. Having a baseline is key to being able to identify any changes over time.
Example 13-5 The vmstat tool output [root@x232 root]# vmstat 2 r b swpd free buff 2 1 0 9004 47196 0 2 0 9672 47224 0 2 0 9276 47224 0 2 0 9160 47224 0 2 0 9272 47224 0 2 0 9180 47228 1 0 0 9200 47228 1 0 0 9756 47228 0 2 0 9448 47228 0 2 0 9740 47228 cache si so bi bo in cs us sy id wa 1141672 0 0 0 950 149 74 87 13 0 0 1140924 0 0 12 42392 189 65 88 10 0 1 1141308 0 0 448 0 144 28 0 0 0 100 1141424 0 0 448 1764 149 66 0 1 0 99 1141280 0 0 448 60 155 46 0 1 0 99 1141360 0 0 6208 10730 425 413 0 3 0 97 1141340 0 0 11200 6 631 737 0 6 0 94 1140784 0 0 12224 3632 684 763 0 11 0 89 1141092 0 0 5824 25328 403 373 0 3 0 97 1140832 0 0 640 0 159 31 0 0 0 100
The iostat command

Performance problems can be encountered when too many files are opened, read, and written to, and then closed repeatedly. Problems can become apparent as seek times (the time it takes to move to the exact track where the data is stored) start to increase. Using the
418
iostat tool, you can monitor the I/O device loading in real time. Various options enable you to drill down even deeper to gather the necessary data. Example 13-6 shows a potential I/O bottleneck on the device /dev/sdb1. This output shows average wait times (await) of about 2.7 seconds and service times (svctm) of 270 ms.
Example 13-6 Sample of an I/O bottleneck as shown with iostat 2 -x /dev/sdb1
[root@x232 root]# iostat 2 -x /dev/sdb1 avg-cpu: %user 11.50 %nice 0.00 %sys 2.00 %idle 86.50 rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 1792.00 12240.00 748.37 101.70 2717.33 266.67 100.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s /dev/sdb1 441.00 3030.00 7.00 30.50 3584.00 24480.00 avg-cpu: %user 10.50 %nice 0.00 %sys 1.00 %idle 88.50
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s /dev/sdb1 441.00 3030.00 7.00 30.00 3584.00 24480.00 avg-cpu: %user 10.95 %nice 0.00 %sys 1.00 %idle 88.06
rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 1792.00 12240.00 758.49 101.65 2739.19 270.27 100.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s /dev/sdb1 438.81 3165.67 6.97 30.35 3566.17 25576.12
rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 1783.08 12788.06 781.01 101.69 2728.00 268.00 100.00
Example 13-7 shows the output of the iostat command on an LPAR configured with a 1.2 CPU running Red Hat Enterprise Linux AS 4 while issuing server writes to the disks sda and dm-2, where the transfers per second are 130 for sda and 692 for dm-2. Also, the iowait is 6.37%.
Example 13-7 Shows output of iostat #iostat Linux 2.6.9-42.EL (rhel) 09/29/2006 avg-cpu: %user 2.70 %nice 0.11 tps 130.69 1.24 4.80 790.73 96.19 0.29 692.66 %sys %iowait 6.50 6.37 Blk_read/s 1732.56 2.53 5.32 1717.40 1704.71 2.35 6.38 %idle 84.32 Blk_read 265688 388 816 263364 261418 360 978 Blk_wrtn 893708 0 4 893704 44840 0 848864
Device: sda sda1 sda2 sda3 dm-0 dm-1 dm-2
Blk_wrtn/s 5827.90 0.00 0.03 5827.87 292.40 0.00 5535.47
Example 13-8 shows the output of the iostat command on an LPAR configured with a 1.2 CPU running Red Hat Enterprise Linux AS 4 issuing server writes to the disks sda and dm-2, where the transfers per second are 428 for sda and 4024 for dm-2. Also, the iowait has gone up to 12.42%.
Example 13-8 Shows output of iostat to illustrate disk I/O bottleneck # iostat Linux 2.6.9-42.EL (rhel) 09/29/2006 avg-cpu: %user %nice %sys %iowait 2.37 0.20 27.22 12.42 Device: sda sda1 tps 428.14 0.17 Blk_read/s 235.64 0.34
%idle 57.80 Blk_read 269840 388 Blk_wrtn 36928420 0
Blk_wrtn/s 32248.23 0.00
419
sda2 sda3 dm-0 dm-1 dm-2
0.64 4039.46 14.63 0.04 4024.58
0.71 233.61 231.80 0.31 0.97
0.00 32248.17 52.47 0.00 32195.76
816 267516 265442 360 1106
4 36928352 60080 0 36868336
Changes made to the elevator algorithm as described in 13.3.5, Tuning the disk I/O scheduler on page 412 will be seen in avgrq-sz (average size of request) and avgqu-sz (average queue length). As the latencies are lowered by manipulating the elevator settings, avgrq-sz decreases. You can also monitor the rrqm/s and wrqm/s to see the effect on the number of merged reads and writes that the disk can manage.
sar command
The sar command, which is included in the sysstat package, uses the standard system activity daily data file to generate a report. The system has to be configured to collect the information and log it; therefore, a cron job must be set up. Add the following lines (shown in Example 13-9) to the /etc/crontab. Example 13-9 illustrates an example of automatic log reporting with cron.
Example 13-9 Example of automatic log reporting with cron .... #8am-7pm activity reports every 10 0 8-18 **1-5 /usr/lib/sa/sa1 600 6 #7pm-8am activity reports every an 0 19-7 **1-5 /usr/lib/sa/sa1 & #Activity reports every an hour on 0 ***0,6 /usr/lib/sa/sa1 & #Daily summary prepared at 19:05 5 19 ***/usr/lib/sa/sa2 -A & .... minutes during weekdays. & hour during weekdays. Saturday and Sunday.
You get a detailed overview of your CPU utilization (%user, %nice, %system, %idle), memory paging, network I/O and transfer statistics, process creation activity, activity for block devices, and interrupts/second over time. Using sar -A (the -A is equivalent to -bBcdqrRuvwWy -I SUM -I PROC -n FULL -U ALL, which selects the most relevant counters of the system) is the most effective way to grep all relevant performance counters. Using sar is recommended to analyze if a system is disk I/O-bound and spending a lot of waiting time, which results in filled-up memory buffers and low CPU usage. Furthermore, this method is useful to monitor the overall system performance over a longer period of time, for example, days or weeks, to further understand which times a claimed performance bottleneck is seen. A various number of additional performance data collection utilities are available for Linux, most of them transferred from UNIX systems. You can obtain more details about those additional tools in Chapter 11, Performance considerations with UNIX servers on page 307.
420
14
Chapter 14.
IBM System Storage SAN Volume Controller attachment

This chapter describes the guidelines and procedures to make the most of the performance available from your DS8000 storage subsystem when attached to the IBM SAN Volume Controller. We discuss: IBM System Storage SAN Volume Controller SAN Volume Controller performance considerations DS8000 performance considerations with SVC Performance monitoring Sharing the DS8000 between a server and the SVC Advanced functions for the DS8000 Configuration guidelines for optimizing performance
421
14.1 IBM System Storage SAN Volume Controller

The IBM System Storage SAN Volume Controller (SVC) is designed to increase the flexibility of your storage infrastructure by introducing an in-band virtualization layer between the servers and the storage systems. The SAN Volume Controller can enable a tiered storage environment to increase flexibility in storage management. The SAN Volume Controller combines the capacity from multiple disk storage systems into a single storage pool, which can be managed from a central point, which is simpler to manage and helps increase disk capacity utilization. It also allows you to apply advanced Copy Services across storage systems from many vendors to help further simplify operations. For more information about SAN Volume Controller, refer to Implementing the IBM System Storage SAN Volume Controller V4.3, SG24-6423.
14.1.1 SAN Volume Controller concepts

The SAN Volume Controller is a Storage Area Network (SAN) appliance that attaches storage devices to supported Open Systems servers. The SAN Volume Controller provides symmetric virtualization by creating a pool of managed disks from the attached storage subsystems, which are then mapped to a set of virtual disks for use by various attached host computer systems. System administrators can view and access a common pool of storage on the SAN, which enables them to use storage resources more efficiently and provides a common base for advanced functions. The SAN Volume Controller solution is designed to reduce both the complexity and costs of managing your SAN-based storage. With the SAN Volume Controller, you can: Simplify management and increase administrator productivity by consolidating storage management intelligence from disparate storage controllers into a single view. Improve application availability by enabling data migration between disparate disk storage devices non-disruptively. Improve disaster recovery and business continuance needs by applying and managing Copy Services across disparate disk storage devices within the SAN. Provide advanced features and functions to the entire SAN, such as: Large scalable cache Advanced Copy Services Space management Mapping based on desired performance characteristics Quality of Service (QoS) metering and reporting
SAN Volume Controller clustering

The SAN Volume Controller is a collection of up to eight cluster nodes added in pairs. These eight nodes are managed as a set (cluster) and present a single point of control to the administrator for configuration and service activity. For I/O purposes, SAN Volume Controller nodes within the cluster are grouped into pairs (called I/O Groups), with a single pair being responsible for serving I/O on a given virtual disk (VDisk). One node within the I/O Group will represent the preferred path for I/O to a given VDisk, and the other node represents the non-preferred path. This preference will alternate between nodes as each VDisk is created within an I/O Group to balance the workload evenly between the two nodes.
422
Note: The preferred node by no means signifies absolute ownership. The data will still be accessed by the partner node in the I/O Group in the event of a failure or if the preferred node workload becomes too high. Beyond automatic configuration and cluster administration, the data transmitted from attached application servers is also treated in the most reliable manner. When data is written by the server, the preferred node within the I/O Group stores a write in its own write cache and the write cache of its partner (non-preferred) node before sending an I/O complete status back to the server application. To ensure that data is written in the event of a node failure, the surviving node empties its write cache and proceeds in write-through mode until the cluster is returned to a fully operational state. Note: Write-through mode is where the data is not cached in the nodes but is written directly to the disk subsystem instead. While operating in this mode, performance is slightly degraded. Furthermore, each node in the I/O Group is protected by its own dedicated uninterruptible power supply (UPS).
SAN Volume Controller virtualization

The SAN Volume Controller provides block aggregation and volume management for disk storage within the SAN. In simpler terms, the SAN Volume Controller manages a number of back-end storage controllers and maps the physical storage within those controllers to logical disk images that can be seen by application servers and workstations in the SAN. The SAN must be zoned in such a way that the application servers cannot see the back-end storage, preventing any possible conflict between SAN Volume Controller and the application servers both trying to manage the back-end storage. In the SAN fabric, three distinct zones are defined: In the server zone, the server systems can identify and address the nodes. You can have more than one server zone. Generally, you will create one server zone per server attachment. In the disk zone, the nodes can identify the disk storage subsystems. Generally, you will create one zone for each distinct storage subsystem. In the SVC zone, all SVC node ports are permitted to communicate for cluster management. Where remote Copy Services are to be used, an inter-cluster zone must be created. The SAN Volume Controller I/O Groups are connected to the SAN in such a way that all back-end storage and all application servers are visible to all of the I/O Groups. The SAN Volume Controller I/O Groups see the storage presented to the SAN by the back-end controllers as a number of disks, known as Managed Disks (MDisks). Because the SAN Volume Controller does not attempt to provide recovery from physical disk failures within the back-end controllers, MDisks are usually, but not necessarily, part of a RAID array. MDisks are collected into one or several groups, known as Managed Disks Groups (MDGs). When an MDisk is assigned to an MDG, the MDisk is divided into a number of extents (default minimum size 16 MB, maximum size of 2 GB), which are numbered sequentially from the start to the end of each MDisk.
Chapter 14. IBM System Storage SAN Volume Controller attachment
423
Note: For performance considerations, we recommend that you create Managed Disk Groups using only MDisks, which have the same characteristics in terms of performance or reliability. An MDG provides a pool of capacity (extents), which will be used to create volumes, known as Virtual Disks (VDisks). When creating VDisks, the default option of striped allocation is normally the best choice. This option helps to balance I/Os across all the managed disks in an MDG, which optimizes overall performance and helps to reduce hot spots. Conceptually, this method is represented in Figure 14-1.
Managed Disk Group

Extent 1A Extent 1A Extent 1B Extent 1C Extent 1D Extent 1E Extent 1F Extent 1G Extent 2A Extent 2B Extent 2C Extent 2D Extent 2E Extent 2F Extent 2G Extent 3A Extent 3B Extent 3C Extent 3D Extent 3E Extent 3F Extent 3G Extent 2A Extent 3A Extent 1B Extent 2B Extent 3B Extent 1C Extent 2C Extent 3C
Create a striped virtual disk
VDISK1
MDISK 1
MDISK 2
MDISK 3
2 GB
VDisk is a collection of Extents (each 16MB to 2GB)
Figure 14-1 Extents being used to create Virtual Disks
The virtualization function in the SAN Volume Controller maps the VDisks seen by the application servers onto the MDisks provided by the back-end controllers. I/O traffic for a particular VDisk is, at any one time, handled exclusively by the nodes in a single I/O Group. Thus, although a cluster can have many nodes within it, the nodes handle I/O in independent pairs, which means that the I/O capability of the SAN Volume Controller scales well (almost linearly), because additional throughput can be obtained by simply adding additional I/O Groups. Figure 14-2 on page 425 summarizes the various relationships that bridge the physical disks through to the virtual disks within the SAN Volume Controller architecture.
424
SDD is the only multi-path device driver needed on server side.

SVC Driver SVC Driver
Fabric 1
Virtual disks are created within a Managed Disk Group and are mapped to the hosts.
Hosts zone
Virtual Disks
Type = 2145 Managed Disk Group
MD 1
Virtualization Engine High Perf

MD 2 MD 3 MD 4 MD 5
Low Cost
MD 6 MD 7 MD 8
SVC isolates Hosts from any storage modifications. SVC manages the relation between Virtual Disks and Managed Disks. Managed Disks are grouped in Managed Disks Groups depending on their characteristics Storage Pools Storage subsystem SCSI LUNs are directly mapped to SVC cluster.
VD 1
VD 2
VD 3
VD 6
VD 4
Managed Disks
Fabric 1
Disks zone
LUN 1 LUN 2 LUN 3 LUN 1 LUN 2 LUN 3
VD 5
VD 7
SCSI LUNs
LUN 4
RAID controller 1
RAID Array
RAID controller 2
LUN 4
Physical disks
Figure 14-2 Relationship between physical and virtual disks
14.1.2 SAN Volume Controller multipathing

Each SAN Volume Controller node presents a VDisk to the SAN via multiple paths. We recommend that a VDisk can be seen in the SAN by four paths. In normal operation, two nodes provide redundant paths to the same storage, which means that, depending on zoning and SAN architecture, a single server might see eight paths to each LUN presented by the SAN Volume Controller. Each server host bus adapter (HBA) port needs to be zoned to a single port on each SVC node. Because most operating systems cannot resolve multiple paths back to a single physical device, IBM provides a multipathing device driver. The multipathing driver supported by the SAN Volume Controller is the IBM Subsystem Device Driver (SDD). SDD groups all available paths to a virtual disk device and presents it to the operating system. SDD performs all the path handling and selects the active I/O paths. SDD will support the concurrent attachment of ESS, DS8000, DS6000, and SVC storage systems to the same host system. Where one or more alternate storage systems are to be attached, you can identify which version of SDD is required at this Web site: http://www.ibm.com/support/docview.wss?rs=540&uid=ssg1S7001350#WindowsSDD Note: You can use SDD with the native Multipath I/O (MPIO) device driver on AIX and on Windows Server 2003 and Windows Server 2008. The AIX MPIO-capable device driver with the supported storage devices Subsystem Device Driver Path Control Module (SDDPCM) enhances the data availability and I/O load balancing. Subsystem Device Driver Device Specific Module (SDDDSM) provides multipath I/O support based on the MPIO technology of Microsoft.
425
14.1.3 SVC Advanced Copy Services

The SAN Volume Controller provides Advanced Copy Services that enable you to copy VDisks using FlashCopy and Remote Copy functions. These Copy Services are available for all supported servers that are connected to the SAN Volume Controller. FlashCopy includes: Single Target FlashCopy (FC) Multiple Target FlashCopy (MTFC) Cascaded FlashCopy (CFC) Incremental FlashCopy (IFC) FlashCopy makes an instant, point-in-time copy from a source VDisk to a target VDisk. A FlashCopy can be made only to a VDisk within the same SAN Volume Controller. Remote Copy includes: Metro Mirror Global Mirror Metro Mirror makes a synchronous remote copy, which provides a consistent copy of a source VDisk to a target VDisk. Metro Mirror can copy between VDisks on separate SAN Volume Controllers or between VDisks within the same I/O Group on the same SAN Volume Controller. Global Mirror makes an asynchronous remote copy, which provides a remote copy over extended distances. Global Mirror can copy between VDisks on separate SAN Volume Controllers or between VDisks within the same I/O Group on the same SAN Volume Controller. Enhancements to Advanced Copy Services functions are introduced with new releases of SAN Volume Controller software. With V4, the functions introduced at each release are: V4.1/V4.1.1: Global Mirror V4.2: Multiple FlashCopy targets 40 TB FC per I/O Group 40 TB RC per I/O Group V4.2.1: Incremental FlashCopy Up to 256 TB FC per I/O Group Up to 256 TB RC per I/O Group Cascaded FlashCopy Space Efficient Vdisks Virtual Disk Mirroring (Local HA) Space Efficient FlashCopy Consistency Management with FC at MM or GM target FlashCopy Targets to 256
V4.3:
Note: SAN Volume Controller Copy Services functions are not compatible with the ESS, DS6000, and DS8000 Copy Services.
426
For details about configuration and management of SAN Volume Controller Copy Services, refer to SVC V4.2.1 Advanced Copy Services, SG24-7574. A FlashCopy mapping can be created between any two VDisks in a cluster. It is not necessary for the VDisks to be in the same I/O Group or in the same Managed Disk Group. This functionality provides the ability to optimize your storage allocation using a secondary storage subsystem (with, for example, lower performance) as the target of the FlashCopy. In this case, the resources of your high performance storage subsystem will be dedicated for production, while your low-cost (lower performance) storage subsystem will be used for a secondary application (for example, backup or development). An advantage of SAN Volume Controller remote copy is that we can implement such relationships between two SAN Volume Controller clusters with different back-end disk subsystems. In this case, you can reduce the overall cost of the disaster recovery infrastructure. The production site can use high performance back-end disk subsystems, and the recovery site can use low-cost back-end disk subsystems, even where back-end disk subsystems Copy Services functions are not compatible (for example, different models or different manufacturers). This relationship is established at the VDisk level and does not depend on the back-end disk storage subsystem Copy Services. Important: For Metro Mirror copies, the recovery site VDisks need to have performance characteristics similar to the production site VDisks when a high write I/O rate is present in order to maintain the I/O response level for the host system.
14.2 SAN Volume Controller performance considerations

The SAN Volume Controller cluster is scalable up to eight nodes. The performance is almost linear when adding more I/O Groups to a SAN Volume Controller cluster, until it becomes limited by other components in the storage infrastructure. While virtualization with the SAN Volume Controller provides a great deal of flexibility, it does not diminish the necessity to have a SAN and disk subsystems that can deliver the desired performance. In the following section, we present the SAN Volume Controller concepts and discuss the performance of the SAN Volume Controller. In this section, we assume there are no bottlenecks in the SAN or on the disk subsystem.
Determine the number of I/O Groups

Growing or adding new I/O Groups to an SVC cluster is a decision that has to be made when either a configuration limit is reached or when the I/O load reaches a point where a new I/O Group is needed. To determine the number of I/O Groups and monitor the CPU performance of each node, you can use TotalStorage Productivity Center. The CPU performance is related to I/O performance and when the CPUs become consistently 70% busy, you must consider either: Adding more nodes to the cluster and moving part of the workload onto the new nodes Move some VDisks to another I/O Group, if the other I/O Group is not busy Note: A VDisk can only be moved to another I/O Group if there is no I/O activity on that VDisk. Any data in cache on the server must be destaged to disk. The SAN zoning and port masking might need to be updated to give access.
427
To see how busy your CPUs are, you can use the TotalStorage Productivity Center performance report, by selecting CPU Utilization. Several of the activities that affect CPU utilization are: VDisk activity: The preferred node is responsible for I/Os for the VDisk and coordinates sending the I/Os to the alternate node. While both systems will exhibit similar CPU utilization, the preferred node is a little busier. To be precise, a preferred node is always responsible for the destaging of writes for VDisks that it owns. Cache management: The purpose of the cache component is to improve performance of read and write commands by holding some read or write data in SVC memory. Because the nodes in a caching pair have physically separate memories, the cache component must keep the caches on both nodes consistent. FlashCopy activity: Each node (of the flash copy source) maintains a copy of the bitmap; CPU utilization is similar. Mirror Copy activity: The preferred node is responsible for coordinating copy information to the target and also ensuring that the I/O Group is up-to-date with the copy progress information or change block information. As soon as Global Mirror is enabled, there is an additional 10% overhead on I/O work due to the buffering and general I/O overhead of performing asynchronous remote copy. With the newly added I/O Group, the SVC cluster can potentially double the I/O rate (IOPS) that it can sustain. An SVC cluster itself can be scaled up to an eight node cluster with which we will quadruple the total I/O rate.
Number of ports in the SAN used by SAN Volume Controller

Each SAN Volume Controller node has four Fibre Channel ports so you will need eight ports in the SAN per I/O Group, four ports respectively in each fabric (two from each node).
Number of paths from SAN Volume Controller to disk subsystem

All SAN Volume Controller nodes in a cluster must be able to see the same set of storage subsystem ports on each device. Any operation that is in this mode in which two nodes do not see the same set of ports on the same device is degraded, and the system logs errors that request a repair action. For the DS8000, there is no controller affinity for the LUNs so a single zone for all SVC ports and up to eight DS8000 host adapter (HA) ports must be defined on each fabric. The DS8000 HA ports must be distributed over as many HA cards as available and dedicated to SVC use if possible. Using no more than two ports on each HA card will provide the maximum bandwidth. Configure a minimum of eight controller ports to the SVC per controller regardless of the number of nodes in the cluster. Configure 16 controller ports for large controller configurations where more than 48 ranks are being presented to the SVC cluster.
Optimal Managed Disk Group configurations

A Managed Disk Group provides the pool of storage from which virtual disks will be created. It is therefore necessary to ensure that the entire pool of storage provides the same performance and reliability characteristics. For the DS8000, all LUNs in the same Managed Disk Group must: Use disk drive modules (DDMs) of the same capacity and speed. Have arrays that are the same RAID type. Use LUNs that are the same size. 428
Note: The SAN Volume Controller extent size does not have a great impact on the performance of an SVC installation, and the most important consideration is to be consistent across MDGs. You must use the same extent size in all MDGs within an SVC cluster to avoid limitations when migrating VDisks from one MDG to another MDG. For additional information, refer to the SAN Volume Controller Best Practices and Performance Guidelines, SG24-7521.
14.3 DS8000 performance considerations with SVC

This section presents the principal DS8000 configuration recommendations to optimize the performance of your virtualized environment.
14.3.1 DS8000 array

The DS8000 storage system provides protection against the failure of individual Disk Drive Modules (DDMs) by the use of RAID arrays, which is important because the SAN Volume Controller provides no protection for the MDisks within a Managed Disk Group.
Array RAID configuration

A DS8000 array is a RAID 5, RAID 6 or RAID 10 array made up of 8 DDMs. A DS8000 array is created from one array site: RAID 5 arrays will be either 6+P+S or 7+P. RAID 6 arrays will be either 5+P+Q+S or 6+P+Q. RAID 10 arrays will be either 3+3+2S or 4+4. There are a number of workload attributes that influence the relative performance of RAID 5 compared to RAID 10, including the use of cache, the relative mix of read as opposed to write operations, and whether data is referenced randomly or sequentially. Consider that: For either sequential or random reads from disk, there is no significant difference in RAID 5 and RAID 10 performance, except at high I/O rates. For random writes to disk, RAID 10 performs better. For sequential writes to disk, RAID 5 performs better. RAID 6 performance is slightly inferior to RAID 5 but provides protection against two DDM failures. For more details regarding the RAID 5 and RAID 10 difference, refer to 5.6, Planning RAID arrays and ranks on page 81.
14.3.2 DS8000 rank format

A rank is created for each array. There is a one-to-one relationship between arrays and ranks. The SAN Volume Controller requires that ranks are created in Fixed Block format, which divides each rank into 1 GB extents (where 1 GB = 230 bytes). A rank must be assigned to an extent pool to be available for LUN creation.
429
It is when the rank is assigned that the DS8000 processor complex (or server group) affinity is determined. To balance ranks on each device adapter (DA), we recommend that you assign equal numbers of ranks of a given capacity on each DA pair to each DS8000 processor complex as shown in Figure 14-1 on page 424.
14.3.3 DS8000 extent pool implications

In the DS8000 architecture, extent pools are used to managed one or more ranks. An extent pool is visible to both of the two processor complexes in the DS8000, but it is directly managed by only one of them. You must define a minimum of two extent pools with one created for each processor complex to fully exploit the resources. Note: Although it is possible to configure an extent pool with many ranks, for SAN Volume Controller performance optimization, we do not recommend more than one rank. Figure 14-1 on page 424 shows an example based on a DS8100 with two DDM sizes where all ranks will be assigned to the SAN Volume Controller. All arrays are configured as RAID 5, which gives ranks of four distinct capacities that are assigned to an individual extent pool with affinity alternating between the two DS8000 processor complexes. Define one or two LUNs on each rank and allocate all LUNs to the SAN Volume Controller volume group. Create four Managed Disk Groups on the SAN Volume Controller and assign all LUNs of equal capacity to a specific group. If all DDMs are of equal capacity, we recommend that you create only two Managed Disks Groups and assign all LUNs of equal capacity on both DA pairs to a specific group. This method of allocation of LUNs to Managed Disk Groups can be extended to DS8000s with many DA pairs. The aim is to have LUNs in each Managed Disk Group from as many DA pairs as possible while maintaining a balance of LUNs on each DS8000 server. Table 14-1 shows the assignment of ranks to extent pools.
Table 14-1 Assignment of ranks to extent pools Device adapter 2 2 2 2 2 2 2 2 0 0 0 0 0 Array site S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 DDM size GB 300 300 300 300 300 300 300 300 146 146 146 146 146 Array A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 RAID type 6+P+S 6+P+S 6+P+S 6+P+S 7+P 7+P 7+P 7+P 6+P+S 6+P+S 6+P+S 6+P+S 7+P 5 5 5 5 5 5 5 5 5 5 5 5 5 Array capacity GB 1582 1582 1582 1582 1844 1844 1844 1844 779 779 779 779 909 Rank R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 Extent pool P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 Server group 0 1 0 1 0 1 0 1 0 1 0 1 0
430
Device adapter 0 0 0
Array site S14 S15 S16
DDM size GB 146 146 146
Array A13 A14 A15
RAID type 7+P 7+P 7+P 5 5 5
Array capacity GB 909 909 909
Rank R13 R14 R15
Extent pool P13 P14 P15
Server group 1 0 1
Multiple ranks per extent pool configuration

This section shows why configuring several ranks in an extent pool is not optimized for performance in an SAN Volume Controller environment. To clearly explain this performance limitation, we can use as an example the configuration presented in Figure 14-3.
Figure 14-3 Example showing configuration with multiple ranks to an extent pool
In this example, an extent pool (Extent Pool 1) is defined on a DS8000. This extent pool includes three ranks, each of which is 519 GB. The overall capacity of this extent pool is 1.5 TB. This capacity is available for LUN creation as a set of 1 GB DS8000 extents. 431
In this pool of available extents, we create one DS8000 Logical Volume called Volume0, which contains all the extents in the extent pool. Volume0 is 1.5 TB. Due to the DS8000 internal Logical Volume creation algorithm, the extents from rank1 will be assigned, then the extents of rank2, and the extents of rank3. In this case, the data stored on the first third of Volume0 will be physically located on rank1, the second third on the rank2, and the last third on rank3. When Volume0 is assigned to the SAN Volume Controller, the Logical Volume is identified by the SAN Volume Controller cluster as a Managed Disk, MDiskB. MDiskB is assigned to a Managed Disk Group, MDG0, where the SAN Volume Controller extent size is defined as 512 MB. Two other Managed Disks, MDiskA and MDiskC, both 1.5 TB, from the same DS8000 but from different extent pools, are defined in this Managed Disk. These extent pools are similarly configured as extent pool 1. The overall capacity of the Managed Disk Group is 4.5 TB. This capacity is available through a set of 512 MB SAN Volume Controller extents. Next, a SAN Volume Controller Virtual Disk called VDisk0 is created in the Managed Disk Group 0. VDisk0 is 50 GB, 100 SAN Volume Controller extents. VDisk0 is created in SAN Volume Controller Striped mode in the hopes of obtaining optimum performance. But actually, a performance bottleneck was just created. When VDisk0 was created, it was assigned sequentially one SAN Volume Controller extent from MDiskA, then one SAN Volume Controller extent from MDiskB, and one extent from MDiskC, and so on. In total, VDisk0 was assigned the first 34 extents of MDiskA, the first 33 of MDiskB, and the first 33 of MDiskC. This point is where the bottleneck occurs. All of the first 33 extents used from MDiskB are physically located at the beginning of Volume0, which means that all of these extents belong to DS8000 rank1. This configuration does not follow the performance recommendation that you need to spread the workload assigned to VDisk0 to all the ranks defined in the extent pool. In this case, performance will be limited to the performance of a single rank. Furthermore, if the configuration of MDiskA and MDiskC are equivalent to MDiskB, data stores on VDisk0 are spread across only three ranks of the nine ranks available within the three DS8000 extent pools used by SAN Volume Controller. This example shows the bottlenecks for VDisk0, but more generally, almost all of the VDisk created in this Managed Disk Group will be spread on only three ranks instead of the nine ranks available. Important: The configuration presented in Figure 14-3 on page 431 is not optimized for performance in an SAN Volume Controller environment.
Note: The rotate extents algorithm for Storage Pool Striping within an extent pool is not recommended for use with the SVC. There is no perceived advantage, because the SVC will stripe across MDisks by default.
One rank per extent pool configuration

This section shows why configuring one rank in an extent pool is optimized for performance in SAN Volume Controller environment. To clearly explain this performance optimization, we can use as an example the configuration presented in Figure 14-4 on page 433.
432
SVC Extent A0001 SVC Extent B0001 SVC Extent I0001 SVC Extent A0002 SVC Extent B0002 SVC Extent I0002 ---
Ext C to H
VDisk0 50GB
100 SVC Extents allocated
Ext C to H
First 12 ext from MDisk1 First11 ext from MDisk2
In this case, SVC extents used for VDisk0 are physically located on Rank1 through Rank9 of the DS8000
SVC
VDisk creation in stripe mode

SVC Extent size 512MB
SVC Extent A0001 SVC Extent A0002 SVC Extent A0003 --SVC Extent A3070 SVC Extent A3071 SVC Extent A3072 SVC Extent B0001 SVC Extent B0002 SVC Extent B0003 --SVC Extent B3070 SVC Extent B3071 SVC Extent B3072
11 ext each from MDisk3 to 8 First 11 ext from MDisk9
MDisk3 through MDisk8
SVC Extent SVC Extent SVC Extent --SVC Extent SVC Extent SVC Extent
I0001 I0002 I0003 I3070 I3071 I3071
Managed Disk Group 0

Total capacity 4.5TB SVC Extent size: 512 MB
Volume1 through 9 are assigned to SVC cluster
MDisk1
519 GB
MDisk2
519 GB
MDisk9
519GB
Volume 1 519 GB 519 DS8000 Extents 519GB

DS Extent 1A DS Extent 1B DS Extent 1Y DS Extent 1Z DS Extent 2A DS Extent 2B DS Extent 2Y DS Extent 2Z
Volume 2 519 GB 519 DS8000 Extents 519GB
Volume 9 519 GB 519 DS8000 Extents

DS Extent 9A DS Extent 9B DS Extent 9Y DS Extent 9Z
DS8000
519GB
SVC Extent size 512MB Extent Pool 1

DS Extent 1A DS Extent 1B DS Extent 1C --DS Extent 1X DS Extent 1Y DS Extent 1Z
Capacity 519GB DS 8000 Ext size: 1GB
Extent Pool 2
Extent Pool 9
Extent Pools 3 through 8
Rank1
519GB
Rank2
519GB
Rank9
519GB
Figure 14-4 Example showing configuration with a single rank per extent pool
In this example, nine extent pools are defined on a DS8000. Each extent pool includes only one rank of 519 GB, and so the overall capacity of each extent pool is 519 GB. This capacity is available through a set of 1 GB DS8000 extents. In each extent pool, we create one volume that assigns all the capacity of the extent pool. The nine volumes created each have a size of 519 GB. Volume0 though Volume9 are assigned to the SAN Volume Controller. These volumes are identified by the SAN Volume Controller cluster as Managed Disks, MDisk1 though MDisk9. These Managed Disks are assigned in a Managed Disk Group, MDG0, where the SAN Volume Controller extent size is defined as 512 MB. The overall capacity of the Managed Disk Group is 4.5 TB. A Virtual Disk (VDisk0) of 50 GB size (100 SAN Volume Controller extents) is created in this storage pool. The Virtual Disk is created in SAN Volume Controller Striped mode in order to obtain the greatest performance. This mode implies that VDisk0 will assign sequentially one extent from MDisk1, then one extent from MDisk2, and so on with until we have one extent from MDisk9 and return to MDisk1. VDisk0 will assign the first 12 extents of MDisk1 and the first 11 extents of MDisk2 through MDisk9. In this case, all the SAN Volume Controller extents assigned for VDisk0 are physically located on all assigned ranks of the DS8000. This configuration permits you to spread the workload applied on VDisk0 to all nine ranks of the DS8000. In this case, we efficiently use the hardware available for each VDisk of the Managed Disk Group.
433
Important: The configuration presented in Figure 14-4 on page 433 is optimized for performance in an SAN Volume Controller environment.
14.3.4 DS8000 volume considerations with SVC

This section details the recommendations regarding volume creation on the DS8000 when assigned to SAN Volume Controller.
Number of volumes per extent pool

The DS8000 provides a mechanism to create multiple volumes from a single extent pool, which is useful when the storage subsystem is directly presenting storage to the servers. In a SAN Volume Controller environment, we recommend that you define one or two volumes. Tests show a small response time advantage to the two LUNs per array configuration and a small IOPS advantage to the one LUN per array configuration for sequential workloads. Overall, the performance differences between these configurations are minimal.
Volume size consideration

Because we recommend to define one or two volumes per extent pool, the volume size will be determined by the extent pool overall capacity. The maximum volume size currently supported by the DS8000 and the SVC MDisk is 2 TB. If this recommendation is not possible in your specific environment, at least we recommend that you assign to SAN Volume Controller DS8000 LUNs of the same size for each Managed Disk Group. In this configuration, the workload applied on a Virtual Disk will be equally balanced on the Managed Disks within the Managed Disk Group.
Dynamic Volume Expansion

Dynamic Logical Volume Expansion on the DS8000 is not supported for the SAN Volume Controller. All available capacity in the extent pool must be defined to the SAN Volume Controller MDisk to allow VDisk striping over the maximum number of extents. It is better to have the free space on the SAN Volume Controller from the start , because adding space to an existing Managed Disk Group does not redistribute the extents already defined. A DS8000 LUN assigned as an MDisk can be expanded only if the MDisk is removed from the Managed Disk Group first, which will automatically redistribute the defined VDisk extents to other MDisks in the group provided there is space available. The LUN can then be expanded, detected as a new MDisk, and reassigned to a Managed Disk Group.
FATA or SATA drives

DS8000 Fibre Channel Advanced Technology Attachment (FATA) or Serial Advanced Technology Attachment (SATA) drives are not suited to use as MDisks by the SAN Volume Controller for high performance applications.
14.3.5 Volume assignment to SAN Volume Controller

On the DS8000, we recommend creating one volume group in which will be included all the volumes defined to be managed by SAN Volume Controller and all the host connections of the SAN Volume Controller node ports.
434
New volumes can be added dynamically to the SAN Volume Controller. When the volume has been added to the volume group, run the command svctask detectmdisk on the SAN Volume Controller to add it as a new MDisk. Before you delete or unmap a volume allocated to the SAN Volume Controller, remove the MDisk from the Managed Disk Group, which will automatically migrate any extents for defined VDisks to other MDisks in the Managed Disk Group provided there is space available. When it has been unmapped on the DS8000, run the command svctask detectmdisk and then run the maintenance procedure on the SAN Volume Controller to confirm its removal.
14.3.6 Managed Disk Group for DS8000 Volumes

As already stated in Optimal Managed Disk Group configurations on page 428, all DS8000 volumes in a particular Managed Disk Group must have identical properties. This consistency will permit even performance on the VDisks defined in the group. Also, you must consider the location of those volumes within the DS8000 with respect to the server and the device adapter (DA). Create each Managed Disk Group using an even number of MDisks with half the volumes associated with each DS8000 processor complex. If you assign a large number of MDisks from a single DS8000, performance can be improved by selecting MDisks from ranks on different device adapters as shown in Figure 14-5. We recommend between four and twelve MDisks per Managed Disk Group for the DS8000.
Figure 14-5 Managed Disk Group using multiple DS8000 DA pairs Chapter 14. IBM System Storage SAN Volume Controller attachment
435
If a volume is added to an existing Managed Disk Group, the performance can become unbalanced due to the extents already assigned. These extents can be rebalanced manually or by using the Perl script provided as part of the SVCTools package from the alphaWorks Web site. Refer to SAN Volume Controller Best Practices and Performance Guidelines, SG24-7521, for information about performing this task.
14.4 Performance monitoring

You can use IBM TotalStorage Productivity Center to manage the IBM SAN Volume Controller and monitor its performance. IBM TotalStorage Productivity Center and IBM TotalStorage Productivity Center for Disk are described in: TotalStorage Productivity Center V3.3 Update Guide, SG24-7490 Monitoring Your Storage Subsystems with TotalStorage Productivity Center, SG24-7364 All new DS8000 and SAN Volume Controller shipments will include the IBM System Storage Productivity Center (SSPC), which is a storage system management console that includes TotalStorage Productivity Center Basic Edition, which provides: Storage Topology Viewer The ability to monitor, alert, report, and provision storage Status dashboard IBM System Storage DS8000 GUI integration The TotalStorage Productivity Center Basic Edition provided with the SSPC can be upgraded to the full TotalStorage Productivity Center Standard Edition license if required.
14.4.1 Using TotalStorage Productivity Center for Disk to monitor the SVC
To configure TotalStorage Productivity Center for Disk to monitor IBM SAN Volume Controller, refer to SAN Volume Controller Best Practices and Performance Guidelines, SG24-7521.
Data collected from SAN Volume Controller

The two most important metrics when measuring I/O subsystem performance are response time in milliseconds and throughput in I/Os per second (IOPS): Response time in non-SVC environments is measured from when the server issues a command to when the storage controller reports the command as completed. With the SVC, we have to consider response time from the server to the SVC nodes and also from the SVC nodes to the storage controllers. Throughput, however, can be measured at a variety of points along the data path, and the SVC adds additional points where throughput is of interest and measurements can be obtained. TotalStorage Productivity Center offers many disk performance reporting options that support the SVC environment well and also the storage controller back end for a variety of storage controller types. The most relevant storage components where performance metrics can be collected when monitoring storage controller performance are: Subsystem Controller Array
436
Managed Disk Managed Disk Group Port
SAN Volume Controller thresholds

Thresholds are used to determine watermarks for warning and error indicators for an assortment of storage metrics. SAN Volume Controller has the following thresholds with their default properties: VDisk I/O rate Total number of virtual disk I/Os for each I/O Group VDisk bytes per second Virtual disk bytes per second for each I/O Group MDisk I/O rate Total number of managed disk I/Os for each Managed Disk Group MDisk bytes per second Managed disk bytes per second for each Managed Disk Group The default status for these properties is Disabled with the Warning and Error options set to None. Only enable a particular threshold after the minimum values for warning and error levels have been defined. Tip: In TotalStorage Productivity Center for Disk, default threshold warning or error values of -1.0 are indicators that there is no recommended minimum value for the threshold and are therefore entirely user defined. You can choose to provide any reasonable value for these thresholds based on the workload in your environment.
14.5 Sharing the DS8000 between a server and the SVC

The DS8000 can be shared between servers and a SAN Volume Controller. This sharing can be useful if you want to have direct attachment for specific Open Systems servers or if you need to share your DS8000 between the SAN Volume Controller and an unsupported server, such as System i or System z. For the latest list of hardware that is supported for attachment to the SVC, refer to: http://www.ibm.com/support/docview.wss?rs=591&uid=ssg1S1003277
14.5.1 Sharing the DS8000 between Open Systems servers and the SVC
If you have a mixed environment that includes IBM SAN Volume Controller and Open Systems servers, we recommend sharing the maximum of DS8000 resources to both environments. Our storage configuration recommendation is to create one extent pool per rank. In each extent pool, create one volume allocated to the IBM SAN Volume Controller environment and allocate one or more other volumes to the Open Systems servers. In this configuration, each environment can benefit from the DS8000 overall performance. If an extent pool has multiple ranks, we recommend that all SVC volumes are created using the rotate volumes algorithm. Server volumes can use the rotate extents algorithm if desired.
437
IBM supports sharing a DS8000 between a SAN Volume Controller and an Open Systems server. However, if a DS8000 port is in the same zone as a SAN Volume Controller port, that same DS8000 port must not be in the same zone as another server.
14.5.2 Sharing the DS8000 between System i server and the SVC
IBM SAN Volume Controller does not support System i server attachment. If you have a mixed server environment that includes IBM SAN Volume Controller and System i servers, you have to share your DS8000 to provide a direct access to System i volumes and access to Open Systems server volumes through the IBM SAN Volume Controller. In this case, we recommend to share the maximum of DS8000 resources to both environments. Our storage configuration recommendation is to create one extent pool per rank. In each extent pool, create one volume allocated to the IBM SAN Volume Controller environment and create one or more other volumes allocated to the System i servers. IBM supports sharing a DS8000 between a SAN Volume Controller and System i servers. However, if a DS8000 port is in the same zone as a SAN Volume Controller port, that same DS8000 port must not be in the same zone as System i servers.
14.5.3 Sharing the DS8000 between System z server and the SVC
IBM SAN Volume Controller does not support System z server attachment. If you have a mixed server environment that includes IBM SAN Volume Controller and System z servers, you have to share your DS8000 to provide a direct access to System z volumes and access to Open Systems server volumes through the IBM SAN Volume Controller. In this case, you have to split your DS8000 resources between two environments. Some of the ranks have to be created using the count key data (CKD) format (used for System z access) and the other ranks have to be created in Fixed Block (FB) format (used for IBM SAN Volume Controller access). In this case, both environments will get performance related to the allocated DS8000 resources. A DS8000 port will not support a shared attachment between System z and IBM SAN Volume Controller, because System z servers use Fiber Channel connection (FICON) and IBM SAN Volume Controller only supports FC connection.
14.6 Advanced functions for the DS8000

The DS8000 provides Copy Services functions that are not compatible with the SAN Volume Controller Advanced Copy Services. Prior to SAN Volume Controller 3.1, there was no support to enable any Copy Services function in a RAID array controller for a LUN that was being virtualized by SAN Volume Controller. This design was not supported, because the behavior of the write-back cache in SAN Volume Controller led to data corruption. With the advent of cache-disabled VDisks, it becomes possible to enable Copy Services in the underlying RAID array controller for these LUNs.
14.6.1 Cache-disabled VDisks

Where Copy Services are used in the DS8000, the controller LUNs at both the source and destination must be mapped through the SAN Volume Controller as image mode cache-disabled VDisks. Note that, of course, it is possible to access either the source or the 438
target of the remote copy from a server directly, rather than through the SAN Volume Controller. The SAN Volume Controller Copy Services can be usefully employed with the image mode VDisk representing the primary of the controller copy relationship, but it does not make sense to use SAN Volume Controller Copy Services with the VDisk at the secondary site, because the SAN Volume Controller does not see the data flowing to this LUN through the controller. Cache-disabled VDisks are primarily used when virtualizing an existing storage infrastructure, and you need to retain the existing storage system Copy Services. You might want to use cache-disabled VDisks where there is a lot of intellectual capital in existing Copy Services automation scripts. We recommend that you keep the use of cache-disabled VDisks to minimum for normal workloads. Another case where you might need to use cache-disabled VDisks is where you have servers, such as System i or System z, that are not supported by the SAN Volume Controller, but you need to maintain a single Global Mirror session for consistency between all servers. In this case, the DS8000 Global Mirror must be able to manage the LUNs for all server systems. Important: When configuring cache-disabled VDisks, consider the guidelines for configuring the DS8000 for host system attachment, which will have as great an impact on performance as the SAN Volume Controller. Because the SAN Volume Controller will not perform any striping of VDisks, it might be an advantage to use extent pools with multiple ranks to allow volumes to be created using the rotate extents algorithm. The guidelines for the use of different DA pairs for FlashCopy source and target LUNs with affinity to the same DS8000 server will also apply. Cache-disabled VDisks can also be used to control the allocation of cache resources. By disabling the cache for certain VDisks, more cache resources will be available to cache I/Os to other VDisks in the same I/O Group. This technique is particularly effective where an I/O Group is serving some VDisks that will benefit from cache and other VDisks where the benefits of caching are small or non-existent. Currently, there is no direct way to enable the cache for previously cache-disabled VDisks. You will need to remove the VDisk from the SAN Volume Controller and redefine it with cache enabled.
14.7 Configuration guidelines for optimizing performance

Note: The guidelines given here are not unique to configuring the DS8000 for SAN Volume Controller attachment. In general, any server will benefit from a balanced configuration that makes use of the maximum available bandwidth of the DS8000. Follow the guidelines and procedures outlined in this section to make the most of the performance available from your DS8000 storage subsystems and avoid potential I/O problems: Use multiple host adapter cards on the DS8000. Where possible, use no more than two ports on each card. Create a one-to-one relationship between extent pool and rank.
439
Avoid splitting an extent pool into multiple volumes at the DS8000 layer. Where possible, create one or at most two volumes on the entire capacity of the extent pool. Ensure that you have an equal number of extent pools and volumes spread equally across the device adapters and the two processor complexes of the DS8000 storage subsystem. Ensure that Managed Disk Groups contain MDisks with similar characteristics and the same capacity. Consider the following factors: The number of DDMs in the array, for example, 6+P+S or 7+P The disk rotational speed: 10k rpm or 15k rpm The DDM attachment: Fibre Channel or FATA/SATA The underlying RAID type: RAID 5, RAID 6, or RAID 10 Using the same MDisk capacity within a Managed Disk Group makes efficient use of the SAN Volume Controller striping. Do not mix MDisks of differing performance in the same Managed Disk Group. The overall group performance will be limited by the slowest MDisk in the group. Do not mix MDisks from different controllers in the same Managed Disk Group. For Metro Mirror configurations, always use DS8000 MDisks with similar characteristics for both the master VDisk and the auxiliary VDisk.
440
15
Chapter 15.
System z servers
In this chapter, we describe the performance features and other enhancements that enable higher throughput and lower response time when connecting a DS8000 to your System z. We also review several of the monitoring tools and their usage with the DS8000.
441
15.1 Overview
The special synergy between DS8000 features and the System z operating systems (mainly the z/OS operating system) makes the DS8000 an outstanding performer in that environment. The specific DS8000 performance features as they relate to application I/O in a z/OS environment include: Parallel Access Volumes (PAVs) Multiple allegiance I/O priority queuing Logical volume sizes Fibre Channel connection (FICON) In the following sections, we describe those DS8000 features and discuss how to best use them to boost performance.
15.2 Parallel Access Volumes

Simply stated, Parallel Access Volumes (PAVs) allow multiple concurrent I/Os to the same volume at the same time from applications running on the same z/OS system image. This concurrency helps applications better share the same logical volumes with reduced contention. The ability to send multiple concurrent I/O requests to the same volume nearly eliminates I/O queuing in the operating system, thus reducing I/O responses times. Traditionally, access to highly active volumes has involved manual tuning, splitting data across multiple volumes, and more actions in order to avoid those hot spots. With PAV and the z/OS Workload Manager, you can now almost forget about manual device level performance tuning or optimizing. The Workload Manager is able to automatically tune your PAV configuration and adjust it to workload changes. The DS8000 in conjunction with z/OS has the ability to meet the highest performance requirements. PAV is implemented by defining alias addresses to the conventional base address. The alias address provides the mechanism for z/OS to initiate parallel I/O to a volume. As its name implies, an alias is just another address/unit control block (UCB) that can be used to access the volume defined on the base address. An alias can only be associated with a base address defined in the same logical control unit (LCU). The maximum number of addresses that you can define in an LCU is 256. Theoretically, you can define one base address, plus 255 aliases in an LCU.
15.2.1 Static PAV, Dynamic PAV, and HyperPAV

Aliases are initially defined to be associated to a certain base address. In a static PAV environment, the alias is always associated to the same base address, while in a dynamic PAV and HyperPAV environment, an alias can be reassigned to any base address as need dictates. With dynamic PAV, you do not need to assign as many aliases in an LCU as compared to a static PAV environment, because the aliases will be moved around to the base addresses that need an extra alias to satisfy an I/O request. The z/OS Workload Manager (WLM) is used to implement dynamic PAVs. This function is called dynamic alias management. With dynamic alias management, WLM can automatically
442
perform alias device reassignments from one base device to another base device to help meet its goals and to minimize IOS queuing as workloads change. WLM manages PAVs across all the members of a sysplex. When making decisions on alias reassignment, WLM considers I/O from all systems in the sysplex. By default, the function is turned off, and must be explicitly activated for the sysplex through an option in the WLM service definition, and through a device-level option in the hardware configuration definition (HCD). Dynamic alias management requires your sysplex to run in WLM Goal mode. In a HyperPAV environment, WLM is no longer involved in managing alias addresses. When the base address UCB is used by an I/O operation, each additional I/O against that address will be assigned an alias that can be picked from a pool of alias addresses within the same LCU. This approach eliminates the latency caused by WLM having to manage the alias movement from one base address to another base address. Also, as soon as the I/O that uses the alias is finished, it will drop that alias, which makes the alias available in the alias pool. HyperPAV allows different hosts to use one alias to access different base addresses, which reduces the number of alias addresses required to support a set of base addresses in a System z environment.
15.2.2 HyperPAV compared to dynamic PAV test

The following simple test shows the advantage of HyperPAV compared to dynamic PAV. The test environment includes: The jobs run consist of six IEBGENER read jobs and five IEBDG write jobs. Every job uses its own dataset, and all datasets reside on one volume: This way, all these jobs will cause contention on the base address where the volume resides. At the end of the job, the job resubmits itself, so that all the jobs keep running until they are cancelled. On the dynamic PAV test, we reset the PAV to one, which means that the base address starts with no aliases assigned to it. With HyperPAV, the base address also starts with no aliases, because there is no I/O activity against that address. Every job is assigned the same WLM service class. The group of read and write jobs is assigned to a separate report class. Figure 15-1 on page 444 shows the test result. The following observations can be concluded from this test result: Response time: In the HyperPAV environment, the response time is about the same from the start of the test until the end of the test. In the dynamic PAV environment, the response time is extremely high initially, because the IOSQ is very high due to the lack of aliases assigned to the base address. The aliases are added by WLM one at a time.
Chapter 15. System z servers
443
I/O rate: HyperPAV: The I/O rate reaches its maximum almost immediately. The I/O rate achieved is around 5780 IOPS. Dynamic PAV: The I/O rate starts to creep up from the beginning of the test and takes several minutes to reach its maximum of about 5630 IOPS.
10 9 8 7 6 msec 5 4 3 2 1 0 1:24 1:26 1:28 1:30 1:32 1:34 3:18 3:20 3:22 1:20 1:22 3:26 3:28 3:30 3:32 3:34 3:24
7000
HyperPAV test
Dynamic PAV test
6300 5600 4900 4200 3500 2800 2100 1400 700 0 IO/sec
time CONN DISC PEND IOSQ IO Rate
Figure 15-1 HyperPAV compared to dynamic PAV test result
Figure 15-2 shows the number of PAVs assigned to the base address.
444
12
10
# PAV
Dynamic PAV test

4
HyperPAV test
2
1:20
1:26
1:28
1:30
3:20
3:30
3:32
1:24
1:34
3:18
3:28
1:22
1:32
3:22
3:24
3:26
time
Figure 15-2 PAVs assigned using HyperPAV and dynamic PAV
HyperPAV: The number of PAVs almost immediately jumped to around 10 and fluctuates between 9 and 11. Dynamic PAV: The number of PAVs starts at one, and gradually WLM increases the PAV one at a time until it reaches a maximum number of nine here. In this test, we see that HyperPAV assigns more aliases compared to dynamic PAV. But, we also see that HyperPAV reaches a higher I/O rate compared to dynamic PAV. Note that this is an extreme test that tries to show how HyperPAV reacts to a very high concurrent I/O rate to a single volume, as compared to how dynamic PAV responds to this condition. The conclusion here is that HyperPAV can and will react immediately to a condition where there is a high concurrent demand on a volume. The other advantage of HyperPAV is that there is no overhead for assigning and releasing an alias for every single I/O operation that needs an alias.
15.2.3 PAV and large volumes

By using queuing models, we can see the performance impact on the IOS queuing time when comparing a 3390-3 to larger volume sizes with various numbers of aliases. This modeling shows that one 3390-9 with 3 aliases (for a total of 4 UCBs) will have less IOSQ time as compared to three 3390-3s with 1 alias each (for a total of 6 UCBs). Using larger volumes reduces the number of total UCBs required. HyperPAV will reduce the number of aliases required even further. IBM Storage Advanced Technical Support provides an analysis based on your Resource Measurement Facility (RMF). As a result of this study, they will provide a recommendation of how many UCB aliases you will need to define per LCU on your DS8000.
3:34
445
15.3 Multiple Allegiance

Normally, if a System z host image (server or logical partition (LPAR)) does an I/O request to a device address for which the storage subsystem is already processing an I/O originating from another System z host image, the storage subsystem will send back a device busy indication and the I/O has to be retried. This response delays the new request and adds to processor and channel overhead (this delay is reported in the RMF Device Activity Report PEND time column). With older storage subsystems (before the DS8000 or ESS), a device has an implicit allegiance, that is, a relationship created in the disk control unit between the device and a channel path group, when an I/O operation was accepted by the device. The allegiance caused the control unit to guarantee access (no busy status presented) to the device for the remainder of the channel program over the set of paths associated with the allegiance. The DS8000, thanks to Multiple Allegiance (MA), can accept multiple parallel I/O requests from different hosts to the same device address, increasing parallelism and reducing channel overhead. The requests are accepted by the DS8000 and all requests will be processed in parallel, unless there is a conflict when writing data to the same extent of the count key data (CKD) logical volume. Still, good application access patterns can improve the global parallelism by avoiding reserves, limiting the extent scope to a minimum, and setting an appropriate file mask, for example, if no write is intended. In systems without Multiple Allegiance, all except the first I/O request to a shared volume are rejected, and the I/Os are queued in the System z channel subsystem, showing up as PEND time in the RMF reports. Multiple Allegiance provides significant benefits for environments running a sysplex or System z systems sharing access to volumes. Multiple Allegiance and PAV can operate together to handle multiple requests from multiple hosts. The DS8000 ability to run channel programs to the same device in parallel can dramatically reduce the IOSQ and the PEND time components in shared environments. In particular, different workloads, for example, batch and online, running in parallel on different systems can have an unfavorable impact on each other. In such cases, Multiple Allegiance can dramatically improve the overall throughput.
15.4 How PAV and Multiple Allegiance work

These two functions allow multiple I/Os to be executed concurrently against the same volume in a z/OS environment. In the case of PAV, the I/Os are coming from the same LPAR or z/OS system, while for Multiple Allegiance, the I/Os are coming from different LPARs or z/OS systems. First, we will look at a disk subsystem that does not support both of these functions. If there is an outstanding I/O operation to a volume, all subsequent I/Os will have to wait as illustrated in Figure 15-3. I/Os coming from the same LPAR will wait in the LPAR, and this wait time is recorded in IOSQ Time. I/Os coming from different LPARs will wait in the disk control unit and be recorded in Device Busy Delay Time, which is part of PEND Time. In the ESS and DS8000, all these I/Os will be executed concurrently using PAV and Multiple Allegiance, as shown in Figure 15-4 on page 447. I/O from the same LPAR will be executed concurrently using UCB 1FF that is an alias of base address 100. I/O from a different LPAR will be accepted by the disk control unit and executed concurrently. All these I/O operations 446
will be satisfied from either the cache or one of the disk drive modules (DDMs) on a rank where the volume resides.
z/OS 1
Appl.A
Wait = IOSQ
UCB 100 UCB Busy
z/OS 2
Appl.C
UCB 100
Appl.B
UCB 100
Device Busy
Wait = PEND
DASD Control Unit

One I/O to one volume at one time
100
Figure 15-3 Concurrent I/O prior to PAV and Multiple Allegiance
z/OS 1
z/OS 2 Appl.C
UCB 100
Appl.A
UCB 1FF - Alias to UCB 100
Appl.B
UCB 100
DASD Control Unit
Parallel Access Volumes
100
Multiple Allegiance
Figure 15-4 Concurrent I/O with PAV and Multiple Allegiance
15.4.1 Concurrent read operation

Figure 15-5 shows that concurrent read operations from the same LPAR or different LPARs can be executed at the same time, even if they are accessing the same record on the same volume.
447

z/OS 1
TCB1: READ1 TCB2: READ2
Multiple Allegiance
z/OS 1 z/OS 2
z/OS 2
TCB READ2
READ1
TCB
concurrent
concurrent
Same logical volume
Same logical volume
Figure 15-5 Concurrent read operations
15.4.2 Concurrent write operation

Figure 15-6 shows the concurrent write operation. If the write I/Os are accessing different domains on the volume, all the write I/Os will be executed concurrently. In case the write operations are directed to the same domain, then the first write to that domain will be executed and other writes to the same domain will have to wait until the first write finishes. This wait time is included in the Device Busy Delay time, which is part of PEND time. Note: The domain of an I/O covers the specified extents to which the I/O operation applies. It is identified by the Define extent command in the channel program. The domain covered by the Define extent used to be much larger than the domain covered by the I/O operation. When concurrent I/Os to the same volume were not allowed, there was not an issue, because subsequent I/Os had to wait anyway. With the availability of PAV and Multiple Allegiance, this extent conflict might prevent multiple I/Os from being executed concurrently. This extent conflict can occur when multiple I/O operations try to execute against the same domain on the volume. The solution is to update the channel programs so that they minimize the domain that each channel program is covering. For a random I/O operation, the domain must be the one track where the data resides. If a write operation is being executed, any read or write to the same domain will have to wait. The same case will happen if a read to a domain starts, subsequent I/Os that want to write to the same domain will have to wait until the read operation is finished. To summarize, all reads can be executed concurrently, even if they are going to the same domain on the same volume. A write operation cannot be executed concurrently with any other read or write operations that access the same domain on the same volume. The purpose of serializing a write operation to the same domain is to maintain data integrity.
448

z/OS 1
TCB1: WRITE1 TCB2: WRITE2
Multiple Allegiance
z/OS 1 z/OS 2
z/OS 2
TCB WRITE2
WRITE1
TCB
concurrent
concurrent
Same logical volume
Same logical volume
Figure 15-6 Concurrent write operation
15.5 I/O Priority Queuing

The DS8000 can manage multiple channel programs concurrently, as long as the data accessed by one channel program is not altered by another channel program. If I/Os cannot run in parallel, for example, due to extent conflicts, and must be serialized to ensure data consistency, the DS8000 will internally queue I/Os. Channel programs that cannot execute in parallel are processed in the order that they are queued. A fast system cannot monopolize access to a device also accessed from a slower system. Each system gets a fair share. The DS8000 can also queue I/Os from different z/OS system images in a priority order. z/OS Workload Manager can make use of this prioritization and prioritize I/Os from one system against the others. You can activate I/O Priority Queuing in WLM Goal mode with the I/O priority management option in the WLMs Service Definition settings. When a channel program with a higher priority comes in and is put ahead of the queue of channel programs with lower priorities, the priorities of the lower priority programs will be increased. This priority increase prevents high priority channel programs from dominating lower priority ones and gives each system a fair share, based on the priority assigned by WLM.
15.6 Logical volume sizes

The DS8000 supports CKD logical volumes of any size from one cylinder up to 262668 cylinders. The term custom volume denotes that a user has the flexibility to select the size of a volume, and does not need to match the size of the standard real devices, such as the 3339 cylinders of the 3390-3, or the 10017 cylinders of the 3390-9.
449
Besides these standard models, there is the 3390-27 that supports up to 32760 cylinders and the 3390-54 that supports up to 65520 cylinders. With the availability of the Extended Address Volume (EAV), we now have the capability to support extremely large volumes of up to 262668 cylinders.
15.6.1 Selecting the volume size

A key factor to consider when planning the CKD volumes configuration and sizes is the 256 device limit per logical subsystem (LSS). You need to define volumes with enough capacity, so that you can use all your installed capacity with at most 256 devices. On Enterprise Systems Connection (ESCON)-attached systems, the number of devices can be even smaller, 128 or even 64, due to ESCON constraints. If using PAV, a portion of the 256 addresses will be used for aliases. When planning the configuration, also consider future growth. You might want to define more alias addresses than needed, so that in the future you can add an additional rank on this LCU, if needed. Figure 15-7 on page 451 shows the number of volumes that can be defined on one (6+P) RAID 5 rank for different 3390 models on different DDM sizes. The 3390 models used in this chart are: 3390-3: 3339 cylinders 3390-9: 10017 cylinders 3390-27: 30051 cylinders, which is 27 times the capacity of a 3390-1 3390-54: 60102 cylinders, which is 54 times the capacity of a 3390-1 EAV: 240408 cylinders, which is 216 times the capacity of a 3390-1, or four times the capacity of the 3390-54 It is obvious that if you define 3390-3 volumes on a 146 GB DDM array, you cannot define all the 291 volumes on one LCU due to the 256 address limitation on the LCU. In this case, you will have to define multiple LCUs on that rank. A better option is to use the larger 3390 models, especially if you have multiple ranks that you want to define under one LCU. In this example, you only need to define four EAVs to occupy the 146 GB DDM array.
450
900 800 700 600 590
886
# of volumes
500 400 300 200 100 0 EAV EAV 3390-9 3390-3 3390-27 3390-54 3390-9 3390-27 3390-3 3390-9 3390-54 3390-27 3390-3 3390-54 EAV 97 32 16 4 291 196 98 32 49 8 12 295
65
146 GB DDM
300 GB DDM
450GB DDM
Figure 15-7 Number of volumes on a (6+P) RAID 5 rank
15.6.2 Larger volume compared to smaller volume performance

The performance of configurations using larger custom volumes as compared to an equal total capacity configuration of smaller volumes has been measured using various online and batch workloads. In this section, we include measurement examples that can help you evaluate the performance implications of using larger volumes. Note: Even though the benchmarks were performed on an ESS-F20, the comparative results are similar on the DS8000.
Random workload
The measurements for DB2 and IMS online transaction workloads in our measurements showed that there was only a slight difference in device response time between a six 3390-27 volume configuration compared to a a 60 3390-3 volume configuration of equal capacity on the ESS-F20 using FICON channels. The measurements for DB2 are shown in Figure 15-8. Note that even when the device response time for a large volume configuration is higher, the online transaction response time can sometimes be lower due to the reduced system overhead of managing fewer volumes.
451
3
Device response time (msec)
2 3390-3 3390-27 1
0
2101 3535
Total I/O rate (IO/sec)

Figure 15-8 DB2 large volume performance
The measurements were carried out so that all volumes were initially assigned with zero or one alias. WLM dynamic alias management then assigned additional aliases as needed. The number of aliases at the end of the test run reflects the number that was adequate to keep IOSQ down. For this DB2 benchmark, the alias assignment done by WLM resulted in an approximately 4:1 reduction in the total number of UCBs used.
Sequential workload
Figure 15-9 on page 453 shows elapsed time comparisons between nine 3390-3s compared to one 3390-27 when a DFSMSdss full volume physical dump and full volume physical restore are executed. The workloads were run on a 9672-XZ7 processor connected to an ESS-F20 with eight FICON channels. The volumes are dumped to or restored from a single 3590E tape with an A60 Control Unit with one FICON channel. No PAV aliases were assigned to any volumes for this test, even though an alias might have improved the performance.
452
1500
3390-3
3390-27
Elapsed time (sec)
1000
500
Full volume dump
Full volume restore
Figure 15-9 DSS dump large volume performance
15.6.3 Planning the volume sizes of your configuration

From a simplified storage management perspective, we recommend that you select and use a uniform volume size for the majority of your volumes. With a uniform volume size configuration in your DS8000, you do not have to keep track of the size of each of your volumes. Several functions, such as FlashCopy, Peer-to-Peer Remote Copy (PPRC), and full volume restores, require that the target volume cannot be smaller than the source, which simplifies and avoids mistakes in your storage administration activities.
Larger volumes
To avoid potential I/O bottlenecks when using large volumes, you might also consider the following recommendations: Use PAVs to reduce IOS queuing. Parallel Access Volume (PAV) is of key importance when using large volumes. PAV enables one z/OS system image to initiate multiple I/Os to a device concurrently, which keeps IOSQ times down even with many active datasets on the same volume. PAV is a practical must with large volumes. In particular, we recommend using HyperPAV. Multiple Allegiance is a function that the DS8000 automatically provides. Multiple Allegiance automatically allows multiple I/Os from different z/OS systems to be executed concurrently, which will reduce the Device Busy Delay time, which is part of PEND time. Eliminate unnecessary reserves. As the volume sizes grow larger, more data and datasets will reside on a single CKD device address. Thus, the larger the volume, the greater the multi-system performance impact will be when serializing volumes with RESERVE processing. You need to exploit a Global Resource Serialization (GRS) Star Configuration and convert all RESERVEs possible into system ENQ requests.
453
Some applications might use poorly designed channel programs that define the whole volume or the whole extent of the dataset it is accessing as their extent range or domain, instead of just the actual track on which the I/O operates. This design prevents other I/Os from running simultaneously, if a write I/O is being executed against that volume or dataset, even when PAV is used. You need to identify such applications and allocate the datasets on volumes where they do not conflict with other applications. Custom volumes are an option here. For an Independent Software Vendor (ISV) product, asking the vendor for an updated version might help solve the problem. Other benefits of using large volumes can be briefly summarized as follows: Reduce the number of UCBs required. We reduce the number of UCBs by consolidating smaller volumes to larger volumes, and we also reduce the number of total aliases required, as explained in 15.2.3, PAV and large volumes on page 445. Simplified storage administration. Larger pools of free space, thus reducing number of X37 abends and allocation failures. Reduced number of multivolume datasets to manage.
15.7 FICON
FICON provides several benefits as compared to ESCON, from the simplified system connectivity to the greater throughput that can be achieved when using FICON to attach the host to the DS8000. FICON allows you to significantly reduce the batch window processing time. Response time improvements can accrue particularly for data stored using larger blocksizes. The data transfer portion of response time is greatly reduced because of the much higher data rate during transfer with FICON. This improvement leads to significant reductions in the connect time component of the response time. The larger the transfer, the greater the reduction as a percentage of the total I/O service time. The pending time component of the response time, which is caused by director port busy, is totally eliminated, because collisions in the director are eliminated with the FICON architecture. For users whose ESCON directors are experiencing as much as 45 - 50% busy conditions, FICON will provide significant response time reduction. Another performance advantage delivered by FICON is that the DS8000 accepts multiple channel command words (CCWs) concurrently without waiting for completion of the previous CCW, which allows setup and execution of multiple CCWs from a single channel to happen concurrently. Contention among multiple I/Os accessing the same data is now handled in the FICON host adapter and queued according to the I/O priority indicated by the Workload Manager. FICON Express2 and FICON Express4 channels on the z9 EC and z9 BC systems also support the Modified Indirect Data Address Word (MIDAW) facility and a maximum of 64 open exchanges per channel compared to the maximum of 32 open exchanges available on FICON, FICON Express and IBM System z 990 (z990) and IBM System z 890 (z890) FICON Express2 channels. Significant performance advantages can be realized by users accessing the data remotely. FICON eliminates data rate droop effect for distances up to 100 km (62.1 miles) for both read and write operations by using enhanced data buffering and pacing schemes. FICON thus extends the DS8000s ability to deliver high bandwidth potential to the logical volumes needing it, when they need it.
454
For additional information about FICON, refer to 9.3.2, FICON on page 276.
15.7.1 Extended Distance FICON

DS8000 Extended Distance FICON (EDF) requires DS8000 Release 3.1 or later, System z10 mainframe post GA (April 2008) and DFSMS SDM APAR OA24218 (March 2008). EDF is designed to help enable greater throughput over distance for IBM z/OS Global Mirror (XRC) by eliminating handshakes between the channel and the control unit. EDF provides performance improvement in long distance (>100 km (62.1 miles) at 1 Gbps, 50 km (31 miles) at 2 Gbps and at 4 Gbps) XRC System Data Mover configurations without using channel extender equipment or using lower-cost channel extenders based on frame-forwarding technology. The performance is equal or slightly better to spoofing channel extenders with extension technology that does not provide spoofing. With a new protocol for persistent Information Unit (IU) pacing, EDF channels remember the last pacing information that it uses on subsequent operations, thus avoiding performance degradation at the start of a new operation. IU pacing helps to optimize the link utilization and simplifies the requirements for channel extension equipment, because more commands can be in flight. EDF is transparent to the operating systems and is applicable to FICON Express2 and FICON Express4 features defined with CHPID type FC.
15.7.2 High Performance FICON

High performance FICON (zHPF) is a protocol extension of FICON, which communicates in a single packet a set of commands that need to be executed (while it looks like a Small Computer System Interface (SCSI) command descriptor block from an interface standpoint). It allows the control unit to stream the data for multiple commands back in a single data transfer section for I/Os initiated by Media Manager, which improves significantly the channel throughput (up to 2 times) on small block transfers (4k block). zHPF implementation by the DS8000 is exclusively for I/Os that transfer less than a single track of data. The maximum number of I/Os is designed to be improved by up to 100% for small data transfers that can exploit zHPF. Realistic production workloads with a mix of data transfer sizes can see up to 30 to 70% of FICON I/Os utilizing zHPF resulting in up to a 10 to 30% savings in channel utilization. Sequential I/Os transferring less than a single track size (for example: 12 x 4k bytes/IO) can also benefit. Enhancements made to both the z/Architecture and the FICON interface architecture deliver optimizations for online transaction processing (OLTP) workloads by using channel Fibre ports to transfer commands and data. zHPF helps to reduce overhead related to the supported commands and improve performance. From the z/OS point of view, the existing FICON architecture is called command mode, and zHPF architecture is called transport mode. Both modes are supported by the FICON Express4 and FICON Express2 features. A parameter in the Operation Request Block (ORB) is used to determine whether the FICON channel is running in command or transport mode. There is a microcode requirement on both the CU side and the channel side that will enable zHPF dynamically at the links level. The links basically initialize themselves to allow zHPF as long as both sides support zHPF.
455
zHPF is available as a license feature (7092 and 0709) on DS8000 series Turbo Models in R4.1. The software requirements are: zHPF support for CHPID type FC on the System z10 EC requires at a minimum: z/OS V1.8, V1.9, or V1.10 with PTFs. z/OS V1.7 with the IBM Lifecycle Extension for z/OS V1.7 (5637-A01) with PTFs.
DS8000 HPF (HA) Performance

Figure 15-10 shows a 4k read hit performance with 4 Gbps FICON and with HPF. On a single port, the read hit is 13.9 kIOPS for the 4 Gbps FICON, and 28.5 kIOPS for HPF (49% improvement). The read improvement is even higher (52%) on a single card.
Figure 15-10 4k Read Hit Performance
Figure 15-11 shows a 4k write hit performance with 4 Gbps FICON and with HPF. On a single port, the write is 11.8 kIOPS for the 4 Gbps FICON and 20.7 KIOPS for HPF (57% improvement). The write performance is almost the same (55%) for a single card.
456
Figure 15-11 4K Write Hit Performance
Figure 15-12 shows a 4k read/write hit performance with 4 Gbps FICON and with HPF. On a single port, the read/write is 12 kIOPS for the 4 Gbps FICON and 23.5 kIOPS for HPF (51% improvement). The read/write improvement is even higher (57%) for a single card.
Figure 15-12 4k Read/Write Hit Performance
15.7.3 MIDAW
The IBM System z9 server introduces a Modified Indirect Data Address Word (MIDAW) facility, which in conjunction with the DS8000 and the FICON Express4 channels delivers enhanced I/O performance for Media Manager applications running under z/OS 1.7 and z/OS 1.8. It is also supported under z/OS 1.6 with PTFs.
457
The MIDAW facility is a modification to a channel programming technique that has existed since S/360 days. MIDAW is a new method of gathering and scattering data into and from noncontiguous storage locations during an I/O operation, thus decreasing channel, fabric, and control unit overhead by reducing the number of channel command words (CCWs) and frames processed. There is no tuning needed to use this MIDAW facility. The requirements to be able to take advantage of this MIDAW facility are: z9 server. Applications that use Media Manager. Applications that use long chains of small blocks. The biggest performance benefit comes with FICON Express4 channels running on 4 Gbps links, especially when processing extended format datasets. Compared to ESCON channels, using FICON channels will improve performance. This performance improvement is more significant for I/Os with bigger blocksizes, because FICON channels can transfer data much faster, which will reduce the connect time. The improvement for I/Os with smaller blocksizes is not as significant. In these cases where chains of small records are processed, MIDAW can significantly improve FICON Express4 performance if the I/Os use Media Manager. Figure 15-13 on page 459 shows that for a 32x4k READ channel program. Without MIDAWs, a FICON Express4 channel is pushed to 100% channel processor utilization at just over 100 MBps of throughput, which is about the limit of a 1 Gigabit/s FICON link. Two FICON Express4 channels are needed to get to 200 MBps with a 32x4k channel program. With MIDAWs, 100 MBps is achieved at only 30% channel utilization, and 200 MBps, which is about the limit of a 2 Gigabit/s FICON link, is achieved at about 60% channel utilization. A FICON Express4 channel operating at 4 Gigabit/s link speeds can achieve over 325 MBps. This measurement was done with a single FICON Express4 channel connected to a FICON director to two 4 Gigabit/s Control Unit (CU) ports.
458
Figure 15-13 Channel utilization limits for hypothetical workloads
15.8 z/OS planning and configuration guidelines

This section discusses general configuration guidelines and recommendations for planning the DS8000 configuration. For a less generic and more detailed analysis that takes into account your particular environment, the Disk Magic modeling tools are available to IBM personnel and IBM Business Partners who can help you in the planning activities. Disk Magic can be used to help understand the performance effects of various configuration options, such as the number of ports and host adapters, disk drive capacity, number of disks, and so on. Refer to 7.1, Disk Magic on page 162.
15.8.1 Channel configuration

The following generic guidelines can be complemented with the information in Chapter 9, Host attachment on page 265: If you use eight FICON channels, define all eight FICON channels as an 8-path-group to all the volumes on the DS8000. If you decide to use more than eight FICON channels, divide the FICON channels evenly between the LCUs, so that LCUs on the same processor complex (Server) are assigned to the same channel-path-group. Figure 15-14 shows the maximum throughput of a FICON 2Gb and 4Gb port on the DS8000 as compared to the maximum throughput of FICON Express channels on the System z servers. Considering that the maximum throughput of a DS8000 FICON 4Gb port is not that much higher than the maximum throughput of a FICON Express4 channel, in general we do not recommend daisy chaining several FICON Express4 channels from multiple CECs onto the same DS8000 4Gb port.
459
Daisy chaining can go either way: Connecting more than one FICON channel to one DS8000 port Connecting one FICON channel to more than one DS8000 port
450 400 350 300
MB/sec
250 200 150 100 50 0 DS8000 2Gb Port DS8000 4Gb Port FICON Express 2Gb FICON Express2 2Gb FICON Express4
Figure 15-14 FICON port and channel throughput
However, if you have multiple DS8000s installed, it might be a good option to balance the channel load on the System z server. You can double the number of required FICON ports on the DS8000s and daisy chain these FICON ports to the same channels on the System z server. This design provides the advantage of being able to balance the load on the FICON channels, because the load on the DS8000 fluctuates during the day. Figure 15-15 on page 461 shows configuration A with no daisy chaining. In this configuration, we see that each DS8000 uses 8 FICON ports and that each port is connected to a separate FICON channel on the host. In this case, we have two sets of 8 FICON ports connected to 16 FICON channels on the System z host. In configuration B, we double the number of FICON ports on both DS8000s and keep the same number of FICON channels on the System z server. We can now connect each FICON channel to two FICON ports, one on each DS8000. The advantage of configuration B is: Workload from each DS8000 will now be spread across more FICON ports, which lowers the load on the FICON ports and FICON host adapters. Any imbalance in the load that is going to the two DS8000s will now be spread more evenly across the 16 FICON channels.
460
CEC
A
DS8000 DS8000 Assumption:
CEC
Each line from the FICON channel in the CEC and each line from the FICON port in the DS8000 represents a set of 8-paths
B
DS8000 DS8000
Figure 15-15 Daisy chaining DS8000s
15.8.2 Extent pool

An extent is the size of a 3390-1, which is 1113 cylinders. You assign ranks to an extent pool. One extent pool can have one to n ranks, where n can be any number up to the total number of ranks that is defined in a processor complex (server). For an in-depth discussion about extent pool planning, refer to 5.7, Planning extent pools on page 91.
One rank on one extent pool

One rank on one extent pool was the original recommendation with the first DS8000 microcode versions. Using this configuration, performance management is simplified, because we can identify the reason for a rank saturation more easily because we can identify all the volumes that reside on the rank. For the performance discussion about this subject, refer to One rank on one extent pool on page 474.
Multiple ranks on one extent pool

Even though the multi-rank extent pool was also available early on, the recommendation from a performance standpoint was still to allocate one rank per extent pool. This recommendation changes with the availability of the DS8000 Release 3.0 microcode. In this new release, a new algorithm, called Storage Pool Striping (SPS), was introduced. With SPS, a volume is allocated by striping the extents (1113 cylinders) in round-robin fashion across all the ranks defined in an extent pool. Now, we suggest that you allocate four to eight ranks in one SPS extent pool.
461
Many performance issues are caused by a rank that is being over-driven by the I/O load of the volumes that reside on that rank. The only real solution is to spread the I/O over multiple ranks. Traditionally, you spread the I/O over multiple ranks by moving volumes from a busier rank to a rank or multiple ranks with less utilization. This solution might not be a practical solution, because often the heavily loaded volumes on one day can differ from the heavily loaded volumes on another day. With SPS, we can proactively involve the mechanical capabilities of multiple ranks into handling the I/O activities of those volumes. We utilize those multiple resources, and the load will be balanced across all the ranks within the SPS extent pool. This load balancing is the main advantage compared to the single-rank extent pool and multi-rank extent pool configurations using the previous allocation methods. For the SPS performance considerations, refer to Multiple ranks on one extent pool using Storage Pool Striping on page 476. Tip: Considering the performance benefits, we highly recommend that you configure the DS8000 using Storage Pool Striping. Using SPS does not mean that we now can allocate all the volumes of one application on one single extent pool. It is still a prudent practice to spread those volumes across all extent pools, including extent pools on the other processor complex (Server). If an extent pool is used exclusively for SPS volumes, the showrank command of all the ranks within that extent pool shows the same list of volumes, except when: The ranks defined in the extent pool contain ranks with different numbers of DDMs, such as (6+P) and (7+P). In which case, a condition can occur where all the extents on the (6+P) ranks are fully occupied, so new volumes will be located on the (7+P) ranks only. Several volumes have a capacity of less than (n x 1113) cylinders, where n = number of ranks defined in the extent pool. For example, a 3390-1 will only be allocated on one rank. Example 15-1 is the output of a showckdvol command for an SPS volume. Here, we can observe for a volume with ID 9730: It is allocated on extent pool P1, which is a multi-rank extent pool with two ranks: R4 and R7. The volume is defined as a 3390-27 with 27 extents (30051 cylinders). It is allocated with the rotateexts (rotate extents) algorithm, which is the DSCLI and DS Storage Manager term for SPS. Because there are only two ranks in this P1 extent pool, volume 9730 allocation is: 13 extents on R4 14 extents on R7
Example 15-1 The showckdvol command output dscli> showckdvol -rank 9730 Date/Time: November 13, 2008 7:47:29 AM PST IBM DSCLI Version: 5.4.2.257 DS: IBM.2107-7512321 Name ITSO_RotExts ID 9730 accstate Online datastate Normal configstate Normal
462
deviceMTM 3390-9 volser datatype 3390 voltype CKD Base orgbvols addrgrp 9 extpool P1 exts 27 cap (cyl) 30051 cap (10^9B) 25.5 cap (2^30B) 23.8 ranks 2 sam Standard repcapalloc eam rotateexts reqcap (cyl) 30051 ==============Rank extents============== rank extents ============ R4 13 R7 14
15.8.3 Considerations for mixed workloads

The bigger capacity of the DS8000 will allow you to combine data and workloads from several different kinds of independent servers into a single DS8000. Examples of mixed workloads include: z/OS and Open Systems Mission-critical production and test Sharing resources in a DS8000 has advantages from a storage administration and resource-sharing perspective, but it does have implications for workload planning. resource-sharing has the benefit that a larger resource pool (for example, disk drives or cache) is available for critical applications. However, be careful to ensure that uncontrolled or unpredictable applications do not interfere with mission-critical work. If you have a workload that is truly mission-critical, you might want to consider isolating it from other workloads, particularly if those other workloads are unpredictable in their demands. There are several ways to isolate the workloads: Place the data on separate DS8000s, or separate DS8000 LPARs. This option is, of course, the best choice. Place the data on separate DS8000 servers. This option will isolate the use of memory buses, microprocessors, and cache resource. However, before doing that, make sure that a half DS8000 provides sufficient performance to meet the needs of your important application. Note that Disk Magic provides a way to model the performance of a half DS8000 by specifying the Failover Mode. Consult your IBM representative for a Disk Magic analysis. Place the data behind separate device adapters. Place the data on separate ranks, which will reduce contention for the use of DDMs. Note: z/OS and Open Systems data can only be placed on separate extent pools.
463
15.9 DS8000 performance monitoring tools

There are several tools available that can help with monitoring the performance of the DS8000: Resource Management Facility (RMF) RMF Magic OMEGAMON: This is an IBM Tivoli product. More information is available at: http://www.ibm.com/software/tivoli/ Tivoli Productivity Center: Tivoli Productivity Center is usually not a required performance monitoring tool for a DS8000 running in a z/OS environment. In a Remote Mirror and Copy environment, you might need TotalStorage Productivity Center to monitor the performance of the remote disk subsystem if there is no z/OS system that accesses this subsystem. More information about TotalStorage Productivity Center is in 8.2.1, TotalStorage Productivity Center overview on page 205. In the following sections, we explain RMF and RMF Magic.
15.10 RMF
Resource Management Facility (RMF), which is part of the z/OS operating system, provides performance information for the DS8000 and other disk subsystems for the users. RMF can help with monitoring the following performance components: I/O response time IOP/SAP FICON host channel FICON director SMP Cache and nonvolatile storage (NVS) FICON/Fibre port and host adapter Extent pool and rank/array
15.10.1 I/O response time

RMF DIRECT ACCESS DEVICE ACTIVITY report (refer to Example 15-2 on page 465) is probably the first report to use to monitor the disk subsystems performance. Also, if a Service Level Agreement (SLA) is not being met and the problem might be related to storage, use this report as a starting point in your performance analysis. If possible, rank volumes related to the application by I/O intensity, which is the I/O rate multiplied by Service Time (PEND + DISC + CONN time). Concentrate on the largest component of the response time. Try to identify what might be the bottleneck that is the reason for this problem. We provide more detailed explanations in the following sections. The device activity report accounts for all activity to a base and all of its associated alias addresses. Activity on alias addresses is not reported separately, but the alias addresses are accumulated into the base address.
464
Starting with z/OS Release 1.10, this report also shows the number of cylinders allocated to the volume. Example 15-2 shows 3390-9 volumes that have either 10017 or 30051 cylinders.
Example 15-2 RMF Direct Access Device Activity (DASD) report
D I R E C T A C C E S S AVG RESP TIME 1.39 1.29 5.39 1.47 2.73 12.9 .803 2.53 6.50 3.31 1.60 AVG IOSQ TIME .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 D E V I C E AVG CMR DLY .128 .128 .152 .140 .144 .140 .139 .149 .142 .136 .133 AVG DB DLY .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 A C T I V I T Y AVG PEND TIME .213 .207 .252 .237 .242 .237 .238 .250 .245 .232 .224 AVG DISC TIME .836 .788 3.21 .846 1.89 12.0 .228 .492 4.96 2.58 .914 AVG CONN TIME .341 .295 1.92 .389 .597 .640 .337 1.78 1.30 .497 .463 % DEV CONN 0.00 0.00 0.01 0.09 0.01 0.01 0.01 0.01 0.02 0.00 0.00 % DEV UTIL 0.00 0.00 0.02 0.28 0.05 0.23 0.02 0.01 0.09 0.01 0.00 % DEV RESV 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 AVG NUMBER ALLOC 0.0 0.0 10.0 9.0 6.0 11.0 18.0 0.0 10.0 7.0 1.0 % % ANY MT ALLOC PEND 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0
STORAGE GROUP
DBS0 DBS0 DBS0 DBS0 DBS1 PRD0 PRD0 PRD0 PRD0
DEV NUM A100 A101 A102 A103 A104 A105 A106 A107 A108 A109 A10A
DEVICE TYPE 33903 33903 33909 33909 33909 33909 33909 33909 33909 33909 33909
NUMBER OF CYL 3339 3339 30051 30051 30051 30051 30051 10017 10017 10017 10017
VOLUME PAV SERIAL AISL00 4 AISL01 4 D26454 5 D26455 5 D26456 5 D26457 5 D26458 8* P26531 5 P26532 5 P26533 5 P26534 5
DEVICE LCU ACTIVITY RATE 007E 0.017 007E 0.014 007E 0.198 007E 11.218 007E 0.990 007E 0.903 007E 2.654 007E 0.154 007E 0.741 007E 0.186 007E 0.111
PAV
PAV is the number of addresses assigned to a UCB, which includes the base address plus the number of aliases assigned to that base address. RMF will report the number of PAV addresses (or in RMF terms, exposures) that have been used by a device. In a dynamic PAV environment, when the number of exposures has changed during the reporting interval, there will be an asterisk next to the PAV number. Example 15-2 shows that address A106 has a PAV of 8*, the asterisk indicates that the number of PAVs was either lower or higher than 8 during the previous RMF period. For HyperPAV, the number of PAVs is shown in this format: n.nH. The H indicates that this volume is supported by HyperPAV, and the n.n is a one decimal number showing the average number of PAVs assigned to the address during the RMF report period. Example 15-3 shows that address 9505 has an average of 9.6 PAVs assigned to it during this RMF period. When a volume has no I/O activity, the PAV is always 1, which means that there is no alias assigned to this base address, because in HyperPAV, an alias is used or assigned to a base address only during the period required to execute an I/O. The alias is then released and put back into the alias pool after the I/O is completed. Note: The number of PAVs includes the base address plus the number of aliases assigned to it. Thus, a PAV=1 means that the base address has no aliases assigned to it.
Example 15-3 RMF DASD report for HyperPAV volumes (report created on pre-z/OS 1.10)
D I R E C T A C C E S S AVG CMR DLY 0.3 0.0 0.0 0.0 0.0 0.4 D E V I C E AVG DB DLY 0.0 0.0 0.0 0.0 0.0 0.0 A C T I V I T Y % DEV CONN 0.04 0.00 0.00 0.00 0.00 60.44 % DEV UTIL 0.04 0.00 0.00 0.00 0.00 60.83 % DEV RESV 0.0 0.0 0.0 0.0 0.0 0.0 AVG % NUMBER ANY ALLOC ALLOC 0.0 0.0 0.0 0.0 0.0 10.9 100.0 100.0 100.0 100.0 100.0 100.0 % MT PEND 0.0 0.0 0.0 0.0 0.0 0.0
STORAGE GROUP
DEV NUM 9500 9501 9502 9503 9504 9505
DEVICE TYPE 3390 3390 3390 3390 3390 3390
VOLUME PAV SERIAL HY9500 HY9501 HY9502 HY9503 HY9504 HY9505 1.0H 1.0H 1.0H 1.0H 1.0H 9.6H
DEVICE AVG AVG LCU ACTIVITY RESP IOSQ RATE TIME TIME 0227 0.900 0227 0.000 0227 0.000 0227 0.000 0227 0.000 0227 5747.73 0.8 0.0 0.0 0.0 0.0 1.6 0.0 0.0 0.0 0.0 0.0 0.0
AVG AVG AVG PEND DISC CONN TIME TIME TIME 0.4 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.0 0.0 0.0 0.0 1.0
465
15.10.2 I/O response time components

The components of response time are: IOSQ time Queuing at the host/CEC PEND time Overhead DISC time Non-productive time CONN time Data transfer time
IOSQ time
IOSQ time is the time measured when an I/O request is being queued in the LPAR by z/OS.
The following situations can cause high IOSQ time: One of the other response time components is high. When you see a high IOSQ time, look at the other response time components to investigate where the problem actually exists. Sometimes, the IOSQ time is due to the unavailability of aliases to initiate an I/O request. There is also a slight possibility that the IOSQ is caused by a long busy condition during device error recovery. To reduce the high IOSQ time: Reduce the other component of the response time. Lowering this other components response time automatically lowers the IOSQ time. Lower the I/O load through data in memory or use faster storage devices. Provide more aliases. Using HyperPAV is the best option.
PEND time
PEND time represents the time that an I/O request waits in the hardware. This PEND time
can be increased by: High FICON Director port or DS8000 FICON port utilization: High FICON Director port or DS8000 FICON port utilization can be caused by a high activity rate on those ports. More commonly, high FICON Director port or DS8000 FICON port utilization is due to daisy chaining multiple FICON channels from different CECs to the same port on the FICON DIrector or the DS8000 FICON host adapter. In this case, the FICON channel utilization as seen from the host might be low, but the combination or sum of the utilization of these channels that share the same port (either on the Director or the DS8000) can be significant. For more information, refer to FICON director on page 469 and DS8000 FICON/Fibre port and host adapter on page 472. High FICON host adapter utilization. Using too many ports within a DS8000 host adapter can overload the host adapter. We recommend that only two out of the four ports in a host adapter are used.
466
I/O Processor (IOP/SAP) contention at the System z host. More IOP might be needed. IOP is the processor in the CEC that is assigned to handle I/Os. For more information, refer to IOP/SAP on page 468. CMR Delay is a component of PEND time. Refer to Example 15-2 on page 465. It is the initial selection time for the first I/O command in a chain for a FICON channel. It can be elongated by contention downstream from the channel, such as a busy control unit. Device Busy Delay is also a component of PEND time. Refer to Example 15-2 on page 465. Device Busy Delay is caused by a domain conflict, because of a read or write operation against a domain that is in use for update. A high Device Busy Delay time can be caused by the domain of the I/O not being limited to the track that the I/O operation is accessing. If you use an Independent Software Vendor (ISV) product, ask the vendor for an updated version, which might help solve this problem.
DISC time
If the major cause of delay is the DISC time, you need to search more to find the cause. The most probable cause of high DISC time is having to wait while data is being staged from the DS8000 rank into cache, because of a read miss operation. This time can be elongated by: Low read hit ratio. Refer to Cache and NVS on page 470. The lower the read hit ratio, the more read operations have to wait for the data to be staged from the DDMs to the cache. Adding cache to the DS8000 can increase the read hit ratio. High DDM utilization. You can verify high DDM utilization from the ESS Rank Statistics report. Refer to Extent pool and rank on page 474. Look at the rank read response time. As a general rule, this number must be less than 35 ms. If it is higher than 35 ms, it is an indication that this rank is too busy, because the DDMs are saturated. If the rank is too busy, consider spreading the busy volumes allocated on this rank to other ranks that are not as busy.
Persistent memory full condition or nonvolatile storage (NVS) full condition can also
elongate the DISC time. Refer to Cache and NVS on page 470. In a Metro Mirror environment, a significant transmission delay between the primary and the secondary site also causes a higher DISC time.
CONN time
For each I/O operation, the channel subsystem measures the time that the DS8000, channel, and CEC are connected during the data transmission. When there is a high level of utilization of resources, significant time can be spent in contention, rather than transferring data. Several reasons for high CONN time: FICON channel saturation. If the channel or BUS utilization at the host exceeds 50%, it elongates the CONN time. Refer to FICON host channel on page 468. In FICON channels, the data transmitted is divided into frames, and when the channel is busy with multiple I/O requests, the frames from an I/O will be multiplexed with the frames from other I/Os, thus elongating the elapsed time that it takes to transfer all of the frames that belong to that I/O. The total of this time, including the transmission time of the other multiplexed frames, is counted as CONN time. Contention in the FICON Director, FICON port, and FICON host adapter elongate the PEND time, which also has the same effect on CONN time. Refer to the PEND time discussion in PEND time on page 466. Rank saturation caused by high DDM utilization increases DISC time, which also increases CONN time. Refer to the DISC time discussion in DISC time on page 467.
467
15.10.3 IOP/SAP
The IOP/SAP is the CEC processor that handles the I/O operation. We check the I/O QUEUING ACTIVITY report (Example 15-4) to determine whether the IOP is saturated. An average queue length greater than 1 indicates that the IOP is saturated, even though an average queue length greater than 0.5 is considered as a warning sign. A burst of I/O can also trigger a high average queue length. If only certain IOPS are saturated, redistributing the channels assigned to the disk subsystems can help balance the load to the IOP, because an IOP is assigned to handle a certain set of channel paths. So, assigning all of the channels from one IOP to access a very busy disk subsystem can cause a saturation on that particular IOP. Refer to the appropriate hardware manual of the CEC that you use.
Example 15-4 I/O Queuing Activity report
I/O z/OS V1R6 Q U E U I N G A C T I V I T Y
SYSTEM ID SYS1 DATE 06/15/2005 INTERVAL 00.59.997 RPT VERSION V1R5 RMF TIME 11.34.00 CYCLE 1.000 SECONDS TOTAL SAMPLES = 59 IODF = 94 CR-DATE: 06/15/2005 CR-TIME: 09.34.25 ACT: ACTIVATE - INITIATIVE QUEUE ------- IOP UTILIZATION -------- % I/O REQUESTS RETRIED --------- RETRIES / SSCH --------IOP ACTIVITY AVG Q % IOP I/O START INTERRUPT CP DP CU DV CP DP CU DV RATE LNGTH BUSY RATE RATE ALL BUSY BUSY BUSY BUSY ALL BUSY BUSY BUSY BUSY 00 278.930 0.00 1.25 278.930 471.907 0.2 0.0 0.2 0.0 0.0 0.00 0.00 0.00 0.00 0.00 01 551.228 0.04 1.55 551.228 553.744 17.2 17.2 0.0 0.0 0.0 0.21 0.21 0.00 0.00 0.00 SYS 830.158 0.02 1.40 830.158 1025.651 12.2 12.1 0.1 0.0 0.0 0.14 0.14 0.00 0.00 0.00
15.10.4 FICON host channel

The FICON report, which is shown in Example 15-5 on page 469, shows the FICON channel-related statistics. The PART Utilization is the FICON channel utilization related to the I/O activity on this logical partition (LPAR), and the Total Utilization is the total utilization of the FICON channel from all LPARs that are defined on the CEC. The general rule for the total FICON channel utilization and the BUS utilization is to keep them under 50%. If these numbers exceed 50%, you see an elongated CONN time. For small block transfers, the BUS utilization is less than the FICON channel utilization, and for large block transfers, the BUS utilization is greater than the FICON channel utilization. The Generation (G) field in the channel report tells you what combination of generation FICON channel is being used and the speed of the FICON channel link for this CHPID at the time of the machine IPL. The G field does not include any information about the link between the director and the DS8000: G=5 means the link between the channel and the director runs at 4Gbps, which is applicable for a FICON Express4 channel. G=4 means the link between the channel and the director runs at 2Gbps, which is applicable for a FICON Express4 or FICON Express2 channel. G=3 means the link between the channel and the director runs at 1Gbps, which is applicable to a FICON Express4 or FICON Express2 channel. G=2 means the link between the channel and the director runs at 2Gbps, which is applicable to a FICON Express channel. G=1 means the link between the channel and the director runs at 1Gbps, which is is applicable to a FICON Express channel.
468
The link between the director and the DS8000 can run at 1, 2, or 4Gbps. If the channel is point-to-point connected to the DS8000 FICON port, the G field indicates the speed that was negotiated between the FICON channel and the DS8000 port.
Example 15-5 Channel Path Activity report
C H A N N E L z/OS V1R6 SYSTEM ID SYS1 RPT VERSION V1R5 RMF P A T H A C T I V I T Y INTERVAL 00.59.997 CYCLE 1.000 SECONDS
DATE 06/15/2005 TIME 11.34.00
IODF = 94 CR-DATE: 06/15/2005 CR-TIME: 09.34.25 ACT: ACTIVATE MODE: LPAR CPMF: EXTENDED MODE --------------------------------------------------------------------------------------------------------------------------------DETAILS FOR ALL CHANNELS --------------------------------------------------------------------------------------------------------------------------------CHANNEL PATH UTILIZATION(%) READ(MB/SEC) WRITE(MB/SEC) CHANNEL PATH UTILIZATION(%) READ(MB/SEC) WRITE(MB/SEC) ID TYPE G SHR PART TOTAL BUS PART TOTAL PART TOTAL ID TYPE G SHR PART TOTAL BUS PART TOTAL PART TOTAL 2E FC_S 2 Y 0.15 0.66 4.14 0.02 0.13 0.05 0.08 36 FC_? OFFLINE 2F FC_S 2 Y 0.15 0.66 4.14 0.02 0.12 0.05 0.08 37 FC_? OFFLINE 30 FC_S 2 Y 0.02 0.14 3.96 0.00 0.00 0.00 0.00 38 FC_S 2 Y 0.00 5.17 4.45 0.00 0.00 0.00 0.00 31 FC_S 2 Y 0.02 0.14 3.96 0.00 0.00 0.00 0.00 39 FC_S 2 Y 0.00 4.47 4.37 0.00 0.00 0.00 0.00 32 FC_? 2 Y 0.00 0.13 3.96 0.00 0.00 0.00 0.00 3A FC_S 2 Y 0.02 0.14 3.96 0.00 0.00 0.00 0.00 3B FC_S 2 Y 9.20 9.20 10.20 0.00 0.00 13.28 13.28 43 FC_S OFFLINE 3C FC_S 2 Y 3.09 3.14 6.53 6.37 6.37 0.00 0.00 44 FC_S 2 Y 9.27 9.27 10.37 0.00 0.00 14.07 14.07 3D FC_S 2 Y 3.25 3.31 6.50 6.34 6.34 0.00 0.00 45 FC 2 0.00 0.13 3.96 0.00 0.00 0.00 0.00
15.10.5 FICON director

FICON director is the switch that is used to connect the host FICON channel to the DS8000 FICON port. FICON director performance statistics are collected in the System Management Facilities (SMF) record type 74 subtype 7. The FICON DIRECTOR ACTIVITY report (Example 15-6) provides information about director and port activities. This report assists in analyzing performance problems and in capacity planning.
The measurements provided for a port in this report include the I/O for the system on which the report is taken and also include all I/Os that are directed through this port, regardless of which LPAR requests the I/O. The CONNECTION in this report shows where this port is connected: CHP: The port is connected to a FICON channel on the host. CHP-H: The port is connected to a FICON channel on the host that requested this report. CU: This port is connected to a port on a disk subsystem. SWITCH: This port is connected to another FICON director. The important performance metric here is the AVG FRAME PACING. This metric shows the average time (in microseconds) that a frame had to wait before it was transmitted. The higher the contention on the director port, the higher the average frame pacing will be.
Example 15-6 FICON Director Activity report
F I C O N SWITCH DEVICE:0414 PORT ADDR 05 07 09 0B 12 13 14 15 -CONNECTIONUNIT ID CHP FA CHP 4A CHP FC CHP-H F4 CHP D5 CHP C8 SWITCH ---CU C800 SWITCH ID:01 AVG FRAME PACING 0 0 0 0 0 0 0 0 D I R E C T O R MODEL:001 A C T I V I T Y MAN:MCD PLANT:01 SERIAL:00000MK00109 ERROR COUNT 0 0 0 0 0 1 0 0
TYPE:005000
AVG FRAME SIZE READ WRITE 808 285 149 964 558 1424 872 896 73 574 868 1134 962 287 1188 731
PORT BANDWIDTH (MB/SEC) --READ ----WRITE -50.04 10.50 20.55 5.01 50.07 10.53 50.00 10.56 20.51 5.07 70.52 2.08 50.03 10.59 20.54 5.00
469
15.10.6 Processor complex

There is no SMF record that provides reports on the DS8000 processor complex utilization. A saturation of the processor complex usually occurs when running at an extremely high I/O rate or running an extremely heavy write remote copy workload. The general indications of a saturation are extremely high PEND and CONN times that are not related to high channel utilization or high port utilization, either for the FICON director port or the DS8000 FICON port. Sometimes, you see a saturation when you run a benchmark to test a new disk subsystem. Usually, in a benchmark, you try to run at the highest possible I/O rate on the disk subsystem.
15.10.7 Cache and NVS

The RMF CACHE SUBSYSTEM ACTIVITY report provides useful information for analyzing the reason of high DISC time. Example 15-7 shows a sample cache report by LCU. Example 15-8 on page 471 is the continuation of this report, and it shows the cache statistics by volume. The report shows the I/O requests by read and by write. It shows the rate, the hit rate, and the hit ratio of the read and the write activities. The read-to-write ratio is also calculated. Note that the total I/O requests here can be higher than the I/O rate shown in the DASD report. In the DASD report, one channel program is counted as one I/O. However, in the cache report, if there is more than one Locate Record command in a channel program, each Locate Record command is counted as one I/O request. In this report, we can check to see the value of the read hit ratio. Low read hit ratios contribute to higher DISC time. For a cache friendly workload, we see a read hit ratio of better than 90%. The write hit ratio is usually 100%. High DFW BYPASS is an indication that persistent memory (NVS) is overcommitted. DFW BYPASS actually means DASD Fast Write I/Os that are retried, because persistent memory is full. Calculate the quotient of DFW BYPASS divided by the total I/O rate; as ageneral rule, if this number is higher than 1%, the write retry operations have a significant impact on the DISC time. Check the DISK ACTIVITY part of the report. The Read response time must be less than 35 ms. If it is higher than 35 ms, it is an indication that the DDMs on the rank where this LCU resides are saturated.
Example 15-7 Cache Subsystem Activity summary
C A C H E z/OS V1R6 S U B S Y S T E M A C T I V I T Y
SYSTEM ID SYS1 DATE 06/15/2005 INTERVAL 00.59.997 RPT VERSION V1R5 RMF TIME 11.34.00 SUBSYSTEM 2107-01 CU-ID 7015 SSID 1760 CDATE 06/15/2005 CTIME 11.34.02 CINT 00.59 TYPE-MODEL 2107-922 MANUF IBM PLANT 75 SERIAL 000000012331 -----------------------------------------------------------------------------------------------------------------------------------CACHE SUBSYSTEM STATUS -----------------------------------------------------------------------------------------------------------------------------------SUBSYSTEM STORAGE NON-VOLATILE STORAGE STATUS CONFIGURED 31104M CONFIGURED 1024.0M CACHING - ACTIVE AVAILABLE 26290M PINNED 0.0 NON-VOLATILE STORAGE - ACTIVE PINNED 0.0 CACHE FAST WRITE - ACTIVE OFFLINE 0.0 IML DEVICE AVAILABLE - YES -----------------------------------------------------------------------------------------------------------------------------------CACHE SUBSYSTEM OVERVIEW -----------------------------------------------------------------------------------------------------------------------------------TOTAL I/O 19976 CACHE I/O 19976 CACHE OFFLINE 0 TOTAL H/R 0.804 CACHE H/R 0.804 CACHE I/O -------------READ I/O REQUESTS----------------------------------WRITE I/O REQUESTS---------------------% REQUESTS COUNT RATE HITS RATE H/R COUNT RATE FAST RATE HITS RATE H/R READ NORMAL 14903 252.6 10984 186.2 0.737 5021 85.1 5021 85.1 5021 85.1 1.000 74.8
470
SEQUENTIAL 0 0.0 0 0.0 N/A CFW DATA 0 0.0 0 0.0 N/A TOTAL 14903 252.6 10984 186.2 0.737 -----------------------CACHE MISSES----------------------REQUESTS READ RATE WRITE RATE TRACKS RATE NORMAL 3919 SEQUENTIAL 0 CFW DATA 0 TOTAL 3919 ---CKD STATISTICS--WRITE WRITE HITS 0 0 66.4 0 0.0 3921 0.0 0 0.0 0 0.0 0 0.0 RATE 66.4 ---RECORD CACHING--READ MISSES WRITE PROM 0 3456 66.5 0.0
52 0 5073
0.9 52 0.9 0.0 0 0.0 86.0 5073 86.0 ------------MISC-----------COUNT RATE DFW BYPASS 0 0.0 CFW BYPASS 0 0.0 DFW INHIBIT 0 0.0 ASYNC (TRKS) 3947 66.9 ----HOST ADAPTER ACTIVITY--BYTES BYTES /REQ /SEC READ 6.1K 1.5M WRITE 5.7K 491.0K
52 0 5073
0.9 1.000 0.0 0.0 N/A N/A 86.0 1.000 74.6 ------NON-CACHE I/O----COUNT RATE ICL 0 0.0 BYPASS 0 0.0 TOTAL 0 0.0
--------DISK ACTIVITY------RESP BYTES BYTES TIME /REQ /SEC READ 6.772 53.8K 3.6M WRITE 12.990 6.8K 455.4K
Following the report in Example 15-7 on page 470 is the CACHE SUBSYSTEM ACTIVITY report by volume serial number, as shown in Example 15-8. Here, you can see to which extent pool each volume belongs. In the case where we have the following setup, it is easier to perform the analysis if a performance problem happens on the LCU: One extent pool has one rank. All volumes on an LCU belong to the same extent pool. If we look at the rank statistics in the report in Example 15-11 on page 474, we know that all the I/O activity on that rank comes from the same LCU. So, we can concentrate our analysis on the volumes on that LCU only. Note: Depending on the DDM size used and the 3390 model selected, you can put multiple LCUs on one rank, or you can also have an LCU that spans more than one rank.
Example 15-8 Cache Subsystem Activity by volume serial number
C A C H E z/OS V1R6 S U B S Y S T E M A C T I V I T Y
SYSTEM ID SYS1 DATE 06/15/2005 INTERVAL 00.59.997 RPT VERSION V1R5 RMF TIME 11.34.00 SUBSYSTEM 2107-01 CU-ID 7015 SSID 1760 CDATE 06/15/2005 CTIME 11.34.02 CINT 00.59 TYPE-MODEL 2107-922 MANUF IBM PLANT 75 SERIAL 000000012331 -----------------------------------------------------------------------------------------------------------------------------------CACHE SUBSYSTEM DEVICE OVERVIEW -----------------------------------------------------------------------------------------------------------------------------------VOLUME DEV XTNT % I/O ---CACHE HIT RATE-----------DASD I/O RATE---------ASYNC TOTAL READ WRITE % SERIAL NUM POOL I/O RATE READ DFW CFW STAGE DFWBP ICL BYP OTHER RATE H/R H/R H/R READ *ALL 100.0 338.6 186.2 86.0 0.0 66.4 0.0 0.0 0.0 0.0 66.9 0.804 0.737 1.000 74.6 *CACHE-OFF 0.0 0.0 *CACHE 100.0 338.6 186.2 86.0 0.0 66.4 0.0 0.0 0.0 0.0 66.9 0.804 0.737 1.000 74.6 PR7000 7000 0000 22.3 75.5 42.8 19.2 0.0 13.5 0.0 0.0 0.0 0.0 14.4 0.821 0.760 1.000 74.6 PR7001 7001 0000 11.5 38.8 20.9 10.5 0.0 7.5 0.0 0.0 0.0 0.0 7.6 0.807 0.736 1.000 73.1 PR7002 7002 0000 11.1 37.5 20.4 9.5 0.0 7.6 0.0 0.0 0.0 0.0 7.0 0.797 0.729 1.000 74.7 PR7003 7003 0000 11.3 38.3 22.0 8.9 0.0 7.4 0.0 0.0 0.0 0.0 6.8 0.806 0.747 1.000 76.8 PR7004 7004 0000 3.6 12.0 6.8 3.0 0.0 2.3 0.0 0.0 0.0 0.0 2.6 0.810 0.747 1.000 75.2 PR7005 7005 0000 3.7 12.4 6.8 3.2 0.0 2.4 0.0 0.0 0.0 0.0 2.7 0.808 0.741 1.000 74.1 PR7006 7006 0000 3.8 12.8 6.5 3.6 0.0 2.6 0.0 0.0 0.0 0.0 3.1 0.796 0.714 1.000 71.5 PR7007 7007 0000 3.6 12.3 6.9 3.1 0.0 2.4 0.0 0.0 0.0 0.0 2.5 0.806 0.742 1.000 75.2 PR7008 7008 0000 3.6 12.2 6.7 3.4 0.0 2.2 0.0 0.0 0.0 0.0 2.7 0.821 0.753 1.000 72.5 PR7009 7009 0000 3.6 12.2 6.8 2.9 0.0 2.5 0.0 0.0 0.0 0.0 2.3 0.796 0.732 1.000 76.4
If you specify REPORTS(CACHE(DEVICE)) when running the cache report, you will get the detailed report by volume as in Example 15-9 on page 472. This report gives you the detailed cache statistics of each volume. By specifying REPORTS(CACHE(SSID(nnnn))), you can limit this report to only certain LCUs. The report basically shows the same performance statistics as in Example 15-7 on page 470, but at the level of each volume.
471
Example 15-9 Cache Device Activity report detail by volume

C A C H E z/OS V1R9 D E V I C E A C T I V I T Y
SYSTEM ID WIN5 DATE 11/05/2008 INTERVAL 00.59.976 CONVERTED TO z/OS V1R10 RMF TIME 21.54.00 SUBSYSTEM 2107-01 CU-ID C01C SSID 0847 CDATE 11/05/2008 CTIME 21.54.01 CINT 01.00 TYPE-MODEL 2107-932 MANUF IBM PLANT 75 SERIAL 0000000AB171 VOLSER @9C02F NUM C02F extent POOL 0000 -------------------------------------------------------------------------------------------------------------------------CACHE DEVICE STATUS -------------------------------------------------------------------------------------------------------------------------CACHE STATUS DUPLEX PAIR STATUS CACHING - ACTIVE DUPLEX PAIR - NOT ESTABLISHED DASD FAST WRITE - ACTIVE STATUS - N/A PINNED DATA - NONE DUAL COPY VOLUME - N/A -------------------------------------------------------------------------------------------------------------------------CACHE DEVICE ACTIVITY -------------------------------------------------------------------------------------------------------------------------TOTAL I/O 3115 CACHE I/O 3115 CACHE OFFLINE N/A TOTAL H/R 0.901 CACHE H/R 0.901 CACHE I/O -------------READ I/O REQUESTS----------------------------------WRITE I/O REQUESTS---------------------% REQUESTS COUNT RATE HITS RATE H/R COUNT RATE FAST RATE HITS RATE H/R READ NORMAL 2786 46.4 2477 41.3 0.889 329 5.5 329 5.5 329 5.5 1.000 89.4 SEQUENTIAL 0 0.0 0 0.0 N/A 0 0.0 0 0.0 0 0.0 N/A N/A CFW DATA 0 0.0 0 0.0 N/A 0 0.0 0 0.0 0 0.0 N/A N/A TOTAL 2786 46.4 2477 41.3 0.889 329 5.5 329 5.5 329 5.5 1.000 89.4 -----------------------CACHE MISSES----------------------------------MISC-----------------NON-CACHE I/O----REQUESTS READ RATE WRITE RATE TRACKS RATE COUNT RATE COUNT RATE DFW BYPASS 0 0.0 ICL 0 0.0 NORMAL 309 5.1 0 0.0 311 5.2 CFW BYPASS 0 0.0 BYPASS 0 0.0 SEQUENTIAL 0 0.0 0 0.0 0 0.0 DFW INHIBIT 0 0.0 TOTAL 0 0.0 CFW DATA 0 0.0 0 0.0 ASYNC (TRKS) 173 2.9 TOTAL 309 RATE 5.1 ---CKD STATISTICS-----RECORD CACHING------HOST ADAPTER ACTIVITY----------DISK ACTIVITY------BYTES BYTES RESP BYTES BYTES WRITE 0 READ MISSES 0 /REQ /SEC TIME /REQ /SEC WRITE HITS 0 WRITE PROM 111 READ 4.1K 190.1K READ 14.302 55.6K 288.4K WRITE 4.0K 21.8K WRITE 43.472 18.1K 48.1K
15.10.8 DS8000 FICON/Fibre port and host adapter

The report in Example 15-10 on page 473 shows the port report on the DS8000, which includes the FICON ports and also the PPRC link ports. The SAID is the port ID: The first two characters denote the enclosure number (refer to Figure 15-16 on page 473). The third character denotes the host adapter number within the enclosure: Numbered 0, 1, 3, and 4 The last character denotes the port ID within that host adapter: Numbered 0, 1, 2, and 3 The report shows that the ports are running at 2Gbps. There are FICON ports, shown under the heading of LINK TYPE as ECKD READ and ECKD WRITE, and there are also PPRC ports, shown as PPRC SEND and PPRC RECEIVE. The I/O INTENSITY is the result of multiplication of the operations per second and the response time per operation. For FICON ports, it is calculated for both the read and write operations, while for PPRC ports, it is calculated for both the send and receive operations. The total I/O intensity is the sum of those two numbers on each port. For FICON ports, if the total I/O intensity reaches 4000, the response time is significantly impacted, most probably, the PEND and CONN times. When this number already approaches 2000, proactive actions might be needed to prevent further increase in the total I/O intensity. Refer to the discussion on PEND and CONN times in PEND time on page 466
472
and CONN time on page 467. This rule does not apply for PPRC ports, especially if the distance between the primary site and the secondary site is significant. If the DS8000 is shared between System z and Open Systems, the report in Example 15-10 also shows the port activity used by the Open Systems. It shows up as SCSI READ and SCSI WRITE on ports 0200 and 0201 in Example 15-16 on page 484.
Example 15-10 DS8000 link statistics
E S S z/OS V1R7 SERIAL NUMBER 00000ABC01 ------ADAPTER-----SAID TYPE 0000 FIBRE 2Gb L I N K S T A T I S T I C S INTERVAL 14.59.778 CYCLE 1.000 SECONDS CTIME 01.14.01 CINT 14.59 RESP TIME /OPERATION 0.1 0.2 I/O INTENSITY 131.6 123.4 -----255.0 79.9 101.4 -----181.2 1024.9 0.0 -----1024.9 998.0 0.0 -----998.0 67.5 83.3 -----150.8 53.3 3.5 -----56.8
SYSTEM ID SYSA DATE 02/01/2008 CONVERTED TO z/OS V1R10 RMF TIME 01.14.00 TYPE-MODEL 002107-921 CDATE 02/01/2008 BYTES /SEC 17.2M 7.7M BYTES /OPERATION 9.9K 14.5K OPERATIONS /SEC 1735.2 533.9
--LINK TYPE-ECKD READ ECKD WRITE
0001 FIBRE 2Gb
ECKD READ ECKD WRITE
9.1M 7.7M
8.4K 17.0K
1087.2 455.9
0.1 0.2
0101 FIBRE 2Gb
PPRC SEND PPRC RECEIVE
6.0M 0.0
53.1K 0.0
112.2 0.0
9.1 0.0
0102 FIBRE 2Gb
PPRC SEND PPRC RECEIVE
6.2M 0.0
53.1K 0.0
115.9 0.0
8.6 0.0
0200 FIBRE 2Gb
SCSI READ SCSI WRITE
10.8M 1.9M
30.7K 31.5K
352.4 60.9
0.2 1.4
0201 FIBRE 2Gb
SCSI READ SCSI WRITE
9.0M 135.0K
38.7K 10.7K
232.0 12.6
0.2 0.3
2107 Host IOPORT Port Numbers from DS CLI

Port0 Port1 Port2 Port3
I0000 I0001 I0002 I0003 Slot 0 I0010 I0011 I0012 I0013 Slot 1
Device Adapter
I0030 I0031 I0032 I0033
I0040 I0041 I0042 I0043
Device Adapter
I0100 I0101 I0102 I0103 Slot 0
I0110 I0111 I0112 I0113 Slot 1
Device Adapter
I0130 I0131 I0132 I0133
I0140 I0141 I0142 I0143
Device Adapter
Slot 2 Slot 3
Slot 4 Slot 5
Slot 2 Slot 3
Slot 4 Slot 5
Enclosure 0
Enclosure 1
I0240 I0241 I0242 I0243 Slot 4 Slot 5 I0300 I0301 I0302 I0303 Slot 0 I0310 I0311 I0312 I0313 Slot 1 I0330 I0331 I0332 I0333 Slot 2 Slot 3 I0340 I0341 I0342 I0343 Slot 4 Slot 5
Port0 Port1 Port2 Port3
I0200 I0201 I0202 I0203 Slot 0
I0210 I0211 I0212 I0213 Slot 1
Device Adapter
I0230 I0231 I0232 I0233
Device Adapter
Device Adapter
Device Adapter
Slot 2 Slot 3
Enclosure 2
Enclosure 3
Figure 15-16 DS8000 port numbering
473
15.10.9 Extent pool and rank

One entire rank can only be assigned to one extent pool. First, we discuss the configuration where one extent pool only has one rank. Then, we discuss the configuration for multiple ranks without and with Storage Pool Striping.
One rank on one extent pool

Example 15-11 shows the rank performance statistics with one rank per extent pool. The important metric here is the read response time per operation. If this number is greater than 35 ms, it is an indication that the DDMs within the rank are saturated. This saturation happens if too many I/O operations are executed against this rank. In this case, we need to identify the volumes on this rank that have extremely high I/O rates or extremely high (read + write) MBps throughput. Move a few of these volumes to other, less busy ranks. A high write response time is not a concern, because the write operation is an asynchronous operation. The actual write operation from the cache/NVS to the rank is performed after the host write I/O is considered completed, which is when the updated record is already written to both the cache and NVS. When NVS is filled up to a high-water mark percentage, the least recently used data is written down to the rank until NVS usage is reduced to a low-water mark percentage. During this activity, multiple write requests can be queued to the same DDM, which can result in the elongation of the write response time. High write response times are not usually an indication of a performance problem.
One LCU on one extent pool

When one LCU is assigned on one extent pool, analyzing the LCUs back-end performance is as simple as looking at the ESS RANK STATISTICS report for the extent pool/rank. Refer to Example 15-11.
One LCU on multiple extent pools

If an LCU uses multiple extent pools, we can still see the back-end performance of each extent pool/rank that belongs to that LCU. If a certain rank is saturated, we can identify which volumes of the LCU reside on that particular rank. Refer to the report in Example 15-8 on page 471 to find out which volumes are on which ranks.
Multiple LCUs on one extent pool

Here, we have a configuration where multiple LCUs are allocated on the same extent pool and thus the same rank. If there is a rank saturation, it is difficult to determine which LCU is causing the problem. The LCU with the highest response time might just be a victim and not the perpetrator of the problem. The perpetrator is usually the LCU that is flooding the rank with I/Os.
Example 15-11 Rank statistics for one rank per extent pool
E S S z/OS V1R8 SERIAL NUMBER R A N K S T A T I S T I C S INTERVAL 14.59.875 CYCLE 0.750 SECONDS CTIME 02.14.01 CINT 14.59
SYSTEM ID SYSA DATE 10/29/2008 CONVERTED TO z/OS V1R10 RMF TIME 02.14.00 00000ABCD1 TYPE-MODEL 002107-921 CDATE 10/29/2008 ------ READ OPERATIONS ------OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 3.7 35.8K 133.6K 9.6 32.9 48.3K 1.6M 4.5 11.2 43.4K 484.2K 7.2 57.7 49.9K 2.9M 8.8 153.3 53.4K 8.2M 9.5 329.8 45.5K 15.0M 5.6 110.7 46.1K 5.1M 7.2
--EXTENT POOL-ID TYPE 0000 CKD 1Gb 0001 CKD 1Gb 0002 CKD 1Gb 0003 CKD 1Gb 0004 CKD 1Gb 0005 CKD 1Gb 0006 CKD 1Gb
RRID 0000 0001 0002 0004 0006 0007 0008
------ WRITE OPERATIONS -----OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 11.3 57.9K 656.8K 61.9 11.7 35.0K 410.7K 24.1 37.9 123.1K 4.7M 120.1 88.1 145.2K 12.8M 126.3 87.2 143.5K 12.5M 135.1 28.4 156.6K 4.4M 68.3 25.6 16.3K 418.3K 53.4
--ARRAY-NUM WDTH 1 6 1 6 1 6 1 6 1 6 1 6 1 6
MIN RPM 15 15 15 15 15 15 15
RANK CAP 876G 876G 876G 876G 876G 876G 876G
RAID TYPE RAID RAID RAID RAID RAID RAID RAID
5 5 5 5 5 5 5
474
0007 0008 0009
CKD 1Gb CKD 1Gb CKD 1Gb
000A 149.5 000C 1495.4 000D 579.5
54.9K 53.5K 54.3K
8.2M 80.1M 31.5M
3.9 58.9 13.8
29.3 140.1 84.1
87.2K 182.8K 162.3K
2.6M 25.6M 13.6M
88.6 325.5 103.5
1 1 1
6 6 6
15 15 15
876G 876G 876G
RAID 5 RAID 5 RAID 5
Multiple ranks on one extent pool without Storage Pool Striping (SPS)
Now, we examine the effect on performance analysis when we have multiple ranks defined on one extent pool. Example 15-12 shows the rank statistics for that configuration. In this example, extent pool 0000 contains ranks with RRID 0001, 0003, 0005, 0007, 0009, 000B, 000D, 000F, 0011, 0013, and 001F. Each ranks performance statistics, as well as the weighted average performance of the extent pool, are reported here. Regardless of the way in which we define the LCU in relationship to the extent pool, identifying the cause of a performance problem is complicated, because we can only see the association of a volume to an extent pool and not to a rank. Refer to Example 15-8 on page 471. The DSCLI showrank command can provide a list of all of the volumes that reside on the rank that we want to investigate further. Analyzing performance based on this showrank output can be difficult, because it can show volumes from multiple LCUs that reside on this one rank. Note: Depending on the technique used to define the CKD volumes on the extent pool (refer to Table 5-2 on page 97), certain volumes can be allocated on multiple ranks within the extent pool, which complicates the performance analysis even further.
Example 15-12 Rank statistics for multiple ranks on one extent pool without SPS
E S S z/OS V1R8 SERIAL NUMBER 00000BCDE1 R A N K S T A T I S T I C S DATE 09/28/2008 TIME 11.59.00 09/28/2008 CTIME INTERVAL 14.59.942 CYCLE 1.000 SECONDS 11.59.01 CINT 14.59
SYSTEM ID SYSE CONVERTED TO z/OS V1R10 RMF TYPE-MODEL 002107-921 CDATE
--EXTENT POOL-ID TYPE 0000 CKD 1Gb
0001
CKD 1Gb
RRID 0001 0003 0005 0007 0009 000B 000D 000F 0011 0013 001F POOL 0000 0002 0004 0006 0008 000A 000C 000E 0010 0012 001E POOL
------ READ OPERATIONS ------OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 458.1 53.1K 24.3M 2.5 95.2 41.3K 3.9M 3.4 580.4 54.0K 31.3M 2.7 146.6 45.5K 6.7M 5.0 30.8 22.2K 685.7K 6.1 167.6 47.2K 7.9M 4.2 49.2 26.1K 1.3M 5.9 255.4 53.1K 13.5M 3.1 103.0 39.3K 4.1M 7.2 127.0 47.5K 6.0M 3.2 1.0 9.4K 9.6K 7.5 2014.3 49.5K 99.8M 3.4 129.4 51.3K 6.6M 2.3 228.6 50.9K 11.6M 2.0 51.0 36.1K 1.8M 5.2 189.3 52.6K 10.0M 1.8 160.5 49.1K 7.9M 2.1 71.4 38.2K 2.7M 6.3 230.1 50.9K 11.7M 2.7 183.0 50.6K 9.3M 2.6 150.1 50.3K 7.6M 2.6 193.2 51.6K 10.0M 4.4 4.5 49.8K 224.8K 16.6 1591.2 49.9K 79.4M 2.9
------ WRITE OPERATIONS -----OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 16.8 10.6K 178.5K 15.2 19.4 11.9K 230.5K 13.5 15.8 22.8K 359.4K 16.4 14.6 14.4K 210.4K 12.6 14.5 31.8K 462.2K 14.2 20.7 52.9K 1.1M 13.1 12.5 12.3K 152.9K 12.1 11.8 26.8K 317.5K 15.4 20.4 21.5K 437.2K 15.0 7.3 39.3K 285.9K 13.7 1.8 43.6K 78.7K 22.7 155.5 24.5K 3.8M 14.2 6.8 21.9K 149.3K 12.5 16.7 38.9K 648.1K 14.7 7.3 17.1K 125.8K 11.0 7.3 17.3K 126.4K 11.6 8.8 28.1K 246.5K 13.6 12.2 11.5K 140.1K 10.4 11.8 15.0K 176.4K 13.1 16.6 86.8K 1.4M 15.1 6.2 22.0K 136.2K 12.5 7.6 24.1K 182.8K 14.3 0.6 57.0K 32.7K 83.1 101.8 33.4K 3.4M 13.5
--ARRAY-NUM WDTH 1 6 1 7 1 6 1 7 1 6 1 7 1 6 1 6 1 7 1 6 1 7 11 71 1 6 1 7 1 6 1 6 1 6 1 6 1 7 1 6 1 7 1 7 1 7 11 71
MIN RPM 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
RANK CAP 876G 1022G 876G 1022G 876G 1022G 876G 876G 1022G 876G 1022G 10366G 876G 1022G 876G 876G 876G 876G 1022G 876G 1022G 1022G 1022G 10366G
RAID TYPE RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
475
Multiple ranks on one extent pool using Storage Pool Striping

The same performance analysis challenges still exist either when we use the SPS or non-SPS technique, because in both cases there are multiple volumes from multiple LCUs that are allocated on each rank within the extent pool. The benefit of using SPS is that the probability of rank saturation is significantly reduced compared to not using SPS. Example 15-13 shows extent pool 0000, which contains RRID 0000, 0002, 0008, 000A, 0010, 0012, 0018, and 001A. This report shows a balanced load on these ranks. All of the performance metrics, such as OPS/SEC, BYTES/OP, BYTES/SEC, and RTIME/OP, are balanced across all ranks that reside on this extent pool for the read and also the write operations. Compare Example 15-13 to the previous example. In Example 15-12 on page 475, we see that the OPS/SEC and the BYTES/SEC vary widely between the ranks within an extent pool, which can cause certain extent pools to reach saturation while other extent pools do not. If we had used SPS in the previous configuration example, the possibility of rank saturation is reduced significantly. Note: Considering the advantages of SPS, consider it as the first choice in configuring a new DS8000, unless there is a compelling reason not to use it.
Example 15-13 Rank statistics for multiple ranks on one extent pool using SPS
E S S z/OS V1R9 SERIAL NUMBER 00000AB321 R A N K S T A T I S T I C S DATE 11/05/2008 TIME 21.24.00 11/05/2008 CTIME INTERVAL 01.00.003 CYCLE 0.333 SECONDS 21.24.00 CINT 01.00
SYSTEM ID SYS5 CONVERTED TO z/OS V1R10 RMF TYPE-MODEL 002107-932 CDATE
--EXTENT POOL-ID TYPE 0000 CKD 1Gb
0001
CKD 1Gb
RRID 0000 0002 0008 000A 0010 0012 0018 001A POOL 0004 0006 000C 000E 0014 0016 001C 001E POOL
------ READ OPERATIONS ------OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 40.9 49.1K 2.0M 6.4 39.2 47.6K 1.9M 6.8 40.9 48.2K 2.0M 6.4 37.7 48.4K 1.8M 6.5 40.9 48.9K 2.0M 6.2 40.0 47.8K 1.9M 6.2 42.1 47.8K 2.0M 6.6 45.9 48.6K 2.2M 6.7 327.8 48.3K 15.8M 6.5 40.8 47.8K 1.9M 6.7 42.0 48.8K 2.0M 6.9 38.4 49.4K 1.9M 6.5 43.0 49.0K 2.1M 6.6 40.4 49.2K 2.0M 6.6 40.5 48.8K 2.0M 6.5 37.9 47.0K 1.8M 6.2 39.6 48.6K 1.9M 7.2 322.7 48.6K 15.7M 6.7
------ WRITE OPERATIONS -----OPS BYTES BYTES RTIME /SEC /OP /SEC /OP 11.3 13.1K 148.5K 7.1 11.9 15.8K 187.9K 7.5 10.9 16.5K 179.1K 7.2 10.3 13.8K 142.0K 7.2 11.4 14.4K 163.8K 7.3 10.5 12.3K 128.9K 7.1 12.1 12.4K 150.7K 7.2 12.7 15.1K 192.2K 7.3 91.1 14.2K 1.3M 7.2 11.2 14.0K 157.3K 9.2 11.9 16.1K 192.2K 9.2 9.5 16.5K 157.3K 8.8 12.1 16.6K 201.0K 8.5 10.8 15.2K 163.8K 8.4 11.5 16.0K 183.5K 8.5 10.5 14.5K 152.9K 8.2 10.9 16.5K 179.1K 10.0 88.4 15.7K 1.4M 8.9
--ARRAY-NUM WDTH 1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6 8 48 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 8 56
MIN RPM 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
RANK CAP 438G 438G 438G 438G 438G 438G 438G 438G 3504G 511G 511G 511G 511G 511G 511G 511G 511G 4088G
RAID TYPE RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID RAID
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
15.11 RMF Magic for Windows

RMF Magic for Windows is a tool that is available from IntelliMagic BV, a company that specializes in storage performance and modeling software. Information for obtaining this product is available through the IntelliMagic Web site at: http://www.intellimagic.net RMF Magic provides consolidated performance reporting about your z/OS disk subsystems from the point of view of those disk subsystems, rather than from the host perspective, even when disk subsystems are shared between multiple sysplexes. This disk-centric approach 476
makes it much easier to analyze the I/O configuration and performance. RMF Magic automatically determines the I/O configuration from your RMF data, showing the relationship between the disk subsystem serial numbers, SSID, LCUs, device numbers, and device types. With RMF Magic, there is no need to manually consolidate printed RMF reports. While RMF Magic reports are based on information from RMF records, the analysis and reporting goes beyond what RMF provides. In particular, it computes accurate estimates for the read and write bandwidth (MB/s) for each disk subsystem and down to the device level. With this unique capability, RMF Magic can size the links in a future remote copy configuration, because RMF Magic knows the bandwidth that is required for the links, both in I/O requests and in megabytes per second, for each point in time. RMF Magic consolidates the information from RMF records with channel, disk, LCU, and cache information into one view per disk subsystem, per SSID (LCU), and per storage group. RMF Magic gives insight into your performance and workload data for each RMF interval within a period selected for the analysis, which can span weeks. Where RMF postprocessor reports are sorted by host and LCU, RMF Magic reports are sorted by disk subsystem and SSID (LSS). With this information, you can plan migrations and consolidations more effectively, because RMF Magic provides a detailed insight in the workload, both from a disk subsystem and a storage group perspective. RMF Magics graphical capabilities allow you to find any hot spots and tuning opportunities in your disk configuration. Based on user-defined criteria, RMF Magic can automatically identify peaks within the analysis period. On top of that, the graphical reports make all of the peaks and anomalies stand out immediately, which helps you with analyzing and identifying performance bottlenecks. You can use RMF Magic to analyze subsystem performance for z/OS hosts. If your DS8000 also provides storage for Open Systems hosts, this tool provides reports on rank statistics and host port link statistics. The DS8000 storage subsystem provides these Open Systems statistics to RMF whenever performance data is reported to RMF and is available for reporting through RMF Magic. Of course, if you have a DS8000 that has only Open Systems activity and does not include any z/OS 3390 volumes, this data cannot be collected by RMF and reported on by RMF Magic.
15.11.1 RMF Magic analysis process

The input to the RMF Magic tool is a set of RMF type 7x records that are collected on the z/OS host. The first step of the analysis process is to reduce these records, selecting the data that is necessary for the analysis. RMF Magic consists of two components that offer three main functions: A Graphical User Interface (GUI) for the Windows platform that provides: A Run Control Center that provides an easy interface allowing the user to prepare, initiate, and supervise the execution of the batch component when executed on the Windows platform. The Run Control Center is also used to load the data into a Reporting database. A Reporter Control Center where the user can interactively analyze the data in the Reporting database by requesting the creation of Microsoft Excel tables and charts. A batch component that validates, reduces, extracts, completes, and summarizes the input. All of this activity is done in two steps: Reduce and Analyze, which we describe next in the RMF Magic Reduce and RMF Magic Analyze steps. You can execute the batch component on either the z/OS or Windows platform.
477
You execute an RMF Magic study in four steps: 1. Data collection: RMF records are collected in the site to be analyzed and sorted in time sequence. 2. RMF Magic reduce step: This step compresses the input data (better than 10:1) and creates a database. The Reduce step also validates the input data. 3. RMF Magic analyze step: Based on information found in the RMF data, this step computes supplemental values, such as Write MB/s. The analyst defines the groups into which to summarize the data for Remote Copy sizing or performance reporting. The output of this step consists of: a. Reports that are created in the form of comma separated values (csv) files to use as input for the next step. b. Top-10 interval lists based on user criteria. The csv files are loaded in a reporting database or in a Microsoft Excel workbook. 4. Data presentation and analysis: RMF Magic for Windows is used to create graphical summaries of the data stored in the reporting database. The analyst can now profile the workload, looking at various workload characteristics from the storage unit point of view. This process might require additional analysis runs, because various interest groups (or application data groups) are identified for analysis.
15.11.2 Data collection step

You always perform this step of the process at the z/OS host, because it is the process of collecting the RMF data that has been captured by the host system. As the RMF data is being captured at the host, be sure that you are gathering data for disk, channel, cache, and the DS8000 performance statistics by specifying DEVICE(DASD), CACHE, and ESS in your ERBRMFxx member of PARMLIB. When preparing the data for processing by RMF Magic for Windows, be sure that the data is sorted by time stamp and that the RMF Magic package comes with a sample set of JCL that you can use for sorting the data. You must gather the RMF data for all of the systems that have access to the disk subsystems that you are going to evaluate. We recommend that the input to the SORT step contains only the RMF records that are recorded by the SMF subsystem. This subset of SMF data can be obtained by executing the IFASMFDP program and selecting only TYPE(70-78) records. You must not run SORT directly against the SMF dataset, because this sometimes produces invalid output datasets.
Example 15-14 Sample job for sorting the RMF data //RMFSORT EXEC PGM=SORT //SORTIN DD DISP=SHR,DSN=USERID1.RMFMAGIC.RMFDATA //* DD DISP=SHR,DSN=<INPUT_RMFDATA_SYSTEM2> //* : //* : //* DD DISP=SHR,DSN=<INPUT_RMFDATA_SYSTEMN> //SORTOUT DD DISP=(NEW,CATLG),UNIT=(SYSDA,5),SPACE=(CYL,(500,500)), // DSN=USERID1.SORTED.RMFMAGIC.RMFDATA //SORTWKJ1 DD DISP=(NEW,DELETE),UNIT=SYSDA,SPACE=(CYL,(500)) //SORTWKJ2 DD DISP=(NEW,DELETE),UNIT=SYSDA,SPACE=(CYL,(500)) //SORTWKJ3 DD DISP=(NEW,DELETE),UNIT=SYSDA,SPACE=(CYL,(500)) //SYSPRINT DD SYSOUT=* //SYSOUT DD SYSOUT=* //SYSIN DD * INCLUDE COND=(06,1,BI,GE,X'46',AND,06,1,BI,LE,X'4E')
478
SORT MODS
FIELDS=(11,4,CH,A,7,4,CH,A),EQUALS E15=(ERBPPE15,500,,N),E35=(ERBPPE35,500,,N)
15.11.3 RMF Magic reduce step

This step of the process can either be run on the z/OS host or it can be run on the workstation where RMF Magic for Windows is installed. If you have the resources available at the z/OS host, we recommend that you run the reduce step there, because that way less data needs to be transferred from the host to the workstation. If you choose to run the reduce step on the workstation, RMF Magic for Windows provides a utility to package the RMF data for transfer to the workstation. This utility compresses the data for more efficient data transfer and also packages the data so that the Variable Blocked Spanned (VBS) records that are used for collecting the data at the host do not result in data transfer problems. The JCL that is used for either option is provided with the RMF Magic for Windows product.
15.11.4 RMF Magic analyze step

Again, the analyze step can either be run at the z/OS host or on the workstation. In this analyze process, you define interest groups, which can be a combination of any of: Set of disk subsystems Set of application volumes Set of storage management subsystem (SMS) storage groups Set of interval periods Performance criteria and threshold If the analyze step is done on the host, and after looking at the reports, you decide that you need a different interest group view, you analyze the data again, and send the resulting database down to the workstation again. Performing the analyze step at the workstation results in a single data transfer from the host to the workstation, where you can experiment with different interest groups.
15.11.5 Data presentation and reporting step

RMF Magic for Windows uses a combination of Microsoft Excel and Microsoft PowerPoint for reporting performance data and building reports from the data. Using the spreadsheet, you can look at the details of the data that has been collected and analyzed or review charts that RMF Magic has created. When you are interested in presenting your findings to an audience, RMF Magic provides an automated interface that allows you to select specific charts from your analysis and to generate a Microsoft PowerPoint presentation file using those charts. The data presentation that is provided by RMF Magic is a powerful approach to understanding your data. The sample chart and spreadsheet shown in this section are based on the data from an installation using three DS8100s. At the highest level, you can see a summary of performance information at a glance in a series of charts as shown in Figure 15-17 on page 480. These charts show the I/O Rate and the response time components of the combined three disk subsystems and also the same graph for the selected volumes defined in the analyze parameter. In this case, the selected volumes include all the volumes on the three disk subsystems, so the graphs on the left and the right are exactly the same.
479
The other standard charts show the breakdown by disk subsystem or by storage group, and also the data transfer rate in MBps. These charts show the power of being able to see a graphical representation of your storage subsystem over time. Figure 15-17 is a sampling of the standard summary charts that are automatically created by the RMF Magic reporting tool. Additional standard reports include back-end RAID activity rates, read hit percentages, and a variety of breakdowns of I/O response time components.
Figure 15-17 Sample set of RMF Magic workload summary charts
In additional to the graphical view of your performance data, RMF Magic also provides detailed spreadsheet views of important measurement data with highlighting of spreadsheet cells for more visual access to the data. Figure 15-18 on page 481 is an example of the Performance Summary for a single disk subsystem. For each of the data measurement points (for example, Column C shows I/O Rate while Column D shows Response Time), rows 10 through 15 show a summary of the information. Rows 10 through 12 show the maximum rate that was measured, the RMF interval in which that rate occurred, and which row of the spreadsheet shows that maximum number. Easy navigation is available to that hot spot by selecting a cell in the desired column, and then clicking Goto Max. For example, selecting cell D10 and then clicking Goto Max moves the cursor to highlight cell D22, which is the peak response time of 2.15 ms. At this position, it provides you with the row where you can see the I/O rate and data rate during this interval. Figure 15-18 on page 481 also shows color-coded highlighting for cells that show the top intervals of measurement data. For example, if you are viewing this text in color, you see that column F has cells that are highlighted in pink (in this particular view, there is only one cell highlighted). The pink cells represent all of the measurement intervals that have values higher than the 95th percentile. This is a feature of the RMF Magic tool, which provides visual access to potential performance hot spots represented by your measurement data.
480
Figure 15-18 I/O and data rate summary for a single subsystem
Figure 15-19 on page 482 shows a spreadsheet similar in appearance to Figure 15-18 but, in this case, shows the summary of the cache measurement data. Again, the tool highlights those cells that have the highest measurement intervals for ease of navigation within the data.
481
RMFM4W 4.3.9 Copyright (C) 2003-2008 IntelliMagic, The Netherlands Performance Summary for each DSS with Cache information Go to Index Go to Max I/O Rate IO/s Resp Time Read Hit ms % All Disk Subsystems IBM-ABMA1 DS8100, 60608MB Cache 8 FICON Channels R/W Resp Ratio I/O Rate Time Read Hit IO/s ms %
R/W Ratio
Start Date Time Maximum value
17916.8 1.99 97.30 3.21 9486.0 2.15 96.70 2.38 8/8/2008 8/8/2008 8/8/2008 8/8/2008 8/8/2008 8/8/2008 8/8/2008 8/8/2008 RMF interval 02:44:00 02:14:00 03:59:00 01:59:00 02:44:00 02:14:00 03:59:00 01:59:00 Row Number 24 22 29 21 24 22 29 21 98% below 17580.5 1.91 96.97 2.92 9265.4 1.93 96.58 2.29 95% below 17076.1 1.78 96.48 2.48 8934.6 1.59 96.40 2.16 Average 13653.3 1.22 94.28 1.77 6963.9 1.07 94.47 1.59 8/8/2008 00:59:00 8/8/2008 01:14:00 8/8/2008 01:29:00 8/8/2008 01:44:00 8/8/2008 01:59:00 8/8/2008 02:14:00 8/8/2008 02:29:00 8/8/2008 02:44:00 8/8/2008 02:59:00 8/8/2008 03:14:00 8/8/2008 03:29:00 8/8/2008 03:44:00 8/8/2008 03:59:00 8/8/2008 04:14:00 14341.6 12400.3 13991.4 9046.6 10480.0 11104.9 14245.4 17916.8 14951.3 12819.8 13447.9 14300.6 16795.9 14751.8 0.75 0.85 0.78 0.93 1.55 1.99 1.26 1.03 0.99 1.29 1.10 0.96 1.49 1.56 92.80 93.20 93.10 93.70 94.80 94.30 94.00 93.80 94.00 91.90 94.50 95.80 97.30 96.20 1.11 1.68 1.89 2.24 3.21 2.12 1.60 1.53 1.02 1.55 2.09 1.70 1.90 1.92 6921.2 5482.5 6337.7 4225.8 4748.0 5541.1 8750.8 9486.0 8260.9 7249.7 6970.9 7803.1 7301.9 7656.2 0.70 0.77 0.69 0.78 1.40 2.15 1.13 1.02 1.00 1.07 0.91 1.01 1.20 1.16 93.70 94.20 93.60 93.60 94.40 93.40 94.80 93.50 94.00 92.60 94.90 95.70 96.70 96.20 1.07 1.62 1.80 2.09 2.38 2.01 1.76 1.40 0.86 1.24 1.94 1.64 1.56 1.80
Figure 15-19 Cache summary for all disk subsystems
In summary, you use the RMF Magic for Windows tool to get a view of I/O activity from the disk subsystem point of view as opposed to the operating system point of view as shown by the standard RMF reporting tools. This approach allows you to analyze the affect that each of the operating system images has on the various disk subsystem resources.
15.11.6 Hints and tips

The previous charts and spreadsheet can be obtained through the RMF Magic Reporter Control Center. If you need to perform specific analysis, you can create your own tailored spreadsheet by using the csv files that are created by the analyze step. You can also modify the analyze parameters to obtain information that is more specific to your needs.
Using the csv file

As an example, if you see an increase in response time that is caused by an increase in CONN Time only, and you do not see a high channel utilization, you might want to look at the xxx@DLNK.csv file, which has the information related to the DS8000 extended count key data (ECKD) and Fibre link ports related to each DS8000.
482
Figure 15-20 shows the spreadsheet created from the csv file. Only the ECKD links and performance statistics are shown on this spreadsheet. The following fields are calculated using this formula: Read KB/op = (Read MBps) x (1000) / (Read op/sec) Write KB/op = (Write MBps) x (1000) / (Write op/sec) At this RMF interval period, the average is 12.5 KB/op for reads and 8.1 KB/op for writes. You can plot this data over a period of time where the CONN Time is higher and then compare the chart to a period in the past and see if the transfer size in KB/op has increased. If the transfer size in KB/op has increased, this increase explains the increase in the CONN Time.
DateTime 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14 9/25/2008 3:14
Serial IBM-54321 IBM-54321 IBM-54321 IBM-54321 IBM-54321 IBM-54321 IBM-54321 IBM-54321
LinkID 0030 0031 0100 0101 0200 0202 0300 0301
Gbps Rd MBs Wrt MBs 2GBs 31.95 10.51 2GBs 32.41 10.62 2GBs 32.04 10.66 2GBs 31.9 10.14 2GBs 32.74 10.44 2GBs 32.4 10.05 2GBs 32.01 10.35 2GBs 32.2 10.23
Rd Ops 2574.46 2609.07 2576.95 2569.85 2615.53 2590.79 2571.53 2579.33
Wrt Ops Rd R/T Wrt R/T Rd KB/op Wrt KB/op 1277.36 0.16 0.33 12.4 8.2 1285.7 0.16 0.35 12.4 8.3 1280.37 0.17 0.34 12.4 8.3 1262.28 0.17 0.33 12.4 8.0 1278.64 0.27 0.49 12.5 8.2 1262.86 0.26 0.46 12.5 8.0 1270.57 0.17 0.35 12.4 8.1 1265.87 0.17 0.36 12.5 8.1 average: 12.5 8.1
Figure 15-20 DS8000 link performance statistics
Tailoring the analyze parameters

If there is a saturation in an extent pool in a DS8000 that uses SPS, it is hard to determine which volumes are the main contributors to the heavy load. You can run the analyze step using your own tailored parameters to help analyze this situation. First, determine which ranks belong to the problem extent pool. This information is available from the rank statistics report as in Example 15-13 on page 476. Running the DSCLI showrank command for all of the above ranks gives you the addresses that reside on the extent pool. Now, you can run the analyze step using the following examples: Example 15-15: Specify the DS8000 serial number. Replace begin and end with the range of addresses from the showrank output. The busydevice criteria creates a list of individual addresses that has an I/O rate greater than (in this example) 99 IOPS. Example 15-16 on page 484: The busydevice criteria creates a list of individual addresses that has a total throughput of read + write that is greater than (in this example) 5 MBps. The miniorate filters out any volumes that have an I/O rate less than 1 IOPS. Use miniorate when you use response time as the busydevice criteria.
Example 15-15 Parameters for I/O rate criteria interestgroup sps, limitto=(dss=(ibm-12345)) select=(devnum=((begin1, end1),..,(beginN, endN)),
483
busydevice=(criterion=iorate,threshold=99); Example 15-16 Parameters for read + write throughput criteria interestgroup sp2, limitto=(dss=(ibm-12345)) select=(devnum=((begin1, end1),..,(beginN, endN)), busydevice=(criterion=mbs,threshold=5,miniorate=1);
The output that you want to look at is the xxx$DDEV, which has an entry for each individual volume that meets the above criteria. Based on either or both csv files, you should be able to determine which volumes are the biggest contributor to the rank saturation. After identifying these volumes, you will need to move some of them to other less busy extent pools. If all the other extent pools are just as busy, then it is time to add more ranks to be able to handle the workload. Figure 15-21 shows the spreadsheet created from xxx$DDEV.csv file for the read + write throughput criteria. The r+w mbs column is calculated from (writembs) + (readmbs). You can sort this spreadsheet by the r+w mbs column to identify the volumes with the biggest load contributor.
addr 813F 804C 8450 8068 806D 8634 8065 8147 8145 854D 8263 855D 813D 8031 8367 813E 8451 863F Volser LD2994 LD29UC LD29SS LD29U4 LD29U9 LD29IU LD29U1 LD299C LD299A LD29S3 LD29BT LD29S6 LD2992 LD295Y LD29W1 LD2993 LD29ST LD29SD iorate 915 472.4 385.4 379.4 397 366.4 276.5 123 374.4 87.6 264.1 134.3 126.9 121.6 56.5 80.6 219.4 111.9 resp writerate writetrack writembs readrate readmbs iosq pend conn disc rhr r+w mbs 0.57 0 0 0 915 17.2 0 0.18 0.33 0.07 0.998 17.2 0.57 27.5 27.5 0.5 444.7 8 0 0.19 0.32 0.06 0.994 8.5 0.57 0 0 0 385.6 7 0 0.17 0.32 0.08 0.994 7 0.58 3 3 0.1 376.3 6.9 0 0.19 0.32 0.07 0.994 7 0.57 0 0 0 397 6.8 0 0.18 0.31 0.08 0.989 6.8 0.56 0 0 0 366.3 5.8 0 0.16 0.3 0.09 0.994 5.8 0.57 0 0 0 276.4 4.8 0 0.19 0.31 0.07 0.994 4.8 0.93 2.4 2.4 0.1 120.5 3.9 0 0.22 0.44 0.27 0.977 4 0.48 0 0 0 374.5 3.6 0 0.16 0.25 0.07 0.994 3.6 0.99 0 0 0 87.6 3.2 0 0.22 0.48 0.29 0.977 3.2 0.51 0 0 0 264.1 3.1 0 0.16 0.27 0.08 0.994 3.1 0.79 0 0.1 0 134.3 2.8 0 0.21 0.34 0.24 0.979 2.8 0.84 0 0 0 126.9 2.7 0 0.2 0.35 0.3 0.971 2.7 0.78 29.1 29.1 0.7 92.2 2 0 0.18 0.34 0.26 0.971 2.7 0.7 46.3 46.3 2.2 9.1 0.4 0 0.18 0.49 0.03 0.979 2.6 0.96 0 0 0 80.5 2.6 0 0.23 0.44 0.29 0.98 2.6 0.59 0 0 0 219.4 2.6 0 0.18 0.26 0.14 0.989 2.6 0.87 0.1 0.1 0 111.8 2.5 0 0.19 0.36 0.32 0.984 2.5
Figure 15-21 DDEV spreadsheet for read + write MBps criteria
484
16
Chapter 16.
Databases
This chapter reviews the major IBM database systems and the performance considerations when they are used with the DS8000 disk subsystem. We limit our discussion to the following databases: DB2 Universal Database (DB2 UDB) in a z/OS environment DB2 in an open environment IMS in a z/OS environment You can obtain additional information at this Web site: http://www.ibm.com/software/data/db2/udb
485
16.1 DB2 in a z/OS environment

In this section, we provide a description of the characteristics of the various database workloads, as well as the types of data-related objects used by DB2 (in a z/OS environment). Also, we discuss the performance considerations and general guidelines for using DB2 with the DS8000, as well as a description of the tools and reports that can be used for monitoring DB2.
16.1.1 Understanding your database workload

To better understand and position the performance of your particular database system, it is helpful to first learn about the following common database profiles and their unique workload characteristics.
DB2 online transaction processing (OLTP)

OLTP databases are among the most mission-critical and widely deployed of all. The primary defining characteristic of OLTP systems is that the transactions are processed in real time or online and often require immediate response back to the user. Examples include: A point of sale terminal in a retail business An automated teller machine (ATM) used for bank transactions A telemarketing site processing sales orders and checking the inventories From a workload perspective, OLTP databases typically: Process a large number of concurrent user sessions Process a large number of transactions using simple SQL statements Process a single database row at a time Are expected to complete transactions in seconds, not minutes or hours OLTP systems process the day-to-day operation of businesses and, therefore, have strict user response and availability requirements. They also have extremely high throughput requirements and are characterized by large numbers of database inserts and updates. They typically serve hundreds, or even thousands, of concurrent users.
Decision support systems (DSSs)

DSSs differ from the typical transaction-oriented systems in that they most often use data extracted from multiple sources for the purpose of supporting user decision making. The types of processing consist of: Data analysis applications using predefined queries Application-generated queries Ad hoc user queries Reporting requirements DSS systems typically deal with substantially larger volumes of data than OLTP systems due to their role in supplying users with large amounts of historical data. Whereas 100 GB of data is considered large for an OLTP environment, a large DSS system might be 1 TB of data or more. The increased storage requirements of DSS systems can also be attributed to the fact that they often contain multiple, aggregated views of the same data. While OLTP queries are mostly related to one specific business function, DSS queries are often substantially more complex. The need to process large amounts of data results in many CPU-intensive database sort and join operations. The complexity and variability of these types of queries must be given special consideration when estimating the performance of a DSS system.
486
16.1.2 DB2 overview

DB2 is a database management system based on the relational data model. Most users choose DB2 for applications that require good performance and high availability for large amounts of data. This data is stored in datasets mapped to DB2 tablespaces and distributed across DB2 databases. Data in tablespaces is often accessed using indexes that are stored in index spaces. Data tablespaces can be divided in two groups: System tablespaces and user tablespaces. Both of these tablespaces have identical data attributes. The difference is that system tablespaces are used to control and manage the DB2 subsystem and user data. System tablespaces require the highest availability and special considerations. User data cannot be accessed if the system data is not available. In addition to data tablespaces, DB2 requires a group of traditional datasets not associated to tablespaces that are used by DB2 to provide data availability: The backup and recovery datasets. In summary, the three major dataset types in a DB2 subsystem are: DB2 system tablespaces DB2 user tablespaces DB2 backup and recovery datasets The following sections describe the objects and datasets that DB2 uses.
16.1.3 DB2 storage objects

DB2 manages data by associating it to a set of DB2 objects. These objects are logical entities, and several of them are kept in storage. The DB2 data objects are: TABLE TABLESPACE INDEX INDEXSPACE DATABASE STOGROUP Here, we briefly describe each of them.
TABLE
All data managed by DB2 is associated to a table. The table is the main object used by DB2 applications.
TABLESPACE
A tablespace is used to store one or more tables. A tablespace is physically implemented with one or more datasets. Tablespaces are VSAM linear datasets (LDS). Because tablespaces can be larger than the largest possible VSAM dataset, a DB2 tablespace can require more than one VSAM dataset.
INDEX
A table can have one or more indexes (or can have no index). An index contains keys. Each key points to one or more data rows. The purpose of an index is to get direct and faster access to the data in a table.
Chapter 16. Databases
487
INDEXSPACE
An index space is used to store an index. An index space is physically represented by one or more VSAM LDSs.
DATABASE
The database is a DB2 representation of a group of related objects. Each of the previously named objects has to belong to a database. DB2 databases are used to organize and manage these objects.
STOGROUP
A DB2 storage group is a list of storage volumes. STOGROUPs are assigned to databases, tablespaces, or index spaces when using DB2-managed objects. DB2 uses STOGROUPs for disk allocation of the table and index spaces. Installations that are storage management subsystem (SMS)-managed can define STOGROUP with VOLUME(*). This specification implies that SMS assigns a volume to the table and index spaces in that STOGROUP. In order to assign a volume to the table and index spaces in the STOGROUP, SMS uses automatic class selection (ACS) routines to assign a storage class, a management class, and a storage group to the table or index space.
16.1.4 DB2 dataset types

As already mentioned, DB2 uses system and user tablespaces for the data, as well as a group of datasets not associated with tablespaces that are used by DB2 to provide data availability; these datasets are backup and recovery datasets.
DB2 system tablespaces

DB2 uses databases to control and manage its own operation and the application data: The catalog and directory databases Both databases contain DB2 system tables. DB2 system tables hold data definitions, security information, data statistics, and recovery information for the DB2 system. The DB2 system tables reside in DB2 system tablespaces. The DB2 system tablespaces are allocated when a DB2 system is first created. DB2 provides the IDCAMS statements required to allocate these datasets as VSAM LDSs. The work database The work database is used by DB2 to resolve SQL queries that require temporary work space. Multiple tablespaces can be created for the work database.
DB2 application tablespaces

All application data in DB2 is organized in the objects as described in 16.1.3, DB2 storage objects on page 487. Application tablespaces and index spaces are VSAM LDS with the same attributes as DB2 system tablespaces and index spaces. The difference between system and application data is made only because they have different performance and availability requirements.
DB2 recovery datasets

In order to provide data integrity, DB2 uses datasets for recovery purposes. We briefly describe these DB2 recovery datasets. These datasets are described in further detail in the Administration Guide of DB2 for OS/390, SC26-8957.
488
The recovery datasets are: Bootstrap dataset DB2 uses the bootstrap dataset (BSDS) to manage recovery and other DB2 subsystem information. The BSDS contains information needed to restart and to recover DB2 from any abnormal circumstance. For example, all log datasets are automatically recorded with the BSDS. While DB2 is active, the BSDS is open and updated. DB2 always requires two copies of the BSDS, because they are critical for data integrity. For availability reasons, the two BSDS datasets must be put on separate servers on the DS8000 or in separate logical control units (LCUs). Active logs The active log datasets are used for data recovery and to ensure data integrity in case of software or hardware errors. DB2 uses active log datasets to record all updates to user and system data. The active log datasets are open as long as DB2 is active. Active log datasets are reused when the total active log space is used up, but only after the active log (to be overlaid) has been copied to an archive log. DB2 supports dual active logs. We strongly recommend that you use dual active logs for all DB2 production environments. For availability reasons, the log datasets must be put on separate servers on the DS8000 or separate LCUs. Archive logs Archive log datasets are DB2-managed backups of the active log datasets. Archive logs datasets are automatically created by DB2 whenever an active log is filled. DB2 supports dual archive logs, and we recommend that you use dual archive log datasets for all production environments. Archive log datasets are sequential datasets that can be defined on disk or on tape and migrated and deleted with standard procedures.
16.2 DS8000 considerations for DB2

Benefits of using the DS8000 in a DB2 environment include: DB2 takes advantage of the parallel access volume (PAV) function that allows multiple concurrent I/Os to the same volume at the same time from applications running on a z/OS system image. Less disk contention when accessing the same volumes from different systems in a DB2 data sharing group using the Multiple Allegiance function. Higher bandwidth on the DS8000 allows higher I/O rates to be handled by the disk subsystem, thus allowing for higher application transaction rates.
16.3 DB2 with DS8000 performance recommendations

When using DS8000, the following generic recommendations will be useful when planning for good DB2 performance.
16.3.1 Know where your data resides

DB2 storage administration can be done using SMS to simplify disk use and control, or also without using SMS. In both cases, it is important that you know where your data resides.
489
If you want optimal performance from DS8000, do not treat it totally like a black box. Understand how DB2 tables map to underlying volumes and how the volumes map to RAID arrays.
16.3.2 Balance workload across DS8000 resources

You can balance workload activity across DS8000 resources by: Spreading DB2 data across DS8000s if practical Spreading DB2 data across servers in each DS8000 Spreading DB2 data across DS8000 device adapters Spreading DB2 data across as many extent pools/ranks as practical You can intermix tables and indexes and also system, application, and recovery datasets on DS8000 ranks. The overall I/O activity will be more evenly spread, and I/O skews will be avoided.
16.3.3 Take advantage of VSAM data striping

Before VSAM data striping was available, in a multi-extent, multi-volume VSAM dataset, sequential processing did not present any type of parallelism. When an I/O operation was executed for an extent in a volume, no other activity from the same task was scheduled to the other volumes. VSAM data striping addresses this problem with two modifications to the traditional data organization: The records are not placed in key ranges along the volumes; instead, they are organized in stripes. Parallel I/O operations are scheduled to sequential stripes in different volumes. By striping data, the VSAM control intervals (CIs) are spread across multiple devices. This format allows a single application request for records in multiple tracks and CIs to be satisfied by concurrent I/O requests to multiple volumes. The result is improved data transfer to the application. The scheduling of I/O to multiple volumes in order to satisfy a single application request is referred as an I/O path packet. We can stripe across ranks, device adapters, servers, and DS8000s. In a DS8000 with Storage Pool Striping (SPS), the implementation of VSAM striping still provides a performance benefit. Because DB2 uses two engines for the list prefetch operation, VSAM striping increases the parallelism of DB2 list prefetch I/Os. This parallelism exists with respect to the channel operations as well as the disk access. If you plan to enable VSAM I/O striping, refer to DB2 for z/OS and OS/390 Version 7 Performance Topics, SG24-6129.
16.3.4 Large volumes

With large volume support, which supports up to 65520 cylinders per volume, System z users can allocate the larger capacity volumes in the DS8000. From the DS8000 perspective, the capacity of a volume does not determine its performance. From the z/OS perspective, PAVs reduce or eliminate any additional enqueues that might originate from the increased I/O on the larger volumes. From the storage administration perspective, configurations with larger volumes are simpler to manage.
490
Measurements oriented to determine how large volumes can impact DB2 performance have shown that similar response times can be obtained using larger volumes compared to using the smaller 3390-3 standard size volumes (refer to 15.6.2, Larger volume compared to smaller volume performance on page 451 for a discussion).
16.3.5 Modified Indirect Data Address Words (MIDAWs)

MIDAWs, as described in 15.7.3, MIDAW on page 457, help improve performance when accessing large chains of small blocks of data. In order to get this benefit, the dataset must be accessed by Media Manager. The bigger benefit is realized for the following datasets: Extended Format (EF) datasets Datasets that have small blocksizes (4K) Examples of DB2 applications that benefit from MIDAWs are DB2 prefetch and DB2 utilities.
16.3.6 Adaptive Multi-stream Prefetching (AMP)

As described in 2.2, Processor memory and cache on page 12, AMP works in conjunction with DB2 sequential and dynamic prefetch. It works even better for dynamic prefetch than it does for most other sequential applications, because dynamic prefetch uses two prefetch engines. DB2 has the ability to explicitly request that the DS8000 prefetch from the disks. AMP simply adjusts the prefetch quantity that the DS8000 uses to prefetch tracks from the disks into the cache.
16.3.7 DB2 burst write

When DB2 updates a record, it first updates the record residing in the buffer pool. If the percentage of changed records in the buffer pool reaches the threshold defined in the vertical deferred write threshold (VDWQT), DB2 starts to flush and to write these updated records. These write activities to the disk subsystem are a huge burst of write I/Os, especially if the buffer pool is large and the VDWQT is high. This burst can cause a nonvolatile storage (NVS) saturation, because it is being flooded with too many writes. It shows up in the Resource Measurement Facility (RMF) cache report as DASD Fast Write Bypass (DFWBP or DFW Bypass). The term bypass is actually misleading. In the 3990/3390 era, when the NVS is full, the write I/O bypasses the NVS and the data is written directly to the disk drive module (DDM). In DS8000, when the NVS is full, the write I/O is retried from the host until NVS space becomes available. So DFW Bypass must be interpreted as DFW Retry for DS8000. If RMF shows that the DFW Bypass divided by the total I/O Rate is greater than 1%, that is an indication of NVS saturation. If this NVS saturation happens, we recommend that you set the VDWQT to 0 or 1. Setting the VDWQT to 0 does not mean that every record update will cause a write I/O to be triggered, because despite the 0% threshold set, the DB2 buffer pool will still have 40 buffers set aside and the least recently updated 32 buffers are scheduled for write, which allows multiple successive updates to the same record to be not written out every time that record is updated, thus preventing multiple write I/Os to the same record on the disk subsystem. Lowering the VDWQT will have a cost; in this case, it increases the CPU utilization, which shows up as a higher DBM1 SRM CPU time.
16.3.8 Monitoring DS8000 performance

You can use RMF to monitor the performance of DS8000. For a detailed discussion, see 15.9, DS8000 performance monitoring tools on page 464.
491
16.4 DS8000 DB2 UDB in an Open Systems environment

This section discusses the performance considerations when using DB2 UDB with the DS8000 in an Open Systems environment. The information presented in this section is further discussed in detail in (and liberally borrowed from) the book, IBM ESS and IBM DB2 UDB Working Together, SG24-6262. Many of the concepts presented are applicable to the DS8000. We highly recommend this book. However, based on client solution experiences using SG24-6262, there are two corrections that we want to point out: In IBM ESS and IBM DB2 UDB Working Together, SG24-6262, section 3.2.2, Balance workload across ESS resources, suggests that a data layout policy must be established that allows partitions and containers within partitions to be spread evenly across ESS resources. It further suggests that you can choose either a horizontal mapping, in which every partition has containers on every available ESS rank, or a vertical mapping in which DB2 partitions are isolated to specific arrays, with containers spread evenly across those ranks. We now recommend the vertical mapping approach. The vertical isolated storage approach is typically easier to configure, manage, and diagnose if problems arise in production. Another data placement consideration suggests that it is important to place frequently accessed files in the space allocated from the middle of an array. This suggestion was an error in the original publication. The intent of the section was to discuss how the placement considerations commonly used with non-RAID older disk technology have less significance in ESS environments. Note: Based on experience, we now recommend a vertical data mapping approach (shared nothing between data partitions). We also want to emphasize that you must not try to micro-manage data placement on storage.
16.4.1 DB2 UDB storage concepts

DB2 Universal Database (DB2 UDB) is the IBM object-relational database for UNIX, Linux, and Windows operating environments. The database object that maps the physical storage is the tablespace. Figure 16-1 on page 493 illustrates how DB2 UDB is logically structured and how the tablespace maps the physical object.
492
Logical Database Objects

System
Instances
Equivalent Physical Object
Databases
Tablespaces are where tables are stored:

SMS or Each container is a directory in the file space of the operating system. DMS Each container is a fixed, pre-allocated file or physical device such as a disk.
Tablespaces
Tables Indexes long data
/fs.rb.T1.DA3a1 /fs.rb.T1.DA3b1
Figure 16-1 DB2 UDB logical structure
Instances
An instance is a logical database manager environment where databases are cataloged and configuration parameters are set. An instance is similar to an image of the actual database manager environment. You can have several instances of the database manager product on the same database server. You can use these instances to separate the development environment from the production environment, tune the database manager to a particular environment, and protect sensitive information from a particular group of users. For database partitioning features (DPF) of the DB2 Enterprise Server Edition (ESE), all data partitions reside within a single instance.
Databases
A relational database structures data as a collection of database objects. The primary database object is the table (a defined number of columns and any number of rows). Each database includes a set of system catalog tables that describe the logical and physical structure of the data, configuration files containing the parameter values allocated for the database, and recovery logs. DB2 UDB allows multiple databases to be defined within a single database instance. Configuration parameters can also be set at the database level, thus allowing you to tune, for example, memory usage and logging.
Database partitions
A partition number in DB2 UDB terminology is equivalent to a data partition. Databases with multiple data partitions and residing on an SMP system are also called multiple logical partition (MLN) databases.
493
Partitions are identified by the physical system where they reside as well as by a logical port number with the physical system. The partition number, which can be from 0 to 999, uniquely defines a partition. Partition numbers must be in ascending sequence (gaps in the sequence are allowed). The configuration information of the database is stored in the catalog partition. The catalog partition is the partition from which you create the database.
Partitiongroups A partitiongroup is a set of one or more database partitions. For non-partitioned

implementations (all editions except for DPF), the partitiongroup is always made up of a single partition.
Partitioning map
When a partitiongroup is created, a partitioning map is associated to it. The partitioning map in conjunction with the partitioning key and hashing algorithm is used by the database manager to determine which database partition in the partitiongroup will store a given row of data. Partitioning maps do not apply to non-partitioned databases.
Containers
A container is the way of defining where on the storage device the database objects will be stored. Containers can be assigned from filesystems by specifying a directory. These containers are identified as PATH containers. Containers can also reference files that reside within a directory. These containers are identified as FILE containers, and a specific size must be identified. Containers can also reference raw devices. These containers are identified as DEVICE containers, and the device must already exist on the system before the container can be used. All containers must be unique across all databases; a container can belong to only one tablespace.
Tablespaces
A database is logically organized in tablespaces. A tablespace is a place to store tables. To spread a tablespace over one or more disk devices, you simply specify multiple containers. For partitioned databases, the tablespaces reside in partitiongroups. In the create tablespace command execution, the containers themselves are assigned to a specific partition in the partitiongroup, thus maintaining the shared nothing character of DB2 UDB DPF. Tablespaces can be either system-managed space (SMS) or data-managed space (DMS). For an SMS tablespace, each container is a directory in the filesystem, and the operating system file manager controls the storage space (Logical Volume Manager (LVM) for AIX). For a DMS tablespace, each container is either a fixed-size pre-allocated file, or a physical device, such as a disk (or in the case of the DS8000, a vpath), and the database manager controls the storage space. There are three major types of user tablespaces: Regular (index and data), temporary, and long. In addition to these user-defined tablespaces, DB2 requires that you define a system tablespace, the catalog tablespace. For partitioned database systems, this catalog tablespace resides on the catalog partition.
Tables, indexes, and LOBs

A table is a named data object consisting of a specific number of columns and unordered rows. Tables are uniquely identified units of storage maintained within a DB2 tablespace.
494
They consist of a series of logically linked blocks of storage that have been given the same name. They also have a unique structure for storing information that allows the information to be related to information on other tables. When creating a table, you can choose to have certain objects, such as indexes and large object (LOB) data, stored separately from the rest of the table data, but you must define this table to a DMS tablespace. Indexes are defined for a specific table and assist in the efficient retrieval of data to satisfy queries. They can also be used to assist in the clustering of data. Large objects (LOBs) can be stored in columns of the table. These objects, although logically referenced as part of the table, can be stored in their own tablespace when the base table is defined to a DMS tablespace, which allows for more efficient access of both the LOB data and the related table data.
Pages
Data is transferred to and from devices in discrete blocks that are buffered in memory. These discrete blocks are called pages, and the memory reserved to buffer a page transfer is called an I/O buffer. DB2 UDB supports various page sizes including 4 k, 8 k, 16 k, and 32k. When an application accesses data randomly, the page size determines the amount of data transferred. This size corresponds to the size of data transfer request done to the DS8000, which is sometimes referred to as the physical record. Sequential read patterns can also influence the page size selected. Larger page sizes for workloads with sequential read patterns can enhance performance by reducing the number of I/Os.
Extents
An extent is a unit of space allocation within a container of a tablespace for a single tablespace object. This allocation consists of multiple pages. The extent size (number of pages) for an object is set when the tablespace is created: An extent is a group of consecutive pages defined to the database. The data in the tables spaces is striped by extent across all the containers in the system.
Buffer pools
A buffer pool is main memory allocated on the host processor to cache table and index data pages as they are being read from disk or being modified. The purpose of the buffer pool is to improve system performance. Data can be accessed much faster from memory than from disk; therefore, the fewer times that the database manager needs to read from or write to disk (I/O), the better the performance. Multiple buffer pools can be created.
DB2 prefetch (reads)

Prefetching is a technique for anticipating data needs and reading ahead from storage in
large blocks. By transferring data in larger blocks, fewer system resources are used and less time is required. Sequential prefetch reads consecutive pages into the buffer pool before they are needed by DB2. List prefetches are more complex. In this case, the DB2 optimizer optimizes the retrieval of randomly located data.
495
The amount of data being prefetched determines the amount of parallel I/O activity. Ordinarily, the database administrator defines a prefetch value large enough to allow parallel use of all of the available containers. Consider the following example: A tablespace is defined with a page size of 16 KB using raw DMS. The tablespace is defined across four containers, each container residing on a separate logical device, and the logical devices are on different DS8000 ranks. The extent size is defined as 16 pages (or 256 KB). The prefetch value is specified as 64 pages (number of containers x extent size). A user makes a query that results in a tablespace scan, which then results in DB2 performing a prefetch operation. The following actions will happen: DB2, recognizing that this prefetch request for 64 pages (1 MB) evenly spans four containers, will make four parallel I/O requests, one against each of those containers. The request size to each container will be 16 pages (or 256 KB). After receiving several of these requests, the DS8000 will recognize that these DB2 prefetch requests are arriving as sequential accesses, causing the DS8000 sequential prefetch to take effect, which will result in all of the disks in all four DS8000 ranks to operate concurrently, staging data to the DS8000 cache, to satisfy the DB2 prefetch operations.
Page cleaners
Page cleaners are present to make room in the buffer pool before prefetchers read pages on
disk storage and move them into the buffer pool. For example, if a large amount of data has been updated in a table, many data pages in the buffer pool might be updated but not written into disk storage (these pages are called dirty pages). Because prefetchers cannot place fetched data pages onto the dirty pages in the buffer pool, these dirty pages must be flushed to disk storage and become clean pages so that prefetchers can place fetched data pages from disk storage.
Logs
Changes to data pages in the buffer pool are logged. Agent processes updating a data record in the database update the associated page in the buffer pool and write a log record into a log buffer. The written log records in the log buffer will be flushed into the log files asynchronously by the logger. To optimize performance, neither the updated data pages in the buffer pool nor the log records in the log buffer are written to disk immediately. They are written to disk by page cleaners and the logger, respectively. The logger and the buffer pool manager cooperate and ensure that the updated data page is not written to disk storage before its associated log record is written to the log. This behavior ensures that the database manager can obtain enough information from the log to recover and protect a database from being left in an inconsistent state when the database has crashed as a result of an event, such as a power failure.
Parallel operations
DB2 UDB extensively uses parallelism to optimize performance when accessing a database. DB2 supports several types of parallelism, including query and I/O parallelism.
496
Query parallelism
There are two dimensions of query parallelism: Inter-query parallelism and intra-query parallelism. Inter-query parallelism refers to the ability of multiple applications to query a database at the same time. Each query executes independently of the other queries, but they are all executed at the same time. Intra-query parallelism refers to the simultaneous processing of parts of a single query, using intra-partition parallelism, inter-partition parallelism, or both: Intra-partition parallelism subdivides what is usually considered a single database operation, such as index creation, database loading, or SQL queries, into multiple parts, many or all of which can be run in parallel within a single database partition. Inter-partition parallelism subdivides what is usually considered a single database operation, such as index creation, database loading, or SQL queries, into multiple parts, many or all of which can be run in parallel across multiple partitions of a partitioned database on one machine or on multiple machines. Inter-partition parallelism only applies to DPF.
I/O parallelism
When there are multiple containers for a tablespace, the database manager can exploit parallel I/O. Parallel I/O refers to the process of writing to, or reading from, two or more I/O devices simultaneously. Parallel I/O can result in significant improvements in throughput. DB2 implements a form of data striping by spreading the data in a tablespace across multiple containers. In storage terminology, the part of a stripe that is on a single device is a strip. The DB2 term for strip is extent. If your tablespace has three containers, DB2 will write one extent to container 0, the next extent to container 1, the next extent to container 2, and then back to container 0. The stripe width (a generic term not often used in DB2 literature) is equal to the number of containers, or three in this case. Extent sizes are normally measured in numbers of DB2 pages. Containers for a tablespace are ordinarily placed on separate physical disks, allowing work to be spread across those disks, and allowing disks to operate in parallel. Because the DS8000 logical disks are striped across the rank, the database administrator can allocate DB2 containers on separate logical disks residing on separate DS8000 arrays, which takes advantage of the parallelism both in DB2 and in the DS8000. For example, four DB2 containers residing on four DS8000 logical disks on four different 7+P ranks will have data spread across 32 physical disks.
16.5 DB2 UDB with DS8000 performance recommendations

When using a DS8000, the following generic recommendations will be useful when planning for good DB2 UDB performance. For a more detailed and accurate approach that takes into consideration the particularities of your DB2 UDB environment, contact your IBM representative, who can assist you with the DS8000 capacity and configuration planning.

Know where your data resides. Understand how DB2 containers map to DS8000 logical disks, and how those logical disks are distributed across the DS8000 ranks. Spread DB2 data across as many DS8000 ranks as possible.
497
If you want optimal performance from the DS8000, do not treat it completely like a black box. Establish a storage allocation policy that allocates data using several DS8000 ranks. Understand how DB2 tables map to underlying logical disks, and how the logical disks are allocated across the DS8000 ranks. One way of making this process easier to manage is to maintain a modest number of DS8000 logical disks.

Balance workload across DS8000 resources. Establish a storage allocation policy that allows balanced workload activity across RAID arrays. You can take advantage of the inherent balanced activity and parallelism within DB2, spreading the work for DB2 partitions and containers across the DS8000 arrays. This spreading applies to both OLTP and DSS workload types. If you do that, and have planned sufficient resource, many of the other decisions become secondary. Consider the following general recommendations: DB2 query parallelism allows workload to be balanced across CPUs and, if DB2 Universal Database Partitioning Feature (DPF) is installed, across data partitions. DB2 I/O parallelism allows workload to be balanced across containers. As a result, you can balance activity across DS8000 resources by following these rules: Span DS8000 storage units. Span ranks (RAID arrays) within a storage unit. Engage as many arrays as possible. Figure 16-2 illustrates this technique for a single tablespace consisting of eight containers.
1 3 5 7
2 4 6 8
HMC
2 2 0 0
S0 S1 0/1 1/0 2/3 3/2
Container 1
Container 2
Container 3
Container 4
Container 5
Container 6
Container 7
Container 8
Figure 16-2 Allocating DB2 containers using a spread your data approach
498
In addition, consider that: You can intermix data, indexes, and temp spaces on the DS8000 ranks. Your I/O activity is more evenly spread, and thus, you avoid the skew effect, which you otherwise see if the components are isolated. For DPF systems, establish a policy that allows partitions and containers within partitions to be spread evenly across DS8000 resources. Choose a vertical mapping, in which DB2 partitions are isolated to specific arrays, with containers spread evenly across those arrays.
16.5.3 Use DB2 to stripe across containers

Use the inherent striping of DB2, placing containers for a tablespace on separate DS8000 logical disks on different DS8000 ranks. This striping eliminates the need for using underlying operating system or logical volume manager striping. Look again at Figure 16-2 on page 498. In this case, we are striping across arrays, disk adapters, clusters, and DS8000s, which can all be done using the striping capabilities of DB2s container and shared nothing concept. This approach eliminates the need to employ AIX logical volume striping.
16.5.4 Selecting DB2 logical sizes

The three settings in a DB2 system that primarily affect the movement of data to and from the disk subsystem work together. These settings are: Page size Extent size Prefetch size
Page size
Page sizes are defined for each tablespace. There are four supported page sizes: 4 K, 8 K, 16 K, and 32 K. Factors that affect the choice of page size include: The maximum number of records per page is 255. To avoid wasting space on a page, do not make page size greater than 255 times the row size plus the page overhead. The maximum size of a tablespace is proportional to the page size of its tablespace. In SMS, the data and index objects of a table have limits, as shown in Table 16-1. In DMS, these limits apply at the tablespace level.
Table 16-1 Page size relative to tablespace size Page size 4 KB 8 KB 16 KB 32 KB Maximum data/index object size 64 GB 128 GB 256 GB 512 GB
Select a page size that can accommodate the total expected growth requirements of the objects in the tablespace.
499
For OLTP applications that perform random row read and write operations, a smaller page size is preferable, because it wastes less buffer pool space with unwanted rows. For DSS applications that access large numbers of consecutive rows at a time, a larger page size is better, because it reduces the number of I/O requests that are required to read a specific number of rows. Tip: Experience indicates that page size can be dictated to a certain degree by the type of workload. For pure OLTP workloads, we recommend a 4 KB page size. For a pure DSS workload, we recommend a 32 KB page size. For a mixture of OLTP and DSS workload characteristics, we recommend either an 8 KB page size or a 16 KB page size.
Extent size
If you want to stripe across multiple arrays in your DS8000, assign a LUN from each rank to be used as a DB2 container. During writes, DB2 will write one extent to the first container, the next extent to the second container, and so on until all eight containers have been addressed before cycling back to the first container. DB2 stripes across containers at the tablespace level. Because DS8000 stripes at a fairly fine granularity (256 KB), selecting multiples of 256 KB for the extent size makes sure that multiple DS8000 disks are used within a rank when a DB2 prefetch occurs. However, keep your extent size below 1 MB. I/O performance is fairly insensitive to the selection of extent sizes, mostly due to the fact that DS8000 employs sequential detection and prefetch. For example, even if you picked an extent size, such as 128 KB, which is smaller than the full array width (it will involve accessing only four disks in the array), the DS8000 sequential prefetch keeps the other disks in the array busy.
Prefetch size
The tablespace prefetch size determines the degree to which separate containers can operate in parallel. Although larger prefetch values might enhance throughput of individual queries, mixed applications generally operate best with moderate-sized prefetch and extent parameters. You will want to engage as many arrays as possible in your prefetch to maximize throughput. It is worthwhile to note that prefetch size is tunable. We mean that prefetch size can be altered after the tablespace has been defined and data loaded, which is not true for extents and page sizes that are set at tablespace creation time and cannot be altered without redefining the tablespace and reloading the data. Tip: The prefetch size must be set so that as many arrays as desired can be working on behalf of the prefetch request. For other than the DS8000, the general recommendation is to calculate prefetch size to be equal to a multiple of the extent size times the number of containers in your tablespace. For the DS8000, you can work with a multiple of the extent size times the number of arrays underlying your tablespace.
16.5.5 Selecting the DS8000 logical disk sizes

The DS8000 gives you great flexibility when it comes to disk allocation. This flexibility is particularly helpful, for example, when you need to attach multiple hosts. However, this flexibility can present a challenge as you plan for future requirements.
500
The DS8000 supports a high degree of parallelism and concurrency on a single logical disk. As a result, a single logical disk the size of an entire array achieves the same performance as many smaller logical disks. However, you must consider how logical disk size affects both the host I/O operations and the complexity of your organizations systems administration. Smaller logical disks provide more granularity, with their associated benefits. But it also increases the number of logical disks seen by the operating system. Select an DS8000 logical disk size that allows for granularity and growth without proliferating the number of logical disks. Take into account your container size and how the containers will map to AIX logical volumes and DS8000 logical disks. In the simplest situation, the container, the AIX logical volume, and the DS8000 logical disk will be the same size. Tip: Try to strike a reasonable balance between flexibility and manageability for your needs. Our general recommendation is that you create no fewer than two logical disks in an array, and the minimum logical disk size needs to be 16 GB. Unless you have an extremely compelling reason, standardize a unique logical disk size throughout the DS8000. Among the advantages and the disadvantages between larger and smaller logical disks sizes: Advantages of smaller size logical disks: Easier to allocate storage for different applications and hosts. Greater flexibility in performance reporting; for example, PDCU reports statistics for logical disks. Disadvantages of smaller size logical disks Small logical disk sizes can contribute to proliferation of logical disks, particularly in SAN environments and large configurations. Administration gets complex and confusing. Advantages of larger size logical disks: Simplifies understanding of how data maps to arrays. Reduces the number of resources used by the operating system. Storage administration is simpler, thus more efficient and fewer chances for mistakes. Disadvantages of larger size logical disks Less granular storage administration, resulting in less flexibility in storage allocation.
Examples
Let us assume a 6+P array with 146 GB disk drives. Suppose you wanted to allocate disk space on your 16-array DS8000 as flexibly as possible. You can carve each of the 16 arrays up into 32 GB logical disks or logical unit numbers (LUNs), resulting in 27 logical disks per array (with a little left over). This design yields a total of 16 x 27 = 432 LUNs. Then, you can implement 4-way multipathing, which in turn makes 4 x 432 = 1728 hdisks visible to the operating system. Not only does this approach create an administratively complex situation, but at every reboot, the operating system will query each of those 1728 disks. Reboots might take a long time.
501
Alternatively, you create just 16 large logical disks. With multipathing and attachment of four Fibre Channel ports, you have 4 x 16 = 128 hdisks visible to the operating system. Although this number is large, it is certainly more manageable, and reboots are much faster. Having overcome that problem, you can then use the operating system logical volume manager to carve this space up into smaller pieces for use. There are problems with this large logical disk approach as well, however. If the DS8000 is connected to multiple hosts or it is on a SAN, disk allocation options are limited when you have so few logical disks. You have to allocate entire arrays to a specific host, and if you wanted to add additional space, you must add it in array-size increments.
16.5.6 Multipathing
Use DS8000 multipathing along with DB2 striping to ensure the balanced use of Fibre Channel paths. Multipathing is the hardware and software support that provides multiple avenues of access to your data from the host computer. You need to provide at least two Fibre Channel paths from the host computer to the DS8000. Paths are defined by the number of host adapters on the DS8000 that service a certain host systems LUNs, the number of Fibre Channel host bus adapters on the host system, and the SAN zoning configuration. The total number of paths ultimately includes consideration for the throughput requirements of the host system. If the host system requires more than (2 x 200) 400 MBps throughput, two host bus adapters are not adequate. DS8000 multipathing requires the installation of multipathing software. For AIX, you have two choices, Subsystem Device Driver Path Control Module (SDDPCM) or the IBM Subsystem Device Driver (SDD). For AIX, we recommend SDDPCM. We discuss these products in Chapter 11, Performance considerations with UNIX servers on page 307 and Chapter 9, Host attachment on page 265. There are several benefits you receive from using multipathing: higher availability, higher bandwidth, and easier management. A high availability implementation is one in which your application can still access data using an alternate resource if a component fails. Easier performance management means that the multipathing software automatically balances the workload across the paths.
16.6 IMS in a z/OS environment

This section discusses IMS, its logging, and the performance considerations when IMS datasets are placed on the DS8000.
16.6.1 IMS overview

IMS consists of three components: the Transaction Manager (TM) component, the Database Manager (DB) component, and a set of system services that provides common services to the other two components.
IMS Transaction Manager

IMS Transaction Manager provides a network with access to the applications running under IMS. The users can be people at terminals or workstations, or other application programs.
502
IMS Database Manager

IMS Database Manager provides a central point of control and access to the data that is processed by IMS applications. The Database Manager component of IMS supports databases using the hierarchic database model of IMS. It provides access to the databases from the applications running under the IMS Transaction Manager, the CICS transaction monitor, and z/OS batch jobs. IMS Database Manager provides functions for preserving the integrity of databases and maintaining the databases. It allows multiple tasks to access and update the data, while ensuring the integrity of the data. It also provides functions for reorganizing and restructuring the databases. The IMS databases are organized internally using a number of IMS internal database organization access methods. The database data is stored on disk storage using the normal operating system access methods.
IMS system services

There are a number of functions that are common to the Database Manager and Transaction Manager: Restart and recovery of the IMS subsystem failures Security: Controlling access to IMS resources Managing the application programs: Dispatching work, loading application programs, and providing locking services Providing diagnostic and performance information Providing facilities for the operation of the IMS subsystem Providing an interface to other z/OS subsystems that interface with the IMS applications
16.6.2 IMS logging

IMS logging is one of the most write-intensive operations in a database environment. During IMS execution, all information necessary to restart the system in the event of a failure is recorded on a system log dataset. The IMS logs are made up of the following information.
IMS log buffers

The log buffers are used to write the information that needs to be logged.
Online log datasets (OLDS)

The OLDS are datasets that contain all the log records required for restart and recovery. These datasets must be pre-allocated on DASD and will hold the log records until they are archived. The OLDS are made of multiple datasets that are used in a wraparound manner. At least three datasets must be allocated for the OLDS to allow IMS to start, while an upper limit of 100 datasets is supported. Only complete log buffers are written to OLDS in order to enhance performance. If any incomplete buffers need to be written out, they are written to the write ahead datasets (WADS).
503
Write ahead datasets (WADS)

The WADS is a small direct access dataset that contains a copy of committed log records that are in OLDS buffers, but they have not yet been written to OLDS. When IMS processing requires writing of a partially filled OLDS buffer, a portion of the buffer is written to the WADS. If IMS or the system fails, the log data in the WADS is used to terminate the OLDS, which can be done as part of an emergency restart, or as an option on the IMS Log Recovery Utility. The WADS space is continually reused after the appropriate log data has been written to the OLDS. This dataset is required for all IMS systems, and must be pre-allocated and formatted at IMS startup when first used. When using a DS8000 with Storage Pool Striping, define the WADS volumes as 3390-Mod.1 and allocate them consecutively so that they are allocated to different ranks.
System log datasets (SLDS)

The SLDS is created by the IMS log archive utility, preferably after every OLDS switch. It is usually placed on tape, but it can reside on disk. The SLDS can contain the data from one or more OLDS datasets.
Recovery log datasets (RLDS)

When the IMS log archive utility is run, the user can request creation of an output dataset that contains all of the log records needed for database recovery. This dataset is the RLDS and also known to DBRC. The RLDS is optional.
16.7 DS8000 considerations for IMS

Benefits of using the DS8000 in an IMS environment include: IMS takes advantage of the PAV function that allows multiple concurrent I/Os to the same volume at the same time from applications running on a z/OS system image. Less disk contention when accessing the same volumes from different systems in an IMS data sharing group, using the Multiple Allegiance function. Higher bandwidth on the DS8000 allows higher I/O rates to be handled by the disk subsystem, thus allowing for higher application transaction rates.
16.8 IMS with DS8000 performance recommendations

When using DS8000, the following generic recommendations will be useful when planning for good IMS performance.

IMS storage administration can be done using SMS to simplify disk use and control, or also without using SMS. In both cases, it is extremely important that you know where your data resides. If you want optimal performance from the DS8000, do not treat it totally like a black box. Understand how your IMS datasets map to underlying volumes, and how the volumes map to RAID arrays.
504

You can balance workload activity across DS8000 resources by: Spreading IMS data across DS8000s if practical Spreading IMS data across servers in each DS8000 Spreading IMS data across DS8000 device adapters Spreading IMS data across as many extent pools/ranks as practical You can intermix IMS databases and log datasets on DS8000 ranks. The overall I/O activity will be more evenly spread, and I/O skews will be avoided.
16.8.3 Large volumes

With large volume support, which supports up to 65520 cylinders per volume, System z users can allocate the larger capacity volumes in the DS8000. From the DS8000 perspective, the capacity of a volume does not determine its performance. From the z/OS perspective, PAVs reduce or eliminate any additional enqueues that might originate from the increased I/O on the larger volumes. From the storage administration perspective, configurations with larger volumes are simpler to manage. Measurements to determine how large volumes can impact IMS performance have shown that similar response times can be obtained when using larger volumes as when using the smaller 3390-3 standard size volumes. Figure 16-3 illustrates the device response times when using 32 3390-3 volumes compared to four large 3390-27 volumes on an ESS-F20 using FICON channels. Even though the benchmark was performed on an ESS-F20, the results are similar on the DS8000. The results show that with the larger volumes, the response times are similar to the standard size 3390-3 volumes.
32 3390-3 vs 4 3390-27
2
Device response time (msec)
1.5
3390-3 3390-27
0.5
0
2905 4407
Total I/O rate (IO/sec)

Figure 16-3 IMS large volume performance
505
16.8.4 Monitoring DS8000 performance

You can use RMF to monitor the performance of the DS8000. For a detailed discussion, see 15.9, DS8000 performance monitoring tools on page 464.
506
17
Chapter 17.
Copy Services performance

In this chapter, we discuss the performance-related considerations when implementing Copy Services for the DS8000. Copy Services is a collection of functions provided by the DS8000 that facilitate disaster recovery, data migration, and data duplication functions. Copy Services are optional licensed features that run on the DS8000 Storage Facility Image and support all attached host systems. We review the Copy Services functions and give recommendations about best practices for configuration and performance: Copy Services introduction FlashCopy Metro Mirror Global Copy Global Mirror z/OS Global Mirror Metro/Global Mirror
507
17.1 Copy Services introduction

The DS8000 series offers an array of advanced functions for data backup, remote mirroring, and disaster recovery. The DS8000s advanced two-site and three-site business continuity capabilities provide synchronous and asynchronous data replication for mission critical applications, giving availability when needed during both planned and unplanned system outages. There are two primary types of Copy Services functions: Point-in-Time Copy and Remote Mirror and Copy. Generally, the Point-in-Time Copy functions are used for data duplication, and the Remote Mirror and Copy functions are used for data migration and disaster recovery. Table 17-1 is a reference chart for the Copy Services. The copy operations available for each function are: Point-in-Time Copy: FlashCopy Standard FlashCopy Space Efficient (SE) Remote Mirror and Copy: Global Mirror Metro Mirror Global Copy 3-site Metro/Global Mirror with Incremental Resync
z/OS Global Mirror, previously known as Extended Remote Copy (XRC) z/OS Metro/Global Mirror across three sites with Incremental Resync
Table 17-1 Reference chart for DS Copy Services on DS8000 DS8000 function FlashCopy FlashCopy SE Global Mirror Metro Mirror Global Copy z/OS Global Mirror z/OS Metro/Global Mirror Metro/Global Mirror ESS800 Version 2 function FlashCopy N/A Global Mirror Metro Mirror Global Copy z/OS Global Mirror z/OS Metro/Global Mirror Metro/Global Mirror Formerly known as: FlashCopy N/A Asynchronous PPRC Synchronous PPRC PPRC Extended Distance Extended Remote Copy (XRC) 3-site solution using Sync PPRC and XRC 2 or 3 site Asynchronous Cascading PPRC
Refer to the Interoperability Matrixes for the DS8000 and ESS to confirm which products are supported on a particular disk subsystem.
Copy Services management interfaces

Copy Services functions can be managed through a number of network and in-band interfaces: DSCLI DS SM DS Open API TotalStorage Productivity Center for Replication 508
Note: TotalStorage Productivity Center for Replication currently does not manage Global Copy sessions. In addition to using these methods, there are several possible interfaces available specifically for System z users to manage DS8000 Copy Services relationships. Table 17-2 lists these tools, which are: TSO ICKDSF DFSMSdss The ANTRQST macro Native TPF commands (for z/TPF only)
Table 17-2 Copy Services Management Tools Runs on z/OS Runs on Open Systems Server No No No Yes Yes Manages count key data (CKD) Yes Yes Yes Yes Yes Manages fixed block data (FB) Yes1 Yes1 No Yes Yes
TSO ANTRQST ICKDSF DSCLI TotalStorage Productivity Center for Replication GDPS
Yes Yes Yes No Yes
Yes
No
Yes
Yes1
1. A CKD unit address (and host unit control block (UCB)) must be defined in the same DS8000 server against which host I/O can be issued to manage Open Systems (FB) LUNs.
Refer to the following IBM Redbooks publications for detailed information about DS8000 Copy Services: IBM System Storage DS8000: Copy Services with IBM System z, SG24-6787 IBM System Storage DS8000: Copy Services in Open Environments, SG24-6788
17.2 FlashCopy
FlashCopy can help reduce or eliminate planned outages for critical applications. FlashCopy is designed to allow read and write access to the source data and the copy almost immediately following the FlashCopy volume pair establishment. Standard FlashCopy uses a normal volume as target volume. This target volume must be the same size (or larger) than the source volume, and the space is allocated in the storage subsystem. IBM FlashCopy SE, introduced with Licensed Machine Code 5.3.0.xxx, uses Space Efficient volumes as FlashCopy target volumes. A Space Efficient target volume has a virtual size that is equal to or greater than the source volume size. However, space is not allocated for this volume when the volume is created and the FlashCopy initiated. Only when updates are made to the source volume, any original tracks of the source volume that will be modified are copied to the Space Efficient volume. Space in the repository is allocated for just these tracks (or for any write to the target itself).
Chapter 17. Copy Services performance
509
FlashCopy objectives
The FlashCopy feature of the DS8000 provides you with the capability of making an immediate copy of a logical volume at a specific point in time, which we also refer to as a point-in-time-copy, instantaneous copy, or time zero copy (T0 copy), within a single DS8000 Storage Facility Image. There are several points to consider when you plan to use FlashCopy that might help you minimize any impact that the FlashCopy operation can have on host I/O performance. As Figure 17-1 illustrates, when FlashCopy is invoked, a relationship (or session) is established between the source and target volumes of the FlashCopy pair. This session includes the creation of the necessary bitmaps and metadata information needed to control the copy operation. This FlashCopy establish process completes quickly, at which point: The FlashCopy relationship is fully established. Control returns to the operating system or task that requested the FlashCopy. Both the source volume and its time zero (T0) target volume are available for full read/write access. At this time, a background task within the DS8000 starts copying the tracks from the source to the target volume. Optionally, you can suppress this background copy task using the nocopy option, which is efficient, for example, if you are making a temporary copy just to take a backup to tape. In this case, the FlashCopy relationship remains until explicitly withdrawn. FlashCopy SE can be used in this instance. FlashCopy SE provides most of the function of FlashCopy but can only be used to provide a nocopy copy.
FlashCopy provides a T0 copy

FlashCopy command issued
Source
Target
T0 copy immediately available
Time
Write Read
Read Write Read and write to both source and target possible. Optional physical copy progresses in background When copy is complete, relationship between source and target ends
T0
Simplex
Simplex
Figure 17-1 FlashCopy: Starting and ending the relationship
The DS8000 keeps track of which data has been copied from source to target. As Figure 17-1 shows, if an application wants to read data from the target that has not yet been copied, the data is read from the source. Otherwise, the read can be satisfied from the target volume. 510
Flashcopy SE is designed for temporary copies. Copy duration generally does not last longer than 24 hours unless the source and target volumes have little write activity. FlashCopy SE is optimized for use cases where a small percentage of the source volume is updated during the life of the relationship. If much more than 20% of the source is expected to change, there might be trade-offs in terms of performance as opposed to space efficiency. In this case, standard FlashCopy might be considered as a good alternative. FlashCopy has several options. Not all options are available to all user interfaces. It is important right from the beginning to know the purpose for which the target volume will be used afterward. Knowing this purpose, the options to be used with FlashCopy can be identified and the environment that supports the selected options can be chosen.
17.2.1 FlashCopy performance considerations

Many parameters can affect the performance of FlashCopy operations. It is important to review the data processing requirements and, hence, select the proper FlashCopy options. We examine when to use copy as opposed to no copy and where to place the FlashCopy source and target volumes/LUNs. We also discuss when and how to use incremental FlashCopy, which you definitely need to evaluate for use in most applications. IBM FlashCopy SE has special considerations. We discuss IBM FlashCopy SE in 17.2.2, Performance planning for IBM FlashCopy SE on page 516. The relative performance overheads of FlashCopy and FlashCopy SE are shown in the following charts in Figure 17-2 and Figure 17-3 on page 512. These charts are based on 5.3.1.xxx DS8000 code.
MB/sec
400 350 300 250 200 150 100 50 0
Sequential Write to Source Single Rank
MB/sec
2400 2000 1600 1200 800 400 0
Sequential Write to Source Full Box, RAID-5
FB 7+P
CKD 6+P
Base FlashCopy
FB 64 Ranks, 8 DA Pairs
CKD 48 Ranks, 6 DA Pairs
FlashCopy SE
Figure 17-2 FlashCopy with sequential write to source
511
IO/sec
625
Random Write to Source
IO /sec
100000
6+P Rank
60% of DB M ax to S ource 64 R anks R AID-5
500
80000
375
60000
250
40000
125
20000
0 CKD 6+P Rank
F B 6 4 R a n ks , D B O 7 0 /3 0/50
C K D 4 8 R a n k s, C a c h e S td.
Base
FlashCopy
FlashCopy SE
Figure 17-3 FlashCopy with random workloads
Note: This chapter is equally valid for System z volumes and Open Systems LUNs. In the following sections of the present chapter, we use only the terms volume or volumes, but the text is equally valid if the terms LUN and LUNs are used, unless otherwise noted.
Distribution of the workload: Location of source and target volumes

In general, you can achieve the best performance by distributing the load across all of the resources of the DS8000. Carefully plan your usage so that the load is: Spread evenly across disk subsystems Within each disk subsystem, spread evenly across processor complexes Within each server, spread evenly across device adapters Within each device adapter, spread evenly across ranks Refer to Chapter 5, Logical configuration performance considerations on page 63. It is always best to locate the FlashCopy target volume on the same DS8000 processor complex as the FlashCopy source volume, which allows you to take advantage of code optimization to reduce overhead when source and target are on the same processor complex. It is also good practice to locate the FlashCopy target volume on a different device adapter (DA) than the source volume, particularly when background copy is used. Another available choice is whether to place the FlashCopy target volumes on the same ranks as the FlashCopy source volumes. In general, it is best not to place these two volumes in the same rank. Refer to Table 17-3 on page 513 for a summary of the volume placement considerations.
512
Table 17-3
FlashCopy source and target volume location Processor complex Device adapter Unimportant Different device adapter Unimportant Rank Different ranks Different ranks Different ranks
FlashCopy establish performance Background copy performance FlashCopy impact to applications
Same server Same server Same server
Tip: To find the relative location of your volumes, you can use the following procedure: 1. Use the lsfbvol command to learn which extent pool contains the relevant volumes. 2. Use the lsrank command to display both the device adapter and the rank for each extent pool. 3. To determine which processor complex contains your volumes, look at the extent pool name. Even-numbered extent pools are always from Server 0, while odd-numbered extent pools are always from Server1.
Rank characteristics
Normal performance planning also includes the tasks to select the disk drives (capacity and rpms) and the RAID configurations that best match the performance needs of the applications. Be aware that with FlashCopy nocopy relations, the DS8000 does copy-on-write for each first change to a source volume track. If the disks of the target volume are slower than the disks of the source volume, copy-on-write might slow down production I/O. A full copy FlashCopy produces an extremely high write activity on the disk drives of the target volume. Therefore, it is always a good practice to use target volumes on ranks with the same characteristics as the source volumes. Finally, you can achieve a small performance improvement by using identical rank geometries for both the source and target volumes. In other words, if the source volumes are located on a rank with a 7 + P configuration, the target volumes are also located on a rank configured as 7 + P.
FlashCopy establish performance

The FlashCopy of a volume has two distinct periods: The initial logical FlashCopy (also called establish) The physical FlashCopy (also called background copy) The FlashCopy establish phase is the period of time when the microcode is preparing things, such as the bitmaps, necessary to create the FlashCopy relationship so the microcode can properly process subsequent reads and writes to the related volumes. It takes only a few seconds to establish the FlashCopy relationships for tens to hundreds or more volume pairs. The copy is then immediately available for both read and write access. During this logical FlashCopy period, no writes are allowed to the source and target volume. However, this period is very short. After the logical relationship has been established, normal I/O activity is allowed to both source and target volumes according to the options selected.
513
There is a modest performance impact to logical FlashCopy establish performance when using incremental FlashCopy. In the case of incremental FlashCopy, the DS8000 must create additional metadata (bitmaps). However, the impact is negligible in most cases. Finally, the placement of the FlashCopy source and target volumes has an effect on the establish performance. You can refer to the previous section for a discussion of this topic as well as to Table 17-3 on page 513 for a summary of the recommendations.
Background copy performance

The background copy phase is the actual movement of the data from the source volume to the target volume. If the copy option was selected, upon completion of the logical FlashCopy establish phase, the source will be copied to the target in an expedient manner. If a large number of volumes have been established, do not expect to see all pairs actively copying data as soon as their logical FlashCopy relationship is completed. The DS8000 microcode has algorithms that will limit the number of active pairs copying data. This algorithm will try to balance active copy pairs across the DS8000 device adapter resources. Microcode gives higher preference to application activity than copy activity. Figure 17-4 shows the effect on the background copy with I/O present. Tip: When creating many FlashCopy pairs, we recommend that all commands are submitted simultaneously, and you allow the DS8000 microcode to manage the internal resources. If using the DSCLI, we recommend that you use single commands for many devices, rather than many commands each with one device.
Figure 17-4 Background FlashCopy rate
The recommended placement of the FlashCopy source and target volumes was discussed in the previous section. Refer to Table 17-3 on page 513 for a summary of the conclusions. For the best background copy performance, always place the source and target volumes in different ranks. There are additional criteria to consider if the FlashCopy is a full box copy that involves all ranks. Note: The term full box copy implies that all rank resources are involved in the copy process. Either all or nearly all ranks have both source and target volumes, or half the ranks have source volumes and half the ranks have target volumes. For full box copies, still place the source and target volumes in different ranks. When all ranks are participating in the FlashCopy, you can still place the source and target volumes in different ranks by doing a FlashCopy of volumes on rank R0 onto rank R1 and volumes on rank R1 onto rank R0 (for example). Additionally, if there is heavy application activity in the
514
source rank, performance is less affected if the background copy target was in another rank that has lighter application activity.
I
Note: If Storage Pool Striping is used when allocating volumes, all ranks will be more or less equally busy. Therefore, there is less need to be so concerned about the placement of data, but make sure to still keep the source and the target on the same processor complex. If background copy performance is of high importance in your environment, use incremental FlashCopy as much as possible. Incremental FlashCopy will greatly reduce the amount of data that needs to be copied and, therefore, greatly reduces the background copy time. If the FlashCopy relationship was established with the -nocp (no copy) option, only write updates to the source volume will force a copy from the source to the target. This forced copy is also called a copy-on-write. Note: The term copy-on-write describes a forced copy from the source to the target, because a write to the source has occurred. This forced copy occurs on the first write to a track. Note that because the DS8000 writes to nonvolatile cache, there is typically no direct response time delay on host writes. A write to the source will result in a copy of the track.
FlashCopy impact on applications

One of the most important considerations when implementing FlashCopy is to achieve an implementation that has minimal impact on the performance of the users applications. Note: As already mentioned, the recommendations discussed in this chapter only consider the performance aspects of a FlashCopy implementation. But FlashCopy performance is only one aspect of an intelligent system design. You must consider all business requirements when designing a total solution. These additional requirements, together with the performance considerations, will guide you when choosing FlashCopy options, such as copy or no copy and incremental, as well as when making choices about source and target volume location. The placement of the source and target volumes has a significant impact on the applications performance, as we discussed in Distribution of the workload: Location of source and target volumes on page 512. In addition to the placement of volumes, the selection of copy or no copy is also an important consideration in regard to impact to application performance. Typically, the choice of copy or no copy depends primarily on how the FlashCopy will be used and for what interval of time the FlashCopy relationship exists. From a purely performance point of view, the choice of whether to use copy or no copy depends a great deal on the type of workload. The general answer is to use no copy, but this choice is not always the best choice. For most workloads, including online transaction processing (OLTP) workloads, no copy typically is the preferred option. However, workloads that contain a large number of random writes and are not cache friendly might benefit from using the copy option.
FlashCopy nocopy In a FlashCopy nocopy relationship, a copy-on-write is done whenever a write to a source
track occurs for the first time after the FlashCopy was established. This type of FlashCopy is ideal when the target volumes are needed for a short time only, for example, to run the backup jobs. FlashCopy nocopy puts only a minimum additional workload on the back-end adapters and disk drives. However, it affects most of the writes to the source volumes as long
515
as the relationship exists. When you plan to keep your target volumes for a long time, this choice might not be the best solution.
FlashCopy full copy

When you plan to use the target volumes for a longer time, or you plan to use them for production and you do not plan to repeat the FlashCopy often, then the full copy FlashCopy will be the right choice. A full copy FlashCopy puts a high additional workload on the back-end device adapters and disk drives. But this high additional workload lasts only for a few minutes or hours, depending on the capacity. After that, there is no additional overhead any more.
Incremental FlashCopy
Another important performance consideration is whether to use incremental FlashCopy. Use incremental FlashCopy when you perform FlashCopies always to the same target volumes on regular time intervals. The first FlashCopy will be a full copy, but subsequent FlashCopy operations will copy only the tracks of the source volume that had been modified since the last FlashCopy. Incremental FlashCopy has the least impact on applications. During normal operation, no copy-on-write is done (as in a nocopy relation), and during a resync, the load on the back end is much lower compared to a full copy. There is only a very small overhead for the maintenance of out-of-sync bitmaps for the source and target volumes. Note: The incremental FlashCopy resyncflash command does not have a -nocp (no copy) option. Using resyncflash will automatically use the copy option, regardless of whether the original FlashCopy was copy or no copy.
Choosing the correct FlashCopy type

The type of copy required depends on the purpose for which the copy is made. FlashCopy nocopy is typically the best choice to minimize rank and DA activity within the physical DS8000. The choice depends on what the copy will be used for: Is the copy only going to be used for creating a tape backup? If yes, use nocopy and the relationship is withdrawn after the tape backup is complete. Is the copy going to be used for testing or development? If yes, nocopy again is typically the best choice. Will you need a copy of the copy? Use background copy so that the target will be withdrawn from its relationship after all of the tracks are copied, thereby allowing it to be a source in a new relationship. Possibly use the nocopy to copy option. Is the workload OLTP? If yes, nocopy typically is the best choice. Or, if there are a large number of random writes and they are not cache friendly, copy might be the best choice.
17.2.2 Performance planning for IBM FlashCopy SE

FlashCopy SE has additional overhead compared to standard FlashCopy. Data from source volumes is copied to Space Efficient target volumes. Actually, the data is written to a repository, and there is a mapping mechanism to map the physical tracks to the 516
logical tracks. Refer to Figure 17-5. Each time that a track in the repository is accessed, it has to go through this mapping process. Consequently, the attributes of the volume hosting the repository are important when planning a FlashCopy SE environment.
I/O from server: update Vol 100 Trk 17 Cache Server 0 Cache
I/O complete New process with IBM FlashCopy SE Track table of repository
:
Server 1
NVS
destaging FlashCopy relationship? New update? Release Data in NVS
NVS
yes
Get Vol 100 Trk 17
Got it
Lookup Vol 100 Trk 17 Got it
Lookup
Vol 100 Trk 17 = Rep Trk 1032 :
Wait
Get track Done Write to Rep Trk 1032
Written Wait for copy to target volume
Write update
Write
Vol 100 Trk 17
Repository Trk 1032
Figure 17-5 Updates to source volumes in an IBM FlashCopy SE relationship
Because of space efficiency, data is not physically ordered in the same sequence on the repository disks as it is on the source. Processes that might access the source data in a sequential manner might not benefit from sequential processing when accessing the target. Another important consideration for FlashCopy SE is that we always have nocopy relationships. A full copy or incremental copy is not possible. If there are many source volumes that have targets in the same extent pool, all updates to these source volumes cause write activity to this one extent pools repository. We can consider a repository as something similar to a volume. So, we have writes to many source volumes being copied to just one volume, the repository. Where a dedicated extent pool is defined specifically for use as a FlashCopy SE repository, there will be less space in the repository than the total capacity (sum) of the source volumes. You might be tempted to use fewer disk spindles or disk drive modules (DDMs) for this extent pool. By definition, fewer spindles mean less performance, and so careful planning is needed to achieve the required throughput and response times from the Space Efficient volumes. A good strategy is to keep the number of spindles roughly equivalent but just use smaller, faster drives (but do not use Fibre Channel Advanced Technology Attachment (FATA) or Serial Advanced Technology Attachment (SATA) drives). For example, if your source volumes are 300 GB 15K rpm disks, using 146 GB 15K rpm disks on the repository can provide both space efficiency and excellent repository performance. Another possibility is to consider RAID 10 for the repository, although that goes somewhat against space efficiency (you might be better off using standard FlashCopy with RAID 5 than SE with RAID 10). However, there might be cases where trading off some of the space efficiency gains for a performance boost justifies RAID 10. Certainly if RAID 10 is used at the
517
source, consider it for the repository (note that the repository will always use striping when in a multi-rank extent pool). Note: There is no advantage in using RAID 6 for the repository other than resilience. Only consider it where RAID 6 is used as the standard throughout the DS8000. Storage Pool Striping has good synergy with the repository (volume) function. With Storage Pool Striping, the repository space is striped across multiple RAID arrays in an extent pool, which helps to balance the volume skew that might appear on the sources. It is generally best to use four RAID arrays in the multi-rank extent pool intended to hold the repository and no more than eight RAID arrays. Finally, as mentioned before, try to use at least the same number of disk spindles on the repository as the source volumes. Avoid severe fan in configurations, such as 32 ranks of source disk being mapped to an eight rank repository. This type of configuration will likely have performance problems unless the update rate to the source is extremely modest. It is possible to share the repository with production volumes on the same extent pool but use caution when doing so, because contention between the repository and the production volumes can impact performance. In this case, the repository for the one extent pool can be placed in a different extent pool so that source and target volumes are on different ranks but on the same processor complex. To summarize, we can expect a high random write workload for the repository. To prevent the repository from becoming overloaded, you can take the following precautions: Have the repository in an extent pool with several ranks (a repository is always striped). Use fast 15K rpm disk drives for the repository ranks. Consider using RAID 10 instead of RAID 5, because RAID 10 can sustain a higher random write workload. Avoid placing standard source and repository target volumes in the same extent pool. Because FlashCopy SE does not need a lot of capacity if your update rate is not too high, you might want to make several FlashCopies from the same source volume. For example, you might want to make a FlashCopy several times a day to set checkpoints to protect your data against viruses or for other reasons. Of course, creating more than one FlashCopy SE relationship for a source volume will increase the overhead, because each first change to a source volume track has to be copied several times for each FlashCopy SE relationship. Therefore, keep the number of concurrent FlashCopy SE relationships to a minimum, or test how many relationships you can have without affecting your application performance too much. Note that from a performance standpoint, avoiding multiple relationships also applies to normal FlashCopy. There are no restrictions on the amount of virtual space or the number of SE volumes that can be defined for either z/OS or Open Systems storage.
17.3 Metro Mirror

Metro Mirror provides real-time mirroring of logical volumes between two DS8000s that can be located up to 300 km (186.4 miles) (more distance is supported via RPQ) from each other. It is a synchronous copy solution where write operations are completed on both copies before they are considered to be complete.
518
It is typically used for applications that cannot suffer any data loss in the event of a failure. As data is transferred synchronously, the distance between primary and secondary disk subsystems will determine the effect on application response time. Figure 17-6 illustrates the sequence of a write update with Metro Mirror.
Server write 1
4 Write acknowledge
Write to secondary 2
LUN or volume Primary (source)
3 Write complete acknowledgment
LUN or volume
Secondary (target)
Figure 17-6 Metro Mirror sequence
When the application performs a write update operation to a primary volume, this process happens: 1. 2. 3. 4. Write to primary volume (DS8000 cache) Write to secondary (DS8000 cache) Signal write complete on the secondary DS8000 Post I/O complete to host server
The Fibre Channel connection between primary and secondary subsystems can be direct, through a Fibre Channel SAN switch, via a SAN router using Fibre Channel over Internet Protocol (FCIP), or through other supported distance solutions, such as Dense Wave Division Multiplexing (DWDM).
17.3.1 Metro Mirror configuration considerations

Metro Mirror pairs are set up between volumes, usually in different disk subsystems, which are normally in separate locations. To establish a Metro Mirror pair, there must be a Metro Mirror path between the logical subsystems (LSSs) in which the volumes reside. These paths can be shared by any Metro Mirror pairs in the same LSS to the secondary LSS in the same direction. A path must be explicitly defined in the reverse direction if required, but it can utilize the same Fibre Channel link. For bandwidth and redundancy, more than one path, up to a maximum of eight paths, can be created between the same LSSs. Metro Mirror will balance the workload across the available paths between the primary and secondary. The logical Metro Mirror paths are transported over physical links between the disk subsystems. The physical link includes the host adapter in the primary DS8000, the cabling, switches or directors, any wide band or long distance transport devices (DWDM, channel extenders, or WAN), and the host adapters in the secondary disk subsystem. Physical links can carry multiple logical Metro Mirror paths as shown in Figure 17-7 on page 521.
519
Metro Mirror Fibre Channel links

A DS8000 Fibre Channel port can simultaneously be: Sender for Metro Mirror primary Receiver for a Metro Mirror secondary Target for Fibre Channel Protocol (FCP) hosts I/O Although one Fibre Channel (FC) link has sufficient bandwidth for most Metro Mirror environments, for redundancy reasons, we recommend configuring at least two Fibre Channel links between each primary and secondary disk subsystem. For better performance, use as many as the supported maximum of eight links. These links must take diverse routes between the DS8000 locations. Metro Mirror Fibre Channel links can be direct-connection or connected by up to two switches. Dedicating Fibre Channel ports for Metro Mirror use guarantees no interference from host I/O activity, which we recommend with Metro Mirror, because it is time-critical and must not be impacted by host I/O activity. The Metro Mirror ports that are used provide connectivity for all LSSs within the DS8000 and can carry multiple logical Metro Mirror paths.
Distance
The distance between your primary and secondary DS8000 subsystems will have an effect on the response time overhead of the Metro Mirror implementation. Note that with the requirement of diverse connections for availability, it is common to have certain paths that are longer distance than others. Contact your IBM Field Technical Sales Specialist (FTSS) to assist you in assessing your configuration and the distance implications if necessary. The maximum supported distance for Metro Mirror is 300 km (186.4 miles). There is approximately a 1 ms overhead per 100 km (62 miles) for write I/Os (this relation between latency and physical distance might be different when using a WAN). The DS8000 Interoperability Matrix gives details of SAN, network, and DWDM supported devices. Distances of over 300 km (186.4 miles) are possible and are supported by RPQ. Due to network configuration variability, the client must work with the channel extender vendor to determine the appropriate configuration to meet their requirements.
Logical paths
A Metro Mirror logical path is a logical connection between the sending LSS and the receiving LSS. An FC link can accommodate multiple Metro Mirror logical paths. Figure 17-7 on page 521 shows an example where we have a 1:1 mapping of source to target LSSs, and where the three logical paths are accommodated in one Metro Mirror link: LSS1 in DS8000 1 to LSS1 in DS8000 2 LSS2 in DS8000 1 to LSS2 in DS8000 2 LSS3 in DS8000 1 to LSS3 in DS8000 2 Alternatively, if the volumes in each of the LSSs of DS8000 1 map to volumes in all three secondary LSSs in DS8000 2, there will be nine logical paths over the Metro Mirror link (not fully illustrated in Figure 17-7 on page 521). Note that we recommend a 1:1 LSS mapping.
520
DS8000 1
LSS 1
1 logical path 3-9 logical paths
DS8000 2
LSS 1
1 logical path
LSS 2
1 logical path
switch
Port
Metro Mirror paths
1 Link
Port
LSS 2
1 logical path
LSS 3
1 logical path
Figure 17-7 Logical paths for Metro Mirror
LSS 3
1 logical path
Metro Mirror links have certain architectural limits, which include: A primary LSS can maintain paths to a maximum of four secondary LSSs. Each secondary LSS can reside in a separate DS8000. Up to eight logical paths per LSS to LSS relationship can be defined. Each Metro Mirror path requires a separate physical link. An FC port can host up to 2048 logical paths. These paths are the logical and directional paths that are made from LSS to LSS. An FC path (the physical path from one port to another port) can host up to 256 logical paths (Metro Mirror paths). An FC port can accommodate up to 126 different physical paths (DS8000 port to DS8000 port through the SAN). For Metro Mirror, consistency requirements are managed through use of the consistency group or Critical Mode option when you are defining Metro Mirror paths between pairs of LSSs. Volumes or LUNs, which are paired between two LSSs whose paths are defined with the consistency group option, can be considered part of a consistency group. Consistency is provided by means of the extended long busy (for z/OS) or queue full (for Open Systems) condition. These conditions are triggered when the DS8000 detects a condition where it cannot update the Metro Mirror secondary volume. The volume pair that first detects the error will go into the extended long busy or queue full condition, so that it will not perform any I/O. For z/OS, a system message will be issued (IEA494I state change message); for Open Systems, an SNMP trap message will be issued. These messages can be used as a trigger for automation purposes to provide data consistency by use of the Freeze/Run (or Unfreeze) commands. Metro Mirror itself does not offer a means of controlling this scenario, it offers the consistency group and Critical attributes, which along with appropriate automation solutions, can manage data consistency and integrity at the remote site. The Metro Mirror volume pairs are always consistent, due to the synchronous nature of Metro Mirror. However, cross-volume or LSS data consistency must have an external management method. IBM offers TotalStorage Productivity Center for Replication to deliver solutions in this area. Note: The consistency group and Critical attributes must not be set for any PPRC paths or LSSs unless there is an automation solution in place to manage the freeze and run commands. If it is set and there is no automation tool to manage a freeze, the host systems will lose access to the volumes until the timeout occurs at 300 seconds.
521
Bandwidth
Prior to establishing your Metro Mirror solution, you must determine what your peak bandwidth requirement will be. Determining your peak bandwidth requirement will help to ensure that you have enough Metro Mirror links in place to support that requirement. To avoid any response time issues, establish the peak write rate for your systems and ensure that you have adequate bandwidth to cope with this load and to allow for growth. Remember that only writes are mirrored across to the target volumes after synchronization. There are tools to assist you, such as TotalStorage Productivity Center (TPC) or the operating system-dependent tools, such as iostat. Another method, but not quite so exact, is to monitor the traffic over the FC switches using FC switch tools and other management tools, and remember that only writes will be mirrored by Metro Mirror. You can also get an idea about the proportion of read to writes by issuing datapath query devstats on SDD-attached servers. A single 2 Gb Fibre Channel link can provide approximately 200 MBps throughput for the Metro Mirror establish. This capability scales up linearly with additional links up to six links. The maximum of eight links for an LSS pair provides a throughput of approximately 1400 MBps. Note: A minimum of two links is recommended between each DS8000 pair for resilience. The remaining capacity with a failed link is capable of maintaining synchronization.
LSS design
Because the DS8000 has made the LSS a topological construct, which is not tied to a physical array as in the ESS, the design of your LSS layout can be simplified. It is now possible to assign LSSs to applications, for example, without concern about the under-allocation or the over-allocation of physical disk subsystem resources. Assigning LSSs to applications can also simplify the Metro Mirror environment, because it is possible to reduce the number of commands that are required for data consistency.
Symmetrical configuration
As an aid to planning and management of your Metro Mirror environment, we recommend that you maintain a symmetrical configuration in terms of both physical and logical elements. As well as making the maintenance of the Metro Mirror configuration easier, maintaining a symmetrical configuration in terms of both physical and logical elements has the added benefit of helping to balance the workload across the DS8000. Figure 17-8 on page 523 shows a logical configuration. This idea applies equally to the physical aspects of the DS8000. You need to attempt to balance workload and apply symmetrical concepts to all aspects of your DS8000, which has the following benefits: Ensure even performance: The secondary site volumes must be created on ranks with DDMs of the same capacity and speed as the primary site. Simplify management: It is easy to see where volumes will be mirrored and processes can be automated. Reduce administrator overhead: There is less administrator overhead due to automation and the simpler nature of the solution. Ease the addition of new capacity into the environment: New arrays can be added in a modular fashion. Ease problem diagnosis: The simple structure of the solution will aid in identifying where any problems might exist.
522
Figure 17-8 on page 523 shows this idea in a graphical form. DS8000 #1 has Metro Mirror paths defined to DS8000 # 2, which is in a remote location. On DS8000 #1, volumes defined in LSS 00 are mirrored to volumes in LSS 00 on DS8000 #2 (volume P1 is paired with volume S1, P2 with S2, P3 with S3, and so on). Volumes in LSS 01 on DS8000 #1 are mirrored to volumes in LSS 01 on DS8000 #2, and so on. Requirements for additional capacity can be added in a symmetrical way also by the addition of volumes into existing LSSs, and by the addition of new LSSs when needed (for example, the addition of two volumes in LSS 03 and LSS 05 and one volume to LSS 04 will bring them to the same number of volumes as the other LSS. Additional volumes can then be distributed evenly across all LSSs, or additional LSSs can be added.
Figure 17-8 Symmetrical Metro Mirror configuration
Consider an asymmetrical configuration where the primary site has volumes defined on ranks comprised of 146 GB DDMs. The secondary site has ranks comprised of 300 GB DDMs. Because the capacity of the destination ranks is double that of the source ranks, it seems feasible to define twice as many LSSs per rank on the destination side. However, this situation, where four primary LSSs on four ranks were feeding into four secondary LSSs on two ranks, creates a performance bottleneck on the secondary rank and slows down the entire Metro Mirror process.
Volumes
You will need to consider which volumes to mirror to the secondary site. One option is to mirror all volumes, which is advantageous for the following reasons: You will not need to consider whether any required data has been missed. Users will not need to remember which logical pool of volumes is mirrored and which is not. The addition of volumes to the environment is simplified; you will not have two processes for the addition of disk (one process for mirrored volumes and another process for non-mirrored volumes). You will be able to move data around your disk environment easily without concern about whether the target volume is a mirrored volume or not.
523
Note: Consider the bandwidth that you need to mirror all volumes. The amount of bandwidth might not be an issue if there are many volumes with a low write I/O rate. Review data from TotalStorage Productivity Center for Disk if it is available. You can choose not to mirror all volumes (for example, swap devices for Open Systems or temporary work volumes for z/OS can be omitted). In this case, you will need careful control over what data is placed on the mirrored volumes (to avoid any capacity issues) and what data is placed on the non-mirrored volumes (to avoid missing any required data). You can place all mirrored volumes in a particular set of LSSs, in which all volumes have Metro Mirror enabled, and direct all data requiring mirroring to these volumes. For testing purposes, additional volumes can be configured at the remote site. These volumes can be used to take a FlashCopy of a consistent Metro Mirror image on the secondary volume and then allow the synchronous copy to restart while testing is performed.
Primary Site
primary
SAN switch
Secondary Site
Metro Mirror
SAN switch
secondary
F la
data written to disk
shC
synchronous copy of primary data
opy
tertiary
consistent tertiary copy of data
Figure 17-9 Metro Mirror environment for testing
To create a consistent copy for testing, the host I/O needs to be quiesced or automation code, such as Geographically Dispersed Parallel Sysplex (GDPS) and TotalStorage Productivity Center for Replication, needs to used to create a consistency group on the primary disks so that all dependent writes are copied to the secondary disks.
17.3.2 Metro Mirror performance considerations

Basic considerations when designing an infrastructure to support a Metro Mirror environment: The process of getting the primary and secondary Metro Mirror volumes into a synchronized state is called the initial establish. Each link I/O port will provide a maximum throughput. Multiple LUNs in the initial establish will quickly saturate the links, which is referred to as the aggregate copy rate and is dependent primarily on the number of links, or bandwidth between sites. It is important to understand this copy rate to have a realistic expectation about how long the initial establish will take to complete. Use Global Copy during the initial copy or for resynchronization to minimize the impact of performing both synchronous replication and the bulk copy at the same time. Production I/O will be given priority over DS8000 replication I/O activity. High production I/O activity will negatively affect both initial establish data rates and synchronous copy data rates. We recommend that you do not share the Metro Mirror link I/O ports with host attachment ports. If you share them, it might cause the unpredictable performance of Metro Mirror and a much more complicated search in case of performance problems. 524
Distance is an important value for both the initial establish data rate and synchronous write performance. Data must go to the other site, and the acknowledgement goes back. Add possible latency times of certain active components on the way. We think it is a good rule to calculate 1 ms additional response time per 100 km (62 miles) of site separation for a write I/O. Distance also affects the establish data rate. Know your workload characteristics. Factors, such as blocksize, read/write ratio, and random or sequential processing, are all key Metro Mirror performance considerations. Monitor link performance to determine if links are becoming overutilized. Use tools, such as TotalStorage Productivity Center and Resource Measurement Facility (RMF), or review SAN switch statistics. Testing Metro Mirror performance has been done for the following configuration with different workload patterns (Figure 17-10). This test is an example of the performance that you can expect. Use these results for guidance only: Primary: DS8300 Turbo Eight DA pairs, 512 73 GB/15k rpm disks, and RAID 10 Secondary: DS8300 Eight DA pairs, 512 73 GB/15k rpm disks, and RAID 10 PPRC links: 2 Gb Host I/Os are driven from the AIX host (AIX 5.3.0.40) with 8 host paths PPRC distance simulation: Four links Channel extender: CNT Edge Router (1 Gb FC Adapter and 1 Gb Ethernet network interface) Distance simulator: Empirix PacketSphere (10 microsecond bidirectional = 1 km (.62 miles))
Figure 17-10 70/30/50 Metro Mirror with channel extender Chapter 17. Copy Services performance
525
Note: 70/30/50 is an open workload where read/write ratio = 2.33, read hits = 50%, destage rate = 17.2%, and transfer size = 4 K. In this instance, it shows the minimal impact of Metro Mirror on I/O response times at a distance between sites of 50 km (31 miles), although it does fall off more sharply and at a lower I/O rate than with no copy running.
17.3.3 Scalability
The DS8000 Metro Mirror environment can be scaled up or down as required. If new volumes are added to the DS8000 that require mirroring, they can be dynamically added. If additional Metro Mirror paths are required, they also can be dynamically added. Note: The mkpprcpath command is used to add Metro Mirror paths. If paths are already established for the LSS pair, they must be included in the mkpprcpath command together with any additional path or the existing paths will be removed.
Adding capacity to the same DS8000

If you are adding capacity to an existing DS8000, providing that your Metro Mirror link bandwidth is not close to or over capacity, it is possible that you only have to add volume pairs to your configuration. If you are adding more LSSs, you must define Metro Mirror paths before adding volume pairs.
Adding capacity in new DS8000s

If you are adding new DS8000s to your configuration, you must add physical Metro Mirror links before defining your Metro Mirror paths and volume pairs. We recommend a minimum of two Metro Mirror paths per DS8000 pair for redundancy reasons. Your bandwidth analysis will indicate if you require more than two paths.
17.4 Global Copy

Global Copy is an asynchronous remote copy function for z/OS and Open Systems for greater distances than are possible with Metro Mirror. With Global Copy, write operations complete on the primary storage system before they are received by the secondary storage system. This capability is designed to prevent the primary systems performance from being affected by wait time from writes on the secondary system. Therefore, the primary and secondary copies can be separated by any distance. This function is appropriate for remote data migration, off-site backups, and the transmission of inactive database logs at virtually unlimited distances. Refer to Figure 17-11 on page 527.
526
Server write
1
Primary (source) LUN or volume
2 Write acknowledge
3
Secondary (target) LUN or volume
4
Write to secondary (non-synchronously)
Figure 17-11 Global Copy
In Figure 17-11: 1. The host server requests a write I/O to the primary DS8000. The write is staged through cache and nonvolatile storage (NVS). 2. The write returns to the host servers application. 3. A few moments later, in a nonsynchronous manner, the primary DS8000 sends the necessary data so that the updates are reflected on the secondary volumes. The updates are grouped in batches for efficient transmission. Note also that if the data is still in cache, only the changed sectors are sent. If the data is no longer in cache, the full track is read from disk. 4. The secondary DS8000 returns write completed to the primary DS8000 when the updates are secured in the secondary DS8000 cache and NVS. The primary DS8000 then resets its change recording information. The primary volume remains in the Copy Pending state while the Global Copy session is active. This status only changes if a command is issued or the links between the storage subsystems are lost.
17.4.1 Global Copy configuration considerations

The requirements for establishing a Global Copy relationship are essentially the same as for Metro Mirror as described in 17.3.1, Metro Mirror configuration considerations on page 519. A path must be established between the source LSS and target LSS over a Fibre Channel link. The major difference is the distance over which Global Copy can operate. Because it is a nonsynchronous copy, the distance is effectively unlimited. Consistency must be manually created by the user.
527
Note: The consistency group is not specified on the establish path command. Data on Global Copy secondaries is not consistent so there is no need to maintain the order of dependent writes. The decision about when to use Global Copy can depend on a number of factors, such as: The recovery of the system does not need to be current with the primary application system. There is a minor impact to application write I/O operations at the primary location. The recovery uses copies of data created by the user on tertiary volumes. Distances beyond ESCON limits or FCP limits are required. 103 km (64 miles) for ESCON links and 300 km (186 miles) for FCP links (RPQ for greater distances) You can use Global Copy as a tool to migrate data between data centers.
Distance
The maximum (supported) distance for a direct Fibre Channel connection is 10 km (6.2 miles). If you want to use Global Copy over longer distances, you can use the following connectivity technologies to extend this distance: Fibre Channel routers using Fibre Channel over Internet Protocol (FCIP) Channel extenders over Wide Area Network (WAN) lines Dense Wavelength Division Multiplexers (DWDM) on fiber
Global Copy Fibre Channel extender support

Channel extender vendors connect DS8000 systems through a variety of Wide Area Network (WAN) connections, including Fibre Channel, Ethernet/IP, ATM-OC3, and T1/T3. When using channel extender products with Global Copy, the channel extender vendor will determine the maximum distance supported between the primary and secondary DS8000. You must contact the channel extender vendor for their distance capability, line quality requirements, and WAN attachment capabilities. A complete and current list of Global Copy supported environments, configurations, networks, and products is available in the DS8000 Interoperability Matrix. You must contact the channel extender vendor regarding hardware and software prerequisites when using their products in a DS8000 Global Copy configuration. Evaluation, qualification, approval, and support of Global Copy configurations using channel extender products is the sole responsibility of the channel extender vendor.
Global Copy Wave Division Multiplexer support

Wavelength Division Multiplexing (WDM) and Dense Wavelength Division Multiplexing (DWDM) are the basic technologies of fiber optic networking. It is a technique for carrying many separate and independent optical channels on a single fiber. A simple way to envision DWDM is to consider that at the primary end, multiple fiber optic input channels, such as ESCON, Fibre Channel, FICON, or Gbit Ethernet, are combined by the DWDM into a single fiber optic cable. Each channel is encoded as light of a different wavelength. You might think of each individual channel as an individual color; the DWDM system is transmitting a rainbow. At the receiving end, the DWDM fans out the different optical channels. DWDM, by the very nature of its operation, provides the full bandwidth
528
capability of the individual channel. Because the wavelength of light is from a practical perspective infinitely divisible, DWDM technology is only limited by the sensitivity of its receptors for the total possible aggregate bandwidth. A complete and current list of Global Copy supported environments, configurations, networks, and products is available in the DS8000 Interoperability Matrix. You must contact the multiplexer vendor regarding hardware and software prerequisites when using the vendors products in a DS8000 Global Copy configuration.
Other planning considerations

When planning to use Global Copy for point-in-time backup solutions as shown in Figure 17-12, you must also consider the configuration of a second volume at the secondary site. If you plan to have tertiary copies, you must have available a set of volumes ready to become the FlashCopy target within the target Storage Facility Image. If your next step is to dump the tertiary volumes onto tapes, you must ensure that the tape resources are capable of handling these dump operations in between the point-in-time checkpoints unless you have additional sets of volumes ready to become alternate FlashCopy targets within the secondary storage subsystems.
Primary Site
primary
channel extender
Secondary Site
Global Copy
channel extender
secondary
Fla s
minimum performance impact
fuzzy copy of data
hCo py
tertiary
consistent tertiary copy of data
Figure 17-12 Global Copy environment for testing
Following these steps, the user creates consistent data: 1. Quiesce I/O. 2. Suspend the pairs (go-to-synch and suspend). FREEZE can be used and extended long busy will not be returned to the server, because consistency group was not specified on the establish path. 3. FlashCopy secondary to tertiary. Tertiary will have consistent data. 4. Reestablish paths (if necessary). 5. RESYNC (resumepprc) Global Copy.
17.4.2 Global Copy performance consideration

As the distance between DS8000s increases, Metro Mirror response time is proportionally affected, which negatively impacts the application performance. When you need
529
implementations over extended distances, Global Copy becomes an excellent trade-off solution. You can estimate the Global Copy application impact as that of the application when working with Metro Mirror suspended volumes. For the DS8000, there is additional work to do with the Global Copy volumes compared to the suspended volumes, because with Global Copy, the changes have to be sent to the remote DS8000. But this impact is negligible overhead for the application compared with the typical synchronous overhead. There are no host system resources consumed by Global Copy volume pairs, excluding any management solution, because the Global Copy is managed by the DS8000 subsystem. If you take a FlashCopy at the recovery site in your Global Copy implementation, consider the influence between Global Copy and the FlashCopy background copy. If you use the FlashCopy with the nocopy option at the recovery site, when the Global Copy target receives an update, the track on the FlashCopy source, which is also the Global Copy target, has to be copied to the FlashCopy target before the data transfer operation completes. This copy operation to the FlashCopy target can complete by using the DS8000 cache and NVS without waiting for a physical write to the FlashCopy target. However, this data movement can influence the Global Copy activity. So, when considering the network bandwidth, consider that the FlashCopy effect over the Global Copy activity might in fact decrease the bandwidth utilization during certain intervals.
17.4.3 Scalability
The DS8000 Global Copy environment can be scaled up or down as required. If new volumes that require mirroring are added to the DS8000, they can be dynamically added. If additional Global Copy paths are required, they also can be dynamically added.
Addition of capacity
As we have previously mentioned, the logical nature of the LSS has made a Global Copy implementation on the DS8000 easier to plan, implement, and manage. However, if you need to add more LSSs to your Global Copy environment, your management and automation solutions must be set up to add this capacity.
Adding capacity to the same DS8000

If you add capacity into an existing DS8000, providing your Global Copy link bandwidth is not close to or over capacity, you might only need to add volume pairs into your configuration. If you add more LSSs, you will need to define Metro Mirror paths before adding volume pairs. Remember that when you add capacity that you want to use for Global Copy, you might also have to purchase capacity upgrades to the appropriate feature code for Global Copy.
Adding capacity in new DS8000s

If you add new DS8000s into your configuration, you will need to add physical Global Copy links prior to defining your Global Copy paths and volume pairs. We recommend a minimum of two Global Copy paths per DS8000 pair for redundancy reasons. Your bandwidth analysis indicates whether you require more than two paths.
17.5 Global Mirror

Global Mirror copying provides a two-site extended distance remote mirroring function for z/OS and Open Systems servers by combining and coordinating Global Copy and FlashCopy operations (Figure 17-13 on page 531). With Global Mirror, the data that the host writes to the 530
storage unit at the local site is asynchronously copied to the storage unit at the remote site. A consistent copy of the data is then periodically automatically maintained on the storage unit at the remote site by forming a consistency group at the local site, and subsequently creating a tertiary copy of the data at the remote site with FlashCopy. This two-site data mirroring function is designed to provide a high performance, cost-effective, global distance data replication and disaster recovery solution.
Global Mirror objectives

The goal of Global Mirror is to provide: Capability to achieve a recovery point objective (RPO) of 3 to 5 seconds with sufficient bandwidth and resources. No impact on production applications when insufficient bandwidth and/or resources are available. Scalability, providing consistency across multiple primary and secondary disk subsystems. Allowance for removal of duplicate writes within a consistency group before sending data to the remote site. Allowance for less than peak bandwidth to be configured by allowing the RPO to increase without restriction at these times. Consistency between System z and Open Systems data and between different platforms on Open Systems.
Host
Host write
1
2 Acknowledge write
B
Write to secondary (asynchronously)
FlashCopy (automatically)
C
Automatic cycle in active session
Figure 17-13 Global Mirror overview
531
The DS8000 manages the sequence to create a consistent copy at the remote site (Figure 17-14): Asynchronous long distance copy (Global Copy) with little to no impact to application writes. Momentarily pause for application writes (fraction of millisecond to a few milliseconds). Create point-in-time consistency group across all primary subsystems in out-of-sync (OOS) bitmap. New updates are saved in the Change Recording bitmap. Restart application writes and complete the write (drain) of point-in-time consistent data to the remote site. Stop the drain of data from the primary after all consistent data has been copied to the secondary. Logically FlashCopy all data to C volumes to preserve consistent data. Restart Global Copy writes from the primary. Automatic repeat of sequence every few seconds to minutes to hours (this choice is selectable and can be immediate).
Global Mirror - How it works

PPRC Primary PPRC Secondary FlashCopy Source FlashCopy Target
Global Copy
FlashCopy
B
Remote Site
Local Site
Automatic Cycle in an active Global Mirror Session
1. 2. 3. 4. 5.
Create Consistency Group of volumes at local site Send increment of consistent data to remote site FlashCopy at the remote site Resume Global Copy (copy out-of-sync data only) Repeat all the steps according to the defined time period
Figure 17-14 How Global Mirror works
The data at the remote site is current within 3 to 5 seconds, but this recovery point (RPO) depends on the workload and bandwidth available to the remote site. Note: The copy created with the consistency group is a power-fail consistent copy, not necessarily an application-based consistent copy. When you use this copy for recovery, you might need to perform additional recovery operations, such as the fsck command in an AIX filesystem. This section discusses performance aspects when planning and configuring for Global Mirror together with the potential impact to application write I/Os caused by the process used to form a consistency group.
532
We also consider distributing the target Global Copy and target FlashCopy volumes across various ranks to balance load over the entire target storage server and minimize the I/O load for selected busy volumes.
17.5.1 Global Mirror performance considerations

Global Mirror is comprised of Global Copy and FlashCopy functions and combines both functions to create a distributed solution that will provide consistent data at a remote site. We analyze the performance at the production site and at the recovery site, as well as between both sites, with the objective of providing a stable RPO without significantly impacting production: At the production site, where production I/O always has a higher priority over DS8000 replication I/O activity, the storage server needs resources to handle both loads. If your primary storage server is already overloaded with production I/O, the potential delay before a consistency group can be formed might become unacceptable. The bandwidth between both sites needs to be sized for production load peaks. This sizing needs to allow for the loss of a link and still maintain the desired RPO. At the recovery site, even if there is no local production I/O workload, the recovery site hosts the target Global Copy volumes, handles the inherent FlashCopy processing, and needs the performance evaluated. The performance of the Global Mirror session, the storage subsystems, and the links between subsystems must be monitored to collect data to allow performance issues to be investigated. There are a number of tools that can assist with these tasks: TotalStorage Productivity Center can collect data from storage subsystems and SAN switches, which gives details of the utilization of the individual components, such as I/O rates for individual volumes. RMF data can be collected for z/OS systems and analyzed using RMF Magic, which is available from IntelliMagic on the Web at: http://www.intellimagic.net/en/index.phtml?p=Home DiskMagic, also available from IntelliMagic, can provide planning information on proposed workloads or additional workloads and give estimates on the inter-site bandwidth required. The Global Mirror Monitor is a tool available from IBM Field Technical Sales Specialists (FTSS), which provides information about the consistency group interval and details of the out-of synch (OOS) tracks. TotalStorage Productivity Center for Replication can monitor the status of the session and send SNMP alerts: Session state change Configuration change Suspending-event notification Communication failure High-availability state change
Primary site DS8000 performance

Configure the primary DS8000 according to the recommendations made in Chapter 5, Logical configuration performance considerations on page 63. A balanced configuration that makes full use of the internal DS8000 resources gives the most consistent performance. If the primary DS8000 is already configured, you can measure the current performance using tools, such as TotalStorage Productivity Center for Disk, which allows you to fix any performance bottlenecks caused by the configuration before the remote copy is established.
533
The PPRC links need to use dedicated DS8000 host adapter ports to avoid any conflict with host I/O. If any subordinate storage subsystems are included in the Global Mirror session, the FC links to those subsystems must also use dedicated host adapter ports.
Global Mirror Fibre Channel links

The links between primary and secondary sites require bandwidth sufficient to maintain the desired RPO. The goal is typically to provide the synchronization of thousands of volumes on multiple primary and secondary storage subsystems with an RPO of 3 to 5 seconds. The cost of providing these links can be high if they are sized to provide this RPO under all circumstances. If it is acceptable to allow the RPO to increase slightly during the highest workload times, you might be able to reduce the bandwidth and hence the cost of the link significantly. In many instances, the highest write rate occurs overnight during backup processing, and increased RPO can be tolerated. The difference in bandwidth and, hence, costs for maintaining an RPO of a few seconds might be double that of maintaining an RPO of a few minutes at peak times. Recovery to the latest consistency group is immediate once the peak passes; there is no catch-up time. Refer to the Global Mirror Whitepaper for further details and examples, which you can obtain at: http://www.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP100642
FlashCopy in Global Mirror

This section looks at the aggregate impact of Global Copy and FlashCopy in the overall performance of Global Mirror. Remember that Global Copy itself has minimal or no significant impact on the response time of an application write I/O to a Global Copy primary volume. The FlashCopy used as an integral part of this Global Copy operation is running in nocopy mode and will cause additional internally triggered I/Os within the target storage server for each write I/O to the FlashCopy source volume, that is, the Global Copy target volume. This I/O is to preserve the last consistency group. Each Global Copy write to its secondary volume during the time period between the formation of successive consistency groups causes an actual FlashCopy write I/O operation on the target DS8000 server, which we describe in Figure 17-15 where we summarize approximately what happens between two consistency group creation points when the application writes are received.
Write
A Primary A1 Primary
source copy pending
Read
2
FCP links
Write
Primary Primary
target
A B1
Read
Write
Primary Primary
C1
tertiary
copy pending
Host
Local site
Remote site
Figure 17-15 Global Copy with write hit at the remote site
534
The FlashCopy write I/O operation on the target DS8000 server steps are (refer to Figure 17-15): 1. The application write I/O completes immediately to volume A1 at the local site. 2. Global Copy nonsynchronously replicates the application I/O and reads the data at the local site to send to the remote site. 3. The modified track is written across the link to the remote B1 volume. 4. FlashCopy nocopy sees that the track is about to change. 5. Track is written to the C1 volume before the write to the B1 volume. This process is an approximation of the sequence of internal I/O events. There are optimization and consolidation effects, which make the entire process quite efficient. Figure 17-15 showed the normal sequence of I/Os within a Global Mirror configuration. The critical path is between points (2) and (3). Usually (3) is simply a write hit in NVS in B1, and some time later and after (3) completes, the original FlashCopy source track is copied from B1 to C1. If NVS is overcommitted in the secondary storage server, there is a potential impact on the Global Copy data replication operations performance. Refer to Figure 17-16.
Write
A A1
source
Primary Primary copy pending
Read
2
FCP links
Write
Primary Primary
target
A B1
Read
Write
Primary Primary
C1
tertiary
copy pending
Host
Local site
Remote site
Figure 17-16 Application write I/O within two consistency group points
Figure 17-16 summarizes roughly what happens when NVS in the remote storage server is overcommitted. A read (3) and a write (4) to preserve the source track and write it to the C volume is required before the write (5) can complete. Eventually, the track gets updated on the B1 volume to complete the write (5). But usually, all writes are quick writes to cache and persistent memory and happen in the order as outlined in Figure 17-14 on page 532. You can obtain a more detailed explanation of this processing in IBM System Storage DS8000: Copy Services with IBM System z, SG24-6787, and IBM System Storage DS8000: Copy Services in Open Environments, SG24-6788.
17.5.2 Global Mirror Session parameters

Global Mirror has three tunable values to modify its behavior. Note: In most environments, use the default values. These default values are maximum intervals, and in practice, the actual interval will usually be shorter.
535
Maximum Coordination Time

The maximum coordination time is the maximum time that Global Mirror will allow for the determination of a consistent set of data before failing this consistency group. Having this cut-off ensures that even if there is an error recovery event or communications problem, the production applications will not experience significant impact from consistency group formation. The default for the maximum coordination time is 50 ms, which is an extremely small value compared to other I/O timeout values, such as MIH (30 seconds) or Small Computer System Interface (SCSI) I/O timeouts. Hence, even in error situations where we might trigger this timeout, Global Mirror will protect production performance rather than impacting production in an attempt to form consistency groups in a time where there might be error recovery or other problems occurring.
Performance considerations at coordination time

When looking at the three phases that Global Mirror goes through to create a set of data consistent volumes at the secondary site, the first question that comes to mind is whether the coordination window imposes an impact to the application write I/O. Refer to Figure 17-17.
Maximum
Maximum
coordination
time Serialize all Global Copy primary volumes
drain
time Perform FlashCopy
Drain data from local to remote site
Write
Hold write I/O
A1
Primary
B1
I/O
PPRC path(s)
Secondary
C1
Tertiary
A2
Primary
B2
Secondary
C2
Tertiary
Local site
Remote Site
Figure 17-17 Coordination time and how it impacts application write I/Os
The coordination time, which you can limit by specifying a number of milliseconds, is the maximum impact to an applications write I/Os that you allow when forming a consistency group. The intention is to keep the coordination time value as small as possible. The default of 50 ms might be high in a transaction processing environment. A valid number might also be in the single digit range. The required communication between the Master storage server and potential Subordinate storage servers is in-band over PPRC paths between the Master and Subordinates. This communication is highly optimized and allows you to minimize the potential application write I/O impact to 3 ms, for example. There must be at least one PPRC FC link between a Master storage server and each Subordinate storage server, although for redundancy we recommend that you use two PPRC FC links. One of the key design objectives for Global Mirror is to not impact the production applications. The consistency group formation process involves the holding of production write activity in order to create dependent write consistency across multiple devices and multiple disk 536
subsystems. This process must therefore be fast enough that an impact is extremely small. With Global Mirror, the process of forming a consistency group is designed to take 1 to 3 ms. If we form consistency groups every 3 to 5 seconds, the percentage of production writes impacted and the degree of impact is therefore very small. The following example shows the type of impact that might be seen from consistency group formation in a Global Mirror environment. We assume that we are going 24000 I/Os per second with a 3:1 R/W ratio. We perform 6000 write I/O per second, each write I/O takes 0.5 ms, and it take 3 ms to create a consistent set of data. Approximately 0.0035 x 6000 = 21 write I/Os are affected by the creation of consistency. If each of these 21 I/Os experiences a 3 ms delay, and this delay happens every 3 seconds, we have an average response time (RT) delay of (21x 0.003)/18000 = 0.0035 ms. A 0.0035 ms average impact to a 0.5 ms write is a 0.7% increase in response time, and normal performance reporting tools will not detect this level of impact.
Maximum drain time

The maximum drain time is the maximum amount of time that Global Mirror will spend draining a consistency group out-of-sync bitmap before failing the consistency group. If the maximum drain time is exceeded, Global Mirror will transition to Global Copy mode for a period of time in order to catch up in the most efficient manner. While in Global Copy mode, the overhead will be lower than continually trying and failing to create consistency groups. The previous consistency group will still be available on the C devices so the effect of this situation will simply be that the RPO increases for a short period. The primary disk subsystem will evaluate when it is possible to continue to form consistency groups and will restart consistency group formation at this time. The default for the maximum drain time is 30 seconds, which allows a reasonable time to send a consistency group while ensuring that if there is a non-fatal network or communications issue that we do not wait too long before evaluating the situation and potentially dropping into Global Copy mode until the situation is resolved. In this way, we again protect the production performance rather than attempting (and possibly failing) to form consistency groups at a time when forming consistency groups might not be appropriate. If we are unable to form consistency groups for 30 minutes, by default, Global Mirror will form a consistency group without regard to the maximum drain time. It is possible to change this time if this behavior is not desirable in a particular environment.
Consistency group drain time

This drain period is the time required to replicate all remaining data for the consistency group from the primary to the secondary storage server. This drain period needs to fall within a time limit set by maximum drain time, which can also be limited. The default is 30 seconds, and this drain period might be too short in an environment with a write-intensive workload. The actual replication process usually does not impact the application write I/O. There is a slight chance that the very same track within a consistency group is updated before this track is replicated to the secondary site within the specified drain time period. When this unlikely event happens, the affected track is immediately (synchronously) replicated to the secondary storage server before the application write I/O modifies the original track. In this exceptional case, the application write I/O is impacted, because it must wait for the write to complete at the remote site as in a Metro Mirror synchronous configuration.
537
Note that further subsequent writes to this very same track do not experience any delay, because the tracks have been already replicated to the remote site.
Consistency group Interval

The consistency group interval is the amount of time that Global Mirror will spend in Global Copy mode between the formation of each consistency group. The effect of increasing this value will be to increase the RPO and can increase the efficiency of the bandwidth utilization by increasing the number of duplicate updates that occur between consistency groups that then do not need to be sent from the primary to the secondary disk subsystems. However, because it also increases the time between successive FlashCopies, increasing this value is not necessary and might be counterproductive in high-bandwidth environments, because frequent consistency group formation will reduce the overhead of Copy on Write processing. The default for the consistency group Interval is 0 seconds so Global Mirror will continuously form consistency groups as fast as the environment will allow. In most situations, we recommend leaving this parameter at the default and allowing Global Mirror to form consistency groups as fast as possible given the workload, because Global Mirror will automatically move to Global Copy mode for a period of time if the drain time is exceeded.
17.5.3 Avoid unbalanced configurations

When the load distribution is unknown in your configuration, consider using TotalStorage Productivity Center for Disk or RMF to gather information about rank and volume utilization. There are two loads to consider: the production load on the production site and the Global Mirror load on both sites. There are only production volumes on the production site. Configure the DS8000 for the best performance as we discussed in Chapter 5, Logical configuration performance considerations on page 63. At the same time, the storage server needs to be able to handle both production and replication workloads. In general, create a balanced configuration that makes use of all device adapters, ranks, and processor complexes. At the recovery site, you only have to consider Global Mirror volumes, but there are two types: target Global Copy and target FlashCopy. Where the B volume is used for production in a failover situation, the DDM size can be double that of the production site DDM size. Global Mirror will still give an identical number of spindles and capacity, because the FlashCopy volume will not be in use in this situation. Where a fourth volume is used to facilitate disaster recovery testing without removing the Global Mirror copy facility for the duration of the test, only 50% more drives are required if you use double capacity DDMs.
Remote DS8000 configuration

There will be I/O skews and hot spots in storage servers for both the local and remote storage servers. In local storage servers, consider a horizontal pooling approach and spread each volume type across all ranks. Volume types in this context are, for example, DB2 database volumes, logging volumes, batch volumes, temporary work volumes, and so on. Your goal can be to have the same number of each volume type within each rank. Through a one-to-one mapping from local to remote storage server, you achieve the same configuration at the remote site for the B volumes and the C volumes. Figure 17-18 on page 539 proposes to spread the B and C volumes across different ranks at the remote storage server so that the FlashCopy target is on a different rank than the FlashCopy source.
538
source copy pending
Primary Primary
target
A B1
Primary Primary
C3
tertiary
copy pending
Rank 1 FCP links FCP FCP
Rank 1
A Primary Primary A2
source copy pending
Primary Primary
target
A B2
Primary Primary
C1
tertiary
copy pending
Rank 2
Rank 2
source
Primary Primary
target
A B3
Primary Primary
C2
tertiary
Host
copy pending
copy pending
Rank 3
Rank 3
Local site
Remote site
Figure 17-18 Remote storage server configuration: All ranks contain equal numbers of volumes
The goal is to put the same number of each volume type into each rank. The volume types that we discuss here refer to B volumes and C volumes within a Global Mirror configuration. In order to avoid performance bottlenecks, spread busy volumes over multiple ranks. Otherwise, hot spots can be concentrated on single ranks when you put the B and C volumes on the same rank. We recommend spreading B and C volumes as Figure 17-18 suggests. With mixed DDM capacities and different speeds at the remote storage server, consider spreading B volumes not only over the fast DDMs but over all ranks. Basically, follow a similar approach as Figure 17-18 recommends. You might keep particularly busy B volumes and C volumes on the faster DDMs. If the DDMs used at the remote site are double the capacity but the same speed as those DDMs used at the production site, an equal number of ranks can be formed. In a failover situation when the B volume is used for production, it will provide the same performance as the production site, because the C volume will not then be in use. Note: Keep the FlashCopy target C volume on the same processor complex as the FlashCopy source B volume. Figure 17-19 on page 540 introduces the D volumes.
539
source copy pending
Primary Primary
target
A B1
Primary Primary
C3
tertiary
copy pending
Rank 1
source copy pending
Primary Primary
target
A B2
Primary Primary
C1
tertiary
copy pending
Rank 2
Rank 2
source
Primary Primary
target
A B3
Primary Primary
C2
tertiary
Host
copy pending
copy pending
Rank 3
Rank 3
Local site
Host
Remote site
Primary Primary
A D1 A D3
Primary Primary
D2
Primary Primary
Primary Primary
D4
Rank 4
Figure 17-19 Remote storage server with D volumes
Figure 17-19 shows, besides the three Global Mirror volumes, the addition of D volumes that you can create for test purposes. Here we suggest, as an alternative, a rank with larger and perhaps slower DDMs. The D volumes can be read from another host, and any other I/O to the D volumes does not impact the Global Mirror volumes in the other ranks. Note that a nocopy relationship between B and D volumes will read the data from B when coming through the D volume. So, you might consider a physical COPY when you create D volumes on a different rank, which will separate additional I/O to the D volumes from I/O to the ranks with the B volumes. If you plan to use the D volumes as the production volumes at the remote site in a failover situation, the D volume ranks must be configured in the same way as the A volume ranks and use identical DDMs. You must make a full copy to the D volume for both testing and failover. When using TotalStorage Productivity Center for Replication, the Copy Sets for Global Mirror Failover/Failback w/ Practice are defined in this way. The TotalStorage Productivity Center for Replication volume definitions are: A volume defined as H1 volume (Host site 1) B volume defined as I2 volume (Intermediate site 2) C volume defined as J2 volume (Journal site 2) D volume defined as H2 volume (Host site 2) The use of FlashCopy SE for the C volume requires a different configuration again. These volumes are physically allocated in a data repository. A repository volume per extent pool is used to provide physical storage for all Space Efficient volumes in that extent pool. Figure 17-20 on page 541 shows an example of a Global Mirror setup with FlashCopy SE. In this example, the FlashCopy targets use a common repository.
540
source copy pending
Primary Primary
target
A B1
Primary Primary
copy pending
Rank 1
source copy pending
Primary Primary
target
A B2
Primary Primary
C
tertiary
copy pending
Rank 2
Rank 2
Extpool 1
Primary Primary
source
Primary Primary
target
A B3
Host
copy pending
copy pending
Rank 3
Rank 3
Local site
Remote site
Figure 17-20 Remote disk subsystem with Space Efficient FlashCopy target volumes
FlashCopy SE is optimized for use cases where less than 20% of the source volume is updated during the life of the relationship. In most cases, Global Mirror is configured to schedule consistency group creation at an interval of a few seconds, which means that a small amount of data is copied to the FlashCopy targets. From this point of view, Global Mirror is a recommended area of application for FlashCopy SE. In contrast, Standard FlashCopy will generally have superior performance to FlashCopy SE. The FlashCopy SE repository is critical regarding performance. When provisioning a repository, Storage Pool Striping (SPS) will automatically be used with a multi-rank extent pool to balance the load across the available disks. In general, we recommend that the extent pool contain a minimum of four RAID arrays. Depending on the logical configuration of the DS8000, you might also consider using multiple Space Efficient repositories for the FlashCopy target volume in a Global Mirror environment, at least one on each processor complex. Note that the repository extent pool can also contain additional non-repository volumes. Contention can arise if the extent pool is shared. After the repository is defined, you cannot expand it so it is important that you plan to make sure that it will be large enough. If the repository fills, the FlashCopy SE relationship will fail and the Global Mirror will not be able to successfully create consistency groups.
17.5.4 Growth within Global Mirror configurations

When a session is active and running, you can alter the Global Mirror environment to add or remove volumes. You also can add storage disk subsystems to a Global Mirror session or you can change the interval between the formation of consistency groups. When a large number of volumes are used with Global Mirror, it is important that you configure sufficient cache memory to provide for the best possible overall function and performance. In order to accommodate additional data structures that are required to efficiently run a Global Mirror environment, Table 17-4 on page 542 lists the maximum number of Global Mirror volume relationships that are currently recommended on a DS8000 as a function of the secondary subsystem cache size. This table also applies to the tertiary DS8000 in a Metro/Global Mirror (MGM) relationship.
541
Effective with the availability of DS8000 Release 3.0 licensed internal code (LIC), the maximum Global Mirror volume relationships on a secondary DS8000 subsystem have been increased three-fold. Table 17-4 includes the old and new values.
Table 17-4 Global Mirror secondary volume guidelines
Disk subsystem/cache
ESS M800 DS8000/16 GB DS8000/32 GB DS8000/64 GB DS8000/128 GB DS8000/256 GB
Recommended maximum devices prior to R3.0

1000 1500 1500 3000 6000 12000
Recommended maximum devices R3.0 or later

1000 4500 4500 9000 18000 36000
Note: The recommendations are solely based on the number of Global Mirror volume relationships, the capacity of the volumes is irrelevant. One way to avoid exceeding these recommendations is to use fewer, larger volumes with Global Mirror.
Adding to or removing volumes from the Global Mirror session

Volumes can be added to the session at any time after the session number is defined to the LSS where the volumes reside. After the session is started, volumes can be added to the session, or they can be removed from the session also at any time. Volumes can be added to a session in any state, for example, simplex or pending. Volumes that have not completed their initial copy phase stay in a join pending state until the first initial copy is complete. If a volume in a session is suspended, it will cause consistency group formation to fail. We recommend that you add only Global Copy source volumes that have completed their initial copy or first pass, although the microcode itself will stop volumes from joining the Global Mirror session until the first pass is complete. Also, we recommend that you wait until the first initial copy is complete before you create the FlashCopy relationship between the B and the C volumes. Note: You cannot add a Metro Mirror source volume to a Global Mirror session. Global Mirror supports only Global Copy pairs. When Global Mirror detects a volume that, for example, is converted from Global Copy to Metro Mirror, the following formation of a consistency group will fail. When you add a rather large number of volumes at one time to an existing Global Mirror session, then the available resources for Global Copy within the affected ranks can be utilized by the initial copy pass. To minimize the impact to the production servers when adding many new volumes, you might consider adding the new volumes to an existing Global Mirror session in stages. If using TotalStorage Productivity Center for Replication, the new copy sets can be added using the GUI or Copy Services Manager CLI. Using the GUI or Copy Services Manager CLI will manage the copy so that a volume is not added to the session before the Global Copy first pass has completed.
542
Suspending a Global Copy pair that belongs to an active Global Mirror session impacts the formation of consistency groups. When you intend to remove Global Copy volumes from an active Global Mirror session, follow these steps: 1. Remove the desired volumes from the Global Mirror session. 2. Withdraw the FlashCopy relationship between the B and C volumes. 3. Terminate the Global Copy pair to bring volume A and volume B into simplex mode. Note: When you remove A volumes without pausing Global Mirror, you might see this situation reflected as an error condition with the showmigr -metrics command, indicating that the consistency group formation failed. However, this error condition does not mean that you have lost a consistent copy at the remote site, because Global Mirror does not take the FlashCopy (B to C) for the failed consistency group data. This message indicates that just one consistency group formation has failed, and Global Mirror will retry the sequence.
Adding or removing storage disk subsystems or LSSs

When you plan to add a new subordinate storage disk subsystem to an active session, you have to stop the session first. Then, add the new subordinate storage disk subsystem and start the session again. The session start command will then contain the new subordinate storage disk subsystem. The same procedure applies when you remove a storage disk subsystem from a Global Mirror session, which can be a subordinate only. In other words, you cannot remove the master storage disk subsystem. When you add a new LSS to an active session and this LSS belongs to a storage disk subsystem that already has another LSS that belongs to this Global Mirror session, you can add the new LSS to the session without stopping and starting the session again. This situation is true for either the master or for a subordinate storage disk subsystem. If using TotalStorage Productivity Center for Replication, the new subsystem can be added using the GUI or Copy Services Manager CLI. The paths must then be added for the new LSS pairs. TotalStorage Productivity Center for Replication will add only one path if the paths are not already defined. The copy sets can then be added to the session after the new subsystem is recognized by TotalStorage Productivity Center for Replication. Note: When using TotalStorage Productivity Center for Replication to manage Copy Services, do not use the DSCLI to make any configuration changes. Only make changes with the Copy Services Manager CLI (CSM CLI) or the TotalStorage Productivity Center for Replication GUI.
17.6 z/OS Global Mirror

z/OS Global Mirror (zGM, formerly known as XRC) is a remote data mirroring function available for the z/OS and OS/390 operating systems. It involves a host-based System Data Mover (SDM) that is a component of OS/390 and z/OS. z/OS Global Mirror maintains a copy of the data asynchronously at a remote location and can be implemented over unlimited distances. It is a combined hardware and software solution offering data integrity and data availability that can be used as part of business continuance solutions, for workload movement, and for data migration. The z/OS Global Mirror function is an optional function known as Remote Mirror for z/OS. The DS8000 function code is RMZ. For a schematic overview of z/OS Global Mirror processing, refer to Figure 17-21 on page 544.
543
Primary site
System Data Mover
Secondary site
Secondary system
Primary system
7 5 1 2 3
CG
ESS DS6000
Journal Journal datasets datasets
Control dataset
Statel dataset
Primary subsystem(s)
Secondary subsystem(s)
Storage Storage Control Control
Primary volume 1
Primary volume n
Secondary volume 1
Secondary volume n
Figure 17-21 z/OS Global Mirror data flow
Figure 17-21 illustrates a simplified view of the z/OS Global Mirror components and the data flow logic. When a z/OS Global Mirror pair is established, the host systems DFSMSdfp software starts to time-stamp all subsequent write I/Os to the primary volumes, which provides the basis for managing data consistency across multiple LCUs. If these primary volumes are shared by systems running on different CECs, an IBM Sysplex Timer is required to provide a common time reference for these time stamps. If all the primary systems are running in different LPARs within the same CEC, the system time-of-day clock can be used. z/OS Global Mirror is implemented in a cooperative way between the DS8000s on the primary site and the DFSMSdfp host system software component System Data Mover (SDM). The logic for the data flow is (refer to Figure 17-21): 1. The primary system writes to the primary volumes. 2. The application I/O operation is signalled completed when the data is written to primary DS8000 cache and NVS, which is when channel end and device end are returned to the primary system. The application write I/O operation has now completed, and the updated data will be mirrored asynchronously according to the following steps. 3. The DS8000 groups the updates into record sets, which are asynchronously off-loaded from the cache to the SDM system. Because z/OS Global Mirror uses this asynchronous copy technique, there is no performance impact on the primary applications I/O operations. 4. The record sets, perhaps from multiple primary storage subsystems, are processed into consistency groups (CGs) by the SDM. The CG contains records that have their order of update preserved across multiple LCUs within an DS8000, across multiple DS8000s, and across other storage subsystems participating in the same z/OS Global Mirror session. This preservation of order is absolutely vital for dependent write I/Os, such as databases and their logs. The creation of CGs guarantees that z/OS Global Mirror will copy data to the secondary site with update sequence integrity. 544
5. When a CG is formed, it is written from the SDM real storage buffers to the Journal datasets. 6. Immediately after the CG has been hardened on the Journal datasets, the records are written to their corresponding secondary volumes. Those records are also written from SDMs real storage buffers. Because of the data in transit between the primary and secondary sites, the currency of the data on secondary volumes lags slightly behind the currency of the data at the primary site. 7. The control dataset is updated to reflect that the records in the CG have been written to the secondary volumes.
17.6.1 z/OS Global Mirror control dataset placement

The z/OS Global Mirror control datasets (the Journal, Control, and State datasets) are critical to the performance of your z/OS Global Mirror environment. The Journals are particularly important, and you need to consider how to allocate them for best performance. You have two options, depending on your available space and your configuration: Dedicate an LSS to the Journals. This will avoid any interference from other workloads with the Journals I/O. This option is more attractive now than it might have been on previous generations of hardware, because you do not necessarily have to dedicate an LSS to an entire rank. Distribute the Journals across all available disk, which can assist to balance out the workload and avoid any potential hot spots. Where these Control datasets are to be allocated on the secondary (or target) disk subsystem, you will also need to consider the impact of the I/O activity from the mirrored volumes that are also on this same disk subsystem, when making this decision.
17.6.2 z/OS Global Mirror tuning parameters

This section discusses the tuning parameters available for z/OS Global Mirror. For additional information about the topics discussed in this section, refer to z/OS DFSMS Advanced Copy Services, SC35-0428. z/OS Global Mirror provides the flexibility of allowing System Data Mover (SDM) operations to be tailored to installation requirements and also supports the modification of key parameters, either from the PARMLIB dataset or through the XSET command.
PARMLIB members
When z/OS Global Mirror first starts up using the XSTART command, it searches for member ANTXIN00 in SYS1.PARMLIB. When a parameter value in the ANTXIN00 member differs from that value specified in the XSTART command, the version in the XSTART command will override the value found in ANTXIN00. After processing the ANTXIN00 member, z/OS Global Mirror then looks for member ALL in the dataset hlq.XCOPY.PARMLIB, where hlq is the value found for the hlq parameter in the ANTXIN00 member, or the overriding value entered in the XSTART command. The ALL member can contain values that you want to be common across all logical sessions on a system.
XSET PARMLIB command

To invoke PARMLIB support at times other than z/OS Global Mirror start-up, you can issue the XSET PARMLIB command. You can use the XSET PARMLIB command both before and after you issue an XSTART command. If XSET PARMLIB is invoked before an XSTART command, it will verify parameter syntax without applying any of the parameters.
545
PAGEFIX parameter in XSET command

This parameter specifies the number of megabytes that are assigned to the SDM as permanent page-fixed real storage. This parameter is the maximum amount of storage that remains page-fixed for the duration of the z/OS Global Mirror session. The default PAGEFIX value is 8 MB. Each Storage Control (SC) session requires up to 35 MB for its data buffers. If you have n SC sessions running concurrently in the SDM address space, up to n x 35 MB data buffers will be required. For example, with six primary LCUs and two SC sessions for each LCU, up to 420 MB data buffers will be required. Page-fixing most or all of those data buffers will minimize the required MIPS and maximize the potential throughput. The SDM will dynamically fix and free additional real storage as required, but this approach uses more processor resources and potentially extends I/O processing times. Changes specified with the PAGEFIX parameter take place when the next SC buffers are processed. Note: The PARMLIB option STORAGE PermanentFixedPages parameter has the equivalent function.
TIMEOUT parameter in XSET command

This parameter specifies the maximum primary system impact that the z/OS Global Mirror session will allow. Set this parameter to a value that is acceptable to your application production environment. We recommend that you do not set this value lower than the Missing Interrupt Handler (MIH) value for the primary volumes. If you use channel extenders between the primary site and the SDM location, you must not set the TIMEOUT value lower than the channel extender value. You can specify the subsystem id (SSID) to which the TIMEOUT value will apply. Each LCU in an DS8000 has a unique SSID, which is set during the installation and configuration process. If you do not specify any SSID, the TIMEOUT value applies to all LCUs in the z/OS Global Mirror session. Do not confuse this TIMEOUT parameter with the TIMEOUT parameter in the XSUSPEND command. These parameters are used for different purposes. Note: The PARMLIB parameter SHADOW StorageControlTimeout provides the equivalent function.
TIMEOUT parameter in XSUSPEND command

When you suspend a session with the XSUSPEND command, you must specify a TIMEOUT value, or you can specify that an existing LCU default value previously set in the LCU is to be used, (which is normally set to 5 minutes). The SDM communicates this value to all primary LCUs. This value specifies the maximum time that an LCU can wait for the z/OS Global Mirror session to be restarted with an XSTART command. This type of suspension is a session suspension accomplished by using the XSUSPEND TIMEOUT command. This command is issued when you want to terminate the SDM for a
546
planned activity, such as a maintenance update, or moving the SDM to a different site or a different LPAR. The XSUSPEND TIMEOUT command will end the ANTASnnn address space and inform the involved LCUs that their z/OS Global Mirror session has been suspended. The DS8000 will then record changed tracks in the hardware bitmap and will free the write updates from the cache. When the z/OS Global Mirror session is restarted with the XSTART command and volumes are added back to the session with the XADDPAIR command, the hardware bitmap maintained by the DS8000 while the session was suspended will be used to resynchronize the volumes, and thus full volume resynchronization is avoided.
RFREQUENCY/RTRACKS parameters in XSET command

One of a pair of hardware bitmaps in the DS8000 is continuously updated to reflect changed tracks on primary volumes. This bitmap is used to avoid a full volume synchronization when returning suspended volumes back to a session. SDM also maintains one of a pair of software bitmaps for each volume as long as the ANTASnnn address space is active. This software bitmap is kept in the State dataset and, with XRC Version 2, was used during resynchronization of a suspended volume. If the ANTASnnn address space is not active, or the data path between the SDM and the primary LCU is broken, SDM is unable to update this bitmap. Unplanned outages were therefore not supported in XRC Version 2. You can use the RFREQUENCY and the RTRACKS parameters in the XSET command to tell the DS8000 (for hardware bitmaps) and the SDM (for software bitmaps) when those bitmaps must be reset. DS8000 and SDM will reset bitmaps for all active volumes to their alternate bitmap at this time, according to these two parameters. Bitmaps for suspended volumes will not be reset. In the RFREQUENCY parameter, you can specify how often, in hours, minutes, and seconds, the bitmaps must be reset. With the RTRACKS parameter, you specify the number of tracks that must change before the bitmaps are reset. The default for RFREQUENCY is 30 minutes. If you specify a value of 0, z/OS Global Mirror will not reset the bitmaps as a result of elapsed time. The default for RTRACKS is 7500 tracks. If you specify a value of 0 (or a value greater than the number of tracks on a volume), z/OS Global Mirror will not reset the bitmaps as a result of the number of tracks that have changed. z/OS Global Mirror bitmap toggling will use whichever happens first based on the RFREQUENCY or RTRACKS specifications. Specifying high values in the RTRACKS and RFREQUENCY (for RFREQUENCY it depends on the update rate on primary volumes) parameters means that the resynchronization time can take longer. For example, if you have specified RTRACKS 7500 (the default, and let us assume that RFREQUENCY is 0) and a volume is suspended when 4000 tracks have been changed since the bitmap was reset last time, 4000 tracks have to be resynchronized. Specifying a low value in the RTRACKS and RFREQUENCY parameters can put a greater demand on DS8000 and SDM processing resources. We recommend that you start by using the defaults unless you have specific needs.
547
Note: PARMLIB parameters BITMAP ChangedTracks and DelayTime have an equivalent function.
DONOTBLOCK parameter in the XADDPAIR command

z/OS Global Mirror tries to balance the needs of the primary system and z/OS Global Mirror operations when data starts to accumulate in the cache. Data paths to primary volumes with high write activity will progressively be blocked. That is, writes from the primary host will be inhibited until SDM can catch up. If there are volumes in your z/OS Global Mirror configuration for which you do not want to have device blocking active, you can specify the DONOTBLOCK parameter when setting up the XADDPAIR command for those volumes. Note: Specify the DONOTBLOCK option only for high performance, sensitive volumes that generate short updates, such as the IMS WADS volumes and spool datasets.
DVCBLOCK parameter in the XADDPAIR command

You can use the DVCBLOCK parameter of the XADDPAIR command to control whether write pacing is enabled, disabled, or has a different value than the default for the session for the volumes for which it is specified. It is mutually exclusive with the DONOTBLOCK parameter.
Write Pacing
Write Pacing works by injecting a small delay as each zGM record set is created in cache for a given volume. As the device residual count increases, so does the magnitude of the pacing, eventually reaching a maximum value at a target residual count. You can specify both this maximum value and the target residual count at which it is effective for each volume through the XRC XADDPAIR command. Write pacing provides a greater level of flexibility than Device Blocking. Furthermore, the device remains ready to process I/O requests, allowing application read activity to continue while the device is being paced, which is not the case with Device Blocking. You can set the write pacing levels to one of fifteen fixed values between 0.02 ms and 2 ms. The write pacing levels are used for volumes with high rates of small blocksize writes, such as database logs, where we need to minimize response time impact. A dynamic workload balancing algorithm was introduced with z/OS Global Mirror (zGM) Version 2 The objective of this mechanism is to balance the write activity from primary systems and SDMs capability to offload cache during write peaks or a temporary lack of resources to SDM with minimal impact to the primary systems. In situations where the SDM offload rate falls behind the primary systems write activity, data starts to accumulate in cache. This accumulation is dynamically detected by the primary DS8000 microcode, and it responds by slowly but progressively reducing available write bandwidth for the primary systems, thus giving the SDM a chance to catch up. The DS8000 implements device-level blocking. The update rate for a volume continues unrestricted unless a volume reaches a residual count threshold waiting to be collected by the SDM. Whenever that threshold is exceeded, application updates to the single volume are paused to allow the SDM to read them from the cache of the subsystem.
548
Tip: By using the DONOTBLOCK parameter of the XADDPAIR command, you can request that z/OS Global Mirror does not block specific devices. You can use this option for IMS WADs, DB2 logs, CICS logs, or spool datasets that use small blocksizes, perform numerous updates, and are critical to application response time.
PRIORITY parameter in the XSET command

The PRIORITY parameter in the XSET command specifies the priority that the command uses for selecting the next volume to synchronize or resynchronize. The first priority is to resynchronize existing volume pairs and then to add new volume pairs to the z/OS Global Mirror session. Tip: We recommend that you specify PRIORITY(FIFO), which ensures that synchronizing occurs in the order that the volumes were originally listed in their XADDPAIR command. PRIORITY (FIFO) provides you with a method of ensuring that your important volumes are the first volumes to be synchronized or resynchronized.
17.6.3 z/OS Global Mirror enhanced multiple reader

In an zGM environment, you define a logical session. This logical session is made up of multiple physical sessions with at least one physical session for each logical subsystem (LSS). Each LSS contains volumes participating in the zGM logical session. Based on the bandwidth study and the configuration of volumes in an LSS, a single LSS can have more than one Physical zGM session. There is a limit of 64 physical sessions per LSS, which also includes other session types, such as Concurrent Copy. With the zGM single reader implementation, you must carefully plan to balance the primary volumes update rates for all zGM volumes in an LSS against the SDM update drain rate, because all updates for a physical session on an LSS are read by the SDM via a single SDM reader. If the updates occur at a faster rate than the rate which the SDM can off-load those updates, record sets will accumulate in the cache. When the cache fills up, the storage subsystem coupled with the SDM will begin to execute the algorithms to start pacing or device-level blocking. Pacing or device-level blocking will affect the performance of the host application. If the effect of pacing and device-level blocking is insufficient, zGM will eventually suspend. When you initially set up a zGM configuration during the bandwidth study, the MBps update rate for each volume is determined and the volumes are placed in an LSS based on their update rates and the associated SDM reader off-load rate. Sometimes more than one physical zGM session is required to be able to drain the updates for all the volumes residing in an LSS. In this case, SDM must manage multiple physical sessions for the LSS. With zGM multiple reader support, the SDM can now drain the record set updates off using multiple fixed utility base addresses or a single base address that has aliases assigned to it. Via the single physical zGM session on an LSS, multiple reader paths can be used by the SDM to drain all the updates, which can reduce the number of base addresses required per LSS for zGM fixed utility addresses and can be even more dynamic in nature when HyperPAV is exploited. This support enables the SDM to off-load the record updates on an LSS through multiple paths against the same sidefile while maintaining the record set Time Sequenced Order. zGM multiple reader support will permit the SDM to balance the updates across multiple readers enabling simpler planning for zGM. SDM can manage a combination of physical sessions with single or multiple readers depending whether the multiple reader support is installed and active or not for each
549
subsystem involved in the zGM logical session. Multiple reader support can also help to simplify the move to larger devices, and it can reduce the sensitivity of zGM in draining updates as workload characteristics change or capacity growth occurs. Less manual effort will be required to manage the SDM off-load process. For more information about multiple reader, refer to IBM System Storage DS8000: Copy Services with IBM System z, SG24-6787.
17.6.4 zGM enhanced multiple reader performance improvement

The following charts show the performance improvement when comparing a zGM with a single reader compared to a zGM running with multiple readers. In this case, the zGM uses four readers. In general, the longer the distance, the better the multiple readers perform when compared to the single reader. Figure 17-22 shows the performance improvement when running a workload that is performing 4 KB sequential writes to a single volume. At 3200 km (1988.3 miles) distance, the multiple reader I/O rate is three times better compared to the single reader I/O rate.
XRC, Direct FICON
XRC, Brocade 7500, 0KM
10
15
20
25
30
35
Application I/O Rate (Thousand I/Os per second)

Multiple Reader Single Reader
Figure 17-22 Test of 4 KB sequential write workload to one volume
Figure 17-23 on page 551 shows the comparison when running a 27 KB sequential write workload to a single volume. Here, we compare the MB per second throughput. Even though the improvement is not as dramatic as on the 4 KB sequential write workload, we still see that the multiple reader provides better performance compared to the single reader.
550
XRC, Direct FICON
50
100
150
200
250
300
350
400
Application Throughput (MB per second)

Figure 17-23 Test of 27 KB sequential write workload to one volume
Figure 17-24 on page 552 shows the benchmark result where we compare the application throughput measured by the total I/O rate when running a database random write to one LSS. Here, we again see a significant improvement on the throughput when running with multiple readers.
551
XRC, Direct FICON
10
15
20
25
Application I/O Rate (Thousand I/Os per second)

Figure 17-24 Database random write to one LSS
17.6.5 XRC Performance Monitor

The IBM XRC Performance Monitor is a licensed IBM program product that can be used to monitor and evaluate a z/OS Global Mirror system in terms of tuning, system constraint determination, evaluation of system growth, and for capacity planning. For more details, refer to IBM TotalStorage XRC Performance Monitor Installation and Users Guide, GC26-7479.
17.7 Metro/Global Mirror

Metro/Global Mirror is a 3-site, multi-purpose, replication solution for both System z and Open Systems data. As shown in Figure 17-25 on page 553, Metro Mirror provides high availability replication from a local site (site A) to an intermediate site (site B), while Global Mirror provides long distance disaster recovery replication from an intermediate site (site B) to a remote site (site C).
552
Server or Servers
***
4
normal application I/Os Global Mirror netw ork asynchronous long distance
1
A
Metro Mirror
2 3
B
Global Mirror FlashCopy incremental NOCOPY
Metro Mirror network synchronous short distance
Global Mirror
Global Mirror consistency group formation (CG)
a. write updates to B volumes paused (< 3ms) to create CG b. CG updates to B volumes drained to C volumes c. after all updates drained, FlashCopy changed data from C to D volume s
Metro Mirror write

1. application to VolA 2. VolA to VolB 3. write complete to A 4. write complete to application
Local Site (Site A)
Intermediate Site (Site B)
Remote Site (Site C)
Figure 17-25 Metro/Global Mirror overview diagram
IBM offers services and solutions for the automation and management of the Metro Mirror environment, which include GDPS for System z and TotalStorage Productivity Center for Replication. You can obtain more details about GDPS at the following Web site: http://www.ibm.com/systems/z/advantages/gdps
17.7.1 Metro/Global Mirror performance

The configuration of the Metro/Global Mirror environment must consider both the Metro Mirror function plus the Global Mirror function. When setting up the configuration of the storage subsystems and links, consider the factors discussed earlier in this chapter. Metro/Global Mirror also has the added requirement of links from both Metro Mirror storage subsystems to the remote third site. In a normal configuration, the synchronous copy is from A to B with the Global Mirror to C. In the event that site B is lost, the links must already be in place from A to C to maintain the Global Mirror function. This link will need to provide the same bandwidth as the B to C link.
17.7.2 z/OS Metro/Global Mirror

z/OS Metro/Global Mirror (MzGM) uses z/OS Global Mirror to mirror primary site data to a remote location and also uses Metro Mirror for primary site data to a location within Metro Mirror distance limits. This approach (shown in Figure 17-26 on page 554) gives you a three-site high-availability and disaster recovery solution.
553
Intermediate Site
Local Site
Remote Site
z/OS Global Mirror
Metropolitan distance
Unlimited distance
Metro Mirror
P
DS8000 Metro Mirror Secondary
FlashCopy when required
X
DS8000 z/OS Global Mirror Secondary
DS8000 Metro Mirror/ z/OS Global Mirror Primary
Figure 17-26 Example three-site configuration with z/OS Metro/Global Mirror
In the example that is shown in Figure 17-26, the System z environment in the Local Site is normally accessing the DS8000 disk in the Local Site. These disks are mirrored back to the Intermediate Site with Metro Mirror to another DS8000. At the same time, the Local Site disk has z/OS Global Mirror pairs established to the Remote Site to another DS8000, which can be at continental distances from the Local Site.
17.7.3 z/OS Metro/Global Mirror performance

Apply the performance considerations discussed in this chapter under Metro Mirror and zGM to the respective parts of the MzGM environment.
554
Appendix A.
Logical configuration examples

This appendix contains examples of logical DS8000 configurations that have been implemented by various clients. These examples are real in all cases and will aid you in understanding the difference between optimal and problematic logical configurations. We have included both good and bad examples. We present the following concepts: Considering hardware resource availability for throughput Understanding data I/O resource isolation as opposed to sharing through practical examples: Scenarios and logical configuration layout designs that can simplify DS8000 performance for optimal analysis and management Configuring arrays in extent pools to meet performance requirements Understanding data placement at the extent pool level
555
A.1 Considering hardware resource availability for throughput

An important factor in logically configuring your DS8000 for optimal performance throughput is to assess your hardware resource availability. Take inventory of the amount of cache, number of device adapter (DA) Pairs, ranks (arrays and array sites), their affinity to a specific server complex, the number of I/O enclosures, the number of host adapters, and the number of I/O ports available on each adapter. In this particular appendix, we focus on: The array sites per DA pair and the total number of arrays/ranks The DA pairs The processor complexes (server 0 and 1) The idea is to evenly distribute the data across as many hardware resources as possible without oversaturating those hardware resources through an imbalanced configuration. When you first configure a DS8000, performance issues might not arise, but as the environment changes, performance is affected. Aspects that can change are: With databases, the saturation of certain hardware resources increases as the database grows. New data, files, or filesystems are created on the same set of hardware resources, thus changing the performance of all data residing on those resources. Additional hardware resources are introduced in the environment, such as drivesets, arrays, and DA pairs, necessitating a redistribution for better utilization.
A.2 Resource isolation or sharing

The following subsections discuss the differences between sharing and isolating data I/O and applications at the server complex, rank, DA pair, and extent pool level. The examples in scenarios 1 and 2 can help you to visualize the concept of balancing the data I/O and strategically distributing I/O evenly across all of the hardware resources. The goal is to get the best throughput while keeping the configuration easy to manage. The examples assume uniformity of disk drive module (DDM) sizes and RAID types.
Scenario 1: Spreading everything with no isolation

There are at least two ways to spread the I/O across all the ranks: DA pairs and server complexes. The first method is to group all the arrays into one extent pool. The second method is to either isolate all or some the ranks into their own extent pools but spread the entire applications I/O across all the hardware resources (all ranks, all DA pairs, and both server complexes). In this first scenario, the client has requested a total of approximately 14 TB of capacity using 146 GB disks at 15K rpm. The I/O requirements call for as much spreading as possible for maximum sharing of I/O across the spindles. Oracle is the database that will be used with this configuration, and the client has stated that most of the applications have random I/O access patterns. The client is extremely concerned about performance priorities for three of the six hosts. The storage administration team has used Capacity Magic to get a view of the available capacity and plan the logical layout. A representation is given in Figure A-1 on page 557. To the right, we show how the team decided to group the arrays/ranks into extent pools. The yellow circle surrounding the capacity represents the raw size and speed of the DDMs. The blue box surrounding the extpools and ranks represents the RAID characteristics, such as 556
RAID 5 and array type of 6+P+S or 7+P. The red circle surrounding the ranks represents the extent pools created. The ranks, designated by Rn, such as R0, represent the arrays logically grouped from eight physical DDMs (disks). In this example, all arrays were formatted as RAID 5, and due to the sparing rules, four arrays from each DA pair have a spare and are designated as 6+P+S. The remaining four arrays are configured as 7+P types.
DS8000 Grouping of Arrays/Extpools
R8 R9
R10 R11 R12 R13 R14 R15
Exp8 R8 Exp9 R9 Exp0 R0 Exp1 R1
145.6GB size at 15Krpm

Exp2 R2 Exp3 R3 Exp4 R4 Exp5 R5 Exp6 R6 Exp7 R7
6+P+S RAID-5
Exp10 Exp11 Exp12 R10 R11 R12 Exp13 Exp14 Exp15 R13 R14 R15
R0 R1
R2 R3 R4 R5 R6 R7
7+P RAID-5
Figure A-1 Simple scenario of array/rank to extent pool grouping
The team has decided to place each rank into its own extent pool. This method is an excellent way to isolate each rank from each other rank, but because of the way that the team laid out the LUNs in each extent pool, no isolation at the host application level was achieved. The view shown in Figure A-1 can be transposed to a spreadsheet to further define the granularity of logical layout and LUNs configured from each extent pool. The chart in Figure A-2 on page 558 shows further granularity of the capacity resources available at the DA pair, server complex, rank, extent pool, logical subsystem (LSS), and logical unit number (LUN) level. Included in the chart is a specific color coding of the LUNs to assigned hosts and the host host bus adapter (HBA) worldwide port names (WWPNs). Orienting yourself with the chart and understanding how to read and use this type of chart is beneficial in understanding the illustrations presented in this appendix. In Figure A-2 on page 558, we show a layout of LUNs with respective numbering associated to the server complexes, extpools, and LSSs numbered in rows 6 - 38, and columns A, B, and C. In this illustration, we have made the array numbering equal to the rank numbering, and the rank numbering equal to the Extpool and LSS numbering. For example, S1=A0=R0=Extpool0=LSS0 as shown by the box around rows 6 and 7, surrounded by the box in red (horizontally). In this example, the LUNs are numbered sequentially (in hexadecimal format) from left to right: Box number 1 (column A) shows the server complexes number 0 and 1. Box number 2 (column B) shows the Extpools numbered 1 - 23. Box number 3 (column C) shows the LSSs. The chart is read vertically by Server, Extpool, LSS, and the associated LUNs carved from the respective arrays/ranks/extpools. Box number 4 shows the arrays owned by the DA pairs,
Appendix A. Logical configuration examples
557
for example, there are two DA pairs in this illustration. Box number 5 shows a number of LUNs by color code that are owned by various hosts represented by matching colors. 1. Server complexs 2. Extent pools/arrays/ranks 3. LSS/LCUs
4. DA pairs
5. Luns assigned to hosts
7. Non Production host LUNs
6. Production host LUNs
Figure A-2 Before hardware capacity resource chart to track LUN placement at the rank to extent pool level
The chart shown in Figure A-2 has value in that you can quickly view the logical layout at a glance without having to run a bunch of commands to visualize your environment. (Notice that all but two LUNs were spread across every array/rank/extent pool and both server complexes). Hot spots can more readily be pinpointed and confirmed with tools, such as Tivoli Productivity Center. Using this type of chart can however more easily and quickly help you in reconfiguring the logical layout when hots spots are found. Migrating LUNs to other less busy arrays, DA pairs, or server complexes can more easily be planned. In the example shown in Figure A-2, notice that all host data LUNs are spread across all the hardware resources, regardless of server complex, DA pair, array characteristics, and so forth. The initial belief of the storage team was that the more spread across spindles, the better the performance. After the implementation, the client experienced performance hot spots on the production hosts, identified in box number 6. One of these production hosts applications workload peaked at the same time as one of the non-priority hosts, identified in box number 7. Upon investigation and reconfiguration, the problem was solved by separating the production hosts from the nonproduction hosts and placing the production host LUNs on the arrays with a 7+P type format. The chart shown in Figure A-3 on page 559 shows the remediated logical configuration. Notice that the production host LUNs were isolated from the nonproduction hosts and placed 558
on arrays that have a 7+P type. Realize that 7+P type arrays will outperform 6+P+S type arrays simply because there are more physical heads and spindles across which to spread. For further information and discussion about array type characteristics, reference 4.3, Understanding the array to LUN relationship on page 55. Another key factor in the performance throughput gain resulted in separating and isolating the production servers application I/O at the array level from the nonproduction servers application I/O. For example, the production hosts I/O resides in the LUNs shown in the box numbered 1. The nonproduction LUNs now reside in separate arrays from the production LUNs as well, shown in the box numbered 2. The benefit of separating the host applications I/O at the array level kept the workloads from peaking at the same time. This new configuration still accommodates a spread of I/O across several arrays, eight to be exact (4, 5, 6, 7, 12, 13, 14, and 15), DA pairs (0 and 2), and server complexes (0 and 1). 2. Non production Host LUNs
1. Production host LUNs
Figure A-3 After hardware capacity resource chart to track LUN placement at the rank to extent pool level
The ideal point between isolating and spreading was achieved as shown in Figure A-3. Spreading applies to both isolated workloads and resource-sharing workloads.
559
In scenario 1, we used a logical configuration for maximum isolation of spindles at the rank/extent pool. To further illustrate this concept and familiarize you with the charts we use in many of the illustrations going forward in this appendix, refer to Figure A-4. This chart is similar to the previous chart shown in Figure A-3 on page 559, except for the LUN to array granularity. In Figure A-4, we show a logical configuration of 48 ranks. Each rank is placed in its own extent pool to isolate the rank spindles from all other rank spindles. The boxes surrounding each rank represent 8 physical disks grouped into a logical array. Each array is represented in Columns A through E, over two rows. For example, columns A - E, rows 3 and 4, represent an array (in this example, rank0 shown on the left). Each of the arrays contains information identifying the server complex, extent pool, rank number, capacity in GB, and RAID type. In the middle of the diagram, we show a representation of eight physical disks for rank number 0, 33, and 15 just to illustrate that each rank is made up of eight physical DDMs.
DA_Pair0, Rank0, Server0, Extpool0, LSS0 @1582GB
DA_Pair5,Rank33,Server1,Extpool33,LSS33 @1582GB
DA_Pair2,Rank15,Server1,Extpool15,LSS15 @1582GB
48 ranks to 48 extpools for max isolation
Figure A-4 Rank to Extpool 1 to 1 ratio illustration
We have isolated all the ranks from each other. The benefit of maximum isolation is for application requirements that call for the most spindle separation. SAN Volume Controller (SVC) and other virtualization appliances benefit from this type of configuration, because they use their own virtualization. Note: Although the ranks are isolated from each other, the I/O traffic is still shared within the associated owning DA pair and server complex for each rank.
560
Scenario 2: Spreading data I/O with partial isolation

Finding the ideal point between isolation and spreading or sharing can be challenging. For example, in Figure A-5, we show a DS8000 unit consisting of 48 arrays/ranks numbered 0 - 47. In this example, we show six DA pairs, which are numbered 0, 2, 6, 4, 5, and 7. Under each DA pair are eight arrays of two different capacities. the rectangles making up rows and columns are the arrays. For example, the box surrounding columns A - E, rows 6 and 7, shows an array that is a 6+P with a total capacity of 1582 GB. R1(Rank1) has been assigned to extent pool 1, which belongs to server complex 1. The boxes on the left and right of the diagram are surrounding one 6+P array from each DA pair available in this configuration. The capacity of the raw drives are all 300 GB, and each array has been formatted for RAID 5. As you can see, four arrays in each DA pair render a 6+P+S configuration and the remaining four arrays render a 7+P configuration. The 6+P arrays render 1582 GB and the 7+P arrays render 1844 GB worth of capacity. The boxes in the middle of the diagram are a representation of the ranks. Each rank represents eight physical disks, but is shown here as one volume, because a rank operates as a large physical disk. For example, the box surrounding the six ranks labeled Extpool 1 is made up of six arrays/ranks numbered R1, R8, R17, R24, R33, and R40, one from each DA pair.
R0 R9 R16 R25 R32 R41 Extpool 0 R1 R8 R17 R24 R33 R40 Extpool 1 R2 R11 R18 R27 R34 R43 Extpool 2 R3 R10 R19 R26 R35 R42 Extpool 3 R4 R13 R20 R29 R36 R45 Extpool 4 R5 R12 R21 R29 R37 R44 Extpool 5 R6 R15 R22 R31 R38 R47 Extpool 6 R7 R14 R23 R30 R39 R46 Extpool 7
Figure A-5 DS8000 grouping of arrays in extent pools
To achieve a well-balanced DS8000 from the array to extent pool perspective, take an array/rank from each DA pair of equal size and the same characteristics and group them into an extent pool as shown in Figure A-5. With this configuration and considering the large
561
number of arrays and DA pairs, a good general rule is to create as many extent pools as there are arrays in a DA pair. For example, in Figure A-5 on page 561, we show six DA pairs and eight ranks in each DA pair. Next, we show that one rank from each DA pair has been grouped into one of the eight extent pools. This way, you can achieve a well balanced configuration in the extent pool, because it leverages and distributes the I/O evenly across all the hardware resources, such as the arrays, DA pairs, and server complexes.
Scenario 3: Grouping unlike RAID types together in the extent pool

The example in Figure A-6 depicts sharing half the ranks spindles within one extent pool. Note how the ranks are distributed between the DA pairs, server complexes, and capacities. This example is not good in that all the 6+P+S ranks are grouped together in server complex 0, and all the 7+P arrays are grouped together in server complex 1.
DS8300 Grouping of Arrays/Extpools Extpool 0 R0 R2 R4 R6 R9 R11 R13 R15 R16 R18 R20 R22 R25 R27 R29 R31 R32 R34 R36 R38 R41 R43 R45 R47 Extpool 1 R1 R3 R5 R7 R8 R10 R12 R14 R17 R19 R21 R23 R24 R26 R28 R30 R33 R35 R37 R39 R40 R42 R44 R46
Figure A-6 A bad example of maximum sharing between two extent pools
Although the ranks to extent pools are divided up to separate the 6+P+S from the 7+P RAID types, we are unable to achieve an even balance across the server complexes. All the LUNs created and assigned to a host from extent pool 0 filter traffic through server complex 0 and all the LUNs assigned to a host from extpool 1 filter I/O traffic through server complex 1. In order for you to achieve I/O balance from a host perspective, you need to assign LUNs from both extent pools to each host. By spreading the application across both extent pools, we introduce a lack of any isolation. All I/O is shared between all hosts and workloads. The advantages for this type of a configuration are that I/O is spread across the maximum number of heads and spindles, maintaining reasonable balance across the DA pairs and server complexes. The disadvantages are a lack of I/O isolation. Everything is spread and striped, which introduces I/O contention between all hosts and applications. If a hot spot was encountered, you have nowhere to move the data to improve I/O throughput for the 562
application, especially if two or more applications peaked at the same time. You also negate any performance gains from 7+P arrays by mixing them with 6+P array types. A properly utilized and balanced system is configured as shown in scenario 4, which is illustrated in Figure A-7.
Scenario 4: Grouping like RAID types in the extent pool

In this example, we have used the same number and capacity of arrays shown in scenario 3, but we created four extent pools as shown in Figure A-7.
DS8300 Grouping of Arrays/Extpools Extpool 0 R0 R2 R9 R11 R16 R18 R25 R27 R32 R34 R41 R43 Extpool 1 R1 R3 R8 R10 R17 R19 R24 R26 R33 R35 R40 R42 Extpool 2 R4 R6 R13 R15 R20 R22 R29 R31 R36 R38 R45 R47 Extpool 3 R5 R7 R12 R14 R21 R23 R28 R30 R37 R39 R44 R46
Figure A-7 A good example of maximum sharing to maintain balance and proper utilization
In Figure A-7, we show a suggested logical configuration. Two ranks from each DA pair are grouped together in an extent pool, which provides excellent balance for optimal, evenly distributed I/O throughput. This configuration also allows you to take advantage of the aggregate throughput of 12 ranks in each extpool. In extent pools 0 and 1, you spread the I/O across 84 physical drives, because one drive in each rank is a spare. Extent pools 2 and 3 allow you to take advantage of spreading the I/O over 96 physical drives, thereby, outperforming I/O residing in extent pools 0 and 1. For a more detailed understanding of why the 7+P arrays outperform the 6+P+S arrays, refer to 4.3, Understanding the array to LUN relationship on page 55. By configuring the unit this way, you can control the balance of I/O more evenly for LUN to host assignments, taking advantage of I/O from both server complexes and all DA pairs, for example, by assigning LUNs to a host from extent pool 0 and 1, or extent pool 2 and 3.
563
The advantages are: 1. Partial built-in isolation, separating arrays by RAID types 2. Faster I/O throughput from the use of larger arrays 3. At least two types of extent pools of both arrays characteristics, such as 6+P and 7+P If I/O hot spots are encountered, you will have the ability to move the hot spot to less busy hardware resources that perform differently at the rank level. This move will aid in analyzing and confirming if I/O can be improved at the DS8000 disk level. In order to see the LUN level granularity with this type of configuration, refer to Figure A-8 for LUN assignments from the extent pools configured in this manner. In Figure A-8, we show that hot spots were encountered on both 512 GB LUNs configured in extent pools 0 and 1. These LUNs are shown in cells G2 and N2. The data residing on these LUNs can be migrated by using a host migration technique to extent pools 2 and 3. Extent pools 2 and 3 contain RAID types of 7+P, which perform faster than extent pools 0 and 1 made up of the 6+P+S array type.
Raid type 6+P Ranks in the Extpool Raid type 7+P
Extpool3
Volgrps Hosts with mapped LUNs
Figure A-8 More sharing of ranks in extent pools
Figure A-8 shows an example of four extent pools and the associated LUN mappings to the hosts from each extent pool. Note: To move the data I/O hot spot between extent pools, use a host migration technique. As the database grows and I/O increases, there might be a need for further isolation to remediate I/O contention at the spindle level. Scenario 5 shows how to further isolate at the rank to extent pool level but still take advantage of the aggregate throughput incurred from spreading the I/O across multiple arrays/ranks. 564
Scenario 5: More isolation of RAID types in the extent pool

In this example, we have created eight extent pools as shown in Figure A-9 for further isolation of the spindles.
R0 R8 R16 R24 R32 R40 Extpool 0 R1 R9 R17 R25 R33 R41 Extpool 1 R2 R10 R18 R26 R34 R42 Extpool 2 R3 R11 R19 R27 R35 R43 Extpool 3 R4 R12 R20 R28 R36 R44 Extpool 4 R5 R13 R21 R29 R37 R45 Extpool 5 R6 R14 R22 R30 R38 R46 Extpool 6 R7 R15 R23 R31 R39 R47 Extpool 7
Figure A-9 More isolation but maintaining balance and spread across hardware resources
We suggest the logical configuration shown in Figure A-9, where one rank from each DA pair of like capacity and characteristics is grouped together in extent pools, which results in more extent pools as shown in Figure A-10 on page 566. More extent pools allow you to isolate to a finer granularity. Be careful when assigning LUNs to hosts to allow for this isolation. Remember that you can quickly mitigate this granularity of isolation by spreading the same applications I/O across multiple extent pools.
565
Extpool0
LUNs mapped to hosts
Hosts
DS8000 Volgrps
Figure A-10 How extent pools might look with this configuration
In Figure A-10, we show the extent pool breakout at the LUN level and have isolated the host LUN assignments to take advantage of this granularity. We still have a good spread of arrays to DA pairs, in that a rank from each DA pair has been pooled together for the aggregate throughput of all the DA pairs in each extent pool. We show that each host takes advantage of the load balancing offered by LUNs assigned from both DS8000 server complexes. The odd-numbered extent pools are owned by server complex 1 and the even-numbered extent pools are owned by server complex 0. At least one LUN from an even-numbered and an odd-number extent pool has been assigned to each host. For example, two LUNs from extpool 4 and two LUNs from extent pool 5 are assigned to HostJ, volume group 13 (V13).
Scenario 6: Balancing mixed RAID type ranks and capacities

In this example, we take an extreme case scenario, where the application requirements called for these specific hardware resources. In Figure A-11 on page 567, we show an extreme case of diverse capacities and RAID types per DA pair. For reasons of cost and availability, the following format has been required. Our job now is to group them for best performance. This example is a suggested rank to extent pool configuration. If the object is to spread the ranks in the extent pools across as many DA pairs and server complexes as possible, for balance (so that we have an equal number of ranks owned by server complex 0 as we do for server complex 1), this suggested configuration is a start.
566
3X2+S+S
Exp0 Exp1 Exp2 Exp3
S0
S1
Exp4 R4 Exp5 R5
7+P
72.8GB size
R0
R1
R2
R3
6+P+S
S0 4X2
Exp7 R9 R33
S0
R6
S1
R7
R8 R9
R10 R11 R12 R13 R14 R15
R16 R17
R18 R19 R20 R21 R22 R23
R40 R41
R42 R43 R44 R45 R46 R47
R48 R49
R50 R51 R52 R53 R54 R55
Exp6 R8 R32
Exp8 Exp9 R12 R13

4X2
3X2+S+S
145.6GB size
Exp10
Exp9 Exp11 R11 R15 R35 R37 R39
R10 R14 R34 R36 R38
S1
7+P 6+P+S
S0
Exp13
Exp12
R0 R1
R2 R3 R4 R5 R6 R7
R24 R25
R26 R27 R28 R29 R30 R31
R32 R33
R34 R35 R36 R37 R38 R39
R56 R57
R58 R59 R60 R61 R62 R63
R25 R27 R41 R43 Exp15 R29 R31 R45 R47
R24 R26 R40 R42 Exp14 R28 R30 R44 R46

7+P
300GB size
Exp17
5+P+Q+S
Exp16
R17 R19 R57 R59 Exp19
R16 R18 R56 R58 Exp18
6+P+Q
450GB size
R21 R23 R61 R63
R20 R22 R60 R62
500GB size
Exp21 R49 R51
Exp20 R48 R50
Exp23 R53 R55
Exp22 R52 R54
5+P+Q+S
6+P+Q
Figure A-11 DS8000 mixed RAID and density drives example of logical configuration for performance
The example shown in Figure A-11 might not be an ideal logical configuration, but balance is near perfect. Notice that in DA pair 0, we have one rank configured as a 3X2+S+S and another at 4X2. Rank2 is owned by server complex 0 and rank3 is owned by server complex 1. Each rank is tied to either server complex 0 or 1 without another rank of the same size or RAID format type from the same server complex. In Figure A-12 on page 568, we show a chart that we have converted from Figure A-11 so that we can better understand at a glance how this DS8000 is configured and why. We can more easily see the RAID types (RAID 5, RAID 6, or RAID 10), the capacity information rendered by each array and RAID type, the spread of arrays/ranks across the DA pairs, and the spread of the extent pools to the server complexes. By knowing all this information, you will be better informed as to how to provision the LUNs from each extent pool to exploit the performance rendered by each LUN for the best possible balanced throughput.
567
DA pairs Server complex number extent pool number Rank number Array capacity RAID type DA pairs
Figure A-12 An extreme case of DS8000 grouping of arrays in extent pools
Figure A-12 shows the characteristics and relationship of each array by size, RAID type, capacity, extent pool, DA pair, and server complex ownership. Note: As a reminder, if you can take advantage of the aggregate throughput rendered by spreading the applications I/O across the ranks, DA pairs, and server complexes, performance is almost always better. Separation of array/rank types is an important factor as well and will enable the option of host server-level striping at RAID 0 for striping granularity. Mixing array type characteristics is allowed but can make performance worse at the host level due to the different strip widths rendered by the extent creation. For a better understanding of the strip widths relationship to the RAID type characteristic, refer to 4.3, Understanding the array to LUN relationship on page 55. Figure A-13 on page 569 shows the breakdown of the LUNs configured in each extent pool, according to the mapping relationship shown in Figure A-12.
568
DS8000 Server Complexs
Hosts
Figure A-13 Array to LUN and host mapping relationship
In Figure A-13, we show the LUN assignments made to the hosts. The important thing to note here is the balance and spread between the hardware resources, such as the server complexes, DA pairs, and arrays. Notice that we did not spread everything across everything. Instead, we split all the arrays of equal capacity and characteristics per DA pair to allow us to make extent pools equally divided between the server complexes, which accommodates more achievable load balancing at the host server level. For example, HostL has four LUNs assigned from 4 extent pools. Each LUN is spread across at least two ranks, two DA pairs, and two server complexes.
569
570
Appendix B.
Windows server performance log collection

This appendix describes the procedures necessary to collect Windows Server 2003 and Windows Server 2008 server disk performance data for later analysis.
571
B.1 Windows Server 2003 log file configuration

The Performance Logs and Alerts window, shown in Figure B-1, lets you collect performance data manually or automatically from local or remote systems. Saved data can be displayed in System Monitor or data can be exported to a spreadsheet or database.
Figure B-1 Performance Logs and Alerts
Performance Logs and Alerts provides the following functions: Counter logs This function lets you create a log file with specific system objects and counters and their instances. Log files can be saved in different formats (file name + file number, or file name + file creation date) for use in System Monitor or for exporting to database or spreadsheet applications. You can schedule the logging of data, or the counter log can be started manually using program shortcuts. Counter log settings can also be saved in HTML format for use in a browser either locally or remotely via TCP/IP. Trace logs This function lets you create trace logs that contain trace data provider objects. Trace logs differ from counter logs in that they measure data continuously rather than at specific intervals. You can log operating system or application activity using event providers. There are two kinds of providers: system and non-system providers. Alerts This function lets you track objects and counters to ensure that they are within a specified range. If the counters value is under or over the specified value, an alert is issued.
Configuring logging of disk metrics Windows Server 2003

To configure logging: 1. From the main Performance (perfmon) console, right-click Counter Logs and Select New Log Settings. 2. Enter the New Log Settings name, for example: Disk Performance. 3. The Log Settings Configuration Window will be displayed as shown in Figure B-2 on page 573.
572
Figure B-2 Windows Server 2003 Performance log General tab
4. Click Add Objects, select the PhysicalDisk and LogicalDisk Objects, and select Add. 5. In the General tab, Sample data every: is used to set how frequently you capture the data. When configuring the interval, specify an interval that will provide enough granularity to allow you to identify the issue. For example, if you have a performance problem that only lasts 15 seconds, you need to set up the interval to be at the most 15 seconds or it will be difficult to capture the issue. Note: As a general rule for post-processing, more than 2000 intervals is difficult to analyze in spreadsheet software. When configuring the interval, remember both the problem profile and the planned collection duration in order to not capture more data than can be reasonably analyzed. You might require multiple log files. 6. In the Run As field, enter the account with sufficient rights to collect the information about the server to be monitored, and then click Set Password to enter the relevant password. 7. The Log Files tab, shown in Figure B-3 on page 574, lets you set the type of the saved file, the suffix that is appended to the file name, and an optional comment. You can use two types of suffixes in a file name: numbers or dates. The log file types are listed in Table B-1 on page 574. If you click Configure, you can also set the location, file name, and file size for a log file. The Binary File format takes the least amount of space and is suggested for most logging.
Appendix B. Windows server performance log collection
573
Figure B-3 Windows Server 2003 Log Files tab
Table B-1 Counter log file formats
Log file type

Text file - csv Text file - TSV
Description
Comma-delimited values log file (csv extension). Use this format to export to a spreadsheet. Tab-delimited log file (TSV extension). Use this format to export the data to a spreadsheet program. Sequential, binary log file (BLG extension). Use this format to capture data intermittently (stopping and resuming log operation). Circular, binary-format log file (BLG extension). Use this format to store log data to same log file, overwriting old data. Logs are output as an SQL database.
Binary file
Binary circular file SQL Database
8. The Schedule tab shown in Figure B-4 on page 575 lets you specify when this log is started and stopped. You can select the option box in the start log and stop log section to manage this log manually using the Performance console shortcut menu. You can configure to start a new log file or run a command when this log file closes.
574
Figure B-4 Windows Server 2003 log file Schedule tab
9. After clicking OK, the logging will start automatically. If for any reason, it does not, simply right-click the Log Settings file in the perfmon window and click Start. 10.To stop the counter log, click the Stop the selected log icon on the toolbar.
Saving counter log settings

You can save the counter log settings to use them later. To save log settings: 1. Select the counter log file in which you want to save settings. 2. Right-click the log. The window shown in Figure 9-15 on page 237 appears. 3. Click Save Setting As. 4. Select a location and enter a file name, and then click Save (saving to an HTML file is the only option). This log settings file can then be opened with Internet Explorer. You can also use the pop-up menu to start, stop, and save the logs, as shown in Figure B-5 on page 576.
575
Figure B-5 Windows Server 2003 Counter log pop-up menu
Importing counter logs properties

You can import counter log settings from saved files. To import settings: 1. Right-click the right window. 2. Select New Log Settings From. 3. The Open dialog window will appear. Choose the location and select a file name, then click Open. 4. The Name dialog box will appear. If you want to change the log setting name, you can enter a name; otherwise, click OK. 5. A dialog window will be displayed where you can add or remove counters and change log file settings and scheduling. 6. If you want to edit the settings, change the required fields; otherwise, click OK.
Analyzing disk performance from collected data

Disk activity is best analyzed using a log, because real-time monitoring provides a view of only the current disk activity, not the historical disk usage over an extended period of time. The information in the log can be viewed and evaluated at a later time. The following section describes the process for viewing and exporting data collected in a log file.
Retrieving data from a counter log file

After you have saved data to a log file, you can retrieve that data and process it. By default, System Monitor displays real-time data. In order to display previously logged data: 1. Click the View log data file icon on the System Monitor toolbar. 2. The System Monitor Properties dialog shown in Figure B-6 on page 577 opens at the Source tab. Select Log Files, and then click Add. The Select Log File dialog opens. Select the log file that you want and click Open.
576
Figure B-6 Windows Server 2003 System Monitor Properties (Source tab)
3. At the System Monitor Properties dialog, select the Data tab. You now see any counter that you specified when setting up the Counter Log as shown in Figure B-7. If you only selected counter objects, the Counters section will be empty. To add counters from an object, simply click Add, and then select the appropriate counters.
Figure B-7 System Monitor Properties (Data tab)
577
Exporting logged data on Windows Server 2003

To export logged data: 1. Stop the data collection by right-clicking the Log Settings File and selecting Stop. 2. Open the System Monitor and remove any existing counters. 3. Select the Log file icon and select your Log file as shown in Figure B-8 and click OK.
Figure B-8 Windows Server 2003 Log file selection
4. In the Monitor View, click Add and select all the key PhysicalDisk counters for all instances and click Add. 5. You will see a window similar to Figure B-9 on page 579.
578
Figure B-9 Windows System Monitor using Performance Log
6. Right-click the chart side, and click Save Data As. 7. In the Save Data As window, select a file type of Text File (Comma delimited)(*.csv). 8. Provide a file name, such as Disk_Performance.
579
B.2 Windows Server 2008 log file configuration

In Windows Server 2008, the performance monitor is initiated in the same manner as on Windows Server 2003. There are differences in the 2008 implementation of perfmon. Windows Server 2008 refers to log configuration files as Data Collector Sets. These sets are logical groupings of collection sets that unlike Windows Server 2003 can contain both system and event trace counters in one collector set. Data Collector Sets can use a default collection set or can be manually configured. Refer to Figure B-10. For the purpose of symmetry with the techniques we are using on Windows Server 2003, we will use a manual configuration process.
Figure B-10 Windows Server 2008 Reliability and Performance Monitor
To run the Performance Monitor: 1. Expand the Data Collection Sets element in the tree view as referenced in Figure B-11 on page 581. Right-click User Defined.
580
Figure B-11 Windows Server 2008 User Defined Data Collector Sets
2. Select New Data Collector Set. 3. The Data Collector Set Wizard will be displayed as shown in Figure B-12. Click Next.
Figure B-12 Windows Server 2008 Data Collector Set wizard
4. Provide a file name, such as Disk_Performance. 5. Select Create manually (Advanced) and click Next. 6. As shown in Figure B-13 on page 582, check the box next to Performance counter and click Next.
581
Figure B-13 Windows Server 2008 Advanced Data Collector Set configuration
7. As shown in Figure B-14, you will be asked to select the desired performance counters. Click Add and you will see a window similar to Figure B-15. Select all instances of the disks except the Total and manually select all the individual counters identified in Table 10-2 on page 293. Select the desired computer as well. Click OK.
Figure B-14 Windows Server 2008 Add performance counters
582
Figure B-15 Windows Server 2008 performance counters
8. You will now be at the Create new Data Collector Set Wizard window. In the Sample interval list box, set how frequently you capture the data. When configuring the interval, specify an interval that will provide enough granularity to allow you to identify the issue. For example, if you have a performance problem that only lasts 15 seconds, you need to set up the interval to be at the most 15 seconds or it will be difficult to capture the issue. Click Next. Note: As a general rule for post-processing, more than 2000 intervals is difficult to analyze in spreadsheet software. When configuring the interval, remember both the problem profile and the planned collection duration in order to not capture more data than can be reasonably analyzed. You might need multiple log files. 9. You will be prompted to enter the location of the log file and the file name as shown in Figure B-16 on page 584. In this example, the default file name is shown. After entering the file name and directory, click Finish.
583
Figure B-16 Windows Server 2008 log file placement
10.You will see a window similar to Figure B-17.
Figure B-17 Windows Server 2008 Performance Monitor with new Data Collector Set
11.Right-click the Data Collector Set and select Start. 12.After you have collected the data for the desired period, you can select the Stop icon or right-click Data Collector Set and select Stop.
Windows Server 2008 Export

To export the log file: 1. Under Monitoring Tools, select the Performance Monitor. 2. Select the log file icon, select your log file as shown in Figure B-8 on page 578, and click OK. This step in the process is identical to Windows Server 2003.
584
3. In the Monitor View, click Add, select all the key PhysicalDisk counters for all instances, and click Add. 4. You will see a window similar to Figure B-17 on page 584.
Figure 17-27 Windows Server 2008 Performance Monitor
5. Right-click the Chart side and left-click Save Data As. 6. In the Save Data As window, select a file type of Text File (Comma delimited)(*.csv). 7. Provide a file name, such as Disk_Performance.
585
586
Appendix C.
UNIX shell scripts

This appendix includes scripts that are helpful to manage disk devices and monitor I/O for servers attached to the DS8000. We describe the implementation of these scripts in Chapter 11, Performance considerations with UNIX servers on page 307.
587
C.1 Introduction
The scripts presented in this appendix were written and tested on AIX servers, but you can modify them to work with SUN Solaris and Hewlett-Packard UNIX (HP-UX). They have been modified slightly from the scripts published in earlier versions of performance and tuning IBM Redbooks publications. These modifications are mainly due to the absence of the ESS Utility commands, lsess and lssdd, which are not available for the DS8000. We used the Subsystem Device Driver (SDD) datapath query essmap command instead. By downloading the Acrobat PDF version of this publication, you can copy and paste these scripts for easy installation on your host systems. To function properly, the scripts presented here rely on: An AIX host running AIX 5L Subsystem Device Driver (SDD) for AIX Version 1.3.1.0 or later The scripts presented in this appendix are: vgmap lvmap vpath_iostat ds_iostat test_disk_speeds Important: These scripts are provided on an as is basis. They are not supported or maintained by IBM in any formal way. No warranty is given or implied, and you cannot obtain help with these scripts from IBM.
C.2 vgmap
The vgmap script displays which vpaths a volume group uses and also to which rank each vpath belongs. Use this script to determine if a volume group is made up of vpaths on several different ranks and which vpaths to use for creating striped logical volumes. Example output of the vgmap command is shown in Example C-1. The vgmap shell script is in Example C-2.
Example: C-1 The vgmap output # vgmap testvg PV_NAME testvg: vpath0 vpath2 RANK 1100 1000 PV STATE TOTAL PPs active active 502 502 FREE PPs 502 502
Example: C-2 The vgmap shell script #!/bin/ksh ############################## # vgmap # usage: vgmap <vgname> # # Displays DS8000 logical disks and RANK ids for each # disk in the volume group # # # Author: Pablo Clifton pablo@compupro.com
588
# Date: August 28, 2005 ############################################## datapath query essmap > /tmp/lssdd.out lssddfile=/tmp/lssdd.out workfile=/tmp/work.$0 sortfile=/tmp/sort.$0 # AIX lsvg -p $1 | grep -v "PV_NAME" > $workfile echo "\nPV_NAME RANK PV STATE
TOTAL PPs
FREE PPs
Free D"
for i in `cat $workfile | grep vpath | awk '{print $1}'` do #echo "$i ... rank" rank=`grep -w $i $lssddfile | awk '{print $11}' | head -n 1` sed "s/$i /$i $rank/g" $workfile > $sortfile cp $sortfile $workfile done cat $workfile rm $workfile rm $sortfile ########################## THE END ######################
C.3 lvmap
The lvmap script displays which vpaths and ranks a logical volume uses. Use this script to determine if a logical volume spans vpaths on several different ranks. The script does not tell you if a logical volume is striped or not. Use lslv <lv_name> for that information or modify this script. An example output of the lvmap command is shown in Example C-3. The lvmap shell script is in Example C-4.
Example: C-3 The lvmap output lvmap.ksh 8000stripelv lvmap.ksh 8000stripelv LV_NAME RANK 8000stripelv:N/A vpath4 0000 vpath5 ffff Example: C-4 The lvmap shell script #!/bin/ksh ############################################## # LVMAP # usage: lvmap <lvname> # # displays logical disk and rank ids for each # disk a logical volume resides on # Note: the script depends on correct lssdd info in # /tmp/lssdd.out #
COPIES 010:000:000 010:000:000
IN BAND 100% 100%
DISTRIBUTION 010:000:000:000:000 010:000:000:000:000
Appendix C. UNIX shell scripts
589
# Before running the first time, run: # Author: Pablo Clifton pablo@compupro.com # Date: August 28, 2005 ############################################## datapath query essmap > /tmp/lssdd.out lssddfile=/tmp/lssdd.out workfile=/tmp/work.$0 sortfile=/tmp/sort.$0 lslv -l $1 | grep -v " COPIES " > $workfile for i in `cat $workfile | grep vpath | awk '{print $1}'` do #echo "$i ... rank" rank=`grep -w $i $lssddfile | awk '{print $11}' | head -n 1` sed "s/$i /$i $rank/g" $workfile > $sortfile cp $sortfile $workfile done echo "\nLV_NAME cat $workfile rm $workfile rm $sortfile ###################### End ####################### RANK COPIES IN BAND DISTRIBUTION"
C.4 vpath_iostat
The vpath_iostat script is a a wrapper program for AIX that converts iostat information based on hdisk devices to vpaths instead. The script first builds a map file to list hdisk devices and their associated vpaths and then converts iostat information from hdisks to vpaths. To run the script, make sure that the SDD datapath query essmap command is working properly; that is, all volume groups are using vpaths instead of hdisk devices. The command syntax is: vpath_iostat (control c to break out) Or: vpath_iostat <interval> <iteration> An example of the output vpath_iostat produces is shown in Example C-5. The vpath_iostat shell script is in Example C-6 on page 591.
Example: C-5 The vpath_iostat output garmo-aix: Total VPATHS used: garmo-aix Vpath: MBps garmo-aix vpath0 12.698 garmo-aix vpath6 12.672 garmo-aix vpath14 11.238 garmo-aix vpath8 11.314 garmo-aix vpath2 6.963 garmo-aix vpath12 7.731 garmo-aix vpath4 3.840 8 16:16 Wed 26 Feb 2003 5 sec interval tps KB/trans MB_read 63.0 201.5 0.0 60.6 209.1 0.0 59.8 187.9 0.0 44.6 253.7 0.0 44.2 157.5 0.0 30.2 256.0 0.0 29.4 130.6 0.0 MB_wrtn 63.5 63.4 56.2 56.6 34.8 38.7 19.2
590
garmo-aix vpath10 2.842 13.2 215.3 0.0 14.2 -----------------------------------------------------------------------------------------garmo-aix TOTAL READ: 0.00 MB TOTAL WRITTEN: 346.49 MB garmo-aix READ SPEED: 0.00 MB/sec WRITE SPEED: 70.00 MB/sec Example: C-6 The vpath_iostat shell script #!/bin/ksh ##################################################################### # Usage: # vpath_iostat (default: 5 second intervals, 1000 iterations) # vpath_iostat <interval> <count> # # Function: # Gather IOSTATS and report on DS8000 VPATHS instead of disk devices # AIX hdisks # HP-UX [under development ] # SUN [under development ] # Linux [under development ] # # Note: # # A small amount of free space < 1MB is required in /tmp # # Author: Pablo Clifton pablo@compupro.com # Date: August 28, 2005 ##################################################################### ########################################################## # set the default period for number of seconds to collect # iostat data before calculating average period=5 iterations=1000 essfile=/tmp/disk-vpath.out ifile=/tmp/lssdd.out ds=`date +%d%H%M%S` # hname=`hostname` # ofile=/tmp/vstats # wfile=/tmp/wvfile # wfile2=/tmp/wvfile2 # pvcount=ìostat | grep hdisk | wc # File to store output from lssdd command # Input file containing LSSDD info time stamp get Hostname raw iostats work file work file -l | awk '{print $1}'`
############################################# # Create a list of the vpaths this system uses # Format: hdisk DS-vpath # datapath query essmap output MUST BE correct or the IO stats reported # will not be correct ############################################# if [ ! -f $ifile ] then echo "Collecting DS8000 info for disk to vpath map..." datapath query essmap > $ifile fi cat $ifile | awk '{print $2 "\t" $1}' > $essfile
######################################### # ADD INTERNAL SCSI DISKS to RANKS list
591
######################################### for internal in `lsdev -Cc disk | grep SCSI | awk '{print $1}'` do echo "$internal $internal" >> $essfile done ############################################### # Set interval value or leave as default if [[ $# -ge 1 ]] then period=$1 fi ########################################## # Set <iteration> value if [[ $# -eq 2 ]] then iterations=$2 fi ################################################################# # ess_iostat <interval> <count> i=0 while [[ $i -lt $iterations ]] do iostat $period 2 > $ofile
# run 2 iterations of iostat # first run is IO history since boot
grep hdisk $ofile > $ofile.temp # only gather hdisk info- not cd # other devices tail -n $pvcount $ofile.temp | grep -v "0.0 0" | sort +4 -n -r | head -n 100 > $wfile ########################################### #Converting hdisks to vpaths.... # ########################################### for j in `cat $wfile | awk '{print $1}'` do vpath=`grep -w $j $essfile | awk '{print $2}'` sed "s/$j /$vpath/g" $wfile > $wfile2 cp $wfile2 $wfile done ########################################### # Determine Number of different VPATHS used ########################################### numvpaths=`cat $wfile | awk '{print $1} ' | grep -v hdisk | sort -u | wc -l` dt=`date +"%H:%M %a %d %h %Y"` print "\n$hname: Total VPATHS used: $numvpaths $dt $period sec interval" printf "%s\t%s\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" "$hname" "Vpath:" "MBps" "tps" \ "KB/trans" "MB_read" "MB_wrtn" ########################################### # Sum Usage for EACH VPATH and Internal Hdisk ########################################### 0.0 0.0 0 \
592
for x in `cat $wfile | awk '{ print $1}' | sort -u` do cat $wfile | grep -w $x | awk '{ printf ("%4d\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" , \ $1, $2, $3, $4, $5, $6) }' | awk 'BEGIN { } { tmsum=tmsum+$2 } { kbpsum=kbpsum+$3 } { tpsum=tpsum+$4 } { kbreadsum=kbreadsum+$5 } { kwrtnsum=kwrtnsum+$6 } END { if ( tpsum > 0 ) printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \ vpath, kbpsum/1000, tpsum, kbpsum/tpsum , kbreadsum/1000, kwrtnsum/1000) else printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \ vpath, kbpsum/1000, tpsum, "0", kbreadsum/1000, kwrtnsum/1000) }' hname="$hname" vpath="$x" >> $wfile2.tmp done ############################################# # Sort VPATHS/hdisks by NUMBER of TRANSACTIONS ############################################# if [[ -f $wfile2.tmp ]] then cat $wfile2.tmp | sort +3 -n -r rm $wfile2.tmp fi ############################################################## # SUM TOTAL IO USAGE for ALL DISKS/LUNS over INTERVAL ############################################################## #Disks: % tm_act Kbps tps Kb_read Kb_wrtn # field 5 read field 6 written tail -n $pvcount $ofile.temp | grep -v "0.0 0.0 0.0 0" | awk 'BEGIN { } { rsum=rsum+$5 } { wsum=wsum+$6 } END { rsum=rsum/1000 wsum=wsum/1000 printf ("------------------------------------------------------------------------------------------\n") if ( divider > 1 ) { printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n", hname, "TOTAL READ: ", \ rsum, "MB", "TOTAL WRITTEN: ", wsum, "MB") } printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n\n\n", hname, "READ SPEED: ", \ rsum/divider, "MB/sec", "WRITE SPEED: ", wsum/divider, "MB/sec" ) }' hname="$hname" divider="$period" let i=$i+1 done
593
# rm rm rm rm
$ofile $wfile $wfile2 $essfile
################## THE END ##################
C.5 ds_iostat
The ds_iostat script is a a wrapper program for AIX that converts iostat information based on hdisk devices to ranks instead. The ds_iostat script depends on the SDD datapath query essmamp command and iostat. The script first builds a map file to list hdisk devices and their associated ranks and then converts iostat information from hdisks to ranks. To run the script, enter: ds_iostat (control c to break out) Or: ds_iostat <interval> <iteration> An example of the ds_iostat output is shown in Example C-7. The ds_iostat shell script is in Example C-8.
Example: C-7 The ds_iostat output # ds_iostat 5 1 garmo-aix: Total RANKS used: 12 20:01 Sun 16 Feb 2003 5 sec interval garmo-aix Ranks: MBps tps KB/trans MB_read MB_wrtn garmo-aix 1403 9.552 71.2 134.2 47.8 0.0 garmo-aix 1603 6.779 53.8 126.0 34.0 0.0 garmo-aix 1703 5.743 43.0 133.6 28.8 0.0 garmo-aix 1503 5.809 42.8 135.7 29.1 0.0 garmo-aix 1301 3.665 32.4 113.1 18.4 0.0 garmo-aix 1601 3.206 27.2 117.9 16.1 0.0 garmo-aix 1201 2.734 22.8 119.9 13.7 0.0 garmo-aix 1101 2.479 22.0 112.7 12.4 0.0 garmo-aix 1401 2.299 20.4 112.7 11.5 0.0 garmo-aix 1501 2.180 19.8 110.1 10.9 0.0 garmo-aix 1001 2.246 19.4 115.8 11.3 0.0 garmo-aix 1701 2.088 18.8 111.1 10.5 0.0 -----------------------------------------------------------------------------------------garmo-aix TOTAL READ: 430.88 MB TOTAL WRITTEN: 0.06 MB garmo-aix READ SPEED: 86.18 MB/sec WRITE SPEED: 0.01 MB/sec Example: C-8 The ds_iostat shell script #!/bin/ksh #set -x ######################### #Usage: # ds_iostat (default: 5 second intervals, 1000 iterations) # ds_iostat <interval> <count> #
594
# Function: # Gather IOSTATS and report on DS8000 RANKS instead of disk devices # AIX hdisks # HP-UX # SUN # Linux # # Note: # ds_iostat depends on valid rank ids from the datapath query essmap command # # A small amount of free space < 1MB is required in /tmp # # Author: Pablo Clifton pablo@compupro.com # Date: Feb 28, 2003 ##################################################################### ########################################################## # set the default period for number of seconds to collect # iostat data before calculating average period=5 iterations=1000 essfile=/tmp/lsess.out ds=`date +%d%H%M%S` # time stamp hname=`hostname` # get Hostname ofile=/tmp/rstats # raw iostats wfile=/tmp/wfile # work file wfile2=/tmp/wfile2 # work file pvcount=ìostat | grep hdisk | wc -l | awk '{print $1}'` ############################################# # Create a list of the ranks this system uses # Format: hdisk DS-rank # datapath query essmap output MUST BE correct or the IO stats reported # will not be correct ############################################# datapath query essmap|grep -v "*"|awk '{print $2 "\t" $11}' > $essfile ######################################### # ADD INTERNAL SCSI DISKS to RANKS list ######################################### for internal in `lsdev -Cc disk | grep SCSI | awk '{print $1}'` do echo "$internal $internal" >> $essfile done ############################################### # Set interval value or leave as default if [[ $# -ge 1 ]] then period=$1 fi ########################################## # Set <iteration> value if [[ $# -eq 2 ]] then iterations=$2 fi
595
################################################################# # ess_iostat <interval> <count> i=0 while [[ $i -lt $iterations ]] do iostat $period 2 > $ofile
# run 2 iterations of iostat # first run is IO history since boot
grep hdisk $ofile > $ofile.temp # only gather hdisk info- not cd # other devices tail -n $pvcount $ofile.temp | grep -v "0.0 | sort +4 -n -r | head -n 100 > $wfile ########################################### #Converting hdisks to ranks.... # ########################################### for j in `cat $wfile | awk '{print $1}'` do rank=`grep -w $j $essfile | awk '{print $2}'` sed "s/$j /$rank/g" $wfile > $wfile2 cp $wfile2 $wfile done ########################################### # Determine Number of different ranks used ########################################### numranks=`cat $wfile | awk '{print $1} ' | grep -v hdisk | cut -c 1-4| sort -u -n | wc -l` dt=`date +"%H:%M %a %d %h %Y"` print "\n$hname: Total RANKS used: $numranks $dt $period sec interval" printf "%s\t%s\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" "$hname" "Ranks:" "MBps" "tps" \ "KB/trans" "MB_read" "MB_wrtn" ########################################### # Sum Usage for EACH RANK and Internal Hdisk ########################################### for x in `cat $wfile | awk '{ print $1}' | sort -u` do cat $wfile | grep -w $x | awk '{ printf ("%4d\t\t%-9s\t%-9s\t%-9s\t%-9s\t%-9s\n" , \ $1, $2, $3, $4, $5, $6) }' | awk 'BEGIN { } { tmsum=tmsum+$2 } { kbpsum=kbpsum+$3 } { tpsum=tpsum+$4 } { kbreadsum=kbreadsum+$5 } { kwrtnsum=kwrtnsum+$6 } END { if ( tpsum > 0 ) printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \ rank, kbpsum/1000, tpsum, kbpsum/tpsum , kbreadsum/1000, kwrtnsum/1000) else printf ("%-7s\t%4s\t\t%-9.3f\t%-9.1f\t%-9.1f\t%-9.1f\t%-9.1f\n" , hname, \ rank, kbpsum/1000, tpsum, "0", kbreadsum/1000, kwrtnsum/1000) }' hname="$hname" rank="$x" >> $wfile2.tmp done 0.0 0.0 0 0" \
596
############################################# # Sort RANKS/hdisks by NUMBER of TRANSACTIONS ############################################# if [[ -f $wfile2.tmp ]] then cat $wfile2.tmp | sort +3 -n -r rm $wfile2.tmp fi ############################################################## # SUM TOTAL IO USAGE for ALL DISKS/LUNS over INTERVAL ############################################################## #Disks: % tm_act Kbps tps Kb_read Kb_wrtn # field 5 read field 6 written tail -n $pvcount $ofile.temp | grep -v "0.0 0.0 0.0 'BEGIN { } { rsum=rsum+$5 } { wsum=wsum+$6 } END { rsum=rsum/1000 wsum=wsum/1000 printf ("------------------------------------------------------------------------------------------\n") if ( divider > 1 ) { printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n", hname, "TOTAL READ: ", \ rsum, "MB", "TOTAL WRITTEN: ", wsum, "MB") } printf ("%-7s\t%14s\t%4.2f\t%s\t%14s\t%4.2f\t%s\n\n\n", hname, "READ SPEED: ",\ rsum/divider, "MB/sec", "WRITE SPEED:", wsum/divider, "MB/sec" ) }' hname="$hname" divider="$period" let i=$i+1 done rm $ofile rm $wfile rm $wfile2 rm $essfile ################################## THE END ##########################
0" \
| awk
C.6 test_disk_speeds
Use the test_disk_speeds script to test a 100 MB sequential read against one raw vpath (rvpath0) and record the speed at different times throughout the day to get an average read speed that a rank is capable of in your environment.
597
You can change the amount of data read, the blocksize, and the vpath by editing the script and changing the variables: tsize=100 # MB bs=128 # KB vpath=rvpath0 # disk to test An example of the output for test_disk_speeds is shown in Example C-9.
Example: C-9 The test_disk_speeds #!/bin/ksh ########################################################## # test_disk_speeds # Measure disk speeds using dd # # tsize = total test size in MB # bs = block size in KB # testsize= total test size in KB; tsize*1000 # count = equal to the number of test blocks to read which is # testsize/bsize # Author: Pablo Clifton pablo@compupro.com # Date: August 28, 2005 ######################################################### # SET these 2 variables to change the block size and total # amount of data read. Set the vpath to test tsize=100 # MB bs=128 # KB vpath=rvpath0 # disk to test ######################################################### let testsize=$tsize*1000 let count=$testsize/$bs # calculate start time, dd file, calculate end time stime=`perl -e "print time"` dd if=/dev/$vpath of=/dev/null bs="$bs"k count=$count etime=`perl -e "print time"` # get total run time in seconds let totalt=$etime-$stime let speed=$tsize/$totalt printf "$vpath\t%4.1f\tMB/sec\t$tsize\tMB\tbs="$bs"k\n" $speed ########################## THE END ###############################
C.7 lsvscsimap.ksh
When you assign several logical unit numbers (LUNs) from DS8000 to Virtual I/O Server (VIOS) and then map those LUNs to the logical partitioning (LPAR) clients with the time, trivial activities, such as upgrading the Subsystem Device Driver Path Control Module (SDDPCM) device driver can become something challenging. Because of that, we created two scripts: the first script (Example C-10 on page 599) generates a list of mappings among the LUNs and LPAR clients. The second script, based on that output, creates the commands needed to recreate the mappings among the LUNs and LPAR clients.
598
To list the configuration, execute the following command: # cd /home/padmin # ./lsvscsimap.ksh To save the configuration in a file, type the following command: # ./lsvscsimap.ksh -s test.out
Example: C-10 The lsvscsimap.ksh script #!/usr/bin/ksh93 ######################################################################### # # # Name of script: lsvscsimap.ksh # # Path: /home/padmin # # Node(s): # # Info: Script to generate the configuration mappings among the LUNs # # and LPAR Clients, based on the S/N of disks. # # # # Author: Anderson F. Nobre # # Creation date: 14/03/2008 # # # # Modification data: ??/??/???? # # Modified by: ???????? # # Modifications: # # - ????????????????? # # # ######################################################################### #-----------------------------------------------------------------------# Function: usage #-----------------------------------------------------------------------function usage { printf "Usage: %s [-s value]\n" $0 printf "Where:\n" printf " -s <value>: Generate the mappings among the LUNs \n" printf " and LPAR Clients\n" } #-----------------------------------------------------------------------# Function: lsmapbyvhost #-----------------------------------------------------------------------function lsmapbyvhost { cat <<EOF > /tmp/lsmapbyvhost.awk /^vhost/ {INI=1; VHOST=\$1; VSCSISRVSLOT=substr(\$2, index(\$2, "-C")+2); LPAR=\$3} /^VTD/ {VTDISK=\$2} /^Backing device/ {HDISK=\$3} /^Physloc/ {printf "%s %s %s %s %d\n", VHOST, HDISK, VTDISK, VSCSISRVSLOT, LPAR} EOF ioscli lsmap -all > /tmp/lsmap-all.out cat /tmp/lsmap-all.out | awk -f /tmp/lsmapbyvhost.awk > /tmp/lsmapbyvhost.out } #-----------------------------------------------------------------------Appendix C. UNIX shell scripts
599
# Function: ppqdhdbysn #-----------------------------------------------------------------------function ppqdhdbysn { cat <<EOF > /tmp/ppqdhdbysn.awk /DEVICE NAME:/ {DEVNUM=\$2; HDISK=\$5} /SERIAL:/ {print \$2, HDISK} EOF pcmpath query device > /tmp/ppqd.out cat /tmp/ppqd.out | awk -f /tmp/ppqdhdbysn.awk > /tmp/ppqdhdbysn.out } #-----------------------------------------------------------------------# Function: lslvbyvg #-----------------------------------------------------------------------function lslvbyvg { lsvg -o | while read vg do ppsize=$(lsvg ${vg} | awk '/PP SIZE:/ {print $6}') lsvg -l ${vg} | egrep "raw|jfs|jfs2" | egrep -v "jfs2log|jfslog" | awk '{print $1, $3}' | while read lv ppnum do lvhdisks=$(lslv -l ${lv} | egrep "hdisk|vpath" | awk '{print $1}' | tr '\n' ',') printf "%s %s %s %s\n" ${vg} ${lv} $(expr ${ppsize} \* ${ppnum}) $(echo ${lvhdisks} | sed -e 's/,$//') done done > /tmp/lslvbyvg.out } #-----------------------------------------------------------------------# Function: lspvidbyhd #-----------------------------------------------------------------------function lspvidbyhd { lspv > /tmp/lspv.out } #-----------------------------------------------------------------------# Declaring global environment variables... #-----------------------------------------------------------------------export PATH=/usr/bin:/etc:/usr/sbin:/usr/ucb:/usr/bin/X11:/sbin:/usr/java14/jre/bin:/usr/java14/bin:/usr/ios/cli:/ usr/ios/utils:/usr/ios/lpm/bin:/usr/ios/oem:/usr/ios/ldw/bin:$HOME ######################################################################### # Main Logic Script... # ######################################################################### SFLAG=
600
while getopts s:h name do case $name in s ) SFLAG=1 SVAL=$OPTARG ;; h|\? ) usage exit -1 ;; esac done shift $(($OPTIND - 1)) #-----------------------------------------------------------------------# Collecting the necessary information and formating... #-----------------------------------------------------------------------lsmapbyvhost ppqdhdbysn lslvbyvg lspvidbyhd #-----------------------------------------------------------------------# Generating the file with the configuration mappings... #-----------------------------------------------------------------------# The config file of VSCSI Server has the following fields: # Name of VIOS Server typeset VIOSNAME=$(hostname) # List of hdisks where is the LV or hdisk of storage mapped typeset -A ALVHDISKS # Name of VG of LV mapped typeset -A AVGNAME # Name of LV typeset -A ALV # Size of LV/hdisk in MB typeset -A ALVHDSIZE # Virtual SCSI Server Device typeset -A AVSCSISRV # Server slot typeset -A ASRVSLOT # Client LPAR ID typeset -A ACLTLPAR # Client slot. Always "N/A" CLTSLOT="N/A" # Virtual target device typeset -A AVTDEVICE # PVID, case an LV is mapped typeset -A ALUNPVID # S/N, case a LUN is mapped typeset -A ALUNSN cat /tmp/ppqdhdbysn.out | while read LUNSN LVHDISKS do
601
ALUNSN[${LVHDISKS}]=${LUNSN} done cat /tmp/lslvbyvg.out | while read VGNAME LV LVHDSIZE LVHDISKS do AVGNAME[${LV}]=${VGNAME} ALVHDSIZE[${LV}]=${LVHDSIZE} ALVHDISKS[${LV}]=${LVHDISKS} done cat /tmp/lspv.out | while read LVHDISKS LUNPVID VGNAME VGSTATUS do ALUNPVID[${LVHDISKS}]=${LUNPVID} done if [[ ${SFLAG} -eq 1 ]] then cat /tmp/lsmapbyvhost.out | while read VSCSISRV LVHDISKS VTDEVICE SRVSLOT CLTLPAR do if [[ ${LVHDISKS} == @(*hdisk*) ]] then printf "%s %s %s %s %s %s %s %s %s %s %s %s\n" ${VIOSNAME} ${LVHDISKS} "N/A" "N/A" "N/A" ${VSCSISRV} ${SRVSLOT} ${CLTLPAR} ${CLTSLOT} ${VTDEVICE} ${ALUNPVID[${LVHDISKS}]} ${ALUNSN[${LVHDISKS}]} else printf "%s %s %s %s %s %s %s %s %s %s %s %s\n" ${VIOSNAME} ${ALVHDISKS[${LVHDISKS}]} ${AVGNAME[${LVHDISKS}]} ${LVHDISKS} ${ALVHDSIZE[${LVHDISKS}]} ${VSCSISRV} ${SRVSLOT} ${CLTLPAR} ${CLTSLOT} ${VTDEVICE} "N/A" "N/A" fi done | sort -k6 > ${SVAL} else cat /tmp/lsmapbyvhost.out | while read VSCSISRV LVHDISKS VTDEVICE SRVSLOT CLTLPAR do if [[ ${LVHDISKS} == @(*hdisk*) ]] then printf "%s %s %s %s %s %s %s %s %s %s %s %s\n" ${VIOSNAME} ${LVHDISKS} "N/A" "N/A" "N/A" ${VSCSISRV} ${SRVSLOT} ${CLTLPAR} ${CLTSLOT} ${VTDEVICE} ${ALUNPVID[${LVHDISKS}]} ${ALUNSN[${LVHDISKS}]} else printf "%s %s %s %s %s %s %s %s %s %s %s %s\n" ${VIOSNAME} ${ALVHDISKS[${LVHDISKS}]} ${AVGNAME[${LVHDISKS}]} ${LVHDISKS} ${ALVHDSIZE[${LVHDISKS}]} ${VSCSISRV} ${SRVSLOT} ${CLTLPAR} ${CLTSLOT} ${VTDEVICE} "N/A" "N/A" fi done | sort -k6 fi
C.8 mkvscsimap.ksh
Let us suppose that you need a script (Example C-11) to remove the configuration mappings among the LUNs and LPAR clients. Also, remove the hdisk devices, update the multipathing I/O (MPIO) device driver, and recognize the hdisk devices again. Now, you still need to rebuild the mappings among the LUNs and LPARs. The problem is that the names of the hdisk devices are not in order after the upgrade. Based on the saved configuration file, you can execute the following command to recreate the new mappings: # ./mkvscsimap.ksh -c test.out -s test2.out -r
602
Example: C-11 The mkvscsimap.ksh script #!/usr/bin/ksh93 ######################################################################### # # # Name of script: mkvscsimap.ksh # # Path: /home/padmin # # Node(s): # # Info: Script to create the commands to map the LUNs and vhosts, # # based on S/N of disks. # # # # Author: Anderson F. Nobre # # Creation date: 14/03/2008 # # # # Modification date: ??/??/???? # # Modified by: ???????? # # Modifications: # # - ????????????????? # # # ######################################################################### #-----------------------------------------------------------------------# Function: usage #-----------------------------------------------------------------------function usage { printf "Usage: %s [-c value] [-s value] [-x] [-r]\n" $0 printf "Where:\n" printf " -c <value>: Configuration file of VSCSI mappings\n" printf " -s <value>: Generates the script with the commands to map the LUNs in the VSCSI devices\n" printf " -x: Creates the mappings of LUNs in the VSCSI devices\n" printf " -r: Recreates the mappings with the new hdisks\n" } #-----------------------------------------------------------------------# Function: ppqdhdbysn #-----------------------------------------------------------------------function ppqdhdbysn { cat <<EOF > /tmp/ppqdhdbysn.awk /DEVICE NAME:/ {DEVNUM=\$2; HDISK=\$5} /SERIAL:/ {print \$2, HDISK} EOF pcmpath query device > /tmp/ppqd.out cat /tmp/ppqd.out | awk -f /tmp/ppqdhdbysn.awk > /tmp/ppqdhdbysn.out } #-----------------------------------------------------------------------# Declaring the global environment variables... #-----------------------------------------------------------------------export PATH=/usr/bin:/etc:/usr/sbin:/usr/ucb:/usr/bin/X11:/sbin:/usr/java14/jre/bin:/usr/java14/bin:/usr/ios/cli:/ usr/ios/utils:/usr/ios/lpm/bin:/usr/ios/oem:/usr/ios/ldw/bin:$HOME
603
######################################################################### # Main Logic of Script... # ######################################################################### # If there are no arguments, then print the usage... if [[ $# -eq 0 ]] then usage exit -1 fi CFLAG= SFLAG= XFLAG= RFLAG= while getopts c:s:xr name do case $name in c ) CFLAG=1 CVAL=$OPTARG ;; s ) SFLAG=1 SVAL=$OPTARG ;; x ) XFLAG=1 ;; r ) RFLAG=1 ;; h|\? ) usage exit -1 ;; esac done shift $(($OPTIND - 1)) # If the configuration file won't be informed, then finishes with error... if [[ $CFLAG -eq 0 ]] then printf "The configuration file needs to be informed!!!\n" usage exit -1 fi # If the options 's' and 'x' has been selected, then finishes with error... if [[ $SFLAG -eq 1 && $XFLAG -eq 1 ]] then printf "The options '-s' and '-x' can't be used together!!!\n" usage exit -1 fi
604
if [[ $SFLAG -eq 1 && $RFLAG -eq 0 ]] then cat $CVAL | while read VIOSNAME LVHDISKS AVGNAME LV LVHDSIZE VSCSISRV SRVSLOT CLTLPAR CLTSLOT VTDEVICE LUNPVID LUNSN do if [[ ${AVGNAME} == @(N/A) ]] then printf "mkvdev -vdev ${LVHDISKS} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" else printf "mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" fi done > ${SVAL} elif [[ $SFLAG -eq 1 && $RFLAG -eq 1 ]] then typeset -A ALUNHD ppqdhdbysn cat /tmp/ppqdhdbysn.out | while read LUNSN LUNHD do ALUNHD[${LUNSN}]=${LUNHD} done cat $CVAL | while read VIOSNAME LVHDISKS AVGNAME LV LVHDSIZE VSCSISRV SRVSLOT CLTLPAR CLTSLOT VTDEVICE LUNPVID LUNSN do if [[ ${AVGNAME} == @(N/A) ]] then printf "mkvdev -vdev ${ALUNHD[${LUNSN}]} -vadapter ${VSCSISRV} -dev vt${ALUNHD[${LUNSN}]}\n" else printf "mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" fi done > ${SVAL} fi if [[ $XFLAG -eq 1 && $RFLAG -eq 0 ]] then cat $CVAL | while read VIOSNAME LVHDISKS AVGNAME LV LVHDSIZE VSCSISRV SRVSLOT CLTLPAR CLTSLOT VTDEVICE LUNPVID LUNSN do if [[ ${AVGNAME} == @(N/A) ]] then printf "mkvdev -vdev ${LVHDISKS} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" # ioscli mkvdev -vdev ${LVHDISKS} -vadapter ${VSCSISRV} -dev ${VTDEVICE} else printf "mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" # ioscli mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE} fi done elif [[ $XFLAG -eq 1 && $RFLAG -eq 1 ]] then typeset -A ALUNHD ppqdhdbysn cat /tmp/ppqdhdbysn.out | while read LUNSN LUNHD do
605
ALUNHD[${LUNSN}]=${LUNHD} done cat $CVAL | while read VIOSNAME LVHDISKS AVGNAME LV LVHDSIZE VSCSISRV SRVSLOT CLTLPAR CLTSLOT VTDEVICE LUNPVID LUNSN do if [[ ${AVGNAME} == @(N/A) ]] then printf "mkvdev -vdev ${ALUNHD[${LUNSN}]} -vadapter ${VSCSISRV} -dev vt${ALUNHD[${LUNSN}]}\n" # ioscli mkvdev -vdev ${ALUNHD[${LUNSN}]} -vadapter ${VSCSISRV} -dev vt${ALUNHD[${LUNSN}]} else printf "mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE}\n" # ioscli mkvdev -vdev ${LV} -vadapter ${VSCSISRV} -dev ${VTDEVICE} fi done fi
606
Appendix D.
Post-processing scripts
This appendix provides several scripts for post-processing and correlating data from different sources.
607
D.1 Introduction
In our experience, it is often necessary to correlate data from different data sources. There are several methods to correlate data from different sources, including manual correlation, using Microsoft Excel macros, or using shell scripts. The information in this appendix demonstrates how to use PERL shell scripts to correlate and post-process data from multiple sources. While these scripts can be converted to run on UNIX fairly easily, we assume you will be running these on a Windows system. In this appendix, we describe: Dependencies Running the scripts perfmon-essmap.pl iostat_aix53_essmap.pl iostat_sun-mpio.pl tpc_volume-lsfbvol-showrank.pl Note: The purpose of these scripts is not to provide a toolkit that addresses every possible configuration scenario; rather, it is to demonstrate several of the possibilities available.
D.2 Dependencies
In order to execute the scripts described in this section, you need to prepare your system.
Software
You need to have the following software installed: Active Perl 5.6.x or later. At the time of writing this book, you can download PERL from the following Web sites: http://www.download.com/3000-2229_4-10634113.html http://www.activestate.com/Products/activeperl/features.plex While it is not absolutely necessary for users preferring a UNIX style bash shell, download and install CYGWIN. All the samples provided in this chapter will assume that CYGWIN is installed. At the time of writing of this book, you can download CYGWIN from: http://www.cygwin.com/
Script location
There is nothing magical about the location of the script; however, the shell from which you will run the script must be aware of the location of the script. We suggest placing the scripts that you plan on reusing in a common directory, such as either: C:\Performance\scripts C:\Performance\bin For simplicity, we place the script in the same directory as the performance and capacity configuration data.
Script creation
In order to run these scripts, you need to copy the entire contents of the script into a file and name the file with a .pl suffix. You can can copy the contents and name the file in notepad, but you need to ensure that the save as type: option is set to All files as shown in Figure D-1 on page 609.
608
Figure D-1 Using Notepad to save scripts
D.2.1 Running the scripts

In order to run the scripts, you need to collect performance and configuration data. This section describes the logical steps required for post-processing the perfmon and essmap data. The other scripts have a similar logical flow: 1. Collect Windows server performance statistics as described in Appendix B, Windows server performance log collection on page 571. 2. Collect Windows server Subsystem Device Driver (SDD) output as described in 10.8.4, Collecting configuration data on page 296. 3. Place the output of the performance data and the configuration data in the same directory. 4. Open a shell. 5. Run the script. The script takes two input parameters. The first parameter is the file name for the windows performance data and the second parameter is the file name for the datapath query essmap output. Execute the script as shown in Figure D-2.
Figure D-2 Run perfmon-essmap.pl
6. Open the file in Excel. In this case, we redirected the output to Win2003_ITSO_Rotate_Volumes_RAID0_RANDOM.csv. Note: If your file is blank or only contains headers, you need to determine if the issue is with the input files, or if you damaged the script when creating the script. For interpretation and analysis of Windows data, refer to 10.8.6, Analyzing performance data on page 297.
Appendix D. Post-processing scripts
609
By default, all the scripts will print to standard out. If you want to place the output in a file, simply redirect the output to a file.
D.2.1.1 perfmon-essmap.pl
The purpose of the perfmon-essmap.pl script (Example D-1 on page 610) is to correlate Windows server disk performance data with SDD datapath query essmap data. The rank column is invalid for DS8000s with multi-rank extent pools. The script takes two input arguments but there are not any flags required. The first parameter has to be the perfmon data and the second parameter has to be the datapath query essmap data. An example of the output of perfmon-essmap.pl is shown in Figure D-3.
Figure D-3 Sample perfmon-sdd output
Several of the columns have been hidden in order to fit the example output into the space provided.
Example: D-1 The perfmon-essmap.pl script #!C:\Perl\bin\Perl.exe ############################################################### # Script: perfmon-essmap.pl # Purpose: Correlate perfmon data with datapath query essmap # and create a normalized data for input into spreadsheet # The following perfmon counters are needed: # Avg Disk sec/Read # Avg Disk sec/Write # Disk Reads/sec # Disk Writes/sec # Disk Total Response Time - Calculated to sum avg rt * avg i/o rate # Disk Read Queue # Disk Write Queue # Disk Read KB/sec - calc # Disk Write KB/sec - calc ############################################################### $file = $ARGV[0]; # Set perfmon csv output to 1st arg open(PERFMON,$file) or die "cannot open perfmon $file\n"; $datapath = $ARGV[1];# Set 'datapath query essmap' output to 2nd arg open(DATAPATH,$datapath) or die "cannot open datapath $datapath\n"; ######################################################################## # Read in essmap and create hash with hdisk as key and LUN SN as value # ######################################################################## while (<DATAPATH>) { if (/^$/) { next; }# Skip empty lines if (/^Disk /) { next; }# Skip empty lines if (/^--/) { next; }# Skip empty lines @line = split(/\s+|\t/,$_);# Build temp array of current line
610
$lun = $line[4];# Set lun ID $path = $line[1];# Set path $disk = $line[0];# Set disk# $hba = $line[2];# Set hba port - use sdd gethba.exe to get wwpn $size = $line[7];# Set size in gb $lss = $line[8];# Set ds lss $vol = $line[9];# Set DS8K volume $rank = $line[10];# Set rank - DOES NOT WORK FOR ROTATE VOLUME OR ROTATE extent $c_a = $line[11]; # Set the Cluster and adapter accessing rank $dshba = $line[13];# Set shark hba - this is unusable with perfmon which isn't aware of paths $dsport = $line[14];# Set shark port physical location - this is unusable with perfmon which isn't aware of paths $lun{$disk} = $lun;# Set the LUN in hash with disk as key for later lookup $disk{$lun} = $disk;# Set vpath in hash with lun as key $lss{$lun} = $lss;# Set lss in hash with lun as key $rank{$lun} = $rank;# Set rank in hash with lun as key $dshba{$lun} = $dshba;# Set dshba in hash with lun as key - this is unusable with perfmon which isn't aware of paths $dsport{$lun} = $dsport;# Set dsport in hash with lun as key - this is unusable with perfmon which isn't aware of paths if (length($lun) > 8) { $ds = substr($lun,0,7);# Set the DS8K serial } else { $ds = substr($lun,3,5);# Set the ESS serial } $ds{$lun} = $ds; # set ds8k in hash with LUN as key } ################ # Print Header # ################ print "DATE,TIME,Subsystem Serial,Rank,LUN,Disk,Disk Reads/sec, Avg Read RT(ms),Disk Writes/sec,Avg Write RT(ms),Avg Total Time,Avg Read Queue Length,Avg Write Queue Length,Read KB/sec,Write KB/sec\n"; ################################################################################################## # Read in perfmon and create record for each hdisk and split the first column into date and time # ################################################################################################## while (<PERFMON>) { if (/^$/) { next; } # Skip empty lines if (/^--/) { next; } # Skip empty lines if (/PDH-CSV/) { @header = split(/,/,$_); # Build header array shift(@header); # Remove the date element unshift(@header,"Date","Time"); # Add in date, time next; # Go to next line } @line = split(/\t|,/,$_); # Build temp array for current line @temp = split(/\s|\s+|\"/,$line[0]);# Split the first element into array $date = $temp[1]; # Set date to second element of array $time = $temp[2]; # Set time to third element of array shift(@line); # Remove old date unshift(@line,"\"$date\"","\"$time\"");# Add in the new date and time chomp(@line); # Remove carriage return at end for ($i=0; $i<=$#line; $i++) { # Loop through each element in line $line[$i] =~ s/"//g; # Remove double quotes from input line $header[$i] =~ s/"//g; # Remove double quotes from header array @arr = split(/\\/,$header[$i]);# Split current header array element $hostname = $arr[2]; # Extract hostname from header $disk = $arr[3]; # Set disk to the 4th element $counter = $arr[4]; # Set counter to 5th element if ($disk =~ /Physical/) { # If we find Physical Object
611
if ($disk =~ /Total/) { next; }# If disk instance is Total then skip @tmpdisk = split(/Physical|$|\s|$/,$disk);# Create temp array of disk name $newdisk = $tmpdisk[1] . $tmpdisk[2];# Create newly formatted disk name to match SDD output if ($counter =~ /Avg. Disk sec\/Read/) {# If counter is Avg. Disk sec/Read $diskrrt{$date}{$time}{$newdisk} = $line[$i]*1000;# Then set disk read response time hash } if ($counter =~ /Avg. Disk sec\/Write/) {# If counter is Avg. Disk sec/Write $diskwrt{$date}{$time}{$newdisk} = $line[$i]*1000;# Then set disk write response time has } if ($counter =~ /Disk Reads\/sec/) {# If counter is Disk Reads/sec $diskreads{$date}{$time}{$newdisk} = $line[$i];# Then set Disk Reads/sec hash } if ($counter =~ /Disk Writes\/sec/) {# If counter is Disk Writes/sec $diskwrites{$date}{$time}{$newdisk} = $line[$i];# Then set Disk Writes/sec hash } if ($counter =~ /Avg. Disk Read Queue Length/) {# If counter is Disk Read Queue Length $diskrql{$date}{$time}{$newdisk} = $line[$i];# Then set Disk Read Queue Length hash } if ($counter =~ /Avg. Disk Write Queue Length/) {# If counter is Disk Write Queue Length $diskwql{$date}{$time}{$newdisk} = $line[$i];# Then set Disk Write Queue Length hash } if ($counter =~ /Disk Read Bytes\/sec/) {# If counter is Disk Read Bytes/sec $diskrkbs{$date}{$time}{$newdisk} = $line[$i]/1024;# Then calc kb an set in hash } if ($counter =~ /Disk Write Bytes\/sec/) {# If counter is Disk Write Bytes/sec $diskwkbs{$date}{$time}{$newdisk} = $line[$i]/1024;# Then calc kb and set in hash } } } } ### Print out the data here - key is date, time, disk while (($date,$times) = each(%diskrrt)) {# Loop through each date-time hash while (($time,$disks) = each(%$times)) {# Nested Loop through each time-disk hash while (($disk,$value) = each(%$disks)) {# Nest loop through disk-value hash $diskrrt = $diskrrt{$date}{$time}{$disk};# Set shortnames for easier print $diskwrt = $diskwrt{$date}{$time}{$disk}; $diskreads = $diskreads{$date}{$time}{$disk}; $diskwrites = $diskwrites{$date}{$time}{$disk}; $total_time = ($diskrrt*$diskreads)+($diskwrt*$diskwrites); $diskrql = $diskrql{$date}{$time}{$disk}; $diskwql = $diskwql{$date}{$time}{$disk}; $diskrkbs = $diskrkbs{$date}{$time}{$disk}; $diskwkbs = $diskwkbs{$date}{$time}{$disk}; $lun = $lun{$disk}; # Lookup lun for current disk print "$date,$time,$ds{$lun},$rank{$lun},$lun,$disk,$diskreads,$diskrrt,$diskwrites,$diskwrt,$total_time,$diskrql ,$diskwql,$diskrkbs,$diskwkbs\n"; } } }
612
D.2.1.2 iostat_aix53_essmap.pl
The purpose of iostat_aix53_essmap.pl script (Example D-2 on page 613) is to correlate AIX 5.3 iostat -D data with SDD datapath query essmap output for the purpose of analysis. Beginning in AIX 5.3, the iostat command provides the ability to continuously collect read and write response times. Prior to AIX 5.3, filemon was required in order to collect disk read and write response times. Example output of the iostat_aix53_essmap.pl is shown in Figure D-4.
TIME 0:00:00 0:00:00 0:00:00 0:00:00 0:00:00 0:00:00 0:00:00 0:00:00 1:01:00 STORAGE DS HBA DS PORT LSS RANK 75P2831 R1-B1-H3-ZA 30 29 ffe3 75P2831 R1-B1-H3-ZA 30 23 ffe9 75P2831 R1-B5-H3-ZB 431 23 ffe9 75P2831 R1-B5-H3-ZB 431 29 ffe3 75P2831 R1-B6-H1-ZA 500 29 ffe3 75P2831 R1-B6-H1-ZA 500 23 ffe9 75P2831 R1-B8-H1-ZA 700 29 ffe3 75P2831 R1-B8-H1-ZA 700 23 ffe9 75P2831 R1-B1-H3-ZA 30 29 ffe3 LUN 75P28311D0A 75P2831170A 75P2831170A 75P28311D0A 75P28311D0A 75P2831170A 75P28311D0A 75P2831170A 75P28311D0A VPATH vpath13 vpath12 vpath12 vpath13 vpath13 vpath12 vpath13 vpath12 vpath13 HDISK hdisk51 hdisk50 hdisk52 hdisk53 hdisk55 hdisk54 hdisk57 hdisk56 hdisk51 #BUSY KBPS KB READ KB WRITE 0.61 0.14 0.47 0.80 8.10 8.10 0.60 8.60 8.60 0.95 0.22 0.73 0.54 0.21 0.33 0.80 8.90 8.90 0.10 0.52 0.12 0.40 0.60 7.60 7.60 4.80 3.80 1.00
Figure D-4 The iostat_aix53_essmap.pl sample output
Several of the columns have been hidden in order to fit the example output into the space provided.
Example: D-2 iostat_aix53_essmap.pl #!C:\Perl\bin\Perl.exe ############################################################### # Script: iostat_aix53_essmap.pl # Purpose: Process AIX 5.3 iostat disk and essmap # normalize data for input into spreadsheet ############################################################### use Getopt::Std; ####################################################################################### main(); # Start main logic sub main{ parseparms(); # Get input parameters readessmap($essmap); # Invoke routine to read datapath query essmap output readiostat($iostat); # Invoke routine to read iostat } ############## Parse input parameters ################################################## sub parseparms { $rc = getopts('i:e:'); # Define inputs $iostat = $opt_i; # Set value for iostat $essmap = $opt_e; # Set value for dp query essmap (defined $iostat) or usage(); # If iostat is not set exit (defined $essmap) or usage(); # If essmap is not set exit } ############# Usage ################################################################### sub usage { print "\nUSAGE: iostat_aix53_essmap.pl [-ie] -h"; print "\n -i The file containing the iostat output"; print "\n -e The file containing the datapath query essmap output"; print "\n ALL ARGUMENTS ARE REQUIRED!\n"; exit ; } ### Read in pcmpath and create hash with hdisk as key and LUN SN as value ############# $file = $ARGV[0]; # Set iostat -D for 1st arg $essmap = $ARGV[1]; # Set 'datapath query essmap' output to 2nd arg Appendix D. Post-processing scripts
613
### Read in essmap and create hash with hdisk as key and LUN SN as value sub readessmap($essmap) { open(ESSMAP,$essmap) or die "cannot open $essmap\n"; while (<ESSMAP>) { if (/^$/) { next; } # Skip empty lines if (/^--/) { next; } # Skip empty lines if (/^Disk/) { next; } # Skip header @line = split(/\s+|\t/,$_); # Build temp array $lun = $line[4]; # Set lun $hdisk = $line[1]; # set hdisk $vpath = $line[0]; # set vpath $hba = $line[3]; # set hba $lss = $line[8]; # set lss $rank = $line[10]; # set rank $dshba = $line[13]; # set shark hba $dsport = $line[14]; # set shark port $vpath{$lun} = $vpath; # Set vpath in hash $lss{$lun} = $lss; # Set lss in hash $rank{$lun} = $rank; # Set rank in hash $dshba{$hdisk} = $dshba; # Set dshba in hash $dsport{$hdisk} = $dsport; # Set dsport in hash $lun{$hdisk} = $lun; # Hash with hdisk as key and lun as value if (length($lun) > 8) { $ds = substr($lun,0,7); # Set ds serial to first 7 chars } else { $ds = substr($lun,3,5); # or this is ESS and only 5 chars } $ds{$lun} = $ds; # set the ds serial in a hash } } ### Read in iostat and create record for each hdisk sub readiostat($iostat) { ### Print Header print "TIME,STORAGE SN,DS HBA,DS PORT,LSS,RANK,LUN,VPATH,HDISK,#BUSY,KBPS,KB READ PS,KB WRITE PS,TPS,RPS,READ_AVG_SVC,READ_MIN_SVC,READ_MAX_SVC,READ_TO,WPS,WRITE_AVG_SVC,WRITE_MIN_SVC,WRITE_MAX_SVC,WRI TE_TO,AVG QUE,MIN QUE, MAX QUE,QUE SIZE\n"; $time = 0; # Set time variable to 0 $cnt = 0; # Set count to zero open(IOSTAT,$iostat) or die "cannot open $iostat\n";# Open iostat file while (<IOSTAT>) { # Read in iostat file if ($time == 0) { if (/^Mon|^Tue|^Wed|^Thu|^Fri|^Sat|^Sun/) {# This only works if a time stamp was in file $date_found = 1; # Set flag for @line = split(/\s+|\s|\t|,|\//,$_);# build temp array $date = $line[1] . " " . $line[2] . " " . $line[5];# Create date $time = $line[3]; # Set time $newtime = $time; # Set newtime $interval = 60; # Set interval to 60 seconds next; } else { $time++; # Set time counter to 1 if no time is in file } } if (/^#|System/) { next; } # Skip notes if (/^hdisk|^dac/) { # If line starts with hdisk @line = split(/\s+|\t|,|\//,$_);# build temp array $pv = $line[0]; # Set physical disk to 1st element $xfer = 1; # we are in the transfer stanza now
614
next; } if (/-------/) { $cnt++; # count the number of stanzas if ($date_found == 1) { # If date flag is set $newtime = gettime($time,$interval,$cnt);# get time based on original time,cnt, and interval } else { $time++; $newtime = 'Time_' . $time;# Set a relative time stamp } next; # Go to next line } if ($xfer == 1) { # If in transfer section @line = split(/\s+|\t|,|\//,$_);# build temp array $busy{$pv} = $line[1]; # Set busy time if ($line[2] =~ /[K]/) { # If K is in value $line[2] =~ s/K//g; # remove K $kbps{$pv} = $line[2]; # Set value } elsif ($line[2] =~ /[M]/) {# If M $line[2] =~ s/M//g; # remove M $kbps{$pv} = $line[2]*1024;# Multiply value by 1024 } else { $kbps{$pv} = $line[2]/1024;# Else its bytes so div by 1024 } $tps{$pv} = $line[3]; # Set transfer per sec if ($line[4] =~ /[K]/) { # If K then remove K $line[4] =~ s/K//g; $bread{$pv} = $line[4]; } elsif ($line[4] =~ /[M]/) {# If Mbytes convert to K $line[4] =~ s/M//g; $bread{$pv} = $line[4]*1024; } else { $bread{$pv} = $line[4]/1024;# Else bytes and convert ot K } if ($line[5] =~ /[K]/) { # Same logic as above $line[5] =~ s/K//g; $bwrtn{$pv} = $line[5]; } elsif ($line[5] =~ /[M]/) { $line[5] =~ s/M//g; $bwrtn{$pv} = $line[5]*1024; } else { $bwrtn{$pv} = $line[5]/1024; } $xfer = 0; next; } if (/read:/) { $read = 1; next; } if ($read == 1) { # If read flag is set @line = split(/\s+|\t|,|\//,$_);# build temp array $rps{$pv} = $line[1]; # set reads per second $rrt{$pv} = $line[2]; # set read rt $rminrt{$pv} = $line[3]; # set read min rt if ($line[4] =~ /[S]/) { $line[4] =~ s/S//g; $rmaxrt{$pv} = $line[4]*1000;# set read max rt convert from seconds if necessary } else { $rmaxrt{$pv} = $line[4]; } $rto{$pv} = $line[5]; $read = 0;
615
next; } if (/write:/) { $write = 1; next; }# If write flag is set if ($write == 1) { @line = split(/\s+|\t|,|\//,$_);# build temp array $wps{$pv} = $line[1]; # set writes per sec $wrt{$pv} = $line[2]; # set write rt $wminrt{$pv} = $line[3]; # set min write rt if ($line[4] =~ /[S]/) { $line[4] =~ s/S//g; $wmaxrt{$pv} = $line[4]*1000;# set max rt and convert from secs if necessary } else { $wmaxrt{$pv} = $line[4]; } $wto{$pv} = $line[5]; $write = 0; next; } if (/queue:/) { $queue = 1; next; }# If queue flag is set if ($queue == 1) { @line = split(/\s+|\t|,|\//,$_);# build temp array $qt{$pv} = $line[1]; # set queue time $qmint{$pv} = $line[2]; # set queue min time $qmaxt{$pv} = $line[3]; # Set queue max time $qsize{$pv} = $line[4]; $queue = 0; $time{$pv} = $newtime; $lun = $lun{$pv}; print "$time{$pv},$ds{$lun},$dshba{$pv},$dsport{$pv},$lss{$lun},$rank{$lun},$lun,$vpath{$lun},$pv,$busy{$pv},$kbp s{$pv},$bread{$pv},$bwrtn{$pv},$tps{$pv},$rps{$pv},$rrt{$pv},$rminrt{$pv},$rmaxrt{$pv},$rto{$pv},$wps{$pv}, $wrt{$pv},$wminrt{$pv},$wmaxrt{$pv},$wto{$pv},$qt{$pv},$qmint{$pv},$qmaxt{$pv},$qsize{$pv}\n"; next; } if (/^--/) { $time++; } } } ############# Convert Time #################################################################### sub gettime() { my $time = $_[0]; my $interval = $_[1]; my $cnt = $_[2]; my $hr = substr($time,0,2); my $min = substr($time,3,2); my $sec = substr($time,6,2); $hrsecs =$hr * 3600; $minsecs = $min * 60; my $addsecs = $interval * $cnt; my $totsecs = $hrsecs + $minsecs + $sec + $addsecs; $newhr = int($totsecs/3600); $newsecs = $totsecs%3600; $newmin = int($newsecs/60); $justsecs = $newsecs%60; $newtime = $newhr . ":" . $newmin . ":" . $justsecs; return $newtime; }
616
D.2.1.3 iostat_sun-mpio.pl
The purpose of the iostat_sun-mpio.pl script (Example D-3 on page 617) is to reformat Solaris iostat -xn data so that it can be analyzed in a spreadsheet. The LUN identification only works properly with Solaris systems running MPxIO. In the iostat -xn output with MPxIO, there is only one disk shown per LUN in the output. The iostat_sun-mpio.pl only takes one argument, which is the iostat file. There are no flags required. An example of the output is shown in Figure D-5.
D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D A a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t T E e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N e N o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T I i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i M m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m E e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e L 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 U N H c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c D 1 1 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t I S 0 1 6 6 6 6 6 6 6 6 0 1 6 6 6 6 6 6 6 6 0 1 6 6 6 6 6 6 6 6 d d 0 0 0 0 0 0 0 0 d d 0 0 0 0 0 0 0 0 d d 0 0 0 0 0 0 0 0 K 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # R E A D 0 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6 6 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 8 8 8 8 8 8 8 8 F F F F F F F F F F F F F F F F C C C C C C C C 8 8 8 8 8 8 8 8 3 3 3 3 3 3 3 3 7 7 7 7 7 7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 4 4 4 4 4 4 3 3 3 3 2 2 2 2 0 0 0 0 0 0 0 0 3 2 1 0 3 2 1 0 d d d d d d d d 0 0 0 0 0 0 0 0 S . 1 0 0 0 . 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 3 2 A V G _ R E A D _ K B . 9 0 0 0 . 7 . 3 0 0 0 0 . 5 0 0 0 0 0 0 0 0 0 0 0 0 1 6 1 6 0 0 0 0 2 4 1 6 1 0 0 W R I T E . 2 0 0 0 . 1 . 1 0 0 0 0 . 1 0 0 0 0 0 0 0 0 0 0 2 0 1 1 0 4 4 0 2 6 9 2 1 5 7 1 9 9 1 7 6 0 0 S 0 A V G _ W R I T E _ K 0 0 0 1 1 0 0 0 1 B . 8 0 . 8 . 7 . 1 . 1 . 7 . 7 . 8 . 2 0 0 0 0 0 0 0 0 0 0 . 5 0 9 5 . 6 0 . 8 . 2 . 6 . 2 . 5 A V G _ W A I T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A V G _ S 1 V 5 0 0 0 0 0 0 0 0 C . 4 0 7 7 5 5 7 7 7 5 0 0 0 0 0 0 0 0 0 0 . 3 0 . 8 . 6 0 . 9 3 3 . 9 . 9 . . . . . . . .
4 4 4 4 4 4 4 4
3 3 3 3 2 2 2 2
0 0 0 0 0 0 0 0
3 2 1 0 3 2 1 0
4 4 4 4 4 4 4 4
3 3 3 3 2 2 2 2
0 0 0 0 0 0 0 0
3 2 1 0 3 2 1 0
5 5 5 5 5 5 5 5
0 0 0 0 0 0 0 0
7 7 7 7 7 7 7 7
6 6 6 6 6 6 6 6
3 3 3 3 3 3 3 3
0 0 0 0 0 0 0 0
8 8 8 8 8 8 8 8
F F F F F F F F
F F F F F F F F
C C C C C C C C
8 8 8 8 8 8 8 8
3 3 3 3 3 3 3 3
7 7 7 7 7 7 7 7
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
4 4 4 4 4 4 4 4
3 3 3 3 2 2 2 2
0 0 0 0 0 0 0 0
3 2 1 0 3 2 1 0
d d d d d d d d
0 0 0 0 0 0 0 0
8 1 5 3 1 9 4 1 1 1 2 3 6 4 3 0 1 7 7 7 5 2 3 9 9 4 0 4
6 2 2 2
4 4 4 4 4 4 4 4
3 3 3 3 2 2 2 2
0 0 0 0 0 0 0 0
3 2 1 0 3 2 1 0
5 5 5 5 5 5 5 5
0 0 0 0 0 0 0 0
7 7 7 7 7 7 7 7
6 6 6 6 6 6 6 6
3 3 3 3 3 3 3 3
0 0 0 0 0 0 0 0
8 8 8 8 8 8 8 8
F F F F F F F F
F F F F F F F F
C C C C C C C C
8 8 8 8 8 8 8 8
3 3 3 3 3 3 3 3
7 7 7 7 7 7 7 7
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
4 4 4 4 4 4 4 4
3 3 3 3 2 2 2 2
0 0 0 0 0 0 0 0
3 2 1 0 3 2 1 0
d d d d d d d d
0 0 0 0 0 0 0 0
1 1 2 2
2 2
Figure D-5 Solaris iostat -xn with MPxIO output
Example: D-3 The iostat_sun-mpio.pl script #!C:\Perl\bin\Perl.exe ############################################################### # Script: iostat_sun-mpio.pl # Purpose: Correlate disk# with Sun iostat and extract LUN # normalize data for input into spreadsheet ############################################################### use Getopt::Std; $file = $ARGV[0]; # Set iostat -xcn to 1st arg open(IOSTAT,$file) or die "cannot open $file\n";# Open file ### Print Header print "DATE,TIME,LUN,HDISK,#READS,AVG_READ_KB,WRITES,AVG_WRITE_KB,AVG_WAIT,AVG_SVC\n"; ### Read in iostat and create record for each hdisk $cnt=0; $time=0; while (<IOSTAT>) { if ($time == 0) { if (/^Mon|^Tue|^Wed|^Thu|^Fri|^Sat|^Sun/) { @line = split(/\s+|\s|\t|,|\//,$_);# build temp array $date = $line[1] . " " . $line[2] . " " . $line[5]; $time = $line[3]; $interval = 60; next; } else { $date = 'Date Not Available'; $time++; # Set time counter to 1 if no time is in file } } if (/tty|tin|md|r\/s|^$|#/) { next;} if (/extended/) { $cnt++; # count the number of stanzas if ($date_found == 1) { # If date flat is set $newtime = gettime($time,$interval,$cnt);# get time based on original time,cnt, and interval } else { $newtime = 'Time' . $time;# Set a relative time stamp $time++; }
617
next; } @line = split(/\s+|\t|,|\//,$_);# Build temp array for each line $pv = $line[11]; # Set pv to 11th element $lun = substr($pv,31,4); # Set lun to substring of pv $date{$pv} = $date; # Set date to date hash $time{$pv} = $newtime; # Set time to time hash $reads{$pv} = $line[1]; # Set read hash $writes{$pv} = $line[2]; # Set write hash $readkbps{$pv} = $line[3]; # Set read kbps $writekbps{$pv} = $line[4]; # Set write kbps $avg_wait{$pv} = $line[7]; # Set avg wait time $avg_svc{$pv} = $line[8]; # Set avg response time print "$date{$pv},$time{$pv}, $lun,$pv, $reads{$pv},$readkbps{$pv},$writes{$pv},$writekbps{$pv},$avg_wait{$pv},$avg_svc{$pv}\n"; }
sub gettime() { my $time = $_[0]; my $interval = $_[1]; my $cnt = $_[2]; my $hr = substr($time,0,2); my $min = substr($time,3,2); my $sec = substr($time,6,2); $hrsecs =$hr * 3600; $minsecs = $min * 60; my $addsecs = $interval * $cnt; my $totsecs = $hrsecs + $minsecs + $sec + $addsecs; $newhr = int($totsecs/3600); $newsecs = $totsecs%3600; $newmin = int($newsecs/60); $justsecs = $newsecs%60; $newtime = $newhr . ":" . $newmin . ":" . $justsecs; return $newtime; }
D.2.1.4 tpc_volume-lsfbvol-showrank.pl
Correlate TotalStorage Productivity Center volume batch reports with rank data obtained from DSCLI. While TotalStorage Productivity Center provides many excellent drill-down and correlation features, you might want to have data in a spreadsheet for further analysis and graphing. Unfortunately when exporting volume performance data using Batch Reports (refer to 8.5.4, Batch reports on page 232), the relationships between the volumes and the array sites are lost. The purpose of this script is to establish the relationship between the array site, DS8000 extent pool, and the volume. The tpc_volume-lsfbvol-showrank.pl script (Example D-5 on page 619) requires data obtained from the DSCLI to establish the relationships. The other feature of this script is that it converts the time stamp into two fields: date and time, and converts the time field into a 24 hour time stamp.
618
The script requires four parameters as shown in Example D-4.

Example: D-4 The tpc_volume-lsfbvol-showrank.pl arguments
$ c:/Performance/bin/tpc_volume-lsfbvol-showrank.pl USAGE: tpc_volume-lsfbvol-showrank.pl [-fart] -h -f The file containing the lsfbvol output -a The file containing the lsarray output -r The file containing the showrank output -t The file containing the tpc volume performance output ALL ARGUMENTS ARE REQUIRED! Figure D-6 shows the example output of tpc_volume-lsfbvol-showrank.pl. Several of the columns have been hidden in order to fit the example output into the space provided.
Date Tim e S ubsys tem V olum e Interval Read I/O Rate (norm al) 0 0 0 41.64 0 0 0 Read I/O Read I/O W rite I/O Rate Rate Rate (norm al) (sequenti (overall) al) 0 0 0 0 0 0 0 0 0 986.69 1028.34 0 718.09 718.09 0 0 0 0 0 0 0 W rite I/O W rite I/O Rate Rate (s equenti (overall) al) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 VG CA P A CIT E XTE NTP Y OOL
11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008 11/3/2008
17:32 17:37 17:42 17:47 17:52 17:57 18:02
DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M DS 8000-2107-75GB 192-IB M
4800 4800 4800 4800 4800 4800 4800
300 301 299 301 299 301 300
V1 V1 V1 V1 V1 V1 V1
64 64 64 64 64 64 64
P4 P4 P4 P4 P4 P4 P4
Figure D-6 The tpc_volume-lsfbvol-showrank.pl sample output
Note: The delimiter is a pipe | sign. This means that when you open the data in Excel, you will need to set the delimiter to a | sign. Do not redirect the output to a file with a suffix of .csv or Excel will open it assuming the delimiter is a comma and the output will look strange.
Example: D-5 The tpc_volume-lsfbvol-showrank.pl #!C:\Perl\bin\Perl.exe ############################################################### # Script: tpc_volume-lsfbvol-showrank.pl # Purpose: correlate tpc volume data to extent pool rank(s) and put in csv output # Requirements: Must contain lsfbvol output from DS8K-Config-Gatherer-v1.2.cmd ############################################################### use Getopt::Std; ####################################################################################### main(); # Start main logic sub main{ parseparms(); # Get input parameters readlsfbvol($lsfbvol); # Invoke routine to read lsfbvol readshowrank($showrank); # Invoke routine to read showrank output readlsarray($lsarray); # Invoke routine to read lsarray output readtpcvol($tpcvol); # Invoke routine to read tpc volume } ####################################################################################### sub parseparms { $rc = getopts('f:a:r:t:'); # Define inputs $lsfbvol = $opt_f; # Set value for lsfbvol $lsarray = $opt_a; # Set value for lsarray $showrank = $opt_r; # Set value for showrank $tpcvol = $opt_t; # Set value for tpcvol (defined $lsfbvol) or usage(); # If lsfbvol is not set exit
619
(defined $lsarray) or usage(); # If lsarray is not set exit (defined $showrank) or usage(); # If showrank is not set exit (defined $tpcvol) or usage(); # If tpcvol is not set exit } ####################################################################################### sub usage { print "\nUSAGE: tpc_volume-lsfbvol-showrank.pl [-fart] -h"; print "\n -f The file containing the lsfbvol output"; print "\n -a The file containing the lsarray output"; print "\n -r The file containing the showrank output"; print "\n -t The file containing the tpc volume performance output"; print "\n ALL ARGUMENTS ARE REQUIRED!\n"; exit ; } ############# BEGIN PROCESS LSFBVOL OUTPUT ###################################### sub readlsfbvol($lsfbvol) { # Define subroutine open(LSFBVOL,$lsfbvol) or die "cannot open $lsfbvol\n";# Open lsfbvol file while (<LSFBVOL>) { # Loop through every line in lsfbvol if (/^$|^Date|^Name|^==/) { next; } # Skip empty and header lines @line = split(/:/,$_); # Build temp array of each line $vol_id = $line[1]; # Set volume ID to the 2nd element $ep = $line[7]; # Set the exent pool to the 8th element $cap = $line[10]; # Set capacity in gb to 11th element $vg = $line[13]; # Set volume group to 14th element chomp($vg); # Remove carriage returnh $vg{$vol_id} = $vg; # Create vg hash with vol as key $cap{$vol_id} = $cap; # Create capacity hash with vol as key $ep{$vol_id} = $ep; # Create extent pool has with vol as key } } ############# END PROCESS LSFBVOL OUTPUT ###################################### ############# BEGIN PROCESS SHOWRANK OUTPUT ###################################### sub readshowrank($showrank) { # Define subroutine open(SHOWRANK,$showrank) or die "cannot open $showrank\n";# Open showrank file while (<SHOWRANK>) { # Iterate through each line if (/^$/) { next; } # skip empty lines chomp($_); # Remove carriage return $_ =~ s/\"//g; # Remove quotations @temp = split(/\t|\s+|\s/, $_);# build array of data for each line if (/ÎD/) { # If line begins with has $rank = $temp[1]; # Set rank to the 2nd element } if (/Ârray/) { # If line begins with array $array = $temp[1]; # Set array to 2nd element } if (/êxtpoolID/) { # If line begins with extpoolID $ep = $temp[1]; # Set ep to 2nd element } if (/^volumes/) { # If line begins with volumes @vol_list = split(/,/,$temp[1]);# Split list of vols into array foreach $vol (@vol_list) { # loop through each volume if ($array_vols{$vol}) { # If its already existing if ($array_vols{$vol} =~ $array) {# check to see if we have it stored move on next; } else { $array_vols{$vol} = $array_vols{$vol}.','.$array;# else add it to existing list } } else { $array_vols{$vol} = $array;
620
} } } } } ############# END PROCESS SHOWRANK OUTPUT ######################################
############# BEGIN PROCESS LSARRAY OUTPUT ###################################### sub readlsarray($lsarray) { # Define subroutine open(LSARRAY,$lsarray) or die "cannot open $lsarray\n";# Open lsarray file while (<LSARRAY>) { # Iterate through each line if (/^$/) { next; } # skip empty lines chomp($_); # Remove carriage return $_ =~ s/\"//g; # Remove quotations @temp = split(/,/, $_); # build array of data for each line local $array = $temp[0]; # Set array to first element $array_site{$array} = $temp[4];# Build has with array as key to array site value } } ############# END PROCESS LSARRAY OUTPUT ###################################### ############# PROCESS TPC VOLUME DATA ######################################### sub readtpcvol($tpcvol) { # Define subroutine open(TPCVOL,$tpcvol) or die "cannot open $tpcvol\n";# Open lsarray file while (<TPCVOL>) { if (/^$/) { next; } # Skip empty lines if (/^-/) { next; } # Skip empty lines if (/^Subsystem/) { # Header line @line = split(/,/,$_); # build temp array $subsystem = $line[0]; # Capture subsystem $volume = $line[1]; # Capture volume @temp = split(/\s|\s+/,$line[2]);# Split the first element into Date and Time $time = $temp[0]; # Capture time shift(@line); shift(@line);shift(@line);# Drop first 3 elements unshift(@line,'Date',$time,$subsystem,$volume);# Add back in elements in desired order $cnt=0; # Start counter for ($i=0;$i<=$#line;$i++) { # Start a loop through elements in header line $header[$cnt] = $line[$i]; # Build header array if($i == $#line) { # if end of line remove carriage return chomp($header[$cnt]); } print "$header[$cnt]|"; # Print out header with pipe delm $cnt++; # Increment count } print "VG|CAPACITY|EXTENTPOOL|TPC-ARRAY\n";# Print out extra-new Headers next; # Next line please } @line = split(/,|$|$|\:|\s/,$_);# Build temp array $subsystem = $line[0]; # Set subsystem to first element $volume = $line[4]; # Set volume to 5th element $volume =~ tr/a-z/A-Z/; # Make volume upper case $date = $line[6]; # Set date to 7th element $hr = twelveto24($line[7],$line[9]);# Convert time from am-pm to 24hr $time = $hr . ":" . $line[8]; # Put hr min together shift(@line); shift(@line);shift(@line);shift(@line);shift(@line);shift(@line);shift(@line);shift(@line);shift(@line);shi ft(@line); unshift(@line,$date,$time,$subsystem,$volume);# After removal of elements re-add for proper order $serial = substr($line[2],12,7);# Set DS8K serial #
621
$ep = $ep{$volume}; $tpc_array = array2arraysite($array_vols{$volume}); # Get array chomp($_); # Remove cr foreach $ele (@line) { # Iterate through elements print "$ele|"; # Print each element } print "$vg{$volume}|$cap{$volume}|$ep{$volume}|$tpc_array\n";# Print the extra elements } } ############# ITERATE THROUGH ARRAY LIST AND CONVERT TO array site ################ sub array2arraysite{ my $list = $_[0]; my @array_list= split(/,/,$list); my $arr; my $ars_list; my $cnt = 0; foreach $arr (@array_list) { $ars = $array_site{$arr}; # Lookup array site $tpc_array = '2107' . "." . $serial . "-" . substr($ars,1);# Convert array site to TPC array# if ($cnt == 0) { $ars_list = $tpc_array; } else { $ars_list = $ars_list . "," . $tpc_array; } $cnt++; } return $ars_list; } ############# END PROCESS TPC VOLUME DATA ######################################### sub twelveto24{ ($hr, $per) = @_; $per =~ tr/A-Z/a-z/; if ($hr == 12 && $per eq 'am') { return 0; } elsif ($hr != 12 && $per eq 'pm') { return $hr+12; } else { return $hr }; }
622
Appendix E.
Benchmarking
Benchmarking storage systems has become extremely complex over the years, given all of the hardware and software parts being used for storage systems. In this appendix, we discuss the goals and the ways to conduct an effective storage benchmark.
623
E.1 Goals of benchmarking

Today, clients have to face difficult choices regarding the number of different storage vendors and their product portfolios. Performance information provided by storage vendors can be generic and often not representative of real client environments. To help in making decisions, benchmarking is a way to get an accurate representation of the storage products performance in simulated application environments. The main objective of benchmarking is to identify performance capabilities of a specific production environment and compare the performance of two or more storage systems. Including the use of real production data in the benchmark can be ideal. To conduct a benchmark, you need a solid understanding of all parts of your environment. This understanding includes not only the storage system requirements but also the SAN infrastructure, the server environments, and the applications. Recreating a representative emulation of the environment, including actual applications and data, along with user simulation, provides efficient and accurate analysis of the performance of the storage system tested. The characteristic of a performance benchmark test is that results must be reproducible to validate the tests integrity.
Benchmark key indicators

The popularity of a benchmark is based on how representative the workload is and whether the results are meaningful. Three key indicators can be used out of the benchmark results in order to evaluate the performance of the storage system: Performance results in a real application environment Reliability Total cost of ownership (TCO) Performance is not the only component to consider as benchmark results. Reliability and cost-effectiveness are also parameters that must be considered. Balancing benchmark performance results with reliability, functionalities, and total cost of ownership of the storage system will give you a global view of the storage product value. To help client understanding of intrinsic storage product values in the marketplace, vendor-neutral independent organizations developed several generic benchmarks. The two most famous organizations are: Storage Performance Council (SPC): http://www.storageperformance.org Transaction Processing Performance Council (TPC): http://www.tpc.org The popularity of such benchmarks depends on how meaningful the workload is compared to the main and new workloads which companies deploy today. If the generic benchmark workloads are representative of your production, you can use the different benchmark results to identify the product you implement in your production environment. But, if the generic benchmark definition is not representative or does not include your requirements or restrictions, running a dedicated benchmark designed to be representative of your workload will give you the ability to choose the right storage system.
E.2 Requirements for a benchmark

You need to carefully review your requirements before you set up a storage benchmark and use these requirements to develop a detailed but reasonable benchmark specification and 624
time frame. Furthermore, you need to clearly identify the objective of the benchmark with all the participants and precisely define the success criteria of the results.
Define the benchmark architecture

This process includes, obviously, the specific storage equipment that you want to test but also, the servers which host your application, the servers used to generate the workload, and the SAN equipment used to interconnect the servers and the storage subsystem. The monitoring equipment and software are also part of the benchmark architecture.
Define the benchmark workload

Your application environment can have different categories of data processing. In most cases, two data processing types can be identified: One is characterized as an online transaction processing (OLTP) type and the other type as batch processing. The OLTP category typically has many users, all accessing the same disk storage subsystem and a common set of files. The requests are typically spread across many files; therefore, the file sizes are typically small and randomly accessed. Typical applications consist of a network file server or disk subsystem being accessed by a sales department entering order information. Batch workloads are frequently a mixture of random database accesses, skip-sequential, pure sequential, and sorting. They generate large data transfers and result in high path utilizations. Often constrained to operate within a particular window of time, during which time online operation is restricted or shut down. Poorer or better performance is often not recognized unless it impacts this window. To identify the specificity of your production workload, you can use monitoring tools that are available at the operating system level. In a benchmark environment, there are two ways to generate workload. The first way, the most complex, is to set up the production environment, including the applications software and the application data. In this case, you have to ensure that the application is well configured and optimized on the server operating system. The data volume also has to be representative of the production environment. Depending on your application, workload can be generated using application scripts or an external transaction simulation tool. These kind of tools provide a simulation of users accessing your application. You use workload tools to provide application stress from end-to-end. To configure an external simulation tool, you first record a standard request from a single user and then generate this request several times. This process can provide an emulation of hundreds or thousands of concurrent users to put the application through the rigors of real-life user loads and measure the response times of key business processes. Examples of software available include IBM Rational Software, Mercury Load Runner TM, and so forth. The other way to generate the workload is to use a standard workload generator. These tools, specific to each operating system, produce different kinds of workloads. You can configure and tune these tools to match your application workload. The main tuning parameters include the type of workload (sequential or random), the read/write ratio, the I/O blocksize, the number of I/Os per second, and the test duration. With a minimum of setup, these simulation tools can help you to recreate your production environment workload without setting up all of the software components. Examples of software that is available include iozone, iometer, and others.
Appendix E. Benchmarking
625
Important: Each workload test must be defined with a minimum time duration in order to eliminate any side effects or warm-up period, such as populating cache, which can generate incorrect results.
Monitoring the performance

Monitoring is a critical component of benchmarking and has to be fully integrated into the benchmark architecture. The more we have information of component activity at each level of a benchmark environment, the more we understand where the solution weaknesses are. With this critical source of information, we can precisely identify bottlenecks and have the ability to optimize component utilization and improve the configuration. A minimum of monitoring tools are required at different levels in a storage benchmark architecture: Storage level: Monitoring at the storage level provides information of intrinsic performance of the storage equipment components. Most of these monitoring tools report storage server utilization, storage cache utilization, volume performance, RAID array performance, disk drive module (DDM) performance, and adapter utilization. SAN level: Monitoring at the SAN level provides information of the interconnect workloads between servers and storage subsystems. This monitoring helps to check that workload is well balanced between the different paths used for production and to verify that the interconnection is not a bottleneck in terms of performance. Server level: Monitoring at the server level provides information about server component utilization (processor, memory, storage adapter, filesystem, and so forth). This monitoring helps you also to understand what type of workload the application hosted on the server is generating and evaluate what the storage performance is in terms of response time and bandwidth from the applications point of view. Application level: This monitoring is the most powerful tool in terms of performance analysis, because these tools monitor the performance at the users point of view and highlight bottlenecks of the entire solution. Monitoring the application is not always possible; only a few applications provide a performance module for monitoring application processes. Note that monitoring can have an impact on component performance. In that case, implement the monitoring tools in the first sequence of tests to understand your workload and then disable it in order to eliminate any impact, which can distort performance results.
Define the benchmark time frame

Several things to consider at the definition of the benchmark planning: Time to set up the environment (hardware and software), restore your data, and validate the working of the solution. Time of execution of each scenario, considering that each scenario has to be run several times. Time to analyze the monitoring data that is collected. After each run, benchmark data can be changed, inserted, deleted, or otherwise modified so that it has to be restored before another test iteration. In that case, consider the time needed to restore original data after each run. During a benchmark, each scenario has to be run several times to, first, understand how the different components are performing using monitoring tools, to identify bottlenecks, and then, 626
test different ways to get an overall performance improvement by tuning each of the different components.
E.3 Caution using benchmark results to design production

While taking benchmark results as the foundation to build your production environment infrastructure, you must take a close look at the benchmark performance information and watch out for different points: Benchmark hardware resources (servers, storage subsystems, SAN equipment, and network) are dedicated only for this performance test. Benchmark infrastructure configuration must be fully documented. Ensure that there are no dissimilar configurations (for example, dissimilar cache sizes, or different disk capacities/speed). Benchmarks are focused on the core application performance and do not often consider interferences with other applications that can occur in the real infrastructure. The benchmark configuration must be realistic (technical choice regarding performance compared to usability or availability). Scenarios built must be representative in different ways: Volume of data Extreme or unrealistic workload Timing execution Workload ramp-up Avoid side effects, such as populating cache, which can generate incorrect results Be sure to note detailed optimization actions performed on each of the components of the infrastructure (including application, servers, storage subsystems, and so forth). Servers and storage performance reports must be fully detailed, including bandwidth, I/O per second, and also response time. Understand the availability level of the solution tested, considering the impact of each component failure (host bus adapter (HBA), switch, RAID, and so forth). Balance the performance results with your high availability requirements. First, consider the advanced Copy Services available on the specific storage subsystem to duplicate or mirror your production data within the subsystem or to a remote backup site. Then, watch out that performance is not equivalent if your data is mirrored or duplicated.
Appendix E. Benchmarking
627
628
Related publications
The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this book.
IBM Redbooks publications

For information about ordering these publications, see How to get IBM Redbooks publications on page 630. Note that some of the documents referenced here might be available in softcopy only: IBM System Storage DS8000 Architecture and Implementation, SG24-6786 IBM System Storage Solutions Handbook, SG24-5250 IBM System Storage DS8000: Copy Services with IBM System z, SG24-6787 IBM System Storage DS8000: Copy Services in Open Environments, SG24-6788 FICON Native Implementation and Reference Guide, SG24-6266 Introduction to Storage Area Networks, SG24-5470 SAN Volume Controller Best Practices and Performance Guidelines, SG24-7521 SVC V4.2.1 Advanced Copy Services, SG24-7574 Implementing the IBM System Storage SAN Volume Controller V4.3, SG24-6423 IBM TotalStorage Productivity Center V2.3: Getting Started, SG24-6490 TotalStorage Productivity Center V3.3 Update Guide, SG24-7490 Managing Disk Subsystems using IBM TotalStorage Productivity Center, SG24-7097 Using IBM TotalStorage Productivity Center for Disk to Monitor the SVC, REDP-3961 Monitoring Your Storage Subsystems with TotalStorage Productivity Center, SG24-7364 Implementing Linux with IBM Disk Storage, SG24-6261 Tuning IBM System x Servers for Performance, SG24-5287 Tuning Linux OS on System p The POWER Of Innovation, SG24-7338 Tuning Red Hat Enterprise Linux on IBM eServer xSeries Servers, REDP-3861 Tuning SUSE LINUX Enterprise Server on IBM eServer xSeries Servers, REDP-3862 Tuning Windows Server 2003 on IBM System x Servers, REDP-3943 Linux with zSeries and ESS: Essentials, SG24-7025 Linux for IBM System z9 and IBM zSeries, SG24-6694 Linux Handbook A Guide to IBM Linux Solutions and Resources, SG24-7000 Linux on IBM System z: Performance Measurement and Tuning, SG24-6926 Virtualizing an Infrastructure with System p and Linux, SG24-7499 z/VM and Linux on IBM System z, SG24-7492 Linux Performance and Tuning Guidelines, REDP-4285 DB2 for z/OS and OS/390 Version 7 Performance Topics, SG24-6129 High Availability and Scalability Guide for DB2 on Linux, UNIX, and Windows, SG24-7363
629
IBM ESS and IBM DB2 UDB Working Together, SG24-6262 AIX 5L Performance Tools Handbook, SG24-6039 AIX 5L Differences Guide Version 5.3 Edition, SG24-7463
Other publications
These publications are also relevant as further information sources: IBM TotalStorage DS8000 Command-Line Interface Users Guide, SC26-7625 IBM System Storage DS8000 Host Systems Attachment Guide, SC26-7917 IBM TotalStorage DS8000: Introduction and Planning Guide, GC35-0495 IBM TotalStorage DS8000: Users Guide, SC26-7623 IBM TotalStorage DS Open Application Programming Interface Reference, GC35-0493 IBM TotalStorage DS8000 Messages Reference, GC26-7659 z/OS DFSMS Advanced Copy Services, SC35-0428 Device Support Facilities: Users Guide and Reference, GC35-0033 DB2 for OS/390 V5 Administration Guide, SC26-8957 Administration Guide of DB2 for OS/390, SC26-8957 Device Support Facilities: Users Guide and Reference, GC35-0033 IBM System Storage DS8000 Host Systems Attachment Guide, SC26-7917 IBM System Storage Multipath Subsystem Device Driver Users Guide, GC52-1309 IBM TotalStorage XRC Performance Monitor Installation and Users Guide, GC26-7479 z/OS DFSMS Advanced Copy Services, SC35-0428
Online resources
These Web sites and URLs are also relevant as further information sources: IBM Disk Storage Feature Activation (DSFA) http://www.ibm.com/storage/dsfa Documentation for the DS8000 http://www.ibm.com/systems/storage/disk/ds8000/index.html IBM System Storage Interoperation Center (SSIC) http://www.ibm.com/systems/support/storage/config/ssic/index.jsp
How to get IBM Redbooks publications

You can search for, view, or download IBM Redbooks publications, Redpapers, Hints and Tips, draft publications and Additional materials, as well as order hardcopy IBM Redbooks publications or CD-ROMs, at this Web site:
ibm.com/redbooks
630
Help from IBM

IBM Support and downloads
ibm.com/support
IBM Global Services

ibm.com/services
Related publications
631
632
Index
A
AAL 6, 20 addpaths command 368 address groups 53 Advanced Copy Services 422, 426, 438 affinity 47 agile view 362 AIO 314, 328329 aioo 315, 329 AIX filemon command 339341 lvmstat command 334 nmon command 336 SDD commands 367, 371 secondary system paging 370 topas command 335 aligned partition 396, 398399 allocation 47, 4950 anticipatory 413 anticipatory I/O Scheduler 412 ANTXIN00 member 545 application workload 32 arbitrated loop 267 array site 20, 45 array sites 4445, 54 arrays 42, 4445, 47 arrays across loops see AAL Assymetric Logical Unit Access (ALUA) 362 cache management 13 performance of SARC 14 capacity 526, 530 Capacity Magic 27 cfgvpath command 373 cfvgpath 373 characterize performance 380 choosing CKD volume size 450 disk size with DB2 UDB 500 CKD volumes 4950, 54 allocation and deletion 50 commands addpaths 368 cfgvpath 373 cron 365 datapath 331, 366 dd 355, 375376 filemon 339341 lquerypr 370 lsvpcfg 368369, 372 lvmstat 334 nmon 336 sar 363, 365, 420 SDD commands in AIX 367 SDD commands in HP-UX 371 SDD commands in Sun 373 SDD datapath 274 showvpath 372, 374 topas 335 vpathmkdev 374 compatability mode 394 Complete Fair Queuing 412413, 415 Completely Fair Queuing I/O Scheduler 412 concurrent write operation 448 CONN time 464, 466467 connectivity Global Copy 528 considerations DB2 performance 489 DB2 UDB performance 492, 497 logical disks in a SAN 269 z/OS planning 459 zones 268 Consistency Group drain time 537 constraints definition 224 containers 492, 494496, 498 control data sets 545 Copy Pending 527 Copy Services 507 FlashCopy 508510 FlashCopy objectives 510 introduction 508
B
balanced DB2 workload 490 IMS workload 505 bandwidth 519520, 522 benchmark cautions 627 requirements 624 benchmarking 623624, 626 goals 624 bio structure 405 block layer 405 block size 416417 buffer pools 495 BusLogic 399
C
cache 1214, 208, 211405, 417 as intermediate repository 12 fast writes 12 cache and I/O operations read operations 12 write operations 12 cache information 208
633
Metro Mirror 518 Metro Mirror configuration considerations 519 cron command 365
D
DA 1719 data mining 34 data rate droop effect 454 data warehousing 34 databases 485488, 493 buffer pools 495 containers 492, 494, 496 DB2 in a z/OS environment 486 DB2 overview 487 DB2 prefetch 490491, 495 extent size 495496, 500 extents 495 IMS in a z/OS environment 502 logs 489, 493, 496, 503 MIDAWs 491 multi-pathing 501502 page cleaners 496 page size 495496, 499 pages 495497 parallel operations 496 partitiongroups 494 partitioning map 494 partitions 493 prefetch size 500 selecting DB2 logical sizes 499 tables, indexes and LOBs 494 tablespace 494, 496 workload 31, 486 dataset placement 545 datastore 385 DB2 485487 in a z/OS environment 486 logging 3132 overview 487 prefetch 490491, 495 query 32 recovery data sets 488 selecting logical sizes 499 storage objects 487 transaction environment 32 DB2 UDB 492494 container 492, 494495, 498 instance 493 performance considerations 492, 497 striping 499 tablespace 487, 492, 494 dd command 355, 375376 DDMs 16, 1819 deadline 412413 deadline I/O scheduler 412 Dense Wavelength Division Multiplexing (DWDM) 528 description and characteristics of a SAN 267 device adapter see DA device busy 446 Device Mapper - Mutlipath I/O (DM-MPIO) 407
DFW Bypass 491 DFWBP 491 digital video editing 35 direct connect 267268 Direct I/O 309, 312, 314, 344 dirty buffer 405 flushing 405 DISC time 466467, 470 Discovered Direct I/O 344 disk bottlenecks iostat command 418419 vmstat command 418 disk enclosure 19, 21 disk I/O subsystem block layer 405 cache ??403403405, 417 I/O subsystem architecture 403 Disk Magic 161163 modeling 163 Disk Magic for open systems 180 Disk Magic for zSeries 165 hardware configuration 179, 195 disk subsystem 18 fragmentation 285 disk xfer 327328 diskpart 399 disks configuring for SAN 269 size with DB2 UDB 500 distance 519520 distributions Linux 401402 DONOTBLOCK 548549 drain period 537 drain time 537538 ds_iostat 588, 594 DS8000 AAL cache management 13 Copy Services 507 DA DDMs 16, 1819 disk enclosure 19, 21 FICON 56 FICON port concurrency 469 hardware overview 3 host adapter 5 host attachment 266 I/O priority queuing 7 Multiple Allegiance 7 PAV POWER5 5 priority queuing 449 SDD z/OS Global Mirror DS8000 nI/O ports 527 DS8100 21 processor memory 12 DS8300 10, 17, 2122 disk subsystem 18
634
LPAR 17, 23, 27 DWDM 276 dynamic alias management 442, 452 dynamic buffer cache 357 Dynamic Path Reconnect (DPR) 26 Dynamic Path Selection (DPS) 26 dynamic PAV 442444
E
EAV 49 elevator model 405 engineering and scientific applications 35 Enterprise Volume Management System (EVMS) 411 ESCON 26 ess_iostat script 594 ESX Server multipathing 387 ESX server 384385, 387 esxcfg-mpath 387 esxtop 390391 EVMS (Enterprise Volume Management System) 410 examples datapath query output 331, 366 ESCON connection 275 FICON connection 278 filemon command outputs 340 larger vs. smaller volumes - random workload 451 rank device spreading 318, 359 sar output 363 zoning in a SAN environment 269 Ext2 414 Ext3 414415 Extended Address Volume (EAV) 49 Extended Distance FICON (EDF) 455 eXtended File System 414 Extended Remote Copy see z/OS Global Mirror extent pool 47, 513, 517518 extent pools 4749 extent rotation 51, 55 extent size 495496, 500 extent type 4647 extents 495
FICON 56, 22, 2426, 276, 442, 451452, 454 host adapters 22, 2425, 27 host attachment 278 FICON Express2 454455, 468 FICON Express4 266, 277278, 454455, 457458 FICON Open Exchange 469 filemon 330, 339340 filemon command 339341 filemon measurements 339 fixed block LUNs 48 Fixed policy 387 FlashCopy 508510 performance 510511 FlashCopy objectives 510 fragmentation, disk 285 fsbuf 327
G
Global Copy 508509, 524, 526 capacity 530 connectivity 528 distance considerations 528 DS8000 I/O ports 527 performance 526, 529531, 533535 planning considerations 529 scalability 530 Global Mirror add or remove storage servers or LSSs 543 avoid unbalanced configurations 538 Consistency Group drain time 537 growth within configurations 541 performance 532533 performance aspects 533 performance considerations at coordination time 536 remote storage server configuration 538 volumes 536, 538539, 542 guidelines z/OS planning 459
H
hardware configuration planning allocating hardware components to workloads 3 array sites, RAID, and spares 20 cache as intermediate 12 Capacity Magic 27 disk subsystem 18 DS8000 overview 3 DS8300 10, 17, 2122 DS8300 LPAR 17, 23, 27 Fibre Channel and FICON host adapters 24 host adapters 11, 1617, 24 I/O enclosures 1718 modeling your workload 3 multiple paths to open systems servers 25 multiple paths to zSeries servers 26 order of installation 21 performance numbers 2 processor complex 10, 12, 17 processor memory 12, 16
F
fast write 1213 fast writes 12 FATA 517 FC adapter 273 FC-AL overcoming shortcomings 18 switched 5 FCP supported servers 267 fcstat 336, 339 Fibre Channel 267 distances 2425 host adapters 22, 2425, 27 topologies 267268 Fibre Channel and FICON host adapters 24
Index
635
RIO-G interconnect and I/O enclosures 17 RIO-G loop 17 spreading host attachments 26 storage server challenge 2 switched disk architecture 18 tools to aid in hardware planning 27 whitepapers 27 HCD (hardware configuration definition) 26 High performance FICON (zHPF) 455 High Performance FileSystem (HFS) 356 host adapter (HA) 5 host adapters 11, 1617, 24 Fibre Channel 22, 2425, 27 FICON 22, 2425, 27 host attachment 5455 description and characteristics of a SAN 267 FC adapter 273 Fibre Channel 267 Fibre Channel topologies 267 FICON 276 SAN implementations 267 SDD load balancing 273 single path mode 273 supported Fibre Channel 267 HP MirrorDisk/UX 358 HyperPAV 442443, 445
J
Journal File System 414415 Journaled File System (JFS) 357 journaling mode 414416 Journals 545
L
large volume support planning volume size 453 legacy AIO 314, 328329 links Fibre Channel 520 Linux 401403 load balancing SDD 273 locality of reference 404 log buffers 503 logging 502503 logical configuration disks in a SAN 269 logical control unit (LCU) 53 logical device choosing the size 450 configuring in a SAN 269 planning volume size 453 logical paths 520 Logical Session 549 logical subsystem see LSS logical volumes 4548 logs 489, 493, 496, 503 lquerypr 367368, 370371 lquerypr command 370 LSI Logic 399 LSS 5354, 519520, 543 add or remove in Global Mirror 543 LSS design 522 lsvpcfg command 368369, 372 LUN masking 268, 271 LUN queue depth 392393 LUNs allocation and deletion 50 fixed block 48 LVM 2 410411 lvmap 588589 lvmap script 589 lvmstat 330, 334335 lvmstat command 334
I
I/O priority queuing 449 I/O elevator 404405, 415 anticipatory 413 Complete Fair Queuing 412413, 415 deadline 412413 NOOP 413 I/O enclosure 1718 I/O latency 12 I/O priority queuing 7 I/O scheduler 405, 417 I/O subsystem architecture 403 IBM TotalStorage Multipath Subsystem Device Driver see SDD IBM TotalStorage XRC Performance Monitor 552 implement the logical configuration 143 IMS 485, 502503 logging 502503 performance considerations 502, 504 WADS 503504 IMS in a z/OS environment 502 index 487488, 494 indexspace 488 installation planning host attachment 266 instance 493 IOCDS 26 Iometer 305 ionice 415 IOP/SAP 464, 467468 IOSQ time 445, 466 iostat command 418419
M
Managed Disk Group 427429 Managed Disks Group 423 maximum drain time 537 metric 207, 226, 228 Metro Mirror 518 adding capacity in new DS6000s 526 addition of capacity 526 bandwidth 519520, 522
636
distance 519520 Fibre Channel links 520 logical paths 520 LSS design 522 performance 520, 523524 scalability 526 symmetrical configuration 522523 volumes 518, 520, 523 Metro Mirror configuration considerations 519 Metro/Global Mirror performance 541, 553 Microsoft Cluster Services (MSCS) 385 MIDAW 491 MIDAWs 491 modeling your workload 3 monitor host workload 38 open systems servers 38 monitoring DS8000 workload 38 monitoring tools Disk Magic 161163 Iometer 305 Performance console 294296 Task Manager 301 mount command 415416 MPIO 254 MRU policy 387 multi-pathing 501502 multipathing DB2 UDB environment 502 SDD 288, 290, 292 Multiple Allegiance 7, 442, 446, 453 Multiple Allegiance (MA) 446 multiple reader 549550 multi-rank 518, 541
tuning I/O buffers 313 verify the storage subsystem 377 vpathmkdev 367, 373374 order of installation 21 DS8100 21 over provisioning 50 overhead 512, 516, 518
P
page cache 403 page cleaners 496 page size 495496, 499 PAGEFIX 546 pages 495497 Parallel 496 Parallel Access Volume (PAV) 453 Parallel Access Volumes (PAV) 53 Parallel Access Volumes see PAV parallel operations 496 partitiongroups 494 partitioning map 494 path failover 274 PAV 7, 442445, 465 PAV and Multiple Allegiance 446 pbuf 327 pdflush 403, 405 PEND time 446, 448, 466 performance 459, 507, 511, 517, 529, 533535 AIX secondary system paging 370 DB2 considerations 489 DB2 UDB considerations 492, 497 Disk Magic tool 161163 FlashCopy overview 510511 IMS considerations 502, 504 planning for UNIX systems 308 size of cache 16 tuning Windows systems 281282 Performance Accelerator feature 23 Performance console 294296 performance data collection 217, 224, 239 performance data collection task 224 performance logs 296 performance monitor 224, 239 performance numbers 2 plan Address Groups and LSSs 118 plan RAID Arrays and Ranks 81 planning logical volume size 453 UNIX servers for performance 308 planning and monitoring tools Disk Magic 162 Disk Magic for open systems 180 Disk Magic for zSeries 165 Disk Magic modeling 163 Disk Magic output information 163 workload growth projection 197 planning considerations 529 ports 209, 225 posix AIO 328 POWER5 5 Index
N
N_Port ID Virtualization (NPIV) 386 nmon command 336 noatime 415 nocopy 510, 513, 515, 517 NOOP 413 Noop 412 Noop I/O Scheduler 412
O
OLDS 503504 OLTP 486, 498, 500 open systems servers 38 cfvgpath 373 characterize performance 380 dynamic buffer cache 357 filemon 330, 339340 lquerypr 367368, 370371 lvmstat 330, 334335 performance logs 296 removing disk bottlenecks 300 showvpath 367, 372, 374 topas 335336
637
prefetch size 500 priority queuing 449 processor complex 10, 12, 17 processor memory 12, 16 pstat 328
R
RAID 20 RAID 10 517518, 525 drive failure 4344 implementation 43 theory 43 RAID 5 517518 drive failure 42 implementation 4243 theory 42 RAID 6 4243, 45, 52 ranks 4649 RAS RAID 42 RAID 10 43 spare creation 44 Raw Device Mapping (RDM) 384, 393 RDM 384386, 393 real-time sampling and display 363 recovery data sets 488 Red Hat 402, 408, 415 Redbooks Web site 630 Contact us xx ReiserFS 414415 reorg 52 Resource Management Facility (RMF) 464 RFREQUENCY 547 RIO-G loop 17 RMF 525, 533, 538 RMZ 543 rotate extents 51, 55 Rotated volume 51 rotated volume 51 RTRACKS 547
S
SAN 26 cabling for availability 267 implementation 267 zoning example 269 SAN implementations 267 SAN Statistics monitoring performance 242 SAN Volume Controller (SVC) 424 SAR display previously captured data 365 real-time sampling and display 363 sar summary 366 sar command 363, 365, 420 sar summary 366 SARC 6 scalability 526, 530 scripts
ess_iostat 594 lvmap 589 test_disk_speeds 597 vgmap 588 vpath_iostat 590 SCSI See Small Computer System Interface SCSI reservation 393 SDD 6, 254, 288, 290, 292 addpaths 368 commands in AIX 367 commands in HP-UX 371 commands in Sun 373 DB2 UDB environment 502 lsvpcfg command 368369, 372 SDD load balancing 273 sequential measuring with dd command 375 Sequential Prefetching in Adaptive Replacement Cache see SARC server affinity 47 setting I/O schedulers 412 showpath command 374 showvpath 367, 372, 374 showvpath command 372 single path mode 273 Small Computer System Interface 405 SMI-S standard IBM extensions 208 ports 209 space efficient volume 49 spares 20, 44 floating 44 spindle 517 spreading host attachments 26 static and dynamic PAVs 442 storage facility image 10 storage image 10, 23 storage LPAR 10 Storage Pool Striping 4748, 51, 53, 515, 518, 541 Storage pool striping 51 storage server challenge 2 storage servers add or remove in Global Mirror 543 storage unit 10 striped volume 52 striping DB2 UDB 499 VSAM data striping 490 Subsystem Device Driver (SDD) 407 Sun SDD commands 373 supported Fibre Channel 267 supported servers FCP attachment 267 SuSE 402 SVC 254 switched fabric 267268 switched FC-AL 5 symmetrical configuration 522523
638
System Data Mover (SDM) 543 System Management Facilities (SMF) 198, 469 System z servers concurrent write operation 448 CONN time 464, 466467 DISC time 466467, 470 IOSQ time 445, 466 PAV and Multiple Allegiance 446 PEND time 446, 448, 466
T
table 487488, 493 tables, indexes, and LOBs 494 tablespace 487, 492, 494, 496 application 488 system 488, 494 Task Manager 301 test_disk_speeds script 597 thin provisioning 50 TIMEOUT 546547 timestamps 214, 223 topas 335336 topas command 335 topologies arbitrated loop 267 direct connect 267268 switched fabric 267268 tpctool CLI 236 tpctool lstype command 237 tuning disk subsystem ??412, ??412 Windows systems 281282 tuning I/O buffers 313 tuning parameters 545
hierarchy 46, 5455 host attachment 5455 logical volumes 4548 ranks 4649 volume group 5455 VMFS datastore 386, 392 VMotion 386 vmstat 326328 vmstat command 418 VMware datastore 385 volume space efficient 49 volume groups 5455 volume manager 52 volumes 209211, 518, 520, 523 add or remove 536, 538539, 542 CKD 4950, 54 vpath_iostat script 590 vpathmkdev 367, 373374 vpathmkdev command 374
W
WADS 503504 Wavelength Division Multiplexing (WDM) 528 whitepapers 27 Windows Iometer 305 Task Manager 301 tuning 281282 Windows Server 2003 fragmentation 285 workload 512, 515516, 518 databases 486 workload growth projection 197 Workload Manager 442, 449, 454 write ahead data sets, see WADS Write Pacing 548
U
UNIX shell scripts ds_iostat 588, 594 introduction 572, 588, 608 ivmap 588589 vgmap 588 using Task Manager 301
X
XADDPAIR 547549 XRC PARMLIB members 545 XRC Performance Monitor 552 XRC see z/OS Global Mirror xSeries servers Linux 401403 XSTART command 545547 XSUSPEND 546547
V
VDisk 424425 Veritas Dynamic MultiPathing (DMP) 291 vgmap 588 vgmap script 588 video on demand 34 Virtual Center (VC) 389 Virtual Machine File System (VMFS) 393 virtualization address groups 53 array site 45 array sites 4445, 54 arrays 42, 4445, 47 extent pools 4749
Z
z/OS planning guidelines 459 z/OS Global Mirror 508, 543544 dataset placement 545 IBM TotalStorage XRC Performance Monitor 552 tuning parameters 545 z/OS Workload Manager 442 zGM multiple reader 549
Index
639
zSeries servers overview 442 PAV 442443, 445, 465 static and dynamic PAVs 442
640

(1.0 spine) 0.875<->1.498 460 <-> 788 pages
Back cover
Understand the performance aspects of the DS8000 architecture Configure the DS8000 to fully exploit its capabilities Use planning and monitoring tools with the DS8000
This IBM Redbooks publication provides guidance about how to configure, monitor, and manage your IBM TotalStorage DS8000 to achieve optimum performance. We describe the DS8000 performance features and characteristics and how they can be exploited with the different server platforms that can attach to it. Then, in consecutive chapters, we detail specific performance recommendations and discussions that apply for each server environment, as well as for database and DS8000 Copy Services environments. We also outline the various tools available for monitoring and measuring I/O performance for different server environments, as well as describe how to monitor the performance of the entire DS8000 subsystem. This book is intended for individuals who want to maximize the performance of their DS8000 and investigate the planning and monitoring tools that are available.
INTERNATIONAL TECHNICAL SUPPORT ORGANIZATION
BUILDING TECHNICAL INFORMATION BASED ON PRACTICAL EXPERIENCE

IBM Redbooks are developed by the IBM International Technical Support Organization. Experts from IBM, Customers and Partners from around the world create timely technical information based on realistic scenarios. Specific recommendations are provided to help you implement IT solutions more effectively in your environment.
For more information: ibm.com/redbooks

SG24-7146-01 ISBN 0738432695

Ds8000 Logical Configuration

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Ds8000 Logical Configuration

Caricato da

Copyright:

Formati disponibili

Front cover

DS8000 Performance Monitoring and Tuning

Copyright IBM Corp. 2009. All rights reserved.

DS8000 Performance Monitoring and Tuning

DS8000 Performance Monitoring and Tuning

DS8000 Performance Monitoring and Tuning

DS8000 Performance Monitoring and Tuning

Copyright IBM Corp. 2009. All rights reserved.

DS8000 Performance Monitoring and Tuning

The team that wrote this IBM Redbooks publication

Copyright IBM Corp. 2009. All rights reserved.

DS8000 Performance Monitoring and Tuning

Become a published author

DS8000 Performance Monitoring and Tuning

Copyright IBM Corp. 2009. All rights reserved.

1.1 The storage server challenge

1.1.1 Performance numbers

1.1.2 Recommendations and rules

DS8000 Performance Monitoring and Tuning

1.1.3 Modeling your workload

1.1.4 Allocating hardware components to workloads

1.2 Meeting the challenge: DS8000

Chapter 1. DS8000 characteristics

1.2.1 DS8000 models and characteristics

Processor speed Processor Memory options (cache)

2.2 GHz 16 GB 32 GB 64 GB 128 GB

2.2 GHz 32 GB 64 GB 128 GB 256 GB

0-1 Model 92E 2 - 16

0-4 Model 92E 2 - 32

DS8000 Performance Monitoring and Tuning

DS8300 Turbo model 932 16 - 1024

DS8300 Turbo LPAR model 9B2 32 - 1024

Next, we provide a short description of the main hardware components.

POWER5+ processor technology

Switched Fibre Channel Arbitrated Loop (FC-AL)

1.3 DS8000 performance characteristics overview

1.3.1 Advanced caching techniques

Sequential Prefetching in Adaptive Replacement Cache (SARC)

Adaptive Multi-Stream Prefetching (AMP)

1.3.2 IBM System Storage multipath Subsystem Device Driver (SDD)

1.3.3 Performance characteristics for System z

DS8000 Performance Monitoring and Tuning

Chapter 1. DS8000 characteristics

DS8000 Performance Monitoring and Tuning

Copyright IBM Corp. 2009. All rights reserved.

2.1 Storage system

Storage Facility Image (SFI)

DS8000 Performance Monitoring and Tuning

Storage Facility Image 2

Figure 2-1 DS8300 Storage Facility Images

Chapter 2. Hardware configuration

2.2 Processor memory and cache

2.2.1 Cache and I/O operations

Write operations - fast writes

DS8000 Performance Monitoring and Tuning

Sequential Prefetching in Adaptive Replacement Cache (SARC)

Chapter 2. Hardware configuration

Desired size SEQ bottom LRU RANDOM bottom LRU

Figure 2-3 Response time improvement with SARC

Adaptive Multi-stream Prefetching (AMP)

2.2.2 Determining the right amount of cache storage

DS8000 Performance Monitoring and Tuning

2.3 RIO-G interconnect and I/O enclosures

2.3.1 RIO-G loop