Sei sulla pagina 1di 357

1

II
a
II
1
1
1
1
1
e
1
e
1
1
1
e
• RH436
Red Hat Enterprise Clustering and Storage
I/ Management
RH436-RHEL5u4-en-17-20110428

10
1
1
11 Table of Contents
e RH436 - Red Hat Enterprise
Clustering and Storage Management
III
e RH436: Red Hat Enterprise Clustering and Storage Management
Copyright ix
110 Welcome
Red Hat Enterprise Linux
x
xi
o Red Hat Enterprise Linux Variants xii
Red Hat Subscription Model xiii
Contacting Technical Support xiv
e Red Hat Network
Red Hat Services and Products
xv
xvi
Fedora and EPEL xvii
e Classroom Setup xviii
Networks xix
Notes on Intemationalization xx

II Lecture 1 - Storage Technologies


Objectives
Ilb The Data 2
Data Storage Considerations 3
IP Data Availability
Planning for the Future
4
5
The RHEL Storage Model 6
11/ Volume Management 7
SAN versus NAS 8
SAN Technologies 9
11>
Fibre Channel 10
Host Bus Adapter (HBA) 11
I Fibre Channel Switch 12
Internet SCSI (iSCSI) 13
End of Lecture 1 14
10
Lab 1: Data Management and Storage
Lab 1.1: Evaluating Your Storage Requirements 15
I Lab 1.2: Configuring the Virtual Cluster Environment 17

• Lecture 2 - iSCSI Configuration

110 Objectives
Red Hat iSCSI Driver 20
iSCSI Data Access 21
II> iSCSI Driver Features 22
iSCSI Device Names and Mounting 23
iSCSI Target Naming 24
10
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / rh436-main
Configuring iSCSI Targets 25 •
Manual iSCSI configuration
Configuring the iSCSI Initiator Driver
iSCSI Authentication Settings
26
27
28

Configuring the open-iscsi Initiator 29
First-time Connection to an iSCSI Target
Managing an iSCSI Target Connection
Disabling an iSCSI Target
End of Lecture 2
Lab 2: iSCSI Configuration
Lab 2.1: iSCSI Software Target Configuration
30
31
32
33

34
••
Lab 2.2: iSCSI Initiator Configuration 35

Lecture 3 - Kernel Device Management
Objectives
udev Features 42
Event Chain of a Newly Plugged-in Device 43 •
/sys Filesystem 44
udev 45
Configuring udev 46 •
udev Rules 47
udev Rule Match Keys 48 •
Finding udev Match Key Values 49
udev Rule Assignment Keys 50
udev Rule Substitutions 51 •
udev Rule Examples 52
udevmonitor 53
Dynamic storage management 54
Tuning the disk queue 55
Tuning the deadline scheduler 56 •
Tuning the anticipatory scheduler 57
Tuning the noop scheduler 58 111
Tuning the (default) cfq scheduler 59
Fine-tuning the cfq scheduler 60
End of Lecture 3 61 •
Lab 3: udev and device tuning
Lab 3.1: Persistent Device Naming 62 •

Lecture 4 - Device Mapper and Multipathing e


Objectives
Device Mapper 65 e
Device Mapping Table 66
dmsetup
Mapping Targets
67
68
e
Mapping Target - linear 69
e

Mapping Target - striped 70
Mapping Target - error 72
Mapping Target - snapshot-origin 73
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / rh436-main
111/ Mapping Target - snapshot 74
LVM2 Snapshots 75
e LVM2 Snapshot Example 76
Mapping Target - zero 78
Device Mapper Multipath Overview 80
e Device Mapper Components 81
Multipath Priority Groups 82
II Mapping Target - mult ipath 83
Setup Steps for Multipathing FC Storage 84
Multipathing and iSCSI 85
Multipath Configuration 86
Multipath Information Queries 88
III End of Lecture 4 90
Lab 4: Device Mapper Multipathing
Lab 4.1: Device Mapper Multipathing 91
11)
Lecture 5 - Red Hat Cluster Suite Overview
II>
Objectives
What is a Cluster? 103
11,
Red Hat Cluster Suite 104
Cluster Topology 105
111> Clustering Advantages 106
Advanced Configuration and Power Interface (ACPI) 107
e Cluster Network Requirements
Broadcast versus Multicast
108
109
Ethernet Channel Bonding 110
I, Channel Bonding Configuration 111
Red Hat Cluster Suite Components 112
Security 113
110 Cluster Configuration System (CCS) 114
CMAN - Cluster Manager 115
11, Cluster Quorum 116
OpenAIS 117
rgmanager - Resource Group Manager 118
I, The Conga Project 119
luci 120
IP ricci 121
Deploying Conga 122
lucí Deployment Interface
11, Clustered Logical Volume Manager (CLVM)
123
124
Distributed Lock Manager (DLM) 125
II/ Fencing 126
End of Lecture 5 127
Lab 5: Cluster Deployment using Conga
10 Lab 5.1: Building a Cluster with Conga 128

II> Lecture 6 - Logical Volume Management


Objectives
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / rh436-main
An LVM2 Review 138 •
LVM2 - Physical Volumes and Volume Groups
LVM2 - Creating a Logical Volume
139
140 •
Files and Directories Used by LVM2
Changing LVM options
Moving a volume group to another host
Clustered Logical Volume Manager (CLVM)
141
142
143
144

CLVM Configuration 145 •
End of Lecture 6
Lab 6: Clustered Logical Volume Manager
Lab 6.1: Configure the Clustered Logical Volume Manager
146

147 •
Lecture 7 - Global File System 2 •
Objectives •

Global File System 2 150
GFS2 Limits 151
GFS2 Enhancements 152
Creating a GFS2 File System
Lock Managers
153
154 •
Distributed Lock Manager (DLM)
Mounting a GFS2 File System
Journaling
155
156
157 •
Quotas
Growing a GFS2 File System
158
159 •
GFS2 Super Block Changes
GFS2 Extended Attributes (ACL)
Repairing a GFS2 File System
160
161
162 •
End of Lecture 7
Lab 7: Global File System 2
163


Lab 7.1: Creating a GFS2 file system with Conga 164
Lab 7.2: Create a GFS2 filesystem on the commandline 165
Lab 7.3: GFS1: Conversion 168
Lab 7.4: GFS2: Working with images
Lab 7.5: GFS2: Growing the filesystem
169
170 •
Lecture 8 - Quorum and the Cluster Manager
Objectives

Cluster Quorum
Cluster Quorum Example
182
183
e
Modifying and Displaying Quorum Votes
CMAN - two node cluster
184
186
111
CCS Tools - ccs_tool 187
cluster. conf Schema 188 •
Updating an Existing RHEL4 cluster. conf for RHEL5 189
190
cman_tool
cman_tool Examples 191 4B
CMAN - API 192
CMAN - libcman 193
r,,rmrinht"9n11 ph.d 1-Int Int, id-cm-17-9M 1 flA9R / rhtilA-main
II, End of Lecture 8 194
Lab 8: Adding Cluster Nodes and Manually Editing cluster .conf
Lab 8.1: Extending Cluster Nodes 195
Lab 8.2: Manually Editing the Cluster Configuration 197
110 Lab 8.3: GFS2: Adding Journals 198

Lecture 9 - Fencing and Failover


e
Objectives
11/ No-fencing Scenario
Fencing Components
207
208
Fencing Agents 209
1 Power Fencing versus Fabric Fencing 210
SCSI Fencing 211
Fencing From the Command Line 212
The Fence Daemon - fenced 213
Manual Fencing 214
IP Fencing Methods 215
Fencing Example - Dual Power Supply 216
11/ Handling Software Failures 217
Handling Hardware Failures 218
Failover Domains and Service Restrictions 219
11, Failover Domains and Prioritization 220
NFS Failover Considerations 221
111/ clusvcadm
End of Lecture 9
222
223
Lab 9: Fencing and Failover
e Lab 9.1: Node Priorities and Service Relocation 224

11, Lecture 10 - Quorum Disk


11 Objectives
Quorum Disk 229
Quorum Disk Communications 230
e Quorum Disk Heartbeating and Status 231
Quorum Disk Heuristics 232
Quorum Disk Configuration 233
11/
Working with Quorum Disks 234
Example: Two Cluster Nodes and a Quorum Disk Tiebreaker 235
• Example: Keeping Quorum When All Nodes but One Have Failed 236
End of Lecture 10 237
10 Lab 10: Quorum Disk
Lab 10.1: Quorum Disk 238

11>
Lecture 11 - rgmanager
• Objectives
Resource Group Manager 244
10 Cluster Configuration - Resources
Copyright © 2011 Red Hat, Inc.
245
RH436-RHEL5u4-en-17-20110428 / rh436-main
Resource Groups
Start/Stop Ordering of Resources
247
248

Resource Hierarchical Ordering 249 111
NFS Resource Group Example 250
Resource Recovery 251
Service Status Checking 252
Custom Service Scripts 253
Displaying Cluster and Service Status 254 II
Cluster Status (luc i) 255
Cluster Status Utility (clustat) 256

Cluster Service States 257
Cluster SNMP Agent 258
Starting/Stopping the Cluster Software on a Member Node 260 •
Cluster Shutdown Tips 261
Troubleshooting 262

Logging 263
End of Lecture 11 264
Lab 11: Cluster Manager •
Lab 11.1: Adding an NFS Service to the Cluster 265
Lab 11.2: Configuring SNMP for Red Hat Cluster Suite 266


Lecture 12 - Comprehensive Review


Objectives
Start from scratch 274 •
End of Lecture 12 275
Lab 12: Comprehensive Review
Lab 12.1: Rebuild your environment 276 •
Lab 12.2: Setup iscsi and multipath 277
Lab 12.3: Build a three node cluster 278 •
Lab 12.4: Add a quorum-disk 279
Lab 12.5: Add a GFS2 filesystem 280
Lab 12.6: Add a NFS-service to your cluster 281 1111


Appendix A - Advanced RAID •
Objectives
Redundant Array of Inexpensive Disks 291
RAIDO 292
RAID1
RAID5
293
294
e
RAID5 Parity and Data Distribution 295
RAID5 Layout Algorithms 296
RAID5 Data Updates Overhead 297
RAID6
RAID6 Parity and Data Distribution
298
299
e
RAID10 300


Stripe Parameters 301
/proc/mdstat 302
Verbose RAID Information 303
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / rh436-main
11> SYSFS Interface 304
/etc/mdadm.conf 305
Event Notification 306
Restriping/Reshaping RAID Devices 307
Growing the Number of Disks in a RAID5 Array 308
e Improving the Process with a Critica) Section Backup 309
Growing the Size of Disks in a RAID5 Array 310
Sharing a Hot Spare Device in RAID 311
Renaming a RAID Array 312
Write-intent Bitmap 313
1110 Enabling Write-Intent on a RAID1 Array 314
Write-behind on RAID1 315
RAID Error Handling and Data Consistency Checking 316
Appendix A: Lab: Advanced RAID
Lab A.1: Improve RAID1 Recovery Times with Write-intent Bitmaps 317
I Lab A.2: Improve Data Reliability Using RAID 6 318
Lab A.3: Improving RAID reliability with a Shared Hot Spare Device 320
11> Lab A.4: Online Data Migration 321
Lab A.5: Growing a RAID5 Array While Online 322
Lab A.6: Clean Up 323
11/ Lab A.7: Rebuild Virtual Cluster Nodes 324

111>

1
1
1
I/
O
1
1
1
1
1
1
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / rfi436-main
Introduction

RH436: Red Hat Enterprise


Clustering and Storage Management

For use only by e student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publicetion may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training meteriels are being improperly usad,
copied, or distributed pleese email <training®redhat . com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 0ca8c908


Copyright 1

• The contents of this course and all its modules and related materials, including handouts to audience members,
are Copyright O 2011 Red Hat, Inc.
• No part of this publication may be stored in a retrieval system, transmitted or reproduced in any way, including,
but not limited to, photocopy, photograph, magnetic, electronic or other record, without the prior written
permission of Red Hat, Inc.
• This instructional program, including all material provided herein, is supplied without any guarantees from
Red Hat, Inc. Red Hat, Inc. assumes no liability for damages or legal action arising from the use or misuse of
contents or details contained herein.
• If you believe Red Hat training materials are being used, copied, or otherwise improperly distributed please
email training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 919 754 3700.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior mitren consent of Red HM, Inc. ff you believe Red HM training materiats are being improperly used,
copied, or distributed pisase email <trainingeredhat. coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 216f53f8


Welcome 2

Please let us know if you need any special assistance while visiting our training facility.

Please introduce yourself to the rest of the class!

Welcome to Red Hat Training!


Welcome to this Red Hat training class! Please make yourself comfortable while you are here. If you have
any questions about this class or the facility, or need special assistance while you are here, please feel free
to ask the instructor or staff at the facility for assistance. Thank you for attending this course.

Telephone and network availability


Please only make telephone calls during breaks. Your instructor will direct you to the telephone to use.
Network access and analog phone lines may be available; if so, your instructor will provide information
about these facilities. Please turn pagers and cell phones to off or to silent or vibrate during class.

Restrooms
Your instructor will notify you of the location of restroom facilities and provide any access codes or keys
which are required to use them.

Lunch and breaks


Your instructor will notify you of the areas to which you have access for lunch and for breaks.

In Case of Emergency
Please let us know if anything comes up that will prevent you from attending or completing the class this
week.

Access
Each training facility has its own opening and closing times. Your instructor will provide you with this
information.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used,
copiad, or distributed picase email <trainíngeredhat .coms or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a8aa45c4


Red Hat Enterprise Linux 3

• Enterprise-targeted Linux operating system


• Focused on mature open source technology
• Extended release cycle between major versions
• With periodic minor releases during the cycle
• Certified with leading OEM and ISV products
• All variants based on the same code
• Certify once, run any application/anywhere/anytime
• Services provided on subscription basis

The Red Hat Enterprise Linux product family is designed specifically for organizations planning to use Linux
in production settings. All products in the Red Hat Enterprise Linux family are built on the same software
foundation, and maintain the highest level of ABI/API compatibility across releases and errata. Extensive
support services are available: a one year support contract and Update Module entitlement to Red Hat
Network are included with purchase. Various Service Level Agreements are available that may provide up
to 24x7 coverage with a guaranteed one hour response time for Severity 1 issues. Support will be available
for up to seven years after a particular major release.

Red Hat Enterprise Linux is released on a multi-year cycle between major releases. Minor updates to major
releases are released roughly every six months during the lifecycle of the product. Systems certified on
one minor update of a major release continue to be certified for future minor updates of the major release.
A core set of shared libraries have APIs and ABIs which will be preserved between major releases. Many
other shared libraries are provided, which have APIs and ABIs which are guaranteed within a major release
(for all minor updates) but which are not guaranteed to be stable across major releases.

Red Hat Enterprise Linux is based on code developed by the open source community and adds
performance enhancements, intensive testing, and certification on products produced by top independent
software and hardware vendors such as Dell, IBM, Fujitsu, BEA, and Oracle. Red Hat Enterprise Linux
provides a high degree of standardization through its support for five processor architectures (Intel x86-
compatible, AMD64/Intel 64, Intel Itanium 2, IBM POWER, and IBM mainframe on System z). Furthermore,
we support the 3000+ ISV certifications on Red Hat Enterprise Linux whether the RHEL operating system
those applications are using is running on "bare metal", in a virtual machine, as a software appliance, or in
the cloud using technologies such as Amazon EC2.

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publicaban may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. ff you believe Red Hat training materials are being improperly used,
copied, or distributed please amad ctraiaiageredhat . coa> or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9b4b75ae


... . .. ,
Red Hat Enterprise Linux Variants 4

• Red Hat Enterprise Linux Advanced Platform


• Unlimited server size and virtualization support
• HA clusters and cluster file system
• Red Hat Enterprise Linux
Basic server solution for smaller non-mission-critical servers
• Virtualization support included
• Red Hat Enterprise Linux Desktop
• Productivity desktop environment
• Workstation option adds tools for software and network service development
• Multi-OS option for virtualization

Currently, on the x86 and x86-64 architectures, the product family includes:

Red Hat Enterprise Linux Advanced Platform: the most cost-effective server solution, this product includes
support for the largest x86-compatible servers, unlimited virtualized guest operating systems, storage
virtualization, high-availability application and guest fail-over clusters, and the highest levels of technical
support.

Red Hat Enterprise Linux: the basic server solution, supporting servers with up to two CPU sockets and up
to four virtualized guest operating systems.

Red Hat Enterprise Linux Desktop: a general-purpose client solution, offering desktop applications such
as the OpenOffice.org office suite and Evolution mail client. Add-on options provide support for high-end
technical and development workstations and for running multiple operating systems simultaneously through
virtualization.

Two standard installation media kits are used to distribute variants of the operating system. Red Hat
Enterprise Linux Advanced Platform and Red Hat Enterprise Linux are shipped on the Server media kit.
Red Hat Enterprise Linux Desktop and its add-on options are shipped on the Client media kit. Media kits
may be downloaded as ISO 9660 CD-ROM file system images from Red Hat Network or may be provided
in a boxed set on DVD-ROMs.

Please visit http : / /www . redhat . com/rhel/ for more information about the Red Hat Enterprise Linux
product family. Other related products include realtime kernel support in Red Hat Enterprise MRG, the thin
hypervisor node in Red Hat Enterprise Virtualization, and so on.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training material. are being Improperly usad,
copiad, or distributed please email <training@redhat . coz> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 47a77a3d


.•
Red Hat Subscription Model 5

• Red Hat sells subscriptions that entitie systems to receive a set of services that
support open source software
• Red Hat Enterprise Linux and other Red Hat/JBoss solutions and applications
• Customers are charged an annual subscription fee per system
• Subscriptions can be migrated as hardware is replaced
• Can freely move between major revisions, up and down
• Multi-year subscriptions are available
• A typical service subscription includes:
• Software updates and upgrades through Red Hat Network
• Technical support (web and phone)
• Certifications, stable APIs/versions, and more

Red Hat doesn't exactly sell software. What we seli is service through support subscriptions.

Customers are charged an annual subscription fee per system. This subscription includes the ability to
manage systems and download software and software updates through our Red Hat Network service; to
obtain technical support (through the World-Wide Web or by telephone, with terms that vary depending on
the exact subscription purchased), and extended software warranties and IP indemnification to protect the
customer from service interruption due to software bugs or legal issues.

In turn, the subscription-based model gives customers more flexibility. Subscriptions are tied to a service
levet, not to a release version of a product; therefore, upgrades (and downgrades!) of software between
major releases can be done on a customers own schedule. Management of versions to match the
requirements of third-party software vendors is simplified as well. Likewise, as hardware is replaced, the
service entitlement which formerly belonged to a server being decommissioned may be freely moved to a
replacement machine without requiring any assistance from Red Hat. Multi-year subscriptions are available
as well to help customers better tie software replacement cycles to hardware refresh cycles.

Subscriptions are not just about access to software updates. They provide unlimited technical support;
hardware and software certifications on tested configurations; guaranteed long-term stability of a major
release's software versions and APIs; the flexibility to move entitlements between versions, machines, and
in some cases processor architectures; and access to various options through Red Hat Network and add-
on products for enhanced management capabilities.

This aliows customers to reduce deployment risks. Red Hat can deliver new technology as it becomes
available in major releases. But you can choose when and how to move to those releases, wíthout needing
to relicense to gain access to a newer version of the software. The subscription model helps reduce
your financial risk by providing a road map of predictable IT costs (rather than suddenly having to buy
licenses just because a new version has arrived). Finally, it allows us to reduce your technological risk
by providing a stable environment tested with software and hardware important to the enterprise. Visit
http: //www. redhat .com/rhel/benefits/ for more information about the subscription model.

For use only by a student enrolled in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or othenvise reproduced without prior written consent of Red HM, Inc. It you believe Red HM training materials are being improperly usad,
copiad, or cfistributed picase email <trainingeredhat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / f98c808c


. ..
Contacting Technical Support 6

• Collect information needed by technical support:


• Define the problem
• Gather background information
• Gather relevant diagnostic information, if possible
• Determine the severity level
• Contacting technical support by WWW:
• http://www.redhat.com/support/

• Contacting technical support by phone:


• Seehttp://www.redhat.com/support/policy/sla/contact/
• US/Canada: 888-GO-REDHAT (888-467-3342)

Information on the most important steps to take to ensure your support issue is resolved by Red Hat as
quickly and efficiently as possible is available at http : //www.redhat .com/support/process/
production/. This is a brief summary of that information for your convenience. You may be able to
resolve your problem without formal technical support by looking for your problem in Knowledgebase
(http://kbase.redhat.com/).

Define the problem. Make certain that you can articulate the problem and its symptoms before you contact
Red Hat. Be as specific as possible, and detail the steps you can use (if any) to reproduce the problem.

Gather background information. What version of our software are you running? Are you using the latest
update? What steps led to the failure? Can the problem be recreated and what steps are required? Have
any recent changes been made that could have triggered the issue? Were messages or other diagnostic
messages issued? What exactly were they (exact wording may be critical)?

Gather relevant diagnostic information. Be ready to provide as much relevant information as possible; logs,
core dumps, traces, the output of sosreport, etc. Technical Support can assist you in determining what is
relevant.

Determine the Severity Level of your issue. Red Hat uses a four-level scale to indicate the criticality of
issues; criteria may be found at http: //www.redhat .com/support/policy/GSS_severity.html.

Red Hat Support may be contacted through a web form or by phone depending on your support level.
Phone numbers and business hours for different regions vary; see http://www.redhat .com/support/
policy/sla/contact/ for exact details. When contacting us about an issue, please have the following
information ready:

Red Hat Customer Number


Machine type/model
Company name
Contact name
Preferred means of contact (phone/e-mail) and telephone number/e-mail address at which you can be reached
Related product/version information
Detailed description of the issue
Severity Level of the issue in respect to your business needs

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiele are being Improperly ueed,
copiad, or distributed picase email <training0redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright ©2011 . Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / c12d09d3


Red Hat Network 7

• A systems management platform providing lifecycle management of the operating


system and applications
• Installing and provisioning new systems
• Updating systems
• Managing configuration files
• Monitoring performance
• Redeploying systems for a new purpose
• "Hosted" and "Satellite" deployment architectures

Red Hat Network's modular service model allows you to pick and choose the features you need to manage
your enterprise.

The basic Update service is provided as part of all Red Hat Enterprise Linux subscriptions. Through it,
you can use Red Hat Network to easily download and install security patches and updates from an RHN
server. All content is digitally signed by Red Hat for added security, so that you can ensure that packages
actually carne from us. The yum (or older up2date) utility automatically resolves dependencies to ensure
the integrity of your system when you use it to initiate an update from the managed station itself. You can
also log into a web interface on the RHN server to remotely add software or updates or remove undesired
software packages, or set up automatic updates to allow systems to get all fixes immediately.

The add-on Management module allows you to organize systems into management groups and perform
update or other management operations on all members of the group. You can also set up subaccounts ín
Red Hat Network which have access to machines in some of your groups but not others. Powerful search
capabilities allow you to identify systems based on their packages or hardware characteristics, and you can
also compare package profiles of two systems.

The Provisioning module makes it easier for your to deploy new systems or redeploy existing systems
using predetermined profiles or through system cloning. You can use RHN to store, manage, and deploy
configuration files as well as software package files. You can use tools to help write automated installation
Kickstart configurations and apply them to selected systems. You can undo problematic changes through
a roll-back feature. Management and Provisioning modules are included as part of a Red Hat Enterprise
Linux Desktop subscription at no additional fee.

Monitoring module is only available with RHN Satellite, and allows you to set up to dozens of low-impact
probes for each system and many applications (including Oracle, MySQL, BEA, and Apache) to track
availability and performance.

RHN is initially deployed in a "hosted" model, where the central update server is located at a Red Hat
facility and is contacted over the Internet using HTTP/SSL. To reduce bandwidth, you can site a RHN Proxy
Server at your facility which caches packages requested by your systems. For maximum flexibility, you may
use RHN Satellite, which places the RHN server at your site under your control; this can run disconnected
from the Internet, or may be connected to the Internet to download update content from the official hosted
RHN servers to populate its service channels.

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publicaban may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. tf you believe Red HM training materials are being improperly usad,
copiad, or distributed Meses email <trainiageredhat . coa> or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 93398b3e


--•
Red Hat Services and Products 8

• Red Hat supports software products and services beyond Red Hat Enterprise Linux
• JBoss Enterprise Middleware
• Systems and Identity Management
• Infrastructure products and distributed computing
• Training, consulting, and extended support
• http://www.redhat.com/products/

Red Hat offers a number of additional open source application products and operating system
enhancements which may be added to the standard Red Hat Enterprise Linux operating system. As with
Red Hat Enterprise Linux, Red Hat provides a range of maintenance and support services for these add-
on products. Installation media and software updates are provided through the same Red Hat Network
interface used to manage Red Hat Enterprise Linux systems.

For additional information, see the following web pages:

• General product information: http: //www.redhat .com/products/

• Red Hat Solutions Guide: http: //www.redhat .com/solutions/guide/

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used,
copied, or distributed please amad <trainingeredhat . con» or phone to1I-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 64968772


Fedora and EPEL 9

• Open source projects sponsored by Red Hat


• Fedora distribution is focused on latest open source technology
• Rapid six month release cycle
• Available as free download from the Internet
• EPEL provides add-on software for Red Hat Enterprise Linux
• Open, community-supported proving grounds for technologies which may be used
in upcoming enterprise products
• Red Hat does not provide formal support

Fedora is a rapidly evolving, technology-driven Linux distribution with an open, highly scalable development
and distribution model. It is sponsored by Red Hat but created by the Fedora Project, a partnership of free
software community members from around the globe. It is designed to be a fully-operational, innovative
operating system which also is an incubator and test bed for new technologies that may be used in later
Red Hat enterprise products. The Fedora distribution is available for free download from the Internet.

The Fedora Project produces releases of Fedora on a short, roughly six month release cycle, to bring the
latest innovations of open source technology to the community. This may make it attractive for power users
and developers who want access to cutting-edge technology and can handle the risks of adopting rapidly
changing new technology. Red Hat does not provide formal support for Fedora.

The Fedora Project also supports EPEL, Extra Packages for Enterprise Linux. EPEL is a volunteer-based
community effort to create a repository of high-quality add-on packages which can be used with Red
Hat Enterprise Linux and compatible derivatives. It accepts legally-unencumbered free and open source
software which does not conflict with packages in Red Hat Enterprise Linux or Red Hat add-on products.
EPEL packages are built for a particular major release of Red Hat Enterprise Linux and will be updated by
EPEL for the standard support lifetime of that major release.

Red Hat does not provide commercial support or service level agreements for EPEL packages. While
not supported officially by Red Hat, EPEL provides a useful way to reduce support costs for unsupported
packages which your enterprise wishes to use with Red Hat Enterprise Linux. EPEL allows you to distribute
support work you would need to do by yourself across other organizations which share your desire to
use this open source software in RHEL. The software packages themselves go through the same review
process as Fedora packages, meaning that experienced Linux developers have examined the packages for
issues. As EPEL does not replace or conflict with software packages shipped in RHEL, you can use EPEL
with confidence that it will not cause problems with your normal software packages.

For developers who wish to see their open source software become part of Red Hat Enterprise Linux, often
a first stage is to sponsor it in EPEL so that RHEL users have the opportunity to use it, and so experience is
gained with managing the package for a Red Hat distribution.

Visit http: / / fedoraproj ect .org/ for more information about the Fedora Project.

Visit http: //fedoraproject.org/wiki/EPEL/ for more information about EPEL.

For use only by a student enroIled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperty used,
copied, or chstributed piense email <trainiagerecthat • coa> or phone toII-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 8744dbe2


. ..
Classroom Setup lo

• Instructor machine: instructor . example . com, 192.168.0.254


• provides DNS, DHCP, Internet routing
• Class material: /var/f tp/pub/

• Student machines: stationX. example . com , 192.168.0.X


• Provide virtual machines, iSCSI storage
• Uses multiple internal bridges for cluster traffic
• Virtual machines: node0, node 1, node2, node3
• nade() is kickstarted as a template
• nodel- 3 are snapshots of node O

The instructor system provides a number of services to the classroom network, including:

• A DHCP server

• A web server. The web server distributes RPMs at http: //instructor. example . com/pub.

• An FTP server. The FTP server distributes RPMs at f tp : / / instructor . example . com/pub.

• An NFS server. The NFS server distributes RPMs at nfs : //instructor . example . com/var/ftp/
pub.

• An NTP (network time protocol) server, which can be used to assist in keeping the clocks of classroom
computers synchronized.

In addition to a local classroom machine virtual machines will be used by each students The physical host
has a script (rebuild-cluster) that is used to create the template virtual machine. The same script is
used to create the cluster machines, which are really logical volume snapshots of the Xen virtual machine.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publicetion may be photocopied,
duplicated, stored in a retrievel system, or otherwise reproduced without prior written consent of Red Hat, Inc. II you believe Red Hat training material. are being improperly usad,
copiad, or distributed please email < t rainingOredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 0419b024


Networks 11

• 192.168.0.0/24
• classroom network
• instructor.example.com eth0 192.168.0.254
• stationX.example.com eth0 192.168.0.X

• 172.16.0.0/16
• public application network
• bridged to classroom net
• Instructor: instructor.example.com eth0:1 172.16.255.254
• Workstatíon: cXn5.example.com eth0:0 172.16.50.X5
• Virtual Nodes: cXnN example.com eth0 172.16.50. XN
• 172.17.X.0/24
• prívate cluster network
• intemal bridge on workstations
• Workstation: dom0.clusterX.example.com cluster 172.17.X.254
• Virtual Nodes: nodeN.clusterX.example.com ethl 172.17. X.N

• 172.17.100+X.0/24
• first iscsi network
• intemal bridge on workstations
• Workstation: storagel.clusterX.example.com storagel 172.17.100+X.254
• Virtual Nodes: nodeN-storagel.clusterX.example.com eth2 172.17. 100+X.N
• 172.17.200+X.0/24
• second iscsi network
• intemal bridge on workstations
• Workstation: storage2.clusterX.example.com storage2 172.17. 200+X. 254
• Virtual Nodes: nodeN-storage2.clusterX.example.com eth3 172.17. 200+X. N

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. tt you believe Red Hat training meteríais are being improperly usad,
copiad, or clkstributed pisase email <trainingersdhat COZ> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 0e761625


. .
Notes on Internationalization 12

• Red Hat Enterprise Linux supports nineteen languages


• Default system-wide language can be selected
• During installation
• With system config Ianguage (System->Administration->Language)
- -

• Users can set personal language preferences


• From graphical login screen (stored in -1. dmrc)
• For interactive shell (with LANG environment variable in -/ bashrc)
• Alternate languages can be used on a per command basis: -

[user@host LANG=j UTF - 8 date

Red Hat Enterprise Linux 5 supports nineteen languages: English, Bengali, Chinese (Simplified), Chinese
(Traditional), French, German, Gujarati, Hindi, Italian, Japanese, Korean, Malayalam, Marathi, Oriya,
Portuguese (Brazilian), Punjabi, Russian, Spanish and Tamil. Support for Assamese, Kannada, Sinhalese
and Telugu are provided as technology previews.

The operating system's default language is normally set to US English (en_US.UTF-8), but this can be
changed during or after installation. To use other languages, you may need to install extra packages to
provide the appropriate fonts, translations and so forth. These can be selected during system installation or
with system-config-packages (Applications->Add/Remove Software).

A system's default language can be changed with system-config-language ( System->Administration-


>Language), which affects the /etc/sysconfig/il 8n file.

Users may prefer to use a different language for their own desktop environment or interactive shells than is
set as the system default. This is indicated to the system through the LANG environment variable.

This may be set automatically for the GNOME desktop environment by selecting a language from the
graphical login screen by clicking on the Language item at the bottom left comer of the graphical login
screen immediately prior to login. The user will be prompted about whether the language selected should
be used just for this one login session or as a default for the user from now on. The setting is saved in the
user's -/ . dmrc file by GDM.

If a user wants to make their shell environment use the same LANG setting as their graphical environment
even when they login through a text console or over ssh, they can set code similar to the following in their
-/ .bashrc file. This will set their preferred language if one is saved in -/ . dmrc and use the system
default if not:

i=$(grep 'Language=' ${HOME}/.dmrc I sed 's/Language=//')


if [ "$i" I= "" ]; then
export LANG=$i
fi.

Languages with non-ASCII characters may have problems displaying in some environments. Kanji
characters, for example, may not display as expected on a virtual console. Individual commands can be
made to use another language by setting LANG on the command-line:

[userhost. -i$ LANG=frFR.UTF-8 date

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No pare of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hal training materials are being improperly used,
copied, or distributed please email <training@redhat cm, or phone MII-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 8a224f80


mer. aoút 19 17:29:12 CDT 2009

Subsequent commands will reved to using the system's default language for output. The locale command
can be used to check the current value of LANG and other related environment variables.

SCIM (Smart Common Input Method) can be used to input text in various languages under X if the
appropriate language support packages are installed. Type Ctrl-Space to switch input methods.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrievaI system, or otherwise reproduced without prior written consent of Red Hat, Inc. ff you believe Red Hat trainíng materias are being improperly used,
copiad, or cfistributed pisase email <traiairrgdzedhat . coa> or phone toII-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 8a224f80


Lecture 1

Storage Technologies

Upon completion of this unit, you should be able to:


• Define storage technologies
• Describe Red Hat Storage Model
• Connect to and configure lab environment equipment

For use only by a student enrolled in a Red HM training course taught by Red Hat, Inc. or a Red Hat Certffied Treining Partner. No par, of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior vnitt' en consent of Red HM, Inc. I1 you believe Red Hat training material are being improperly used,
copied, or chstributed pisase email ctrainingeredhat. con> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 6f0a110d


. .
The Data 1-1

• User versus System data


• Availability requirements
• Frequency and type of access
• Directory location
• /home versus /var/spool/mail

• Application data
• Shared?
• Host or hardware-specific data

User data often has more demanding requirements and challenges than system data. System data is
often easily re-created from installation CDs and a relatively small amount of backed-up configuration files.
System data can often be reused for similar architecture machines, whereas user data is highly specific to
each user.

Some user data lies outside of typical user boundaries, like user mailboxes.

Would the data ideally be shared among many machines?

is the data specific to a specific type of architecture?

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are being improperly ueed,
copied, or distributed please email <training@redhat .0016> or phone toll-free (USA) +1 (855) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / c59804d9


Data Storage Considerations 1 -2

• Is it represented elsewhere?
• Is it private or public?
• Is it nostalgic or pertinent?
• Is it expensive or inexpensive?
• Is it specific or generic?

Is the data unique, or are there readily-accessible copies of ít elsewhere?

Does the data need to be secured, or is it available to anyone who requests it?

Is the data stored for historical purposes, or are old and new data being accessed just as frequently?

Was the data difficult or expensíve to obtain? Could it just be calculated from other already-available data,
or is it one of a kind?

Is the data specific to a particular architecture or OS type? Is it specific to one application, or one version of
one application?

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red HM Certifled Training Partner. No pan of this publícation may be photocopied,
duplicated, stored in a retrieval system, or othenvise reproduced without prior written consent of Red HM, Inc. It you believe Red HM training materials are being impropedy usad,
copiad, or cfistributed please email <training@redhat.coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9864cabd


Data Availability 1 -3

• How available must it be?


• Data lifetime
• Archived or stored?
• Frequency and method of access
• Read-only or modifiable
• Application-specific or direct access
• Network configuration and security
• Is performance a concern?
• Applications "data starved"?
• Where are my single points of failure (SPOF)?

What happens if the data become unavailable? What is necessary to be done in the event of data
downtime?

How long is the data going to be kept around? Is it needed to establish a historical profile, or is it no longer
valid after a certain time period?

Is this data read-only, or is it frequently modified? What exactly is modified? Is modification a privilege of
only certain users or applications?

Are applications or users limited in any way by the performance of the data storage? What happens when
an application is put into a wait-state for the data it needs?

With regard to the configuration environment and resources used, where are my single points of failure?

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copied, or distributed please email trainíng•redhat com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Coovriaht © 2011 Red Hat. Inc. RH436-RHEL5u4-en-17-20110428 / be32bccb


Planning for the Future 1 -4

• Few data requirements ever diminish


• Reduce complexity
• Increase fiexibility
• Storage integrity

Few data requirements ever diminish: the number of users, the size of stored data, the frequency of access,
etc.... What mechanisms are in place to aid this growth?

A reduction in complexity often means a simpler mechanism for its management, which often leads to less
error-prone tools and methods.

For use only by a student enrollad in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. H you believe Red HM training materials are being improperly usad,
copiad, or cHstributed pisase email <traiaing•reclhat . coa, or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 18fa55f5


The RHEL Storage Model 1 -5

Ale System Driver

Block Device Driver

Volume
The Red Hat Enterprise Linux (RHEL) Storage Model for an individual host includes physical volumes,
kernel device drivers, the Virtual File System and Appiication data structures. All file access is managed
similarly, and by the same, unique kernel I/O system, both the data, and the meta-data organizing the data.

RHEL includes many computing applications each with its own file, or data structure, including network
services, document processing, database and other media. With respect to data storage, the file type is
less dependent on the way it is stored, but the method by which an application at this layer accesses it.

The Virtual File System, or VFS, layer is the interface which handles file system related system calls for the
kernel. It provides a uniform mechanism for these calls to be passed to any one of a variety of different file
system implementations in the kernel such as ext3, msdos, GFS, NFS, CIFS, and so on. For example, if
a file on an ext3-formatted file system is opened by a program, VFS transparently passes the program's
open() system call to the kernel code (device driver) implementing the ext3 file system.

The file system device driver then typically sends low-level requests to the device driver implementing the
block device containing the filesystem. This could be a local hardware device (IDE, SCSI), a logical device
(software RAID, LVM), or a remote device (iSCSI), for example.

Volumes are contrived through device driver access. Whether the volume is provided through a local
system bus, or over an IP network infrastructure, it always provides logical bounds through which a file (or
record) data structure is accessible. Volumes do not organize data, but provide the logical "size" of such an
organizing structure.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials ere being improperly usad,
copied, or distributed pisase email <training9reculat . coz> or phone toll-free (USA) +1 (886)626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 5973e514


Volume Management 1-6

• A volume defines some forro of block aggregation


• Many devices may be combined as one
• Optimized through low-Ievel device configuration (often in hardware)
• Striping, Concatenation, Parity

• Consistent name space


• LUN
• UUID

A volume is a some forro of block aggregation that describes the physical bounds of data. These bounds
represent physical constraints of hardware and its abstraction or virtualization. Device capabilities,
connectivity and reliabílity all influence the availability of this data "container." Data cannot exceed these
bounds; therefore, block aggregation must be flexible.

Often times, volumes are made highly available or are optimized at the hardware level. For example,
specialty hardware may provide RAID 5 "behind the scenes" but present simple virtual SCSI devices to be
used by the administrator for any purpose, such as creating logical volumes.

If the RAID controller has multi-LUN support (is able to simulate multiple SCSI devices from a single one
or aggregation), larger storage volumes can be carved finto smaller pieces, each of which is assigned a
unique SCSI Logical Unit Number (LUN). A LUN is simply a SCSI address used to reference a particular
volume on the SCSI bus. LUNs can be masked, which provides the ability to exclusively assign a LUN to
one or more host connections. LUN masking does not use any special type of connection, it simply hides
unassigned LUNs from specific hosts (similar to an unlisted telephone number).

The Universally Unique IDentifier (UUID) is a reasonably guaranteed-to-be-unique 128 bit number used
to uniquely identify objects within a distributed system (such as a shared LUN, physical volume, volume
group, or logical volume).

UUIDs may be viewed using the blkid command:

# blkid
/dev/mapper/VolGroup0O-LogVo101: TYPE="swap"
/dev/mapper/VolGroup0O-LogVo100: UUID="9924e91b-le5c-44e2-bd3c-dlfbc82ce488" Ié
SEC_TYPE="ext2" TYPE="ext3"
/dev/sdal: LABEL="/boot" UUID="e000084b-26b9-4289-b1d9-efae190c22f5" SEC_TYPE="ext2" be
TYPE="ext3"
/dev/VolGroup0O/LogVo101: TYPE="swap"
/dev/sdbl: UUID="111a7953-85a5-4b28-9cff-b622316b789b" SEC_TYPE="ext2" TYPE="ext3"

For use only by a student enrollad in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publicano(' may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. if you betieve Red HM training materials are being improperly used,
copiad, or cfistributed please ornan ctrainiageredhat . con> or phone ton-free (USA) +1 (1366) 626 2994 or +1 (919) 754 3700.

Copyright(?) 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b3f8d9b5


.. 7
SAN versus NAS 1 -7

• Two shared storage technologies trying to accomplish the same thing -- data
delivery
• Network Attached Storage (NAS)
• The members are defined by the network
• Scope of domain defined by IP domain
• NFS/CIFS/HTTP over TCP/IP
Delivers file data blocks

• Storage Area Network (SAN)


• The network is defined by its members
• Scope of domain defined by members
• Encapsulated SCSI over fibre channel
Delivers volume data blocks

Often used one for the other, Storage Area Network(SAN) and Network Accessed Storage (NAS) differ.
NAS is best described as IP network access to File/Record data. A SAN represents a collection of
hardware components which, when combined, present the disk blocks comprising a volume over a fibre
channel network. The iSCSI-SCSI layer communication over IP also satisfies this definition: the delivery of
low-level device blocks to one or more systems equally.

NAS servers generally run some form of a highly optimized embedded OS designed for file sharing. The
NAS box has direct attached storage, and clients connect to the NAS server just like a regular file server,
over a TCP/IP network connection. NAS deals with files/records.

Contrast this with most SAN implementations in which Fibre-channel (FC) adapters provide the
physical connectivity between servers and disk. Fibre-channel uses the SCSI command set to handle
communications between the computer and the disks; done properly, every computer connected to the disk
view it as if it were direct attached storage. SANs deal with disk blocks.

A SAN essentially becomes a secondary LAN, dedicated to interconnecting computers and storage
devices. The advantages are that SCSI is optimized for transferring large chunks of data across a reliable
connection, and having a second network can off-load much of the traffic from the LAN, freeing up capacity
for other uses.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, *torrad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material* are being improperly usad,
copied, or distributed please email < training*redhat com> or phone toll-free (USA) +1 (866) 628 2994 or +1 (919) 754 3700.

Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 631 ba8c2


SAN Technologies 1 -8

• Different mechanisms of connecting storage devices to machines over a network


• Used to emulate a SCSI device by providing transparent delivery of SCSI protocol
to a storage device
• Provide the illusion of locally-attached storage
• Fibre Channel
• Networking protocol and hardware for transporting SCSI protocol across fiber optic equipment
• Internet SCSI (iSCSI)
• Network protocol that allows the use of the SCSI protocol over TCP/IP networks
• "SAN via IP"
• Global Network Block Device (GNBD)
• Client/Server kernel modules that provide block-level storage access over an Ethernet LAN
• Deprecated by iSCSI, included for compatibility only,

Most storage devices use the SCSI (Small Computer System Interface) command set to communicate.
This is the same command set that was developed to control storage devices attached to a SCSI parallel
bus. The SCSI command set is not tied to the originally-used bus and is now commonly used for all storage
devices with all types of connections, including fibre channel. The command set is still referred to as the
SCSI command set.

The LUN on a SCSI parallel bus is actually used to electrically address the various devices. The concept of
a LUN has been adapted to fibre channel devices to allow multiple SCSI devices to appear on a single fibre
channel connection.

It is important to distinguish between a SCSI device and a fibre channel (or iSCSI, or GNBD) device. A fibre
channel device is a abstract device that emulates one or more SCSI devices at the lowest leve) of storage
virtualization. There is not an actual SCSI device, but one is emulated by responding appropriately to the
SCSI protocol.

SCSI over fibre channel is similar to speaking a language over a telephone connection. The low level
connection (fibre channel) is used to transpon the conversation's language (SCSI command set).

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. tf you believe Red HM training material are being improperly used,
copiad, or cfistributed pelase email ctrainiageredhat . coa> or phone to1I-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright O 2011 . Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4ddab820


Fibre Channel 1-9

Common enterprise-class network connection to storage technology


• Major components:
• Fiber optic cable
• Interface card (Host Bus Adaptor)
• Fibre Channel switching technology

Fibre Channel is a storage networking technology that provides flexible connectivity options to storage
using specialized network switches, fiber optic cabling, and optic connectors.

While a common connecting cable for fibre channel is fiber-optic, it can also be enabled over twisted pair
copper wire, despite the implied limitation of the technology's name. Transmitting the data via light signals,
however, allows the cabling lengths to far exceed that of normal copper wiring and be far more resistant to
electrical interference.

The Host Bus Adaptor (HBA), in its many forms, is used to convert the light signals transmitted over the
fiber-optic cables to electrical signals (and vice-versa) for interpretation by the endpoint host and storage
technologies.

The fibre channel switch is the foundation of a fibre channel network, defining the topology of how the
network ports are arranged and the data path's resistance to failure.

For use only by a student enrolled in a Red HM training course taught by Red Het, Inc. or e Red Hat Certified Training Partner. No part of this publication mey be photocopied,
duplicated, stored in e retrievel system, or otherwise reproduced without prior written consent of Red Het, Inc. If you believe Red Het training meteriale ere being improperly used,
copied, or distributed please emeil < t reining0redhat com> or phone toIl-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 78crif51


Host Bus Adapter (HBA) 1-10

• Used to connect hosts to the fibre channel network


• Appears as a SCSI adapter
• Relieves the host microprocessor of data I/O tasks
• Multipathing capable

An HBA is simply the hardware on the host machine that connects it to, for example, a fibre channel
networked device. The hardware can be a PCI, Sbus, or motherboard-embedded IC that transiates signals
on the local computer to frames on the fibre channel network.

An operating system treats an HBA exactly like it does a SCSI adapter. The HBA takes the SCSI
commands it was sent and transiates them into the fiber channel protocol, adding network headers and
error handling. The HBA then makes sure the host operating system gets return information and status
back from the storage device across the network, just like a SCSI adapter would.

Some HBAs offer more than one physical pathway to the fibre channel network. This is referred to as
multipathing.

While the analogy can be drawn to NICs and their purpose, HBAs tend to be far more intelligent: switch
negotiation, tracking devices on the network, I/O processing offloading, network configuration monitoring,
load balancing, and failover management. Critica! to the HBA is the driver that controls it and communicates
with the host operating system.

In the case of iSCSI-like technologies, TCP Offloading Engine (TOE) cards can be used instead of ordinary
NICs for performance enhancement.

For use onty by a student enrollad in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No pan of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly usad,
copied, or cfistributed pisase email ctrainingaredhat coa> or phone tdt-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright (?) 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 2dbdf27c


••
Fibre Channel Switch

• Foundation of a Fibre channel SAN, providing:


• High-speed non-blocking interconnect between devices
• Fabric services
• Additional ports for scalability
• Linking capability of the SAN over a wide distance
• Switch topologies
• Point-to-Point - A simple two-device connection
• Arbitrated loop - All devices are arranged in a loop connection
• Switched fabric - All devices are connected to one or more interconnected Fibre Channel switches,
and the switches manage the resulting "fabric" of communication channels

The fibre channel fabric refers to one or more interconnected switches that can communicate with each
other independently instead of having to share the bandwidth, such as in a looped network connection.

Additional fiber channel switches can be combined into a variety of increasingly complex wired connection
patterns to provide total redundancy so that failure of any one switch will not harm the fabric connection and
still provide maximum scalability.

Fibre channel switches can provide fabric services. The services provided are conceptually distributed
(independent of direct switch attachment) and include a login server (fabric device authentication), name
server (a distributed database that registers all devices on a fabric and responds to requests for address
information), time server (so devices can maintain system time with each other), alias server (like a name
server for multicast groups), and others.

Fibre channel is capable of communicating up to 100km.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used,
copied, or distributed please email < t redningeradhat . coa> or phone toll-f res (USA) +1 (855) 826 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 2c3a168e


Internet SCSI (iSCSI) 1-12

• A protocol that enables clients (initiators) to send SCSI commands to remote


storage devices (targets)
• Uses TCP/IP (tcp : 3260, by default)
• Often seen as a low-cost alternatíve to Fibre Channel because ít can run over
existing switches and network infrastructure

iSCSI sends storage traffic over TCP/IP, so that inexpensive Ethernet equipment may be used instead
of Fibre Channel equipment. FC currently has a performance advantage, but 10 Gigabit Ethernet will
eventually allow TCP/IP to surpass FC in overall transfer speed despíte the additional overhead of TCP/IP
to transmit data. TCP offload engines (TOE) can be used to remove the burden of doing TCP/IP from the
machines using iSCSI. iSCSI is routable, so it can be accessed across the Internet.

For use only by a atudent enrollad in a Red HM training course taught by Red Hat, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. It you betieve Red HM training materiafs are being improperly usad,
copiad, or distributed pisase email <trainingeredhat con or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 59b9f233


. .. • /1
End of Lecture 1

• Questions and Answers


• Summary
• How best to manage your data
• Describe Red Hat Storage Model
• Explain Common Storage Hardware

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you betieve Red Hat training material. are being improperly usad,
copied, or distributed please email <training.redhat . coxa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 6f0a110d


-•
Lab 1.1: Evaluating Your Storage Requirements
Instructions:

1. What is the largest amount of data you manage, including all types and all computing
platforms?

2. What is the smallest significant group of data that must be managed?

3. How many applications require access to your largest data store? Are these applications running
on the same computing platform?

4. How many applications require access to your smallest data store? Are these applications
running on the same computing platform?

5. How would you best avoid redundancy of data stored while optimizing data access and
distribution? How many copies of the same data are available directly to each host? How many
are required?

6. When was the last time you reduced the size of a data storage environment, including the
amount of data and the computing infrastructure it supported? Why was this necessary?

7. Which data store is the most unpredictable (categorize by growth, access, or other means)?
What accounts for that unpredictability?

8. Which is the most predictable data store you manage? What malees this data store so
predictable?

9. List your top five most commonly encountered data management issues and categorize them
according to whether they are hardware, software, security, user related, or other.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 3dfc33ae


10. What does data unavailability "cost" your organization?

11. What percentage of your data storage is archived, or "copied" to other media to preserve its
state at a point in time? Why do you archive data? What types of data would you never archive,
and why? How often do you archive your data?

12. What is the least important data store of your entire computing environment? What makes it
unimportant?

r,nnvrinht n 9(111 Pori Hat Inr n, 1A 17 nni i nAno I


1
Lab 1.2: Configuring the Virtual Cluster Environment
Scenario: The root password is redhat for your classroom workstation and for a11
virtual machines.

Deliverable: Create, instan, and test the virtual cluster machines hosted by your
11> workstation.

1
Instructions:

1. Configure your physical machine to recognize the hostnames of your virtual machines:

stationX# cat RH436/HelpfulFiles/hosts - table » /etc/hosts


1,

2. The virtual machines used for your labs still need be created. Execute the script rebuild-
c luster -m. This script will build a master Xen virtual machine (cXnO . example.com,
172.16.50. X0, hereafter referred to as 'node0') within a logical volume. The node O Xen
virtual machine will be used as a template to create three snapshot images. These snapshot
images will, in turn, become our cluster nodes.

111 stationX# rebuild-cluster -m


This will create or rebuild the template node (node0).
Continue? (y/N): y

If you are logged in graphically a virt viewer will automatically be created, otherwise your
-

terminal will automatically become the console window for the instan.
IP
The installation process for this virtual machine template will take approximately 10-15
IIP minutes.

3. Once your nade O installation is complete and the node has shut down, your three cluster
nodes:

cXnl.example.com 172.16.50.X1
cXn2.example.com 172.16.50.X2
cXn3.example.com 172.16.50.X3

can now be created. Each cluster node is created as a logical volume snapshot of nade O.

The pre-created rebuild cluster script simplifies the process of creating and/or
-

rebuilding your three cluster nodes. Feel free to inspect the script's contents to see what it is
doing. Passing any combination of numbers in the range 1 -3 as an option to rebui ld-
11) c lus ter creates or rebuilds those corresponding cluster nodes in a process that takes only a
few minutes.
10
At this point, create three new nodes:

stationX# rebuild-cluster -123


This will create or rebuild node(s): 1 2 3
Continue? (y/N): y
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 1ccafc0b
Monitor the boot process of one or all three nodes using the command: •

st ionX# xm console nodeN

where N is a node number in the range 1-3. Console mode can be exited at any time with the
keystroke combination: Ct rl - ] . 411
To rebuild only node3, execute the following command (Do not worry if it has not finished
booting yet): •

st at rebuild cluster 3
- -
1111
Because the cluster nodes are snapshots of an already-created virtual machine, the rebuilding
process is dramatically reduced in time, compared to building a virtual machine from scratch, as •
we did with node0.

You should be able to log into all three machines once they have completed the boot process. •

For your convenience, an /et c/host s table has already been preconfigured on your cluster
nodes with name-to-IP mappings of your assigned nodes.
11/
If needed, ask your instructor for assistance.


1

111

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 1ccafc0b



Lecture 2

iSCSI Configuration

Upon completion of this unit, you should be able to:


• Describe the iSCSI Mechanism
• Define iSCSI Initiators and Targets
• Explain iSCSI Configuration and Tools

For use only by a student enrollad in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. 1f you betieve Red HM training materiMs are being improperly usad,
copiad, or distributed piense email <trainingeredhat. coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 2276845b


Red Hat iSCSI Driver 2-1

• Provides a host with the ability to access storage via IP


• iSCSI versus SCSI/FC access to storage:

Host Appitcatioris

~II
SCSI Driver
iSCSI Driver SCSt or PC
Adepter Driver
TCP/IP
Network Drivers

Storage Router or
Gateway

The iSCSI driver provides a host with the ability to access storage through an IP network.

The driver uses the iSCSI protocol (IETF-defined) to transport SCSI requests and responses over an IP
network between the host and an iSCSI target device. For more information about the iSCSI protocol, refer
to RFC 3720 (http: //www.ietf .org/rfc/rfc3 7 2 O .txt).

Architecturally, the iSCSI driver combines with the host's TCP/IP stack, network drivers, and Network
Interface Card (NIC) to provide the same functions as a SCSI or a Fibre Channel (FC) adapter driver with a
Host Bus Adapter (HBA).

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copied, or distributed piense email ctrainingersdhat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 765b832c


iSCSI Data Access 2-2

• Clients (initiators) send SCSI commands to remote storage devices (targets)


• Uses TCP/IP (t cp : 3260, by default)
• Initiator
• Requests remote block device(s) via discovery process
• iSCSI device driver required
• iscs i service enables target device persistence
• Package:iscsi-initiator-utils-*.rpm
• Target
• Exports one or more block devices for initiator access
• Supported starting RHEL 5.3
• Package:scsi-target-utils-*.rpm

An initiating device is one that actively seeks out and interacts with target devices, while a target is a
passive device.

The host ID is unique for every target. The LUN ID is assigned by the iSCSI target.

The iSCSI driver provides a transport for SCSI requests and responses to storage devices via an IP
network instead of using a direct attached SCSI bus channel or an FC connection. The Storage Router, in
turn, transports these SCSI requests and responses received via the IP network between it and the storage
devices attached to it.

Once the iSCSI driver is installed, the host will proceed with a discovery process for storage devices as
follows:

• The iSCSI driver requests available targets through a discovery mechanism as configured in the /etc/
iscsi/iscsid . conf configuration file.

• Each iSCSI target sends available iSCSI target names to the iSCSI driver.

• The iSCSI target accepts the login and sends target identifiers.

• The iSCSI driver queries the targets for device information.

• The targets respond with the device information.

• The iSCSI driver creates a table of available target devices.

Once the table is completed, the iSCSI targets are available for use by the host using the same commands
and utilities as a direct attached (e.g., via a SCSI bus) storage device.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material are being improperly usad,
copiad, or distributed pisase email <training•redhat . coro or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a8c0b9ed


rs 4
iSCSI Driver Features 2-3
e

• Header and data digest support


e


Two way CHAP authentication
R2T flow control support with a target .
• Multipath support (RHEL4-U2)


Target discovery mechanisms
Dynamic target discovery
.


Async event notifications for portal and target changes
Immediate Data Support



Dynamic driver reconfiguration
Auto-mounting for iSCSI filesystems after a reboot •
Header and data digest support - The iSCSI protocol defines a 32-bit CRC digest on an iSCSI packet to
detect corruption of the headers (header digest) and/or data (data digest) because the 16-bit checksum
used by TCP is considered too weak for the requirements of storage on long distance data transfer.

Two way Challenge Handshake Authentication Protocol (CHAP) authentication - Used to control access to
the target, and for verification of the initiator.

Ready-to-Transfer (R2T) flow control support - A type of target communications flow control.

Red Hat multi-path support - iSCSI target access via multiple paths and automatic failover mechanism.

Available RHEL4-U2.

Sendtargets discovery mechanisms - A mechanism by which the driver can submit requests for available
targets. •
Dynamic target discovery - Targets can be changed dynamically.


Async event notifications for portal and target changes - Changes occurring at the target can be
communicated to the initiator as asynchronous messages.

Immediate Data Support - The ability to send an unsolicited data burst with the iSCSI command protocol
data unit (PDU).

Dynamic driver reconfiguration - Changes can be made on the initiator without restarting all iSCSI sessions. •
Auto-mounting for iSCSI filesystems after a reboot - Ensures network is up before attempting to auto-mount
iSCSI targets. •
e

e
e
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are beIng improperly used,
copied, or distributed please email <trainingeredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Coovriaht © 2011 Red Hat. Inc. RH436-RHEL5u4-en-17-20110428 / 32798410



iSCSI Device Names and Mounting 2-4

• Standard default kemel names are used for iSCSI devices


• Linux assigns SCSI device names dynamically whenever detected
• Namíng may vary across reboots
• SCSI commands may be sent to the wrong logical unit
• Persistent device naming (2.6 kernel)
• udev
• UUID and LABEL-based mounting
• Important /etc/fstab option: _netdev
• Without this, rc.sysinit attempts to mount target before network or iscsid services have started

The iSCSI driver uses the default kernel names for each iSCSI device the same way it would with other
SCSI devices and transports like FC/SATA.

Since Linux assigns SCSI device nodes dynamically whenever a SCSI Iogical unit is detected, the mapping
from device nodes (e.g., /dev/sda or /dev/sdb) to iSCSI targets and Iogical units may vary. Factors
such as variations in process scheduling and network delay may contribute to iSCSI targets being mapped
to different kernel device names every time the driver is started, opening up the possibility that SCSI
commands might be sent to the wrong target.

We therefore need persistent device naming for iSCSI devices, and can take advantage of some 2.6 kernel
features to manage this:

udev - udev can be used to provide persistent names for all types of devices. The scsi_id program, which
provides a serial number for a given block device, is integrated with udev and can be used for persistence.

UUID and LABEL-based mounting - Filesystems and LVM provide the needed mechanisms for mounting
devices based upon their UUID or LABEL instead of their device name.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publicaban may be photocopiad,
duplicated, stored in a retrievat system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training materials are being improperly used,
copiad, or distributed pisase email ctraining•redhat . con» or phone toll-free (USA) +1 (886) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4c5159db


. .
iSCSI Target Naming 2-5

• iSCSI Qualified Name (IQN)


• Must be globally unique
• The IQN string format:

iqn <datecode>.<reversed domain> <string>[:<substring>]

• The IQN sub-fields:


• Required type designator (iqn)
• Date code (yyyy-mm)
• Reversed domain name (t/d.domain)
• Any string guaranteeing uniqueness (string [ [ string] . . .])
• Optional colon-delimited sub-group string ( [ : substring] )
• Example:

iqn.2007-01.com.example.sales:sata.rack2.diskl

The format for the iSCSI target name is required to start with a type designator (for example, 'iqn', for
'iSCSI Qualified Name') and must be followed by a multi-field (delimited by a period character) unique
name string that is globally unique. There is a second type designator we won't discuss here, eui, that
uses a naming authority similar to that of Fibre Channel world-wide names (an EUI-64 address in ASCII
hexadecimal).

The first sub-field consists of the reversed domain name owned by the person or organization creating the
iSCSI name. For example: com . example.

The second sub-field consists of a date code in yyyy - mm format. The date code must be a date during
which the naming authority owned the domain name used in this format, and should be the date on which
the domain name was acquired by the naming authority. The date code is used to guarantee uniqueness in
the event the domain name was transferred to another party and both barbes wish to use the same domain
name.

The third field is an optional string identifier of the owner's choosing that can be used to guarantee
uniqueness. Additional fields can be used if necessary to guarantee uniqueness.

Delimited from the name string by a colon character, an optional sub-string qualifier may also be used to
signify sub-groups of the domain.

See the document at


http: //www3 ietf .org/proceedings/Oldec/I-D/draft-ietf -ips-
iscsi-name-disc- 03 .txt for more details.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of thls publication may be photocopled,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training materials are being improperly used,
copiad, or distributed piense email trainingOredhat com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ea325263


Configuring iSCSI Targets 2-6

• Install scsi-target-utils package


• Modify letc/tgt/targets . conf
• Start the tgtd service
• Verify configuration with tgt-admin -s
• Reprocess the configuration with tgt-admin --update
• Changing parameters of a 'busy' target is not possible this way
• Use tgtadm instead

Support for configuring a Linux server as an iSCSI target is supported in RHEL 5.3 onwards, based on the
scsi-target-utils package (developed at http : //stgt.berlios.de/).

After installing the package, the userspace tgtd service must be started and configured to start at boot.
Then new targets and LUNs can be defined using /etc/tgt/targets.conf. Targets have an iSCSI
name associated with them that is universally unique and which serves the same purpose as the SCSI ID
number on a traditional SCSI bus. These names are set by the organization creating the target, with the
iqn method defined in RFC 3721 being the most commonly used.

/etc/tgt/targets.conf:
Parameter Description
backing-store device defines a virtual device on the target.
direct-store device creates a device that with the same VENDORID
and SERIAL_NUM as the underlying storage
initiator-address address Limits access to only the specified IP address.
Defaults to all
íncominguser username password Only specified user can connect.
outgoinguser username password Target will use this user to authenticate against the
initiator.

Example:

<target iqn.2009-10.com.example.cluster20:iscsi>
# List of files to export as LUNs
backing-store /dev/volO/iscsi

initiator-address 172.17.120.1
initiator-address 172.17.120.2
initiator-address 172.17.120.3
</target>

For use only by a student enrollad in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No pan of this publication may be photocopied,
duplicated, stored in a relievel system, or otherwise reproduced without prior written consent of Red HM, Inc. ff you believe Red HM training material are being improperly used,
copiad, or cfistributed pleese email ctrainingOredhat . coa> or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / bfd4c00e


... . . . nc
Manual iSCSI configuration 2-7

• Create a new target


• # tgtadm --Ild iscsi --op new --mode target --tid 1 -T iqn.2008-02.cozn.exaznple:diskl

• Export local block devices as LUNs and configure target access


• # tgtadm --Ild iscsi --op new --mode logicalunit --tid 1 --Iun 1 -b /dev/v010/iscsil
• # tgtadm --Ild iscsi --op bind --mode target --tid 1-I 192.0.2.15

To create a new target manually and not persistently, with target ID 1 and the name
iqn.2008-02.com.example:diskl,Use:

frootstation51# tgtadm --11d iscsi --op new --mode target --tid 1 -T


ign.2008-02.com.example:diskl

Then that target needs to provide one or more disks, each assigned to a logical unit number or LUN. These
disks are arbitrary block devices which will only be accessed by iSCSI initiators and are not mounted as
local file systems on the target. To set up LUN 1 on target ID 1 using the existing logical volume /dev/
vol o/iscen as the block device to export:

[root9station5]# tgtadm --11d iscsi --op new --moda logicalunit --tid 1 ke


--lun 1 -b /dev/volO/iscsil

Finally, the target needs to allow access to one or more remote initiators. Access can be allowed by IP
address:

[rootstation5]# tgtadm --11d iscsi --op bind --mode target --tid 1 -I /


192.168.0.6

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material. are being improperly usad,
copied, or distributed please email <training®redhat . com> or phone toll-free (USA) +1 (888) 628 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / d86e078d


Configuring the iSCSI Initiator Driver 2-8

• /etc/iscsi/iscsid.conf
• Default configuration works unmodified (no authentication)
• Settings:
• Startup - automatic or manual
• CHAP - usernames and passwords
• Timeouts - connections, login/logout
• iSCSI - flow control, payload size, digest checking

The following settings can be configured in /etc/ scsi / iscsid . conf.

Startup settings:
node.startup automatic Or manual

CHAP settings:
node.session.auth.authmethod Enable CHAP authentication (cHAP). Default is
NONE.
node.session.auth.username CHAP username for initiator authentication by the
target
node.session.auth.password CHAP password for initiator authentication by the
target
node.session.auth.username_in CHAP username for target authentication by the
initiator
node.session.auth.password_in CHAP password for target authentication by the
initiator
discovery.sendtargets.auth.authmethod Enable CHAP authentication (cHAP) for a discovery
session to the target. Default is NONE.
discovery.sendtargets.auth.username Set a discovery session CHAP username for the
initiator authentication by the target
discovery.sendtargets.auth.password Set a discovery session CHAP username for the
initiator authentícation by the target
discovery.sendtargets.auth.username_in Set a discovery session CHAP username for target
authentication by the initiator
discovery.sendtargets.auth.password_in Set a discovery session CHAP username for target
authentication by the initiator

For more information about iscsid. conf settings, refer to the file comments.

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly usad,
copied, or cfistributed pisase email <training•redbat . coa> or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 85675f28


• 7
iSCSI Authentication Settings 2-9

• Two-way authentication can be configured using CHAP


• Target must also be capable/configured
• No encryption of iSCSI communications
• Authentication based on CHAP implies that:
• Usemame and challenge is sent cleartext
• Authenticator is a hash (based on challenge and password)
• If username, challenge and authenticator is sniffed, offline brute force attack is possible
• Standard (RFC 3720) recommends use of IPSec

• Consider running on an isolated storage-only network

CHAP (Challenge Handshake Authentication Protocol) is defined as a one-way authentication method


(RFC 1334), but CHAP can be used in both directions to create two-way authentication.

The following sequence of events describes, for example, how the initiator authenticates with the target
using CHAP: After the initiator establishes a link to the target, the target sends a challenge message back
to the initiator. The initiator responds with a value obtained by using its authentication credentials in a
one-way hash function. The target then checks the response by comparing it to its own calculation of the
expected hash value. If the values match, the authentication is acknowledged; otherwise the connection is
terminated.

The maximum length for the username and password is 256 characters each.

For two-way authentication, the target will need to be configured also.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red HM training materiale are being Improperly used,
copied, or distributed pisase email <training@redhat . com, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9648105a


Configuring the open-iscsi Initiator 2-10

• iscsiadm
• open - iscsi administration utility
• Manages discovery and login to iSCSI targets
• Manages access and configuration of open - iscsi database
• Many operations require the iscsid daemon to be running

• Files:
• /etc/iscsi/iscsid.conf - main configuration file
• /etc/iscsi/initiatorname.iscsi - sets initiator name and alias
• /var/lib/iscsi/nodes/ - node and target information
• /var/lib/iscsi/send_targets - portal information

/etc/iscsi/iscsid. conf - configuration file read upon startup of iscsid and iscsiadm

/etc/iscsi/initiatorname .iscsi - file containing the iSCSI InitiatorName and


InitiatorAlias read by iscsid and iscsiadm on startup.

/var/lib/iscsi/nodes/ - This directory describes information about the nodes and their targets.

/var/lib/iscsi/send_targets - This directory contains the portal information.

For more information, see the file /usr/share/doc/iscsi-initiator-utils- */README.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Cerfified Training Partner. No para of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training materials are being improperly usad,
copiad, or cfistributed pisase email ctraiaiageredfiat . con> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 87fb99cb


... . — , rara
First-time Connection to an iSCSI Target 2-11

• Start the initiator service:


• # service iscsi start
• Discover available targets:
• # iscsiadm -m discovery -t sendtargets -p 172.16.36.1:3260
172.16.36.71:3260,1 iqn.2007-01 .com.example:storage.diskl
• Login to the target session:
• # iscsiadm -m node -T iqn.2007-01.com .example:storage.diskl -p 172.16.36.1:3260 -I
• View information about the targets:
• # iscsiadm -m node -P N (N=0, 1)
• # iscsiadm -m session -P N (N=0 - 3)
• # iscsiadm -m discovery -P N (N=0, 1)

The iSCSI driver has a SysV initialization script that will report information on each detected device to the
console or in dmesg(8) output.

Anything that has an iSCSI device open must close the iSCSI device before shutting down iscsi. This
includes filesystems, volume managers, and user applications.

If iSCSI devices are open and an attempt is made to stop the driver, the script will error out and stop iscsid
instead of removing those devices in an attempt to protect the data on the iSCSI devices from corruption. If
you want to continue using the iSCSI devices, it is recommended that the iscsi service be started again.

Once Iogged into the iSCSI target volume, it can then be partitioned for use as a mounted filesystem.

When mounting iSCSI volumes, use of the netdev mount option is recommended. The _netdev
mount option is used to indicate a filesystem that requires network access, and is usually used as a
preventative measure to keep the OS from mounting these file systems until the network has been enabled.
It is recommended that all filesystems mounted on iSCSI devices, either directly or on virtual devices
(LVM, MD) that are made up of iSCSI devices, use the ' netdevl mount option. With this option, they will
automatically be unmounted by the net f s initscript (before iscsi is stopped) during normal shutdown, and
you can more easily see which filesystems are in network storage.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in e retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material, are belng Improperly used,
copied, or distributed oleosa email < t reining@recthat . cosa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9effb144


A II -• La-
Managing an iSCSI Target Connection 2-12

To disconnect from an iSCSI target:


• Discontinue usage
• Log out of the target session:
• # iscsiadm -m node -T iqn.2007-01.com.example:storage.disk1 -p 172.16.36.1:3260 -u

To later reconnect to an iSCSI target:


• Log in to the target session
• # iscsiadm -m node -T iqn.2007-01.com.example:storage.disk1 -p 172.16.36.1:3260 -1

or restart the iscsi service


• # service iscsi restart

The iSCSI initiator "remembers" previously-discovered targets. Because of this, the iSCSI initiator will
automatically Iog finto the aforementioned target(s) at boot time or when the iscsi service is restarted.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Cerlified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copiad, or distributed please email <training•redhat . coa> or phone toil-free (USA) +1 (666) 626 2994 or +1 (919) 754 3700.

Copyright e 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 16c7ae1c


... . , . , ni
Disabling an iSCSI Target 2-13

To disable automatic iSCSI Target connections at boot time or iscsi service restarts:
• Discontinue usage
• Log out of the target session
• # iscsiadm -m node -T iqn.2007-01.com.example:storage.diskl -p 172.16.36.1:3260 -u
• Delete the target's record ID
• # iscsiadm -m node -o delete -T iqn.2007-01.com.example:storage.disk1 -p 172.16.36.1:3260

Deleting the target's record ID will clean up the entries for the target in the /var/lib/iscsi directory
structure.

Alternatively, the entries can be deleted by hand when the iscsi service is stopped.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copied, or distributed pisase email <training8redhat com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright ©2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / d9f1Oef6


End of Lecture 2

• Questions and Answers


• Summary
• Describe the iSCSI Mechanism
• Define iSCSI Initiators and Targets
• Explain iSCSI Configuration and Tools

For use ordy by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No para of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. H you believe Red Hat training material* are being improperly usad,
copied, or distributed picase email ctrainingWredhat . coa> or phone toll-free (USA) +1 (866)626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 2276845b


. .. 00

Lab 2.1: iSCSI Software Target Configuration •
Scenario: For a test cluster you have been assigned to configure a software iSCSI


target as backend storage.

Deliverable: A working iSCSI software target we can use to practice configuration of an


iSCSI initiator.

Instructions:

1. Instan the scsi -target - ut i I s package on your physical machine



2. Create a 5GiB logical volume named iscsi inside the vol O volumegroup to be exported as
the target volume.

3. Modify /e tc/tgt/target s . conf so that it exports the volume to the cluster nodes:
IQN
Backing Store
iqn.2009-10.com.example.clusterx:iscsi
klev/volOnscsi

Initiator Addresses 172.17. (1 00+x).1 , 172.17. (1 00+X).2 ,


172.17. (1 00+X).3
4. Start the tgtd service and make sure that it will start automatically on reboot.

5. Check to see that the iSCSI target volume is being exported to the correct host(s).




••

••
Coovriaht © 2011 Red Hat. Inc. P11-141A-1:11-1PI Fi id-on-17-9111 1 nA9sa h.gs:triniA

11, Lab 2.2: iSCSI Initiator Configuration
Deliverable: A working iSCSI initiator on the virtual machine that can connect to the
iSCSI target.

System Setup: It is assumed that you have a working iSCSI target from the previous
exercise. All tasks are done on nodel.

1 Instructions:

• 1. The iscs - init iat or-ut i 1 s RPM should already be installed on your virtual machines.
Verify.

2. Set the initiator alias to nodel in /etc/ iscsi/init iatorname . iscsi.

11> 3. Start the iSCSI service and make sure it survives a reboot.

10 Check the command output and /var/log/messages for any errors and correct them
before continuing on with the lab.

4. Discover any targets being offered to your initiator by the target.

The output of the iscsiadm discovery command should show the target volume that is available
to the initiator in the form: <target_IP:port> <target_ign_name>.

5. View information about the newly discovered target.

Note: The discovery process also loads information about the target in the directories:
e> /var/lib/iscsi/ {nodes, send-targets}

111 6. Log in to the iSCSI target.

7. Use f disk to view the newly available device. It should appear as an unpartitioned 5GiB
volume.

8. Log out of the iSCSI target. Is the volume still there?

9. Restart the iscsi service. Is the volume visible now?

10. Log out of the iSCSI service one more time, but this time also delete the record ID for the
target.

11. Restart the iscsi service. Is the volume visible now?

110 12. Re-discover and log into the target volume, again.

e Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / e4ec3012



13. Use the volume to create a 100MB partition (of type Linux). Format the newly-created
partition with an ext 3 filesystem. Create a directory named /mnt /c las s and mount the
partition to it. Test that you are able to write to it. Create a new entry in /et c/ f st ab for the
filesystem and test that the mount is able to persist a reboot of the machine.

14. Remove the fstab entry when you are finished testing and umount the volume. •









111
11



010
Copyright © 2011 Red Hat, Inc.
. . RH436-RHEL5u4-en-17-20110428 / e4ec3012

11> Lab 2.1 Solutions
1. Install the scsi - target -utils package on your physical machine

stationn yum install - y scsi target utils


- -

2. Create a 5GiB logical volume named iscsi inside the vol° volumegroup to be exported as
the target volume.

statimx# lvcreate vol0 - n iscsi - L 5G

3. Modify /etc/tgt/targets. conf so that it exports the volume to the cluster nodes:
IQN iqn.2009-10.com.example.clusterX:iscsi
1 Backing Store
Initiator Addresses
klev/volO/iscsi
172.17. (100+x).1 , 172.17. (1 00+X).2 ,
172.17. (1 00+x).3
1
Edit /etc/tgt/targets. conf so that it reads:
1
<target iqn.2009-10.com.example.clusterX:iscsi>
1 backing-store /dev/volO/iscsi
initiator-address 172.17.(100+X).1
initiator-address 172.17.(100+X).2
initiator-address 172.17.(100+X).3
</target>
e 4. Start the tgtd service and make sure that it will start automatically on reboot.

*service tgtd start; chkconfig tgtd on

5. Check to see that the iSCSI target volume is being exported to the correct host(s).

# tgt-admin -s
Target 1: iqn.2009-10.com.example.clusterX:iscsi
System information:
Driver: iscsi
State: ready
I T nexus information:
LUN information:
LUN: O
Type: controller
SCSI ID: deadbeaf1:0
SCSI SN: beaf 10
Size: 0 MB
Online: Yes
Removable media: No
Backing store: No backing store
LUN: 1
1 Type: disk
Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / be38d016
SCSI ID: deadbeafl:1 •
SCSI SN: beafll

••
Size: 1074 MB
Online: Yes
Removable media: No
Backing store: /dev/volO/iscsi
Account information:
ACL information:
172.17.(100+X).1
172.17.(100+X).2


172.17.(100+X).3


••



••






Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / be38d016

• Lab 2.2 Solutions
1. The iscsi - initiator-utils RPM should already be inst alled on your virtual machines.
Verify.

cxni# rpm - q iscsi initiator utils


- -

1,
2. Set the initiator alias to nodel in /etc/ iscsi/init iatorname iscsi.
11, # echo "InitiatorAlias=nodeln » /etc/iscsi/initiatorname.iscsi

3. Start the iSCSI service and make sure it survives a reboot.

# service iscsi start

e•
11, # chkconfig iscsi on

Check the command output and /var/log/messages for any errors and correct them
before continuing on with the lab.
4. Discover any targets being offered to your initiator by the target.
11>
# iscsiadm -m discovery -t sendtargets -p 172.17.(100+X).254
11> 172.17.100+X.254:3260,1 iqn.2009-10.com.example.clusterX:iscsi

The output of the iscsiadm discovery command should show the target volume that is available
I> to the initiator in the forro: <target_IP:port> <target_ign_name>.
5. View information about the newly discovered target.

# iscsiadm -m node -T <target_ign_name> -p 172.17.(100+X).254


11>
Note: The discovery process also loads information about the target in the directories:

11 /var/lib/iscsi/{nodes, send-targets}

6. Log in to the iSCSI target.


10
# iscsiadm - m node - T <target_ign_name> - p 172.17.(100+X).254 - 1

7. Use fdisk to view the newly available device. It should appear as an unpartitioned 5GiB
volume.

# fdisk - 1

8. Log out of the iSCSI target. Is the volume still there?

# iscsiadm -m node -T <target_ign_name> -p 172.17.(100+X).254 -u


1, # fdisk -1

11, It should not still be visible in the output of fdisk 1.


-

9. Restart the iscsi service. Is the volume visible now?

Copyright (1) 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / e4ec3012


fi service iscsi restart

Because the record ID information about the previously-discovered target is still stored in the /
var/lib/ isc si directory structure, it should have automatically made the volume available
again.

10. Log out of the iSCSI service one more time, but this time also delete the record ID for the
target.

# iscsiadm -m node -T <target_ign_name> -p 172.17.(100+X).254 -u


# iscsiadm -m node -T <target_ign_name> -p 172.17.(100+X).254 -o
delete

11. Restart the iscsi service. Is the volume visible now?

# service iscsi restart

It should not still be available. We must re-discover and log in to make the volume available
again.

12. Re-discover and log into the target volume, again.

# iscsiadm -m discovery -t sendtargets -p 172.17.(100+X).254


# iscsiadm -m node -T <target_ign_name> -p 172.17.(100+X).254 -1

13. Use the volume to create a 100MB partition (of type Linux). Format the newly-created
partition with an ext 3 filesystem. Create a directory named /mnt /c las s and mount the
partition to it. Test that you are able to write to it. Create a new entry in /et c/ f stab for the
filesystem and test that the mount is able to persist a reboot of the machine.

# mkdir /mnt/class
# fdisk <target_volume_dev_name>
fimkfs -t ext3 <target_volume_dev_name>
# echo "<target_volume_dev_name> /mnt/class ext3 _netdev O O" it
» /etc/fstab
fi mount /mnt/class
4 cd /mnt/class
4 dd if=/dev/zero of=myfile bs=1M count=10

14. Remove the fstab entry when you are finished testing and umount the volume.

fi umount /mnt/class
0 rmdir /mnt/class
4 vi /etc/fstab

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / e4ec3012


Lecture 3
e
e
e
Kernel Device Management
a
e Upon completion of this unit, you should be able to:
• Understand how udev manages device names.
a • Describe the role of the sys filesystem
• Learn how to write udev rules for custom device names.
a • Dymically add storage to the system

e
e
e
e
e
e
e
e
e
e
e
e For use only by a student enrollad in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No pert of this publication may be photocopied,

o duplicated, Mored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. It you believe Red HM training materials are being improperty usad,
copiad, or distributed pisase ornad ctraining4redhat . cota> or phone toll-free (USA) +1 (966) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 6ae830ef


Al
udev Featu res 3-1

• Only populates /dev with devices currently present in the system


• Device major/minor numbers are irrelevant
• Provides the ability to name devices persistently
• Userspace programs can query for device existence and name
• Moves all naming policies out of kernel and into userspace
• Follows LSB device naming standard but allows customization
• Very small

The /dev directory was unwieldy and big, holding a large number of static entries for devices that might be
attached to the system (18,000 at one point). udev, in comparison, only populates /dev with devices that
are currently present in the system. udev also solves the problem of dynamic allocation of entries as new
devices are plugged (or unplugged) into the system.

Developers were running out of major/minor numbers for devices. Not only does udev not care about
major/minor numbers, but in fact the kernel could randomly assign them and udev would be fine.

Users wanted a way to persistently name their devices, no matter how many other similar devices were
attached, where they were attached to the system, and the order in which the device was attached. For
example, a particular disk might always be named /dev/bootdisk no matter where it might be plugged
into a SCSI chain.

Userspace programs needed a way to detect when a device was plugged in or unplugged, and what /dev
entry is associated with that device.

udev follows the Linux Standards Base (LSB) for naming conventions, but allows userspace customization
of assigned device names.

udev is small enough that embedded devices can use it, as well.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Treining Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copiad, or distributed please email < training@redhat coms or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b765de18


All
Event Chain of a Newly Plugged-in Device 3-2

1. Kemel discovers device and exports the device's state to sysfs


2. udev is notified of the event via a netlink socket
3. udev creates the device node and/or runs programs (rule files)
4. udev notifies hald of the event via a socket
5. HAL probes the device for information
6. HAL populates device object structures with the probed information and that from
several other sources
7. HAL broadcasts the event over D-Bus
8. A user-space application watching for such events processes the information

When a device is plugged into the system, the kernel detects the plug-in and populates sysfs (/sys)
with state information about the device. sysfs is a device virtual file system that keeps track of all devices
supported by the kernel.

Via a netlink socket (a connectionless socket which is a convenient method of transferring information
between the kernel and userspace), the kernel then notifies udev of the event.

udev, using the information passed to it by the kernel and a set of user-configurable rule files in /etc/
udev/rules .d, creates the device file and/or runs one or more programs configured for that device (e.g.
modprobe), before then notifying HAL of the event via a regular socket (See/etc/udev/rules . d/90 -

hal . rules for the RUN+=" socket : /org/freedesktop/hal/udev_event" event). udev events can
be monitored with udevmonitor --env.

When HAL is notified of the event, it then probes the device for information and populates a structured
object with device properties using a merge of information from several different sources (kemel,
configuration files, hardware databases, and the device itself).

hald then broadcasts the event on D-Bus (a system message bus) for receipt by user-space applications.

Those same applications also have the ability to send messages back to hald via the D-Bus to, for
example, invoke a method on the HAL device object, and potentially invoking the kernel. For example, the
mounting of a filesystem might be requested by gnome-volume-manager. The actual mounting is done by
HAL, but the request and configuration carne from a user-space application.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No pan of this publication mey be photocopied,
duplicated, stored in a retrieval system, or othenvise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training materias are being improperly used,
copied, or astributed aviase email ctrainingeredhat . cesta or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / d63be5d4


•_
/sys Filesystem 3-3

• Virtual filesystem like procfs


• Used to manage invidual devices
• Used by udev to identify devices
• Changes to /sys are not persistent
• Use udev rule to persist
• Alternative: ktune(RHEL5.4 or newer)

• Important directories
• /sys/class/scsi_host/: contains all detected SCSI adapters
• /sys/block/ : Contains alI block devices

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material, are being Improperly ueed,
copied, or distributed please email <t raining9redhat .com> or phone toll-free (USA) +1 (866) 628 2994 or +1 (919)754 3700.

(PA nr11 1 o,.a LJni. 111-1.43A_I:11-1P1 itl-t3n-1 7-91)1 1 rld9R / SahRR3911


udev 3-4

• Upon receipt of device add/remove events from the kemel, udev will parse:
• user-customizable rules in /etc/udev/rules .d
• output from commands within those rules (optional)
• information about the device in /sys
• Based upon the information udev has gathered:
• Handles device naming (based on rules)
• Determines what device files or symlinks to create
• Determines device file attributes to set
• Determines what, if any, actions to take
• udevmonitor [--env]

When a device is added to or removed from the system, the kernel sends a message to udevd and
advertises information about the device through /sys. udev then looks up the device information in /sys
and determines, based on user customizable rules and the information found in /sys, what device node
files or symlinks to create, what their attributes are, and/or what actions to perform.

sysfs is used by udev for querying attributes about all devices in the system (location, name, serial
number, major/minor number, vendor/product IDs, etc...).

udev has a sophisticated userspace rule-based mechanism for determining device naming and actions to
perform upon device loading/unloading.

udev accesses device information from sysfs using libsysf s library calls. libsysfs has a standard,
consistent interface for all applications that need to query sysfs for device information.

The udevmonitor command is useful for monitoring kernel and udev events, such as the plugging and
unplugging of a device. The env option to udevmonitor increases the command's verbosity.
- -

For use only by a student enrollad in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. II you believe Red Hat training materials are being improperly used,
copied, or cfistributed pisase email ctrainiageredhat . coas or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 764af923



Configuring udev 3-5
••
• /etc/udev/udev.conf
• udev_root - location of created device files (default is /dev)


• udev rules - location of udev rules (default is /etc/udev/rules .d)
• udev_log syslog(3) priority (default is err)

••
-

• Run-time: udevcontrol Iog_priority=<value>

All udev configuration files are placed in /etc/udev and every file consists of a set of lines of text. All
empty lines or lines beginning with '#' will be ignored.

The main configuration file for udev is /etc/udev/udev. conf, which aliows udev's default configuration
variables to be modified.

The following variables can be defined:

udev_root
dev.
- Specifies where to place the created device nodes in the filesystem. The default value is /


udev_rules

udev_log
The name of the udev rules file or directory to look for files with the suffix " . rules". Multiple
-

rule files are read in lexical order. The default value is /etc/udev/rules . d.

The priority level to use when logging to syslog (3). To debug udev at run-time, the logging
-

level can be changed with the command "udevcontrol Iog_priority=<value>". The default value is err.
Possible values are: err, inf o and debug.
•e
e

•e

e
e

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material. are being improperly usad,
copied, or distributed please email <training.redhat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
•e
ennvrinht (e) 2011 Red Hat. Inc. RH436-RHEL5u4-en-17-20110428 / 6eec8c5f
udev Rules 3-6

• Filename location/format:
• /etc/udev/rules.dkrulename>.rules
• Examples:
• 50-udev.rules
• 75-custom.rules
• Rule format:
• <match-key><op>value [, ...] <assignment-key><op>value [, ...]
• Example:
• BUS=="usb", SYSFS{serial}=="20043512321411d34721", SYMLINK+="usb_backup"
• Rule files are called on first read and cached
• Touch file to force an update

By default, the udev mechanism reads files with a " . rules" suffix located in the directory /etc/udev/
rules . d. If there is more than one rule file, they are read one at a time by udev in lexical order. By
convention, the name of the rule file usually consists of a 2-digit integer, followed by a dash, followed by a
descriptive name for the rules within it, and completes with a ". rules" suffix. For example, a udev config
file named s O udev . rules would be read by udev before a file named 75 usb_custom rules because
- -

50 comes before 75.

The format of a udev rule is logically broken into two separate pieces on the same line: one or more match
key-value pairs used to match a device's attributes and/or characteristics to some value, and one or more
assignment key-value pairs that assign a value to the device, such as a name.

If no matching rule is found, the default device node name is used.

In the example aboye, a USB device with serial number 20043512321411d34721 will be assigned an
additional symkink /dev/usb_backup (presuming no other rules override it later).

For use only by a etudent enroIled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. fi you believe Red HM training motorista are being improperly used,
copiad, or cfistributed pisase ornad < trainingAredhat . COZ> or phone to1I-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428/ c6ace2cb


udev Rule Match Keys 3-7

• Operators:
• Compare for equality
• ! = Compare for non-equality
• Match key examples:
• ACTION=="add"
• KERNEL=="sd[a-z]1"
• BUS=="scsi"
• DRIVER!="ide-cdrom"
• SYSFS{serial}=="20043512321411d34721"
• PROGRAM=="customapp.pl" RESULT=="some return string"

• udev(7)

The following keys can be used to match a device:


ACTION Match the name of the event action (add or remove). Typicaily used to run a
program upon adding or removing of a device on the system.
KERNEL Match the name of the device.
DEVPATH Match the devpath of the device.
SUBSYSTEM Match the subsystem of the device.
BUS Search the devpath upwards for a matching device subsystem name.
DRIVER Search the devpath upwards for a matching device driver name.
ID Search the devpath upwards for a matching device name.
SYSFS{filename} Search the devpath upwards for a device with matching sysfs attribute values.
Up to five SYSFS keys can be specified per rule. All attributes must match on
the same device.
ENV{key} Match against the value of an environment variable (up to five ENV keys can
be specified per rule). This key can also be used to export a variable to the
environment.
PROGRAM Execute external program and return true if the program returns with exit code
O. The whole event environment is available to the executed program. The
program's output, printed to stdout, is available for the RESULT key.
RESULT Match the returned string of the last PROGRAM call. This key can be used in
the same or in any later rule after a PROGRAM call.

Most of the fields support a form of pattern matching:


Matches zero or more characters
Matches any single character
Matches any single character specified within the brackets
[a-z] Matches any single character in the range a to z
[!a] Matches any single character except for the letter a

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material. are being improperly used,
copied, or distributed piense email <trainingereclhat . cern> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 89698ec1


. . fl
Finding udev Match Key Values 3-8

• udevinfo -a -p $(udevinfo -q path -n /deld<device>)


• The sysfs device path of the device in question is all that is needed
• Produces a list of attributes that can be used in match rules
• Choose attributes which identify the device in a persistent and easily-recognizable way
• Can combine attributes of device and a single parent device
• Padding spaces at the end of attribute values can be omitted
• Also useful:
• scsi_id -g -x -s /block/sdx
• /lib/udev/ata_id /dev/hdx
• flib/udev/usbid /block/sdx

Finding key values to match a particular device to a custom rule is made easier with the udevinfo
command, which outputs attributes and unique identifiers for the queried device.

The "inner" udevinfo command aboye first determines the sysfs (/sys) path of the device, so the "outer"
udevinfo command can query it for all the attributes of the device and its parent devices.

Examples:

# udevinfo -a -p $(udevinfo -q path -n /dev/sdal)

# udevinfo -a -p /sys/class/netteth0

Other examples of commands that might provide useful information for udev rules:

# scsi_id -g -s /block/sda

# scsi_id -g -x -s /block/sda/sda3

# /11b/udev/ataid /dev/hda

# /lib/udev/usb_id /block/sda

For use only by a student enrolled in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No pan of this publication may be photocopied,
duplicated, stored in a retrieval system, or ~enviee reproduced without prior !rinden consent of Red HM, Inc. If you believe Red Hat training matenals are being improperly usad,
copiad, or dstributed please emaiI ctrainingeredhat . coas or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / c2bcbfb3


udev Rule Assignment Keys 3-9

• Operators:
• = Assign a value to a key
• += Add the value to a key
• : = Assign a value to a key, disallowing changes by any later rules
• Assignment key examples:
• NAME="usbcrypto"
• SYMLINK+="datal"
• OWNER="student"
• MODE="0600"
• LABEL="testrulesend"
• GOTO="testrulesend"
• RUN="myapp"

The following keys can be used to assign a value/attribute to a device:


NAME The name of the node to be created, or the name the network interface should
be renamed to. Only one rule can set the node name, all later rules with a
NAME key will be ignored. It is not recommended to rename devices this way
because tools like fdisk expect certain naming conventions. Use symlinks
instead.
SYMLINK The name of a symlink targeting the node. Every matching rule can add this
value to the list of symlinks to be created along with the device node. Multiple
symlinks may be specified by separating the names by the space character.
OWNER, GROUP, The permissions for the device node. Every specified value overwrites the
MODE compiled-in default value.
ENV{key} Export a variable to the environment. This key can also be used to match
against an environment variable.
RUN Add a program to the list of programs to be executed for a specific device.
This can only be used for very short running tasks. Running an event process
for a long period of time may block all further events for this or a dependent
device. Long running tasks need to be immediately detached from the event
process itself.
LABEL Named label where a GOTO can jump to.
GOTO Jumps to the next LABEL with a matching name
IMPORT{type} Import the printed result or the value of a file in environment key format into
the event environment. program will execute an externa! program and read
its output. Pile will import a text file. If no option is given, udev will determine
it from the executable bit of of the file permissions.
WAIT_FOR_SYSFS Wait for the specified sysfs file of the device to be created. Can be used to
fight against kernel sysfs timing issues.
OPTIONS last_rule - No later rules will have any effect, ignore_device - Ignore
this event completely, ignore_remove - Ignore any later remove event
for this device, al 1 partitions - Create device nodes for all available
partitions of a block—device.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used,
copiad, or distributed oleosa email ctraining@redhat .coms or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Coovriaht © 2011 Red Hat. Inc. RH436-RHEL5u4-en-17-20110428 / 7a4d3e13


udev Rule Substitutions 3-10

• printf-Iike string substitutions


• Can simplify and abbreviate rules
• Supported by NAME, SYMLINK, PROGRAM, OWNER, GROUP and RUN keys
• Example:

KERNEL=="sda*", SYMLINK+="iscsi%n"

Substitutions are applied while the individual rule is being processed (except for RUN; see udev(7)).

The available substitutions are:


$kernel, %k The kernel name for this device (e.g. sdbl)
$number, %n The kernel number for this device. (e.g. %n is 3, for 'sdan
$devpath, %p The devpath of the device (e.g. /block/sdb/sdbl, not /sys/block/sdb/
sdbl).
$id, %b Device name matched while searching the devpath upwards for BUS,
IDDRIVER and SYSFS.
$sysfs{file}, The value of a sysfs attribute found at the current or parent device.
%s{file}
$env{key}, %E{key} The value of an environment variable.
$major, %M The kernel major number for the device.
$minor %m The kernel minor number for the device.
$result, %c The string retumed by the external program requested with PROGRAM. A
single part of the string, separated by a space character may be selected by
specifying the part number as an attribute: %c{N}. If the number is followed by
the '+' char this part plus all remaining parts of the result string are substituted:
%c{N+}
$parent, %P The node name of the parent device.
$root, %r The udev_root value.
$tempnode, %N The name of a created temporary device node to provide access to the device
from a external program before the real node is created.
%% The '%' character itself.
$$ The '$' character itself.

The count of characters to be substituted may be limited by specifying the format length value. For
example, ' % 3s{file}' wi II only insert the first three characters of the sysfs attribute

For example, using the rule:

KERNEL==usda*" SYMLINK+="iscsAnn

any newly created partitions on the /dev/sda device (e.g. /dev/sdas) would trigger udev to also create
a symbolic link named iscsi with the same kemel-assigned partition number appended to it (/dev/
iscsis, in this case).

For use only by a student enrollad in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior wrítten consent of Red HM, Inc. If you believe Red HM training meteríais are being improperly usad,
copiad, or distributed please email ctrainingeredhat . coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / d2cc94ae



udev Rule Examples 3-11
••
• Examples:
• BUS=="scsi", SYSFS{serial}=="123456789", NAME="byLocation/rackl-shelf2-
disk3"
• KERNEL=="sd*", BUS=="scsi", PROGRAM=="/lib/udev/scsi_id -g -s %p",
RESULT=="SATA ST340014AS 3JX8LVCA", NAME="backup%n"
••
• KERNEL=="sd*", SYSFS{idVendor}=="0781", SYSFS{idProduct}=="5150", SYMLINK


+="keycard", OWNER="student", GROUP="student", MODE="0600"
KERNEL=="sd?1", BUS=="scsi", SYSFS{model}=="DSCT10", SYMLINK+="camera" e
• ACTION=="add", KERNEL=="ppp0", RUN+="/usr/bin/wall PPP Interface Added"
• KERNEL=="ttyUSB*", BUS=="usb", SYSFS{product}=="Palm Handheld", SYMLINK •

+="pda"


The first example demonstrates how to assign a SCSI drive with serial number "123456789" a meaningful
device name of /dev/byLocation/rackl-shelf2-disk3. Subdirectories are created automatically.

The second example runs the program "/Iib/udev/scsi_id -g -s %p", substituting "%p" with the device path
of any device that matches "/dev/sd*".


••
In the second example, any device whose name begins with the Ietters "sd" (assigned by the kernel),
will have its devpath substituted for the '%p" in the command "/sbin/scsi_id -g -s %p" (e.g. /block/
sda3 if /sys/block/sda3). If the command is successful (zero exit code) and its output is equivalent to
"SATA ST340014AS 3JX8LVCA", then the device name "backup%n" will be assigned to it, where %n is the
number portion of the kernel-assigned name (e.g. 3 if sda3).

In the third example, any SCSI device that matches the listed vendor and product IDs will have a symbolic
link named /dev/keycard point to the device. The device name will have owner/group associations with
student and permissions mode 0600.

The fourth example shows how to create a unique device name for a USB camera, which otherwise would •
appear like a normal USB memory stick.

The fifth example executes the wall command-line shown whenever the ppp0 interface is added to the •
machine.



The sixth example shows how to make a PDA always available at /dev/pda.


e

o
M
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certitied Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. II you believe Red Hat training materials are beIng Improperly usad,
copiad, or distributed please email <training@redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ff3a0d5e


udevmonitor 3-12

• Continually monitors kernel and udev rule events


• Presents device paths and event timing for analysis and debugging

udevmonitor continuously monitors kemel and udev rule events and prints them to the console whenever
hardware is added or deleted from the machine.

For use only by a student enrolled in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No parí of this pubiication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. if you believe Red HM training meteríais are being improperly used,
copied, or distributed pisase email ctrairringereehat . coa> or phone ton-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9e8e5c7f


Dynamic storage management 3-13

• SCSI or FC devices require active scanning to be detected by the OS


• Scan can by triggered manually:

# echo > /sys/class/scsi_host/hoet/scan

• Disks can be removed by executing:

# echo 1 > /sys/block/device/device/delete

• sg3_utils in RHEL 5.5 or newer provide the rescan- scsi -bus . sh


• For iSCSI use iscsiadm

RHEL5.5 and newer offer the rescan-scsi-bus.sh script in the sg3 utils package. This tool makes
scanning of the scsi bus easy. It can automatically update the logical unit configuration of the host as
needed after a device has been added to the system and issues Loop Initialization Primitive (LIP) on on
supported devices.

In order for rescan scsi bus .sh to work properly, LUNO must be the first mapped logical unit else
- -

nothing is beeing detected even with --nooptscan option. During the first scan, rescan scsi bus . sh - -

only adds LUNO, all other logical units are added in the second scan (race condition). A bug in the rescan-
scsi bus . sh script incorrectly executes the functionality for recognizing a change in logical unit size when
-

the --remove option is used.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training material. are being improperly used,
copiad, or distributed please email <trairang•recUlat coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 653913e0


..
Tuning the disk queue 3-14

• Queue Iength

/sys/block/sda/queue/nr_requests

• Scheduler algorithm

/sys/block/sda/queue/scheduler

• Longer queues...
• Allow reads to be merged before writes
• Enable more efficient merging but add latency

Reads are critical; applications can not do anything without data!

When scheduling 10 requests, the kernel must handle two, somewhat conflicting goals. The most efficient
way to access disk drives is to keep the access pattern as sequential as possible so requests are typically
ordered in the direction of increasing logical block address on disk. That is, an 10 request for disk block 500
would get scheduled before a request for disk block 5000. At the same time, the kernel must ensure that all
processes receive some 10 in a timely fashion to avoid 10 starvation. The 10 scheduler must ensure that 10
requests for a location near the 'tail' end of the disk don't wind up getting continually pushed to the back of
the 10 request queue.

When adding an entry to the queue, the kernel will first try to enlarge an existing request by merging the
new request wíth a request that is already in the queue. If the request cannot be merged with an existing
request, then the new request will be assigned a position in the queue based on several different factors
according to the current elevator algorithm.

The default 10 scheduler is determined as a kernel compile option.

grep -i cfq /boot/config-*


CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_CFQ=y
CONFIG_ DEFAULT_ IOSCHED="cfq"

There are numerous files at /usr/share/doc/kernel-doc-*/Documentation/block/* providing


informatíon on the elevator algorithms.

The deadline scheduler compromises lower efficiency to gain low wait times. The anticipatory scheduler
compromises longer wait times to gain more efficiency. The noop scheduler compromises all but the
simplest of sorting in order to gain CPU cycles. The cfq scheduler attempts to compromise on all points in
order to achieve equalíty.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Cartified Training Partner. No part of thia publication may be photocopied,
duplicated, atored in a retrieval aystem, or otherwise reproduced without prior written conseM of Red HM, Inc. If you believe Red Hat training materiMs are being improperly usad,
copiad, or cAstributed pisase ornad <trainingdredhat. COIll> or phone foil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4cd507e9


-_
Tuning the deadline scheduler 3-15

• Goal: predictable wait time (W)

echo deadline > /sys/block/sda/queue/scheduler

• Primary tunables in /sys/block/sda/queue/iosched/


• Max queue time

read_expire
write expire

• Should we attempt to front merge contiguous requests?


-

front merges

With the deadline 10 scheduler, each request is assigned an expiration time. When this time is reached, the
scheduler will move to that request's location on disk. In order to prevent excessive seek movement, the
deadline 10 scheduler will also serve other 10 requests that are near to the new location on disk.

The tunable settings, under /sys/block/<dev>/queue/iosched/ are:

read_expire - the number of milliseconds before each read 10 request expires,

write_expire - the number of milliseconds before each write 10 request expires,

fifo_batch - the number of requests to move from the scheduler list to the block device request queue,

writes_starved - this setting determines how much preference reads have over writes. After writes_starved
number of reads have been moved to the block device request queue, some writes will be queued.

front_merges - normally 10 requests are merged into the tail of the request queue, this boolean value
controls whether attempts should be made to merge requests to the front of the queue, which takes more
work. A value of O means front merges are disabled.

For use only by a atudent enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red HM training materiele are being improperly usad,
copied, or distributed Maese email ctrainingOredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 52c09891


e
Tuning the anticipatory scheduler 3-16
e
1 • Goal: optimize completion rate (C) for dependent reads

1 echo anticipatory > /sys/block/sda/queue/scheduler

ir •

Primary tunables in /sys/block/sda/queue/iosched/
How long to wait for another, nearby read

1 antic _expire

• Max queue time


1 read_expire
write expire
e
e In many cases, an application that has issued a read may, after a short amount of think time, issue a
request for the next disk block alter the one that was just read. Of course, by this time, the 10 scheduler will
most likely have moved the disk read/write head to a different spot on the disk drive resulting in a another

o seek back to the same spot on disk to read the next block of data. This will result in additional latency
for the application. The anticipatory 10 scheduler will wait for a short amount of time after servicing an 10
request for another request near the block of data that was just read. This can result in greatly improved

e throughput for certain types of loads.

Reads and writes are processed in batches. Each batch of requests is allocated a specific amount of time in

1 which to complete. Read batches should be allowed longer times than write batches.

The tunables at /sys/block/<dev>/queue/íosched/ for the anticipatory scheduler are documented

e at:

/usr/share/doc/kernel-doc-*/Documentation/block/as-iosched.txt

1
r
1

o
1
II For use onty by a student enrollad in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior miden consent of Red HM, Inc. If you believe Red HM training materials are being improperly usad,

1 copied, or cfistributed pisase email ctraining@redhat . coa> or phone boli-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9285b974


Tuning the noop scheduler 3-17

• Goal: conserve CPU clock cycles


echo noop > /sys/block/sda/queue/scheduler

• No tunable settings required


• Use when CPU clock cycles are "too expensive"
• Host CPU cycles are usually cheaper than SAN CPU cycles
• Some controllers perform elevator functions
• Tagged command queuing
• Available on SCSI and some SATA drives
• Sorting is still useful for iSCSI and GNBD
The no-op scheduler is just what it sounds like: with this 10 scheduler requests are queued as they are
sent to the 10 sub-system and it is left up to the disk hardware to optimize things. This scheduler may be
appropriate for certain types of workloads and hardware (RAM disks, TCQ disks, etc.)

A feature of many modern drives that can boost performance is tagged command queuing. Tagged
command queuing (TCQ) can improve disk performance by allowing the the disk controller to re-order 10
requests in such a way that head seek movement is minimized. The requests are tagged with an identifier
by the disk controller so that the block(s) requested for a particular 10 operation can be returned in the
proper sequence in which they were received.

For certain applications, you may wish to limit the queue depth (number of commands that can be queued
up). Many device driver modules for devices that support TCQ accept a parameter that will control the
queue depth. This parameter is typically passed as an argument to the kernel at system boot time. For
example, to limit the queue depth for the SCSI disk at LUN 2 on an Adaptec controller to 64 requests, you
would append the following to the 'kernel' line in /boot/grub/grub.conf :

aic7xxx=tag_info:{{0,0,64,0,0,0,0})

For use only by a atudent enrolled In a Red Hat training course taught by Red HM, Inc. or a Red Hat Certified Training Partner. No part of thie publication may be photocopied,
duplicated, stored In a retrleval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being Improperly usad,
copied, or distributed piense email <traíning•redhat . com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 6edb76bf


Tuning the (default) cfq scheduler 3-18

• Goal: differentiated 10 service per application

echo cfq > /sys/block/sda/queue/scheduler

• Class- and priority-based 10 queuing


• Uses 64 interna! queues
• Fills internal queues using round-robin
• Requests are dispatched from non-empty queues
• Sort occurs at dispatch queue

• Primary tunables in /sys/block/sda/queue/iosched/


• Max requests per internal queue

queued

• Number of requests dispatched to device per cycle

quantum

The goal of the completely fair queuing (CFQ) 10 scheduler is to divide available 10 bandwidth equally
among all processes that are doing 10. Internally, the CFQ 10 scheduler maintains 64 request queues. 10
requests are assigned in round robin fashion to one of these internal request queues. Requests are pulled
from non-empty internal queues and assigned to a dispatch queue where they are serviced. 10 requests
are ordered to minimize seek head movement when they are placed on the dispatch queue.

The tunable settings for the CFQ 10 scheduler are:

quantum - the total number of requests placed on the dispatch queue per cycle,

queued - the maximum number of requests allowed per interna! request queue.

For use only by a student enrollad in a Red HM training course taught by Red Hat, Inc. or a Red HM Confiad Training Partner. No parí of Mis publication may be photocopied,
duplicated, atorad in a retrieval aystem, or otherwise reproduced without prior written consent of Red HM, Inc. tf you believe Red HM training materials are being improperty usad,
copiad, or distributed pisase email <trainingeredhat .com> or phone ton-free (USA) +1 (866) 826 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b41619d1


- -
Fine-tuning the cfq scheduler 3-19

• Class-based, prioritized queuing


• Class 1 (real-time): first-access to disk, can starve other classes
• Priorities O (most important) through 7 (least important)

• Class 2 (best - effort): round - robin access, the default


• Priorities O (most important) through 7 (least important)

• Class 3 (idle): receives disk 10 only if no other requests in queue


• No priorities

• Example

ionice -pl
ionice -pl -n7 -c2
ionice -pl

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or e Red Hat Certified Training Partner. No part of this publication may be photocopled,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material» are being improperly usad,
copiad, or distributed piense email ctrainingeredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4fe88513


... ,,„
End of Lecture 3

• Questions and Answers


• Summary
• Understand how udev manages device names.
• Learn how to write udev rules for custom device names.

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of Mis publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. It you belleve Red HM training materials are being improperly used,
copiad, or cfistributed pisase email c training@redhat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 6ae830ef


....
Lab 3.1: Persistent Device Naming
Scenario: The order in which devices are attached or recognized by the system may
dictate the device name attached to it, which can be problematic. We will
learn how to map a specific device or set of devices to a persistent device
name that will always be the same.

Deliverable: Statically defined device names for storage devices.

Instructions:

1. Create and implement a udev rule on node 1 that, upon reboot, will create a symbolic link
named /dev/ iscsiN that points to any partition device matching /dev/sdaN, where N is
the partition number (any value between 1-9).

Test your udev rule on an existing partition by rebooting the machine and verifying that the
symbolic link is made correctly. If you don't have any partitions on /dev/ sda, create one
before rebooting.

The reboot can be avoided if, after verifying the correct operation of your udev rule, you
create a new partition on /dev/ sda and update the in-memory copy of the partition table
(partprobe).

nfli 1 o rai-imaa_mu _nn_17_9niinA011 / Orihf9.40


Lab 3.1 Solutions
1. Create and implement a udev rule on node 1 that, upon reboot, will create a symbolic link
named /dev/ i scs iN that points to any partition device matching /dev/ sdaN, where N is
the partition number (any value between 1-9).

Test your udev rule on an existing partition by rebooting the machine and verifying that the
symbolic link is made correctly. If you don't have any partitions on /dev/sda, create one
before rebooting.

The reboot can be avoided if, after verifying the correct operation of your udev rule, you
create a new partition on /dev/ sda and update the in-memory copy of the partition table
(partprobe).
a. There are severa] variations the student could come up with, but one is to
create a file with priority 75, possibly named /et c/udev/rul e s . d/ 75 -

c lass lab remote . rules, with the contents:

KERNEL=="sda [1-9] " , \


PROGRAM=="scsi id -g -u -s /block/sda/sda%n", \
RESULT=="S beafll", \
SYMLINK+="iscsi%n"

(Replace the RESULT field with the output you get from running the command: scsi_id -g
-u -s /block/sda)
If you are having problems making this work, double-check your file for typos and make sure
that you have the correct number of equals signs for each directive as shown aboye.

Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9dbf2495


Lecture 4


Device Mapper and Multipathing



Upon completion of this unit, you should be able to:
• Understand how Device Mapper works and how to
configure it
• Understand Multipathing and its configuration

A eli e.a Cl o n LL/H V,013


e
••
u, v i re C)

e-y- 13 e 11. fq/ v.5 I "I bac .p-k4 (ye a

CrL

••
fitcke. các
_
e)loc Ltk s. ¿ I t,c C


I Man) r
)
(

.5 t)1 e
lus e-
4II
1 1)110c) DfiVC-f 2 e
••
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent o1 Red Hat, Inc. 11 you believe Red Hat training materiale are being improperly usad,
copied, or distributed piense email <trainingOredhat com> or phone toll-free (USA) +1 (888) 826 2994 or +1 (919) 754 3700.

RH436-RHEL5u4-en-17-20110428 / a502ee61
O
Copyright © 2011 Red Hat, Inc.
Device Mapper 4-1

• Generic device mapping platform


• Used by applications requiring block device mapping:
• LVM2 (e.g. logical volumes, snapshots)
• Multipathing
• Manages the mapped devices (create, remove, ...)
• Configured using plain text mapping tables (load, reload, ...)
• Online remapping
• Maps arbitrary block devices
• Mapping devices can be stacked (e.g. RAID10)
• Kernel mapping-targets are dynamically loadable

The goal of this driver is to support volume management. The driver enables the creation of new logical
block devices composed of ranges of sectors from existing, arbitrary physical block devices (e.g. (i)SCSI).
This can be used to define disk partitions, or logical volumes. This kernel component supports user-space
tools for logical volume management.

Mapped devices can be more than 2TiB in 2.6 and newer versions of the kernel (CONFIG_LBD).

Device mapper has a user space library (1 ibdm) that is interfaced by DeviceNolume Management
applications (e.g. dmraid, LVM2) and a configuration and testing tool: dmsetup. The library creates nodes
to the mapped devices in /dev/mapper.

For use only by a atudent enrollad in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent oí Red Hat, Inc. 11 you believe Red HM training material are being improperty usad,
copiad, or cfistributed please amad <trainiageredhat . coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 653574d8


.. .
Device Mapping Table 4-2

• Meta-devices are created by loading a mapping table


• Table specifies the physical-to-logical mapping of every sector in the logical device
• Each table line specifies:
• logical device starting sector
• logical device number of sectors (size)
• target type
• target arguments

Each device mapper meta-device is defined by a text file-based table of ordered rules that map each and
every sector (512 bytes) of the logical device to a corresponding arbitrary physical device's sector.

Each line of the table has the format:

logicalStartSector numSectors targetType [targetArgs...1

The target type refers to the kernel device driver that should be used to handle the type of mapping of
sectors that is needed. For example, the linear target type accepts arguments (sector ranges) consistent
with mapping to contiguous regions of physical sectors, whereas the striped target type accepts sector
ranges and arguments consistent with mapping to physical sectors that are spread across multiple disk
devices.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copiad, or distributed please email <training@redhat.com > or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 1d3d2829


dmsetup 4-3

• Creates, manages, and queries logical devices that use the device-mapper driver
• Mapping table information can be fed to dmsetup via stdin or as a command-line
argument
• Usage example:
• dmsetup create mydevice map_table

A new logical device can be created using dmsetup. For example, the command:

dmsetup create mydevice map_table

will read a file named map _ table for the mapping rules to create a new logical device named mydevice. If
successful, the new device will appear as /dev/mapper/mydevice. The logical device can be referred to
by its logical device name (e.g. mydevice), its QUID (-u), or device number ( j major m minor). - -

The command:

echo "0 'blockdev —getsize /dev/sdal' linear /dev/sdal 0" I dmsetup create mypart

first figures out how many sectors there are in device /dev/sdal (blockdev --getsize /dev/sdal), then
uses that information to create a simple linear target mapping to a new logical device named /dev/
mapper/mypart.

See dmsetup(8) for a complete list of commands, options, and syntax.

For use only by a student enrollad in a Red HM training course taught by Red Hat, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. tf you believe Red HM training materials are being improperly usad,
copiad, or distributed pisase emelt < training9redhat • con> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 3dce5b9c


Mapping Targets 4-4

• Dynamically loadable (kernel modules)


• Mapping targets (dmsetup targets):
• linear - contiguous allocation
• striped - segmented allocation across devices
• error - defines "out-of-bounds" area
• snapshot - copy-on-write device
• snapshot origin - - device map of original volume
• zero - sparse block devices
• muit ipath alternate I/O routes to a device
-

Mapping targets are specific-purpose drivers that map ranges of sectors for the new logical device onto
'mapping targets' according to a mapping table. The different mapping targets accept different arguments
that are specific to their purpose.

Mapping targets are dynamically loadable and register with the device mapper core.

The crypt mapping target is not discussed in this course.

For more information about the targets and their options, see the text files in /usr/share/doc/kernel-
doc- version/Documentation/device-mapper installed by the kernel-doc RPM.

For use only by a student enrolled in e Red Hat training course taught by Red HM, Inc. or a Red Hat Certified Training Forint'''. No part of this publicetion may be photocopied,
duplicated, stored in e retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red HM training ~miele ere being improperly used,
copied, or distributed picase email <traintngeredhat . com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / af3eedc4


Mapping Target - linear 4-5

• dm-linear driver
• Linearly maps ranges of physical sectors to create a new logical device
• Parameters:
• physical device path
• offset
• Example:
• dmsetup create mydevice map_table

where the file map_table contains the lines:

0 20000 linear /dev/sdal O


20000 60000 linear /dev/sdb2 O

The linear target maps (creates) a logical device from the concatenation of one or more regions of sectors
from specified physical devices, and is the basic building block of LVM.

In the aboye example, a logical device named /dev/mapper/mydevice is created by mapping the first
(offset 0) 20000 sectors of /dev/sdal and the first 60000 sectors of /dev/sdb2 to the logical device.
sdal'S sectors make up the first 20000 logical device sectors (starting at sector 0) and sdb2's 60000
sectors of make up the rest, starting at offset 20000 of the logical device:

[0< (0 20000 of /dev/sdal) >20000< (0 60000 of /dev/sdb2) >80000]


— - — — - —

The /dev/mapper/mydevice logical device would appear as a single new device with 80000 contiguous
(linearly mapped) sectors.

As another example, the following script concatenates two devices in their entirety (both provided as the
first two arguments to the command, e.g. scriptname /dev/sda /dev/sdb), to create a single new logical
device named /dev/mapper/combined:

#!/bin/bash
sizel=$(blockdev --getsize $1)
size2=$(blockdev --getsize $2)
echo -e "0 $sizel linear $1 0\n$sizel $size2 linear $2 0" dmsetup create be
combined

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. 1f you believe Red HM training ~crisis are being impropedy usad,
copiad, or chstributed pisase email <trainingeredhat . con, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / dafff0c6


... — ,
Mapping Target striped - 4-6

• dm-stripe driver
• Maps linear range of one device to segments of sectors spread round-robin across
multiple devices
• Parameters are:
• number of devices
• chunk size
• device path
• offset
• Example:
• dmsetup create mydevice map_table

where the Pile map tabl e contains the line:

0 1024 striped 2 256 /dev/sdal O /dev/sdbl O

The striped target maps a linear (contiguous) range of logical device sectors to sectors that have been
striped across several devices, in round-robin fashion. The number of sectors placed on each device before
continuing to the next device is determined by the chunk size, with consecutive chunks rotating among the
underlying devices. This striping of data can potentially provide improved I/O throughput by utilizing several
physical devices in parallel.

In the aboye example, a 1024-sector logical device named /dev/mapper/mydevice is created


from two partitions, /dev/sdal and /dev/sdbl. If A(x-y) represents sectors x through y of /dev/
sdal and B(x-y) similarly represents /dev/sdbl, the logical device is created from a mapping of:
A(0-256)B(0-256)A(257-512)B(257-512).

Parameters: <numdevs> <chunk_size> [<dev_path> <offset>) ...


<num_devs> Number of underlying devices.
<chunk_size> Size, in sectors, of each chunk of data written to each device (must be a
power-of-2 and least as large as the system's PAGE_SIZE).
<dev_path> Full pathname to the underlying block device, or a "major:minor"-formatted
device number.
<offset> Starting sector on the device.

One or more underlying devices can be specified with additional <dev_path><of f set > pairings. The
striped device size must be a multiple of the chunk size and a multiple of the number of underlying devices.

The following script creates a new logical device named /dev/mapper/mystripe that stripes its data
across two equally-sized devices (whose names are specified via command-line arguments) with a chunk
size of 128kiB:

#1/bin/bash
chunk_size=“ 128 * 2 1
num_devs=2
sizel=$(blockdev --getsize $1)
size2=$(blockdev --getsize $2)
sizetotal=$(( "$sizel" + "$size2" ))

For use only by a student enrolled in a Red HM training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red HM training material, are being improperly ueed,
copiad, or distributed please email <training@redhat • com> or phone toll-f ree (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 0c70f9d4


... . .. .
echo -e "0 $size_total striped $num_devs $chunk_size $1 O $2 O" dmsetup ¿

create mystripe

For use only by a student enrolIed in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this pubfication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. ff you believe Red Hat training materials are being improperly used,
copiad, or distributed piense email <trainingOredhat . comes or phone toII-free (USA) +1 (666) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 0c70f9d4


. .
Mapping Target - error 4-7

• Causes any I/O to the mapped sectors to fail


• Useful for defining gaps in a logical device
• Example:
• dmsetup create mydevice map_table

where the file map_table contains the lines:

0 80 linear /dev/sdal O
80 100 error
180 200 linear /dev/sdbl O

The error target causes any I/O to the mapped sectors to fail. This is useful for defining gaps in a logical
device.

In the aboye example, a gap is defined between sectors 80 and 180 in the logical device.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. II you believe Red Hat training materials are being improperly usad,
copiad, or distributed please email <training®redhat . con> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / fcafa420


Mapping Target - snapshot-origin 4-8

• dm- snapshot driver


• dm Mapping of original source volume
• Any reads of unchanged data will be mapped directly to the underlying source
volume
• Works in conjunction with snapshot
• Writes are allowed, but original data is saved to snapshot-mapped COW device first
• Parameters are:
• origin device
• Example:
• dmsetup create mydevice map_table

where the file map_table contains the line:

0 1000 snapshot-origin /dev/sdal

The snapshot origin mapping target is a dm mapping to the original source volume device that is being
-

snapshot'd. Whenever a change is made to the snapshot-origin-mapped copy of the original data, the
original data is first copied to the snapshot-mapped COW device.

In the aboye example, the first 1000 sectors of /dev/sdal are configured as a snapshot's origin device
(when used with the snapshot mapping target).

Parameters: <origin_device>

<origin_device> - The original underlying device that is being snapshot'd.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Cerhfied Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red HM training materials are being improperly usad,
copied, or distributed Osase ornad <trainiageredhat . cora, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 185db637


.—
Mapping Target snapshot - 4-9

• dm-snapshot driver
• Works in conjunction with snapshot origin -

• Copies origin device data to a separate copy-on-write (COW) block device for
storage before modification
• Snapshot reads come from COW device, or from underlying origin for unchanged
data
• Used by LVM2 snapshot
• Parameters are:
• origin device
• COW device
• persistent?
• chunk size
• Example:
• dmsetup create mydevice map_table

where the Pile map_table contains the line:

0 1000 snapshot /dev/sdal /dev/vg0/realdev P 16

In the aboye example, a 1000-sector snapshot of the block device /dev/sdal (the origin device) is
created. Before any changes to the origin device are made, the 16-sector chunk (chunksize parameter)
of data that the change is part of is first backed up to the COW device, /dev/vgo/realdev. The COW
device contains only chunks that have changed on the original source volume or data written directly to it.
Any writes to the snapshot are written only to the COW device. Any reads of the snapshot will come from
the COW device or the origin device (for unchanged data, only). The COW device can usually be smaller
than the origin device, but if it fills up, will become disabled. Fortunately, snapshots themselves are logical
volumes, so this is relatively easy to do with the lvextend command without taking the snapshot offline.
This snapshot will persist across reboots.

Parameters: <origin_device> <COW_device> <persistent?> <chunk_size>


<origin_device> The original underlying device that is being snapshot'd
<COW_device> Any blocks written to the snapshot volume are stored here. The original
version of blocks changed on the original volume are also stored here.
<persistent?> Will this survive a reboot? Default is ID' (yes). 'N' = not persistent. If this is a
transient snapshot, 'N' may be preferable because metadata can be kept in
memory by the kernel instead of having to save it to the disk.
<chunk_size> Modified data chunks of chunk size (default is 16 sectors, or 8kiB) will be
stored on the COW device.

Snapshots are useful for "moment-in-time" backups, testing against production data without actually using
the original production data, making copies of large volumes that require only minor modification to the
source volume for other tasks (without redundant copies of the non-changing data), etc.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red HM training materials are being improperly used,
copiad, or distributed please email <training@redhat . cora> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ba726ab3


LVM2 Snapshots 4-10

• Four dm devices are used by LVM2 snapshots:


• The original mapping of the source volume (<name>- real)
• A snapshot mapping of the original source volume (snapshot-origin)
• The COW device (snapshot)
• The visible snapshot device, consisting of the first and third (<name>)

• Allows block device to remain writable, without altering the original

When you create an LVM2 snapshot of a volume, four dm devices are used, as explained in more detail on
the next slide.

For use only by a student enrollad in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly usad,
copiad, or distributed pisase email ctrainingeredhat .cow, or phone toll-free (USA) +1 (866)626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 2e0dc15e


LVM2 Snapshot Example 4-11

lvcreate -L 500M -n original vg0


lvcreate -L 100M -n snap --snapshot /dev/vg0/original

# 11 /dev/mapper 1 grep vg0


total O
brw-rw---- 1 root disk 253, 0 Mar 21 12:28 vg0-original
brw-rw---- 1 root disk 253, 2 Mar 21 12:28 se
vg0-original-real
brw-rw---- 1 root disk 253, 1 Mar 21 12:28 vg0-snap
brw-rw---- 1 root disk 253, 3 Mar 21 12:28 vg0-snap-cow

# dmsetup table 1 grep vg0 1 sort


vg0-original: 0 1024000 snapshot-origin 253:2
vg0-original-real: 0 1024000 linear 8:17 384
vg0-snap: 0 1024000 snapshot 253:2 253:3 P 16
vg0-snap-cow: 0 204800 linear 8:17 1024384

# dmsetup ls --tree
vg0-snap (253:1)
l_vg0-snap-cow (253:3)
1 \_(8:17)
\_vg0-original-real (253:2)
\(8:17)
vg0-original (253:0)
\_vg0-original-real (253:2)
\(8:17)

For example, create a logical volume:

pvcreate /dev/sdbl
vgcreate vg0 /dev/sdbl
lvcreate -L 500M -n original vg0

Then take a snapshot of it:

lvcreate -L 100M -n snap --snapshot /dev/vg0/original

Looking at the output of the commands below, we can see that dm utilizes four devices to manage the
snapshot: the original linear mapping of the source volume (vg0 - original - real), a "forked" snapshot -
origin mapping of the original source volume (vgo - original), the linear-mapped COW device (vg0 -
snap - cow), and the visible snapshot-mapped device (vgo - snap). Note that reads can come from the
original source volume or the COW device. The snapshot - origin device allows more than one snapshot
device to be based on it (several snapshots of a source volume).

# 11 /dev/mapper 1 grep vg0

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiels are being improperly used,
copied, or distributed please email <training@redhat com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 5f62be72


total O
brw-rw---- 1 root disk 253, 3 Mar 18 11:35 vg0-original
brw-rw---- 1 root disk 253, 5 Mar 18 11:37 vg0-original-real
brw-rw---- 1 root disk 253, 4 Mar 18 11:37 vg0-snap
brw-rw---- 1 root disk 253, 6 Mar 18 11:37 vg0-snap-cow

# dmsetup table 1 grep vg0 1 sort


vg0-original: 0 1024000 snapshot-origin 253:5
vg0-original-real: 0 1024000 linear 8:17 384
vg0-snap: 0 1024000 snapshot 253:5 253:6 P 16
vg0-snap-cow: 0 204800 linear 8:17 1024000

# dmsetup ls --tree
vg0-snap (253:4)
l_vg0-snap-cow (253:6)
1 \_(8:17)
\_vg0-original-real (253:5)
\(8:17)
vg0-original (253:3)
\_vg0-original-real (253:5)
\(8:17)

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No pan of this publication may be photocopied,
duplicated, stored in a retrieval system, or othenvise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training materials are being improperly usad,
copied, or tfistributed pisase email ctrainingeredhat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 5f62be72


Mapping Target - zero 4-12

• dm- zero driver


• Same as /dev/zero, but a block device
• Always returns zero'd data on reads
• Silently drops writes
• Useful for creating sparse devices for testing
• "Fake" very large files and filesystems
• Example:
• dmsetup create mydevice map_table

where the file map_table contains the line:

0 10000000 zero

Device-Mapper's "zero" target provides a block-device that always returns zero'd data on reads and silently
drops writes. This is similar behavior to /dev/ zero, but as a block-device instead of a character-device.
dm- zero has no target-specific parameters.

In the aboye example, a 10000000 sector (5GiB) logical device is created named /dev/mydevice.

One interesting use of dm- zero is for creating "sparse" devices in conjunction with dm- snapshot. A
sparse device can report a device size larger than the amount of actual storage space available for that
device. A user can write data anywhere within the sparse device and read it back like a normal device.
Reads from previously-unwritten areas will return zero'd data. When enough data has been written to fill up
the actual underlying storage space, the sparse device is deactivated. This can be useful for testing device
and filesystem limitations.

To create a huge (say, 100TiB) sparse device on a machine with not nearly that much available disk space,
first create a logical volume device that will serve as the true target for any data written to the zero device.
For example, lets assume we pre-created a 1GiB logical volume named /dev/vgo/bigdevice.

Next, create a dm-zero device that's the desired size of the sparse device. For example, the foliowing script
creates a 100TiB sparse device named /dev/mapper/zerodev:

#!/bin/bash
HUGESIZE=$[100 * (2**40) / 512] # 100 TiB, in sectors
echo "0 $HUGESIZE zero" I dmsetup create zerodev

Now create a snapshot of the zero device using our previously-created logical volume, /dev/vg0/
bigdevice, as the COW device:

#!/bin/bash
HUGESIZE=$[100 * (2**40) / 512] # 100 TiB, in sectors
echo "0 $HUGESIZE snapshot /dev/mapper/zerodev /dev/vg0/bigdevice P 16" be
dmsetup create hugedevice

We now have a device that appears to be a 100TiB device, named /dev/mapper/hugedevice. The size
of the snapshot COW device (1 GiB in this case) will ultimately determine the amount of real disk space that

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training materiala are being improperly usad,
copied, or distributed please email <trainingeredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 93cf6927


is available to the sparse device for writing. Writing more than this underlying logical volume can hold will
result in I/O errors.

We can test our "100TiB"-sized device with the following command, which writes out 1 1MB-sized block of
zeroes to our sparse device, starting at an offset of 1000000 1MB-sized blocks into the file:

dd if=/dev/zero of=/dev/mapper/hugedevice bs=1M count=1 seek=1000000

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior miden consent of Red HM, Inc.lf you believe Red HM training material, are being impropedy used,
copied, or dstributed pisase email <trainingeredhat . coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 93cf6927


Device Mapper Multipath Overview 4-13

• Provides redundancy: more than one communication path to the same physical
storage device
• Monitors each path and auto-fails over to alternate path, if necessary
• Provides failover and failback that is transparent to applications
• Creates dm-multipath device aliases (e.g. /dev/dm-2)
• Device-Mapper multipath is cluster-aware and supported with GFS
• Multipath using mdadm is not

Enterprise storage needs redundancy -- in this case more than one path of communication to its storage
devices (e.g. connection from an HBA port to a storage controller port, or an interface used to access an
iSCSI storage volume) -- in the event of a storage communications path failure.

Device Mapper Multipath facilitates this redundancy. As paths fail and new paths come up, dm-multipath
reroutes the I/O over the available paths.

When there are multiple paths to storage, each path appears as a separate device. Device mapper
multipath creates a new meta device on top of those devices. For example, a node with two HBAs, each of
which has two ports attached to a storage controller, sees four devices: /dev/sda, /dev/sdb, /dev/sdc,
and /dev/sdd. Device mapper multipath creates a single device, /dev/dm 2 (for example) that reroutes 1/ -

O to those four underlying devices.

Multipathing iSCSI with dm-multipath is supported in RHEL4 U2 and newer.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you beIieve Red Hat training materials are being improperly usad,
copied, or distributed please email <treiningeredhat . cora> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a2a5ba12


Device Mapper Components 4-14

• Components:
• Multipath priority groups
• dm-multipath kemel module
• Mappíng Target: multipath

• multipath - lists and configures multipath devices


• mult ipathd daemon - monítors paths
• kpartx - creates dm devices for the partitions

Device mapper multipath consists of the following components:


Multipath priority groups Used to group together and prioritize shared storage paths.
dm-multipath kernel This module reroutes I/O and fails-over paths and path groups.
module
multipath Lists and configures multipath devices. Normally started up with a SysV init
script, it can also be started up by udev whenever a block device is added.
multipathd daemon Monitors paths; as paths fail and come back, ít may initíate path group
switches. Provides for interactive changes to multipath devices. This must be
restarted for any changes to the / etc imul t ipath conf file.
kpartx Creates device mapper devices for the partitions on a device.

For use only by a student enrollad in a Red HM training course taught by Red HM, Inc. or a Red HM Certitied Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly used,
copiad, or distributed piense amad <trainingeredhat . coz> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 722eaebl


Multipath Priority Groups 4-15

• Storage device paths are organized into Priority Groups


• Each group is assigned a priority (0-1024)
• I/O is dispatched to next highest priority path upon failure
• Only one group is active at a time
• Active/active (parallel) paths are members of the same group
• Default scheduling policy: round- rob in

• Active/passive paths are members of different groups


• Passive paths remain inactive until needed

The different paths to shared storage are organized into priority groups, each with an assigned priority
(0-1024). The lower the priority value, the higher the preference for that priority group. If a path fails, the I/O
gets dispatched to the priority group with the next-highest priority (next lowest number). If that path is also
faulty, the I/O continues to be dispatched to the next-highest priority group until all path options have been
exhausted. Only one priority group is ever in active use at a time.

The actual action to take upon failure of one priority group is configured by the path_grouping_policy
parameter in the de f aul t s section of /etc/multipath. conf. This parameter is typically configured to
have the value failover.

Placing more than one path in the same priority group results in an "active/active" configuration: more
than one path being used at the same time. Separating the paths into different priority groups results in an
"active/passive" configuration: active paths are in use, passive paths remain inactive until needed because
of a failure in the active path.

Each priority group has a scheduling policy that is used to distribute the I/O among the different paths within
it (e.g. round robin). The scheduling policy is specified as a parameter to the mult ipathing target and
-

the default_selector/path_selector parameters in /etc/multipath. conf.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material. are being improperly used,
copiad, or distributed plisas. email < t reíning@redhat . coa» or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 86773d58


Mapping Target multipath - 4-16

• dm-multipath driver
• Parameters (after multipath keyword) are:
• Number of priority groups for the segment
• The first priority group's parameters:
• scheduler used to spread I/O inside priority group
• number of paths in the priority group
• number of paths parameters (usually 0)
• list of paths for this priority group

• Additional priority group parameter sections

Parameters:<num_pg> <sched> <num_paths> <num_paths_parms> <path_list> [<sched>


<num_paths> <num_paths_parms> <path_list>]...

Parameter definitions:
<num_pg> The number of priority groups
<sched> The scheduler used to spread the I/O inside the priority group
<num_paths> The number of paths in the priority group
<num_paths_parms> The number of paths parameters in the priority group (usually O)
<path_list> A Iist of paths for this priority group

Additional priority groups can be appended.

Here we Iist some multipath examples. The first defines a 1 GiB storage device with two priority groups.
Each priority group round-robins the I/O across two separate paths.

O 2147483648 multipath 2 round-robin 2 0 /dev/sda /dev/sdb round-robin 2 0/


dev/sdc /dev/sdd

This example demonstrates a failover target (4 priority groups, each with one multipath device):

O 2147483648 multipath 4 round-robin 1 0 /dev/sda round-robin 1 0 /dev/sdb


round-robin 1 0 /dev/sdc round-robin 1 0 /dev/sdd

This example spreads out (multibus) the target I/O using a single priority group:

O 2147483648 multipath 1 round robin 4 0 /dev/sda /dev/sdb /dev/sdc /dev/sdd


-

The following command determines the multipath device assignments on a system, and then creates the
multipath devices for each partition:

/sbin/dmsetup Is —target multipath —exec "/sbiMcparbc -a"

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certined Training Partner. No pan of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copiad, or distributed ralease email ctrainingOredhat . coa> or phone ton-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / e2506f23


... . . .
Setup Steps for Multipathing FC Storage 4-17

• Install device-mapper-multipath RPM •


• Configure /etc/mult ipath. conf


modprobe dm_multipath
modprobe dm-round-robin

• chkconfig multipathd on
• service multipathd start
• multipath -I

Note: while the actual device drivers are named dm-multipath . ko and dm- round- robin . ko (see the
files in /1 ib/modules /kernel -version/kernel /drivers/md), underscores are used in place of the


dash characters in the output of the Ismod command and either naming form can be used with modprobe.

Available SCSI devices are viewable via /proc/scsi/scsi:

# cat /proc/scsi/scsi
Attached devices:


Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: SEAGATE Model: ST318305LC Rev: 2203
Type: Direct-Access ANSI SCSI revision: 03
Host: scsil Channel: 00 Id: 00 Lun: 00
Vendor: ATA Model: ST340014AS Rev: 8.05
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi3 Channel: 00 Id: 00 Lun: 08
Vendor: IET Model: VIRTUAL-DISK Rev: O
Type: Direct-Access ANSI SCSI revision: 04

If you need to re-do a SCSI scan, you can run the command:

echo "- - -" > /sys/class/scsi_host/hostO/scan



where hos t O is replaced by the HBA you wish to use. You also can do a fabric rediscover with the
commands:

echo "1" > /sys/class/fc_host/hostO/issue_lip



•e
echo "- - -" > /sys/class/scsi_host/hostO/scan

This sends a LIP (loop initialization primitive) to the fabric. During the initialization, HBA access may be
slow and/or experience timeouts.



For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are belng improperly usad,
copied, or distributed please email <trainingkgredttat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 6de19b5f



Multipathing and iSCSI 4-18

• Similar to FC multipathing setup


• Can use either:
• dm-multipath
• interface bonding

iSCSI can be multipathed. The iSCSI target is presented to the initiator via a completely independent
pathway. For example, two different interfaces, eth0 and ethl , configured on different subnets, can provide
the same exact device to the initiator via different pathways.

In Linux, when there are multiple paths to a storage device, each path appears as a separate block
device. The separate block devices, with the same WWID, are used by multipath to create a new multipath
block device. Device mapper multipath then creates a single block device that re-routes I/O through the
underlying block devices.

In the event of a failure on one interface, multipath transparently changes the route for the device to be the
other network interface.

Ethernet interface bonding provides a partial alternative to dm-multipath with iSCSI, where one of the
Ethernet links can fail between the node and the switch, and the network traffic to the target's IP address
can switch to the remaining Ethernet link without involving the iSCSI block device at all. This does not
necessarily address the issue of a failure of the switch or of the target's connection to the switch.

For use only by a studeM enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part oí this publication may be photocopied,
duplicated, atored in a retrieval system, or othenvise reproduced without prior written consent o( Red HM, Inc. II you believe Red Hat training materials are being improperly usad,
copiad, or cfistributed piense ornad ctrainingeredhat . coas or phone toII-free (USA) +1 (066) 620 2994 or +1 (919) 754 3700.

Copyright 0 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ca0c7912


Multipath Configuration 4-19

• /etc/multipath. conf SeCtiOnS:


• defaults - multipath tools default settings
• blacklist - list of specific device names to not consider for multipathing
• blacklist_exceptions - list of multipathing candidates that would otherwise be blacklisted
• mult ipaths - list of multipath characteristic settings
• devices - list of per storage controller settings
• Allows regular expression description syntax
• Only specify sections that are needed
defaults A section that lists default settings for the multipath tools. See the file:
/usr/share/doc/device-mapper-multipath-<version>/
multipath. conf .annotated for more details.
blacklist By default, all devices are blacklisted (devnode "*"). Usually, the default
blacklist section is commented out and/or modified by more specific rules in
the blacklist_exceptions and secondary blackli st sections.
blacklist_exceptionsAllOWS devices to be multipathing candidates that would otherwise be
blacklisted.
multipaths Specifies multipath-specific characteristics.
Secondary blackli st To blacklist entire types of devices (e.g. SCSI devices), use a devnode line in
the secondary blacklist section. To blacklist specific devices, use a World-
Wide IDentification (WWID) line. Unless it is statically mapped by udev rules,
there may be no guarantee that a specific device will have the same name
on reboot (e.g. it could change from /dev/sda to /dev/sdb). Therefore is
is generally recommended to not use devnode lines for blacklisting specific
devices. Examples:

blacklist {
wwid 26353900f02796769
devnode " ^ (ramirawiloopifdlmclidm-isriscdist)[0-9]*"
devnode " ^ hd[a-z]"
devnode " ^ ccissic[0-9]d[0-9)*"
}

Multipath attributes that can be set:


wwid The container index
alias Symbolic name for the multipath
path_checker Path checking algorithm used to check path state
path_selector The path selector algorithm used for this multipath
failback Whether the group daemon should manage path group failback or not
no_path_retry Should retries queue (never stop queuing until the path is fixed), fail (no
queuing), or try N times before disabling queuing (N>0)
rr_min_io The number of lOs to route to a particular path before switching to the next in
the same path group
rr_weight Used to assign weights to the path
prio_callout Executable used to obtain a path weight for a block device. Weights are
summed for each path group to determine the next path group to use in case
of path failure

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are being Improperly used,
copied, or distributed please email <training@redhat . con» or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a92bc106


Example:

multipaths {
multípath {
wwid 3600508b4000156d700012000000b0000
alias yellow
path_groupingpolícy multibus
path_checker readsector0
path_selector "round-robín O"
failback manual
rr_weight priorities
no_path_retry 5
}
multipath {
wwid 1DEC 321816758474
alias red
}

For use °My by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this pubfication may be photocopied,
duplicated, stored in a retrieval system, or othenvise reproduced without prior written consent of Red Hat, Inc. ff you believe Red Hat haining materials are being impropedy used,
copiad, or distributed pisase ornad <trainingererlhat. con> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a92bc106


^-,
Multipath Information Queries 4-20

• multipath [-1 1 -II 1 -v[01112]]


• dmsetup Is --target multipath
• dmsetup table
• Example:

# multipath -1
mpathl (3600d0230003228bc000339414edb8101)
[size=10 GB] [features="0"][hwhandler="0"]
\ round-robin O [prio=1] [active]
\ 2:0:0:6 sdb 8:16 [active][ready]
\ round-robin O [prio=1] [enabled]
\ 3:0:0:6 sdc 8:64 [active] [ready]

For each multipath device, the first two Unes of output are interpreted as follows:

action_if_any: alias (WWID_if_different_from_alias)


[size][features][hardwarehandler]

• action_if_any : If multipath is performing an action, while running the command this action will be
displayed here. An action can be reload, create or switchpg.

• alias : The name of the multipath device as can be found in /dev/mapper.

• WWID : The unique identifier of the LUN.

• size : The size of the multipath device.

• features : A list of all the options enabled for this multipath device (e.g. queue_if_no_path).

• hardware_handler : o if no hardware handler is in use, or 1 and the name of the hardware handler
kernel module if in use.

For each path group:

\_ scheduling_policy [path_group_priority] [path_group_status]

• scheduling_policy : Path selector algorithm in use for this path group (defined in /etc/
multipath.conf).

• path_group_priority : If known. Each path can have a priority assigned to it by a callout program. Path
priorities can be used to group paths by priority and change their relative weights for the algorithm that
defines the scheduling policy.

• path_group_status : If known. The status of the path can be one of: active (path group currently
receiving I/O requests), enabled (path groups to try if the active path group has no paths in the ready
state), and disabled (path groups to try if the active path group and all enabled path groups have no
paths in the active state).

For each path:

\ host:channel:id:lun devnode major:minor

For use only by a student enrolled in a Red Hat training course taught by Red Het, Inc. or e Red Het Certified Training Partner. No parí of this publication mey be photocopied,
dupliceted, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Het training materials are being improperly used,
copied, or distributed please email tredníngeredhat . COSI> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 655f1732


• La 00
[path status] [dm_status_if_known]

• host:channel:id:lun : The SCSI host, channel, ID, and LUN variables that identify the LUN.

• devnode : The name of the device.

• major:minor : The major and minor numbers of the block device.

• path_status : One of the following: ready (path is able to handle I/O requests), shaky (path is up, but
temporarily not available for normal operations), faulty (path is unable to handle I/O requests), and
ghost (path is a passive path, on an active/passive controller).

• dm_status_if_known : Similar to the path status, but from the kernel's point of view. The dm status has
two states: failed (analogous to faulty), and active which covers alI other path states.

If the path is up and ready for I/O, the state of the path is [ready] [active]. If the path is down, the
state will be [faulty] [failed] . The path state is updated periodically by the multipathd daemon based
on the polling interval defined in /etc/multipath. conf. The dm status is similar to the path status, but
from the kernel's point of view.

NOTE: When a multipath device is being created or modified, the path group status and the dm status
are not known. Also, the features are not always corred. When a multipath device is being listed, the path
group priority is not known.

To find out which device mapper entries match the systems multipathed devices, perform the following:

• multipath -II

• Determine which long numbers are needed for the device mapper entries.

• dmsetup Is —target multipath

This will return the long number. Examine the part that reads " (255, #)". The 'W is the device mapper
number. The numbers can then be compared to find out which dm device corresponds to the multípathed
device, for example /dev/dm 3. -

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or othenvise reproduced without prior written consent of Red Hat, Inc. If you betieve Red Hat tmining materiaIs are being improperly usad,
copiad, or diatributed pisase amad < trainingeredhat . coi or phone toll-free (USA) +1 (666) 626 2994 or +1 (919) 754 3700.

Copyright (1) 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 655f1732


... . .. .
End of Lecture 4

• Questions and Answers


• Summary
• We Iearn how the system maps ordinary physical devices into very useful logical devices.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or e Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale ere being improperly usad,
copied, or distributed piense email ctraíning@redhat . con» or phone toll-free (USA) +1 (865) 626 2994 or +1 (919) 754 3700.

Coovricht in 2011 Red Hat. Inc. RH436-RHEL5u4-en-17-20110428 / a502ee61


Lab 4.1: Device Mapper Multipathing
Scenario: The iSCSI target volume on your workstation has been previously
created. The nodel machine accesses this iSCSI volume via it's eth2
(172 . 17 . (100+X) .1) interface.
We want to make this iSCSI volume more fault tolerant by providing a
second, independent network pathway to the iSCSI storage. We also want
our operating system to ensure uninterrupted access to the storage device
by automatically detecting any failure in the iSCSI pathway and redirecting
all traffic to the alternate path.
The /dev/sda device on your nodel is an iSCSI-provided device,
discovered on your iSCSI target's 172.17 . (100+X) 254 interface.
.

Deliverable: Create a second, independent network pathway on nodel to our iSCSI


11, storage device and configure device-mapper multipathing so that access to
the iSCSI target device continues in the event of a network failure.

Instructions:
11/
1. If you did not rebuild nodel at the end of the last lab, do so now using the rebuild-
cluster script.

2. Note: In this lab we will be performing multipath failovers using our iSCSI SAN. In
node i's iSCSI initiator configuration file, /etc/iscsi/iscsid. conf, the default
I>
iSCSI timeout parameters (node sess ion . timen. replacement_timeout and
node. session. err timeo. lu reset timeout) are set to 120 and 20 seconds,
respectively. Left unchanged, failovers would take a while to complete. Edit these parameters to
something smaller (e.g. 10, for both) and restart the isc s id service to put them into effect.
• 3. Before we can use the second interface on the intiator side, we need to modify the target
configuration. Add 172.17.200+X. 1,172.17.200+X 2, and 172.17.200+X.3 as
.

valid intitiator addresses to /etc/tgt/targets . conf


4. Restart tgtd to activate the changes.

Note that this will not change targets that have active connections. In this case either stop these
connections first, or use tgtadm --lid iscsi --op bind --mode target --tid 1 -I ini tia tor-
ip

5. Let's start by disovering the target on the first interface. Also set the initiator alias again to
10 nodel

110
Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 130c5b8a
6. Log into nodel via ssh (do not use the console). Currently, nodel's network interfaces are
configured as:

eth0 -> 172.16.50.X1/16


ethl -> 172.17.X.1/24 (will be used for cluster messaging later)
eth2 -> 172.17.100+X.1/24 (first path to the iscsi target)
eth3 -> 172.17.200+X.1/24 (second path to the iscsi target)

Note that eth3 is on a different subnet than eth2.

7. On nodel, make sure there are exactly two 1GiB partitions on /dev/ sda (/dev/ sdal and
/dev/ sda2). Delete any extras or create new ones if necessary.

8. Discover and login to the target on the second interface (172.17 . 200+X. 254).

9. Re-examine the output of the command 'fdisk -1'. Notice the addition of the new /dev/ sdb
device, which is really the same underlying device as /dev/sda (notice their partitions have
the same characteristics), but provided to the machine a second time via a second pathway.

We can prove it is the same device by, for example, comparing the output of the following
commands:

cXn1.# scsi id -g -u -s /block/sda


c.:Xn1# scsi id -g -u -s /block/sdb

or

cxnui scsi id -g -p 0x83 -s /block/sda


cxni# scsi id -g -p 0x83 -s /block/sdb

See scsi_id(8) for explanation of the output and options used.

10. If not already installed, install the devi ce -mapper-multipath RPM on nodel.

11. Make the following changes to /etc/mult ipath conf:

Comment out the first blacklist section:

# blacklist {
devnode "*"
# }

Uncomment the device mapper default behavior section that looks like the following:
-

defaults {
udev dir /dev
polling_interval 10
selector "round-robin O"
path_grouping_policy multibus
getuid callout "/sbin/scsi id - g
-u -s /block/%n"
prio callout /bin/true
path checker readsector0

Convriaht (e) 2011 Red Hat. Inc. RH436-RHEL5u4-en-17-201 10428 / 130c5b8a


rr_min_io 100
rr weight priorities
failback immediate
no_path_retry fail
user friendly_name yes
}

Change the path_grouping_policy to failover, instead of mult ibus, to enable


simple failover.

defaults {
udev dir /dev
polling_interval 10
selector "round-robin O"
path_grouping_policy failover
getuid callout "/sbin/scsi id -g -u -s /block/%n"
prio_callout /bin/true
pathchecker readsector0
rr_min_io 100
rr weight priorities
failback immediate
no_path_retry fail
user friendly_names yes
}

Uncomment the blacklist section just below it. This filters out all the devices that are not
normally multipathed, such as IDE hard drives and floppy drives.

Save the configuration file and exit the editor.

12. Before we start the mult ipathd service, make sure the proper modules are loaded:
dm mult ipath, dm round robin. List all available dm target types currently available in
the kernel.

13. Open a console window to nodel from your workstation and, in a separate terminal window,
log in to nodel and monitor /var/log/messages.

14. Now start the mult ipathd service and make it persistent across reboots.

15. View the result of starting mult ipathd by running the commands:

cXnl# fdisk -1
cXnl# 11 /dev/mpath

The device mappings, in this case, are as follows:

/ - - sda- - \ / dm 3 (sd [ab] 1)


-

LUN --+ -- dm-2 (mpath0) ---+


\ - sdb- - /
- \ dm 4 (sd[ab]2)
-

These device mappings follow the pattern of:

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 130c5b8a


. ..
SAN (iSCSI storage) --> NIC (eth0/ethl, or HBA) --> device (/dev/sda) --> dm device
(/dev/dm- 2) --> dm-mp device (/dev/mpath/mpath0).

Notice how device mapper combines multiple paths into a single device node. For example,
/dev/dm- 2 represents both paths to our iSCSI target LUN. The device node /dev/dm- 3
singularly represents both paths to the first partition on the device, and the device node /dev/ •
dm-4 singularly represents both paths to the second partition on the device.

You will notice that /dev/dm- 3 is also referred to as/dev/mpath/mpathOpl and /dev/
mapper/mpathOpl. Only the /dev/mapper/mpath* device names are persistent and are
created early enough in the boot process to be used for creating logical volumes or filesystems.
Therefore these are the device names that should be used to access the multipathed devices.
••
Keep in mind that fdisk cannot be used with /dev/dm- # devices. If the multipathed device
needs to be repartitioned, use fdisk on the underlying disks instead. Afterward, execute the
command 'kpartx -a /dev/dm-#' to recognize any newly created partitions. The device-mapper
multipath maps will get updated and create /dev/dm- # devices for them.



16. View the multipath device assignments using the command:

cXn1# multipath -11


mpath0 (S_beaf11) dm-2 IET,VIRTUAL-DISK
[size=10G] [features=0] [hwhandler=0]
\_ round-robin O [prio=0] [active] •

\_ 0:0:0:1 sda 8:0 [active] [ready]
\_ round-robin O [prio=0] [enabled]
\ 1:0:0:1 sdb 8:16 [active][ready]

The first line shows the narre of the multipath (mpath0), its SCSI ID, and device-mapper

device node. The second line helps to identify the device vendor and model. The third line
specifies device attributes. The remaining lines show the participating paths of the multipath
device, and their state. The "0:0:0:1" portion represents the host, bus (channel), target (SCSI •
id) and LUN, respectively, of the device (compare to the output of the command cat /proc/scsi/
scsi).

17. Test our multipathed device to make sure it really will survive a failure of one of its pathways.

Create a filesystem on /dev/mapper/mpathOpl (which is really the first partition of our
multipathed device), create a mount point named /mnt /data, and then mount it.

Create a file in the /mnt /data directory that we can use to verify we still have access to the
disk device.
••
18. To test that our filesystem can survive a failure of either path (eth2 or eth3) to the device,
we will systematically bring down the two interfaces, one at a time, and test that we still have
access to the remote device's contents. To do this, we will need to work from the console

window of nodel, which you opened earlier, otherwise open a new console connection now.

19. Test the first path. From the console, verify that device access survives if we bring down eth3,
and that we still have read/write access to /mnt/data/passwd.

Note: if the iSCSI parameters were not trimmed to smaller values properly, the following
multipath command and log output could take up to 120 seconds to complete.
Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 130c5b8a
--

If you monitor the tau end of /var/ log/messages, you will see messages similar to
(trimmed for brevity):

avahi-daemon[1768]: Interface eth3.IPv6 no longer relevant for II


mDNS.
kernel: sd 1:0:0:1: SCSI error: return code = 0x00020000
kernel: end request: I/O error, dev sdb, sector 4544
kernel: device-mapper: multipath: Failing path 8:16.
multipathd: sdb: readsector0 checker reports path is down
multipathd: checker failed path 8:16 in map mpath0
multipathd: mpath0: remaining active paths: 1
iscsid: Nop-out timedout after 15 seconds on connection 2:0 le
state (3). Dropping session.

The output of multipath also provides information:

nodel-console# multipath -11


sdb: checker msg is "readsector0 checker reports path is down"
mpath0 (16465616462656166313a3100000000000000000000000000) dm-2 le
IET,VIRTUAL-DISK
[size=10G] [features=0] [hwhandler=0]
\_ round-robin O [prio=0] [active]
\ 0:0:0:1 sda 8:0 [active] [ready]
\_—round-robin O [prio=0] [enabled]
\_ 1:0:0:1 sdb 8:16 [failed] [faulty]

Notice that the eth3 path (/dev/sdb) has failed, but the other path is still ready and active
for all access requests.

Bring the eth3 interface back up when you are finished verifying. Ensure that both paths are
active and ready before continuing.

20. Now test the other path.

Repeat the process by bringing down the eth2 interface, and again verifying that you still have
read/write access to the device's contents.

Bring the eth2 interface back up when you are finished verifying.

21. Rebuild nodei when done (execute rebuild-cluster -1 on your workstation.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 130c5b8a


Lab 4.1 Solutions
1. If you did not rebuild nodel at the end of the last lab, do so now using the rebuild-
cluster script.

stationx# rebuild cluster - - 1

2. Note: In this lab we will be performing multipath failovers using our iSCSI SAN. In
node l's iSCSI initiator configuration file, / etc / iscsi / iscsid . conf, the default
iSCSI timeout parameters (node session. timeo . replacement_timeout and
node . session. err timeo. lu reset timeout) are set to 120 and 20 seconds,
respectively. Left unchanged, failovers would take a while to complete. Edit these parameters to
something smaller (e.g. 10, for both) and restart the iscsid service to put them into effect.

nociel# vi /etc/iscsi/iscsid.conf
nodei# service iscsid restart

3. Before we can use the second interface on the intiator side, we need to modify the target
configuration. Add 172.17.200+X. 1, 172.17.200+X. 2, and 172.17.200+X. 3 as
valid intitiator addresses to /etc/tgt/targets . conf

/etc/tgt/targets.conf:
<target iqn.2009-10.com.example.clusterX:iscsi<
# List of files to export as LUNs
backing-store /dev/volO/iscsi
initiator-address 172.17.(100+X).1
initiator-address 172.17.(100+X).2
initiator-address 172.17.(100+X).3

initiator-address 172.17.(200+X).1
initiator-address 172.17.(200+X).2
initiator-address 172.17.(200+X).3
</target>

4. Restart tgtd to activate the changes.

Note that this will not change targets that have active connections. In this case either stop these
connections first, or use tgtadm --lld iscsi --op bind --mode target --tid 1 -I ini tia tor -
ip

stationx# /sbin/service tgtd stop; /sbin/service tgtd start

or if the target is in use:

statíonX# tgtadm --11d iscsi --op bind --mode target --tid 1 -I 11


172.17.(200+X).1
stationr4 tgtadm --11d iscsi --op bind --mode target --tid 1 -I 11
172.17.(200+X).2
stationX# tgtadm --11d iscsi --op bind --mode target --tid 1 -I
172.17.(200+X).3

r.nnvrinht e 2011 RPfi Hat RH436-RHEL5u4-en-17-20110428 / 130c5b8a


1 5. Let's start by disovering the target on the first interface. Also set the initiator alias again to
nodel
1 cXnl # echo 'InitiatorAlias=nodel' »
/etc/iscsi/initiatorname.iscsi
1 # service iscsi start
# chkconfig iscsi on

1 # iscsiadm -m discovery -t sendtargets -p 172.17.(100+X).254


# iscsiadm -m node -T <target_iqn_name> -p 172.17. (100+X)
.254 -1
1 6. Log finto nodel via ssh (do not use the console). Currently, nodel's network interfaces are

e configured as:

eth0 -> 172.16.50.X1/16


1 ethl
eth2
->
->
172.17.X.1/24 (will be used for cluster messaging later)
172.17.100+X.1/24 (first path to the iscsi target)
eth3 -> 172.17.200+X.1/24 (second path to the iscsi target)
1
Note that eth3 is on a different subnet than eth2.

7. On nodel, make sure there are exactly two 1GiB partitions on /dev/sda (/dev/ sdal and
/dev/sda2). Delete any extras or create new ones if necessary.
e eXn1.# fdisk - 1

10 8. Discover and login to the target on the second interface (172.17 .200+X. 254).

1 cXnl# iscsiadm -m discovery -t sendtargets -p 172.17. (200+X) .254


# iscsiadm -m node -T <target_iqn_name> -p 172.17. (200+X) .254 le
-1
e 9. Re-examine the output of the command 'fdisk -1'. Notice the addition of the new /dev/sdb
device, which is really the same underlying device as /dev/sda (notice their partitions have
the same characteristics), but provided to the machine a second time via a second pathway.

1 We can prove it is the same device by, for example, comparing the output of the following
commands:

irfr cxxii# /sbin/scsi id -g -u -s /block/sda


cx_in# /5bin/scsi id -g -u -s /block/sdb

See scsi_id(8) for explanation of the output and options used.

10. If not already installed, install the device-mapper-mult ipath RPM on nodel.
10
cXnl#yum - y install device mapper multipath
- -

1110 11. Make the following changes to /etc /mult ipath conf:

11, Comment out the first blacklist section:

# blacklist {

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 130c5b8a




devnode "*"
# }

Uncomment the device mapper default behavior section that looks like the following:
-

defaults {
udev dir /dev
polling_interval 10


selector "round-robin O"
path_grouping_policy multibus
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio_callout
path_checker
/bin/true
readsector0 •

rr _ min _io 100
rr weight priorities
failback immediate
no_path_retry
user friendly_names
fail
yes •
}

Change the path_grouping_policy to failover, instead of multibus, to enable •


simple failover.

defaults {
udev dir /dev

polling_interval 10
selector
path_grouping_policy
getuid callout
"round-robin O"
failover
u/sbin/scsi id -g -u -s /block/%n" •
prio callout
path_checker
/bin/true
readsector0 a
rr _ min _io
rr weight
100
priorities

failback
no path retry
immediate
fail


user friendly_names yes
}

Uncomment the blackl st section just below it. This filters out all the devices that are not
normally multipathed, such as IDE hard drives and floppy drives.

Save the configuration file and exit the editor.


••
12. Before we start the mult ipathd service, make sure the proper modules are loaded:
dm multipath, dm round rob in. List all available dm target types currently available in

the kernel.

cx111.# modprobe dm_multipath


cx111# modprobe dm_round_robin
cx1-11# lsmod grep dm_

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 130c5b8a


13. Open a console window to nodel from your workstation and, in a separate terminal window,
log in to nodel and monitor /var/log/messages.

stationx# acm console nodel


stationX# ssh nodel
cXn1# tail -f /var/log/messages

14. Now start the multipathd service and make it persistent across reboots.

cXn1# chkconfig multipathd on


cxrii# service multipathd start

15. View the result of starting mult ipathd by running the commands:

cxralt fdisk -1
cXn1# 11 /dev/mpath

The device mappings, in this case, are as follows:

/--sda--\ / dm-3 (sd[ab] 1)


LUN - - + + -- dm 2 (mpath0)
- - - - +
\--sdb--/ \ dm-4 (sd[ab]2)

These device mappings follow the pattern of:

SAN (iSCSI storage) --> NIC (ethWethl, or HBA) --> device (/dev/sda) --> dm device
(/dev/dm 2) --> dm-mp device (/dev/mpath/mpath0).
-

Notice how device mapper combines multiple paths into a single device node. For example,
/dev/dm- 2 represents both paths to our iSCSI target LUN. The device node /dev/dm- 3
singularly represents both paths to the first partition on the device, and the device node /dev/
dm-4 singularly represents both paths to the second partition on the device.

You will notice that /dev/dm- 3 is also referred to as/dev/mpath/mpathOpl and /dev/
mapper/mpathOpl. Only the /dev/mapper/mpath* device names are persistent and are
created early enough in the boot process to be used for creating logical volumes or filesystems.
Therefore these are the device names that should be used to access the multipathed devices.

Keep in mind that fdisk cannot be used with /dev/dm # devices. If the multipathed device
-

needs to be repartitioned, use fdisk on the underlying disks instead. Afterward, execute the
command 'kpartx -a /dev/dm-#' to recognize any newly created partitions. The device-mapper
multipath maps will get updated and create /dev/dm-# devices for them.

16. View the multipath device assignments using the command:

cxn?# multipath -11


mpath0(S_beaf11) dm-2 IET,VIRTUAL-DISK
[size=10G] [features=0] [hwhandler=0]
\_ round-robin O [prio=0] [active]
\_ 0:0:0:1 sda 8:0 [active][ready]
\ round-robin O [prio=0] [enabled]

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 130c5b8a


\_ 1:0:0:1 sdb 8:16 [active][ready]

The first line shows the name of the multipath (mpath0), its SCSI ID, and device-mapper
device node. The second line helps to identify the device vendor and model. The third line
specifies device attributes. The remaining lines show the participating paths of the multipath
device, and their state. The "0:0:0:1" portion represents the host, bus (channel), target (SCSI
id) and LUN, respectively, of the device (compare to the output of the command cat /prociscsi/
scsi).

17. Test our multipathed device to make sure it really will survive a failure of one of its pathways.

Create a filesystem on /dev/mapper/mpathOpl (which is really the first partition of our


multipathed device), create a mount point named /mnt /data, and then mount it.

cxn3.# mke2fs -j /dev/mapper/mpathOpl


cxn].# mkdir /mnt/data
c.:Xn1.4# mount /dev/mapper/mpathOpl /mnt/data

Create a file in the /mnt /data directory that we can use to verify we still have access to the
disk device.

cXn1# cp /etc/passwd /mnt/data

18. To test that our filesystem can survive a failure of either path (eth2 or eth3) to the device,
we will systematically bring down the two interfaces, one at a time, and test that we still have
access to the remote device's contents. To do this, we will need to work from the console
window of nodel, which you opened earlier, otherwise open a new console connection now.

19. Test the first path. From the console, verify that device access survives if we bring down eth3,
and that we still have read/write access to /mnt/dat a/pa s swd.

cY,:n1# ifdown eth3


cXn1# cat /mnt/data/passwd
eXnl# echo "HELLO" » /mnt/data/passwd

Note: if the iSCSI parameters were not trimmed to smaller values properly, the following
multipath command and log output could take up to 120 seconds to complete.

If you monitor the tail end of /var/log/messages, you will see messages similar to
(trimmed for brevity):

avahi-daemon[1768]: Interface eth3.IPv6 no longer relevant for w.


mDNS.
kernel: sd 1:0:0:1: SCSI error: return code = 0x00020000
kernel: end request: I/O error, dev sdb, sector 4544
kernel: device-mapper: multipath: Failing path 8:16.
multipathd: sdb: readsector0 checker reports path is down
multipathd: checker failed path 8:16 in map mpath0
multipathd: mpath0: remaining active paths: 1
iscsid: Nop-out timedout after 15 seconds on connection 2:0
state (3). Dropping session.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 130c5b8a


The output of multipath also provides information:

11 1
multipath -11
nodel-console#
sdb: checker msg is "readsector0 checker reports path is down"
mpath0 (16465616462656166313a3100000000000000000000000000) dm-2 11
I/ IET,VIRTUAL-DISK
[size=10G][features=0][hwhandler=0]
\_ round-robin O [prio=0] [active]
\_ 0:0:0:1 sda 8:0 [active] [ready]
\_ round-robin 0 [prio=0] [enabled]
1, 1:0:0:1 sdb 8:16 [failed] [faulty]

1 Notice that the eth3 path (/dev/ sdb) has failed, but the other path is still ready and active
for all access requests.

Bring the eth3 interface back up when you are finished verifying. Ensure that both paths are
active and ready before continuing.

c xnz#ifup eth3
cxrali multipath -11
1 # multipath -11
mpath0 (16465616462656166313a3100000000000000000000000000) dm-2 I/
IET,VIRTUAL-DISK
1 [size=10G] [features=0] [hwhandler=0]
\_ round-robin O [prio=0] [active]
\_ 0:0:0:1 sda 8:0 [active] [ready]
\_ round-robin O [prio=0] [enabled]
\ 1:0:0:1 sdb 8:16 [active][ready]

20. Now test the other path.

Repeat the process by bringing down the eth2 interface, and again verifying that you still have

e read/write access to the device's contents.

cXnl#
cXn1#
ifdown eth2
cat /mnt/data/passwd
cXnl# echo "LINUX" » /mnt/data/passwd
cXnl# multipath -11
1 Bring the eth2 interface back up when you are finished verifying.

o 21. Rebuild node 1 when done (execute rebuild-cluster -1 on your workstation).

• station5#rebuild-cluster -1
This will create or rebuild node(s): 1
Continue? (y/N): y

The rebuild process can be monitored from node l's console window:

station# an console nodel

1 Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 130c5b8a



Lecture 5


Red Hat Cluster Suite Overview

Upon completion of this unit, you should be able to: •
• Provide an overview of Red Hat Cluster Suite and its major
components


e





1


For use only by a student enrolted in a Red Hat treining course teught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat treining materials are belng improperly used,
copied, or distributed pleese email ct rainingeredhat . coa» or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

rnnurinht le) 9111 1 Read Hat Int RH436-RHEL5u4-en-17-20110428 / b2cdb596



What is a Cluster? 5-1

• A group of machines that work together to perform a task.


• The goal of a cluster is to provide one or more of the following:
• High Performance
• High Availability
• Load Balancing
• Red Hat's cluster products are enablers of these goals
• Red Hat Cluster Suite
• Global File System (GFS)
• Clustered Logical Volume Manager (CLVM)
• Piranha

High performance, or Computational clusters, sometimes referred to as GRID computing, use the CPUs
of severa! systems to perform concurrent calculations. Working in parallel, many applications, such as
animation rendering or a wide variety of simulation and modeling problems, can improve their performance
considerably.

High-availability application clusters are also sometimes referred to as fail-over clusters. Their intended
purpose is to provide continuous availability of some service by eliminating single points of failure. Through
redundancy in both hardware and software, a highly available system can provide virtually continuous
availability for one or more services. Fail-over clusters are usually associated with services that involve both
reading and writing data. Fail-over of read-write mounted file systems is a complex process, and a fail-over
system must contain provisions for maintaining data integrity as a system takes over control of a service
from a failed system.

Load-balancing clusters dispatch network service requests to multiple systems in order to spread the
request load over multiple systems. Load-balancing provides cost-effective scalability, as more systems can
be added as requirements change over time. Rather than investing in a single, very expensive system, it is
possible to invest in multiple commodity x86 systems. If a member server in the cluster fails, the clustering
software detects this and sends any new requests to other operational servers in the cluster. An outside
client should not notice the failure at all, since the cluster looks like a single large server from the outside.
Therefore, this form of clustering also makes the service highly-available, able to survive system failures.

What distinguishes a high availability system from a load-balancing system is the relationship of fail-over
systems to data storage. For example, web service might be provided through a load-balancing router that
dispatches requests to a number of real web servers. These web servers might read content from a fail-
over cluster providing a NFS export or running a database server.

For use only by a student enroffed in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copied, or distributed pisase email < training4redhat . cosa> or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 06cfad5d


Red Hat Cluster Suite 5-2

• Open Source Clustering Solution


• Provides infrastructure for Clustered Application
• rgmanager: makes off-the-shelf applications highly available
• GFS2: Clustered filesystem
High availability clusters, like Red Hat Cluster Suite, provide the necessary infrastructure for monitoring and
failure resolution of a service and its resources.

Red Hat Cluster Suite allows services to be relocated to another node in the event of an unresolvable
failure on the original node. The service itself does not need to be aware of the other nodes, the status of its
own resources, or the relocation process.

Shared storage among the cluster nodes may be useful so that the services' data remains available
after being relocated to another node, but shared storage is not required for the cluster to keep a service
available.

The ability to prevent access to a resource (hard disk, etc...) for a cluster node that loses contact with the
rest of the nodes in the cluster is called fencing, and is a requirement for multi-machine (as opposed to
single machine, or virtual machine instances) support. Fencing can be accomplished at the network level
(e.g. SCSI reservations or a fibre channel switch) or at the power level (e.g. networked power switch).

For use only by a student enrolled In a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material, are being improperly used,
copied, or distributed please email <training@redhat . coms or phone Mil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 042b025f


Cluster Topology 5-3

Á-72
Red Hat Cluster. Suite

Of the several types of clusters described, this course will focus on Highly Available (HA) service clusters
utilizing a shared-access Global File System (GFS).

Red Hat Cluster Suite includes and provides the infrastructure for both HA failover cluster domains and
GFS.

HA clusters provide the capability for a given service to remain highly available on a group of cluster nodes
by "failing over" ("relocating") to a still-functional node within its "failover domain" (group of pre-defined and
cluster nodes to which it can be relocated) when its current node fails in some way.

GFS complements the Cluster Suite by providing cluster-aware volume management and concurrent file
system access to more than one kernel I/O system (shared storage).

HA failover clusters are independent of GFS clusters, but they can co-exist and work together.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certifled Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. ft you believe Red Hat training materials are being improperly usad,
copiad, or iffistributed piense amad <training@redhat.coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ddb0dc50


..
Clustering Advantages 5-4

• Flexibility
• Configurable node groupings for failover
• Additional failover nodes can be added on the fly
• Cost effective configurations
• Utilize excess capacity of other nodes
• Online resource management
• Services can be updated without shutting down
• Hardware can be managed without loss of service

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
dupliceted, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training materials are being improperly used,
copied, or distributed please email <trainingeredhat . coa» or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 614509b0


Advanced Configuration and Power Interface (ACPI) 5-5

• Useful for managing power consumption


• Ability to "step down" idle CPUs
• DRAC, iLO, and IPMI
• Firmware version could be an issue
• A virtual power button press may be translated as "shutdown -h now"
• The cluster wants "power off NOW"
• Evicted nodes can't count on ACPI running properly
• Disable at command line
• service acpid stop
• chkconfig acpid off

• Disable at boot time


• acpi=off

ACPI was developed to overcome the deficiencies in APM. ACPI allows control of power management from
within the operating system.

Some BIOSes allow ACPI's behavior to be toggled as to whether it is a "soft" or "hard" power off. A hard
power off is preferred.

Integrated Lights-Out (iLO) is a vendor-specific autonomous management processor that resides on a


system board. Among its other functions, a cluster node with iLO can can be power cycled or powered off
over a TCP/IP network connection, independent of the state of the host machine. Newer firmware versions
of iLO make a distinction between "press power button" and "hold power button", but older versions may
only have the equivalent of "press power button". Make sure the iLO fencing agent you are using properly
controls the power off so that it is immediate.

Other iLo-like integrated system management configurations that Red Hat supports in a clustered
environment are Intelligent Platform Management Interface (IPMI) and the Dell Remote Access Card
(DRAC).

The IPMI specification defines a operating system independent set of common interfaces to computer
hardware and firmware which system administrators can use to remotely (direct serial connection, a local
area network (LAN) or a serial over LAN (SOL) connection) monitor system health and manage the system.
Inclusive to íts management functions, IPMI provides remote power control.

The DRAC has its own processor, memory, battery, network connection, and access to the system bus,
giving it the abilíty to provide power management and a remote console via a web browser.

Using the software-based ACPI mechanism isn't always reliable. For example, if a node has been evicted
from the cluster due to a kernel panic, it likely will be in a state that is unable to process the necessary
power cycle.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or othenvise reproduced without prior written consent of Red Hat, Inc. 1f you believe Red Hat training material* are being impropedy usad,
copiad, or distributed pisase email < training*redhat coa> or phone toll-free (USA) +1 (866) 628 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ccc1986a


• IN-7
Cluster Network Requirements 5-6

• Separate traffic by type:


• Cluster communication
• External application access
• Storage traffic (if using iSCSI)
• Administrative access
• Provide redundancy
• Ideally all network interfaces except iSCSI use channel bonding
• Use Multipathing for iSCSI/fiberchannel devices
• switches must be interconnected
• Configure multicasting
• OpenAIS uses multicasting for cluster messages
• Check if multicasting is enabled on the switches
• the used multicast address can configured

The primary reason for requiring at least two network interfaces for clustering is to separate cluster traffic
from all other network traffic. Cluster traffic is comprised of heartbeats and internode communication and
is normally confined to a private (local) network. Clusters are immensely dependent on their inter-node
communication (aka: heartbeats) for their integrity. If a node looses access to this network, it will be evicted
from the cluster (fenced) by other members.

Storage traffic can saturate even Gigabit links completely. This could prevent cluster messages to reach the
other nodes in time, thus causing a timeout and fencing action. Always separate iSCSI or FCoE traffic from
cluster communication.

Network outages due to broken cables, NIC or switch failures can easily be prevented by configuring
channel bonding (sometimes called Etherchanneling or Link Aggregration). If multiple switches are used in
this scenario (recommended), make sure that these switches are interconnected (e.g. by crossover cable).
Often, additional switch configuration is necessary.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used,
copiad, or distributed picase email < t raining0redhat .coms or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 37db93e7


Broadcast versus Multicast 5-7

• Broadcast - send to all connected recipients


• e.g. Radio
• Multicast - send to designated recipients
• e.g. Conference call
• Not all hardware supports multicasting
• Multicast required for IPv6

One to many many to many

Broadcasting is a one-to-all technique in which messages are sent to everybody. Internet routers block
broadcasts from propagating everywhere.

IP multicast allows one-to-many network communication, where a single source sends traffic to many
interested recipients. Multicast groups are identified by a single IP address on the 224 . o . o . 0/4 network.
Hosts may join or leave a multicast group at any time -- the sender may not restrict the recipient Iist.
Multicasts can allow more efficient communication since hosts uninterested in the traffic do not need to be
sent that traffic, unlike broadcasts, which are sent to all nodes on the network.

Multicasting is required for IPv6 because there is no broadcast.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or othenvise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training material* are being improperly usad,
copiad, or distributed pisase email ctraining•redhat. coa> or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright@ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / bc13a4a3


Ethernet Channel Bonding 5-8

• Highly available network interface


• Avoids single point of failure
• Aggregating bandwidth and load balancing are possible
• Many NICs can be bonded into a single virtual interface
• Plug each interface into different switches on the same network
• Network driver must be able to detect link
• Configuration steps:
• Load and configure bonding module in /etc/modprobe . conf
• Configure bond0 interface and its slave interfaces
• /proc/net/bondO/info
The Ethernet bonding driver can be used to provide a highly-available networking connection. More than
one network interface card (NIC) can be bonded into a single virtual interface. If, for example, two NICs
are plugged into different switches in the same broadcast domain, the interface will survive the failure of a
single switch, NIC, or cable connection.

Configuring ethernet channel bonding is a two-step process: configure/load the bonding module in /etc/
modprobe . conf, and configure the master/slave bonding interfaces in /etc/sysconf ig/network-
scripts.

After networking is restarted, the current state of the bond° interface can be found in /proc/net/bondo/
inf o.

A number of things can affect how fast failure recovery occurs, including traffic pattern, whether the active
interface was the one that failed, and the nature of the switching hardware. One of the strongest effects on
fail-over time is how long it takes the attached switches to expire their forwarding tables, which may take
many seconds.

For use only by a student enrolled in a Red HM training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used,
copiad, or distributed piense email <tridningeredhat . coma or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b31f2aef


. . . 44/1
Channel Bonding Configuration 5-9

• /etc/modprobe.conf

alias bond° bonding

• Common bonding module options:


• mode= [O 1 balanced_rr] - provides load balancing and fault tolerance (default)
• mode= [11 active backup] - - provides fault tolerance
• primary - specifies which slave is the primary device (e.g. eth0)
• use_carrier - how to determine link status
• mi imon link monitoring frequency in milliseconds
-

• Bonding interface configuration files (/etc/sysconf ig/network scripts) -

ifcfg-bond0 ifcfg-eth0 ifcfg-ethl

DEVICE=bond0 DEVICE=eth0 DEVICE=ethl


IPADDR=192.0.2.1 MASTER=bond0 MASTER=bond0
NETMASK=255.255.255.0 SLAVE=yes SLAVE=yes
GATEWAY=192.0.2.254 ONBOOT=yes ONBOOT=yes
ONBOOT=yes BOOTPROTO=static BOOTPROTO=static
BOOTPROTO=static
BONDING OPTS="mode=0 miimon=100 use carrier=0"
— ÍYPr- wr? Jan
The bonding module is configured in /etc/modprobe conf to persist across reboots.

The default mode, mode= [ 0 I balanced_rr] , traffics packets sequentially through all slaves in a round-
robin fashion, evenly distributing the load.

active backup] uses only one siave in the bond at a time (e.g. primary=eth0). active-
mode= [1 1 -

backup mode should work with any layer-2 switch. A different siave becomes active if, and only if, the
active slave fails.

The mi imon setting specifies how often, in milliseconds, the network interface is checked for link.

The use_carrier setting specifies how to check the link status; 1 works with drivers that support the
netif carrier_ok () kemel function (the default), o works with any driver that works with mii-tool or
ethtool.

See Documentation/networking/bonding. txt in the kernel doc RPM for additional modes and -

inforrnation.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, atorad in a retrievat system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red Hat training materiala are being improperly usad,
copiad, or chstributed pisase email <trainingeredhat or phone toil-free (USA) +1 (366) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 816e0496


■ ••
Red Hat Cluster Suite Components 5-10

• Infrastructure
• ccsd - Cluster Configuration System
• aisexec - Open AIS Cluster Framework
• cman -Cluster Manager: quoum, membership
• fenced -I/O Fencing
• DLM - Distributed Locking Manager
• Applications
• clvmd - Clustered Logical Volume Manager
• rgmanager - Cluster resource group manager
• GFS2- Clustered File System
• Deployment
• luci/ricci - Conga project

The Infrastructure of Red Hat Cluster Suite is made from independent best-of-class components that
perform a specific function.

DLM (Distributed Locking Manager) is used to manage shared volume file locking between nodes for GFS.

dlmcontrold is a daemon that connects to cman to manage the DLM groups. The command cman_tool
services shows these groups. dlm_controld listens for node up/down events from cman and requests to
create/destroy lockspaces. Those requests are passed into DLM, which is still in the kernel, via configfs.

lock_dlmd is a daemon that manages the interaction between the DLM and GFS.

clvmd is unchanged from RHEL4. 1 ibdlm is also unchanged, provided any applications using it are
dynamically linked.

aisexec is pan of the OpenA1S cluster framework. This framework is an Open Source standard for cluster
communication. It provides messaging and encryption. This is the underlying framework for cman.

cman uses OpenAIS for cluster communication. It provides a configuration interface into CCS, quorum disk
API, mechanism for conditional shutdown, and functions for managing quorum.

For use only by a student enroiled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Treining Partner. No pert of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiels are being improperly usad,
copiad, or distributed piense email <training@radhat . com» or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright() 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 /f365b4a4


Security 5-11

e • Cluster inter-node communications

1 •

Multicast
Encrypted by default
• OpenAIS
e • Firewall must allow for ports used by the cluster and GFS

e All inter-node communications are encrypted, by default. OpenAIS uses the cluster narre as the encryption
key. While not a good isolation strategy, it does make sure that clusters on the same multicast/port don't
mistakenly interfere with each other and that there is some minímal form of encryption.

The following ports should be enabled for the corresponding service:


PORT NUMBER SERVICE PROTOCOL

e 5404, 5405
11111
14567
cman
ricci
udp
tcp
gnbd tcp
e 16851
21064
modclusterd (part of Conga)
dlm
tcp
tcp
50006, 50008, ccsd tcp

e 50009
50007 ccsd udp

o
e
e
1

e For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. H you believe Red Hat training materials are being improperly used,

1 copiad, or cfistributed pisase email ctrainingeredhat . con> or phone tolf-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4f74f7ee


Cluster Configuration System (CCS) 5-12

• Daemon runs on each node in the cluster (ccsd)


• Provides cluster configuration info to all cluster components
• Configuration file:
• /etc/cluster/cluster.conf
• Stored in XML format
• cluster.conf(5)
• Finds most recent version among cluster nodes at startup
• Facilitates online (active cluster) reconfigurations
• Propagates updated file to other nodes
• Updates cluster manager's information

CCS consists of a daemon and a library. The daemon stores the XML file in memory and responds to
requests from the library (or other CCS daemons) to get cluster information. There are two operating
modes quorate and nonquorate. Quorate operation ensures consistency of information among nodes.
Non-quorate mode connections are only allowed if forced. Updates to the CCS can only happen in quorate
mode.

If no cluster .conf exists at startup, a cluster node may grab the first one it hears about by a multicast
announcement.

The OpenAIS parser is a "plugin" that can be replaced at run time. The cman service that plugs into
OpenAIS provides its own configuration parser, ccsd. This means /etc/ais/openais . conf is not used
if cman is loaded into OpenAlS; ccsd is used for configuration, instead.

For use only by a student enrolled in e Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training materials ere being improperly used,
copied, or distributed please email < t rainingerecthat .corn> or phone toII-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 88d9cf79


CMAN - Cluster Manager 5-13

• Main component of cluster suite


• Calculates quorum - an indication of the cluster's health
• Started by the curan SysV script
• also starts other components (ccsd,fenced,..)
• must be started in parallel on cluster members
• Uses /etc/cluster/cluster . conf
• all cluster applications (CLVM, GFS2, rgmanager, ..) require this service
• Uses the OpenAIS framework for communication

The cluster manager, an OpenAIS service, is the mechanism for configuring, controlling, querying, and
calculating quorum for the cluster. The cluster manager is configured via /etc/cluster/cluster.. conf
(ccsd), and is responsible for the quorum disk API and functions for managing cluster quorum.

For use <>My by a student enrollad he a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publicaba) may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior minen consent of Red HM, Inc. If you believe Red HM training materials are being improperly usad,
copiad, or chstributed piense amad < rainingfiredhat coa> or phone to114ree (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a18b7ef5


• —
Cluster Quorum 5-14

• Most cluster operations require the cluster to be "quorate"


• generally more than haif nodes must be available
• exact quorum requirements can be configured
• prevents accidential resource usage by errant nodes (split-brain situation)
• If quorum is lost, no cluster resources may be started
• Two node clusters do not use this feature

CMAN keeps track of cluster quorum by monitoring the count of cluster nodes. This feature is only used for
clusters with more than two nodes. If more than haif the nodes are active, the cluster has quorum. If half the
nodes (or fewer) are active, the cluster does not have quorum, and all cluster activity is stopped.

Cluster quorum prevents the occurrence of a "split-brain" condition — a condition where two instances of
the same cluster are running. A split-brain condition would allow each cluster instance to access cluster
resources without knowledge of the other cluster instance, resulting in corrupted cluster integrity.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training ~tortilla are being Improperly used,
copiad, or distributed please email trainingOredhat coa» or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / d6bc45d7


. .
OpenAIS 5-15

• A cluster manager
• Underlying Cluster Communication Framework
• Provides cluster membership and messaging foundation
• All components that can be in user space are in user space
• Allows closed process groups (iibcpg)
• Advantages:
• Failures do not cause kemel crashes and are easier to debug
• Faster node failure detection
• Other OpenAIS services now possible
• Larger development community
• Advanced, well researched membership/messaging protocols
• Encrypted communication

OpenAIS has several subsystems that already provide membership/locking/events/communications


services and other features. In this sense, OpenAIS is a cluster manager in its own right. OpenAIS's core
messaging system used is called "totem", and it provides reliable messaging with predictable delivery
ordering.

While standard OpenAIS callbacks are relative to the entire cluster for tasks such as message delivery
and configuration/membership changes, OpenAIS also allows for Closed Process Groups (libcpg) so
processes can join a closed group for callbacks that are relative to the group. For example, communication
can be limited to just host nodes that have a specific GFS filesystem mounted, currently using a DLM lock-
space, or a group of nodes that will fence each other.

The core of OpenAIS is the modular aisexec daemon, into which various services load. Because cman is
a service module that loads into aisexec, it can now take advantage of the OpenAIS's totem messaging
system. Another module that loads into aisexec is the CPG (Closed Process Groups) service, used to
manage trusted service partners.

cman, to some extent, still exists largely as a compatibility layer for existing cluster applications. A
configuration interface into CCS, quorum disk API, mechanism for conditional shutdown, and functions for
managing quorum are among its still-remaining tasks.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red HM Certified Treining Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. II you believe Red HM training materials are being improperly used,
copiad, or ckstributed pisase email <trainiageredhat . coa, or phone toa-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ec5d8bbd



rgmanager - Resource Group Manager 5-16

• Useful for making off-the-shelf applications highly available


• Applications not required to be cluster-aware
• ...but may require configuration tweaks
• Warm/hot failovers often require application modification
• Uses a "virtual service" design
• Preferred nodes and/or restricted sets of nodes on which a service should run
• Simple dependency tree for services: only touch the affected parts
• Alter any part of a service; rgmanager will only restart the affected parts of the service
• If a part of a service fails; rgmanager will only restart the affected part

rgmanager provides "cold failover" (usually means "full application restad") for off-the-shelf applications
and does the "heavy lifting" involved in resource group/service failover. Services can take advantage of the
cluster's extensible resource script framework API, or simply use a SysV-style init script that accepts start,
stop, restart, and status arguments.

Without rgmanager, when a node running a service fails and is subsequently fenced, the service it was
running will be unavailable until that node comes back online.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. II you believe Red Hat training materials are being Improperly used,
copied, or distributed please email <trainingaredhat . coz> or phone toll-f res (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 06acadfb


The Conga Project 5-17

• Unified management platform for easily building and managing clusters


• Web-based project with two components
• ricci - authentication component on each cluster node
• luci - centralized web management interface
• A single web interface for all cluster and storage management tasks
• Automated deployment of cluster data and supporting packages
• Cluster configuration
• RPMs
• Easy integration with existing clusters
• Integration of cluster status and logs
• Fine-grained control over user permissions
Users frequently commented that while they found value in the GUI interfaces provided for cluster
configuration, they did not routinely instali X and Gtk libraries on their production servers. Conga solves this
problem by providing an agent that is resident on the production servers and is managed through a web
interface, but the GUI is located on a machine more suited for the task.

luci and ricci interact as follows:

Conga is available in versions equal to or newer than Red Hat Cluster Suite 4 Update 5 and Red Hat
Cluster Suite 5.

The elements of this architecture are:

luci is an application server which serves as a central point for managing one or more clusters, and
cannot run on one of the cluster nodes. luci is ideally a machine with x already loaded and with network
connectivity to the cluster nodes. luci maintains a database of node and user information. Once a system
running ricci authenticates with a luci server, it will never have to re-authenticate unless the certificate
used is revoked. There will typically be only one luci server for any and alI clusters, though that doesn't
have to be the case.

ricci is an agent that is installed on all servers being managed.

Web Client is typically a Browser, like Firefox, running on a machine in your network.

The interaction is as follows. Your web client securely logs into the luci server. Using the web interface,
the administrator issues commands which are then forwarded to the ricci agents on the nodes being
managed.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplícated, stored in a retrieval system, or othenvise reproduced without prior wrítten consent of Red HM, Inc. tf you believe Red HM training meteríais are being improperly used,
copied, or cfistributed picase email ctrainingeredhat . con> or phone toll-free (USA) +1 (666) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4ea94f94


. .
lucí 5-18

• Web interface for cluster management

• Create new clusters or import old configuration •

• Can create users and determine what privileges they have
• Can grow an online cluster by adding new systems



Only have to authenticate a remote system once
Node fencing
View system logs for each node •
Conga is an agent/server architecture for remote administration of systems. The agent component is called
ricci, and the server is called luci. One luci server can communicate with many multiple ricci agents installed
on systems.





Ó

••


For use only by a student enrolled in a Red Hat treining course teught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
dupliceted, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copied, or distributed ()tease email <training0redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Coovriaht O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 2b94add9



ricci 5-19

• An agent that runs on any cluster node to be administered by lucí


• One-time certificate authentication with lucí
• All communication between lucí and ricci is via XML

When a system is added to a luci server to be administered, authentication is done once. No authentication
is necessary from then on (unless the certificate used is revoked by a CA). Through the UI provided by
luci, users can configure and administer storage and cluster behavior on remote systems. Communication
between luci and ricci is done via XML.

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certifled Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or othenvise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly usad,
copiad, or efistributed pisase email <trainingeredhat . coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright ©2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / dc81178d


... . .. . ..-.

Deploying Conga 5-20

• Instan luci on management node
• Install and start ricci service on cluster nodes - cw 13 6
• Initialize luci
'>
y 4., v, CA




luci_admin init
service luci restart •

• https://localhost:8084/

Ensure that lucí starts up automatically

# chkconfig lucí on

• Log in to luci and configure the cluster

Once lucí is installed, its database must be initialized and an admin account must be setup. These tasks •
are accomplished with the luci_admin command line utility. After this, the lucí service should be started
(and configured to persist a reboot):
e
# service lucí start
Starting lucí: [ OK ]


Please, point your web browser to https://myhostname:8084 to access lucí

Now luci can be logged into for cluster configuration and deployment.

Other useful luci_admin commands for troubleshooting/repairing


stopped):
Command Description
lucí (all require that the lucí service be

luci_admin password
luci_admin backup

luci_admin restore
Change the admin user's password
Backs up the lucí config to an XML
luci_backup.xml
Restores the lucí config from an XML file
file:/var/lib/luci/var/



••


For use only by e student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Het Certified Training Partner. No part of this publicatIon may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Het training materials are being Improperly ueed,
copied, or distributed please email <training•redhat com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 680f29d9
1 00
lucí Deployment Interface 5-21

redhat

testi

Nate Moine: dui letwass a lasv

Cluslkt klartvilevy

eetvlbr. welthlia Mods, fle011aws OwwwrInMeentorrwit,

fl
!srs1 rSIISS

Mode Mune: ensose alass

Status: Clutr: Allembee

Ihmlinnv em8iS Illimar Paitsver %wats iiitintairvhip

▪ 14, 1,4.ddrt seroses we-xweerily ruenirg hete

Nade Name: tIlvaisse a lawr

Statimat Clult4 Rliembee

tkirdrei rook~ ~rudo limilwir~1


• ~ter lerolesse •osertty ratiorg neve

• •, ss rsr. mdo

tte . coarkinte ZOCO- zor by -sr


E,wv *Med u■stálls C.P; oc y

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certifled Training Partner. No part of this publication may be photocopied,
duplicated, atored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. It you believe Red HM training motorista are being impropedy usad,
copiad, or distributed piense email <trainingeredhat coa> or phone toll-free (USA) +1 (866)626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 8ecc008b


^ ^
Clustered Logical Volume Manager (CLVM) 5-22

• CLVM is the clustered version of LVM2


• Aims to provide the same functionality of single-machine LVM
• Provides for storage virtualization
• Based on LVM2
• Device mapper (kernel)
• LVM2 tools (user space)

CLVM is required for GFS. Without it, any changes to a shared logical volume on one cluster node would
go unrecognized to the other cluster nodes.

To configure CLVM, the locking type must be changed to 3.

# lvm dumpconfig I grep locking_type


locking_type=1
# lvmconf --enable-cluster
# lvm dumpconfig I grep locking_type
lockingtype=3

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. II you believe Red Hat training materiale are being improperly usad,
copiad, or distributed pisase email <training@redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9227f45e


Distributed Lock Manager (DLM) 5-23

• Lockspace for resources


• When nodes join/leave lockspace recovery is done
• A resource is a named object that can be Iocked
• One node in the lockspace is "master" of the resource
• Other nodes need to contad this node to lock resource
• First node to take lock on resource becomes its master (when using a resource directory)
• Resource directory says which node is the master of a resource
• Divided across all nodes, rebuilt during recovery
• Node weighting

DLM (Distributed Lock Manager) is the only supported lock management provided in Red Hat Cluster Suite.
DLM provides a good performance and reliability profile.

In previous versions, GULM was promoted for use with node counts over 32 and special configurations with
Oracle RAC. In RHEL5, the scalability issues of DLM beyond 32 nodes have been addressed. Furthermore,
DLM nodes can be configured as dedicated lock managers in high lock traffic configurations, making GULM
redundant.

For use ()My by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certifled Training Partner. No parí of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM tmining material, are being improperty used,
copiad, or chatributed pisase email ctraiaingaradhat . coa> or phone toll-free (USA) +1 (886) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 5a6ce63f


Fencing 5-24

• Fencing separates a cluster node from its storage


• Power fencing
• Fabric fencing
• Fencing is necessary to prevent corruption of resources
• Fencing is required for a supportable configuration
• Watchdog timers and manual fencing are NOT supported

Fencing is the act of immediately and physically separating a cluster node from its storage to prevent the
node from continuing any form of I/O whatsoever.

A cluster must be able to guarantee a fencing action against a cluster node that loses contact with the other
nodes in the cluster, and is therefore no longer working cooperatively with them.

Without fencing, an errant node could continue I/O to the storage device, totally unaware of the I/O from
other nodes, resulting in corruption of a shared filesystem.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiala are being improperly used,
copiad, or distributed please email <treiningeredhat . cora> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / f4a70f29


End of Lecture 5

• Questions and Answers


• Summary
• Red Hat Cluster Suite

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certifted Training Partner. No part of this publication may be photocopied,
duplicated, stored ín a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. I( you belíeve Red Hat training material* are being improperly usad,
copiad, or distributed pisase emelt <traininggredhat coa> or phone ton-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b2cdb596


AI •
Lab 5.1: Building a Cluster with Conga
Scenario: We will use the workstation as the luci deployment node to create a cluster
from nodes 1 and 2. We will configure an Apache web server resource
group on the cluster nodes that accesses a shared ext3 filesystem for our
DocumentRoot.

Instructions:

1. Recreate nodel and node2 if necessary with the rebuild-cluster tool.

2. It is best practice to put the cluster traffic on a private network. For this purpose ethl of your
virtual machines is connected to private bridge named cluster on your workstation.

Cluster suite picks the network that is associated with the hostname as its cluster
communication network. It is considered best practice to use a separate private network for that.

Configure the hostname of both virtual machines so that it points to


nodeN. clus t erX. example . com (Replace N with the node number and X with your
cluster number.

Make sure that the setting is persistent.

3. Make sure that the iscsi target is available on both nodes. You can use /root/RH436/
HelpfulFiles/setup-initiator -bl .

4. From any node in the cluster, delete any pre-existing partitions on our shared storage (the /
root/RH4 3 6 /HelpfulFiles/wipe_sda script makes this easy), then make sure the OS
on each node has its partition table updated using the partprobe command.

5. Install the luc i RPM on your workstation and the ricci and ht tpd RPMs on nodel and
node2 of your assigned cluster.

6. Start the ricci service on nodel and node2, and configure it to start on boot.

7. Initialize the lucí service on your workstation and create an administrative user named
admin with a password of redhat.

8. Restart luc i (and configure to persist a reboot) and open the web page the command output
suggests. Use the web browser on your local classroom machine to access the web page.

9. Log in to luc i using admin as the Login Name and redhat as the Password.

10. From the "Luci Homebase" page, select the cluster tab near the top and then select "Create
a New Cluster" from the left sidebar. Enter a cluster name of c lusterX, where X is
your assigned cluster number. Enter the fully-qualified name for your two cluster nodes
(nodeN. clust erX. example . com) and the password for the root user on each. Make

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 081f11a7


sure that "Download packages" is pre-selected, then select the "Check if node passwords are
identical" option. All other options can be left as-is. Do not click the Submit button yet!
11. Before submitting the node information to lucí and beginning the Install, Reboot, Configure,
and Join phases, open a console window to nodel and node2, so you can monitor each node's
progress. Once you have completed the previous step and have prepared your consoles, click
the Submit button to send your configuration to the cluster nodes.

12. Once lucí has completed (once all four circles have been filled-in in the lucí interface), you
will be automatically re-directed to a General Properties page for your cluster.

Select the Fence tab.

In the XVM fence daemon key distribution section, enter dont() . clusterX. example . com
in the first box (node hostname from the host cluster) and
nodel . clusterX. example . com in the second box (node hostname from the hosted
(virtual) cluster). Click on the Retrieve cluster nodes button.

At the next screen, in the same section, make sure both cluster nodes are selected and click on
the Create and distribute keys button.

13. From the left-hand menu select Failover Domains, then select Add a Failover Domain.

In the "Add a Failover Domain" window, enter pre f er_nodel as the "Failover Domain
Name". Select the Prioritized and Restrict failover to this domain's members boxes.

In the "Failover domain membership" section, make sure both nodes are selected as members,
and that nodel has a priority of 1 and node2 has a priority of 2 (lower priority).

Click the Submit button when finished.

14. We must now configure fencing (the ability of the cluster to quickly and absolutely
remove a node from the cluster). Fencing will be performed by your workstation
(domo . clusterX. example . com), as this is the only node that can execute the xm destroy
<node_name> command necessary to perform the fencing action.

First, create a shared fence device that will be used by all cluster nodes. From the left-hand
menu select Shared Fence Devices, then select Add a Fence Device. In the Fencing Type drop-
down menu, select Virtual Machine Fencing. Choose the name xenf enceX (where X is your
cluster number) and click the Add this shared fence device button.

15. Second, we associate each node with our shared fence device.

From the left-hand menu select Nodes. From the lower left area of the first node in luc i's
main window (node 1) select Manage Fencing for this Node. Scroll to the bottom, and in the
Main Fencing Method; section, click the Add fence device to this level link. In the drop-down
menu, select xenf enceX (Virtual Machine Fencing). In the Domain box, type nodel (the
name that would be used in the command: xm destroy <node_name> to fence the node), then
click the Update main fence properties button at the bottom.

Repeat the process for each node in the cluster (using the appropriate node name for each in the
Domain box).

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 081f11a7


16. To complete the fencing setup, we need run fence xvmd on your workstation. First, install the
cman packages on your workstation, but do not start the cman service.

stat ionx# yum -y instan cman

Second, copy /et c/c luster/ f ence_xvm . key from one of the cluster nodes to /etc/
cluster on stationX.

Note:

If the fence key was not created automatícally by the GUI, it is possible to create one
manually.:

# dd if= dev urandom cluster fence-xvm.key bsm4k


count=1

Third, add the command /sbin/fence_xvmd -L -I cluster to /etc /rc . local and execute
rc . local. This starts the fence daemon without a running cluster (-L) and let it listen on the
cluster bridge (-I cluster).

17. Before we add our resources to luci, we need to make sure one of them is in place: a partition
we will use for an Apache Web Server DocumentRoot filesystem.

From a terminal window connected to nodel, create an ext 3-formatted 100MiB partition
on the /dev/sda shared storage volume. Make sure it is recognized by both nodel and
node2, and run the partprobe command, if not. Temporarily mount it and place a file named
index . html in it with permissions mode 0644 and contents "Helio". Unmount the partition
when finished, and do not place any entries for it in /e t c/ fstab.

18. Next we build our clustered service by first creating the resources that make it up. Back in the
luc i interface window, select Add a Resource, then from the Select a Resource Type menu,
select IP Address.

Choose 172.16.50 . X6 for the IP address and make sure the Monitor link box is selected.
Click the Submit button when finished.

19. Select Add a Resource from the left-hand-side menu, and from the drop-down menu select File
system.

Enter the following parameters:

Name: docroot
File system type: ext3
Mount point: /var/www/html
Device: /dev/sdal

All other parameters can be left at their default. Click the Submit button when finished.

-: -1-1 r7-:, 'In «I .1 r"...-1 ouA•un ouci nAoa ni:on 1n7


20. Once more, select Add a Resource from the left-hand-side menu, and from the drop-down menu
select Apache.

Choose ht tpd for the Name. Set Shutdown Wait to 5 seconds. This parameter defines how
long stopping the service may take before Cluster Suite declares it failed. Click the Submit
button when finished.

21. Now we collect together our three resources to create a functional web server service. From the
left-hand-side menu, select Services, then Add a Service.

Choose webby for the Service Name, pre f er_nodel as the Failover Domain, and a
Recovery Policy of Relocate. Leave all other options at their defaults. Click the Add a
resource to this service button when finished.

Under the Use an existing global resource drop-down menu, choose the previously-created IP
Address resource, then click the Add a resource to this service button again.

Under the Use an existing global resource drop-down menu, choose the previously-created File
System resource, then click the Add a resource to this service button again.

Finally, under the Use an existing global resource drop-down menu, choose the previously-
created Apache Server resource. When ready, click the Submit button at the bottom of the
window.

If you want that webby starts automatically set the auto start option.

22. From the left-hand menu, select Cluster List. Notice the brief description of the cluster just
created, including services, nodes, and status of the cluster service, indicated by the color of
the cluster name. A green-colored name indicates the cluster service is functioning properly. If
your cluster name is colored red, wait a minute and refresh the information by selecting Cluster
List from the left-hand side menu, again. The service should autostart (an option in the service
configuration window). If it remains a red color, that may indícate a problem with your cluster
configuration.

23. Verify the web server is working properly by pointing a web browser on your local workstation
to the URL: http: //172.16.50.X6/index.html or running the command:

local# elinks - dump http: //172.16.50.X6/index.html

Verify the virtual IP address and cluster status with the following commands:

nodel# ip addr list

nodel, 2# clustat
24. If the previous step was successful, try to relocate the service using the lucí interface onto the
other node in the cluster, and verify it worked.

25. While continuously monitoring the cluster service status from nodel, reboot node2 and
watch the state of webby.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 081f11a7


. .
Lab 5.1 Solutions
1. Recreate nodel and node2 if necess ary with the rebuild-cluster tool.

2. It is best practice to put the cluster traffic on a private network. For this purpose ethl of your
virtual machines is connected to private bridge named cluster on your workstation.

Cluster suite picks the network that is associated with the hostname as its cluster
communication network. It is considered best practice to use a separate private network for that.

Configure the hostname of both virtual machines so that it points to


nodeN. c lus t e rX. example . com (Replace N with the node number and X with your
cluster number.

Make sure that the setting is persistent. Either edit /etc/sysconf ig/network manually or
use the following perl statement

cXni# perl -pi -e "s/HOSTNAME=.*/HOSTNAME=nodeN.clusterX. example 11


.com" /etc/sysconfig/network
cXnl# hostname nodeN. clus t erX. example .com

Repeat for node2.

3. Make sure that the iscsi target is available on both nodes. You can use /root/RH436/
HelpfulFiles/setup-initiator -bl .

:111# /root/RH436/HelpfulFiles/setup-initiator -bl


eX.n2# /root/RH436/HelpfulFiles/setup-initiator -bl

4. From any node in the cluster, delete any pre-existing partitions on our shared storage (the /
root/RH4 3 6 /Helpf ulFi les/wipe_sda script makes this easy), then make sure the OS
on each node has its partition table updated using the partprobe command.

nodel# /root/RH436/HelpfulFiles/wipe_sda

nodel , 2# partprobe /dev/sda

5. Install the lucí RPM on your workstation and the ricci and httpd RPMs on nodel and
node2 of your assigned cluster.

stationx# yum -y install luci

nodel, 2# yum -y install ricci httpd

6. Start the ricci service on nodel and node2, and configure it to start on boot.

nodel, 2# service ricci start

nodel, 2# chkconfig ricci on

7. Initialize the lucí service on your workstation and create an administrative user named
admin with a password of redhat.
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 081f11a7
statíonx# luci admin init

8. Restart luci (and configure to persist a reboot) and open the web page the command output
suggests. Use the web browser on your local classroom machine to access the web page.

stat ionx# chkconfig luci on; service luci restart

Open https : //stationX. example . com : 8084/ in a web browser, where X is your
cluster number.

(If presented with a window asking if you wish to accept the certificate, click the 'OK' button)

9. Log in to luci using admin as the Login Name and redhat as the Password.

10. From the "Luci Homebase" page, select the cluster tab near the top and then select "Create
a New Cluster" from the left sidebar. Enter a cluster name of clusterX, where X is
your assigned cluster number. Enter the fully-qualified name for your two cluster nodes
(nodeN. clusterX. example . com ) and the password for the root user on each. Make
sure that "Download packages" is pre-selected, then select the "Check if node passwords are
identical" option. All other options can be left as-is. Do not click the Submit button yet!

nodel.clusterX.example.com redhat
node2.clusterX.example.com redhat

11. Before submitting the node information to luci and beginning the Instail, Reboot, Configure,
and Join phases, open a console window to nodel and node2, so you can monitor each node's
progress. Once you have completed the previous step and have prepared your consoles, click
the Submit button to send your configuration to the cluster nodes.

stationx# xm console nodel

stationx# xm console node2

12. Once luc i has completed (once all four circles have been filled-in in the luc i interface), you
will be automatically re-directed to a General Properties page for your cluster.

Select the Fence tab.

In the XVM fence daemon key distribution section, enter dom0 . clusterX. example . com
in the first box (node hostname from the host cluster) and
nodel . c lusterX. example . com in the second box (node hostname from the hosted
(virtual) cluster). Click on the Retrieve cluster nodes button.

At the next screen, in the same section, make sure both cluster nodes are selected and click on
the Create and distribute keys button.

13. From the left-hand menu select Failover Domains, then select Add a Failover Domain.

In the "Add a Failover Domain" window, enter prefer_nodel as the "Failover Domain
Name". Select the Prioritized and Restrict failover to this domain's members boxes.

In the "Failover domain membership" section, make sure both nodes are selected as members,
and that nodel has a priority of 1 and node2 has a priority of 2 (lower priority).

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-1 7-20110428 / 081f11a7


Click the Submit button when finished.

14. We must now configure fencing (the ability of the cluster to quickly and absolutely
remove a node from the cluster). Fencing will be performed by your workstation
(domo . c lus terX. example . com ), as this is the only node that can execute the xm destroy
<node_name> command necessary to perform the fencing action.

First, create a shared fence device that will be used by all cluster nodes. From the left-hand
menu select Shared Fence Devices, then select Add a Fence Device. In the Fencing Type; drop-
down menu, select Virtual Machine Fencing. Choose the name xenf enceX (where x is your
cluster number) and click the Add this shared fence device button.

15. Second, we associate each node with our shared fence device.

From the left-hand menu select Nodes. From the lower left area of the first node in luc i's
main window (nodel) select Manage Fencing for this Node. Scroll to the bottom, and in the
Main Fencing Method section, click the Add fence device to this level link. In the drop-down
menu, select xenf enceX (Virtual Machine Fencing). In the Domain box, type nodel (the
name that would be used in the command: xm destroy <node_name> to fence the node), then
click the Update main fence properties button at the bottom.

Repeat the process for each node in the cluster (using the appropriate node name for each in the
Domain box).

16. To complete the fencing setup, we need to configure your workstation as a simple single-node
cluster with the same fence xvm. key as the cluster nodes. Complete the following three
steps:

First, install the cman packages on your workstation, but do not start the cman service yet.

Second, copy /e t c/c luster/fence_xvrn. key from one of the cluster nodes to /etc/
cluster on stationX

st at ionX#SCp nodel:/etc/cluster/fence_xvm.key /etc/cluster

Note:

If the fence key was not created automatically by the GUI, it is possible to create one
manually.:

4 dd if=/dev/urandom o etc/cluster fence-xvm.key b0=4k bt"


count=1

Third, add the command /sbin/fence_xvmd -L -I cluster to /etc/rc . local and execute
rc . local. This starts the fence daemon without a running cluster (-L) and let it listen on the
cluster bridge (-1 cluster).

st ati.onx# echo '/sbin/fence xvmd -L -I cluster' »/etc/rc.local


stationX# /etc/rc.local

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 081f11a7


17. Before we add our resources to lucí, we need to make sure one of them is in place: a partition
we will use for an Apache Web Server DocumentRoot filesystem.

From a terminal window connected to nodel, create an ext3 formatted 100MiB partition
-

on the /dev/sda shared storage volume. Make sure it is recognized by both nodei and
node2, and run the partprobe command, if not. Temporarily mount it and place a file named
índex . html in it with permissions mode 0644 and contents "Helio". Unmount the partition
when finished, and do not place any entries for it in /etc/fstab.

nade 1 # fdisk /dev/sda > (size=+100M, /dev/sdal (this partition


-

may differ on your machine))


node 1 ,2# partprobe /dev/sda
nade 1.# mkfs -t ext3 /dev/sdal
node 1.# mount /dev/sdal /mnt
node 144 echo "Helio" > /mnt/index.html
nodei# chmod 644 /mnt/index.html
node 1.44 umount /mnt

18. Next we build our clustered service by first creating the resources that make it up. Back in the
luc i interface window, select Add a Resource, then from the Select a Resource Type menu,
select IP Address.

Choose 172.16.50.X6 for the IP address and make sure the Monitor link box is selected.
Click the Submit button when finished.

19. Select Add a Resource from the left-hand-side menu, and from the drop-down menu select File
system.

Enter the following parameters:

Name: docroot
File system type: ext3
Mount point: /var/www/html
Device: /dev/sdal

All other parameters can be left at their default. Click the Submit button when finished.

20. Once more, select Add a Resource from the left-hand-side menu, and from the drop-down menu
select Apache.

Choose ht tpd for the Name. Set Shutdown Wait to 5 seconds. This parameter defines how
long stopping the service may take before Cluster Suite declares it failed. Click the Submit
button when finished.

21. Now we collect together our three resources to create a functional web server service. From the
left-hand-side menu, select Services, then Add a Service.

Choose webby for the Service Name, prefer_nodel as the Failover Domain, and a
Recovery Policy of Relocate. Leave all other options at their defaults. Click the Add a
resource to this service button when finished.

Under the Use an existing global resource drop-down menu, choose the previously-created IP
Address resource, then click the Add a resource to this service button again.
Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 081f11a7
Under the Use an existing global resource drop-down menu, choose the previously-created File
System resource, then click the Add a resource to this service button again.

Finally, under the Use an existing global resource drop-down menu, choose the previously-
created Apache Server resource. When ready, click the Submit button at the bottom of the
window.

If you want that webby starts automatically set the auto start option.

22. From the left-hand menu, select Cluster List. Notice the brief description of the cluster just
created, including services, nodes, and status of the cluster service, indicated by the color of
the cluster name. A green-colored name indicates the cluster service is functioning properly. If
your cluster name is colored red, wait a minute and refresh the information by selecting Cluster
List from the left-hand side menu, again. The service should autostart (an option in the service
configuration window). If it remains a red color, that may indicate a problem with your cluster
configuration.

23. Verify the web server is working properly by pointing a web browser on your local workstation
to the URL: ht tp : //172.16.50 . X6 / index . html or running the command:

local# elinks - dump http://172.16.50.X6/index.html

Verify the virtual IP address and cluster status with the following commands:

nodel# ip addr list

nodel, 2# clustat

24. If the previous step was successful, try to relocate the service using the lucí interface onto the
other node in the cluster, and verify it worked (you may need to refresh the lucí status screen
to see the service name change from the red to green color, otherwise you can continuously
monitor the service status with the clustat -i 1 command from one of the node terminal
windows.

Cluster List --> clusterX --> Services --> Choose a Task... -->
Relocate this service to node3.clusterX.example.com --> Go

Note: the service can also be manually relocated using the command:

nodel# clusvcadm -r webby -m node2.clusterX.example.com

from any active node in the cluster.

25. While continuously monitoring the cluster service status from nodel, reboot node2 and
watch the state of webby.

From one terminal window on nodel:

nodel # clustat - i 1

From another terminal window on nodel:

nodel # tail -f /var/log/messages

ennwrinht (E) 9n11 Rad 1-lat Inr 111-1/11R-RHFI qiid-tan-17-2011(42R/ 0R1f11a7


Lecture 6

Logical Volume Management

Upon completion of this unit, you should be able to:


• understand advanced LVM topics
• move and rename Volume groups
• setup Clustered Logical Volumes

For use only by e student enrolled in e Red Hat training course taught by Red Hat, Inc. or a Red Hat Certitied Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrievat system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red HM training materiaIs are being improperly usad,
copiad, or distributed pisase email <trainingeredhet. com> or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / fa48b603


0,,
An LVM2 Review 6-1

• Review of LVM2 Iayers:

Logicat Volurnes
lvcreate 1
Volurne C;rn p
vgcreate

pvcreate
1 9 Citickj(¡I PhysicAl W u rries

CiA 1:2(k Block Devices

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication mey be photocopled,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training material, are being improperly usad,
copied, or distributed pisase email <trainíng•redliat . coa> or phone toll-free (USA) +1 (855)628 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 889a2702


LVM2 - Physical Volumes and Volume Groups 6-2

• Creating a physical volume (PV) initializes a whole disk or a partition for use in a
logical volume
• pvcreate /dev/sda5 /dev/sdb
• Using the space of one or more PVs, create a volume group (VG) named vg0
• vgcreate vg0 /dev/sda5 /dev/sdb
• Display information
• pvdisplay, pvs, pvscan
• vgdisplay, vgs, vgscan

Whole disk devices or just a partition can be turned into a physical volume (PV), which is really just a way of
initializing the space for later use in a logícal volume. If converting a partition into a physical volume, first set
its partition type to LVM (se) within a partitioning tool like fdisk. Whole disk devices must have their partition
table wiped by zeroing out the first sector of the device (dd if=/dev/zero of=<physical volume> bs=512
count=1). Up to 21\32 PVs can be created in LVM2.

One or more PVs can be used to create a volume group (VG). When PVs are used to create a VG, its
disk space is "quantized" into 4MB extents, by default. This extent is the minimum amount by which the
logical volume (LV) may be increased or decreased in size. In LVM2, there is no restriction on the number
of allowable extents and large numbers of them will have no impact on I/O performance of the LV. The only
downside (if it can be considered one) to a large number of extents is it will slow down the tools.

The following commands display useful PVNG information in a brief format:

# pvscan
PV /dev/sdb2 VG vg0 lvm2 [964.00 MB / 0 free]
PV /dev/sdcl VG vg0 lvm2 [964.00 MB / 428.00 MB free]
PV /dev/sdc2 lvm2 [964.84 MB]
Total: 3 [2.83 GB] / in use: 2 [1.88 GB] / in no VG: 1 [964.84 MB]
# pvs -o pv_name,pv_size -O pv_free
PV PSize
/dev/sdb2 964.00M
/dev/sdcl 964.00M
/dev/sdc2 964.84M
# vgs -o vg_name,vg_uuid -O vg_size
VG VG UUID
vg0 18IoBt-hAFn-lUsj-dai2-UGry-Ymgz-w6AfD7

For use only by a student enrollad in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. 11 you believe Red HM training materials are being improperly usad,
copied, or distributed pisase email <trainingeredhat . COM, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 05d7ca41


LVM2 - Creating a Logical Volume 6-3

• From VG vg O I S free extents, "caree" out a 50GB logical volume (LV) named gfslv:
• Ivcreate -L 50G -n gfslv vg0

• Create a striped LV across 2 PVs with a stride of 64kB:


• Ivcreate -L 50G -12 -164 -n gfslv vg0

• Allocate space for the LV from a specific PV in the VG:


• Ivcreate -L 50G -i2 -164 -n gfslv vg0 /dev/sdb

• Display LV information
• lvdisplay, Ivs, Ivscan

One or more LVs are then "carved" from a VG according to needs using the VGs free physical extents.

Data in a LV is not written contiguousiy by default, it is written using a "next free" principie. This can be
overridden with the -c option to Ivcreate.

Striping has a performance enhancement by writing to a predetermined number of physical voiumes in


round-robin fashion. Theoretically, with proper hardware configuration, I/O can be done in parallel, resulting
in a near-linear performance gain for each addition physical volume in the stripe. The stripe size used
shouid be tuned to a power of 2 between 4kB and 512kB, and matched to the application's I/O that is using
the striped volume. The -1 option to Ivcreate specifies the stripe size in kilobytes.

The underiying PVs used to create a LV can be important if the PV needs to be removed, so careful
consideration may be necessary at LV creation time. Removing a PV from a VG (vgreduce) has the side
effect of removing any LV using physical extents from the removed PV.

vgreduce vg0 /dev/sdb

Up to 21,32 LVs can be created in LVM2.

The following commands display useful LV information in a brief format:

# lvscan
ACTIVE 1 /dev/vg0/gfslvT [1.46 GB] inherit
# lvs -o lv_name,lv_attr -O -1v_name
LV Attr
gfslv -wi-ao

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material• are being improperly used,
copied, or distributed please emaiI < training0redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 3d828d29


. . Al,
Files and Directories Used by LVM2 6-4

• /etc/lvm/lvm.conf
• Central configuration file read by the tools
• /etc/lvm/cache/.cache
• Device name filter cache fila
• /etc/lvm/backup/
• Directory for automatic VG metadata backups
• /etc/lvm/archive/
• Directory for automatic VG metadata archives
• /var/lock/lvm
• Lock files to prevent parallel tool runs from corrupting the metadata

Understanding the purpose of these files and their contents can help troubleshoot and/or fix most common
LVM2 issues.

To view a summary of LVM configuration information after loading lvm.conf(8) and any other configuration
files:

lvm dumpconfig

To scan the system looking for LVM physical volumes on all devices visible to LVM2:

lvmdiskscan

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hal Certified Training Partner. No para ot this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwíse reproduced without prior written consent ot Red Hal, Inc. It you belíeve Red Hat training materials are being improperly usad,
copiad, or cfistributed please email <training•redhat. con> or phone toil-free (USA) +1 (866)626 2994 or +1 (919) 754 3700.

Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 03c88ffd


Changing LVM options 6-5

• pvchange
• changing allocation permission on physical volumes
• disable alocation on a physical volume: pvchange -x n device
• allocation allowed on all phyical volumes (default): pvchange -ax y
• vgchange
• mainly for activating/deactivating volume groups
• activation: vgchange -ay vgname
• deactivation: vgchange -an vgname
• Ivchange
• used for controlling visibility of the logical volume for the kernel
• activation: Ivchange -ay lvname
• deactivation: Ivchange -an lvname
• marking a volume read-only: Ivchange -pr lvname
• marking it read-write again: Ivchange -pw lvname

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training materiale are being improperly usad,
copied, or distributed piense email <training@redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Coovriaht © 2011 Red Hat. Inc. RH436-RHEL5u4-en-17-20110428 / cf7860af


Moving a volume group to another host 6-6

• Removal:
• unmount i "old" machine
• mark the v group ínactive: vgchange an vgname -

• export the vokine group: vgexport vgname


• shutdown the box and remove the drive(s)

• Adding:
• connect the drive(s) to the new machine
• pvscan should show the new physical volumes
• ímport the volume group: vgimport vgname
• actívate the volume group: vgchange -ay vgname

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, Morad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training materials are being ímproperly used,
copiad, or dstributed piense email <training9redhat . COM> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a86880ab


41 /In
Clustered Logical Volume Manager (CLVM) 6-7

• CLVM is the clustered version of LVM2


• Aims to provide the same functionality of single-machine LVM
• Provides for storage virtualization
• Based on LVM2
• Device mapper (kernel)
• LVM2 tools (user space)
• Relies on a cluster infrastructure
• Used to coordinate logical volume changes between nodes

• CLVMD aliows LV metadata changes only if the following conditions are true:
• All nodes in the cluster are running
• Cluster is quorate

To change between a CLVMD-managed (clustered) LV and an "ordinary" LV, its as simple as modifying the
locking_type specified in LVM2's configuration file (/etc/lvm/ivm. conf).

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in e retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are being improperly usad,
copied, or distributed please email <training•rectbat com> or phone toll-free (USA) +1 (865) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / f9043295


. • 4,1.1
CLVM Configuration 6-8

s • Requires changes to /etc/lvm/lv -rn. conf


• lvmconf —enable-cluster
• locking_type = 3

• When using multipathing, set preferred_names


• Setup logical volumes
• Manually, via pvcreate/vgcreate/lvcreate

10 • GUI (system-config-lvm)

• Start CLVM on all nodes (service clvmd start)

11, The cluster-aware version of system-config-lvm is available since the RHEL4 Update 3 release.

There are three locking types to choose from in LVM2: locking_type = 1 (stand-alone node locking),
11/ locking_type = 2 (uses externa) locking), and locking_type = 3 (built-in cluster-wide locking). GFS requires
CLVM to be using locking type 3.

To modify the locking_type parameter, manually edit the file, or run the command:

/usr/sbin/Ivmconf {—disable-cluster,--enable-cluster}

When using multipathed devices for CLVM, you need to make sure that LVM uses the mpath devices and
not the individual partitions. By default lvm only Iooks for "userfriendly" devices (/dev/mpath{ 0 , 1, .. }).

1 If use WWN based names instead, you must adjust the preferred_names parameter in /etc/lvm/
lvm.conf

1
1
1

1 For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red HM C,ertified Training Partner. No part 01 this publication may be photocopied,
clurdicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. II you believe Red HM training materials are being improperty used,

1 copiad, or distributed piense email <training@redhat .com> or phone tolI-free (USA) +1 (856) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 0c0ce898


End of Lecture 6

• Questions and Answers


• Summary
• Describe LVM
• Setup and troubleshoot LVM
• Configure CLVM

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of thIs publicetion may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat tralning materials are beIng Improperly usad,
copiad, or distributed please email <trainingOlredhat . cora, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / fa48b603


4 AD
11, Lab 6.1: Configure the Clustered Logical Volume Manager
Instructions:

1. Disable the webby service so that we can change it's filesystem from ext 3 to gf s2 (the GFS2
filesystem will be created in the next lab)
1110 2. From luc i's interface, select the storage tab near the top and then select your cluster's first
node (nodel . c luste rX. example . com) from the left-hand side "System List" menu.
3. Select the "sda" link from the Partition Tables section from the window, then click on the
Remove button.
1110
4. Select New Partition Table from the Partition Tables section. Create a new msdos partition
table on device /dev/sda
I>
5. Create a Phys i cal Volume that uses 4 GB only of the unused space of the sda device. The
11/ PV should not be assigned to a volume group yet.
6. Create a new Clustered volume group named ClusterVG that uses the newly added physical
11, volume.
7. Select the other cluster nodes in the system list and make sure that the newly created
ClusterVG volume group shows up. If it doesn't, select Reprobe Storage.

1
• r• A
1

1
1
1
1
1 Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9efe5547
Lab 6.1 Solutions
1. Disable the webby service so that we can change it's filesystem from ext 3 to gf s2 (the GFS2
filesystem will be created in the next lab).

nodel# clusvcadm -d webby

2. From luc i's interface, select the storage tab near the top and then select your cluster's first
node (nodel . c lus t erX. example . com) from the left-hand side "System List" menu.

3. Select the "sda" link from the Partition Tables section from the window, then click on the
Remove button.

4. Select New Partition Table from the Partition Tables section. Create a new msdos partition
table on device /dev/sda

Set the label to msdos and mark the disk /dev/sda, then click Create

5. Create a Physical Volume that uses 4 GB only of the unused space of the sda device. The
PV should not be assigned to a volume group yet.

Select the sda partition table, then click on the Unused Space ama of the physical partition
graphic.

Leave Size and Partition Type unchanged. Choose Physical Volume and change the
Volume Group Name to empty.

Click on Create and confirm.

6. Create a new clustered volume group named ClusterVG that uses the newly added physical
volume.

Click on Volume Groups/New Volume Group. Change the Volume Group Name to
ClusterVG. Select the Physical Volume /dev/sdal. Leave the other values at their
defaults.

You should now see the new volume group in the list. Notice that the Clustered flag is set to
true.

7. Select the other cluster nodes in the system list and make sure that the newly created
ClusterVG volume group shows up. If it doesn't, select Reprobe Storage.

r.nnvrinht 9n1 1 Por' Wat Inr id-an-17-2(111na91a / cuan qz1.7


Lecture 7

Global File System 2

Upon completion of this unit, you should be able to:


• Describe GFS2
• create a clustered GFS2 filesystem
• maintain GFS2

For use only by a student enrolled in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. H you believe Red HM training materials are being improperly usad,
copied, or cfistributed pisase email ctrainingeredhat • coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a2166820


..
Global File System 2 7-1
••
• Symmetric, shared-disk, cluster file system
• Relies on the cluster storage infrastructure
• guarantees data conistency through:
• data/metadata journaling
••





inter machine locking
fencing of errant nodes
64 bit file system, can store up to 8 EB (Exabyte)
POSIX compliant
supports ACLs, Selinux contexts
••
• Online management

GFS2 is a completely symmetric filesystem with direct storage access. It avoids central data structures and
therefore avoids bottlenecks of a server/client architecture.

A GFS2 file system can be implemented in a standalone system or as part of a cluster configuration.

When implemented as a cluster file system, GFS2 employs distributed metadata and multiple journais, it
interfaces directly with the Linux kernel file system interface (VFS layer).

Each node has its own journal that is accessible by all the other nodes in the cluster. If an errant node is
power cycled, other cluster nodes have access to its journal to replay it and put the filesystem back into a
clean state for continuad access without waiting for the fenced node to come back into the cluster.

GFS2 supports extended attributes such as Access Control List (ACL), filesystem quotas and SELinux
contexts.

File system meta-data is stored in file system data blocks and allocated dynamically on an as-needed
basis. GFS2 file systems can be grown while online, with no loss in performance or downtime.


••



For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material» are being improperly usad,
copiad, or distributed please email ctrainingaredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / f0827114



GFS2 Limits 7-2

• GFS2 is capable of:


• 16TB file systems on 32bit
• 8EB file systems on 64bit
• Currently supported by Red Hat
• 25TB file systems
• Can run mixed 32/64-bít archítectures across x86/EM64T/AMD64/ia64

GFS2 is based on a 64-bit architecture, which can theoretically accommodate an 8 EB file system.
However, the current supported maximum size of a GFS2 file system is 25 TB. If your system requires
GFS2 file systems larger than 25 TB, contact your Red Hat service representative.

GFS2 has no problems mixing 32/64-bit architectures across different CPU types.

Mixed 32/64-bit architectures limit GFS2 to 16TB (the 32-bit limit).

Typical disk devices are addressed in units of 512 byte blocks. The size of the address in the SCSI
command determines the maximum device size. The SCSI subsystem in the 2.6 kernel has support for
commands with 64-bit block addresses. To support disks larger than 2TB, the Host Bus Adapter (HBA), the
HBA driver, and the storage device must also support 64-bit block addresses.

For use only by a student enrolled in a Red HM training course taught by Red Hat, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. ff you believe Red HM training materials are being improperly used,
copiad, or distributed picase email <training9redhat .coms or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-1 7-20110428 / 45e15a37



GFS2 Enhancements 7-3





better performance for concurrent access on a single directory
faster async I/O
faster direct_io for preallocated files •


included in upstream kernels (2.6.19, backported to RHEL 2.6.18 kernels)
improved metadata operations (df, statfs, atime, journal updates) •


supports Isattr/chattr via standard ioctls
nano second timestamps •

• common tunables are set using mount options instead of gfs_tool
• Existing GFS file systems can be converted with gfs2_convert

In general, the functionality of GFS2 is identical to GFS. GFS2 was designed so that upgrading from GFS
would be a simple procedure, but there are few interesting architectural differences:


GFS2 uses regular (system) files for journals, whereas GFS uses special extents

GFS2 has some other "per_node" system files •



The layout of inodes is slightly different

The layout of indirect blocks is slightly different



The journaling systems of GFS2 and GFS are not compatible with each other. Upgrading is possible by
means of a tool (gfs2_convert) which is run with the filesystem off-line to update the metadata. Some spare
blocks in the GFS2 journals are used to create the (very small) per_node files required by GFS2 during the
update process. Most of the data remains in place. More details and differences between GFS2 and GFS
can be found at: http: //docs . redhat .com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/ 1
Global_File_System_2/sl-ov-newfeatures-GFS2.html






e


For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material are being Improperly usad,
copiad, or distributed please email ctraining0redhat . cm> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 26f275b5



Creating a GFS2 File System 7-4

• Required information:
• Lock manager type
• iocknolock
• lockdlm

• Lock file name


• clustername:fs_name

• Number of journals
• One per cluster node accessing the GFS2 is required
• Extras are useful to have prepared ín advance

• Size of journals
• File system block size
• Example:
• gfs2_mkfs -p lock_dirn -t clusterl :gfsly -j 3 /dev/vg0/gfsly

The following is an example of making a GFS2 file system that utilizes DLM lock management, is a valid
resource of a cluster named "cluster1", is placed on a logical volume named "gfslv" that was created from
a volume group named "vg0", and creates 3 journals, each of which takes up 128MB of space in the logical
volume.

gfs2_mkfs -p lock_dlm -t clusterl:gfslv -j 3 /dev/vg0/gfsly

The lock file name consists of two elements that are delimited from each other by a colon character: the
name of the cluster for which the GFS2 filesystem is being created, and a unique (among all filesystems in
the cluster) 1-16 character name for the fílesystem.

All of a GFS2 file system's attributes, including those specified at creation time, can be retrieved with the
following command if it is currently mounted:

gfs2_tool df <gfs2_mount_point>

The size of the journals created is specified with the -J option, and defaults to 128MB. The mínimum
journal size is 32MB.

The GFS2 block size is specified with the - b option, and defaults to 4096 bytes. The block size is a power
of two between 512 bytes and the machine's page size (usually 4096 bytes).

For use °My by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. It you believe Red Hat training materials are being improperly used,
copiad, or cfistributed piense email ctraiaiageredhat coz> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright02011RedFlatinc. RH436-RHEL5u4-en-17-20110428/db4e89ac
• •
Lock Managers 7-5

• Via Red Hat Cluster Suite, GFS2 can use the following lock architectures:
• DLM
• nolock

The type of locking used for a previously-existing GFS2 file system can be viewed in the output of the
command gfs2_tool df <mount_point>.

DLM (Distributed Lock Manager) provides lock management throughout a Red Hat cluster, requiring no
nodes to be specifically configured as lock management nodes (though they can be configured that way, if
desired).

nolock Literally, no clustered lock management. For single node operation only. Automatically turns
on local flocks (use local VFS layer for file locking and file descriptor control instead of GFS2),
localcaching (so GFS2 can turn on some block caching optimizations that can't be used when running
in cluster mode), and oopses_ok (won't automatically kernel panic on oops).

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Treining Partner. No parí of this publication may be photocopied,
dupliceted, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training motorista ere being improperly usad,
copiad, or distributed please email <training@redhat com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 11b02fb8


Distributed Lock Manager (DLM) 7-6

• DLM manages distribution of lock management across nodes in the cluster


• Availability
• Performance

DLM runs algorithms used internally to distribute the lock management across all nodes in the cluster,
removing bottlenecks while remaining fully recoverable given the failure of any node or number of nodes.

Availability - DLM offers the highest form of availability. There is no number of nodes or selection of nodes
that can fail such that DLM cannot recover and continue to operate.

Performance - DLM increases the likelihood of local processing, resulting in greater performance. Each
node becomes the master of its own locks, so requests for locks are immediate, and don't require a
network request. In the event there is contention for a lock between nodes of a cluster, the lock arbitration
management is distributed among all nodes in the cluster, avoiding the slowdown of a heavily loaded single
lock manager. Lock management overhead becomes negligible.

For use only by a student enrollad in a Red HM training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior miden consent of Red HM, Inc. II you believe Red Hat training materials are being improperly usad,
copiad, or distributed please email <trainingeredhat . coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 391740e0


Mounting a GFS2 File System 7-7

• At GFS2 mount time:


• GFS2 requires that node be a member of the cluster
• Checks that cluster name is encoded into the GFS2's superblock
• Necessary to prevent nodes from different clusters from mounting the same file system at the same time and corrupting it

• gfs2_mount(8)
• mount -o StdMountOpts,GFSOptions -t gfs2 DEVICE MOUNTPOINT

• GFS-specific mount options, for example:


• lockproto=[lock_dlm,lock_nolock]
• locktable=clustername:fsname
• acl

Many mount(8) options are perfectly valid when mounting GFS2 volumes. gfs2_mount(8) describes
additional mount options that are specific to GFS2 file systems. The -t gfs2 option is a requirement for
mounting GFS2 volumes.

The device is usually (and recommended to be) a CLVM2-managed logical volume for ease of
administration, but it is not a requirement. Later in this course we will use GNBD and iSCSI devices for our
GFS2 volumes.

The lockproto option allows a different lock manager to be used at mount time. For example, a GFS2 file
system created with lock_dlm may need to be mounted on a single node for recovery after a total cluster
failure.

The locktable option allows specification of an alternate cluster that a GFS2 filesystem should be made
available to.

The ac 1 option enables a subset of the POSIX Access Control List acl(5) support within GFS2. One
notable missing ACL capability is the use of default ACLs. Without the acl option, users are still able to
view ACL settings (getfacl), but are not allowed to set or change them (sedad).

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are being improperly usad,
copiad, or distributed please email <training•redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 58cc1736


. ..
Journaling 7-8

• Normally only journais meta data


• One journal per node
• better performance than common journal
• other nodes can access the journal during recovery
• stored in hidden files within the file system
• Two journaling modes
• Ordered: flushes I/O to disk before committing the journal (default)
• Write-Back: faster, but can lead to data loss (not recommended)
• Data joumaling can be enabled per file or directory
• chattr +j fila
kk 2.._yk C1

• chattr +j directory enables it for all new files in this directory

• New journais can be added online with gfs2_tool jadd :1number of journals
ct\t5Z-cx.J4 -114

Ordinarily, GFS2 writes only metadata to its journal. Filo contents are subsequently written to disk by the
.

kernel's periodic sync used to flush the file system buffers. An fsync ( ) call on a file causes the file's data
to be written to disk immediately and returns when the disk reports that all data is safely written.

Applications relying on fsync ( ) to sync file data may see improved performance using data journaling.
Because an fsync ( ) returns as soon as the data is written to the joumal (whích can be much faster than
writing the file to the main file system), data joumaling can result in a reduced fsync ( ) time, especially for
small files.

Note:

On GFS1 the command gfs_tool setflag jdata fila was used. While this is still possible usíng
chattr is the preferred method.

For use only by a student enrollad in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. M you believe Red HM training materials are beíng improperly used,
copied, or distributed please emelt <trainingeredhat cosi, or phone toIl-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 28bc2838


. ..
Quotas 7-9

• Requires additional communication between nodes


• reduces performance
• quota sync between nodes is delayed to reduce overhead
• Activated by mount option
• quota=of f : quotas disabled (default)
• quota=account : usage is monitored, but quotas are not enforced
quota=on : quotas are enforced

• Edit quotas with gfs2_qouta

gfs2_quota limit -u user -1 size

# gfs2_quota warn -u user -1 size

• Display quotas for a single user with gfs2_quota get -u user


• gfs2_quota Iist -fmount point generates a report of all users and groups

File-system quotas are used to limit the amount of file system space a user or group can use. A user or
group does not have a quota limit until one is set. GFS2 keeps track of the space used by each user and
group even when there are no limits in place. GFS2 updates quota information in a transactional way so
system crashes do not require quota usages to be reconstructed.

To prevent a performance slowdown, a GFS2 node synchronizes updates to the quota file only periodically.
The "fuzzy" quota accounting can allow users or groups to slightly exceed the set limit. To minimize this,
GFS2 dynamically reduces the synchronization period as a "hard" quota limit is approached.

GFS2 uses its gfs2_quota command to manage quotas. This command only needs to issued on a single
node. Other Linux quota facilities cannot be used with GFS2.

Two quota settings are available for each user ID (UID) or group ID (GID): a hard limit and a warn limit. A
hard limit is the amount of space that can be used. The file system will not let the user or group use more
than that amount of disk space. A hard limit value of zero means that no limit is enforced. A warn limit is
usually a value less than the hard limit. The file system will notify the user or group when the warn limit is
reached to warn them of the amount of space they are using. A warn limit value of zero means that no limit
is enforced.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certitied Training Partner. No part of this publication may be photocopied,
dupliceted, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training materials are being improperly ueed,
copiad, or distributed please email <training@redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 7a74aa24


. .
Growing a GFS2 File System 7-10

• Consider if space is also needed for additional journais


• Grow the underlying volume
• Create additional physical volumes
• pvcreate /dev/sdc /dev/sdd

• Extend the current volume group


• vgextend vg0 /dev/sdc /dev/sdd

• Extend the logical volume


• Ivextend -L +100G /dev/vg0/gfslv

• Grow the existing GFS2 file system into the additional space
• gfs2_grow - v <DEVICE 'MOUNT POINT>

To grow a GFS2 file system, the underlying logical volume on which it was built must be grown first. This
is also a good time to consider if additional nodes will be added to the cluster, because each new node will
require room for its journal (journais consume 128MB, by default) in addition to the data space.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent ot Red Hat, Inc. fi you believe Red Hat training materials are being ímproperly usad,
copiad, or distributed pisase email <trainiageredfiat -coa> or phone toll-free (USA) +1 (866) 826 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 239f7d7e


. ..
GFS2 Super Block Changes 7-11

• It is sometimes necessary to make changes directly to GFS2 super block settings



• GFS2 file system should be unmounted from all nodes before changes applied
• Lock manager
• gfs2_tool sb <dev> proto [Iock_dlm,lock_nolock]

• Lock table name



• gfs2_tool sb <dev> table clusterl:gfslv

• List superblock information •


• gfs2_tool sb <dev> all

GFS2 file systems are told at creation time (gfs2_mkfs) what type of locking manager (protocol) will be

used. If this should ever change, the locking manager type can easily be changed with gfs2_tool.

For example, suppose a single-node GFS2 filesystem created with the lock_nolock locking manager is
now going to be made highly available by adding additional nodes and clustering the service between them.


••
We can change its locking manager using:

gfs2_tool sb <dev> proto lock_dlm


a

••
••

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or e Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used,
copied, or distributed please email <training@redhat com> or phone toI14ree (USA) +1 (866) 626 2994 or +1 (919) 754 3700.
••
Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 6d5cd97d
GFS2 Extended Attributes (ACL) 7-12

• Access Control Lists (ACL) are supported under GFS2 file systems
• ACLs allow additional "owners/groups" to be assigned to a file or directory
• Each additional owner or group can have customized permissions
• File system must be mounted with acl option
• Add l acr to/etc/fstab entry
• mount -o remount <file_system>

• getfacl - view ACL settings


• setfacl - set ACL permissions

The file system on which ACLs are to be used must be mounted with the acl option. Place 'acl' in the
options field of the file system's line entry in /etc/fstab and run the command mount -o remount
<file_system>. Run the mount command to verify the acl option is in effect.

ACLs add additional owners and groups to a file or directory. For example, suppose the following file must
have read-write permissions for user jane, and read-only permissions for the group 'users':

-rw-r 1 jane users O Dec 17 18:33 data.0

Now suppose the 'boss' user also wants read-write permissions, and one particular user who is a member
of the users group, 'joe', shouldn't have any access to the file at all. This is easy to do with ACLs.

The following command assigns user 'boss' as an additional owner (user) with read-write permissions, and
'joe' as an additional owner with no privileges:

setfacl -m u:boss:rw,u:joe:- data.0

Because owner permission masks are checked before group permission masks, user joe's group
membership has no effect it never gets that far, stopping once identifying joe as an owner with no
permissions.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training materials are being improperly used,
copied, or distributed pisase email ctraining•redhat .com> or phone 1oll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4f14f60a


Repairing a GFS2 File System 7-13

• In the event of a file system corruption, brings it back into a consistent state
• File system must be unmounted from all nodes
• gfs2 isck <block_device>

While the command is running, verbosity of output can be increased (—V, —vv) or decreased (-q, -qq). The
-y option specifies a 'yes' answer to any question that may be asked by the command, and is usually used
to run the command in "automatic" mode (discover and fix). The -n option does just the opposite, and is
usually used to run the command and open the file system in read-only mode to discover what errors, if
any, there are without actually trying to fix them.

For example, the following command would search for file system inconsistencies and automatically
perform necessary changes (e.g. attempt to repair) to the file system without querying the user's permission
to do so first.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
dupliceted, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being impropedy usad,
copied, or distributed please email < training@redhat . com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 85eee0fb


A11 • la J J^^
End of Lecture 7

• Questions and Answers


• Summary
• Describe GFS2
• Setup and maintain GFS2

For use only by a student enrolled in a Red HM training course taugM by Red HM, Inc. or a Red HM Certífied Training Partner. No part of ibis publication may be photocopied,
duplicated, Morad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. 11 you believe Red HM training materials are being improperly used,
copiad, or distributed pisase email ctrainiagaredhat . con, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a2166820



Lab 7.1: Creating a GFS2 file system with Conga e
Instructions:
e
1.

2.
Create a 2 GB logical volume named webfs containing a gf s2 filesystem named webf s. Set
the mount point to /var/www/html. Leave the other values at their defaults.

Before adding the filesystem to the webby service, let's mount the filesystem manually on a
node and add some content for the web service. On node1 mount the filesystem to /var/

www/html and create the file índex . html with some content.

3. On node2 verify the GFS2 functionality by mounting the filesystem manually and checking
the content of the index. html file. •
4. Umount the GFS2 filesystem on both nodes. O
5. Create a GFS resource using the newly added filesystem. Use the following parameters and
leave the others at their default value.

Name
Mount Point
webfs
/var/www/html O
Device
Filesystem Type
/dev/ClusterVG/webf s
GFS2 •
6. Remove the doc root resource from the webby service. Notice that this also removes the
httpd child resource. e
7. Add the GFS resource webfs as a child to the IP address.

8. Re-add the httpd resource as a child to the webfs resource.

9. Save the changes.


O
10. Return to the Services list and enable the webby service. Confirm it's operation by pointing


your webbrowser to http : / /172.16.50 . 100+X.

1


Coovriaht C 2011 Red Hat. Inc. RH426-RHFI_Sn4-en-17-2011042R / asSPRSRA

1
1› Lab 7.2: Create a GFS2 filesystem on the commandline
Scenario: We've seen how easy it is to configure a GFS2 filesystem from within
lucí, but what if we want to configure a GFS2 filesystem for a non-
clustered application? In this lab we explore how to create and manage a
GFS2 filesystem from the command line.
1
1, Instructions:

1. Because we've already configured a GFS2 filesystem from within lucí, the required RPMs
111 have already been installed for us.

GFS2 requires only gf s2 -ut ils. The kernel module is already provided by the installed
kernel RPM. if GFS2 is installed on top of CLVM lvm2 cluster is also required.
-

Note that luci has also installed the GFS 1 specific RPMs which we will use in the next exercise.
Verify which of the aboye RPMs are already installed on your cluster nodes.
• 2. Verify that the GFS2 kernel module is loaded.

nodel# lsmod I head - 1; lsmod grep - E " (gfs I dlm I kmod) "

3. Verify that Conga converted the default LVM locking type from 1 (local file-based locking) to
11/ 3 (clustered locking), and that clvmd is running.

node1,2# grep locking_type /etc/lvm/lvm.conf

nodel, 2# service clvmd status

Note:

To convert the locking type without Conga's help, use the following command before
starting clvmd:

nodel , 2# vinconf•- enabl e -clus ter

4. In the next step we will create a clustered LVM2 logical volume as the GFS 1 "container".
Before doing so, we briefly review LVM2 and offer some troubleshooting tips.

First, so long as we are running the clvmd service on all participating GFS cluster nodes,
we only need to create the logical volume on one node and the others will automatically be
updated.

Second, the following are helpful commands to know and use for displaying information about
the different logical volume elements:

pvdisplay, pvs
vgdisplay [-v], vgs
Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 699c0558
lvdisplay, lvs
service clvmd status

Possible errors you may encounter:

If, when viewing the LVM configuration the tools show or complain about missing physical
volumes, volume groups, or logical volumes which no longer exist on your system, you may
need to flush and re-scan LVM's cached information:

# rm -f /etc/lvm/cache/.cache
# pvscan
# vgscan
# lvscan

If, when creating your logical volume it complains about a locking error ("Error locking on
node..."), stop clvmd on every cluster node, then start it on all cluster nodes again. You may
even have to clear the cache and re-scan the logical volume elements before starting clvmd
again. The output of:

# lvdisplay I grep "LV Status"

should change from:

LV Status NOT available

to:

LV Status available

and the LV should be ready to use.

If you need to dismantle your LVM to start from scratch for any reason, the following sequence
of commands will be helpful:
1. Remove any /etc/fstab entries vi /etc/fstab
referencing the LVM
2. Make sure it is unmounted umount /dev/ClusterVG/gfslv
3. Deactivate the logical volume lvchange -an /dev/ClusterVG/
gf slv
4. Remove the logical volume lvremove /dev/ClusterVG/gfslv
5. Deactivate the volume group vgchange -an ClusterVG
6. Remove the volume group vgremove ClusterVG
7. Remove the physical volumes pvremove /dev/sd??
8. Stop c lvmd service clvmd stop

5. Create a 1GB logical volume named gf slv from volume group Cluste rVG that will be used
for the GFS.

6. The GFS locktable name is created from the cluster name and a uniquely defined name of your
choice. Verify your cluster's name.

Coovriaht 2011 Red Hat. Inc. RH436-RHEL5u4-en-17-20110428 / 699c0558


7. Create a GFS2 file system on the gf slv logical volume with journal support for two (do not
create any extras at this time) nodes. The GFS2 file system should use the default DLM to
manage it's locks across the cluster and should use the unique name "gf slv".

Note: GFS2 journals consume 32MB, by default, each.

8. Create a new mount point named /mnt/gf s on both nodes and mount the newly created file
system to it, on both nodes.

Look at the tail end of /var/log/messages to see that it has properly acquired a journal
lock.

9. Add an entry to both node's /etc/ f stab file so that the shared file system persists across
reboots.

10. Copy into or create some data in /mnt/gf s from either node and verify that the other node
can see and access it.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 699c0558


Lab 7.3: GFS1: Conversion
Scenario: While GFS 1 is still supported, many customers choose to upgrade their
existing GFS 1 filesystems to GFS2 for better performance. In this exercise
we create a GFS 1 filesystem and then convert it to GFS2. Again we use the
command line. Notice the strong similiarity between the GFS 1 and GFS2
commands.

Instructions:

1. Luci has already installed the necessary gf s -ut 1.1 s and kmod-gf s -xen for you.

2. Create a 1GB logical volume named gf si from volume group ClusterVG that will be used
for the GFS.

3. Create a GFS 1 file system on the gf sl logical volume with journal support for two nodes
using the mkfs.gfs command. The GFS 1 file system should use the default DLM to manage it's
locks across the cluster and should use the unique narre "gf sl".

Note: journals in GFS2 consume 128MB, by default, each.

4. Create a new mount point named /mnt /gf sl on both nodes and mount the newly created file
system to it, on both nodes.

Look at the tail end of /var/log/messages to see that it has properly acquired a journal
lock.

5. Add an entry to both node's /etc/f st ab file so that the shared file system persists across
reboots.

6. Copy into or create some data in /mnt /gf sl from either node and verify that the other node
can see and access it.

7. It is now time to convert this filesystem to GFS2. This conversion has to be done offline.
Umount the filesystem on both nodes.

8. Before the filesystem is converted it is strongly recommended to backup the filesystem and
perforen a filesystem check. Use the tool gfs_fsck to test the integrity of the data.

9. Convert the filesystem to GFS2.

10. Update the filesystem type in /et c/f stab on both cluster nodes.

11. Now mount the filesystem again. Use mount -a to verify consistency of /etc/ f st ab

12. Cleanup: Umount the filesystem and delete the /etc /f s t ab reference on both nodes and
remove the logical volume /dev/ClusterVG/gf sl

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 734fc20a


I> Lab 7.4: GFS2: Working with images
Scenario: In this exercise we explore how to access an image of a GFS2 filesystem on
a node outside of the cluster.

1 Instructions:

1. Begin by creating a 512MB LVM volume with a GFS2 filesystem. Use imagetest as the
filesystem and volume name.

2. Mount the filesystem on a single node and put some data in it. You don't need to mount the
filesystem persistently across boots.

• 3. Before taking the image umount the filesystem. Images taken from a live filesystem will not be
consistent.

4. Now use dd to create an image of the logical volume. The GFS2 filesystem you have created in
the first part of the lab has enough space to store the file.

5. We won't need the filesystem anymore. Remove it.

1 6. Now copy this image to your physical system.

7. You have two options to access this image: Either by creating a logical volume or partition and
"dd'ing" the image into that container or by loop-mounting the image directly. In both cases you
have to manually disable the DLM locking mechanism, since the station is not member of your
cluster.
1110 Let's use the loop-mounting way: Use losetup to point the loopl device to the image file,
perform a filesystem check and the mount the filesystem

8. How would you mount such an image persistently without creating the loop device manually?

1 Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 88f20560


Lab 7.5: GFS2: Growing the filesystem
Instructions:

1. We still have room left in volume group ClusterVG, so let's expand our logical volume and
GFS2 filesystem to use the rest of the space.

First, expand the logical volume into the remaining volume group space.

2. Now grow the GFS2 filesystem into the newly-available logical volume space, and verify the
additional space is available. Note: GFS2 must be mounted, and we only need to do this on one
node in the cluster.

sp.; ,r, nn 1 U.., I.," fLJ Anc nu 1-1 C.. A 17 Al,- t. 4 "AntS 1 ..cn......esns
I>

11) Lab 7.1 Solutions


1. Create a 2 GB logical volume named webfs containing a gf s2 filesystem named webfs. Set
the mount point to /var/www/html. Leave the other values at their defaults.

Go to the storage tab and select one of your cluster nodes. Select Volume Groups/ClusterVG.
Click on New Logical Volume.

110 Set the Logical Volume Name to webfs, the size to 2 GB and the Content Type to GFS2 -
Global FS v.2.

Choose webfs as the Unique GFS Name and /var/ f tp/pub as the mount point. Leave the
other values at their default values.

Click on the Create button and confirm.


2. Before adding the filesystem to the webby service, let's mount the filesystem manually on a
node and add some content for the web service. On nodel mount the filesystem to /var/
www/html and create the file index . html with some content.
1 nodel# mount /dev/ClusterVG/webfs /var/www/html
nodel# echo 'Helio GFS2I' >/var/www/html/índex.html

3. On node2 verify the GFS2 functionality by mounting the filesystem manually and checking
the content of the index html file.
ID
node2# mount /dev/ClusterVG/webfs /var/www/html
1 node2# cat /var/www/html/index.html
Helio GFS2!

1 4. Umount the GFS2 filesystem on both nodes.

11, nodel# umount /var/www/html


node2# umount /var/www/html

1 5. Create a GFS resource using the newly added filesystem. Use the following parameters and
leave the others at their default value.
Name webf s
Mount Point /var/www/html
Device /dev/ClustervG/webf s
1 Filesystem Type GFS2

1 In Luci, go to the cluster tab and click on the c lus terX link.

Select Resources, then Add a Resource.

Choose the Resource Type GFS f i le system. Set the Name, Mount Point, Device, and
Filesystem Type as defined aboye and leave the other values unchanged.
1
Click on Submit and confirm.
1
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / aa523563
6. Remove the doc root resource from the webby service. Notice that this also removes the
httpd child resource.

Go to Services and click on the webby link. Scroll down to the File System Resource
Configuration and click the button Delete this Resource. Confirm.

As you see, the changed Service Composition only lists the IP Address resource.

7. Add the GFS resource webf s as a child to the IP address.

At IP Address Resource Configuration click on Add a child. From Use an existing global
resource select webf s (GFS) .

8. Re-add the httpd resource as a child to the webf s resource.

At GFS Resource Configuration click on Add a child. From Use an existing global resource
select httpd (Apache Server).

9. Save the changes.

Scroll down and click on Save changes. Confirm.

10. Return to the Services list and enable the webby service. Confirm it's operation by pointing
your webbrowser to http: //172.16.50.100+X.

Select Services, then choose Enable this service from webby's task list. Click on Go
and con firm.

Open a new webbrowser window and enter http: / /172.16.50.100+X as the URL. Do
you see the content of your índex . html file?

Coovrinht (c) 2011 Red Hat Inc RHdRA-RHFI Sud-an-17-20110420 / RaWRARR


• Lab 7.2 Solutions
1. Because we've already configured a GFS2 filesystem from within luc i, the required RPMs
have already been installed for us.

GFS2 requires only gf s 2 utils. The kernel module is already provided by the installed
-


kernel RPM. if GFS2 is installed on top of CLVM lvm2 cluster is also required.
-

Note: Luci has also installed the GFS 1 specific RPMs which we will use in the next exercise.
Verify which of the aboye RPMs are already installed on your cluster nodes.

• nodel# rpm -qa 1 grep -E " (gfs 1 lvm2 ) "

2. Verify that the GFS2 kernel module is loaded.

nodel# lsmod 1 head -1; ismod 1 grep -E " (gfs 1 dlm 1 kmod) "
I> 3. Verify that Conga converted the default LVM locking type from 1 (local file-based locking) to
3 (clustered locking), and that clvmd is running.
e nodel , 2# grep locking type /etc/lvm/lvm. conf

110 node]., 2# service clvmd status

Note: to convert the locking type without. Conga's help, use the following command
O before starting clvmd:

node 1 , 2# lvmconf -enable-cluster

1 4. In the next step we will create a clustered LVM2 logical volume as the GFS2 "container".
Before doing so, we briefly review LVM2 and offer some troubleshooting tips.
1 First, so long as we are running the clvmd service on all participating GFS cluster nodes,
we only need to create the logical volume on one node and the others will automatically be
O updated.

Second, the following are helpful commands to know and use for displaying information about
e the different logical volume elements:

pvdisplay, pvs
vgdisplay [-v], vgs
lvdisplay, lvs
service clvmd status

O Possible errors you may encounter:

O
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 699c0558
If, when viewing the LVM configuration the tools show or complain about missing physical
volumes, volume groups, or logical volumes which no longer exist on your system, you may
need to flush and re-scan LVM's cached information:

# rm -f /etc/lvm/cache/.cache
# pvscan
# vgscan
# lvscan

If, when creating your logical volume it complains about a locking error ("Error locking on
node..."), stop c lvmd on every cluster node, then start it on all cluster nodes again. You may
even have to clear the cache and re-scan the logical volume elements before starting clvmd
again. The output of:

# lvdisplay 1 grep "LV Status"

should change from:

LV Status NOT available

to:

LV Status available

and the LV should be ready to use.

If you need to dismantle your LVM to start from scratch for any reason, the following sequence
of commands will be helpful:
1. Remove any /etc/ f stab entries vi /etc/fstab
referencing the LVM
2. Make sure it is unmounted umount /dev/ClusterVG/gfslv
3. Deactivate the logical volume lvchange -an /dev/ClusterVG/
gfslv
4. Remove the logical volume lvremove /dev/ClusterVG/gfslv
5. Deactivate the volume group vgchange -an ClusterVG
6. Remove the volume group vgremove ClusterVG
7. Remove the physical volumes pvremove /dev/sd??
8. Stop c lvmd service clvmd stop

5. Create a 1GB logical volume named gf s lv from volume group ClusterVG that will be used
for the GFS.

nodel# lvcreate - L 1G - n gfslv ClusterVG

This command will create the /dev/ClusterVG/gfslv device file and it should be visible
on all nodes of the cluster.

6. The GFS locktable name is created from the cluster name and a uniquely defined name of your
choice. Verify your cluster's name.

nodel# cman tool status 1 grep "Cluster Name"

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 699c0558


7. Create a GFS2 file system on the gf slv logical volume with journal support for two (do not
create any extras at this time) nodes. The GFS2 file system should use the default DLM to
manage its locks across the cluster and should use the unique name "gf slv".

Note: journals consume 32MB, by default, each.

Substitute your cluster's number for the character X in the following command:

nodel# mkfs.gfs2 -t clusterX:gfslv -j 2 /dev/ClusterVG/gfslv

8. Create a new mount point named /mnt /gf s on both nodes and mount the newly created file
system to it, on both nodes.

Look at the tail end of /var/log/messages to see that it has properly acquired a journal
lock.

node1,2# mkdir /mnt/gfs


nodel,2# mount /dev/ClusterVG/gfslv /mnt/gfs
node1,2# tail /var/log/messages

9. Add an entry to both node's /etc/f st ab file so that the shared file system persists across
reboots.

/dev/ClusterVG/gfslv /mnt/gfs gfs2 defaults O V'


O

10. Copy into or create some data in /mnt/gf s from either node and verify that the other node
can see and access it.

nodel# cp /etc/group /mnt/gf


node2# cat /mnt/gfs/group

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 699c0558


Lab 7.3 Solutions •
1. Luci has already installed the necessary gf s -ut ils and kmod-gf s -xen for you.

2. Create a 1GB logical volume named gf sl from volume group ClusterVG that will be used
for the GFS.

nodel# lvcreate - L 1G - n gfsl ClusterVG

This command will create the /dev/ClusterVG/gf sl device file and it should be visible
on all nodes of the cluster.

3. Create a GFS I file system on the gf sl logical volume with journal support for two nodes
unisng the mkfs.gfs command. The GFS 1 file system should use the default DLM to manage its
locks across the cluster and should use the unique name "gf sl".

Note: journals consume 128MB, by default, each.



Substitute your cluster's number for the character X in the following command:

nodel# mkfs.gfs -t clusterX:gfsl -j 2 /dev/ClusterVG/gfsl 4111


4. Create a new mount point named /mnt/gf sl on both nodes and mount the newly created file •
system to it, on both nodes.

Look at the tail end of /var/ log/messagesl. to see that it has properly acquired a journal •
lock.
111
nodes, 2# mkdir /mnt/gfsl
nodel , 2# mount /dev/ClusterVG/gfsi /mnt/gfsi
nodel , 2# tan. /var/log/messages

5. Add an entry to both node's /et c/ f stab file so that the shared file system persists across
reboots. •

/dev/ClusterVG/gfsl /mnt/gfsl gf s defaults O ii •


O

6. Copy into or create some data in /mnt /gf sl from either node and verify that the other node •
can see and access it.

nodel# cp /etc/group /mnt/gfsl
node2# cat /mnt/gfsi/group

7. It is now time to convert this filesystem to GFS2. This conversion has to be done offline.
Umount the filesystem on both nodes.

nodes. , 2# umount /mnt/gfsi

8. Before the filesystem is converted it is strongly recommended to backup the filesystem and
perform a filesystem check. Use the tool gfs fsck to test the integrity of the data.
011>
CInnwrinht n 9ni 1 Pori 1--lat Inr RHARR- [AH Fl id-pn-17-9(111n49R 734fr2na
nodel# gfs_fsck /dev/ClusterVG/gfsl

9. Convert the filesystem to GFS2

nodel# gfs2_convert /dev/ClusterVG/gfsl

10. Update the filesystem type in /etc/ f stab on both cluster nodes.

Edit both node's /et c/f stab to read:

/etc/fstab:
/dev/ClusterVG/gfsl /mnt/gfsl gfsi defaults O iti
O

11. Now mount the filesystem again. Use mount -a to verify consistency of /et c/f stab

nade 1. , 2# mount -a

12. Cleanup: Umount the filesystem and delete the /etc/ f stab reference on both nodes and
remove the logical volume /dev/ClusterVG/gfsi

node 1 , 2#umount /dev/ClusterVG/gfsl


node , 2# vim /etc/fstab
nodel# lvremove /dev/ClusterVG/gfsl

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 734fc20a


Lab 7.4 Solutions
1. Begin by creating a 512MB LVM volume with a GFS2 filesystem. Use imagetest as the
filesystem and volume name.

nodel# lvcreate -n imagetest -G 512M ClusterVG


nodel# mkfs.gfs2 -t clusterX:imagetest /dev/ClusterVG/imagetest

2. Mount the filesystem on a single node and put some data in it. You don't need to mount the
filesystem persistently across boots.

nodel# mkdir /mnt/imagetest


nodel# cp /etc/services /mnt/imagetest

3. Before taking the image umount the filesystem. Images taken from a live filesystem will not be
consistent.

nodel# umount /dev/ClusterVG/imagetest

4. Now use dd to create an image of the logical volume. The GFS2 filesystem you have created in
the first part of the lab has enough space to store the file.

nodel# dd if=/dev/ClusterVG/imagetest of=/mnt/gfa/imagetest.img


bs=4M

5. We won't need the filesystem anymore. Remove it.

nodel# lvremove /dev/ClusterVG/imagetest

6. Now copy this image to your physical system.

nodel# scp /mnt/gfs/imagetest.img stationX:/tmp

7. You have two options to access this image: Either by creating a logical volume or partition and
"dd'ing" the image into that container or by loop-mounting the image directly. In both cases you
have to manually disable the DLM locking mechanism, since the station is not member of your
cluster.

Let's use the loop-mounting way: Use losetup to point the 1 oopl device to the image file,
perform a filesystem check and the mount the filesystem to /mnt/imagetest.

stationX#losetup /dev/loopl /tmp/imagetest.img


stationx# gfs2_fsck /dev/loopl
stationx# mkdir /mnt/imagetest
stationx# mount -o lockproto=locknolock /dev/loopl

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 88f20560


/mnt/imagetest

8. How would you mount such an image persistently without creating the loop device manually?

/etc/fstab:
/tmp/imagetest. img /mnt/image gfs2 loop, lockproto=lock_nolock O 11
O

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 88f20560


Lab 7.5 Solutions
1. We still have room left in volume group ClusterVG, so lets expand our logical volume and
GFS2 filesystem to use the rest of the space.

First, expand the logical volume into the remaining volume group space.

Determine the number of free physical extents (PE) in vg0:

nodel# vgdisplay vg0 I grep Free


Free PE / Size 516 / 2.98 GB

then grow the logical volume by that amount (alternatively, you can use the option " -1
+10 0%FREE" to lvextend to do the same thing in fewer steps):

nodel# lvextend - 1 +516 /dev/vg0/gfs

and verify the additional space in the logical volume:

nodel# lvdisplay /dev/ClusterVG/gfslv

2. Now grow the GFS2 filesystem into the newly-available logical volume space, and verify the
additional space is available. Note: GFS2 must be mounted, and we only need to do this on one
node in the cluster.

nodel# gfs2 grow -v /mnt/gfs

nodel# df

Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a63ec29f


e Lecture 8
e
e
e Quorum and the Cluster Manager

Upon completion of this unit, you should be able to:


• Define Quorum
e • Understand how Quorum is Calculated
• Understand why the Cluster Manager Depends Upon
Quorum
e
e
e


e
e
e

e
e
For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,

e
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used,
copiad, or cfistributed please email ctrainingerecthat coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / aa694607


Cluster Quorum 8-1

• Majority voting scheme to deal with split-brain situations


• Each node has a configurable number of votes (default=1)
• <clusternode name="foo" nodeid="1" votes="1"›
• Total votes = sum of all cluster node votes
• Expected votes = initially, the Total votes value, but modifiable
• Quorum is calculated from Expected votes value
• If the sum of current member votes is greater than half of Expected votes, then
quorum is achieved
• Two-node special case is the exception
• The cluster and its applications only operate if the cluster has quorum
Quorum is an important concept in a high-availability application cluster. The cluster manager can suffer
from a "split-brain" condition in the event of a network partition. That is, two groups of nodes that have been
partitioned could both form their own cluster of the same name. If both clusters were to access the same
shared data, that data would be corrupted. Therefore, the cluster manager must guarantee, using a quorum
majority voting scheme, that only one of the two split clusters becomes active.

To this end, the cluster manager safely copes with split-brain scenarios by having each node broadcast or
multicast a network heartbeat indicating to the other cluster members that it is on-line. Each cluster node
also listens for these messages from other nodes. Each node constructs an internal view of which other
nodes it thinks is on-line. Whenever a node is detected to have come on-Une or gone off-line, a member
transition is said to have occurred. Member transitions trigger an election, in which one node proposes
a view and all the other nodes report whether the proposed view matches their internal view. The cluster
manager will then form a view of which nodes are on-line and will tally up their respective quorum votes. If
exactly half or more of the expected votes disappear, a quorum no longer exists (except in the two-node
special case). Only nodes which have quorum may run a virtual cluster service.

The voting values described aboye can be viewed in the output of the command cman_tool status.

As new nodes are added to the cluster, the number of total votes increases dynamically. The total vote
count is never decreased dynamically.

If there is quorum, an exit code of O (zero) should be returned to the shell when the clustat -Q command
(which produces no output) is run:

clustat -Q
# echo $?

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certitied Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperty usad,
copied, or distributed otease email <training•redhat com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b8a7e6ed


Cluster Quorum Example 8-2

• Required votes for quorum = (expected_votes / 2) + 1


• Fractions are rounded down
• Two-node case is special
• Ten-node cluster example:

2 nodes @ 10 votes each = 20 votes


8 nodes @ 1 vote each = 8 votes

Needed for Quorum = 15 votes

In this ten-node cluster example, two of the machines have 10 votes while the other 8 machines have only
1 vote, each. We are assuming that the expected votes has not been modified and is equal to the number
of total votes.

The reasons for giving one machine more voting power than another are varied, but possibly the 10-vote
machines have a cleaner and more reliable power source, they can handle much more computational load,
they have redundant connections to storage or the network, etc....

Scenario 1: All 8 1-vote machines fail, but the 10-vote machines are still operational. The cluster maintains
quorum.

Scenario 2: One 10-vote machine fails. We need at least 5 of the 1-vote machines to remain operational in
order for the cluster to maintain quorum.

Scenario 3: Both 10-vote machines fail, but all 8 of the 1-vote machines are still operational. The cluster
loses quorum.

For use only by a student enrolled in e Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atored in a retrieval system, or othenvise reproduced without prior written consent of Red Hat, Inc. 1f you believe Red Hat training materials are being improperly used,
copiad, or distributed please email <trainingaredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / bb0e8cb3


-4 0,1,
Modifying and Displaying Quorum Votes 8-3

• The number of votes assigned to each node can be modified:


• system-config-cluster
• Manually edit cluster configuration file (cman(5) for syntax)
• The Expected votes can be modified for fiexibility:
• cman_tool expected -e <votes>

• Displaying Voting Information:


• cman_tool status
• ccstool Isnode

An administrator can manually change the expected votes value in a running cluster with the command
(Warning: exercise care that a split-brain cluster does not become quorate!):

# cman_tool expected -e <votes>

This command can be very handy when a quorate number of nodes has failed, but the service must be
brought up again quickly on the remaining less-than-optimal number of nodes. It tells CMAN there is a
new value of expected votes and instructs it to recalculate quorum based on this value. Remember, votes
required for quorum = (expected_votes / 2) + 1.

To display Expected votes and number of votes needed for quorum:

# cman_tool status
Version: 6.0.1
Config Version: 12
Cluster Name: clusterl
Cluster Id: 26777
Cluster Member: Yes
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0 177
Node name: nodel.cluster-l.example.com
Node ID: 2
Multicast addresses: 239.192.104.2
Node addresses: 172.16.36.11

Two-node output:

# cman_tool status
Version: 6.0.1
Config Version: 3
Cluster Name: testl
Cluster Id: 3405
Cluster Member: Yes

For use only by a etudent enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Cedified Training Partner. No parí of thie publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale ere being Improperly used,
copiad, or distributed picase email <trainingfpredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / e2e27e8f


a IN Á
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1
Active subsystems: 7
Flags: 2node
Ports Bound: 0 177
Node name: nodel.cluster-l.example.com
Node ID: 2
Multicast addresses: 239.192.13.90
Node addresses: 172.16.36.11

To view how many votes each node in a cluster cardes:

# ccstool lsnode

Cluster name: testl, config version: 19

Nodename Votes Nodeid Fencetype


node2.cluster-l.example.com 1 1 apcl
nodel.cluster-l.example.com 1 2 apcl
node3.cluster-l.example.com 1 3 apcl

To modify votes assígned to current node:

# cman tool votes -v <votes>

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. Pf you believe Red HM training material* are being improperly usad,
copiad, or distributed pisase amad <trainingaredhat coa> or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / e2e27e8f


10C
CMAN - two node cluster 8-4

• There is a two node parameter that can be set when there are only two nodes in
the cluster
• Quorum is disabled in two node mode
• Because one node can have quorum, a split-brain is possible
• Safe because both nodes race to fence each other before enabling GFS/DLM
• Race winner enables GFS/DLM, loser reboots
• This is a poor solution when there's a persistent network partition and both nodes
can still fence each other
• Reboot-then-fence cycle

For the two-node special case, we want to preserve quorum when one of the two nodes fails. To this end,
two-node clusters are an exception to the "normal" quorum decision process: in order for one node to
continue to operate when the other is down, the cluster enters a special mode called, literally, two_node
mode. two_node mode is entered automatically when two-node clusters are built in the GUI, or manually by
setting the two_node and expected_votes values to 1 in the cman configuration section:

<cman two_node= 1 expected_votes= 1 ></cman>

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red HM tralning materiala are beIng Improperly used,
copiad, or distributed please email trainíng@redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / bd0114be


['lin
CCS Tools - ccstool 8-5

• Must be used whenever updating configuration file by hand


• Sequence of events:
• Edit cluster. conf with changes
• Increment config version number in cluster. conf
• Inform CCS and cman of updated version and propagate the new cluster. conf to the other cluster
nodes
• ccs_tool update/etc/clusterkluster.conf

• Only need to run on one of the cluster nodes

The cluster configuration GUIs generally Cake care of propagating any cluster configuration changes to /
etc/cluster/cluster. . conf to the other nodes in the cluster. The system-config-cluster GUI has a
button in the upper right comer of the tool labeled "Send to Cluster" that will accomplish this for you.

If, however, you are maintaining your cluster . conf file by hand and want to manually propagate it to the
rest of the cluster, the following example will guide you:

1. Edit /etc/cluster/cluster.conf

2. Inform the Cluster Configuration System (CCS) and Cluster Manager (cman) about the change, and
propagate the changes to all cluster nodes:

4 ccs _ tool update /etc/cluster/cluster.conf


Config file updated from version 2 to 3

Update complete.

3. Verify CMAN's information and the changes to cluster . conf were propagated to the other nodes by
examining the output of either of the following commands:

# cman_tool status I grep "Config version"


Config Version: 3
# cman_tool version
6.0.1 config 3

The integer version number must be incremented by hand whenever any changes are manually made to
the configuration file.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No pan ot this publication may be photocopied,
duplicated, stored in a retrieval system, or othenvise reproduced without prior written consent ot Red Hat, Inc. tt you believe Red Hat training materials are being improperIy used,
copied, or dstributed pisase email ctraining•redhat. cota> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 35566ac4


• rs-s
11/
cluster..conf Schema 8-6
111
• XML Schema

• http://sources.redhat.comicluster/doc/cluster_schema.html

• Hierarchical layout of XML:

CLUSTER
\ CMAN
e
\ CLUSTERNODES •
CLUSTERNODE+
FENCE
\ METHOD+ •
\ DEVICE+
\ FENCEDEVICES 11,
FENCEDEVICE+
1
\ RM (Resource Manager Block)
1 \ FAILOVERDOMAINS
1 1 FAILOVERDOMAIN*

FAILOVERDOMAINNODE*
\ RESOURCES

\ FENCE DAEMON
SERVICE*
111
In the diagram aboye, * means "zero or more", and + means "one or more". •

An explanation of the XML used for cluster . conf can be found at the aboye URL. There are over 200
cluster attributes that can be defined for the cluster. The most common attributes are most easily defined •
using the GUI configuration tools available.



1
For use only by a student enrolled in a Red Hat training COurse taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used,
copiad, or distributed piense email < t rainingdredhat . coz> or phone toll•free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Cnnurinht Cc> 9n1i Poni I-Int lnr RH436-RHEL5u4-en-17-20110428 / b775dd3a


Updating an Existing RHEL4 8-7
cluster . conf for RHEL5

• Every node listed in cluster. conf must have a node ID


• Update a pre-existing cluster. conf file:
• ccs_tooi addnodeids
• Propagate cluster. conf to all cluster nodes.

For use only by a student enrolled in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly usad,
copiad, or distributed pisase email ctrainingeredhat con> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a501f92d


cmantool

8-8
e
• Manages the cluster management subsystem, CMAN
e
• Can be used on a quorate cluster
e

• Can be used to:
• Join the node to a cluster
• Leave the cluster
• Kill another cluster node
• Display or change the value of expected votes of a cluster
e
• Get status and service/node information

Example output (modified for brevity): •


# cman_tool status
Version: 6.0.1
Config Version: 12
e
Cluster Name: clusterl
Cluster Id: 26777 e
Cluster Member: Yes
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 3

Expected votes: 3
Total votes: 3
Quorum: 2
e
Active subsystems: 7
Flags:
Ports Bound: 0 177
e
Node name: node-l.cluster-l.example.com
Node ID: 2 oil

Multicast addresses: 239.192.104.2
Node addresses: 172.16.36.11

The status the Service Manager:

e
••
# cman_tool services
type level name id state
fence O default 00010003 none
[1 2 3]
dlm 1 rgmanager 00010001 none
[1 2 3]

Listing of quorate cluster nodes and when they joined the cluster:

# cman_tool nodes
Node Sts Inc Joined Name e

1 M 12 2007-04-11 17:01:53 node-2.cluster-l.example.com
2 M 4 2007-04-11 17:01:14 node-l.cluster-1.example.com
3 M 8 2007-04-11 17:01:14 node-3.cluster-l.example.com

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of thie publication may be photocopied,
e
duplicated, ~red in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material' are being Improperly usad,
copiad, or distributed please email < t rainingaredhat . com> or phone ton-free (USA) +1 (866)526 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 2d4d95b7


. -
e
cman_tool Examples 8-9

• cman_tool join
• Join the cluster
• cman_tool leave
• Leave the cluster
• Fails if systems are still using the cluster
• cman_tool status
• Local view of cluster status
• cman_tool nodes
• Local view of cluster membership

In a CMAN cluster, there is a join protocol that all nodes have to go through to become a member, and
nodes will only talk to known members.

By default, cman will use UDP port 6809 for internode communication. This can be changed by setting a
port number in cluster . conf as follows:

<cman port="6809"> </ornan>

or at cluster join time using the command:

cman_tool join -p 6809

For use only by a student enrollad in a Red HM training course taught by Red Hat, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly used,
copied, or cfistributed pisase email <training•redhat con> or phone toll-free (USA) +1 (866) 626 2994 or +1 (019) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 0469cf08


CMAN - API 8-10

• Provides interface to cman libraries


• Cluster Membership API
• Backwards-compatible with RHEL4

The 1 ibcman library provides a cluster membership API. It can be used to get a count of nodes in the
cluster, a list of nodes (name, address), whether it is quorate, the cluster name, and join times.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of thie publication may be photocopled,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training material. are being improperly ueed,
copied, or distributed pisase email < trainingaredhat . coa» or phone toll-f res (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a766c 181


—s..
CMAN - libcman 8-11

• For developers
• Backwards-compatible with RHEL4
• Cluster Membership API
• cman_get_node_count()
• cmangetnodes()
• cman_get node()
• cman_is_quorate()
• cman_get_cluster()
• cmansenddata()

The 1 ibcman library provides a cluster membership API. It can be used to get a count of nodes in the
cluster, a Iist of nodes (name, address), whether it is quorate, the cluster name, and join times.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 1f you believe Red Hat training materials are being improperly used,
copiad, or cfistributed please email ctraining•redhat. coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 24e3d78f


,no
e
End of Lecture 8
e


• Questions and Answers
• Summary

e
• Define Quorum
• Understand how Quorum is Calculated
• Understand why the Cluster Manager Depends Upon Quorum

e

e
e



e

e

o
•e
e
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training materiale are being Improperly tosed,
copiad, or distributed please email < t raining@redhat . com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / aa694607



1
Lab 8.1: Extending Cluster Nodes
Scenario: In this exercise we will extend our two-node cluster by adding a third node.
System Setup: Students should already have a working two-node cluster from the previous
lab.

Instructions:

1. Recreate node3 if you have not already done so, by executing the command:
11>
stationX# rebuild-cluster -3

2. Make sure the node's hostname is set persistently to node3 . clusterX. example . com

Configure your cluster's node3 for being added to the cluster by installing the ricci and
1110 httpd RPMs, starting the ricci service, and making sure the ricci service survives a
reboot.
1111 Make sure that node3's iscsi initator is configured and the partition table is consistent with
nodel and node2.
101 3. If not already, log into luc i's administrative interface. From the cluster tab, select Cluster List
from the clusters menu on the left-side of the window.
11>
From the "Choose a cluster to administer" section of the page, click on the cluster name.
4. From the clusterX menu on the left side, select Nodes, then select Add a Node.

11> Enter the fully-qualified name of your node3 (node3 . clusterX. example . com ) and the
root password. Click the Submit button when finished. Monitor node3's progress via its
console and the luc i interface.
5. Provide node3 with a copy of /etc/c lust ernence_xvm . key from one of the other
111 nodes, and then associate node3 with the xenfenceX shared fence device we created earlier.
6. Make sure that cman and rgmanager start automatically on node3 by setting the Enabled at
start up flag.
7. Once finished, select Failover Domains from the menu on the left-hand side of the window,
10 then click on the Failover Domain Name (prefer_nodel).

In the "Failover Domain Membership" section, node3 should be usted.


111
Make it a member and set its priority to 2. Click the Submit button when finished.
8. Relocate the webby service to node3 to test the new configuration, while monitoring the
status of the service.

Verify the web page is accessible and that node3 is the node with the 172.16.50 . X6 IP
address.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 04a4445d


9. Troubleshooting: In rare cases luci fails to propage /etc /c lust er /c luster . conf to a
newly added node. Without the config file cman cannot start properly. If the third node cannot
join the cluster check if the file exist on node3. If it doesn't, copy the file manually from another
node and restart the cman service manually.

10. View the current voting and quorum values for the cluster, either from luc i's Cluster List view
or from the output of the command cman_tool status on any cluster node.

11. Currently, the cluster needs a minimum of 2 nodes to remain quorate. Let's test this by shutting
down our nodes one by one.

On nodel, continuously monitor the status of the cluster with the clustat command, then
poweroff -f node 3.

Which node did the service failover to, and why?

Verify the web page is still accessible.

12. Check the values for cluster quorum and votes again.

Go ahead and poweroff -fnode2.

13. Does the service stop or fail? Why or why not?

Check the values for cluster quorum and votes again.

14. Re-start nodes 2 and 3, and once again query the cluster quorum and voting values. Have they
returned to their original settings?

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 04a4445d


1
Lab 8.2: Manually Editing the Cluster Configuration
Scenario: Building a cluster from scratch or making changes to the cluster's
configuration within the lucí GUI is convenient. Propagating changes
to the other cluster nodes can be as simple as pressing a button within the
interface.
111
There are times, however, that you will want to tweak the
cluster . conf file by hand: avoiding the overhead of a GUI, modifying
a parameter that can be specified in the XML but isn't handled by the
GUI, or maybe changes best implemented by a script that edits the
1111 cluster. conf file directly.

0 Command line interface (CLI) changes are straightforward, as you will see,
but there is a process that must be followed.

1110 Deliverable: In this lab section, we will make a very simple change to the
cluster. conf file, propagate the new configuration and update the in-
memory CCS information.
"

Instructions:

1. First, inspect the current post_join_delay and conf ig_vers ion parameters on both
11> nodel and node2.

111> 2. On nodel, edit the cluster configuration file, /etc/cluster/cluster. conf, and
increment the post join_delay parameter from its default setting to a value that is one
integer greater (e.g. change post_join_delay=3) to post_join_delay=4. Do not exit
11, the editor, yet, as there is one more change we will need to make.
3. Whenever the cluster . conf file is modified, it must be updated with a new integer version
number. Increment your cluster. conf conf ig_vers ion value (keep the double
quotes around the value) and save the file.
1) 4. On node2, verify (but do not edit) its cluster. conf still has the old values for the
post_j oin_delay and conf ig_vers ion parameters.

5. On nodel, update the CCS with the changes, then use ccsd to propagate them to the other
nodes in the cluster. Re-verify the information on node2. Was the post join delay and
conf ig_vers ion updated on node2? Is cman on node2 aware of the update?

1
1 Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 ecb3f060
Lab 8.3: GFS2: Adding Journals
Scenario: Every node in the cluster that wants access to the GFS needs its own
journal. Each journal is 128MB in size, by default. We specified 2 journals
were to be created (-j 2 option to mkfs.gfs2) when we first created our
GFS filesystem, and so only nodel and node2 were able to mount it.
We now want to extend GFS2's reach to our third node, node 3. In order
to do that, we need to add an additional journal. We will actually add two
additional journals it is always helpful to have spares for future growth.

System Setup: GFS2 must be mounted, and we only have to do this on one node.

Instructions:

1. First, verify our current number of journals.

2. Confirm that the third node can currently not mount the GFS2 filesystem.

3. Verify that there is enough space on the filesystem to add another 128MB journal.

4. Add two more journals with the same size.

5. Verify that the available space has been reduced by 2*128MB

6. Now mount the GFS2 filesystem on the third node. This time the command should succeed.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 33856a1d


1
11 Lab 8.1 Solutions
1. Recreate node3 if you have not already done so, by executing the command:

stationx# rebuild cluster


- - 3

111 2. Make sure the node's hostname is set persistently to node3 . clusterX. example . com

eXn3# perl -pi -e ns/HOSTNAHE=.*/HOSTNAME=node3.clusterX.example


11. .com" /etc/sysconfig/network
cXn1# hostname node3.clusterX.example.com
11>
Configure your cluster's node3 for being added to the cluster by installing the ricci and
httpd RPMs, starting the ricci service, and making sure the ricci service survives a
110 reboot.

node3# yum -y install ricci httpd


11>
node3# service ricci start; chkconfig ricci on

Make sure that node3's iscsi initator is configured and the partition table is consistent with
nodel and node2

node3# /root/RH436/HelpfulFiles/setup-initiator -bl


node3# partprobe /dev/sda

3. If not already, log into luci's administrative interface. From the cluster tab, select Cluster List
11> from the clusters menu on the left-side of the window.

From the "Choose a cluster to administer" section of the page, click on the cluster name.
I>
4. From the clusterX menu on the left side, select Nodes, then select Add a Node.

Enter the fully-qualified name of your node3 (node3 . clusterX. example . com ) and the
root password. Click the Submit button when finished. Monitor node3's progress via its
10 console and the luc i interface.
5. Provide node3 with a copy of /et c/c lus ter/ fence_xvm . key from one of the other
11. nodes, and then associate node3 with the xenfenceX shared fence device we created earlier.

nodel# scp /etc/cluster/fence_xvm.key node3:/etc/cluster

To associate node3 with our shared fence device, follow these steps: From the left hand -

menu select Nodes, then select node3 . clusterX. example . com just below it. In luci's
11/
main window, scroll to the bottom, and in the "Main Fencing Method" section, click the "Add
fence device to this level" link. In the drop-down menu, select "xenfenceX (Virtual Machine
Fencing)". In the "Domain" box, type node3, then click the Update main fence properties
button at the bottom.
6. Make sure that cman and rgmanager start automatically on node3 by setting the Enabled at
start up flag.
110
Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 04a4445d
^ ^
7. Once finished, select Failover Domains from the menu on the left-hand side of the window,
then click on the Failover Domain Name (pre f e r_nodel).

In the "Failover Domain Membership" section, node3 should be listed.

Make it a member and set its priority to 2. Click the Submit button when finished.

8. Relocate the webby service to node3 to test the new configuration, while monitoring the
status of the service.

Monitor the service from luc i's interface, or from any node in the cluster run the clustat -i 1
command.

To relocate the service in lucí, traverse the menus to the webby service (Cluster List -->
webby), then choose Relocate this service to node3.clusterX.example.com " from the Choose a
Task... drop-down menu near the top. Click the Go button when finished.

Alternatively, from any cluster node run the command:

nodel# clusvcadm - r webby - m node3 .clusterX. example .com

Verify the web page is accessible and that node3 is the node with the 172.16.50 . X6 IP
address (Note: the ifconfig command won't show the address, you must use the ip command).

qt at ionX# el inks -dump http: //172.16.50 . X6/index . html

node 3# ip addr list

9. Troubleshooting: In rare cases luci fails to propage /et c /c lus t e r/ c luster . conf to a
newly added node. Without the config file cman cannot start properly. If the third node cannot
join the cluster check if the file exist on node3. If it doesn't, copy the file manually from another
node and restad the cman service manually.

10. View the current voting and quorum values for the cluster, either from luc i's Cluster List view
or from the output of the command cman_tool status on any cluster node.

nodel# cman tool status


Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2
(output truncated for brevity)

11. Currently, the cluster needs a minimum of 2 nodes to remain quorate. Let's test this by shutting
down our nodes one by one.

On nodel, continuously monitor the status of the cluster with the clustat command, then
poweroff -f node3.

nodel# clustat -i 1

node3# poweroff - f

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 04a4445d


Which node did the service failover to, and why?

The node should have failed over to node 1 because it has a higher priority in the
pref er_nodel failover domain (the name is a clue!).

Verify the web page is still accessible.

stat ionX# elinks -dump http://172.16.50.X6/index.html

12. Check the values for cluster quorum and votes again.

nodei#cman tool status


Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2

(There can be a delay in the information update. If your output does not agree with this, wait a
minute and run the command again.)

Go ahead and poweroff -fnode2.

node2# poweroff -f

13. Does the service stop or fail? Why or why not?

Now only a single node is online, the cluster lost quorum and the service is no longer active.

Check the values for cluster quorum and votes again.

nodel#cman tool status


Nodes: 1
Expected votes: 3
Total votes: 1
Quorum: 2 Activity blocked

14. Re-start nodes 2 and 3, and once again query the cluster quorum and voting values. Have they
returned to their original settings?

st at ionX# xm create node2

stationX# 3an create -c node3

Verify all three nodes have rejoined the cluster by running the "cman_tool status" command
and ensuring that all three nodes have "Onl ine , rgmanager" usted in their status field.

As soon as the two nodes are online again, the cluster adjusts the values back to their original
state automatically.

node3#CMan tool status


Nodes: 3
Expected votes: 3
Total votes: 3

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 04a4445d


Quorum : 2

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 04a4445d


1
Lab 8.2 Solutions
1. First, inspect the current post_j oin_delay and conf ig_vers ion parameters on both
11/ nodel and node2.

nodel, 2# cd /etc/cluster
11)
nocie1,2# grep config_version cluster.conf

nodel,2# grep post_join_delay cluster.conf

node1,2# cman tool version

node1,2# cman tool status 1 grep Version

2. On nodel, edit the cluster configuration file, /etc/cluster/cluster. . conf, and


increment the post_j oin_delay parameter from its default setting to a value that is one
integer greater (e.g. change post_j oin_delay=3) to post_j oin_delay=4. Do not exit
the editor, yet, as there is one more change we will need to make.
111 3. Whenever the cluster. conf file is modified, it must be updated with a new integer version
number. Increment your cluster . conf's conf ig_version value (keep the double
quotes around the value) and save the file.

4. On node2, verify (but do not edit) its cluster . conf still has the oid values for the
post_j oin_delay and conf ig_vers ion parameters.

a. node2# cd /etc/cluster

node2# grep config_version cluster.conf


11,
node2# grep post_join_delay cluster.conf

node2# cman tool version

node2# cman tool status 1 grep Version

5. On nodel, update the CCS with the changes, then use ccsd to propagate them to the other
nodes in the cluster. Re-verify the information on node2. Was the post_j oin _delay and

conf ig_ver s ion updated on node2? Is cman on node2 aware of the update?

• a. nodel# CCS tool update /etc/cluster/cluster.conf

e node2# grep config_version cluster.conf

node2# grep post_join_delay cluster.conf

node2# cman tool version

1, node2# cman tool status 1 grep "Config Version"

Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ecb3f060


b. The changes should have been propagated to node2 (and node 3) and cman updated by
the ccs_tool command.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ecb3f060


1
Lab 8.3 Solutions
1. First, verify our current number of journals.

nodei.# gfs2_tool journals /mnt/gfs

2. Confirm that the third node can currently not mount the GFS2 filesystem.

node3# mkdir /mnt/gfs


111 node3# mount /dev/CluserVG/gfslv /mnt/gfs
/sbin/mount.gfs2: error mounting /dev/mapper/CluserVG-gfslv on /mnt/gfs:
Invalid argument

3. Verify that there is enough space on the filesystem to add another 128MB journal.
11/ nodel# df -h I grep gfs

11> 4. Add two more journals with the same size.

gfs2_jadd -j2 /mnt/gfs

5. Verify that the available space has been reduced by 2*128MB

nodel.# df -h I grep gfs

6. Now mount the GFS2 filesystem on the third node. This time the command should succeed.

node3# mount /dev/ClusterVG/g1s1v /mnt/gls

5
5
1
5
1

IP
1
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 33856a1d
Lecture 9

1

1
Fencing and Failover

Upon completion of this unit, you should be able to:


• Define Fencing

• Describe Fencing Mechanisms
• Explain CCS Fencing Configuration
e


1





1


For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training materiale ere being improperly ueed,
copied, or distributed please email < training@redhat coz» or phone toll-free (USA) +1 (868) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 2ec9bed0



No-fencing Scenario 9-1

• What could happen if we didn't use fencing?


• The live-hang scenario:
• Three-node cluster: nodes A, B, C
• Node A hangs with 1/Os pending to a shared file system
• Node B and node C decide that node A is dead, so they recover resources allocated by node A,
including the shared file system
• Node A "wakes up" and resumes normal operation
• Node A completes 1/Os to the shared file system
• Data corruption ensues...

If a node has a lock on GFS metadata and live-hangs long enough for the rest of the cluster to think it is
dead, the other nodes in the cluster will take over its I/O for it. A problem occurs if the (wrongly considered
dead) node wakes up and still thinks it has that lock. If it proceeds to alter the metadata, thinking it is safe to
do so, it will corrupt the shared file system.

1f you're lucky, gfsfsck will fix it if you're not, you'Il need to restore from backup. I/O fencing prevents the
"dead" node from ever trying to resume lis I/O to the storage device.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. ff you believe Red Hat training meteríais are being impropedy used,
copied, or cfistributed pisase amad <trainingeredhat. COM> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / c1eca2a6


.)n7
Fencing Components 9-2

• The I/O fencing system has two components:


• Fence daemon: receives fencing requests as service events from cman
• Fence agent: a program to interface with a specific type of fencing hardware
• The fencing daemon determines how to fence the failed node by looking up the
information in CCS
• Starting and stopping fenced
• Automatically by cman service script
• Manually using fence_tool

The fenced daemon is started automatically by the cman service:

# service cman start


Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing... done
[ OK ]

f ence_tool is used to join or leave the default fence domain, by either starting fenced on the node to
join, or killing f enced to leave. Before joining or leaving the fence domain, fence_tool waits for the
cluster be in a quorate state.

The fence_tool join -w command waits until the join has actually completed before returning. It is the same
as fence_tool join; fence_tool wait.

For use only by a student enrolled in a Red Hat training courae taught by Red Hat, Inc. or a Red Hat Certifiod Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are being improperly used,
copied, or distributed picase email <training@redhat.com > or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 6bcdfe18


All rinin+e• renc•es.....-..-1 ^^^
Fencing Agents 9-3

• Customized script/program for popular hardware fence devices


• Included in the curan package
• /sbin/fence_*
• Usually Perl or Python scripts
• "fence_<agenb h" or view man page to display agent's options
-

• fence_node
• Generic command-line agent that queries CCS for the proper agent and parameters to use
• Supported fence devices:
• http://www.redhat.com/cluster_suite/hardware

Example fencing device CCS definition in cluster. conf:

<fencedevices>
<fencedevice agent="fence_apc" ipaddr="172.16.36.107" login="nps" te«
name="apc" passwd="password°1>
</fencedevices>

The fence_node program accumulates all the necessary CCS information for 1/0 fencing a particular node
and then performs the fencing action by issuing a call to the proper fencing agent.

The following fencing agents are provided by Cluster Suite at the time of this writing:

fence_ack_manual - Acknowledges a manual fence


fence_apc - APC power switch
fence_bladecenter - IBM Blade Center
fence_brocade - Brocade Fibre Channel fabric switch.
fence_bullpap - Bull PAP
fence_drac - DRAC
fence_egenera - Egenera SAN controller
fence_ilo - HP iLO device
fence_ípmilan - IPMI Lan
fence_manual - Requires human interaction
fence_mcdata McData SAN switch
fence_rps10 - RPS10 Serial Switch
fence_rsa - IBM RSA II Device
fence_rsb Fujitsu-Siemens RSB management interface
fence_sanbox2 - QLogíc SANBox2
fence_scsi - SCSI persistent reservations
fence_scsi_test - Tests SCSI persistent reservations capabilities
fence_vixel - Vixel SAN switch
fence_wti - WTI network power switch
fence_xvm - Xen virtual machines
fence_xvmd - Xen virtual machines

Because manufacturers come out with new models and new microcode all the time, forcing us to change
our fence agents, we recommend that the source code in CVS be consulted for the very latest devices to
see if yours is mentioned: http: //sources .redhat .com/cgi-bin/cvsweb.cgi/cluster/fence/
agents/?cvsroot=cluster

For use only by a student enrollad in a Red HM training course laught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication ~y be photocopied,
duplícated, stored in a retrieval system, or otherviíse reproduced without prior written consent of Red HM, Inc. If you believe Red HM training meteríais are being improperly usad,
copiad, or distributed pisase email <training•reclhat coa» or phone toil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 55cdd0d5


Power Fencing versus Fabric Fencing 9-4

• Power fencing
• Networked power switch (STONITH)
• Configurable action:
• Turn off power outlet, wait N seconds, turn outlet back on
• Turn off power outlet

• Fabric fencing
• At the switch
• At the device (e.g. iSCSI)
• Both fencing mechanisms:
• Separate a cluster node from its storage
• Must be accessible to all cluster nodes
• Are supported configurations
• Can be combined (cascade fencing, or both at once)

Two types of fencing are supported: fabric (e.g. Fibre Channel switch or SCSI reservations) and power
(e.g. a networked power switch). Power fencing is also known as STONITH ("Shoot The Other Node In The
Head"), a gruesome analogy to a mechanism for bringing an errant node down completely and quickly.

While both do the job of separating a cluster node from its storage, Red Hat recommends power fencing
because a system that is forced to power off or reboot is an effective way of preventing (and sometimes
fixing) a system from wrongly and continually attempting an unsafe I/O operation on a shared storage
resource. Power fencing is the only way to be completely sure a node has no buffers waiting to flush to the
storage device after it has been fenced.

Arguments for fabric fencing include the possibility that the node might have a reproducible error that keeps
occurring across reboots, another mission-critical non-clustered application on the node in question that
must continue, or simply that the administrator wants to debug the issue before resetting the machine.

Combining both fencing types is discussed in a later slide (Fencing Methods).

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No pert of thia publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiala are being improperly usad,
copied, or distributed please email <training@redhat . coms or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / cfa7c2a8


SCSI Fencing 9-5

• Components
• /etc/init.d/scsireserve
• generates a unique key
• creates a registration with discovered storage devices
• creates a reservation if necessary.

• isbin/fence_scsi
• removes registration/reseravation of failed node
• that node is no longer able to access the volume

• fence_scsi_test
• tests íf a storage is supported.

• Limitations
• all nodes must have access to all storage devices
• requires at least three nodes
• multipathing only supported with dm-multipath
• the TGTD software target does not support scsi fencing at the moment

Registration: A registration occurs when a node registers a unique key with a device. A device can have
many registrations. For scsi fencing, each node will create a registration on each device.

Rerservation: A reservation dictates how a device can be accessed. In contrast to registrations, there
can be only one reservation on a device at any time. The node that holds the reservation is know as the
"reservation holder". The reservation defines how other nodes may access the device.

For example, fence_scsi uses a "Write Exclusive, Registrants Only" reservation. This type of reservation
indicates that only nodes that have registered with that device may write to the device.

Fencing. The fence_scsi agent is able to perform fencing via SCSI persistent reservations by simply
removing a node's registration key from all devices. When a node failure occurs, the fence_scsi agent will
remove the failed node's key from all devices, thus preventing it from being able to write to those devices.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. ff you believe Red Hat training materials are being improperly used,
copiad, or dstributed 'Arfase emelt ctrainingeredhat . COM, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / aa39b34c


711
Fencing From the Command Line 9-6

• Faster/Easier than a manual login to a networked power switch


• Power switches usually allow only one login at a time
• Using the fencing agent directly:

fence_apc -a 172.16.36.101 -1 nps -p password -n 3 -v -o /


reboot

• Using CCS for proper fencing agent and options:

fence _node nodel.clusterX.example.com

• Using CMAN:

cman tool kill -n nodel.clusterX.example.com

Manually logging in to a network power switch (NPS) to power cycle a node has two related problems: the
(relatively slow) human interaction and the power switch potentially being tied up while the slow interaction
completes.

Most power switches allow (or are configured to allow) only one login at a time. While you are negotiating
the menu structure of the switch, what happens if another node needs to be fenced?

Best practices dictate that command-line fencing be scripted or a "do-everything" command line be used to
get in and out of the network switch as fast as possible.

In the example aboye where the fencing agent is accessed directly, the command connect to an APC
network power switch using its customized fencing script with a userid/password of "nps/password", reboots
node 3, and logs the action in /tmp/apclog.

The command: fence_<agent> -h can be used to display the full set of options available from a fencing
agent.

For use only by a student enrolfed in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training materiale are being improperly usad,
copied, or distributed piense email .c training@redhat . coa» or phone toll-free (USA) +1 (856) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / e8e10d09


The Fence Daemon - fenced 9-7

• Started automatically by cman service script


• Depends upon CMAN's cluster membership information for "when" and "who" to
fence
• Depends upon CCS for "how" to fence
• Fencing does not occur unless the cluster has quorum
• The act of initiating a fence must complete before GFS can be recovered
• Joining a fence domain implies being subject to fencing and possibly being asked to
fence other domain members

A node that is not running fenced is not permitted to mount GFS file systems. Any node that starts fenced,
but is not a member of the cluster, will be automatically fenced to ensure its status with the cluster.

Failed nodes are not fenced unless the cluster has quorum. If the failed node causes the loss of quorum, it
will not be fenced until quorum has been re-established.

If an errant node that caused the loss of quorum rejoins the cluster (maybe it was just very busy and
couldn't communicate a heartbeat to the rest of the cluster), any pending fence requests are bypassed for
that node.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certiffed Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training ~terliz are being improperly usad,
copiad, or distributed pisase email <training•redhat . coas or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 05f293e9


Manual Fencing 9-8

• Not supported!
• Useful only in special non-production environment cases
• Agents: f ence_manualif ence_ack_manual
• Evicts node from cluster / cuts off access to shared storage
• Manual intervention required to bring the node back online
• Do not use as a primary fencing agent
• Manual fencing (as primary fencing agent) sequence example:
1. Nodes A and B are in the same cluster fence domain
2. B dies
3. A automatically fences B using fence_manual and prints a message to syslog
4. System administrator power cycles B manually or otherwise ensures that it has been fenced from the shared
storage by other actions
5. System administrator runs fence_ack_manual on A to acknowledge successful fencing of the failed node, B
6. A replays B's journal
7. Services from B failover to A

The fence_manual agent is used to evict a member node from the cluster. Human interaction is required
on behalf of the faulty node to rejoin the cluster, often resulting in more overhead and longer downtimes.

The system administrator must manually reset the faulty node and then manually acknowledge that the
faulty node has been reset (fence_ack_manual) from another quorate node before the node is allowed to
rejoin the cluster.

If the faulty node is manually rebooted and is able to successfully rejoin the cluster after bootup, that is also
accepted as an acknowledgment and completes the fencing. Do not use this as a primary fencing device!

Example cluster . conf section for manual fencing:

<clusternodes>
<clusternode name="nodel" votes="1"›
<fence>
<method name="single">
<device name="human" ipaddr="10.10.10.1"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fence_devices>
<device name="human" agent="fence_manual"/>
</fencedevices>

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are being Improperly usad,
copied, or distributed pisase email <training@redhat .com> or phone toll-free (USA) +1 (966) 626 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / acbb3f27


Fencing Methods 9-9

• Grouping mechanism for fencing agents


• Allows for "cascade fencing"
• A fencing method must succeed as a unit or the next method is tried
• Fencing method example:

<fence>
<method name="1"›
<device name="fencel" port="1" option="reboot"/>
<device name="fencel" port="2" option="reboot"/>
</method>
<method name="2"›
<device name="brocade" port="1"/>
</method>
</fence>

A <method> block can be used when more than one fencing device should be triggered for a single fence
action, or for cascading fence events to define a backup method in case the first fence method fails.

The fence daemon will call each fence method in the order they are specified within the <fence> tags.
Each <method> block should have a unique name parameter defined.

Within a <method> block, more than one device can be Usted. In thís case, the fence daemon wíll run the
agent for each device Usted before determiníng if the fencing action was a success or failure.

For the aboye example, imagine a dual power supply node that fails and needs to be fenced. Fencing
method "1" power cycles both network power switch ports (the order is indeterminate), and they must
succeed as a unit to properly remove power from the node. If only one succeeds, the fencing action should
fail as a whole.

If fencing method "1" fails, the fencing method named "2" is tried next. In this case, fabric fencing is used as
the backup method.

This is sometimes referred to as "cascade fencing".

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red HM Cerned Training Partner. No part of this publication may be photocopied,
duplicated, Morad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training material are being improperly used,
copiad, or distributed pisase email < trainingAredhat . cera, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / bcd12e7b


n1 a
Fencing Example - Dual Power Supply 9-10

• Must guarantee a point at which both outlets are off at the same time
• Two different examples for fencing a dual power supply node:

<fence>
<method name="1"›
<device name="fencel" port="1" option="off"/>
<device name="fencel" port="2" option="reboot"/>
<device name="fencel" port="1" option="on"/>
</method>
<method name="2"›
<device name="fencel" port="1" option="off"/>
<device name="fence2" port="2" option="off"/>
<device name="fencel" port="1" option="on"/>
<device name="fence2" port="2" option="on"/>
</method>
</fence>

Some devices have redundant power supplies, both of which need to be power cycled in the event of a
node failure.

Consider the differences between the different fence methods aboye. In fencing methods 1 and 2, there is
no point at which the first outlet could possibly be turned back on before the second outlet is turned off. This
is the proper mechanism to ensure fencing with dual-power supply nodes.

Notice also that in method 2, if fencel and fence2 networked power switches are powered by two
separate UPS devices, a failure of any one UPS will not cause our machine to lose power. This is not
the case for method 1. For this reason, method 2 is far preferred in High Availability (HA) solutions with
redundant power supplies.

A less deterministic solution is to configure a longer delay in the outlet power cycle (if the switch is capable
of it), but this will also delay the entire fencing procedure, which is never a good idea.

In the case where fencing fails altogether, the cluster will retry the operation.

What could go wrong in the following method?

<method name="3">
<device name="fencel" port="1" option="reboot"/>
<device name="fencel" port="2" option="reboot"/>
</method>

In this fencing method, if the network power switch's outlet off/on cycle is very short, and/or if fenced hangs
between the two, there exists the possibility that the first power source might have completed its power
cycle before the other is cycled, resulting in no effective power loss to the node at all. When the second
fencing action completes, the cluster will think that the errant node has been turned off and file system
corruption is sure to follow.

For use only by e student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material. are belng improperly usad,
copiad, or distributed please email ctrairdrtg@redhat .com, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 795c41cb


Handling Software Failures 9-11

• Not all failures result in a fencing action


• Resource agents
• Monitor individual resources
• Handle restarts, if necessary
• If a resource fas and is correctly restarted, no other action is taken
• If a resource fails to restart, the action is per-service configurable:
• Relocate
• Restart
• Disable

Resource agents are scripts or executables which handle operations for a given resource (such as start,
stop, restart, status, etc...).

In the event a resource fails to restart, each service is configurable in the resulting action. The service can
either be relocated to another quorate node in the cluster, restarted on the same node, or disabled.

"Restart" tríes to restart failed parts of this resource group locally before attempting to relocate (default);
"relocate" does not bother trying to restart the service locally; "disable" disables the resource group if any
component fails. Note that any resource which can be recovered without a restart will be.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. if you believe Red Hat training mataríais are being improperly used,
copiad, or cAstributed pisase email <training•redhat -cora> or phone toll-free (USA) +1 (866)826 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 22b6bae6


,14
Handling Hardware Failures 9-12

• Hardware/Cluster failures
• If service status fails to respond, node is assumed to be errant
• Errant node's services are relocated/restarted/disabled, and the node is fenced
• If a NIC or cable fails, the service will be relocated/restarted/disabled
• Double faults
• Usually difficult or impossible to choose a universally correct course of action to take

If the cluster infrastructure evicts a node from the cluster, the cluster manager selects new nodes for the
services that were running based on the failover domain, if one exists.

If a NIC fails or a cable is pulled (but the node is still a member of the cluster), the service will be either
relocated, restarted, or disabled.

With double hardware faults, it is usually difficult or impossible to choose a universally correct course
of action when one occurs. For example, consider a node with LO losing power versus pulling all of its
network cables. Has that node stopped I/O to disk or not?

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are being improperly usad,
copied, or distributed pisase email <traíníngerectliat .com> or phone toIl-f res (USA) +1 (856) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9d86b5a5


Failover Domains and Service Restrictions 9-13

• Failover domain: list of nodes to which a service may be bound


• Specifies where cluster manager should relocate a failed node's service
• Restricted
• A service may only run on nodes in its domain
• If no nodes are available, the service is stopped
• Unrestricted
• A service may run on any cluster node, but prefers its domain
• If a service is running outside its domain, and a domain node becomes available, the service will
migrate to that domain node
• Exclusive Service
• May affect list of nodes available to service
• Specifies service will only start on a node which has no other services running

Which cluster nodes may run a particular virtual service is controlled through failover domains. A failover
domain is a named subset of the nodes in the cluster which may be assigned to take over a service in case
of failure.

An unrestricted failover domain is a list of nodes which are preferred for a particular network service. If none
of those nodes are available, the service may run on any other node in the cluster, even though it is not in
the failover domain for that service. A restricted failover domain mandates that the virtual service may only
run on nodes which are members of the failover domain. Unrestricted is the default.

Exclusive service, an attribute of the service itself and not of the failover domain, is used to failover a
service to a node if and only if no other services are running on that node.

In RHEL 5.2 versions of Conga and newer, there is a new nofailback option that can be configured in
the failoverdomain section of cluster . conf. Enabling this option for an ordered failover domain will
prevent automated fail-back after a more-preferred node rejoins the cluster. For example:

<failoverdomaín name="test_failover_domaín" ordered="1" restrícted=n1" no he


failback="1"›

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or e Red HM Certified Training Partner. No pera of this publicador" may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior wrítten consent of Red HM, Inc. tf you believe Red HM training material• are being improperly used,
copiad, or ~tributad pisase email <trainingeredhat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 414180ae


710
Failover Domains and Prioritization 9-14

• Prioritized (Ordered)
• Each node is assigned a priority between 1-100 (1=highest)
• Higher priority nodes are preferred by the service
• If a node of higher priority transitions, the service will migrate to it
• Non-prioritized (Unordered)
• All cluster nodes have the same priority and may run the service
• Services always migrate to members of their domain whenever possible
• Any combination of ordered/unordered and restricted/unrestricted is allowed

In a prioritized failover domain, services from failed nodes will be moved preferentially to similarly prioritized
nodes if they exist, or to a node with the next highest priority. In a non-prioritized failover domain, a service
may be started up on any available node in the domain (they all have the same priority). Non-prioritized is
the default.

Failover domains are particularly useful in multiple-node clusters which run with multiple virtual services
in an Active-Active mode. For instance, consider two services running on a four-node cluster in two
unrestricted failover domains each made up of two nodes each. In normal operation, the services, in effect,
have their own private Active-Passive two-node cluster. If both nodes in one failover domain fail, the service
may move onto one of the remaining two nodes normally used by the other service.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of thie publication may be photocopled,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training materiale are being lmproperly usad,
copied, or distributed please email .c training0redhat . coms or phone toll-free (USA) +1 (868) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 027cd0d8


MI rinhf. rtnenrIltrui ^^^
NFS Failover Considerations 9-15

• Filesystem Identification ID (f s id)


• Identifies an NFS export
• GUI tools automatically generate a default value
• Normally derived from block device's major/minor number
• Can be manually set to a non-zero 32-bit integer value
• Must be unique amongst all exported filesystems
• Device mapper doesn't guarantee same major/minor for all cluster nodes
• Ensures failover servers use the same NFS file handles for shared filesystems
• Avoids stale file handles

The f s id.N (where N is a 32-bit positive integer) NFS mount option forces the filesystem identification
portion of the exported NFS file handle and file attributes used in cluster NFS communications be N instead
of a number derived from the major/minor numbers of the block device on which the filesystem is mounted.
The fsid must be unique amongst all the exported filesystems.

During NFS failover, a unique hard-coded fsid ensures that the same NFS file handles for the shared file
system are used, avoiding stale file handles alter NFS service failover.

Note:
Typically the fsid would be specified as part of the NFS Client resource options, but that would
be very bad if that NFS Client resource was reused by another service the same client could
potentially have the same fsid on multiple mounts.

Starting with RHEL4 update 3, the Cluster Configuration GUI allows users to view and modify an auto-
generated default fsid value.

For use only by a student enrollad in a Red Hat training course taught by Red HM, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrievaI system, or othenvise reproduced without prior written consent of Red HM, Inc. ff you believe Red HM training materials are being improperly usad,
copiad, or cfistributed piense ornad ctrainingeredhat co or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 0 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ce4b69a7


n.-31
clusvcadm 9-16

• Cluster service administration utility



• Requires cluster daemons be running (and quorate) on invoking system
• Base capabilities:
• Enable/Disable/Stop

• Restad
• Relocate
• Can specify target for service relocation
• Example:

# clusvcadm -r webby -m nodel.example.com •


There is a subtle difference between a stopped and disabled service. When the service is stopped, any
cluster node transition causes the service to start again. When the service is disabled, the service remains
disabled even when another cluster node is transitioned.

A service named webby can be manually relocated to another machine in the
node 1 . e xamp 1 e . com using the following command, so long as the machine
cluster named
on which the command was

executed is running all the cluster daemons and the cluster is quorate:


# clusvcadm -r webby -m nodel.example.com

••



•e
••
For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No pert of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. H you believe Red Hat training materials are being improperly usad,
copied, or distributed please email <training@redhat com> or phone toli-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright@ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 147aee6b



End of Lecture 9

• Questions and Answers


• Summary
• Defíne Fencing
• Describe Fencing Mechanisms
• Explain CCS Fencing Configuration

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certitied Training Partner. No parí of this publication may be photocopied,
duplicated, etored in a retrieval system, or otherwise reproduced without prior written coneent of Red Hat, Inc. H you believe Red Hat training materials are being impropedy used,
copiad, or distributed please email < training•redhat coz, or phone toli-free (USA) +1 (886) 626 2994 or +1 (919) 754 3700.

Copyright m 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 2ec9bed0


nnn
Lab 9.1: Node Priorities and Service Relocation
Scenario: Services can be configured to be relocated to another cluster node upon
failure of resource tied to the service. In other situations, the system
administrator might purposely want to relocate the service to another node
in the cluster in order to safely bring down and work on a node currently
running a critical service. In both situations, it is important to understand
how to configure the automatic relocation of a service and manually
relocate a service.

Deliverable: In this lab we use your previously-created three-node cluster to explore


node priorities and their effect on service relocation sites.

Instructions:

1. Starting with your previously-created 3-node cluster, log into the lucí interface from your
local machine.

2. From the "Lucí Homebase" page, select the cluster tab near the top and then select "Cluster
List" from the left sidebar. From the "Choose a cluster to administer" page, select the first node
in your cluster (nodel . c lus t e rX . example . com).

In a separate terminal window, log into node2 and monitor the cluster status.

3. With the cluster status window in clear view, go back to the luc i interface and select the drop-
down menu near the Go button in the upper right comer.

From the drop-down menu, select "Reboot this node" and press the Go button.

What happens to the webby service while nodel is rebooting?

What happens to the webby service after nodel comes back online (wait up to 1 minute after
it is back online)? Why?

4. Navigate within luc i to the "Nodes" view of your cluster. This view shows which services
are running on which nodes (note: you may have to click the refresh button in your browser for
an updated view), and the failover domain each node is a member of (pre f e r_nodel in this
case).

5. nodel might require a longer outtage (for example, if it required maintenance). Select "Have
node leave cluster" from the "Choose a Task..." drop-down menu and press the Go button.

Once nodel has left the cluster (clustat will report "offline", and luc i (might require
refreshing) will show the cluster node's name in a red color (as opposed to green).

Bring nodel back into the cluster ("Have node join cluster") once it is offline. The webby
service should migrate back to nodel.

6. The service can also be restarted, disabled, re-enabled, and relocated from the command line
using the clusvcadm command.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4273f111


While monitoring the cluster status on one of the cluster nodes from a separate terminal
window, execute the following commands on nodel (it is assumed that the service is currently
running on nodel) to see the effect each command has on the service's location.

clusvcadm -r webby
clusvcadm -d webby
clusvcadm -e webby
clusvcadm -s webby
clusvcadm -e webby -m nodel.clusterX.example.com
clusvcadm -s webby
clusvcadm -r webby -m node2.clusterX.example.com
clusvcadm -d webby
clusvcadm -r webby -m nodel.clusterX.example.com
clusvcadm -e webby
clusvcadm -r webby
clusvcadm -r webby

What's the difference between stopped and disabled? (Hint: what happens when any node in the
cluster transitions (joins/leaves the cluster) when in each state?)

7. Make sure the service is currently running on nodel. On node2 run the command:

clustat - i 1

8. While viewing the output of clustat in one window, open a console connection to nodel and
run the command:

ifdown ethl

What happens? (Note: it could take 30s or so to see the action begin.) Once nodel is back
online, where is the service running now?

9. You can also try the same experiment by rebooting a node directly, or using the CLI interface to
fence a cluster node from any cluster node. For example, to reboot node3:

fence xvm - H node3

A node can also be fenced using the command:

f ence node nodel.clusterX.example.com

Note:

In the first instance, the node name must correspond to the name of the node's virtual
machine as known by Xen, and in the second instance the node name is that which is
defined in the cluster . conf file.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4273f111


Lab 9.1 Solutions
1. Starting with your previously-created 3-node cluster, log into the luc á. interface from your
local machine.

# firefox https://stationX.example.com:8084/

(Login Name: admin, Password: redhat)

2. From the "Lucí Homebase" page, select the cluster tab near the top and then select "Cluster
List" from the left sidebar. From the "Choose a cluster to administer" page, select the first node
in your cluster (nodel . c lust e rX. example . com ).

In a separate terminal window, log into node2 and monitor the cluster status.

node2# clustat -i 1

3. With the cluster status window in clear view, go back to the luc í interface and select the drop-
down menu near the Go button in the upper right comer.

From the drop-down menu, select "Reboot this node" and press the Go button.

What happens to the webby service while nodel is rebooting? [The service is stopped and
relocated to another valid cluster node.]

What happens to the webby service after nodel comes back online (wait up to 1 minute after
it is back online)? Why? [Up to 1 minute after =del is back online, the service is relocated
back to nodel. It does this because we specified that node 1 had a higher priority in our
failover domain definition (prefer_nodel).]

4. Navigate within luc i to the "Nodes" view of your cluster. This view shows which services
are running on which nodes (note: you may have to click the refresh button in your browser for
an updated view), and the failover domain each node is a member of (pre f e r_nodel in this
case).

5. nodel might require a longer outtage (for example, if it required maintenance). Select "Have
node leave cluster" from the "Choose a Task..." drop-down menu and press the Go button.

Once nodel has left the cluster (clustat will report "offline", and lucí (might require
refreshing) will show the cluster node's name in a red color (as opposed to green).

Bring =del back into the cluster ("Have node join cluster") once it is offline. The webby
service should migrate back to nodel.

6. The service can also be restarted, disabled, re-enabled, and relocated from the command line
using the clusvcadm command.

While monitoring the cluster status on one of the cluster nodes from a separate terminal
window, execute the following commands on nodel (it is assumed that the service is currently
running on nodel) to see the effect each command has on the service's location.

clusvcadm -r webby [relocates service from nodel]


Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4273f111
. .
clusvcadm -d webby [disables service]
clusvcadm -e webby [re-enables service]
clusvcadm -s webby [stops service]
clusvcadm -e webby -m nodel.clusterX.example.com
[starts/enables service on nodel]
clusvcadm -s webby [stops service]
clusvcadm -r webby -m node2.clusterX example.com [starts and 1,0.

relocates service to node2]


clusvcadm -d webby [disables service]
clusvcadm -r webby -m nodel.clusterX.example.com [Invalid
operation, remains disabled]
c lusvcadm -e webby [starts/enables service on nodel]
c lusvcadm -r webby [relocates service to node2]
c lusvcadm -r webby [relocates service to nodel]

What's the difference between stopped and disabled? (Hint: what happens when any node in the
cluster transitions (joins/leaves the cluster) when in each state?)
When the service is stopped, any cluster node transition causes the service to start again.
When the service is disabled, the service remains disabled even when another cluster node is
transitioned.
7. Make sure the service is currently running on nodel. On node2 run the command:

clustat 1

8. While viewing the output of clustat in one window, open a console connection to node1 and
run the command:
ifdown ethl
What happens? (Note: it could take 30s or so to see the action begin.) Once nodel is back
online, where is the service running now?
9. You can also try the same experiment by rebooting a node directly, or using the CLI interface to
fence a node. For example, to reboot node3:
fence xvm - H node3

A node can also be fenced using the command:


fence node node3.clusterX.example.com

Note:

In the first instance, the node name must correspond to the name of the node's virtual
machine as known by Xen, and in the second instance the node name is that which is
defined in the cluster. conf file.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4273f111



Lecture 10



Quorum Disk

Upon completion of this unit, you should be able to: •
• Become more familiar with quorum disk and how it affects
quorum voting.
• Understand heuristics. ••


C u• SV cul
••

can

c 1, 7 c Ycvn v




For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly ueed,
copied, or distributed please email <fraining@redhat coxa, or phone toll-free (USA) +1 (866) 628 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 5942f2bf



Quorum Disk 10-1

• A partial solution to two-node cluster fencing races


• "Tie-breaker"
• Appears like a voting node to cman

• Allows fiexibility in number of cluster nodes required to maintain quorum


• Requires no user intervention
• Mechanism to add quorum votes based on whether arbitrary tests pass on a
particular node
• One or more user-configurable tests, or "heuristics", must pass
• qdisk daemon runs on each node to heartbeat test status through shared storage independent of
cman heartbeat

• Part of the cman package


• Available since RHEL4U4

A quorum disk allows the configuration of arbitrary, cluster-independent heuristics each cluster member
can use to determine its fitness for participating in a cluster, especially for the handling of network-partition
("split-brain") scenarios or when a majority of the cluster members fail. The quorum disk contains the cluster
state and timestamp information.

The fitness information is communicated to other cluster members via a "quorum disk" residing on shared
storage.

The quorum disk daemon requires a shared block device with concurrent read/write access from all nodes
in the cluster. The shared block device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a
RAIDed iSCSI target, or even GNBD. The Quorum daemon uses O_DIRECT to write to the device.

Quorum disks are limited to 16 nodes in a cluster. Cluster node IDs must be statically configured in
cluster . conf and must be numbered sequentially from 1 to 16 (gaps in the numbering is allowed). The
cman service must be running before the quorum disk can start.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you betieve Red Hat training materials are being improperly usad,
copiad, or chstribtded pisase email <trainiagfiredlAt. cota> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9d36a50e


. .. ^^^
Quorum Disk Communications 10-2

• Quorum Disk communicates with:


• cman quorum-device availability or heuristics result
-

• ccsd configuration information


-

• Shared storage - check and record states

Quorum Disk communicates with cman, ccsd (the Cluster Configuration System daemon), and shared
storage. It communicates with cman to advertise quorum-device availability. It communicates with ccsd to
obtain configuration information. It communicates with shared storage to check and record states.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat treining materials are being improperly usad,
copied, or distributed please email <training@redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH438-RHEL5u4-en-17-20110428 / 44fdaab8


Quorum Disk Heartbeating and Status 10-3

• Cluster nodes update individual status blocks on the quorum disk


• Heartbeat parameters are configured in cluster. conf'S quorumd block
• Update frequency is every interval seconds
• The tímeliness and content of the write provides an indication of node health
• Other nodes inspect the updates to determine if a node is hung or not
• A node is declared offline after tko failed status updates
• A node is declared online after a tko_up number of status updates
• Quorum disk node status information is communicated to cman via an elected
quorum disk master node
• cman's eviction timeout (post_fail_delay) should be 2x the quorum daemon's
• Helps provide adequate time during failure and load spike situation

Every interval seconds, nodes write some basic information to its own individual status block on the
quorum disk. This information (timestamp, status (available/unavailable), bitmask of other nodes it thinks
are online, etc...) is inspected by all the other nodes to determine if a node is hung or has otherwise lost
access to the shared storage device. If a node fails to update its status tko times in a row, it is declared
offline and is unable to count the quorum disk votes when its quorum status is calculated. If a node starts
to write to the quorum disk again, it will be declared online after a tko_up number of status updates
(default=tko/3).

Example opening quorumd block tag in cluster . conf:

<quorumd interval="1" tko="10" votes="1" label="testing"›

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of ibis publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. ff you believe Red Hat training material• are being improperly used,
copiad, or distributed (»ase email <trainingeredhat . coa> or phone toff-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright@ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / c07239dc


0,11
Quorum Disk Heuristics 10-4


• A quorum disk may contribute votes toward the cluster quorum calculation
e



1 to 10 arbitrary heuristics (tests) are used to determine if the votes are contributed or not
Heuristics are in a <heuristic> block contained within the <quorumd> block
Each heuristic is configured with score number of points

• Heuristic


Any command executable by sh c "command string" producing true/false result
- -

Allow quorum decisions to be made based upon external, cluster-independent tests




• Should help determine a node's usefulness to the cluster or clients
Outcome determination:
• min_score defined in the quorumd block, or


• floor ( (n+1) /2) where n is the sum total points of all heuristics

Example: •
<quorumd interval="1" tko="10" votes="1" label="testing">
<heuristic program="ping A -cl -tl" score="1"


interval="2" tko="3"/>
</quorumd>

A heuristic is an arbitrary test executed in order to help determine a result.




The quorum disk mechanism uses heuristics to help determine a node's fitness as a cluster node in addition
to what the cluster heartbeat provides. It can, for example, check network paths (e.g. ping'ing routers) and
availability of shared storage.

The administrator can configure 1 to 10 purely arbitrary heuristics. Nodes scoring over 1/2 of the total points
offered by alI heuristics (or min_score
if its defined) become eligible to claim the votes offered by the
quorum daemon in cluster quorum calculations.

The heuristics themselves can be any command string executable by 'sh-c<string>'. For example:

<heuristic program="[ - f /quorum 1" score="1" interval="2"/>



This shell command tests for the existence of a file called "/quorum". Without that file, the node would claim
it was unavailable.

••




For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material. are being improperly used,
copied, or distributed please amad <trainingeredhat coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / abcc8524



Quorum Disk Configuration 10-5

• A quorum disk can be configured usíng any of:


• Conga
• system-config-cluster
• Manual edit of /etc/cluster/cluster. conf

• With two-node clusters as a tie-breaker:


• Give quorum disk one vote and set <cman two_node=" 0" . . . > in cluster. conf

• To allow quorum even if only one node is up:


• Votes should be set so that the votes of the qdisk and a single node exceed the quorum requirement
• II all nodes have a single vote, qdisk should receive ( number of nodes / 2 ) - 1 votes

• Requirements:
• The quorum disk must reside on a shared storage device
• Cluster node IDs should be numbered sequentially in clust er. conf
• qdiskd uses node IDs for Iogging

• The cman service must be running


• service qdiskd start ; chkconfig qdiskd on

Quorum Disk was first made available in RHEL4U4. For that release, only, Quorum Disk must be
configured by manually editing the cluster configuration file, /etc/cluster/cluster . conf. In all
releases since then, Quorum Disk is also configurable using system - conf ig - cluster (only at cluster
creation time) and Conga.

If the quorum disk is on a logical volume, qdiskd cannot start until clvmd is first started. A potential issue
is that clvmd cannot start until the cluster has established quorum, and quorum may not be possible
without qdiskd. A suggested workaround for this circular issue is to not set the cluster's expected votes to
include the qdiskd daemon's votes. Bring all nodes online, and start the qdiskd daemon only after the whole
cluster is running. This allows the expected votes to increase naturally.

More information about Quorum Disk is available in the following man pages: mkqdisk(8), qdiskd(8), and
qdisk(5).

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly usad,
copiad, or cfistributed picase email <training•reclhat. coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright ©2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 8d07347c


... . .. _.
Working with Quorum Disks 10-6

• Constructing a quorum disk:


• mkqdisk c device -I label
-

• Listing all quorum disks:


• mkqdisk -L

• Getting information on a quorum disk:


• mkqdisk -d -f label

Before creation of the quorum disk, it is assumed that the cluster is configured and running. This is because
it is not possible to configure the quorum heuristics from the system conf ig cluster tool. - -

To create a quorum disk use the Cluster Quorum Disk (mkqdisk) Utility. The mkqdisk command is used
to create a new quorum disk or display existing quorum disks accessible from a given cluster node. To
create the quorum disk use the command as:

mkqdisk - c <device> - 1 label

This will initialize a new cluster quorum disk.

Warning: This will destroy all data on the given device.

For further information, please look at the following: qdisk(8), mkqdisk.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiala are being improperly used,
copied, or distributed please email <trainingaredhat . com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 0d716397


Example: Two Cluster Nodes 10-7
and a Quorum Disk Tiebreaker

<cman two_ node="0" expected_votes="3" ...1>


<clusternodes>
<clusternode name="nodel" votes="1" ... 1>
<clusternode name="node2" votes="i" ... 1>
</clusternodes>
<quorumd interval="1" tko="10" votes="1" label="testing"›
<heuristic program="ping -cl -t1 hostA" score=" i" be
interval="2" tko="3"1>
</quorumd>

For tiebreaker operation in a two-node cluster:

1) In the <cman> block, unset the two_node flag (or set it to o) so that a single node with a single vote is
no longer enough to maintain quorum.

2) Also in the <cman> block, set expected_votes to 3, so that a minimum of 2 votes is necessary to
maintain quorum.

3) Set each node's votes parameter to 1, and set qdisk's votes count to 1. Because quorum requires
2 votes, a single surviving node must meet the requirement of the heuristic (be able to ping cl ti - -

hostA, in this case) to earn the extra vote offered by the quorum disk daemon and keep the cluster alive.

This will allow the cluster to operate if either both nodes are online, or if a single node and the heuristics are
met. If there is a partition in the network preventing cluster communications between nodes, only the node
with 2 votes will remain quorate.

The heuristic is run every 2 seconds (interval), and reports failure if it is unsuccessful after 3 cycles
(tko), causing the node to lose the quorumd vote.

If the heuristic is not satisfied after 10 seconds (quorumd interval multiplied by quorumd tko value), the
node is declared dead to cman, and it will be fenced.

The worst case scenario for improperly configured quorum heuristics, or if the two nodes are partitioned
from each other but can still meet the heuristic requirement, is a race to fence each other, which is the
original outcome of a split-braín two-node cluster.

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, Morad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. tí you believe Red HM training motorista are being improperly usad,
copiad, or distributed pisase email < trainingOredhat . coles or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 5bc9a4a6


0,1
Example: Keeping Quorum When 10-8
All Nodes but One Have Failed

<cman expected_votes="6" 1>


<clusternodes>
<clusternode name="nodel" votes="1"
<clusternode name="node2" votes="1"
<clusternode name="node3" votes="1" ... 1 >
</clusternodes>
<quorumd interval="1" tko="10" votes="3" label="testing"›
<heuristic program="ping A -cl -ti" score="1" 11
interval="2" tko="39>
<heuristic program="ping B -cl -ti" score="1" he
interval="2" tko="3"/>
<heuristic program="ping C -cl -ti" score=" i" he
interval="2" tko="3"/>
</quorumd>

What if two out of three of your cluster nodes fail, but the remaining node is perfectly functional and can still
communicate with its clients? The remaining machine's viability can be tested and quorum maintained with
a quorum disk configuration.

In this example, the expected_votes are increased to 6 from the normal value of 3 (3 nodes at 1 vote
each), so that 4 votes are required in order for the cluster to remain quorate.

A quorum disk is configured that will contribute 3 votes (<quorumd votes="3" ... >) to the cluster if it
scores more than half of the total possible heuristic test score, and remains writable.

The quorum disk has three heuristic tests defined, each of which is configured to score 1 point
(<heuristic program="ping A -cl -ti" score="1" ... >) if it can ping a different router (A, a,
or c), for a total of 3 possible points.

To get the 2 out of 3 points needed to pass the heuristic tests, at least two out of the three routers must be
up. If they are, and the quorum disk remains writable, we get all 3 of quorumd's votes.

If, on the other hand, no routers or only one router is up, we do not score enough points to pass and get NO
votes from the quorum disk. Likewise, if the quorum disk is not writable, we get no votes from the quorum
disk no matter how many heuristics pass.

As a result, if only a single node remains functional, the cluster can remain quorate so long as the
remaining node can ping two of the three routers (earning a passing score) and can write to the quorum
disk, which gains it the extra three votes it needs for quorum.

The <quorumd> and <heuristic> block's tko parameters set the number of failed attempts before it is
considered failed, and interval defines the frequency (seconds) of read/write attempts to the quorum
disk and at which the heuristic is polled, respectively.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopled,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. if you believe Red Hat training materlals are being Improperly usad,
copied, or distributed ()lomee email <t raining@redhat . cola> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 6e6dd034


---
End of Lecture 10

• Questions and Answers


• Summary
• qdisk
• heuristics
• quorum

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certifted Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrievel system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training ~Awilda are being improperly usad,
copiad, or dietributed pisase email ctrainiageredhat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 5942f2bf


717
Lab 10.1: Quorum Disk
Scenario: In a two-node cluster where both nodes have a single vote, a split-brain
problem (neither node can communicate with the other, but each still sees
itself as perfectly functional) can result a fencing war, as bother nodes
continuously try to "correct" the other.

In this lab we demonstrate how configuring a quorum disk heuristic can


help the split-brain cluster nodes decide (though not always absolutely)
which node is OK and which is errant.

The heuristic we will use will be a query of which remaining node can still
ping the IP address 172.16.255.254. o/lab>

Instructions:

1. Create a two-node cluster by gracefully withdrawing node3 from the cluster and deleting it
from luc i's cluster configuration.

Once completed, rebuild node3 using the rebuild cluster script.


-

2. View the cluster's current voting/quorum values so we can compare changes later.

3. Create a new 10MB quorum partition named /dev/ sdaN and assign it the label myqdi sk.

4. Configure the cluster's configuration with the quorum partition using luc i's interface and the
following characteristics.

Quorum should be communicated through a shared partition named /dev/sdaN with label
myqdisk. The frequency of reading/writing the quorum disk is once every 2 seconds. A node
must have a minimum score of 1 to consider itself "alive". If the node misses 10 cycles of
quorum disk testing it should be declared "dead". The node should advertise an additional vote
(for a total of 2) to the cluster manager when its heuristic is successful.

Add a heuristic that pings the IP address 172.17 . X. 254 once every 2 seconds. The heuristic
should have a weight/score of 1.

5. Using a file editor, manually modify the following values in cluster. conf:

expected_votes="3"
two_node="0"

Observe the quorumd-tagged section in c luster . conf.

Increment cluster . conf's version number (conf ig_version), save the file, and then
update the cluster configuration with the changes.
6. Start qdi skd on both nodes and make sure the service starts across reboots.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / cdfafbOb


7. Monitor the output of the clustat. When the quorum partition finally becomes active, what does
the cluster manager view it as?

8. Now that the quorum partition is functioning, whichever node is able to satisfy its heuristic
becomes the "master" cluster node in the event of a split-brain scenario. Note: this does not cure
split-brain, but it may help prevent it in specific circumstances.

View the cluster's new voting/quorum values and compare to before.

9. What happens if one of the nodes is unable to complete the heuristic command (ping)? Open
a terminal window on whichever node is running the service and monitor messages in /var/
log/messages. On the other node, firewall any traffic to 172.17 . X. 254.

10. Clean up. Stop and disable the qdiskd service on both nodes.

11. Disable the quorum partition in luc i's interface.

12. Add node3 back into the cluster as you have done before. You will need to set the hostname,
enable the initiator, re-install the ricci and ht tpd RPMs and start the ricci service before
adding it back in with lucí. Don't forget to copy /etc /c luster/fence_xvm. key to it
and reconfigure its fencing mechanism!

Copyright CID 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / cdfafbOb


^^^
Lab 10.1 Solutions
1. Create a two-node cluster by gracefully withdrawing node3 from the cluster and deleting it
from lucís cluster configuration.

To gracefully withdraw from the cluster, navigate luc i's interface to and choose
the Nodes link from the left sidebar menu. In the section of the window describing
node3 . lusterX. example . com, select "Have node leave cluster" from the "Choose a
Task..." drop-down menu, then press the Go button.

To delete node3 from the cluster configuration, wait for the previous action to complete,
choose "Delete this node" from the same drop-down menu, and then press the Go button.

Once completed, rebuild node3 using the rebuild-c luster script.

stat ionX#
. rebuild cluster
- - 3

2. View the cluster's current voting/quorum values so we can compare changes later.

nodel# cmantool status

3. Create a new 10MB quorum partition named /dev/sdaN and assign it the label myqdi sk.

node # fdisk /dev/sda


nodel , 2# partprobe /dev/sda
nodei# mkqdisk -c /dev/sdaN -1 myqdisk

Verify the quorum partition was made correctly:

nodel.# mkqdisk - L

4. Configure the cluster's configuration with the quorum partition using luc i's interface and the
following characteristics.

Quorum should be communicated through a shared partition named /dev/sdaN with label
myqdi sk. The frequency of reading/writing the quorum disk is once every 2 seconds. A node
must have a minimum score of 1 to consider itself "alive". If the node misses 10 cycles of
quorum disk testing it should be declared "dead". The node should advertise an additional vote
(for a total of 2) to the cluster manager when its heuristic is successful.

Add a heuristic that pings the IP address 172.17 . X. 254 once every 2 seconds. The heuristic
should have a weight/score of 1.

In luc i, navigate to the cluster tab near the top, and then select the c lusterX link. Select the
Quorum Partition tab. In the "Quorum Partition Configuration" menu, select "Use a Quorum
Partition", then fill in the fields with the following values:

Interval: 2
Votes: 1
TKO: 10
Mínimum Score: 1
Device: /dev/sdaN
Copyright © 2011 Red Hat. Inc. RH426-RHFI 5lu4-gn-17-2011042F1 / nrifaffinh
Label: mycidisk

Heuristics
Path to Program: ping -cl -ti 172.17.X.254
Interval: 2
Score: 1

5. Using a file editor, manually modify the following values in cluster. conf:

expectedvotes="3"
two node="0"

Observe the quorumd-tagged section in cluster. conf.

Increment cluster. conf's version number (conf ig_version), save the file, and then
update the cluster configuration with the changes.

nodel# vi /etc/cluster/cluster.conf

nodel# CCS tool update /etc/cluster/cluster.conf

6. Start qdiskd on both nodes and make sure the service starts across reboots.

nodel, 2ff service qdiskd start; chkconfig qdiskd on

7. Monitor the output of the clustat. When the quorum partition finally becomes active, what does
the cluster manager view it as?

nodel# clustat -i 1

The cluster manager treats it as if it were another node in the cluster, which is why we
incremented the expected_votes value to 3 and disabled two_node mode, aboye.

8. Now that the quorum partition is functioning, whichever node is able to satisfy its heuristic
becomes the "master" cluster node in the event of a split-brain scenario. Note: this does not cure
split-brain, but it may help prevent it in specific circumstances.

View the cluster's new voting/quorum values and compare to before.

cman tool status


Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2

(truncated for brevity)

9. What happens if one of the nodes is unable to complete the heuristic command (ping)? Open
a terminal window on whichever node is running the service and monitor messages in /var/
log/messages. On the other node, firewall any traffic to 172.17.X.254.

If nodel is the node running the service, then:

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / cdfafbOb


nodel# tail -f /var/log/messages
node2# iptables -A OUTPUT -d 172.17.X.254 - j REJECT

Because the heuristic will not be able to complete the ping successfully, it will declare the node
dead to the cluster manager. The messages in /var/log/messages should indicate that
node2 is being removed from the cluster and that it was successfully fenced.

10. Clean up. Stop and disable the gdiskd service on both nodes.

node1,2# service qdiskd stop; chkconfig qdiskd off

11. Disable the quorum partition in luc i's interface.

Navigate to the Cluster List and click on the c lusterX link. Select the Quorum Partition tab,
then select "Do not use a Quorum Partition", and press the Apply button near the bottom.

12. Add node3 back into the cluster as you have done before. You will need to set the hostname,
enable the initiator, re-install the ricci and httpd RPMs and start the rico i service before
adding it back in with luci. Don't forget to copy /etc/cluster/ f ence_xvm . key to it
and reconfigure its fencing mechanism!

cxn3# perl pi e "s/HOSTNAME= . */HOSTNAME=node3 clus terX. example h?


- -

.comn /etc/sysconfig/network
cXni# hostname node3.clusterX.example.com

node3# /root/RH436/HelpfulFiles/setup-initiator -bl

node3# yum -y install ricci httpd

node3# service ricci start; chkconfig ricci on

node3# scp nodel:/etc/cluster/fencexvm.key /etc/cluster

Copyright O 2011 . Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / cdfafb0b


Lecture 11

rgmanager

Upon completion of this unit, you should be able to:


• Understand the function of the Service Manager
• Understand resources and services

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red HM Certífied Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior wrítten consent of Red Hat, Inc. If you believe Red HM training materials are being improperty usad,
copiad, or distributed pisase email <trainingfiredhat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 920246d6


oAq
Resource Group Manager 11-1

• Provides failover of user-defined resources collected into groups (services)


• rgmanager improves the mechanism for keeping a service highly available
• Designed primarily for "cold" failover (application restarts entirely)
• Warm/hot failovers often require application modification
• Most off-the-shelf applications work with minimal configuration changes
• Uses SysV-style init script (rgmanager) or API
• No dependency on shared storage
• Distributed resource group/service state
• Uses CCS for all configuration data
• Uses OpenAIS for cluster infrastructure communication
• Failover Domains provide preferred node ordering and restrictions
• Hierarchical service dependencies

rgmanager provides "cold failover" (usually means "full application restad") for off-the-shelf applications
and does the "heavy lifting" involved in resource group/service failover. Services can take advantage of the
cluster's extensible resource script framework API, or simply use a SysV-style init script that accepts start,
stop, restad, and status arguments.

Without rgmanager, when a node running a service fails and is subsequently fenced, the service it was
running will be unavailable until that node comes back online.

rgmanager uses OpenAIS for talking to the cluster infrastructure, and uses a distributed model for its
knowledge of resource group/service states.

It is not always desirable for a service (a resource group) to fail over to a particular node. Perhaps the
service should only run on certain nodes in the cluster, or certain nodes in the cluster never run services but
mount GFS volumes used by the cluster.

rgmanager registers as a "service" with CMAN:

# cman_tool services
type level name id state
fence O default 00010003 none
[1 2 3]
dlm 1 rgmanager 00030003 none
[1 2 3]

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or e Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiele are being Improperly used,
copied, or distributed please email <training®redhat . cota> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / a64f4b17


Cluster Configuration - Resources 11-2

• A cluster service is comprised of resources


• Many describe additional settings that are application-specific
• Resource types:
• GFS file system
• Non-GFS file system (ext2, ext3)
• IP Address
• NFS Mount
• NFS Client
• NFS Export
• Script
• Samba
• Apache
• LVM
• MySQL
• OpenLDAP
• PostgreSQL 8
• Tomcat 5

The luci GUI currently has more resource types to choose from than system-config-cluster.

GFS file system - requires name, mount point, device, and mount options.

Non-GFS file system - requires name, file system type (ext2 or ext3), mount point, device, and mount
options. This resource is used to provide non-GFS file systems to a service.

IP Address - requires valid IP address. This resource is used for floating service IPs that follow relocated
services to the destination cluster node. Monitor Link can be specified to continuously check on the
interface's link status so it can failover in the event of, for example, a downed network interface. The
IP won't be associated with a named interface, so the command: ip addr list must be used to view its
configuration.

The NFS resource options can sometimes be confusing. The following two fines explain, vía command-fine
examples, some of the most important options that can be specified for NFS resources:

showmount -e <host>
mount -t nfs <host>:<export_path> <mount_poínt>

NFS Mount - requires name, mount point, host, export path, NFS version (NFS, NFSv4), and mount
options. This resource details an NFS share to be imported from another host.

NFS Client - requires name, target (who has access to this share), permissions (ro, rw), export options. This
resource essentially details the information normally listed in /etc/export s.

NFS Export - requires a name for the export. This resource is used to identify the NFS export with a unique
name.
Script - requires name for the script, and a fully qualified pathname to the script. This resource is often used
for the service script in /etc/init . d used to control the application and check on its status.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red HM Certified Treining Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. tf you believe Red HM training materiats are being improperly used,
copiad, or distributed pisase amad <training•redhat COM> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 411e6fa7


9AZ
The GFS, non-GFS, and NFS mount file system resources have force umount options.

The severa! different application resource types (Apache, Samba, MySQL, etc...) describe additional
configuration parameters that are specific to that particular application. For example, the Apache resource
allows the specification of ServerRoot, location of httpd. conf, additional httpd options, and the
number of seconds to wait before shutdown.

For use only by a student enrolled in a Red Hat training course teught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrievel system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copied, or distributed please email <trainingeredhat . COM, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 411e6fa7


Resource Groups 11-3

• One or more resources combine to form a resource group, or cluster service


• Example: Apache service
• Filesystem (e.g. ext3 -formatted filesystem on /dev/sdb2 mounted at /var/www/html)
• IP Address (floating)
• Script (e.g. /etc/init .d/httpd)

We will see that different resource types have different default start and stop priorities when used within the
same resource group.

For use only by a student enrollad in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, Morad in a retrieval system, or othenvíse reproduced without prior written consent of Red HM, Inc. ff you believe Red HM training materials are being improperly used,
copiad, or distributed pisase email < trainingfiredhat . cm> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 336e0171


9A7
Start/Stop Ordering of Resources 11-4

• Within a resource group, the start/stop order of resources when enabling a service
is important
• Examples:
• Should the Apache service be started before its is mounted?
DocumentRoot
• Should the NFS server's IP address be up before the allowed-clients have been defined?
• Several "special" resources have default start/stop ordering values built-in
• /usr/share/cluster/service.sh
• Order dependencies can be resolved in the service properties configuration (GUI)

From /usr/share/cluster/service. sh (XML file), we can see the built-in resource ordering defaults:

<special tag="rgmanagern>
<attributes root="1" maxinstances="1"/>
<child type="fs" start="1" stop="8"/>
<child type="clusterfs" start="2" stop="7"/>
<child type="netfs" start="3" stop="6"/>
<child type="nfsexport" start="4" stop="5"/>
<child type="nfsclient" start="5" stop=""/>
<child type="ip" start="6" stop="2"/>
<child type="smb" start="7" stop="3"/>
<child type="script" start="7" stop="1"/>
</special>

We can see that different resource types have different default start and stop priorities when used within the
same resource group.

Parent/child ordering relationships can be established within the GUI. At creation of a new service or by
editing a pre-existing service, the buttons: "Add a Shared Resource to this Service" and "Add a Shared
Resource to the Selection" create top-level piers and children of resources, respectively.

For use only by e student enrolled in a Red Hat training course teught by Red Het, Inc. or e Red Hat Certified Training Partner. No part of this publication mey be photocopled,
dupliceted, atorad in a retrievel system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Het training materiais are beIng Improperly used,
copiad, or distributed pleese email <trainingeredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 78d9a8b2


All rinhtc ~armad nAn
Resource Hierarchical Ordering 11-5

• Some resources do not have a pre-defined start/stop order


• There is no guaranteed ordering among similar resource types
• Hierarchically structured resources:
• Parent/child resource relationships can guarantee order
• Child resources are started before continuing to the next parent resource
• Stop ordering is exactly the opposite of the defined start ordering
• Allows children to be added or restarted without affecting parent resources

After a resource is started, it follows down its in-memory tree structure that was defined by externa!
XML rules passed on to CCS, and starts all dependent children. Before a resource is stopped, all of its
dependent children are first stopped.

Because of this structure, it is possible to make on-line service modifications and intelligently add or restad
child resources (for instance, an "NFS client" resource) without affecting its parent (for example, an "export"
resource) after a new configuration is received.

For example, look at the following example of a sub-mount point:

Incorrect:

<service ... >


<fs mountpoint="/a" ...
<fs mountpoint="/a/b"
<fs mountpoint="/a/c"
</service>

Corred:

<service ... >


<fs mountpoint="/a" ... >
<fs mountpoint="/a/b" ...
<fs mountpoint="/a/c" ...
</fs>
</service>

In the correct example, "la" is mounted before the others. There is no guaranteed ordering of which will
be mounted next, either "/a/b" or "/a/c". Also, in the correct example, "la" Will not be unmounted until its
children have first been unmounted.

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training meteríais are being improperly usad,
copiad, or c5stributed piense amad <trainiag•redhat. coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 5e772b12


9A0
NFS Resource Group Example

11-6

• Consider an NFS resource group with the following resources and start order:

<service . . . > •

<fs . . . >
<nfsexport . . . >
<nfsclient />

</fs>
<nfsclient
</nfsexport>

<ip ... 1>


/>


</service>

• The stop ordering would be just the opposite

The NFS resource group tree can be generally summarized as follows (with some extra, commonly used
resources thrown in for good measure):


group
file system...
NFS export...
NFS client...

ip address...
NFS client...

samba share(s)...

script...

This default ordering comes from the <special tag="rgmanager"
cluster/service.sh.

Proper ordering should provide graceful startup and shutdown of the service. In the slide's example aboye,
the order is, (1) the file system to be exported must be mounted before all else, (2) file system is exported,
(3)(4) the two client specifications are added to the exports access list, (5) finally, the IP address on which
> section of /usr/share/

••
the service runs is enabled. We have no guaranteed ordering of which clients will be added to the access
list first, but its irrelevant because the service won't be available until the IP address is enabled.

When the service is stopped, the order is reversed. It is usually preferable (especially in the case of a
service restad or migration to another node) to have the NFS server IP taken down first so clients will hang
on the connection, rather than produce errors if the NFS service is still accessible but the filesystem holding
the data is not mounted. •
e



For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopled,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used,
copied, or distributed please email ctrainingaredhat com> or phone toll-free (USA) +1 (866) 628 2994 or +1 (919)754 3700.

Copyright() 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 027d03d1



Resource Recovery 11-7

• Resource recovery policy is defined at the time the service is created


• Policies:
• Restad - tries to restart failed parts of resource group locally before attempting to relocate service
(default)
• Relocate - does not bother trying to restart service locally
• Disable - disables entire service if any component resource fails

"Restad" tries to restart failed parts of this resource group locally before attempting to relocate (default);
"relocate" does not bother trying to restart the service locally; "disable" disables the resource group if any
component fails. Note that any resource which can be recovered without a restart will be.

For use only by a student enrollad in a Red HM training course taught by Red HM, Inc. Oí a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrievel system, or otherwise reproduced without prior written consent of Red HM, Inc. ti you believe Red HM training meteríais are being improperly used,
copiad, or cfistributed pisase email <training•redhat. con> or phone ton-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright@ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / e8dcf6f2


0-1
Service Status Checking 11-8

• Service status checking is done via scripts


• /usr/share/cluster/*.sh
• Not supposed to consume system resources
• Frequency of checks can be modified
• Default is 30s
• <5s not supported
• Checking is per resource, not per resource group (service)
• Do not set the status interval too low

Service status checking is done per-resource, and not per-service, because it takes more system time to
check one resource type versus another resource type. For example, a check on a "script" might happen
every 30s, whereas a check on an "ip" might happen every 20s.

Example setting (service.sh):

<action name="status" interval="30s" timeout="0"/>

Example of nested status checking (ip.sh):

<!-- Checks if the IP is up and (optionally) the link is working -->


<action name="status" interval="20" timeout="10"/>
<!-- Checks íf we can ping the IP address locally -->
<action name="status" depth="10" interval="60" timeout="20"/>
<!-- Checks if we can ping the router -->
<action name="status" depth="20" interval="2m" timeout="20"/>

Red Hat Enterprise Linux is not a real-time system, so modifying the interval to some other value may result
in status checks that are slightly different than that specified.

Two popular ways people get into trouble:

1. No status check at all is done ("Why is my service not being checked?")

2. Setting the status check interval way too low (e.g. 10s for an Oracle service)

If the status check interval is set lower than the actual time it takes to check on the status of a service,
you end up with the problem of endless status checking, which is a waste of resources and could slow the
cluster.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, Morad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copiad, or distributed please email ‹t rainingOredhat con» or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 81a36b63


Custom Service Scripts 11-9

• Similar to SysV init scripts


• Required to support start, stop, restart, and status arguments
• Stop must be able to be called at any time, even before or during a start
• AH successful operations must return O exit code
• All failed operations must return non-zero exit code
• Example script: http: //kbase redhat .com/faq/docs/DOC-5913

Note:
Service scripts that intend to interact with the cluster, must follow the Linux Standard Base (LSB)
project's standard retum value for successful stop operations, including that a stop operation of a
service that isn't running (already stopped) should return O (success) as it's errorlevel (exit status).
Starting an already started service should also provide an exit status of O.

On start, if a service script fails the cluster will try to start the service on the other nodes that have quorum.
If all nodes fail to start it, then the cluster will try to stop it on all nodes that have quorum. If this fails as well,
then the service is marked as FAILED. A failed service must be manually disabled and should have the
error cleared or fixed before it is re-enabled.

If a status check fails, then the current node will try to restart the service first. If that fails, the service will be
failed over to another node that has quorum.

In addition to straightforward success, the following situations are also to be considered successful:

• running start on a service already running

• running stop on a service already stopped or not running

• running restad on a service already stopped or not running

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publicetion may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 1f you believe Red Hat training materials are being improperly usad,
copiad, or distributed pisase amad <trainingaredhat . coas or phone toll-free (USA) +1 (666) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / af7a4bcf


gna
Displaying Cluster and Service Status 1 1 -1 0

• Helpful tools
• luci Interface
• system-config-cluster's Cluster Management tab
• clustat
• cman_tool

• Must be a member of the cluster that is to be monitored

The node on which the cluster/service status tool is used must have the cluster software installed and be a
member of the cluster.

Active monitoring of cluster and service status can help bring problems to the system administrators
attention and also provide clues to help identify and resolve problems. The tools listed aboye are
commonly-used for these purposes.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of thia publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training materiala are being improperly used,
copiad, or distributed picase email <trainingaredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 13696821


All rirlhfc• rrs,nrs/nr1
Cluster Status (luci)

• luc i
• Cluster Management interface

festi
Soda klarn• C.1n,rna á Uva

aaMeal; 1:A.41/ tiámták,

Illerrees en Ido Mádá irIllátkre ClaseeM Neáltleaseqb.

1111111.: 0~se a Taw


**Imetz efis..slm, 14~,.

ihn~ *11 Oááo eladá: e- ekkárM» 110•1~~


ta Olatle« ~111,1 Arl ~I% skt~..9 Viene

01
harávo #I9~1r

ksadkk, tem/•4 árá :lovont11 -

For use only by a atudent enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No pan of this pubtication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training materials are being improperly used,
copiad, or distributed please email ctrainingaredhat coa» or phone ton-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b50f60f3


Cluster Status Utility (clustat) 11-12

• Used to display the status of the cluster


• From the viewpoint of the machine it is running on...
• Shows:
• Membership information
• Ouorum view
• State of all configured user services
• Built-in snapshot refresh capability
• XML output capable

To view the cluster status from the viewpoint of a particular node (node-1 in this example) and to refresh the
status every 2 seconds, the following clustat command can be used:

node-1# clustat -i 2
Member Status: Quorate

Member Name Status

node-1 Online, Local, rgmanager


node-2 Online, rgmanager
node-3 Online, rgmanager

Service Name Owner (Last) State

webby node-1 started

Note:

This output may look different if an older rgmanager-1 . 9 .39 0) version of rgmanager is
. -

installed.

If a cluster member status indicates "Online", it is properly communicating with other nodes in the cluster.
If it is not communicating with the other nodes or is not a valid member, it simply will not be listed in the
output.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. 11 you believe Red Hat training materials are being improperly used,
copied, or distributed pisase email <training@redhat «cora> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Convrinht © 2011 Red Hat. Inc. RH436-RHEL5u4-en-17-20110428 / 7ca6e47b


Cluster Service States 11-13

• Started - service resources are configured and available


• Pending - service has failed on one node and is pending start on another
• Disabled - service is disabled, and will not be restarted automatically
• Stopped - service is temporarily stopped, and waíting on a capable member to start
it
• Failed - service has failed to start or stop

Started - The service resources are configured and available.

Pending - The service has failed on one node in the cluster, and is awaiting being started on another
capable cluster member.

Disabled - The service has been disabled and has no assigned owner, and will not be automatically
restarted on another capable member. A total restart of the entire cluster will attempt to restart the service
on a capable member unless the cluster software is disabled (chkconf ig <service> off).

Stopped - The service is temporarily stopped, and is awaiting a capable cluster member to start it. A
service can be configured to remain in the stopped state if the autostart checkbox is disabled (in the cluster
configuration GUI: Cluster -> Managed Resources -> Services -> Edit Service Properties, "Autostart This
Service" checkbox).

Failed - The service has failed to start on the cluster and cannot successfully stop. A failed service is never
automatically restarted on a capable cluster member.

For use only by a student enrollad in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. H you believe Red HM training materials are being improperly usad,
copied, or cfistributed pisase email ctraining•redhat . com> or phone toll-free (USA) +1 (866)626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 472bObc2


. . nn7
Cluster SNMP Agent 11-14



• Work in progress
• Storage MIB (FS, LVM, CLVM, GFS) subject to change
• OID
• 1.3.6.1.4.1.2312.8
• REDHAT-CLUSTER-MIB:RedllatCluster

The cluster-snmp package provides extensions to the net - snmp agent to allow SNMP monitoring of the
cluster. The MIB definitions and other features are still a work in progress.

After Installing the relevant RPMs and configuring /etc/snmp/snmpd. conf to recognize the new


RedHatCluster space, the output of the following command shows the MIB tree associated with the cluster:

# snmptranslate -Os -Tp REDHAT-CLUSTER-MIB:RedHatCluster

+--RedHatCluster(8)

+--rhcMIBInfo(1)

+-- -R-- Integer32 rhcMIBVersion(1) •


--rhcCluster(2) I
+-- -R-- String
I
rhcClusterName(1) •
+-- -R-- Integer32
+-- -R-- String
+-- -R-- Integer32
+-- -R-- Integer32
rhcClusterStatusCode(2)
rhcClusterStatusString(3)
rhcClusterVotes(4)
rhcClusterVotesNeededForQuorum(5)

+-- -R-- Integer32
+-- -R-- Integer32
+-- -R-- Integer32
rhcClusterNodesNum(6)
rhcClusterAvailNodesNum(7) 1

rhcClusterUnavailNodesNum(8)
+-- -R-- Integer32 rhcClusterServicesNum(9)
+-- -R-- Integer32 rhcClusterRunningServicesNum(10)
+-- -R-- Integer32
+-- -R-- Integer32
rhcClusterStoppedServicesNum(11)
rhcClusterFailedServicesNum(12)


--rhcTables(3)

+--rhcNodesTable(1)

+--rhcNodeEntry(1)
I Index: rhcNodeName •
+--
+--
+--
-R--
-R--
-R--
String
Integer32
rhcNodeName(1)
rhcNodeStatusCode(2) •

String rhcNodeStatusString(3)
+-- -R-- Integer32 rhcNodeRunningServicesNum(4)

--rhcServicesTable(2)


For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No pad of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you believe Red Hat training motorista are being improperly used,
copied, or distributed please email < t rainingillredhat . coz» or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 557fa849



+--rhcServiceEntry(1)
Index: rhcServiceName

+-- String rhcServiceName(1)


+-- Integer32 rhcServiceStatusCode(2)
+-- String rhcServiceStatusString(3)
+-- String rhcServiceStartMode(4)
+-- String rhcServiceRunningOnNode(5)

For use only by a student enrolted in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, Morad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. It you believe Red HM training materieds are being improperly usad,
copiad, or distributed pisase email <trainingelredhat coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 557fa849


.— 9M3

Starting/Stopping the Cluster
Software on a Member Node
11-15

• service cman start


• service qdiskd start (if using qdisk)
• service clvmd start (if using LVs)
• service gfs start (if using GFS)
• service rgmanager start

Reverse the aboye process to remove a node from the cluster. Don't forget to make services persistent •

across reboots (chkconfig servicename on).

To temporarily disable a node from rejoining the cluster after a reboot:

for i in rgmanager gfs clvmd qdiskd cman


> do
> chkconfig --level 2345 $i off
> done

Race conditions can sometimes arise when running the service commands in a bash shell loop structure. It
is recommended that each command be run one at a time at the command line.


1

e


1

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written coneent of Red Hat, Inc. If you believe Red Hat training materiale are being Improperly usad,
copiad, or distributed please email < trainingAredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 6dc5acd2


Cluster Shutdown Tips 11-16

• Timing issue with respect to shutting down all cluster nodes


• Partial shutdown problem due to lost quorum
• Operations such as unmounting GFS or leaving the fence domain will block
• Solution 1:
• cman_tool leave remove
• Solution 2:
• Forcibly decrease the number of expected votes to regain quorum
• cman_tool expected <votes>

When shutting down all or most nodes in a cluster, there is a timing issue: as the nodes are shutting
down, if quorum is lost, remaining members that have not yet completed fence_tool leave will be stuck.
Operations such as unmounting GFS file systems or leaving the fence domain will block while the cluster is
inquorate and will be incapable of completing until quorum is regained.

One simple solution is to execute the command cman_tool leave remove, which automatically reduces the
number of votes needed for quorum as each node leaves, preventing the loss of quorum and allowing the
last nodes to cleanly shutdown. Care should be exercised when using this command to avoid a split-brain
problem.

If you end up with stuck nodes, another solution is to have enough of the nodes rejoin the cluster to regain
quorum, so stuck nodes can complete their shutdown (potentially then making the rejoined nodes get
stuck).

Yet another option is to forcibly reduce the number of expected votes for the cluster (cman_tool expected
<votes>) so it can become quorate again.

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, Morad in a retrieval system, or othenvise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly usad,
copiad, or cfistributed pisase email <trainingftredhat . coa, or phone toll-free (USA) +1 (566) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 5906261d


9A1
Troubleshooting 11-17

• A common service configuration problem is improperly written user scripts

• Is the service status being checked too frequently?


• Are service resources available in the correct order?
• Is a proper exit code being sent to the cluster?


• cman_tool {status,nodes}
• clustat

The number one field problem with respect to service configuration has been improperly written user
scripts.


Again, its important to make sure that the script delivers an exit code of O (zero) back to the cluster for all
successful operations.

Also, make sure to not lower the status checking defaults without good reason and a thorough testing after
having done so. If too low of a time value is chosen, you won't have to wait too long before the cluster
become sluggish as it eventually spends most of its time checking the status of the service or one of its
resources. •


e








For use only by e student enrolled in e Red HM training course teught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, etored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are being improperly used,
copiad, or distributed please email <training®redhat com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / c4a0bef7



Logging 11-18

• Most of the cluster infrastructure uses daemon. *


• Older cluster versions used local4 .* via syslogd
• Log level can be adjusted in /etc/cluster/cluster. conf
• clulog

To send most cluster-related messages and all kernel messages to the console using syslogd, edit /etc/
syslog. conf and include the following fines:

kernel.* /dev/console
daemon.info /dev/console

then restartireload syslog.

Log events can be generated and sent to syslogd(8) using the clulog command:

clulog - s 7 "cluster: My custom message"

The -s option specifies a severity leve) (0-7; 0=ALERT, 7=DEBUG).

The Iog level for the resource group manager can be adjusted in /etc/cluster/cluster. . conf:

<rm log_level="6" log_facility="daemon"/>

The available log levels are:


O system is unusable, emergency
1 action must be taken immediately
2 critical conditions
3 error conditions
4 warning conditions
5 normal but significant condition
6 informational
7 debug-level messages

Higher-numbered Iog levels include events recorded by all lower-numbered log levels. The log_facility
parallels the facilities in syslog(3) and therefore may be selected from the following list:

auth, authpriv, cron, daemon, kern, lpr, mail, news, syslog, user, uucp, and local() through
local?

For use only by a student enrolled in a Red HM training course taugM by Red HM, Inc. or a Red HM Certified Training Partner. No part of Mis publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. tf you believe Red HM training materiais are being improperly usad,
copiad, or cfistributed pisase email <training•redhat . ces> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 60853d77


... . .. _. OR'l

End of Lecture 11


• Questions and Answers
• Summary
• rgmanager



Resources
Services •
e








e


e


For use only by a student enrollad in e Red Hat training course taught by Red HM, Inc. or a Red Het Certified TralnIng Partner. No part of thls publIcetion mey be photocopied,
duplicated, etored In a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. II you belleve Red Het tralning material* ere beIng Improperly used,
copied, or dietributed pleitee email <trainingersdhat . coz> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 920246d6



Lab 11.1: Adding an NFS Service to the Cluster
Scenario: We can affect the order in which our service's resources are made available
by configuring them in a parent/child hierarchy. We can demonstrate this
by adding an NFS service whose resources are defined in such a hierarchy.

Deliverable: Add an NFS failover service to our existing cluster.

Instructions:

1. Create an ext3 formatted filesystem mounted at /mnt /nf sdat a using /dev/ sda2 (a
-

500MB-sized "0x83 Linux" partition). Copy the file /usr/share/dict/words to the


11/ /mnt/nf sdat a filesystem for testing purposes. Unmount the filesystem when you are done
copying the file to it.

110 2. Create a failover domain named prefer_node2 that allows services to use any node in the
cluster, but prefers to run on node2 (node2 should have a higher priority (lower priority
value) than the other nodes).

3. Using luc i's interface, create the resources necessary for an NFS service. This service should
provide data from our just-created /mnt /nf sdat a filesystem. AH remote hosts should have
read-write access to this NFS filesystem at 172.16.50 . X7.

S As a hint, you will need the following resources: IP Address, File System, NFS Export, and
NFS Client.

10 4. Create a new NFS service from these four resources named mynf s, that uses the
prefer_node2 failover domain and has a relocate recovery policy. Make sure that the
10 NFS Export resource is a child of the File System resource, and that the NFS Client
resource is a child of the NFS Export resource.

• 5. Monitor the mynf s cluster service's status until you see that it has started successfully.

• 6. When the NFS service finally starts, on which node is it running? What about the Web service?
Why might you want to criss-cross service node domains like this?

e
e
e
411

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / f0321 eea


Lab 11.2: Configuring SNMP for Red Hat Cluster Suite
Scenario: In this lab we configure simple SNMP client access to cluster resources.

All of the following commands are to be executed on nodel, for


simplicity.

Instructions:

1. On nodel, install the following RPMs: c lus ter- snmp, net snmp, net snmp utils.
- - -

2. Backup the original SNMP daemon configuration file /etc/snmp/snmpd. conf.

3. Edit snmpd conf so that it contains only the following two lines:

dlmod RedHatCluster
/usr/lib/cluster-snmp/libClusterMonitorSnmp.so
rocommunity guests 127.0.0.1

4. Start the SNMP service, and make sure it survives a reboot:

5. "Walk" the MIB space and test that your SNMP server is functioning properly.

6. Examine the part of the MIB tree that is specific to the Red Hat Cluster Suite (REDHAT -

CLUSTER MIB:RedHatCluster). in a tree like format


- -

7. View the values assigned to the OIDs in the cluster's MIB tree.

8. Note that part of the MIB tree has tabled information (e.g. rhcNodesTable,
rhcServicesTable, etc...) and some has scalar (singular valued) information. Compare the
output of the following commands (you will likely need a wide terminal window and/or small
font to view the snmptable output properly):

nodel# snmpwalk -v 1 -c guests localhost


REDHAT-CLUSTER-MIB::rhcCluster

nodel# snmpwalk -v 1 -c guests localhost


REDHAT-CLUSTER-MIB::rhcClusterServicesNames

nodell# snmpwalk -v 1 -c guests localhost


REDHAT-CLUSTER-MIB::rhcClusterStatusDesc

nodel# snmptable -v 1 -c guests localhost


REDHAT-CLUSTER-MIB::rhcNodesTable

nodel# snmptable -v 1 -c guests localhost


REDHAT-CLUSTER-MIB::rhcServicesTable

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b4825664


9. What SNMP command could you use to examine the total number of votes in your cluster? The
number of votes needed in order to make the cluster quorate?

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b4825664


........
Lab 11.1 Solutions
1. Create an ext 3-formatted filesystem mounted at /mnt/nf sdata using /dev/ sda2 (a
500MB-sized "0x83 Linux" partition). Copy the file /usr/share/dict /words to the
/mnt /nf sdat a filesystem for testing purposes. Unmount the filesystem when you are done
copying the file to it.

a. nodel# fdisk /dev/sda

(create the partition and exit fdisk, then run partprobe on all three nodes)

nodel , 2 , 3# mkdir /mnt/nfsdata

nodel# mkfs -t ext3 /dev/sda2

nodel# mount /dev/sda2 /mnt/nfsdata

nodel# cp /usr/share/dict/words /mnt/nfsdata

nodel# umount /mnt/nfsdata

b. Do not place an entry for the filesystem in /et c/ f stab; we want the cluster software to
handle the mounting and unmounting of the filesystem for us.

2. Create a failover domain named prefer_node2 that allows services to use any node in the
cluster, but prefers to run on node2 (node2 should have a higher priority (lower priority
value) than the other nodes).

From the left-hand menu select Failover Domains, then select Add a Failover Domain. Choose
the following values for its parameters and leave all others at their default.

Failover Domain Name --> prefer_node2


Prioritized --> yes
Restrict failover to... --> yes
nodel.clusterX.example.com --> Member: yes --> Priority: 2
node2.clusterX.example.com --> Member: yes --> Priority: 1
node3.clusterX.example.com --> Member: yes --> Priority: 2

Click the Submit button to save your choices.


3. Using luc i's interface, create the resources necessary for an NFS service. This service should
provide data from our just-created /mnt /nf sdata filesystem. All remote hosts should have
read-write access to this NFS filesystem at 172.16.50 . X7.

As a hint, you will need the following resources: IP Address, File System, NFS Export, and
NFS Client.

From the left-hand menu select Resources, then select Add a Resource. Add the following
resources, one at a time:

IP Address --> 172.16.50.X7


File System --> Name: mydata
Convriaht (e) 2011 RPri Hat Inn 11A_c■ n_i nA /
• FS Type: ext3
Mount Point: /mnt/nfsdata
11 Device: /dev/sda2
NFS Client --> Name: myclients
Target: *
10 NFS Export --> Name: myexport

(Note: t arget specifies which remote clients will have access to the NFS expon). Leave all
other options at their default.

110 4. Create a new NFS service from these four resources named mynf s, that uses the
prefer_node2 failover domain and has a relocate recovery policy. Make sure that the
NFS Export resource is a child of the Fi le System resource, and that the NFS Client
resource is a child of the NFS Export resource.

From the left-hand menu select Services, then select Add a Service. Choose the following
values for its parameters and leave all others at their default.

Service name mynfs


Failover Domain > prefer_node2
- -

Recovery policy: relocate

Click the Add a resource to this service button. From the "Use an existing global resource"
10 drop-down menu, choose: 172.16.50. X7 IP Address ) .

Click the Add a resource to this service button again. From the "Use an existing global
111 resource" drop-down menu, choose: mydata (File System) .

This time, click the Add a child button in the "File System Resource Configuration" section of
the window. From the "Use an existing global resource" drop-down menu, choose: myexport
(NFS Export) .

Now click the Add a child button in the "NFS Export Resource Configuration" section of the
window. From the "Use an existing global resource" drop-down menu, choose: myclients
(NFS Client) .

At the very bottom of the window (you may have to scroll down), click the Submit button to
save your choices.

5. Monitor the mynf s cluster service's status until you see that it has started successfully.

clustat -i 1

and/or refresh luc i's Services screen.

• 6. When the NFS service finally starts, on which node is it running? What about the Web service?
Why might you want to "criss-cross" service node domains like this?

• a. The NFS Service should have started on node2.

• b. The Web Service should still be running on node I.


Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / f0321eea
^^^

c. This configuration allows the two services to minimize contention for resources by
running on their own machine. Only when there is a failure of one node will the two
services have to share the other. •
Note: •
Your service locations may differ, depending upan where the webby service was at the


time the NFS service started.





.




.
.
.



Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / f0321 eea

Lab 11.2 Solutions
1. On nodel, install the following RPMs: c luster- snmp, net - snmp, net - snmp-ut ils.

nodel# yum - y install cluster snmp net snmp net snmp utils
- - - -

2. Backup the original SNMP daemon configuration file /e t c /snmp/snmpd . conf

nodel# cp /etc/snmp/snmpd.conf /etc/snmp/snmpd.conf.orig

3. Edit snmpd conf so that it contains only the following two unes:

dlmod RedHatCluster
/usr/lib/cluster-snmp/libClusterMonitorSnmp.so
rocommunity guests 127.0.0.1

The first line loads the proper MIB for Red Hat Cluster Suite. The second line creates a read-
only community named guests with full access to the entire MIB tree, so long as the request
originates from 127.0.0.1.

4. Start the SNMP service, and make cure it survives a reboot:

nodel# service snmpd start

nodel# chkconfig snmpd on

5. "Walk" the MIB space and test that your SNMP server is functioning properly.

nodel# snmpwalk - v1 - c guests localhost

6. Examine the part of the MIB tree that is specific to the Red Hat Cluster Suite (REDHAT-
CLUSTER-MIB:RedHatCluster). in a tree-like format

nodel# snmptranslate -Os -Tp REDHAT-CLUSTER-MIB:RedHatCluster

7. View the values assigned to the OIDs in the cluster's MIB tree.

nodel# snmpwalk -v 1 -c guests localhost


REDHAT-CLUSTER-MIB::RedHatCluster

8. Note that part of the MIB tree has tabled information (e.g. rhcNodesTable,
rhcServicesTable, etc...) and some has scalar (singular valued) information. Compare the
output of the following commands (you will likely need a wide terminal window and/or small
font to view the snmptable output properly):

nodel#snmpwalk -v 1 -c guests localhost /


REDHAT-CLUSTER-MIB::rhcCluster

nodel#snmpwalk -v 1 -c guests localhost


REDHAT-CLUSTER-MIB::rhcClusterServicesNames

nodel# snmpwalk -v 1 -c guests localhost 4ti

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b4825664


REDHAT-CLUSTER-MIB::rhcClusterStatusDesc

nodel# snmptable -v 1 -c guests localhost


REDHAT-CLUSTER-MIB::rhcNodesTable

nodel# snmptable -v 1 -c guests localhost


REDHAT-CLUSTER-MIB::rhcServicesTable

9. What SNMP command could you use to examine the total number of votes in your cluster? The
number of votes needed in order to make the cluster quorate?

nodel# snmpget -v1 -c guests localhost


REDHAT-CLUSTER-MIB::rhcClusterVotes.0

nodel# snmpget -v1 -c guests localhost 111


REDHAT-CLUSTER-MIB::rhcClusterVotesNeededForQuorum.0

Copyright 02011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b4825664


Lecture 12

Comprehensive Review

Upon completion of this unit, you should be able to:


• Setup a HA cluster from scratch including a quorum-disk,
clustered LVM, GFS2 and an NFS service

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in e retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. Uf you believe Red Hat training material are being improperly used,
copiad, or distributed pisase email ctrainingeredhat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright (O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 1421 d389


27r1
Start from scratch 12-1




Rebuild your workstation
Build a new cluster


• Ask questions
• Ask more questions

In this unit's lab you'II be building a new cluster from scratch. We will begin by reinstalling your workstation
to make sure that we'll begin with a pristine environment.

After your machine has finished reinstalling you will build your virtual clusternodes for the actual cluster.

Once we've got our new clusternodes we will build a new three-node cluster with multipathed iscsi-disks, a
quorum-disk, clustered LVM, GFS2 and a NFS-export service. •
This lab is an excellent opportunity to ask your instructor questions on any subjects that need further
s

clarification.











For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are beIng improperly ueed,
copied, or distributed please email ctredning@redhat coal> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 4ff6c90a


End of Lecture 12

• Questions and Answers


• Summary
• Build a cluster from scratch

For use only by a student enrollad in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, Moved in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. It you believe Red HM training motorista are being improperly usad,
copied, or cfistributed pisase amad ctrainiageredhat . con> or phone foil-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 1421d389


... . . . _. o-7Ç
Lab 12.1: Rebuild your environment
Scenario: We will start by reinstalling and prepping your workstation and virtual
machines.

Deliverable: A basic environment on which to build our cluster

Instructions:

1. PXE-boot your workstation and at the prompt that appears choose ws. On most machines PXE-
boot can be selected by pressing F12 when the BIOS screen appears. Ask your instructor for
assistance if your hardware differs.

2. Once your system has finished installing login as student, elevate your privileges to root and
run the command

stationn rebuild-cluster -m

Answer Y when asked if you are certain that you want to continue

3. Build your clusternodes using the command rebuild-cluster -123

4. Since we want our cluster-communication to happen over our private network we need to set
the hostnames for all our clusternodes to the form of node Y. clusterX. example . com

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / cf23d81e


Lab 12.2: Setup iscsi and multipath
Scenario: In this sequence we will setup your desktop to export a 5GiB iscsi-volume
to your clusternodes.

Deliverable: Clusternodes with multipathed iscsi access


111
Instructions:

1. On your desktop create a 5GiB logical volume called /dev/vol O / iscsi

2. Make sure the sc si -target -ut ils pacakge is installed on your desktop.
10
3. Export your /dev/vol O / isc si logical volume over iscsi with a IQN of
iqn.2009-10 . com . example. statiora: iscsi to 172.17. {100,200} +X.
{1..3}

4. On all three of clusternodes discover and login to the iscsi - target created aboye using both of
the portals available. (172.17. 100+X. 254 and 172.17 . 200+X. 254)
111> 5. Setup multipathing on all three of your clusternodes to create /dev/mpath/mpath0 from
your newly discovered iscsi-targets.

1
1
1
1
1
B

1
1
1
Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 83e54df1
..
Lab 12.3: Build a three node cluster
Scenario: Now we will build a three node cluster out of your virtual machines

Deliverable: A working three-node cluster with fencing

Instructions:

1. Install, enable and start ricci on all three of your clusternodes

2. Install, enable, configure and start luci on your desktop.

3. Using luci build a cluster named clus t e rX, where X is your cluster number.

4. Setup "Virtual machine fencing" for you cluster-nodes. Do not forget to start /sbin/fence xvmd
on your desktop.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / e0638972


Lab 12.4: Add a quorum-disk
Scenario: Now that we have our cluster configured we will add a quorum-disk to help
redundancy.

Deliverable: A working three-node cluster with fencing and a quorum-disk

Instructions:

1. Add a quorum disk on your multipathed iscsi, giving it a label of qdi skX, 3 votes and using a
heuristic of ping -c1 -W1 172.16.255.254 with an interval of a 2 and a TKO of 5.

1 Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 337a8e86


Lab 12.5: Add a GFS2 filesystem
Scenario: In this exercise we will add our GFS2 filesystem to our cisuter

Deliverable: A working three-node cluster with fencing, a quorum-disk, clustered LVM


and a GFS2 filesystem.

Instructions:

1. Add a new 4 GiB clustered LVM volume group to your cluster using your multipathed iscsi as
the backend called ClusterVG

2. Add a new 3GiB clustered Logical Volume to your cluster called /dev/ClusterVG/
nf sdata.

3. Create a new GFS2 filesystem called nfsdata on your /dev/ClusterVG/nfsdata logical


volume, be sure to add one spare journal. Do not add this filesystem to your /et c/ f st ab
right now.

4. Test your new GFS2 filesystem by mounting it on /nf sdata on all three nodes
simultaneously and creating a file on it. Do not forget to unmount afterwards.

rnnwrinht 0.) 9n11 Rad I —Int Inr• 11A_durs_17_0/11 1 /1A91:1 / nrin9171 A


Lab 12.6: Add a NFS-service to your cluster
Scenario: With our cluster up and running we can now add a NFS-service.
1111 Deliverable: A working highly available NFS service.

1
Instructions:
11, 1. Add a NFS-service to your cluster, listening on the ip-address 172.16.50 . X7 and exporting
your nf sdat a GFS2 filesystem iread-only from inf sdata to the entire world.

When possible the service should be running from your node1 machine.
11>

1
111

1
1
1

1
1
1
s

1 Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9070a975


Lab 12.1 Solutions
1. PXE-boot your workstation and at the prompt that appears choose ws. On most machines PXE-
boot can be selected by pressing F12 when the BIOS screen appears. Ask your instructor for
assistance if your hardware differs.

2. Once your system has finished installing login as student, elevate your privileges to root and
run the command

stationx# rebuild cluster


- - m

Answer Y when asked if you are certain that you want to continue

stationX$ Su -
Password: redhat
stationx# rebuild-cluster -m
This will create or rebuild the template node
Continue? (y/N): y

3. Build your clusternodes using the command rebuild cluster 123


- -

stationx# rebuild-cluster -123


This will create or rebuild nodes 1 2 3
Continue? (y/N) : y

4. Since we want our cluster-communication to happen over our private network we need to set
the hostnames for all our clusternodes to the form of node Y. clusterX. example . com

stationx# for I in node{1..3}.clusterX.example.com


do
ssh -o stricthostkeychecking=no ${1} hostname ${1}
ssh -o stricthostkeychecking=no ${1} sed -i kt"
nsrHOSTNAME=.*/HOSTNAME=${1}/n /etc/sysconfig/network
done

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / cf23d81 e


1
Lab 12.2 Solutions
1. On your desktop create a 5GiB logical volume called /dev/vol 0/ iscsi

stati.onx# lvcreate - L 5G - n iscsi vol°

2. Make sure the scsi -target -ut ils pacakge is installed on your desktop.

stationX# yum install scsi target utils - -

3. Export your /dev/vol0/iscsi logical volume over iscsi with a IQN of


iqn. 2009 10. com . example statiora: iscsi to 172.17 { 100,200 }+X.
-

{1. .3}

11>
Edit /et c/tgt/target s . conf to contain the following:

<target iqn.2009-10.com.example.stationX:iscsi>
10 backing-store /dev/volO/iscsi
initiator-address 172.17.100+X.1
11 initiator-address 172.17.100+X.2
initiator-address 172.17.100+X.3
initiator-address 172.17.200+X.1
initiator-address 172.17.200+X.2
initiator-address 172.17.200+X.3
</target>
110
Then start and enable the tgtd service:
110 stationX# chkconfig tgtd on
stationX# service tgtd start

Finally test the service:

stationX# tgt admin - - s

4. On all three of clusternodes discover and login to the iscsi-target created aboye using both of
111. the portals available. (172.17.1004-X.254 and 172.17.2004-X.254)

stationx# for I in node{1..3}; do


> ssh $1 "echo InitiatorAlias=${I} »
/etc/iscsi/initiatorname.iscsi;
11/ > isciadm -m discovery -t st -p 172.17.100+X.254;
> isciadm -m discovery -t st -p 172.17.200+X.254;
> service iscsi restart"
> done

5. Setup multipathing on all three of your clusternodes to create /dev/mpat himpath0 from
your newly discovered iscsi-targets.

On all three of clusternodes perforen the following:

11. node Y# yum -y install device-mapper-multipath


Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 83e54df 1
. .
Edit /et c /mul t ipath conf and remove/comment the blackl i st block, remove/
comment the de f ault s block and uncomment the larger de f aul t s block just below.

Execute:

nodeY# chkconfig multipathd on

nodeY# service multipathd start

nodeY# multipath -11

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 83e54df1


1
Lab 12.3 Solutions
1. Install, enable and start ricci on all three of your clusternodes
stationX#for I in node{1..3}; do
> ssh ${1} 'yum -y install ricci; chkconfig ricci on;
service ricci start';
> done

1, 2. Install, enable, configure and start luci on your desktop.


stationX# yum -y install luci
stationx# luci admin init
stationx# chkconfig luci on
stationX# service luci start

3. From the "Lucí Homebase" page, select the cluster tab near the top and then select "Create
11, a New Cluster" from the left sidebar. Enter a cluster name of c lusterX, where X is
your assigned cluster number. Enter the fully-qualified name for your three cluster nodes
(nodeN. clusterX. example . coro) and the password for the root user on each. Make
sure that "Download packages" is pre-selected, then select the "Check if node passwords are
identical" option. AH other options can be left as-is.
4. Setup "Virtual machine fencing" for you cluster-nodes. Do not forget to start Isbinifence xvmd
on your desktop.
11>
In luci select the cluster tab, select your cluster and then click on the fence
tab. Enter dom0 . cluster . X. com as the hostname for the host cluster, and
nodel c lusterX. example . com as a hostname from the virtual cluster. Click on Retrieve
cluster nodes, and then on Create and distribute keys.
111 Now we need to make your desktop listen to fencing request:
11> stationX# yum -y install croan
statiorl» scp nodel:/etc/cluster/fence_xvm.key /etc/cluster/
stationx# echo u/sbin/fence xvmd -L -I cluster" » /etc/rc.local

stationX# /etc/rc.local

In luci select the cluster tab, select your cluster and then click on Shared Fence Devices in the
left-hand column.

• Select Add a Fence Device, choose a Virtual Machine Fencing and give it a name of
xenf ence O

From the Nodes on the left select Manage fencing for this node for each of your three nodes
and add a primary fencing device pointing to xenf ence 0 with a domain corresponding to the
proper node (nodel for nodel etc.)
Test fencing for all your nodes by using a different host to run
node Y# fence node nodeZ.clusterX.example.com
1111
Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / e0638972
Lab 12.4 Solutions
1. Add a quorum disk on your multipathed iscsi, giving it a label of qdi skX, 3 votes and using a
heuristic of ping - c 1 -W1 172.16.255.254 with an interval of a 2 and a TKO of 5. •
Execute the following on one of you cluster nodes:
1
nodo Yft fdisk idevimpathimpath0

Command (m for help): n


Command action
e extended
p primary partition (1-4)

Partition number (1-4): 1
1
First cylinder (1-652, default 1): Enter
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-652, default 652): it"
+128M
Command (m for help): w

Now to let our systems know about the new partition perform the following on all three •
clusternodes:

partprobe


nade
nod.e Y# kpartx -a idevirapathimpath0

Now to create the qdisk perform the following on one of your nodes:

nade mkqdisk c idevimpathimpathOpl 1 gdiskX •



- -

Add the qdisk to your cluster by selecting the cluster tab in lucí, selecting your cluster followed
by selecting the Quorom Partition tab. Select Use A Quorum Partition and use the following
table to fill in the corresponding values:
interval 2 1
Votes 3
TKO 5
Label qdiskX

Also add a heuristic with ping - c 1 -Wl 172.16.255.254 as the program, an interval of e
2 and a score of 1

The last thing left is to start and enable the gdiskd service on all three of our cluster nodes.
Perform the following on all three of your clusternodes:

node Y#
node Y#
chkconfig qdiskd on
service qdiskd start

Convriaht CO 2011 Red Hat. Inc. RH42R-RHFI Sial-pa-17-9011049R / RR7aRARR

11>
Lab 12.5 Solutions
1. Make sure that LVM will prefer our multipathed devices over our normal devices by editing /
11. etc/lvm/lvm.conf so that it includes the following line on all your clusternodes:

preferred_names = [ " A /dev/mpath/", " A /dev/mapper/mpath",


" A /dev/[hs]d"

Also make sure that LVM is setup for clustered operation by executing the following command
11/
on all your clusternodes:

11/ nodeY# lvmconf --enable cluster


-

Add a new partition on your /dev/mpath/mpath0 device:

nodeY# fdisk /devimpathimpath0

Command (m for hele): n


Command action
111 e extended
p primary partition (1-4)

11> Partition number (1-4): 2


First cylinder (18-652, default 1): Enter
IP Using default value 1
Last cylinder or +size or +sizeM or +sizeK (18-652, default se
1> 652) : +4G
Command (m for help) : w

Now to let our systems know about the new partition perform the following on all three
clusternodes:
11> nodeY# partprobe
nodeY# kpartx -a idevimpathimpath0
11,
Now to create a Volume Group:

nodeY# pvcreate /devimpathimpath0p2


nodeY# vgcreate ClusterVG idevimpathimpath0p2

10 Make certain that all our clusternodes pick up on the new clustered Volume Group, execute the
following on all clusternodes:

nodeY# chkconfig clvmd on


nodeY# service clvmd restart

2. On one of your clusternodes perform the following:

lvcreate -L 3G -n nfsdata ClusterVG

1 Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 0dc25714


3. Create a new GFS2 filesystem called nfsdata on your /dev/ClusterVG/nfsdata logical
volume, be sure to add one spare journal. Do not add this filesystem to your /etc/ fstab
right now.

nodeY# mkfs.gfs2 - j4 - t cluster20:nfsdata /dev/ClusterVG/nfsdata

4. Test your new GFS2 filesystem by mounting it on /nf sdata on all three nodes
simultaneously and creating a file on it. Do not forget to unmount afterwards.

Run the following commands on all three clusternodes:

nodeY# mkdir /nfsdata


nodeY# mount /dev/ClusterVG/nfsdata /nfsdata

On one of your nodes create a file in /nf sdata

nodeY# echo "testing, testing, 123" > /nfsdata/test

One your other nodes attempt to read the file:

nodeY# cat /nfsdata/test

Finally unmount the new filesystem from all your nodes by running the following command on
all your nodes:

nodeY# umount /nfsdata

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 0dc25714


el>
• Lab 12.6 Solutions
1. Add a NFS-service to your cluster, listening on the ip-address 172.16.50 . X7 and exporting
110 your nf sdata GFS2 filesystem iread-only from inf sdata to the entire world.

When possible the service should be running from your nodel machine.
Ó
Let's start by adding a failoverdomain prefer_nodel to our cluster, in the luci interface
select your cluster from the cluster tab, and then in the left-hand column select Failover
Domains. Click Add a Failover Domain and in the following screen create a new failover
domain with a name of prefer_nodel, checking both the priori t zed and
restr icted checkboxes. Add all three of your nodes, giving nodel a priority of 1 and both
your other nodes a priority of 2.
1 Next let's add the individual resources that will make up our service, from the cluster page in
luci navigate via your cluster to the Resources screen. From here use the Add a Resource link to
add the following resources:
Name Type Options
1 172.16.50.X7
nf sdata
IP adress
GFS filesystem
• Monitor link: True
• Mountpoint: inf sdata

• Device: /dev/
ClusterVG/nf sdata
1 • Type: GFS2

1 nf sexportX NFS export


• Filesystem ID: 513

world NFS client • Target: *

• Options: ro
1 • Allow Recover: True

1 With our resources and failover domain defined we can now add a service. From the Services
screen select Add a service. Use mynf s for the name and select Automatically start this
1 service and Enable NFS lock workarounds. Set the Recovery policy to Relocate and select
pref er_nodel as the Failover Domain.

Use the Add a resource to this service button to add your IP-address and GFS2 filesystem as
direct children of your service, then use the Add a child button to add your NFS export resource
as a child to your GFS2 filesystem, and your NFS client resource as a child to your NFS export.
1
1
1
1 Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9070a975

Appendix A


Advanced RAID


••
Upon completion of this unit, you should be able to:
• Understand the different types of RAID supported by Red
Hat
• Learn how to administer software RAID
• Learn how to optimize software RAID
• Planning for and implementing storage growth

••
••








For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale ere being improperly usad,
copied, or distributed piense emeil <training@redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright @ 2011 Red Hat, Inc.


. . .
RH436-RHEL5u4-en-17-20110428 / b5701945

Redundant Array of Inexpensive Disks A-1

• Software RAID
• o, 1, 5, 6, 10
• Software versus Hardware RAID
• Provides
• Data integrity
• Fault-tolerance
• Throughput
• Capacity
• mdadm
• Creates device files named /dev/md0, /dev/mdl, etc...
• -a yes option for non-precreated device files (/i:ley/mai and higher)

RAID originally stood for Redundant Array of Inexpensive Disks, but has come to also stand for Redundant
Array of Independent Disks.

RAID combines multiple hard drives into a single logical unit. The operating system ultimately sees only
one block device, which may really be made up of severa! different block devices. How the different block
devices are organized differentiates one type of RAID from another.

Software RAID is provided by the operating system. Software RAID provides a layer of abstraction between
the logical disks (RAID arrays) and the physical disks or partitions participating in a RAID array. This
abstraction layer requires some processing power, normally provided by the main CPU in the host system.

Hardware RAID requires a special-purpose RAID controller, and is often provided in a stand-alone
enclosure by a third-party vendor. Hardware RAID uses its controller to off-load any processing power
required by the chosen RAID level (such as parity calculations) from the main CPU, and simply present a
logical disk to the operating system. Another advantage to hardware RAID is most implementation support
hot swapping of disks, allowing failed drives to be replaced without having to take the system off-fine.
Additional features of hardware RAID are as varied as the vendors provided it.

The RAID type you choose will be dictated by your needs: data integrity, fault tolerance, throughput, and/or
capacity. Choosing one particular level is largely a matter of trade-offs and compromises.

For use only by a student enrollad in a Red Hat training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training material are being impropetly usad,
copiad, or distributed pisase ornad <trainingeredhat .cor> or phone ton-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / de4abb26


RAIDO A-2

• Striping without parity


• Data is segmented
• Segments are round-robin written to multiple physical devices
• Provides greatest throughput
• Not fault-tolerant
• Minimum 2 (practical) block devices
• Storage efficiency: 100%
• Example:
• mdadm --create /dev/md0 --Ievel=0 --raid-devices=2 --chunk=64 /dev/sd[ab]1
• mke2fs -j -b 4096 -E stride=16 /dev/md0

RAIDO (software), or striping without parity, segments the data, so that the different segments can be
written to multiple physical devices (usually disk drives) in a round-robin fashion. The storage efficiency is
maximized if identical-sized drives are used.

The size of the segment written to each device in a round-robin fashion is determined at array creation time,
and is referred to as the chunk size.

The advantage of striping is increased performance. It has the best overall performance of the non-nested
RAID levels. The disadvantage of striping is fault-tolerance: if one disk in the RAID array is lost, all data on
the RAID array is lost, because each segmented file it hosts will have lost any segments that were placed
on the failed drive.

The size of the array is originally taken from the smallest of its member block devices at build time. The
size of the array can be grown if all the drives are, one at a time, removed and replaced with larger block
devices, followed by a grow of the RAID array. A resynchronization process would then start to make sure
that new parts of the array are synchronized. The filesystem would then need to be grown into the newly
available array space.

Recommended usage: non-critical, infrequently changed and/or regularly backed up data requiring high-
speed I/O (particularly writes) with a low cost of implementation.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material, are being Improperly usad,
copied, or distributed picase email <trainingeredhat . con» or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 3c5e2b5f


RAID1 A-3

• Mirroring
• Data is replicated
• Provides greater fault-tolerance
• Greater read performance
• Minimum 2 (practical) block devices
• Storage efficiency: (100/N)%, where N=#mirrors
• Example:
• mdadm —create /devimd0 —level=1 —raid-devices=2 /dev/sd[ab]1

RAID1 (software), or mirroring, replicates a block device onto one or more, separate, block devices in real
time to ensure continuous availability of the data. The storage efficiency is maximized if identical-sized
drives are used.

Additional devices can be added, at which time a synchronization of the data is performed (so they hold a
valid copy). Failed devices are automatically taken out of the array and the administrator can be notified of
the event via e-mail. So long as there remains at least one copy in the mirror, the data remains available.

While not its primary goal, mirroring does provide some performance benefít for read operations. Because
each block device has an independent copy of the same exact data, mirroring can allow each disk to be
accessed separately, and in parallel.

Each block device used for a mirrored copy of the data must be the same size as the others, and should be
relatively equal in performance so the load is distributed evenly.

The size of the array is originally taken from the smallest of its member block devices at build time. The
size of the array can be grown if all the drives are, one at a time, removed and replaced with larger block
devices, followed by a grow of the RAID array. A resynchronization process would then start to make sure
that new parts of the array are synchronized. The filesystem would then need to be grown into the newly
available array space.

Mirroring can also be used for periodic backups. lf, for example, a third equally-sized disk is added to an
active two-disk mirror, the new disk will not become an active participant in the RAID array until the already-
active participants synchronize their data onto the newly added disk (making it the third copy of the data).
Once completed, and the new disk is an active third copy of the data, it can then be removed. If it is re-
added every week, for example, it effectively becomes a weekly backup of the RAID array data.

Recommended usage: data requiring the highest fault tolerance, with reduced emphasis on cost, capacity,
and/or performance.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or othenvise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly usad,
copiad, or distributed piense email ‹trainingeredhat. coas or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 /11873d9a


A II rirlIste. rose•tammurrl ^^^
RAID5 A-4

• Block-level striping with distributed parity


• Increased performance and fault tolerance
• Survives the failure of one array device
• Degraded mode
• Hot spare
• Requires 3 or more block devices
• Storage efficiency: 100*(1 - 1/N)%, where N=#devices
• Example:
• mdadm --create /dev/md0 --Ievel=5 --raid-devices=3 /dev/sd[abc]1

RAID5 (software), or striping with distributed parity, stripes both data and parity information across three or
more block devices. Striping the parity information eliminates single-device bottlenecks and provides some
parallelism advantages. Placing the parity information for a block of data on a different device helps ensure
fault tolerance.

The storage efficiency is maximized if identical-sized drives are used, and increases as more drives are
used in the RAID array.

When a single RAID5 array device is lost, the array's data remains available by regenerating the failed
drive's lost data on the fly. This is called degraded mode, because the RAID array's performance is
degraded while having to calculate the missing data.

Performance can be tuned by experimenting with and/or tuning the stripe size.

Recommended usage: data requiring a combination of read performance and fault-tolerance, lesser
emphasis on write performance, and minimum cost of implementation.

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publicetion may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are being improperly used,
copiad, or distributed please email <training@reGlat . coz> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ea295aea


. . nnA
RAID5 Parity and Data Distribution A-5

Asstitning:
Chunk Size 14Ydt3 Strip. Width = Sekiti

Devices 1 2 3 4
Stripe í Data Data Data 4 x 4ki13 Parity

Stripe 2 Pata Data 4 x 4k,B Parity Data

Stripe 3 " Data 4 x 4kin Parity Pata Pata

Stripe 4 4 klB Parity Data Data Pata

Stripe 5 Data Pata Data 4 x 4ki8 Paríty

Stripe 6 Pata Pata 4 x 4k3B Parity Data

Parity calculations add extra data, and therefore require more storage space. The benefit to the extra parity
information is that it is possible to recover data from errors. It can be recreated from the parity information.

Data is written to this RAID starting in stripe 1, going across the RAID devices from 1 to 4, then proceeding
across stripe 2, 3, etc....

The aboye diagram illustrates Ieft-symmetric parity, the default in Red Hat Enterprise Linux.

For use only by a student enrollad in a Red HM training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No pan of Mis publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. ff you believe Red HM training ~dais are being improperly usad,
copied, or Mstributed picase email <trainingSredhat . coa> ce phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 17e88df4


A II
RAID5 Layout Algorithms A-6

• RAID5-specific option to mdadm:


• - -la yout=<type>
• Default is Left Symmetric

Left Asymmetric Right Asymmetric

sdal sdbl sdcl sdel sdal sdbl sdcl sdel


DO D1 D2 P P DO D1 D2
D3 D4 P D5 D3 P D4 D5
D6 P D7 D8 D6 D7 P D8
P D9 D10 Dll D9 D10 Dll P
D12 D13 D14 P P D12 D13 D14

Left Symmetric Right Symmetric

sdal sdbl sdcl sdel sdal sdbl sdcl sdel


DO D1 D2 P P DO D1 D2
D4 D5 P D3 D5 P D3 D4
D8 P D6 D7 D7 D8 P D6
P D9 D10 Dll D9 D10 Dll P
D12 D13 D14 P P D12 D13 D14

The --1ayout=<type > option to mdadm defines how data and parity information is placed on the array
segments. The different types are listed here:

left-asymmetric: Data stripes are written round-robin from the first array segment to the last array segment
(sdal to sdel). The parity's position in the striping sequence round-robins from the last segment to the first.

right-asymmetric: Data stripes are written round-robin from the first array segment to the last array segment
(sdal to sdel). The parity's position in the striping sequence round-robins from the first segment to the last.

left-symmetric: This is the default for RAID5 and is the fastest stripe mechanism for large reads. Data
stripes are written follow the parity, always beginning the next stripe on the segment immediately following
the parity segment, then round-robins to complete the stripe. The parity's position in the striping sequence
round-robins from the last segment to the first.

right-symmetric: Data stripes are written follow the parity, always beginning the next stripe on the segment
immediately following the parity segment, then round-robins to complete the stripe. The parity's position in
the striping sequence round-robins from the last segment to the first.

For use only by a student enrollad in a Red Hat treining course taught by Red Net, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly used,
copied, or distributed picase email < training@redhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Convriaht (O 2011 Red Hat. Inc. RH436-RHEL5u4-en-17-20110428 / 051b3a6a


RAID5 Data Updates Overhead A-7

• Each data update requires 4 1/0 operations


• Data to be updated is read from disk
• Updated data written back, but parity incorrect
• Read all other blocks from same stripe and calculate parity
• Write out updated data and parity

RAID5 takes a performance hit whenever updating on-disk data. Before changed data can be updated on a
RAID5 device, all data from the same RAID stripe across all array devices must first be read back in so that
a new parity can be calculated. Once calculated, the updated data and parity can be written out.

The net effect is that a single RAID5 data update operation requires 4 I/O operations. The performance
impact can, however, be masked by a large subsystem cache.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or othenvise reproduced without prior written consent of Red Hat, Inc. ff you betieve Red HM training material* are being improperly usad,
copiad, or cfistributed pisase emita <trainingeredhat .0011> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 717be7fe


All rirdIfe• rebe,e1r..,el
RAID6 A-8

• Block-level striping with dual distributed parity


• Comparable to RAID5, with differences:
• Decreased write performance
• Greater fault tolerance
• Survives the failure of up to 2 array devices
• Degraded mode
• Protection during single-device rebuild
• SATA drives become more viable
• Requires 4 or more block devices
• Storage efficiency: 100*(1 - 21N)%, where N=#devices
• Example:
• mdadm --create /dev/md0 --level=6 --raid-devices=4 /dev/sd[abcd]1

RAID6 (software), or striping with dual distributed parity, is similar to RAID5 except that it calculates two
sets of parity information for each segment of data.

The duplication of parity improves fault tolerance by allowing the failure of any two drives (instead of one as
with RAID5) in the array, but at the expense of slightly slower write performance due to the added overhead
of the increased parity calculations.

While protection from two simultaneous disk failures is nice, it is a fairly unlikely event. The biggest benefit
of RAID6 is protection against sector failure events during rebuild mode (when recovering from a single disk
failure). Other benefits to RAID6 include making less expensive drives (e.g. SATA) viable in an enterprise
storage solution, and providing the administrator additional time to perform rebuilds.

The storage efficiency is maximized if identical-sized drives are used, and increases as more drives are
used in the RAID array.

RAID6 reads can be slightly faster due to the possibility of data being spread out over one additional disk.
Performance can be tuned by experimenting with and/or tuning the chunk size.

Performance degradation can be substantial after the failure of an array member, and during the rebuild
process.

Recommended usage: data requiring a combination of read performance and higher level of fault-tolerance
than RAID5, with lesser emphasis on write performance, and minimum cost of implementation.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material» are being improperly used,
copied, or distributed please email <tral.rdngOlrenhat .com> or phone toll-free (USA) +1 (366) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 6cf80fbd


RAID6 Parity and Data Distribution A-9

A s suming:
Chunk Size 161118 Stripe IVidth = 64iii6

DevIces 2 3 4
Stripe 1 P Data t. 7: ta 4 x 4ki6 Parity 4 x 4ki6 parity

Stripe 2 P Data 4 x 4kiB Parity d x dkif Parity :" Data

Stripe 3 4 x 4kin Parizy 4 x 4ki6 Parity Data t Data

Stripe 4 4 x 4kis Pasity Data .1 D a la 4 x 4kill Parity

Stripe_ 5 -z P Data Data 4 x 4kita Parity 4 x 41(i8 Parity

Stripe 6 1 P Data 4 x 4kiB Patty 4 x 4kiB Parity Dala

The key to understanding how RAID6 can withstand the Ioss of two devices is that the two parities (on
device 3 and 4 of stripe 1 in the diagram aboye) are separate parity calculations. The parity information on
3 might have been calculated from the information on devices 1-3, and the parity information on device 4
might be for devices 2-4.

If devices 1 and 2 failed, the parity on 4 combined with the data on 3 can be used to rebuild the data for 2.
Once 2 is rebuili, its data combined with the parity information on device 3 can be used to rebuild device 1.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Cerfified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. ft you believe Red HM training materials are being improperly usad,
copiad, or distributed please email <training8redhat -coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 2820ecc1


„„„
RAID1 O A-10

• A stripe of mirrors (nested RAID)


• Increased performance and fault tolerance
• Requires 4 or more block devices
• Storage efficiency: (100/N)%, where N=#devices/mirror
• Example:
• mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/sd[abcd]1

At the no-expense-spared end of RAID, RAID6 usually loses out to nested RAID solutions such as RAID10
that provides the multiple-drive redundancy fault tolerance of RAID1 while still offering the maximum
performance of RAIDO.

RAID10 is a striped array across elements which themselves are mirrors. For example, a similar RAID10 as
the one created by the command in the slide aboye (but with name /dev/md2) could be created using the
following three commands:

# mdadm --create /dev/md0 --leve1.1 --raid-devices=2 /dev/sd[ab]l


# mdadm --create /dev/mdl -a yes --leve1.1 --raid-devices=2 /dev/sd[cd]l
# mdadm --create /dev/md2 -a yes --level=10 --raid-devices=2 /dev/md[01]

(Note: - -level=o could be substituted for - -ievel=io in the last command.)

See mdadm(8) for more information.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Treining Partner. No part of this publication mal , be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copied, or distributed please email ct ridningraredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / b31 bb554


11 • LA.
Stri pe Parameters A-11

• RAIDO, RAIDS, RAID6


• Chunk size (mdadm -c N):
• Amount of space to use on each round-robin device before moving on to the next
• 64 kiB default
• Stride (mke2fs -E stride=N):
• (chunk size) / (filesystem block size)
• Can be used to offset ext2-specific data structures across the array devices for more even
distribution

Tuning stripe parameters are important to optimízing striping performance.

Chunk Size is the amount (segment size) of data read/written from/to each device before moving on to
the next in round-robin fashion, and should be an integer multiple of the block size. The chunk size is
sometimes also referred to as the granularity of the stripe.

Decreasing chunk size means files will be broken into smaller and smaller pieces, increasing the number of
drives a file will use to hold all its data blocks. This may increase transfer performance, but may decrease
positioning performance (some hardware implementations don't perform a write until an entire stripe width's
worth of data is written, wiping out any positional effects). Increasing chunk size has just the opposite effect.

Stride is a parameter used by the mke2fs in an attempt to optimize the distribution of ext2-specific data
structures across the different devices in a striped array.

All things being equal, the read and write performance of a striped array increases as the number of
devices increase, because there is greater opportunity for parallel/simultaneous access to individual drives,
reducing the overall time for I/O to complete.

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. H you believe Red HM training ~Niels are being improperly usad,
copiad, or distributed picase emelt <trainingeredliat „coms or phone toll-free (USA) +1 (060) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / ca4aa230



/proc/mdstat A-12




Lists and provides information on alI active RAID arrays
Used by mdadm during - - scan


• Monitor array reconstruction (watch -n .5 'cat /proc/mdstat')
• Examples:

Initial sync'ing of a RAID1 (mirror):

Personalities : [raidl] •
md0 : active raidl sda5[1] sdb5[0]
987840 blocks [2/2] [UU]

(354112/987840) finish=0.9min
speed=10743K/sec
resync = 35.7% ¿


Active functioning RAID1:

# cat /proc/mdstat
Personalities : [raidl]
md0 : active raidl sda5[1] sdb5[0]
••
987840 blocks [2/2] [UU]



unused devices: <none>

Failed half of a RAID1:

# cat /proc/mdstat
Personalities : [raidl]

md0 : active raidl sda5 [1] (F) sdb5 [0]
987840 blocks [2/1] [U]

unused devices: <none>


••
••



For use only by e student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are being improperly usad,
copiad, or distributed piense email <tral.ningeredhat . coz> or phone toll-free (USA) +1 (666) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 15691b5d



Verbose RAID Information A-13

• Example (RAID1 rebuilding after failed member):

# mdadm --detail /dev/md0


/dev/md0:
Version : 00.90.03
Creation Time : Tue Mar 13 14:20:58 2007
Raid Level : raids
Array Size : 987840 (964 .85 MiB 1011.55 MB)
Device Size : 987840 (964 .85 MiB 1011.55 MB)
Raid Devices 2
Total Devices 2
Preferred Minor 0
Persistence Superblock is persistent

Update Time : Tue Mar 13 14:25:34 2007


State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

Rebuild Status : 60% complete

UUID : lad0a27b:b5d6d1d7:296539b4:f69e34ed
Events : 0.6

Number Major Minor RaidDevice State


0 3 5 0 active sync
/dev/sda5
1 3 6 1 spare rebuilding
/dev/sdb5

The --detail option to mdadm shows much more verbose information regarding a RAID array and
its current state. In the case aboye, the RAID array is clean (data is fully accessible from the one active
array member, /dev/sda5), running in degraded mode (we aren't really mirroring at the moment), and
recovering (a spare array member, /dev/sdb5, is being synced with valid data from /dev/sda5).

Once the spare is fully synced with the active member, it will be converted to another active member and
the state of the array will change to clean.

For use only by a student enrollad in a Red HM training course taught by Red HM, Inc. or a Red HM Cerfified Training Partner. No parí of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training material* are being improperly usad,
copiad, or distributed please amad <trainiageredhat . COM> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 7b113e15


SYSFS Interface A-14

• /sys/block/mdX/md
• level
• raid disks
• chunk_size(RAID0,5,6,10)
• component_size
• new_dev
• safe mode delay
• sync_speedjmin,max}
• syncaction

See Documentation/md. txt for a full explanation of all the files.


level Indicates RAID level of this array.
raid_disks Number of devices in a fully functional array.
chunk_size Size of 'chunks' (bytes), and only relevant to striping RAID arrays.
component_size For mirrored RAID arrays, this is the valid size (sectors) that alI members have
agreed upon (all members should be the same size).
new_dev Write-only file expecting a "major:minor" character string of a device that
should be attached to the array.
safe_mode_delay If no write requests have been made in the past amount of time determined by
this file (200ms default), then md declares the array to be clean.
sync_speedjmin,max }Current goal rebuild speed for times when the array has ongoing non-rebuild
activity. Similar to /proc/sys/dev/raid/speed_limit_{min, max}, but
they only apply to this particular RAID array. If " (system) " appears, then it is
using the system-wide value, otherwise a locally set value shows " (local)) ".
The system-wide value is set by writing the word system to this file. The
speed is kiB/s.
sync_action Used to monitor and control the rebuild process. Contains one word: resync,
recover, idle, check, or repair. The 'check' parameter is useful
to check for consistency (will not correct any discrepancies). A count of
problems found will be stored in mismatch_count. Writing l idie' will stop
the checking process.
stripe_cache_size Used for synchronizing alI read and write operations to the array. Increasing
this number may increase performance at the expense of system memory.
RAIDS only (currently). Default is 128 pages per device in the stripe cache.
(rnin=16, max=32768).

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiale are being improperly ueed,
copied, or distributed picase email <training•recttxat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 2e0dec79


/e tc/mdadm . conf A-15

• Used to simplify and configure RAID array construction


• Allows grouping of arrays to spare a spare drive
• Leading white space treated as line continuation
• DEVICE is optional (assumes DEVICE partitions
• Create for an existing array: mdadm —verbose —examine --scan
• Example:

DEVICE partitions
ARRAY /dev/md0 level=raidl num-devices=2
UUID=c5dac4d3:2d6b9861:ab54c1f6:27c15a12
devices=/dev/sda2,/dev/sdc2
ARRAY /dev/mdl level=raid0 num-devices=2
UUID=4ed6e3cc:f12c94b1:a2044461:19e09821
devices=/dev/sdal,/dev/sdcl

DEVICE - Lists devices that might contain a component of a RAID array. Using the word 'part itions'
causes mdadm to read and include all partitions from /proc/partitions. DEVICE partitions is the
default, and so specifying it is optional. More than one line is allowed and it may use wild cards.

ARRAY - Specifies information about how to identify RAID arrays and what their attributes are, so that they
can be activated.

ARRAY attributes

unid Universally Unique IDentifier of a device


super - minor The integer identifier of the RAID array (e.g. 3 from /dev/md3) that is stored
in the superblock when the RAID device was created (usually the minor
number of the metadevice)
name A name, stored in the superblock, given to the array at creation time
devices Comma-delimited list of devices in the array
level RAID level
num- devi ce s Number of devices in a complete, active array
spares The expected number of spares an array should have
spare - group A name for a group of arrays, within which a common spare device can be
shared
auto Create the array device if it doesn't exist or has the wrong device number. Its
value can also indicate if the array is partitionable (mdp or partition) or
non-partitionable (yes or md).
bitmap The file holding write-intent bitmap information
metadata Specifies the metadata format of the array

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certífied Training Partner. No part of this pubtication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperly usad,
copied, or chstributed piense emaiI <training9redhat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 40375bb0


„,„-
Event Notification A-16

• Make sure e-mail works


• /etc/mdadm.conf
• MAILADDR root@example.com
• MAILFROM root@nodel.example.com
• PROGRAM /usr/sbin/my-RAID-script
• Test:
• mdadm --monitor --scan --oneshot --test
• Implement continuous monitoring of the array:
• chkconfig mdmonitor on; service mdmonitor start

The MAILADDR line in /etc/mdadm. conf provides and E-mail address to which alerts should be sent
when mdadm is running in "--monitor - -scan" mode. There should only be one MAILADDR line and it
should have only one address.

The MAILFROM line in /etc/mdadm. conf provides the "From" address for the event e-mails sent out. The
default is root with no domain.
A copy of /proc/mdstat is sent along with the event e-mail.

These values cannot be set via the mdadm command line, only via /etc/mdadm. conf.

A shorter form of the aboye test command is: mdadm -Fslt


A program may also be run (PROGRAM in /etc/mdadm.conf) when "mdadm --monitor" detects
potentially interesting events on any of the arrays that it is monitoring. The program is passed two
arguments: the event and md device (a third argument may be passed: the related component device).

The mdadm daemon can also be put finto continuous-monitor mode using the command: mdadm --
daemonise --monitor --scan --mail root@example.com but this will not survive a reboot and should only
be used for testing.

For use only by a atudent enrollad in a Red HM training course taught by Red Hat, Inc. or a Red HM Certified Training Partner. No part of this publication may be photocopled,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. II you believe Red Hat training materials are being Improperly used,
copiad, or distributed please ~MI trtaning@redhat coz> or phone toll-free (USA) +1 (1366) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 620c72f7


^^^
Restriping/Reshaping RAID Devices A-17

• Re-arrange the data stored in each stripe into a new layout


• Necessary alter changing:
• Number of devices
• Chunk size
• Arrangement of data
• Parity location/type
• Must back up the Critical Section

When beginning the process of stripe reorganization, there is a "critical section" during which live data is
being over-written on disk with the new organization layout. While md works on the critical section, it is in
peril of losing data in the event of a crash or power failure.

Once beyond the critical section, data is then only written to areas of the array which no longer hold live
data -- those areas have already been cleared out and placed elsewhere.

For example, to increase the number of members in a RAID5 array, the critical section consists of the first
few (old number of devices multiplied by new number of devices) stripes.

To avoid the possibilíty of data loss, mdadm will:


1) disable writes to the critica! section of the array
2) backup the critical section
3) continue the reshaping process
4) eventually invalidate the backup and restore write access once the critical section is passed

mdadm also provides a mechanism for restoring critica' data before restarting an array that was interrupted
during the critical section.

Reshaping operations that don't change the size of the array (e.g. changing the chunk size, etc...), the
entire process remains in the critical section. In this case, the reshaping happens in sections. Each section
is marked read-only, backed up, reshaped, and finally written back to the array device.

For use °n'y by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No per( of this publication may be photocopied,
duplicated, atorad in a refriese' system, or otherwise reproduced without prior written consent of Red Hat, Inc. It you batiese Red Hat training materials are being improperly usad,
copiad, or cfistributed picase email <trainingeredhat . coas, or phone toll-free (USA) +1 (860) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 06115ae2


Ali '1,4~ rne.r...r1 nrs,
Growing the Number of Disks in a RAID5 Array A-18

• Requires a reshaping of on-disk data


• Add a device to the active 3-device RAID5 (starts as a spare):
• mdadm --add /dev/md0 /dev/hda8
• Grow into the new device (reshape the RAID5):
• mdadm --grow /dev/md0 --raid-devices=4
• Monitor progress and estimated time to finish
• watch -n 1 'cat /proc/mdstat'
• Expand the FS to fill the new space whiie keeping it online:
• resize2fs /dev/md0

In 2.6.17 and newer kernels, a new disk can be added to a RAID5 array (e.g. go from 3 disks to 4, and not
just as a spare) while the filesystem remains online.

This allows you to expand your RAID5 on the fly without having to fail-out (one at a time) alt 3 disks for
larger spare ones before doing a fitesystem grow.

The reshaping of the RAID5 can be slow, but can be tuned by adjusting the kernel tunable minimum
reconstruction speed (default=1000):

echo 25000 > /proc/sys/dev/raid/speed_limit_min

The steps for adding a new disk are:

1. Add the new disk to the active 3-device RAID5 (starts as a spare):

mdadm --add /dev/md0 /dev/hda8

2. Reshape the RAID5:

mdadm --grow /dev/md0 --raid-devices=4

3. Monitor the reshaping process and estimated time to finish:

watch -n 1 'cat /proc/mdstat'

4. Expand the FS to fill the new space:

resize2fs /dev/md0

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, atorad in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials ere being improperly used,
copied, or distributed please email <training@recthat . coal> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-201 10428 / 28aef8dd


Improving the Process with A-19
a Critical Section Backup

• During the first stages of a reshape, the critical section is backed up, by default, to:
• a spare device, if one exists
• otherwise, memory
• If the critica! section is backed up to memory, it is prone to loss in the event of a
failure
• Backup critical section to a file during reshape:
• mdadm —grow /dev/md0 —raid-devices=4 —backup-file=/tmp/md0.bu
• Once past the critical section, mdadm will delete the file
• In the event of a failure during the critical section:
• mdadm —assemble /dev/md0 —backup-file=/tmp/md0.bu /dev/sd[a-d]

To modify the chunk size, add new devices, modify arrangement of on-disk data, or change the
parity location/type of a RAID array, the on-disk data must be "reshaped". Reshaping striped data is
accomplished using the command:

mdadm --grow /dev/md0 —raid-devices=4 —backup-file=/tmp/md0-backup

For the process of reshaping, mdadm will copy the first few stripes to /tmp/md0 -backup (in this example)
and start the reshape. Once it gets past the critical section, mdadm will remove the file. If the system
happens to crash during the critical section, the only way to assemble the array would be to provide
mdadm the backup file:

mdadm —assemble /dev/md0 --backup-file=/tmp/md0-backup /dev/sd[a-d]

Note that a spare device is used by default, if it exists, for the backup. If none exists, and a file is not
specified (as aboye), then memory is used for the backup and is therefore prone to loss as a result of any
error.

For use only by a shident enrollad in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly used,
copiad, or distributed pisase email <trainiagaredhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 8efaa9a7


All rinhic resegansnri
Growing the Size of Disks in a RAID5 Array A-20

• One at a time:
• Fail a device
• Grow its size
• Re-add to array
• Then, grow the array into the new space
• Finally, grow the filesystem into the new space

Additional space for an array can come from growing each member device (especially a logical volume)
within the array, or replacing each device with a larger one.

Assume for the moment that our array devices are logical volumes (/dev/vg0 /disk{ 1, 2, 3 }) and that
we have the ability to extend them by 100GB, each from a volume group named vg0.

To grow the size of our RAID5 array, one at a time (do NOT do this to more than one disk at a time, or
move on to the next disk while the array is still rebuilding, or data Ioss will occur!), fail and remove each
device, grow it, then re-add it back into the array.
mdadm --manage fi:1~mM --fail /dev/vg0/diskl --remove /dev/vg0/diskl
(...array is now running in degraded mode...)
Ivextend -L +100G /dev/vg0/diskl
mdadm --manage /dev/md0 --add /dev/vg0/diskl
watch -n 1 'cat /proc/mdstat'

Once the array has completed building, do the same thing for the 2nd and 3rd devices.

Once all three devices are grown and re-added to the array, now its time to grow the array into the newly
available space, to the largest size that fits on all current drives:
mdadm --grow /dev/md0 --size=max

Now that the array device is larger, the filesystem must be grown into the new space (while keeping the
filesystem online):
resize2fs /dev/md0

If we were replacing each drive with a larger one, the process would mostly be the same, except we would
add all three new drives into the array at the start as spares. With each removal of the smaller drive, the
array would rebuild using one of the newer spare drives. After all three drives are introduced and the array
rebuilds three times, the array and filesystem would be grown into the new space.

For use only by e etudent enrolled in a Red Hat training course teught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication mey be photocopied,
duplicated, etored in a retrieval system, or otherwise reproduced without prior written coneent of Red Hat, Inc. If you belleve Red Het training material. are being improperly usad,
copied, or dietributed pisase email <tral.ningeredhat .com > or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 49a47bb8


Sharing a Hot Spare Device in RAID A-21

• Ensure at Ieast one array has a spare drive


• Populate /etc/mdadm. conf with current array data
• mdadm —verbose —examine —scan » /etc/mdadm.conf
• Choose a name for the shared spare-group (e.g. sharel)
• Configure each partícipating ARRAY entry with the same spare-group name
• spare-group=sharel

For exampie, if RAID array /dev/mdl has a spare drive, /dev/sdel, that shouid be shared with another
RAID array:

DEVICE /dev/sdal /dev/sdbl /dev/sdcl /dev/sddl /dev/sdel


ARRAY /dev/md0 level=raidl num-devices=2 1/^
UUID=c5dac4d3:2d6b9861:ab54c1f6:27c15a12
devices=/dev/sdal,/dev/sdbl spare-group=sharel
ARRAY /dev/mdl level=raidl num-devices=2
UUID=4ed6e3cc:f12c94b1:a2044461:19e09821
devices=/dev/sdcl,/dev/sddl,/dev/sdel spare-group=sharel

Now mdadm can be put in daemon mode to continuousiy poli the devices. By default it will scan every 60
seconds, but that can be altered with the - -delay=<#seconds> option.

If mdadm senses that a device has failed, it will look for a hot spare device in all arrays sharing the same
spare-group identifier. If it finds one, it will make it available to the array that needs it, and begin the rebuild
process.

The hot spare can and should be tested by failing and removing a device from /dev/md0:

mdadm /dev/md0 --fail /dev/sdal --remove /dev/sda1

When mdadm next polis the device, it should make /dev/sdel available to /dev/mdo and rebuild the
array, automatically. Additional hot spares can be added dynamically.

Hot spares can aiso be configured at array creation time:

mdadm -C /dev/md0 -I 5 -n 4 -x 1 -c 64 spare-group=mygroupname /dev/sd{a,b,c,d,e}1

This configures a RAID5 with 4 disks, 1 spare, chunk size of 64k, and is associated with the spare-group
named mygroupname.

For use only by a student enrolled in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Partner. No parí of this publication may be photocopied,
duplicated, Morad in a retrievaI system, or otherwise reproduced without prior written consent of Red HM, Inc. If you believe Red HM training materials are being improperly usad,
copied, or distributed picase email ctrainingeredhat . cora> or phone foil-trae (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / f44e05b4


014
Renaming a RAID Array A-22

• Moving a RAID array to another system


• What if /dev/md0 is already in use?
• Example: rename /dev/md0 to /dev/md3
• Stop the array:
• mdadm —stop /dev/md0

• Reassemble it as /dev/md3:
• mdadm --assemble /dev/md3 --super-minor=0 --update=super-minor /dev/sda5 /dev/sdb5

How do we rename a RAID array if it needs to move to another system, which already has an array with the
same name?

In the foliowing example, /dev/md0 is the original and /dev/md3 is the new md device. /dev/sda5 and /
dev/sdb5 are the two partitions that make up the RAID device.

First stop the RAID device:

mdadm --stop /dev/md0

Now reassemble the RAID device as /dev/md3:

mdadm --assemble /dev/md3 --super-minor=0 --update=super-minor /dev/sda5 /dev/sdb5

This reassembly process looks for devices which have an existing minor number of o (referring to the zero
in /dev/mdo in this case, so option - - super-minor=0), and then updates the array's superblocks to the
new number (the 3 in /dev/md3).

The array device can now be plugged into the other system and be immediately recognized as /dev/md3
without issue, so long as no existing array is already named /dev/md3

For use only by a student enroiled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training material. are being improperly usad,
copiad, or distributed please email training@redhat . coma or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 6112a948


-•_
Write-intent Bitmap A-23

• RAID drivers periodically writes out bitmap information describing portions of array
that have changed
• After failed sync events, only changed portions need be re-synced
• Power Ioss before array components have chance to sync
• Temporary failure and/or removal of a RAID1 member
• Faster RAID recovery times
• Allows - -write-behind on - -write-mostly disks using RAID1

A write-intent bitmap is used to record which areas of a RAID component have been modified since the
RAID array was Iast in sync. The RAID driver periodically writes this information to the bitmap.

In the event of a power Ioss before all drives are in sync, when the array starts up again a full sync is
normally needed. With a write-intent bitmap, only the changed portions need to be re-synced, dramatically
reducing recovery time.

Also, if a drive fails and is removed from the array, md stops clearing bits in the bitmap. If that same drive
is re-added to the array again, md will notice and only recover the portions of the drive that the bitmap
indicates have changed. This allows devices to be temporarily removed and then re-added to the array
without incurring a Iengthy recovery/resync.

Write-behind is discussed in an upcoming slide.

For use only by a student enrollad in a Red HM training course taught by Red HM, Inc. or a Red HM Certified Training Padner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red HM, Inc. ff you believe Red HM training materials are being improperly usad,
copiad, or distributed piense email <trainingeredhat . coa> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919)754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 01 ca3f3e


*71'2
Enabling Write-Intent on a RAID1 Array A-24

• Internal (metadata area) or external (file)


• Can be added to (or removed from) active array
• Enabling write-intent bitmap:
• RAID volume must be in sync
• Must have a persistent superblock
• Infernal
• mdadm —grow /dev/mdx --bitmap=internal

• External
• mdadm --grow /dev/mdx --bitmap=/root/ f ilename

• Filename must contain at least one slash ('/') character


• ext2/ext3 Filesystems only

The bitmap file should not pre-exist when creating it. If an internal bitmap is chosen (-b internal), then
the bitmap is stored with the metadata on the array, and so is replicated on all devices. If an external bitmap
is chosen, the narre of the bitmap must be an absolute pathname to the bitmap file, and it must be on a
different filesystem than the RAID array it describes, or the system will deadlock.

Before write-intent can be turned on for an already-active array, the array must already by in sync and have
a persistent superblock. Verify this by running the command:

mdadm --detail /dev/mdx

and making sure the State and Persistence attributes read:

State : active
Persistence : Superblock is persistent

If both attributes are OK, then add the write-intent bitmap (in this case, an internal one):

mdadm /dev/mdx --grow --bitmap=internal

The status of the bitmap as writes are performed can be monitored with the command:

watch -n .1 'cat /proc/mdstat'

To turn off the write-intent bitmapping:

mdadm /dev/mdx --grow --bitmap=none

For use only by a student enrollad in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No parí of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materiales ere being improperly used,
copied, or distributed please email ctraininglaradhat .com> or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 93a82290


Write-behind on RAID1 A-25

• --write-behind=256 (default)
• Required:
• write-intent ( --bitmap= )
• --write-mostly

• Facilitates slow-link RAID1 mirrors


• Mirror can be on a remote network
• Write-intent bitmap prevents application from blocking during writes

If a write-intent (- -bitmap= ) bitmap is combined with the - -write-behind option, then write requests to
- -write-mostly devices will not wait for the requests to complete before reporting the write as complete
to the filesystem (non-blocking).

RAID1 with write-behind can be used for mirroring data over a slow link to a remote computer. The extra
latency of the remote link will not slow down the system doing the writing, and the remote system will still
have a fairly current copy of all data.

If an argument is specified to - -write-behind, it will set the maximum number of outstanding writes
allowed. The default value is 256.

For use only by a student enroIled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retneval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training materials are being improperty usad,
copiad, or distributed picase email ctrainingOredhat . coa, or phone toll-free (USA) +1 (866) 626 2994 or +1 (919) 754 3700.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 18372126


,1
RAID Error Handling and A-26
Data Consistency Checking

• RAID passively detects bad blocks


• Tries to fix read errors, evicts device from array otherwise
• The larger the disk, the more likely a bad block encounter
• Initiate consistency and bad block check:

echo check » /sys/block/mdX/md/sync_action


Normally, RAID will passively detect bad blocks. If a read error occurs when attempting a block read (soft
error), an attempt is made to reconstruct (rewrite) the data using another valid copy in the array. If the
reconstruction of the block fails (hard error), the errant device is evicted from the active array.

This becomes more problematic during, for example, a RAID5 reconstruction. If, during the reconstruction,
an unrecoverable block error is discovered in one of the devices used to rebuild a failed drive within a
RAID5 array, recovery of the data or array may be impossible. The larger the disk, the more likely the
problem.

As a result, it is imperative to actively seek out bad blocks in an array.

In kernel versions 2.6.16 or greater, the following command will initiate an active data consistency and bad
block check, and attempt to fix any bad blocks:

echo check » /sys/block/mdx/md/sync_action

The progress of the consistency check can be monitored using:

watch -n 1 'cat /proc/mdstat'

It is a good idea to check periodically by adding the check to the proper crontab entries.

For use only by a student enrolled in a Red Hat training course taught by Red Hat, Inc. or a Red Hat Certified Training Partner. No part of this publication may be photocopied,
duplicated, stored in a retrieval system, or otherwise reproduced without prior written consent of Red Hat, Inc. If you believe Red Hat training meted/de are being improperly ueed,
copiad, or distributed please email <t rainiagersdhat . cora, or phone toll-free (USA) +1 (866)626 2994 or +1 (919) 754 3700.

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / bd7801c6


__
Lab A.1: Improve RAID1 Recovery Times with Write-intent Bitmaps
Scenario: In this sequence you will measure the recovery time of a RAID mirror with
11> and without write-intent bitmaps.

s Instructions:

1. Use fdisk to create four, 500MiB partitions on your local workstation of type "Linux raid
autodetect ( f d) ". Run partprobe when you have finished so that the kernel recognizes
the partition table changes.
2. Create a RAID1 (mirror) array from the first two 500MiB partitions you have made.
111
3. Create another RAID1 (mirror) array from the 3rd and 4th 500MiB partitions you have made,
11, but this time with a write-intent bitmap.
4. Place an ext3-formatted filesystem on each of the two RAID1 arrays, and mount them to /data0
10 and /datal, respectively.
5. Open a new terminal window next to the first so that the two windows are in view at the same
110 time. In the second window, watch the status of the two arrays with a fast refresh time. We will
use this to monitor the rebuild process.
How could you tell from the status of the array which one has the write-intent bitmap?
6. One array at a time, fail and remove a device in an array, write some information to that array's
filesystem (which should still be online), then re-add the failed device back to the array. This
will force a rebuild of the (temporarily) failed device with information from the surviving
device. Wait for the array to rebuild the array before doing the same thing to the other array.
Which array has the faster rebuild time? Why?

•e
e
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 31682c38
Lab A.2: Improve Data Reliability Using RAID 6
Scenario: In this sequence you will build a RAID 6 array and a RAID 5 array on
direct-attached storage devices and compare the reliability of each.

The RAID 6 array will be set up as an LVM physical volume in order to


demonstrate growing an existing array in a subsequent sequence in this lab.

Instructions:

1. On nodel of your cluster, create three 100MiB partitions on /dev/hda of type "Linux
raid autodetect (f d) ". Three primary partitions already exist, so make /dev/hda4
an extended partition consisting of the remaining space on the disk, and then create /dev/
hda5, /dev/hda6 and /dev/hda7 as logical partitions.

Run partprobe when you have finished so that the kernel recognizes the partition table
changes.

2. On nodel of your cluster, create a RAIDS array from the three partitions you made.

3. Create an ext3 filesystem on the array, mount it to a directory named /raid5, and copy/create
a readable text file in it that is larger than the chunk size of the array (e.g. /usr/share/
dict/words).
4. Check /proc/mds tat and verify that the RAIDS array has finished synchronizing. Once it
has, fail and remove one of the devices from the RAIDS array. Verify the status of the array in
/proc/mdst at, and that you can still see the contents of /raid5/words.

5. Fail a second device.

6. Can you still see the contents of /raid5/words? How is this possible?

7. Are you able to create new files in /raid5?

8. Is the device recoverable?

9. Completely disassemble, then re-create /dev/md0.

10. Create four more 100MiB partitions (of type "f d") on /dev/hda, then create a RAID6 array
from those partitions.

1 1 . Wait for the RAID array to finish sync'ing, then create a volume group named vgraid using
the RAID6 array.

12. Determine the number of free extents.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / al f65997


13. Create a logical volume named lvraid using all free extents reported in the previous step.

14. Create and mount an ext3 filesystem to a directory named /raid6 using the logical volume.
Create a file in it named test, with contents "raid6".

15. Fail and remove one of the RAID6 array devices.

16. Fail and remove a second device. Is the data still accessible?

17. Can you still create new Piles on /raid6?

18. Recover the RAID6 devices and resync the array (Note: this may take few minutes).

Copyright O 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / al f65997


_ ,_
Lab A.3: Improving RAID reliability with a Shared Hot Spare Device
Scenario: In this sequence you will create a hot spare device that is shared between
your RAIDS and RAID6 arrays.

System Setup: The RAIDS and RAID6 arrays from the previous exercise should still be in
place and active.

Instructions:

1. On nodel of your cluster, create a RAID configuration file (/et c/mdadm conf).

2. Edit /et c/mdadm conf to associate a spare group with each array:

3. Create a 100 MiB partition on /dev/hda of type f d. Add the new partition as a hot spare to
the RAIDS array and observe the array's status.

4. Fail and remove one device from the RAID6 array. Did the spare move from the RAIDS to the
RAID6 array? Why or why not?

5. In another terminal window, monitor the status of your RAID arrays (refresh every 0.5s) while
you perforen the next step: Add an email address to /etc/mdadm conf to instruct the
monitoring daemon to send mail alerts to root, then start mdmonitor.
6. What happened to the spare device?

Note: do not re-add /dev/hda 8 at this point.

Coovriaht O 2011 Red Hat. Inc. RI-IctRA - 1:11-1F I Fi 1nA9a / 999r1 rI9e,
e
Lab AA: Online Data Migration
Scenario: In this sequence you will migrate your RAID6 array from DASD to a SAN
without an outage. You will simultaneously expand the size of your array
and its associated Pile system.
System Setup: This lab presumes that your cluster node's DASD storage is /dev/hda
and the SAN storage is /dev/sda.
e
Instructions:

1. Delete any previously existing partitions on your SAN (/dev/sda) device, then create four
11/ new 1GiB partitions of type fd such that the partition table looks like the following:

/dev/sdal primary
type/ID=fd size=1GB
e /dev/sda2 primary type/ID=fd size=1GB
/dev/sda3 primary type/ID=fd size=1GB
si /dev/sda4 extended type/ID=5
/dev/sda5 logical
size="remaining disk space"
type/ID=fd size=1GB

110 2. In a different terminal, monitor the status of your RAID arrays. One device at a time, migrate
your RAID6 DASD members to the SAN.

e 3. Note the current size of the RAID6 array.

10 4. Grow the RAID6 array into the newly available space, while keeping it online. Note the new
size of the array when done.
5. Note the current size of the /raid6 filesystem, its logical volume, and the number of free
extents in your volume group.
6. Resize the /dev/mdl physical volume.

7. Now that the physical volume has been resized, check the number of free extents in the
volume group with vgdisplay. Resize the /dev/vgra id/ lvra id logical volume, where
NN=number of free extents discovered previously.
8. Why did you not have to grow the volume group?

9. Note the current size of your filesystem, then grow the filesystem into the newly-available
space. Note the new filesystem size when you are done.

e
e
Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 059f3a60
Lab A.5: Growing a RAID5 Array While Online
Scenario: In this sequence you will grow your RAID5 array, and keep it online while
doing so, by adding another device.

System Setup: It is assumed that you still have a working RAID5 array from the previous
exercises.

Instructions:

1. Create a new 100MiB partition on /dev/hda of type f d, and make sure the kernel is aware of
it.

2. Add the device to the RAID5 array.

3. Grow the array into the new space. Note that the array must be reshaped when adding disks.
Also note that all four slots of the array become filled ( [UUUU] ).

4. Grow the array again, this time without first adding a spare device, noting that the command
adds an empty slot since there are no spares available [UUUU__] .

5. Explore this further to convince yourself that the array is growing in degraded (recovering)
mode:

6. Question: What would happen to your data if a device failed during the reshaping process with
no spares?

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9eeb0176


e
Lab A.6: Clean Up
Scenario: In this sequence we will disassemble the RAID arrays and Logical
11> Volumes we created in preparation for the next lab sequence.

e Instructions:

10 1. Unmount any filesystems created in this lab.

11/ 2. Disassemble the logical volume that was created in this lab. (Note: your logical volume and its
components may be different than what is listed here. Double-check against the output of lvs,
vgs, and pvs.)
3. Disassemble the RAID arrays created in this lab (Note: your partitions may be different than
those listed here. Double-check against the output of "cat /proc/mdstat").


e
1111
1 Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / e96759af
Lab A.7: Rebuild Virtual Cluster Nodes
Deliverable: Remove partitions and rebuild all virtual cluster nodes.

Instructions:

1. Clean up: On nodel remove all partitions on the iscsi device with the /root/RH436/
HelpfulFiles/wipe_sda tool.

2. Clean up: Rebuild nodel, node 2, and node3 using the rebui ld-cluster script.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / d4b234bc


Lab A.1 Solutions
11>
1. Use fdisk to create four, 500MiB partitions on your local workstation of type "Linux raid
11> autodetect (fd)". Run partprobe when you have finished so that the kernel recognizes
the partition table changes.

11) Note: we presume below your local disk device is /dev/sda, but yours may differ.

stationX# fdisk /dev/sda



stationx# partprobe /dev/sda
111 Create a RAID 1 (mirror) array from the first two 500MiB partitions you have made (Note: your
2.
partition numbers may differ depending upon partitions created in previous labs).
11,
stationx# mdadm -C /dev/md0 -11 -n2 /dev/sda{6,7} -a yes
1, 3. Create another RAID1 (mirror) array from the 3rd and 4th 500MiB partitions you have made,
but this time with a write-intent bitmap.

mdadm -C /dev/mdl -11 -n2 /dev/sda{8,9} -a yes -b it


stat ionX#
internal
1111 4. Place an ext3-formatted filesystem on each of the two RAID1 arrays, and mount them to /data°
and /datal, respectively.

stationx# mkdir /data° /datal


Ó
stationX# mkfs -t ext3 /dev/md0

stationx# mkfs -t ext3 /dev/mdl

stationx# mount /dev/md0 /data0


01/
stationx# mount /dev/mdl /datal

• 5. Open a new terminal window next to the first so that the two windows are in view at the same
time. In the second window, watch the status of the two arrays with a fast refresh time. We will
• use this to monitor the rebuild process.

stationx# watch -n .2 'cat /proc/mdstat'


How could you tell from the status of the array which one has the write-intent bitmap?

One of the RAID1 arrays will have a line in its /proc/mdstat output similar to:

bitmap: 0/121 pages [OKB], 4KB chunk

6. One array at a time, fail and remove a device in an array, write some information to that array's
• filesystem (which should still be online), then re-add the failed device back to the array. This
will force a rebuild of the (temporarily) failed device with information from the surviving
device. Wait for the array to rebuild the array before doing the same thing to the other array.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 31682c38
stationXIt mdadm /dev/md0 -f /dev/sda6 -r /dev/sda6

stationx# dd if=/dev/urandom of=/data0/file bs=1M count=10

stationX# mdadm /dev/md0 -a /dev/sda6

st.ationX# mdadm /dev/mdl -f /dev/sda8 -r /dev/sda8

stationX# dd if=/dev/urandom of=/datal/file bs=1M count=10

stationX# mdadm /dev/mdl -a /dev/sda8

Which array has the faster rebuild time? Why?

The write-intent array, by far! The information written to the array when one-half of the mirror
was down was recorded in the write-intent bitmap. When the other half of the mirror was re-
added to the array, only the changes from the bitmap needed to be sent to the new device,
instead of having to to synchronize, from scratch, the entire array's volume.

Convriaht © 2011 Red Hat. Inc. RH4RR-RHFI Si 14-An-1 7-9R11/14911 / RI RA9MA


Lab A.2 Solutions
1. On nodel of your cluster, create three 100MiB partitions on /dev/hda of type "Linux
raid autodetect ( f d) ". Three primary partitions already exist, so make /dev/hda4
an extended partition consisting of the remaining space on the disk, and then create /dev/
hda5, /dev/hda6 and /dev/hda7 as logical partitions.

Run partprobe when you have finished so that the kernel recognizes the partition table
changes.

nodel# fdisk /dev/hda

nodel# partprobe /dev/hda

2. On nodel of your cluster, create a RAIDS array from the three partitions you made.

nodei# mdadm -C /dev/md0 -15 -n3 /dev/hda{5 , 6 , 7}


110
3. Create an ext3 filesystem on the array, mount it to a directory named /raid5, and copy/create
a readable text file in it that is larger than the chunk size of the array (e.g. /usr/share/
110 dict/words).

Wait for the RAID array to complete its synchronization process (watch -n 1 'cat /proc/
mdstat'), then:

mkfs -t ext3 -L raid5 /dev/md0

e nodel#

nodel. -11
mkdir /raid5

mount LABEL=raid5 /raid5


111> nodei# mdadm - -detail /dev/md0 I grep Chunk

• nodel# cp /usr/share/dict/words /raid5

nodei# echo "raid5" > /raid5/test


11,
4. Check /proc/mdstat and verify that the RAIDS array has finished synchronizing. Once it
has, fail and remove one of the devices from the RAIDS array. Verify the status of the array in
/proc/mdstat, and that you can still see the contents of /raid5/words.

nodel# cat /proc/mdstat

nodel# mdadm /dev/md0 -f /dev/hda5 -r /dev/hda5

nodel# cat /proc/mdstat

4111 nodei# cat /raid5/words

e 5. Fail a second device.

nodel# mdadm /dev/md0 -f /dev/hda6 -r /dev/hda6



Copyright 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 /alf65997
6. Can you still see the contents of /ra id5 /words? How is this possible?

Yes. Files larger than the chunk-size are readable only if still cached in memory from writing to
the block device. In this case, we recently wrote it, so it still is cached.
7. Are you able to create new files in /raid5?

No. The filesystem is marked read-only.

8. Is the device recoverable?

No. Adding the devices back into the array will not initiate recovery; they are treated as spares,
only. Attempting to reassemble the device results in a message indicating that there are not
enough valid devices to start the array.

nodel# watch -n .5 ' cat /proc/mdstat '

nodel# mdadm /dev/md0 -a /dev/hda5

nodei# umount /dev/md0

nodei# mdadm -S /dev/md0

nodel# mdadm - -as sembl e /dev/md0 /dev/hda{ 5,6,7} - - force

"mdadm: /dev/md0 assembled from 1 drive and 2 spares - not enough to start the array."
9. Completely disassemble, then re-create /dev/md0.

nodei# umount /raid5

nodei# mdadm -S /dev/md0

nodel# mdadm - - zero-superblock /dev/hda{5,6,7}

nodei# mdadm -C /dev/md0 -15 -n3 /dev/hda{5,6,7}

10. Create four more 100MiB partitions (of type "f d") on /dev/hda, then create a RAID6 array
from those partitions.

After using fdisk to create the partitions, be sure to run partprobe /devihda so the kernel is
aware of them, then:

mdadm - C /dev/mdl - 16 - n4 /dev/hda{8,9,10,11} - a yes

11. Wait for the RAID array to finish sync'ing, then create a volume group named vgraid using
the RAID6 array.

nodel# pvcreate /dev/mdl

nodel# vgcreate vgraid /dev/mdl

12. Determine the number of free extents.

Copyright @ 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / al f65997


. .
nodel# vgdisplay vgraid I grep -i free

13. Create a logical volume named lvraid using all free extents reported in the previous step.

Run the following command, where IVN is the number of free extents:

nodei# lvcreate -1 1VN -n lvraid vgraid

14. Create and mount an ext3 filesystem to a directory named /raid6 using the logical volume.
Create a file in it named test, with contents "raid6".

nodel# mkfs -t ext3 -L raid6 /dev/vgraid/lvraid

nodei# mkdir /raid6

nodel# mount LABEL=raid6 /raid6

nodei# echo "raid6" > /raid6/test

15. Fail and remove one of the RAID6 array devices.

nodei# mdadm /dev/mdl -f /dev/hda8 -r /dev/hda8

If the device cannot be removed, it is probably because the resynchronization process has not
yet completed. Wait until it is done then try again.

nodel# mdadm /dev/mdl -r /dev/hda8

16. Fail and remove a second device. Is the data still accessible?

nodel# mdadm /dev/mdl -f /dev/hda9 -r /dev/hda9

nodel# cat /raid6/test

Yes, the data should still be accessible.

17. Can you still create new files on /raid6?

nodel# touch /raid6/newfile

Yes, it is still a read-write filesystem.

18. Recover the RAID6 devices and resync the array (Note: this may take few minutes).

nodel# mdadm /dev/mdl -a /dev/hda8

nodel# mdadm /dev/mdl -a /dev/hda9

nodel# watch -n .5 'cat /proc/mdstat'

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / al f65997


Lab A.3 Solutions
1. On nodel of your cluster, create a RAID configuration file (/ e t c/mdadm. conf).

nodel# mdadm --examine --verbose --scan > /etc/mdadm.conf

2. Edit /et c/mdadm conf to associate a spare group with each array:

ARRAY /dev/md0 level=raid5 num-devices=3 UUID=... spare-group=1

ARRAY /dev/mdl level=raid6 num-devices=4 UUID=... spare-group=1

(Note: substitute the correct UUID value, it is truncated here for brevity.)

3. Create a 100 MiB partition on /dev/hda of type f d. Add the new partition as a hot spare to
the RAIDS array and observe the array's status.

After using fdisk to create the partition, be sure to run partprobe /dev/hda so the kernel is
aware of it. Then:

nodel# mdadm /dev/md0 -a /dev/hdall

nodel# cat /proc/mdstat

4. Fail and remove one device from the RAID6 array. Did the spare move from the RAIDS to the
RAID6 array? Why or why not?

nodel# mdadm /dev/mdl -f /dev/hda8 -r /dev/hda8

It should not, because mdmonitor is not enabled.

5. In another terminal window, monitor the status of your RAID arrays (refresh every 0.5s) while
you perform the next step: Add an email address to /etc /mdadm . conf to instruct the
monitoring daemon to send mail alerts to root, then start mdmonitor.

nodel# watch -n .5 'cat /proc/mdstat'

nodel# echo ' MAILADDR root@localhost' » /etc/mdadm.conf

nodel# echo 'MAILFROM root@localhost' » /etc/mdadm.conf

nodel .# chkconfig mdmonitor on

nodel# service mdmonitor restart

6. What happened to the spare device?

Note: do not re-add /dev/hda8 at this point.

The spare device should have automatically migrated from the RAIDS to RAID6 array.

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 222c1 d2a


Lab AA Solutions
1. Delete any previously existing partítions on your SAN (/dev/sda) device, then create four
new 1GiB partitions of type f d such that the partition table looks like the following:
11/
/dev/sdal primary type/ID=fd size=1GB
/dev/sda2 primary type/ID=fd size=1GB
/dev/sda3 primary type/ID=fd size=1GB
/dev/sda4 extended type/ID=5 size="remaining disk space"
/dev/sda5 logical type/ID=fd size=1GB

2. In a different terminal, monitor the status of your RAID arrays. One device at a time, migrate
your RAID6 DASD members to the SAN.

nodel# watch -n .5 'cat /proc/mdstat'

nodel# mdadm /dev/mdl -a /dev/sdal


10
nodel# mdadm /dev/mdl -f /dev/hda8 -r /dev/hda8
11 (...wait for recovery to complete...)

nodel# mdadm /dev/mdl -a /dev/sda2

nodel# mdadm /dev/mdl -f /dev/hda9 -r /dev/hda9

(...wait for recovery to complete...)

nodel# mdadm /dev/mdl -a /dev/sda3

nodel# mdadm /dev/mdl -f /dev/hdal0 -r /dev/hdal0

(. _wát for recovery to complete...)


..)
11>
nodel# mdadm /dev/mdl -a /dev/sda5

nodel# mdadm /dev/mdl -f /dev/hdall -r /dev/hdall

(...wait for recovery to complete...)


I/
3. Note the current size of the RAID6 array.
10 nodel# mdadm --detail /dev/mdl I grep -i size

10 4. Grow the RAID6 array into the newly available space, while keeping it online. Note the new
size of the array when done.
cxnift mdadm -G /dev/mdl --size.max

cxni# mdadm --detall /dev/mdl I grep -i size

5. Note the current size of the /raid6 filesystem, its logical volume, and the number of free
10 extents in your volume group.
Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 059f3a60
eXn L# df

exni# lvdisplay /dev/vgraid/lvraid I grep -i size

eXn1# vgdisplay vgraid I grep -i size

6. Resize the /dev/mdl physical volume.

cXnI# pvresize /dev/mdl

7. Now that the physical volume has been resized, check the number of free extents in the
volume group with vgdisplay. Resize the /dev/vgraid/lvraid logical volume, where
NN=number of free extents discovered previously.

nodel# lvresize -1 +NN /dev/vgraid/lvraid

8. Why did you not have to grow the volume group?

You did not have to grow the volume group because you did not add new physical volumes;
you only added to the number of extents already on the physical volumes that comprise the
volume group.

9. Note the current size of your filesystem, then grow the filesystem into the newly-available
space. Note the new filesystem size when you are done.

c:Xn1# df -Th grep raid6

cxni# resize2fs /dev/vgraid/lvraid

c.:Xn1 # df -Th 1 grep raid6

Cnnvrinht (e) 2011 Red Hat Inc. RH436-RHEL5u4-en-17-20110428 / 059f3a60


Lab A.5 Solutions
1. Create a new 100MiB partition on /dev/hda of type fd, and make sure the kemel is aware of
11> it.

After using fdisk to create the partition, be sure to run partprobe /dev/hda so the kemel is
aware of it.

1110 cXn1# partprobe /dev/hda

2. Add the device to the RAIDS array.

cx..# mdadm /dev/md0 -a /dev/hdal2

110 3. Grow the array into the new space. Note that the array must be reshaped when adding disks.
Also note that all four slots of the array become filled ( [UUUM ).

mdadm -G /dev/md0 -n4 --backup-file./tmp/critical-section

10 cXn1# watch -n .5 'cat /proc/mdstat'

4. Grow the array again, this time without first adding a spare device, noting that the command
10 adds an empty slot since there are no spares available [UUUu .
cxiii# mdadm -G /dev/md0 -n5 --backup-file./tmp/critical-section

exni# watch -n .5 'cat /proc/mdstat'


1110 5. Explore this further to convince yourself that the array is growing in degraded (recovering)
mode:

cxni# mdadm --detail /dev/md0 1 grep -i state


11> 6. Question: What would happen to your data if a device failed during the reshaping process with

• no spares?

All data would be lost.

Copyright (1) 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / 9eeb0176


. .
Lab A.6 Solutions
1. Unmount any filesystems created in this lab.

cXni# umount /raid5

cXn1# umount /raid6

2. Disassemble the logical volume that was created in this lab. (Note: your logical volume and its
components may be different than what is listed here. Double-check against the output of lvs,
vgs, and pvs.)

cXn1# lvchange -an /dev/vgraid/lvraid

cXn1# lvremove /dev/vgraid/lvraid

cXn1# vgchange -an vgraid

exn3.# vgremove vgraid

cxrii# pvremove /dev/mdl

3. Disassemble the RAID arrays created in this lab (Note: your partitions may be different than
those listed here. Double-check against the output of "cat /proe/mdstat").

cxni# mdadm -S /dev/md0

cXn1# mdadm --zero-superblock /dev/hda{3,5,6,11,12}

cXn1# mdadm -S /dev/mdl

cXn1# mdadm - - zero-superblock /dev/sda{ 1,2,3,5}

Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 e96759af


-^ •
Lab A.7 Solutions
• 1. Clean up: On nodel remove 1111 partitions on the iscsi device with the /root/RH436/
HelpfulFiles/wipe_sda tool.

c:Xn1 # /root/RH436/HelpfulFiles/wipe_sda

2. Clean up: Rebuild nodel, node2, and node3 using the rebuild cluster script.
-

• stationX# rebuild-cluster -123

O Copyright © 2011 Red Hat, Inc. RH436-RHEL5u4-en-17-20110428 / d4b234bc

Potrebbero piacerti anche