Sei sulla pagina 1di 15

HP 3PAR disk replacement

December 15, 2015

Regmen

HP 3PAR Tech Notes

HP 3PAR disk replacement. How to deal with failed


drive on 3PAR
This article treats of disk replacement on 3PAR for administrators who want to know
a little more about the background of disk replacement.

3PAR logical layer


With telling about disk replacement on 3PAR, the logical layer cannot be omited, as
this is the fundamental concern around hard drive replacement procedure on
3PAR. The logical layer of 3PAR consist few levels. In overall the structure is not
complicated, starting with physical disk and ending on Virtual Volumes.
physical disks (PD) logical disks (LD) Common Provisioning Groups (CPGs)
Virtual Volumes (VV)
Physical disks are divided into chunklets, starting with 7000 series, we are talking
about 1GB fixed size of chunklets. Then 3PAR is using chunklets to build LDs. This all
happen without any administrator involvement. Chunklet is the basic logic unit in
3PAR terminology. Thanks to this approach, we are receiving nicely virtualized
storage, with virtual RAID approach, which gives a lot of more flexibility, also in terms
of redundancy. From the other hand, while some blocks within specific chunklet are
unreadable, then the whole chunklet (1GB) is marked as failed.

3PAR and RAID protection


3PAR offers virtualized approach in RAID creation. RAID is created during CPG
creation and RAID behavior can be adjusted by administrator according to needs.
RAID is based on chunklets, not on physical disks. Thanks to that we are in power to
create CPG based on performance, or we can use slower sectors within physical disk
to use them for example for backup destaging, where performance is not so
important. Thanks to that we have full control on shaping our storage resources and
environment under.
For example to see the details on already created CPGs, use showcpg command with
a suitable parameters.
3PAR-cluster cli% showcpg -sdg

------(MB)-----Id Name Warn Limit Grow Args


0 CPG_FC_R5 - - 86304 -t r5 -ha cage -ssz 4 -ss 128 -ch first -p -devtype FC
1 CPG_SSD_R5 - - 23546 -t r5 -ha cage -ssz 4 -ss 64 -ch first -p -devtype SSD
2 CPG_DESTAGING - - 86304 -t r5 -ssz 4 -ss 128 -ch last -p -devtype FC
3 FC_r1 - - 65536 -ssz 2 -ha cage -t r1 -p -devtype FC
4 FC_r6 - - 65536 -ssz 8 -ha cage -t r6 -p -devtype FC
5 SSD_r1 - - 16384 -ssz 2 -ha cage -t r1 -p -devtype SSD
6 CPG_SSD_DEDUP - - 65536 -t r5 -ha cage -ssz 4 -ss 64 -ch first -p -devtype SSD

The exact working principle will be explained in the other article. However for now, it
is good to know, that 3PAR allows to build RAID groups based on some capabilities:

-t: type of RAID.


-ha: here you can specify layout of RAID stripe size distribution. The policy
can be based on cage (default), magazine, backend port.
-ssz: option stands for set size in terms of chunklets. Default value is based
on specific RAID, for example RAID-1 2 chunklets, RAID-5 4 chunklets, RAID-6
8 chunklets.
-ss: with this option it is possible to set-up step size in kilobytes for
-ch: type of chunklets that would be preferred to build stripe (from
lowest/highest available in terms of numeration outer/inner zones of disks).
-p: pattern is used for creating LDs in terms of disk type (FC, SSD, NL)

How 3PAR deals with spares


Some chunklets are promoted for spares during first set up a system. The 3PAR
algorithms build chunklets from physical disks to maximize the usage of outer zones
within disk. As spares chunklets should be used only temporarily in emergency
situation, 3PAR decided to assign spare chunklets on the inner zones of disk. The
details of spare chunklets visible on your 3PAR can be shown with showspare
command.
3PAR-cluster cli% showspare

Deep investigation is not needed to see that the numeration of spare chunklets on
each drive has high number, hence chunklets designated for spares are from inner
zones of disks, which is obviously good.

But thats not all. The amount of chunklets that are candidates for spares are
determined by policy, which can be chosen by following vendor recommendation
from presents or chosen by us. We can distinguish below sparing algorithms:

default: amount of one full disk for every 40 disks, with required 4 disks as
minimum.
minimum: same as default, but without required minimum drives.
maximum: amount of one full disk per drive cage so called cage level high
availability.
custom: defined by user, but administrator should remember to add spares
while adding new drives.

Remember about vendor recommendation to create spare chunklets during first


system initialization, as this is the time when layout of created system is established
and specific chunklets can be distributed evenly among all physical disks.

Logging logical drive and spares?


In case of physical disk failure all new writes that would be committed to failed drive,
are redirected to logging logical disk. When drive come back online or time limit for
logging is reached, then reallocation is performed to free chunklets marked as spare
chunklets.
To see Logical Disk that are marked as logging disks, grep log from showpd
command as shown below.
ssh 3PAR-cluster showpd |grep log
Id Name

RAID -Detailed_State- Own

SizeMB

UsedMB Use Lgct LgId

WThru MapV
5 log0.0

1 normal

0/-/-/-

20480

0 log

0 ---

6 log1.0

1 normal

1/-/-/-

20480

0 log

0 ---

7 log2.0

1 normal

2/-/-/-

20480

0 log

0 ---

8 log3.0

1 normal

3/-/-/-

20480

0 log

0 ---

Following official guide, we can investigate informations related to logging drives,


which are:

Column Use: log in cell under this column means that logical disk is used as a
logging logical disk.
Column Lgct: The number of chunklets that are in logging mode in the logical
disk.

Column LgId: The ID of the logging disk that is being used for logging by the
logical disk.

Important information
Logging logical disk is entity that is entirely created and managed by 3PAR
system.

Replacing failed disk


The most common task on any storage array is to deal with failed drives. Storage
arrays makes tremendous work with our data, especially if cache hit is not on the
remarkable level. The question is, how to deal with failed drive, and what should be
under our attention.
The 3PAR is starting spit out many alerts regarding some disk, marking that situation
become serious.
2015-12-06

04:36:38

GMT

Informational

Disk

event

hw_disk:5000C50075EB86C4 pd 7 port b0 on 0:0:1: cmdstat:0x00 (TE_PASS -Success),

scsistat:0x02

asc/ascq:0x5d/0x0

(Check

(Failure

cmd_spec:0x0,

condition),

prediction

snskey:0x01

threshold

sns_spec:0x50000,

CDB:2A00562EE38800001800

(Recovered

exceeded),

info:0x0,

host:0x0,

(Write10),

blk:0x562ee388,

error),
abort:0,

blkcnt

0x18,

fru_cd:0x32, LUN:0, LUN_WWN:0000000000000000 after 0.007s, toterr:1808,


deverr:1138
2015-12-06

04:37:41

GMT

Degraded

Disk

abort

hw_disk:5000C50075EB86C4;sw_pd:7 pd 7 port b0 on 0:0:1: scsi abort/sick/hwerr


status TE_SMART_THRESH
2015-12-06

04:41:49

GMT

Informational

Disk

event

hw_disk:5000C50075EB86C4 pd 7 port b0 on 0:0:1: cmdstat:0x00 (TE_PASS -Success),

scsistat:0x02

asc/ascq:0x5d/0x0
cmd_spec:0x0,

(Check

(Failure

condition),

prediction

snskey:0x01
threshold

sns_spec:0x50000,

(Recovered

exceeded),

host:0x0,

error),

info:0x0,
abort:0,

CDB:2A000021D7C000004000 (Write10), blk:0x21d7c0, blkcnt 0x40, fru_cd:0x32,


LUN:0, LUN_WWN:0000000000000000 after 0.008s, toterr:1813, deverr:1143

That was matter of time, when disk totally crash.


2015-12-06 04:47:42 GMT 1 Informational Disk state change sw_pd:7 pd 7 wwn
5000C50075EB86C4 changed state from valid to missing because disk gone event
was received for this disk.

2015-12-06

04:47:42

GMT

Informational

Disk

state

change

hw_disk:5000C50075EB86C4 pd wwn 5000C50075EB86C45000C50075EB86C4


changed state from valid to missing because disk gone event was
received for this disk.

Lets check the situation with marked as failed disk. If you are uncertain about failed
drive you can see which PDs are failed with using command showpd with -failed
option. With showpd command information about systems physical disks can be
shown.
3PAR-cluster cli% showpd -failed
-Size(MB)-- ----Ports---Id CagePos Type RPM State Total Free A
7 0:7:0? FC

10 failed 838656

Capacity(GB)

0 ----- -----

900

----------------------------------------------------------------1 total

838656

Use again showpd with -c parameter, which gives visibility on chunklets.


3PAR-cluster cli% showpd -c 7
------- Normal Chunklets -------- ---- Spare Chunklets ---- Used - -------- Unused -------- - Used - ---- Unused ---Id CagePos Type State Total OK Fail Free Uninit Unavail Fail OK Fail Free Uninit
Fail
7 0:7:0? FC failed 819 0

667 152 0

---------------------------------------------------------------------------------------1 total

819 0

667 152 0

To see detailed information about chunklets within disk you can use command
showpdch.
3PAR-cluster cli% showpdch 7

The -i parameter shown disk details.


3PAR-cluster cli% showpd -i 7

Id CagePos State

----Node_WWN---- --MFR-- -----Model------ -Serial- -FW_Rev-

Protocol MediaType -----AdmissionTime----7 0:7:0?


3P01

failed 5000C50075EB86C4 SEAGATE SLTN0900S5xnN010 S0N1L0LN

SAS

Magnetic 2014-07-15 12:40:41 IST

----------------------------------------------------------------------------------------------------------------------1 total

After new disk will be replaced, event log will record it.
2015-12-08 12:03:06 GMT 1 Informational Disk state change sw_pd:160 pd 160
wwn 5000CCA0714AC3D3 changed state from new to valid because disk was
admitted successfully.
2015-12-08

12:03:06

GMT

Informational

Disk

state

change

hw_disk:5000CCA0714AC3D3 pd wwn 5000CCA0714AC3D35000CCA0714AC3D3


changed state from new to valid because disk was admitted successfully.
2015-12-08

12:03:06

GMT

Informational

Object

added

hw_disk:5000CCA0714AC3D3 Disk 5000CCA0714AC3D3 added

Important information

Remember that the PD_ID of replaced drive will be different from failed drive.
In this example pd_id of failed drive is 7, and replaced drive has been
assigned to the first available id, which is 160.
The array will still signal degraded status, until reallocation will be completed.
After reallocation, pd_id assigned to new drive disappear and replaced disk
will be visible under pd_id assigned previously to failed drive.

However all chunklets that resided on failed drive must be reallocated from logging
drive and spare area back to replaced disk. To see the progress use service mag
command or monitor the chunklets.
3PAR-cluster cli% servicemag status
Cage 0, magazine 7:
The magazine is being brought online due to a servicemag resume.
The last status update was at Tue Dec 8 12:04:07 2015.
Chunklets relocated: 404 in 3 hours, 17 minutes and 25 seconds
Chunklets remaining: 762

Chunklets marked for moving: 762


Estimated time for relocation completion based on 29 seconds per chunklet is: 6
hours, 8 minutes and 18 seconds
servicemag resume 0 7 -- is in Progress

Unfortunately above method is not 100% reliable, and from time to time, the
amount of chunklets and time for that operations are wrongly displayed. To check
how the process looks like you can use showpdch command with -mov parameter.
At the end of output sum of chunklets remaining for move is shown.
3PAR-cluster cli% showpdch -mov 160

What if I want replace disk manually?


If you see that this is only a matter of time, when your disk fail, and you possess
some spare drive in your closet, then you free to do it on your own.
You can do this at least on both ways. The most common is to use servicemag utility,
but from time to time for some reasons command fail and disk replacement is not
possible.
Lets say that our disk layout looks like on image presented below.

The log parameter determines that write operations to chunklets of specific drive
are committed to logging disk. The pdid obviously stands for disk ID. Also you could

add wait parameter to servicemag start. Task will not run in background and you
will have visibility on the whole process.
Command to use: servicemag start -log -pdid 8, which is shown on below image.

Check whether command finish successfully with servicemag status command.


3PAR-cluster cli% servicemag status d

According to documentation:
Any I/O on the chunklets marked normal,smag, changes the states to logging and
I/O is written to the logging logical disks.

Manual disk replacement procedure


In case servicemag command fail for some reasons, then you are pushed to do it
manually, using the whole sets of commands.
1. First thing to do is to stop disk from being use. To achieve it, 3PAR has special
command.
setpd ldalloc off <pd_id>

And to see detailed state of disk use command showpd -s.


3PAR-cluster cli% showpd -s <pd_id>

2. Now you can initiate the movement process for data from specified physical
disk to location chosen by system, which is one of the main steps in terms of
disk replacement.
The suitable command is movepdtospare with -vacate option. Vacate option
makes moves pernament and removes source tags after recolocation. The -f
parameter means that no confirmation is required.

In case this command fail, you will be forced to do it manually, chunklet by


chunklet.

3PAR-cluster cli% movech -perm -ovrd <pd_id>:<chunklet_location>

where:
-perm: chunklet are moved pernamently and original location will be forgot.
-ovrd: allows to move chunklet to some destination even if it will have impact
on quality. Option is necessary with -perm parameter.
3. Time to see whether we have any spare chunklets on disk designated for
removal, as previous step only moved data chunklets.
To display chunklets marked as spare use showpdch -spr command.
3PAR-cluster cli% showpdch -spr <pd_id>

4. Time to see whether we have any spare chunklets on disk designated for
removal, as previous step only moved data chunklets.

Command designated for that kind of task is shown below. It will remove all
spare chunklets off the disk. After execution check again whether any spares
exist.
3PAR-cluster cli% removespare <pd_id>:a

5. After all previous steps you can safely remove physical disk definition from
system. Hold on with physical disk replacement at this step.
3PAR-cluster cli% dismisspd <pd_id>

6. Check if dismissed disk shown us new. If yes, then it can be safely remove
from magazine.

7. In case you put in new disk and disk will not be automatically added to the
system, you have to do it manually

First thing is to determine the WWN of disk. Check this with showpd -i
command.
3PAR-cluster cli% showpd -i <pd_id>

After that use admitpd command to make new disk operational for system.
3PAR-cluster cli% admitpd <disk_wwn>

At the end, tunesys is necessary to make the proper layout of chunklets


within CPGs.
3PAR-cluster cli% tunesys

Potrebbero piacerti anche