Sei sulla pagina 1di 26

Unresponsive EVA Handling Guide HP Restricted

Version 1.0

Unresponsive
EVA Handling
Guide
Guide of best practices in this
situation as well as instructions of
information to collect.
This information is strictly HP Restricted.
Do not share in any form (verbal, soft- or hardcopy) with others.

This document is meant to be used by HP Services people,


who are servicing an Enterprise Virtual Array

HP Restricted Page: 1 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

1 Introduction......................................................................................................................................... 3
1.1 General ......................................................................................................................................... 3
1.2 Tools Required ............................................................................................................................. 4
1.3 Things Not to Do........................................................................................................................... 4
1.4 How to Determine if an EVA is Considered “Down” ..................................................................... 5
1.4.1 Visual Indicators................................................................................................................... 5
1.4.2 Disk Drive LEDs................................................................................................................... 5
1.4.3 LCD Displays (alternating) WWID and Storage Cell Name................................................. 6
1.4.4 LCD Displays Nothing.......................................................................................................... 7
1.4.5 LCD Displays Something Else............................................................................................. 7
1.4.6 Command View.................................................................................................................... 9
2 Information to Collect ....................................................................................................................... 12
2.1 General ....................................................................................................................................... 12
2.2 Version and Model Numbers ...................................................................................................... 12
2.3 Status of Visual Indicators .......................................................................................................... 12
2.4 Latest Controller Event Log ........................................................................................................ 12
2.5 Latest EVA Configuration ........................................................................................................... 13
2.6 Serial Line Capture..................................................................................................................... 13
2.7 List of Questions towards the Customer .................................................................................... 14
2.7.1 General .............................................................................................................................. 14
2.7.2 Does the customer prefer a fast restore of data availability versus a time consuming data
capturing process? ........................................................................................................................... 14
2.7.3 Is this EVA in a CA configuration?..................................................................................... 14
2.7.4 Is this system connected to ISEE? .................................................................................... 14
3 Guidelines to Elevate a Priority 1 Escalation towards SWD Level 3................................................ 15
4 Appendix A: Questionnaire............................................................................................................... 16
5 Appendix B: Loop Switch Models..................................................................................................... 18
5.1 Old Model ................................................................................................................................... 18
5.2 New Model.................................................................................................................................. 18
6 Appendix C: Meaning of the Controller LEDs next to LCD .............................................................. 19
7 Appendix D: Some LCD Messages and their Meaning.................................................................... 20
8 BIST and MIST ................................................................................................................................. 21
8.1 Overview..................................................................................................................................... 21
8.2 BIST (Built-in Self-Test).............................................................................................................. 22
8.3 MIST (Minimum Integrity Diagnostics) ....................................................................................... 23
8.3.1 HSV100 and HSV110 Controller Models........................................................................... 23
8.3.2 HSV200 and HSV210 Controller Models........................................................................... 25

HP Restricted Page: 2 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

1 Introduction
1.1 General
This document describes the best practices on how to recognize and react upon a situation where an
EVA doesn’t respond to management commands from Command View, SSSU, RSM, etc. and/or host
I/O.
This document also lists information to be collected onsite in such a situation.

Lastly, this document contains criteria for HP Services in case they want to escalate the escalation to
SWD. There are some escalations that can be handled by HP Services.

It’s important to realize that, with the later VCS and XCS releases, “EVA Down” situations mean that the
data is temporary unavailable due to a hardware issue. Data availability is restored once the hardware
issue is resolved. It’s becoming rare that data is lost after an “EVA Down” situation. In most of the cases
where data was lost, human errors during the trouble-shooting process were the cause.

Whatever you do, there’s no need to panic. Stay calm and maintain a professional attitude. Also,
adhere to the basics of proper trouble-shooting, meaning keep a recorded log of all actions you
perform, timestamp, and result of your action. The format might be as simple as:
- Time:
- Action:
- Result:

If at any time during troubleshooting a down array there is ambiguity or you are
not certain of your next action, elevate the case to your next level of support.

HP Restricted Page: 3 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

1.2 Tools Required


The following tools are required for troubleshooting an “EVA Down” situation:
1. A laptop, with at least one serial port, connected to mains. Don’t try to run on laptop batteries,
as they’re unreliable and run out of energy quickly. Capturing information via the serial port is
time consuming due to the possible massive amount of information.
2. A serial cable. Any standard NULL Modem cable to connect the serial port of the laptop to the
UART port of the HSV controller. Cable PN BC16E and 2 x H8571J connectors are required.
Alternative use PN213807-001.
3. Terminal Emulation software such as HyperTerm, KEA or PowerTerm. Make sure you’re
familiar with the terminal emulator menu structure to enable capturing information to a file. The
communication settings are 19.2Kb/sec, 8 bits,1 Stop bit, No Parity, No Flow Control
4. Your eyes. The EVA has a lot of visual indicators and they all tell you something, even if the
EVA is considered to be down.
5. Telephone, be it a cell phone or fixed line. In circumstances where an EVA is really down, you’ll
more than likely be discussing the situation with higher escalation levels.
6. A collection of the latest VCS and XCS releases, including disk drive firmware, placed on your
laptop or CD.
7. Controller Event Log translation tools like Navigator, NAPP or EVE installed on your laptop.
8. EVA Configuration Display utilities like Navigator, EVA-CD or EVE installed on your laptop.
9. The PDF file of the latest Service Manual for that particular EVA model placed on your laptop or
CD.
10. Access to the SMA or management station running Command View. It’s not enough to be able
to browse to Command View. One requires access to the desktop of an account with
administrative privileges on that computer.
11. A pen and piece of paper to record your actions as you progress through the issue.

An optional, but recommended, part of the tool kit is an EMU Serial Cable: The requirement for
connecting to the EMU communications port is cable part number 17-04875-04. This cable is not
normally readily available from HP. Alternatively, it can be ordered directly from Amphenol at +1
607-786-4345

1.3 Things Not to Do


This section lists the actions that should not be performed without explicit instructions from a higher
escalation level.

1. Do not power off the HSV controllers. There might still be valuable information (controller
events) in volatile memory. In case the controllers are powered off, that information is lost. In
that case, a root cause might not be possible anymore.
2. Do not remove the cache batteries from the HSV controller. The batteries provide hold-up
power to the data cache, which most likely still contains user data. Removal of the batteries will
lead to data loss.
3. Do not connect your laptop to the serial port unless explicitly asked by higher escalation levels,
as it might interfere with the controller current recovery actions.
4. Do not randomly swap parts and then find you do not know which part came from which
location for example. All parts need to be identifiable and the location they were removed from
documented.

HP Restricted Page: 4 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

1.4 How to Determine if an EVA is Considered “Down”


1.4.1 Visual Indicators
If you experience an EVA that’s not responding to management requests from Command View or host
I/O, a visual inspection of the system is the first thing that needs to be performed. This information
should be collected and documented. It must also be updated after every service action performed on
the EVA.

Important visual indicators are:

1. Front Side of the EVA:


• The LCD of the OCP on each controller.
• The 4 LEDs next to the LCD on each controller.
• The status LEDs on each disk drive enclosure at the front side of the cabinet.
• The LEDs on every disk drive.
2. Back Side of the EVA:
• The port LEDs on each controller. The number of LED’s depends on the controller model.
• The port LEDs on every loop-switch (if present).
• The status LED of every loop-switch (if present and if present on that loop-switch model).
• The LEDs on the I/O-module’s in every disk drive enclosure.
• The LEDs and alpha-numeric display on the EMU in every disk drive enclosure.
• The power LEDs on the power supplies.

The meaning of these visual indicators is explained in the Service Manual, which is part of the required
tool set.

1.4.2 Disk Drive LEDs


The disk drive LEDs should be monitored for activity. If the disks are displaying normal activity, the EVA
is operational. If the disk drive LEDs are not displaying activity, the EVA may be down (or low activity).
The disk drive LEDs may display activity on one loop pair or the other loop pair. During error recovery,
all drives on the same loop pair will flash the activity LED when the controller issues a LIP (Loop
Initialization Primitive). Do not confuse error recovery with user I/O. User I/O will not cause the LEDs on
all drives on one loop pair to flash at the same time.

HP Restricted Page: 5 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

1.4.3 LCD Displays (alternating) WWID and Storage Cell Name


If an HSV controller displays on its LCD alternating the WWID and storage cell name, then that
controller is operating. If both controllers do that, then it’s likely that the communication path between
the EVA and Command View or the servers is broken. In case a server lost connectivity to the EVA,
investigate if that’s only restricted to that single server, group of servers, or to all servers connected to
the EVA. Most likely this is caused by a problem in the SAN configuration. It might be caused by a
process lock-up within the EVA. In that case, a “CSM reset” will occur within one hour and that will clear
the situation.

The 4 Status LEDs are located to the left of the LCD.


Status Navigation
LCD pushbuttons
LEDs

Appendix C contains a list of the LEDs and their meaning.

HP Restricted Page: 6 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

1.4.4 LCD Displays Nothing


If the LCD of an HSV controller is completely blank, it might mean that the controller is busy with a
restart and that it’s waiting for a “go” command due to incorrect settings of the debug flags. In this case,
connect to the serial port of the controller, and enter the “G” command. After that, clear the “Prompt for
Go” bit in the debug flags from the field service page in Command View. You might have to wait 15
minutes before Command View picks up the EVA again. For instructions on how to reach the field
service page or debug flags, please refer to the Service Manual.

Another possible explanation is that the HSV controllers have a single power supply and that either the
power supply is broken or the complete input power is gone. In the latter case, all disk drive enclosures
operate on a single power supply as well. The corrective action is to either replace the controller (in
case of a broken power supply) or troubleshoot why the input power is no longer available.

1.4.5 LCD Displays Something Else


In case a controller is displaying something else in its LCD, it can be one of the following:
• POST messages (Power-ON Self Test (TE:xx in the display)
• Boot sequence (“Startup Complete”, “Scanning for disks”, “Activating STSYS”). These
messages shouldn’t stay on the LCD for more than 1 minute!
See also Appendix D: Some LCD messages and their meaning
• Termination Data (Termination Code, Location)
• LID/DDD recovery information
• Error information (“Stsys lost”, “Multiple Stsys”, “Multiple Disk failure”, etc.).
See also Appendix D: Some LCD messages and their meaning
• Fault Menu: “Restart”

The POST messages should never be displayed for more than one minute, otherwise, the POST never
completes due to an unexpected error condition in the controller hardware. In case of a repetitive
sequence of POST messages, the controller fails to perform its power-on self test. The corrective action
in both cases is to replace the HSV controller. This is not an “EVA Down” situation and data availability
is restored once the controller is replaced.

In case of boot sequence messages, they should disappear within one minute and the display should
switch to displaying storage cell name and WWID. If that’s not the case, then typically it’s only one
controller displaying one of the boot sequence messages, while the other one still displays storage cell
name and WWID. The correct action is to collect information and raise a Severity 1 escalation to the
next escalation level. What you’re facing is most likely caused by a process lock-up within the firmware
and a “CSM reset” should follow within an hour. This might be an “EVA Down” situation; the correct
action is to collect information and raise an escalation to a higher level. Monitor each controller display
and record the display output. It is important to note any patterns or the first and last displayed data.

HP Restricted Page: 7 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

The LID/DDD recovery messages should change periodically and disappear. This recovery activity
shouldn’t be disrupted. If LID/DDD recovery continues for more than 15 minutes, then it is probably due
to unexpected hardware conditions on the CAN-bus. The correct action is to restore data availability
first by executing the following sequence:
1. Halt both controllers by using the CTRL+H, CTRL+R keystroke sequence. Do not power off the
controllers, as vital event information is kept in volatile memory. Removing input power to the
controllers will lead to a loss of event information.
2. Pull the EMUs from all disk drive enclosures, but do not remove them. You have to pull all of
them before you move on to the next step.
3. Re-insert the EMUs in all disk drive enclosures.
4. Restart the top controller. The controller will boot and re-start the LID/DDD recovery routine.
This time it will find the faulty drive and isolate it.
5. Restart the bottom controller.
6. Check, via the field service page, that the print- and debug flags are still set to expected values.
Raise an escalation towards a higher level to register this event. Supply the controller event log,
controller termination log, actual EVA configuration and a list of actions you performed together with the
escalation. The faulty disk drive must be physically removed from the system.

In all other cases, prepare yourself and collect the required information as listen in Appendix A:
Questionnaire. Do NOT power off the controllers unless explicitly directed by higher escalation levels. If
you power off the controllers, vital information (events that aren’t available yet in Command View) will
be lost as they reside in volatile memory. The correct action is to collect information and raise a
Severity 1 escalation to the next escalation level.

You’ve now entered an “EVA Down” state.

HP Restricted Page: 8 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

1.4.6 Command View

1.4.6.1 Checking Web Browser Settings


Check Command View status of the EVA. Ensure that your internet browser properties are correct and
that you’re not reading cached pages. On Internet Explorer, this can be checked with “Tools Æ Internet
Options Æ Settings”. Delete the temporary internet files and clear the history by clicking on the related
buttons.

Ensure you reload the current status to prevent Command View from displaying old status data. Use
the “Refresh” button of your web browser; in case of Internet Explorer, this is mapped onto the
CTRL+F5 keystroke combination.

An unresponsive EVA will generate a log on the management system indicating the first failing
command.

This will clear the internet files cache

Ensure “Every visit to this page” is selected

This will clear the history of the visited


pages

HP Restricted Page: 9 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

Another thing to check is the proxy settings of the web browser. As for Internet Explorer, these settings
can be found in “Tools Æ Internet Options Æ Connections Tab Æ LAN Settings”. In case you’re working
on the SMA or management station, be sure that there’s no proxy access for “localhost” or “127.0.0.1”.

Within Command View, also use the “Re-discover” button. This will force Command View to re-scan the
SAN for any EVAs with which it’s allowed to communicate.

1.4.6.2 Verify Connectivity SMA towards EVA

There are several tools and log files that you can use to verify the communication between the
management server and the EVA.

The first step is to verify the physical connection between management server and SAN, as well as
verification if FC communication is established. The second step is to verify the SCSI and SCMI
communication.

The FC path between management server and FC switch can be verified by using the HBAnywhere or
Emulex configuration utility. The FC switch can provide connection status between the management
station and the FC switch, as well as EVA and FC switch. It can also tell you if the EVA is known in the
name server of the switch.

HP Restricted Page: 10 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

If there are no issues on the physical level and FC communication appears to be working, reviewing the
logs in \hsvmafiles might give an indication about the reason why the management station cannot
communicate with the EVA.

From Command View v5 onwards, a new tool, EVACT, is present. This tool will verify the path and the
communication between the management system and the EVA. It operates from a DOS prompt, field
service page or CV storage system page when connection is lost. It will run from any Windows-based
Fibre attached server (requires program and supporting files, so copy the complete \hsvmafiles
directory including the supporting DLL’s.

The status field can have 4


different values:
1. Ready. Meaning that the
SCMI layer is ready to
accept commands
2. Busy. Meaning that the
SCMI layer is busy
processing a command.
It cannot accept another
Expect this during
commands with a long
processing time, such as
Vdisk deletion
3. Waiting. This means that
the SCMI layer is waiting
for a part of the
command. This might
indicate a problem.
4. Not Available. This
means that the SCMI
status command isn’t
supported by this
particular VCS version. In other words, the communication is working, as the EVA indicates
responds with a SCMI error message. You’ll most likely find an extra event logged in the
controller event log.

If you run the command from the field service page, the trace file,
“\hsvmafiles\evact\path_trace.txt” is created.

In case the utility is started from the DOS prompt, one can re-direct the output to a file by using
“evact > myfile.txt”. There is an optional parameter, the “–verbose” flag will dump out what the tool
is doing at every step. If things are running slow, this is a good way to indicate that the tool isn’t
hung.

HP Restricted Page: 11 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

2 Information to Collect

2.1 General
In case the situation is qualified as an “EVA Down” situation, the main priority of the higher escalation
level will be to restore data availability as soon as possible. To be able to do that, that team needs a
defined set of information to be provided upfront with the escalation. Providing correct and adequate
information allows higher escalation levels to quickly restore data availability. Appendix A of this
document contains a questionnaire, which needs to be filled prior to raising the escalation.

2.2 Version and Model Numbers


The controller model (HSV100, HSV110, HSV200 and HSV210) is important and must be filled in on
the questionnaire. Furthermore, the model of the loop switches is important. Examples of the different
models are given in Appendix B: Loop switch models. The loop switch model number must be filled in
on the questionnaire as well.

2.3 Status of Visual Indicators


Carefully note the exact text, as displayed on the LCD of the OCP of each controller. Also note if any of
the device ports, located on the rear side of the controllers, isn’t solid green.

Furthermore, check the LEDs on every I/O module. If there are no loop switches in the system, all I/O
modules should have three green LEDs. On systems with loop switches, there should be two green
LEDs (the top-2) on each I/O module.

Check the status of all LEDs on each loop switch. If one LED or group of LEDs show a different
behavior compared to other, please note that on the questionnaire. Monitor both the LEDs on the I/O
modules, as well as the LEDs on the loop switches, for at least one minute. They may flash
occasionally, which may help the higher escalation levels with troubleshooting.

The EMU has an alpha-numeric display. In case the yellow error LED is on (and “Er” is displayed in the
2-char display), read the error code. See the Service Manual for instructions. The error code needs to
be filled on the questionnaire in the N.N.NN.NN format.

2.4 Latest Controller Event Log


In case of an unresponsive EVA, like in an “EVA Down” situation, the latest controller event log is
available on the SMA or management station that was managing the EVA just before it became
unresponsive. A directory called “\hsvmafiles” is located on the local hard drive of that SMA or
management station; typically on the C-drive. Within “\hsvmafiles” you will find the “zip.exe”. Use
that executable to create a zip archive of the complete “\hsvmafiles” directory. If a customer has
multiple EVAs, please indicate on the questionnaire the WWID of the system that’s unresponsive.

HP Restricted Page: 12 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

2.5 Latest EVA Configuration


The actual EVA configuration is important troubleshooting information. It provides higher escalation
levels visibility to the number of disk groups, location of quorum drives, and RSS placement. There are
2 ways to collect this info:

• Manually and upon request. Think about captured config data via SSSU or Command View. There
may be old SSSU output available. It’s important to collect as much information as possible, and
indicate at the same time which information is the most current. The customer can inform you
about the location of these archives.
• Automatic and periodically. There are several tools available that periodically collect this
information. These tools are:

1. WEBES v4.4.3 onwards. You’re able to find the EVA configuration of utmost 12 hours old in the
“C:\Program Files\Hewlett-Packard\svctools\specific\ca\data” directory. The
file is called “<hostname>_config.xml”.
2. SDC (Support Data Collector). This tool updates its configuration tracking database upon
detection of a change in the EVA configuration. You can find the home directory of this tool in
“C:\SDC”. Within this directory, you’ll find subdirectories. The tool also provides historical EVA
configuration data. Please create a zip archive of the “config”, “output” and “tmp” subdirectories.
3. EVE (Event View EVA). This tool, mainly used in EMEA, has the same characteristics as SDC.
Its home directory can be found in “C:\Hsvev” . Within this directory, you’ll find subdirectories.
Please create a zip archive of the “config”, “output” and “tmp” subdirectories.
4. ScanMaster. This tool periodically scans (once a week, month) several SAN components,
including EVAs. It might be worthwhile to find this information as well, as it some history on the
EVA configuration
5. HPCC.

2.6 Serial Line Capture


Do not capture any serial line output yet, unless explicitly asked for by higher escalation levels. By that
time, you’ll also receive explicit instructions on the actions you’ll need to take.

HP Restricted Page: 13 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

2.7 List of Questions towards the Customer


2.7.1 General
In case of an “EVA Down” situation, there are some questions that you can ask the customer. You may
be asked these questions later by higher escalation levels, specifically if the “EVA Down” situation
starts to last for a long time.

Sometimes a customer didn’t have the opportunity to fully back up the data that resides on the EVA. It’s
important to know that there is a valid backup. In “EVA Down” situations, which last longer than
expected, a trade-off must be made whether or not it’s worthwhile to continue troubleshooting or to
restore the data from tape. This can be best explained by using the following example:

“A customer needs to resume production by 6 PM on Sunday evening. Restore time is 10 hours and
there is a valid backup”. That means that between 7–8 AM on Sunday morning a decision needs to be
made to continue troubleshooting or to start to roll back the data from tape, in order to allow the
customer to resume production at 6 PM.”

There’s no need to ask these questions upon entry of the escalation, but be sensitive to remarks from
the customer. It will be important to understand the customer situation and business requirements
related to the operation of this EVA.

2.7.2 Does the customer prefer a fast restore of data availability versus a time consuming data
capturing process?
There are “EVA Down” situations which can be quickly recovered by going through a complete power
down of the EVA. However, in such a case, vital information kept in the controller volatile memory about
the possible root cause of the situation will be lost. Dependant on the VCS and XCS versions, it’s
possible to capture this information via the serial port. However, the capture process might take 1–2
hours. The customer needs to make a trade-off between a fast restore of data availability versus
capturing the volatile data, which may allow, but not guarantee, the determination of the root cause of
this incident.

2.7.3 Is this EVA in a CA configuration?


In case the EVA is part of a CA configuration, it may be worthwhile to consider the option to perform a
failover to the other EVA and resume production on that other EVA. In case a decision is taken to
perform a failover to the other EVA, block host access to the unresponsive EVA by changing the zoning
configuration in the SAN.

2.7.4 Is this system connected to ISEE?


If the answer is yes, higher escalation levels can see the call history of this system. That’s important
background information.

HP Restricted Page: 14 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

3 Guidelines to Elevate a Priority 1 Escalation toward SWD


Level 3
This section lists the guidelines and criteria that must be fulfilled before an “EVA Down” escalation can
be raised to SWD Level 3 Support.

The following conditions must be met if involvement from Level 3 Support is requested:

1. The firmware currently running on the EVA must be a supported version. Exceptions can be
made by the “Gate Keeper”. In case the current firmware isn’t a supported one, a Root Cause
Analysis will not be provided by Level 3 Support.
2. The EVA must refuse to boot, even after a single restart attempt. The restart attempt MUST be
executed as follows:
a. Connect a laptop with a serial connection to the UART port of the bottom HSV
controller. Make sure the terminal emulation is in capturing mode and that the bits and
baud are correctly set.
b. Halt this controller back to the MINDY prompt by using the CTRL+K keystroke. The “:”
prompt will appear.
c. Connect the serial cable now to the top HSV controller.
d. Halt this controller back to the MINDY prompt by using the CTRL+K keystroke. The “:”
prompt will appear.
e. Power down the disk drive enclosures by removing both power inlets to each
enclosure. Power down all enclosures first before moving on to the next step. Do NOT
use the main power switch, as this will power down the HSV controllers as well.
f. Power up the disk drive enclosures by re-inserting both power cords to the power
supplies of each disk drive enclosure.
g. Check that every EMU reports its correct shelf position.
h. Check that all disk drives are ready, so verify that the “green LED” on every drive is
ON.
i. Start the top controller by entering the “rs” command on the keyboard. The top
controller will now start and run through its standard boot sequence.
i. The controller will boot just fine and will start to display the storage cell name
and WWID on the LCD. If the top controller booted and displays the storage
cell name and WWID, connect the serial cable to the bottom controller. Start
this controller as well with the “rs” command. This controller will also start and
display the storage cell name and WWID upon completion of its boot cycle.
ii. The top controller will not display its storage cell name and WWID after a
couple of minutes.
3. The restart attempt may be executed only one time, as a power cycle per nature introduces the
risk of failing drives.
4. The Questionnaire must be filled in and be readily presented with the case.

In case both controllers display the storage cell name and WWID, a P2 escalation towards SWD Level
3 supported is required, also for tracking purposes. The questionnaire must be filled in and presented
with the case.

In case both controllers do not display the storage cell name and WWID, a P1 escalation is warranted.
Please provide the filled-out questionnaire together with the case.

HP Restricted Page: 15 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

4 Appendix A: Questionnaire

Number of Disk Drive Enclosures

Number of Disk Drives

VCS or XCS Version

Loop Switches installed? If so, specify


type (see document)

Exact Text on LCD Display Controllers

Status of port LED’s of controllers.


Please specify if one or more are NOT
green.

Status of status LED on each loop


switch (if applicable).

Status of port LED’s on each loop


switch. Please choose from on, off,
blinking

Location of latest controller event log


(attached to case, ftp location)

Location of latest controller termination


log (attached to case, ftp location)

Is EVA Configuration info available?


(yes/no) If yes, specify location.

System connected to ISEE? (yes/no)

Serial Line capture available? (yes/no)


If yes, specify location.

HP Restricted Page: 16 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

Does one of the EMU’s indicate an


error situation? (yes/no) If yes, what’s
the error code? Use the N.N.NN.NN
format.
Is the EVA part of a CA configuration?
(yes/no) If yes, “source”, destination” or
“bi-directional”.
1.
2.
List of actions and results performed 3.
up to this point in summary format. 4.
5.
6.
Has a restart attempt, as described
earlier in this document been
executed? If so, how many times?
Have the controllers been powered off
by the customer? (yes/no)

HP Restricted Page: 17 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

5 Appendix B: Loop Switch Models


5.1 Old Model

Power-On-Self-Test: All LEDs (except power) will flash


System LEDs:

1. SFP status is green if, operating normally.


2. Port Bypassed is amber, if port is bypassed. In
OFF, then there’s no SFP inserted.
3. Over Temperature is amber if temperature too
high.
4. Power-On-Self-Test is amber, if self-test failed.
5. Loop Operation is green when loop completes initialization
6. Power is green when Power Supply is functional.

5.2 New Model

• This model has an Ethernet Port with 2 LEDs


adjacent to it. These LEDs indicate the
network connection status.
• The Port LEDs operate different compared to
the old model:
o The green LED flashes if there’s FC activity (so traffic) on that port. If there’s no
activity it’s solid on. If there’s activity, it starts flashing
o The yellow LED is normally off. If on, it indicates that the port is bypassed.
• The 4 system LEDs on the right-hand lower corner operate slightly different compared to the
old model:
1. Fault Led (Top Left):
• When on, one or more of the ports has failed or the internal temperature
has exceeded acceptable levels.
• When flashing, all ports are operational but another error has occurred.
Errors appear in an event log. The level of error severity that will cause
flashing to start can be controlled using the config sys fault command in
the CLI. The default is level 3, Critical.
• Note: Whether lit or flashing, the switch will continue to operate. Switch functionality
may be impaired depending on the event. Regardless of the cause, the switch requires
immediate attention.
2. Power LED: (Top Right): When on, the switch is plugged in and the internal power supply
is functional.
3. System Operational (Bottom Right): When on, indicates that the switch has completed
initialization for ports with inserted SFPs and that the switch is operational.
4. 2 Gbps enabled (Bottom Left):
• When on, the switch is set to operate at a speed of 2Gb/s.
• When off, the switch is set to 1Gb/s.

HP Restricted Page: 18 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

6 Appendix C: Meaning of the Controller LEDs next to LCD


Fault LED
• When the amber LED to the right of this icon is ON (flashing) there is a
controller problem. Check either via Command View or the LCD Fault
Management displays for a definition of the problem and recommended
corrective action.

Host Link LED


• When the green LED next to this icon is ON, there is a link between the
storage system and a host.
• When the red LED next to this icon is OFF, there is no link between the storage
system and a host.

Vdisk Good LED (not present onHSV1xx controllers)


• Used to indicate if system is presenting good virtual disks
• Lit (green) if virtual disks are healthy and functioning normally

Controller Heartbeat LED


• When the green LED next to this icon is flashing slowly, the controller
heartbeat is operating normally. This means that both controller communicate
with each other via the dedicated mirror port or one of the device ports
• When this LED is not flashing, there indicates a potential problem. It’s expected
behavior for the LED to be off in case a controller is unresponsive or halted by
commands, entered via the serial port.

Cache Battery Assembly LED


• When the red LED next to this icon is OFF, the battery assembly is charged.
• When the LED is ON, the battery assembly is less than 50% charged or failed.

Controller Unit ID (not present on HSV1xx controllers)


• Used to locate the enclosure
• Lights the blue LED on the front and back (labeled UID)
• If you push the button (front or back) it will light the LED on both buttons

HP Restricted Page: 19 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

7 Appendix D: Some LCD Messages and their Meaning


• “Startup Complete”. This message indicates that the controller finished booting. It will now
start to look for drives and other HSV controller.
• “Scanning for Disks”. This message indicates that the controller is in the process of detecting
the number and type of disk drives on the FC-AL back-end loops. If this message is displayed
for a long time, it usually means that the controller isn’t able to finish device discovery.
• “Activating StSys”. This message means that the controller creates a new storage cell (so the
first controller, which becomes the master controller) and joins an existing storage cell (so the
second controller, which will become the slave). It this message is displayed for a long time, it
usually means that the controller cannot join an existing storage cell, most likely as the CSM
process must (and will) be reset in 60 minutes.
• “Metadata unprotected”. This message means that only one quorum drive could be found.
Typically occurs if the system boots with only one disk-group with one or more disk-groups
(temporary) removed.
• “Too many disks”. This message means that the system discovered more devices than
supported. The EVA supports up to 240 disk drives. Only occurs with system with extension
cab and drives recently added to enclosure 17 and 20. Remember that in those two enclosures
the drive bays 9–14 should not be populated.
• “Too few disks”. This message means that the system discovered less than the minimum
number of four drives. Indicates severe back-end issues if there physically more than four
drives.
• “Partitioned FC loop”. Golden quorum disks only have one port active and can’t reserve these
quorum disks.
• “Too many diff stsys”. This message means that the system found quorum disks from more
than four storage cells.
• “Switch on FC loop”. This message means that the system encountered fibre channel switch
on loop (other than supported models).
• “Too many HSVs”. This message means that the controller encountered two other HSV
controllers. Can only occur if a wiring mistake is made with FC-AL cables.
• “StSys has been lost”. This message means that the system lost access to all quorum drives.
Typically occurs if all quorum drives are located on one loop-pair and both loops in that loop-
pair are down.
• “Mult disk failures”. This message means that the system cannot locate at least two golden
quorum drives. Typically means that a disk-group became inoperable due to multiple drive
failures in the same RSS.
• “Too many shelves”. This message means that more than ten disk drive enclosures are
detected on a FC-AL loop.
• “Mult stsys detected”. This message means that the system found quorum disks from another
storage cell. Remove drives just added and reboot the controller.
• “Possible Bad Config”, This message means that the system discovered only one quorum
drive and we may be running on a single controller.

HP Restricted Page: 20 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

8 BIST and MIST

8.1 Overview
When the Enterprise hardware initializes, the first software to run will be the Built-In Self-Test (BIST)
image. The goal of the BIST image is to verify that the controller functions well enough to support the
execution of the functional (or main) firmware image. This goal requires that all components attached
directly to the PowerPC and those accessible via the Quasar chip's "local" bus, are tested.

If any of the BIST diagnostic tests fail, or if the controller fails to execute BIST diagnostics, the amber
LED on the EVA OCP will be illuminated and all controller activity will halt. In this case the controller
must be replaced.

If the BIST completes successfully, a set of minimum integrity diagnostics (MIST) are then run to verify
that the bus hardware is functional, the program card contents are valid, and that shared memory is
good.

HP Restricted Page: 21 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

8.2 BIST (Built-in Self-Test)

The components are tested in the order listed below.

1.1. Real-time Environment Executor. The Real-time Environment eXecutor (REX) will execute the
different BIST diagnostic tests, collect and reports error information as required.

1.1.1. Processor Initialization. The BIST image must first configure the PowerPC processor for use
with the Enterprise design.

1.1.2. Quasar Configuration. The Quasar chip must be configured for access to the components on
the Quasar chip's bus.

1.1.3. Policy Memory. The DIMM for policy memory will be interrogated, using the Quasar's I²C
controller to determine size and configuration requirements. After the Quasar chip has been
configured for this DIMM, the memory will be tested for address and data integrity.

1.1.4. Program Card. The BIST image located on the program card must be verified, to insure that
the code being executed is valid.

1.1.5. Glue Chip. The Glue chip (G3 for third glue chip) is a FPGA (Field Programmable Gate Array)
chip containing much of the miscellaneous control functionality that was implemented using
more discrete components in previous controller designs. This component contains features
that include PCI bus arbiters, interrupt controllers, bus watchdog timers, and general-purpose
I/O pins (i.e. diagnostic registers). The Glue chip will be read to verify initial values, written
where possible to verify address and data integrity and configured for use.

1.1.6. L2 Cache. The PowerPC will utilize a L2 cache, on some versions of the Enterprise platform.
This cache must be verified to ensure functionality, before it is enabled for use by the
processor. This test is fault tolerant and the controller will continue to run without the L2 cache if
the test fails.

1.1.7. NVRAM & TOY Clock. The NVRAM component of Enterprise is composed of a battery
backed-up SRAM chip. The NVRAM battery controller contains a Time of Year (TOY) Clock.
The battery controller, the battery and the memory must be checked for functionality, and the
TOY may need to be initialized. The functionality of the NVRAM system is not required to begin
execution of the functional image, but as it is the only component on the Quasar's bus that isn't,
it will be tested as a BIST component.

1.1.8. Functional Image. The functional image must be moved from the program card to the required
addresses in Policy memory. The functional image will be compressed on the program card.
The image will be verified as it is moved from the card to memory. After the functional image is
placed in memory, the BIST image will clean up and begin the execution of the functional
image.

HP Restricted Page: 22 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

8.3 MIST (Minimum Integrity Diagnostics)

8.3.1 HSV100 and HSV110 Controller Models

8.3.1.1 General
TE 1.(0x01) - Temperature Sensor test
test 1 - check the values of the two temp sensors and processor temperature
test 2 - check rpm values and voltages

TE 2.(0x02) - HW code check and LCD setup


test 1 - Load the OCPs internal memory via IIC
test 2 - update Battery PIC code
test 3 - update GLUE code
test 4 - update CBIC code

TE 3.(0x03) - Serial and WWID Number test


test 1 - validate EEPROM WWID and Serial #

TE 10.(0x0a) - Near PCI configuration test 1 - configure the near PCI bus

TE 12.(0x0c) - Cache Memory test


test 1 - Near PCI low side bus test
test 2 - IIC access of DIMMs
test 3 - Configuring Surge
test 4 - Configuration Verification Tests
test 5 - Access size test
test 6 - DIMM address test
test 7 - Surge ECC tests
test 8 - DMA engine test (policy scripts)
test 9 - Diagnostic Page tests
test 10- DMA engine test (cache scripts)
test 11- Cache memory tests

TE 13.(0x0d) - Cache Battery test


test 1 - Access PIC
test 2 - Access Battery Bricks
test 3 - determine down time
test 4 - summarize battery status
test 5 - Interrupt tests
test 6 - Disable/Enable test
test 7 - fan and battery LED test
test 8 - EEPROM Write/Read test

TE 14.(0x0e) - Far PCI configuration


test 1 - configure the far PCI bus

HP Restricted Page: 23 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

TE 17. (0x11) - Port 4 (MP) test. See Tests used within the port tests for more details.

TE 18. (0x12) - Port 5 (FP1) test. See Tests used within the port tests for more details.

TE 19. (0x13) - Port 6 (FP2) test. See Tests used within the port tests for more details.

TE 20. (0x14) - Port 0 (DP-2A) test. See Tests used within the port tests for more details.

TE 21. (0x15) - Port 1 (DP-1A) test. See Tests used within the port tests for more details.

TE 22. (0x16) - Port 2 (DP-2B) test. See Tests used within the port tests for more details.

TE 23. (0x17) - Port 3 (DP-1B) test . See Tests used within the port tests for more details.

TE 24. (0x18) - All Ports test


test 1 - SFS Internal Loop-Back test..
test 2 - SFS External Pad Loop-Back test 1k..
test 3 - SFS External Pad Loop-Back test 2k..
test 4 - SFS External Pad Loop-Back test 2k.. (Surge Æ Quasar memory)
test 5 - SFS External Pad Loop-Back test 2k. (Quasar Æ Surge memory)
test 6 - SFS External Pad Loop-Back test 2k. (Surge Æ Surge memory)
test 7 - SEST External Pad Loop-Back test 16k (Surge Æ Surge memory)
test 8 - SEST External Pad Loop-Back test 16k..ns (Surge Æ Surge memory)

TE 25. (0x19) - Config & Init Port regs


test 1 - configure all passing ports for use by firmware

TE 28. (0x1c) - CBIC test


test 1 - communicate with CBIC chip

TE 31. (0x1f) - Hardware Revision test


test 1 - gather configure information

8.3.1.2 Tests Used within the Port Tests (TE17 – TE23):


test 1 -- Register read/write test. Includes SFF detect, (errors ending in 0x500)
test 2 -- Data lines test..
test 4 -- SFS DMA Read/Write test..
test 5 -- Interrupt test..
test 6 -- Internal Loop-Back N-Port Initialization test..
test 7 -- Internal Loop-Back AL Initialization test..
test 8 -- External Pad Loop-Back AL Initialization test..
test 10 -- SFS Internal Loop-Back test..
test 11 -- SFS External Pad Loop-Back test 1k..
test 12 -- SFS External Pad Loop-Back test 2k..
test 14 -- SFS External Pad Loop-Back test 2k..(Surge Æ Quasar memory)
test 15 -- SFS External Pad Loop-Back test 2k..(Quasar Æ Surge memory)

HP Restricted Page: 24 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

8.3.2 HSV200 and HSV210 Controller Models

8.3.2.1 General
TE 2. (0x02) - HW code check and LCD setup
test 1 - Load the OCPs internal memory via IIC
test 2 - update Battery PIC code
test 3 - update GLUE code
test 4 - update CBIC code

TE 07. (0x07) - Cache Memory test


test 1 - Near PCI low side bus test
test 2 - IIC access of DIMMs
test 3 - Configuring Surge
test 4 - Configuration Verification Tests
test 5 - Access size test
test 6 - DIMM address test
test 7 - Surge ECC tests
test 8 - DMA engine test (policy scripts)
test 9 - Diagnostic Page tests
test 10- DMA engine test (cache scripts)
test 11- Cache memory tests

TE 8. (0x08) – SDC test (prior to XCS 5030)

TE 12. (0x0c) - Port 0 test (SFP 7, DP-2A). See Tests used within port tests for more details.

TE 13. (0x0d) - Port 1 test (SFP 6, DP-1A) . See Tests used within port tests for more details.

TE 14. (0x0e) - Port 2 test (SFP 5, MP2) . See Tests used within port tests for more details.

TE 15. (0x0f) - Port 3 test (SFP 4, FP4) . See Tests used within port tests for more details.

TE 16. (0x10) - Port 4 test (SFP 9, FP3) . See Tests used within port tests for more details.

TE 17. (0x11) - Port 5 test (SFP 8, FP2) . See Tests used within port tests for more details.

TE 18. (0x12) - Port 6 test (SFP 3, FP1) . See Tests used within port tests for more details.

TE 19. (0x13) - Port 7 test (SFP 2, MP1) . See Tests used within port tests for more details.

TE 20. (0x14) - Port 8 test (SFP 1, DP-2B) . See Tests used within port tests for more details.

TE 21. (0x15) - Port 9 test (SFP 0, DP-1B) . See Tests used within port tests for more details.

HP Restricted Page: 25 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)
Unresponsive EVA Handling Guide HP Restricted
Version 1.0

TE 22. (0x16) - All Ports test


test 1 - SFS Internal Loop-Back test..
test 2 - SFS External Pad Loop-Back test 1k..
test 3 - SFS External Pad Loop-Back test 2k..
test 4 - SFS External Pad Loop-Back test 2k.. (Surge Æ Quasar memory)
test 5 - SFS External Pad Loop-Back test 2k. (Quasar Æ Surge memory)
test 6 - SFS External Pad Loop-Back test 2k. (Surge Æ Surge memory)
test 7 - SEST External Pad Loop-Back test 16k (Surge Æ Surge memory)
test 8 - SEST External Pad Loop-Back test 16k ns (Surge Æ Surge memory)

TE 23, (0x17) – Port-to-Port Test.

TE 24, (0x18) - Config & Init Port regs


test 1 - configure all passing ports for use by firmware

TE 28. (0x12) – SDC test (XCS 5030 onwards)

TE 31. (0x1f) - Hardware Revision test


test 1 - gather configure information

8.3.2.2 Tests Used within Port Tests (TE12 – TE21)


test 1 -- Register read/write test. Includes SFF detect, (errors ending in 0x500)
test 2 -- Data lines test..
test 4 -- SFS DMA Read/Write test..
test 5 -- Interrupt test..
test 6 -- iTR Link Mode test …
test 7 – iTR Wrap Mode test …
test 8 -- Internal Loop-Back N-Port Initialization test..
test 9 -- Internal Loop-Back AL Initialization test..
test 10 -- External Pad Loop-Back AL Initialization test..
test 12 -- SFS Internal Loop-Back test..
test 13 -- SFS External Pad Loop-Back test 1k..
test 14 -- SFS External Pad Loop-Back test 2k..

HP Restricted Page: 26 Authors: H. van Sluis (EMEA),


B. Martens (NA),
Date: 13-Apr-2006 R. Lustenhouwer (APJ)

Potrebbero piacerti anche