Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Version 1.0
Unresponsive
EVA Handling
Guide
Guide of best practices in this
situation as well as instructions of
information to collect.
This information is strictly HP Restricted.
Do not share in any form (verbal, soft- or hardcopy) with others.
1 Introduction......................................................................................................................................... 3
1.1 General ......................................................................................................................................... 3
1.2 Tools Required ............................................................................................................................. 4
1.3 Things Not to Do........................................................................................................................... 4
1.4 How to Determine if an EVA is Considered “Down” ..................................................................... 5
1.4.1 Visual Indicators................................................................................................................... 5
1.4.2 Disk Drive LEDs................................................................................................................... 5
1.4.3 LCD Displays (alternating) WWID and Storage Cell Name................................................. 6
1.4.4 LCD Displays Nothing.......................................................................................................... 7
1.4.5 LCD Displays Something Else............................................................................................. 7
1.4.6 Command View.................................................................................................................... 9
2 Information to Collect ....................................................................................................................... 12
2.1 General ....................................................................................................................................... 12
2.2 Version and Model Numbers ...................................................................................................... 12
2.3 Status of Visual Indicators .......................................................................................................... 12
2.4 Latest Controller Event Log ........................................................................................................ 12
2.5 Latest EVA Configuration ........................................................................................................... 13
2.6 Serial Line Capture..................................................................................................................... 13
2.7 List of Questions towards the Customer .................................................................................... 14
2.7.1 General .............................................................................................................................. 14
2.7.2 Does the customer prefer a fast restore of data availability versus a time consuming data
capturing process? ........................................................................................................................... 14
2.7.3 Is this EVA in a CA configuration?..................................................................................... 14
2.7.4 Is this system connected to ISEE? .................................................................................... 14
3 Guidelines to Elevate a Priority 1 Escalation towards SWD Level 3................................................ 15
4 Appendix A: Questionnaire............................................................................................................... 16
5 Appendix B: Loop Switch Models..................................................................................................... 18
5.1 Old Model ................................................................................................................................... 18
5.2 New Model.................................................................................................................................. 18
6 Appendix C: Meaning of the Controller LEDs next to LCD .............................................................. 19
7 Appendix D: Some LCD Messages and their Meaning.................................................................... 20
8 BIST and MIST ................................................................................................................................. 21
8.1 Overview..................................................................................................................................... 21
8.2 BIST (Built-in Self-Test).............................................................................................................. 22
8.3 MIST (Minimum Integrity Diagnostics) ....................................................................................... 23
8.3.1 HSV100 and HSV110 Controller Models........................................................................... 23
8.3.2 HSV200 and HSV210 Controller Models........................................................................... 25
1 Introduction
1.1 General
This document describes the best practices on how to recognize and react upon a situation where an
EVA doesn’t respond to management commands from Command View, SSSU, RSM, etc. and/or host
I/O.
This document also lists information to be collected onsite in such a situation.
Lastly, this document contains criteria for HP Services in case they want to escalate the escalation to
SWD. There are some escalations that can be handled by HP Services.
It’s important to realize that, with the later VCS and XCS releases, “EVA Down” situations mean that the
data is temporary unavailable due to a hardware issue. Data availability is restored once the hardware
issue is resolved. It’s becoming rare that data is lost after an “EVA Down” situation. In most of the cases
where data was lost, human errors during the trouble-shooting process were the cause.
Whatever you do, there’s no need to panic. Stay calm and maintain a professional attitude. Also,
adhere to the basics of proper trouble-shooting, meaning keep a recorded log of all actions you
perform, timestamp, and result of your action. The format might be as simple as:
- Time:
- Action:
- Result:
If at any time during troubleshooting a down array there is ambiguity or you are
not certain of your next action, elevate the case to your next level of support.
An optional, but recommended, part of the tool kit is an EMU Serial Cable: The requirement for
connecting to the EMU communications port is cable part number 17-04875-04. This cable is not
normally readily available from HP. Alternatively, it can be ordered directly from Amphenol at +1
607-786-4345
1. Do not power off the HSV controllers. There might still be valuable information (controller
events) in volatile memory. In case the controllers are powered off, that information is lost. In
that case, a root cause might not be possible anymore.
2. Do not remove the cache batteries from the HSV controller. The batteries provide hold-up
power to the data cache, which most likely still contains user data. Removal of the batteries will
lead to data loss.
3. Do not connect your laptop to the serial port unless explicitly asked by higher escalation levels,
as it might interfere with the controller current recovery actions.
4. Do not randomly swap parts and then find you do not know which part came from which
location for example. All parts need to be identifiable and the location they were removed from
documented.
The meaning of these visual indicators is explained in the Service Manual, which is part of the required
tool set.
Another possible explanation is that the HSV controllers have a single power supply and that either the
power supply is broken or the complete input power is gone. In the latter case, all disk drive enclosures
operate on a single power supply as well. The corrective action is to either replace the controller (in
case of a broken power supply) or troubleshoot why the input power is no longer available.
The POST messages should never be displayed for more than one minute, otherwise, the POST never
completes due to an unexpected error condition in the controller hardware. In case of a repetitive
sequence of POST messages, the controller fails to perform its power-on self test. The corrective action
in both cases is to replace the HSV controller. This is not an “EVA Down” situation and data availability
is restored once the controller is replaced.
In case of boot sequence messages, they should disappear within one minute and the display should
switch to displaying storage cell name and WWID. If that’s not the case, then typically it’s only one
controller displaying one of the boot sequence messages, while the other one still displays storage cell
name and WWID. The correct action is to collect information and raise a Severity 1 escalation to the
next escalation level. What you’re facing is most likely caused by a process lock-up within the firmware
and a “CSM reset” should follow within an hour. This might be an “EVA Down” situation; the correct
action is to collect information and raise an escalation to a higher level. Monitor each controller display
and record the display output. It is important to note any patterns or the first and last displayed data.
The LID/DDD recovery messages should change periodically and disappear. This recovery activity
shouldn’t be disrupted. If LID/DDD recovery continues for more than 15 minutes, then it is probably due
to unexpected hardware conditions on the CAN-bus. The correct action is to restore data availability
first by executing the following sequence:
1. Halt both controllers by using the CTRL+H, CTRL+R keystroke sequence. Do not power off the
controllers, as vital event information is kept in volatile memory. Removing input power to the
controllers will lead to a loss of event information.
2. Pull the EMUs from all disk drive enclosures, but do not remove them. You have to pull all of
them before you move on to the next step.
3. Re-insert the EMUs in all disk drive enclosures.
4. Restart the top controller. The controller will boot and re-start the LID/DDD recovery routine.
This time it will find the faulty drive and isolate it.
5. Restart the bottom controller.
6. Check, via the field service page, that the print- and debug flags are still set to expected values.
Raise an escalation towards a higher level to register this event. Supply the controller event log,
controller termination log, actual EVA configuration and a list of actions you performed together with the
escalation. The faulty disk drive must be physically removed from the system.
In all other cases, prepare yourself and collect the required information as listen in Appendix A:
Questionnaire. Do NOT power off the controllers unless explicitly directed by higher escalation levels. If
you power off the controllers, vital information (events that aren’t available yet in Command View) will
be lost as they reside in volatile memory. The correct action is to collect information and raise a
Severity 1 escalation to the next escalation level.
Ensure you reload the current status to prevent Command View from displaying old status data. Use
the “Refresh” button of your web browser; in case of Internet Explorer, this is mapped onto the
CTRL+F5 keystroke combination.
An unresponsive EVA will generate a log on the management system indicating the first failing
command.
Another thing to check is the proxy settings of the web browser. As for Internet Explorer, these settings
can be found in “Tools Æ Internet Options Æ Connections Tab Æ LAN Settings”. In case you’re working
on the SMA or management station, be sure that there’s no proxy access for “localhost” or “127.0.0.1”.
Within Command View, also use the “Re-discover” button. This will force Command View to re-scan the
SAN for any EVAs with which it’s allowed to communicate.
There are several tools and log files that you can use to verify the communication between the
management server and the EVA.
The first step is to verify the physical connection between management server and SAN, as well as
verification if FC communication is established. The second step is to verify the SCSI and SCMI
communication.
The FC path between management server and FC switch can be verified by using the HBAnywhere or
Emulex configuration utility. The FC switch can provide connection status between the management
station and the FC switch, as well as EVA and FC switch. It can also tell you if the EVA is known in the
name server of the switch.
If there are no issues on the physical level and FC communication appears to be working, reviewing the
logs in \hsvmafiles might give an indication about the reason why the management station cannot
communicate with the EVA.
From Command View v5 onwards, a new tool, EVACT, is present. This tool will verify the path and the
communication between the management system and the EVA. It operates from a DOS prompt, field
service page or CV storage system page when connection is lost. It will run from any Windows-based
Fibre attached server (requires program and supporting files, so copy the complete \hsvmafiles
directory including the supporting DLL’s.
If you run the command from the field service page, the trace file,
“\hsvmafiles\evact\path_trace.txt” is created.
In case the utility is started from the DOS prompt, one can re-direct the output to a file by using
“evact > myfile.txt”. There is an optional parameter, the “–verbose” flag will dump out what the tool
is doing at every step. If things are running slow, this is a good way to indicate that the tool isn’t
hung.
2 Information to Collect
2.1 General
In case the situation is qualified as an “EVA Down” situation, the main priority of the higher escalation
level will be to restore data availability as soon as possible. To be able to do that, that team needs a
defined set of information to be provided upfront with the escalation. Providing correct and adequate
information allows higher escalation levels to quickly restore data availability. Appendix A of this
document contains a questionnaire, which needs to be filled prior to raising the escalation.
Furthermore, check the LEDs on every I/O module. If there are no loop switches in the system, all I/O
modules should have three green LEDs. On systems with loop switches, there should be two green
LEDs (the top-2) on each I/O module.
Check the status of all LEDs on each loop switch. If one LED or group of LEDs show a different
behavior compared to other, please note that on the questionnaire. Monitor both the LEDs on the I/O
modules, as well as the LEDs on the loop switches, for at least one minute. They may flash
occasionally, which may help the higher escalation levels with troubleshooting.
The EMU has an alpha-numeric display. In case the yellow error LED is on (and “Er” is displayed in the
2-char display), read the error code. See the Service Manual for instructions. The error code needs to
be filled on the questionnaire in the N.N.NN.NN format.
• Manually and upon request. Think about captured config data via SSSU or Command View. There
may be old SSSU output available. It’s important to collect as much information as possible, and
indicate at the same time which information is the most current. The customer can inform you
about the location of these archives.
• Automatic and periodically. There are several tools available that periodically collect this
information. These tools are:
1. WEBES v4.4.3 onwards. You’re able to find the EVA configuration of utmost 12 hours old in the
“C:\Program Files\Hewlett-Packard\svctools\specific\ca\data” directory. The
file is called “<hostname>_config.xml”.
2. SDC (Support Data Collector). This tool updates its configuration tracking database upon
detection of a change in the EVA configuration. You can find the home directory of this tool in
“C:\SDC”. Within this directory, you’ll find subdirectories. The tool also provides historical EVA
configuration data. Please create a zip archive of the “config”, “output” and “tmp” subdirectories.
3. EVE (Event View EVA). This tool, mainly used in EMEA, has the same characteristics as SDC.
Its home directory can be found in “C:\Hsvev” . Within this directory, you’ll find subdirectories.
Please create a zip archive of the “config”, “output” and “tmp” subdirectories.
4. ScanMaster. This tool periodically scans (once a week, month) several SAN components,
including EVAs. It might be worthwhile to find this information as well, as it some history on the
EVA configuration
5. HPCC.
Sometimes a customer didn’t have the opportunity to fully back up the data that resides on the EVA. It’s
important to know that there is a valid backup. In “EVA Down” situations, which last longer than
expected, a trade-off must be made whether or not it’s worthwhile to continue troubleshooting or to
restore the data from tape. This can be best explained by using the following example:
“A customer needs to resume production by 6 PM on Sunday evening. Restore time is 10 hours and
there is a valid backup”. That means that between 7–8 AM on Sunday morning a decision needs to be
made to continue troubleshooting or to start to roll back the data from tape, in order to allow the
customer to resume production at 6 PM.”
There’s no need to ask these questions upon entry of the escalation, but be sensitive to remarks from
the customer. It will be important to understand the customer situation and business requirements
related to the operation of this EVA.
2.7.2 Does the customer prefer a fast restore of data availability versus a time consuming data
capturing process?
There are “EVA Down” situations which can be quickly recovered by going through a complete power
down of the EVA. However, in such a case, vital information kept in the controller volatile memory about
the possible root cause of the situation will be lost. Dependant on the VCS and XCS versions, it’s
possible to capture this information via the serial port. However, the capture process might take 1–2
hours. The customer needs to make a trade-off between a fast restore of data availability versus
capturing the volatile data, which may allow, but not guarantee, the determination of the root cause of
this incident.
The following conditions must be met if involvement from Level 3 Support is requested:
1. The firmware currently running on the EVA must be a supported version. Exceptions can be
made by the “Gate Keeper”. In case the current firmware isn’t a supported one, a Root Cause
Analysis will not be provided by Level 3 Support.
2. The EVA must refuse to boot, even after a single restart attempt. The restart attempt MUST be
executed as follows:
a. Connect a laptop with a serial connection to the UART port of the bottom HSV
controller. Make sure the terminal emulation is in capturing mode and that the bits and
baud are correctly set.
b. Halt this controller back to the MINDY prompt by using the CTRL+K keystroke. The “:”
prompt will appear.
c. Connect the serial cable now to the top HSV controller.
d. Halt this controller back to the MINDY prompt by using the CTRL+K keystroke. The “:”
prompt will appear.
e. Power down the disk drive enclosures by removing both power inlets to each
enclosure. Power down all enclosures first before moving on to the next step. Do NOT
use the main power switch, as this will power down the HSV controllers as well.
f. Power up the disk drive enclosures by re-inserting both power cords to the power
supplies of each disk drive enclosure.
g. Check that every EMU reports its correct shelf position.
h. Check that all disk drives are ready, so verify that the “green LED” on every drive is
ON.
i. Start the top controller by entering the “rs” command on the keyboard. The top
controller will now start and run through its standard boot sequence.
i. The controller will boot just fine and will start to display the storage cell name
and WWID on the LCD. If the top controller booted and displays the storage
cell name and WWID, connect the serial cable to the bottom controller. Start
this controller as well with the “rs” command. This controller will also start and
display the storage cell name and WWID upon completion of its boot cycle.
ii. The top controller will not display its storage cell name and WWID after a
couple of minutes.
3. The restart attempt may be executed only one time, as a power cycle per nature introduces the
risk of failing drives.
4. The Questionnaire must be filled in and be readily presented with the case.
In case both controllers display the storage cell name and WWID, a P2 escalation towards SWD Level
3 supported is required, also for tracking purposes. The questionnaire must be filled in and presented
with the case.
In case both controllers do not display the storage cell name and WWID, a P1 escalation is warranted.
Please provide the filled-out questionnaire together with the case.
4 Appendix A: Questionnaire
8.1 Overview
When the Enterprise hardware initializes, the first software to run will be the Built-In Self-Test (BIST)
image. The goal of the BIST image is to verify that the controller functions well enough to support the
execution of the functional (or main) firmware image. This goal requires that all components attached
directly to the PowerPC and those accessible via the Quasar chip's "local" bus, are tested.
If any of the BIST diagnostic tests fail, or if the controller fails to execute BIST diagnostics, the amber
LED on the EVA OCP will be illuminated and all controller activity will halt. In this case the controller
must be replaced.
If the BIST completes successfully, a set of minimum integrity diagnostics (MIST) are then run to verify
that the bus hardware is functional, the program card contents are valid, and that shared memory is
good.
1.1. Real-time Environment Executor. The Real-time Environment eXecutor (REX) will execute the
different BIST diagnostic tests, collect and reports error information as required.
1.1.1. Processor Initialization. The BIST image must first configure the PowerPC processor for use
with the Enterprise design.
1.1.2. Quasar Configuration. The Quasar chip must be configured for access to the components on
the Quasar chip's bus.
1.1.3. Policy Memory. The DIMM for policy memory will be interrogated, using the Quasar's I²C
controller to determine size and configuration requirements. After the Quasar chip has been
configured for this DIMM, the memory will be tested for address and data integrity.
1.1.4. Program Card. The BIST image located on the program card must be verified, to insure that
the code being executed is valid.
1.1.5. Glue Chip. The Glue chip (G3 for third glue chip) is a FPGA (Field Programmable Gate Array)
chip containing much of the miscellaneous control functionality that was implemented using
more discrete components in previous controller designs. This component contains features
that include PCI bus arbiters, interrupt controllers, bus watchdog timers, and general-purpose
I/O pins (i.e. diagnostic registers). The Glue chip will be read to verify initial values, written
where possible to verify address and data integrity and configured for use.
1.1.6. L2 Cache. The PowerPC will utilize a L2 cache, on some versions of the Enterprise platform.
This cache must be verified to ensure functionality, before it is enabled for use by the
processor. This test is fault tolerant and the controller will continue to run without the L2 cache if
the test fails.
1.1.7. NVRAM & TOY Clock. The NVRAM component of Enterprise is composed of a battery
backed-up SRAM chip. The NVRAM battery controller contains a Time of Year (TOY) Clock.
The battery controller, the battery and the memory must be checked for functionality, and the
TOY may need to be initialized. The functionality of the NVRAM system is not required to begin
execution of the functional image, but as it is the only component on the Quasar's bus that isn't,
it will be tested as a BIST component.
1.1.8. Functional Image. The functional image must be moved from the program card to the required
addresses in Policy memory. The functional image will be compressed on the program card.
The image will be verified as it is moved from the card to memory. After the functional image is
placed in memory, the BIST image will clean up and begin the execution of the functional
image.
8.3.1.1 General
TE 1.(0x01) - Temperature Sensor test
test 1 - check the values of the two temp sensors and processor temperature
test 2 - check rpm values and voltages
TE 10.(0x0a) - Near PCI configuration test 1 - configure the near PCI bus
TE 17. (0x11) - Port 4 (MP) test. See Tests used within the port tests for more details.
TE 18. (0x12) - Port 5 (FP1) test. See Tests used within the port tests for more details.
TE 19. (0x13) - Port 6 (FP2) test. See Tests used within the port tests for more details.
TE 20. (0x14) - Port 0 (DP-2A) test. See Tests used within the port tests for more details.
TE 21. (0x15) - Port 1 (DP-1A) test. See Tests used within the port tests for more details.
TE 22. (0x16) - Port 2 (DP-2B) test. See Tests used within the port tests for more details.
TE 23. (0x17) - Port 3 (DP-1B) test . See Tests used within the port tests for more details.
8.3.2.1 General
TE 2. (0x02) - HW code check and LCD setup
test 1 - Load the OCPs internal memory via IIC
test 2 - update Battery PIC code
test 3 - update GLUE code
test 4 - update CBIC code
TE 12. (0x0c) - Port 0 test (SFP 7, DP-2A). See Tests used within port tests for more details.
TE 13. (0x0d) - Port 1 test (SFP 6, DP-1A) . See Tests used within port tests for more details.
TE 14. (0x0e) - Port 2 test (SFP 5, MP2) . See Tests used within port tests for more details.
TE 15. (0x0f) - Port 3 test (SFP 4, FP4) . See Tests used within port tests for more details.
TE 16. (0x10) - Port 4 test (SFP 9, FP3) . See Tests used within port tests for more details.
TE 17. (0x11) - Port 5 test (SFP 8, FP2) . See Tests used within port tests for more details.
TE 18. (0x12) - Port 6 test (SFP 3, FP1) . See Tests used within port tests for more details.
TE 19. (0x13) - Port 7 test (SFP 2, MP1) . See Tests used within port tests for more details.
TE 20. (0x14) - Port 8 test (SFP 1, DP-2B) . See Tests used within port tests for more details.
TE 21. (0x15) - Port 9 test (SFP 0, DP-1B) . See Tests used within port tests for more details.