Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
CHAPTER 1
Troubleshooting Tools
This chapter describes the diagnostics tools available to the Netra 240 server. The chapter contains the following sections:
● System Prompts
● OpenBoot Commands
● OpenBoot Diagnostics
Accessibility and
ALOM Hardware and Monitors environmental Can function on Designed for remote
access.
LEDs Hardware Indicate status of overall Accessed from system Local, but can be
available.
Power-on self-test Firmware Tests core components Runs automatically on Local, but can be
not running.
OpenBoot Firmware Display various kinds of Available when the Local, but can be
running. of ALOM.
Solaris software Software Display various kinds of Requires operating Local, but can be
of ALOM.
System Prompts
The following default server prompts are used by the Netra 240 server:
FIGURE 1-1 shows the relationship between the three prompts and how to change from one to the other.
● Escape sequence: #.
Sun Advanced Lights Out Manager (ALOM) for the Netra 240 server provides a series of LED status indicators. This
section details the meaning of their status and how to turn them on and off. For more information on ALOM, see Chapter
3.
The server has three LED status indicators. They are located on the front bezel (FIGURE 1-2) and are repeated on the rear
LED LED
Activity Green On The server is powered on and is running the Solaris OS.
Service Yellow On The server has detected a problem and requires the attention of service
Required personnel.
Locator White On A continuous light turns on and identifies the server from others in a
You can turn the Locator LED on and off either from the system console or the ALOM command-line interface (CLI).
● As superuser, type:
# /usr/sbin/locator
sc> showlocator
● As superuser, type:
# /usr/sbin/locator -n
sc> setlocator on
● As superuser, type:
# /usr/sbin/locator -f
The dry contact alarm card has four LED status indicators that are supported by ALOM. They are located vertically on
the front bezel (FIGURE 1-2). Information about the alarm indicators and dry contact alarm states is provided in TABLE
1-3. For more information about alarm indicators, see the Sun Advanced Lights Out Manager Software User's Guide for
the Netra 240 Server (part number 817-3174). For more information about an API to control the alarm indicators, see
Appendix A.
Relay Relay
Indicator
Application System Alarm [1] [2]
and Relay NC NO
Indicator or Server Condition Indicator Indicator
Critical Red Server state No power Off Off Closed Open Default
not
on; Solaris
OS not fully
loaded.
successfully operating
loaded. state.
Solaris OS.
shutdown state.
initiated by
[4]
user .
power. state.
power state.
shutdown
initiated by
user.
Critical cleared.
alarm offii.
offii.
offii.
offii.
In all cases when the user sets an alarm, a message is displayed on the console. For example, when the critical alarm is
Note that in some instances when the critical alarm is set, the associated alarm indicator is not lit. This implementation is
verifies the core functionality of the system, including the CPU module(s), motherboard, memory, and some on-board I/O
devices. The software then generates messages that can be useful in determining the nature of a hardware failure. You can
POST detects most system faults and is located in the motherboard OpenBoot PROM. You can program the OpenBoot
software to run POST at power-on by setting two environment variables: the diag-switch? and the diag-level
flag. These two variables are stored on the system configuration card (SCC).
Note - The SCC contains information about the system's identity, including the Host ID, MAC address and NVRAM
settings. To increase the speed by which a system can be brought back online, your Sun Service representative might
transfer the SCC and the drive on which it resides to another server, enabling it to inherit the old server's identity without
being reconfigured. This procedure should only be done by trained Sun personnel.
POST runs automatically when the system power is applied, or following an automatic system reset, if all of the
If diag-level is set to min or max, POST performs an abbreviated or extended test, respectively.
If diag-level is set to menus, a menu of all the tests executed at power up is displayed.
You control POST diagnostics (and other aspects of the boot process) by setting OpenBoot configuration variables.
Changes to OpenBoot configuration variables take effect only after the system is restarted. TABLE 1-4 lists the most
important and useful of these variables. You can find instructions for changing OpenBoot configuration variables in To
OpenBoot Configuration
auto-boot Determines whether the operating system automatically starts up. Default is
true.
finished running.
● off--No testing.
diag-script Determines which devices are tested by OpenBoot diagnostics. Default is none
tested.
diag-switch? Toggles the system in and out of diagnostic mode. Default is false.
post-trigger These two variables specify the class of reset event that causes power-on self-
tests (or OpenBoot diagnostics tests) to run. These variables can accept single
[6]
obdiag-trigger
keywords as well as combinations of the first three keywords separated by
spaces. For details, see To View and Set OpenBoot Configuration Variables.
problem corrupts system state data. Examples include CPU and system
watchdog resets, fatal errors, and certain CPU reset events (default).
(default).
output-device Selects where diagnostic and other console output is displayed. Default is ttya
Note - These variables affect OpenBoot diagnostics tests as well as POST diagnostics.
Once POST diagnostics have finished running, POST reports back the status of each test that was run to the OpenBoot
If POST diagnostics do not uncover a fault, and your server still does not start up, run OpenBoot diagnostics tests.
1. Go to the ok prompt.
2. Type:
3. Type:
Where value is min, max, or menus, depending on the quantity of diagnostic information you want to see.
4. Type:
ok reset-all
The system runs POST diagnostics if post-trigger is set to user-reset. Status and error messages are
displayed in the console window. If POST detects an error, it displays an error message describing the failure.
5. When you have finished running POST, restore the value of diag-switch? to false by typing:
OpenBoot Commands
OpenBoot commands are commands you type from the ok prompt. OpenBoot commands that can provide useful
● probe-ide
● show-devs
The probe-scsi and probe-scsi-all commands diagnose problems with the SCSI devices.
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, issuing the
probe-scsior probe-scsi-allcommand can hang the system.
The probe-scsi command communicates with all SCSI devices connected to on-board SCSI controllers. The probe-
scsi-all command also accesses devices connected to any host adapters installed in PCI slots.
For any SCSI device that is connected and active, the probe-scsi and probe-scsi-all commands display its loop
ID, host adapter, logical unit number, unique world-wide name (WWN), and a device description that includes type and
manufacturer.
{1} ok probe-scsi
Target 0
Target 1
{1} ok
{1} ok probe-scsi-all
/pci@1c,600000/scsi@2,1
/pci@1c,600000/scsi@2
Target 0
Target 1
{1} ok
probe-ide Command
The probe-ide command communicates with all Integrated Drive Electronics (IDE) devices connected to the IDE bus.
This is the internal system bus for media devices such as the DVD drive.
Caution - If you used the haltcommand or the Stop-A key sequence to reach the okprompt, issuing the
probe-idecommand can hang the system.
{1} ok probe-ide
Not Present
Not Present
Device 2 ( Secondary
Master )
Not Present
Not Present
{1} ok
show-devs Command
The show-devs command lists the hardware device paths for each device in the firmware device tree. CODE
Output
/pci@1d,700000
/pci@1c,600000
/pci@1e,600000
/pci@1f,700000
/memory-controller@1,0
/SUNW,UltraSPARC-IIIi@1,0
/memory-controller@0,0
/SUNW,UltraSPARC-IIIi@0,0
/virtual-memory
/memory@m0,0
/aliases
/options
/openprom
/chosen
/packages
/pci@1d,700000/network@2,1
/pci@1d,700000/network@2
/pci@1c,600000/scsi@2,1
/pci@1c,600000/scsi@2
/pci@1c,600000/scsi@2,1/tape
/pci@1c,600000/scsi@2,1/disk
/pci@1c,600000/scsi@2/tape
/pci@1c,600000/scsi@2/disk
/pci@1e,600000/ide@d
/pci@1e,600000/usb@a
/pci@1e,600000/pmu@6
/pci@1e,600000/isa@7
/pci@1e,600000/ide@d/cdrom
/pci@1e,600000/ide@d/disk.........
OpenBoot Diagnostics
Like POST diagnostics, OpenBoot diagnostics code is firmware-based and resides in the Boot PROM.
1. Type:
ok reset-all
2. Type:
ok obdiag
ok obdiag
_____________________________________________________________________________
| o b d i a g |
|_________________________ __________________________________________________|
| | | |
| 13 serial@0,3f8 | | |
|_________________________|_________________________|________________________|
|____________________________________________________________________________|
Note - If you have a PCI card installed inside the server, additional tests appear on the obdiag menu.
3. Type:
obdiag> test n
Where n represents the number corresponding to the test you want to run.
obdiag> help
Most of the OpenBoot configuration variables you use to control POST (see TABLE 1-4) also affect OpenBoot
diagnostics tests.
● Use the diag-level variable to control the OpenBoot diagnostics testing level.
By default, test-args is set to contain an empty string. You can modify test-args using one or more of the
Keyword Description
restore Attempts to restore original state of the device if the previous execution of the test
failed.
silent Displays only errors rather than the status of each test.
n
callers=0--Displays backtrace of all callers before the error.
If you want to customize the OpenBoot diagnostics testing, you can set test-args to a comma-separated list of
You can also run OpenBoot diagnostics tests directly from the ok prompt. To do this, type the test command, followed
by the full hardware path of the device (or set of devices) to be tested. For example:
ok test /pci@x,y/SUNW,qlc@2
ok test /usb@1,3:test-args={verbose,debug}
This syntax affects only the current test without changing the value of the
You can test all the devices in the device tree with the test-all command:
ok test-all
CODE EXAMPLE 1-5 displays a sample OpenBoot diagnostics test report where all tests have passed.
If you specify a path argument to test-all, only the specified device and its children are tested. The following
example shows the command to test the USB bus and all devices with self-tests that are connected to the USB bus:
ok test-all /pci@9,700000/usb@1,3
OpenBoot diagnostics error results are reported in a tabular format that contains a short summary of the problem, the
hardware device affected, the subtest that failed, and other diagnostic information. CODE EXAMPLE 1-6 displays a
Testing /pci@1e,600000/isa@7/flashprom@2,0
DEVICE : /pci@1e,600000/isa@7/flashprom@2,0
SUBTEST : selftest:crc-subtest
SERIAL# : 52965531
failed
multiuser mode, you have access to the software-based diagnostic tools and the SunVTS software. These tools enable you
Note - If you set the auto-boot? OpenBoot configuration variable to false, the operating system does not boot
following completion of the firmware-based tests.
In addition to the tools just mentioned, you can refer to error and system message log files and to Solaris software
information commands.
Error and other system messages are saved in the /var/adm/messages file. Messages are logged to this file from
many sources, including the operating system, the environmental control subsystem, and various software applications.
The following Solaris software system information commands display data that you can use when assessing the condition
● prtconf
● prtdiag
● prtfru
● psrinfo
● showrev
This section describes the information that these commands give you. For more information about using these commands,
prtconf Command
The prtconf command displays the Solaris software device tree. This tree includes all the devices probed by OpenBoot
firmware, as well as additional devices, such as individual disks that only the operating system software recognizes. The
output of prtconf also includes the total size of system memory. CODE EXAMPLE 1-7 shows an excerpt of prtconf
output.
# prtconf
SUNW,Netra-240
options, instance #0
memory-controller, instance #0
memory-controller, instance #1
The prtconf command -p option produces output similar to that of the OpenBoot
show-devs command. This output lists only those devices compiled by the system firmware.
prtdiag Command
The prtdiag command displays a table of diagnostic information that summarizes the status of system components.
The display format used by the prtdiag command can vary depending on what version of the Solaris OS is running on
your system. The following code example is an excerpt of some of the output produced by prtdiag on a functional
# prtdiag
Bus Freq
0 pci 66 2 network-pci14e4,1648.108e.16+
0 pci 66 2 network-pci14e4,1648.108e.16+
0 pci 66 2 scsi-pci1000,21.1000.1000.1 +
0 pci 66 2 scsi-pci1000,21.1000.1000.1 +
0 pci 66 2 network-pci14e4,1648.108e.16+
0 pci 66 2 network-pci14e4,1648.108e.16+
Segment Table:
-----------------------------------------------------------------------
-----------------------------------------------------------------------
--------------------------------------------------
--------------------------------------------------
0 0 MB/P0/B0/D0,MB/P0/B0/D1
--------------------------------------------------
--------------------------------------------------
1 0 MB/P1/B0/D0,MB/P1/B0/D1
In addition to the information in CODE EXAMPLE 1-8, prtdiag with the verbose option (-v) also reports on front
panel status, disk status, fan status, power supplies, hardware revisions, and system temperatures (see CODE EXAMPLE
1-9).
---------------------------------------------------------------
---------------------------------------------------------------
In the event of an overtemperature condition, prtdiag reports an error in the Status column (CODE EXAMPLE 1-10).
---------------------------------------------------------------
---------------------------------------------------------------
Similarly, if a particular component fails, prtdiag reports a fault in the appropriate status column (CODE EXAMPLE
1-11).
Fan Speeds:
-----------------------------------------
-----------------------------------------
MB/P0/F0 RS failed 0
rpm
F2 RS okay 2896
rpm
F3 RS okay 2576
rpm
---------------------------------
prtfru Command
The Netra 240 server maintains a hierarchical list of all field-replaceable units (FRUs) in the system, as well as specific
The prtfru command can display this hierarchical list, as well as data contained in the serial electrically-erasable
programmable read-only memory (SEEPROM) devices located on many FRUs. CODE EXAMPLE 1-12 shows an
excerpt of a hierarchical list of FRUs generated by the prtfru command with the -l option.
# prtfru -l
/frutree
/frutree/chassis (fru)
/frutree/chassis/MB?Label=MB
/frutree/chassis/MB?Label=MB/system-board (container)
/frutree/chassis/MB?Label=MB/system-board/SC?Label=SC
/frutree/chassis/MB?Label=MB/system-board/SC?Label=SC/sc (fru)
/frutree/chassis/MB?Label=MB/system-board/BAT?Label=BAT
/frutree/chassis/MB?Label=MB/system-board/BAT?Label=BAT/battery (fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu (fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F0?Label=F0
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F0?Label=F0/fan-
unit
(fru)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F1?Label=F1
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/F1?Label=F1/fan-
unit
(fru)........
CODE EXAMPLE 1-13 shows an excerpt of SEEPROM data generated by the prtfru command with the -c option.
This output displays only the containers and their data and does not print the FRU tree hierarchy.
# prtfru -c
/frutree/chassis/MB?Label=MB/system-board (container)
SEGMENT: SD
/ManR
/ManR/Fru_Description: FRUID,INSTR,M'BD,2X1.28GHZ,CPU
/ManR/Manufacture_Loc: Hsinchu,Taiwan
/ManR/Sun_Part_No: 3753120
/ManR/Sun_Serial_No: 000615
/ManR/Initial_HW_Dash_Level: 02
/ManR/Initial_HW_Rev_Level: 0E
/ManR/Fru_Shortname: MOTHERBOARD
/SpecPartNo: 885-0076-11
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/B0?Label=B0/bank/D0?La
bel=D0/mem-module (container)
/frutree/chassis/MB?Label=MB/system-board/P0?Label=P0/cpu/B0?Label=B0/bank/D1?La
bel=D1/mem-module (container)........
Data displayed by the prtfru command varies depending on the type of FRU. In general, it includes the following:
● FRU description
psrinfo Command
The psrinfo command displays the date and time that each CPU is introduced online. With the verbose (-v) option,
the command displays additional information about the CPUs, including their clock speed. CODE EXAMPLE 1-14
shows sample output from the psrinfo command with the -v option.
# psrinfo -v
showrev Command
The showrev command displays revision information for the current hardware and software. CODE EXAMPLE 1-15
# showrev
Hostname: vsp78-36
Hostid: 8328c87b
Release: 5.8
Domain: vsplab.SFBay.Sun.COM
When used with the -p option, the showrev command displays installed patches. CODE EXAMPLE 1-16 shows a
partial sample output from the showrev command with the -p option.
At a command prompt, type the command for the kind of system information you want to
display.
For more information, see Solaris Software System Information Commands. See TABLE 1-6 for a summary of the
commands.
prtfru FRU hierarchy and SEEPROM /usr/sbin/prtfru Use the -l option to display
psrinfo Date and time each CPU came /usr/sbin/psrinfo Use the -v option to obtain
showrev Hardware and software revision /usr/bin/showrev Use the -p option to show
1. Go to the ok prompt.
ok show-post-results
This command produces a system-dependent list of hardware components, along with an indication of which
OpenBoot diagnostics tests are performed. This section explains how to access and modify OpenBoot configuration
variables. For a list of important OpenBoot configuration variables, see TABLE 1-4.
● To display the current values of all OpenBoot configuration variables, use the printenv command.
ok printenv
● To set or change the value of an OpenBoot configuration variable, use the setenv command:
diag-level = max
● To set OpenBoot configuration variables that accept multiple keywords, separate keywords with a space.
The watch-net diagnostics test monitors Ethernet packets on the primary network interface. The watch-net-all
diagnostics test monitors Ethernet packets on the primary network interface and on any additional network interfaces
connected to the system board. Good packets received by the system are indicated by a period (.). Errors such as the
framing error and the cyclic redundancy check (CRC) error are indicated with an X and an associated error description.
To start the watch-net diagnostic test, type the watch-net command at the ok prompt (CODE
EXAMPLE 1-17).
Message
{0} ok watch-net
Link is -- up
To start the watch-net-all diagnostic test, type watch-net-all at the ok prompt (CODE
EXAMPLE 1-18).
Message
{0} ok watch-net-all
/pci@1f,0/pci@1,1/network@c,1
Link is -- up
Note - Automatic System Recovery (ASR) is not the same as Automatic Server Restart, which the Netra 240 server
also supports. For information about Automatic Server Restart, see Chapter 3.
Automatic System Recovery (ASR) consists of self-test features and an auto-configuring capability to detect failed
hardware components and unconfigure them. By enabling ASR, the server is able to resume operating after certain
If a component is monitored by ASR and the server is capable of operating without it, the server automatically reboots if
that component develops a fault or fails. This capability prevents a faulty hardware component from preventing the entire
If a fault is detected during the power-on sequence, the faulty component is disabled. If the system remains capable of
To support this degraded boot capability, the OpenBoot firmware uses the 1275 Client Interface (by means of the device
tree) to mark a device as either failed or disabled, by creating an appropriate status property in the device tree node. The
Solaris OS does not activate a driver for any subsystem marked in this way.
As long as a failed component is electrically dormant (not causing random bus errors or signal noise, for example), the
system reboots automatically and resumes operation while a service call is made.
Once a failed or disabled device is replaced with a new one, the OpenBoot firmware automatically modifies the status of
Note - ASR is not enabled until you activate it (see To Enable ASR).
Auto-Boot Options
The auto-boot? setting controls whether the firmware automatically boots the operating system after each reset. The
The auto-boot-on-error? setting controls whether the system attempts a degraded boot when a subsystem failure
is detected. Both the auto-boot? and auto-boot-on-error? settings must be set to true to enable an automatic
degraded boot.
Note - The default setting for auto-boot-on-error? is false. Therefore, the system does not attempt a
degraded boot unless you change this setting to true. In addition, the system does not attempt a degraded boot in
non-recoverable error, even if degraded booting is enabled. For examples of fatal non-recoverable errors, see Error-
Handling Summary.
Error-Handling Summary
Error handling during the power-on sequence can be summarized in the following three ways:
● If no errors are detected by POST or OpenBoot diagnostics, the system attempts to boot if auto-boot? is true
● If only nonfatal errors are detected by POST or OpenBoot diagnostics, the system attempts to boot if auto-
Note - If POST or OpenBoot diagnostics detects a nonfatal error associated with the normal boot device, the
OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by
● If a fatal error is detected by POST or OpenBoot diagnostics, the system does not boot regardless of the settings of
Reset Scenarios
The standard system reset protocol bypasses POST and OpenBoot diagnostics unless diag-switch? is set to true.
The default setting for this variable is false. Because ASR relies on firmware diagnostics to detect faulty devices,
diag-switch? must be set to true for ASR to run. For instructions, see To Enable ASR.
To control which reset events, if any, automatically initiate firmware diagnostics, use obdiag-trigger and post-
trigger. For detailed explanations of these variables and their uses, see Controlling POST Diagnostics and Controlling
To Enable ASR
3. Type:
ok reset-all
The system permanently stores the parameter changes and boots automatically if the OpenBoot variable auto-boot? is
Note - To store parameter changes, you can also power-cycle the system by using the front panel On/Standby button.
To Disable ASR
2. Type:
ok reset-all
Note - To store parameter changes, you can also power-cycle the system by using the front panel On/Standby button.
1 (TableFootnote) NC state is the normally closed state. This state represents the default mode of the relay contacts in the normally
closed state.
2 (TableFootnote) NO state is the normally open state. This state represents the default mode of the relay contacts in the normally open
state.This table contains details on the alarm indicators and the dry contact alarm states. It has sixteen rows and nine columns.
4 (TableFootnote) The user can shut down the system using commands such as init0 and init6. This does not include the system
power shutdown.
5 (TableFootnote) Based on a determination of the fault conditions, the user can turn the alarm on using the Solaris platform alarm API
or ALOM CLI. For more information about the alarm API see Appendix A, and for more information about the ALOM CLI, refer to the
Sun Advanced Lights Out Manager Software User's Guide for the Netra 240 Server (part number 817-3174).
6 (TableFootnote) The post-trigger and obdiag-trigger variables are obsolete in releases of OBP after 4.16.2.
7 (TableFootnote) POST messages cannot be displayed on a graphics terminal. They are sent to ttya even when output-device
is set to screen.