Saturday, June 19, 2010

Sun Fire[TM] 12K/15K: showplatform reports "Could not get the signature for domain X"

- Problem Statement/Title: SF15K Troubleshooting Article:

showplatform reports "Could not get the signature for domain X"

- Symptoms:

% showplatform
...
Q - - Powered Off
R - - Powered Off
Could not get the signature for domain B
Could not get the signature for domain C
Could not get the signature for domain D

Resolution

- Troubleshooting:

'showplatform' examines the PCD for domain state. If the PCD
indicates the keyswitch position is ON|DIAG|SECURE, 'showplatform'
attempts to access the domain's Golden IOSRAM to determine the
domain signature.

- Resolution:

The common cause of this error is that the PCD does not
accurately reflect the true state of the platform. See "Background
information" for some scenarios that can result in PCD discrepancy.

To correct the discrepancy, perform a 'setkeyswitch standby' followed
by a 'setkeyswitch off' for all domains that report the signature
error. Answer 'y' to any queries.

In cases of extreme PCD corruption, setkeyswitch operations may
not succeed. If this occurs, the PCD for the effected domain(s)
can be returned to defaults. If setkeyswitch is not successful,
do the following to clean up the PCD. Note this will null the
board assignments and available component lists (ACLs) for the
effected domains.

- Note the configuration of affected domain(s). Include board
assignments, ACLs, etc.
- Issue 'setdefaults -d X -p', where X is the domain [A..R]. The
-p preserves the NVRAM settings.
- Reassign boards to the domain(s) with 'addboard' and setup any
ACLs with 'setupplatform'.

If the above procedure does not succeed, a restart of the SMS may be required:

- Turn OFF failover: setfailover off.

- Issue '/etc/init.d/sms stop'.

- After confirming that the SMS software is no longer running, issue

'/etc/init.d/sms start'.

- Turn ON failover: setfailover on.

- Repeat the first procedure.

- Additional background information:

Three scenarios that can lead to the above situation are:

1. Incomplete shutdown of domains prior to a platform poweroff

An 'init 0' is done to the domain, so the domain is down. But, the
PCD still has the keyswitch as ON|DIAG|SECURE. Then the SCs are
shutdown and the platform powered off. When power is restored to
the platform, the PCD doesn't reflect the powered off state of the
domain.

This is different from a total power loss because SMS is shutdown
gracefully. There is no indicator to SMS to indicate that a power
recovery is needed, so domains the PCD lists as ON|DIAG|SECURE are
taken at face value.

2. Restoration of an old/stale smsbackup file

Similar logic to above, but even if the domains were appropriately
sekeyswitched OFF, a stale smsbackup file can restore a PCD that has
incorrect keyswitch states.

Also of note is that setkeyswitch operations to ON|DIAG|SECURE may
not be successful. PCD consistency is verified as part of POST and if
the PCD is inconsistent, POST does not continue as the state of the
platform is in question and further activity could interrupt running
domains. A typical POST failure indicative of PCD inconsistency is:

pcs_pcd_get_domain_info(): Golden sram for domain 7=H = IO14,
Not in active slot1 vector 00000
pcs_pcd_get_domain_info(): MAND Net for domain 7=H = IO14,
Not in active slot1 vector 00000
pcs_pcd_get_domain_info(): Golden sram for domain 8=I = IO16,
Not in active slot1 vector 00000
pcs_pcd_get_domain_info(): MAND Net for domain 8=I = IO16,
Not in active slot1 vector 00000
Exitcode = 44: Error accessing Physical Config Database

3. Root filesystem on main system controller is full

If the root filesystem fills, the PCD will become unwritable, preventing SMS from making any changes.

In /var/adm/messages on SC0:

Dec 15 23:03:29 dchmc03 ufs: [ID 845546 kern.notice] NOTICE: alloc: /: file system full

In the domain messages on SC0:

Dec 17 06:55:18 2004 dchmc03 dsmd[4910]-E(): [2535 3947109096994210 NOTICE DomainsPatrol.cc 643] Reboot domain E request received, rebooting domain.
Dec 17 06:57:28 2004 dchmc03 dsmd[4910]-E(): [5304 3947239171030235 ERR SysControl.cc 1841] Domain failed by hpost: ecode=44
Dec 17 06:57:42 2004 dchmc03 dsmd[4910]-E(): [2507 3947253462410790 ERR Observers.cc 102] Failed to send DSMD_EVENT_DOMAIN_STOP 4 event to PCD, rc = 1711.

In platform messages file on SC0:

Dec 17 06:57:27 2004 dchmc03 pcd[4898]: [1711 3947237871910357 ERR DomainMgr.cc 356] Unable to write file: /var/opt/SUNWSMS/SMS1.3/.pcd/domain_info.tmp with errno = 28
Dec 17 06:57:28 2004 dchmc03 pcd[4898]: [1754 3947239235253794 ERR PCDApp.cc 2424] PCD chkpt WRITE failed. session id: 128, status: 8
Dec 17 06:57:28 2004 dchmc03 pcd[4898]: [1764 3947239236115050 ERR PCDApp.cc 1519] PCD unable to checkpoint Domain Down event sequence

Note that errno 28 is defined as:

errno 28 = #define ENOSPC 28 /* No space left on device */

If failover had been properly enabled in this particular situation, the SC would have failed over to SC1 when the disk filled up.

- Keywords

15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K,
showplatform, pcs_pcd_get_domain_info, Could not get the signature for domain

0 comments:

Post a Comment

 
;