Tuesday, January 12, 2016

Solaris 11: How to Verify Which SRU a System Is Running

The following text shows the how to identify which SRU is installed on a system running Oracle Solaris 11.1. There are also examples for:
  • An Oracle Solaris 11.1 system without an SRU -> Example 1
  • An Oracle Solaris 11 11/11 system with an SRU -> Example 2
  • An Oracle Solaris 11 11/11 system without an SRU -> Example 3
Display the information about pkg:/entire.
In this example, Oracle Solaris 11.1 with SRU 18.4 installed. The identification of the SRU number is the 18.4 as the third and fourth numbers displayed after Oracle Solaris 11.1 in the Summary and Version lines.
pkg info entire
Name: entire
Summary: entire incorporation including Support Repository Update Oracle Solaris 11.1.18.4.0).
Description: This package constrains system
package versions to the same 
build
WARNING: Proper system update and correct package
selection depend on the presence of this incorporation.
Removing this package will result in an unsupported system.  For more information see https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=REFERENCE&id=1501435.1.
Category: Meta Packages/Incorporations
State: Installed
Publisher: solaris
Version: 0.5.11 (Oracle Solaris 11.1.18.4.0)
Build Release: 5.11
Branch: 0.175.1.18.0.4.0
Packaging Date: April 1, 2014 07:58:03 PM 
Size: 5.46 kB
FMRI: pkg://solaris/entire@0.5.11,5.11-0.175.1.18.0.4.0:20140401T195803Z
Example 1  Package Information Displayed for Oracle Solaris 11.1 Without SRUs If your system has Oracle Solaris 11.1 installed, without any SRUs, the package information will look like the following. Note that the entries after the Build Release line are different from the 11/11 version.
pkg info entire
Name: entire
Summary: Incorporation to lock all system packages to the same build
Description: This package constrains system package versions to the same build.  WARNING: Proper system update and correct package selection depend on the presence of this incorporation.
Removing this package will result in an unsupported system.
Category: Meta Packages/Incorporations
State: Installed
Publisher: solaris
Version: 0.5.11
Build Release: 5.11
Branch: 0.175.1.0.0.24.2
Packaging Date: Wed Sep 19 19:01:35 2012
Size: 5.46 kB
FMRI: pkg://solaris/entire@0.5.11,5.11-0.175.1.0.0.24.2:20120919T190135Z

Example 2  Package Information Displayed for Oracle Solaris 11 11/11 With an SRU If your system has Oracle Solaris 11 11/11 installed, with a SRU, the package information will look like the following. In this example SRU 10.5 is installed. Note that the entries after the Build Release line are different because of the SRU.
pkg info entire
Name: entire
Summary: entire incorporation including Support Repository Update Oracle Solaris 11 11/11 SRU 10.5).
Description: This package constrains system package versions to the same build.
WARNING: Proper system update and correct package
selection depend on the presence of this incorporation.
Removing this package will result in an unsupported system.  For more information see https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=REFERENCE&id=1372094.1.
Category: Meta Packages/Incorporations
State: Installed
Publisher: solaris
Version: 0.5.11 (Oracle Solaris 11 SRU 10.5)
Build Release: 5.11
Branch: 0.175.0.10.0.5.0
Packaging Date: Fri Aug 03 18:26:27 2012
Size: 5.45 kB
FMRI: pkg://solaris/entire@0.5.11,5.11-0.175.0.10.0.5.0:20120803T182627Z
Example 3  Package Information Displayed for Oracle Solaris 11 11/11 Without SRUs If your system has Oracle Solaris 11 11/11 installed, without any SRUs, the package information looks like:
pkg info entire
Name: entire
Summary: Incorporation to lock all system packages to the same build
Description: This package constrains system package versions to the same build.  WARNING: Proper system update and correct package selection depend on the presence of this incorporation.
Removing this package will result in an unsupported system.
Category: Meta Packages/Incorporations
State: Installed
Publisher: solaris
Version: 0.5.11
Build Release: 5.11
Branch: 0.175.0.0.0.2.0
Packaging Date:
October 20, 2011 02:38:22 PM
Size: 5.45 kB
FMRI:
pkg://solaris/entire@0.5.11,5.11-0.175.0.0.0.2.0:20111020T143822Z

Solaris Cluster : pm_tick delay of xxxx ms exceeds 2147 ms message is continuously logged in /var/adm/messages


SYMPTOMS
The message "pm_tick delay of xxxx ms exceeds 2147 ms" is continuoulsy written in the file /var/adm/messages since several days. The delay number reported (here xxxx) could be variable.
This message is printed thousands of times and the size of the messages files is growing but there is no other impact or symptom on the cluster nodes.
The key symptom is the fact the same message is repeated over time but the delays reported are bounded and almost the same over time rather than showing increasing delays.
In other words the delays are not getting worse.
For example, checking in /var/adm/messages you could see:
grep pm_tick messages |more
Dec 27 01:07:38 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 3589 ms exceeds 2147 ms <==== 1st delay logged
Dec 27 01:07:38 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 3549 ms exceeds 2147 ms
............ snip.....
 This message was printed a
lot of times and the delay is now stable
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep -c pm_tick messages
3112523
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | sort -u
3549 <========= this is the lowest delay
3550
3551
3589 <========= this is the highest delay
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | grep -c 3550
1555990
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | grep -c 3549
1556529
It is clear that the delay is not getting worse: the majority of times we see  delays of 3549 and 3550 seconds and the delay range is bounded.

CAUSE
The pm_tick function mentioned in the message is used by cluster to perform sanity checks to ensure that the node itself is not hang and that cluster heartbeat are sent in a timely fashion.
Repeated delays in the execution of pm_tick are considered as warning of a potential hang issue or serious performance problem.
The reported message is printed when the pm_tick is delayed of more than 2147 milliseconds.
If the delays grew over time and went beyond 5050 milliseconds at the end the node would panic with the message : Aborting node because pm_tick delay of %lld
ms exceeds %lld ms
898738 Aborting node because pm_tick delay of %lld ms exceeds %lld ms
Description: The system is
unable to send heartbeats for a long time. (This is half of the  minimum of timeout values of all the paths. If the
timeout values for all the paths is 10 secs
then this value is 5 secs.) There is probably heavy
interrupt activity causing the clock thread to get delayed, which in turn
causes irregular heartbeats. The node is aborted because it is
considered to be in 'sick' condition and it is better to abort this node
instead of causing other nodes (or the cluster) to go down.
Solution: Check to see what is
causing high interrupt activity and configure the system accordingly
These figures are calculated from the heartbeat_timeout value which is by default 10 seconds. You can check that value by running :
cluster show -t global |grep heartbeat_timeout
heartbeat_timeout: 10000
In the particular scenario described in the Symptoms section, the delays reported are bounded and not growing over time and, in this specific scenario,  the messages are not the symptom of an ongoing problem.
In such a  case, the reason for the delay is a one-shot actual delay that was large enough to trigger the first warning but resulted into a incorrect status of the pm_tick counters which was not reset.
As a result further messages are printed as if there was a constant delay detected at every subsequent pm_tick execution, even if there are no more actual delays.

SOLUTION
The problem can be fixed by disabling and enabling again the path monitoring on the fly (despite the name, this also works for a single node cluster)
# /usr/cluster/lib/sc/cmm_ctl -d
Then after a couple of minutes
 # /usr/cluster/lib/sc/cmm_ctl -e
This will result into resetting the internal counters used by pm_tick and thus zeroing out the historical delays carried over incorrectly.

Solaris 11: Recover "Lost" Root Password

Last week I have encountered an issue where we were not able to switch as a root user (su - root), as someone unfortunately changed the root password. Now our goal is to reset the root password on Solaris 11 server. NOTE: Solaris 11 having one of the restriction is that root can't be login through console directly, as it’s consider as a role account. We have to login as normal user and switch to root account.
High Level Plan:
Boot the server from bootable DVD. (here I’m using X86 server)
  1. Import the root pool.
  2. Mount the root pool dataset.
  3. Edit the shadow file.
  4. Reboot the server.
Boot the server from Bootable DVD and Select 3rd option, it will get into the shell prompt
Identify the root pool, Most of the time rpool is the default root pool. However, you can verify with the below command, this will show you all available pool details.
zpool import |grep -i pool:
pool:rpool
pool:unixrepo
#
Import the root pool. (Below example rpool
is our root pool)
zpool import -f -R /tmp/rpool rpool
#
Configure root pool dataset as legacy
zfs set mountpoint=legacy rpool/ROOT/solaris
#
Mounting rpool dataset on /mnt
# mount -F zfs rpool/ROOT/solaris /mnt
#
Modifying the Root passwd
cp /mnt/etc/shadow /mnt/etc/shadow_backup
cp /mnt/etc/passwd /mnt/etc/passwd_backup
# TERM=vt100;export TERM
# EDITOR=vi;export EDITOR
Remove the encrypted password entry for root
vi /mnt/etc/shadow
root:$5$hzZlNEsJ$iuZHgU4xIQ72LfnmyPj7jzsyn.kbwdl0Ig4lbztnEQ4:16273::::::13888
After removing the encrypted password that line should be like below
grep -root /mnt/etc/shadow
root::16273::::::13888
#
Now unmount and set back the mount points and export the pool and reboot the server in single user mode.
umount /mnt
zfs set mountpoint=/ rpool/ROOT/solaris
zpool export rpool
halt

For X86: Boot from harddisk with single user mode by editing the grub menu with typing "e", then search the line which is start with $multiboot /ROOT and add -s at the end of the line and allow the system to boot from single user mode. Here we can change our root password and reboot the server.
 For SPARC: From OK prompt just enter boot -s that will allow you to enter single user mode.
COOL...You have successfully recovered the root password on Solaris 11.

Thursday, March 19, 2015

Solaris 11: Resolve ZFS Device faults/alerts using fmadm


Procedure:
  1. Identify the faulted device with the fmadm faulty command. For example:
  2. Replace the faulty or retired device or clear the device error.
  3. Clear the FMA fault.
  4. Confirm that the fault is cleared.
1. Identify the faulted device with the fmadm faulty command. 
For example:
# fmadm faulty
————— ———————————— ————– ———
TIME EVENT-ID MSG-ID SEVERITY
————— ———————————— ————– ———
Jun 20 16:30:52 55c82fff-b709-62f5-b66e-b4e1bbe9dcb1 ZFS-8000-LR Major

Problem Status : solved
Diag Engine : zfs-diagnosis / 1.0
System Manufacturer : unknown
Name : ORCL,SPARC-T3-4
Part_Number : unknown
Serial_Number : 1120BDRCCD
Host_ID : 84a02d28

—————————————-
Suspect 1 of 1 :
Fault class : fault.fs.zfs.open_failed
Certainty : 100%
Affects : zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
Status : faulted and taken out of service

FRU Name : "zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a"
Status : faulty

Description : ZFS device 'id1,sd@n5000c500335dc60f/a' in pool 'pond' failed to open.

Response : An attempt will be made to activate a hot spare if available.

Impact : Fault tolerance of the pool may be compromised.

Action : Use 'fmadm faulty' to provide a more detailed view of this event.
Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-LR for the latest service procedures and policies regarding this diagnosis.


2. Replace the faulty or retired device or clear the device error.
For example:
# zpool clear pond c0t5000C500335DC60Fd0
If an intermittent device error occurred but the device was not replaced, you can attempt to clear the previous error.

 

3. Clear the FMA fault. For example:
# fmadm repaired zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a

fmadm: recorded repair to of zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
4. Confirm that the fault is cleared.
# fmadm faulty
If the error is cleared, the fmadm faulty command returns nothing.

Solaris 11: Administration of SCSI devices using cfgadm


Below are the operations that can be performed on a SCSI device
  • Connect a SCSI Controller
  • Add a SCSI Device to a SCSI Bus
  • Replace an SCSI Disk on a SCSI Controller 
  • Remove a SCSI Device 
Connect a SCSI Controller

Step 1: Verify that the device is disconnected before you connect it.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 unavailable disconnected configured unknown
c2::dsk/c2t0d0 unavailable disconnected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown

Step 2: Connect the SCSI controller.
# cfgadm -c connect c2

Step 3: Verify that the SCSI controller is connected.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown

Add a SCSI Device to a SCSI Bus

Step 1: Identify the current SCSI configuration.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown

Step 2: Add the SCSI device to the SCSI bus.

2a. Type the following cfgadm command.
For example:

# cfgadm -x insert_device c3
Adding device to SCSI HBA: /devices/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2
This operation will suspend activity on SCSI bus: c3

2b. Type y at the Continue (yes/no)? prompt to proceed.
Continue (yes/no)? y
SCSI bus quiesced successfully.
It is now safe to proceed with hotplug operation.
I/O activity on the SCSI bus is suspended while the hot-plug operation is in progress.

2c. Connect the device and then power it on.

2d. Type y at the Enter y if operation is complete or n to abort (yes/no)? prompt.
Enter y if operation is complete or n to abort (yes/no)? y


Step 3: Verify that the device has been added.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown
A new disk has been added to controller c3.

Replace an SCSI Disk on a SCSI Controller  


Step 1: Identify the current SCSI configuration.
# cfgadm -al
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown

Step 2: Replace a device on the SCSI bus with another device of the same type.
2a. Type the following cfgadm command.
For example:

 # cfgadm -x replace_device c3::dsk/c3t3d0
Replacing SCSI device: /devices/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2/sd@3,0
This operation will suspend activity on SCSI bus: c3


2b. Type y at the Continue (yes/no)? prompt to proceed.
I/O activity on the SCSI bus is suspended while the hot-plug operation is in progress.

 Continue (yes/no)? y
SCSI bus quiesced successfully.
It is now safe to proceed with hotplug operation.

2c. Power off the device to be removed and remove it.

2d. Add the replacement device. Then, power it on.
The replacement device should be of the same type and at the same address (target and LUN) as the device to be removed.


2e. Type y at the Enter y if operation is complete or n to abort (yes/no)? prompt.
Enter y if operation is complete or n to abort (yes/no)? y


Step 3 : Verify that the device has been replaced.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown

Remove a SCSI Device 

Step 1: Identify the current SCSI configuration.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown

Step 2: Remove the SCSI device from the system.
2a. Type the following cfgadm command.
For example:

# cfgadm -x remove_device c3::dsk/c3t3d0
Removing SCSI device: /devices/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2/sd@3,0
This operation will suspend activity on SCSI bus: c3

2b. Type y at the Continue (yes/no)? prompt to proceed.
Continue (yes/no)? y
SCSI bus quiesced successfully.
It is now safe to proceed with hotplug operation.
I/O activity on the SCSI bus is suspended while the hot-plug operation is in progress.

2c. Power off the device to be removed and remove it.

2d. Type y at the Enter y if operation is complete or n to abort (yes/no)? prompt.
Enter y if operation is complete or n to abort (yes/no)? y
Note – This step must be performed if you are removing a SCSI RAID device from a SCSI RAID array.


Step 3: Verify that the device has been removed from the system.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown

Solaris 10/11: How to enable/disable automount debugging

If you are having problems with an automounter directory, you should always try and mount the partition by hand, to verify that the problem is related to automounter, and not NFS in general. If this manual mount operation fails then the problems is with NFS. But if this mount works fine, the problem is with automounter.
Automounter also has built in debugging, which can be used to examine exactly what the automounter is doing. It is best to kill automount and restart it with the debug flags, so that you can see everything from the start. Below procedure can be used to enable or disable automount debugging.

Solaris 10 
Enable Solaris 10 automount debug output
1. Uncomment the following lines in the /etc/default/autofs file:
# Verbose mode.  Notifies of autofs mounts, unmounts, or other
# non-essential events.  This equivalent to the "-v" argument.
AUTOMOUNT_VERBOSE=TRUE
# Verbose.  Log status messagess to the console.
# This is equivalent to the "-v" argument.
AUTOMOUNTD_VERBOSE=TRUE
# Trace.  Expand each RPC call and display it on standard output.
# This is equivalent to the "-T" argument.
AUTOMOUNTD_TRACE=3

DISABLE Solaris 10 automount debug output

2. Revert /etc/default/autofs entries back to default to shut off logging:
#AUTOMOUNT_VERBOSE=FALSE
#AUTOMOUNTD_VERBOSE=FALSE
#AUTOMOUNTD_TRACE=0

Refer the file /etc/default/autofs


Solaris 11
Enable Solaris 11 automount debug output 
1.  Alter the autofs debugging properties with sharectl:
root@my-nfs-server# sharectl get autofs
timeout=600
automount_verbose=false
automountd_verbose=false
nobrowse=false
trace=0
environment=


root@my-nfs-server# sharectl set -p automount_verbose=true autofs
root@my-nfs-server# sharectl set -p automountd_verbose=true autofs
root@my-nfs-server# sharectl set -p trace=3 autofs
root@my-nfs-server# sharectl get autofs
timeout=600
automount_verbose=true
automountd_verbose=true
nobrowse=false
trace=3
environment=
  
2. Tail the autofs log: 
root@my-nfs-server# tail -f /var/svc/log/system-filesystem-autofs:default.log
[ Jan 23 09:48:36 Stopping because service restarting. ]
[ Jan 23 09:48:36 Executing stop method ("/lib/svc/method/svc-autofs stop 76″) ]
[ Jan 23 09:48:42 Method "stop" exited with status 0 ]
[ Jan 23 09:48:42 Executing start method ("/lib/svc/method/svc-autofs start") ]
[ Jan 23 09:48:42 Method "start" exited with status 0 ]
[ Jan 23 09:49:48 Stopping because service restarting. ]
[ Jan 23 09:49:48 Executing stop method ("/lib/svc/method/svc-autofs stop 8276539″) ]
[ Jan 23 09:49:53 Method "stop" exited with status 0 ]
[ Jan 23 09:49:53 Executing start method ("/lib/svc/method/svc-autofs start") ]
[ Jan 23 09:49:53 Method "start" exited with status 0 ]

3. Restart the autofs service:

# svcadm restart autofs

4. Examine output from the tail -f command to ensure the logging is now enabled:
[ Jan 23 09:56:18 Stopping because service restarting. ]
[ Jan 23 09:56:18 Executing stop method ("/lib/svc/method/svc-autofs stop 8276563″) ]
[ Jan 23 09:56:23 Method "stop" exited with status 0 ]
[ Jan 23 09:56:23 Executing start method ("/lib/svc/method/svc-autofs start") ]
t1      init_ldap: setting up for version 2
automount: /net mounted
automount: /home mounted
automount: no unmounts
[ Jan 23 09:56:23 Method "start" exited with status 0 ]

Reproduce or await the automount activity that triggers the failure, as appropriate.  Attach or provide the /var/svc/log/system-filesystem-autofs:default.log.

DISABLE Solaris 11 automount debug output
1. Restore the original values to the autofs debug properties:
root@my-nfs-server# sharectl set -p automount_verbose=false autofs
root@my-nfs-server# sharectl set -p automountd_verbose=false autofs
root@my-nfs-server# sharectl set -p trace=0 autofs
root@my-nfs-server# sharectl get autofs
timeout=600
automount_verbose=false
automountd_verbose=false
nobrowse=false
trace=0
environment=

2. Restart autofs service:
# svcadm restart autofs