Tuesday, January 12, 2016

Solaris Cluster : pm_tick delay of xxxx ms exceeds 2147 ms message is continuously logged in /var/adm/messages


SYMPTOMS
The message "pm_tick delay of xxxx ms exceeds 2147 ms" is continuoulsy written in the file /var/adm/messages since several days. The delay number reported (here xxxx) could be variable.
This message is printed thousands of times and the size of the messages files is growing but there is no other impact or symptom on the cluster nodes.
The key symptom is the fact the same message is repeated over time but the delays reported are bounded and almost the same over time rather than showing increasing delays.
In other words the delays are not getting worse.
For example, checking in /var/adm/messages you could see:
grep pm_tick messages |more
Dec 27 01:07:38 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 3589 ms exceeds 2147 ms <==== 1st delay logged
Dec 27 01:07:38 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 3549 ms exceeds 2147 ms
............ snip.....
 This message was printed a
lot of times and the delay is now stable
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep -c pm_tick messages
3112523
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | sort -u
3549 <========= this is the lowest delay
3550
3551
3589 <========= this is the highest delay
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | grep -c 3550
1555990
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | grep -c 3549
1556529
It is clear that the delay is not getting worse: the majority of times we see  delays of 3549 and 3550 seconds and the delay range is bounded.

CAUSE
The pm_tick function mentioned in the message is used by cluster to perform sanity checks to ensure that the node itself is not hang and that cluster heartbeat are sent in a timely fashion.
Repeated delays in the execution of pm_tick are considered as warning of a potential hang issue or serious performance problem.
The reported message is printed when the pm_tick is delayed of more than 2147 milliseconds.
If the delays grew over time and went beyond 5050 milliseconds at the end the node would panic with the message : Aborting node because pm_tick delay of %lld
ms exceeds %lld ms
898738 Aborting node because pm_tick delay of %lld ms exceeds %lld ms
Description: The system is
unable to send heartbeats for a long time. (This is half of the  minimum of timeout values of all the paths. If the
timeout values for all the paths is 10 secs
then this value is 5 secs.) There is probably heavy
interrupt activity causing the clock thread to get delayed, which in turn
causes irregular heartbeats. The node is aborted because it is
considered to be in 'sick' condition and it is better to abort this node
instead of causing other nodes (or the cluster) to go down.
Solution: Check to see what is
causing high interrupt activity and configure the system accordingly
These figures are calculated from the heartbeat_timeout value which is by default 10 seconds. You can check that value by running :
cluster show -t global |grep heartbeat_timeout
heartbeat_timeout: 10000
In the particular scenario described in the Symptoms section, the delays reported are bounded and not growing over time and, in this specific scenario,  the messages are not the symptom of an ongoing problem.
In such a  case, the reason for the delay is a one-shot actual delay that was large enough to trigger the first warning but resulted into a incorrect status of the pm_tick counters which was not reset.
As a result further messages are printed as if there was a constant delay detected at every subsequent pm_tick execution, even if there are no more actual delays.

SOLUTION
The problem can be fixed by disabling and enabling again the path monitoring on the fly (despite the name, this also works for a single node cluster)
# /usr/cluster/lib/sc/cmm_ctl -d
Then after a couple of minutes
 # /usr/cluster/lib/sc/cmm_ctl -e
This will result into resetting the internal counters used by pm_tick and thus zeroing out the historical delays carried over incorrectly.

Solaris 11: Recover "Lost" Root Password

Last week I have encountered an issue where we were not able to switch as a root user (su - root), as someone unfortunately changed the root password. Now our goal is to reset the root password on Solaris 11 server. NOTE: Solaris 11 having one of the restriction is that root can't be login through console directly, as it’s consider as a role account. We have to login as normal user and switch to root account.
High Level Plan:
Boot the server from bootable DVD. (here I’m using X86 server)
  1. Import the root pool.
  2. Mount the root pool dataset.
  3. Edit the shadow file.
  4. Reboot the server.
Boot the server from Bootable DVD and Select 3rd option, it will get into the shell prompt
Identify the root pool, Most of the time rpool is the default root pool. However, you can verify with the below command, this will show you all available pool details.
zpool import |grep -i pool:
pool:rpool
pool:unixrepo
#
Import the root pool. (Below example rpool
is our root pool)
zpool import -f -R /tmp/rpool rpool
#
Configure root pool dataset as legacy
zfs set mountpoint=legacy rpool/ROOT/solaris
#
Mounting rpool dataset on /mnt
# mount -F zfs rpool/ROOT/solaris /mnt
#
Modifying the Root passwd
cp /mnt/etc/shadow /mnt/etc/shadow_backup
cp /mnt/etc/passwd /mnt/etc/passwd_backup
# TERM=vt100;export TERM
# EDITOR=vi;export EDITOR
Remove the encrypted password entry for root
vi /mnt/etc/shadow
root:$5$hzZlNEsJ$iuZHgU4xIQ72LfnmyPj7jzsyn.kbwdl0Ig4lbztnEQ4:16273::::::13888
After removing the encrypted password that line should be like below
grep -root /mnt/etc/shadow
root::16273::::::13888
#
Now unmount and set back the mount points and export the pool and reboot the server in single user mode.
umount /mnt
zfs set mountpoint=/ rpool/ROOT/solaris
zpool export rpool
halt

For X86: Boot from harddisk with single user mode by editing the grub menu with typing "e", then search the line which is start with $multiboot /ROOT and add -s at the end of the line and allow the system to boot from single user mode. Here we can change our root password and reboot the server.
 For SPARC: From OK prompt just enter boot -s that will allow you to enter single user mode.
COOL...You have successfully recovered the root password on Solaris 11.

Thursday, March 19, 2015

Solaris 11: Resolve ZFS Device faults/alerts using fmadm


Procedure:
  1. Identify the faulted device with the fmadm faulty command. For example:
  2. Replace the faulty or retired device or clear the device error.
  3. Clear the FMA fault.
  4. Confirm that the fault is cleared.
1. Identify the faulted device with the fmadm faulty command. 
For example:
# fmadm faulty
————— ———————————— ————– ———
TIME EVENT-ID MSG-ID SEVERITY
————— ———————————— ————– ———
Jun 20 16:30:52 55c82fff-b709-62f5-b66e-b4e1bbe9dcb1 ZFS-8000-LR Major

Problem Status : solved
Diag Engine : zfs-diagnosis / 1.0
System Manufacturer : unknown
Name : ORCL,SPARC-T3-4
Part_Number : unknown
Serial_Number : 1120BDRCCD
Host_ID : 84a02d28

—————————————-
Suspect 1 of 1 :
Fault class : fault.fs.zfs.open_failed
Certainty : 100%
Affects : zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
Status : faulted and taken out of service

FRU Name : "zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/
pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a"
Status : faulty

Description : ZFS device 'id1,sd@n5000c500335dc60f/a' in pool 'pond' failed to open.

Response : An attempt will be made to activate a hot spare if available.

Impact : Fault tolerance of the pool may be compromised.

Action : Use 'fmadm faulty' to provide a more detailed view of this event.
Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-LR for the latest service procedures and policies regarding this diagnosis.


2. Replace the faulty or retired device or clear the device error.
For example:
# zpool clear pond c0t5000C500335DC60Fd0
If an intermittent device error occurred but the device was not replaced, you can attempt to clear the previous error.

 

3. Clear the FMA fault. For example:
# fmadm repaired zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a

fmadm: recorded repair to of zfs://pool=86124fa573cad84e/vdev=25d36cd46e0a7f49/pool_name=pond/vdev_name=id1,sd@n5000c500335dc60f/a
4. Confirm that the fault is cleared.
# fmadm faulty
If the error is cleared, the fmadm faulty command returns nothing.

Solaris 11: Administration of SCSI devices using cfgadm


Below are the operations that can be performed on a SCSI device
  • Connect a SCSI Controller
  • Add a SCSI Device to a SCSI Bus
  • Replace an SCSI Disk on a SCSI Controller 
  • Remove a SCSI Device 
Connect a SCSI Controller

Step 1: Verify that the device is disconnected before you connect it.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 unavailable disconnected configured unknown
c2::dsk/c2t0d0 unavailable disconnected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown

Step 2: Connect the SCSI controller.
# cfgadm -c connect c2

Step 3: Verify that the SCSI controller is connected.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown

Add a SCSI Device to a SCSI Bus

Step 1: Identify the current SCSI configuration.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown

Step 2: Add the SCSI device to the SCSI bus.

2a. Type the following cfgadm command.
For example:

# cfgadm -x insert_device c3
Adding device to SCSI HBA: /devices/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2
This operation will suspend activity on SCSI bus: c3

2b. Type y at the Continue (yes/no)? prompt to proceed.
Continue (yes/no)? y
SCSI bus quiesced successfully.
It is now safe to proceed with hotplug operation.
I/O activity on the SCSI bus is suspended while the hot-plug operation is in progress.

2c. Connect the device and then power it on.

2d. Type y at the Enter y if operation is complete or n to abort (yes/no)? prompt.
Enter y if operation is complete or n to abort (yes/no)? y


Step 3: Verify that the device has been added.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown
A new disk has been added to controller c3.

Replace an SCSI Disk on a SCSI Controller  


Step 1: Identify the current SCSI configuration.
# cfgadm -al
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown

Step 2: Replace a device on the SCSI bus with another device of the same type.
2a. Type the following cfgadm command.
For example:

 # cfgadm -x replace_device c3::dsk/c3t3d0
Replacing SCSI device: /devices/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2/sd@3,0
This operation will suspend activity on SCSI bus: c3


2b. Type y at the Continue (yes/no)? prompt to proceed.
I/O activity on the SCSI bus is suspended while the hot-plug operation is in progress.

 Continue (yes/no)? y
SCSI bus quiesced successfully.
It is now safe to proceed with hotplug operation.

2c. Power off the device to be removed and remove it.

2d. Add the replacement device. Then, power it on.
The replacement device should be of the same type and at the same address (target and LUN) as the device to be removed.


2e. Type y at the Enter y if operation is complete or n to abort (yes/no)? prompt.
Enter y if operation is complete or n to abort (yes/no)? y


Step 3 : Verify that the device has been replaced.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown

Remove a SCSI Device 

Step 1: Identify the current SCSI configuration.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown
c3::dsk/c3t3d0 disk connected configured unknown

Step 2: Remove the SCSI device from the system.
2a. Type the following cfgadm command.
For example:

# cfgadm -x remove_device c3::dsk/c3t3d0
Removing SCSI device: /devices/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2/sd@3,0
This operation will suspend activity on SCSI bus: c3

2b. Type y at the Continue (yes/no)? prompt to proceed.
Continue (yes/no)? y
SCSI bus quiesced successfully.
It is now safe to proceed with hotplug operation.
I/O activity on the SCSI bus is suspended while the hot-plug operation is in progress.

2c. Power off the device to be removed and remove it.

2d. Type y at the Enter y if operation is complete or n to abort (yes/no)? prompt.
Enter y if operation is complete or n to abort (yes/no)? y
Note – This step must be performed if you are removing a SCSI RAID device from a SCSI RAID array.


Step 3: Verify that the device has been removed from the system.
# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 CD-ROM connected configured unknown
c3 scsi-sas connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown
c3::dsk/c3t1d0 disk connected configured unknown
c3::dsk/c3t2d0 disk connected configured unknown

Solaris 10/11: How to enable/disable automount debugging

If you are having problems with an automounter directory, you should always try and mount the partition by hand, to verify that the problem is related to automounter, and not NFS in general. If this manual mount operation fails then the problems is with NFS. But if this mount works fine, the problem is with automounter.
Automounter also has built in debugging, which can be used to examine exactly what the automounter is doing. It is best to kill automount and restart it with the debug flags, so that you can see everything from the start. Below procedure can be used to enable or disable automount debugging.

Solaris 10 
Enable Solaris 10 automount debug output
1. Uncomment the following lines in the /etc/default/autofs file:
# Verbose mode.  Notifies of autofs mounts, unmounts, or other
# non-essential events.  This equivalent to the "-v" argument.
AUTOMOUNT_VERBOSE=TRUE
# Verbose.  Log status messagess to the console.
# This is equivalent to the "-v" argument.
AUTOMOUNTD_VERBOSE=TRUE
# Trace.  Expand each RPC call and display it on standard output.
# This is equivalent to the "-T" argument.
AUTOMOUNTD_TRACE=3

DISABLE Solaris 10 automount debug output

2. Revert /etc/default/autofs entries back to default to shut off logging:
#AUTOMOUNT_VERBOSE=FALSE
#AUTOMOUNTD_VERBOSE=FALSE
#AUTOMOUNTD_TRACE=0

Refer the file /etc/default/autofs


Solaris 11
Enable Solaris 11 automount debug output 
1.  Alter the autofs debugging properties with sharectl:
root@my-nfs-server# sharectl get autofs
timeout=600
automount_verbose=false
automountd_verbose=false
nobrowse=false
trace=0
environment=


root@my-nfs-server# sharectl set -p automount_verbose=true autofs
root@my-nfs-server# sharectl set -p automountd_verbose=true autofs
root@my-nfs-server# sharectl set -p trace=3 autofs
root@my-nfs-server# sharectl get autofs
timeout=600
automount_verbose=true
automountd_verbose=true
nobrowse=false
trace=3
environment=
  
2. Tail the autofs log: 
root@my-nfs-server# tail -f /var/svc/log/system-filesystem-autofs:default.log
[ Jan 23 09:48:36 Stopping because service restarting. ]
[ Jan 23 09:48:36 Executing stop method ("/lib/svc/method/svc-autofs stop 76″) ]
[ Jan 23 09:48:42 Method "stop" exited with status 0 ]
[ Jan 23 09:48:42 Executing start method ("/lib/svc/method/svc-autofs start") ]
[ Jan 23 09:48:42 Method "start" exited with status 0 ]
[ Jan 23 09:49:48 Stopping because service restarting. ]
[ Jan 23 09:49:48 Executing stop method ("/lib/svc/method/svc-autofs stop 8276539″) ]
[ Jan 23 09:49:53 Method "stop" exited with status 0 ]
[ Jan 23 09:49:53 Executing start method ("/lib/svc/method/svc-autofs start") ]
[ Jan 23 09:49:53 Method "start" exited with status 0 ]

3. Restart the autofs service:

# svcadm restart autofs

4. Examine output from the tail -f command to ensure the logging is now enabled:
[ Jan 23 09:56:18 Stopping because service restarting. ]
[ Jan 23 09:56:18 Executing stop method ("/lib/svc/method/svc-autofs stop 8276563″) ]
[ Jan 23 09:56:23 Method "stop" exited with status 0 ]
[ Jan 23 09:56:23 Executing start method ("/lib/svc/method/svc-autofs start") ]
t1      init_ldap: setting up for version 2
automount: /net mounted
automount: /home mounted
automount: no unmounts
[ Jan 23 09:56:23 Method "start" exited with status 0 ]

Reproduce or await the automount activity that triggers the failure, as appropriate.  Attach or provide the /var/svc/log/system-filesystem-autofs:default.log.

DISABLE Solaris 11 automount debug output
1. Restore the original values to the autofs debug properties:
root@my-nfs-server# sharectl set -p automount_verbose=false autofs
root@my-nfs-server# sharectl set -p automountd_verbose=false autofs
root@my-nfs-server# sharectl set -p trace=0 autofs
root@my-nfs-server# sharectl get autofs
timeout=600
automount_verbose=false
automountd_verbose=false
nobrowse=false
trace=0
environment=

2. Restart autofs service:
# svcadm restart autofs

Wednesday, March 18, 2015

Solaris 10: Live upgrade with ZFS rpool - Example


Solaris 10 Live Upgrade with ZFS is really simple compared to some of the messes you could get into with SVM mirrored root disks. Below is a simple live upgrade BE creation and patching example. Also, Solaris Live Upgrade works the same as in previous releases when you use ZFS. The same commands. As I said, it's just easier. Also, a really great feature is that you can now migrate from UFS file systems to a ZFS root pool and create new boot environments within a ZFS root pool. Will't show that in another blog entry at a later date.
# lucreate -n Dec2012
Analyzing system configuration.
No name for current boot environment.
INFORMATION: The current boot environment is not named – assigning name <s10s_u10wos_17b>.
Current boot environment is named <s10s_u10wos_17b>.
Creating initial configuration for primary boot environment <s10s_u10wos_17b>.
INFORMATION: No BEs are configured on this system.
The device </dev/dsk/c0t5000CCA02533AC20d0s0> is not a root device for any boot environment; cannot get BE ID.
PBE configuration successful: PBE name <s10s_u10wos_17b> PBE Boot Device </dev/dsk/c0t5000CCA02533AC20d0s0>.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <Dec2012>.
Source boot environment is <s10s_u10wos_17b>.
Creating file systems on boot environment <Dec2012>.
Populating file systems on boot environment <Dec2012>.
Analyzing zones.
Duplicating ZFS datasets from PBE to ABE.
Creating snapshot for <rpool/ROOT/s10s_u10wos_17b> on <rpool/ROOT/s10s_u10wos_17b@Dec2012>.
Creating clone for <rpool/ROOT/s10s_u10wos_17b@Dec2012> on <rpool/ROOT/Dec2012>.
Mounting ABE <Dec2012>.
Generating file list.
Finalizing ABE.
Fixing zonepaths in ABE.
Unmounting ABE <Dec2012>.
Fixing properties on ZFS datasets in ABE.
Reverting state of zones in PBE <s10s_u10wos_17b>.
Making boot environment <Dec2012> bootable.
Population of boot environment <Dec2012> successful.
Creation of boot environment <Dec2012> successful.

now patch it. Lets move to my patch folder where I unzipped my stuff 
# cd /root/10_Recommended/patches
# luupgrade -n Dec2012 -s /root/10_Recommended/patches -t `cat patch_order`
Validating the contents of the media </root/10_Recommended/patches>.
The media contains 358 software patches that can be added.
Mounting the BE <Dec2012>.
---------------- SNIP -------------------------------------------------
Patch 146054-07 has been successfully installed.
See /a/var/sadm/patch/146054-07/log for details
Executing postpatch script…
Patch packages installed:
SUNWcsu
SUNWxcu6
Checking installed patches…
Executing prepatch script…
Installing patch packages…
Patch 125555-12 has been successfully installed.
See /a/var/sadm/patch/125555-12/log for details
Executing postpatch script…
Patch packages installed:
SUNWcsu
Checking installed patches…
Installing patch packages…
---------------- SNIP ------------------------------------------------Un
mount the BE <Dec2012>.
The patch add to the BE <Dec2012> completed.

Now we activate the new BE
# luactivate Dec2012
A Live Upgrade Sync operation will be performed on startup of boot environment <Dec2012>.
**********************************************************************
The target boot environment has been activated. It will be used when you
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You
MUST USE either the init or the shutdown command when you reboot. If you
do not use either init or shutdown, the system will not boot using the
target BE.
**********************************************************************
In case of a failure while booting to the target BE, the following process
needs to be followed to fallback to the currently working boot environment:
1. Enter the PROM monitor (ok prompt).
2. Boot the machine to Single User mode using a different boot device
(like the Solaris Install CD or Network). Examples:
At the PROM monitor (ok prompt):
For boot to Solaris CD:  boot cdrom -s
For boot to network:     boot net -s
3. Mount the Current boot environment root slice to some directory (like
/mnt). You can use the following commands in sequence to mount the BE:
zpool import rpool
zfs inherit -r mountpoint rpool/ROOT/s10s_u10wos_17b
zfs set mountpoint=<mountpointName> rpool/ROOT/s10s_u10wos_17b
zfs mount rpool/ROOT/s10s_u10wos_17b
4. Run <luactivate> utility with out any arguments from the Parent boot
environment root slice, as shown below:
<mountpointName>/sbin/luactivate
5. luactivate, activates the previous working boot environment and
indicates the result.
6. Exit Single User mode and reboot the machine.
**********************************************************************
Modifying boot archive service
Activation of boot environment <Dec2012> successful.

All done. Lets see if it looks ok,
# lustatus
Boot Environment           Is       Active Active    Can    Copy
Name                       Complete Now    On Reboot Delete Status
————————–                 ——–       ——     ———       ——     ———-
s10s_u10wos_17b            yes      yes    no        no     –
Dec2012                    yes      no     yes       no     –

Yes, Dec2012 is now the active BE. Now we reboot, WITHOUT using REBOOT command.
# shutdown -i6 -y -g0
Shutdown started.    Thu Dec 20 13:11:38 EST 2012
Changing to init state 6 – please wait
Broadcast Message from root (pts/1) on mybox.ca Thu Dec 20 13:11:39…
THE SYSTEM mybox.ca IS BEING SHUT DOWN NOW ! ! !
Log off now or risk your files being damaged

System has rebooted now. Lets look:
# uname -a
SunOS mybox.ca  5.10 Generic_147440-26 sun4v sparc sun4v

# lustatus
Boot Environment           Is       Active Active    Can    Copy
Name                       Complete Now    On Reboot Delete Status
————————–                 ——–       ——     ———       ——     ———-
s10s_u10wos_17b            yes      no     no        yes    –
Dec2012                    yes      yes    yes       no     –

New kernel. Yes. Dec2012 is the booted and active BE. Hope this helps.

Tuesday, March 17, 2015

Solaris 10: Migrating From UFS to ZFS with Live Upgrade


Solaris 10 Live Upgrade with ZFS is really simple compared to some of the messes you could get into with SVM mirrored root disks. Below is a simple live upgrade BE creation and patching example. Also, Solaris Live Upgrade works the same as in previous releases when you use ZFS. The same commands. As I said, it's just easier. Also, a really great feature is that you can now migrate from UFS file systems to a ZFS root pool and create new boot environments within a ZFS root pool.

Create the new rpool
You will need a new disk to be used as your ZFS boot disk. First task is to create a new root pool or rpool. You have to create a new boot environment in that rpool from the existing UFS boot and root file system.
In this example, the zfs list command shows the ZFS root pool created by the zpool command. The next zfs listcommand shows the datasets created by the lucreate command.

# zpool create rpool c0t2d0s2

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 12.4G 90.1G 20K /rpool

We now need to create the new Boot Environment BE using the existing boot disk (UFS based) as the source and use the newly created rpool and the destination BE.

Create a new ZFS Boot Environment

# lucreate -c c0t0d0 -n Nov2012-zfsBE -p rpool

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 12.4G 90.1G 20K /rpool
rpool/ROOT 8.12G 90.1G 18K /rpool/ROOT
rpool/ROOT/Nov2012-zfsBE 8.12G 90.1G 551M /tmp/.alt.luupdall.899001
rpool/dump 3.95G - 3.95G -
rpool/swap 3.95G - 3.95G -

This is so cool, and easy. All that is left is to patch the new Nov2012 BE, activate it and reboot. So lets say we had the typical Oracle (old Sun) patch cluster.

Patch the new ZFS based Boot Environment Using luupgrade

To patch the new BE, I would do:

# luupgrade -n Nov2012-zfsBE -s /root/10_Recommended/patches -t `cat patch_order`
>snip to remove all the patching output.

Then we activate the new ZFS based BE so we can boot off of the new disk.

Activate the new ZFS based Boot Environment

# luactivate Nov2012-zfsBE
A Live Upgrade Sync operation will be performed on startup of boot environment <Nov2012-zfsBE>.
**********************************************************************
The target boot environment has been activated. It will be used when you
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You
MUST USE either the init or the shutdown command when you reboot. If you
do not use either init or shutdown, the system will not boot using the
target BE.
**********************************************************************

 
In case of a failure while booting to the target BE, the following process
needs to be followed to fallback to the currently working boot environment:
1. Enter the PROM monitor (ok prompt).
2. Boot the machine to Single User mode using a different boot device
(like the Solaris Install CD or Network). Examples:
At the PROM monitor (ok prompt):
For boot to Solaris CD:  boot cdrom -s
For boot to network:     boot net -s
3. Mount the Current boot environment root slice to some directory (like
/mnt). You can use the following commands in sequence to mount the BE:
zpool import rpool
zfs inherit -r mountpoint rpool/ROOT/Nov2012-zfsBE
zfs set mountpoint=<mountpointName> rpool/ROOT/Nov2012-zfsBE
zfs mount rpool/ROOT/Nov2012-zfsBE
4. Run <luactivate> utility with out any arguments from the Parent boot
environment root slice, as shown below:
<mountpointName>/sbin/luactivate
5. luactivate, activates the previous working boot environment and
indicates the result.
6. Exit Single User mode and reboot the machine.
**********************************************************************
Modifying boot archive service
Activation of boot environment <Nov2012-zfsBE> successful.

Let's see if it worked. Yes, Nov2012-zfsBE is now Active on Reboot. Let's reboot with init 6.

Boot from your new ZFS root disk

# lustatus
Boot Environment           Is       Active Active    Can    Copy
Name                       Complete Now    On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
Nov2012-zfsBE        yes      no     yes       no     -

 
# init 6

Login and check patch level and see if the new software I installed is there.

# uname -a
SunOS mygreatbox 5.10 Generic_147440-26 sun4v sparc sun4v

How easy and cool was that? The handy thing with luactivate is it changes the boot environment for you, meaning the OBP boot device settings don't have to be changed, it is all handled by the operating system during the BE activation process.