Tuesday, January 12, 2016

Solaris 11.2: Control the size of the ZFS ARC cache dynamically

Solaris 11.2 deprecates the zfs_arc_max kernel parameter in favor of user_reserve_hint_pct and that’s cool.
ZFS has a very smart cache, the so called ARC (Adaptive replacement
cache). In general the ARC consumes as much memory as it is available, it also takes care that it frees up memory if other applications need more.
In theory, this works very good, ZFS just uses available memory to speed up slow disk I/O. But it also has some side effects, if the ARC consumed almost all unused memory. Applications which request more memory need to wait, until the ARC frees up memory. For example, if you restart a big database, the startup is maybe significantly delayed, because the ARC could have used the free memory from the database shutdown in the meantime already. Additionally this database would likely request large memory pages, if ARC uses just some free segments, the memory gets easily
fragmented.
That is why, many users limit the total size with the  zfs_arc_max  ernel parameter. With this parameter you can configure the absolute maximum size of the ARC in bytes. For a time I personally refused to use this parameter, because it feels like “breaking the legs” of ZFS, it’s hard to standardize (absolute value) and it needs a reboot to change. But for memory-intensive applications this hard limit is simply necessary, until now.
Solaris 11.2 finally addresses this pain point, the
 zfs_arc_max parameter is now deprecated. There is the new dynamic user_reserve_hint_pct kernel parameter, which allows the system administrator to tell ZFS which percentage of the physical memory should be reserved for user applications. Without reboot!
So if you know your application will use 90% of your physical memory, you can just set this parameter to 90.
Oracle provides a script called set_user_reserve.sh and additional documentation. Both can be found on My Oracle Support: “Memory Management Between ZFS and Applications in Oracle Solaris 11.2 (Doc ID 1663862.1)”. The script gracefully adjusts this parameter, to give ARC enough time to shrink.
According first tests it works really nice:
# ./set_user_reserve.sh -f 50
Adjusting user_reserve_hint_pct from 0 to 50
08:43:03 AM UTC : waiting for current value : 13 to grow to target : 15
08:43:11 AM UTC : waiting for current value : 15 to grow to target : 20
...
# ./set_user_reserve.sh -f 70
...
# ./set_user_reserve.sh 0

The following chart shows the memory consumption of the ZFS ARC and available memory for user applications during my test. The line graph is the value of user_reserve_hint_pct which is gracefully set by the script. During the test, I set it to 50%, 70% and back to 0%. At the same time I generated some I/O on the ZFS filesystem to cause caching to ARC.

As you can see, the ARC shrinks and grows according to the new
parameter.
For generating the chart data, I wrote the following Dtrace script:
dtrace script zfs_user_reserve_stat.d

#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
{
        pagesize= `_pagesize;
        mb = 1024*1024;
        printf("  PHYSARC  USR avail(MB) USR(%%) user_reserve_hint_pct (MB) user_reserve_hint_pct (%%)\n");
}
profile:::tick-$1sec
{
        physmem= (`physmem * pagesize) / mb;
        user_reserved_pct= `user_reserve_hint_pct;
        user_reserved= (user_reserved_pct * physmem)/ 100;
        arc_size= `arc_stats.arcstat_size.value.ui64 / mb;
        used_by_kernel= (`kpages_locked * pagesize)/ mb;
        current_mem_userspace_pct= ((physmem - used_by_kernel)* 100) / physmem;
        current_mem_userspacephysmem - used_by_kernel;
        printf("%8d %8d %15d %7d %27d %26d\n", physmemarc_sizecurrent_mem_userspace,
      current_mem_userspace_pct,user_reserved,user_reserved_pct);
}

You can run the script during adjusting user_reserve_hint_pct with your desired interval, for example every ten seconds:

# ./zfs_user_reserve_stat.d 10

PHYS   ARC  USR avail (MB)  USR(%) user_reserve_hint_pct (MB)
user_reserve_hint_pct (%)

2031   1136   288           14     0             0

2031   1051   364           17     304           15

2031   934    489           24     507           25

2031   863    561           27     609           30

2031   798    627           30     710           35
...

I definitely need to play more with this parameter, but so far it looks like a big improvement to zfs_arc_max and a very good replacement.

Solaris 11: How to Verify Which SRU a System Is Running

The following text shows the how to identify which SRU is installed on a system running Oracle Solaris 11.1. There are also examples for:
  • An Oracle Solaris 11.1 system without an SRU -> Example 1
  • An Oracle Solaris 11 11/11 system with an SRU -> Example 2
  • An Oracle Solaris 11 11/11 system without an SRU -> Example 3
Display the information about pkg:/entire.
In this example, Oracle Solaris 11.1 with SRU 18.4 installed. The identification of the SRU number is the 18.4 as the third and fourth numbers displayed after Oracle Solaris 11.1 in the Summary and Version lines.
pkg info entire
Name: entire
Summary: entire incorporation including Support Repository Update Oracle Solaris 11.1.18.4.0).
Description: This package constrains system
package versions to the same 
build
WARNING: Proper system update and correct package
selection depend on the presence of this incorporation.
Removing this package will result in an unsupported system.  For more information see https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=REFERENCE&id=1501435.1.
Category: Meta Packages/Incorporations
State: Installed
Publisher: solaris
Version: 0.5.11 (Oracle Solaris 11.1.18.4.0)
Build Release: 5.11
Branch: 0.175.1.18.0.4.0
Packaging Date: April 1, 2014 07:58:03 PM 
Size: 5.46 kB
FMRI: pkg://solaris/entire@0.5.11,5.11-0.175.1.18.0.4.0:20140401T195803Z
Example 1  Package Information Displayed for Oracle Solaris 11.1 Without SRUs If your system has Oracle Solaris 11.1 installed, without any SRUs, the package information will look like the following. Note that the entries after the Build Release line are different from the 11/11 version.
pkg info entire
Name: entire
Summary: Incorporation to lock all system packages to the same build
Description: This package constrains system package versions to the same build.  WARNING: Proper system update and correct package selection depend on the presence of this incorporation.
Removing this package will result in an unsupported system.
Category: Meta Packages/Incorporations
State: Installed
Publisher: solaris
Version: 0.5.11
Build Release: 5.11
Branch: 0.175.1.0.0.24.2
Packaging Date: Wed Sep 19 19:01:35 2012
Size: 5.46 kB
FMRI: pkg://solaris/entire@0.5.11,5.11-0.175.1.0.0.24.2:20120919T190135Z

Example 2  Package Information Displayed for Oracle Solaris 11 11/11 With an SRU If your system has Oracle Solaris 11 11/11 installed, with a SRU, the package information will look like the following. In this example SRU 10.5 is installed. Note that the entries after the Build Release line are different because of the SRU.
pkg info entire
Name: entire
Summary: entire incorporation including Support Repository Update Oracle Solaris 11 11/11 SRU 10.5).
Description: This package constrains system package versions to the same build.
WARNING: Proper system update and correct package
selection depend on the presence of this incorporation.
Removing this package will result in an unsupported system.  For more information see https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=REFERENCE&id=1372094.1.
Category: Meta Packages/Incorporations
State: Installed
Publisher: solaris
Version: 0.5.11 (Oracle Solaris 11 SRU 10.5)
Build Release: 5.11
Branch: 0.175.0.10.0.5.0
Packaging Date: Fri Aug 03 18:26:27 2012
Size: 5.45 kB
FMRI: pkg://solaris/entire@0.5.11,5.11-0.175.0.10.0.5.0:20120803T182627Z
Example 3  Package Information Displayed for Oracle Solaris 11 11/11 Without SRUs If your system has Oracle Solaris 11 11/11 installed, without any SRUs, the package information looks like:
pkg info entire
Name: entire
Summary: Incorporation to lock all system packages to the same build
Description: This package constrains system package versions to the same build.  WARNING: Proper system update and correct package selection depend on the presence of this incorporation.
Removing this package will result in an unsupported system.
Category: Meta Packages/Incorporations
State: Installed
Publisher: solaris
Version: 0.5.11
Build Release: 5.11
Branch: 0.175.0.0.0.2.0
Packaging Date:
October 20, 2011 02:38:22 PM
Size: 5.45 kB
FMRI:
pkg://solaris/entire@0.5.11,5.11-0.175.0.0.0.2.0:20111020T143822Z

Solaris Cluster : pm_tick delay of xxxx ms exceeds 2147 ms message is continuously logged in /var/adm/messages


SYMPTOMS
The message "pm_tick delay of xxxx ms exceeds 2147 ms" is continuoulsy written in the file /var/adm/messages since several days. The delay number reported (here xxxx) could be variable.
This message is printed thousands of times and the size of the messages files is growing but there is no other impact or symptom on the cluster nodes.
The key symptom is the fact the same message is repeated over time but the delays reported are bounded and almost the same over time rather than showing increasing delays.
In other words the delays are not getting worse.
For example, checking in /var/adm/messages you could see:
grep pm_tick messages |more
Dec 27 01:07:38 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 3589 ms exceeds 2147 ms <==== 1st delay logged
Dec 27 01:07:38 node1 genunix: [ID 313806 kern.notice] NOTICE: pm_tick delay of 3549 ms exceeds 2147 ms
............ snip.....
 This message was printed a
lot of times and the delay is now stable
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep -c pm_tick messages
3112523
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | sort -u
3549 <========= this is the lowest delay
3550
3551
3589 <========= this is the highest delay
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | grep -c 3550
1555990
/explorer.12345678.node1-2016.01.04.12.56/messages/--> grep pm_tick messages |awk '{print $13}' | grep -c 3549
1556529
It is clear that the delay is not getting worse: the majority of times we see  delays of 3549 and 3550 seconds and the delay range is bounded.

CAUSE
The pm_tick function mentioned in the message is used by cluster to perform sanity checks to ensure that the node itself is not hang and that cluster heartbeat are sent in a timely fashion.
Repeated delays in the execution of pm_tick are considered as warning of a potential hang issue or serious performance problem.
The reported message is printed when the pm_tick is delayed of more than 2147 milliseconds.
If the delays grew over time and went beyond 5050 milliseconds at the end the node would panic with the message : Aborting node because pm_tick delay of %lld
ms exceeds %lld ms
898738 Aborting node because pm_tick delay of %lld ms exceeds %lld ms
Description: The system is
unable to send heartbeats for a long time. (This is half of the  minimum of timeout values of all the paths. If the
timeout values for all the paths is 10 secs
then this value is 5 secs.) There is probably heavy
interrupt activity causing the clock thread to get delayed, which in turn
causes irregular heartbeats. The node is aborted because it is
considered to be in 'sick' condition and it is better to abort this node
instead of causing other nodes (or the cluster) to go down.
Solution: Check to see what is
causing high interrupt activity and configure the system accordingly
These figures are calculated from the heartbeat_timeout value which is by default 10 seconds. You can check that value by running :
cluster show -t global |grep heartbeat_timeout
heartbeat_timeout: 10000
In the particular scenario described in the Symptoms section, the delays reported are bounded and not growing over time and, in this specific scenario,  the messages are not the symptom of an ongoing problem.
In such a  case, the reason for the delay is a one-shot actual delay that was large enough to trigger the first warning but resulted into a incorrect status of the pm_tick counters which was not reset.
As a result further messages are printed as if there was a constant delay detected at every subsequent pm_tick execution, even if there are no more actual delays.

SOLUTION
The problem can be fixed by disabling and enabling again the path monitoring on the fly (despite the name, this also works for a single node cluster)
# /usr/cluster/lib/sc/cmm_ctl -d
Then after a couple of minutes
 # /usr/cluster/lib/sc/cmm_ctl -e
This will result into resetting the internal counters used by pm_tick and thus zeroing out the historical delays carried over incorrectly.

Solaris 11: Recover "Lost" Root Password

Last week I have encountered an issue where we were not able to switch as a root user (su - root), as someone unfortunately changed the root password. Now our goal is to reset the root password on Solaris 11 server. NOTE: Solaris 11 having one of the restriction is that root can't be login through console directly, as it’s consider as a role account. We have to login as normal user and switch to root account.
High Level Plan:
Boot the server from bootable DVD. (here I’m using X86 server)
  1. Import the root pool.
  2. Mount the root pool dataset.
  3. Edit the shadow file.
  4. Reboot the server.
Boot the server from Bootable DVD and Select 3rd option, it will get into the shell prompt
Identify the root pool, Most of the time rpool is the default root pool. However, you can verify with the below command, this will show you all available pool details.
zpool import |grep -i pool:
pool:rpool
pool:unixrepo
#
Import the root pool. (Below example rpool
is our root pool)
zpool import -f -R /tmp/rpool rpool
#
Configure root pool dataset as legacy
zfs set mountpoint=legacy rpool/ROOT/solaris
#
Mounting rpool dataset on /mnt
# mount -F zfs rpool/ROOT/solaris /mnt
#
Modifying the Root passwd
cp /mnt/etc/shadow /mnt/etc/shadow_backup
cp /mnt/etc/passwd /mnt/etc/passwd_backup
# TERM=vt100;export TERM
# EDITOR=vi;export EDITOR
Remove the encrypted password entry for root
vi /mnt/etc/shadow
root:$5$hzZlNEsJ$iuZHgU4xIQ72LfnmyPj7jzsyn.kbwdl0Ig4lbztnEQ4:16273::::::13888
After removing the encrypted password that line should be like below
grep -root /mnt/etc/shadow
root::16273::::::13888
#
Now unmount and set back the mount points and export the pool and reboot the server in single user mode.
umount /mnt
zfs set mountpoint=/ rpool/ROOT/solaris
zpool export rpool
halt

For X86: Boot from harddisk with single user mode by editing the grub menu with typing "e", then search the line which is start with $multiboot /ROOT and add -s at the end of the line and allow the system to boot from single user mode. Here we can change our root password and reboot the server.
 For SPARC: From OK prompt just enter boot -s that will allow you to enter single user mode.
COOL...You have successfully recovered the root password on Solaris 11.