Tuesday, March 17, 2015

Solaris/LDOMs: using PCIe direct IO with LDOMs


Recent versions of logical domains (or Oracle VM for SPARC) allow you to assign single PCIe devices to a guest LDOM so IO from that LDOM does not have to go through the primary domain. I am setting this up for 2 FC HBA on a T4-2 system with two domains, one for prod and one for test. Assigning DIO devices to a guest domain (which then becomes an IO-Domain) will prevent you from doing live migration of this domain and it will also provide a new dependency to the primary domain because if the primary goes down or reboots, so does the PCI bus and with it the access to the HBA. But since we also boot from a ZFS provided by the primary domain, this dependency was already there as well. Another option would be to assign a whole PCIe bus to a guest domain (making it a so-called root domain) but extra caution needs to be taken if the primary domains boots from a disk controller attached to the PCIe bus to be shared. And some more thought needs to be put into your networking configuration as well.

The whole process is documented well, this post basically repeats the steps that I have taken and adds the multipath configuration from the guest domain.
The first step is to identify the device names of these FC adapters using ldm list-io from the primary domain (abbreviated output below).

root@primary:~# ldm list-io -l
NAME TYPE BUS DOMAIN STATUS
---- ---- --- ------ ------
pci_0 BUS pci_0 primary
[pci@400]
niu_0 NIU niu_0 primary
[niu@480]
pci_1 BUS pci_1 primary
[pci@500]
niu_1 NIU niu_1 primary
[niu@580]
/SYS/MB/PCIE0 PCIE pci_0 primary OCC
[pci@400/pci@2/pci@0/pci@8]
SUNW,qlc@0/fp/disk
SUNW,qlc@0/fp@0,0
SUNW,qlc@0,1/fp/disk
SUNW,qlc@0,1/fp@0,0
/SYS/MB/PCIE1 PCIE pci_1 primary OCC
[pci@500/pci@2/pci@0/pci@a]
SUNW,qlc@0/fp/disk
SUNW,qlc@0/fp@0,0
SUNW,qlc@0,1/fp/disk
SUNW,qlc@0,1/fp@0,0

So in my case, this is /SYS/MB/PCI0 and /SYS/MB/PCI1 of both PCI busses. So next we'll enable IO virtualization on both busses and remove the devices from the primary LDOM. The primary LDOM will need to be rebooted after this.


root@primary:~# ldm start-reconf primary
Initiating a delayed reconfiguration operation on the primary domain.
All configuration changes for other domains are disabled until the primary
domain reboots, at which time the new configuration for the primary domain
will also take effect.

 
root@primary:~# ldm set-io iov=on pci_0
------------------------------------------------------------------------------
Notice: The primary domain is in the process of a delayed reconfiguration.
Any changes made to the primary domain will only take effect after it reboots.
------------------------------------------------------------------------------

 
root@primary:~# ldm set-io iov=on pci_1
------------------------------------------------------------------------------
Notice: The primary domain is in the process of a delayed reconfiguration.
Any changes made to the primary domain will only take effect after it reboots.
------------------------------------------------------------------------------

 
root@primary:~# ldm remove-io /SYS/MB/PCIE0 primary
------------------------------------------------------------------------------
Notice: The primary domain is in the process of a delayed reconfiguration.
Any changes made to the primary domain will only take effect after it reboots.
------------------------------------------------------------------------------

 
root@primary:~# ldm remove-io /SYS/MB/PCIE1 primary
------------------------------------------------------------------------------
Notice: The primary domain is in the process of a delayed reconfiguration.
Any changes made to the primary domain will only take effect after it reboots.
------------------------------------------------------------------------------

 
root@primary:~# reboot -- -r
After the reboot the device(s) will show up as unassigned.

 
root@priamry:~# ldm list-io -l /SYS/MB/PCIE0
NAME TYPE BUS DOMAIN STATUS
---- ---- --- ------ ------
/SYS/MB/PCIE0 PCIE pci_0 OCC
[pci@400/pci@2/pci@0/pci@8]
SUNW,assigned-device@0
SUNW,assigned-device@0,1

And we can now assign these devices to the guest domains. They need to be stopped first (test was not installed at this point). The last steps sets up the dependency relationship to the primary LDOM so that the guests are also reset if the primary reboots.

root@primary:~# ldm stop-domain LDOM-prod
LDOM LDOM-prod stopped

 
root@primary:~# ldm stop-domain LDOM-test
Remote graceful shutdown or reboot capability is not available on LDOM-test
LDOM LDOM-test stopped

 
root@primary:~# ldm add-io /SYS/MB/PCIE0 LDOM-prod
root@primary:~# ldm add-io /SYS/MB/PCIE1 LDOM-test
root@primary:~# ldm set-domain failure-policy=reset primary
root@primary:~# ldm set-domain master=primary LDOM-prod
root@primary:~# ldm set-domain master=primary LDOM-test

Last step is to boot the guest back up, verify that the device is available there and set up multipathing.


root@primary:~# ldm start-domain LDOM-prod
LDOM LDOM-prod started

 
root@primary:~# telnet localhost 5001
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

 
Connecting to console "LDOM-prod" in group "LDOM-prod" ....
Press ~? for control options ..
[... console login ...]

 
root@LDOM-prod:~# prtdiag -v
System Configuration: Oracle Corporation sun4v SPARC T4-2
Memory size: 204800 Megabytes
================================ Virtual CPUs ================================
CPU ID Frequency Implementation Status
------ --------- ---------------------- -------
0 2848 MHz SPARC-T4 on-line
1 2848 MHz SPARC-T4 on-line
2 2848 MHz SPARC-T4 on-line
3 2848 MHz SPARC-T4 on-line
================================ IO Devices ================================
Slot + Bus Name + Model Speed
Status Type Path
----------------------------------------------------------------------------
PCIE0 PCIE SUNW,qlc-pciex1077,2532 QLE2562 5.0GTx4
/pci@400/pci@2/pci@0/pci@8/SUNW,qlc@0
PCIE0 PCIE SUNW,qlc-pciex1077,2532 QLE2562 5.0GTx4
/pci@400/pci@2/pci@0/pci@8/SUNW,qlc@0,1

 
root@LDOM-prod:~# stmsboot -e
WARNING: stmsboot operates on each supported multipath-capable controller
detected in a host. In your system, these controllers are

 
/pci@400/pci@2/pci@0/pci@8/SUNW,qlc@0/fp@0,0
/pci@400/pci@2/pci@0/pci@8/SUNW,qlc@0,1/fp@0,0

 
If you do NOT wish to operate on these controllers, please quit stmsboot
and re-invoke with -D { fp | mpt | mpt_sas | pmcs} to specify which controllers you wish
to modify your multipathing configuration for.

 
Do you wish to continue? [y/n] (default: y) y
WARNING: This operation will require a reboot.
Do you want to continue ? [y/n] (default: y) y
The changes will come into effect after rebooting the system.
Reboot the system now ? [y/n] (default: y) y

And after that we can use the FC HBA directly from our LDOM with multipathing.

Solaris/Shell-scripts: Timestamps for Log Files


There are two common occasions when you might want to get a timestamp
  • If you want to create a logfile called "myapp_log.11.Mar.2007
  • If you want to write to a logfile with "myapp: 11 Mar 2007 22:14:44: Something Happened"
Either way, you want to get the current date, in the format you prefer – for example, it's easier if a filename doesn't include spaces.
For the purposes of this article, though for no particular reason, I am assuming that the current time is 10:14:44 PM on Sunday the 11th March 2007.
The tool to use is, naturally enough, called "date". It has a bucket-load of switches, but first, we'll deal with how to use them. For the full list, see the man page ("man date"), though I'll cover some of the more generally useful ones below.

Setting the Date/Time
The first thing to note, is that date has two aspects: It can set the system clock:

# date 031122142007.44

will set the clock to 03 11 22 14 2007 44 – that is, 03=March, 11=11th day, 22 = 10pm, 14 = 14 minutes past the hour, 2007 = year 2007, 44 = 44 seconds past the minute.
Heck, I don't even know why I bothered to spell it out, it's obvious. Of course the year should come between the minutes and the seconds (ahem).

Getting the Date/Time

The more often used feature of the date command, is to find the current system date / time, and that is what we shall focus on here. It doesn't follow tradition, in that it uses the "+" and "%" symbols, instead of the "-" symbol, for its switches.
H = Hours, M = Minutes, S = Seconds, so:

$ date +%H:%M:%S
22:14:44

Which means that you can name a logfile like this:

#!/bin/sh
LOGFILE=/tmp/log_`date +%H%M%S`.log
echo Starting work > $LOGFILE
do_stuff >> $LOGFILE
do_more_stuff >> $LOGFILE
echo Finished >> $LOGFILE

This will create a logfile called /tmp/log_221444.log
You can also put useful information to the logfile:

#!/bin/sh
LOGFILE=/tmp/log_`date +%H%M%S`.log
echo `date +%H:%M:%S : Starting work > $LOGFILE
do_stuff >> $LOGFILE
echo "`date +%H:%M:%S : Done do_stuff" >> $LOGFILE
do_more_stuff >> $LOGFILE
echo "`date +%H:%M:%S : Done do_more_stuff" >> $LOGFILE
echo Finished >> $LOGFILE

This will produce a logfile along the lines of:

$ cat /tmp/log_221444.log
22:14:44: Starting work
do_stuff : Doing stuff, takes a short while
22:14:53: Done do_stuff
do_more_stuff : Doing more stuff, this is quite time consuming.
22:18:35: Done do_more_stuff
$

Counting the Seconds

UNIX has 1st Jan 1970 as a "special" date, the start of the system clock; GNU date will tell you how many seconds have elapsed since midnight on 1st Jan 1970:

$ date +%s
1173651284

Whilst this information is not very useful in itself, it may be useful to know how many seconds have elapsed between two events:

$ cat list.sh
#!/bin/sh
start=`date +%s`
ls -R $1 > /dev/null 2>&1
end=`date +%s`

 
diff=`expr $end - $start`
echo "Started at $start : Ended at $end"
echo "Elapsed time = $diff seconds"
$ ./list.sh /usr/share
Started at 1173651284 : Ended at 1173651290
Elapsed time = 6 seconds
$

For more useful switches, see the man page, but here are a few handy ones:

$ date "+%a %b %d" # (in the local language)
Sun Mar 11
$ date +%D # (show the full date)
03/11/07
$ date +%F # (In another format)
2007-03-11
$ date +%j # (how many days into the year)
070
$ date +%u # (day of the week)
7
$

Solaris: The RSTCHOWN parameter & Changing File Ownership

By default, in Solaris, the owner of the file cannot use the chown command to change the owner of the file or directory. There is also restriction in using chgrp command. The owner can only use chgrp command to change the group of the file to a group, which the owner belongs to.

However, this behavior can be changed by modifying /etc/system file by adding the following line:

set rstchown=0

You need to reboot your system after this change. Only the root can arbitrarily change ownership of the file whether or not this option is in effect.

Let's check the difference in system's behavior in following examples. We have the user account nameduser1 which is member of groups admin and dba. With default behavior when user1 is trying to change owner of the data.log file to user2 he gets the following message

$ chown user2 data.log
chown: data.log: Not owner

$ ls -l
total 12
-rw-rw-r-- 1 user1 admin 5345 May 11 05:53 data.log

However, user1 can change group of the data.log file to dba, because that is one of the two groups which he is a member of. But when user1 tries to change group of the data.log file to the manager his attempt will fail.

$ chgrp dba data.log

$ ls -l
total 12
-rw-rw-r-- 1 user1 dba 5345 May 11 05:53 data.log

$ chgrp manager data.log
chgrp: data.log: Not owner

Now, after the /etc/system file modification, user1 is free to change owner or group of his data.log file.

$ chown user2 data.log

$ ls -l
total 12
-rw-rw-r-- 1 user2 admin 5345 May 11 05:53 data.log

$ chgrp manager data.log


$ ls -l
total 12
-rw-rw-r-- 1 user1 manager 5345 May 11 05:53 data.log

There is possibility to change rstchown parameter on the fly without reboot using adb (mdb). This change will not persist across reboot though.

# adb -w -k /dev/ksyms /dev/mem
physmem 1f425
rstchown/D
rstchown:
rstchown: 1
rstchown/W 0
rstchown: 0x1 = 0x0
rstchown/D
rstchown:
rstchown: 0

You need to provide the name of the parameter you want to change together with valid option. In example above we printed current value assigned to rstchown by using /D and then we assigned new value by /W 0.

Solaris 11: Activate FASTREBOOT to significantly reduce boot time


The Fast Reboot feature of Oracle Solaris 11 is supported on both SPARC and x86 platforms. The Fast Reboot feature implements an in-kernel boot loader that loads the kernel into memory and then switches to that kernel, so that the reboot process occurs within seconds.

For x86 systems:

Fast reboot, bypassing firmware and boot loader. The new kernel will be loaded into memory by the running kernel, and control will be transferred to the newly loaded kernel. If disk or kernel arguments are specified, they must be specified before other boot arguments.

For SPARC systems:

Speeds up rebooting by skipping some POST tests.

The Fast Reboot feature is controlled by SMF and implemented through a boot configuration service, svc:/system/boot-config. The boot-config service provides a means for setting or changing the default boot parameters.

The fastreboot_default property of the boot-config service enables an automatic fast reboot of the system when either the reboot or the init 6 command is used. When the config/fastreboot_default property is set to true, the system automatically performs a fast reboot, without the need to use the reboot -f command. By default, this property's value is set to true on an x86 based system and false on a SPARC based system.

 The svc:/system/boot-config:default service consists of the following properties:

   config/fastreboot_default
   config/fastreboot_onpanic
These properties can be configured by using the svccfg and svcadm commands.

For example, to disable the default behavior of the fastreboot_onpanic property on an x86 based system, you would set the property's value to false, as shown here:

# svccfg -s "system/boot-config:default" setprop config/fastreboot_onpanic=false
# svcadm refresh svc:/system/boot-config:default
Changing one property's value does not affect the default behavior of the other property.

The following example shows how to make a fast reboot the default behavior on a SPARC based system by setting the boot-config SMF service property to true.

# svccfg -s "system/boot-config:default" setprop config/fastreboot_default=true
# svcadm refresh svc:/system/boot-config:default
Setting the property's value to true accelerates the reboot process, which enables systems that support the Fast Reboot feature to bypass certain POST tests. When the property is set to true, you can perform a fast reboot of the system without having to use the -f option with the reboot command.

To reboot a system that has the Fast Reboot feature enabled, without reconfiguring the boot-config service to disable the feature, use the -p option with the reboot command, as shown here:

# reboot -p

Solaris: how to edit/vi huge files (GBs)


Every now and then you end up with task to read or edit file that is huge. By huge I mean the size of the file at least 1GB, more likely several GBs.
You usually notice huge file by the following message when you try to open it using vi editor:

Not enough space in /var/tmp

The default directory /var/tmp for the vi editing buffer needs space equal to roughly twice the size of the file with which you are working, because vi uses extra lines for buffer manipulation.
This is how you choose another directory for vi's editing buffer:

$ cd /where_your_file_is_located
$ vi
:set directory=/tmp
:e filename

You can replace /tmp with any other directory containing enough space as a editing buffer. The filename is the name of your large file.
In case you just need to read file and do not plan to edit it, you can use one of the commands that use fewer resources like less, head or tail.
Other option would be to split the file either size-wise or line-wise.

$ split -l 10000 xyz.txt

 
Each of new files will have 10000 lines. Or you can try


$ split -b 250m xyz.txt xyz.txt.split

which will create 250MB chunks out of the original file.

If you prefer to use vim over vi, you might try to press Ctrl-C when loading your file. Vim tries to read in the whole file initially to do things like syntax highlighting, number of lines in file and so on. Ctrl-C will cancel this and will only load what's needed to display on your screen.

Solaris 10: Unable to login into NonGlobal Zones (NGZ) after Patching


Yesterday I faced an issue, where in I was unable to login into NGZ's after kernel patching. Though the zlogin was working perfectly.
Initially I think the cause is ssh key, then I tried to login into the NGZ from the network, it was showing ssh connection refused.
I have checked the ssh services via zlogin. There were many services related to network which were not running including ssh. All services were depending on /system/sysidtool:net service which is in disbale mode. I tried to enable /system/sysidtool:net service but No luck.

# svcs -vx
# svcs -a | grep -i /system/sysidtool:net
# svcadm enable /system/sysidtool:net
# svcs -a | grep -i /system/sysidtool:net
# svcs -vx

Then I examine the logs for this service failure and found:

[ Aug 27 09:15:49 Method "start" exited with status 0 ]
[ Aug 27 09:36:58 Enabled. ]
[ Aug 27 09:37:01 Executing start method ("/lib/svc/method/sysidtool-net") ]
/etc/.UNCONFIGURED not found. System already configured, /lib/svc/method/sysidtool-net exiting.
[ Aug 27 09:37:01 Method "start" exited with status 0 ]
[ Aug 27 14:58:06 Enabled. ]
[ Aug 27 14:58:12 Executing start method ("/lib/svc/method/sysidtool-net") ]
ifconfig: status: SIOCGLIFFLAGS: fjgi0: no such interface
ifconfig: setifflags: SIOCGLIFFLAGS: fjgi0: no such interface
ifconfig: status: SIOCGLIFFLAGS: fjgi7: no such interface
ifconfig: setifflags: SIOCGLIFFLAGS: fjgi7: no such interface
Terminated

At this point of time I checked the Interfaces, but all were up and running fine in Global as well as in Non-global zone.
Then one particular line got my attention:

"/etc/.UNCONFIGURED not found. System already configured, /lib/svc/method/sysidtool-net exiting."

I checked /etc/.UNCONFIGURE with ls -la in the NGZ and found one file with name .UNCONFIGURE, which was of 0 "Zero" size.
I removed this file and restarted the NZG, all went in favor and all services started successfully. 

# cd /etc
# ls -la
# rm .UNCONFIGURE
# zoneadm -z zone-name reboot

Solaris 10: How to give Special Privileges to specific Zones


Whenever we install Solaris Zones, Solaris will assign some set of privileges to take care of activities that impacts overall systems that means that particular zone as well as other zones. And Solaris classifies the available privileges as 
1. Default Privileges
2. Optional Privileges
3. Prohibited Privileges
  • Default Privileges- which are assigned to every zone during the installation
Example: file_chown
Allows a process to change a file's owner user ID.
Allows a process to change a file's group ID to one other than the process' effective group ID or one of the process' supplemental group IDs.
  • Optional Privileges-  which we have to assign manually, using the limitpriv option
Example: sys_time
Allows a process to manipulate system time using any of the appropriate system calls: stime, adjtime, ntp_adjtime and the IA specific RTC calls.
Why it is Optional?
A system with non-global zones will share one kernel for all zones (the global zone as well as all configured non-global zone). As a result there is only one date/time on the entire setup and this time is usually controlled by the global zone only. By default the privilege to change the date and time is not available inside a non-global zone and therefore the NTP service will fail to adjust the time.
The default configuration for non-global zones assumes that the time synchronization is done in the global zone and that there is no need to adjust the system time from inside a non-global zone.
If the administrator of a non-global zone is able to change the system time then these changes will affect all running zones (including the global zone) and this may be considered a security risk.
How to assign Optional Privilege to zone?
global-zone# zonecfg -z myzone info limitpriv
limitpriv:

global-zone# zonecfg -z myzone set limitpriv="default,sys_time"
global-zone# zoneadm -z myzone reboot  
Once the sys_time privilege is available in the non-global zone you can continue to setup NTP as usual, i.e. configure the /etc/inet/ntp.conf file and enable the ntp service.
  • Prohibited  Privileges -  which currently not supported to assign

 
How do we know currently available privileges set to a Zone?
global-zone# ppriv -lv myzone
contract_event
Allows a process to request critical events without limitation.
Allows a process to request reliable delivery of all events on
any event queue.
contract_observer
Allows a process to observe contract events generated by
contracts created and owned by users other than the process's
effective user ID.
Allows a process to open contract event endpoints belonging to
contracts created and owned by users other than the process's
effective user ID.
file_chown
Allows a process to change a file's owner user ID.
Allows a process to change a file's group ID to one other than
the process' effective group ID or one of the process'
supplemental group IDs.

-------------------------- SNIP ---------------------------------

 
Below are List of Available Privileges in Solaris 10

 

Privilege Status Notes 
cpc_cpuOptional Access to certain cpc(3CPC) counters
dtrace_procOptional fasttrap and pid providers; plockstat(1M)
dtrace_userOptional profile and syscall providers
graphics_accessOptional ioctl(2) access to agpgart_io(7I)
graphics_mapOptional mmap(2) access to agpgart_io(7I)
net_rawaccessOptional in shared-IP zones. 
Default in exclusive-IP zones. 
Raw PF_INET/PF_INET6 packet access
proc_clock_highresOptional Use of high resolution timers 
proc_priocntlOptional Scheduling control; priocntl(1)
sys_ipc_configOptional Raising IPC message queue buffer size 
sys_timeOptional System time manipulation; xntp(1M)
dtrace_kernelProhibited Currently unsupported 
proc_zoneProhibited Currently unsupported 
sys_configProhibited Currently unsupported 
sys_devicesProhibited Currently unsupported 
sys_linkdirProhibited Currently unsupported 
sys_net_configProhibited Currently unsupported 
sys_res_configProhibited Currently unsupported 
sys_suser_compatProhibited Currently unsupported 
proc_execRequired, Default Used to start init(1M)
proc_forkRequired, Default Used to start init(1M)
sys_mountRequired, Default Needed to mount required file systems 
sys_ip_configRequired, Default in exclusive-IP zones 
Prohibited in shared-IP zones 
Required to boot zone and initialize IP networking in exclusive-IP zone 
contract_eventDefault Used by contract file system 
contract_observerDefault Contract observation regardless of UID 
file_chownDefault File ownership changes 
file_chown_selfDefault Owner/group changes for own files 
file_dac_executeDefault Execute access regardless of mode/ACL 
file_dac_readDefault Read access regardless of mode/ACL 
file_dac_searchDefault Search access regardless of mode/ACL 
file_dac_writeDefault Write access regardless of mode/ACL 
file_link_anyDefault Link access regardless of owner 
file_ownerDefault Other access regardless of owner 
file_setidDefault Permission changes for setid, setgid, setuid files
ipc_dac_readDefault IPC read access regardless of mode 
ipc_dac_ownerDefault IPC write access regardless of mode 
ipc_ownerDefault IPC other access regardless of mode 
net_icmpaccessDefault ICMP packet access: ping(1M)
net_privaddrDefault Binding to privileged ports 
proc_auditDefault Generation of audit records 
proc_chrootDefault Changing of root directory
proc_infoDefault Process examination 
proc_lock_memoryDefault Locking memory; shmctl(2)and mlock(3C)
If this privilege is assigned to a non-global zone by the system administrator, consider also setting the zone.max-locked-memory resource control to prevent the zone from locking all memory.
proc_ownerDefault Process control regardless of owner 
proc_sessionDefault Process control regardless of session 
proc_setidDefault Setting of user/group IDs at will 
proc_taskidDefault Assigning of task IDs to caller 
sys_acctDefault Management of accounting 
sys_adminDefault Simple system administration tasks 
sys_auditDefault Management of auditing 
sys_nfsDefault NFS client support 
sys_resourceDefault Resource limit manipulation