UnixPedia : HPUX / LINUX / SOLARIS: November 2018

Wednesday, November 28, 2018

Linux : How to restart vmtoolsd cleanly.

Few time we see that vmtools process can't be starting properly. We have to find if it is used by some process of not.

Restart the vmtoolsd after kill the process.

#-> cat /var/log/messages| grep -i tool
Oct 14 14:27:43 SAS_SYSTEM ansible-command: Invoked with warn=True executable=None _uses_shell=True _raw_params=#/etc/init.d/vmware-tools status;date;uptime removes=None creates=None chdir=None
Oct 14 17:42:02 SAS_SYSTEM init: vmware-tools post-stop process (26407) terminated with status 1
Oct 14 23:15:02 SAS_SYSTEM init: vmware-tools post-stop process (30772) terminated with status 1


[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> #/etc/init.d/vmware-tools status
Checking vmware-tools...
vmware-tools    start/running

[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> ps -ef |grep -i vmtoolsd
root      7519  6939  0 18:00 pts/1    00:00:00 grep -i vmtoolsd
[root@SAS_SYSTEM:/var/adm/install-logs]#


[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> ps -ef |grep -i vm
root       351     2  0 Jun01 ?        00:00:00 [vmw_pvscsi_wq_2]
root       916     2  0 Jun01 ?        00:04:00 [vmmemctl]
root      7642  6939  0 18:01 pts/1    00:00:00 grep -i vm

lsof |grep -i vmtoolsd

[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> lsof |grep -i vmtoolsd
[root@SAS_SYSTEM:/var/adm/install-logs]#

Is this physical or SDDC server

SDDC

#/etc/vmware-tools/services.sh  status

[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> #/etc/vmware-tools/services.sh status
vmtoolsd is not running
[root@SAS_SYSTEM:/var/adm/install-logs]#

#/etc/vmware-tools/services.sh  start

[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> #/etc/vmware-tools/services.sh start
[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> #/etc/vmware-tools/services.sh status
vmtoolsd is not running
[root@SAS_SYSTEM:/var/adm/install-logs]#

#/etc/vmware-tools/services.sh  restart

#/etc/vmware-tools/services.sh  stop

#/etc/vmware-tools/services.sh  start

[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> #/etc/vmware-tools/services.sh restart
Stopping VMware Tools services in the virtual machine:
   Guest operating system daemon:                          [  OK  ]
   VGAuthService:                                          [  OK  ]
   VMware User Agent (vmware-user):                        [  OK  ]
   Unmounting HGFS shares:                                 [  OK  ]
   Guest filesystem driver:                                [  OK  ]
   VM communication interface socket family:               [WARNING]
   VM communication interface:                             [WARNING]
[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> #/etc/vmware-tools/services.sh status
vmtoolsd is not running
[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> #/etc/vmware-tools/services.sh stop
Stopping VMware Tools services in the virtual machine:
   Guest operating system daemon:                          [  OK  ]
   VGAuthService:                                          [  OK  ]
   VMware User Agent (vmware-user):                        [  OK  ]
   Unmounting HGFS shares:                                 [  OK  ]
   Guest filesystem driver:                                [  OK  ]
   VM communication interface socket family:               [WARNING]
   VM communication interface:                             [WARNING]
[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> #/etc/vmware-tools/services.sh start
[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> #/etc/vmware-tools/services.sh status
vmtoolsd is not running


Check any process is holding the vmtools process.

[root@SAS_SYSTEM:/var/adm/install-logs]#

lsof |grep -i tool

[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> lsof |grep -i tool
nbdisco    2784      root  mem       REG             253,10  7876150     110784 /usr/openv/lib/libvCloudTools.so
nbdisco    2784      root  mem       REG             253,10 11624190     110746 /usr/openv/lib/libnbuVmwareTools.so
sapcimb   47221      root  DEL       REG              253,7              240710 /usr/lib/vmware-tools/lib64/libvmGuestLib.so/libvmGuestLib.so
[root@SAS_SYSTEM:/var/adm/install-logs]#


Here 47221 is holding the vmtools process.

[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> ps -ef |grep -i 47221
root      9544  6939  0 18:09 pts/1    00:00:00 grep -i 47221
root     47221  2962  0 Jun22 ?        00:00:00 /usr/sap/hostctrl/exe/sapcimb -format flat -tracelevel 1 -nonull -continue-on-error -metadata -enumi -namespace root/cimv2 -class SAP_MetricValue
[root@SAS_SYSTEM:/var/adm/install-logs]#

#kill -9 47221     

[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> ps -ef |grep -i 47221
root      9739  6939  0 18:10 pts/1    00:00:00 grep -i 47221
[root@SAS_SYSTEM:/var/adm/install-logs]#

#/etc/vmware-tools/services.sh restart

[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> #/etc/vmware-tools/services.sh status
vmtoolsd is running
[root@SAS_SYSTEM:/var/adm/install-logs]#

ps -ef |grep -i vmtoolsd

[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> ps -ef| grep -i tool
root     11333     1  0 18:11 ?        00:00:00 /usr/sbin/vmtoolsd
root     11580  6939  0 18:11 pts/1    00:00:00 grep -i tool
[root@SAS_SYSTEM:/var/adm/install-logs]#
#-> ps -ef| grep -i tool
root     11333     1  0 18:11 ?        00:00:00 /usr/sbin/vmtoolsd
root     11580  6939  0 18:11 pts/1    00:00:00 grep -i tool
[root@SAS_SYSTEM:/var/adm/install-logs]#

Tuesday, November 27, 2018

How to check and collect log for centrify issue.


This needs to be done which we are having the issue and right now I can see that the agent is using 40% of the CPU.

On the Centrify Unix server, as root or sudo, please run the following commands: 

1) Run: /usr/share/centrifydc/bin/addebug on
(Switch on debug log and watch for any errors) 

2) Run: /usr/share/centrifydc/bin/addebug clear
(Will clear any previous debug log /var/log/centrifydc.log) 

3) Make sure /var/log/centrifydc.log is growing in size. 

4) Reproduce the issue -- restart adclient, monitor until %cpu by adclient goes high, mark the %cpu for our reference, then go to step 5. 

5) Run: /usr/share/centrifydc/bin/addebug off
(Switch off debug log) 

6) Run: /usr/bin/adinfo -t
7) Reply back to the email thread to include support@centrify.com or attach logs online with the following files: 

Monday, November 26, 2018

How to check open file allowed for this user

maximum open file allowed for this user has reached to limit.

[root@Heiniken:/root]#
#-> ulimit -u sascmxep -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1450297
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1450297
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

[root@Heiniken:/root]#
#-> lsof  -u sascmxep | wc -l
1602


[root@Heiniken:/root]#

Friday, November 23, 2018

HPUX : Printer Unix queue enablement and old job cleanup script

Unix : - HPUX
This script is applicable for HPUx print steup.

#cat >SS-Printer_Enabling_and_Defucnt_jobid_check.sh
#Running below code to find disable printer and enabling them.
#
SKDSB_Q=`lpstat -p | grep -i disable | awk '/printer/ {print $2}'`
for var in $SKDSB_Q
do
SKPRNT_STATE=`lpstat -p$var | grep -i disabled|wc -l`

#Removing the Job ID which do not have Data associated with it and
#Causing Printer to go in Hung state and impacting JOB flow

echo "# printer $var `date` : `lpstat -o$var |grep -i "???"`" >> /var/adm/lp/enable_printer_script.log
lpstat -o$var |grep -i "???"
if [ $? -eq 0 ]
then
echo "# printer $var `date` :`lpstat -o$var |head -10`" >> /var/adm/lp/enable_printer_script.log
DEFUNCTJOBID=`lpstat -o$var |grep -i $var |head -1 |awk '{print $1}'`
echo "# printer $var `date` :Canceling  $DEFUNCTJOBID from printer Queue $var" >> /var/adm/lp/enable_printer_script.log
cancel $DEFUNCTJOBID
fi

#Enabling the printer if queue is disable

if [ $SKPRNT_STATE -ne 0 ]
then
echo "# printer $var `date` : `lpstat -p$var | grep -i $var ` " >> /var/adm/lp/enable_printer_script.log
enable $var
echo "# printer $var `date` : `lpstat -p$var | grep -i $var `" >> /var/adm/lp/enable_printer_script.log
fi
done


find /var/spool/lp/request/  -xdev -type f -mtime +30 -exec ls -ltr {} \; >/tmp/queuelistforremoval.txt
cat /tmp/queuelistforremoval.txt |grep -vE "remotesending|sendingstatus|cancel" |awk '{print $9 }' |while read i
do
rm $i
done

exit

--------------

Add above script in crontab to schedule it for every 15 minutes.

Linux : Sending Mail in HTML formate

User is not able to send the html format content over the mail, it coming in distorted way. below is solution proposed to resolve it.

Add orange content is html format file.

#-> cat body_text.txt
From: abct@bcs.com
To:  abct@bcs.com
Subject: MIME Test
Mime-Version: 1.0
Content-Type: text/html
<html><body><font color="#1F497D"> Hi Team,</br></br>Count of HIVE tables against Teradata staging tables is not same for validity date '''2018-11-18'''.</br></br></font><table border="1"  ><tr><td align="center" bgcolor="#2F75B5" valign="top"><font color="white">Hive Table Name</font></td><td align="left" bgcolor="#2F75B5" valign="top"><font color="white">Hive Count</font></td><td align="left" bgcolor="#2F75B5" valign="top"><font color="white">Teradata Table Name</font></td><td align="left" bgcolor="#2F75B5" valign="top"><font color="white">Teradata Count</font></td><td align="left" bgcolor="#2F75B5" valign="top"><font color="white">Status</font></td></tr><tr><tr><td align="left"  valign="top">GSGO_M09_TD_F42199_H</td><td align="left"  valign="top">317067288</td><td align="left"  valign="top">M09_TD_F42199_H</td><td align="left"  valign="top">308660931</td><td align="center" bgcolor="#C00000" valign="top"><font color="white">FAILED</font></td></tr><tr><td align="left"  valign="top">GSGO_M29_TD_BIC_OHZDP_HC045_H</td><td align="left"  valign="top">19682893</td><td align="left"  valign="top">M29_TD_BIC_OHZDP_HC045_H</td><td align="left"  valign="top">19954073</td><td align="center" bgcolor="#C00000" valign="top"><font color="white">FAILED</font></td></tr><tr><td align="left"  valign="top">GSGO_M02_TD_F3102_H</td><td align="left"  valign="top">62696104</td><td align="left"


#cat body_text.txt |sendmail -t

Linux : Storage Migration plan on linux system

1. Verify the exiting Lun
# mutlipath -ll
2.Verify the LVM disk
# pvs;vgs;lvs
3. Verify the output of disk

4. Once the storage team assing the disk, please scan the disk and make sure it is available to OS
# hp_rescan -a
# df -h
# mutlipath -ll
fdisk -l -------------To check the new attached disk ( Suppose disk is /dev/disk/by-id/scsi-mpathc )
# pvs;vgs;lvs
# pvcreate /dev/disk/by-id/scsi-mpathc
# pvscan
# pvdisplay /dev/disk/by-id/scsi-mpathb /dev/disk/by-id/scsi-mpathc
# vgs
# vgextend vg01 /dev/disk/by-id/scsi-mpathc
# lvs -a -o +devices,size
# pvmove -background /dev/disk/by-id/scsi-mpathb /dev/disk/by-id/scsi-mpathc
# lvs -a -o +devices,size ------------ Verify the LV is now the part of /dev/disk/by-id/scsi-mpathc


 pvdisplay /dev/disk/by-id/scsi-mpathb /dev/disk/by-id/scsi-mpathc
   44  2018-11-22 17:16:41 vgreduce vg01 /dev/disk/by-id/scsi-mpathb
   45  2018-11-22 17:16:48 pvremove /dev/disk/by-id/scsi-mpathb

 ########################################################################################################################

 Remove the old lun from VG

 vgreduce vg01 /dev/disk/by-id/scsi-mpathb
 pvremove /dev/disk/by-id/scsi-mpathb

Remove the LUN from the system.

#dmsetup remove /dev/disk/by-id/scsi-mpathb

Thursday, November 22, 2018

Data corruption due to Fs active on multiple nodes.

On earth system after data copy , data are corrupted below are output seen while doing ll or du in directory

[root@earth:/root]#
#-> cd /sap_refresh02
[root@earth:/sap_refresh02]#
#-> ll
ls: cannot access RSE: No such device or address
ls: cannot access INCRBKP: No such device or address
total 24
?????????? ? ?      ?            ?            ? INCRBKP
drwxr-xr-x 2 root   root        96 Oct  8  2014 lost+found
drwxrwxrwx 2 oracle oinstall 24576 Nov 22 05:13 RMAN_RBE_RSE
?????????? ? ?      ?            ?            ? RSE
drwxrwxr-x 2 oracle oinstall    96 Nov 22 12:34 RSE_DB
drwxrwxr-x 2 oracle oinstall    96 Nov 22 12:34 RSE_DB_INCR
-rw------- 1 root   root         0 Nov 22 11:20 sifh
drwx------ 2 root   root        96 Nov 22 13:34 Test
[root@earth:/sap_refresh02]#
#-> ll RSE_DB
total 0
[root@earth:/sap_refresh02]#
#-> ll RSE_DB_INCR
total 0
[root@earth:/sap_refresh02]#
#-> ll RMAN_RBE_RSE
ls: cannot access RMAN_RBE_RSE/PSAPDAT_4.tf: No such device or address
ls: cannot access RMAN_RBE_RSE/PSAPDAT_11.tf: No such device or address
ls: cannot access RMAN_RBE_RSE/PSAPDAT_12.tf: No such device or address
ls: cannot access RMAN_RBE_RSE/PSAPDAT_13.tf: No such device or address
ls: cannot access RMAN_RBE_RSE/PSAPDAT_14.tf: No such device or address
ls: cannot access RMAN_RBE_RSE/PSAPDAT_18.tf: No such device or address
ls: cannot access RMAN_RBE_RSE/PSAPDAT_19.tf: No such device or address


Reason :
   Same set of lun are active on other nodes and in mounted state, active IO on those causing the corruption of LV.
 
Solution :
   deactivate the volume and remove LV from other nodes.

System is rebooting abnormally due to block size issue on one of FS.



System is rebooting abnormally due to block size issue on one of FS.


 1 KiB blocksize is used for a file system and it was expressed in a previous case, 01960120 , that this should not be used to done.

------------------------------------------------------------8> - possible mitigation activities
o As the issue may be a function of the file system block size, refrain from using a file system block size of 1KiB.
------------------------------------------------------------8 sys
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-693.17.1.el7.x86_64/vmlinux
DUMPFILE: /cores/retrace/tasks/664521057/crash/vmcore [PARTIAL DUMP]
CPUS: 20
DATE: Sat Nov 10 16:17:01 2018
UPTIME: 11 days, 10:24:29
LOAD AVERAGE: 8.78, 9.25, 8.80
TASKS: 1093
NODENAME: ITSUSRALSP05403
RELEASE: 3.10.0-693.17.1.el7.x86_64
VERSION: #1 SMP Sun Jan 14 10:36:03 EST 2018
MACHINE: x86_64 (2397 Mhz)
MEMORY: 96 GB
PANIC: "kernel BUG at fs/jbd2/journal.c:766!

crash> mod -t
NAME TAINTS
redirfs OE
gsch OE

o Existing file system errors
crash> log | grep -i ext | grep -v gsch
[778861.340894] EXT4-fs (dm-12): error count since last fsck: 115
[778861.340898] EXT4-fs (dm-12): initial error at time 1495053905: ext4_validate_block_bitmap:381
[778861.340901] EXT4-fs (dm-12): last error at time 1540908042: ext4_validate_block_bitmap:384
[865368.129522] EXT4-fs (dm-12): error count since last fsck: 115
[865368.129526] EXT4-fs (dm-12): initial error at time 1495053905: ext4_validate_block_bitmap:381
[865368.129528] EXT4-fs (dm-12): last error at time 1540908042: ext4_validate_block_bitmap:384
[951874.916775] EXT4-fs (dm-12): error count since last fsck: 115
[951874.916780] EXT4-fs (dm-12): initial error at time 1495053905: ext4_validate_block_bitmap:381
[951874.916782] EXT4-fs (dm-12): last error at time 1540908042: ext4_validate_block_bitmap:384
[987861.755674] RIP: 0010:[] [] jbd2_journal_next_log_block+0x79/0x80 [jbd2]
[987861.758359] RIP [] jbd2_journal_next_log_block+0x79/0x80 [jbd2]

o Several messages relating to the third-party kernel module and ext
crash> log | grep -i ext | grep gsch_flt | awk '{for (i=2;i<=NF;i++){printf "%s ",$i ; if (i==NF) print ""}}' | sort | uniq -c | sort -rn
243 gsch_flt_add_mnt(/var/tmp @ Unknown[ef53(ext3)]) done: 0
243 gsch_flt_add_mnt(/ @ Unknown[ef53(ext3)]) done: 0
243 gsch_flt_add_mnt(/tmp @ Unknown[ef53(ext3)]) done: 0
121 gsch_flt_add_mnt(/boot @ Unknown[ef53(ext3)]) done: 0

o Processes just started and were in an uninterruptible state.
crash> ps -m | grep UN
[ 0 00:00:00.000] [UN] PID: 4073 TASK: ffff880431bdcf10 CPU: 14 COMMAND: "oracle_4073_mra"
[ 0 00:00:00.000] [UN] PID: 29624 TASK: ffff8804ee2c0000 CPU: 12 COMMAND: "ora_j000_mraq04"
[ 0 00:00:00.005] [UN] PID: 3805 TASK: ffff8806cd771fa0 CPU: 1 COMMAND: "oracle_3805_mra"
[ 0 00:00:00.017] [UN] PID: 43209 TASK: ffff8807d3e09fa0 CPU: 6 COMMAND: "oracle_43209_mr"

o Crashing process
crash> bt
PID: 2296 TASK: ffff88115d290fd0 CPU: 4 COMMAND: "jbd2/dm-12-8"
#0 [ffff88115cbcf930] machine_kexec at ffffffff8105c63b
#1 [ffff88115cbcf990] __crash_kexec at ffffffff81106922
#2 [ffff88115cbcfa60] crash_kexec at ffffffff81106a10
#3 [ffff88115cbcfa78] oops_end at ffffffff816b0aa8
#4 [ffff88115cbcfaa0] die at ffffffff8102e87b
#5 [ffff88115cbcfad0] do_trap at ffffffff816b01f0
#6 [ffff88115cbcfb20] do_invalid_op at ffffffff8102b174
#7 [ffff88115cbcfbd0] invalid_op at ffffffff816bd1ae
[exception RIP: jbd2_journal_next_log_block+121]
RIP: ffffffffc014ad99 RSP: ffff88115cbcfc88 RFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff88115b417800 RCX: 0000000000000008
RDX: 0000000000038818 RSI: ffff88115cbcfd38 RDI: ffff88115b41782c
RBP: ffff88115cbcfca0 R8: ffff8804464fbbc8 R9: 0000000000000000
R10: 0000000000000001 R11: 0000040000000400 R12: ffff88115b417828
R13: ffff88115cbcfd38 R14: ffff88115b417800 R15: 000000000000000b
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffff88115cbcfc80] jbd2_journal_next_log_block at ffffffffc014ad40 [jbd2]
#9 [ffff88115cbcfca8] jbd2_journal_commit_transaction at ffffffffc01437c8 [jbd2]
#10 [ffff88115cbcfe48] kjournald2 at ffffffffc0149a79 [jbd2]
#11 [ffff88115cbcfec8] kthread at ffffffff810b270f
#12 [ffff88115cbcff50] ret_from_fork at ffffffff816b8798

crash> mount | awk 'NR == 1 || $0 ~ "vg_oraarch-lv_oraarch"'
MOUNT SUPERBLK TYPE DEVNAME DIRNAME
ffff881159887780 ffff88115b714000 ext3 /dev/mapper/vg_oraarch-lv_oraarch /u02/oraarch

o 1 KiB blocksize again.
crash> super_block.s_blocksize ffff88115b714000
s_blocksize = 1024

Is there a reason why the 1 KiB blocksize is still being used?

### Next Steps

o State why the 1 KiB block size is being used when it was expressed previously to avoid such a small blocksize.
~~~~~

Resolution : 
  1. increase the block size to 4K recommended.


Linux : How to chagne the block size of the logical volume

1) Check the block size of current device.
$tune2fs -l /dev/vg_oraarch/lv_oraarch |grep -i "Block size"
2) Unmount filesystem to change block size.
$ umount /u02/oraarch
3) Create filesystem to change new block size.
$ mkfs -t ext3 -b 4096 /dev/vg_oraarch/lv_oraarch 
4) Mount to check the changed block size.
$mount  /u02/oraarch
$tune2fs -l /dev/vg_oraarch/lv_oraarch |grep -i "Block size"

Latency issue in Bond0

Check the bond configuration on the linux server:

cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: transmit load balancing
Primary Slave: None
Currently Active Slave: eth10
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth9
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 6c:5b:e5:XX:1a:64
Slave queue ID: 0

Slave Interface: eth10
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 6c:Xb:X5:aa:1c:61
Slave queue ID: 0

#check bonding configuration:
#-> cat /etc/sysconfig/network/ifcfg-bond0
DEVICE='bond0'
BOOTPROTO=static
BROADCAST=
IPADDR=XX.XX.AA.VV/23
NETWORK=
STARTMODE=auto
USERCONTROL=no
#LLADDR=
#ETHTOOL_OPTIONS=
BONDING_MASTER=yes
BONDING_MODULE_OPTS='miimon=100 mode=5'
BONDING_SLAVE0='eth9'
BONDING_SLAVE1='eth10'

#check for packet dropped on NIC
 for x in $(seq 1 20); do ip -s link show dev eth9 | grep -A1 'RX.*dropped'; sleep 2; done
 for x in $(seq 1 20); do ip -s link show dev eth10 | grep -A1 'RX.*dropped'; sleep 2; done

#->  for x in $(seq 1 20); do ip -s link show dev eth9 | grep -A1 'RX.*dropped'; sleep 2; done
    RX: bytes  packets  errors  dropped overrun mcast
    1662600144 24960287 150     67145216 0       0
    RX: bytes  packets  errors  dropped overrun mcast
    1662601168 24960302 150     67145223 0       0
    RX: bytes  packets  errors  dropped overrun mcast
    1662601584 24960308 150     67145233 0       0
    RX: bytes  packets  errors  dropped overrun mcast
    1662602246 24960318 150     67145244 0       0
    RX: bytes  packets  errors  dropped overrun mcast
    1662603078 24960331 150     67145262 0       0

Resolution :
  1. Reset the card with ifenslave
  2. Reseat the blade into enclosure
  3. Bond0 can be breaked to monitor the packat drops.

Saturday, November 10, 2018

What is the maximum number of groups (GIDs) a user can belong to when using NFS with AUTH_UNIX / AUTH_SYS on RHEL


What is the maximum number of groups (GIDs) a user can belong to when using NFS with AUTH_UNIX / AUTH_SYS on RHEL


Environment

  • Red Hat Enterprise Linux 5
  • Red Hat Enterprise Linux 6
  • Red Hat Enterprise Linux 7
  • NFS

Issue

  • GIDs of users in more than 16 groups are not recognized properly on NFS.
  • I can not change ownership of subdirectory of nfs filesystem, following is the error messages:
Raw
$ chown USER:GROUP /nfs-mount-point/file
chown: changing ownership of `/nfs-mount-point/file': Operation not permitted
  • User getting "Permission Denied" error while creating a file on NFS share.
Raw
# su - testuser1
$ touch example
touch: cannot touch `example`: Permission denied 
  • What limits are in place for group settings on NFS ?

Resolution

  • If the NFS environment requires a user to belong to more than 16 groups then use RPCSEC_GSS (e.g. with Kerberos) instead of AUTH_UNIX / AUTH_SYS. How to configure NFSv4 with kerberos authentication?
  • On RHEL 6 and newer the NFS-server can be instructed to discard the groups given by the NFS-client. The --manage-gids option for rpc.mountd (see man rpc.mountd) needs to be set on the NFS-server in /etc/sysconfig/nfs. The flag tells the server to ignore the 16 groups sent by the client and resolve group membership locally (NFS-server side). This effectively bypasses the limit imposed by the RPC data structure and requires that the NFS-server see either the same or a superset of the groups available to the NFS-client. Note that AUTH_SYSwith --manage-gids is less secure than switching to RPCSEC_GSS. If RPCSEC_GSS is an option for your environment, it is a better solution.
  • NOTE: If you intend to do file locking over NFS, there maybe a limitation in NFSv3's file locking protocol (NLM) where it is unable to use RPCSEC_GSS. In that situation one should use NFSv4 with RPCSEC_GSS (e.g. with Kerberos).

Root Cause

  • NFS uses the RPC protocol to authenticate users.
  • The RPC protocol's AUTH_UNIX / AUTH_SYS Credentials structure limits the number of groups to 16, as specified in RFC 5531.

         struct authsys_parms {
            unsigned int stamp;
            string machinename<255>;
            unsigned int uid;
            unsigned int gid;
            unsigned int gids<16>;
         }
    
  • NFS uses the AUTH_SYS protocol by default.

Diagnostic Steps

  1. Capture a tcpdump.
  2. Inspect the the RPC request (e.g. SETATTR Call).

    Credentials
        Flavor: AUTH_UNIX (1)
        Length: 92
        Stamp: STAMP
        Machine Name: NAME
            length: 7
            contents: NAME
            fill bytes: opaque data
        UID: 901
        GID: 901
        Auxiliary GIDs (16) [901, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965]
            GID: 901
            GID: 951
            GID: 952
            GID: 953
            GID: 954
            GID: 955
            GID: 956
            GID: 957
            GID: 958
            GID: 959
            GID: 960
            GID: 961
            GID: 962
            GID: 963
            GID: 964
            GID: 965
    Verifier
        Flavor: AUTH_NULL (0)
        Length: 0
    
  3. For the SETATTR example, note that the GID the file is being set to, 982, does not appear in the credentials above.

    Network File System, SETATTR Call FH:0x2713c62b
    [Program Version: 3]
    [V3 Procedure: SETATTR (2)]
    object
        length: 32
        [hash (CRC-32): 0x2713c62b]
        decode type as: unknown
        filehandle: 000000000000fe1b0009bdae6802ffffffffffffffffffff...
    new_attributes
        mode: no value
            set_it: no value (0)
        uid: value follows
            set_it: value follows (1)
            uid: 901
        gid: value follows
            set_it: value follows (1)
            gid: 982
        size: no value
            set_it: no value (0)
        atime: don't change
            set_it: don't change (0)
        mtime: don't change
            set_it: don't change (0)
    guard: no value
        check: no value (0)
    
Based on the second resolution provided i.e. to discard the groups given by the NFS client.
Here are the steps to reproduce it.
  • Created a user testuser1 with same uid and gid on NFS server and client.
  • Created 20 groups with same gid on NFS server and client
  • Made user testuser1 members of these 20 groups
  • Created 20 directories and gave them respective ownerships
On NFS server :

[root@server-test test1]# id -a testuser1
uid=20362(testuser1) gid=20362(testuser1) groups=20362(testuser1),20363(group1),20364(group2),20365(group3),20366(group4),20367(group5),20368(group6),20369(group7),20370(group8),20371(group9),20372(group10),20373(group11),20374(group12),20375(group13),20376(group14),20377(group15),20378(group16),20379(group17),20380(group18),20381(group19),20382(group20)
On client :

server-test:/test1 /testnfs nfs rw,relatime,vers=3,rsize=16384,wsize=16384,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.65.210.132,mountvers=3,mountport=43813,mountproto=udp,local_lock=none,addr=10.65.210.132 0 0

[root@client-test testnfs]# usermod -G group1,group2,group3,group4,group5,group6,group7,group8,group9,group10,group11,group12,group13,group14,group15,group16,group17,group18,group19,group20 testuser1

[root@client-test testnfs]# id -a testuser1
uid=20362(testuser1) gid=20362(testuser1) groups=20362(testuser1),20363(group1),20364(group2),20365(group3),20366(group4),20367(group5),20368(group6),20369(group7),20370(group8),20371(group9),20372(group10),20373(group11),20374(group12),20375(group13),20376(group14),20377(group15),20378(group16),20379(group17),20380(group18),20381(group19),20382(group20)
  • Setup on server and client is as follows:

drwxrwxr-x. 2 testuser1 group1  4.0K Aug 19 15:15 1
drwxr-xr-x. 2 testuser1 group2  4.0K Aug 19 15:15 2
drwxr-xr-x. 2 testuser1 group3  4.0K Aug 19 15:15 3
drwxr-xr-x. 2 testuser1 group4  4.0K Aug 19 15:15 4
drwxr-xr-x. 2 testuser1 group5  4.0K Aug 19 15:15 5
drwxr-xr-x. 2 testuser1 group6  4.0K Aug 19 15:15 6
drwxr-xr-x. 2 testuser1 group7  4.0K Aug 19 15:15 7
drwxr-xr-x. 2 testuser1 group8  4.0K Aug 19 15:16 8
drwxr-xr-x. 2 testuser1 group9  4.0K Aug 19 15:16 9
drwxr-xr-x. 2 testuser1 group10 4.0K Aug 19 15:16 10
drwxr-xr-x. 2 testuser1 group11 4.0K Aug 19 15:16 11
drwxr-xr-x. 2 testuser1 group12 4.0K Aug 19 15:16 12
drwxr-xr-x. 2 testuser1 group13 4.0K Aug 19 15:16 13
drwxr-xr-x. 2 testuser1 group14 4.0K Aug 19 15:16 14
drwxr-xr-x. 2 testuser1 group15 4.0K Aug 19 15:16 15
drwxr-xr-x. 2 testuser1 group16 4.0K Aug 19 15:16 16
drwxr-xr-x. 2 testuser1 group17 4.0K Aug 19 15:16 17
drwxr-xr-x. 2 testuser1 group18 4.0K Aug 19 15:16 18
drwxr-xr-x. 2 testuser1 group19 4.0K Aug 19 15:16 19
drwxr-xr-x. 2 testuser1 group20 4.0K Aug 19 15:16 20
  • Tried changing ownership of one of the directory.

[root@client-test /]# su - testuser1
$ cd /testnfs
testnfs]$ chown testsuer1:group20 1
chown: changing ownership of `1': Operation not permitted <--------------Error reported

  • Whereas locally on the NFS server, this operation works without any problem :

[root@server-test test1]# su - testuser1                      [  OK  ]
[testuser1@server-test ~]$ cd /test1
[testuser1@server-test test1]$ chown testuser1:group20 1
[testuser1@server-test test1]$ ls -ld 1
drwxrwxr-x. 2 testuser1 group20 4096 Aug 19 15:15 1
  • Tcpdump captured during the same time also shows "Permission Error".
    Inspect the the RPC request (e.g. SETATTR Call). For the SETATTR example, note that the GID the dir is being set to, 20382, does not appear in the credentials below.

Credentials
Flavor: AUTH_UNIX (1)
Length: 100
Stamp: 0x004f0cff
Machine Name: client-test
length: 14
contents: client-test
fill bytes: opaque data
UID: 20362
GID: 20362
Auxiliary GIDs (16) [20362, 20363, 20364, 20365, 20366, 20367, 20368, 20369, 20370, 20371, 20372, 20373, 20374, 20375, 20376, 20377]
GID: 20362
GID: 20363
GID: 20364
GID: 20365
GID: 20366
GID: 20367
GID: 20368
GID: 20369
GID: 20370
GID: 20371
GID: 20372
GID: 20373
GID: 20374
GID: 20375
GID: 20376
GID: 20377
Verifier
Flavor: AUTH_NULL (0)
Length: 0

Network File System, SETATTR Call FH:0x1622cb12
[Program Version: 3]
[V3 Procedure: SETATTR (2)]
object
length: 28
[hash (CRC-32): 0x1622cb12]
decode type as: unknown
filehandle: 01000601c8166eb71d244e24a1af18679aeb921c02000200...
new_attributes
mode: no value
set_it: no value (0)
uid: value follows
set_it: value follows (1)
uid: 20362
gid: value follows
set_it: value follows (1)
gid: 20382
size: no value
set_it: no value (0)
atime: don't change
set_it: don't change (0)
mtime: don't change
set_it: don't change (0)
guard: no value
check: no value (0)
  • Then on NFS server, I made following entry in /etc/sysconfig/nfs and restarted NFS service:

RPCMOUNTDOPTS="--manage-gids"
  • Unmount NFS share from client and re-mounted again.
  • Lastly, I tried to change the ownership once again as follows and this time it worked without any problems :

# mount -t nfs -o vers=3 10.65.210.132:/test1 /testnfs
# su - testuser1
$ cd /testnfs
$ chown testuser1:group18 1
$ ls -ld 1
drwxrwxr-x. 2 testsuer1 group18 4096 Aug 19 15:15 1
Thus we can conclude, setting rpc.mountd --manage-gids solves NFS limitation of 16 groups.