[Pkg-iscsi-maintainers] Bug#629442: Bug#629442: iscsitarget: ietd gives "iscsi_trgt: Abort Task" errors on high disk load and iscsi connections are dropped

Thu Aug 4 21:08:54 UTC 2011

Il 08/07/2011 09:50, Ritesh Raj Sarraf ha scritto:
> Any update on this issue ? I am lowering its severity to Important as it
> is a corner case.

Hello, sorry for not having updated you earlier

I think the issue as nothing to do with an ietd bug.
I think that the server log, even with debug active, just showed that 
ietd was closing the connections, but the problem originates from the 
initiators: there are timeouts in the connections and these are dropped.

I has designed the architecture to boot initiators from usb keys (debian 
over a jffs2 file system) and had the very bad idea to put swap over one 
of the iscsi lvm volumes (one volume for each cluster node).
I think this was one source of problems: any problem on the swap volume 
would result in process crashing on the node.

Now we have bought disks for the nodes and the system is installed on 
the disks, and I see no more errors on the swap space.
I think the bug can be closed, thank you for the support.

We still have some problems of dropped connections, probably I have not 
set correctly iscsi initiators timeout.
This is what I see on the initiator syslog

/var/log/syslog.5.gz:Jul 30 14:40:27 xen002 qdiskd[3187]: qdisk cycle 
took more than 3 seconds to complete (3.470000)
/var/log/syslog.5.gz:Jul 30 14:44:39 xen002 kernel: [  537.074759] dlm: 
closing connection to node 2
/var/log/syslog.5.gz:Jul 30 14:44:39 xen002 kernel: [  537.074890] dlm: 
closing connection to node 1
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401455]  
connection12:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401494]  
connection7:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401526]  
connection11:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401639]  
connection6:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401690]  
connection10:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401722]  
connection5:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401750]  
connection3:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401773]  
connection9:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401809]  
connection1:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401850]  
connection8:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401879]  
connection2:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.402096]  
connection4:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:47:59 xen002 nrpe[8427]: Listening for 
connections on port 5666
/var/log/syslog.5.gz:Jul 30 14:47:59 xen002 nrpe[8427]: Allowing 
connections from: 10.212.0.1,10.213.0.1
/var/log/syslog.5.gz:Jul 30 15:26:31 xen002 qdiskd[3241]: qdisk cycle 
took more than 3 seconds to complete (3.260000)
/var/log/syslog.5.gz:Jul 30 15:28:53 xen002 qdiskd[3241]: qdisk cycle 
took more than 3 seconds to complete (3.680000)
/var/log/syslog.5.gz:Jul 30 17:52:14 xen002 qdiskd[3241]: qdisk cycle 
took more than 3 seconds to complete (3.010000)
/var/log/syslog.5.gz:Jul 30 22:23:45 xen002 qdiskd[3241]: qdisk cycle 
took more than 3 seconds to complete (3.050000)
/var/log/syslog.5.gz:Jul 30 23:03:57 xen002 qdiskd[3241]: Assuming 
master role
/var/log/syslog.5.gz:Jul 30 23:04:00 xen002 qdiskd[3241]: Writing 
eviction notice for node 1
/var/log/syslog.5.gz:Jul 30 23:04:03 xen002 qdiskd[3241]: Node 1 evicted
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308142]  
connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 
4302371373, last ping 4302372623, now 4302373873
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308300]  
connection5:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308304]  
connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 
4302371373, last ping 4302372623, now 4302373873
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308407]  
connection4:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308410]  
connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 
4302371373, last ping 4302372623, now 4302373873
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308518]  
connection1:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.316053]  
connection12:0: ping timeout of 5 secs expired, recv timeout 5, last rx 
4302371374, last ping 4302372624, now 4302373875
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.316173]  
connection12:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:59 xen002 iscsid: Kernel reported 
iSCSI connection 5:0 error (1011) state (3)
/var/log/syslog.5.gz:Jul 30 23:04:59 xen002 iscsid: Kernel reported 
iSCSI connection 4:0 error (1011) state (3)
/var/log/syslog.5.gz:Jul 30 23:04:59 xen002 iscsid: Kernel reported 
iSCSI connection 1:0 error (1011) state (3)
/var/log/syslog.5.gz:Jul 30 23:04:59 xen002 iscsid: Kernel reported 
iSCSI connection 12:0 error (1011) state (3)
/var/log/syslog.5.gz:Jul 30 23:07:04 xen002 kernel: [30052.199198] dlm: 
closing connection to node 1

In this example then the node was killed from the other node because it 
could not write anymore on the quorum disk (that is over iscsi too).

Is there any good pointer to examples of settings to be used with iscsi 
and multipath?
At the moment on the initiator side I have

node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 20

and in /etc/multipath.conf

defaults {
         udev_dir                /dev
         polling_interval        10
         selector                "round-robin 0"
         path_grouping_policy    multibus
         getuid_callout          "/lib/udev/scsi_id -g -u -d /dev/%n"
         prio_callout            /bin/true
         path_checker            directio
         rr_min_io               100
         rr_weight               priorities
         failback                immediate
         #no_path_retry          fail
         user_friendly_names     no
}
multipaths {
   multipath {
     wwid 149455400000000000359...
     alias lun1
   }
   multipath {
     wwid 1494554000000000024d1...
     alias lun2
   }
}

should I enable
no_path_retry          fail
?

Thanks
Massimiliano

-- 

Massimiliano Ferrero
Midhgard s.r.l.
C/so Svizzera 185 bis
c/o centro Piero della Francesca
10149 - Torino
tel. +39-0117575375
fax  +39-0117768576
e-mail: m.ferrero at midhgard.it
sito web: http://www.midhgard.it