[Pkg-iscsi-maintainers] Bug#629442: Bug#629442: iscsitarget: ietd gives "iscsi_trgt: Abort Task" errors on high disk load and iscsi connections are dropped
Massimiliano Ferrero
m.ferrero at midhgard.it
Thu Aug 4 21:08:54 UTC 2011
Il 08/07/2011 09:50, Ritesh Raj Sarraf ha scritto:
> Any update on this issue ? I am lowering its severity to Important as it
> is a corner case.
Hello, sorry for not having updated you earlier
I think the issue as nothing to do with an ietd bug.
I think that the server log, even with debug active, just showed that
ietd was closing the connections, but the problem originates from the
initiators: there are timeouts in the connections and these are dropped.
I has designed the architecture to boot initiators from usb keys (debian
over a jffs2 file system) and had the very bad idea to put swap over one
of the iscsi lvm volumes (one volume for each cluster node).
I think this was one source of problems: any problem on the swap volume
would result in process crashing on the node.
Now we have bought disks for the nodes and the system is installed on
the disks, and I see no more errors on the swap space.
I think the bug can be closed, thank you for the support.
We still have some problems of dropped connections, probably I have not
set correctly iscsi initiators timeout.
This is what I see on the initiator syslog
/var/log/syslog.5.gz:Jul 30 14:40:27 xen002 qdiskd[3187]: qdisk cycle
took more than 3 seconds to complete (3.470000)
/var/log/syslog.5.gz:Jul 30 14:44:39 xen002 kernel: [ 537.074759] dlm:
closing connection to node 2
/var/log/syslog.5.gz:Jul 30 14:44:39 xen002 kernel: [ 537.074890] dlm:
closing connection to node 1
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401455]
connection12:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401494]
connection7:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401526]
connection11:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401639]
connection6:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401690]
connection10:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401722]
connection5:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401750]
connection3:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401773]
connection9:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401809]
connection1:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401850]
connection8:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.401879]
connection2:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [ 541.402096]
connection4:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:47:59 xen002 nrpe[8427]: Listening for
connections on port 5666
/var/log/syslog.5.gz:Jul 30 14:47:59 xen002 nrpe[8427]: Allowing
connections from: 10.212.0.1,10.213.0.1
/var/log/syslog.5.gz:Jul 30 15:26:31 xen002 qdiskd[3241]: qdisk cycle
took more than 3 seconds to complete (3.260000)
/var/log/syslog.5.gz:Jul 30 15:28:53 xen002 qdiskd[3241]: qdisk cycle
took more than 3 seconds to complete (3.680000)
/var/log/syslog.5.gz:Jul 30 17:52:14 xen002 qdiskd[3241]: qdisk cycle
took more than 3 seconds to complete (3.010000)
/var/log/syslog.5.gz:Jul 30 22:23:45 xen002 qdiskd[3241]: qdisk cycle
took more than 3 seconds to complete (3.050000)
/var/log/syslog.5.gz:Jul 30 23:03:57 xen002 qdiskd[3241]: Assuming
master role
/var/log/syslog.5.gz:Jul 30 23:04:00 xen002 qdiskd[3241]: Writing
eviction notice for node 1
/var/log/syslog.5.gz:Jul 30 23:04:03 xen002 qdiskd[3241]: Node 1 evicted
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308142]
connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4302371373, last ping 4302372623, now 4302373873
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308300]
connection5:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308304]
connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4302371373, last ping 4302372623, now 4302373873
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308407]
connection4:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308410]
connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4302371373, last ping 4302372623, now 4302373873
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308518]
connection1:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.316053]
connection12:0: ping timeout of 5 secs expired, recv timeout 5, last rx
4302371374, last ping 4302372624, now 4302373875
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.316173]
connection12:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:59 xen002 iscsid: Kernel reported
iSCSI connection 5:0 error (1011) state (3)
/var/log/syslog.5.gz:Jul 30 23:04:59 xen002 iscsid: Kernel reported
iSCSI connection 4:0 error (1011) state (3)
/var/log/syslog.5.gz:Jul 30 23:04:59 xen002 iscsid: Kernel reported
iSCSI connection 1:0 error (1011) state (3)
/var/log/syslog.5.gz:Jul 30 23:04:59 xen002 iscsid: Kernel reported
iSCSI connection 12:0 error (1011) state (3)
/var/log/syslog.5.gz:Jul 30 23:07:04 xen002 kernel: [30052.199198] dlm:
closing connection to node 1
In this example then the node was killed from the other node because it
could not write anymore on the quorum disk (that is over iscsi too).
Is there any good pointer to examples of settings to be used with iscsi
and multipath?
At the moment on the initiator side I have
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 20
and in /etc/multipath.conf
defaults {
udev_dir /dev
polling_interval 10
selector "round-robin 0"
path_grouping_policy multibus
getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n"
prio_callout /bin/true
path_checker directio
rr_min_io 100
rr_weight priorities
failback immediate
#no_path_retry fail
user_friendly_names no
}
multipaths {
multipath {
wwid 149455400000000000359...
alias lun1
}
multipath {
wwid 1494554000000000024d1...
alias lun2
}
}
should I enable
no_path_retry fail
?
Thanks
Massimiliano
--
Massimiliano Ferrero
Midhgard s.r.l.
C/so Svizzera 185 bis
c/o centro Piero della Francesca
10149 - Torino
tel. +39-0117575375
fax +39-0117768576
e-mail: m.ferrero at midhgard.it
sito web: http://www.midhgard.it
More information about the Pkg-iscsi-maintainers
mailing list