[Pkg-iscsi-maintainers] Bug#629442: Bug#629442: iscsitarget: ietd gives "iscsi_trgt: Abort Task" errors on high disk load and iscsi connections are dropped

Tue Jun 14 13:51:18 UTC 2011

On 06/14/2011 05:12 PM, Massimiliano Ferrero wrote:
> 
>> Next time, when you try to test/re-create the bug, capture dstat output.
>> The default dstat output is good enough to report us on the system state
>> was during starvation.
> Hello, yesterday and tonight I performed some other tests, these are the
> results:
> 
> 1) it seems I am not able to reproduce the bug on a test system
> the test system (san01) has the same processor (E5220) and amount of RAM
> 12 GB, but a smaller I/O system: an 8 channel 3ware controller with an 8
> disks raid 5 array
> the system that presents the problem (san00) has a 24 channel controller
> and a 23 disks raid 6 array (+ 1 hot spare)
> both systems are connected through the same gigabit switches
> 

Is the OS/kernel also the same?

> there is another hw difference between the two environment: the nodes
> connected to san00 are high end hw, their network card is able to
> generate nearly 1 Gb/s of iscsi traffic
> the nodes connected to san01 are low end hw and their network card does
> not exceed 300 Mb/s
> so the system that presents the problem has both an I/O subsystem with
> higher performance and the machine that is doing iscsi traffic is able
> to generate more than 3 times i/o operations
>

Unfortunately, dstat default output doesn't capture VM statistics. Do
you have any idea what was the VM consumption when you saw the problem?

Assuming it is buffered I/O, Your VM will soon be consumed (as I see
from dstat logs that the CPU was in wait for a very long time) and then
start paging. And this is the scenario, where linux tends to succumb, still.

On your faulty setup, can you try "unbuffered direct I/O" and see if
that can trigger the problem? I have low hopes that that will fail. To
trigger (direct) I/O, you can use fio tool. It is already pre-packaged
in Debian.

> at the moment I am not able to tell which of these aspects, or the sum
> of them, create the condition for the problem: I suspect that it's a mix
> of all these
> unfortunately at the moment I do not have hw similar to the one in
> production to perform a test in the same conditions.
> 
> 2) san00 presents the problem event with deadline scheduler active on
> all logical volume exported through iscsi or used by the heavy load
> operation (dd)
> 
> 3) on san00 I was able to reproduce the problem in a simpler condition
> than the one I described in the first mail: just one node connected
> through iscsi, the other node was restarting, no virtual machines
> running on the node, the node was performing one i/o intensive operation
> on one of the lv exported by iscsi/lvm (an fsck on one file system)
> during this operation I launched a dd on san00 and the iscsi connection
> was dropped after a few seconds
> 

I think it is the typical linux I/O controller problem, which I believe
is a combination of I/O Scheduler + VM Subsystem.

> I am attaching 3 files: dstat output during the test and an extract of
> /var/log/messages and /var/log/syslog
> I have just filtered out information for non relevant services (nagios,
> dhcp, snmp, postfix, etc.) both for readability and confidentiality
> ietd was running with the following command line
> /usr/sbin/ietd --debug=255
> so in the log we have debug information
> the problem can be seen in syslog at Jun 14 01:28:53
> at Jun 14 01:34:06 I turned off the node for reboot and in the log there
> are some record regarding termination of iscsi sessions
> I do not see anything relevant in ietd debug log, just a restart of the
> connections
> 
> in dstat output the dd operation was started around line 197 and was
> terminated at line 208 (I interrupted the operation as soon as I saw the
> problem)
> 
> what I see in dstat output is the following: dd for some seconds (about
> 10) does not generate a lot of read and writes
> 
> usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
>   7   2  56  35   0   0|  12M   14M|  22k   35k|   0     0 |4415    12k
> 
> 12M read and 14M write, and this could be from the dd operation or the
> fsck performed through iscsi
> 
> then there is a burst of write, I guess using the full I/O capacity of
> the controller and of the disks
> 
> usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
> 35   7  35  22   0   1|8180k  325M|  38k   25k|   0     0 |6860    11k
>   2   3  59  36   0   1|3072B  541M|  20k   26k|   0     0 |5380  2747
>   3   4  64  30   0   0|5120B  473M|  21k   30k|   0     0 |4752    16k
> 
> write 325M, 541M, 473M
> and this is exactly the moment when the problem arise
> 
> could it be that the i/o operation are cached in memory and the problem
> presents when they are flushed to disk?
> 

Yes, that's what my suspicion is also. The per bdi writeback mechanism
improves this situation to a great extent but I'm not sure if that is
part of the Squeeze kernel.
http://lwn.net/Articles/326552/

> 
> If from the logs does not come out any pointer to a potential solution
> the only other test I can think of is upgrading to a newer kernel, but I

You can try this just to ascertain the cause.

> see this a last resort for several reasons:
> - as I see it putting a test kernel directly on a production system is
> not a wise move, I could (and in the past already have) incur into
> several other unknown bugs
> - all our other systems are running on a standard lenny or squeeze kernel
> - I would lose support for kernel security updates from debian
> 

-- 
Ritesh Raj Sarraf | http://people.debian.org/~rrs
Debian - The Universal Operating System

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 900 bytes
Desc: OpenPGP digital signature
URL: <http://lists.alioth.debian.org/pipermail/pkg-iscsi-maintainers/attachments/20110614/f94b9acd/attachment-0001.pgp>