Bug#740701: multipath-tools: mkfs fails "Add. Sense: Incompatible medium installed"

Sun Jun 22 23:30:27 UTC 2014

Hi,

On 06/22/2014 10:19 AM, Martin George wrote:
>
> So firstly, the question arises why your kernel marked all paths as
> failed when you hit this error. This actually resembles the old Linux
> behavior where for a device error such as a MEDIUM ERROR, it gets
> retried on all paths available to the LUN, all which result in the same
> error, and hence all paths get marked as failed. This was addressed with
> the upstream patch at
> http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=63583cca745f440167bf27877182dc13e19d4bcf,
> where more fine-grained error handling is now available.

Yes, it retries on all paths. The kernel version (3.2.57) which is used 
in my case already includes the changes mentioned above.

> With this,
> device errors such as MEDIUM ERROR are no longer retried since it treats
> such errors as permanent errors. That makes me suspect your kernel is
> already missing some of the key patches from the upstream kernel in
> context with this error handling. And given that UNMAP has also been a
> relatively new feature which underwent several upstream revisions to get
> to the current stable state, it would be prudent for you to check if
> your kernel is up-to-date with its SCSI & UNMAP handling.

Currently I'm not able to reproduce the error (getting this iSCSI 
response) I see in production after re-creating a very similar test 
setup using same hardware and software that is failing on me, which is a 
bit confusing. :||

So, even worse, I'm not convinced that the actual problem is a linux 
kernel problem yet. Why is my NetApp filer sending a MEDIUM ERROR 
"Incompatible medium installed" to me anyway in the other case?

The latest kernel code only prevents (afaics) the retry in a small 
subset of cases, which does not include an asc of 0x30 INCOMPATIBLE 
MEDIUM INSTALLED.

   case MEDIUM_ERROR:
       if (sshdr.asc == 0x11 || /* UNRECOVERED READ ERR */
           sshdr.asc == 0x13 || /* AMNF DATA FIELD */
           sshdr.asc == 0x14) { /* RECORD NOT FOUND */
           set_host_byte(scmd,DID_MEDIUM_ERROR);
           return SUCCESS;
       }
       return NEEDS_RETRY;

> That said, it is indeed strange that you hit a MEDIUM ERROR in the first
> place, when using UNMAP. As described above, that's a device error. So
> does this fail even for other commands such as a regular write (you
> could try this with dd) or even a simple TUR command (like say using
> sg_turs -v /dev/mpathX)?

# sg_turs -v /dev/mapper/mpath_scylla0
     test unit ready cdb: 00 00 00 00 00 00

The UNMAP is the only command that causes the failure. As long as I do 
not cause an UNMAP to be sent, by doing mkfs.ext4 without -E nodiscard, 
doing a mkfs.btrfs without preventing discard or issuing an fstrim 
command, this multipathed lvm on iscsi handles millions of iscsi write 
and read ops every day in production just fine. If an UNMAP is sent, it 
makes all iSCSI storage on a physical server hang, as seen before.

Today I played around a bit in my test environment (where the failure 
does not occur yet), also tcpdumping the iSCSI traffic, viewing it 
afterwards using wireshark, and reading about the SCSI specs. That's a 
very interesting way to learn more about what I'm talking about here. :-)

If there's no obvious way to be found to trigger the same error in the 
test environment, I think I'm going to propose to trigger the same again 
while having the test physical server attached to the production luns. 
 From the past occurance, I know that if the only thing that breaks is 
the storage connection on the physical server that executes the UNMAP. 
It's still not the most reassuring choice, but a kind of a calculated risk.

If that's possible I can do a couple of tcpdumps on the iscsi and 
blktrace dumps to capture what's going on and post them here. Doing so 
will prove whether the SCSI error was actually being sent by the NetApp 
device or not.

-- 
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van.kranenburg at mendix.com | www.mendix.com