[Virtual-pkg-base-maintainers] Bug#759706: Debian Bug

Tue Sep 16 06:11:08 UTC 2014

Hello Dominique,

there are a few points worth consideration with respect to #759706

1. I do respect your bug report

2. the main reason why I'm closing the bug is because, I very much suspect
    that by reporting it against the 'base' package, no one willing or able
    to help you will access your bug report and so having the bug report
    there, against 'base' is useless for you. You will very probably get no
    help.

3. When taking into account point 2. the thing you need to do to get help,
    is find someone or some place that will do so. Again, with 'base' I
    doubt this will be the case. Since I'm no expert in the matters that
    you are discussing (SSD performance and support in the kernel, md-raid,
    lvm), I am unable to tell you which place to go - but I can make
    suggestions - see http://debian.org/support, ask on the debian-users
    mailing list, ask on #debian IRC, ask the mdadm maintainers, ask the
    relevant upstream kernel mailing lists (I guess there's one specialised
    on md or block devices etc.)

4. As far as I can understand your report, you are not providing the full
    details in a clear way that would make (at least me) understand how
    your system is set up and how you are doing your tests. F.ex. at some
    point you write:

      I switched the disks (/dev/sda1 back to /dev/md0

    But as far as I understood /dev/sda1 never left /dev/md0 ?

5. From your bug report I can't see why you are not testing without the
    lvm and the ext4 layers. The problem seems to lie with the md-raid
    device. So drop those other layers to narrow the problem down. md being
    a block device, you should be able to test performance of it with a
    simple:

      time dd if=random_data of=/dev/md0

    (Please double check before doing this)

6. From my own experience, I have been seeing extremly weird behaveour
    with Flash Disks, which I couldn't pinpoint down even with
    paid support by a Linux Kernel filesystem expert. That was a few years
    ago, and you are not using flash disks per se but SSDs, but still, it
    was a pain in the ass and we simply switched flash disk manufacturers
    and the problem magically disappeared.

7. Also taking in account point 6. : from your report:

        Events : 61833

    That should ring an alarm bell with you. What about those events?
    They're far too many. They should be logged. What are they saying?

To sum up, I hope you can see that I do want to help you, but that 'base' 
is not the right place to get help or get the bug fixed (in case it's in 
software). OK?

Best greets,
*t

On Mon, 15 Sep 2014, Dominique Barton wrote:

> Hi
>
> I’m writing you regarding the Debian bug at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=759706
>
> You closed it because "it is not really clear that there actually is a bug.”.
> As you can see on my tests, I used the exactly same setup/stack to configure the 2nd RAID (same hardware, same config) and it has no issues.
> Tell me if I’m wrong, but IMHO this is pointing to a software issue/error and this is a bug.
>
> Yes I’m not quite sure where the performance issues are coming from.
> But it has to do something with md or lvm, and it’s quite hard to find the issue if you’re not a developer of lvm/md.
>
> What do you mean by "Once you have narrowed down the problem”?
> You want a kernel/mdadm patch or something alike? There are plenty of other bugs w/ the same amount of informations, the same context or even less, and they’ve been investigated.
> No offense, but the problem is already quite narrowed down and I’m looking for s/o w/ a little experience in md.
>
> Everything a “user” can do to narrow down this problem was already done, except for testing the md device directly instead of the LV on the top of it.
> But I can’t do that, because the md device is in use as PV for my rootvg.
> Of course I can move my whole rootvg to a new MD, reboot my system and test against the “old” MD. But this might “fix” the problems temporarily, because you’ve seen what happened to the 2nd MD raid which is working perfectly w/o any changes.
>
> I try to narrow down the issues right now w/ the current situation.
> So everything I can do is run tests on my LV and everything I see is a high CPU (I/O wait) usage, which means at least one layer is waiting for the I/O response.
> Now we’re digging in a depth I don’t know and I can’t explain any further. Of course I can run my tests again w/ strace attached, to see what syscall is making these “problems” or is waiting for a response, but I’m pretty sure it’s an write() or fwrite().
>
> I think that’s quite “narrowed down”… No offense, but can you please reopen that bug?
>
> Cheers
> Domi
>