[Pkg-ofed-devel] ofa-kernel: ib_query_gid() failed

Mario Lang mlang at debian.org
Mon Oct 12 16:07:41 UTC 2009


Hi.

We recently got Infiniband hardware (mlx4 cards) for one
of our HPC cluster systems running Debian.

82:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0)

Installation of the user-space libraries/programs went fine.  The
pkg-ofed repository has been very valuable to us, thanks for your work.
ipoib works as expected, and the infiniband tools like ibstat
and friends do all produce output as expected.

node1:~# ibstat
CA 'mlx4_0'
        CA type: MT25418
        Number of ports: 2
        Firmware version: 2.6.900
        Hardware version: a0
        Node GUID: 0x0003ba0001007208
        System image GUID: 0x0003ba000100720b
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 20
                Base lid: 3
                LMC: 0
                SM lid: 3
                Capability mask: 0x0251086a
                Port GUID: 0x0003ba0001007209
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x0251086a
                Port GUID: 0x0003ba000100720a

node1:~# ibswitches
Switch  : 0x00066a00d9000bac ports 24 "SilverStorm 9024 DDR GUID=0x00066a00d9000bac" enhanced port 0 lid 2 lmc 0

However, since XRC is not in the mainline kernel, openmpi and the
ib diag tools refuse to start (this is a well known fact).

This is when we built/installed the ofa-kernel modules.
Doing so on a 2.6.26-2-amd64 kernel from lenny seems to work fine
during compilation/installation.  The modules are loaded
during boot and the card seems to be recognized correctly.

Oct 12 12:52:27 node1 kernel: [   14.462221] mlx4_core: Mellanox ConnectX core driver v1.0 (April 4, 2008)
Oct 12 12:52:27 node1 kernel: [   14.771930] mlx4_core: Initializing 0000:82:00.0
Oct 12 12:52:27 node1 kernel: [   14.834436] ACPI: PCI Interrupt Link [LN2D] enabled at IRQ 43
Oct 12 12:52:27 node1 kernel: [   14.851617] ACPI: PCI Interrupt 0000:82:00.0[A] -> Link [LN2D] -> GSI 43 (level, low) -> IRQ 43
Oct 12 12:52:27 node1 kernel: [   14.876579] PCI: Setting latency timer of device 0000:82:00.0 to 64

However, opensm fails to start which can be traced down
to ibstat -p hanging.  dmesg produces the following output upon
/etc/init.d/opensm start:

Oct 12 12:52:48 node1 kernel: [   78.170077] ib0: ib_query_gid() failed
Oct 12 12:52:58 node1 kernel: [   89.272789] ib0: ib_query_port failed

We dont get any other obvious dmesg errors.
ipoib doesn't work since opensm can't be started.  We didn't even try
running openmpi in this mode since it seems pointless if ib_query_gid() fails.

Summary: Mainline kernels from Debian (2.6.26-2-amd64 and
2.6.31-trunk-amd64) do work fine.  opensm starts and ipoib works,
however, XRC is missing from the kernel, which basically means
most native infiniband applications (not using ipoib) fail.

ofa-kernel with 2.6.26-2-amd64 produces the above mentioned problems
with ib_query_gid() and leaves us with all ib related services unusable.

We also tried to roll our own version of openmpi with
--disable-openib-connectx-xrc which seemed a good idea to get openib
based openmpi on mainline kernels (read, XRC missing) to work.
However, openmpi still complains about XRC not being present, which
seems to indicate that the decision to use XRC is made somewhere
down in the dependency chain.  I haven't figured out where though.

Does anyone have any ideas how to proceed from here?
Is the ofa-kernel problem mentioned above known somehow?  If so, is
there a fix / are we doing something wrong here?
Is there anything we could try to narrow this down a little further?

-- 
Thanks,
  ⡍⠁⠗⠊⠕ | Debian Developer <URL:http://debian.org/>
  .''`. | Get my public key via finger mlang/key at db.debian.org
 : :' : | 1024D/7FC1A0854909BCCDBE6C102DDFFC022A6B113E44
 `. `'
   `-      <URL:http://delysid.org/>  <URL:http://www.staff.tugraz.at/mlang/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/pkg-ofed-devel/attachments/20091012/1fd41b17/attachment.pgp>


More information about the Pkg-ofed-devel mailing list