HP Moonshot/McDivitt: iSER compatibility issues between x86 OFED target and McDivitt ARM64 client

Bug #1401575 reported by Brian Fromme
20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
open-iscsi (Ubuntu)
Invalid
Undecided
Rafael David Tinoco

Bug Description

HP found this issue when working with Ubuntu 14.04.1, kernel: 3.13.0-24-generic

The Ubuntu open-iscsi package is missing upstream patches to make iSER function.

The latest open-iscsi is available under github: https://github.com/mikechristie/open-iscsi

==========
Release: Ubuntu 14.04.1 LTS
open-iscsi:
  Installed: (none)
  Candidate: 2.0.873-3ubuntu9
  Version table:
     2.0.873-3ubuntu9 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages

Changed in open-iscsi (Ubuntu):
status: New → In Progress
assignee: nobody → Rafael David Tinoco (inaddy)
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I need more information on this. Since I don't have the HW to test iSER, I really need people interested on this bug to describe what patches are missing from upstream code (into Trusty's version) and what kind of "compatibility issues" they are facing.

Thank you

Rafael Tinoco

Revision history for this message
Hadar Hen Zion (hadarh) wrote :

1. The compatibility issue is:
error occurred during discovery from the initiator

The log messages at the initiator gives following messages:
root@mcdivitt-c32n1:/var/log# cat syslog|tail
Nov 13 01:55:14 mcdivitt-c32n1 kernel: [656442.898068] scsi10 : iSCSI Initiator over iSER
Nov 13 01:55:14 mcdivitt-c32n1 kernel: [656442.904675] iser: iser_drain_tx_cq:tx id ffffffc7b16d2c98 status 11 vend_err 89
Nov 13 01:55:14 mcdivitt-c32n1 kernel: [656442.993487] connection7:0: detected conn error (1011)
Nov 13 01:55:25 mcdivitt-c32n1 kernel: [656453.418098] scsi11 : iSCSI Initiator over iSER
Nov 13 01:55:25 mcdivitt-c32n1 kernel: [656453.430361] iser: iser_drain_tx_cq:tx id ffffffc7c7905498 status 11 vend_err 89
Nov 13 01:55:25 mcdivitt-c32n1 kernel: [656453.519171] connection8:0: detected conn error (1011)

2. To solve this issue it will be butter to updated the open-iscsi package to the latest version and not take specific patches.
I can't detect specific patches. The kernel version was updated so now we need to update the user space.

Thanks,
Hadar

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Hadar,

This is probably not going to happen.

Back-porting entire upstream versions to stable releases, to fix an issue, is not accepted by Ubuntu community.

Based on the following public discussion:

https://groups.google.com/forum/#!msg/open-iscsi/BP6Q33U3ZIY/HlZOO54XV1UJ

It looks like there is a patch/fix for userland code to avoid such problems:

iser_drain_tx_cq:tx id ffffffc7b16d2c98 status 11 vend_err 89

Brian,

Is it possible for me to put my hands into an IB HW so I can fix this ?

Tks

Rafael Tinoco

Revision history for this message
Hadar Hen Zion (hadarh) wrote :

Rafael, Brian,

I would like to send you an IB HW.

Please write here your address and other relevant details and I'll send you Mellanox HCA with IB support.

Thanks,
Hadar

Revision history for this message
Brian Fromme (brianfromme) wrote :

From: <email address hidden>
The issue you linked below is one of the problems we identified (iSER discovery issues running ARM initiator against x86 target).

The other issue is that we are getting very poor iSER Random I/O throughput and latency on an ARM cartridge (m400) compared to an x86 cartridge (m710). There’s also a gap in the vanilla iSCSI random and sequential I/O performance, which needs to be investigated.

We’ve been working both these issues through Mellanox.

Anders m710

Single Client – 4KiB IO Patterns on a logical volume through iSCSI from HP ProLiant SL4540 Server.
1 HP ProLiant m710 Node
Bandwidth(MB/s)
IOPS (in Thousands = 1K)
Avg. Latency(msec)
Random Read (100%)
61
16
62.498
Random Read Write Mix (70:30)
52
13
72.855
Random Write (100%)
41
11
90.644
Sequential Read (100%)
803
206
4.656
Sequential Read Write Mix(70:30)
326
83
11.755
Sequential Write (100%)
434
111
8.672
Single Client – 4KiB IO Patterns on one logical volume through iSER from HP ProLiant SL4540 Server.
1 HP ProLiant m710 Node
Bandwidth(MB/s)
IOPS (in Thousands = 1K)
Avg. Latency(msec)
Random Read (100%)
168.178
43.0535
3.707
Random Read Write Mix (70:30)
85.077
21.779
29.418
Random Write (100%)
114.460
29.301
21.840
Sequential Read (100%)
1069.6
273.809
2.303
Sequential Read Write Mix(70:30)
828.85
212.184
3.111
Sequential Write (100%)
1066.6
273.032
2.328

McDivitt m400

Single Client – 4KiB IO Patterns on a logical volume of m400 through iSCSI from HP ProLiant SL4540 Server
1 HP ProLiant m400 Node
Bandwidth(MB/s)
IOPS (in Thousands = 1K)
Avg. Latency(msec)
Random Read (100%)
41.632
10.657
90.062
Random Read Write Mix (60:40)
22.595
5.783
203.204
Random Write (100%)
17.521
4.485
214
Sequential Read (100%)
585.663
149.929
6.401
Sequential Read Write Mix(60:40)
205.923
52.715
22.245
Sequential Write (100%)
307.691
78.769
12.184

Single Client – 4KiB IO Patterns on a logical volume of m400 through iSER from HP ProLiant SL4540 Server.
1 HP ProLiant m400 Node
Bandwidth(MB/s)
IOPS (in Thousands = 1K)
Avg. Latency(msec)
Random Read (100%)
39.737
10.172
94.36
Random Read Write Mix (60:40)
25.265
6.466
181.659
Random Write (100%)
17.776
4.55
210.869
Sequential Read (100%)
1013.5
259.442
3.699
Sequential Read Write Mix(60:40)
775.645
198.564
5.846
Sequential Write (100%)
1053.8
269.756
3.366

Cheers,
Bharath

Raghuram Kota (rkota)
tags: added: arm64 hs-arm64
no longer affects: lomond
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I have created the following PPA for Trusty:

https://launchpad.net/~inaddy/+archive/ubuntu/lp1401575

# add-apt-repository ppa:inaddy/lp1401575
# apt-get update
# apt-get install open-iscsi

Please test latest bits to see if the issue is mitigated.
Meanwhile I'm doing all the tests inside our Mellanox lab @ Canonical.

Thank you

Rafael Tinoco

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Okay I was able to reproduce the problem with Trusty open-iscsi package:

------------------------------
Apr 13 15:38:21 hertz iscsid: iSCSI logger with pid=12263 started!
Apr 13 15:38:22 hertz iscsid: iSCSI daemon with pid=12264 started!
Apr 13 15:38:23 hertz tgtd: iser_alloc_pool(572) shmget rdma pool sz:1073741824 failed
Apr 13 15:38:23 hertz tgtd: iser_cm_conn_established(1615) conn:0x10af640 cm_id:0x10af080, 172.16.0.1 -> 172.16.0.1, established
Apr 13 15:38:23 hertz kernel: [259224.104000] scsi7 : iSCSI Initiator over iSER
Apr 13 15:38:23 hertz kernel: [259224.104000] tgtd[11429] segfault at 0 ip 00007f7808bb1abf sp 00007ffd0940fcf0 error 4 in librdmacm.so.1.0.0[7f7808bac000+d000]
Apr 13 15:38:23 hertz kernel: [259224.104000] iser: iser_drain_tx_cq:tx id ffff8807fccf9098 status 11 vend_err 89
Apr 13 15:38:23 hertz kernel: [259224.104000] connection1:0: detected conn error (1011)
Apr 13 15:38:23 hertz tgtd: handle_wc_error(2978) conn:0x10bbe40 task:0x10b1e00 tag:0x0000 wr_id:0x0x10b1e60 op:recv err:local protection error vendor_err:0x32
Apr 13 15:38:23 hertz tgtd: tgtd logger exits abnormally, pid:11430
------------------------------

And also with open-iscsi package from PPA:

------------------------------
Apr 13 15:42:32 hertz tgtd: tgtd daemon started, pid:16809
Apr 13 15:42:32 hertz tgtd: tgtd logger started, pid:16810 debug:0
Apr 13 15:42:33 hertz tgtd: work_timer_start(150) use signal based scheduler
Apr 13 15:42:33 hertz tgtd: bs_init(390) use signalfd notification
Apr 13 15:42:42 hertz tgtd: iser_alloc_pool(572) shmget rdma pool sz:1073741824 failed
Apr 13 15:42:43 hertz kernel: [259484.232000] scsi8 : iSCSI Initiator over iSER
Apr 13 15:42:43 hertz kernel: [259484.236000] iser: iser_drain_tx_cq:tx id ffff880804cfec98 status 11 vend_err 89
Apr 13 15:42:43 hertz kernel: [259484.236000] connection1:0: detected conn error (1011)
Apr 13 15:42:43 hertz tgtd: iser_cm_conn_established(1615) conn:0x159b530 cm_id:0x159af20, 172.16.0.1 -> 172.16.0.1, established
Apr 13 15:42:43 hertz tgtd: handle_wc_error(2978) conn:0x159b530 task:0x159bd60 tag:0x0000 wr_id:0x0x159bdc0 op:recv err:local protection error vendor_err:0x32
Apr 13 15:42:43 hertz tgtd: iser_conn_close(1294) conn:0x159b530 cm_id:0x0x159af20 state: CLOSE, refcnt:1
Apr 13 15:42:43 hertz tgtd: iser_cm_disconnected(1635) conn:0x159b530 cm_id:0x159af20 event:10, RDMA_CM_EVENT_DISCONNECTED
Apr 13 15:42:43 hertz tgtd: iser_cm_timewait_exit(1649) conn:0x159b530 cm_id:0x159af20
------------------------------

I'm backporting "tgtd" also to see if latest bits of tgtd mitigates the imcompatibility for iSER with mellanox HBAs.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Good news,

I just backported "tgtd" from upstream (in the same PPA) and the following sequence:

root@hertz:~# tgtd

root@hertz:~# tgt-setup-lun -n lunit1 -d /disks/lunit1 -t iser
Using transport: iser
Creating new target (name=iqn.2001-04.com.hertz-lunit1, tid=1)
Adding a logical unit (/disks/lunit1) to target, tid=1
Accepting connections from all initiators

root@hertz:~# iscsiadm -m discovery --op=show --type sendtargets --portal 172.16.0.1 -I iser
172.16.0.1:3260,1 iqn.2001-04.com.hertz-lunit1

Worked!! Now i'm testing this from a remote node with the same packages... this might be a hotfix for the problem.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Voilá :D

root@dixie:~# iscsiadm -m discovery --op=new --op=delete --type sendtargets --portal 172.16.0.1 -I iser
172.16.0.1:3260,1 iqn.2001-04.com.hertz-lunit1
root@dixie:~# iscsiadm -m node -l
Logging in to [iface: iser, target: iqn.2001-04.com.hertz-lunit1, portal: 172.16.0.1,3260] (multiple)
Login to [iface: iser, target: iqn.2001-04.com.hertz-lunit1, portal: 172.16.0.1,3260] successful.

Now I have /dev/sdc (from iSER) setup in my environment coming from other host through IB:

Apr 13 16:51:08 dixie iscsid: Connection7:0 to [target: iqn.2001-04.com.hertz-lunit1, portal: 172.16.0.1,3260] through [iface: iser] is operational now

root@dixie:~# dd if=/dev/zero of=/dev/sdc bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.84224 s, 1.3 GB/s

And it looks okay.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Okay,

I just tested from a remote machine with open-iscsi from Trusty:

inaddy@dixie:~$ dpkg -l | grep iscsi
ii open-iscsi 2.0.873-3ubuntu9 amd64 High performance, transport independent iSCSI implementation

inaddy@dixie:~$ sudo iscsiadm -m discovery --op=new --op=delete --type sendtargets --portal 172.16.0.1 -I iser
172.16.0.1:3260,1 iqn.2001-04.com.hertz-lunit1

inaddy@dixie:~$ sudo iscsiadm -m node -l
Logging in to [iface: iser, target: iqn.2001-04.com.hertz-lunit1, portal: 172.16.0.1,3260] (multiple)
Login to [iface: iser, target: iqn.2001-04.com.hertz-lunit1, portal: 172.16.0.1,3260] successful.

inaddy@dixie:~$ sudo dd if=/dev/zero of=/dev/sdc bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.807006 s, 1.3 GB/s

And it worked, saying that the problem was actually related to TGT daemon and NOT open-iscsi package.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Okay, during the iscsi logout with kernel 3.13.0-49 I'm getting kernel panics:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1443648

So now I have to check if Utopic & Vivid suffer from the same problem and fix the kernel before backporting needed TGT daemon fixes for Trusty -> Working on this.

tags: added: cts
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

This bug is not an open-iscsi bug and the real bug can be found here:

https://bugs.launchpad.net/tgt-project/+bug/1445038

Changed in open-iscsi (Ubuntu):
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.