[00:28:22] *** Joins: ziyeyang_ (~ziyeyang@192.55.46.46) [01:00:59] *** Quits: ziyeyang_ (~ziyeyang@192.55.46.46) (Quit: Leaving) [01:35:26] *** Joins: tomzawadzki (uid327004@gateway/web/irccloud.com/x-xiiwsodeogcfgjfr) [06:28:37] *** Quits: guerby (~guerby@april/board/guerby) (Ping timeout: 252 seconds) [06:30:51] *** Joins: guerby (~guerby@april/board/guerby) [10:15:57] *** Quits: tomzawadzki (uid327004@gateway/web/irccloud.com/x-xiiwsodeogcfgjfr) (Quit: Connection closed for inactivity) [10:28:55] finally back in the office! [10:29:03] and caught up on email [10:37:58] *** Quits: zhouhui (~wzh@114.255.44.139) (Ping timeout: 245 seconds) [11:49:28] *** Joins: bluebird (~bluebird@p5B0ACA94.dip0.t-ipconnect.de) [12:12:41] *** Joins: gila (~gila@5ED4D979.cm-7-5d.dynamic.ziggo.nl) [13:13:33] Would nvme-tcp have any issues running over a network bond? I'm running two hosts, one tgt and one is running perf, but after 10 seconds, i get errors in polling CQ (master) or is something else known causing issues at the moment [13:42:58] polling cq? for nvme-tcp there are no completion queues [13:52:41] oh - i might have the wording wrong. I assumed CQ meant completion queue. [13:53:00] it does - an RDMA completion queue [13:53:27] if the tcp transport has prints talking about cq then it could use an audit to change those [13:53:44] ok. [13:53:58] I'll fix them to have better messages if you post what it's printing now [13:54:00] it says so in nvme_tcp:1587 [13:54:04] ok pulling it up [13:54:30] ok - that should say "error reading from socket" or something along those lines [13:55:27] that's just calling POSIX recv() on a socket [13:55:29] and getting an error [13:56:09] what's the error it's printing out in that message? it's printing the correct error message using strerror at least [13:56:30] Getting erro 11. [13:57:56] And the proper message that goes along with it in text [13:58:29] is that EAGAIN [13:58:40] but before I send you on a while goose chase @bwalker -- i just realised I was using an RC kernel (5.0) so rerunning the tests with a fresh deployment [13:58:50] yes thats EAGAIN, temporary not available [14:02:58] ok - just looked through the code and I don't see how that's possible unless POSIX recv isn't correctly setting errno [14:04:10] hmm so it just happened again using a fresh ubuntu 18.04. [14:05:37] ok on line 1587, you are seeing negative 11 right? [14:05:45] or wait, positive 11 [14:06:00] nvme_tcp.c:1588:nvme_tcp_qpair_process_completions: *ERROR*: Error polling CQ! (11): Resource temporarily unavailable [14:07:20] The first sign of trouble comes from the target: [14:07:23] tcp.c:1238:spdk_nvmf_tcp_qpair_handle_timeout: *ERROR*: No pdu coming for tqpair=0x561aae9c5a20 within 30 seconds [14:07:23] do you see a print that says spdk_sock_recv() failed? [14:08:25] during the lifetime of a request, the target times the individual stages [14:08:29] No I do not.. but that might be because the screen gets flooded with those messages (and never stops) let me retry and redirect to a file [14:08:38] so if the target responds to a request and the initiator never answers [14:08:43] it gives it 30 seconds, then it disconnects [14:08:55] so the target is killing the connection [14:09:06] then presumably the initiator is seeing that through some path [14:09:24] yes but -- because it does not get any data iirc [14:09:42] because on recv of data it should reset the keep alive? [14:10:43] so if you bond the two devices on the initiator side [14:10:52] it still just looks like a single connection to the target, right? [14:11:29] Yes [14:11:43] *** Quits: guerby (~guerby@april/board/guerby) (Ping timeout: 268 seconds) [14:12:06] there is just one iface bond:0 -- and it ties together 2 10GB networks but, the connection itself still goes over one -- or the other. it does not interleave over interfaces or anything like that. [14:12:29] or at-least AFAIK but i might need to read in to that [14:12:48] can you add some prints to nvme_tcp_read_pdu in nvme_tcp.c [14:12:48] reason i asked is because soft roce for sure, did not work with bonds [14:13:10] and figure out which path through that function is failing (i.e. returning < 0) [14:15:02] so you want a BT when nvme_tcp_read_pdu() returns < 0? [14:16:05] to come back to your first question (i.e do i see a spdk_sock_recv() failed): [14:16:19] EAL: Probing VFIO support... [14:16:20] nvme_tcp.c:1588:nvme_tcp_qpair_process_completions: *ERROR*: Error polling CQ! (11): Resource temporarily unavailable [14:16:43] So no.. I dont see it. [14:17:12] interesting [14:17:30] yeah there are 4 or 5 ways that nvme_tcp_read_pdu can fail - it would help to know which path it's on in your case [14:19:48] ok let get back to you on that [14:20:29] also are you using the latest master branch of spdk or the 18.10 release? [14:20:36] master. [14:20:39] ok good [14:27:14] *** Joins: travis-ci (~travis-ci@ec2-54-162-106-149.compute-1.amazonaws.com) [14:27:15] (spdk/master) util/string: sprintf_append_realloc to concatenate strings with realloc (Shuhei Matsumoto) [14:27:15] Diff URL: https://github.com/spdk/spdk/compare/bf6210b3c793...8adbd909911d [14:27:15] *** Parts: travis-ci (~travis-ci@ec2-54-162-106-149.compute-1.amazonaws.com) () [14:31:14] *** Joins: travis-ci (~travis-ci@ec2-54-226-112-178.compute-1.amazonaws.com) [14:31:15] (spdk/master) bdev/gpt: examine my_lba in primary header (lorneli) [14:31:15] Diff URL: https://github.com/spdk/spdk/compare/8adbd909911d...94f6d54e2f1a [14:31:15] *** Parts: travis-ci (~travis-ci@ec2-54-226-112-178.compute-1.amazonaws.com) () [15:15:11] *** Joins: guerby (~guerby@april/board/guerby) [15:16:02] *** Quits: guerby (~guerby@april/board/guerby) (Remote host closed the connection) [15:17:53] *** Joins: guerby (~guerby@april/board/guerby) [15:18:44] *** Quits: guerby (~guerby@april/board/guerby) (Remote host closed the connection) [15:21:16] *** Joins: guerby (~guerby@april/board/guerby) [15:24:18] *** Quits: bluebird (~bluebird@p5B0ACA94.dip0.t-ipconnect.de) (Quit: Leaving) [15:44:31] bwalker we are in the state NVME_TCP_PDU_RECV_STATE_AWAIT_PDU_CH -- when a call to nvme_tcp_read_data() fails. we then go into the loop once more and we return NVME_TCPU_PDU_FATAL. (-2) [15:52:03] spdk_nvme_wait_for_completion() does nothing with the return value of spdk_nvme_qpair_process_completions() and so we loop for ever (it only checks for status->done) [15:54:00] So it seems to be that this is one bug -- where the second one, is that it looks like that the server hangs up for whatever reason [16:29:34] *** Quits: gila (~gila@5ED4D979.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [17:45:47] *** Joins: zhouhui (~wzh@114.255.44.139) [18:51:52] *** Joins: ziyeyang_ (~ziyeyang@192.102.204.38) [18:59:26] \ [19:04:45] *** Quits: ziyeyang_ (~ziyeyang@192.102.204.38) (Ping timeout: 244 seconds) [21:42:17] *** Joins: ziyeyang_ (~ziyeyang@192.102.204.38) [21:42:17] *** ChanServ sets mode: +o ziyeyang_ [21:43:00] *** Joins: ziyeyang__ (~ziyeyang@192.102.204.38) [21:43:00] *** ChanServ sets mode: +o ziyeyang__ [21:43:12] *** ziyeyang__ sets mode: +o ziyeyang__ [21:43:41] *** Parts: ziyeyang__ (~ziyeyang@192.102.204.38) ("Leaving")