[00:28:22] *** Joins: ziyeyang_ (~ziyeyang@192.55.46.46)
[01:00:59] *** Quits: ziyeyang_ (~ziyeyang@192.55.46.46) (Quit: Leaving)
[01:35:26] *** Joins: tomzawadzki (uid327004@gateway/web/irccloud.com/x-xiiwsodeogcfgjfr)
[06:28:37] *** Quits: guerby (~guerby@april/board/guerby) (Ping timeout: 252 seconds)
[06:30:51] *** Joins: guerby (~guerby@april/board/guerby)
[10:15:57] *** Quits: tomzawadzki (uid327004@gateway/web/irccloud.com/x-xiiwsodeogcfgjfr) (Quit: Connection closed for inactivity)
[10:28:55] <bwalker> finally back in the office!
[10:29:03] <bwalker> and caught up on email
[10:37:58] *** Quits: zhouhui (~wzh@114.255.44.139) (Ping timeout: 245 seconds)
[11:49:28] *** Joins: bluebird (~bluebird@p5B0ACA94.dip0.t-ipconnect.de)
[12:12:41] *** Joins: gila (~gila@5ED4D979.cm-7-5d.dynamic.ziggo.nl)
[13:13:33] <gila> Would nvme-tcp have any issues running over a network bond? I'm running two hosts, one tgt and one is running perf, but after 10 seconds, i get errors in polling CQ (master) or is something else known causing issues at the moment
[13:42:58] <bwalker> polling cq? for nvme-tcp there are no completion queues
[13:52:41] <gila> oh - i might have the wording wrong. I assumed CQ meant completion queue.
[13:53:00] <bwalker> it does - an RDMA completion queue
[13:53:27] <bwalker> if the tcp transport has prints talking about cq then it could use an audit to change those
[13:53:44] <gila> ok.
[13:53:58] <bwalker> I'll fix them to have better messages if you post what it's printing now
[13:54:00] <gila> it says so in nvme_tcp:1587
[13:54:04] <bwalker> ok pulling it up
[13:54:30] <bwalker> ok - that should say "error reading from socket" or something along those lines
[13:55:27] <bwalker> that's just calling POSIX recv() on a socket
[13:55:29] <bwalker> and getting an error
[13:56:09] <bwalker> what's the error it's printing out in that message? it's printing the correct error message using strerror at least
[13:56:30] <gila> Getting erro 11.
[13:57:56] <gila> And the proper message that goes along with it in text
[13:58:29] <bwalker> is that EAGAIN
[13:58:40] <gila> but before I send you on a while goose chase @bwalker -- i just realised I was using an RC kernel (5.0) so rerunning the tests with a fresh deployment
[13:58:50] <gila> yes thats EAGAIN, temporary not available
[14:02:58] <bwalker> ok - just looked through the code and I don't see how that's possible unless POSIX recv isn't correctly setting errno
[14:04:10] <gila> hmm so it just happened again using a fresh ubuntu 18.04.
[14:05:37] <bwalker> ok on line 1587, you are seeing negative 11 right?
[14:05:45] <bwalker> or wait, positive 11
[14:06:00] <gila> nvme_tcp.c:1588:nvme_tcp_qpair_process_completions: *ERROR*: Error polling CQ! (11): Resource temporarily unavailable
[14:07:20] <gila> The first sign of trouble comes from the target:
[14:07:23] <gila> tcp.c:1238:spdk_nvmf_tcp_qpair_handle_timeout: *ERROR*: No pdu coming for tqpair=0x561aae9c5a20 within 30 seconds
[14:07:23] <bwalker> do you see a print that says spdk_sock_recv() failed?
[14:08:25] <bwalker> during the lifetime of a request, the target times the individual stages
[14:08:29] <gila> No I do not.. but that might be because the screen gets flooded with those messages (and never stops) let me retry and redirect to a file
[14:08:38] <bwalker> so if the target responds to a request and the initiator never answers
[14:08:43] <bwalker> it gives it 30 seconds, then it disconnects
[14:08:55] <bwalker> so the target is killing the connection
[14:09:06] <bwalker> then presumably the initiator is seeing that through some path
[14:09:24] <gila> yes but -- because it does not get any data iirc
[14:09:42] <gila> because on recv of data it should reset the keep alive?
[14:10:43] <bwalker> so if you bond the two devices on the initiator side
[14:10:52] <bwalker> it still just looks like a single connection to the target, right?
[14:11:29] <gila> Yes
[14:11:43] *** Quits: guerby (~guerby@april/board/guerby) (Ping timeout: 268 seconds)
[14:12:06] <gila> there is just one iface bond:0 -- and it ties together 2 10GB networks but, the connection itself still goes over one -- or the other. it does not interleave over interfaces or anything like that.
[14:12:29] <gila> or at-least AFAIK but i might need to read in to that
[14:12:48] <bwalker> can you add some prints to nvme_tcp_read_pdu in nvme_tcp.c
[14:12:48] <gila> reason i asked is because soft roce for sure, did not work with bonds
[14:13:10] <bwalker> and figure out which path through that function is failing (i.e. returning < 0)
[14:15:02] <gila> so you want a BT when nvme_tcp_read_pdu() returns < 0?
[14:16:05] <gila> to come back to your first question (i.e do i see a spdk_sock_recv() failed):
[14:16:19] <gila> EAL: Probing VFIO support...
[14:16:20] <gila> nvme_tcp.c:1588:nvme_tcp_qpair_process_completions: *ERROR*: Error polling CQ! (11): Resource temporarily unavailable
[14:16:43] <gila> So no.. I dont see it.
[14:17:12] <bwalker> interesting
[14:17:30] <bwalker> yeah there are 4 or 5 ways that nvme_tcp_read_pdu can fail - it would help to know which path it's on in your case
[14:19:48] <gila>  ok let get back to you on that
[14:20:29] <bwalker> also are you using the latest master branch of spdk or the 18.10 release?
[14:20:36] <gila> master.
[14:20:39] <bwalker> ok good
[14:27:14] *** Joins: travis-ci (~travis-ci@ec2-54-162-106-149.compute-1.amazonaws.com)
[14:27:15] <travis-ci> (spdk/master) util/string: sprintf_append_realloc to concatenate strings with realloc (Shuhei Matsumoto)
[14:27:15] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/bf6210b3c793...8adbd909911d
[14:27:15] *** Parts: travis-ci (~travis-ci@ec2-54-162-106-149.compute-1.amazonaws.com) ()
[14:31:14] *** Joins: travis-ci (~travis-ci@ec2-54-226-112-178.compute-1.amazonaws.com)
[14:31:15] <travis-ci> (spdk/master) bdev/gpt: examine my_lba in primary header (lorneli)
[14:31:15] <travis-ci> Diff URL: https://github.com/spdk/spdk/compare/8adbd909911d...94f6d54e2f1a
[14:31:15] *** Parts: travis-ci (~travis-ci@ec2-54-226-112-178.compute-1.amazonaws.com) ()
[15:15:11] *** Joins: guerby (~guerby@april/board/guerby)
[15:16:02] *** Quits: guerby (~guerby@april/board/guerby) (Remote host closed the connection)
[15:17:53] *** Joins: guerby (~guerby@april/board/guerby)
[15:18:44] *** Quits: guerby (~guerby@april/board/guerby) (Remote host closed the connection)
[15:21:16] *** Joins: guerby (~guerby@april/board/guerby)
[15:24:18] *** Quits: bluebird (~bluebird@p5B0ACA94.dip0.t-ipconnect.de) (Quit: Leaving)
[15:44:31] <gila> bwalker we are in the state NVME_TCP_PDU_RECV_STATE_AWAIT_PDU_CH -- when a call to nvme_tcp_read_data() fails. we then go into the loop once more and we return NVME_TCPU_PDU_FATAL. (-2)
[15:52:03] <gila> spdk_nvme_wait_for_completion() does nothing with the return value of spdk_nvme_qpair_process_completions() and so we loop for ever (it only checks for status->done)
[15:54:00] <gila> So it seems to be that this is one bug -- where the second one, is that it looks like that the server hangs up for whatever reason
[16:29:34] *** Quits: gila (~gila@5ED4D979.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[17:45:47] *** Joins: zhouhui (~wzh@114.255.44.139)
[18:51:52] *** Joins: ziyeyang_ (~ziyeyang@192.102.204.38)
[18:59:26] <zhouhui> \
[19:04:45] *** Quits: ziyeyang_ (~ziyeyang@192.102.204.38) (Ping timeout: 244 seconds)
[21:42:17] *** Joins: ziyeyang_ (~ziyeyang@192.102.204.38)
[21:42:17] *** ChanServ sets mode: +o ziyeyang_
[21:43:00] *** Joins: ziyeyang__ (~ziyeyang@192.102.204.38)
[21:43:00] *** ChanServ sets mode: +o ziyeyang__
[21:43:12] *** ziyeyang__ sets mode: +o ziyeyang__
[21:43:41] *** Parts: ziyeyang__ (~ziyeyang@192.102.204.38) ("Leaving")