[00:51:24] *** guerby_ is now known as guerby
[00:52:13] *** Quits: guerby (~guerby@ip165.tetaneutral.net) (Changing host)
[00:52:13] *** Joins: guerby (~guerby@april/board/guerby)
[02:30:37] *** Joins: vermavis- (~vermavis@192.55.54.44)
[02:34:59] *** Quits: vermavis (~changpe1@192.55.54.44) (*.net *.split)
[02:35:01] *** vermavis- is now known as vermavis
[08:44:51] <peluse> jimharris, I can try back on env layer only.  Will look for that change and if I can find it I'll back it out and see
[09:37:14] <peluse> jimharris, hmm, I tried just grabbing 'event/env: remove dpdk_ prefix  …' and running with it as a complete set of functional code before your change and it fails in a way that seems like I missed something...
[09:38:29] <peluse> here's the error I get w/that patch: EAL: Could not open /dev/hugepages/spdk0map_3072
[09:50:56] <peluse> BTW I nabbed that one as it was the last change to stub.c before you updated to use the framework...
[10:01:15] <jimharris> could you try rm /dev/hugepages/* and then run the test again?
[10:01:29] <jimharris> although I'm not sure how much this will help on your system - IIRC, you were having these issues before my stub changes
[10:01:54] <jimharris> i was trying to think of changes we'd made that would cause wkb-fedora-04 to start failing so often
[10:02:11] <drv> jimharris: do we want to temporarily revert the rocksdb stub change to work around the wkb-fedora-04 failures?
[10:02:19] <jimharris> I think for now we should
[10:02:26] <drv> since that machine is only running rocksdb anyway, it probably doesn't have a big impact on test time
[10:05:03] <peluse> jimharris, rerunning now...
[10:05:48] <drv> here's the revert (hasn't run through the pool yet): https://review.gerrithub.io/#/c/367263/
[10:06:21] <bwalker> what about that --unlink flag?
[10:06:24] <bwalker> should we move on that?
[10:06:51] <jimharris> i'm planning to dig into this more tomorrow
[10:07:02] <jimharris> including --unlink
[10:07:22] <drv> I think the unlink thing won't work with multiprocess, but it's worth a try
[10:08:11] <peluse> what did you want to do with the --unlink?
[10:09:59] <jimharris> in lib/env_dpdk/init.c, funciton spdk_build_eal_cmdline(), you could try adding an argument --huge-unlink to the list of command line arguments we pass to DPDK for initialization
[10:10:14] <jimharris> do this off of latest master, and go back to using the stub with this option
[10:11:18] <peluse> will do... rebooted and will try what I have now first again (no app fw) after fresh boot then I'l do the unlink thing
[10:11:34] <jimharris> drv: i posted a reply to your patch - i notice now you just did a revert, but I'm thinking the original script was wrong on the trap lines
[10:19:00] <drv> ok, I will tweak it
[10:50:18] <jimharris> remove the verified -1 and push it?
[10:57:24] <drv> pushed
[12:26:04] <peluse> jimharris, hmmm.  This is what the parm list should look like right? The core dump is just a bonus I guess :)
[12:26:06] <peluse> Starting DPDK 17.05.0 initialization...
[12:26:06] <peluse> [ DPDK EAL parameters: bdevtest -c 0x3 --file-prefix=spdk_pid57985 --huge-unlink ]
[12:26:07] <peluse> test/lib/bdev/blockdev.sh: line 17: 57985 Segmentation fault      (core dumped) $testdir/bdevio/bdevio $testdir/bdev.conf
[12:27:17] <jimharris> yeah - that looks right
[12:27:49] <jimharris> i'd say hold off on further debug on this for now - i'm going to look at this tomorrow (have a couple of other things to wrap up today) and will ping you once i have something more for you to test on your magic system
[12:28:02] <peluse> cool
[14:46:45] <drv> hrm, we made bdev reset work for the NVMe bdev, but I think all of the other bdev modules still need to be updated to abort/complete all of their I/O when their I/O channel is destroyed
[14:47:37] <drv> (also, once that's complete, we should re-enable the bdevperf reset test that is currently disabled)
[14:55:06] <bwalker> yeah - none of the others have been updated
[14:56:54] *** Quits: gila (~gila@ec2-54-91-114-223.compute-1.amazonaws.com) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[15:11:30] <peluse> something maybe interesting wrt the whole stub thing.. in looking at the output from nvme.sh stub is correctly identified as PRIMARY and each subsequent app as SECONDARY up until the point where mine fails, at the last while loop.  There arb is identified as SECONDARY but the next two instances of perf and identify are auto-detected as PRIMARY which I don't believe is correct?
[15:14:31] <drv> yeah, that sounds wrong - arbitration should be primary, and the rest should be secondary
[15:15:04] <drv> we shut down the stub before the multi_process tests, so it shouldn't be the primary anymore
[15:17:12] <peluse> drv, cool, I'll dig some more
[15:17:39] <drv> first thing to check would be if stub is still kicking afer kill_stub
[15:18:06] <peluse> may not be exactly what you guys are seeing but likely related if I can figure out why it's doing that, yeah for sure on the make sure stub is dead since the first arb in that loop should be PRIMARY
[15:53:22] <peluse> need to double check but appears that immediately after kill_stub reactor_0 is still running. If I wait 10 sec (picked out the air) reactor_0 is done and then I march through the next tests with the expected PRI and SEC and vtophys passes
[15:54:22] <drv> hmm, that sounds like bad news
[15:54:53] <drv> can you try adding 'wait $stubpid' after 'kill $stubpid' in scripts/autotest_common.sh?
[15:55:01] <peluse> running a few more times to make sure its consistent..
[15:56:33] <peluse> will do that next
[15:56:58] <peluse> is reactor_0 something that gets fired up - at what point?
[15:58:02] <jimharris> reactor_0 is the name of the thread
[15:58:08] <jimharris> in the stub app
[15:58:16] <peluse> oh, LOL
[16:15:57] <vermavis> jimharris: we can get to 2.05M with a single core using SPDK FIO plugin
[16:20:13] <vermavis> This is a 3.2GHz CPU clock frequency
[16:26:46] <peluse> drv, well... helped but not as effective as the delay just before multi-proc. With the wait at kilstub, I do see that reactor_0 is stil running afer the kill command and gone after the waitpid *and* the PRI and SEC are as expected however something else creeped in on this last run and may or may note be rleated. During perf in multi-proc I got this and then the next instance of identify wouldnt' start
[16:26:47] <peluse> nvme_pcie.c: 966:nvme_pcie_qpair_insert_pending_admin_request: *ERROR*: The owning process (pid 21131) is not found. Drop the request.
[16:27:14] * peluse reboots and will see if it repeats
[17:04:15] <peluse> I'm fairly sure I'm not losing my mind but the things above only work intermittently, the only thing so far that gets vtophy to pass every time on my machins is a 5 sec delay after starting perf and identify in the last loop in the multi_proc tests. When I remove that and add a wait on stubpid I get the above error sometimes and sometimes I just get no huge pages on either perf or identify and one time it worked all the way though...
[17:42:51] <jimharris> peluse: can you also try adding this as an experiment (after the wait $stubpid)
[17:43:21] <jimharris> HUGETLBFS_MOUNT=`mount | grep '^hugetlbfs ' | awk '{print $3 }'`
[17:44:17] <jimharris> rm -f $HUGETLBFS_MOUNT/*
[17:56:02] <peluse> jimharris, will do
[18:07:35] <peluse> still fails on vtophys and the print is a way long "rm -rf /dev/hugepages/spdk0map_0" where it looks like the spdk0map_0 though spdk0_map1024 are all on the parm list
[18:36:39] *** Joins: ziyeyang_ (~ziyeyang@134.134.139.74)
[20:59:21] *** Joins: ziyeyang__ (ziyeyang@nat/intel/x-mbjjsosaqwivnagn)