[01:37:57] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl)
[03:30:20] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[03:52:20] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl)
[05:36:23] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[05:50:34] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl)
[06:28:55] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[07:54:03] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl)
[09:04:13] <bwalker> jimharris: I have an idea for why this randread test gets much worse performance with the rocksdb cache vs. the kernel page cache
[09:04:31] <jimharris> do tell
[09:04:36] <bwalker> it looks to me like stuff is only inserted into cache (the compressed cache) on compaction
[09:04:51] <bwalker> so I'm seeing 100% miss rate on the compressed cache
[09:05:10] <bwalker> this is because we do the inserts, then shutdown
[09:05:14] <bwalker> then start fresh for the randread stuff
[09:05:17] <bwalker> so nothing is in our cache
[09:05:34] <bwalker> with the kernel, I bet stuff is still in the pagecache from the insert run
[09:06:33] <jimharris> hmmm
[09:06:43] <jimharris> that's surprising
[09:07:04] <jimharris> the cache never gets populated on reads?
[09:07:23] <bwalker> the regular block cache does
[09:07:32] <bwalker> but the compressed cache (which I think is really a cache of the .sst files)
[09:07:38] <bwalker> is only populated it appears on compaction
[09:09:56] <bwalker> direct/no cache: 109,144
[09:09:56] <bwalker> direct/16 GB: 110,937 (hit 6.5%)
[09:09:56] <bwalker> direct/38 GB:  112,062 (hit 14.7%)
[09:09:56] <bwalker> direct/16 GB/16 GB compressed: 111,058 (hit 6.5%, compressed hit: 0%)
[09:09:56] <bwalker> buffered/no cache: 120,233
[09:10:11] <bwalker> that's the number of Get ops per second on a 10 min randread run, all using the kernel
[09:11:45] <bwalker> note how the hit % lines up exactly as you'd expect for a random read test - the data is 250GB
[09:11:52] <bwalker> 16GB is 6.9% of 250GB
[09:11:58] <bwalker> 38GB is 16.3% of 250GB
[09:23:13] <bwalker> I'm changing the test to do fillseq followed by randread in the same db_bench run
[09:34:06] <drv> hmm, we've had a couple of failures on wkb-fedora-04 that look like what peluse was seeing - the vtophys test fails to find any hugepages
[09:34:08] <drv> http://spdk.intel.com/public/spdk/builds/release/master/2468/wkb-fedora-04/build.log
[09:35:16] <drv> that machine is using a local DPDK, not the submodule - I'm going to switch it to the submodule and see if it reproduces at all
[09:50:28] <jimharris> hmmm - I think I've only seen that vtophys failure with this DPDK submodule change though
[09:54:15] <drv> it happened on a previous master build before the DPDK change: http://spdk.intel.com/public/spdk/builds/release/master/2468/
[09:54:21] <drv> and that machine wasn't using the submodule anyway
[09:54:37] <drv> it was on a local copy of DPDK 17.02
[09:54:46] <jimharris> ok
[09:54:53] <jimharris> i've only seen it on wkb-fedora-04
[09:55:07] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[09:55:19] <drv> yeah, we'll see if it repros now that I've switched it to the submodule
[10:01:54] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl)
[10:28:16] <drv> hmm, wkb-fedora-04 failed again just now, with the new DPDK 17.05 submodule
[10:30:44] <peluse> drv, so a few more points on that vtophys failure: * if I pause 5 sec after the perf and the identify both it works * if I run just the perf and not the ID or the other way around it fails * if I run just the ID, vtophys passes but I get and error about improer shutdown a bit later on (of NVMe). If I then icnrease the 5 sec shutdown TO in the NVMe driver I get better pass rate with just the identify, but still fails with both perf and identify. * no matter
[10:30:44] <peluse> what if I run vtophys on its own it passes.
[10:31:28] <peluse> that's about it for yesterday afternoon dorking with with.  Was going to look a bit more at NVMe driver cleanup next, or something in the identify program
[10:33:12] <peluse> one more, if I pause like 50 sec after the loop in nvme.sh but before vtophys is called, leaving in both perf and ID it still fails.  So it seems like a cleanup mess-up by running multiple ID or perf at the same time.  I'm pretty sure I took out the & background processing just assuming things would pass then and of course they did
[10:35:41] <jimharris> peluse: can you try this?
[10:35:50] <jimharris> 1) echo 0 > /proc/sys/vm/nr_hugepages
[10:36:02] <jimharris> 2) scripts/setup.sh && cat /proc/sys/vm/nr_hugepages
[10:36:28] <jimharris> my guess is some systems (for some TBD reason) take longer than others to actually get the huge pages registered
[10:36:41] <bwalker> also maybe "ls -l /dev/hugepages | wc -l" after a failure
[10:36:51] <jimharris> well I think that's too late
[10:36:51] <bwalker> and see if it says something other than 0
[10:37:13] <jimharris> based on the fact that if he waits 5 seconds, vtophys always passes
[10:38:13] <bwalker> so you think there is potentially some delay between our echo "$NR_HUGE" > /proc/sys/vm/nr_hugepages
[10:38:19] <jimharris> exactly
[10:38:20] <bwalker> and those hugepages actually being available
[10:38:22] <jimharris> yep
[10:38:31] <bwalker> probably possible
[10:38:37] <bwalker> we could make setup.sh more robust against that possibility
[10:38:45] <jimharris> on my system there is definitely a delay before all of them are available
[10:39:07] <jimharris> the weird thing though is that on wkb-fedora-04 and peluse's system, it seems like *none* are available
[10:40:17] <jimharris> i just pushed a test patch to see what this looks like on wkb-fedora-04
[10:40:31] <drv> it might be only hitting on wkb-fedora-04 since it runs the env tests right away
[10:40:39] <drv> the other machines run unittest.sh first, which doesn't need hugepages
[10:45:03] <jimharris> hmmm - well that test patch seems to show all of the hugepages being available immediately after running the script
[10:45:57] <bwalker> but that test patch is going to pass
[10:46:22] <jimharris> true - but i was expecting it to maybe show some number less than 4096
[10:56:38] <peluse> will try, one sec need to reboot
[11:00:39] <peluse> so there's  delay between steps 1 & 2 of 3-5 seconds I'd guess
[11:00:54] <peluse> sorry, after step 2 I mean
[11:01:18] <peluse> then I get back a repsonse of 1024
[11:11:33] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net)
[11:11:37] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit)
[11:17:43] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net)
[11:17:44] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit)
[11:23:42] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net)
[11:23:42] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit)
[13:14:38] <jimharris> my theory is debunked: https://ci.spdk.io/builds/review/06fb9e58575e6ad69c47efb903494cc2397042a8.1498240515/
[13:14:55] <jimharris> nr_hugepages is 4096 before vtophys is run
[13:15:35] <drv> yeah, that's very weird
[13:41:56] <peluse> :(
[13:44:09] <peluse> now I can't remember exactly what the results were just simply up'ing the shutdown TO value in the driver, will run that again real quick
[13:48:01] <drv> I don't think the wkb-fedora-04 results are related to NVMe, since the env test is the first thing we run on that machine (no NVMe tests before that)
[13:48:09] <peluse> hmmm
[13:49:09] <peluse> mine appears to be, or DPDK maybe.  Just ID in the loop and a 50 sec delay before giving up on shutdown status and vtophys passes, however I exit eight after and I can see identify is still an active processs.  Wll attach to it and see what its doing...
[13:50:23] <peluse> OK, in the time it took me to type that I double checked ps and now identify is still listed but as defunct
[13:51:08] <peluse> so maybe vtophys was running while the nvme driver was spinning in the longer shutdown loop?
[14:02:50] <bwalker> jimharris: I hit a crash in the rocksdb spdk plugin after rebasing to 5.4.5. I've got it debugged down to the exact sequence
[14:03:05] <bwalker> it appears to do a write for 13 bytes to offset 0 of the manifest file
[14:03:10] <bwalker> then a read for 32k at offset 0
[14:03:14] <bwalker> then a read for 32k at offset 32k
[14:04:03] <bwalker> which makes the logic in blobfs.c on line 2046 result in a negative number, which results in an enormous length passed to memcpy
[14:04:22] <bwalker> so my first question is - is this write/read pattern something that is supposed to be handled?
[14:04:27] <bwalker> with the current code
[14:17:54] <peluse> so yeah what I suggested earlier is what is happening for sure. When I increase the delay in the NVMe driver shutdown from 5 sec to 50 sec vtophys passes because identify is still running, it finished up (timeous out) after vtophys runs successfully
[14:18:03] <peluse> (on my system at least)
[14:19:42] <peluse> will see if it's the same deal w/only perf running in the loop but assume that it is
[14:23:43] <jimharris> bwalker: looking at the code now...
[14:25:18] <bwalker> I'm checking now to see if rocksdb is doing a truncate to grow the file prior to those reads
[14:25:24] <bwalker> because when it does the reads, it thinks the file is 4MB
[14:25:28] <bwalker> but only 13 bytes have been written
[14:26:54] <bwalker> yep that's exactly what happens now
[14:27:00] <bwalker> the steps are:
[14:27:01] <bwalker> create file
[14:27:04] <bwalker> truncate file to 4MB
[14:27:11] <bwalker> write 13 bytes to offset 0
[14:27:14] <bwalker> read 32k from offset 0
[14:27:18] <bwalker> read 32k from offset 32k
[14:27:28] <bwalker> that second read hits some bad math
[14:27:44] <bwalker> because it has a buffer with 13 bytes in it
[14:28:34] <jimharris> so blobfs certainly needs to handle this - but I'm curious why rocksdb is trying to read 32k of data from a file its only written 13 bytes to?
[14:28:42] <bwalker> yeah no idea
[14:30:19] <jimharris> ok - got an idea on this - will come down to the lab in 5
[15:03:32] <jimharris> well that looks like a smoking gun: http://spdk.intel.com/public/spdk/builds/review/fba7e531af15087e66074cc45da8581ccc3a1fbe.1498254605/wkb-fedora-04/build.log
[15:03:45] <jimharris> + cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
[15:03:46] <jimharris> 4096
[15:03:46] <jimharris> + cat /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages
[15:03:47] <jimharris> 0
[15:04:18] <bwalker> yep
[15:04:25] <bwalker> we can make setup.sh block on that
[15:04:55] <jimharris> we should echo > nr_hugepages before we do the device binding
[15:05:06] <jimharris> as well as blocking
[15:05:17] <jimharris> might save us a bit of time
[15:06:03] <jimharris> what about this one?  http://spdk.intel.com/public/spdk/builds/review/fba7e531af15087e66074cc45da8581ccc3a1fbe.1498254605/wkb-fedora-03/build.log
[15:06:19] <jimharris> we already ran some tests, but then a later test doesn't have any huge pages
[15:10:24] <peluse> jimharris, so why no hugepages in that case (the smoking gun log)
[15:10:26] <peluse> ?
[15:10:41] <bwalker> we're assuming that they just haven't been allocated by the kernel yet
[15:10:43] <bwalker> or made available
[15:10:54] <bwalker> we made a request to allocate pages, it noted that we requested it
[15:10:58] <bwalker> but it hasn't finished making them available
[15:11:17] <bwalker> I think the idea is that we just have to wait a bit, which explains why when you add sleeps it fixes it
[15:14:12] <drv> my theory is that they are left over from a previous SPDK app that didn't clean up after itself
[15:25:41] <drv> so I re-ran it a couple of times, and wkb-fedora-04 seems to always work on the first run after boot, then fail on the next run
[15:26:08] <drv> there's got to be some kind of left-over hugepage state that gets cleaned up when we reboot
[15:27:36] <drv> do we leave the stub running or something?
[15:42:43] <peluse> yeah, the 'fails on 2nd run' is true for me too but mine fails on the first without some mucking around as mentioned earlier
[15:42:45] <peluse> one other thing...
[15:43:30] <peluse> if I only run perf in nvme.sh then I get slightly different behavior, with the delay on shutdown instead of vtopy failing I get nvme_pcie.c: 966:nvme_pcie_qpair_insert_pending_admin_request: *ERROR*: The owning process (pid 23689) is not found. Drop the request.
[15:43:30] <peluse>  
[16:18:03] <jimharris> drv, bwalker: we could add --huge-unlink to our dpdk command line options
[16:18:09] <jimharris> this will unlink the files after mapping them
[16:18:28] <jimharris> we will still have them open, but when the process dies the kernel will then automatically clean them up
[16:19:03] <drv> sounds like a good idea to me
[16:19:12] <drv> why isn't that on by default? is there some downside?
[16:54:17] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[17:22:51] *** Joins: nix_ (cf8c2b51@gateway/web/freenode/ip.207.140.43.81)
[17:23:35] *** Quits: nix_ (cf8c2b51@gateway/web/freenode/ip.207.140.43.81) (Client Quit)
[19:12:29] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net)
[19:12:32] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit)
[20:16:39] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net)
[20:16:42] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit)
[20:20:14] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net)
[20:20:15] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit)