[01:37:57] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [03:30:20] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [03:52:20] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [05:36:23] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [05:50:34] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [06:28:55] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [07:54:03] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [09:04:13] jimharris: I have an idea for why this randread test gets much worse performance with the rocksdb cache vs. the kernel page cache [09:04:31] do tell [09:04:36] it looks to me like stuff is only inserted into cache (the compressed cache) on compaction [09:04:51] so I'm seeing 100% miss rate on the compressed cache [09:05:10] this is because we do the inserts, then shutdown [09:05:14] then start fresh for the randread stuff [09:05:17] so nothing is in our cache [09:05:34] with the kernel, I bet stuff is still in the pagecache from the insert run [09:06:33] hmmm [09:06:43] that's surprising [09:07:04] the cache never gets populated on reads? [09:07:23] the regular block cache does [09:07:32] but the compressed cache (which I think is really a cache of the .sst files) [09:07:38] is only populated it appears on compaction [09:09:56] direct/no cache: 109,144 [09:09:56] direct/16 GB: 110,937 (hit 6.5%) [09:09:56] direct/38 GB: 112,062 (hit 14.7%) [09:09:56] direct/16 GB/16 GB compressed: 111,058 (hit 6.5%, compressed hit: 0%) [09:09:56] buffered/no cache: 120,233 [09:10:11] that's the number of Get ops per second on a 10 min randread run, all using the kernel [09:11:45] note how the hit % lines up exactly as you'd expect for a random read test - the data is 250GB [09:11:52] 16GB is 6.9% of 250GB [09:11:58] 38GB is 16.3% of 250GB [09:23:13] I'm changing the test to do fillseq followed by randread in the same db_bench run [09:34:06] hmm, we've had a couple of failures on wkb-fedora-04 that look like what peluse was seeing - the vtophys test fails to find any hugepages [09:34:08] http://spdk.intel.com/public/spdk/builds/release/master/2468/wkb-fedora-04/build.log [09:35:16] that machine is using a local DPDK, not the submodule - I'm going to switch it to the submodule and see if it reproduces at all [09:50:28] hmmm - I think I've only seen that vtophys failure with this DPDK submodule change though [09:54:15] it happened on a previous master build before the DPDK change: http://spdk.intel.com/public/spdk/builds/release/master/2468/ [09:54:21] and that machine wasn't using the submodule anyway [09:54:37] it was on a local copy of DPDK 17.02 [09:54:46] ok [09:54:53] i've only seen it on wkb-fedora-04 [09:55:07] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [09:55:19] yeah, we'll see if it repros now that I've switched it to the submodule [10:01:54] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [10:28:16] hmm, wkb-fedora-04 failed again just now, with the new DPDK 17.05 submodule [10:30:44] drv, so a few more points on that vtophys failure: * if I pause 5 sec after the perf and the identify both it works * if I run just the perf and not the ID or the other way around it fails * if I run just the ID, vtophys passes but I get and error about improer shutdown a bit later on (of NVMe). If I then icnrease the 5 sec shutdown TO in the NVMe driver I get better pass rate with just the identify, but still fails with both perf and identify. * no matter [10:30:44] what if I run vtophys on its own it passes. [10:31:28] that's about it for yesterday afternoon dorking with with. Was going to look a bit more at NVMe driver cleanup next, or something in the identify program [10:33:12] one more, if I pause like 50 sec after the loop in nvme.sh but before vtophys is called, leaving in both perf and ID it still fails. So it seems like a cleanup mess-up by running multiple ID or perf at the same time. I'm pretty sure I took out the & background processing just assuming things would pass then and of course they did [10:35:41] peluse: can you try this? [10:35:50] 1) echo 0 > /proc/sys/vm/nr_hugepages [10:36:02] 2) scripts/setup.sh && cat /proc/sys/vm/nr_hugepages [10:36:28] my guess is some systems (for some TBD reason) take longer than others to actually get the huge pages registered [10:36:41] also maybe "ls -l /dev/hugepages | wc -l" after a failure [10:36:51] well I think that's too late [10:36:51] and see if it says something other than 0 [10:37:13] based on the fact that if he waits 5 seconds, vtophys always passes [10:38:13] so you think there is potentially some delay between our echo "$NR_HUGE" > /proc/sys/vm/nr_hugepages [10:38:19] exactly [10:38:20] and those hugepages actually being available [10:38:22] yep [10:38:31] probably possible [10:38:37] we could make setup.sh more robust against that possibility [10:38:45] on my system there is definitely a delay before all of them are available [10:39:07] the weird thing though is that on wkb-fedora-04 and peluse's system, it seems like *none* are available [10:40:17] i just pushed a test patch to see what this looks like on wkb-fedora-04 [10:40:31] it might be only hitting on wkb-fedora-04 since it runs the env tests right away [10:40:39] the other machines run unittest.sh first, which doesn't need hugepages [10:45:03] hmmm - well that test patch seems to show all of the hugepages being available immediately after running the script [10:45:57] but that test patch is going to pass [10:46:22] true - but i was expecting it to maybe show some number less than 4096 [10:56:38] will try, one sec need to reboot [11:00:39] so there's delay between steps 1 & 2 of 3-5 seconds I'd guess [11:00:54] sorry, after step 2 I mean [11:01:18] then I get back a repsonse of 1024 [11:11:33] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) [11:11:37] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit) [11:17:43] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) [11:17:44] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit) [11:23:42] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) [11:23:42] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit) [13:14:38] my theory is debunked: https://ci.spdk.io/builds/review/06fb9e58575e6ad69c47efb903494cc2397042a8.1498240515/ [13:14:55] nr_hugepages is 4096 before vtophys is run [13:15:35] yeah, that's very weird [13:41:56] :( [13:44:09] now I can't remember exactly what the results were just simply up'ing the shutdown TO value in the driver, will run that again real quick [13:48:01] I don't think the wkb-fedora-04 results are related to NVMe, since the env test is the first thing we run on that machine (no NVMe tests before that) [13:48:09] hmmm [13:49:09] mine appears to be, or DPDK maybe. Just ID in the loop and a 50 sec delay before giving up on shutdown status and vtophys passes, however I exit eight after and I can see identify is still an active processs. Wll attach to it and see what its doing... [13:50:23] OK, in the time it took me to type that I double checked ps and now identify is still listed but as defunct [13:51:08] so maybe vtophys was running while the nvme driver was spinning in the longer shutdown loop? [14:02:50] jimharris: I hit a crash in the rocksdb spdk plugin after rebasing to 5.4.5. I've got it debugged down to the exact sequence [14:03:05] it appears to do a write for 13 bytes to offset 0 of the manifest file [14:03:10] then a read for 32k at offset 0 [14:03:14] then a read for 32k at offset 32k [14:04:03] which makes the logic in blobfs.c on line 2046 result in a negative number, which results in an enormous length passed to memcpy [14:04:22] so my first question is - is this write/read pattern something that is supposed to be handled? [14:04:27] with the current code [14:17:54] so yeah what I suggested earlier is what is happening for sure. When I increase the delay in the NVMe driver shutdown from 5 sec to 50 sec vtophys passes because identify is still running, it finished up (timeous out) after vtophys runs successfully [14:18:03] (on my system at least) [14:19:42] will see if it's the same deal w/only perf running in the loop but assume that it is [14:23:43] bwalker: looking at the code now... [14:25:18] I'm checking now to see if rocksdb is doing a truncate to grow the file prior to those reads [14:25:24] because when it does the reads, it thinks the file is 4MB [14:25:28] but only 13 bytes have been written [14:26:54] yep that's exactly what happens now [14:27:00] the steps are: [14:27:01] create file [14:27:04] truncate file to 4MB [14:27:11] write 13 bytes to offset 0 [14:27:14] read 32k from offset 0 [14:27:18] read 32k from offset 32k [14:27:28] that second read hits some bad math [14:27:44] because it has a buffer with 13 bytes in it [14:28:34] so blobfs certainly needs to handle this - but I'm curious why rocksdb is trying to read 32k of data from a file its only written 13 bytes to? [14:28:42] yeah no idea [14:30:19] ok - got an idea on this - will come down to the lab in 5 [15:03:32] well that looks like a smoking gun: http://spdk.intel.com/public/spdk/builds/review/fba7e531af15087e66074cc45da8581ccc3a1fbe.1498254605/wkb-fedora-04/build.log [15:03:45] + cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages [15:03:46] 4096 [15:03:46] + cat /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages [15:03:47] 0 [15:04:18] yep [15:04:25] we can make setup.sh block on that [15:04:55] we should echo > nr_hugepages before we do the device binding [15:05:06] as well as blocking [15:05:17] might save us a bit of time [15:06:03] what about this one? http://spdk.intel.com/public/spdk/builds/review/fba7e531af15087e66074cc45da8581ccc3a1fbe.1498254605/wkb-fedora-03/build.log [15:06:19] we already ran some tests, but then a later test doesn't have any huge pages [15:10:24] jimharris, so why no hugepages in that case (the smoking gun log) [15:10:26] ? [15:10:41] we're assuming that they just haven't been allocated by the kernel yet [15:10:43] or made available [15:10:54] we made a request to allocate pages, it noted that we requested it [15:10:58] but it hasn't finished making them available [15:11:17] I think the idea is that we just have to wait a bit, which explains why when you add sleeps it fixes it [15:14:12] my theory is that they are left over from a previous SPDK app that didn't clean up after itself [15:25:41] so I re-ran it a couple of times, and wkb-fedora-04 seems to always work on the first run after boot, then fail on the next run [15:26:08] there's got to be some kind of left-over hugepage state that gets cleaned up when we reboot [15:27:36] do we leave the stub running or something? [15:42:43] yeah, the 'fails on 2nd run' is true for me too but mine fails on the first without some mucking around as mentioned earlier [15:42:45] one other thing... [15:43:30] if I only run perf in nvme.sh then I get slightly different behavior, with the delay on shutdown instead of vtopy failing I get nvme_pcie.c: 966:nvme_pcie_qpair_insert_pending_admin_request: *ERROR*: The owning process (pid 23689) is not found. Drop the request. [15:43:30] [16:18:03] drv, bwalker: we could add --huge-unlink to our dpdk command line options [16:18:09] this will unlink the files after mapping them [16:18:28] we will still have them open, but when the process dies the kernel will then automatically clean them up [16:19:03] sounds like a good idea to me [16:19:12] why isn't that on by default? is there some downside? [16:54:17] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [17:22:51] *** Joins: nix_ (cf8c2b51@gateway/web/freenode/ip.207.140.43.81) [17:23:35] *** Quits: nix_ (cf8c2b51@gateway/web/freenode/ip.207.140.43.81) (Client Quit) [19:12:29] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) [19:12:32] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit) [20:16:39] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) [20:16:42] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit) [20:20:14] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) [20:20:15] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Client Quit)