[00:54:25] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [01:28:49] *** Quits: whitepa (~whitepa@2601:601:1200:f23b:8054:7049:7905:d8f7) (Ping timeout: 246 seconds) [03:58:10] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [04:51:09] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [07:00:45] *** Joins: lhodev (~Adium@inet-hqmc08-o.oracle.com) [07:29:18] *** Joins: nKumar (uid239884@gateway/web/irccloud.com/x-hmsstpabaqkdtqua) [07:37:18] within the hello_blob example, I have a question on the callbacks chains that are used to execute open/close/create/read/write. When these callbacks occur, are they occuring from the same thread that originally tried the open/close/create/read write functions? [07:38:24] as in inline? or are they actually instantiating new threads and executing the callback functions within the new threads? [07:40:47] first question, yes [07:41:09] all same thread for 2nd question [07:41:38] you can break it at any time and look at the callstack to see this [07:46:03] oh wait, one min [07:50:01] let me clarify, the hello_blob example doesn't demonstrate the use of multiple threads in the app but uses the SPDK app framework which does create additional threads for things like the reactor, etc. Still learning this myself so maybe one of the other guys can jump in here soon and add some more info, I'll go look at the SPDK framework code a bit and see if I need to correct my answers to your questions as well [08:41:45] nKumar: blobstore will call the callback on the same thread as the request was initiated on [08:41:57] but it may not (probably won't) call it "inline" [08:42:09] i.e. it won't be within the call stack of the request [08:42:37] that's because the operation requested will probably require some disk I/O, which is done asynchronously [08:42:52] and the callback will be called when the thread later polls for completions on the disk [08:44:50] bwalker, so in the case where the callback is called later, truly async like, what thread will that be, the reactor? Or a thread created by the back end driver? [08:45:09] it will be the same thread that the operation was initiated from [08:45:26] if the user is using the spdk event framework (which isn't required), then yes - it's the reactor thread [08:45:43] none of our libraries/drivers spawn threads [08:45:54] just the event framework does, and the user can opt not to use that and do their own thing [08:46:08] OK, that makes sense but wrt the first answer when the framework isn't in use, [08:46:36] got it, thanks! [08:46:51] if the app calls in for a read or something and it does turn out to be async at the disk, does the app not get control back from it spdk call that it made until spdk is done polling/completing the IO? [08:47:43] the app gets control back as soon as the asynchronous operation is submitted [08:47:52] and in general, it's the application's job to poll the disk for I/O [08:48:01] oh, duh. OK [08:48:16] when the application polls, it finds the completion which triggers the callback chain up [08:48:25] eventually resulting in the blobstore calling the user callback [08:48:33] *** Parts: lhodev (~Adium@inet-hqmc08-o.oracle.com) () [08:48:47] I was thinking SPDK was somehow being async to the app and at the same time somehow polling for completion and the unwinding [08:50:21] bwalker, so in hello_blob since I'm using the app framework, if I wasn't using malloc backend, I would expect my callbacks for writes for example to come back to me on the reactor thread right? [08:50:31] yes [08:50:38] awesome, thanks [08:58:33] Is there something that we need to call on every loop iteration to get it to poll for any callbacks it needs to fire off? [09:03:58] if you are using the spdk event framework, it all happens automatically [09:04:16] if you aren't, then when you initialize the bdev layer you have to provide a function pointer that it will use to register a "poller" [09:04:31] you can implement that however you want, but the bdev layer is going to request a poller on the current thread [09:07:11] nKumar, are you using bdev or integrating directly to the nvme driver? [09:08:52] I am currently trying to integrate directly to the NVMe driver, however it sounds like for the sake of simplicity I should switchover to use bdev like the hello_blob example. [09:09:25] the bdev thing is going to be a lot simpler, but it's workable with NVMe [09:09:39] the blobstore will request an I/O channel from the device [09:09:45] which I assume you implement as creating an NVMe queue pair [09:10:05] your application needs to poll that queue pair periodically on the thread it was created on [09:10:08] then it will work [09:10:42] yep, @peluse showed me how to switchover the hello_blob example to NVMe. I will try to switch over now and let you know if I have any issues. Thanks for the advice! [09:12:24] we could probably write an example of blobstore directly on NVMe [09:12:35] but we did a bunch of benchmarking and the bdev layer doesn't really add any overhead [09:12:54] bwalker, I could switch the WIP CLI over to not use bdev and kill 2 birds there [09:12:57] and it sets you up for a lot more flexible code base in the future [09:13:10] got it. Let me try using bdev first, sounds like thats the way to go [09:14:18] nKumar, yeah, just look in hello_blob at the calls/structs that have bdev in them starting with spdk_bdev_create_bs_dev() [09:16:39] will do, thanks! [09:51:13] so in the hello_blob example, both the app framework and the bdev layer are used. Im not sure that using the app framework is going to work for our implementation, is it possible to configure the bdev without using the app framework? [09:51:46] specifically, im referring to the conf file, which seems to be passed in using the app framework. [10:13:58] so the conf file is specific to the bdev later I'm fairly certain, I'll go take a closer look and in the meantime someone else may be able chime in and explain how to use it w/o the app framework [10:27:23] yeah, OK I dunno :) drv or someone will have to answer... [10:41:16] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [10:44:22] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [11:03:08] nKumar - the conf file is part of the spdk event/app framework [11:03:28] you can use the bdev layer without it - we were actually just recently working through some changes to get those separated [11:04:11] I'm not sure if we got all the way to where it needs to be yet [11:05:44] gotcha. Given our application, Im not sure that the app framework is a viable solution for us so being able to use the bdev abstraction without the app framework would be awesome. [11:07:02] yeah - we definitely are most of the way there in terms of separating the two [11:07:12] it may just be the config file, if even that [11:07:33] there is a new fio plugin that uses the bdev layer but does not use the event framework [11:07:39] in examples/bdev/fio_plugin [11:07:46] but it does still use a configuration file [11:17:31] so it basically uses the spdk_conf* functions to parse the config file and spdk_bdev_initialize() to init which I assume knows about the config because of spdk_conf_set_as_default()? [11:25:01] you can also use the blobstore without using the bdev layer at all, if you want to [11:28:56] *** Quits: patrickmacarthur (~smuxi@2606:4100:3880:1240:39f2:4fd8:2ba:3d8) (Ping timeout: 255 seconds) [11:31:32] *** Joins: patrickmacarthur (~smuxi@2606:4100:3880:1240:2977:4084:1d52:877) [11:37:00] *** Quits: patrickmacarthur (~smuxi@2606:4100:3880:1240:2977:4084:1d52:877) (Remote host closed the connection) [11:37:12] *** Joins: patrickmacarthur (~smuxi@2606:4100:3880:1240:2977:4084:1d52:877) [11:38:21] *** Quits: patrickmacarthur (~smuxi@2606:4100:3880:1240:2977:4084:1d52:877) (Remote host closed the connection) [11:38:37] *** Joins: patrickmacarthur (~smuxi@2606:4100:3880:1240:2977:4084:1d52:877) [11:40:05] *** Parts: patrickmacarthur (~smuxi@2606:4100:3880:1240:2977:4084:1d52:877) () [11:40:08] *** Joins: patrickmacarthur (~smuxi@2606:4100:3880:1240:2977:4084:1d52:877) [13:29:51] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [13:58:49] *** Joins: lhodev (~Adium@inet-hqmc08-o.oracle.com) [14:22:11] peluse: going through your nvme ut patches now - i've marked the first one as -1 but hold off on making any changes until i get through the rest [14:22:25] hate to have you rebase the whole series for one small change in the first patch [14:22:33] gracias [14:22:51] although I've rebased this series like 100 times so what's a few more... LOL [14:23:12] the main issue i have is picking 0x12345 for both the function pointer and the void argument [14:23:29] note that 2 of them failed CI for unrelated reasons, was waiting til I either had to update or someone would remove the -1 [14:23:37] *** Parts: lhodev (~Adium@inet-hqmc08-o.oracle.com) () [14:23:44] oh yeah, that was somewhat arbitrary obviously [14:23:52] that would mask someone stupidly changing the code to do req->cb_fn = cb_arg [14:24:09] good point [14:28:17] drv, so I added the "dump a blob to a regular file" capability to the CLI, that was a great idea. It' very cool, I'm gonna add convert file to blob for the other direction and then clean it up for review. [14:28:23] awesome [14:29:22] jimharris, let me know when you're done w/all the patches and I'll go update them all at once and see if I can't do so w/o braking the chain. take your time though! [15:08:34] bwalker, drv: i went through peluse's patches - could one of you go through as well before he revises and rebases? [15:09:28] ok [15:09:50] thanks [15:49:32] *** Joins: lhodev (~Adium@inet-hqmc08-o.oracle.com) [15:50:48] lhodev, howdy! [15:50:58] Howdy [15:54:57] I just sent out a mail to the list with hopes of getting a handle the NVMe driver's use of PAGE_SIZE. Hope I'm not becoming a nuisance ;-). Just tryin' to put the pieces together and understand how it all works. [15:58:17] not a nuisance at all - I'll respond on the mailing list but the short answer is that the NVMe page size does not need to be the same as the MMU page size [15:58:41] as long an NVMe page never crosses an MMU page boundary [15:58:42] hi lhodev [15:58:58] i'm not sure that's even possible :) [15:59:04] it's not [15:59:15] well actually [15:59:21] if you set MPS to 8k [15:59:25] and used 4k pages with vfio [15:59:28] it would go boom [15:59:39] BTW, lhodev = Lance Hartmann [16:01:10] we don't even check MPSMIN and MPSMAX when we go to set MPS today [16:01:11] thanks lance - glad to see you on irc :) [16:01:16] because MPS is always 4k in reality [16:01:34] so we could probably clean this up to support more exotic architectures [16:08:47] bwalker: https://review.gerrithub.io/#/c/374209/4/lib/nvmf/rdma.c [16:09:00] this new mempool - will it only get used from a single core? [16:09:29] in this patch, yes [16:09:35] but look at the two in the series right after it [16:09:36] but in the future? [16:09:50] I may have been overly aggressive in breaking these changes up [16:10:02] i'll look at the rest of the series [16:10:15] i'm concerned about using max_queue_depth/2 for the cache size [16:10:37] I didn't know what values to put in there really [16:10:44] it means you could get exhaustion if it's ever accessed from more than 2 cores [16:10:49] Thx in advance for your reply to the email list Ben on the topic. I figured the page sizes — i.e. NVMe page size and MMU page size — didn't necessarily have to be the same. After all, it's running beautifully on my system ;-). The dissonance is my own cognitive one, lol. I suspect some other folks in the future might also wonder about the use of PAGE_SIZE and hugepages, and it wouldn't be a simple topic someone might search on the [16:10:49] IRC's log. [16:11:19] the spdk_mempool wrapper is actually smart enough to take care of that - it does num elements / 2 / num cores [16:11:24] so I'm bailed out by that right now [16:11:30] I agree I should probably think about those values a little bit harder [16:12:17] hmmm - yeah, I forgot that spdk env will clamp that lower [16:12:17] I'm pretty sure drv is going to beat me to the mailing list response [16:12:47] and I think he's also going to push a patch out that stops using PAGE_SIZE and instead using an MPS_SIZE or something [16:12:51] so that should clear it up [16:13:10] in practice, they're the same value [16:13:22] I think PAGE_SIZE is OK - we just need to document it [16:14:05] there are a lot of folks who already know that PAGE_SIZE == 4096, where MPS_SIZE requires looking up that value [16:17:59] i think we should only change to an MPS-like name if we are really going to support MPS other than 4KB (which would require things like passing a ctrlr pointer to _is_page_aligned()) [16:19:00] there aren't any physical devices that appear to support it that I can find on my desk right now, but it is kind of interesting to think about using MPS=2MB [16:19:10] I'm eager to learn if it would be possible — read: of a non-Herculean nature — for the NVMe driver to handle the actual hugepage sizes — i.e. NVMe page size = MMU page size. An application with large I/Os would, in theory, then be able to fire off an I/O using just a single prp entry, right? That's, of course, assuming that the I/O also fit within the controller's reported max transfer size. [16:19:34] exactly [16:19:56] basically you'd need 2 PRP entries max, assuming the devices MTS is less than or equal to MPS [16:20:02] i.e. no prp list at all [16:20:13] yeah - for contiguous payloads this would work well [16:20:16] I don't know if that buys anything [16:20:32] performance-wise [16:20:38] but for the size of I/O, the CPU overhead is pretty minimal already [16:21:26] for large I/O you touch fewer cache lines, but for large I/O the cpu usage is irrelevant [16:22:08] * jimharris running the overhead tool to measure 128K v. 4K [16:23:19] So the thought is that the time spent iterating creating multiple PRP entries would be insignificant to the point of being virtually irrelevant? [16:23:32] basically, yeah [16:23:52] it doesn't save you much on small I/O, just on large I/O [16:24:02] i.e. more than 3 or 4 PRP entries to hit the next cache line [16:24:22] and when the I/O gets large enough, it bottlenecks at the SSD or the PCIe bus anyway [16:26:50] Hmmm. I had spoken with someone whose target application intends to do 1MB I/Os. Hence, my interest on this topic ;-) [16:26:57] lhodev: I have a mailing list post that will expand on this, but I went ahead and posted a patch to clean up our use of PAGE_SIZE: https://review.gerrithub.io/#/c/374371/ [16:27:15] (this still uses 4K in all cases for page size, but it sets us up to pick a different NVMe page size later) [16:27:55] lhodev: very few SSDs actually support 1MB I/O, and for QoS reasons I'm not sure you'd want to issue 1MB I/O anyway [16:28:34] either way - a 1MB I/O takes long enough at the SSD to make saving a few cycles generating the PRP list probably not matter much [16:28:41] quick test on my system shows 294ns to submit a 4KB I/O and 531ns for a 128KB I/O [16:29:03] hmm, that's actually more than I thought [16:29:21] so you would potentially save ~240ns per 128KB I/O [16:29:32] more, if your device supported 1MB I/O [16:30:13] I don't have any devices that can do more than 128K or have MPSMAX greater than 4k, so we can't really do more testing [16:30:17] so for a device that support 3GB/s streaming throughput, that's 24K IO/s (at 128KB IO size) [16:32:13] 24K * 240ns = 5.7ms [16:34:07] I have a device that reports MPSMAX 134217728 bytes at according to examples/nvme/identify, but the MDTS is 128KB. [16:34:09] my guess is there are optimizations we could make in our PRP building routine - we haven't optimized that to the same level as the 4KB path [16:34:55] you've been running perf all day, I guess we can just pile this on :) [16:35:31] I'll add a card to the things to do trello board - "Rewrite I/O path in assembly" [16:35:51] (chuckle) [16:42:01] I need to head out for dinner shortly. Thx for the responses guys. Will look for the mail on the PAGE_SIZE topic later. Cheers! [16:42:58] *** Quits: lhodev (~Adium@inet-hqmc08-o.oracle.com) (Quit: Leaving.) [16:51:45] bwalker: i'm curious what a squashed version of these first three patches would look like [16:57:51] like this: https://review.gerrithub.io/#/c/374373/ [16:57:54] it's not bad really [17:02:43] drv's patch only adds 10ns of overhead to a 128KB I/O submission [17:24:35] cool [17:33:40] drv: what do you have to do to get pahole to work? [17:33:48] I tested overhead with 4KB I/O and couldn't see a difference on my machine [17:34:50] do you specify a .o or an executable? [17:34:51] jimharris: I think the one shipped with Fedora is really old - I think I have a locally-built version [17:34:57] anything with debug info should wor [17:35:01] I usually use a .o [17:35:18] https://git.kernel.org/pub/scm/devel/pahole/pahole.git seems to be the latest home - maybe Fedora is actually using that now, not sure [17:35:51] what version is yours? [17:35:57] my ubuntu package has v1.9 [17:36:03] I am not at my workstation right now [17:36:07] ok - no worries [17:36:10] it shouldn't matter as long as it doesn't spit errors [17:36:17] mine spits errors [17:36:18] older ones don't understand newer DWARF types, though [17:36:30] pahole lib/nvme/nvme.o [17:36:34] yeah, it's probably easiest to just build from source - not too much of a hassle [17:36:37] that should work [17:36:45] die__process_unit: DW_TAG_restrict_type (0x37) @ <0x576> not handled! [17:39:16] apparently stands for "Poke-a-Hole" - I'll have to remember that if somebody thinks I'm being rude :) [17:40:41] building from source worked like a charm [18:44:52] *** Joins: dcostantino (~dcostanti@12.133.137.30) [18:48:35] *** Quits: dcostantino (~dcostanti@12.133.137.30) (Client Quit) [19:20:39] pahole.. I love it. Hey man, you're being a pahole this week :) [20:34:33] drv, bwalker jimharris - thanks for going through the length chain. Most comments are clear enough, there's a few that I might want to discuss but first I want to make certain that I don't break the chain as I update these. I'll list the steps I plan to take here sometime tomorrow before I do anything else for sure... [20:53:47] *** Joins: dcostantino (~dcostanti@c-24-23-244-114.hsd1.ca.comcast.net) [21:01:32] hi, I was woundering if anyone has started development of NVMe stream data placement API for SPDK?