[00:46:15] As I sent a mail, my pass was fake. Please do not rush to queue your patches. [00:46:21] *** Quits: ziyeyang_ (ziyeyang@nat/intel/x-ckwvagmdqxfxbvqe) (Quit: Leaving) [01:12:00] *** Joins: baruch (~baruch@bzq-82-81-85-138.red.bezeqint.net) [02:40:29] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 260 seconds) [02:53:16] *** Joins: gila (~gila@94.212.217.200) [02:59:51] Hi Shuhei - this fail was caused by configuration of ASAN on one of the test machines, which revealed a bug when executing a lvol tasting scenario. We are currently working on fixing the bug and ASAN is temporarily disabled so that it doesn't block the test pool. [04:02:43] *** Quits: gila (~gila@94.212.217.200) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [04:03:57] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) [06:11:54] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [06:22:45] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) [07:34:08] *** Joins: boutcher (~boutcher@66.113.132.66) [07:48:44] *** Joins: lhodev (~Adium@inet-hqmc01-o.oracle.com) [07:51:59] *** Parts: lhodev (~Adium@inet-hqmc01-o.oracle.com) () [08:15:25] *** Joins: Vikas (9d301439@gateway/web/freenode/ip.157.48.20.57) [08:29:19] *** Quits: Vikas (9d301439@gateway/web/freenode/ip.157.48.20.57) (Ping timeout: 260 seconds) [08:42:21] *** Quits: baruch (~baruch@bzq-82-81-85-138.red.bezeqint.net) (Ping timeout: 264 seconds) [08:46:19] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [10:10:56] jimharris: I'm looking at a patch that adds comments to blobfs.h, and so I am looking at the API for spdk_fs_init [10:11:05] and the user passes a function to send a request to the main dispatch thread as an argument [10:11:14] but we could use spdk_thread_send_msg now that we have it, right? [10:11:22] probably - yes [10:22:02] once the blobstore API is all settled I think we can do some pretty neat things with blobfs [10:22:06] to make it feel posix-y [10:37:09] I have a question about running bdevperf with kernel driver. Is this the correct configuration? [10:37:20] [AIO] [10:37:20] AIO /dev/nvme0n1 AIO0 [10:40:05] jkkariu: yes, that should work [10:40:25] the section should be [Aio], though - I'm not sure if that's case sensitive [10:40:38] never mind, it is actually AIO [10:41:17] thanks, it is working [10:46:26] drv, bwalker: don't we need to acquire the mutex around the TAILQ_FOREACH_SAFE here? https://review.gerrithub.io/#/c/391329/12/lib/nvme/nvme_pcie.c [10:47:26] I think the hotplug monitor function is called under the lock, but let me check [10:47:52] and then we recursively acquire it? [10:49:08] spdk_nvme_probe_internal -> nvme_pcie_ctrlr_scan -> _nvme_pcie_hotplug_monitor [10:49:12] probe_internal takes the lock [10:49:18] we shouldn't be recursively acquiring it - where is that? [10:50:45] it would be nice to have some kind of macro to assert that we are holding a lock (for documentation purposes) [10:50:52] I think you can build that with trylock [10:53:38] oh - I was misreading the code, it's an unlock then lock [10:54:42] this still seems racy to me though - what if tmp gets removed by a different thread between the unlock/lock calls? [10:56:02] hmm, that's a good question, but I think the existing code is broken in the same way [10:56:23] you mean the existing uevent code? [10:57:05] the uevent code isn't iterating over the shared_attached_ctrlrs list though [10:58:49] all of the attach/scan code is only designed for use from one thread, so all of this code is fine - I think we just need to make sure it's explicitly documented [10:59:14] but it does drop the lock while calling remove_cb [10:59:44] the TAILQ_FOREACH_SAFE should be okay, since it's the _SAFE variant that grabs the next pointer before calling the body of the loop [11:00:14] hmm, no, I see what you're saying - if the next one gets removed rather than the current one, that would break [11:01:03] that's what I'm concerned about though - thread A grabs the next pointer, and drops the lock, thread B grabs the lock, removes the ctrlr in thread A's next pointer, and drops the lock, thread B grabs the lock and now tries to operate on an invalid next pointer [11:05:05] *** Quits: tomzawadzki (~tomzawadz@134.134.139.83) (Remote host closed the connection) [11:32:40] *** Joins: lhodev (~Adium@inet-hqmc02-o.oracle.com) [11:32:55] *** Parts: lhodev (~Adium@inet-hqmc02-o.oracle.com) () [11:35:21] bwalker: could you look at this set of 4 patches from shuhei? https://review.gerrithub.io/#/c/391510/ [11:38:24] there's a patch out to support >64 lcores in SPDK which will need to be reworked a bit after the iSCSI load balancing changes from this series goes in [11:49:41] sure [12:39:42] *** Joins: lhodev (~Adium@inet-hqmc02-o.oracle.com) [12:53:13] jimharris: can you also look at this patch? https://review.gerrithub.io/#/c/391898/ [12:53:51] Shuhei's definitely right that the cpumask handling for the RPC is wrong, but I'm not sure if we should change the RPC or just make spdk_iscsi_portal_create() interpret 0 == all cores [12:54:24] (and this will also need to be updated for the new cpumask > 64 code, if that goes in first) [12:54:39] i'd rather get shuhei's patch in first I think [12:54:45] I haven't really looked at the >64 patch yet [12:54:52] yeah, sounds fine to me [12:58:22] i think we first get in shuhei's patch set that cleans up and simplifies the iscsi code that picks the reactor [12:59:50] i think fixing the rpc code is fine [13:19:42] I posted a big comment on the cpuset patch [13:33:16] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) [13:39:27] do we want to even stick with "cpu" in the name for this? [13:39:44] I'm open to alternatives if you have a better name :) [13:39:53] we could theoretically even just use the bit array for this [13:39:55] i'm just thinking of basing it around our spdk thread abstraction [13:40:17] I'd prefer if all of our code moved away from "cores" and "cpus" [13:40:20] to threads [13:41:17] spdk_allocate_thread() can call spdk_env_get_current_core() to try to get an ID (i.e. an index to the bit array) [13:41:58] so that covers most of the cases - but then when you're using something like fio plugin or rocksdb, spdk_env_get_current_core() will return -1 [13:42:05] if we can avoid even assuming that threads are pinned to cores it would be great [13:42:22] except in the event framework - that can use knowledge of cores to pin threads all it wants [13:42:32] one use for it is passing the core mask to DPDK, but I guess that could be just passed through as a string [13:42:58] maybe - i guess i'm thinking that using core numbers as the thread ID reduces confusion [13:43:09] when a pinned core number is available [13:43:23] why now use the pthread id? [13:43:28] s/now/not [13:43:49] as an index into a bit array? [13:44:22] I haven't actually looked at these patches, so I was just suggested as the thread id [13:44:30] but if you need an index into a bit array that's a problem [13:44:35] well, this patch set is more about defining a set of cores, not an individual core ID [13:44:58] I think the first question is - do we even need a set of cores in our code? [13:44:58] and finding out if a core is a member of a set, things like that [13:45:02] it's getting rid of the current paradigm of a uint64_t as a bit mask for the available cores [13:45:20] yes - load balancing/scheduling [13:45:25] sure - but do you replace it with a set object that can hold more than 64, or can you rewrite those parts of the code to not make those assumptions [13:46:28] this isn't about assumptions though - it's about allowing you to specify things like a thread/core mask for placing iSCSI or NVMe-oF connections from a specific portal [13:47:27] sure it's about assumptions - you're assuming that threads are pinned to cores. I think what you want to specify as a user is which threads the connection can be processed on [13:47:28] not which cores [13:47:40] i'm not assuming threads are pinned to cores [13:48:20] so we could define a "thread mask" [13:48:41] and arbitrarily assign threads id numbers, but that algorithm has to have a way to pick unique numbers that can be represented by a mask [13:48:46] yes - that's what I suggested two pages up :) [13:49:13] but you can't use spdk_env_get_current_core as the id [13:49:21] you need some other algorithm [13:49:26] except I don't want to arbitrarily assign the numbers if it's a pinned thread [13:49:52] even if the thread is pinned to that core, that doesn't mean that another thread couldn't preempt it occasionally and run there [13:50:01] and end up with the same id [13:50:06] now you're just being difficult :) [13:50:33] I'm just saying I'd like to try and move away from that assumption - from experience writing the fio plugin [13:50:35] as a user, how do i specify a thread mask for the iscsi target? [13:50:47] but where do you use threads masks in the fio plugin? [13:50:53] or core masks [13:50:59] I use thread ids [13:51:14] but not masks [13:51:33] sure, but this may impact the thread id algorithm [13:51:47] which today is pthread_self [13:52:16] I'm not sure how the user would specify a set of threads [13:52:30] I don't have a solution in mind right now [13:52:39] threads don't have nice predictable numbers like cores because they come and go [13:52:45] is there a case where someone would need to specify a set of threads in the non-pinned case? [13:53:02] just thinking out loud here... [13:53:04] in our code? no [13:53:09] #define SPDK_MAX_THREADS 1024 [13:53:18] #define SPDK_MAX_LCORES 128 (or 256) [13:53:42] if spdk_env_get_current_core() returns > 0, then it's a pinned lcore and we use that value [13:53:53] otherwise we use an unpicked value between 128 and 1023 [13:54:34] how does the user know what that value is, when they go to specify a mask somewhere? [13:54:55] this may be an intractable problem - I'm not sure yet. [13:55:19] we'd have to think of a use case where we'd want to use thread masks in the non-pinned case [13:55:43] today for like NVMe-oF, the user specifies a global core mask [13:55:51] and that results in one thread per core [13:56:04] you could replace that by instead saying "number of threads" [13:56:20] but then if you wanted to provide a sub-mask for a set of connections associated with a particular network interface [13:56:23] like drv said though, we still need the lcore mask for DPDK [13:56:25] how would you do that? [13:56:49] we may not actually be able to get away from cores just yet - I'd like to but it causes all kinds of problems [13:57:16] do you see the SPDK NVMe-oF target getting used outside of a polled mode framework? meaning with multiple threads possibly running on each lcore, preempting each other? [13:57:34] in general, no - but besides this problem it would work just fine [13:58:21] it would work but performance would be interesting [13:58:41] it may make sense in some cases where you're very idle [13:58:43] for long periods [13:58:55] with RDMA you can't move connections between poll groups like you can with iSCSI [13:59:13] but you could move the whole poll group - except poll groups are i/o channels so you can change their thread [13:59:20] you can just schedule their thread onto the same core as another thread [13:59:29] s/can/can't [13:59:43] so you could make the poll groups more granular [14:00:00] making it easier to move connections between threads [14:00:06] (notice i didn't say cores!) [14:00:29] but you can't move connections between poll groups with RDMA [14:00:36] once you pick one, it's fixed forever [14:00:53] and poll groups are strictly tied to threads - they are i/o channels [14:01:10] right - I meant if you want to move connections between threads for load balancing purposes, you move a whole poll group of connections [14:01:29] but you can't move a poll group to a different thread [14:01:34] it's an i/o channel [14:01:47] I mean, we could add that ability I guess [14:01:52] to move an i/o channel to a different thread [14:01:58] maybe that would be workable [14:02:12] drv stopped listening I think [14:02:19] I'm still here :) [14:02:31] but I may have lost the plot a while ago [14:02:57] Anyone here that happens to have virtio scsi example so that i can connect to a vhost exported SPDK LU? [14:03:32] hi gila [14:03:50] lib/bdev/virtio has the SPDK virtio-scsi bdev module [14:03:57] is that what you are looking for? [14:04:05] Well, correct if im wrong. [14:04:43] But -- i wanted to use to connect "an app" to an SPDK exported LU. If i read the docs right, you can use those bits to do that. [14:05:03] "Virtio SCSI driver is an initiator for SPDK vhost application. The driver allows any SPDK app to connect to another SPDK instance exposing a vhost-scsi device." [14:05:46] is your app running in a VM or on bare metal? [14:05:52] baremetal. [14:06:35] and do you want SPDK to access the device from within the same process as your app, or do you want it to live in a separate process and expose block devices to a whole set of other processes like a service? [14:07:14] Ideally the second approach [14:07:49] so probably the best choice is to run the SPDK vhost-scsi target as your "storage service" [14:08:03] and then in your app, you have to modify it to route I/O using the SPDK bdev layer [14:08:21] *** Parts: lhodev (~Adium@inet-hqmc02-o.oracle.com) () [14:08:25] one of the available bdev's is a virtio-scsi initiator, which can connect to the SPDK vhost-scis target [14:08:42] if your app was running in a VM, specifically QEMU, then that comes with a virtio-scsi initiator built in [14:08:56] so in that case your app doesn't need to be modified - the disk just shows up as a block device in your kernel [14:09:06] in the guest kernel that is [14:09:27] in the VM, you could either leverage the in-kernel virio driver - or use the same SPDK virtio-scsi initiator from guest VM userspace [14:09:38] yes -- the VM part is clear. [14:10:26] you'll want to use the SPDK bdev library as your programming interface from within the app. The API is in include/spdk/bdev.h [14:10:53] and your app will need to link against the bdev and virtio-scsi libraries in SPDK [14:11:52] This is where I have the gap... [14:12:02] Lets call the SPDK process serving out the VHOST LU's -- A [14:12:08] and the "client" app B. [14:12:25] if B, i cant just call spdk_bdev_open() -- right? [14:12:45] yes - you can call spdk_bdev_open() in B [14:12:54] you have to set up B to be able to connect to A first though [14:13:04] Yes, thats the missing bit. [14:13:16] doc/bdev.md shows how to set up a VirtioUserX section in your config file to do this [14:13:41] oh I see, so then B parses that config and SPDK does the right thing in terms of connecting? [14:13:42] basically the path to the UNIX domain socket for the vhost controller being exposed by A [14:13:42] you pass that config file with the right section to the "client" when it starts up and loads the bdev library [14:13:47] yes - exactly! [14:13:50] good! [14:14:03] and then when you ask the bdev library about the bdevs it knows about, your bdev is just in the list [14:14:05] and you can open it [14:14:33] the spdk virtio initiator will do a scsi bus scan and find all of the LUs exposed by "A" [14:14:50] thats exactly what I need. [14:14:53] and expose each LU as a separate bdev [14:15:29] and it turns out that you can modify the config file and tell it to make a bdev out of a local NVMe device, or a remote iSCSI device, or an NVMe-oF device, or a Ceph RBD volume, etc. and the code you write in your "client" app is identical [14:15:34] you just pass it a different config file [14:15:41] and it works [14:16:13] Yes, i figured that out that SPDK abstracts all of that nicely. [14:17:58] Thx for the help, i'lll look in to that tomorrow. [14:20:17] BTW, I would have two SPDK processes running both consuming a certain amount of huge pages and cores right? [14:23:46] yes [14:23:54] the target is one process, your client is another [14:24:01] both are SPDK things [14:28:10] Is that also the case with a VM? [14:28:59] you can run SPDK in the VM, but with QEMU it can connect to the SPDK vhost-scsi target without itself being an SPDK process [14:29:24] QEMU has a separate virtio-scsi implementation that it uses, outside of SPDK [14:29:58] Ok, that sounds like something I want as well. [14:31:40] So without making it process an SPDK process; being able to consume storage from an other process which *is* an SPDK process [14:31:53] and without QEMU =) [14:33:17] that's totally possible, we just don't have code for that [14:34:03] some of our libraries are usable by themselves - without consuming hugepages and initializing DPDK and such [14:34:11] like blobstore [14:34:26] I'm not sure our virtio-scsi initiator library is set up to do that, but it probably could do that [14:35:14] you can of course connect to our NVMe-oF target using the linux kernel initiator [14:35:25] and you can connect to our iSCSI target using the linux kernel initiator or libiscsi [14:35:31] so those have decent enough options [14:36:41] you can run the whole bdev layer without initializing DPDK, but it takes some work [14:37:19] you have to re-implement the subset of include/spdk/env.h that bdev happens to use [14:37:32] and you have to implement the callbacks necessary to call spdk_allocate_thread [14:38:18] Im not familiar enough with SPDK and its concepts to even consider doing that right now. [14:38:36] it's a non-trivial amount of work, for sure [14:38:47] iSCSI looks something to consider but AFAIK, it uses the kernel to transport the packets [14:38:59] you can use libiscsi as your initiator [14:39:09] https://github.com/sahlberg/libiscsi [14:39:22] not written by us, but it's just a C library that implements and asynchronous iscsi initiator [14:39:51] Yes, I'm familiar with that library [14:40:16] But it *looks* like adding a lot of overhead to the mix to access SPDK storage resources. [14:40:26] yes definitely [14:40:50] a virtio-scsi initiator as just a plain library would be ideal for you [14:40:58] exactly [14:41:05] our code can probably be turned into that [14:41:35] I didn't write that component so I'm not familiar with how much work that would take, if any [14:42:08] if I understood correctly, QEMU kinda has, embedded within it, this virtio-scsi initiator, right? [14:42:13] yep [14:42:34] you could either take that and try to turn it into a library, or take ours and do the same [14:42:40] perhaps this code can serve as a boiler plate to libify it and create something like libvirtio-scsi [14:42:41] my guess is that ours is probably more easily separable [14:42:42] darsto is working towards decoupling the bdev part from the virtio-scsi part [14:43:39] yours might be more separable yes, but its also a little more difficult to understand where to make the cuts and so forth (from a beginner standpoint that is) [14:48:36] @jimharris so that it becomes an initiator type like library of some sort? [14:49:01] so that it is not tied specifically to the SPDK bdev layer [14:49:35] there are still dependencies on DPDK though - the biggest challenge is how to pass the file descriptors for shared memory from the client to the SPDK target app [14:49:49] currently that's done by passing the huge page descriptors opened by DPDK [14:51:06] so you'll definitely want/need to use 1GB hugepages for this usage model - since vhost really only supports passing a small number of fds [14:54:01] I can live with the huge pages requirement -- but having N poll mode drivers for N "SPDK clients" is not a workable solution for our use case. [15:57:09] *** Joins: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97)