[00:46:15] <Shuhei> As I sent a mail, my pass was fake. Please do not rush to queue your patches.
[00:46:21] *** Quits: ziyeyang_ (ziyeyang@nat/intel/x-ckwvagmdqxfxbvqe) (Quit: Leaving)
[01:12:00] *** Joins: baruch (~baruch@bzq-82-81-85-138.red.bezeqint.net)
[02:40:29] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 260 seconds)
[02:53:16] *** Joins: gila (~gila@94.212.217.200)
[02:59:51] <klateck> Hi Shuhei - this fail was caused by configuration of ASAN on one of the test machines, which revealed a bug when executing a lvol tasting scenario. We are currently working on fixing the bug and ASAN is temporarily disabled so that it doesn't block the test pool.
[04:02:43] *** Quits: gila (~gila@94.212.217.200) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[04:03:57] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl)
[06:11:54] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[06:22:45] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl)
[07:34:08] *** Joins: boutcher (~boutcher@66.113.132.66)
[07:48:44] *** Joins: lhodev (~Adium@inet-hqmc01-o.oracle.com)
[07:51:59] *** Parts: lhodev (~Adium@inet-hqmc01-o.oracle.com) ()
[08:15:25] *** Joins: Vikas (9d301439@gateway/web/freenode/ip.157.48.20.57)
[08:29:19] *** Quits: Vikas (9d301439@gateway/web/freenode/ip.157.48.20.57) (Ping timeout: 260 seconds)
[08:42:21] *** Quits: baruch (~baruch@bzq-82-81-85-138.red.bezeqint.net) (Ping timeout: 264 seconds)
[08:46:19] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[10:10:56] <bwalker> jimharris: I'm looking at a patch that adds comments to blobfs.h, and so I am looking at the API for spdk_fs_init
[10:11:05] <bwalker> and the user passes a function to send a request to the main dispatch thread as an argument
[10:11:14] <bwalker> but we could use spdk_thread_send_msg now that we have it, right?
[10:11:22] <jimharris> probably - yes
[10:22:02] <bwalker> once the blobstore API is all settled I think we can do some pretty neat things with blobfs
[10:22:06] <bwalker> to make it feel posix-y
[10:37:09] <jkkariu> I have a question about running bdevperf with kernel driver. Is this the correct configuration?
[10:37:20] <jkkariu> [AIO]
[10:37:20] <jkkariu>   AIO /dev/nvme0n1 AIO0
[10:40:05] <drv> jkkariu: yes, that should work
[10:40:25] <drv> the section should be [Aio], though - I'm not sure if that's case sensitive
[10:40:38] <drv> never mind, it is actually AIO
[10:41:17] <jkkariu> thanks, it is working
[10:46:26] <jimharris> drv, bwalker: don't we need to acquire the mutex around the TAILQ_FOREACH_SAFE here?  https://review.gerrithub.io/#/c/391329/12/lib/nvme/nvme_pcie.c
[10:47:26] <drv> I think the hotplug monitor function is called under the lock, but let me check
[10:47:52] <jimharris> and then we recursively acquire it?
[10:49:08] <drv> spdk_nvme_probe_internal -> nvme_pcie_ctrlr_scan -> _nvme_pcie_hotplug_monitor
[10:49:12] <drv> probe_internal takes the lock
[10:49:18] <drv> we shouldn't be recursively acquiring it - where is that?
[10:50:45] <drv> it would be nice to have some kind of macro to assert that we are holding a lock (for documentation purposes)
[10:50:52] <drv> I think you can build that with trylock
[10:53:38] <jimharris> oh - I was misreading the code, it's an unlock then lock
[10:54:42] <jimharris> this still seems racy to me though - what if tmp gets removed by a different thread between the unlock/lock calls?
[10:56:02] <drv> hmm, that's a good question, but I think the existing code is broken in the same way
[10:56:23] <jimharris> you mean the existing uevent code?
[10:57:05] <jimharris> the uevent code isn't iterating over the shared_attached_ctrlrs list though
[10:58:49] <jimharris> all of the attach/scan code is only designed for use from one thread, so all of this code is fine - I think we just need to make sure it's explicitly documented
[10:59:14] <drv> but it does drop the lock while calling remove_cb
[10:59:44] <drv> the TAILQ_FOREACH_SAFE should be okay, since it's the _SAFE variant that grabs the next pointer before calling the body of the loop
[11:00:14] <drv> hmm, no, I see what you're saying - if the next one gets removed rather than the current one, that would break
[11:01:03] <jimharris> that's what I'm concerned about though - thread A grabs the next pointer, and drops the lock, thread B grabs the lock, removes the ctrlr in thread A's next pointer, and drops the lock, thread B grabs the lock and now tries to operate on an invalid next pointer
[11:05:05] *** Quits: tomzawadzki (~tomzawadz@134.134.139.83) (Remote host closed the connection)
[11:32:40] *** Joins: lhodev (~Adium@inet-hqmc02-o.oracle.com)
[11:32:55] *** Parts: lhodev (~Adium@inet-hqmc02-o.oracle.com) ()
[11:35:21] <jimharris> bwalker: could you look at this set of 4 patches from shuhei?  https://review.gerrithub.io/#/c/391510/
[11:38:24] <jimharris> there's a patch out to support >64 lcores in SPDK which will need to be reworked a bit after the iSCSI load balancing changes from this series goes in
[11:49:41] <bwalker> sure
[12:39:42] *** Joins: lhodev (~Adium@inet-hqmc02-o.oracle.com)
[12:53:13] <drv> jimharris: can you also look at this patch? https://review.gerrithub.io/#/c/391898/
[12:53:51] <drv> Shuhei's definitely right that the cpumask handling for the RPC is wrong, but I'm not sure if we should change the RPC or just make spdk_iscsi_portal_create() interpret 0 == all cores
[12:54:24] <drv> (and this will also need to be updated for the new cpumask > 64 code, if that goes in first)
[12:54:39] <jimharris> i'd rather get shuhei's patch in first I think
[12:54:45] <jimharris> I haven't really looked at the >64 patch yet
[12:54:52] <drv> yeah, sounds fine to me
[12:58:22] <jimharris> i think we first get in shuhei's patch set that cleans up and simplifies the iscsi code that picks the reactor
[12:59:50] <jimharris> i think fixing the rpc code is fine
[13:19:42] <drv> I posted a big comment on the cpuset patch
[13:33:16] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl)
[13:39:27] <jimharris> do we want to even stick with "cpu" in the name for this?
[13:39:44] <drv> I'm open to alternatives if you have a better name :)
[13:39:53] <drv> we could theoretically even just use the bit array for this
[13:39:55] <jimharris> i'm just thinking of basing it around our spdk thread abstraction
[13:40:17] <bwalker> I'd prefer if all of our code moved away from "cores" and "cpus"
[13:40:20] <bwalker> to threads
[13:41:17] <jimharris> spdk_allocate_thread() can call spdk_env_get_current_core() to try to get an ID (i.e. an index to the bit array)
[13:41:58] <jimharris> so that covers most of the cases - but then when you're using something like fio plugin or rocksdb, spdk_env_get_current_core() will return -1
[13:42:05] <bwalker> if we can avoid even assuming that threads are pinned to cores it would be great
[13:42:22] <bwalker> except in the event framework - that can use knowledge of cores to pin threads all it wants
[13:42:32] <drv> one use for it is passing the core mask to DPDK, but I guess that could be just passed through as a string
[13:42:58] <jimharris> maybe - i guess i'm thinking that using core numbers as the thread ID reduces confusion
[13:43:09] <jimharris> when a pinned core number is available
[13:43:23] <bwalker> why now use the pthread id?
[13:43:28] <bwalker> s/now/not
[13:43:49] <jimharris> as an index into a bit array?
[13:44:22] <bwalker> I haven't actually looked at these patches, so I was just suggested as the thread id
[13:44:30] <bwalker> but if you need an index into a bit array that's a problem
[13:44:35] <drv> well, this patch set is more about defining a set of cores, not an individual core ID
[13:44:58] <bwalker> I think the first question is - do we even need a set of cores in our code?
[13:44:58] <drv> and finding out if a core is a member of a set, things like that
[13:45:02] <jimharris> it's getting rid of the current paradigm of a uint64_t as a bit mask for the available cores
[13:45:20] <jimharris> yes - load balancing/scheduling
[13:45:25] <bwalker> sure - but do you replace it with a set object that can hold more than 64, or can you rewrite those parts of the code to not make those assumptions
[13:46:28] <jimharris> this isn't about assumptions though - it's about allowing you to specify things like a thread/core mask for placing iSCSI or NVMe-oF connections from a specific portal
[13:47:27] <bwalker> sure it's about assumptions - you're assuming that threads are pinned to cores. I think what you want to specify as a user is which threads the connection can be processed on
[13:47:28] <bwalker> not which cores
[13:47:40] <jimharris> i'm not assuming threads are pinned to cores
[13:48:20] <bwalker> so we could define a "thread mask"
[13:48:41] <bwalker> and arbitrarily assign threads id numbers, but that algorithm has to have a way to pick unique numbers that can be represented by a mask
[13:48:46] <jimharris> yes - that's what I suggested two pages up :)
[13:49:13] <bwalker> but you can't use spdk_env_get_current_core as the id
[13:49:21] <bwalker> you need some other algorithm
[13:49:26] <jimharris> except I don't want to arbitrarily assign the numbers if it's a pinned thread
[13:49:52] <bwalker> even if the thread is pinned to that core, that doesn't mean that another thread couldn't preempt it occasionally and run there
[13:50:01] <bwalker> and end up with the same id
[13:50:06] <jimharris> now you're just being difficult :)
[13:50:33] <bwalker> I'm just saying I'd like to try and move away from that assumption - from experience writing the fio plugin
[13:50:35] <jimharris> as a user, how do i specify a thread mask for the iscsi target?
[13:50:47] <jimharris> but where do you use threads masks in the fio plugin?
[13:50:53] <jimharris> or core masks
[13:50:59] <bwalker> I use thread ids
[13:51:14] <jimharris> but not masks
[13:51:33] <bwalker> sure, but this may impact the thread id algorithm
[13:51:47] <bwalker> which today is pthread_self
[13:52:16] <bwalker> I'm not sure how the user would specify a set of threads
[13:52:30] <bwalker> I don't have a solution in mind right now
[13:52:39] <bwalker> threads don't have nice predictable numbers like cores because they come and go
[13:52:45] <jimharris> is there a case where someone would need to specify a set of threads in the non-pinned case?
[13:53:02] <jimharris> just thinking out loud here...
[13:53:04] <bwalker> in our code? no
[13:53:09] <jimharris> #define SPDK_MAX_THREADS 1024
[13:53:18] <jimharris> #define SPDK_MAX_LCORES 128 (or 256)
[13:53:42] <jimharris> if spdk_env_get_current_core() returns > 0, then it's a pinned lcore and we use that value
[13:53:53] <jimharris> otherwise we use an unpicked value between 128 and 1023
[13:54:34] <bwalker> how does the user know what that value is, when they go to specify a mask somewhere?
[13:54:55] <bwalker> this may be an intractable problem - I'm not sure yet.
[13:55:19] <jimharris> we'd have to think of a use case where we'd want to use thread masks in the non-pinned case
[13:55:43] <bwalker> today for like NVMe-oF, the user specifies a global core mask
[13:55:51] <bwalker> and that results in one thread per core
[13:56:04] <bwalker> you could replace that by instead saying "number of threads"
[13:56:20] <bwalker> but then if you wanted to provide a sub-mask for a set of connections associated with a particular network interface
[13:56:23] <jimharris> like drv said though, we still need the lcore mask for DPDK
[13:56:25] <bwalker> how would you do that?
[13:56:49] <bwalker> we may not actually be able to get away from cores just yet - I'd like to but it causes all kinds of problems
[13:57:16] <jimharris> do you see the SPDK NVMe-oF target getting used outside of a polled mode framework?   meaning with multiple threads possibly running on each lcore, preempting each other?
[13:57:34] <bwalker> in general, no - but besides this problem it would work just fine
[13:58:21] <jimharris> it would work but performance would be interesting
[13:58:41] <bwalker> it may make sense in some cases where you're very idle
[13:58:43] <bwalker> for long periods
[13:58:55] <bwalker> with RDMA you can't move connections between poll groups like you can with iSCSI
[13:59:13] <bwalker> but you could move the whole poll group - except poll groups are i/o channels so you can change their thread
[13:59:20] <bwalker> you can just schedule their thread onto the same core as another thread
[13:59:29] <bwalker> s/can/can't
[13:59:43] <jimharris> so you could make the poll groups more granular
[14:00:00] <jimharris> making it easier to move connections between threads
[14:00:06] <jimharris> (notice i didn't say cores!)
[14:00:29] <bwalker> but you can't move connections between poll groups with RDMA
[14:00:36] <bwalker> once you pick one, it's fixed forever
[14:00:53] <bwalker> and poll groups are strictly tied to threads - they are i/o channels
[14:01:10] <jimharris> right - I meant if you want to move connections between threads for load balancing purposes, you move a whole poll group of connections
[14:01:29] <bwalker> but you can't move a poll group to a different thread
[14:01:34] <bwalker> it's an i/o channel
[14:01:47] <bwalker> I mean, we could add that ability I guess
[14:01:52] <bwalker> to move an i/o channel to a different thread
[14:01:58] <bwalker> maybe that would be workable
[14:02:12] <jimharris> drv stopped listening I think
[14:02:19] <drv> I'm still here :)
[14:02:31] <drv> but I may have lost the plot a while ago
[14:02:57] <gila> Anyone here that happens to have virtio scsi example so that i can connect to a vhost exported SPDK LU?
[14:03:32] <jimharris> hi gila
[14:03:50] <jimharris> lib/bdev/virtio has the SPDK virtio-scsi bdev module
[14:03:57] <jimharris> is that what you are looking for?
[14:04:05] <gila> Well, correct if im wrong.
[14:04:43] <gila> But -- i wanted to use to connect "an app" to an SPDK exported LU. If i read the docs right, you can use those bits to do that.
[14:05:03] <gila> "Virtio SCSI driver is an initiator for SPDK vhost application. The driver allows any SPDK app to connect to another SPDK instance exposing a vhost-scsi device."
[14:05:46] <bwalker> is your app running in a VM or on bare metal?
[14:05:52] <gila> baremetal.
[14:06:35] <bwalker> and do you want SPDK to access the device from within the same process as your app, or do you want it to live in a separate process and expose block devices to a whole set of other processes like a service?
[14:07:14] <gila> Ideally the second approach
[14:07:49] <bwalker> so probably the best choice is to run the SPDK vhost-scsi target as your "storage service"
[14:08:03] <bwalker> and then in your app, you have to modify it to route I/O using the SPDK bdev layer
[14:08:21] *** Parts: lhodev (~Adium@inet-hqmc02-o.oracle.com) ()
[14:08:25] <bwalker> one of the available bdev's is a virtio-scsi initiator, which can connect to the SPDK vhost-scis target
[14:08:42] <bwalker> if your app was running in a VM, specifically QEMU, then that comes with a virtio-scsi initiator built in
[14:08:56] <bwalker> so in that case your app doesn't need to be modified - the disk just shows up as a block device in your kernel
[14:09:06] <bwalker> in the guest kernel that is
[14:09:27] <jimharris> in the VM, you could either leverage the in-kernel virio driver - or use the same SPDK virtio-scsi initiator from guest VM userspace
[14:09:38] <gila> yes -- the VM part is clear.
[14:10:26] <bwalker> you'll want to use the SPDK bdev library as your programming interface from within the app. The API is in include/spdk/bdev.h
[14:10:53] <bwalker> and your app will need to link against the bdev and virtio-scsi libraries in SPDK
[14:11:52] <gila> This is where I have the gap...
[14:12:02] <gila> Lets call the SPDK process serving out the VHOST LU's -- A
[14:12:08] <gila> and the "client" app B.
[14:12:25] <gila> if B, i cant just call spdk_bdev_open() -- right?
[14:12:45] <jimharris> yes - you can call spdk_bdev_open() in B
[14:12:54] <jimharris> you have to set up B to be able to connect to A first though
[14:13:04] <gila> Yes, thats the missing bit.
[14:13:16] <jimharris> doc/bdev.md shows how to set up a VirtioUserX section in your config file to do this
[14:13:41] <gila> oh I see, so then B parses that config and SPDK does the right thing in terms of connecting?
[14:13:42] <jimharris> basically the path to the UNIX domain socket for the vhost controller being exposed by A
[14:13:42] <bwalker> you pass that config file with the right section to the "client" when it starts up and loads the bdev library
[14:13:47] <jimharris> yes - exactly!
[14:13:50] <gila> good!
[14:14:03] <bwalker> and then when you ask the bdev library about the bdevs it knows about, your bdev is just in the list
[14:14:05] <bwalker> and you can open it
[14:14:33] <jimharris> the spdk virtio initiator will do a scsi bus scan and find all of the LUs exposed by "A"
[14:14:50] <gila> thats exactly what I need.
[14:14:53] <jimharris> and expose each LU as a separate bdev
[14:15:29] <bwalker> and it turns out that you can modify the config file and tell it to make a bdev out of a local NVMe device, or a remote iSCSI device, or an NVMe-oF device, or a Ceph RBD volume, etc. and the code you write in your "client" app is identical
[14:15:34] <bwalker> you just pass it a different config file
[14:15:41] <bwalker> and it works
[14:16:13] <gila> Yes, i figured that out that SPDK abstracts all of that nicely.
[14:17:58] <gila> Thx for the help, i'lll look in to that tomorrow.
[14:20:17] <gila> BTW, I would have two SPDK processes running both consuming a certain amount of huge pages and cores right?
[14:23:46] <bwalker> yes
[14:23:54] <bwalker> the target is one process, your client is another
[14:24:01] <bwalker> both are SPDK things
[14:28:10] <gila> Is that also the case with a VM?
[14:28:59] <bwalker> you can run SPDK in the VM, but with QEMU it can connect to the SPDK vhost-scsi target without itself being an SPDK process
[14:29:24] <bwalker> QEMU has a separate virtio-scsi implementation that it uses, outside of SPDK
[14:29:58] <gila> Ok, that sounds like something I want as well.
[14:31:40] <gila> So without making it process an SPDK process; being able to consume storage from an other process which *is* an SPDK process
[14:31:53] <gila> and without QEMU =)
[14:33:17] <bwalker> that's totally possible, we just don't have code for that
[14:34:03] <bwalker> some of our libraries are usable by themselves - without consuming hugepages and initializing DPDK and such
[14:34:11] <bwalker> like blobstore
[14:34:26] <bwalker> I'm not sure our virtio-scsi initiator library is set up to do that, but it probably could do that
[14:35:14] <bwalker> you can of course connect to our NVMe-oF target using the linux kernel initiator
[14:35:25] <bwalker> and you can connect to our iSCSI target using the linux kernel initiator or libiscsi
[14:35:31] <bwalker> so those have decent enough options
[14:36:41] <bwalker> you can run the whole bdev layer without initializing DPDK, but it takes some work
[14:37:19] <bwalker> you have to re-implement the subset of include/spdk/env.h that bdev happens to use
[14:37:32] <bwalker> and you have to implement the callbacks necessary to call spdk_allocate_thread
[14:38:18] <gila> Im not familiar enough with SPDK and its concepts to even consider doing that right now.
[14:38:36] <bwalker> it's a non-trivial amount of work, for sure
[14:38:47] <gila> iSCSI looks something to consider but AFAIK, it uses the kernel to transport the packets
[14:38:59] <bwalker> you can use libiscsi as your initiator
[14:39:09] <bwalker> https://github.com/sahlberg/libiscsi
[14:39:22] <bwalker> not written by us, but it's just a C library that implements and asynchronous iscsi initiator
[14:39:51] <gila> Yes, I'm familiar with that library
[14:40:16] <gila> But it *looks* like adding a lot of overhead to the mix to access SPDK storage resources.
[14:40:26] <bwalker> yes definitely
[14:40:50] <bwalker> a virtio-scsi initiator as just a plain library would be ideal for you
[14:40:58] <gila> exactly
[14:41:05] <bwalker> our code can probably be turned into that
[14:41:35] <bwalker> I didn't write that component so I'm not familiar with how much work that would take, if any
[14:42:08] <gila> if I understood correctly, QEMU kinda has, embedded within it, this virtio-scsi  initiator, right?
[14:42:13] <bwalker> yep
[14:42:34] <bwalker> you could either take that and try to turn it into a library, or take ours and do the same
[14:42:40] <gila> perhaps this code can serve as a boiler plate to libify it and create something like libvirtio-scsi
[14:42:41] <bwalker> my guess is that ours is probably more easily separable
[14:42:42] <jimharris> darsto is working towards decoupling the bdev part from the virtio-scsi part
[14:43:39] <gila> yours might be more separable yes, but its also a little more difficult to understand where to make the cuts and so forth (from a beginner standpoint that is)
[14:48:36] <gila> @jimharris so that it becomes an initiator type like library of some sort?
[14:49:01] <jimharris> so that it is not tied specifically to the SPDK bdev layer
[14:49:35] <jimharris> there are still dependencies on DPDK though - the biggest challenge is how to pass the file descriptors for shared memory from the client to the SPDK target app
[14:49:49] <jimharris> currently that's done by passing the huge page descriptors opened by DPDK
[14:51:06] <jimharris> so you'll definitely want/need to use 1GB hugepages for this usage model - since vhost really only supports passing a small number of fds
[14:54:01] <gila> I can live with the huge pages requirement -- but having N poll mode drivers for N "SPDK clients" is not a workable solution for our use case.
[15:57:09] *** Joins: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97)