[00:42:03] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 252 seconds) [00:46:12] *** Joins: tkulasek (~tkulasek@134.134.139.73) [00:58:43] *** Quits: tomzawadzki (tomzawadzk@nat/intel/x-vycrwbmpoogxcaxx) (Remote host closed the connection) [01:00:40] *** Joins: tzawadzki (~tomzawadz@192.55.54.42) [01:06:05] *** Quits: tzawadzki (~tomzawadz@192.55.54.42) (Ping timeout: 240 seconds) [01:06:39] *** Joins: tomzawadzki (tomzawadzk@nat/intel/x-mghkothqgiosbpsl) [01:21:13] *** Quits: guerby (~guerby@april/board/guerby) (Read error: Connection reset by peer) [01:21:25] *** Joins: guerby (~guerby@ip165.tetaneutral.net) [01:21:25] *** Quits: guerby (~guerby@ip165.tetaneutral.net) (Changing host) [01:21:25] *** Joins: guerby (~guerby@april/board/guerby) [04:32:30] *** Joins: dlw1 (~Thunderbi@114.255.44.139) [04:34:02] *** Quits: dlw (~Thunderbi@114.255.44.143) (Ping timeout: 256 seconds) [04:34:02] *** dlw1 is now known as dlw [04:46:38] *** Quits: dlw (~Thunderbi@114.255.44.139) (Ping timeout: 256 seconds) [05:11:34] *** Joins: dlw (~Thunderbi@114.255.44.139) [05:49:24] *** Quits: tomzawadzki (tomzawadzk@nat/intel/x-mghkothqgiosbpsl) (Ping timeout: 256 seconds) [05:54:24] *** Joins: lyan (~lyan@2605:a000:160e:2dd:4a4d:7eff:fef2:eea3) [05:54:47] *** lyan is now known as Guest27255 [06:09:58] *** Quits: dlw (~Thunderbi@114.255.44.139) (Ping timeout: 264 seconds) [06:11:48] bwalker: spdk_tgt: blobstore.c:2737: _spdk_bs_load_used_clusters_cpl: Assertion `ctx->mask->length <= (ctx->super->used_cluster_mask_len * sizeof( struct spdk_blob_md_page) * 8)' failed. [06:12:59] can those asserts be converted into failures or trigger some kind of recovery? [06:26:18] *** Joins: tkulasek_ (~tkulasek@192.55.54.42) [06:29:24] *** Quits: tkulasek (~tkulasek@134.134.139.73) (Ping timeout: 265 seconds) [07:21:56] *** Joins: tomzawadzki (~tomzawadz@134.134.139.74) [07:24:27] *** Joins: philipp-sk (~Philipp@ktnron0916w-lp130-02-76-66-162-159.dsl.bell.ca) [07:26:17] @bwalker: Can I ask for review on delete_lvol_bdev patch series? Would really appreciate feedback, if needed, on those. https://review.gerrithub.io/#/c/spdk/spdk/+/407402/ [07:55:46] *** Quits: tomzawadzki (~tomzawadz@134.134.139.74) (Ping timeout: 256 seconds) [08:05:01] pwodkowx: any idea what steps led up to that assert? [08:05:21] i agree these shouldn't assert and should just fail in some way - but we really shouldn't be hitting that condition [08:07:21] could you try running blob_cli -D on that bdev? [08:52:17] *** Joins: peter_turschm (~peter_tur@66.193.132.66) [09:11:53] *** Quits: philipp-sk (~Philipp@ktnron0916w-lp130-02-76-66-162-159.dsl.bell.ca) (Read error: Connection reset by peer) [09:12:18] *** Joins: philipp-sk (~Philipp@ktnron0916w-lp130-02-76-66-162-159.dsl.bell.ca) [09:30:44] peluse: so the new webex meetings will use both a meeting number/access code, as well as this password? [09:31:09] yes [09:31:32] well, I haven't tried calling in but that sure looks like how it will work. We'll find out at the next one :) [09:36:12] peluse: could write webex password into an image and save it [09:36:15] then put it up on the website [09:37:12] Yup, I can do that - good diea! [09:37:14] idea [09:37:47] still need address a few of your crypto comments first so maybe later this week :) [09:37:50] thanks for the review BTW [09:38:50] bwalker: hmm, blobstore still leaks io_channels in case of failed lvol examine [09:39:26] at the time bdev_register() finishes, blobstore still has an io_channel held [09:41:06] bwalker, BTW by "use one of our pools" where I'm currently allocating a buffer for writes, you do mean use the callback driven spdk_bdev_io_get_buf() right? [09:42:02] i'm not sure if that's the bug causing dpdk 18.05 patch to fail, but it must be fixed as well [09:42:06] peluse: yep exactly [09:42:33] gracias [09:42:48] darsto: which io_channel is it? metadata channel? [09:42:49] i think my one and only patch for blobstore was addressing that back in the day - https://review.gerrithub.io/c/spdk/spdk/+/388424 [09:43:59] if there is a fairly simple way to reproduce it, I can take a look [09:44:03] and see what's going wrong [10:21:06] root cause (i think) on the fedora-02 failure with dpdk 18.05 - the aio bdev module never unregisters its io_device [10:21:59] this manifests itself by aio freeing the memory that was the io_device, bdev/part does a calloc and gets this same buffer and tries to register it as an io_device [10:22:24] something about fedora-02 + dpdk 18.05 allowed these steps to occur - testing patch now to confirm [10:25:50] nice [10:26:28] *** Quits: peter_turschm (~peter_tur@66.193.132.66) (Remote host closed the connection) [10:40:14] *** Joins: peter_turschm (~peter_tur@66.193.132.66) [11:04:22] *** Joins: travis-ci (~travis-ci@ec2-54-147-133-39.compute-1.amazonaws.com) [11:04:23] (spdk/master) blobfs: add the check for buf allocation (Ziye Yang) [11:04:23] Diff URL: https://github.com/spdk/spdk/compare/0de89b361735...e152aa8e5179 [11:04:23] *** Parts: travis-ci (~travis-ci@ec2-54-147-133-39.compute-1.amazonaws.com) () [11:30:13] peluse: i've cherry picked your two crypto dpdk 18.02 patches to dpdk 18.05: https://review.gerrithub.io/#/c/spdk/dpdk/+/420448/ [11:30:44] i removed the common_base changes though - i couldn't see why there were needed [12:44:39] *** Quits: tkulasek_ (~tkulasek@192.55.54.42) (Ping timeout: 256 seconds) [13:21:21] *** Joins: alekseymmm (bcf3adf1@gateway/web/freenode/ip.188.243.173.241) [13:59:00] *** Quits: alekseymmm (bcf3adf1@gateway/web/freenode/ip.188.243.173.241) (Ping timeout: 252 seconds) [14:05:06] *** Joins: alekseymmm (bcf3adf1@gateway/web/freenode/ip.188.243.173.241) [14:33:17] @jimharris fyi - replied to your comment @ https://review.gerrithub.io/#/c/spdk/spdk/+/419510/ [14:35:22] klateck: looking now... [14:36:35] for configuring subsystems, what are you thinking that would look like? [14:36:47] would the subsystems show up in the ls/tree output, with the parameters that can be changed? [14:36:58] or just rpc names that you execute from the command line? [14:48:48] *** Joins: philipp_sk (~Philipp@ktnron0916w-lp130-02-76-66-162-159.dsl.bell.ca) [14:48:48] *** Quits: philipp-sk (~Philipp@ktnron0916w-lp130-02-76-66-162-159.dsl.bell.ca) (Read error: Connection reset by peer) [14:50:59] *** Quits: Guest27255 (~lyan@2605:a000:160e:2dd:4a4d:7eff:fef2:eea3) (Quit: Leaving) [15:07:59] Hello [15:08:17] in bdev module funcitions there is callback void (*init_complete)(void); [15:08:31] in its description it is said [15:08:33] * Optional callback for modules that require notification of when * the bdev subsystem has completed initialization. [15:08:47] but what does it mean the bdev subsystem has completed initialization ? [15:09:35] I mean will it be called after all .module_init of all the modules are called ? [15:10:17] or is it about this particular module ? [15:10:18] not only after they're called, but after they've reported they're done initializing [15:10:36] it's global - that notification is called on each module after all modules have completed their initialization [15:10:45] when do the modules notify that they are done initializing? [15:11:11] i mean which function notifies [15:11:41] if a bdev module's module_init function returns 1 (greater than 0 really), that means it is asynchronously initializing [15:11:48] I have noticed that none of the modules use this .init_complete callback [15:11:56] oooh [15:12:07] when a module is done asynchronously initializing, it must call spdk_bdev_module_init_done [15:12:37] the init_complete thing will be called on each module only after all bdev modules have finished initializing (account for asynchronous initialization) [15:13:15] do any of the existing modules perform initialization asynchronously? [15:14:18] I feel like we added that because some did [15:14:23] but I am still looking for one [15:14:59] i greped .init_complete in lib/bdev .It seems none [15:15:25] nothing uses init_complete I don't think - that was added for someone that wanted that particular notification for their own bdev module [15:15:34] but I thought some module initialized asynchronously [15:15:44] however, this is older than the examine callback mechanism we have now [15:15:58] so now I'm thinking we may have moved everything to that examine callback and the initializations are all synchronous [15:16:11] It is better to use examine mechanizm instead of init_complete? [15:16:51] yeah - the init_complete should have been removed from the bdev_module [15:17:39] there are flags in the bdev module that indicate whether module_init and module_fini can be asynchronous [15:17:50] and I see two modules set async_init to true [15:17:50] We have spdk_bdev_module_examine_done(); I think it could make the logic of init_complete [15:18:01] the iscsi initiator and the virtio initiator [15:19:09] So you gonna leave init_complete as is , but you suggest me not to use it. Use examine instead ? [15:19:14] one sec [15:19:50] I think there is a lot of duplicated functionality here - the iscsi initiator is reading the configuration file and doing asynchronous connects in .module_init [15:20:38] well, I guess that's the only time that can run [15:20:48] the init_complete was added so that modules could know when all initial bdevs had been identified - then modules like RAID could make decisions about partially discovered volumes [15:22:05] see a4a497d5b and cbb8f4657 [15:22:26] you mean initial_complete is called after all the examine done? [15:22:45] correct - that's the idea (although nothing is using it yet, including the newly added raid bdev module) [15:23:31] I think everything is in order with the API now that I look it over (not sure the async_init flag is actually required, but that's minor) [15:23:35] except, of course, the documentation [15:24:03] we just created the bdev_module.h header not long ago to try to formalize the bdev module API [15:24:22] now that the API is in a single header, we need to write a bunch of documentation on what everything means [15:26:33] a lot of the movement on this front is all fallout from the lab we did at the SPDK summit about how to write a bdev module [15:26:56] we'll have a full programming guide eventually [15:27:12] unfortunately I missed this summit and workshow [15:27:26] waiting for it [15:28:46] if you have any more questions just ask and I'll add it to my notes about what the guide needs to include [15:32:24] This question was from the fact that I was trying creating my own bdev and test it with fio. And if I register my bdev (it is similar to passthru) in examine it works well, but if i register in init_cmplete I got coredump of fio when it runs with numjobs >1. So i a,m trying to understand the difference [15:32:58] thanks for explanations though [15:33:03] if your bdev claims another bdev [15:33:09] it is [15:33:18] you should be registering it in the examine callback for the bdev you are claiming [15:33:48] why cannot I register in init_complete of my module ? [15:34:11] I mean if init_complete is called after all the examine then it sounds posssible [15:34:13] I'd have to look at the crash - honestly I think that would work too [15:34:39] the examine mechanism is specifically for this case, but init_complete is called late enough in the process that everything should be ready [15:34:46] *** Quits: philipp_sk (~Philipp@ktnron0916w-lp130-02-76-66-162-159.dsl.bell.ca) (Ping timeout: 256 seconds) [15:36:11] what does it mean "late enough" ? Is it called by timer or smthg like it ? [15:36:28] I mean it is called after all the examines have run [15:36:37] so I would think it is fine to claim a bdev there [15:36:47] I'd have to see your specific crash - it could be a bug [15:37:12] may be it is a bag. I will try reading it more carefully [15:37:42] thanks for all the advices. I think I just switch to using examine instead of init_complete as everyone do [15:42:43] In the code in spdk_bdev_modules_init there is a line rc = module->module_init(); And if as you said some module init returns 1 (something about async init) then this rc will be returned to spdk_bdev_initialize() [15:42:57] and result will be SPDK_ERRLOG("bdev modules init failed\n"); [15:43:02] spdk_bdev_init_complete(-1); [15:43:06] is it ok ? [15:44:51] it looks like I was wrong - we changed the behavior I think [15:45:02] it isn't based on the return code from module_init anymore [15:45:16] probably changed it a year ago or something and I just don't remember [15:45:33] there is another flag called "async_init" in the module structure [15:45:43] if you set that to true, then module_init is treated as asynchronous [15:46:06] so .module_init returns 0 on success, non-zero on failure [15:46:14] and the asynchronous part is based on that other .async_init flag [15:47:22] ok thanks [15:52:52] *** Joins: tomzawadzki (~tomzawadz@134.134.139.73) [15:54:40] *** Joins: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) [15:54:56] been meaning to ask this for months. When running w/gdb an assert() that hits does a SIGABRT and I can't break at the point however if I temp hack in an if and raise(SIGINT) for the condition I can a nice break in gdb and can step on through. Is there a better way to get this preferred behavior or am I doing something wrong? [16:04:18] Anyone do any tests with fio lately with numjobs > 4. I'm chasing down a report of a segfault occurring (ioengine=spdk). I found that once I started bumping up the numjobs over 4, then I started seeing segfaults. When I run with 8, I get the segfault every time. [16:07:42] The I/Os appear to run ok, but as I watch the fio timer click down to zero, it's at that point I consistently see the segfault. [16:07:48] @lnodev i have simillar issues but it is probably due to bugs in my code. What bdevs do you test? [16:08:21] alekseymmm: I'm not using any bdev's. Just fio straight to an nvme drive. [16:09:36] How many cores on your system ? [16:10:48] alekseymmm: I've got 24 cores on this particular system, but the one on which it was reported had many more. [16:11:21] Examining the core files with gdb, I consistently see the same backtrace. It's dying in spdk_fio_cleanup(). [16:11:59] The call to pthread_cancel() crashes. [16:13:21] is there an errno that it prints out [16:13:26] or that you can get to [16:13:50] bwalker: no [16:14:48] it actually crashes in pthread_cancel? [16:14:59] is it trying to cancel itself or something? [16:15:18] bwalker: yes, it dies in pthread_cancel(). I'm left to speculate that the passed in g_ctrlr_thread_id is bad. [16:15:49] sure - but the function is supposed to return ESRCH on a bad thread id [16:16:04] so crashing is probably something much worse [16:16:13] the only thing I can think of is cancelling self [16:16:33] could add a check for whether g_ctrlr_thread_id is pthread_self right before the call [16:16:35] *** Quits: tomzawadzki (~tomzawadz@134.134.139.73) (Ping timeout: 240 seconds) [16:16:37] and see if it hits [16:16:44] lhodev: this is the nvme fio_plugin? not bdev fio_plugin? [16:17:06] I've never written a custom fio engine, so I'm completely in the dark starting off with this. [16:17:22] jimharris: it's the nvme fio_plugin -- not the bdev. [16:18:05] everything runs ok until the end of the test run? [16:19:02] jimharris: yes, it appears to run fine up until runtime expires, then it segfaults, but only (so far) if numjobs > 4. [16:19:14] if numjobs is 6, then sometimes it completed fine, but sometimes not. [16:19:24] if numjobs is 8, it segfaults consistently. [16:26:39] bwalker: I tried your experiment testing pthread_self() against g_ctrlr_thread_id. In several instances, they do *not* match. [16:32:14] *** Quits: alekseymmm (bcf3adf1@gateway/web/freenode/ip.188.243.173.241) (Quit: Page closed) [16:36:57] Given that spdk_fio_cleanup() apparently is invoked multiple times -- once per thread ??? -- I tried adding the modifier "__thread" to the declaration of g_ctrlr_thread_id, but that doesn't change the behavior. [17:05:55] bwalker: I've yet to see an instance where a thread attempted to cancel itself. [17:28:58] *** Joins: lyan (~lyan@2605:a000:160e:2dd:4a4d:7eff:fef2:eea3) [17:29:30] *** lyan is now known as Guest25520 [17:38:08] *** Quits: peter_turschm (~peter_tur@66.193.132.66) (Remote host closed the connection) [18:41:08] *** Joins: dlw (~Thunderbi@114.255.44.143) [18:52:42] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 252 seconds) [19:05:28] *** Joins: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) [20:01:29] *** Joins: philipp-sk (~Philipp@ktnron0916w-lp130-02-76-66-162-159.dsl.bell.ca) [20:31:09] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 252 seconds) [20:44:28] *** Quits: philipp-sk (~Philipp@ktnron0916w-lp130-02-76-66-162-159.dsl.bell.ca) (Ping timeout: 244 seconds) [20:52:19] *** Quits: Guest25520 (~lyan@2605:a000:160e:2dd:4a4d:7eff:fef2:eea3) (Quit: Leaving) [21:52:04] *** Joins: tomzawadzki (~tomzawadz@192.55.54.42) [22:43:18] *** Quits: tomzawadzki (~tomzawadz@192.55.54.42) (Ping timeout: 256 seconds) [23:54:26] *** Joins: tomzawadzki (~tomzawadz@192.55.54.44)