[00:08:11] *** guerby_ is now known as guerby
[00:08:13] *** Quits: guerby (~guerby@ip165.tetaneutral.net) (Changing host)
[00:08:13] *** Joins: guerby (~guerby@april/board/guerby)
[00:25:38] *** Joins: tomzawadzki (~tomzawadz@134.134.139.83)
[00:25:46] *** Quits: tomzawadzki (~tomzawadz@134.134.139.83) (Client Quit)
[00:25:58] *** Joins: tomzawadzki (~tomzawadz@134.134.139.83)
[01:21:30] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 260 seconds)
[01:26:09] *** Joins: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97)
[02:08:44] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 260 seconds)
[02:20:53] <param> spdk automated system has failed... not sure about the reason.. https://review.gerrithub.io/#/c/403211/
[02:31:06] *** Joins: tkulasek (~tkulasek@192.55.54.45)
[02:44:35] *** Joins: tkulasek_ (~tkulasek@134.134.139.76)
[02:44:35] *** Quits: tkulasek (~tkulasek@192.55.54.45) (Remote host closed the connection)
[02:54:21] <pwodkowx> changpe1: About https://review.gerrithub.io/#/c/402665/, can we fix this issue in QEMU
[02:54:22] <pwodkowx> ?
[03:07:52] *** Quits: tkulasek_ (~tkulasek@134.134.139.76) (Quit: Leaving)
[03:44:52] *** Joins: tkulasek (~tkulasek@192.55.54.45)
[07:39:10] *** Quits: tkulasek (~tkulasek@192.55.54.45) (Remote host closed the connection)
[07:39:17] *** Joins: tkulasek (~tkulasek@192.55.54.45)
[07:45:57] <peluse> FYI crypto virtual bdev discussion on 15 min at https://intel.webex.com/intel/j.php?MTID=m8209e3761c23b5f54e737a3c537ab550 (see email for details). Anyone interested is welcome to join
[08:02:36] <drv> peluse: is the WebEx site loading for you? seems to be stuck loading forever for me
[08:02:55] <drv> never mind, just loaded
[08:13:41] *** lhodev1 is now known as lhodev
[08:42:10] <param> drv : spdk automated system has failed... not sure about the reason.. https://review.gerrithub.io/#/c/403211/ what should be done?
[08:42:58] <drv> param: looks like an intermittent test failure - I will re-run it
[08:43:13] <param> sure.. thanks.. :)
[09:03:44] <peluse> thanks to everyone who was on the call!
[09:29:54] *** Quits: tomzawadzki (~tomzawadz@134.134.139.83) (Ping timeout: 260 seconds)
[10:09:54] *** Parts: lhodev (~Adium@66-90-218-190.dyn.grandenetworks.net) ()
[10:10:03] *** Joins: lhodev (~Adium@66-90-218-190.dyn.grandenetworks.net)
[10:14:57] *** Quits: tkulasek (~tkulasek@192.55.54.45) (Ping timeout: 256 seconds)
[10:33:48] <sethhowe> jimharris: ubuntu17.10 has replaced ubuntu17.04 in the test pool. I have installed libiscsi-dev on that machine.
[10:36:33] <jimharris> sethhowe: thanks!  i'll resubmit my iscsi initiator patch
[10:43:43] <lhodev> Good morning folks.  Once my outstanding patch, https://review.gerrithub.io/#/c/403357/, is approved and merged, that will have completed my pass through the lib code replacing exit()'s with return's and coding the error paths respectively.   My next effort is, potentially, a little more thorny; namely, the replacement of abort()'s with return's.  In particular, I can imagine scenario's where original/early developers intentionally wanted
[10:43:43] <lhodev> to trigger aborts, perhaps for new'ish, complex code to see how things shake out which could yield valuable core dumps for analysis.   I'm reluctant to embark willy-nilly replacing the abort()'s in the lib code until surfacing this concern.   So, is there some set of criteria I should employ like a review of existing bugs https://github.com/spdk/spdk/issues, checking last-change date of instrumented abort()'s, wholesale avoiding of particu
[10:43:43] <lhodev> lar libs, etc.?  Or, should I just jump in and start this replacement effort?  Or, perhaps, hold off to discussing this say at the next community call?
[10:46:54] <drv> hi Lance
[10:47:16] <drv> it looks like there's only 12 abort() calls in lib/
[10:47:41] <drv> some of these are supposed to be unreachable code, and if they're hit, there was some kind of programming error elsewhere inside the SPDK libs
[10:47:49] <drv> I think those can stay for now (but should probably be documented with a comment)
[10:48:11] <drv> the abort()s in case of memory allocation failure are the ones we should look at first, I think
[10:48:24] <drv> although those may be challenging, since it will mean finding all of the callers and making sure they do something sensible
[10:48:30] <drv> so it might be good to tackle them one at a time
[10:49:16] <drv> e.g. spdk_bdev_get_io() contains an abort, but it looks like all of the call sites already handle a NULL return value
[10:50:52] <lhodev> Hi Daniel.  Yup, that sounds about right.   I was scrutinizing that first one in bdev (on the spdk_mempool_get() failure in spdk_bdev_get_io()).   I've backtraced most of the callers and it appears they field a failed response, at least to some degree ;-)
[10:51:37] <drv> I think spdk_poller_register() will be much more complicated, since none of the callers are looking for a failure there
[10:51:55] <lhodev> And, then there's *that* one…
[10:53:51] <lhodev> I'm also aware that there have been a couple of intermittent failures that have appeared during TP runs resulting in core dumps (iscsi_tgt and nvmf_tgt).   Don't know yet if those are all truly bad memory ref's vs abort() calls.
[10:56:16] <sethhowe> jimharris: no problem! Let me know if anything else comes up.
[10:57:23] <lhodev> The ones I *did* see — i.e. core dumps during iscsi_tgt and nvmf_tgt — appeared to fail on a free() call and thus were not explicitly abort() calls.   However, I only have anecdotal samples of that.
[10:59:56] <sethhowe> jimharris: do you know if the test failures in the NVMe-oF fio tests are related to NVMF multicore? https://ci.spdk.io/spdk/builds/review/5e64c3846a83afffcfaa6c1c250231be9e228d81.1520835982/fedora-06/build.log
[11:12:13] <lhodev> drv:  Daniel, if you have the time, would you also mind please taking a look at https://review.gerrithub.io/#/c/403357/ ?  Jim provided a couple of comments on my first submission which I believe I addressed in the 2nd set.   It's pretty simple and straight-forward.   Thx.
[11:12:24] <drv> sure, I should be able to get to it this afternoon
[11:29:32] <peluse> lhodev, man, you're rockin' it lately!!
[12:02:08] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Read error: Connection reset by peer)
[12:06:42] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl)
[14:01:00] <drv> jimharris: please take a look at this setup.sh fix if you get a chance: https://review.gerrithub.io/#/c/403370/
[14:01:58] <drv> without this, I get a random set of devices rebound on my system with 5 NVMe devices
[14:48:59] *** Quits: changpe1 (~changpe1@134.134.139.82) (Ping timeout: 260 seconds)
[14:49:34] *** Quits: ziyeyang (~ziyeyang@134.134.139.82) (Ping timeout: 260 seconds)
[14:51:19] *** Quits: destrudo (~destrudo@tomba.sonic.net) (Ping timeout: 260 seconds)
[14:51:50] *** Joins: destrudo (~destrudo@tomba.sonic.net)
[14:52:47] *** Joins: changpe1 (changpe1@nat/intel/x-zotprjhfsxxfhrxi)
[14:53:22] *** Joins: ziyeyang (ziyeyang@nat/intel/x-wgikbhnpdaiphseu)
[15:41:55] *** Quits: leospdk (42718442@gateway/web/freenode/ip.66.113.132.66) (Ping timeout: 260 seconds)
[17:10:32] <jimharris> drv: I think we should revert https://review.gerrithub.io/#/c/401257/
[17:10:53] <drv> ok - what for?
[17:11:28] <jimharris> I had marked this -1 earlier - I wasn't convinced it was OK due to some RocksDB failures I saw during testing - I re-ran a few times and it passed and after a rebase my -1 was removed but I still wasn't 100% convinced
[17:11:41] <drv> hmm, ok
[17:12:18] <drv> if you push a review for the revert, I'll merge it
[17:12:22] <jimharris> it's all around queuing the sync requests - and I need to test that more with RocksDB especially the shutdown path
[17:16:31] <jimharris> done
[18:08:53] *** Joins: ziyeyang_ (~ziyeyang@192.55.54.42)
[18:13:58] <peluse> jimharris, so wrt the comments I made this morn about a few bdevio cases not passing that I thought was because of a lack of encrypting write_zero cmds, it's actually the case that when I ran this morn I was on a malloc disk and all weekend I was on an NVMe disk
[18:14:39] <peluse> what did you say the malloc disk was doing, I started stepping through it but I figured one sentence from someone who knows what its suppsoed to do will save me a lot of time :)
[18:15:13] <jimharris> look in _bdev_malloc_submit_request() in lib/bdev/malloc/bdev_malloc.c
[18:15:26] <peluse> PS: 100% of bevio IO w/aserts on data comaprison pass with nvme back end
[18:15:31] <jimharris> for READ, we don't allocate a buffer - we point iov_base directly at the backing memory region
[18:15:40] <peluse> OK, thanks
[19:09:11] <lhodev> I submitted a simple patch replacing an abort() with a return error in nbd.c, but the TP indicated a failure, but looking in https://ci.spdk.io/spdk/builds/review/47b370c6c8d44f300923d3606f9bc732f89083aa.1520894593/fedora-05/build.log, it doesn't appear nbd-related.   Has this been seen elsewhere?   I sifted through the bug list looking in the titles for anything bdev and destroy_lvol_store related, but didn't see anything.
[19:16:26] <peluse> lhodev, that doesn't look familiar to me but does look real.  This might be one of the TP machines that recently had ASAN enabled, was off for a while because of a different real issue that jimharris fixed about a month ago...
[19:17:13] <peluse> FYI if you haven't seen it before you'll only get errors like this "ERROR: AddressSanitizer: heap-use-after-free on address 0x612000000380 at pc 0x0000005867b2 bp 0x7fffdf0dbd90 sp 0x7fffdf0dbd80" if you build with address sanatizer enabled
[19:17:56] <peluse> either by CONFIG_ASAN=y on the make cmd line or ./configure --enable-asan before you make
[19:17:58] <lhodev> peluse:  Do you share my conclusion that the point of failure in the log is *not* related to nbd ?
[19:18:09] <peluse> yeah, highly unlikely I would say
[19:18:30] <peluse> but ya never know :)
[19:19:24] <lhodev> Can you give the change a nudge to re-run on the TP?
[19:20:06] <peluse> could be something generic in bdev that is being 'exposed' because of a specific bdev module...
[19:20:46] <lhodev> I'd be curious to see if it fails on a re-run, and moreover, in exactly the same way.
[19:20:54] <peluse> I can't but a maintainer can or you can do a git commit --amend and change something in your commit message, that will run it again
[19:21:44] <peluse> yeah unless someone *just* enabled ASAN before your patch it's likely either intermittent or is somehow exposed by your change (even if not directly related)
[19:21:58] <peluse> running again seems very reasonable though
[19:22:14] <peluse> I gotta bolt... enjoy :)
[19:22:23] <lhodev> From my inspection of the log, it doesn't look (to me) like the thing that was running even is linked to the nbd lib.
[19:22:40] <lhodev> Thx Paul for the tip about the git commit —amend
[19:41:39] <param> drv : when can this be merged.. https://review.gerrithub.io/#/c/403211/ ?
[22:08:03] <jimharris> hi param - 403211 has just been merged - thanks!
[22:08:23] <jimharris> we still need changes in scripts/rpc.py to expose your new RPC - could you put together a patch for that?  it would also be great to modify one of the existing test scripts to make use of it
[22:54:30] *** Quits: ziyeyang_ (~ziyeyang@192.55.54.42) (Remote host closed the connection)
[22:55:22] *** Joins: ziyeyang_ (~ziyeyang@192.55.54.42)