[00:08:11] *** guerby_ is now known as guerby [00:08:13] *** Quits: guerby (~guerby@ip165.tetaneutral.net) (Changing host) [00:08:13] *** Joins: guerby (~guerby@april/board/guerby) [00:25:38] *** Joins: tomzawadzki (~tomzawadz@134.134.139.83) [00:25:46] *** Quits: tomzawadzki (~tomzawadz@134.134.139.83) (Client Quit) [00:25:58] *** Joins: tomzawadzki (~tomzawadz@134.134.139.83) [01:21:30] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 260 seconds) [01:26:09] *** Joins: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) [02:08:44] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Ping timeout: 260 seconds) [02:20:53] spdk automated system has failed... not sure about the reason.. https://review.gerrithub.io/#/c/403211/ [02:31:06] *** Joins: tkulasek (~tkulasek@192.55.54.45) [02:44:35] *** Joins: tkulasek_ (~tkulasek@134.134.139.76) [02:44:35] *** Quits: tkulasek (~tkulasek@192.55.54.45) (Remote host closed the connection) [02:54:21] changpe1: About https://review.gerrithub.io/#/c/402665/, can we fix this issue in QEMU [02:54:22] ? [03:07:52] *** Quits: tkulasek_ (~tkulasek@134.134.139.76) (Quit: Leaving) [03:44:52] *** Joins: tkulasek (~tkulasek@192.55.54.45) [07:39:10] *** Quits: tkulasek (~tkulasek@192.55.54.45) (Remote host closed the connection) [07:39:17] *** Joins: tkulasek (~tkulasek@192.55.54.45) [07:45:57] FYI crypto virtual bdev discussion on 15 min at https://intel.webex.com/intel/j.php?MTID=m8209e3761c23b5f54e737a3c537ab550 (see email for details). Anyone interested is welcome to join [08:02:36] peluse: is the WebEx site loading for you? seems to be stuck loading forever for me [08:02:55] never mind, just loaded [08:13:41] *** lhodev1 is now known as lhodev [08:42:10] drv : spdk automated system has failed... not sure about the reason.. https://review.gerrithub.io/#/c/403211/ what should be done? [08:42:58] param: looks like an intermittent test failure - I will re-run it [08:43:13] sure.. thanks.. :) [09:03:44] thanks to everyone who was on the call! [09:29:54] *** Quits: tomzawadzki (~tomzawadz@134.134.139.83) (Ping timeout: 260 seconds) [10:09:54] *** Parts: lhodev (~Adium@66-90-218-190.dyn.grandenetworks.net) () [10:10:03] *** Joins: lhodev (~Adium@66-90-218-190.dyn.grandenetworks.net) [10:14:57] *** Quits: tkulasek (~tkulasek@192.55.54.45) (Ping timeout: 256 seconds) [10:33:48] jimharris: ubuntu17.10 has replaced ubuntu17.04 in the test pool. I have installed libiscsi-dev on that machine. [10:36:33] sethhowe: thanks! i'll resubmit my iscsi initiator patch [10:43:43] Good morning folks. Once my outstanding patch, https://review.gerrithub.io/#/c/403357/, is approved and merged, that will have completed my pass through the lib code replacing exit()'s with return's and coding the error paths respectively. My next effort is, potentially, a little more thorny; namely, the replacement of abort()'s with return's. In particular, I can imagine scenario's where original/early developers intentionally wanted [10:43:43] to trigger aborts, perhaps for new'ish, complex code to see how things shake out which could yield valuable core dumps for analysis. I'm reluctant to embark willy-nilly replacing the abort()'s in the lib code until surfacing this concern. So, is there some set of criteria I should employ like a review of existing bugs https://github.com/spdk/spdk/issues, checking last-change date of instrumented abort()'s, wholesale avoiding of particu [10:43:43] lar libs, etc.? Or, should I just jump in and start this replacement effort? Or, perhaps, hold off to discussing this say at the next community call? [10:46:54] hi Lance [10:47:16] it looks like there's only 12 abort() calls in lib/ [10:47:41] some of these are supposed to be unreachable code, and if they're hit, there was some kind of programming error elsewhere inside the SPDK libs [10:47:49] I think those can stay for now (but should probably be documented with a comment) [10:48:11] the abort()s in case of memory allocation failure are the ones we should look at first, I think [10:48:24] although those may be challenging, since it will mean finding all of the callers and making sure they do something sensible [10:48:30] so it might be good to tackle them one at a time [10:49:16] e.g. spdk_bdev_get_io() contains an abort, but it looks like all of the call sites already handle a NULL return value [10:50:52] Hi Daniel. Yup, that sounds about right. I was scrutinizing that first one in bdev (on the spdk_mempool_get() failure in spdk_bdev_get_io()). I've backtraced most of the callers and it appears they field a failed response, at least to some degree ;-) [10:51:37] I think spdk_poller_register() will be much more complicated, since none of the callers are looking for a failure there [10:51:55] And, then there's *that* one… [10:53:51] I'm also aware that there have been a couple of intermittent failures that have appeared during TP runs resulting in core dumps (iscsi_tgt and nvmf_tgt). Don't know yet if those are all truly bad memory ref's vs abort() calls. [10:56:16] jimharris: no problem! Let me know if anything else comes up. [10:57:23] The ones I *did* see — i.e. core dumps during iscsi_tgt and nvmf_tgt — appeared to fail on a free() call and thus were not explicitly abort() calls. However, I only have anecdotal samples of that. [10:59:56] jimharris: do you know if the test failures in the NVMe-oF fio tests are related to NVMF multicore? https://ci.spdk.io/spdk/builds/review/5e64c3846a83afffcfaa6c1c250231be9e228d81.1520835982/fedora-06/build.log [11:12:13] drv: Daniel, if you have the time, would you also mind please taking a look at https://review.gerrithub.io/#/c/403357/ ? Jim provided a couple of comments on my first submission which I believe I addressed in the 2nd set. It's pretty simple and straight-forward. Thx. [11:12:24] sure, I should be able to get to it this afternoon [11:29:32] lhodev, man, you're rockin' it lately!! [12:02:08] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Read error: Connection reset by peer) [12:06:42] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) [14:01:00] jimharris: please take a look at this setup.sh fix if you get a chance: https://review.gerrithub.io/#/c/403370/ [14:01:58] without this, I get a random set of devices rebound on my system with 5 NVMe devices [14:48:59] *** Quits: changpe1 (~changpe1@134.134.139.82) (Ping timeout: 260 seconds) [14:49:34] *** Quits: ziyeyang (~ziyeyang@134.134.139.82) (Ping timeout: 260 seconds) [14:51:19] *** Quits: destrudo (~destrudo@tomba.sonic.net) (Ping timeout: 260 seconds) [14:51:50] *** Joins: destrudo (~destrudo@tomba.sonic.net) [14:52:47] *** Joins: changpe1 (changpe1@nat/intel/x-zotprjhfsxxfhrxi) [14:53:22] *** Joins: ziyeyang (ziyeyang@nat/intel/x-wgikbhnpdaiphseu) [15:41:55] *** Quits: leospdk (42718442@gateway/web/freenode/ip.66.113.132.66) (Ping timeout: 260 seconds) [17:10:32] drv: I think we should revert https://review.gerrithub.io/#/c/401257/ [17:10:53] ok - what for? [17:11:28] I had marked this -1 earlier - I wasn't convinced it was OK due to some RocksDB failures I saw during testing - I re-ran a few times and it passed and after a rebase my -1 was removed but I still wasn't 100% convinced [17:11:41] hmm, ok [17:12:18] if you push a review for the revert, I'll merge it [17:12:22] it's all around queuing the sync requests - and I need to test that more with RocksDB especially the shutdown path [17:16:31] done [18:08:53] *** Joins: ziyeyang_ (~ziyeyang@192.55.54.42) [18:13:58] jimharris, so wrt the comments I made this morn about a few bdevio cases not passing that I thought was because of a lack of encrypting write_zero cmds, it's actually the case that when I ran this morn I was on a malloc disk and all weekend I was on an NVMe disk [18:14:39] what did you say the malloc disk was doing, I started stepping through it but I figured one sentence from someone who knows what its suppsoed to do will save me a lot of time :) [18:15:13] look in _bdev_malloc_submit_request() in lib/bdev/malloc/bdev_malloc.c [18:15:26] PS: 100% of bevio IO w/aserts on data comaprison pass with nvme back end [18:15:31] for READ, we don't allocate a buffer - we point iov_base directly at the backing memory region [18:15:40] OK, thanks [19:09:11] I submitted a simple patch replacing an abort() with a return error in nbd.c, but the TP indicated a failure, but looking in https://ci.spdk.io/spdk/builds/review/47b370c6c8d44f300923d3606f9bc732f89083aa.1520894593/fedora-05/build.log, it doesn't appear nbd-related. Has this been seen elsewhere? I sifted through the bug list looking in the titles for anything bdev and destroy_lvol_store related, but didn't see anything. [19:16:26] lhodev, that doesn't look familiar to me but does look real. This might be one of the TP machines that recently had ASAN enabled, was off for a while because of a different real issue that jimharris fixed about a month ago... [19:17:13] FYI if you haven't seen it before you'll only get errors like this "ERROR: AddressSanitizer: heap-use-after-free on address 0x612000000380 at pc 0x0000005867b2 bp 0x7fffdf0dbd90 sp 0x7fffdf0dbd80" if you build with address sanatizer enabled [19:17:56] either by CONFIG_ASAN=y on the make cmd line or ./configure --enable-asan before you make [19:17:58] peluse: Do you share my conclusion that the point of failure in the log is *not* related to nbd ? [19:18:09] yeah, highly unlikely I would say [19:18:30] but ya never know :) [19:19:24] Can you give the change a nudge to re-run on the TP? [19:20:06] could be something generic in bdev that is being 'exposed' because of a specific bdev module... [19:20:46] I'd be curious to see if it fails on a re-run, and moreover, in exactly the same way. [19:20:54] I can't but a maintainer can or you can do a git commit --amend and change something in your commit message, that will run it again [19:21:44] yeah unless someone *just* enabled ASAN before your patch it's likely either intermittent or is somehow exposed by your change (even if not directly related) [19:21:58] running again seems very reasonable though [19:22:14] I gotta bolt... enjoy :) [19:22:23] From my inspection of the log, it doesn't look (to me) like the thing that was running even is linked to the nbd lib. [19:22:40] Thx Paul for the tip about the git commit —amend [19:41:39] drv : when can this be merged.. https://review.gerrithub.io/#/c/403211/ ? [22:08:03] hi param - 403211 has just been merged - thanks! [22:08:23] we still need changes in scripts/rpc.py to expose your new RPC - could you put together a patch for that? it would also be great to modify one of the existing test scripts to make use of it [22:54:30] *** Quits: ziyeyang_ (~ziyeyang@192.55.54.42) (Remote host closed the connection) [22:55:22] *** Joins: ziyeyang_ (~ziyeyang@192.55.54.42)