[00:54:03] *** Quits: ziyeyang_ (ziyeyang@nat/intel/x-kwwlwanytfhxsudp) (Quit: Leaving)
[01:32:23] *** Quits: Guest28167 (~James@c-67-180-75-111.hsd1.ca.comcast.net) (Remote host closed the connection)
[01:51:14] *** Joins: tomzawadzki (~tomzawadz@192.55.54.44)
[02:28:05] <darsto_> drv VItta: thanks, I produced a fix: https://review.gerrithub.io/c/393749/ (setup.sh: don't bind controllers with only *some* devices mounted)
[02:38:34] *** Quits: Shuhei (caf6fc61@gateway/web/freenode/ip.202.246.252.97) (Quit: Page closed)
[04:55:12] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Ping timeout: 248 seconds)
[06:09:58] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl)
[08:09:05] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[08:13:30] *** Quits: lhodev (~Adium@66-90-218-190.dyn.grandenetworks.net) (Quit: Leaving.)
[08:14:17] *** Joins: lhodev (~Adium@inet-hqmc07-o.oracle.com)
[08:14:55] *** Parts: lhodev (~Adium@inet-hqmc07-o.oracle.com) ()
[08:20:03] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl)
[08:53:23] *** Joins: mayank (2d70c844@gateway/web/freenode/ip.45.112.200.68)
[08:58:36] <mayank> Hi there, I have a question regarding NVME BDEV performace test with FIO. With iodepth=1, thread=1, fio results are good but if i increased iodepth to 32, then sequential read results are very poor compared to sequential writes. Any idea?
[09:00:25] <pwodkowx> pls share your setup, config etc
[09:01:28] <jimharris> include what kind of nvme ssd you are using as well as IO size (4KB?  512B?)
[09:12:05] <mayank> io size is 4KB, NVME is Intel® SSD DC P3700 Series, 2TB
[09:12:11] <mayank> fio config file
[09:12:12] <mayank> [global] ioengine=./examples/bdev/fio_plugin/fio_plugin spdk_conf=./examples/bdev/fio_plugin/bdev.conf.in thread=1 group_reporting=1 direct=1 verify=0 time_based=1 ramp_time=0 runtime=30 iodepth=32 rw=read bs=4k  [test] numjobs=1 filename=Nvme0n1
[09:12:25] <mayank> bdev config file
[09:12:33] <mayank> [Nvme]     TransportID "trtype:PCIe traddr:0000:18:00.0" Nvme0   RetryCount 4   Timeout 0   ActionOnTimeout None   AdminPollRate 100000
[09:17:23] <pwodkowx> and the results?
[09:19:30] <mayank>  Sequential read.... iops: 191K, throughput :781 MB/s, latency: 167
[09:20:31] <mayank>  Sequential write.... iops: 450K, throughput: 1845MB/s, latency: 70 usec
[09:21:13] <mayank> i performed same test more than 3 times, but results came same!
[09:24:16] <pwodkowx> did you precondition NVMe before tests?
[09:24:55] <mayank> yes, i ran write with perf for around 30 minutes
[09:26:39] <mayank> also executed benchmark script from spdk
[09:26:58] <mayank> scripts/prep_benchmarks.sh
[09:30:16] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[09:32:27] *** Quits: sethhowe (~sethhowe@192.55.54.42) (Remote host closed the connection)
[09:32:55] <pwodkowx> hmmm... just checked vhost perf results, weird... seem that median latency for seq read increased almost twice 2017-10-25
[09:33:39] *** Quits: mayank (2d70c844@gateway/web/freenode/ip.45.112.200.68) (Ping timeout: 260 seconds)
[09:43:57] *** Quits: tomzawadzki (~tomzawadz@192.55.54.44) (Ping timeout: 240 seconds)
[09:45:30] *** Joins: sethhowe (~sethhowe@192.55.54.42)
[11:12:42] <jimharris> i've found another 10% bdevperf bump by avoiding the function call overhead of spdk_io_channel_get_ctx()
[11:12:50] <jimharris> there are two ways to do it
[11:13:06] <jimharris> 1) include the struct spdk_io_channel structure definition in the header file, and then make spdk_io_channel_get_ctx() inline
[11:14:04] <jimharris> 2) do a bunch of work in io_channel.c so that the struct spdk_io_channel is basically the context - change the current spdk_io_channel to spdk_io_channel_impl and do pointer arithmetic when needed
[11:14:23] <jimharris> the pointer arithmetic then is only needed in the relatively slow paths
[11:15:11] <jimharris> i'd prefer #1 even if it means we expose the contents of the structure through the header file
[11:15:27] <jimharris> thoughts?
[11:15:30] * jimharris looks at drv
[11:16:02] <bwalker> 3) compile with lto?
[11:16:17] <drv> just looking at the definition now
[11:16:29] <jimharris> i can try lto
[11:16:35] <bwalker> I hope it still works
[11:16:36] <drv> I'm OK with putting struct spdk_io_channel in the header
[11:16:53] <jimharris> i've done #2 - it's not terrible (60 lines)
[11:16:55] <drv> the only minor annoyance is that it adds another queue.h dependency in a public API header, which I was trying to clean up
[11:17:10] <drv> but this is definitely a key performance-path function, so it is probably worth it
[11:17:33] <drv> (also we should mark the contents of the struct as internal with Doxygen markup)
[11:17:55] <jimharris> ok - i'm trying lto now just for reference
[11:19:20] <jimharris> lto builds don't seem to work currently
[11:19:24] <drv> one other thing I was looking at a while ago is that we force -fno-omit-frame-pointer
[11:19:41] <drv> turning that off would reduce the function call overhead for these tiny leaf functions
[11:19:59] <drv> (and I think we should still be able to get good backtraces due to the way debug symbols work on x86-64)
[11:20:24] <drv> might be worth a try to just recompile without that option and see where that puts the overhead
[11:20:45] <drv> I think LTO is busted due to the UT wrap stuff - we should look into that
[11:21:01] <jimharris> it's actually complaining about aliasing with spdk_io_channel too
[11:21:29] <drv> hmm, yeah, that is also in the unit test code, I think
[11:21:36] <drv> because we define our own fake spdk_io_channel
[11:21:46] <drv> maybe we can filter out the LTO flags for UT builds - we don't really want it there anyway
[11:22:02] <drv> (would only need to be filtered for the final link command, I think)
[11:22:22] <drv> as long as we're generating fat LTO objects
[11:23:02] *** Joins: VItta (b7521395@gateway/web/freenode/ip.183.82.19.149)
[11:24:40] <jimharris> current: 18319400
[11:24:55] <jimharris> current + omit-frame-pointer: 20521512 (+12.0%)
[11:25:13] <jimharris> current + inline spdk_io_channel_get_ctx: 20472859 (+11.7%)
[11:25:26] <jimharris> current + omit + inline: 22531470 (+22.9%)
[11:25:43] <bwalker> what benchmark is this?
[11:26:06] <jimharris> bdevperf -c null.conf -q 16 -s 4096 -w read -t 30
[11:26:10] <jimharris> with 4 null bdevs
[11:26:34] <jimharris> using read instead of randread to avoid the divq operation when mod'ing the rand_r
[11:26:58] <bwalker> so we won't see 22.9% improvement in reality, but it is probably noticeable
[11:27:05] <jimharris> right
[11:27:14] <drv> ok, so omit-frame-pointer is probably a win regardless of what we do
[11:27:15] <bwalker> seems like it is worth doing then
[11:27:27] <drv> and putting the function inline is also probably the right thing to do
[11:27:30] <bwalker> we could fix lto and see if that is the same as inline
[11:27:52] <drv> I think it's still safe to assume that most people won't be using LTO, but we should fix it anyway
[11:28:15] <bwalker> I think the struct is small enough that it isn't a problem to expose the definition
[11:31:47] <jimharris> drv: you're not concerned at all about ability to debug if we omit frame pointers?
[11:32:04] <drv> I'm pretty sure the unwind info in x86-64 is guaranteed to be accurate
[11:32:22] <drv> we could try it out (intentionall abort() or something and verify that we can get a good backtrace)
[11:38:49] *** Joins: James (~James@208.185.211.6)
[11:39:13] *** James is now known as Guest33987
[11:57:44] *** Quits: VItta (b7521395@gateway/web/freenode/ip.183.82.19.149) (Ping timeout: 260 seconds)
[12:30:04] <drv> seems that linking objects generated with cc -flto doesn't work if you don't also specify -flto in the link command
[12:30:17] <drv> so that kills the idea of disabling LTO just for unit test linking
[13:31:15] *** Joins: fgwu (83d4fafe@gateway/web/freenode/ip.131.212.250.254)
[15:18:22] <jimharris> drv: could we get this patch in that removes nbd app?  https://review.gerrithub.io/#/c/391371/
[15:18:49] <jimharris> xiaodong keeps rebasing it because of the changelog entry - want to get that in before your changelog patch makes him need to do it again :)
[15:19:41] <drv> yeah, the removal of the nbd app looks fine, I just need to recheck the previous patch
[15:20:45] <jimharris> previous patch?
[15:21:41] <drv> https://review.gerrithub.io/#/c/391732/
[15:21:52] <drv> the nbd removal patch is based on this one
[15:23:03] <jimharris> oh - when i look at the nbd removal patch, it says 391732 is dependent on it
[15:23:16] <jimharris> one will need to be rebased either way
[15:23:30] <drv> hmm, maybe it got rebased the other way around
[15:34:03] <drv> jimharris: did you see https://review.gerrithub.io/#/c/393770/ (blobstore: fix serializing flags)? looks pretty important
[15:34:15] <jimharris> yes
[15:34:16] <drv> (it has a -1 to add some more tests)
[15:34:59] <jimharris> they found it during thin provisioning testing
[15:35:20] <jimharris> i suggested they break it out into a separate patch
[15:39:41] *** Joins: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl)
[15:49:26] <bwalker> jimharris: I threw together a quick patch to use an spdk_mem_map to do gpa to vva translation in vhost. The translation part is easy. However, it looks like several of the message definitions contain an array of file descriptors that is of the maximum size
[15:49:32] <bwalker> and these messages are placed on the stack
[15:49:39] <bwalker> today that max size is 8, so no big deal
[15:49:42] <bwalker> but I need to make it like 32k
[15:50:23] <bwalker> like this structure:
[15:50:25] <bwalker> struct vhost_user_msg {
[15:50:25] <bwalker> 	enum vhost_user_request request;
[15:50:25] <bwalker> #define VHOST_USER_VERSION_MASK     0x3
[15:50:25] <bwalker> #define VHOST_USER_REPLY_MASK       (0x1 << 2)
[15:50:25] <bwalker> 	uint32_t flags;
[15:50:26] <bwalker> 	uint32_t size; /* the following payload size */
[15:50:28] <bwalker> 	union {
[15:50:30] <bwalker> #define VHOST_USER_VRING_IDX_MASK   0xff
[15:50:32] <bwalker> #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
[15:50:34] <bwalker> 		uint64_t u64;
[15:50:38] <bwalker> 		struct vhost_vring_state state;
[15:50:40] <bwalker> 		struct vhost_vring_addr addr;
[15:50:42] <bwalker> 		struct vhost_memory_padded memory;
[15:50:44] <bwalker> 	} payload;
[15:50:46] <bwalker> 	int fds[VHOST_MEMORY_MAX_NREGIONS];
[15:50:48] <bwalker> } __attribute((packed));
[15:50:50] <bwalker> that's terrible
[15:51:12] <jimharris> yeah - you'll need to allocate it off the heap
[15:51:14] <bwalker> look at lib/vhost/rte_vhost/vhost_user.h - struct VhostUserMsg
[15:51:29] <bwalker> but what if the number of fds is less than the maximum? I presume that's supported
[15:51:37] <bwalker> so I should probably calculate the actual size it needs
[15:51:44] <bwalker> instead of just putting a maximum size struct on the stack
[15:52:08] <jimharris> you don't want to allocate the extra memory?
[15:52:23] <bwalker> in some cases it's a lot of memory
[15:52:48] <bwalker> like one of the messages contains an array of 4 uint64_t's per file descriptor
[15:53:01] <drv> is VHOST_MEMORY_MAX_NREGIONS part of the official vhost protocol or just a limitation of the DPDK implementation?
[15:53:03] <jimharris> oh - you were saying 32k memory regions
[15:53:03] <bwalker> and if you have 64 GB of hugepages, that's 32k entries in the array
[15:53:20] <drv> like, will it crash if we pass a larger count to an older target that doesn't dynamically allocate it?
[15:53:22] <bwalker> drv: I think, now that I look at it, it's just a DPDK thing
[15:53:34] <bwalker> there is a size in the message header that says how big it is
[15:54:02] <drv> is it checked before we access the later elements, or is that too much to hope for? :)
[15:54:16] <bwalker> I think I can actually lift this silly 8 limit independently of making it actually efficient to have more than 8 elements
[15:54:34] <bwalker> I'll give that a shot
[15:55:51] <jimharris> can't get to gerrithub
[15:56:03] <jimharris> oh - there it loads now
[16:14:24] *** Quits: fgwu (83d4fafe@gateway/web/freenode/ip.131.212.250.254) (Ping timeout: 260 seconds)
[16:23:44] <bwalker> ugh - read_fd_message in vhost needs to be redone
[16:24:01] <bwalker> it's making all kinds of implementation assumptions that the protocol doesn't guarantee to be true
[16:24:20] <bwalker> it will break if the client doesn't send an array of exactly 8 file descriptors
[16:24:34] <bwalker> even though the protocol tells you how many file descriptors are coming in the message header
[16:28:22] <bwalker> I see why they coded it this way - to do it right is a huge pain
[16:28:26] <bwalker> typedef struct VhostUserMemory {
[16:28:26] <bwalker> 	uint32_t nregions;
[16:28:26] <bwalker> 	uint32_t padding;
[16:28:26] <bwalker> 	VhostUserMemoryRegion regions[]; /* An array of 'nregions' elements */
[16:28:26] <bwalker> } VhostUserMemory;
[16:28:27] <bwalker> typedef struct VhostUserMsg {
[16:28:29] <bwalker> 	VhostUserRequest request;
[16:28:31] <bwalker> 	uint32_t flags;
[16:28:33] <bwalker> 	uint32_t size; /* the following payload size */
[16:28:35] <bwalker> 	union {
[16:28:39] <bwalker> 		... other stuff ...
[16:28:41] <bwalker> 		VhostUserMemory memory;
[16:28:43] <bwalker> 	} payload;
[16:28:45] <bwalker> 	int fds[]; /* An array of file descriptors */
[16:28:47] <bwalker> } __attribute((packed)) VhostUserMsg;
[16:28:49] <bwalker> this VhostUserMsg is the base message structure getting passed
[16:28:53] <bwalker> but it contains two arrays, potentially, of indeterminate length
[16:29:11] <bwalker> that VhostUserMemory thing is variable length, and so is the fds thing after it
[16:30:59] <bwalker> oh actually no - the way it comes over the wire doesn't match the struct definition
[16:31:27] <bwalker> it comes over the wire as <Header><Optional Array of FDs><Payload>
[16:32:12] <bwalker> yeah I can fix this up
[16:44:13] *** Quits: gila (~gila@5ED4D9C8.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…)
[16:55:43] <peluse> bwalker, not a big hurry but when you get a chance pls take a look at https://review.gerrithub.io/#/c/386186/.  The edits I made to the 'match' util in PMDK were just approved so once this lands I'll propose a 'match patch' to SPDK and use blobstore as the example on how it works (and will tweak test.bs as well as it could be a little more well rounded that what's being proposed here but I think OK to land as is)
[17:30:37] *** Quits: Guest33987 (~James@208.185.211.6) (Remote host closed the connection)
[19:22:09] *** Joins: sage____ (~quassel@2607:f298:5:101d:f816:3eff:fe21:1966)
[19:26:48] *** Quits: sage__ (~quassel@2607:f298:5:101d:f816:3eff:fe21:1966) (*.net *.split)
[19:54:36] *** Joins: James (~James@c-67-180-75-111.hsd1.ca.comcast.net)
[19:55:01] *** James is now known as Guest84211
[21:44:35] *** Quits: Guest84211 (~James@c-67-180-75-111.hsd1.ca.comcast.net) (Remote host closed the connection)
[21:50:01] *** Joins: James (~James@c-67-180-75-111.hsd1.ca.comcast.net)
[21:50:25] *** James is now known as Guest34719
[22:40:12] *** Quits: Guest34719 (~James@c-67-180-75-111.hsd1.ca.comcast.net) (Remote host closed the connection)