[01:16:06] *** Quits: ziyeyang_ (~ziyeyang@134.134.139.82) (Quit: Leaving) [02:28:39] *** Quits: tomzawadzki (tzawadzk@nat/intel/x-ctzjocrsxofcpjus) (Remote host closed the connection) [02:29:03] *** Joins: tomzawadzki (~tzawadzk@192.55.54.36) [02:44:14] *** Quits: tsuyoshi (b42b2067@gateway/web/freenode/ip.180.43.32.103) (Ping timeout: 260 seconds) [05:11:24] *** Quits: tomzawadzki (~tzawadzk@192.55.54.36) (Quit: Leaving) [05:27:51] *** Joins: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) [08:28:20] i've been running the vhost tests locally since yesterday afternoon - i can disable verify from the fio job file, reduce the queue depth to 4 and i still get periodic failures [09:24:06] FYI Jun 19 is NATIONAL FREEBSD DAY :) [09:24:16] so lets plan for some beer or something... [09:26:02] is that a federal holiday? ;) [09:28:39] in someone's federation I'm sure... star trek maybe :) [09:38:06] maybe LCARS is actually running on FreeBSD [09:44:10] hmm, the hotplug test failed on the patch that was supposed to fixd the hotplug failures [09:44:21] sweet! [10:29:38] jimharris, see UT patch (no hurry).. the global tweak you asked for is sort of a PITA and the globals are going away with the next patch anyway. Let me know if what I pushed up there is acceptable in the short term... [10:31:58] peluse: can you keep the global definitions in the .c file, and just declare them extern in the mock.h fiel? [10:32:00] file? [10:33:13] meaning don't change spdk_mock.c at all (keep the int definitions there) but just add them as extern declarations in spdk_mock.h [10:34:13] i'm still thinking those CU_ASSERTs should be comparing != 0, instead of comparing with ut_fake_pthread_mutex_init/mutexattr_init [10:34:53] and should have a comment explaining that on freebsd, shared mutexes are stubbed out which is why we expect it to return 0 [10:37:25] jimharris, sure I can move where they are declared. I don't understand why you want != 0 though, the idea with that scheme of setting the expected value adn comparing the expected value is just that, to make sure you get what you expect. And sure I will add a few comments [10:38:16] but you are specifying the return value of some function that nvme_robust_mutex_init_shared() calls [10:38:40] we should not assume that nvme_robust_mutex_init_shared() will return this same value [10:39:42] for the mock that's what we're doing for sure. Look at either of these, I set the global to -1 and the mock returns -1. If I set the global to 99 the mock will return 99 [10:41:11] having an API that says mock(function_name, expected_return_value); will make that much more obvious I think [10:41:48] hmmm - i'll need to take a look at this further - it looks like if pthread_mutexattr_init() returns 99, that nvme_robust_mutex_init_shared() will return -1 [10:43:07] the CU_ASSERT is not testing what the mocked pthread function returns, it is supposed to test what the function-under-test (nvme_robust_mutex_init_shared) returns [10:44:27] oh yeah, with the API implementation it would have been obvious that this implementation isn't exactly what I was going after :) You're right, I'm mocking functions used by the FUT... I'll change to !=0 for sure [10:45:19] cool [10:48:41] back to the globals, I don't think I can get away from having ext decl in nvme_ut.c until they're rolled into the i/f - regardless of whether theyre in spdk_mock.c or spdk_mock.h they need to be declared ext in nvme_ut.c or I can't set them there [10:56:51] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [11:03:10] we were suggesting declaring them ext in spdk_mock.h, and including spdk_mock.h from nvme_ut.c [11:03:38] or are you seeing we cannot include spck_mock.h from nvme_ut.c yet? [11:03:43] seeing => saying [12:10:14] *** Joins: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) [12:18:57] isn't tat what I did, let me go look [12:19:55] those messages were a bit old - from before we chatted on the phone [12:20:12] yeah, that's what's up there now. ext in spdk_mock.h, declared in spdk_mock.c and the .h included in nvme_ut.c [12:20:13] oh, OK [12:20:46] looks like it just failed CI but the logs aren't copied yet :( [12:26:38] looks like my UT patch failed hotplug test.... [12:53:58] bwalker, drv: https://review.gerrithub.io/#/c/365496/ [13:08:08] the kick is up... and it's good for an extra point! [13:09:17] drv: is 1 second enough for that delay? [13:09:40] not sure [13:09:42] i think eventually we want to make it a smart wait, but for now i'm thinking we should do 5 seconds just to be safe [13:09:49] ok [13:09:51] I'll change it [13:23:47] bwalker: https://review.gerrithub.io/#/c/365496/ [13:28:17] do you want to merge these workarounds immediately? [13:28:29] should let the team in Poland know what we're doing [13:28:47] i think we have to get this vhost workaround in immediately [13:29:06] the reset testing there isn't providing any value currently - at least we keep the vhost tests running in general [13:29:16] i'll be sending them an e-mail before i leave today [13:31:27] that plus the hotplug workaround should eliminate the vast majority of the failures [13:31:32] I think that will make things a lot smoother [13:36:19] I merged the hotplug delay now too [14:00:55] moment of truth on these workarounds - I just pushed a 6 patche series [14:01:02] that took 11 total runs to get through before [14:04:01] if none of them are mine they should work fine anyways [14:08:10] haha [14:08:22] I guess we should take out the if (peluse) { fail_build(); } [14:08:41] (he'll find it as he digs through autorun.sh) [14:09:01] awe, you guys :) I figured somebody had done something like that out of love :) [14:09:30] bwalker, its running now so... [14:10:01] some of the logic in there will skip tests if you don't have things installed [14:10:09] so once you have it working, you'll have to go back and see what it skipped [14:10:29] yeah, I assume I can mostly compare against any recent passing build.log right? [14:10:36] and then a developer really should be able to run that from the root of the spdk repo too, if that doesn't work already [14:10:42] it would be great to have our vm image somewhere where people could pull it [14:10:47] well, it's complicated [14:10:57] to speed the tests up, autorun.sh can take as an argument a config file [14:11:01] which tells it which tests to run [14:11:09] so all of our agents only individually run a tiny subset of the tests [14:11:17] and the aggregate, hopefully, is a full run [14:11:25] ...in the cloud [14:11:37] if you don't pass a config file to autorun.sh, it should run the full set [14:11:38] ...with bluetooth somehow [14:12:09] its still running, doing all sorts of horseshit [14:12:20] creating a VM image with NVMe + soft RoCE and everything installed so that a full autorun.sh test run can be done via Vagrant would be amazing [14:12:42] pull image, vagrant up, autorun.sh [14:12:48] drink beer [14:14:40] so question before I start trying to figure out why its hung right now :) Should my on screen output match (or close enough to see problems) a build.log on a recent +1? [14:15:04] yeah - if you find the agent that is configured to run the test you are hung on [14:15:04] I ask because you mentioned the tests being split up between machines so not sure what the build.log represents or how well its stitched together if that's what's happeneing [14:15:15] what test did it hang on? [14:15:32] the start and end of tests are blocked out with lots of ****** [14:16:05] cool [14:16:32] if you tell me which test I can tell you which build agent [14:16:58] bwalker: to start, just the VM image we are using for the hotplug and vhost testing [14:18:15] I'm not seeing a bunch of **** anywhere [14:18:30] here's the last few lines, you'll probably recognize it [14:18:33] Starting thread on core 5 [14:18:33] Attached to NVMe Controller at 0000:06:00.0 [8086:0953] [14:18:33] Associating INTEL SSDPEDMD400G4 (CVFT7203005M400LGN ) with lcore 4 [14:18:33] Initialization complete. Launching workers. [14:18:34] Starting thread on core 4 [14:18:35] EAL: PCI device 0000:06:00.0 on NUMA socket 0 [14:18:36] EAL: probe driver: 8086:953 spdk_nvme [14:18:38] EAL: PCI device 0000:06:00.0 on NUMA socket 0 [14:18:40] EAL: probe driver: 8086:953 spdk_nvme [14:18:43] [14:19:03] right now I'm just reading the scripts to get some sort of bearing [14:19:21] you don't see things like this? : [14:19:22] ************************************ [14:19:22] START TEST test/lib/env/env.sh [14:19:22] ************************************ [14:19:31] scrolling up.... [14:20:01] yeah OK [14:20:03] ************************************ [14:20:03] START TEST test/lib/nvme/nvme.sh [14:20:03] ************************************ [14:20:38] are you able to run like our identify example on the system in question? [14:20:45] and have it find your NVMe devices [14:20:59] it looks like it sees one on slot 6 [14:21:01] now worries right now, let me poke around through the sourced scripts and autotest.sh to at least get an idea of what's happening. Then I'll try that and let ya know [14:21:26] but it looks like the test is doing that w/success as I see output on ctrl caps/functinos, etc. [14:23:22] *** Joins: konv81 (cf8c2b51@gateway/web/freenode/ip.207.140.43.81) [15:35:10] *** Quits: sethhowe (~sethhowe@134.134.139.76) (Remote host closed the connection) [15:39:14] FYI my autotest.sh is failing at nvme.sh in the multi_process test in the last loop where it sends out 2 perf and 2 identify. Looks like they get issued and then just hangs... will poke some more at it later. Any quick suggestions for shortening time to failure or debugging fire away... [15:39:43] peluse: how many cores do you have on your test machine? [15:40:17] the multiprocess test hard-codes some CPU masks right now, and I think it needs at least 8 logical cores [15:41:56] like 70 something [15:42:22] in autotest.sh you can comment out all of the tests you don't want to run - that's one way [15:42:23] EAL: Detected 72 lcore(s) [15:42:32] or make a config file and pass to autotest.sh that only enables NVMe [15:42:50] where's an example of what a config file looks like? [15:43:00] you're just supposed to magically know [15:43:01] FYI I gotta go pick up a kid - back a bit later. thanks guys! [15:43:17] I don't think we have a switch to control multiprocess specifically yet [15:43:22] but you could turn off all of nvme [15:43:42] oh, never mind, we actually do have $SPDK_TEST_NVME_MULTIPROCESS [15:44:21] the config file is ~/autorun-spdk.conf, and it should just be a shell script fragment that will get sourced by autorun.sh and friends [15:44:44] so you could make one with just a single line that says 'SPDK_TEST_NVME_MULTIPROCESS=0' (no quotes) [15:45:34] the whole list of flags you can tweak is at the top of scripts/autotest_common.sh [15:46:11] it would be good to understand why the multiprocess test isn't working on your machine, though [16:44:17] *** Quits: gila (~gila@5ED4FE92.cm-7-5d.dynamic.ziggo.nl) (Quit: My Mac Pro has gone to sleep. ZZZzzz…) [16:55:41] yeah, I don't want to skip that test, want to skip everything else so I can figure out why it doesn't work. There's nothing odd about this machine and the NVMe drive is brand new and matches the one in the first test pool machine log I looked at except fo FW version [17:02:53] *** Joins: tsuyoshi (b42b2067@gateway/web/freenode/ip.180.43.32.103) [17:06:26] peluse: you can just run that nvme.sh script manually [17:06:55] (after running scripts/setup.sh) [17:07:05] I tried that real quick and it didn't work, wasn't sure what might be missing (setup by other scripts) or even if that was suported [17:07:44] but hey, I just realized I was running autotest.sh not autorun.sh, and when I do I get a problem with scan-build right off master [17:08:11] also, separate question/topic - I don't see autorun-spdk.conf anywhere in any of the repos [17:10:32] it's not checked in to the repo, just created manually on the test machines [17:10:48] and if it's missing, the autotest stuff uses the defaults from scripts/autotest_common.sh [17:11:28] ahhh [17:15:20] we should probably check in an example, though [17:16:00] yeah, hey what about scan-build failing for me doing nothing more than running autorun.sh off master? [17:17:31] what errors are you seeing? [17:18:13] might be hard to copy-paste, since the output is html, but at least which files is it reporting errors in? [17:19:26] will be a min, running a machine that doesn't have a browser that can get to it... [17:21:01] Logic error Dereference of null pointer vbdev_error.c spdk_vbdev_inject_error 83 4 [17:22:24] this is the error marked at line #83 "Access to field 'io_type_mask' results in a dereference of a null pointer (loaded from variable 'error_disk')" [17:23:54] the test machines are running the static anal too right? Why would I see something different? [17:26:13] I don't see that on my machine; maybe you have a different version of scan-build? [17:27:09] we have clang 3.9.1 from Fedora 25 on our machines [17:27:35] it's possible that newer clang has more checkers that are finding bugs that older clang doesn't find [17:28:02] clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final) [17:29:36] well, then I would guess this has been fixed in a newer clang/scan-build version, but I'm not sure [17:29:49] heh, I'll mess with it a bit :) [17:29:59] you can turn off scan-build in autorun-spdk.conf (SPDK_RUN_SCANBUILD=0) [17:30:06] it might be the flux capacitor... [17:30:19] although that bit of code looks like it could really use a NULL check anyway [17:30:35] just for sanity checking [17:31:38] wow, when I run scan-buuld make from master on my ubunutu it says 9 bugs found [17:32:16] we run scan-build with CONFIG_DEBUG=y, which will enable a bunch of asserts that make scan-build happier [17:32:39] if you run make with CONFIG_DEBUG=n (the default) it will not see those asserts, so it'll report a lot of false positives [17:33:02] running now with it set to y [17:33:04] manually [17:40:58] still that one error. will try getting 3.9 for ubuntu [17:45:10] peluse: can you try building with this patch? https://review.gerrithub.io/#/c/365526/ [17:45:29] that should hopefully fix the error with the 3.8 scan-build, and it's a possible problem anyway [17:46:10] yeah, well I installed it but the system is still ponting/using 3.8. need to I need to do something other than dpkg -i blablabla.deb? [17:46:28] no idea, I think scan-build is actually just a perl script, but I don't know how to point it at a different clang [17:46:47] hmmm, OK will see what google has to say... [17:54:31] that did the trick for the scan-build issue. I'll update the readme patch to specify the correct package for ubuntu [17:55:40] cool [17:55:55] ubuntu is eternally out of date :) [18:00:34] well, I had to apt-get install clang-3.9 instead of clang [18:01:44] and I'm not positive that is sufficient because I had 3.8 on there and tried to install 3.9 via dpkg and that didn't work then discovered clang-3.9 in apt so installed that and it worked. On my other machine I just removed clang and installed clang-3.9 via apt and it can't find scan-build. argh [18:02:07] I think its beer-thirty for me... [18:02:44] BTW, why aren't we running ubuntu on one of the pool machines? We state we support it, we probably should teset with it no? [18:13:02] yeah, we should probably add an ubuntu machine to the pool [18:13:47] as well as at least the last two versions of CentOS [18:14:09] those would be good candidates for VM-based agents [18:14:45] we are mostly running Fedora because it is fairly up to date (e.g. we need a 4.8 or newer kernel to do NVMe-oF) [18:20:29] cool, yet another test pool item :) [18:22:58] wow so not sure how I hacked my way into getting 3.9 working on the other one and invoking via "clang" and "scan-build" but without a hacky magic on a fresh build after installing 3.9 I have to call "clang-3.9" and "scan-build-3.9" [18:24:45] ha! on my other machine after rebooting I have to invoke with the version number as well... [18:45:09] magic [21:56:00] *** Joins: ziyeyang_ (~ziyeyang@134.134.139.75) [21:57:52] *** Quits: ziyeyang_ (~ziyeyang@134.134.139.75) (Client Quit) [22:02:26] *** Joins: johnmeneghini1 (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) [22:03:34] *** Quits: johnmeneghini (~johnmeneg@pool-96-252-112-122.bstnma.fios.verizon.net) (Ping timeout: 268 seconds)