Monitoring inconclusive GBT validations and harvesting data for testing

Author	Message
Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874	Message 1828540 - Posted: 5 Nov 2016, 15:40:10 UTC - in response to Message 1828501. so what is wrong here? driver restart leads to context destruction. If Runtime would return last call to it app would do temporary exit. But there is no return from last call to runtime hence app never got control back. Nothing new, such behavior observed since first OpenCL app introduction many years ago. OK, let's just call it a "heads up" to the owners of NVidia cards who may have preferred to use CUDA applications above OpenCL applications until the arrival of Breakthrough Listen VLAR tasks. It's a behavioural quirk they might wish to be aware of. As well as wasting computational time by making no progress, the stalled OpenCL application drew attention to itself by using 100% of a CPU core throughout (so far as I could observe) the hour and three quarters it was stalled. ID: 1828540 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1828559 - Posted: 5 Nov 2016, 16:22:55 UTC - in response to Message 1828507. Here https://cloud.mail.ru/public/2G3G/aj9aBpWaY is update for ATi and NV SoG builds that pass last overflow testcase versus Akv8 CPU. Okay, I've got the r3556 NV build running on my host 8064262, replacing r3548. ID: 1828559 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1828563 - Posted: 5 Nov 2016, 16:46:45 UTC - in response to Message 1828467. One of my hosts has the potential tiebreaker, but that will run as Cuda50 so it won't provide any signal detail. Signals are signals, and the cuda app will certainly provide them - that's the whole point of the exercise. It simply doesn't write a duplicate copy into stderr. Yeah, I should've been more precise in my comment. First, make that task run at a time when you're around to manage things. Then disable networking while it's running, and copy the result file somewhere else before it uploads. Then set BOINC back to normal and let it do its thing. Running now. Second, I once wrote a little tool which I called a 'Summariser' - it should be somewhere in the downloads area at Lunatics. It takes a result file in scientific (XML) format, and spits out the (limited subset of) values in a layout more easily comparable with Raistmer's stderr summary - using text manipulation only, so it doesn't introduce any additional maths errors (though that does mean that over-length values are truncated, not rounded). That certainly leaves enough data to spot the >1% validator-busting discrepancies. Hmmmm, can't find it. Can you drop a few more breadcrumbs? ID: 1828563 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1828572 - Posted: 5 Nov 2016, 17:06:03 UTC - in response to Message 1828031. OK, then either send mix you feel is good for beta testing directly to Eric or E-mail me for passing. I think the current builds are going to be the best. The HD5 path is best on the ATI card and the Intel build is best on the nVidia cards. I still haven't heard how well the Intel build works on an Apple Intel although the download number is now up to three. Since starting the OpenCL tests a few days ago I have a total of 4 Inconclusive results on my Host from those days. ID: 1828572 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874	Message 1828574 - Posted: 5 Nov 2016, 17:11:51 UTC - in response to Message 1828563. Hmmmm, can't find it. Can you drop a few more breadcrumbs? Sorry, memory failure. It was attached to a message in the development area - now uploaded. Try Test and Benchmark Tools. ID: 1828574 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1828577 - Posted: 5 Nov 2016, 17:19:22 UTC - in response to Message 1828540. so what is wrong here? driver restart leads to context destruction. If Runtime would return last call to it app would do temporary exit. But there is no return from last call to runtime hence app never got control back. Nothing new, such behavior observed since first OpenCL app introduction many years ago. OK, let's just call it a "heads up" to the owners of NVidia cards who may have preferred to use CUDA applications above OpenCL applications until the arrival of Breakthrough Listen VLAR tasks. It's a behavioural quirk they might wish to be aware of. As well as wasting computational time by making no progress, the stalled OpenCL application drew attention to itself by using 100% of a CPU core throughout (so far as I could observe) the hour and three quarters it was stalled. That may not just be an OpenCL issue. Ever since I upgraded one of my old HP boxes from a GTX 550Ti to a GTX 750 several months ago, I've been plagued by driver restarts, among other problems. For the most part, the S@h apps recover just fine, but every so often I get one of these when I'm not paying attention to the machine. That "Maximum elapsed time exceeded" error usually gets triggered after anywhere from 11 to 13 or more hours of zero progress. The good news, such as it is, with the Cuda50 app, is that it's not also sucking up massive amounts of CPU time. In this case, it only used about 4 minutes during the 13+ hour time that it was "running". ID: 1828577 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874	Message 1828580 - Posted: 5 Nov 2016, 17:29:03 UTC - in response to Message 1828577. That "Maximum elapsed time exceeded" error usually gets triggered after anywhere from 11 to 13 or more hours of zero progress. The good news, such as it is, with the Cuda50 app, is that it's not also sucking up massive amounts of CPU time. In this case, it only used about 4 minutes during the 13+ hour time that it was "running". 'EXIT_TIME_LIMIT_EXCEEDED' is a function of the initial runtime estimate for the task. Here at SETI, <rsc_fpops_bound> is set to 20x <rsc_fpops_est> (that seems to be the new BOINC default - it used to be 10x). I'm guessing the estimate for that task was around 33-40 minutes when it started? ID: 1828580 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1828586 - Posted: 5 Nov 2016, 17:44:08 UTC - in response to Message 1828580. That "Maximum elapsed time exceeded" error usually gets triggered after anywhere from 11 to 13 or more hours of zero progress. The good news, such as it is, with the Cuda50 app, is that it's not also sucking up massive amounts of CPU time. In this case, it only used about 4 minutes during the 13+ hour time that it was "running". 'EXIT_TIME_LIMIT_EXCEEDED' is a function of the initial runtime estimate for the task. Here at SETI, <rsc_fpops_bound> is set to 20x <rsc_fpops_est> (that seems to be the new BOINC default - it used to be 10x). I'm guessing the estimate for that task was around 33-40 minutes when it started? Probably closer to an hour for that AR. Anyway, once one of those gets stuck after a driver restart, it's not really doing anything other than just spinning its wheels. A few times, I've caught one in the act and just suspended/resumed it, which gets it back on track. There have been a half dozen or so that I never noticed until they've shown up as errors in my task list. I don't usually monitor that machine very closely. ID: 1828586 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1828587 - Posted: 5 Nov 2016, 17:45:13 UTC - in response to Message 1828572. OK, then either send mix you feel is good for beta testing directly to Eric or E-mail me for passing. I think the current builds are going to be the best. The HD5 path is best on the ATI card and the Intel build is best on the nVidia cards. I still haven't heard how well the Intel build works on an Apple Intel although the download number is now up to three. Since starting the OpenCL tests a few days ago I have a total of 4 Inconclusive results on my Host from those days. Please send me binaries or direct links to them. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1828587 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1828590 - Posted: 5 Nov 2016, 17:53:36 UTC - in response to Message 1828574. Hmmmm, can't find it. Can you drop a few more breadcrumbs? Sorry, memory failure. It was attached to a message in the development area - now uploaded. Try Test and Benchmark Tools. Got it, thanks. And here's the updated comparison on that one off-kilter Pulse from WU 2316205882: x41p_zi3k: Pulse: peak=10.72607, time=45.86, period=27.2, d_freq=1647270240.2, score=1.08, chirp=84.573, fft_len=1024 opencl_ati5_nocal: Pulse: peak=10.78494, time=45.86, period=27.25, d_freq=1647270240.2, score=1.086, chirp=84.573, fft_len=1024 Cuda50 (x41zi): Pulse: peak=10.784947, time=45.86, period=27.246198, d_freq=1647270240.2024, score=0, chirp=84.573320, fft_len=1024 Your Summariser provides better precision than the Stderr, it appears. Does the score get calculated during validation? ID: 1828590 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874	Message 1828594 - Posted: 5 Nov 2016, 18:08:06 UTC - in response to Message 1828590. Your Summariser provides better precision than the Stderr, it appears. Does the score get calculated during validation? Well, that depends what's in the original XML file - open that up and take a peek. Most BOINC values are presented to six decimal places, whether that's meaningful or not (what's 0.000001 of a floating point operation?) - I think I just followed that precedent. Having a fixed width helps to line up the numbers for comparison. But it's certainly not "better precision" than is available in the official result - maybe Raistmer removes some precision to save space? As to scoring - no score value is passed to the validator via the official result file. The implications of that are outside my competence. ID: 1828594 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1828603 - Posted: 5 Nov 2016, 18:19:00 UTC - in response to Message 1828594. Last modified: 5 Nov 2016, 18:21:47 UTC Your Summariser provides better precision than the Stderr, it appears. Does the score get calculated during validation? Well, that depends what's in the original XML file - open that up and take a peek. Most BOINC values are presented to six decimal places, whether that's meaningful or not (what's 0.000001 of a floating point operation?) - I think I just followed that precedent. Having a fixed width helps to line up the numbers for comparison. But it's certainly not "better precision" than is available in the official result - maybe Raistmer removes some precision to save space? As to scoring - no score value is passed to the validator via the official result file. The implications of that are outside my competence. Most of the values in the actual result file appear to go out to 11 or 12 decimal places. I just meant that your Summariser presented more of that than usually appears in the Stderr. All the score values, though, appear to be 0, so I wasn't sure if that was being calculated someplace else, or what. EDIT: I seem to have some vague recollection of Eric mentioning, in regard to one of the signal types, that the score value rather than the peak value was used to calculate the "best" signal for that type. ID: 1828603 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874	Message 1828612 - Posted: 5 Nov 2016, 18:36:09 UTC - in response to Message 1828603. Looking at the validator code, say starting at https://setisvn.ssl.berkeley.edu/trac/browser/seti_boinc/validate/sah_result.cpp#L142 the value of <score> isn't even retrieved from the result file. Eric's comment probably means that the score is used (locally on your machine) to decide which of the available signals to report as 'best', but beyond that plays no part in the proceedings. ID: 1828612 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1828616 - Posted: 5 Nov 2016, 18:46:49 UTC - in response to Message 1828612. Looking at the validator code, say starting at https://setisvn.ssl.berkeley.edu/trac/browser/seti_boinc/validate/sah_result.cpp#L142 the value of <score> isn't even retrieved from the result file. Eric's comment probably means that the score is used (locally on your machine) to decide which of the available signals to report as 'best', but beyond that plays no part in the proceedings. Ah, found it over on Beta, Message 59634. Best for pulses is measured by score, not by peak_power. So, I wonder how that works if the scores are all 0. ID: 1828616 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874	Message 1828618 - Posted: 5 Nov 2016, 18:52:10 UTC - in response to Message 1828616. So, I wonder how that works if the scores are all 0. There's only ever one signal of each type reported as 'best', so all the validator needs to check is that the 'best' from each host has the same values (within tolerances). The validator doesn't need to know how the application decided which one to report as best. 'score' has done its job and is finished with - discarded - before the final best values are written into the result file. ID: 1828618 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1828619 - Posted: 5 Nov 2016, 18:54:28 UTC - in response to Message 1828590. Last modified: 5 Nov 2016, 18:59:07 UTC x41p_zi3k: Pulse: peak=10.72607, time=45.86, period=27.2, d_freq=1647270240.2, score=1.08, chirp=84.573, fft_len=1024 opencl_ati5_nocal: Pulse: peak=10.78494, time=45.86, period=27.25, d_freq=1647270240.2, score=1.086, chirp=84.573, fft_len=1024 Cuda50 (x41zi): Pulse: peak=10.784947, time=45.86, period=27.246198, d_freq=1647270240.2024, score=0, chirp=84.573320, fft_len=1024 It's third case already where I see same issue - difference in reported period. I'm afraid it's real bug and it still not fixed. Cause it's not just difference in precision. Different signal picked up. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1828619 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1828623 - Posted: 5 Nov 2016, 19:00:10 UTC - in response to Message 1828618. So, I wonder how that works if the scores are all 0. There's only ever one signal of each type reported as 'best', so all the validator needs to check is that the 'best' from each host has the same values (within tolerances). The validator doesn't need to know how the application decided which one to report as best. 'score' has done its job and is finished with - discarded - before the final best values are written into the result file. Something else just belatedly occurred to me, and that's that the result file from Cuda50 for that task has no <best_xxxxxx> sections at all. It ends after the final Pulse section. Could that be because it's an overflow, or is something else going on there? ID: 1828623 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1828625 - Posted: 5 Nov 2016, 19:00:58 UTC - in response to Message 1828623. Last modified: 5 Nov 2016, 19:02:42 UTC Because of overflow IMHO And if bests discarded on overflow - it will help a lot for validation. Cause to make parallel best to match serial best is quite separate complex task. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1828625 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1828626 - Posted: 5 Nov 2016, 19:01:36 UTC - in response to Message 1828587. OK, then either send mix you feel is good for beta testing directly to Eric or E-mail me for passing. I think the current builds are going to be the best. The HD5 path is best on the ATI card and the Intel build is best on the nVidia cards. I still haven't heard how well the Intel build works on an Apple Intel although the download number is now up to three. Since starting the OpenCL tests a few days ago I have a total of 4 Inconclusive results on my Host from those days. Please send me binaries or direct links to them. They are here, OSX OpenCL Apps All the Apps were compiled in Darwin 12.6. The ATI App works in Darwin 12.6 & 15.6. The NV App works in 15.6 but should work at least down to Darwin 13.4. I don't know if an Intel build will work below 13.4. ID: 1828626 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1828627 - Posted: 5 Nov 2016, 19:03:46 UTC - in response to Message 1828626. So what version restrictions should be for plan classes? SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1828627 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.