Message boards :
Number crunching :
Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation
Previous · 1 . . . 30 · 31 · 32 · 33 · 34 · 35 · 36 · Next
Author | Message |
---|---|
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874 |
so what is wrong here? OK, let's just call it a "heads up" to the owners of NVidia cards who may have preferred to use CUDA applications above OpenCL applications until the arrival of Breakthrough Listen VLAR tasks. It's a behavioural quirk they might wish to be aware of. As well as wasting computational time by making no progress, the stalled OpenCL application drew attention to itself by using 100% of a CPU core throughout (so far as I could observe) the hour and three quarters it was stalled. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Here https://cloud.mail.ru/public/2G3G/aj9aBpWaY is update for ATi and NV SoG builds that pass last overflow testcase versus Akv8 CPU. Okay, I've got the r3556 NV build running on my host 8064262, replacing r3548. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
One of my hosts has the potential tiebreaker, but that will run as Cuda50 so it won't provide any signal detail. Yeah, I should've been more precise in my comment. First, make that task run at a time when you're around to manage things. Then disable networking while it's running, and copy the result file somewhere else before it uploads. Then set BOINC back to normal and let it do its thing. Running now. Second, I once wrote a little tool which I called a 'Summariser' - it should be somewhere in the downloads area at Lunatics. It takes a result file in scientific (XML) format, and spits out the (limited subset of) values in a layout more easily comparable with Raistmer's stderr summary - using text manipulation only, so it doesn't introduce any additional maths errors (though that does mean that over-length values are truncated, not rounded). That certainly leaves enough data to spot the >1% validator-busting discrepancies. Hmmmm, can't find it. Can you drop a few more breadcrumbs? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
OK, then either send mix you feel is good for beta testing directly to Eric or E-mail me for passing. I think the current builds are going to be the best. The HD5 path is best on the ATI card and the Intel build is best on the nVidia cards. I still haven't heard how well the Intel build works on an Apple Intel although the download number is now up to three. Since starting the OpenCL tests a few days ago I have a total of 4 Inconclusive results on my Host from those days. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874 |
Hmmmm, can't find it. Can you drop a few more breadcrumbs? Sorry, memory failure. It was attached to a message in the development area - now uploaded. Try Test and Benchmark Tools. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
so what is wrong here? That may not just be an OpenCL issue. Ever since I upgraded one of my old HP boxes from a GTX 550Ti to a GTX 750 several months ago, I've been plagued by driver restarts, among other problems. For the most part, the S@h apps recover just fine, but every so often I get one of these when I'm not paying attention to the machine. That "Maximum elapsed time exceeded" error usually gets triggered after anywhere from 11 to 13 or more hours of zero progress. The good news, such as it is, with the Cuda50 app, is that it's not also sucking up massive amounts of CPU time. In this case, it only used about 4 minutes during the 13+ hour time that it was "running". |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874 |
That "Maximum elapsed time exceeded" error usually gets triggered after anywhere from 11 to 13 or more hours of zero progress. The good news, such as it is, with the Cuda50 app, is that it's not also sucking up massive amounts of CPU time. In this case, it only used about 4 minutes during the 13+ hour time that it was "running". 'EXIT_TIME_LIMIT_EXCEEDED' is a function of the initial runtime estimate for the task. Here at SETI, <rsc_fpops_bound> is set to 20x <rsc_fpops_est> (that seems to be the new BOINC default - it used to be 10x). I'm guessing the estimate for that task was around 33-40 minutes when it started? |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
That "Maximum elapsed time exceeded" error usually gets triggered after anywhere from 11 to 13 or more hours of zero progress. The good news, such as it is, with the Cuda50 app, is that it's not also sucking up massive amounts of CPU time. In this case, it only used about 4 minutes during the 13+ hour time that it was "running". Probably closer to an hour for that AR. Anyway, once one of those gets stuck after a driver restart, it's not really doing anything other than just spinning its wheels. A few times, I've caught one in the act and just suspended/resumed it, which gets it back on track. There have been a half dozen or so that I never noticed until they've shown up as errors in my task list. I don't usually monitor that machine very closely. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
OK, then either send mix you feel is good for beta testing directly to Eric or E-mail me for passing. Please send me binaries or direct links to them. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Hmmmm, can't find it. Can you drop a few more breadcrumbs? Got it, thanks. And here's the updated comparison on that one off-kilter Pulse from WU 2316205882: x41p_zi3k: Pulse: peak=10.72607, time=45.86, period=27.2, d_freq=1647270240.2, score=1.08, chirp=84.573, fft_len=1024 opencl_ati5_nocal: Pulse: peak=10.78494, time=45.86, period=27.25, d_freq=1647270240.2, score=1.086, chirp=84.573, fft_len=1024 Cuda50 (x41zi): Pulse: peak=10.784947, time=45.86, period=27.246198, d_freq=1647270240.2024, score=0, chirp=84.573320, fft_len=1024 Your Summariser provides better precision than the Stderr, it appears. Does the score get calculated during validation? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874 |
Your Summariser provides better precision than the Stderr, it appears. Does the score get calculated during validation? Well, that depends what's in the original XML file - open that up and take a peek. Most BOINC values are presented to six decimal places, whether that's meaningful or not (what's 0.000001 of a floating point operation?) - I think I just followed that precedent. Having a fixed width helps to line up the numbers for comparison. But it's certainly not "better precision" than is available in the official result - maybe Raistmer removes some precision to save space? As to scoring - no score value is passed to the validator via the official result file. The implications of that are outside my competence. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Your Summariser provides better precision than the Stderr, it appears. Does the score get calculated during validation? Most of the values in the actual result file appear to go out to 11 or 12 decimal places. I just meant that your Summariser presented more of that than usually appears in the Stderr. All the score values, though, appear to be 0, so I wasn't sure if that was being calculated someplace else, or what. EDIT: I seem to have some vague recollection of Eric mentioning, in regard to one of the signal types, that the score value rather than the peak value was used to calculate the "best" signal for that type. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874 |
Looking at the validator code, say starting at https://setisvn.ssl.berkeley.edu/trac/browser/seti_boinc/validate/sah_result.cpp#L142 the value of <score> isn't even retrieved from the result file. Eric's comment probably means that the score is used (locally on your machine) to decide which of the available signals to report as 'best', but beyond that plays no part in the proceedings. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Looking at the validator code, say starting at Ah, found it over on Beta, Message 59634. Best for pulses is measured by score, not by peak_power. So, I wonder how that works if the scores are all 0. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14671 Credit: 200,643,578 RAC: 874 |
So, I wonder how that works if the scores are all 0. There's only ever one signal of each type reported as 'best', so all the validator needs to check is that the 'best' from each host has the same values (within tolerances). The validator doesn't need to know how the application decided which one to report as best. 'score' has done its job and is finished with - discarded - before the final best values are written into the result file. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
It's third case already where I see same issue - difference in reported period. I'm afraid it's real bug and it still not fixed. Cause it's not just difference in precision. Different signal picked up. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
So, I wonder how that works if the scores are all 0. Something else just belatedly occurred to me, and that's that the result file from Cuda50 for that task has no <best_xxxxxx> sections at all. It ends after the final Pulse section. Could that be because it's an overflow, or is something else going on there? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Because of overflow IMHO And if bests discarded on overflow - it will help a lot for validation. Cause to make parallel best to match serial best is quite separate complex task. SETI apps news We're not gonna fight them. We're gonna transcend them. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
OK, then either send mix you feel is good for beta testing directly to Eric or E-mail me for passing. They are here, OSX OpenCL Apps All the Apps were compiled in Darwin 12.6. The ATI App works in Darwin 12.6 & 15.6. The NV App works in 15.6 but should work at least down to Darwin 13.4. I don't know if an Intel build will work below 13.4. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
So what version restrictions should be for plan classes? SETI apps news We're not gonna fight them. We're gonna transcend them. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.