Monitoring inconclusive GBT validations and harvesting data for testing

Author	Message
jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1820433 - Posted: 29 Sep 2016, 7:21:24 UTC - in response to Message 1820427. If two apps find the same subset of 30 out of the available 50, then I'm pretty sure the validator will pass the result, even if the reporting order is different - I had a walk through the code a few days ago. But if the app - by doing a parallel search - finds a different subset of 30 from 50, then the results are different, and no amount of tweaking the validator is going make any difference. Yup, that is so totally true! Thats why i'm naging about that the result sent back should be unified (presentation wise) so the stock CPU has a sorting routine incorporated in the future if so and every other application aswell so we won't ever get this again in the future. If a WU is overflowed it is ofcourse and it's crap. But why perhaps don't get credit for 5600 seconds of cputime if it gets ironed out of other "juggling order" applications when you could do the code right from the beginning? Incorporate a result sorting routine in main s@h code and let the other (Tweakers follow its lead). The only thing in the future all would get is less headache when dealing with other forthcoming optimisations and variations which will only increase, not decrease :-/ Agreed on the ordering issue. Needs to be formalised to Serial order to be correct in numerical methods/computer science terms. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1820433 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1820434 - Posted: 29 Sep 2016, 7:23:07 UTC - in response to Message 1820432. Last modified: 29 Sep 2016, 7:25:27 UTC But to solve this by doing a "find all and sort them afterwards" would mean that every task would have to run to full term, and we'd lose the efficiency of quitting early after 10 seconds or so for the really noisy WUs. Goes back to my 'more than one way to skin a cat' post. There are more efficient divide and conquer methods to serialise the results, that Raistmer claimed exist in the OpenCL builds already (haven't looked myself). [Baseline Cuda only gets away with its minor order difference by its extreme rarity of applicability, but ultimately should be corrected] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1820434 ·

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 1820435 - Posted: 29 Sep 2016, 7:26:28 UTC - in response to Message 1820431. this doesn't guarantee other compilers, or hardware device manufacturers implemented their chips in the bit identical way suggested Yes but if other compilers or hardware isn't bit identical then there would be a flaw in their IEEE754 implementation and you all would know that and needs to tacle that platform or device differently and puts effort there! I'm only suggesting that IEEE754 should be used so the majority of applications get to the Q100 mark! Then you all know that when compiling under Linux,Windows, Bla bla This work as intended and when a new version breaks it then you would know it 100% for sure and could revert back or "change lines in the code" required to get to Q100 mark. Haven't mentioned validation as it could validate non Q100 results also but i'm proposing this as a base and way of thinking route to ease future headache instead. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 1820435 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874	Message 1820437 - Posted: 29 Sep 2016, 7:32:44 UTC - in response to Message 1820435. this doesn't guarantee other compilers, or hardware device manufacturers implemented their chips in the bit identical way suggested Yes but if other compilers or hardware isn't bit identical then there would be a flaw in their IEEE754 implementation and you all would know that and needs to tacle that platform or device differently and puts effort there! I'm only suggesting that IEEE754 should be used so the majority of applications get to the Q100 mark! Then you all know that when compiling under Linux,Windows, Bla bla This work as intended and when a new version breaks it then you would know it 100% for sure and could revert back or "change lines in the code" required to get to Q100 mark. Haven't mentioned validation as it could validate non Q100 results also but i'm proposing this as a base and way of thinking route to ease future headache instead. Note that the code comment I just posted says that each signal must be "roughly equal to a signal from the other result". 'Roughly' in this case (and it applies exactly the same to the strong similarity test) means within 1%, in general terms. https://setisvn.ssl.berkeley.edu/trac/browser/seti_boinc/validate/sah_result.cpp#L35 We need to distinguish between "same signal, different maths" and "different signal". They'll have different solutions. ID: 1820437 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1820440 - Posted: 29 Sep 2016, 7:38:10 UTC - in response to Message 1820435. Last modified: 29 Sep 2016, 7:47:40 UTC this doesn't guarantee other compilers, or hardware device manufacturers implemented their chips in the bit identical way suggested Yes but if other compilers or hardware isn't bit identical then there would be a flaw in their IEEE754 implementation and you all would know that and needs to tacle that platform or device differently and puts effort there! I'm only suggesting that IEEE754 should be used so the majority of applications get to the Q100 mark! Then you all know that when compiling under Linux,Windows, Bla bla This work as intended and when a new version breaks it then you would know it 100% for sure and could revert back or "change lines in the code" required to get to Q100 mark. Haven't mentioned validation as it could validate non Q100 results also but i'm proposing this as a base and way of thinking route to ease future headache instead. Then we need a separate discussion about this paper: What Every Computer Scientist should know about Floating Point Arithmetic [pdf document link] Because this common perception of floating point as deterministic leaves out rounding error, which is part of floating point. The only MB Applications I know that have had components rounding error measured (in ulps) are stock CPU, double precision naive reference, and Cuda implementations, because I did them myself. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1820440 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1820444 - Posted: 29 Sep 2016, 7:46:53 UTC Last modified: 29 Sep 2016, 7:47:29 UTC Regarding fp:strict : before going further estimate performance penalty of this option enabled. Don't forget that doing double precision math generally gives more precision. Doing arbitrary-precision calculations is possible too... but just not suits our needs. regarding fp:precise usage for CUDA : in host or device code? SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1820444 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1820447 - Posted: 29 Sep 2016, 7:52:02 UTC - in response to Message 1820444. Last modified: 29 Sep 2016, 7:53:59 UTC Regarding fp:strict : before going further estimate performance penalty of this option enabled. Don't forget that doing double precision math generally gives more precision. Doing arbitrary-precision calculations is possible too... but just not suits our needs. regarding fp:precise usage for CUDA : in host or device code? Host only. Fast math with Cuda compiled kernels, proved the same results (probably due to only one implementation for sensitive parts, with hard intrinsics used in e.g. FFTs, and chirp etc, so not replaced). That could potentially change if nv decide to start substituting intrinsics with other instructions (unlikely but possible). In the case of many of Petri's hand optimisations, we're talking hard PTX assembly instructions, so compiler fp optimisations don't apply at all. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1820447 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874	Message 1820448 - Posted: 29 Sep 2016, 7:55:33 UTC - in response to Message 1820440. Then we need a separate discussion about this paper: What Every Computer Scientist should know about Floating Point Arithmetic [pdf document link] And some of the links in http://mcintosh.web.cern.ch/mcintosh/ ID: 1820448 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1820451 - Posted: 29 Sep 2016, 8:05:12 UTC - in response to Message 1820448. Last modified: 29 Sep 2016, 8:07:05 UTC Then we need a separate discussion about this paper: What Every Computer Scientist should know about Floating Point Arithmetic [pdf document link] And some of the links in http://mcintosh.web.cern.ch/mcintosh/ Very Good!, had forgotten about that recent work. Saving link (again) [Edit:] from one link: Floating-Point Determinism Posted on July 16, 2013 by brucedawson Is IEEE floating-point math deterministic? Will you always get the same results from the same inputs? The answer is an unequivocal â€œyesâ€. Unfortunately the answer is also an unequivocal â€œnoâ€. Iâ€™m afraid you will need to clarify your question. Love it :D haha "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1820451 ·

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 1820456 - Posted: 29 Sep 2016, 8:42:29 UTC - in response to Message 1820432. But to solve this by doing a "find all and sort them afterwards" would mean that every task would have to run to full term, and we'd lose the efficiency of quitting early after 10 seconds or so for the really noisy WUs. Well if we lose efficiency of quitting early why should validator even "validate" -9 work when the server code could see .. "Ohh geez this is a overflow result! Thanks! Here is your credits!" if compared to other -9s If the device sends a -9 result back but the other application sees this as a real result then you should be awarded zero credits anyway. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 1820456 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1820457 - Posted: 29 Sep 2016, 8:58:52 UTC - in response to Message 1820456. Last modified: 29 Sep 2016, 9:07:18 UTC But to solve this by doing a "find all and sort them afterwards" would mean that every task would have to run to full term, and we'd lose the efficiency of quitting early after 10 seconds or so for the really noisy WUs. Well if we lose efficiency of quitting early why should validator even "validate" -9 work when the server code could see .. "Ohh geez this is a overflow result! Thanks! Here is your credits!" if compared to other -9s If the device sends a -9 result back but the other application sees this as a real result then you should be awarded zero credits anyway. That's an option/choice for the project scientists. My feeling is that being scientists the repeatability of the -9's plays a part somehow, for example perhaps in identifying earth local sources of RFI for filtering and/or threshold calibration. That would potentially make overflow ordering relevant for specific uses, if only to distinguish genuine noise from faulty applications or hosts. (effectively shielding the science) This 'feeling' (of mine) would seem to fall in line with some past comments Eric made regarding the precision (separate issue), that extra precision wasn't needed. The main rub with the separate bit exact concept, besides the massive CERN level engineering effort it would require, is that the stock CPU application itself would require treatment as though it was many kinds of devices/compilers. That's because an x86 CPU could be using the x87 FPU with 80 bit intermediates for some or all parts (depending on CPU + OS + app), in addition to 32 and 64 bit floats of SIMD vectorisation (i.e. SSE+, various implementations tested during bench, not to mention fftw's internal dispatch) There's definitely been inadequate match on a precision level in the past (and some builds/devices/platforms may still suffer this), however this has steadily been refined from Q's of ~60% in AKv8 days, along with ~20-30% inconclusives, down to the point where the project seems comfortable with initial replications of 2, and our expectation of 99%+ Q's . That's change we all contributed to over time,[in stock and third party], and now the focus switches from precision to reliability and performance (among other metrics). "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1820457 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874	Message 1820460 - Posted: 29 Sep 2016, 9:16:40 UTC - in response to Message 1820456. But to solve this by doing a "find all and sort them afterwards" would mean that every task would have to run to full term, and we'd lose the efficiency of quitting early after 10 seconds or so for the really noisy WUs. Well if we lose efficiency of quitting early why should validator even "validate" -9 work when the server code could see .. "Ohh geez this is a overflow result! Thanks! Here is your credits!" if compared to other -9s If the device sends a -9 result back but the other application sees this as a real result then you should be awarded zero credits anyway. I think I'd draw a distinction between the tasks which overflow early in the run (sometimes almost immediately), and these 'late onset' cases. "Immediate overflow" is (probably - IMO) scientifically useless, and could be treated as you suggest. On the other hand, nothing is lost by sending an extra tie-breaker except some user's bandwidth - and that will be more of a problem for some users than for others. But I don't accept that tasks which run for a substantial proportion of their intended runtime are necessarily scientifically useless - and if there is science in there, it should go through the validation process before credit is awarded. I'm increasingly coming to believe that credit should be substantially reduced for 'weakly similar' results, as an encouragement to users to pay attention to malfunctioning hardware or software. As a quid pro quo for that suggestion, I think we would need to find a way of reporting the 'serial first 30' signals from a parallel application. That might involve choosing an intermediate point - 30%? 50%? - after which the parallel app would continue to the end, find all signals, and sort out the reportable ones. All of which is much easier to suggest than to implement... ID: 1820460 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874	Message 1820464 - Posted: 29 Sep 2016, 9:29:27 UTC I hope the people contributing to this discussion are also reading the Beta forum. Recently (Beta message 59698) Raistmer wrote: EDIT2: recently I looked into pulse signal selection algorithm. And it appeared more resemble AstroPulse one than I thought. It contains same PoT signal replacement too. That is, if another, more strong, signal will be found inside same PoT but on another fold level (another period) old one will be replaced by new one. Old was not reported. It's one of possible places where bug with such manifestation could hide. That does indeed suggest that we ought to pay some attention (if we don't already) to where the '30 signal' breakpoint is invoked in both serial and parallel cases - and ensure they are compatible. Running on to the end of the current PoT - enabling replacement - before breaking and reporting would seem to be wise in both cases. ID: 1820464 ·

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 1820465 - Posted: 29 Sep 2016, 9:30:28 UTC Last modified: 29 Sep 2016, 9:34:17 UTC Gentlemen! I sum it up so far as i'm happy that a mindset has been brought up to Daylight instead of pinpointing applications to the left and right. Something needs to be done and Jeff highlited with hardproof of what i've seen and haven't been able to express verbaly what i've seen and figured out to be addressed. Now that we're all seem to be on the same page and recognizing that we have issues in various different platforms/apps/compilers whatever then from now forth i Think the solution might popup in all of this later on in how to Think and act to resolve this and forthcoming issues. My hat of to all of u lads! Now if we go back to idea of FP:strict (IEEE754) vs anything else Double precision etc etc, Have any of you an idea of speed penalty of going strict instead of precise, double precise is? If going to fp:strict single precision (More isn't needed apparently) is a few percent slower then so be it for the sake of conformity! But if it is half the speed etc then, No that is not the route to go "for now" but instead of focusing in the validator/re-order-of-work-reported issue that seems to be apparent. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 1820465 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1820467 - Posted: 29 Sep 2016, 9:42:41 UTC - in response to Message 1820447. Last modified: 29 Sep 2016, 9:43:02 UTC regarding fp:precise usage for CUDA : in host or device code? Host only. With MSVC builds (both CPU and OpenCL GPU) specifying /fp:precise for host code leads to difference in results with stock. It's the topic of initial 7.99/8.0 deployment precision issue where stock disagreed between own Linux/Windows x64/x86 builds. Was discussed in details in E-mail conversations at those times. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1820467 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1820468 - Posted: 29 Sep 2016, 9:44:58 UTC - in response to Message 1820460. Last modified: 29 Sep 2016, 9:50:25 UTC As a quid pro quo for that suggestion, I think we would need to find a way of reporting the 'serial first 30' signals from a parallel application. That might involve choosing an intermediate point - 30%? 50%? - after which the parallel app would continue to the end, find all signals, and sort out the reportable ones. All of which is much easier to suggest than to implement... yes, most likely as Cuda Xbranch evolves into heterogeneous form proper (x42's intent, longstanding), you may potentially end up with the extreme situation of different devices/threads/hosts processing portions all out of sequence, and potentially of different sizes. the special case of an overflow (late or early) is manageable, though of course not without waste. Balancing the potential waste with efficiency and throughput/utilisation/scaling is one of the key challenges. Special partial to full-serial reprocessing modes are likely to form part of some optional solutions, addressing the overflow situation, but moreso directed to a completely different direction: -- serialising/rationalising results to standard form & precision requirements matching the reference algorithm, -- potential from 'other implementations' such as attached DSPs, FPGAs, AIs/NNs, Fast Wavelet implementations, or -- a mixture of those For example, Would seem pretty wasteful if large consumer level GP10x based GPUs with HBM came out next year, that were optimised for convolutional neural networks [and other deep learning approaches] were forced to use Fourier analysis for the whole tasks. Watch the Bladerunner Enhance scene (youtube 2m40s) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1820468 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1820469 - Posted: 29 Sep 2016, 9:45:25 UTC - in response to Message 1820456. But to solve this by doing a "find all and sort them afterwards" would mean that every task would have to run to full term, and we'd lose the efficiency of quitting early after 10 seconds or so for the really noisy WUs. Well if we lose efficiency of quitting early why should validator even "validate" -9 work when the server code could see .. "Ohh geez this is a overflow result! Thanks! Here is your credits!" if compared to other -9s If the device sends a -9 result back but the other application sees this as a real result then you should be awarded zero credits anyway. Modifications of validator for overflows currently in discussion with Berkeley's team (look beta forums for example). SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1820469 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1820470 - Posted: 29 Sep 2016, 9:48:38 UTC - in response to Message 1820467. regarding fp:precise usage for CUDA : in host or device code? Host only. With MSVC builds (both CPU and OpenCL GPU) specifying /fp:precise for host code leads to difference in results with stock. It's the topic of initial 7.99/8.0 deployment precision issue where stock disagreed between own Linux/Windows x64/x86 builds. Was discussed in details in E-mail conversations at those times. Well what can I say, Opposite was true for Cuda builds and I guess the host code is different. It was Richard that brought flaky Gaussians to my attention, host fp:precise fixed them against 8.00, and no repeatable dissimilarity to 8.00 CPU has been reported to me since. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1820470 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1820471 - Posted: 29 Sep 2016, 9:51:42 UTC - in response to Message 1820460. I think we would need to find a way of reporting the 'serial first 30' signals from a parallel application. That might involve choosing an intermediate point - 30%? 50%? - after which the parallel app would continue to the end, find all signals, and sort out the reportable ones. All of which is much easier to suggest than to implement... Strongly disagree. Effort is worthless. (And very resource-costly to implement by my current estimates). SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1820471 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1820473 - Posted: 29 Sep 2016, 9:57:10 UTC - in response to Message 1820464. I hope the people contributing to this discussion are also reading the Beta forum. Recently (Beta message 59698) Raistmer wrote: EDIT2: recently I looked into pulse signal selection algorithm. And it appeared more resemble AstroPulse one than I thought. It contains same PoT signal replacement too. That is, if another, more strong, signal will be found inside same PoT but on another fold level (another period) old one will be replaced by new one. Old was not reported. It's one of possible places where bug with such manifestation could hide. That does indeed suggest that we ought to pay some attention (if we don't already) to where the '30 signal' breakpoint is invoked in both serial and parallel cases - and ensure they are compatible. Running on to the end of the current PoT - enabling replacement - before breaking and reporting would seem to be wise in both cases. This means that separate periods for same Pot should be processed in whole and best single reportable signal should be chosen amongst all results. This part never will result in overflow, it's about missing correct reportable pulse (as Petri's build demonstrated for beta task in discussion). Such bugs should be fixed of course cause they result in wrong signals report for all types of tasks. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1820473 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.