Message boards :
Number crunching :
Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 36 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Agreed on the ordering issue. Needs to be formalised to Serial order to be correct in numerical methods/computer science terms. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
But to solve this by doing a "find all and sort them afterwards" would mean that every task would have to run to full term, and we'd lose the efficiency of quitting early after 10 seconds or so for the really noisy WUs. Goes back to my 'more than one way to skin a cat' post. There are more efficient divide and conquer methods to serialise the results, that Raistmer claimed exist in the OpenCL builds already (haven't looked myself). [Baseline Cuda only gets away with its minor order difference by its extreme rarity of applicability, but ultimately should be corrected] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
this doesn't guarantee other compilers, or hardware device manufacturers implemented their chips in the bit identical way suggested Yes but if other compilers or hardware isn't bit identical then there would be a flaw in their IEEE754 implementation and you all would know that and needs to tacle that platform or device differently and puts effort there! I'm only suggesting that IEEE754 should be used so the majority of applications get to the Q100 mark! Then you all know that when compiling under Linux,Windows, Bla bla This work as intended and when a new version breaks it then you would know it 100% for sure and could revert back or "change lines in the code" required to get to Q100 mark. Haven't mentioned validation as it could validate non Q100 results also but i'm proposing this as a base and way of thinking route to ease future headache instead. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
this doesn't guarantee other compilers, or hardware device manufacturers implemented their chips in the bit identical way suggested Note that the code comment I just posted says that each signal must be "roughly equal to a signal from the other result". 'Roughly' in this case (and it applies exactly the same to the strong similarity test) means within 1%, in general terms. https://setisvn.ssl.berkeley.edu/trac/browser/seti_boinc/validate/sah_result.cpp#L35 We need to distinguish between "same signal, different maths" and "different signal". They'll have different solutions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
this doesn't guarantee other compilers, or hardware device manufacturers implemented their chips in the bit identical way suggested Then we need a separate discussion about this paper: What Every Computer Scientist should know about Floating Point Arithmetic [pdf document link] Because this common perception of floating point as deterministic leaves out rounding error, which is part of floating point. The only MB Applications I know that have had components rounding error measured (in ulps) are stock CPU, double precision naive reference, and Cuda implementations, because I did them myself. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Regarding fp:strict : before going further estimate performance penalty of this option enabled. Don't forget that doing double precision math generally gives more precision. Doing arbitrary-precision calculations is possible too... but just not suits our needs. regarding fp:precise usage for CUDA : in host or device code? SETI apps news We're not gonna fight them. We're gonna transcend them. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Regarding fp:strict : before going further estimate performance penalty of this option enabled. Host only. Fast math with Cuda compiled kernels, proved the same results (probably due to only one implementation for sensitive parts, with hard intrinsics used in e.g. FFTs, and chirp etc, so not replaced). That could potentially change if nv decide to start substituting intrinsics with other instructions (unlikely but possible). In the case of many of Petri's hand optimisations, we're talking hard PTX assembly instructions, so compiler fp optimisations don't apply at all. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
Then we need a separate discussion about this paper: And some of the links in http://mcintosh.web.cern.ch/mcintosh/ |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Then we need a separate discussion about this paper: Very Good!, had forgotten about that recent work. Saving link (again) [Edit:] from one link: Floating-Point Determinism Love it :D haha "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
Well if we lose efficiency of quitting early why should validator even "validate" -9 work when the server code could see .. "Ohh geez this is a overflow result! Thanks! Here is your credits!" if compared to other -9s If the device sends a -9 result back but the other application sees this as a real result then you should be awarded zero credits anyway. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
That's an option/choice for the project scientists. My feeling is that being scientists the repeatability of the -9's plays a part somehow, for example perhaps in identifying earth local sources of RFI for filtering and/or threshold calibration. That would potentially make overflow ordering relevant for specific uses, if only to distinguish genuine noise from faulty applications or hosts. (effectively shielding the science) This 'feeling' (of mine) would seem to fall in line with some past comments Eric made regarding the precision (separate issue), that extra precision wasn't needed. The main rub with the separate bit exact concept, besides the massive CERN level engineering effort it would require, is that the stock CPU application itself would require treatment as though it was many kinds of devices/compilers. That's because an x86 CPU could be using the x87 FPU with 80 bit intermediates for some or all parts (depending on CPU + OS + app), in addition to 32 and 64 bit floats of SIMD vectorisation (i.e. SSE+, various implementations tested during bench, not to mention fftw's internal dispatch) There's definitely been inadequate match on a precision level in the past (and some builds/devices/platforms may still suffer this), however this has steadily been refined from Q's of ~60% in AKv8 days, along with ~20-30% inconclusives, down to the point where the project seems comfortable with initial replications of 2, and our expectation of 99%+ Q's . That's change we all contributed to over time,[in stock and third party], and now the focus switches from precision to reliability and performance (among other metrics). "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
But to solve this by doing a "find all and sort them afterwards" would mean that every task would have to run to full term, and we'd lose the efficiency of quitting early after 10 seconds or so for the really noisy WUs. I think I'd draw a distinction between the tasks which overflow early in the run (sometimes almost immediately), and these 'late onset' cases. "Immediate overflow" is (probably - IMO) scientifically useless, and could be treated as you suggest. On the other hand, nothing is lost by sending an extra tie-breaker except some user's bandwidth - and that will be more of a problem for some users than for others. But I don't accept that tasks which run for a substantial proportion of their intended runtime are necessarily scientifically useless - and if there is science in there, it should go through the validation process before credit is awarded. I'm increasingly coming to believe that credit should be substantially reduced for 'weakly similar' results, as an encouragement to users to pay attention to malfunctioning hardware or software. As a quid pro quo for that suggestion, I think we would need to find a way of reporting the 'serial first 30' signals from a parallel application. That might involve choosing an intermediate point - 30%? 50%? - after which the parallel app would continue to the end, find all signals, and sort out the reportable ones. All of which is much easier to suggest than to implement... |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
I hope the people contributing to this discussion are also reading the Beta forum. Recently (Beta message 59698) Raistmer wrote: EDIT2: recently I looked into pulse signal selection algorithm. And it appeared more resemble AstroPulse one than I thought. It contains same PoT signal replacement too. That is, if another, more strong, signal will be found inside same PoT but on another fold level (another period) old one will be replaced by new one. Old was not reported. It's one of possible places where bug with such manifestation could hide. That does indeed suggest that we ought to pay some attention (if we don't already) to where the '30 signal' breakpoint is invoked in both serial and parallel cases - and ensure they are compatible. Running on to the end of the current PoT - enabling replacement - before breaking and reporting would seem to be wise in both cases. |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
Gentlemen! I sum it up so far as i'm happy that a mindset has been brought up to Daylight instead of pinpointing applications to the left and right. Something needs to be done and Jeff highlited with hardproof of what i've seen and haven't been able to express verbaly what i've seen and figured out to be addressed. Now that we're all seem to be on the same page and recognizing that we have issues in various different platforms/apps/compilers whatever then from now forth i Think the solution might popup in all of this later on in how to Think and act to resolve this and forthcoming issues. My hat of to all of u lads! Now if we go back to idea of FP:strict (IEEE754) vs anything else Double precision etc etc, Have any of you an idea of speed penalty of going strict instead of precise, double precise is? If going to fp:strict single precision (More isn't needed apparently) is a few percent slower then so be it for the sake of conformity! But if it is half the speed etc then, No that is not the route to go "for now" but instead of focusing in the validator/re-order-of-work-reported issue that seems to be apparent. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
With MSVC builds (both CPU and OpenCL GPU) specifying /fp:precise for host code leads to difference in results with stock. It's the topic of initial 7.99/8.0 deployment precision issue where stock disagreed between own Linux/Windows x64/x86 builds. Was discussed in details in E-mail conversations at those times. SETI apps news We're not gonna fight them. We're gonna transcend them. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
As a quid pro quo for that suggestion, I think we would need to find a way of reporting the 'serial first 30' signals from a parallel application. That might involve choosing an intermediate point - 30%? 50%? - after which the parallel app would continue to the end, find all signals, and sort out the reportable ones. All of which is much easier to suggest than to implement... yes, most likely as Cuda Xbranch evolves into heterogeneous form proper (x42's intent, longstanding), you may potentially end up with the extreme situation of different devices/threads/hosts processing portions all out of sequence, and potentially of different sizes. the special case of an overflow (late or early) is manageable, though of course not without waste. Balancing the potential waste with efficiency and throughput/utilisation/scaling is one of the key challenges. Special partial to full-serial reprocessing modes are likely to form part of some optional solutions, addressing the overflow situation, but moreso directed to a completely different direction: -- serialising/rationalising results to standard form & precision requirements matching the reference algorithm, -- potential from 'other implementations' such as attached DSPs, FPGAs, AIs/NNs, Fast Wavelet implementations, or -- a mixture of those For example, Would seem pretty wasteful if large consumer level GP10x based GPUs with HBM came out next year, that were optimised for convolutional neural networks [and other deep learning approaches] were forced to use Fourier analysis for the whole tasks. Watch the Bladerunner Enhance scene (youtube 2m40s) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Modifications of validator for overflows currently in discussion with Berkeley's team (look beta forums for example). SETI apps news We're not gonna fight them. We're gonna transcend them. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Well what can I say, Opposite was true for Cuda builds and I guess the host code is different. It was Richard that brought flaky Gaussians to my attention, host fp:precise fixed them against 8.00, and no repeatable dissimilarity to 8.00 CPU has been reported to me since. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
I think we would need to find a way of reporting the 'serial first 30' signals from a parallel application. That might involve choosing an intermediate point - 30%? 50%? - after which the parallel app would continue to the end, find all signals, and sort out the reportable ones. All of which is much easier to suggest than to implement... Strongly disagree. Effort is worthless. (And very resource-costly to implement by my current estimates). SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
I hope the people contributing to this discussion are also reading the Beta forum. Recently (Beta message 59698) This means that separate periods for same Pot should be processed in whole and best single reportable signal should be chosen amongst all results. This part never will result in overflow, it's about missing correct reportable pulse (as Petri's build demonstrated for beta task in discussion). Such bugs should be fixed of course cause they result in wrong signals report for all types of tasks. SETI apps news We're not gonna fight them. We're gonna transcend them. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.