Monitoring inconclusive GBT validations and harvesting data for testing

Author	Message
Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1823256 - Posted: 10 Oct 2016, 11:25:43 UTC - in response to Message 1823254. Last modified: 10 Oct 2016, 11:31:31 UTC Wiki's Advanced Vector Extensions page says that for x86, FMA only became available with AVX2 - as your intel blog reply already told us. But again, hardly that x86 is what that used on iGPU. Hehe... https://software.intel.com/en-us/forums/opencl/topic/277001 Unfortunately the offline compiler only displays CPU asm and we do not currently expose the graphics ISA from this tool, even though the ISA is available as part of the linux graphics documentation (http://intellinuxgraphics.org/documentation.html). https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol02a-commandreference-instructions.pdf SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1823256 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1823261 - Posted: 10 Oct 2016, 11:38:10 UTC - in response to Message 1823256. Last modified: 10 Oct 2016, 12:05:39 UTC Multiply Add mad - Multiply Add Source: EuIsa Length Bias: 4 The mad instruction takes component-wise multiplication of src1 and src2, adds the results with the corresponding src0 values, and then stores the final results in dst. The conditional modifier and saturation (.sat) must not be used when src1 or src2 are dwords. Format: [(pred)] mad[.cmod] (exec_size) dst src0 src1 src2 Restriction No explicit accumulator access because this is a three-source instruction. AccWrEn is allowed for implicitly updating the accumulator. All three-source instructions have certain restrictions, described in Instruction Formats. Multiply Accumulate mac - Multiply Accumulate Source: EuIsa Length Bias: 4 The mac instruction takes component-wise multiplication of src0 and src1, adds the results with the corresponding accumulator values, and then stores the final results in dst. Format: [(pred)] mac[.cmod] (exec_size) dst src0 src1 Programming Notes When source and destination datatypes are different, the implied datatype for the accumulator operand is always the destination datatype. Restriction Accumulator is an implicit source and thus cannot be an explicit source operand. Syntax [(pred)] mac[.cmod] (exec_size) reg reg reg [(pred)] mac[.cmod] (exec_size) reg reg imm32 Pseudocode Evaluate(WrEn); for ( n = 0; n < exec_size; n++ ) { if ( WrEn.chan[n] ) { dst.chan[n] = src0.chan[n] * src1.chan[n] + acc0.chan[n]; } } Multiply Add for Macro madm - Multiply Add for Macro Source: EuIsa Length Bias: 4 The madm instruction takes component-wise multiplication of src1 and src2, adds the results with the corresponding src0 values, and then stores the final results in dst. The source and destination operands have a higher precision carried in the exponent for this operation. The madm instruction is used for macro operations, where precision is accumulated over several instructions. This accumulation requires the exponent to increase by 2 extra bits across multiple madm operations. Refer to Macros Defined in 'Math' Section for usage and restrictions of this operation. Format: [(pred)] madm[.cmod] (exec_size) dst src0 src1 src2 Restriction Accumulator access is restricted to the sp As one say Ñ‡ÐµÑ€Ñ‚ Ð½Ð¾Ð³Ñƒ ÑÐ»Ð¾Ð¼Ð¸Ñ‚ :D And seems no FMA per se, BTW. And definition of MAD doesn't discuss any precision considerations. Pseudocode is simple: dst.chan[n] = src1.chan[n] * src2.chan[n] + src0.chan[n]; how rounding occurs - not specified. Bravo, Intel's manual writers, decades of development did not vanish... :P EDIT2: Well, my timeslice for iGPU finished. I found no discussion of precision of iGPU. If someone find it please give the reference. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1823261 ·

MarkJ Volunteer tester Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5	Message 1823266 - Posted: 10 Oct 2016, 12:11:39 UTC - in response to Message 1823248. So try test iGPU build. 8 cooking on beta using 8.19 app as we speak. Will take a couple of hours for them to be done. All are Arecibo at the moment, no sign of guppis. Tasks BOINC blog ID: 1823266 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1823268 - Posted: 10 Oct 2016, 12:21:01 UTC - in response to Message 1823266. So try test iGPU build. 8 cooking on beta using 8.19 app as we speak. Will take a couple of hours for them to be done. All are Arecibo at the moment, no sign of guppis. Tasks beta app will produce inconclusives (but hceck this as baseline). Then check test app: https://cloud.mail.ru/public/2aUP/dborYAw9G SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1823268 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1823269 - Posted: 10 Oct 2016, 12:22:42 UTC Last modified: 10 Oct 2016, 12:24:41 UTC And for reference: https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-hsw-commandreference-instructions_0_0.pdf ~~seems Haswell has lower number of MADs (only accumulate one)~~ SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1823269 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1823271 - Posted: 10 Oct 2016, 12:24:40 UTC - in response to Message 1823242. ... I think I'm reading that as Intel saying that the 'mad' optimisation happens automatically in the newer compilers, without any option? ... With Cuda code (for comparison only), the instructions resolve to either MADs, MAFSs or Adds and Muls depending on architecture, yielding different results, so there are some similarities in the situation. It isn't as sensitive though, because CUFFT library is used instead of self compiled OCLFFT. The way around this in Cuda is relatively simple, using the example Answer_mul = float0 * float1; Answer_add = Answer_mul + float2; it becomes wired by hand as corresponding intrinsics in sensitive places, that generate explicit instructions, or Inline PTX assembly, which the compiler cannot optimise or change. The OpenCL situation may be murkier, with its wider range of hardware. I would have expected Intel to have provided some math.h or similar in their SDK, with either intrinsic functions, override switch of MADs, or vendor extensions with assembly... but not something I've looked at directly for the Intel case, due to bot actively running such a GPU. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1823271 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1823272 - Posted: 10 Oct 2016, 12:30:50 UTC - in response to Message 1823271. And that oclFFT (we share with Einstein btw) uses mad() in code generator. Well, I think time to look for Skylake results from modded binary I posted. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1823272 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1823274 - Posted: 10 Oct 2016, 12:42:04 UTC Just for comparison with Intel: how MAD description sounds for HD6900: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf Floating-Point Multiply-Add Instruction MULADD Description Floating-point multiply-add (MAD). Gives same results as ADD after MUL. dst = src0 * src1 + src2; Microcode Format ALU_WORD0 (page 9-23) and ALU_WORD1_OP3 (page 9-32). Instruction Field ALU_INST == OP3_INST_MULADD, opcode 20 (0x14). SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1823274 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874	Message 1823296 - Posted: 10 Oct 2016, 15:01:17 UTC - in response to Message 1823243. I would prefer to find original Intel's thread about this issue. Asked on Einstein's site about that already. Christian has replied It was a direct mail exchange with an Intel developer where I got the explanation from. ID: 1823296 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1823341 - Posted: 10 Oct 2016, 17:17:45 UTC - in response to Message 1823296. I would prefer to find original Intel's thread about this issue. Asked on Einstein's site about that already. Christian has replied It was a direct mail exchange with an Intel developer where I got the explanation from. So awaiting results from new build. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1823341 ·

Juha Volunteer tester Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0	Message 1823415 - Posted: 10 Oct 2016, 21:20:02 UTC - in response to Message 1823246. https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/mad.html https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/fma.html So, OpenCL specification distinguishes between these instruction indeed. MAD and FMA are different. BUT(!) they should be specified directly in code to be used, ab+c will not be replaced automatically. There's another post in Einstein forums, Intel GPU brp app returns incorect results with beignet 1.2 drivers. In Beignet 1.2 FP_CONTRACT was switched to ON and the code generated for xy+z was changed from MUL+ADD to MAD. (commit) If I'm reading FP_CONTRACT documentation correctly it seems that implementations are supposed to use fused instructions unless told otherwise. What it doesn't say is whether FMA or MAD should be used, but since there's --cl_mad_enable compiler option I suppose FMA should be used. I can imagine Windows drivers have made similar change earlier. ID: 1823415 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1823428 - Posted: 10 Oct 2016, 22:32:51 UTC - in response to Message 1823415. Last modified: 10 Oct 2016, 22:42:58 UTC Not sure that contract expression means using mad/fma. It's more generalized term than just MAD instead of ab+c substitution. AMD docs directly state that such replacement will not occur. But (as usual) worth to try. When I will get feedback from already provided build I could try to disable this pragma too. Also, would be interesting to add to CLinfo printing of FMA status. What we will see for different devices/platforms?... EDIT: from comitted code looks like they map mad to hardware mad now instead of emulating it via mul and add. Still it doesn't imply silent replacement of ab+c to mad(a,b,c) but also it will change behavior of mad(a,b,c) call and as I said earlier oclFFT heavely using mad. The question to Intel is: why their mad so imprecise versus 2 other vendors?? (BTW, iGPU imprecision in native trigonometry was demonstrated by Einstein's team before, in oclFFT. ) SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1823428 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1823482 - Posted: 11 Oct 2016, 3:35:21 UTC - in response to Message 1823428. Last modified: 11 Oct 2016, 3:39:29 UTC The question to Intel is: why their mad so imprecise versus 2 other vendors? can be as simple as a float implemented as double half float hardware emulation sequence, or some other shortcut. Maybe they even use something like x87 80 bit intermediate registers underneath, or blocks of pentium circuits with fdiv bugs (j/k) That aside, using fma's etc changes algorithms, and error growth. So you'll see different codelets even in fftw CPU sources to compensate. We don't completely escape problems on Cuda either, especially from Pre-compute capability 1.3 not having doubles, nor IEEE 754 compliance, and fma not coming until much later. We escape a lot though, because of CUFFT hard wired paths, and we use a fair whack of intrinsics already (more assembly gradually) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1823482 ·

MarkJ Volunteer tester Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5	Message 1823524 - Posted: 11 Oct 2016, 11:47:36 UTC When I will get feedback from already provided build I could try to disable this pragma too. Sorry for the delay. Work intervened. Another set running using new app. Same hosts as before. I also snagged some guppies this time. BOINC blog ID: 1823524 ·

MarkJ Volunteer tester Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5	Message 1823698 - Posted: 12 Oct 2016, 11:13:49 UTC - in response to Message 1823524. When I will get feedback from already provided build I could try to disable this pragma too. Sorry for the delay. Work intervened. Another set running using new app. Same hosts as before. I also snagged some guppies this time. Looking through the results it would seem v8.19 supplied by beta validate most of the time. The r3525's are almost all inconclusive or invalid. BOINC blog ID: 1823698 ·

Juha Volunteer tester Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0	Message 1823790 - Posted: 12 Oct 2016, 19:40:23 UTC - in response to Message 1823428. EDIT: from comitted code looks like they map mad to hardware mad now instead of emulating it via mul and add. Still it doesn't imply silent replacement of ab+c to mad(a,b,c) but also it will change behavior of mad(a,b,c) call and as I said earlier oclFFT heavely using mad. I could be mistaken but I think that is really what it now does. LLVM uses fmuladd to let code generator decide between using mul+add or fma. llvm.fmuladd ID: 1823790 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1823795 - Posted: 12 Oct 2016, 20:04:17 UTC - in response to Message 1823790. EDIT: from comitted code looks like they map mad to hardware mad now instead of emulating it via mul and add. Still it doesn't imply silent replacement of ab+c to mad(a,b,c) but also it will change behavior of mad(a,b,c) call and as I said earlier oclFFT heavely using mad. I could be mistaken but I think that is really what it now does. LLVM uses fmuladd to let code generator decide between using mul+add or fma. llvm.fmuladd Thanks, perhaps you are right. That means detailed definition regarding precision behavior is required for iGPU MAD/MAC/"macro MAD". I'll try to disable corresponding macro in code. Will see if it help. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1823795 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1824740 - Posted: 16 Oct 2016, 19:49:42 UTC - in response to Message 1823795. New build: https://cloud.mail.ru/public/EbPU/q7ZKhRnYV More details on beta: https://setiweb.ssl.berkeley.edu/beta//forum_thread.php?id=2266&postid=59828 SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1824740 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1824975 - Posted: 17 Oct 2016, 18:27:20 UTC FWIW, I just had an overflow task running SoG r3528 get ganged up on by a pair of x41p_zi3j Petri Specials. It was really an extreme case where my host found 30 Pulses while the two Special hosts found 30 Triplets. The WU is 2295032503, although it's now too late to grab the file since I didn't spot the Inconclusive before the second Special host reported. ID: 1824975 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1825013 - Posted: 17 Oct 2016, 21:10:30 UTC - in response to Message 1824975. Last modified: 17 Oct 2016, 21:12:10 UTC FWIW, I just had an overflow task running SoG r3528 get ganged up on by a pair of x41p_zi3j Petri Specials. It was really an extreme case where my host found 30 Pulses while the two Special hosts found 30 Triplets. The WU is 2295032503, although it's now too late to grab the file since I didn't spot the Inconclusive before the second Special host reported. Someone else could say this: I think it is a 'bad' packet having noisy data. This time it was reported as 'bad' by a different version of software looking into something else before looking into something different but still something 'broken'. EDIT: and each time it could still be something, although probably noise. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1825013 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.