Message boards :
Number crunching :
Marked as Invalid, anyone know why?
Message board moderation
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
This WU was marked as Invalid for me, but Validated for 2 other systems even though we all got the same counts. My output, Stderr output <core_client_version>7.0.64</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 2 CUDA device(s): Device 1: GeForce GTX 750 Ti, 2048 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 2, pciSlotID = 0 Device 2: GeForce GTX 750 Ti, 2048 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 1, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 2 setiathome_CUDA: CUDA Device 2 specified, checking... Device 2: GeForce GTX 750 Ti is okay SETI@home using CUDA accelerated device GeForce GTX 750 Ti pulsefind: blocks per SM 4 (Fermi or newer default) pulsefind: periods per launch 100 (default) Priority of process set to BELOW_NORMAL (default) successfully Priority of worker thread set successfully setiathome enhanced x41zc, Cuda 5.00 Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 1.612047 Kepler GPU current clockRate = 1163 MHz re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes Thread call stack limit is: 1k cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... cudaAcc_free() DONE. Flopcounter: 16689105997964.898000 Spike count: 8 Autocorr count: 0 Pulse count: 0 Triplet count: 3 Gaussian count: 0 Worker preemptively acknowledging a normal exit.-> called boinc_finish Exit Status: 0 boinc_exit(): requesting safe worker shutdown -> boinc_exit(): received safe worker shutdown acknowledge -> Cuda threadsafe ExitProcess() initiated, rval 0 </stderr_txt> ]]> Other systems output, number 1 <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> setiathome_v7 7.00 DevC++/MinGW/g++ 4.5.2 libboinc: 7.1.0 Work Unit Info: ............... WU true angle range is : 1.612047 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_avxGetPowerSpectrum 0.000058 0.00000 avx_ChirpData_d 0.003476 0.00000 v_avxTranspose4x16ntw 0.001686 0.00000 JS AVX_a folding 0.000408 0.00000 Flopcounter: 16690882215684.797000 Spike count: 8 Autocorr count: 0 Pulse count: 0 Triplet count: 3 Gaussian count: 0 12:39:20 (6188): called boinc_finish </stderr_txt> ]]> Other systems output, number 2 <core_client_version>7.0.65</core_client_version> <![CDATA[ <stderr_txt> setiathome_v7 7.00 XCode GCC 4.0.1 (Apple Inc. build 5494) i386 libboinc: 7.0.58 libboinc: 7.0.58 Work Unit Info: ............... WU true angle range is : 1.612047 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_vGetPowerSpectrumUnrolled 0.000102 0.00000 sse3_ChirpData_ak8 0.008154 0.00000 v_vTranspose4x16ntw 0.001978 0.00000 AK SSE folding 0.000626 0.00000 Flopcounter: 16690374967034.802734 Spike count: 8 Autocorr count: 0 Pulse count: 0 Triplet count: 3 Gaussian count: 0 11:35:05 (43827): called boinc_finish </stderr_txt> ]]> Grant Darwin NT |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
hmmm, got there too late to see the task page. I can't tell anything amiss in particular from the stderr outputs, or with other tasks on your hosts/750ti's. There's a few possibilities. Leaving out the possibilities of a server malfunction, comms glitch or damage in the result file, cosmic rays flipping bits or a number of tin foil hat scenarios. Half or more of those signals might be particularly close to threshold, so swap around with some other signal close to threshold. That can happen due to slight differences in the way CPU & GPU process data, namely small variations that run afoul of the way the project uses hard thresholds as go-nogo. I'd keep an eye on if more happen, for which a pattern would indicate a problem. The best way to analyse these kind of events is to grab a copy of the WU (off the server) before it disappears, so that it can be put under the microscope. That can reveal if it's a one off event, a limitation of the validation system, or some underlying problem in hardware or software including app or even drivers. Without a pattern of failure then any of those situations is possible. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
I'll keep a closer eye on Invalids. I've had a couple of the truncated sdterr outputs, but this is the first time I've noticed an Invalid when all the Counts matched. Grant Darwin NT |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I'll keep a closer eye on Invalids. I've had a couple of the truncated sdterr outputs, but this is the first time I've noticed an Invalid when all the Counts matched. Yeah, that gives the gist of one suspicion, in that the same mechanisms that cause truncated stderr could sometimes affect actual result files too. (In theory, but a much harder chain of events to set up, so likely rarer). As well as looking for more like that, I'd suggest to check system DPC latency figures while crunching ( http://www.thesycon.de/eng/latency_check.shtml ), just to put local issues like chipset drivers in the clear. Any runs of DPC spikes have the potential to run into Boinc's aggressive timeouts, silently terminating the OS threads busily writing files. With that in the clear, if recurrence is about every fortnight or so, with no signs of system issues, then you could switch to the special commode build at http://jgopt.org/download.html, which disables applications use of filesystem buffering. That's not a fix for Boinc's outdated file-handling and thread safety issues, but a workaround that could further reduce the probability of occurrences. Getting more familiar with rare events is challenging, but likely will become more pressing as newer models continue to increase throughput (multiplying failure rates of any holes) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
Had the DPC latency checker running for the last couple of hours, and generally it's around 250 or less, with the odd one or 2 around 450 or so. The absolute maximum in that time was 684. Grant Darwin NT |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Had the DPC latency checker running for the last couple of hours, and generally it's around 250 or less, with the odd one or 2 around 450 or so. That's indicating no specific driver quality, firmware or hardware induced issues creating delays (putting a lot of layers of stuff in the clear). From here, if you manage to characterise (get a feel for frequency): - truncated stderr but otherwise OK, - truncated stderr with instant invalid, - invalid with apparently intact & OK looking stderr content / signal counts (like this one) - Something other than above or total success, excluding dodgy looking wingmen. Then comparing with the workaround commit mode build would be worth a try. It doesn't address the Boinc client side of the issues, but does at least disable the OS buffering mechanisms in the app that Boinc runs afoul of. While low numbers now, I feel these reliability related events tend to increase in frequency as total project throughput increases, so will be including the anomalies in a raft of recommendations for Boincapi+client I'm working on. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
I got an invalid too, on http://setiathome.berkeley.edu/workunit.php?wuid=1478241118. The _0 host got a -9 on it, and invalids are very rare for me, so I'm not worried about it, but if you want to look, there it is. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I got an invalid too, on http://setiathome.berkeley.edu/workunit.php?wuid=1478241118. The _0 host got a -9 on it, and invalids are very rare for me, so I'm not worried about it, but if you want to look, there it is. That one looks like a case of instant invalid overflow, with truncated stderr content symptom. If you see a lot of these then switching to the workaround build mentioned/linked earlier would probably stop them, though since the losses are slight probably not worth the effort unless you need to dig deeper. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
I got an invalid too, on http://setiathome.berkeley.edu/workunit.php?wuid=1478241118. The _0 host got a -9 on it, and invalids are very rare for me, so I'm not worried about it, but if you want to look, there it is. Like I said, invalids are very rare for me. So rare that I'm slightly shocked when I get one. I'll get over it. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
'Completed, marked as invalid' ("instant invalid", 3rd task is still 'In progress') http://setiathome.berkeley.edu/result.php?resultid=3483427597 http://setiathome.berkeley.edu/workunit.php?wuid=1473624680 Was run on my old CPU K6-2+ @ 524 MHz Run time 768,454.81 CPU time 754,116.50 Stderr output <core_client_version>6.6.38</core_client_version> <![CDATA[ <stderr_txt> setiathome_v7 7.00 DevC++/MinGW/g++ 4.5.2 libboinc: 7.1.0 Work Unit Info: ............... WU true angle range is : 0.010866 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.009774 0.00000 fpu_opt_ChirpData 0.346757 0.00000 v_Transpose4 0.152320 0.00000 FPU opt folding 0.124074 0.00000 Restarted at 36.21 percent. Restarted at 37.75 percent. Restarted at 38.32 percent. Restarted at 38.91 percent. Restarted at 48.98 percent. Restarted at 50.31 percent. Restarted at 50.44 percent. Restarted at 50.44 percent. Restarted at 50.44 percent. Restarted at 50.65 percent. Restarted at 50.65 percent. Flopcounter: 44709614712063.109000 Spike count: 8 Autocorr count: 0 Pulse count: 6 Triplet count: 1 Gaussian count: 0 03:05:55 (-918629): called boinc_finish </stderr_txt> ]]> _______________________ Wingmate task is 'Completed, validation inconclusive': ..... Build features: SETI7 Non-graphics FFTW USE_AVX x86 CPUID: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Cache: L1=64K L2=256K CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 SSE4.1 SSE4.2 AVX ar=0.010866 NumCfft=146007 NumGauss= 0 NumPulse= 400488430592 NumTriplet= 28745021997056 In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768 Restarted at 12.76 percent. Pulse: peak=3.393877, time=53.74, period=8.271, d_freq=1419346340.17, score=1.042, chirp=24.809, fft_len=1024 Spike: peak=24.26264, time=48.65, d_freq=1419347356.97, chirp=-35.078, fft_len=32k Spike: peak=24.44669, time=48.65, d_freq=1419347356.95, chirp=-35.151, fft_len=32k Spike: peak=24.07589, time=22.65, d_freq=1419349590.17, chirp=42.827, fft_len=16k Spike: peak=24.76583, time=22.65, d_freq=1419349590.16, chirp=43.063, fft_len=16k Spike: peak=25.78738, time=21.81, d_freq=1419349553.88, chirp=43.152, fft_len=32k Spike: peak=26.55867, time=21.81, d_freq=1419349553.91, chirp=43.167, fft_len=32k Spike: peak=25.93216, time=21.81, d_freq=1419349553.93, chirp=43.181, fft_len=32k Spike: peak=24.23698, time=22.65, d_freq=1419349590.15, chirp=43.3, fft_len=16k Pulse: peak=1.953763, time=53.79, period=3.382, d_freq=1419348684.76, score=1.019, chirp=54.72, fft_len=2k Triplet: peak=12.55167, time=57.09, period=47.42, d_freq=1419347482.32, chirp=57.887, fft_len=512 Pulse: peak=0.8888392, time=53.69, period=0.9809, d_freq=1419344315.59, score=1.042, chirp=73.627, fft_len=64 Pulse: peak=1.650835, time=53.69, period=2.316, d_freq=1419341130.93, score=1.079, chirp=85.364, fft_len=32 Pulse: peak=6.232743, time=53.9, period=19.08, d_freq=1419348123.24, score=1.012, chirp=-99.218, fft_len=4k Best spike: peak=26.55867, time=21.81, d_freq=1419349553.91, chirp=43.167, fft_len=32k Best autocorr: peak=16.90317, time=73.82, delay=3.479, d_freq=1419345484.25, chirp=-2.965, fft_len=128k Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.121e+011, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 Best pulse: peak=1.650835, time=53.69, period=2.316, d_freq=1419341130.93, score=1.079, chirp=85.364, fft_len=32 Best triplet: peak=12.55167, time=57.09, period=47.42, d_freq=1419347482.32, chirp=57.887, fft_len=512 Flopcounter: 44703276014020.000000 Spike count: 8 Autocorr count: 0 Pulse count: 6 Triplet count: 1 Gaussian count: 0 Wallclock time elapsed since last restart: 10709.6 seconds 16:22:45 (5868): called boinc_finish </stderr_txt> ]]> Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
Here's another Invalid WU, that was Validated on other systems, although this one is different from the first one. Portion of my stderr_txt Spike count: 0 Autocorr count: 0 Pulse count: 0 Triplet count: 30 Gaussian count: 0 Portion of Validated stderr_txt Spike count: 0 Autocorr count: 0 Pulse count: 30 Triplet count: 0 Gaussian count: 0 For some reason my crunching resulted in 30 Triplets, the others got them as 30 Pulses. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
All the computers that ran the task 'completed' it in very few seconds, so the data was obviously heavily contaminated with RFI, or otherwise unprocessable. Also, the three computers all used different processing resources - ATI/OpenCL, NVidia/cuda, and CPU/FPU. Each developer has taken their own optimisation route, and in particular GPU optimisation requires converting the original serial code into small chunks ('kernels') which can be processed in parallel. The different architectures of the different processors can lead to these parallel kernels being processed in a different order. For full length tasks with few signals, a great deal of testing has been done to minimise the risk of validation failing because the signals are reported in the wrong order, but sorting out a massive flurry of signals in the first few seconds of a task is more difficult and, frankly, not worth the effort. Sorry you wasted six seconds of computing time :( |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
6 seconds of my computing time i'll never get back again... Guess i'll survive it, and manage to overcome the grief. Eventually. Grant Darwin NT |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.