Marked as Invalid, anyone know why?

Message boards : Number crunching : Marked as Invalid, anyone know why?
Message board moderation

To post messages, you must log in.

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1505428 - Posted: 18 Apr 2014, 2:34:51 UTC

This WU
was marked as Invalid for me, but Validated for 2 other systems even though we all got the same counts.


My output,
Stderr output
<core_client_version>7.0.64</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 2 CUDA device(s):
Device 1: GeForce GTX 750 Ti, 2048 MiB, regsPerBlock 65536
computeCap 5.0, multiProcs 5
pciBusID = 2, pciSlotID = 0
Device 2: GeForce GTX 750 Ti, 2048 MiB, regsPerBlock 65536
computeCap 5.0, multiProcs 5
pciBusID = 1, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 2
setiathome_CUDA: CUDA Device 2 specified, checking...
Device 2: GeForce GTX 750 Ti is okay
SETI@home using CUDA accelerated device GeForce GTX 750 Ti
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to BELOW_NORMAL (default) successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 5.00

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is : 1.612047

Kepler GPU current clockRate = 1163 MHz

re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
Thread call stack limit is: 1k
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
cudaAcc_free() DONE.

Flopcounter: 16689105997964.898000

Spike count: 8
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 0
Worker preemptively acknowledging a normal exit.->
called boinc_finish
Exit Status: 0
boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->
Cuda threadsafe ExitProcess() initiated, rval 0

</stderr_txt>
]]>


Other systems output, number 1

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_v7 7.00 DevC++/MinGW/g++ 4.5.2
libboinc: 7.1.0

Work Unit Info:
...............
WU true angle range is : 1.612047
Optimal function choices:
--------------------------------------------------------
name timing error
--------------------------------------------------------
v_BaseLineSmooth (no other)
v_avxGetPowerSpectrum 0.000058 0.00000
avx_ChirpData_d 0.003476 0.00000
v_avxTranspose4x16ntw 0.001686 0.00000
JS AVX_a folding 0.000408 0.00000

Flopcounter: 16690882215684.797000

Spike count: 8
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 0
12:39:20 (6188): called boinc_finish

</stderr_txt>
]]>


Other systems output, number 2

<core_client_version>7.0.65</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_v7 7.00 XCode GCC 4.0.1 (Apple Inc. build 5494) i386

libboinc: 7.0.58
libboinc: 7.0.58

Work Unit Info:
...............
WU true angle range is : 1.612047
Optimal function choices:
--------------------------------------------------------
name timing error
--------------------------------------------------------
v_BaseLineSmooth (no other)
v_vGetPowerSpectrumUnrolled 0.000102 0.00000
sse3_ChirpData_ak8 0.008154 0.00000
v_vTranspose4x16ntw 0.001978 0.00000
AK SSE folding 0.000626 0.00000

Flopcounter: 16690374967034.802734

Spike count: 8
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 0
11:35:05 (43827): called boinc_finish

</stderr_txt>
]]>
Grant
Darwin NT
ID: 1505428 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1505524 - Posted: 18 Apr 2014, 8:21:01 UTC
Last modified: 18 Apr 2014, 8:29:15 UTC

hmmm, got there too late to see the task page. I can't tell anything amiss in particular from the stderr outputs, or with other tasks on your hosts/750ti's.

There's a few possibilities. Leaving out the possibilities of a server malfunction, comms glitch or damage in the result file, cosmic rays flipping bits or a number of tin foil hat scenarios. Half or more of those signals might be particularly close to threshold, so swap around with some other signal close to threshold. That can happen due to slight differences in the way CPU & GPU process data, namely small variations that run afoul of the way the project uses hard thresholds as go-nogo.

I'd keep an eye on if more happen, for which a pattern would indicate a problem. The best way to analyse these kind of events is to grab a copy of the WU (off the server) before it disappears, so that it can be put under the microscope. That can reveal if it's a one off event, a limitation of the validation system, or some underlying problem in hardware or software including app or even drivers. Without a pattern of failure then any of those situations is possible.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1505524 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1505532 - Posted: 18 Apr 2014, 9:02:31 UTC - in response to Message 1505524.  

I'll keep a closer eye on Invalids. I've had a couple of the truncated sdterr outputs, but this is the first time I've noticed an Invalid when all the Counts matched.
Grant
Darwin NT
ID: 1505532 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1505541 - Posted: 18 Apr 2014, 10:04:05 UTC - in response to Message 1505532.  
Last modified: 18 Apr 2014, 10:06:54 UTC

I'll keep a closer eye on Invalids. I've had a couple of the truncated sdterr outputs, but this is the first time I've noticed an Invalid when all the Counts matched.


Yeah, that gives the gist of one suspicion, in that the same mechanisms that cause truncated stderr could sometimes affect actual result files too. (In theory, but a much harder chain of events to set up, so likely rarer).

As well as looking for more like that, I'd suggest to check system DPC latency figures while crunching ( http://www.thesycon.de/eng/latency_check.shtml ), just to put local issues like chipset drivers in the clear. Any runs of DPC spikes have the potential to run into Boinc's aggressive timeouts, silently terminating the OS threads busily writing files.

With that in the clear, if recurrence is about every fortnight or so, with no signs of system issues, then you could switch to the special commode build at http://jgopt.org/download.html, which disables applications use of filesystem buffering. That's not a fix for Boinc's outdated file-handling and thread safety issues, but a workaround that could further reduce the probability of occurrences.

Getting more familiar with rare events is challenging, but likely will become more pressing as newer models continue to increase throughput (multiplying failure rates of any holes)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1505541 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1505879 - Posted: 19 Apr 2014, 1:22:34 UTC - in response to Message 1505541.  

Had the DPC latency checker running for the last couple of hours, and generally it's around 250 or less, with the odd one or 2 around 450 or so.
The absolute maximum in that time was 684.
Grant
Darwin NT
ID: 1505879 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1506021 - Posted: 19 Apr 2014, 12:28:15 UTC - in response to Message 1505879.  
Last modified: 19 Apr 2014, 12:29:41 UTC

Had the DPC latency checker running for the last couple of hours, and generally it's around 250 or less, with the odd one or 2 around 450 or so.
The absolute maximum in that time was 684.


That's indicating no specific driver quality, firmware or hardware induced issues creating delays (putting a lot of layers of stuff in the clear).

From here, if you manage to characterise (get a feel for frequency):
- truncated stderr but otherwise OK,
- truncated stderr with instant invalid,
- invalid with apparently intact & OK looking stderr content / signal counts (like this one)
- Something other than above or total success, excluding dodgy looking wingmen.

Then comparing with the workaround commit mode build would be worth a try. It doesn't address the Boinc client side of the issues, but does at least disable the OS buffering mechanisms in the app that Boinc runs afoul of.

While low numbers now, I feel these reliability related events tend to increase in frequency as total project throughput increases, so will be including the anomalies in a raft of recommendations for Boincapi+client I'm working on.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1506021 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1506926 - Posted: 21 Apr 2014, 14:23:25 UTC

I got an invalid too, on http://setiathome.berkeley.edu/workunit.php?wuid=1478241118. The _0 host got a -9 on it, and invalids are very rare for me, so I'm not worried about it, but if you want to look, there it is.
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1506926 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1506941 - Posted: 21 Apr 2014, 15:40:53 UTC - in response to Message 1506926.  

I got an invalid too, on http://setiathome.berkeley.edu/workunit.php?wuid=1478241118. The _0 host got a -9 on it, and invalids are very rare for me, so I'm not worried about it, but if you want to look, there it is.


That one looks like a case of instant invalid overflow, with truncated stderr content symptom. If you see a lot of these then switching to the workaround build mentioned/linked earlier would probably stop them, though since the losses are slight probably not worth the effort unless you need to dig deeper.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1506941 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1506949 - Posted: 21 Apr 2014, 16:27:11 UTC - in response to Message 1506941.  

I got an invalid too, on http://setiathome.berkeley.edu/workunit.php?wuid=1478241118. The _0 host got a -9 on it, and invalids are very rare for me, so I'm not worried about it, but if you want to look, there it is.


That one looks like a case of instant invalid overflow, with truncated stderr content symptom. If you see a lot of these then switching to the workaround build mentioned/linked earlier would probably stop them, though since the losses are slight probably not worth the effort unless you need to dig deeper.

Like I said, invalids are very rare for me. So rare that I'm slightly shocked when I get one. I'll get over it.
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1506949 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1510139 - Posted: 30 Apr 2014, 3:44:42 UTC
Last modified: 30 Apr 2014, 3:55:59 UTC


'Completed, marked as invalid' ("instant invalid", 3rd task is still 'In progress')
http://setiathome.berkeley.edu/result.php?resultid=3483427597
http://setiathome.berkeley.edu/workunit.php?wuid=1473624680

Was run on my old CPU K6-2+ @ 524 MHz
Run time 768,454.81
CPU time 754,116.50


Stderr output

<core_client_version>6.6.38</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_v7 7.00 DevC++/MinGW/g++ 4.5.2
libboinc: 7.1.0

Work Unit Info:
...............
WU true angle range is : 0.010866
Optimal function choices:
--------------------------------------------------------
name timing error
--------------------------------------------------------
v_BaseLineSmooth (no other)
v_GetPowerSpectrum 0.009774 0.00000
fpu_opt_ChirpData 0.346757 0.00000
v_Transpose4 0.152320 0.00000
FPU opt folding 0.124074 0.00000
Restarted at 36.21 percent.
Restarted at 37.75 percent.
Restarted at 38.32 percent.
Restarted at 38.91 percent.
Restarted at 48.98 percent.
Restarted at 50.31 percent.
Restarted at 50.44 percent.
Restarted at 50.44 percent.
Restarted at 50.44 percent.
Restarted at 50.65 percent.
Restarted at 50.65 percent.

Flopcounter: 44709614712063.109000

Spike count: 8
Autocorr count: 0
Pulse count: 6
Triplet count: 1
Gaussian count: 0
03:05:55 (-918629): called boinc_finish

</stderr_txt>
]]>

_______________________


Wingmate task is 'Completed, validation inconclusive':

.....
Build features: SETI7 Non-graphics FFTW USE_AVX x86
CPUID: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

Cache: L1=64K L2=256K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 SSE4.1 SSE4.2 AVX
ar=0.010866 NumCfft=146007 NumGauss= 0 NumPulse= 400488430592 NumTriplet= 28745021997056
In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768
Restarted at 12.76 percent.
Pulse: peak=3.393877, time=53.74, period=8.271, d_freq=1419346340.17, score=1.042, chirp=24.809, fft_len=1024
Spike: peak=24.26264, time=48.65, d_freq=1419347356.97, chirp=-35.078, fft_len=32k
Spike: peak=24.44669, time=48.65, d_freq=1419347356.95, chirp=-35.151, fft_len=32k
Spike: peak=24.07589, time=22.65, d_freq=1419349590.17, chirp=42.827, fft_len=16k
Spike: peak=24.76583, time=22.65, d_freq=1419349590.16, chirp=43.063, fft_len=16k
Spike: peak=25.78738, time=21.81, d_freq=1419349553.88, chirp=43.152, fft_len=32k
Spike: peak=26.55867, time=21.81, d_freq=1419349553.91, chirp=43.167, fft_len=32k
Spike: peak=25.93216, time=21.81, d_freq=1419349553.93, chirp=43.181, fft_len=32k
Spike: peak=24.23698, time=22.65, d_freq=1419349590.15, chirp=43.3, fft_len=16k
Pulse: peak=1.953763, time=53.79, period=3.382, d_freq=1419348684.76, score=1.019, chirp=54.72, fft_len=2k
Triplet: peak=12.55167, time=57.09, period=47.42, d_freq=1419347482.32, chirp=57.887, fft_len=512
Pulse: peak=0.8888392, time=53.69, period=0.9809, d_freq=1419344315.59, score=1.042, chirp=73.627, fft_len=64
Pulse: peak=1.650835, time=53.69, period=2.316, d_freq=1419341130.93, score=1.079, chirp=85.364, fft_len=32
Pulse: peak=6.232743, time=53.9, period=19.08, d_freq=1419348123.24, score=1.012, chirp=-99.218, fft_len=4k

Best spike: peak=26.55867, time=21.81, d_freq=1419349553.91, chirp=43.167, fft_len=32k
Best autocorr: peak=16.90317, time=73.82, delay=3.479, d_freq=1419345484.25, chirp=-2.965, fft_len=128k
Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.121e+011, d_freq=0,
score=-12, null_hyp=0, chirp=0, fft_len=0
Best pulse: peak=1.650835, time=53.69, period=2.316, d_freq=1419341130.93, score=1.079, chirp=85.364, fft_len=32
Best triplet: peak=12.55167, time=57.09, period=47.42, d_freq=1419347482.32, chirp=57.887, fft_len=512


Flopcounter: 44703276014020.000000

Spike count: 8
Autocorr count: 0
Pulse count: 6
Triplet count: 1
Gaussian count: 0
Wallclock time elapsed since last restart: 10709.6 seconds

16:22:45 (5868): called boinc_finish

</stderr_txt>
]]>


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1510139 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1510677 - Posted: 1 May 2014, 8:49:34 UTC

Here's another Invalid WU, that was Validated on other systems, although this one is different from the first one.


Portion of my stderr_txt

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 30
Gaussian count: 0


Portion of Validated stderr_txt
Spike count: 0
Autocorr count: 0
Pulse count: 30
Triplet count: 0
Gaussian count: 0

For some reason my crunching resulted in 30 Triplets, the others got them as 30 Pulses.
Grant
Darwin NT
ID: 1510677 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1510682 - Posted: 1 May 2014, 9:11:56 UTC - in response to Message 1510677.  

All the computers that ran the task 'completed' it in very few seconds, so the data was obviously heavily contaminated with RFI, or otherwise unprocessable.

Also, the three computers all used different processing resources - ATI/OpenCL, NVidia/cuda, and CPU/FPU.

Each developer has taken their own optimisation route, and in particular GPU optimisation requires converting the original serial code into small chunks ('kernels') which can be processed in parallel. The different architectures of the different processors can lead to these parallel kernels being processed in a different order.

For full length tasks with few signals, a great deal of testing has been done to minimise the risk of validation failing because the signals are reported in the wrong order, but sorting out a massive flurry of signals in the first few seconds of a task is more difficult and, frankly, not worth the effort. Sorry you wasted six seconds of computing time :(
ID: 1510682 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1511154 - Posted: 2 May 2014, 7:35:32 UTC - in response to Message 1510682.  

6 seconds of my computing time i'll never get back again...
Guess i'll survive it, and manage to overcome the grief.
Eventually.
Grant
Darwin NT
ID: 1511154 · Report as offensive

Message boards : Number crunching : Marked as Invalid, anyone know why?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.