Message boards :
Number crunching :
Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 36 · Next
Author | Message |
---|---|
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
@TBar I did change the chirp and powerspetrum code. That may explain autocorr error reappearance. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
@TBar This is the oldest one I've found, Received: 17 Aug 2016, 16:07:35 UTC That's about the time I started using r3516 and I thought it was a day or two before I started using the Blocking Sync. That's about all I can remember. Notice how the times are all the same? time=6.711 Same time on All of them. Looks like more GUPPI Inconclusives now... |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Jason, could you email me the problematic WU and the correct result file to compare to so I can start debugging? Jeff's WU comes up roses with (before latest minor pulse commit) Windows build and your code with my blocking sync tweaks...(Strongly Similar Q=99.74%, same as baseline) We'll need to find other culprits for special Cuda builds. [Also very fast ;) ] Will probably attempt another live run with the minor update later, and collect some inconclusives (more relevant to us) then. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
I may have found it. It could be the autocorrelation blocking sync added a few days ago. I thought it started before then, but, maybe not. I removed it from the AC section and for now it seems alright, but, it's working GUPPIs now, and the problem seems to occur on the Arecibo tasks AFAIK. I'll check that I don't use the same event in two places... To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
v. quick dump of Workunit 2239977895 before outage. setiathome v8 enhanced x41p_zi3d, Cuda 7.50 special Compiled with NVCC 8.0, using 6.5 libraries. Modifications done by petri33. Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.008105 Sigma 664 Sigma > GaussTOffsetStop: 664 > -600 Thread call stack limit is: 1k Pulse: peak=3.291499, time=45.86, period=5.973, d_freq=1135499141.05, score=1.016, chirp=9.0753, fft_len=1024 Spike: peak=24.2199, time=51.54, d_freq=1135495569.87, chirp=-20.571, fft_len=128k Spike: peak=24.15853, time=51.54, d_freq=1135495569.87, chirp=-20.576, fft_len=128k Pulse: peak=0.5418959, time=45.82, period=0.4743, d_freq=1135499375.6, score=1.002, chirp=-71.16, fft_len=256 Pulse: peak=0.8455893, time=45.82, period=0.8212, d_freq=1135493846.36, score=1.01, chirp=75.487, fft_len=128 Triplet: peak=12.39966, time=40.62, period=29.04, d_freq=1135497387.5, chirp=92.254, fft_len=1024 Pulse: peak=5.972482, time=45.86, period=14.52, d_freq=1135497870.4, score=1.068, chirp=92.254, fft_len=1024 Pulse: peak=7.48679, time=45.99, period=19.69, d_freq=1135491064.1, score=1.012, chirp=-97.618, fft_len=4k Pulse: peak=4.357534, time=45.99, period=9.485, d_freq=1135495952.73, score=1.032, chirp=99.617, fft_len=4k Pulse: peak=4.29636, time=45.99, period=9.485, d_freq=1135495951.97, score=1.018, chirp=99.661, fft_len=4k Pulse: peak=4.288888, time=45.99, period=9.485, d_freq=1135495952.68, score=1.016, chirp=99.676, fft_len=4k cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... 1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE. 13 Best spike: peak=24.2199, time=51.54, d_freq=1135495569.87, chirp=-20.571, fft_len=128k Best autocorr: peak=17.20775, time=74.45, delay=2.4269, d_freq=1135495546.76, chirp=17.658, fft_len=128k Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.123e+11, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 Best pulse: peak=5.972482, time=45.86, period=14.52, d_freq=1135497870.4, score=1.068, chirp=92.254, fft_len=1024 Best triplet: peak=12.39966, time=40.62, period=29.04, d_freq=1135497387.5, chirp=92.254, fft_len=1024 Flopcounter: 43540820050625.507812 Spike count: 2 Autocorr count: 0 Pulse count: 8 Triplet count: 1 Gaussian count: 0 setiathome enhanced x41zi (baseline v8), Cuda 5.00 setiathome_v8 task detected Detected Autocorrelations as enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.008105 GPU current clockRate = 927 MHz re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes Thread call stack limit is: 1k cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... cudaAcc_free() DONE. Flopcounter: 43540820050625.508000 Spike count: 2 Autocorr count: 0 Pulse count: 9 Triplet count: 1 Gaussian count: 0 Build features: SETI8 Non-graphics OpenCL USE_OPENCL_NV OCL_ZERO_COPY SIGNALS_ON_GPU OCL_CHIRP3 FFTW USE_SSE3 x86 CPUID: Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz Cache: L1=64K L2=256K Build features: SETI8 Non-graphics OpenCL USE_OPENCL_NV OCL_ZERO_COPY SIGNALS_ON_GPU OCL_CHIRP3 FFTW USE_SSE3 x86 CPUID: Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz Cache: L1=64K L2=256K CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 FMA3 SSE4.1 SSE4.2 AVX OpenCL-kernels filename : MultiBeam_Kernels_r3500.cl ar=0.008105 NumCfft=124611 NumGauss=0 NumPulse=55685480320 NumTriplet=68668147872 Currently allocated 201 MB for GPU buffers In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768 Windows optimized setiathome_v8 application Based on Intel, Core 2-optimized v8-nographics V5.13 by Alex Kan SSE3xj Win32 Build 3500 , Ported by : Raistmer, JDWhale SETI8 update by Raistmer OpenCL version by Raistmer, r3500 Used GPU device parameters are: Number of compute units: 13 Single buffer allocation size: 128MB Total device global memory: 4096MB max WG size: 1024 local mem type: Real FERMI path used: yes LotOfMem path: yes LowPerformanceGPU path: no HighPerformanceGPU path: no period_iterations_num=50 Pulse: peak=0.5418961, time=45.82, period=0.4743, d_freq=1135499375.6, score=1.002, chirp=-71.16, fft_len=256 Pulse: peak=0.8455899, time=45.82, period=0.8212, d_freq=1135493846.36, score=1.01, chirp=75.487, fft_len=128 Pulse: peak=3.683052, time=45.9, period=8.881, d_freq=1135491251.54, score=1.006, chirp=92.104, fft_len=2k Triplet: peak=12.39965, time=40.62, period=29.04, d_freq=1135497387.5, chirp=92.254, fft_len=1024 Pulse: peak=5.972462, time=45.86, period=14.52, d_freq=1135497870.4, score=1.068, chirp=92.254, fft_len=1024 Pulse: peak=7.486795, time=45.99, period=19.69, d_freq=1135491064.1, score=1.012, chirp=-97.618, fft_len=4k Pulse: peak=4.357523, time=45.99, period=9.485, d_freq=1135495952.73, score=1.032, chirp=99.617, fft_len=4k Pulse: peak=4.296352, time=45.99, period=9.485, d_freq=1135495951.97, score=1.018, chirp=99.661, fft_len=4k Pulse: peak=4.288878, time=45.99, period=9.485, d_freq=1135495952.68, score=1.016, chirp=99.676, fft_len=4k Best spike: peak=24.21985, time=51.54, d_freq=1135495569.87, chirp=-20.571, fft_len=128k Best autocorr: peak=17.20776, time=74.44, delay=2.4269, d_freq=1135495546.76, chirp=17.658, fft_len=128k Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.123e+011, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 Best pulse: peak=5.972462, time=45.86, period=14.52, d_freq=1135497870.4, score=1.068, chirp=92.254, fft_len=1024 Best triplet: peak=12.39965, time=40.62, period=29.04, d_freq=1135497387.5, chirp=92.254, fft_len=1024 Flopcounter: 4458567510473.013700 Spike count: 2 Autocorr count: 0 Pulse count: 9 Triplet count: 1 Gaussian count: 0 |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
Also v. quick before outage, preliminary list of the reasons why tasks were sent to tie-breakers. anon linux Petri v anon cuda50 Opti ATi_HD5 v mac intel_gpu Petri v opt intel_gpu v opt nvidia SoG Sock CPU v stock Apple CPU stock ati5_SoG v mac intel_gpu Stock CPU v intel_gpu Stock CPU v stock cuda overflow stock CPU v stock cuda32 overflow Stock CPU v stock cuda50 overflow Stock CPU v stock intel_gpu Stock CPU v stock mac CPU overflow Stock CPU v stock mac intel_gpu Stock CPU v stock nvidia_mac Stock CPU v stock nvidia_mac Stock CPU v stock nvidia_mac Stock CPU v stock nvidia_mac Stock CPU v stock nvidia_mac v petri special Stock CPU v stock nvidia_SoG Stock CPU v stock nvidia_SoG Stock CPU v stock SoG - SoG only overflow Stock CPU vs cuda50 late overflow Stock linux CPU v cuda42 Stock linux CPU v stock nvidia_mac Stock nvidia_mac v stock ati5_mac stock nvidia_mac v stock mac intel_gpu v anon Sog v stock Stock nvidia_sah v stock nvidia_mac Stock nvidia_sah v stock_cuda42 Stock nvidia_SoG v stock cuda42 Stock nvidia_SoG v stock intel_gpu Stock v stock nvidia_mac I'll normalise those descriptions and count them while we're off. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I may have found it. It could be the autocorrelation blocking sync added a few days ago. I thought it started before then, but, maybe not. I removed it from the AC section and for now it seems alright, but, it's working GUPPIs now, and the problem seems to occur on the Arecibo tasks AFAIK. I'm still receiving the Unreal autocorrelation peaks, http://setiathome.berkeley.edu/result.php?resultid=5111957533 That would indicate the problem isn't the 3515 change or the AC Blocking Sync. I'm open for suggestions. The latest version does seem to be working better with the GUPPIs though. Concerning the Inconclusives with my ATI Linux machine against the Intel iGPUs, it seems the ATI running Raistmer's r3505 was given the canonical result in both cases. http://setiathome.berkeley.edu/workunit.php?wuid=2239946523 http://setiathome.berkeley.edu/workunit.php?wuid=2233419202 |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Also v. quick before outage, preliminary list of the reasons why tasks were sent to tie-breakers. And the outcome % or deemed as invalid rate % would be nice too. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
I got this far before deciding to go out for a meal instead: Reason Field NumberOfDups Stock CPU 18 Error 12 stock nvidia_mac 10 Timed out 9 Aborted 5 Abandoned 5 stock nvidia_SoG 4 stock intel_gpu 2 Stock nvidia_sah 2 stock mac intel_gpu 2 Stock linux CPU 2 mac intel_gpu 2 opt nvidia SoG 1 opt intel_gpu 1 intel_gpu 1 Opti ATi_HD5 1 anon Sog 1 Stock ati5_SoG 1 anon linux Petri 1 anon cuda50 1 cuda42 1 Petri 1 petri special 1 stock 1 stock ati5_mac 1 cuda50 late overflow 1 stock cuda overflow 1 stock cuda32 overflow 1 stock cuda42 1 stock cuda50 overflow 1 stock mac CPU overflow 1 stock SoG - SoG only overflow 1 stock_cuda42 1 stock Apple CPU 1 'Reason' is the reason for the resend - so the biggest problem is hard error/timeout/abandoned/aborted. Inconclusives themselves aren't a huge volume. Next highest problem group is - Stock CPU. I think we can agree that's because an inconclusive has to be one of a pair (at least). If stock is defined (and believed) to be the gold standard, then it's the paired apps which need attention - and nvidia_mac is the largest by far. As regards final outcomes - it's the cuda apps which most often get invalidated, usually because if they go wrong, they go badly wrong with false early overflows. Most of the others get awarded consolation credits for a near (weak) miss. I'll keep working on the data I have already, and try to collect more examples when I can, but it's harder now the resends are hiding among so many first-run tasks. I have kept local copies of the web pages for the workunit displays, so I can identify the host involved in each case - if we end up needing to go that far. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
As regards final outcomes - it's the cuda apps which most often get invalidated, usually because if they go wrong, they go badly wrong with false early overflows. Yes, that's a characteristic I noticed seems to apply to some factory/agressively overclocked, otherwise overheating, or broken in some other way Cuda hosts. I've been trying to maintain (or rather not break) that behaviour with stock Cuda, such that they either error/restart (Like My Linux/680 did for a bit, a couple of weeks back) due to Cuda runtime failure, or tend to a rather obvious (to the validator) bad similarity, so for the most part Boinc doing its job. Raistmer's description of some sanity checks did spark off thinking about it again (toward x42), however the potential issue there is a bit different in that only spikes have a fairly predictable profile, with regard to changing telescopes and possible dataset size changes upcoming. (which will probably break any such hardcoded magic numbers or restraints lurking) Biting the bullet, and since x42 is to be a clean slate anyway, the upcoming performance improvements will allow for something a little more sophisicated. Namely, this would involve some level of aerospace style internal redundancy (i.e. spot checks of results with different code on different device (i.e. CPU) ) this could potentially trigger optional warnings, alerts, or 'safe' operating modes, such that the user would not be required to babysit hidden boinc logs, or dig in stderrs online etc, just to know they need to clean their fans. That's where X-branch was headed anyway, although it's only now with help on the optimisation side that I'm freed up enough to consider all the angles to the degree necessary. That's why I consider x41 more or less closed, as the traditional boinc app based infrastructure and client have none of the features that it's going to need to cope with a truly high performance modern demands. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Kiska Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0 |
OK since I don't have this data file anymore I am going to ask the 4th task sent out if he/she could perhaps upload it so we may have some form of analysis. Workunit in question |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
OK since I don't have this data file anymore I am going to ask the 4th task sent out if he/she could perhaps upload it so we may have some form of analysis. I'd wait with that one. Cursory look says #1 = broken 8400GS, #2 = your stock CPU (which looks likely OK), then Petri special with some known issues to iron out. Your CPU result will likely become canonical. [Edit:] Might help if Richard has his looky uppy downloader thingy on hand, as Petri is looking for such variants at the moment. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Kiska Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0 |
I would download it if I knew what the url looks like. Maybe I can try some combination and get the file, but unlikely |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I would download it if I knew what the url looks like. Maybe I can try some combination and get the file, but unlikely Richard has a spreadsheet thing, which I don't have on hand right now myself, but I believe is in Lunatics downloads. In any case, Richard's monitoring the thread, so could possibly grab it and forward to Petri if looks suitable (as I suspect). I'd do it myself, but dealing with some home stuff, so just checking in. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Kiska Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0 |
Unfortunately I have just downloaded the spreadsheet and it gives an incorrect fanout address. that being this: http://boinc2.ssl.berkeley.edu/sah/download_fanout/35a/blc5 which is not correct |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Unfortunately I have just downloaded the spreadsheet and it gives an incorrect fanout address. that being this: http://boinc2.ssl.berkeley.edu/sah/download_fanout/35a/blc5 Yeah something missing there. Probably will be able to poke around before Richard appears. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Kiska Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0 |
And it's not just this workunit having problems all the guppi ones are. All blc units give the same fanout address |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yeah looks like the md5 hash is only basing the fanout folder on the text 'blc5' instead of the whole filename. [Edit:] abort, checking what I got "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Kiska Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0 |
Failure |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Must have pasted wrong address: http://boinc2.ssl.berkeley.edu/sah/download_fanout/374/blc5_2bit_guppi_57451_69044_HIP117559_OFF_0022.7362.831.18.27.38.vlar What I get for having too many tabs open ;) [Edit] subfolder being 6th, 7th, and 8th digits of the md5 checksum of the filename: http://md5.gromweb.com/?string=blc5_2bit_guppi_57451_69044_HIP117559_OFF_0022.7362.831.18.27.38.vlar ... 374 "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.