Message boards :
Number crunching :
Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 36 · Next
Author | Message |
---|---|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I opened SETI general fanouts.xls with gedit, searched for (. and it gave one hit on line 218; Aev*o\85V\92O)$Qptr)Inte\D8ger#y02Je\8F\B6 \C1(\E1blc\C2 \F1= In\00\C8+\90(.") I dunno.... |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
That's nothing like either the code I wrote, or the MD5 code I borrowed off the internet. My own researches (including updating Libre Office from v5.0.4.2 to v5.1.5.2, and being forced to reboot afterwards) found it was actually objecting to (), as in Private Function ConvertToWordArray(sMessage As String) As Long() ' RETURN VALUE: If it doesn't recognise long array (array of long integers) as a valid return type, I think we're stuffed. Later on, it objects to the internal function AscB(), as in The AscB function is used with byte data contained in a string. Instead of returning the character code for the first character, AscB returns the first byte. I think we're doubly stuffed - Libre Office doesn't seem to be as compatible in their VBA interpreter as Open Office was. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
Developers - please be careful how you use punctuation in GUI error messages. That dratted period is nothing to do with the case. Just wanted to get that off my chest. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
As I suspected, that was a huge waste of time. Same as last time. When run a second time the Autocorrelation Extreme Peaks don't appear. Oh well. I've noticed the 750Ti is again producing Invalid Overflows just as it did with driver 7.5.27 in Darwin 14.5. Time to go back to driver 8.0.29 in Darwin 15.4 where the only worry is AC Peaks. It would appear x41p_zi3g succeeded in eliminating the frequent 750Ti stalls with driver 7.5.x only to replace them with infrequent 750Ti Overflows. KWSN-Darwin-MBbench v2.1.07 Running on TomsMacPro.local at Thu Sep 8 18:46:24 2016 --------------------------------------------------- Starting benchmark run... --------------------------------------------------- Listing wu-file(s) in /testWUs : 11au16aa.28481.85822.12.39.56.wu 21au10ag.26901.20545.14.41.41.wu blc5_2bit_guppi_57449_43932_HIP78775_0013.26700.831.18.27.53.vlar.wu blc5_2bit_guppi_57449_45355_HIP81348_0017.14390.416.17.26.21.vlar.wu blc5_2bit_guppi_57449_45695_HIP81348_OFF_0018.29291.831.17.26.236.vlar.wu Listing executable(s) in /APPS : setiathome_x41p_zi3g_x86_64-apple-darwin_cuda75 Listing executable in /REF_APPs : MBv8_8.05r3344_sse41_x86_64-apple-darwin --------------------------------------------------- Current WU: 11au16aa.28481.85822.12.39.56.wu --------------------------------------------------- Running default app with command : MBv8_8.05r3344_sse41_x86_64-apple-darwin 3629.37 real 3614.92 user 11.28 sys Elapsed Time: ………………………………… 3630 seconds --------------------------------------------------- Running app with command : setiathome_x41p_zi3g_x86_64-apple-darwin_cuda75 -bs -unroll 4 -device 0 155.96 real 34.94 user 15.46 sys Elapsed Time : ……………………………… 156 seconds Speed compared to default : 2326 % ----------------- Comparing results Result : Strongly similar, Q= 99.98% --------------------------------------------------- Done with 11au16aa.28481.85822.12.39.56.wu. Current WU: 21au10ag.26901.20545.14.41.41.wu --------------------------------------------------- Running default app with command : MBv8_8.05r3344_sse41_x86_64-apple-darwin 7672.14 real 7632.09 user 31.42 sys Elapsed Time: ………………………………… 7672 seconds --------------------------------------------------- Running app with command : setiathome_x41p_zi3g_x86_64-apple-darwin_cuda75 -bs -unroll 4 -device 0 605.34 real 42.15 user 24.66 sys Elapsed Time : ……………………………… 605 seconds Speed compared to default : 1268 % ----------------- Comparing results Result : Strongly similar, Q= 99.62% --------------------------------------------------- Done with 21au10ag.26901.20545.14.41.41.wu. Current WU: blc5_2bit_guppi_57449_43932_HIP78775_0013.26700.831.18.27.53.vlar.wu --------------------------------------------------- Running default app with command : MBv8_8.05r3344_sse41_x86_64-apple-darwin 8061.68 real 8024.16 user 30.86 sys Elapsed Time: ………………………………… 8062 seconds --------------------------------------------------- Running app with command : setiathome_x41p_zi3g_x86_64-apple-darwin_cuda75 -bs -unroll 4 -device 0 937.66 real 40.58 user 23.60 sys Elapsed Time : ……………………………… 938 seconds Speed compared to default : 859 % ----------------- Comparing results Result : Strongly similar, Q= 99.73% --------------------------------------------------- Done with blc5_2bit_guppi_57449_43932_HIP78775_0013.26700.831.18.27.53.vlar.wu. Current WU: blc5_2bit_guppi_57449_45355_HIP81348_0017.14390.416.17.26.21.vlar.wu --------------------------------------------------- Running default app with command : MBv8_8.05r3344_sse41_x86_64-apple-darwin 8288.84 real 8258.83 user 25.26 sys Elapsed Time: ………………………………… 8289 seconds --------------------------------------------------- Running app with command : setiathome_x41p_zi3g_x86_64-apple-darwin_cuda75 -bs -unroll 4 -device 0 991.76 real 40.50 user 23.07 sys Elapsed Time : ……………………………… 992 seconds Speed compared to default : 835 % ----------------- Comparing results Result : Strongly similar, Q= 99.73% --------------------------------------------------- Done with blc5_2bit_guppi_57449_45355_HIP81348_0017.14390.416.17.26.21.vlar.wu. blc5_2bit_guppi_57449_45695_HIP81348_OFF_0018.29291.831.17.26.236.vlar.wu does not exist. blc5_2bit_guppi_57449_45695_HIP81348_OFF_0018.29291.831.17.26.236.vlar.wu does not exist. Done with Benchmark run! Removing temporary files! Yes, I trashed the last task, no sense in wasting more than 6 hours. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
overflows that disappear on offline run on same hardware could originate either from broken hardware/SW setup combo (that is, GPU itself or its drivers) or in some missing sync point in the app. That constitutes most hard case for debugging so don't expect "come, see, win" approach for such problems. Much more than 6hours could be wasted... SETI apps news We're not gonna fight them. We're gonna transcend them. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I wasn't testing for overflows on that last test. It says at the top of the post 'Autocorrelation'. The top two results from this post, http://setiathome.berkeley.edu/forum_thread.php?id=80158&postid=1815756 are from a different machine, running a different platform, with the same Autocorrelation problem. The Overflows weren't noticeable until the change from zi3f to zi3g. That change actually solved the targeted problem which was Very frequent Stalls with just the 750Ti using driver 7.5.x. It was So bad it was unusable. I even went back to the older x41p_zi for a while because....that App doesn't have any of the problems x41p_zi3 has. The Older App works very nicely, it's just much slower on the GUPPIs. In fact, None of the other Apps have these problems the zi3 version has. So, I really don't think it's the hardware. Yes, much more than 6 hours have been spent, the count is measured in Months at this point. If you go back to the original x41p_zi, the time is approaching a Year. Remember that old ATI 6850 I had that would crash after finishing an AstroPulse? Many people here kept telling me it was the card, until just about everyone here was having the same problem. *snicker* Remember that? Well, the same card is currently running in this host, http://setiathome.berkeley.edu/results.php?hostid=7769537&offset=200, still nothing wrong with that card. I have 3 750Ti, I've swapped out all three cards, All of them have the same problem with just this One App. I really don't think it's the card(s). Any suggestions? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Well, swapping cards in the same host rules out faulty hardware good enough. Then faulty driver remains (driver/OS combo). Just as you managed to prove already in case with OS X and OpenCL builds. They work under some OS X/driver and produce invalids under another. Can this case be ruled out for CUDA app? Do you see wrong autocorr signals under whole range of OS/driver versions? If so, then faulty OS/driver combo will be ruled out too and then more close attention to app itself should be achieved. SETI apps news We're not gonna fight them. We're gonna transcend them. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Well, swapping cards in the same host rules out faulty hardware good enough. The interesting thing is, the two 950s Never had the stalling problem, it Only happens with the 750Ti cards. The stalls were present with the first version of zi3, very noticeable, the AC problems were also present but very infrequent. Both problems became worse with newer versions. The stalls would only happen with driver 7.5, both problems happened in Darwin 14.5 and 15.x, and the AC problem happens with all combinations. With driver 8.0.29, which only works with 15.5 to 15.0, the only problems so far, are the AC problems. Again, None of these problems exist with the other cuda Apps. I did just compile a new cuda80 zi3g App a few hours ago, I'm still looking at it. I also just compiled a new zi3g App for the Linux machine, still looking at that as well. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Any suggestions? Crashes *after* finish are a pure symptom of standard boincapi non threadsafe shutdown procedure (Cuda uses internal helper threads that should not be killed). If you're running into this using current boincapi, or similarly finished file present too long etc, despite multiple workarounds added recently, you'll probably see variable symptoms depending on current driver and Cuda versions, until such time I can generalise a custom Boincapi patch. [Edit:] won't know about the stalls for a bit. Would have to poke at Petri's stream event chains, perhaps install some timeouts/restarts if some conditions can cause events to go AWOL. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Hi, I'll rest this weekend and test/debug next week. I downloaded those workunits that Richard made links to. How about running without -bs switch? Removing it changes the synchronisation to non blocking (and increases CPU time and makes exe run faster on shorties). Too bad I do not have a 750Ti nor MAC to test with. I'll be back. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Well, swapping cards in the same host rules out faulty hardware good enough. There is now a zi3h version in your mailbox. Hope it cures the ac. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Thanks Petri. I was about to try it with Toolkit 8 and ran across this post while pondering the connection to cudaAutoCorr_plan. cufft: use CUFFT_COMPATIBILITY_FFTW_PADDING instead of CUFFT_COMPATIBILITY_NATIVE It still works in Toolkit 7.5, so, not sure if it will help. I'm going to replace the 3 CUFFT_COMPATIBILITY_NATIVE entries with CUFFT_COMPATIBILITY_FFTW_PADDING. Unless someone comes up with a better plan...soon. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Thanks Petri. NAtive with 7.5 That is one reason I use 7.5 include files and 6.5 or 7.5 library for fft. I haven't tried with 8.0 cufft since it doesn't support NATIVE. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
It appears that doesn't work either; zi3h was compiled with CUFFT_COMPATIBILITY_FFTW_PADDING. x41p_zi3h, Cuda 8.00 special, Best autocorr: peak=82137.73 x41p_zi3h, Cuda 8.00 special, Best autocorr: peak=30465.58 |
Kiska Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0 |
I have a task that is reporting 2 less pulses with each other. Whilst mr kevvy machine is using zi3d, is the issue still present in the latest build is what we need to test Task Datafile CPU result file |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The Pulsefind error wasn't corrected until zi3f, so, unless an earlier version was compiled with the Pulsefind fix it will still give the wrong Pulse count about half the time. That gives me an idea. I'll go back to zi3c, apply the recent fixes, use CUFFT_COMPATIBILITY_FFTW_PADDING, and see what happens. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Task blc5_2bit_guppi_57449_46749_HIP83043_0021.14243.831.18.27.242.vlar_2 exited with zero status but no 'finished' file Yes, that is the task running on the 750Ti. It usually happens sometime before the 750Ti Stalls. Oh well, I was able to apply the Pulsefind fix, and the Blocking Sync to zi3c, and it did pass the benchmark. I suppose I could boot back to Darwin 15.4 and try it with Driver 8.0.29, 'cause it looks like it's going to Stall in 14.5 with driver 7.5.27. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Task blc5_2bit_guppi_57449_46749_HIP83043_0021.14243.831.18.27.242.vlar_2 exited with zero status but no 'finished' file Hi TBar, I had to make a amall change into cudaAcceleration.cu to make the autocorr work with 8.0 //cu_errf = cufftPlan1d(&cudaAutoCorr_plan, ac_fftlen*2, CUFFT_R2C, 8); // RFFT method, batch of 8 int size = ac_fftlen*2; cu_errf = cufftPlanMany(&cudaAutoCorr_plan, 1, &size, NULL, 0, 0, 0, 0, 0, CUFFT_R2C, 8); The plan1d with batch is deprecated and does not work correctly in 8.0. The PlanMany works ok. EDIT: no it does not. I have found one wu that still gives an ac error. That is good - now i can debug. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I've compiled another cuda80 zi3h with the cufftPlanMany, well see how that goes. I checked the files used for the updated zi3c, alias zi3x, and noticed it also has cufftPlan1d. The updated zi3c was compiled in Darwin 14.5 with Toolkit 7.5. Running in 14.5 it produced 2 exited with zero status but no 'finished' file lines. Both lines indicate it choked on Autocorrelation. Everything was running normally and then it crashed after displaying; Spike: peak=25.33522, time=54.4, d_freq=1250818526.19, chirp=-4.3155, fft_len=64k Spike: peak=26.24246, time=87.24, d_freq=1419683300.07, chirp=21.469, fft_len=128k That's probably also what's causing the 750Ti to Stall with driver 7.5.x. I'll rebuild the upated zi3c, at some point, using cufftPlanMany and see how that works. During the time zi3x has run, I haven't found any Extreme AC Peaks. However, after 5 750Ti Stalls in Darwin 14.5 I gave it up and went back to 15.4 with driver 8.0.29. BTW, have you noticed compiling zi3h & zi3x with CUFFT_COMPATIBILITY_FFTW_PADDING doesn't seem to have slowed it down...at least not on My machine anyway. KWSN-Darwin-MBbench v2.1.07 Running on TomsMacPro.local at Sat Sep 10 11:12:22 2016 --------------------------------------------------- Starting benchmark run... --------------------------------------------------- Listing wu-file(s) in /testWUs : 02se09ad.24663.11110.7.34.139.wu blc2_2bit_guppi_57403_HIP11048_0006.17091.831.22.45.71.wu Listing executable(s) in /APPS : setiathome_x41p_zi3h_x86_64-apple-darwin_cuda80 Listing executable in /REF_APPs : MBv8_8.05r3344_sse41_x86_64-apple-darwin --------------------------------------------------- Current WU: 02se09ad.24663.11110.7.34.139.wu --------------------------------------------------- Skipping default app MBv8_8.05r3344_sse41_x86_64-apple-darwin, displaying saved result(s) Elapsed Time: ………………………………… 2819 seconds --------------------------------------------------- Running app with command : setiathome_x41p_zi3h_x86_64-apple-darwin_cuda80 -bs -unroll 4 -device 0 148.77 real 17.29 user 11.73 sys Elapsed Time : ……………………………… 148 seconds Speed compared to default : 1904 % ----------------- Comparing results Result : Strongly similar, Q= 99.98% --------------------------------------------------- Done with 02se09ad.24663.11110.7.34.139.wu. Current WU: blc2_2bit_guppi_57403_HIP11048_0006.17091.831.22.45.71.wu --------------------------------------------------- Skipping default app MBv8_8.05r3344_sse41_x86_64-apple-darwin, displaying saved result(s) Elapsed Time: ………………………………… 5654 seconds --------------------------------------------------- Running app with command : setiathome_x41p_zi3h_x86_64-apple-darwin_cuda80 -bs -unroll 4 -device 0 546.94 real 23.93 user 14.92 sys Elapsed Time : ……………………………… 547 seconds Speed compared to default : 1033 % ----------------- Comparing results Result : Strongly similar, Q= 99.37% --------------------------------------------------- |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Hi, Yes, I noticed it did not slow things down. The changes work with cufft 6.5 but give a lot of autocorr errors when run with cufft 8.0. To get the error number in autocorrelation you can change the code to be.. if(fft_num == 0) { err = cudaEventSynchronize(autocorrelationDoneEvent); // host (CPU) code waits for the all (specific) GPU task to complete if(cudaSuccess != err) { fprintf(stderr, "GetAutocorr - sync fft_num = %d error = %d\r\n", fft_num, err); exit(0); } } To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.