Monitoring inconclusive GBT validations and harvesting data for testing

Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 36 · Next

AuthorMessage
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1814331 - Posted: 1 Sep 2016, 13:17:28 UTC - in response to Message 1814329.  

I was going to ask if you could upload the result, however you don't have access to that
ID: 1814331 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1814346 - Posted: 1 Sep 2016, 15:17:23 UTC - in response to Message 1814331.  

I was going to ask if you could upload the result, however you don't have access to that


Strongly similar 99.79%. No missing nor extra pulses.

I could email the result file.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1814346 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1814363 - Posted: 1 Sep 2016, 16:25:07 UTC - in response to Message 1814346.  

Since the first post I've received 1 more AC error on the Linux machine. I was not keeping backups of the workunits on the Linux machine. I didn't start making backups on the Mac until After the first post. So, I don't have any files. Now if someone running Windows want to try to recover those workunits, fine. The links are in the last post and here, http://setiathome.berkeley.edu/workunit.php?wuid=2233603646.
If someone can use that script, that doesn't work on a Mac, to find the download urls that's also fine. Until then, I don't have any files.
ID: 1814363 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1814370 - Posted: 1 Sep 2016, 16:46:09 UTC - in response to Message 1814363.  
Last modified: 1 Sep 2016, 16:46:36 UTC

Since the first post I've received 1 more AC error on the Linux machine. I was not keeping backups of the workunits on the Linux machine. I didn't start making backups on the Mac until After the first post. So, I don't have any files. Now if someone running Windows want to try to recover those workunits, fine. The links are in the last post and here, http://setiathome.berkeley.edu/workunit.php?wuid=2233603646.
If someone can use that script, that doesn't work on a Mac, to find the download urls that's also fine. Until then, I don't have any files.



A couple of years ago I was able to run the excel macro that translates the wu name to a download address under OpenOffice/LibreOffice in linux.

I could try to find that, but then I'd have to run an hour+ to calculate the reference result on CPU. I find it easier to get a wu and a result and then run a comparison on my GPU. That will make my development cycle faster with my limited on screen time. The summer 2½ months and the holidays are my main development times now that I'm at work. Last year I was studying so I had plenty of time.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1814370 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14672
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1814376 - Posted: 1 Sep 2016, 17:07:48 UTC - in response to Message 1814363.  

Since the first post I've received 1 more AC error on the Linux machine. I was not keeping backups of the workunits on the Linux machine. I didn't start making backups on the Mac until After the first post. So, I don't have any files. Now if someone running Windows want to try to recover those workunits, fine. The links are in the last post and here, http://setiathome.berkeley.edu/workunit.php?wuid=2233603646.
If someone can use that script, that doesn't work on a Mac, to find the download urls that's also fine. Until then, I don't have any files.

The url for that data file is

http://boinc2.ssl.berkeley.edu/sah/download_fanout/21c/02se09ad.24663.11110.7.34.139
Grab it quick before the final replication is reported.

The original Excel/Open Office/Libre Office macro should still work for those Arecibo task names: I uploaded a patched one for the new guppi file name format to Lunatics a few days ago. If anyone can lend me a Mac, I can try and work out why it doesn't work in the Mac versions of the various office suites.
ID: 1814376 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1814377 - Posted: 1 Sep 2016, 17:16:18 UTC - in response to Message 1814376.  
Last modified: 1 Sep 2016, 17:21:34 UTC

Since the first post I've received 1 more AC error on the Linux machine. I was not keeping backups of the workunits on the Linux machine. I didn't start making backups on the Mac until After the first post. So, I don't have any files. Now if someone running Windows want to try to recover those workunits, fine. The links are in the last post and here, http://setiathome.berkeley.edu/workunit.php?wuid=2233603646.
If someone can use that script, that doesn't work on a Mac, to find the download urls that's also fine. Until then, I don't have any files.

The url for that data file is

http://boinc2.ssl.berkeley.edu/sah/download_fanout/21c/02se09ad.24663.11110.7.34.139
Grab it quick before the final replication is reported.

The original Excel/Open Office/Libre Office macro should still work for those Arecibo task names: I uploaded a patched one for the new guppi file name format to Lunatics a few days ago. If anyone can lend me a Mac, I can try and work out why it doesn't work in the Mac versions of the various office suites.



Grabbed. Thanks. And now it is running on CPU with stock linux app. Then on GPU.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1814377 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1814378 - Posted: 1 Sep 2016, 17:35:39 UTC - in response to Message 1814377.  
Last modified: 1 Sep 2016, 17:37:42 UTC

I have a copy. I tried the xls in LibreOffice and it crashes when trying to open the xls file. That's after I set the security to Low and installed Java 7. Before that it did nothing.

I'll run the task on the Mac, but, my suspicion is the AC error is one of those random non-reoccurring events. Something like the old AP errors from years ago. We'll see.
ID: 1814378 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1814385 - Posted: 1 Sep 2016, 17:52:33 UTC

saved task here as well, pending weekend playing. If spurious, particularly on Mac, could be some sortof buffer overrun. Have run into situations many years ago with CPU builds, where such things would appear to run on Mac creating rubbish, but crash and burn horribly on Windows. Seemed at the time to be a difference in how memory allocation was handled, such that the different OSes would react differently to an overrun. Naturally Murphy says will turn out something completely different, though is one potential familiar angle for me to poke at anyway.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1814385 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1814386 - Posted: 1 Sep 2016, 17:52:50 UTC - in response to Message 1814377.  
Last modified: 1 Sep 2016, 18:12:19 UTC

Since the first post I've received 1 more AC error on the Linux machine. I was not keeping backups of the workunits on the Linux machine. I didn't start making backups on the Mac until After the first post. So, I don't have any files. Now if someone running Windows want to try to recover those workunits, fine. The links are in the last post and here, http://setiathome.berkeley.edu/workunit.php?wuid=2233603646.
If someone can use that script, that doesn't work on a Mac, to find the download urls that's also fine. Until then, I don't have any files.

The url for that data file is

http://boinc2.ssl.berkeley.edu/sah/download_fanout/21c/02se09ad.24663.11110.7.34.139
Grab it quick before the final replication is reported.

The original Excel/Open Office/Libre Office macro should still work for those Arecibo task names: I uploaded a patched one for the new guppi file name format to Lunatics a few days ago. If anyone can lend me a Mac, I can try and work out why it doesn't work in the Mac versions of the various office suites.



Grabbed. Thanks. And now it is running on CPU with stock linux app. Then on GPU.


WU is 02se09ad.24663.11110.7.34.139


Elapsed Time : ...................... 90 seconds
Speed compared to default : ......... 1820 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.98%



EDIT: the run time is so big because it was run along with the normal seti processing thus doing two at at time.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1814386 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1814398 - Posted: 1 Sep 2016, 18:49:31 UTC - in response to Message 1814386.  

I get about the same on the Mac;

Current WU: 02se09ad.24663.11110.7.34.139.wu
---------------------------------------------------
Running default app with command : MBv8_8.05r3344_sse41_x86_64-apple-darwin
     2818.95 real      2811.01 user         6.18 sys
Elapsed Time: ………………………………… 2819 seconds
---------------------------------------------------
Running app with command : setiathome_x41p_zi3f_x86_64-apple-darwin_cuda80 -bs -unroll 4
      147.43 real        17.44 user        11.57 sys
Elapsed Time : ……………………………… 148 seconds
Speed compared to default : 1904 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.98%
---------------------------------------------------

Random...Non-reoccurring...

Right now the Error score is;
Linux - 2
Mac - 1

There is also the point that ALL these AC Errors have occurred on a GTX 750Ti.
Not One has happened on a GTX 950. The Linux machine has 2 750Ti, the Mac has 1.
Then there's the old problem with Hangs that happen ONLY on the 750Ti with CUDA Driver 7.5. The 950s Don't hang with driver 7.5. The 950s Don't have AC Errors either.

I'm going to compile another Mac App with Toolkit 7.5 and see if x41zi3f still causes the 750Ti to Hang with driver 7.5. Right now a CUDA 8.0 App is pretty useless on a Mac anyway. You have to register as a Developer to even get the CUDA 8 driver.
ID: 1814398 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1814401 - Posted: 1 Sep 2016, 19:17:38 UTC - in response to Message 1814398.  

I get about the same on the Mac;

Current WU: 02se09ad.24663.11110.7.34.139.wu
---------------------------------------------------
Running default app with command : MBv8_8.05r3344_sse41_x86_64-apple-darwin
     2818.95 real      2811.01 user         6.18 sys
Elapsed Time: ………………………………… 2819 seconds
---------------------------------------------------
Running app with command : setiathome_x41p_zi3f_x86_64-apple-darwin_cuda80 -bs -unroll 4
      147.43 real        17.44 user        11.57 sys
Elapsed Time : ……………………………… 148 seconds
Speed compared to default : 1904 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.98%
---------------------------------------------------

Random...Non-reoccurring...

Right now the Error score is;
Linux - 2
Mac - 1

There is also the point that ALL these AC Errors have occurred on a GTX 750Ti.
Not One has happened on a GTX 950. The Linux machine has 2 750Ti, the Mac has 1.
Then there's the old problem with Hangs that happen ONLY on the 750Ti with CUDA Driver 7.5. The 950s Don't hang with driver 7.5. The 950s Don't have AC Errors either.

I'm going to compile another Mac App with Toolkit 7.5 and see if x41zi3f still causes the 750Ti to Hang with driver 7.5. Right now a CUDA 8.0 App is pretty useless on a Mac anyway. You have to register as a Developer to even get the CUDA 8 driver.



Hi,
thanks.

So it seems to be somewhat random and also a driver version related issue/nonissue. The source code can be compiled to run on Kepler, Maxwell and Pascal. They all need some special tweaking to get the most out of them.

But since the error count has been dramatically decreased we can concentrate on building a stable app. The last 2% of speed is irrelevant.

Do we need a different app for Kepler, Maxwell and Pascal? For accuracy? Because of driver issues? For any other reason? Those with bigger goals in mind may decide. I reside. Not from developing though.

There's going to be a mid term vacation and I may have time to put up a new rig that has 908's and 780's. Ive got them idling around in the shelf (2+4*).

Petri


* I will not send them, come and get them.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1814401 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1814435 - Posted: 1 Sep 2016, 21:21:16 UTC - in response to Message 1814401.  

I got the CUDA 75 App working. It seems to be about the same as the CUDA 80 App on the GUPPIs. I forgot the cmdline on the first run and ran it a second time with the settings;
Running on TomsMacPro.local at Thu Sep 1 20:36:41 2016
---------------------------------------------------
Starting benchmark run...
---------------------------------------------------
Listing wu-file(s) in /testWUs :
blc2_2bit_guppi_57403_HIP11048_0006.17091.831.22.45.71.wu

Listing executable(s) in /APPS :
setiathome_x41p_zi3f_x86_64-apple-darwin_cuda75

Listing executable in /REF_APPs :
MBv8_8.05r3344_sse41_x86_64-apple-darwin
---------------------------------------------------
Current WU: blc2_2bit_guppi_57403_HIP11048_0006.17091.831.22.45.71.wu
---------------------------------------------------
Skipping default app MBv8_8.05r3344_sse41_x86_64-apple-darwin, displaying saved result(s)
Elapsed Time: ………………………………… 4797 seconds
---------------------------------------------------
Running app with command : setiathome_x41p_zi3f_x86_64-apple-darwin_cuda75
      909.28 real       112.04 user        19.91 sys
Elapsed Time : ……………………………… 909 seconds
Speed compared to default : 527 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.37%
---------------------------------------------------


Running on TomsMacPro.local at Thu Sep 1 20:55:53 2016
---------------------------------------------------
Starting benchmark run...
---------------------------------------------------
Listing wu-file(s) in /testWUs :
blc2_2bit_guppi_57403_HIP11048_0006.17091.831.22.45.71.wu

Listing executable(s) in /APPS :
setiathome_x41p_zi3f_x86_64-apple-darwin_cuda75

Listing executable in /REF_APPs :
MBv8_8.05r3344_sse41_x86_64-apple-darwin
---------------------------------------------------
Current WU: blc2_2bit_guppi_57403_HIP11048_0006.17091.831.22.45.71.wu
---------------------------------------------------
Skipping default app MBv8_8.05r3344_sse41_x86_64-apple-darwin, displaying saved result(s)
Elapsed Time: ………………………………… 4797 seconds
---------------------------------------------------
Running app with command : setiathome_x41p_zi3f_x86_64-apple-darwin_cuda75 -bs -unroll 4
      554.71 real        24.49 user        14.80 sys
Elapsed Time : ……………………………… 555 seconds
Speed compared to default : 864 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.37%
---------------------------------------------------

The unroll setting makes a big difference. I suppose it's a little less accurate with the GUPPIs than the Arecibo tasks.
Still, compared to all those Intel iGPUs, Apple nVidia/Intel iGPUs I see in my Inconclusive list the 'Special' App is much more accurate.

Now to see how the CUDA 75 App works in Darwin 15.6.
ID: 1814435 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1814443 - Posted: 1 Sep 2016, 21:39:51 UTC

What effect would a release of a public special windows version have to the credit (quite old)?
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1814443 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1814444 - Posted: 1 Sep 2016, 21:40:26 UTC - in response to Message 1814435.  
Last modified: 1 Sep 2016, 21:41:02 UTC

@TBar @all,

The time difference reflects the architectural change from cuda50 to special. When omitting the unroll a quite number of pulse find cycles use only one of the SMX units of the GTX xxx(x) card.

The time 909 seconds is not too bad either. That reflects the previous optimizations influence - using CUDA queues. That is implemented in the OpenCL version by now I guess.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1814444 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1814449 - Posted: 1 Sep 2016, 21:53:13 UTC - in response to Message 1814444.  

That reflects the previous optimizations influence - using CUDA queues. That is implemented in the OpenCL version by now I guess.

Smth similar to unroll you implemented now (at least inferring from your description) also presents in OpenCL builds since partial PulseFind introduction.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1814449 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1814452 - Posted: 1 Sep 2016, 22:02:58 UTC
Last modified: 1 Sep 2016, 22:45:17 UTC

Oh well. The 750Ti has produced the infamous "Task 11ja09aa.31086.11524.14.41.214_0 exited with zero status but no 'finished' file". That line is present when the 750Ti Hangs or Stalls. It hasn't 'Stalled' yet, but it's probably just a matter of time. If you're not around when it stalls you get one of these, http://setiathome.berkeley.edu/workunit.php?wuid=2209368898
Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

It will do Nothing for hours.

Checking the days worth of stdoutdae.txt, that line didn't appear with the CUDA driver 8.0.29. It occurs about a dozen times a day with the 7.5.x drivers. The Exact same thing happens running in Linux with the CUDA 7.5.x drivers.



That didn't take long. The 750Ti is Stalled on an Arecibo task. Strange it never happens with the 950s. It also never happens with the older x41p_zi Apps.
A couple tasks later and the 750Ti is Stalled again, This isn't going to work.
I suppose it's back to Darwin 15.4 and CUDA driver 8.
ID: 1814452 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1814495 - Posted: 2 Sep 2016, 1:15:43 UTC - in response to Message 1814452.  

...
That didn't take long. The 750Ti is Stalled on an Arecibo task. Strange it never happens with the 950s. It also never happens with the older x41p_zi Apps.
A couple tasks later and the 750Ti is Stalled again, This isn't going to work.
I suppose it's back to Darwin 15.4 and CUDA driver 8.


could compare operation, using activity monitor and nvidia-smi, with the client's no priority change set ? Could be some of Cuda's internal helper threads having different priority by OS and Cuda version.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1814495 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1814507 - Posted: 2 Sep 2016, 1:39:26 UTC - in response to Message 1814443.  

What effect would a release of a public special windows version have to the credit (quite old)?


Not a lot on individual task credit initially, while running anonymous platform, and will be considering such a build this weekend. With the major deterministic pulsefinding issue out of the way, then the risk of pollution of the science db is drastically reduced. Random issues to be located are less problematic, since improper crossmatch is unlikely.

Once the refined/proven portions integrate into stock, credit for this will probably oscillate or look random for a bit, and stabilise on something similar to now. Stock CPU AVX will remain normalisation reference by virtue of its effective operation underclaim by a factor of ~3.3x, so until that's eventually addressed credit will remain low here (compared to cobblestone scale)

With the questions/temptation for numerous dedicated builds, I'd recommend withholding so we can work out better approaches for internal dispatch. Many builds approach quickly becomes a maintenance nightmare, along with a major source of confusion (and error) for users. For one example, we'd need to make A Kepler+ class build simply hard error out if attempting to run on a <= compute capability 3.2 device, but preferably longer term just dispatch to legacy code internally, and so just work. That's difficult using the Cuda runtime and CUFFT directly, though simpler if we push the device specific code to shared library wrappers, since then we'd be looking after a tidy set of compact libraries and one core app, rather than the possible 50 or so complete application builds.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1814507 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1814556 - Posted: 2 Sep 2016, 6:03:25 UTC - in response to Message 1814507.  
Last modified: 2 Sep 2016, 6:04:00 UTC

With the questions/temptation for numerous dedicated builds, I'd recommend withholding so we can work out better approaches for internal dispatch.

Personally, the less different stock versions we have the better.
It would be great if one application was the best for all, but failing that 2 (or 3) if necessary stock applications to continue to support the current range of supported hardware.
One build for Kepler, Maxwell and Pascal, another for older hardware. Maybe a third for oldest supported hardware, if it's necessary? eg CUDA32, CUDA42 & now CUDA 80 (or whatever it will be).

Ideally leave the maximally tweaked & tuned and will only work on specific hardware applications for the particular GPU families for the Lunatics installer/ Anonymous platform.
Grant
Darwin NT
ID: 1814556 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1814576 - Posted: 2 Sep 2016, 7:53:46 UTC - in response to Message 1814452.  
Last modified: 2 Sep 2016, 7:56:33 UTC

That line is present when the 750Ti Hangs or Stalls. It hasn't 'Stalled' yet, but it's probably just a matter of time.


My 4 750Ti's Gigabyte Black edition (Made for 24/7 in a serverpark environment) just chugs along just fine with Petris app.
Works like a charm for days and days.

http://setiathome.berkeley.edu/show_host_detail.php?hostid=8053171

Could you please try to increase the voltage to the GPU in your case. I've experienced several times that manufacturers bios is tweaked so it doesn't give the GPU "juice" enough to maintain itself.
Please oc the gpu voltage and try again.

I have experienced this on my 1080 and i did just that and several others GPUs in the past too.

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1814576 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 36 · Next

Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.