Tasks ending in errors

Message boards : Number crunching : Tasks ending in errors
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Rune Bjørge

Send message
Joined: 5 Feb 00
Posts: 45
Credit: 30,508,204
RAC: 5
Norway
Message 1806780 - Posted: 3 Aug 2016, 12:35:27 UTC

I've started to run into tasks ending With errors on two of my crunchers.
From time to time, one task on GPU ends With 30 found pulses, and the task is terminated. This usually happens within the first 30 Seconds of crunching the task. Most of these too many pulses tasks ends normally, With the "too many pulses" Message. Those tasks end up as valid.

Then i have some tasks that ends up With 30 pulses found, but it seems like the task continues to crunch further. After around 30 Seconds these tasks end up With exit status: 4294967295 (0xffffffff) Unknown exit code.

When i look at the "wingmen" of the task they often end up With the same result if the task are sendt to GPU. If my wingmates have a CPU flavour of the task, it ends With "too many pulses" or overflow.

Example of such task With multiple failures:

http://setiathome.berkeley.edu/workunit.php?wuid=2224690101
Task computer Sendt Reported Status, Runtime GPU Runtime CPU Credits, Application
5070729185 8002450 31 Jul 2016, 17:11:35 UTC 1 Aug 2016, 8:29:42 UTC Feil oppsto ved beregning 311.34 9.70 --- SETI@home v8 v8.12 (opencl_nvidia_sah)
windows_intelx86 
5070729186 7234332 31 Jul 2016, 17:11:33 UTC 31 Jul 2016, 22:34:31 UTC Feil oppsto ved beregning 322.12 18.50 --- SETI@home v8 v8.12 (opencl_nvidia_SoG)
windows_intelx86 
5071672000 7968069 1 Aug 2016, 4:21:22 UTC 1 Aug 2016, 14:09:32 UTC Feil oppsto ved beregning 329.39 27.78 --- SETI@home v8 v8.12 (opencl_nvidia_SoG)
windows_intelx86 
5072554484 7236263 1 Aug 2016, 14:42:01 UTC 2 Aug 2016, 15:25:09 UTC Feil oppsto ved beregning 340.76 32.80 --- SETI@home v8
Anonym plattform (ATI GPU) 
5073004275 3693878 1 Aug 2016, 20:18:03 UTC 2 Aug 2016, 11:34:42 UTC Ferdig, venter på validering 87.31 85.02 pending SETI@home v8 v8.00 
windows_intelx86 
5075045476 2301233 2 Aug 2016, 23:17:15 UTC 25 Sep 2016, 4:16:57 UTC I prosess --- --- --- SETI@home v8 v8.00 (opencl_ati5_mac)
x86_64-apple-darwin 



A lot of the failing tasks end up With the same result on two of my rigs:
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7968069
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7992414

No changes made to the rigs prior to the problem started, and the gpu's in the rigs are not overheating eiter. (running at around 50 - 55 degrees). I'm also running default clock of the Cards. I've also checked the Cards for dust devils and cleaned the fans and cooling fins without any noticable improvement.

The Cards are running only one task at the time and the machines are dedicated crunchers, only used for S@H.

Anyone here around With an good idea how to improve on the situation?
ID: 1806780 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1806784 - Posted: 3 Aug 2016, 13:21:42 UTC - in response to Message 1806780.  

I only see errors like that on host 7968069 at the moment.

The error report is

"ERROR: Possible wrong computation state on GPU, host needs reboot or maintenance"

and it only occurs with the opencl_nvidia_SoG application.

This is a known problem with the application - the watchdog which monitors the health of the GPU is a little over-zealous, and prone to false positives like this. We're working to get the application replaced as quickly as possible: in the meantime, you can ignore this particular error message if it only appears for a small proportion of the tasks you complete.
ID: 1806784 · Report as offensive
Profile Rune Bjørge

Send message
Joined: 5 Feb 00
Posts: 45
Credit: 30,508,204
RAC: 5
Norway
Message 1806819 - Posted: 3 Aug 2016, 16:31:55 UTC - in response to Message 1806784.  

Thank you for the information.

I have around 0,5% of the tasks ending With this result, so it's not a large amount. Just wanted to check if there were anything i should do to the hosts.
ID: 1806819 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1806854 - Posted: 3 Aug 2016, 20:12:32 UTC

please can someone have a short look at my Computer hostid=6944847 have ca 50% of all Tasks inconclusive or error.
Fired up the V8-Xeon again, but not so easy to find the optimal OC without any lost Task. Running now stock frequencies with the graphic.
Some hints are appreciated.
_heinz
D5400XS V8-Xeon
ID: 1806854 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1806856 - Posted: 3 Aug 2016, 20:30:52 UTC - in response to Message 1806854.  

SETI@home using CUDA accelerated device GeForce GTX TITAN
setiathome enhanced x41zi (baseline v8), Cuda 3.20

I don't think cuda 3.2 is likely to be optimal for a GTX TITAN.

Not sure whether cuda 5.0 is either, but it's probably better.
ID: 1806856 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1806869 - Posted: 3 Aug 2016, 21:00:33 UTC - in response to Message 1806856.  
Last modified: 3 Aug 2016, 21:01:34 UTC

SETI@home using CUDA accelerated device GeForce GTX TITAN
setiathome enhanced x41zi (baseline v8), Cuda 3.20

I don't think cuda 3.2 is likely to be optimal for a GTX TITAN.

Not sure whether cuda 5.0 is either, but it's probably better.


For that 'Big K' generation of Titan on Windows, with baseline apps, either Cuda 5.0 or 6.5 will be optimal (and probably splitting hairs, but 5.0 probably edging it out). Too far ahead and the compiler and libraries tend to regress a little on older generations.

@heinz
Fortunately for that generation, Petri's code does target that compute capability (possibly one before and onwards). Once working on Windows properly, at least enough todo some limited testing of very specific (non generic support) target builds, advanced user only testing can go on while integration into mainstream is perfected (longer process). Until then I'd imagine the OpenCL builds may perform better on it (though not verified on my similar 780 class GPU)

3 more steps toward that mountaintop today, with the first Cuda enabled small test pieces building under Gradle automation. Once small test pieces build, run regression tests, and package simultaneously on the 3 main platforms (x bitnesses), within the same build system, then a lot of headaches updating the codebase and apps should be eased. finding & smoothing the alpha code precision/search weaknesses should 'fall out' of the test pieces along the way. As long as I don't get too impatient with the whole process and rush it anyway.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1806869 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1806888 - Posted: 3 Aug 2016, 21:58:22 UTC

Yesterday I installed W10 Pro on V8-Xeon. Then the latest Graphic-driver for the Titan, now CUDA Toolkit 7.5
Is it so OK ?

@Jason
Merci for your hint.
Likely I will test Petrys Code and help to bring a new and faster app up.
In the next days I will install my complete developer Environment with latest Intel-Compiler, have full production Lizenz with Support.
As soon as I can I will help you Jason with compiling and testing the new stuff.
Must lookup in the developer area of NVIDIA, it seemed CUDA8 is on the way.
Greetings...
ID: 1806888 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1806917 - Posted: 4 Aug 2016, 0:08:09 UTC - in response to Message 1806888.  
Last modified: 4 Aug 2016, 0:15:44 UTC

[Edit:] Cuda 7.5 will be OK, though you would probably have to go 64 bit only, and duplicate settings from Cuda 5 or 6.5 solutions/projects. I've ceased maintaining the visual studio solutions now, unless some urgent stock change becomes needed (details below on what I'm moving to)

Along with Cuda8, you may want to have a look at, or even play with, the build system I am gradually adapting/moving to:
( https://gradle.org/ )

Emerge from Build Hell today through Gradle Build Tool, the modern open source polyglot build automation system. Automate and integrate your DevOps toolchain with a concise and expressive build programming language. End long build times. End code freezes. End build script chaos. End deathmarches. End bug regressions. End broken release processes.


The purpose is to change from the mix of different broken Makefiles / projects, to a single one that's easier to maintain on WIndows, MaxOSX, and Linux (possibly + others later)

It's proving a huge undertaking to integrate building Cuda enabled applications, but the first signs are starting to appear that simpler building cross platform can work, and make things better.

If you have or create a github account, I cleaned out the stale copy of Cuda Xbranch that was there so as to create/refine this process and alpha code.
You or anyone can watch https://github.com/JasonGroothuis/XBranch/ if interested. It is under construction and likely to only have small test code pieces initially. Though it is the (painfully) long road, signs are so far that a near clean slate and injecting everything learned, modern techniques (including Petri's streaming code) is going to end up the way to go.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1806917 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1807835 - Posted: 7 Aug 2016, 20:35:02 UTC

@Jason,
I worked hard to install CUDA8, VS2012 and VS2015 on V8-Xeon, maked a msc account to activate VS2015 for free. Then I installed Intels Parallel Studio XE Composer Edition, maked a GitHub account and activated it.
I compiled the samples of CUDA8 for testing the developer environment. With Intels Compiler I get error in float.h with #include_next -->invalid(unknown) preprozessor statement. I wonder float.h is itself from Intel. I remember this error is already happened more than 3 years ago in CUDA4.
MSC worked through without error.
Had have a short look at your GitHub and tried to clone it, no success.
Looking at that Gradle tool, reading a lot dokus. Think I must install Gradle first...
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Installed TortoiseSVN handle old seti Projects.
Meanwhille some further MS updates are running..
Some hints are appreciated
ID: 1807835 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1807840 - Posted: 7 Aug 2016, 21:28:05 UTC - in response to Message 1807835.  
Last modified: 7 Aug 2016, 21:33:28 UTC

Thinking back quite a way, I think there was some issue with making sure which float.h, and other headers, were actually getting included. Probably it was defining some specific architecture preprocessor flags.

Things may have changed a lot since I used those. I abandoned 1 personal and 2 professional Intel compiler licences (for different jobs) some time back due to license incompatibilities with GPL. I'm not sure if the licensing terms have changed since then, or there are other ways to legitimately use it for open source distribution. No doubts for personal use it's open season ;)

With Gradle, for just native CPP, pretty simple getting started if you already have compilers installed. For playing you'd probably just want the 'all' distribution .zip of Gradle and follow the docs. Probably only main issue you might run into might be telling scripts where visual studio is located (if it doesn't find one that works by itself, which it can), and Windows SDK if needed. After that it's a matter of following how they lay out Native development using the cpp plugin.

Still a lot to go through setting up to work with Cuda, though I do since have it building and linking in .cu files into Windows exes already.

For the work in progress buildsystem on github (meant to do Cuda builds on Mac, Windows, and Linux from the same setup) I committed the 'Gradle Wrapper' which actually downloads and installs the exact version of Gradle I use, straight from the git clone checkout. That's just automation that down the line will mean someone wanting to build (even myself on other hosts) should just need whatever Cpp compiler will work with gradle on the OS, the appropriate Cuda toolkit, and run the gradle wrapper.

Once the (currently custom) Cuda addons become 'mature', probably I would turn it into a Gradle plugin for others to use in other projects... but that's more as an offshoot down the line, if no-one makes a Cuda plugin for Gradle by then already. I imagine nVidia must be using some inhouse variant with their Tegra SDK, but didn't look deep into that, to see if they added support in any way to be compatible with the Android SDK (Which is Gradle + Java based).

Since it does seem to be pretty roughly building Cuda executable tests for now, Win32 and Win64, and I'm a bit closer to make it work on Mac and Linux, It is starting to look like it will end up better than either putting up with different projects/solutions/Makefiles on every platform, or alternatives like CMake

Not sure if anyone makes an Intel compiler plugin for Gradle yet (I doubt it), however I suspect support for that could be done in some similar way as I managed to get Cuda building in it (custom/manual scripting for now).

certainly a fun exercise trying to make something that will build on just about anything. Can see why Google chose Gradle for underneath Android developer studio.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1807840 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1808400 - Posted: 10 Aug 2016, 22:49:13 UTC
Last modified: 10 Aug 2016, 22:50:06 UTC

@Jason,
My disaster:
I have a 3TB WD Mybook, it is secured by Software(direct on it) and Password. I would use it to make a full backup of V8-Xeon. So I looked up at WD Support and could download Acronis WD Edition. I installed the Software, made full backup on WD Mybook sucessful then shutdown V8-Xeon. Next day I started V8-Xeon, and start failed with a damaged Windows Bootmanager. It took me 3 days to find out this and how I could repair it. Start from Win10 CD and use repair option, relative easy. If System start up again uninstall this coruptible Software.
Also all should be warned never to use Acronis WD Edition with Win7 or Win 10.
I had have the same scenario with my W7 Laptop too. This was stress pure for 3 days.
Now both machines are repaired and running again smoothly.
Work is going on.

_heinz
ID: 1808400 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1808404 - Posted: 10 Aug 2016, 23:27:10 UTC - in response to Message 1808400.  

hmmm, Wonder why WD needed a 'special edition'. Have had to fight with messed up boot in the distant past. Yeah most things these days if left for a while seem to get 'cranky'.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1808404 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1808668 - Posted: 12 Aug 2016, 6:34:45 UTC

I installed now Lunatics_x41zi_win32_cuda50 with 0.5 per GPU and let it run for 24 hours to see if V8-Xeon will get more stable results.
With 0.33 per GPU it runs, but a little bit laggy, not to use for other work parallel.
hostid=6944847
D5400XS V8-Xeon
ID: 1808668 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1808671 - Posted: 12 Aug 2016, 6:53:07 UTC - in response to Message 1808668.  

I installed now Lunatics_x41zi_win32_cuda50 with 0.5 per GPU and let it run for 24 hours to see if V8-Xeon will get more stable results.
With 0.33 per GPU it runs, but a little bit laggy, not to use for other work parallel.
hostid=6944847


Might be worth setting
<cpu_usage>0.04</cpu_usage>

to
<cpu_usage>0.50</cpu_usage>
so you reserve 1 CPU Core for every 2 GPU WUs running.
CUDA50 generally doesn't need much CPU time, however the more powerful the video card, the greater the load on the CPU will be.
Grant
Darwin NT
ID: 1808671 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1808673 - Posted: 12 Aug 2016, 7:09:29 UTC

Thanks Grant
changed now:
<avg_ncpus>0.040000</avg_ncpus>
<max_ncpus>0.500000</max_ncpus>

TDP is ca. 50-60% on the cards
D5400XS V8-Xeon
ID: 1808673 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1808676 - Posted: 12 Aug 2016, 7:37:06 UTC - in response to Message 1808673.  
Last modified: 12 Aug 2016, 7:38:42 UTC

Thanks Grant
changed now:
<avg_ncpus>0.040000</avg_ncpus>
<max_ncpus>0.500000</max_ncpus>

TDP is ca. 50-60% on the cards

With the CPU core reserved, it should run 3 (or even 4 WUs) at a time without things being laggy (even 0.4 or 0.3 may be enough).

EDIT- i'm not sure if <max_ncpus> will have the same effect as <cpu_usage>, I've never used the <max_ncpus> setting.
Grant
Darwin NT
ID: 1808676 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1808680 - Posted: 12 Aug 2016, 8:34:45 UTC

I changed now max_ncpus to 1.0 and add 0.50 cpu usage
<avg_ncpus>0.040000</avg_ncpus>
<max_ncpus>1.000000</max_ncpus>
<cpu_usage>0.50</cpu_usage>

I stopped all workunits, stopped GPU-usage then shut down BOIC-Manager and started BOINC again.
Could it be that the Cache must drained out, before the new Settings are to see on the wu's ?
Will now try and then lookup.
D5400XS V8-Xeon
ID: 1808680 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22189
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1808683 - Posted: 12 Aug 2016, 8:46:40 UTC

No need to drain the cache - there can be an issue with BOINC not displaying the correct text when you change these values.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1808683 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1808685 - Posted: 12 Aug 2016, 8:48:11 UTC - in response to Message 1808680.  

Could it be that the Cache must drained out, before the new Settings are to see on the wu's ?

Nope.
The settings will take effect, depending on what file they are in.
If they're in app_config.xml then all you need to do is select Options, Read config files in the BOINC manager.
If they are in app_info.xml then you need to exit BOINC & restart.

Any new work that is downloaded, will show the new settings in the Tasks, Status column- any current work will still display the earlier settings there, but use the new ones.
Grant
Darwin NT
ID: 1808685 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1808708 - Posted: 12 Aug 2016, 12:07:00 UTC
Last modified: 12 Aug 2016, 12:22:47 UTC

<avg_ncpus>0.040000</avg_ncpus>
<max_ncpus>1.000000</max_ncpus>
<cpu_usage>0.5</cpu_usage>
~~~~~~~~~~~~~~~~~~~~
This way it can run 4/GPU
~~~~~~~~~~~~~~~~~~~~
TDP is now 62 - 77%
have a look v8-xeon_0.25_gpu crunching 12 wu's parallel
all 12 wus are blc5_2bit_guppi's so we must a liitle bit wait to see the results
:-)
D5400XS V8-Xeon
ID: 1808708 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : Tasks ending in errors


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.