garbage collect error cases GPU to hang

Questions and Answers : GPU applications : garbage collect error cases GPU to hang
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 2006713 - Posted: 10 Aug 2019, 21:34:53 UTC
Last modified: 10 Aug 2019, 21:54:42 UTC

Two of my GPUs on a 10 GPU mining rig are stuck: 0% utilization with work unit showing %100 done.

error messages from, I assume, each of the stuck GPUs:

7209	SETI@home	8/10/2019 3:34:45 PM	[error] garbage_collect(); still have active task for acked result blc32_2bit_guppi_58643_76143_HIP73005_0101.26078.409.23.46.97.vlar_0; state 5	
10233	SETI@home	8/10/2019 4:20:49 PM	[error] garbage_collect(); still have active task for acked result blc33_2bit_guppi_58643_86349_HIP33332_0131.3725.0.23.46.188.vlar_0; state 5	


what's happening?

Using client 7.16.7 but googling I found a previous report in 2010 also on this project.

I have 8gb ram. Maybe need to add more?

[edit] sudo /etc/init.d/boinc-client stop didn't stop

neither did kill -9 or just kill

boinc still shows up in htop with argument -detect-gpu

Need to reboot.
ID: 2006713 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2006762 - Posted: 11 Aug 2019, 6:49:06 UTC

Sometimes if you suspend the processing on the affected tasks for a few minutes (in which time other tasks will start) then resume processing (the tasks will reports as "waiting to run" at a lower % complete) when they run they will run to completion. If this start to happen fairly regularly then it is time to shutdown, evict the dust bunnies, re-seat all cables then restart.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2006762 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 2006792 - Posted: 11 Aug 2019, 14:41:30 UTC - in response to Message 2006762.  

Sometimes if you suspend the processing on the affected tasks for a few minutes (in which time other tasks will start) then resume processing (the tasks will reports as "waiting to run" at a lower % complete) when they run they will run to completion. If this start to happen fairly regularly then it is time to shutdown, evict the dust bunnies, re-seat all cables then restart.


suspending does not help. Those tasks are stuck. I am not an expert on Linux. How to kill those task so I can avoid rebooting. is the reason I cannot kill them because they belong to BOINC? Or maybe they are just hung and cannot receive a term signal?

The task is (pardon the screen/text grab)
3376 jstateson	20	0 29696	-3808	3284 S	0.0	0.1	0:00.03 -bash
3407 boinc	30	10 78.8G	79014	36014 S	0.0 10.0	0:00.00 ../../projects/setiathome.berkeley.edu/setiathome_x41p_V0.98bl_x86_64-pc-linux-gnu_cuda90


so, the owner of task 3407 is "boinc" and I own 3376

using sudo kill -9 3407 has no effect when that task is "hung"

can I log in as "boinc" to terminate it? Is there a password?
Maybe it is hung so bad it cant receive the terminate signal.
Is there another way to terminate?
ID: 2006792 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2006796 - Posted: 11 Aug 2019, 15:19:07 UTC

Try using "sudo kill" instead of just "kill" - "sudo" gives you elevated privileges.
If that fails you will have to shutdown BOINC (client and manager) and re-start.
After that its a re-boot of the computer - very much a last resort.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2006796 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2006799 - Posted: 11 Aug 2019, 15:25:56 UTC

Additional comment - your computer https://setiathome.berkeley.edu/results.php?hostid=8757016 is returning a high number of "time exceeded" faults, on all GPUs - time to have a look at the hardware for dust, defective risers etc.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2006799 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 2008965 - Posted: 23 Aug 2019, 18:27:23 UTC

This just happened again. I had doubled the memory thinking it was out of memory. This system has 7.16.1 client but I use boinctasks to access. When this happens the GPU is stuck and never gets another task.

sudo /etc/boinc-client restart

is a disaster, the tasks disappear but they never show up again.

top shows 6 seti tasks but no client

boinccmd --quit says it cannot connect to the client.

sudo shutdown now

may work but after about 5 minutes I cycle the power.

Wonder if this could be reported as a bug or a request to handle stuck task differently. I failed to make a note of which GPU was stuck. If the same one then possibly a hardware problem.
ID: 2008965 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 715
Credit: 8,032,827
RAC: 62
France
Message 2009094 - Posted: 24 Aug 2019, 14:15:03 UTC

same thing here twice this week with windows , AP gpu app hang, shut down Boinc with boinc manager file menu and exit, still seeing boinc.exe and ap.exe app with process manager, kill boinc.exe process tree , wipe them ... but

in boinc_data / slots / 0 , try to wipe all files and see a shortcup AP.exe app but can't del it ; sometimes by killing explorer.exe , it gone away ( and restating boinc will conserve elapsed time ) , sometimes no, have to reboot with a DOS floppy to wipe it and then reboot and the AP restart at 0, if not it will end with an errored ap wu ...
ID: 2009094 · Report as offensive

Questions and Answers : GPU applications : garbage collect error cases GPU to hang


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.