Message boards :
Number crunching :
Long-running work unit
Message board moderation
Author | Message |
---|---|
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
I have a work unit where the estimated remaining time field is empty. It took about 40 hours to get to 99.999% complete, and was still running after 48 hours when I suspended it. The application is SETI@home v7 7.01. Most of my work units take about 2 hours to complete. What should I do? Abort it? |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
I would reboot the PC and look what the task will do. It looks like the task couldn't finish correctly. If you open Task-Manager tab processes, how much CPU time usage take this task? It's an AMD CPU? AFAIK it could happen with stock apps that it don't finish correctly. After the reboot I would wait some time, then if the task don't finish correctly I would abort it. If you are an advanced user, you could install »opti apps«. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Try restarting Boinc, or restarting the host, with your computers hidden, and there only being a 7.01 app for Linux, I'm assuming you're running Linux, and that you're got Boinc from the repository. Setiathome Applications Claggy |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
Yes, I'm running Linux. I restarted Boinc and the work unit, and the progress dropped to 0%. :-( At least it now has a Remaining estimate (rather longer than usual, about 3.5 hours). Looks good so far. It sounds as though this is a known problem, then? |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
The original problem seems to be back. The elapsed time is about 3.5 hours now (which was the estimated remaining time at first), but the progress is only about 60%, while the estimated remaining time has gone to being shown as "---". Abort it, I guess, unless there's anything I can do to debug it? |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
If I have a troubled task I check to see if any of the wingmates that were also assigned the task has issues. If not I'll give it one more go before calling it quits. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
How would I find/alert the wingmates? The WU's name is 20fe09af.20377.5815.438086664197.12.252.vlar_0. As far as I can see, it's doing now what it did the first time. What are the chances of it succeeding on the third try? |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Go to your list of computers http://setiathome.berkeley.edu/hosts_user.php Select Tasks for that machine. In the Task column choose Show names. Locate the task & click on the Work Unit ID next to it in the list. Now you can look at the Task details for the wingmates that have completed it. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
How would I find/alert the wingmates? The WU's name is 20fe09af.20377.5815.438086664197.12.252.vlar_0 It was Completed OK by another (Mac OS) computer: http://setiathome.berkeley.edu/workunit.php?wuid=1589164773 And for reference of other readers - your computer is: http://setiathome.berkeley.edu/show_host_detail.php?hostid=7306664 You didn't answer an earlier question: Do you see CPU load by this process? (and vlar tasks take more time) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Yes, I'm running Linux. I restarted Boinc and the work unit, and the progress dropped to 0%. :-( The latest Boinc's estimate progress if the app doesn't report it's progress, that can lead to progress being reported even if the app isn't making any actual progress. Claggy |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
Do you see CPU load by this process? Yes, the "top" command shows a normal CPU load for the task. It's an AMD CPU. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
Then you may find is it really progressing (or is hang) go to <BOINC_Data>\slots\ and find the relevant slot # (the WU name is in work_unit.sah) Sort the filelist by date, look if the last updated files have near-current time Look in the next files for similar rows: boinc_task_state.xml (this file is written by BOINC, may not show the real progress) <fraction_done>0.850431</fraction_done> state.sah (this file is written by the app, this is the checkpoint file) <prog>0.85040579</prog> (the above shows ~85% done) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
There's no state file. The only file being updated is boinc_setiathome_2, and this is updated whether or not the WU is running. stderr.txt says: setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3) libboinc: BOINC 7.1.0 Work Unit Info: ............... WU true angle range is : 0.014299 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3) libboinc: BOINC 7.1.0 Work Unit Info: ............... WU true angle range is : 0.014299 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
The app is hang in 'Optimal function choice' routine 'Optimal function choice' do not depend on the task itself (so hang may happen again with any task) This test is done before the app even looks at the task data Suspend/Restart it (wait a few minutes after Restart) Look in stderr.txt to see the printed functions (Optimal function choices) in the table (do Suspend/Restart a few times if needed) Normal stderr.txt should start with: http://setiathome.berkeley.edu/result.php?resultid=3732593977 <stderr_txt> setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3) libboinc: BOINC 7.1.0 Work Unit Info: ............... WU true angle range is : 0.384554 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_vGetPowerSpectrumUnrolled 0.000039 0.00000 avx_ChirpData_c 0.001676 0.00000 v_avxTranspose4x16ntw 0.000320 0.00000 AK SSE folding 0.000227 0.00000 Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
Optimized apps do not have this problem (they do not do 'Optimal function choice') and are of course much faster. If you want to go for them: http://www.arkayn.us/forum/index.php?action=tpmod;dl=cat5 Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
I've suspended and restarted the task several times, but stderr.txt doesn't change; its last write time is three days ago. I tried running gdb, but it doesn't look promising: Attaching to process 29675 Reading symbols from /var/lib/boinc-client/projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu...(no debugging symbols found)...done. 0x000000000040d708 in ?? () (gdb) bt #0 0x000000000040d708 in ?? () #1 0x0000000000000000 in ?? () (gdb) n Cannot find bounds of current function (gdb) s Cannot find bounds of current function (gdb) ni 0x00000000004a557c in ?? () (gdb) ni 0x00000000004a557d in ?? () (gdb) 0x00000000004a5580 in ?? () (gdb) 0x00000000004a5584 in ?? () (gdb) 0x00000000004a5587 in ?? () (gdb) 0x00000000004a558c in ?? () (gdb) 0x00000000004a5591 in ?? () (gdb) 0x000000000073c6c0 in ?? () (gdb) 0x000000000073c6c5 in ?? () (gdb) 0x000000000073c6c7 in ?? () (gdb) 0x000000000073c6cd in ?? () (gdb) 0x000000000073c6d3 in ?? () (gdb) 0x00000000004a5596 in ?? () (gdb) 0x00000000004a559d in ?? () (gdb) 0x00000000004a559f in ?? () (gdb) 0x00000000004a55ae in ?? () (gdb) 0x00000000004a55b4 in ?? () (gdb) 0x00000000004a55b6 in ?? () (gdb) 0x00000000004a55b8 in ?? () (gdb) 0x00000000004a55c4 in ?? () (gdb) 0x00000000004a55ca in ?? () (gdb) 0x00000000004a55cc in ?? () (gdb) 0x00000000004a55df in ?? () (gdb) 0x00000000004a55e4 in ?? () (gdb) 0x00000000004a55e6 in ?? () (gdb) 0x00000000004a55e8 in ?? () (gdb) 0x00000000004a55e9 in ?? () (gdb) <signal handler called> (gdb) <signal handler called> (gdb) det Detaching from program: /var/lib/boinc-client/projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu, process 29675 |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
BOINC Manager wasn't downloading new WUs, saying: "Not requesting tasks: some task is suspended via Manager". So I aborted the task. Unfortunately, BOINC Manager still isn't downloading new WUs, issuing the same message. Any ideas? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
BOINC Manager wasn't downloading new WUs, saying: "Not requesting tasks: some task is suspended via Manager". What it says on the tin. BOINC will not fetch new work while any task from the project is suspended. Note that 'suspended' (by you) is different from 'waiting to run' (BOINC's task management). Open BOINC Manager in Advanced view, ensure that all tasks are displayed (not just active tasks), and look down the 'status' column. Highlight any task(s) that say 'Task suspended by user', and click the third button on the left - which will have changed to say 'Resume'. BTW, I answered a problem you posted at Einstein yesterday - your assumption was wrong, and your proposed course of action was inadvisable. |
Graeme Hewson Send message Joined: 14 Jun 99 Posts: 19 Credit: 242,802 RAC: 0 |
There are no SETI@Home tasks shown in my manager. The one I aborted was the only one, and it's not there now. Thanks, I saw your reply about Einstein, but I haven't got around yet to doing the reading you suggested. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
There are no SETI@Home tasks shown in my manager. The one I aborted was the only one, and it's not there now. Ensure that all tasks are displayed. The top button on the left should have the caption "Show active tasks": that would be the outcome if you clicked it. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.