Suddenly BOINC Decides to Abandon 71 APs...WTH?

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 15 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1695660 - Posted: 25 Jun 2015, 19:48:16 UTC
Last modified: 25 Jun 2015, 20:40:02 UTC

I've seen this with other Hosts, Now it hits Me. One host on Beta is constantly having tasks abandoned, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714&offset=40

It would be nice if this were fixed, Error tasks for computer 6796479
Thu Jun 25 14:26:47 2015 | SETI@home | [sched_op] Starting scheduler request
Thu Jun 25 14:26:47 2015 | SETI@home | Sending scheduler request: To report completed tasks.
Thu Jun 25 14:26:47 2015 | SETI@home | Reporting 1 completed tasks
Thu Jun 25 14:26:47 2015 | SETI@home | Requesting new tasks for AMD/ATI GPU
Thu Jun 25 14:26:47 2015 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Thu Jun 25 14:26:47 2015 | SETI@home | [sched_op] AMD/ATI GPU work request: 659269.91 seconds; 0.00 devices
Thu Jun 25 14:32:01 2015 | SETI@home | Scheduler request failed: Timeout was reached
Thu Jun 25 14:32:01 2015 | SETI@home | [sched_op] Deferring communication for 00:01:25
Thu Jun 25 14:32:01 2015 | SETI@home | [sched_op] Reason: Scheduler request failed

Thu Jun 25 14:33:31 2015 | SETI@home | [sched_op] Starting scheduler request
Thu Jun 25 14:33:31 2015 | SETI@home | Sending scheduler request: To report completed tasks.
Thu Jun 25 14:33:31 2015 | SETI@home | Reporting 1 completed tasks
Thu Jun 25 14:33:31 2015 | SETI@home | Requesting new tasks for AMD/ATI GPU
Thu Jun 25 14:33:31 2015 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Thu Jun 25 14:33:31 2015 | SETI@home | [sched_op] AMD/ATI GPU work request: 660690.66 seconds; 0.00 devices
Thu Jun 25 14:33:34 2015 | SETI@home | Scheduler request completed: got 0 new tasks
Thu Jun 25 14:33:34 2015 | SETI@home | [sched_op] Server version 707
Thu Jun 25 14:33:34 2015 | SETI@home | No tasks sent
Thu Jun 25 14:33:34 2015 | SETI@home | No tasks are available for AstroPulse v7
Thu Jun 25 14:33:34 2015 | SETI@home | Tasks for CPU are available, but your preferences are set to not accept them
Thu Jun 25 14:33:34 2015 | SETI@home | Project requested delay of 303 seconds
Thu Jun 25 14:33:34 2015 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task ap_07fe15aa_B2_P1_00161_20150624_15414.wu_1
Thu Jun 25 14:33:34 2015 | SETI@home | [sched_op] Deferring communication for 00:05:03
Thu Jun 25 14:33:34 2015 | SETI@home | [sched_op] Reason: requested by project
Thu Jun 25 14:38:19 2015 | SETI@home | Message from task: 0
Thu Jun 25 14:38:19 2015 | SETI@home | Computation for task ap_07fe15aa_B3_P0_00092_20150624_17213.wu_1 finished
Thu Jun 25 14:38:19 2015 | SETI@home | Starting task ap_06fe15aa_B6_P1_00291_20150624_15160.wu_1
Thu Jun 25 14:38:22 2015 | SETI@home | Started upload of ap_07fe15aa_B3_P0_00092_20150624_17213.wu_1_0
Thu Jun 25 14:38:25 2015 | SETI@home | Finished upload of ap_07fe15aa_B3_P0_00092_20150624_17213.wu_1_0
Thu Jun 25 14:38:40 2015 | SETI@home | [sched_op] Starting scheduler request
Thu Jun 25 14:38:40 2015 | SETI@home | Sending scheduler request: To report completed tasks.
Thu Jun 25 14:38:40 2015 | SETI@home | Reporting 1 completed tasks
Thu Jun 25 14:38:40 2015 | SETI@home | Requesting new tasks for AMD/ATI GPU
Thu Jun 25 14:38:40 2015 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Thu Jun 25 14:38:40 2015 | SETI@home | [sched_op] AMD/ATI GPU work request: 662485.94 seconds; 0.00 devices
Thu Jun 25 14:38:42 2015 | SETI@home | Scheduler request completed: got 0 new tasks
Thu Jun 25 14:38:42 2015 | SETI@home | [sched_op] Server version 707
Thu Jun 25 14:38:42 2015 | SETI@home | Not sending work - last request too recent: 149 sec
Thu Jun 25 14:38:42 2015 | SETI@home | Project requested delay of 303 seconds
Thu Jun 25 14:38:42 2015 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task ap_07fe15aa_B3_P0_00092_20150624_17213.wu_1
Thu Jun 25 14:38:42 2015 | SETI@home | [sched_op] Deferring communication for 00:05:03
Thu Jun 25 14:38:42 2015 | SETI@home | [sched_op] Reason: requested by project
Thu Jun 25 14:41:21 2015 | SETI@home | Message from task: 0
Thu Jun 25 14:41:21 2015 | SETI@home | Computation for task ap_07fe15aa_B2_P0_00226_20150624_28439.wu_1 finished
Thu Jun 25 14:41:21 2015 | SETI@home | Starting task ap_07ja15aa_B1_P0_00189_20150624_29454.wu_1
Thu Jun 25 14:41:23 2015 | SETI@home | Started upload of ap_07fe15aa_B2_P0_00226_20150624_28439.wu_1_0
Thu Jun 25 14:41:25 2015 | SETI@home | Finished upload of ap_07fe15aa_B2_P0_00226_20150624_28439.wu_1_0
Thu Jun 25 14:43:47 2015 | SETI@home | [sched_op] Starting scheduler request
Thu Jun 25 14:43:47 2015 | SETI@home | Sending scheduler request: To report completed tasks.
Thu Jun 25 14:43:47 2015 | SETI@home | Reporting 1 completed tasks
Thu Jun 25 14:43:47 2015 | SETI@home | Requesting new tasks for AMD/ATI GPU
Thu Jun 25 14:43:47 2015 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Thu Jun 25 14:43:47 2015 | SETI@home | [sched_op] AMD/ATI GPU work request: 663512.99 seconds; 0.00 devices
Thu Jun 25 14:43:49 2015 | SETI@home | Scheduler request completed: got 0 new tasks
Thu Jun 25 14:43:49 2015 | SETI@home | [sched_op] Server version 707
Thu Jun 25 14:43:49 2015 | SETI@home | No tasks sent
Thu Jun 25 14:43:49 2015 | SETI@home | No tasks are available for AstroPulse v7
Thu Jun 25 14:43:49 2015 | SETI@home | Tasks for CPU are available, but your preferences are set to not accept them
Thu Jun 25 14:43:49 2015 | SETI@home | Project requested delay of 303 seconds

It would also be nice if you were notified when BOINC trashes All your work so you at least have a Clue about what just happened.
Oh, you can forget about any 'flakie' internet connections on this end. This machine is connected to Verizon FIOS with newish cat5 cable with my junction box about 70 feet from a State highway. Not to mention my other 3 machines didn't have any problems around the same time.
ID: 1695660 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1695687 - Posted: 25 Jun 2015, 21:11:22 UTC - in response to Message 1695660.  
Last modified: 25 Jun 2015, 21:12:48 UTC

This host had contact with the server mere seconds before the server decided to trash the tasks on the other machine;
25 Jun 2015, 18:36:13 UTC Abandoned
Thu 25 Jun 2015 02:35:47 PM EDT | SETI@home | Computation for task ap_04fe15ab_B3_P0_00067_20150624_07836.wu_0 finished
Thu 25 Jun 2015 02:35:47 PM EDT | SETI@home | Starting task ap_05fe15aa_B4_P0_00122_20150624_01881.wu_0 using astropulse_v7 version 701 in slot 1
Thu 25 Jun 2015 02:35:49 PM EDT | SETI@home | Started upload of ap_04fe15ab_B3_P0_00067_20150624_07836.wu_0_0
Thu 25 Jun 2015 02:35:53 PM EDT | SETI@home | Finished upload of ap_04fe15ab_B3_P0_00067_20150624_07836.wu_0_0
Thu 25 Jun 2015 02:35:54 PM EDT | SETI@home | [sched_op] Starting scheduler request
Thu 25 Jun 2015 02:35:54 PM EDT | SETI@home | Sending scheduler request: To fetch work.
Thu 25 Jun 2015 02:35:54 PM EDT | SETI@home | Reporting 1 completed tasks
Thu 25 Jun 2015 02:35:54 PM EDT | SETI@home | Requesting new tasks for CPU
Thu 25 Jun 2015 02:35:54 PM EDT | SETI@home | [sched_op] CPU work request: 1600230.31 seconds; 0.00 devices
Thu 25 Jun 2015 02:35:54 PM EDT | SETI@home | [sched_op] ATI work request: 0.00 seconds; 0.00 devices
Thu 25 Jun 2015 02:35:56 PM EDT | SETI@home | Scheduler request completed: got 0 new tasks
Thu 25 Jun 2015 02:35:56 PM EDT | SETI@home | [sched_op] Server version 707
Thu 25 Jun 2015 02:35:56 PM EDT | SETI@home | No tasks sent
Thu 25 Jun 2015 02:35:56 PM EDT | SETI@home | No tasks are available for AstroPulse v7
Thu 25 Jun 2015 02:35:56 PM EDT | SETI@home | Project requested delay of 303 seconds
Thu 25 Jun 2015 02:35:56 PM EDT | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task ap_04fe15ab_B3_P0_00067_20150624_07836.wu_0
Thu 25 Jun 2015 02:35:56 PM EDT | SETI@home | [sched_op] Deferring communication for 00:05:03
Thu 25 Jun 2015 02:35:56 PM EDT | SETI@home | [sched_op] Reason: requested by project
Thu 25 Jun 2015 02:46:02 PM EDT | SETI@home | [sched_op] Starting scheduler request
Thu 25 Jun 2015 02:46:02 PM EDT | SETI@home | Sending scheduler request: To fetch work.

Both machines are connected to the same router.
ID: 1695687 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695698 - Posted: 25 Jun 2015, 21:52:02 UTC - in response to Message 1695687.  
Last modified: 25 Jun 2015, 21:52:45 UTC

The control mechanism driving task scheduling (and abort etc.), is completely driven by the task estimates, which are complete rubbish, with the addition of several failure modes possible with resend lost tasks inactive.

Q: Given the bounds are usually set to 10x estimate, would you say they might have kicked in (falsely), or that somehow your scheduler request missed a bunch of tasks ? (such as by a dicey wifi connection etc)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695698 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1695699 - Posted: 25 Jun 2015, 21:55:02 UTC - in response to Message 1695698.  

I've seen this happen when there is some disruption in the download of the data. I have about 100 MB error out the other day. I just happen to check the manager and saw the "some downloads have stalled" in the event log.
ID: 1695699 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695700 - Posted: 25 Jun 2015, 21:56:21 UTC - in response to Message 1695698.  

Note that "such as by a dicey wifi connection etc" includes anything your client never received/downloaded.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695700 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695701 - Posted: 25 Jun 2015, 21:57:14 UTC - in response to Message 1695699.  
Last modified: 25 Jun 2015, 21:58:02 UTC

I've seen this happen when there is some disruption in the download of the data. I have about 100 MB error out the other day. I just happen to check the manager and saw the "some downloads have stalled" in the event log.



That's what I'm thinking. That shouldn't happen [but obviously does..].
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695701 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1695704 - Posted: 25 Jun 2015, 22:08:18 UTC - in response to Message 1695698.  

As noted previously, my machines are connected to FIOS with a straight connection to the main trunk on a State Highway.

This host had contact with the server while the other host was being timed out.
Can't happen that way...the hosts are connected to the same router.

http://setiathome.berkeley.edu/results.php?hostid=6797524
6/25/2015 2:23:48 PM | SETI@home | Project requested delay of 303 seconds
6/25/2015 2:23:48 PM | SETI@home | [sched_op] Deferring communication for 00:05:03
6/25/2015 2:23:48 PM | SETI@home | [sched_op] Reason: requested by project
6/25/2015 2:28:21 PM | SETI@home | Computation for task ap_04fe15ab_B6_P0_00266_20150624_23207.wu_1 finished
6/25/2015 2:28:21 PM | SETI@home | Starting task ap_06fe15aa_B4_P0_00012_20150624_06136.wu_1 using astropulse_v7 version 704 (opencl_ati_100) in slot 0
6/25/2015 2:28:23 PM | SETI@home | Started upload of ap_04fe15ab_B6_P0_00266_20150624_23207.wu_1_0
6/25/2015 2:28:27 PM | SETI@home | Finished upload of ap_04fe15ab_B6_P0_00266_20150624_23207.wu_1_0
6/25/2015 2:28:52 PM | SETI@home | [sched_op] Starting scheduler request
6/25/2015 2:28:52 PM | SETI@home | Sending scheduler request: To report completed tasks.
6/25/2015 2:28:52 PM | SETI@home | Reporting 1 completed tasks
6/25/2015 2:28:52 PM | SETI@home | Requesting new tasks for ATI
6/25/2015 2:28:52 PM | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
6/25/2015 2:28:52 PM | SETI@home | [sched_op] ATI work request: 89559.22 seconds; 0.00 devices
6/25/2015 2:28:54 PM | SETI@home | Scheduler request completed: got 0 new tasks
6/25/2015 2:28:54 PM | SETI@home | [sched_op] Server version 707
6/25/2015 2:28:54 PM | SETI@home | No tasks sent
6/25/2015 2:28:54 PM | SETI@home | No tasks are available for AstroPulse v7
6/25/2015 2:28:54 PM | SETI@home | Tasks for CPU are available, but your preferences are set to not accept them
6/25/2015 2:28:54 PM | SETI@home | Project requested delay of 303 seconds
6/25/2015 2:28:54 PM | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task ap_04fe15ab_B6_P0_00266_20150624_23207.wu_1
6/25/2015 2:28:54 PM | SETI@home | [sched_op] Deferring communication for 00:05:03
6/25/2015 2:28:54 PM | SETI@home | [sched_op] Reason: requested by project
6/25/2015 2:29:09 PM | SETI@home | Computation for task ap_07fe15aa_B1_P0_00348_20150624_24414.wu_2 finished
6/25/2015 2:29:09 PM | SETI@home | Starting task ap_04fe15ab_B6_P1_00320_20150624_24572.wu_0 using astropulse_v7 version 704 (opencl_ati_100) in slot 1
6/25/2015 2:29:11 PM | SETI@home | Started upload of ap_07fe15aa_B1_P0_00348_20150624_24414.wu_2_0
6/25/2015 2:29:15 PM | SETI@home | Finished upload of ap_07fe15aa_B1_P0_00348_20150624_24414.wu_2_0
6/25/2015 2:33:58 PM | SETI@home | [sched_op] Starting scheduler request
6/25/2015 2:33:58 PM | SETI@home | Sending scheduler request: To report completed tasks.
6/25/2015 2:33:58 PM | SETI@home | Reporting 1 completed tasks
6/25/2015 2:33:58 PM | SETI@home | Requesting new tasks for ATI
6/25/2015 2:33:58 PM | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
6/25/2015 2:33:58 PM | SETI@home | [sched_op] ATI work request: 103828.18 seconds; 0.00 devices
6/25/2015 2:34:00 PM | SETI@home | Scheduler request completed: got 0 new tasks
6/25/2015 2:34:00 PM | SETI@home | [sched_op] Server version 707
6/25/2015 2:34:00 PM | SETI@home | No tasks sent
6/25/2015 2:34:00 PM | SETI@home | No tasks are available for AstroPulse v7
6/25/2015 2:34:00 PM | SETI@home | Tasks for CPU are available, but your preferences are set to not accept them
6/25/2015 2:34:00 PM | SETI@home | Project requested delay of 303 seconds
6/25/2015 2:34:00 PM | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task ap_07fe15aa_B1_P0_00348_20150624_24414.wu_2
6/25/2015 2:34:00 PM | SETI@home | [sched_op] Deferring communication for 00:05:03
6/25/2015 2:34:00 PM | SETI@home | [sched_op] Reason: requested by project
6/25/2015 2:39:05 PM | SETI@home | [sched_op] Starting scheduler request
6/25/2015 2:39:05 PM | SETI@home | Sending scheduler request: To fetch work.
6/25/2015 2:39:05 PM | SETI@home | Requesting new tasks for ATI
6/25/2015 2:39:05 PM | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
6/25/2015 2:39:05 PM | SETI@home | [sched_op] ATI work request: 114553.69 seconds; 0.00 devices
6/25/2015 2:39:07 PM | SETI@home | Scheduler request completed: got 0 new tasks
6/25/2015 2:39:07 PM | SETI@home | [sched_op] Server version 707
6/25/2015 2:39:07 PM | SETI@home | No tasks sent
6/25/2015 2:39:07 PM | SETI@home | No tasks are available for AstroPulse v7
6/25/2015 2:39:07 PM | SETI@home | Tasks for CPU are available, but your preferences are set to not accept them
6/25/2015 2:39:07 PM | SETI@home | Project requested delay of 303 seconds
6/25/2015 2:39:07 PM | SETI@home | [sched_op] Deferring communication for 00:05:03
6/25/2015 2:39:07 PM | SETI@home | [sched_op] Reason: requested by project
6/25/2015 2:48:14 PM | SETI@home | [sched_op] Starting scheduler request

The host that had the Abandoned tasks was being Time Out between 2:26:47 EDT and 2:33:31 EDT
This host was talking to the server during that time. Sorry, the bad internet ploy doesn't cut it this time. I think we can rule out 'bad internet',
ID: 1695704 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695709 - Posted: 25 Jun 2015, 22:15:35 UTC - in response to Message 1695704.  
Last modified: 25 Jun 2015, 22:17:27 UTC

While the 'Bad internet Ploy' might seem a reasonable first tackle at the situation, It's just that, a first stop. I'd have to agree there are more vagaries going on in there.

To be clear I'm not a big fan of the Boinc codebase (client or server), by the methodologies employed, nor the attitudes of the people involved (toward users).

Ultimately I suspect you will see weird behaviour like this until the Boinc development team learn to speak to real developers in a civil manner.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695709 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1695716 - Posted: 25 Jun 2015, 22:28:45 UTC - in response to Message 1695709.  
Last modified: 25 Jun 2015, 22:29:25 UTC

Well, there is a host at Beta suffering Abandoned tasks multiple times daily. I think it would be wise to put a watch on that machine;
http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714&offset=60
ID: 1695716 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1695718 - Posted: 25 Jun 2015, 22:30:13 UTC

Maybe a fault in the router?
ID: 1695718 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695719 - Posted: 25 Jun 2015, 22:31:56 UTC - in response to Message 1695716.  
Last modified: 25 Jun 2015, 22:32:13 UTC

Well, there is a host at Beta suffering Abandoned tasks multiple times daily. I think it would be wise to put a watch on that machine;
http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714&offset=60



I'll get my team on this. Do you know any background or details that may help them ?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695719 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1695723 - Posted: 25 Jun 2015, 22:42:13 UTC - in response to Message 1695719.  

Well, there is a host at Beta suffering Abandoned tasks multiple times daily. I think it would be wise to put a watch on that machine;
http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714&offset=60



I'll get my team on this. Do you know any background or details that may help them ?

Go for it, also I'll mention it to Eric the next time I request one of my faulty Apps on Beta be replaced. The schedule on that, for the past few weeks, has been every Monday...
ID: 1695723 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1695726 - Posted: 25 Jun 2015, 22:50:19 UTC - in response to Message 1695719.  

"Abandonment" is done by the routine mark_results_over() - line 197 of handle_request.cpp (server code, in \sched\)

This is called from precisely two places - lines 403 and 426 of the same file. One or other of those two must have been triggered.

You might like to read the preceding comment:

// Called when there's evidence that the host has detached.
// Mark in-progress results for the given host
// as server state OVER, outcome CLIENT_DETACHED.
// This serves two purposes:
// 1) make sure we don't resend these results to the host
// (they may be the reason the user detached)
// 2) trigger the generation of new results for these WUs
//
ID: 1695726 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695727 - Posted: 25 Jun 2015, 22:51:29 UTC - in response to Message 1695723.  

lol, thanks for the schedule. Yeah I have a personal affinity for Eric, but not in a homo way.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695727 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695728 - Posted: 25 Jun 2015, 22:54:05 UTC - in response to Message 1695726.  

I find interesting the stressor on that when it's a host has detached. The mind naturally wanders to the old spontaneous detach problem.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695728 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1695730 - Posted: 25 Jun 2015, 22:55:52 UTC - in response to Message 1695728.  

I find interesting the stressor on that when it's a host has detached. The mind naturally wanders to the old spontaneous detach problem.

I think the crucial (but probably false) word is 'evidence'.
ID: 1695730 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1695734 - Posted: 25 Jun 2015, 23:00:26 UTC - in response to Message 1695704.  

As noted previously, my machines are connected to FIOS with a straight connection to the main trunk on a State Highway.

This host had contact with the server while the other host was being timed out.
Can't happen that way...the hosts are connected to the same router.

http://setiathome.berkeley.edu/results.php?hostid=6797524
6/25/2015 2:23:48 PM | SETI@home | Project requested delay of 303 seconds
6/25/2015 2:23:48 PM | SETI@home | [sched_op] Deferring communication for 00:05:03
6/25/2015 2:23:48 PM | SETI@home | [sched_op] Reason: requested by project
6/25/2015 2:28:21 PM | SETI@home | Computation for task ap_04fe15ab_B6_P0_00266_20150624_23207.wu_1 finished
6/25/2015 2:28:21 PM | SETI@home | Starting task ap_06fe15aa_B4_P0_00012_20150624_06136.wu_1 using astropulse_v7 version 704 (opencl_ati_100) in slot 0
6/25/2015 2:28:23 PM | SETI@home | Started upload of ap_04fe15ab_B6_P0_00266_20150624_23207.wu_1_0
6/25/2015 2:28:27 PM | SETI@home | Finished upload of ap_04fe15ab_B6_P0_00266_20150624_23207.wu_1_0
6/25/2015 2:28:52 PM | SETI@home | [sched_op] Starting scheduler request
6/25/2015 2:28:52 PM | SETI@home | Sending scheduler request: To report completed tasks.
6/25/2015 2:28:52 PM | SETI@home | Reporting 1 completed tasks
6/25/2015 2:28:52 PM | SETI@home | Requesting new tasks for ATI
6/25/2015 2:28:52 PM | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
6/25/2015 2:28:52 PM | SETI@home | [sched_op] ATI work request: 89559.22 seconds; 0.00 devices
6/25/2015 2:28:54 PM | SETI@home | Scheduler request completed: got 0 new tasks
6/25/2015 2:28:54 PM | SETI@home | [sched_op] Server version 707
6/25/2015 2:28:54 PM | SETI@home | No tasks sent
6/25/2015 2:28:54 PM | SETI@home | No tasks are available for AstroPulse v7
6/25/2015 2:28:54 PM | SETI@home | Tasks for CPU are available, but your preferences are set to not accept them
6/25/2015 2:28:54 PM | SETI@home | Project requested delay of 303 seconds
6/25/2015 2:28:54 PM | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task ap_04fe15ab_B6_P0_00266_20150624_23207.wu_1
6/25/2015 2:28:54 PM | SETI@home | [sched_op] Deferring communication for 00:05:03
6/25/2015 2:28:54 PM | SETI@home | [sched_op] Reason: requested by project
6/25/2015 2:29:09 PM | SETI@home | Computation for task ap_07fe15aa_B1_P0_00348_20150624_24414.wu_2 finished
6/25/2015 2:29:09 PM | SETI@home | Starting task ap_04fe15ab_B6_P1_00320_20150624_24572.wu_0 using astropulse_v7 version 704 (opencl_ati_100) in slot 1
6/25/2015 2:29:11 PM | SETI@home | Started upload of ap_07fe15aa_B1_P0_00348_20150624_24414.wu_2_0
6/25/2015 2:29:15 PM | SETI@home | Finished upload of ap_07fe15aa_B1_P0_00348_20150624_24414.wu_2_0
6/25/2015 2:33:58 PM | SETI@home | [sched_op] Starting scheduler request
6/25/2015 2:33:58 PM | SETI@home | Sending scheduler request: To report completed tasks.
6/25/2015 2:33:58 PM | SETI@home | Reporting 1 completed tasks
6/25/2015 2:33:58 PM | SETI@home | Requesting new tasks for ATI
6/25/2015 2:33:58 PM | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
6/25/2015 2:33:58 PM | SETI@home | [sched_op] ATI work request: 103828.18 seconds; 0.00 devices
6/25/2015 2:34:00 PM | SETI@home | Scheduler request completed: got 0 new tasks
6/25/2015 2:34:00 PM | SETI@home | [sched_op] Server version 707
6/25/2015 2:34:00 PM | SETI@home | No tasks sent
6/25/2015 2:34:00 PM | SETI@home | No tasks are available for AstroPulse v7
6/25/2015 2:34:00 PM | SETI@home | Tasks for CPU are available, but your preferences are set to not accept them
6/25/2015 2:34:00 PM | SETI@home | Project requested delay of 303 seconds
6/25/2015 2:34:00 PM | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task ap_07fe15aa_B1_P0_00348_20150624_24414.wu_2
6/25/2015 2:34:00 PM | SETI@home | [sched_op] Deferring communication for 00:05:03
6/25/2015 2:34:00 PM | SETI@home | [sched_op] Reason: requested by project
6/25/2015 2:39:05 PM | SETI@home | [sched_op] Starting scheduler request
6/25/2015 2:39:05 PM | SETI@home | Sending scheduler request: To fetch work.
6/25/2015 2:39:05 PM | SETI@home | Requesting new tasks for ATI
6/25/2015 2:39:05 PM | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
6/25/2015 2:39:05 PM | SETI@home | [sched_op] ATI work request: 114553.69 seconds; 0.00 devices
6/25/2015 2:39:07 PM | SETI@home | Scheduler request completed: got 0 new tasks
6/25/2015 2:39:07 PM | SETI@home | [sched_op] Server version 707
6/25/2015 2:39:07 PM | SETI@home | No tasks sent
6/25/2015 2:39:07 PM | SETI@home | No tasks are available for AstroPulse v7
6/25/2015 2:39:07 PM | SETI@home | Tasks for CPU are available, but your preferences are set to not accept them
6/25/2015 2:39:07 PM | SETI@home | Project requested delay of 303 seconds
6/25/2015 2:39:07 PM | SETI@home | [sched_op] Deferring communication for 00:05:03
6/25/2015 2:39:07 PM | SETI@home | [sched_op] Reason: requested by project
6/25/2015 2:48:14 PM | SETI@home | [sched_op] Starting scheduler request

The host that had the Abandoned tasks was being Time Out between 2:26:47 EDT and 2:33:31 EDT
This host was talking to the server during that time. Sorry, the bad internet ploy doesn't cut it this time. I think we can rule out 'bad internet',



TBar, I never said the problem was on your end. I could say the same about my connection, but I know it happens. So if it's not on your end and it's not on my end... Isn't there another end???? just saying.....
ID: 1695734 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695735 - Posted: 25 Jun 2015, 23:05:01 UTC - in response to Message 1695734.  

Isn't there another end???? just saying.....


Yeah there's lots of other ends.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695735 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1695736 - Posted: 25 Jun 2015, 23:07:29 UTC - in response to Message 1695726.  

"Abandonment" is done by the routine mark_results_over() - line 197 of handle_request.cpp (server code, in \sched\)

This is called from precisely two places - lines 403 and 426 of the same file. One or other of those two must have been triggered.

You might like to read the preceding comment:

// Called when there's evidence that the host has detached.
// Mark in-progress results for the given host
// as server state OVER, outcome CLIENT_DETACHED.
// This serves two purposes:
// 1) make sure we don't resend these results to the host
// (they may be the reason the user detached)
// 2) trigger the generation of new results for these WUs
//

As I recall, you were the one investigating these Abandonments. I think you were having trouble because it was an intermittent problem. Well the problem isn't intermittent for the host at Beta, he got whacked about an hour before I did. He will probably be whacked again before too long. As for me, this was the first time I suffered this problem, although I fear it will not be the last. The problem does seem to be getting worse, and I don't think it has anything to do with 'bad internet'. What could be the problem is a mystery, I just know what it isn't.
ID: 1695736 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1695741 - Posted: 25 Jun 2015, 23:31:10 UTC - in response to Message 1695736.  

"Abandonment" is done by the routine mark_results_over() - line 197 of handle_request.cpp (server code, in \sched\)

This is called from precisely two places - lines 403 and 426 of the same file. One or other of those two must have been triggered.

You might like to read the preceding comment:

// Called when there's evidence that the host has detached.
// Mark in-progress results for the given host
// as server state OVER, outcome CLIENT_DETACHED.
// This serves two purposes:
// 1) make sure we don't resend these results to the host
// (they may be the reason the user detached)
// 2) trigger the generation of new results for these WUs
//

As I recall, you were the one investigating these Abandonments. I think you were having trouble because it was an intermittent problem. Well the problem isn't intermittent for the host at Beta, he got whacked about an hour before I did. He will probably be whacked again before too long. As for me, this was the first time I suffered this problem, although I fear it will not be the last. The problem does seem to be getting worse, and I don't think it has anything to do with 'bad internet'. What could be the problem is a mystery, I just know what it isn't.

Agreed. IIRC, one of the cases which triggers mark_results_over() is processing requests with non-monotonic RPCseqnos - in other words, processing scheduler requests in the wrong order. If that happens once in a blue moon, it's Eddie in the space-time continuum. If it happens consistently, then there's a cause, and it will show up in the logs - the server logs, that is. One (highly speculative) suggestion as to a possible cause is if a user has a working computer: buys a second similar machine: and shortcuts the installation/setup procedure by copying the existing BOINC data directory to the second machine. That creates two hosts with the same HostID, and of course the RPC sequence number gets de-synchronised. A scan of the server logs would reveal that by the alternating IP addresses.

Other theories are available...

We do have one member of the 'team' who has access to server logs, and I caught his attention with a possibly related case a couple of weeks ago: Immediate timeout? Missing deadline?. But so far, as you can see, no diagnosis or resolution.
ID: 1695741 · Report as offensive
1 · 2 · 3 · 4 . . . 15 · Next

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.