Why not in order of date?

Message boards : Number crunching : Why not in order of date?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2010174 - Posted: 31 Aug 2019, 17:40:25 UTC

Rob, Keith - we're going a bit off-topic here, but thanks for volunteering to try to investigate. Before you get too far into it, please compare notes on exactly what version of BOINC you're each using - and I mean EXACTLY.

And please set the <sched_op_debug> Event Log flag, because...

Here's a log from my new Linux host. By kind permission of the spoofer, I've been allowed to test the spoofed client: this one is set up for two real cards, 16 cards total.

31/08/2019 17:34:36 | SETI@home | Requesting new tasks for NVIDIA GPU
31/08/2019 17:34:36 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
31/08/2019 17:34:36 | SETI@home | [sched_op] NVIDIA GPU work request: 28.35 seconds; 0.00 devices
31/08/2019 17:34:39 | SETI@home | Scheduler request completed: got 16 new tasks
31/08/2019 17:34:39 | SETI@home | [sched_op] estimated total CPU task duration: 0 seconds
31/08/2019 17:34:39 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 1737 seconds
The machine is currently estimating around 110 seconds per task, so a 28 second request should have got precisely one task. I think that the 1,737 second estimate is because of the 16-task allocation, and precisely 16 tasks have been allocated because of the 16 GPU spoofing. Discuss?
ID: 2010174 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2010175 - Posted: 31 Aug 2019, 17:46:58 UTC - in response to Message 2010173.  

A question before I do the actual test - roughly how long does a "typical" Einstein job take to run on a CPU?
And were they assigned to be run on a CPU?

From Keith's previous post:

Einstein Gamma Ray Pulsar tasks on the first connection. A Einstein task on my host crunches in between 465 -585 seconds depending on the card.
Einstein Gamma Ray Pulsar tasks have both CPU and GPU applications available. Noting Keith's use of the word 'card' at the end, and the expected runtime, I'd suggest that these tasks were slated to be run on GPUs - hence my request for sched_op_debug output, so we see the numbers and the devices 'in the raw'.
ID: 2010175 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22829
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2010177 - Posted: 31 Aug 2019, 17:52:47 UTC

Ignore my question - I found a rough answer of three hours from some old notes of mine....

OK, with a very similar set of settings to Kieth (apart from using a secondary cache of 0.001)
I've just received 21 tasks Given the estimated run-time for these tasks is typically a bit under two hours that's about 40 hours total CPU time, which equates to about 13.5 hours of work in hand. Which is a lot more than 0.01 days (which is about 15 minutes, which would round up to three tasks, one for each of the CPU cores in use).
Actually the situation is even worse than that described above, a pile of the tasks are estimating at about 12 hours.....

Second and third data requests have just gone through getting 17 and 5 tasks respectively.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2010177 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22829
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2010178 - Posted: 31 Aug 2019, 17:56:14 UTC

Sorry, I missed your question Richard - I'm using stock 7.14.2 on a Windows 7 machine.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2010178 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2010180 - Posted: 31 Aug 2019, 18:02:25 UTC - in response to Message 2010177.  

It would also help if during these tests, you grabbed a sample of the Einstein server log (available through the 'last contact' link on the host page, Einstein web site), so we can match your client Event log figures (what we assume it asked for) with the same request as seen by the server. Einstein is useful, because it's the only project which routinely makes this data available.
ID: 2010180 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22829
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2010182 - Posted: 31 Aug 2019, 18:03:35 UTC

Having missed Richard's post I forced an update, which, as far as I can see correctly identified that no more work was needed:

31/08/2019 18:58:18 | Einstein@Home | update requested by user
31/08/2019 18:58:20 | Einstein@Home | sched RPC pending: Requested by user
31/08/2019 18:58:20 | Einstein@Home | [sched_op] Starting scheduler request
31/08/2019 18:58:20 | Einstein@Home | Sending scheduler request: Requested by user.
31/08/2019 18:58:20 | Einstein@Home | Not requesting tasks: don't need (job cache full)
31/08/2019 18:58:20 | Einstein@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
31/08/2019 18:58:21 | Einstein@Home | Scheduler request completed
31/08/2019 18:58:21 | Einstein@Home | [sched_op] Server version 611
31/08/2019 18:58:21 | Einstein@Home | Project requested delay of 60 seconds
31/08/2019 18:58:21 | Einstein@Home | [sched_op] Deferring communication for 00:01:00
31/08/2019 18:58:21 | Einstein@Home | [sched_op] Reason: requested by project



It looks as though I'm going to have a long wait for an "automagic" update - the tasks are about 0.5% done in about 20 minutes.....

I'll increase the cache a bit and see what happens.....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2010182 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2010185 - Posted: 31 Aug 2019, 18:09:34 UTC - in response to Message 2010182.  

Now the WOW! event is over, I'll give Einstein a quick poke with the spoofed machine and 16-GPU requests.
ID: 2010185 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22829
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2010188 - Posted: 31 Aug 2019, 18:14:06 UTC - in response to Message 2010180.  

Is this what you are looking or Richard??

2019-08-31 18:05:18.6371 [PID=29271]   Request: [USER#xxxxx] [HOST#12788495] [IP xxx.xxx.xxx.154] client 7.14.2
2019-08-31 18:05:18.6973 [PID=29271] [debug]   have_master:1 have_working: 1 have_db: 1
2019-08-31 18:05:18.6974 [PID=29271] [debug]   using working prefs
2019-08-31 18:05:18.6974 [PID=29271] [debug]   have db 1; dbmod 1567274694.000000; global mod 1567272183.000000
2019-08-31 18:05:18.6974 [PID=29271] [debug]   sending db prefs in reply
2019-08-31 18:05:18.6974 [PID=29271]    [send] effective_ncpus 3 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2019-08-31 18:05:18.6974 [PID=29271]    [send] effective_ngpus 0 max_jobs_on_host_gpu 999999
2019-08-31 18:05:18.6974 [PID=29271]    [send] Not using matchmaker scheduling; Not using EDF sim
2019-08-31 18:05:18.6974 [PID=29271]    [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2019-08-31 18:05:18.6974 [PID=29271]    [send] work_req_seconds: 0.00 secs
2019-08-31 18:05:18.6974 [PID=29271]    [send] available disk 99.64 GB, work_buf_min 864000
2019-08-31 18:05:18.6974 [PID=29271]    [send] active_frac 0.999365 on_frac 0.841301 DCF 1.000000
2019-08-31 18:05:18.7033 [PID=29271]    Sending reply to [HOST#12788495]: 0 results, delay req 60.00
2019-08-31 18:05:18.7034 [PID=29271]    Scheduler ran 0.069 seconds

Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2010188 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2010189 - Posted: 31 Aug 2019, 18:18:08 UTC - in response to Message 2010188.  
Last modified: 31 Aug 2019, 18:43:41 UTC

Yes, that's the right place. Most helpful would be to catch an 'overfetch', such as the complained-about behaviour that started this conversation, and compare/contrast the client log and the server log for the same event.

Like this:

First effective work request for the spoofed Linux, after I got it attached and on the right venue.

Sat 31 Aug 2019 19:28:33 BST | Einstein@Home | Sending scheduler request: To fetch work.
Sat 31 Aug 2019 19:28:33 BST | Einstein@Home | Requesting new tasks for NVIDIA GPU
Sat 31 Aug 2019 19:28:33 BST | Einstein@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Sat 31 Aug 2019 19:28:33 BST | Einstein@Home | [sched_op] NVIDIA GPU work request: 24192.00 seconds; 16.00 devices
Sat 31 Aug 2019 19:28:41 BST | Einstein@Home | Scheduler request completed: got 16 new tasks
Sat 31 Aug 2019 19:28:41 BST | Einstein@Home | [sched_op] estimated total NVIDIA GPU task duration: 37052 seconds
And the server said:

2019-08-31 18:28:33.8522 [PID=6591 ]    [send] effective_ncpus 6 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2019-08-31 18:28:33.8522 [PID=6591 ]    [send] effective_ngpus 8 max_jobs_on_host_gpu 999999
2019-08-31 18:28:33.8522 [PID=6591 ]    [send] Not using matchmaker scheduling; Not using EDF sim
2019-08-31 18:28:33.8522 [PID=6591 ]    [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2019-08-31 18:28:33.8522 [PID=6591 ]    [send] CUDA: req 24192.00 sec, 16.00 instances; est delay 0.00
2019-08-31 18:28:33.8522 [PID=6591 ]    [send] work_req_seconds: 0.00 secs
2019-08-31 18:28:33.8522 [PID=6591 ]    [send] available disk 99.31 GB, work_buf_min 864
and then went on to pick out a number of both 'Gamma-ray pulsar binary search' and 'Continuous Gravitational Wave search' tasks. Both task types were given an initial estimate of 38:41 runtime (no host history, no DCF). Looks like the pulsars are going to take about 16:25 (half way through the first one now): GW around 52 minutes, guessing from the 20% mark. I'll bodge preferences to allow Gamma-ray oly from now on, so DCF doesn't go all over the place and confuse the issue.

Note that both client and server are consistent in showing this as a 16-GPU request.
ID: 2010189 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22829
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2010194 - Posted: 31 Aug 2019, 18:44:42 UTC

There could be some wait for that (3.3% in 1 hour).
A passing comment - SETI reports serve version 709, Beta 715 and Einstein 611. That might be a complete red herring, but it adds a bit of data to the picture.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2010194 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2010196 - Posted: 31 Aug 2019, 18:49:17 UTC - in response to Message 2010174.  

Well my last experiment was on one of the temp closet contest machines as I pulled it out to update it for the last year of non-use. So I was testing the AIO 7.14.2 stock client. No spoofing. I didn't spoof either of the temp machines. I tried to set 0.001 for additional cache but every time I updated, the value never changed. But then I found someone's reply to me (Wiggo??) that the additional days value WILL take 0.0. I tried that and it saved and updated. I don't think the value field can handle so many digits under the decimal sign.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010196 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2010199 - Posted: 31 Aug 2019, 18:52:08 UTC - in response to Message 2010194.  

A passing comment - SETI reports serve version 709, Beta 715 and Einstein 611. That might be a complete red herring, but it adds a bit of data to the picture.
Einstein consciously and deliberately stopped applying Berkeley patches when CreditNew came out, so their server won't have seen a Berkeley Client version number in the 7xx range. They've updated it themselves, of course, but not using the version number script.
ID: 2010199 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2010200 - Posted: 31 Aug 2019, 18:57:08 UTC
Last modified: 31 Aug 2019, 18:59:46 UTC

2019-08-31 18:05:18.6974 [PID=29271] [send] available disk 99.64 GB, work_buf_min 864000


What does this reply in Rob's Einstein scheduler log mean? Does that mean it was setting the minimum request at one ten days?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010200 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2010202 - Posted: 31 Aug 2019, 19:07:09 UTC - in response to Message 2010189.  

I've been wondering where my client got that "NVIDIA GPU work request: 24192.00 seconds" from. I'd deliberately set local preferences to 0.01 + 0.01 days, which equates to 1,728 seconds per device. There was plenty of SETI work on board (getting on for 6 hours worth), but it was only allowed to run on two of the GPUs - the real ones. (This is the spoofed version of the 7.16.1 client, which should have the benefit of the work that Keith and I did on max_concurrent since v7.14.2)

So, trying 1,728 seconds * 14 apparently idle resources, I get 24,192 - voila. QED.
ID: 2010202 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2010203 - Posted: 31 Aug 2019, 19:10:01 UTC

Yes you are correct, I was referring to only gpu work. I only do cpu work for Seti. I had 3 1070's in the temp contest machine I tested on. I knew I shouldn't have deleted the logs from that test. I had logs for 0.1 and logs for 0.01 days of primary cache. I even had the resource share ratio at one time at 10,000:1 between Seti and Einstein. Didn't make any difference and I know that resource share is a long term average that I never let establish. I can pull the machine out of the closet again and run the tests again.

On another note, I just added the beta Continuous GW gpu application and received work along with the normal Gamma Ray Pulsar work. But for some reason I have yet to run one of them. Still only running the Gamma Ray tasks and yet both types have the same deadline of Sept.12. Curious how you were able to get your host to run the new GW app tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010203 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2010205 - Posted: 31 Aug 2019, 19:25:51 UTC - in response to Message 2010202.  

I've been wondering where my client got that "NVIDIA GPU work request: 24192.00 seconds" from. I'd deliberately set local preferences to 0.01 + 0.01 days, which equates to 1,728 seconds per device. There was plenty of SETI work on board (getting on for 6 hours worth), but it was only allowed to run on two of the GPUs - the real ones. (This is the spoofed version of the 7.16.1 client, which should have the benefit of the work that Keith and I did on max_concurrent since v7.14.2)

So, trying 1,728 seconds * 14 apparently idle resources, I get 24,192 - voila. QED.

Are you saying you can run a max_concurrent statement with the spoofed 7.16.1 client? I cannot and neither can Juan or Ian. On prior versions of the spoofed client, namely 7.14.2 and 7.15.0, it was mandatory to have a max_concurrent statement set to the number of spoofed gpus PLUS the number of desired cpu tasks running or the cpus would go idle.

That is no longer the case with the 7.16.1 client. If you run a max_concurrent statement now, you will run on all possible devices and also not request any cpu work. The only way to restrict resource usage now on the spoofed 7.16.1 client is through Local Preferences cpu% usage settings.

At least now with this new client without the max_concurrent statement it will ask for cpu work even when there are no more cpu tasks waiting to run. With the previous spoofed client I always had to revert to an older client that did not have the #3076 commit in it so I could request cpu work when I ran through all my cpu tasks on Tuesdays.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010205 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2010206 - Posted: 31 Aug 2019, 19:26:03 UTC - in response to Message 2010203.  

On another note, I just added the beta Continuous GW gpu application and received work along with the normal Gamma Ray Pulsar work. But for some reason I have yet to run one of them. Still only running the Gamma Ray tasks and yet both types have the same deadline of Sept.12. Curious how you were able to get your host to run the new GW app tasks.
Just the order they happened to arrive in. They arrived in the same work request (so deadlines identical to the second), but of course the FIFO buffer has a sequence, and there happened to be a GW at the top (or - just possibly - it was the first app to download the binary and all other required files*).

I think I'll wind down the research for tonight. Burn off that initial batch of Einstein tasks while I eat, drink, and sleep: let SETI get ahead again on resource share (but a tight cache), and see what else I can think of to test in the morning. Including that 'work_buf_min' question - no answer for that one yet.

* - no, not download order. Einstein has a fast pipe and server: all downloads had completed before the first GPU came free from SETI work, so it picked the first one off the queue.
ID: 2010206 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2010210 - Posted: 31 Aug 2019, 19:43:09 UTC - in response to Message 2010206.  

That makes sense, the FIFO. I already had the Gamma app and had to download the new CGW app, so the regular Gamma Ray tasks came down the pipe first.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010210 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2010211 - Posted: 31 Aug 2019, 19:46:14 UTC - in response to Message 2010205.  

Are you saying you can run a max_concurrent statement with the spoofed 7.16.1 client? I cannot and neither can Juan or Ian. On prior versions of the spoofed client, namely 7.14.2 and 7.15.0, it was mandatory to have a max_concurrent statement set to the number of spoofed gpus PLUS the number of desired cpu tasks running or the cpus would go idle.
I'm running the 20 August version of the v7.16.1 client - I had a bug in the 2--24 version I tried at first. I have max_concurrent statements for both my GPU project (SETI) and my CPU project (not SETI), and each project fetches and runs the expected amount of work for their respective device. I don't try to run the same project on multiple devices, and I think I would expect that to fail under spoofing - or at least not to work as well as might be expected. That would be something the spoofer would have to explore: we can't expect David to work on that one!
ID: 2010211 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2010213 - Posted: 31 Aug 2019, 20:05:29 UTC - in response to Message 2010211.  

I am running the 7.16.1 client branch from 10 August snaphot. Haven't checked the repo in a while to see if any more commits have been added since.
Yes, running both cpu and gpu tasks on the same host certainly is a challenge for the client. Every client works a little bit differently when spoofed. I now have the viewpoint that max_concurrent in its proper use hews more closely to the original definition and one that David espouses. So to use the cpu % usage now for task control is probably the proper way to achieve my goals now. Each client I have used required slight changes in my configuration to achieve my desired goal. This new 7.16.1 client is easy to get along with and I think I understand how it interacts with projects fairly well now.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010213 · Report as offensive
Previous · 1 · 2

Message boards : Number crunching : Why not in order of date?


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.