All CPU tasks not running. Now all are: - "Waiting to run"

Questions and Answers : Unix/Linux : All CPU tasks not running. Now all are: - "Waiting to run"
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1971328 - Posted: 21 Dec 2018, 10:14:56 UTC

Keith, have a look at "Client: fix job scheduling bug #2918". It'll take time to be reviewed and merged into the master code base - especially at this time of year. But it looks hopeful - you obviously have more pull with David than I do.

I'll have a go at testing it under Windows. If I can remember what configuration I was running two years ago...
ID: 1971328 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1971356 - Posted: 21 Dec 2018, 17:37:41 UTC - in response to Message 1971328.  

Thanks Richard. So how do I know when the code is merged into the master? I assume I have to wait for this to happen before I git the master and attempt to compile the client. I don't really know how the structure of Github works.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1971356 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1971365 - Posted: 21 Dec 2018, 18:42:26 UTC - in response to Message 1971356.  

So how do I know when the code is merged into the master?
Fireworks will go off in every office in the world. :)

No... check the closed pull requests. Or the open ones. The code merge request has to go through there... well, unless David thinks he's better than the rest again and doesn't. Wouldn't be the first time.
ID: 1971365 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1971370 - Posted: 21 Dec 2018, 19:04:32 UTC - in response to Message 1971356.  

It's actually possible to compile and test the new code before it's merged. Assuming you have Git set up to retrieve the master BOINC code, the commands

git fetch origin
git checkout -b dpa_nconcurrent origin/dpa_nconcurrent
will bring your copy into line with David's current work-in-progress.

But I've just reported a new bug, so this isn't finished yet. You could wait and see if he's able to fix that one too, or you could try a practice run now...
ID: 1971370 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1971372 - Posted: 21 Dec 2018, 19:36:16 UTC - in response to Message 1971365.  

So how do I know when the code is merged into the master?
Fireworks will go off in every office in the world. :)

No... check the closed pull requests. Or the open ones. The code merge request has to go through there... well, unless David thinks he's better than the rest again and doesn't. Wouldn't be the first time.

Ahh, so that is what the filter is useful for. Maybe a little glimmer of understanding on how Github works. So if the pull request is closed, then the master has been updated with the fix? Then it is viable to git the master and then tackle compiling.

That will be the next roadblock.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1971372 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1971373 - Posted: 21 Dec 2018, 19:43:29 UTC - in response to Message 1971370.  

Richard, does anyone have an insight why using <project_max_concurrent> by itself never triggered a cpu_scheduling problem in the past on all my hosts? It was only when I added the <gpu_exclude> statements to the cc_config.xml for the RTX 2080 card that things went sideways.

In reading through the proposed pull request, all the code referenced max_concurrent statements and nothing about gpu_exclude was mentioned. Why is that not a component of the fix since it is part of the problem?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1971373 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1971380 - Posted: 21 Dec 2018, 20:18:25 UTC - in response to Message 1971373.  

That's an interesting question, and I don't immediately know the answer. But it's true that I do have GPU excludes in place, both two years ago and today. They could well be implicated, but I don't think I've seen them mentioned in the cpu_sched_debug logs I've looked at. (They do have a significant, and quite useful, effect on work fetch: I run a 6 hour cache most of the time, but with an exclude GPUGrid only fetches 3 hours before work is needed - that gives me a better chance of meeting their 24-hour turnround target)
ID: 1971380 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1971383 - Posted: 21 Dec 2018, 20:43:12 UTC - in response to Message 1971380.  

OK, that is interesting that your original issue back in 2016 was with a <gpu_exclude> also and that is something we both have in common. I assume that the exclude reduces the available resources and that causes things to go out of whack with cpu scheduling. As if the cpu scheduling isn't aware of the reduced resources it is supposed to be working with.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1971383 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1971386 - Posted: 21 Dec 2018, 21:04:31 UTC - in response to Message 1971383.  

I refer you to the second post in the discussion issue (Nov 24, 2016):

But disabling line 130 (alone) also causes starvation of another type: when max_concurrent is present, tasks that are needed for the same resource type (multi-core CPU) from a different project are restricted.

We need to populate the runnable list with sufficient jobs to satisfy all resources from all projects, even when restrictions (such as max_concurrent and exclude_gpu) are in operation.
It's all there for David to read - that's the bug I found in today's pull request test.
ID: 1971386 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1971460 - Posted: 22 Dec 2018, 10:04:40 UTC

Richard is the latest post by David after you stated the original fix didn't in fact fix the problem and his post about fixing the problem does in fact fix your failure in scenario 165?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1971460 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1971461 - Posted: 22 Dec 2018, 10:36:51 UTC - in response to Message 1971460.  

I (always) find David very inscrutable in cases like this - he is often terse to the point of obfuscation.

But in this case, I don't see any secondary code changes since the original fix for your problem, and I don't see a second, confirmation, run on scenario 165. So, my thinking is that he thought he'd fixed it for good in the first PR (continuing to neglect my post of Nov 24, 2016) and was patting himself on the back before the holidays. He'll probably come back to # 165 after he's had a rest, but I have no idea when that could be - he has a habit of going off for a week's hiking, and now would be an obvious time to do something like that.
ID: 1971461 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1971504 - Posted: 22 Dec 2018, 17:17:54 UTC

OK, I will wait for your post that everything is finally in order before attempting to build the client and hopefully after you indicated you yourself have built the Windows version and it works correctly.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1971504 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1971520 - Posted: 22 Dec 2018, 19:04:10 UTC - in response to Message 1971504.  

I'll be offline for a few days, back end of next week, but I'll grab it if I'm in a position to test when I see any movement on the threads.

For Windows, I can build, but we have - at long last - got automatic test builds accessible for every change. You can simply download the key files (client, Manager and so on - whatever you want to test) and drop them in place of the files you're using at the moment. There are a couple of wrinkles to watch out for, but that's basically it. I don't think there are any plans to do the same for Linux.
ID: 1971520 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1971848 - Posted: 24 Dec 2018, 17:40:52 UTC

David has enabled adding two separate app_config.xml files, and I've made a scenario (# 166) which exhibits the same behaviour as I'm seeing locally. We make progress, slowly.
ID: 1971848 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1971853 - Posted: 24 Dec 2018, 18:10:28 UTC - in response to Message 1971848.  

Thanks for the update Richard. I saw David's message about adding another app_config entry. I remember you said you used that in two projects. So your scenario now matches your local settings.

Won't help me as I have a total of four app_config files.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1971853 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1972472 - Posted: 29 Dec 2018, 10:53:03 UTC

My body has got back home after a holiday break, but my brain is following some way behind. I see we have some extensive new code to test: I might get started this evening, but more likely tomorrow.
ID: 1972472 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1972797 - Posted: 31 Dec 2018, 13:15:57 UTC

I've been testing a while now, and David has made one more cosmetic fix: I think we're ready for some proper testing so we can be confident that the new code is ready for merging into master and hence be automatically included in the next client release. Keith, are you willing to give it a go?
ID: 1972797 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1972825 - Posted: 31 Dec 2018, 16:58:19 UTC - in response to Message 1972797.  

I guess so. I assume the cosmetic fix has nothing to do with the bugfix for the max_concurrent problem. Tell me when the master is ready to git and I will attempt to compile.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1972825 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1972842 - Posted: 31 Dec 2018, 19:03:12 UTC - in response to Message 1972825.  

I guess so. I assume the cosmetic fix has nothing to do with the bugfix for the max_concurrent problem. Tell me when the master is ready to git and I will attempt to compile.
Ah. And there we have a catch-22. The code only goes into master once it has been 'approved' by a 'reviewer'.

I have the authority to review and approve (and hence merge), but I have to use it sparingly and thoughtfully, otherwise I'll get thrown off the list. I don't have the in-depth technical knowledge of C++ coding to be able to 'read it off the page' and approve it that way (round here, probably only Juha has that skill, where the BOINC client is concerned). What I can do is to run the compiled client, throw wobblies at it, and see if it copes.

And I'm campaigning behind the scenes to get that sort of pre-alpha testing accepted as a proper part of the development process. At the moment, most code is reviewed and approved 'by eye'. Once that's done, it goes into master and sits there, untouched by human eye, until 'they' (someone unspecified) decides that a new client release is appropriate because of some completely unrelated new feature they've added. So, it gets built, put out to the punters in what is still known (for historical reasons) as alpha testing, and, if nothing too drastic is seen, it gets released. And by then it's too late, the bugs are in the wild, and everyone has forgotten what they wrote six months, two years, and in the case of the addition of app_config.xml files, probably five years ago.

We have a problem with that process. It goes from "Too early to test" (the position you've described) to "Too late to change" at the moment of 'approve and merge'.

It's getting better for Windows, because we now have automatically built replacement binaries which can be downloaded and dropped into place for instant testing - those are the ones I'm using now, and they're a great timesaver compared with the old way. But I think the working mindset in academic research labs is that anyone who uses Linux is thereby automatically qualified to write, compile, and test their own code at the drop of a hat. In their ivory towers, perhaps.

[/rant]
ID: 1972842 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1972879 - Posted: 1 Jan 2019, 0:53:56 UTC

Well I scanned the files I got from that other repository and they came up clean. So I proceed to try and run things. The client seemed to start up once I figured out how to run it. Repository versions being foreign to me. Then I tried to start the manager and ran into a missing dependency. I found some info to grab the library from obsoleted sources and I got it to start.

But I goofed in copying over my existing folder and managed to make the client grab the stock stuff and now I have 266 ghosts. Forgetting that for the minute. I had errors on startup in the Event Log complaining the URL's were bad for my gpu excludes for Einstein and GPUGrid. Not sure why. And the other thing that is the main point I guess is that even though I was running with the app_config with the <project_max_concurrent> is that I was running all 24 cpu threads instead of the 16 I defined. I re-read the config files from the Manager to be sure it was being picked up. It was.

I think I will wait a little more to try things out. One thing I won't do is to reuse the existing hostID, that was a big goof on my part. I just wasn't thinking.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1972879 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Questions and Answers : Unix/Linux : All CPU tasks not running. Now all are: - "Waiting to run"


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.