Panic Mode On (94) Server Problems?

Message boards : Number crunching : Panic Mode On (94) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 22 · Next

AuthorMessage
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1623780 - Posted: 5 Jan 2015, 18:06:38 UTC - in response to Message 1623773.  

I don't know about you, but this is the most APs my computers have seen in over 2 months.

1 has 118 the other 112.

Finally had to turn the heater down as all the GPUs are now rived up!!
ID: 1623780 · Report as offensive
Lionel

Send message
Joined: 25 Mar 00
Posts: 680
Credit: 563,640,304
RAC: 597
Australia
Message 1623950 - Posted: 5 Jan 2015, 21:19:02 UTC - in response to Message 1623780.  
Last modified: 5 Jan 2015, 21:19:20 UTC

Yes, I get that message to, even though I have only 4 CPU APs in the queue, normally it is 100. It brings up an interesting question in how can one be at a limit when you are 96% below the maximum value.
ID: 1623950 · Report as offensive
Profile JaundicedEye
Avatar

Send message
Joined: 14 Mar 12
Posts: 5375
Credit: 30,870,693
RAC: 1
United States
Message 1623995 - Posted: 5 Jan 2015, 22:03:14 UTC

Almost back to pre-Bruno-crash RAC.........Keep 'em coming!

"Sour Grapes make a bitter Whine." <(0)>
ID: 1623995 · Report as offensive
Aurora Borealis
Volunteer tester
Avatar

Send message
Joined: 14 Jan 01
Posts: 3075
Credit: 5,631,463
RAC: 0
Canada
Message 1624019 - Posted: 5 Jan 2015, 22:31:48 UTC

AP spliters have started running on Beta again.
ID: 1624019 · Report as offensive
Lionel

Send message
Joined: 25 Mar 00
Posts: 680
Credit: 563,640,304
RAC: 597
Australia
Message 1624135 - Posted: 6 Jan 2015, 2:01:01 UTC - in response to Message 1623958.  
Last modified: 6 Jan 2015, 2:12:31 UTC

@Sten-Arne: I'm not worried about it at all.

The only thing that annoys me is the 4+ hours I have to sit in front of the puters and keep clicking on the update button to get to 100 CPU tasks. After getting to 100 CPU tasks the caches glide down to circa 5 wus again.

It's just poor software.
ID: 1624135 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1624141 - Posted: 6 Jan 2015, 2:31:24 UTC - in response to Message 1624135.  

@Sten-Arne: I'm not worried about it at all.

The only thing that annoys me is the 4+ hours I have to sit in front of the puters and keep clicking on the update button to get to 100 CPU tasks. After getting to 100 CPU tasks the caches glide down to circa 5 wus again.

It's just poor software.

I am running the current release and after it picked up the first few AP units, it has filled my queue to 100 work units each and has maintained the level. I gave up and decided to maintain a 2 day queue which is 100 work units each.
ID: 1624141 · Report as offensive
Lionel

Send message
Joined: 25 Mar 00
Posts: 680
Credit: 563,640,304
RAC: 597
Australia
Message 1624164 - Posted: 6 Jan 2015, 3:46:15 UTC - in response to Message 1624141.  

I don't have a problem maintaining work in the cache for GPU WUs, only CPU WUs are an issue as it does not really re-stock the cache as they are completed.

[side issue] I would like to see the limits raised. I know there are people out there with slower machines but 100 MB CPU WUs is about 1 days worth of work. 100 AP CPU WUs is about 3.5 days worth of work (if you can get 100 WUs for your CPU that is). 200 MB GPU WUs is about 8 hours of work (based on 10 minute completion times - appreciate shorties take less than 6 minutes) and 200 AP GPU WUs are about 40 hours of work (based on circa 50 minutes completion time).

The other thing that I would like to see is the 8 weeks on a slow glide path down to say 2 weeks. It was set at 8 weeks some 14+ years ago when processors were much much slower. I think it's time for a review in here.

Probably got an ice creams chance in hell of any of the above.
ID: 1624164 · Report as offensive
OTS
Volunteer tester

Send message
Joined: 6 Jan 08
Posts: 369
Credit: 20,533,537
RAC: 0
United States
Message 1624181 - Posted: 6 Jan 2015, 5:07:34 UTC - in response to Message 1624164.  

It looks like there might be less AP work being issued again. If I am reading the SSP correctly, it appears that there are three splitters again working on a single file. For a short time my CPU cache was holding steady at the maximum 100 amount but now the total in progress is slowly going down hill.
ID: 1624181 · Report as offensive
Profile JanniCash
Avatar

Send message
Joined: 17 Nov 03
Posts: 57
Credit: 1,276,920
RAC: 0
United States
Message 1624207 - Posted: 6 Jan 2015, 7:31:27 UTC - in response to Message 1624164.  

[side issue] I would like to see the limits raised. I know there are people out there with slower machines but 100 MB CPU WUs is about 1 days worth of work.

Not sure I understand your problem 100%, so ignore from here if my comment doesn't compute.

If all you want is to download more WUs in advance, to bridge a server outage for example, why not create multiple Virtual Machines with a total number of VCPU's equal to your physical cores (or threads)? With VTx (or similar), virtualization has almost zero performance loss these days, but each VM would appear as a separate BOINC client, maintaining its own cache.
ID: 1624207 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1624395 - Posted: 6 Jan 2015, 14:37:48 UTC - in response to Message 1623995.  

Almost back to pre-Bruno-crash RAC.........Keep 'em coming!

I'm not. Last summer I was around 15K. I've only recovered about half of my drop.
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1624395 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1624398 - Posted: 6 Jan 2015, 14:41:00 UTC

I'm surprised no one has posted PANIC!!! about the MB RTS being so low this morning.
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1624398 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1624434 - Posted: 6 Jan 2015, 15:44:33 UTC - in response to Message 1622882.  

Here's my run-through of my job_log:

First one's unix stamp is Aug 08, 2012:
1344411405.758627 ue 40443.513279 ct 44801.180000 fe 100366914056685.200000 nm ap_06my12ab_B6_P1_00397_20120728_06219.wu_0
1348411651.444180 ue 41462.286580 ct 41549.820000 fe 99312357715860.406000 nm ap_28jn12ab_B1_P1_00327_20120913_22283.wu_1
1353416891 ue 41154.808134 ct 44612.900000 fe 100748165223275 nm ap_27au12ab_B4_P1_00033_20121109_21887.wu_2 et 44755.875742
1354547709 ue 43096.733593 ct 41306.480000 fe 103747720196172 nm ap_27au12ab_B3_P0_00383_20121109_08471.wu_3 et 41408.402161
1354823446 ue 42485.311151 ct 36223.400000 fe 103782242333548 nm ap_01se12ab_B2_P1_00034_20121106_10784.wu_3 et 36360.880091
1365786085 ue 42041.303007 ct 43234.320000 fe 101221168606141 nm ap_30jn12ad_B0_P0_00323_20130331_14381.wu_2 et 43327.372242
1367166402 ue 41867.606117 ct 44128.500000 fe 102913816340793 nm ap_29ja13aa_B4_P0_00011_20130418_18429.wu_0 et 44210.891695
1367705446 ue 42201.410116 ct 44351.470000 fe 103734332406452 nm ap_26fe13ae_B6_P0_00170_20130424_29841.wu_0 et 44437.584740
1371698309 ue 43973.768278 ct 35040.840000 fe 107073863886764 nm ap_24no12aa_B3_P0_00177_20130610_00904.wu_0 et 35083.251799
1371782925 ue 43491.213005 ct 37562.650000 fe 105898866616610 nm ap_03ja12ai_B5_P0_00025_20130611_18331.wu_0 et 37606.444761

Last one's timestamp is June 21, 2013.

If you (or anybody else) feel you have a *comprehensive* job log/s with tasks from all AP tapes, I could set up a parallel database and compare them with my MB records.

With help from Cosmic_Ocean, I've been able to assemble a data distribution history for AP with records of 3051 tapes split. But that's still a long way short of the 5,544 I know have been split for MB (as at 31 Dec) - the difference is far more likely to be our incomplete jog logs, rather than a vast stash of unprocessed tapes. So, if anyone would actually like to know how we're getting along, I'd need help with logs from some of you assiduous AP-hounds out there, please.
ID: 1624434 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1624441 - Posted: 6 Jan 2015, 16:02:10 UTC - in response to Message 1624398.  

I'm surprised no one has posted PANIC!!! about the MB RTS being so low this morning.

The tapes which have been added to the splitter queue today and over the weekend have all been split for MB before, and it looks as if re-splitting for MB has been inhibited this time - the tapes are skimmed through very quickly without any new work appearing, as was happening for AP last week.
ID: 1624441 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1624448 - Posted: 6 Jan 2015, 16:18:22 UTC - in response to Message 1624441.  
Last modified: 6 Jan 2015, 16:34:43 UTC

Yup, was my computer. Sorry. Forget what I said. Bad driver.....
ID: 1624448 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1624556 - Posted: 6 Jan 2015, 23:11:19 UTC - in response to Message 1624508.  

One might wonder why they still haven't started the AP assimilators? It's as of now 477,099 AP Workunits waiting for assimilation.

It's been several Tuesday outages since the DB was fixed, but still no assimilation of AP WU's. Maybe the DB isn't working as it should yet, and until it does, they will not allow the WU's to be assimilated.

I was thinking this was part of a stress test for the new data base.
ID: 1624556 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1624664 - Posted: 7 Jan 2015, 1:25:17 UTC - in response to Message 1624556.  

One might wonder why they still haven't started the AP assimilators? It's as of now 477,099 AP Workunits waiting for assimilation.

It's been several Tuesday outages since the DB was fixed, but still no assimilation of AP WU's. Maybe the DB isn't working as it should yet, and until it does, they will not allow the WU's to be assimilated.

I was thinking this was part of a stress test for the new data base.

Until the assimilators start no data is going into the AP science db.
sah_assimilator/ap_assimilator : Takes scientific data from validated results and puts them in the SETI@home (or Astropulse) database for later analysis.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1624664 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1624726 - Posted: 7 Jan 2015, 4:49:23 UTC - in response to Message 1624164.  

I don't have a problem maintaining work in the cache for GPU WUs, only CPU WUs are an issue as it does not really re-stock the cache as they are completed.

[side issue] I would like to see the limits raised. I know there are people out there with slower machines but 100 MB CPU WUs is about 1 days worth of work. 100 AP CPU WUs is about 3.5 days worth of work (if you can get 100 WUs for your CPU that is). 200 MB GPU WUs is about 8 hours of work (based on 10 minute completion times - appreciate shorties take less than 6 minutes) and 200 AP GPU WUs are about 40 hours of work (based on circa 50 minutes completion time).

The other thing that I would like to see is the 8 weeks on a slow glide path down to say 2 weeks. It was set at 8 weeks some 14+ years ago when processors were much much slower. I think it's time for a review in here.

Probably got an ice creams chance in hell of any of the above.

After the outage today I burned off a fair amount of work and one work request returned 34 MB work units. I am still a bit low on AP work but I suspect I will get more work after some of the other people get theirs.
ID: 1624726 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1624758 - Posted: 7 Jan 2015, 6:37:33 UTC

I had an interesting occurrence on my slow Sempron machine.

So it was crunching away on an AP, and at that time, it had not had 11 "completed" APs yet, so the estimates were still astronomical (pun sort-of intended).. like ~230 hours. Once it got to about 75% complete, the remaining time was low enough for the 2.5+1-day cache to ask for more work. By this time, the 11th "completed task" had happened, so the estimates that came with new tasks were much more realistic. Sort of.

New tasks were coming with an estimate of 31 hours, when it normally takes ~46 for a <10%-blanked AP. And then it got a few more APs. And then the one that was running finished.. and I thought since the duration on that one was more than 10% of the estimate of the others.. the others would change their estimate to match what that one actually took. But I was wrong. The estimates for all the ones that were in the cache at that time dropped to 18 hours...and then work fetch happened and got more.

I realized this and set it to NNT, and then counted how many that machine had and multiplied by 48 hours and I shouldn't run past the deadline on any of them, but then a thought occurred... there's a newfangled error these days that bails on a WU when it runs for more than 2x the estimated duration. So... to avoid basically losing everything after crunching most of the way through them first.. time for some client_state editing.

Suspended network comms in BOINC, then ran 'net stop boinc' and then made a copy of the data directory and opened up client_state and added a 0 to the left of the decimal in the flops_est fields. Save, 'net start boinc' and checked.. now they all show 188 hours..and of course running in high priority mode. I didn't want crazy things to happen, so I did 'net stop boinc' and opened client_state back up and decided to cut the values down by approximately a third. The high-order digits were 28, so I changed them to 10. Save, 'net start boinc' and now they're 68 hours.

Close enough. It should be able to smooth itself out from here. Of course, I'll keep an eye on it. I have already set that machine back to a MB+AP venue (had it on AP-only just to try to get at least the 11th "completed task" done), so hopefully, it doesn't gobble too many more of those APs up from the rest of you. I know some of you just cringe at the fact that your GPU does them in 30 minutes and then the wingmate ends up taking 47 hours with the Lunatics CPU app.

That little machine has been pretty good to me over the years. It's been crunching for just over 7 years and is edging closer to 1M credits.

I just wanted to share that weird scenario though. I'm pretty sure if I had set it to NNT and only gotten one AP assigned at a time, the whole debacle would have been avoided.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1624758 · Report as offensive
Lionel

Send message
Joined: 25 Mar 00
Posts: 680
Credit: 563,640,304
RAC: 597
Australia
Message 1624777 - Posted: 7 Jan 2015, 7:05:50 UTC - in response to Message 1624758.  

Still not getting any AP CPU WUs (and I have 1 box with none at the moment)...
ID: 1624777 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 1624847 - Posted: 7 Jan 2015, 12:41:16 UTC

Seems we're back to problems getting work. Before Tuesday's outage I had max GPU cache for the first time in about 2 months, now I'm down to half GPUs and it's continues going down.
ID: 1624847 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (94) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.