Message boards :
SETI@home Staff Blog :
Eric's biannual post #6: You can tuna fish, but you can't tune a TCP
Message board moderation
Author | Message |
---|---|
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
This one could probably go in the techincal news, but since I haven't blogged in a while, I decided to jot it down here. Following the large outage, bruno's been having some problems keeping up. Lots of dropped connections. I guess most of you noticed that. It's not a lack of hardware this time, just an over-abundance of connection attempts. Some of the dropped connections were local file-server connections, which causes some of the http processes to wait around which causes more dropped connections. Changing some of the TCP tuning parameters helped, but didn't solve the problem. We did some brain storming before the outage and have come up with some tactics to combat these issues. We're setting up our router to proxy the SYN/ACK handshakes. That way if we are flooded, the connections will be dropped before they get to bruno. That'll in turn prevent the NFS connections from getting dropped. We're also getting rid of some configuration remnants from earlier BOINC server code. Currently bruno handles all of the incoming connections and forwards them to other machines when appropriate for uploads and downloads. We can designate other machines as upload or download handlers so that bruno won't have to touch those connections at all. If that's not enough, we'll set up web servers on some of the other machines and get back to round robin DNS for the upload and download servers. Well, that's enough typing for now. This weekend, one of my fingers had an unfortunate meeting with the leading edge of a 120mm fan blade inside a server case. Fortunately the fan blade broke and it doesn't look like I'll lose the fingernail. I've learned my lesson, always approach case fans from the trailing edge. -- Eric @SETIEric@qoto.org (Mastodon) |
Fuzzy Hollynoodles Send message Joined: 3 Apr 99 Posts: 9659 Credit: 251,998 RAC: 0 |
Thanks for the update, Eric, it's very much appreciated. We know you guys are doing all what you can. "I'm trying to maintain a shred of dignity in this world." - Me |
KWSN - Chicken of Angnor Send message Joined: 9 Jul 99 Posts: 1199 Credit: 6,615,780 RAC: 0 |
Yow. Just did that myself three months ago, and lost half the nail. It's grown back since, but damn was that annoying (I type a lot). On the up/download issue, good plan on dropping connections at the router vs. the host itself - hopefully that will have the desired effect and give NFS a kick in the pants. Thanks again for all your and your colleagues' hard work in resurrecting Thumper! Regards, Simon. Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
Unfortunately the router couldn't handle the load so we're back to dropping connections at bruno. I spent the last few hours getting a bruno clone, which I have tentatively named Ptolemy, up and running. (It's not quite a clone, dual 3.06 GHz hyperthreaded processors rather than dual 2.8GHz non-hyperthreaded. Where it came from is a story for another time.) I've got the OS installed and am at the point where Matt and or Jeff need to work some apache magic in order to have it be usable in a round robin DNS with bruno. I'm going to go get some dinner, then I'll mail Matt and Jeff with a progress report. I think they'll be surprised how far I've gotten this evening. Eric @SETIEric@qoto.org (Mastodon) |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
Then get some sleep. Thanks for the extraordinary effort! |
Labbie Send message Joined: 19 Jun 06 Posts: 4083 Credit: 5,930,102 RAC: 0 |
Good news and Great job Eric. We appreciate everything you and the rest of the gang are doing. Calm Chaos Forum...Join Calm Chaos Now |
Fuzzy Hollynoodles Send message Joined: 3 Apr 99 Posts: 9659 Credit: 251,998 RAC: 0 |
Yes You are the best. .oO(You nicked Angela's computer for this? ;-D) "I'm trying to maintain a shred of dignity in this world." - Me |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
Good try. Sorry it didn't pan out. :( Out of curiosity, does anyone know how far past max / peak capacity the router was? Would something like Packeteer PacketShaper help, or do you have something similar already in use? |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection. If you look at our network traffic, you can see what happened when I lowered that to 30 seconds..... We sending about 4 times as much work as we were when I got in this morning. @SETIEric@qoto.org (Mastodon) |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection. It's good to see the progress... Hopefully soon things will be better. For the time being, uploading is still an exercise in futility on my machine. Any comment on PacketShaper? |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
Any comment on PacketShaper? The quick, but unsatisfying answer is "I dunno." It's certainly worth looking into, so I'll mention it to Matt and Jeff. They're the experts... @SETIEric@qoto.org (Mastodon) |
Conrad Human Send message Joined: 17 Nov 00 Posts: 67 Credit: 2,009,224 RAC: 0 |
Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection. OOPS lol you lot just human how is Ptolemy comming along ? [edit]jipee just got a WU reported[/edit] |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
Any comment on PacketShaper? In my former job, we used it for a brief test period on a Hughes satellite link. It performed admirably, even though the decision was made to go to 56K burst frame. While I know that slow link optimization isn't exactly the same goal as what you need, the product isn't just for slow links... It might help. It might not. Edit: Additionally, SkyX looks like another possible help for the TCP/XML/HTTP acceleration. Brian |
KenKLRC Send message Joined: 12 Jul 06 Posts: 27 Credit: 7,791,658 RAC: 0 |
Eric, Would throwing add'l H/W (a dual core puppy - P4 PD 940 3.2 GHz 800FBS 1GB DDR2 667MHz - I've here in reserve) at it to help handle the comms load be of any use? |
Paul Hayslett Send message Joined: 3 Aug 00 Posts: 15 Credit: 14,207,862 RAC: 0 |
Eric, it looks like you hit the jackpot. Slowly but surely my upload queue is shrinking and WUs are trickling down too. Thanks for making it happen! |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
Eric, We're about to put "ptolemy" in the mix in the next few hours. I'll certainly let you know if we need more beyond that. Eric @SETIEric@qoto.org (Mastodon) |
Iztok s52d (and friends) Send message Joined: 12 Jan 01 Posts: 136 Credit: 393,469,375 RAC: 116 |
Hi! It might be too low: I've noticed several new WUs on "Results for user" list, while nothing is on my PCs. Looks like WUs are allocated, but connection is terminated before client realize there is something to be fetched. A long delay till same WU is re-send to another client due to timeout. BR, 73 Iztok Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection. |
KWSN - Chicken of Angnor Send message Joined: 9 Jul 99 Posts: 1199 Credit: 6,615,780 RAC: 0 |
Well, today some of my hosts managed to upload and report almost all their WUs, vs. an average of 1-2/day/host before. The timeout change certainly seems to have eased the situation somewhat. Still, what Iztok mentioned is worth looking into - unless there is a way for BOINC to recover that WU download, it'll put all low-bandwidth users at a disadvantage while reducing overall project efficiency. Regardless, for a measure in hard times, it's a good one IMO. Regards, Simon. Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
We've moved the scheduler to bruno (from galileo) and both bruno and ptolemy are handling uploads. Only penguin is on download duty, but that may change if downloads start becoming a problem. We'll round-robin the scheduler once we can get round-robin capable feeders built. Matt wasn't able to do it before he left for vacation. Validators and assimilators are offline while Jeff tracks down a strange segfault. The std::vector<>::size() method is reporting an incorrect value, even though the pointers to the start and end of data are correct. IBTHOOM. Apache on bruno hung last night in a weird state. Lots of httpd processes running, but no connections getting through. We'll need to come up with a way to detect that state and fix it without human intervention. Eric @SETIEric@qoto.org (Mastodon) |
Fuzzy Hollynoodles Send message Joined: 3 Apr 99 Posts: 9659 Credit: 251,998 RAC: 0 |
Thanks for the update, Eric. :-) Matt's on vacation? How lucky for him. And how bad for you who are left in the lab. I guess you both, Jeff and you, look forward to get rid of this sign: "I'm trying to maintain a shred of dignity in this world." - Me |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.