Results of code re-write : 6.5 Million Page faults (Windows XP) and malloc() - a question or so to ponder

Message boards : Number crunching : Results of code re-write : 6.5 Million Page faults (Windows XP) and malloc() - a question or so to ponder
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 110099 - Posted: 11 May 2005, 13:29:09 UTC
Last modified: 11 May 2005, 13:30:45 UTC

Hi. I was quite interested to see that 1 WU accounted for some 6.5M page faults. This was on Xp P4 2.8 GHz HT 512 Meg RAM box doing....well nothing but seti. I wonder if others see the same level.

It persuaded me to look at the seti source code....there is a liberal use of malloc (as you might expect) and often within for loops. malloc can be the cause of page faults of course and I wondered if this might be an explantion for the high number that I have been getting. OR....its not high at all perhaps. In any event if a developer wanders past this thread....or anyone else knows..does he/you have any observations on this? What is the mem alloc strategy within seti? Is it optimised to reduce OS overheads?

ID: 110099 · Report as offensive
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 110115 - Posted: 11 May 2005, 14:34:10 UTC - in response to Message 110099.  

<blockquote>Hi. I was quite interested to see that 1 WU accounted for some 6.5M page faults. This was on Xp P4 2.8 GHz HT 512 Meg RAM box doing....well nothing but seti. I wonder if others see the same level.

It persuaded me to look at the seti source code....there is a liberal use of malloc (as you might expect) and often within for loops. malloc can be the cause of page faults of course and I wondered if this might be an explantion for the high number that I have been getting. OR....its not high at all perhaps. In any event if a developer wanders past this thread....or anyone else knows..does he/you have any observations on this? What is the mem alloc strategy within seti? Is it optimised to reduce OS overheads?</blockquote>

I noticed this too, Ian.
The data being worked on and the code is not large, it really should not be necessary to have more page faults than required to load the code and data initially.
(If this was VAX/VMS I'd be increasing the Process Working Set).

I don't think it really matters as I don't think these are hard page faults (require a disk access), otherwise the app wouldn't run at 99% CPU, there would be an element of I/O waiting.
Even so, soft faults require non-trivial amounts of OS code to satisfy and it should be possible to pretty much eliminate these. Then again, few programmers understand the effect their code has on the hardware or OS...

Not knowing the Windows Virtual memory code I couldn't say how much 'unecessary' code gets executed for each fault, but it's probably a few tens of instructions - could be anywhere from 6 seconds to 60 seconds of lost cycles per WU. (Can't be much more or seti wouldn't get 99% CPU - the OS would be stealing some)

What the heck! I've just cut all my WU times by around 30% using the optimized binaries - Now I want... no demand... that extra 0.3% speed-up ;)


ID: 110115 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 110121 - Posted: 11 May 2005, 14:43:33 UTC
Last modified: 11 May 2005, 14:45:45 UTC

30% nice! That's windows binaries? hmmmmm should I do this I ask.

Anyway page faults.....well they can have a measurable impact on the OS and hence other applications. I looked at some of the code and there are mallocs() in loops but there are also nested loops with mallocs() in the inner loops. I'm no maths man so I cannot coment on that aspect but it does....on the face of it....look like mallocs happen when needed when they could be planned for and a reduced impact on the OS would be achieved. I agree they probably aren't hard...I am not swapping at all....no memory crisis......but with some users they may well be. Just the same having to process 6.5 M system calls is work that may be reducable. As you say small data set small object code but highly iterative perhaps. Meaning many mallocs()? I hope a developer stops by and can tell their strategy.

Anyhow...where did you get the binaries Mike?

Regards

ID: 110121 · Report as offensive
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 110123 - Posted: 11 May 2005, 14:52:05 UTC

Ian, Take a look at the two threads by Tetsuji Maverick Rai, entitled "Compiling faster Windows client with Intel C++ compiler (and fftw?)" and "2nd: Compiling faster Windows client with Intel C++ compiler (and fftw?)"

The download links are all there - a tad hard to find the right ones for the right CPU, but they're b***dy marvelous (well done TJM) when you sort out which. It's tricky not to lose the work you've got in your cache too.


ID: 110123 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 110229 - Posted: 11 May 2005, 19:59:06 UTC
Last modified: 11 May 2005, 20:36:08 UTC

Hmmmm.....going well so far......well to help the discussion.....if you look at chirpfft.cpp you will see the processing of chirp rates. In a for loop in a while loop in a for loop there is a test to see if a pair should be processed. If yes then they are added to a list. This list is built up on an iterative basis through a realloc....which because it extends an object's size is in reality a free() followed by a malloc(). Thus, and I cannot be sure of the numbers here, we have a nested loop and at the busiest part of the loop (in theory) a series of mallocs. Now with a malloc (realloc causes a malloc() in these circumstances because object size changes often means relocation which means dump old and start new) the Operating System has to intervene each time.

So I guess that's why we see millions of page faults for each WU processed. Mine all seem to be around the 6.5 M faults. So...that's a clue to folk to start this discussion and for that passing developer to give us a view.

My initial guess is that the client would go faster with a modifed malloc() and judicious realloc() strategy.......let's see what comes back eh?

ID: 110229 · Report as offensive
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1256
Credit: 13,565,513
RAC: 13
Germany
Message 110239 - Posted: 11 May 2005, 21:29:19 UTC - in response to Message 110099.  

<blockquote>Hi. I was quite interested to see that 1 WU accounted for some 6.5M page faults. ...
</blockquote>

That is exactly the cause, why the SETI cruncher takes a huge benefit from the big 2nd level cache of the Pentium M's. Exactly the same on the P4EE, the server XEON's or on the Athlon 64 with the clawhammer core with 1024 Kb 2nd level cache. The page faults cause the OS to move a huge amount of memory inside the computer memory (hence no disk access). That's also the cause why the cpu doesn't get so hot with SETI, than for example with the Einstein cruncher. The cpu has to wait for the memory move to complete before continueing crunching. The Einstein cruncher has nearly no page faults and my Athlon TB 1400 get's ~4-5°C hotter than on SETI.

Aloha, Uli

ID: 110239 · Report as offensive
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 110259 - Posted: 11 May 2005, 22:38:00 UTC

Well blow me down!

Just checked processing on two of my systems, the page faults on both for nearly completed WUs is ~100K

I'd swear it used to go into the low millions (and I don't doubt Ians' figures either).

The only thing that has changed is I'm running the optimized binaries. The SETI app is still using 18-19Meg of memory, which IIRC is about what it was before.

Around 50K soft faults per hour is pretty damn good.

ID: 110259 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 110396 - Posted: 12 May 2005, 7:05:03 UTC
Last modified: 12 May 2005, 7:50:25 UTC

I'm intrigued now for sure. I get the same page fault levels ona P4 2.66 also so its not just a rougue system here but you get a massive reduction! Are different compilers handling memory requests differently.....is this where part of your gain for the optimised client comes from Mike? Hmmmm.....well could the standard client then see a performance improvement through a different memory strategy....and the optimised client see even further gains? Interesting. Hey I might have to take the sources here...rewrite some of the memory handling parts and recompile it just to see. First I will check out my Linux system just to see how its behaving.

ID: 110396 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 110399 - Posted: 12 May 2005, 7:14:04 UTC - in response to Message 110239.  

<blockquote><blockquote>Hi. I was quite interested to see that 1 WU accounted for some 6.5M page faults. ...
</blockquote>

That is exactly the cause, why the SETI cruncher takes a huge benefit from the big 2nd level cache of the Pentium M's. Exactly the same on the P4EE, the server XEON's or on the Athlon 64 with the clawhammer core with 1024 Kb 2nd level cache. The page faults cause the OS to move a huge amount of memory inside the computer memory (hence no disk access). That's also the cause why the cpu doesn't get so hot with SETI, than for example with the Einstein cruncher. The cpu has to wait for the memory move to complete before continueing crunching. The Einstein cruncher has nearly no page faults and my Athlon TB 1400 get's ~4-5°C hotter than on SETI.
</blockquote>

Interesting. Well my P4s ... some have 512 L2 and some have 1024 L2 but both perform the same in terms of faults ....+/- 5% they appear identical. I dont doubt what you say about memory movements either and its an interesting temperature observation re Einstein...I would have thought it would have been highly similar to Seti. Thanks.

ID: 110399 · Report as offensive
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 110405 - Posted: 12 May 2005, 8:01:20 UTC

Just checked another WU (on the optimized binary) at 85% done, it has only 23K faults, nothing to worry about there.

Ian, are you still running the Berkeley releases and are you still getting multi-meg faults?

I find it hard to believe that a compiler can optimize across different function calls i.e. look ahead and realise that a free memeory call is followd by an allocate memory and kind-of cancel them out ;)


ID: 110405 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 110407 - Posted: 12 May 2005, 8:55:08 UTC

Mike. Yes I'm running the standard berkeley 4.09 windows client. I have a unit here 67 % done with 4.09 M page faults so far....so it looks like it will end up with the ~ full 6.5 M. Same story for my 2.8 P4 too. Just checked my lap top. Wu is 42% done with 1.89 M page faults. Thats a P3 1.133 Mhz processor. I guess that WU will end up in the ~ 5 or 6 M mark too.

A lot will depend, I think, on how the malloc() scheme works. There are different versions around. Perhaps some schemes make larger initial allocations and a realloc simply takes some more of the spare until it runs out and then a resize and relocation happens. That would reduce page faults for sure. Its worthy of further research I think.

ID: 110407 · Report as offensive
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 110415 - Posted: 12 May 2005, 10:15:52 UTC - in response to Message 110407.  

<blockquote>Mike. Yes I'm running the standard berkeley 4.09 windows client. I have a unit here 67 % done with 4.09 M page faults so far....so it looks like it will end up with the ~~ full 6.5 M. Same story for my 2.8 P4 too. Just checked my lap top. Wu is 42% done with 1.89 M page faults. Thats a P3 1.133 Mhz processor. I guess that WU will end up in the ~~ 5 or 6 M mark too.

A lot will depend, I think, on how the malloc() scheme works. There are different versions around. Perhaps some schemes make larger initial allocations and a realloc simply takes some more of the spare until it runs out and then a resize and relocation happens. That would reduce page faults for sure. Its worthy of further research I think.</blockquote>

Well I'll be very interested to see what happens when you try (why not?) the optimized binaries.That will tell us if it is the optimization, or some other effect of system specification or configuration.

As an aside, the optimized versions seem very sound and the results are validating and credit's being granted just fine ('cept you may get a bit less credit if you quorum against other optimized results - but that's made-up for by doing more Results per period). If you missed it, you don't need to change the BOINCMgr/BOINC EXEs if you have particular version requirements for scheduling or other BOINC features.

ID: 110415 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 110459 - Posted: 12 May 2005, 13:15:55 UTC

OK I looked at a Linux WU at 99.5%. Again just over 6.5 M page faults. None were maj_flt so there was no swapping at all; all were min_flt. This is such defferent behaivour to the optimised client of MikeSW17 - his client had only 100K or so. Any ideas folks? Please post your thoughts.

ID: 110459 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 110488 - Posted: 12 May 2005, 15:42:28 UTC
Last modified: 12 May 2005, 15:59:09 UTC

Ok I had a cursory look through the sources.

In analyzeFuncs.cpp there is the main loop in seti_analyze function for processing the WU. There is for every iteration a malloc and free for the creation and disposal of work space. This space is used for getting the power spectrum and searching for spikes. Precisely how many iterations there will be I cannot say. I have not got my head around fft8g to work this out. My guess will be lots....but its a guess. So the malloc()/free() pairs will cause OS page faults as I am seeing on both linux and Win XP. I suppose the question to the coder would be.....if you already have the workspace created (first pass) can you not test to see if its big enough for re-use? If not discard and re-create. Doing that will always give you the correct workspace needed and avoid (if you start with a reasonable guess for workspace size anyway) many expensive mallocs. Then at the end of that analysis loop discard the workspace 'cos you don't need it anymore.

Happy to hear from a developer........if not I might just amend the code myself and see what happens. Happy to hear from anyone else here in the forum on your views and thoughts. Could lead to a faster client!

ID: 110488 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20332
Credit: 7,508,002
RAC: 20
United Kingdom
Message 110501 - Posted: 12 May 2005, 16:45:46 UTC - in response to Message 110488.  

<blockquote>...So the malloc()/free() pairs will cause OS page faults...</blockquote>
That one should be quite easy to fix. Just write a modified malloc/free optimised for the call pattern from the client.

setiMalloc()/setiFree() anyone?

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 110501 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 110505 - Posted: 12 May 2005, 16:59:14 UTC
Last modified: 12 May 2005, 17:00:18 UTC

Yeah. I created a new .cpp file. Ha but I cant compile it LOL cos I aint got a c++ compiler. Any takers here? I think I will send to the development stream to see what they think.


ID: 110505 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 110866 - Posted: 13 May 2005, 20:05:40 UTC
Last modified: 13 May 2005, 20:09:38 UTC

Well I have downloaded, compiled and run boinc and seti. Oh and sorted out my development system too!

The results in OS overheads reduction are staggering - the 6,484,954 OS page faults have reduced to 6655 OS page faults on the test reference unit provided. But this translates into an insignificant improvement in WU processing. So no improvements for seti WUs but a huge load off the OS. This will be good for hard pressed systems and other applications will benefit from the removal of over 1200 faults per sec that were taking place.


All work linux FC3 based. boinc 4.32. seti 4.70. gnu c++ compiler v 3.4.3-22.FC3. Gnu make v 3.80. NO optimisation was applied. I have offered the revised source code to the project team as I think its a worthwhile improvement.

I am going to look further at the code to if there is scope to rationalise further. Happy to discuss if anyone wants to.

ID: 110866 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 110894 - Posted: 13 May 2005, 21:50:57 UTC
Last modified: 13 May 2005, 21:51:26 UTC

Can it really be that not a single person has anything to say?

ID: 110894 · Report as offensive
Janus
Volunteer developer

Send message
Joined: 4 Dec 01
Posts: 376
Credit: 967,976
RAC: 0
Denmark
Message 110896 - Posted: 13 May 2005, 21:54:06 UTC - in response to Message 110866.  

<blockquote>the 6,484,954 OS page faults have reduced to 6655 OS page faults on the test reference unit provided.</blockquote>

Sounds great! It still produces the same output right?
ID: 110896 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20332
Credit: 7,508,002
RAC: 20
United Kingdom
Message 110897 - Posted: 13 May 2005, 21:56:04 UTC - in response to Message 110894.  
Last modified: 13 May 2005, 21:57:47 UTC

<blockquote>Can it really be that not a single person has anything to say?</blockquote>
Don't forget about the time shift around the world. Also note that it is FRIDAY! (;-))

Just seen your note on boinc_opt via Dr Anderson. How have you improved the free, calloc, malloc management?

Note that linux is very efficient for disk caching and memory allocation. However, if you've streamlined the operation before hitting the OS, it must reduce the instruction counts and CPU cache misses for the better.

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 110897 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Results of code re-write : 6.5 Million Page faults (Windows XP) and malloc() - a question or so to ponder


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.