Energy Efficiency of Arm Based sysetms over x86 or GPU based systems

Message boards : Number crunching : Energy Efficiency of Arm Based sysetms over x86 or GPU based systems
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
mavrrick

Send message
Joined: 12 Apr 00
Posts: 17
Credit: 1,894,993
RAC: 4
United States
Message 1552274 - Posted: 4 Aug 2014, 17:49:04 UTC

I brought this up a few months ago and didn't get real far with the analysis. For some reason my interest in this topic was sparked again recently.

So since the last time I brought this up I have been letting my Ouya run Seti@home full time and have racked up a fair amount of credit. I am not really interested in going crazy with it. I am just letting it run. Only thing I did was move the little box next to a system with a active fan with cooling. It has run well and for the little power it uses 4.5-5 watts I am very impressed.

So now to the point of all this. While I was looking over the host information I found information referring to Average GFLOP based on the app running. I also found what is labeled as Device Peak Flops, on the task information for my computer and the processed work units. Interestingly enough the two aren't very close.

The one that seems to correlate best with the time the Wu takes is the Average GFLOP number. So my first question is does anyone clearly understand what that number is. Like what is being used to create it.

So my math is pretty simple. Take the Average GFLOP value, multiply it by the number of cores the system has, then device that by the aprox. power usage of the device getting as close as I can.

I don't expect this to be exact but it should be a fairly decent approximation of the performance per watt. or more specifically GFLOPS per watt.

Unfortunately I was only able to get fairly perciese power numbers for a few systems: My desktop and the Ouya. My desktop is a Phenom II 6 core system and a radeon 4870. Nothing cutting edge but also a big enough system to churn through some work if I want to.

So the numbers:
Ouya got a average Gflops across the 4 apps it had used of 1.515 with 4 cores running at 5 watt was getting about 1.212 GFLOPS per watt

My desktop got a average of GFLOPS across 1 app of 8.06 GFLOPS, and had 6 cores. The desktop functioned at around 360 watts of energy which gives it 0.13433 GFlops per watt. The Radeon 4870 did a AstroPulse WU and achieved a average GFlops of 101.38, and used about 60 Watts of energy giving it the best performance per watt of 1.689. The catch though is that you have to have the rest of the computer running. If you combine the performance values for the CPU and GPU the performance per watt drops down to 0.3565

I suspect the most efficient option is a lower power CPU with a really good GPU. As long as the CPU can feed the data needed to keep the GPU happy and feed to chug away that may produce the best performance per watt results.

Now of course this doesn't account for RAC at all. Just simply the amount of work processing work a device can do compared to the power consumed.
ID: 1552274 · Report as offensive
mavrrick

Send message
Joined: 12 Apr 00
Posts: 17
Credit: 1,894,993
RAC: 4
United States
Message 1552292 - Posted: 4 Aug 2014, 18:54:39 UTC - in response to Message 1552274.  

A few things I though of after posting.

1. Depending on the GPU a significant hike and power usage can occur just be installing a dedicated GPU. The 4870 adds about 90-100 watts to the base systems power draw. So if the card was removed my desktop's power efficiency would increase a far amount. up to around .193 GFlops per watt.

2. There are obviously more power efficient CPU's now then my Phenom 2 6 core cpu. It would be nice to get some comparable numbers with some newer higher end systems.


Here is a interesting point to think about to. If you take the "Average GFLOPs" as a way to indicated the processing speed of the device.
Then in in theory with 8 Ouya's I could generate the same seti@home processing power as my desktop. The Ouya' would only need 40 watts to do so. That is a fair amount of power savings, and possibly heat savings. And if doing the same work should receive the same RAC.
ID: 1552292 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1552293 - Posted: 4 Aug 2014, 19:01:09 UTC
Last modified: 4 Aug 2014, 19:10:41 UTC

Watts per flop or flops per watt is the name of the game.
I think it might be best to calculate the performance of each app.

For one of my i5-4670K systems.
Application	GFLOPS	Cores	Total GLOPS	System Watts	GFLOPS/Watt
SETI@home v7	 42.81	    4	171.24		90		1.903
AstroPulse v6	106.52	    4	426.08		90		4.734


Compared to my Bay Trail-D system.
Application	GFLOPS	Cores	Total GLOPS	System Watts	GFLOPS/Watt
SETI@home v7	 10.25	    4	 41.00		25		2.050
AstroPulse v6	 21.30	    4	 85.20		25		4.260


For ARM system I would do each app version for comparison. One might be better than the others performance wise & that information could be feed back to home base.

EDIT:
I will also add that my low powered system is drawing more than it should. As I have an oversized PSU that is not the most efficient right now. Ideally it would be in the 10-20w range for power consumption.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1552293 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1552368 - Posted: 4 Aug 2014, 22:12:37 UTC - in response to Message 1552292.  
Last modified: 4 Aug 2014, 22:16:13 UTC

If you take the "Average GFLOPs" as a way to indicated the processing speed of the device.

There's the problem. GFLOPs is an indicator, but a very, very poor one. Depending on the application being run, a card with a lower GFLOPS rating can process more work per hour than one with a much higher rating.
The number of WUs per hour is a better indicator, but the mix of WUs (VHARs, shorties) makes it difficult to compare things.
Average Processing Rate is a good one as it directly relates to the work being done, unfortunately it isn't accurate as processing more than one WU at a time results in a lower APR, even though the work done per hour is much higher than doing a single WU at a time.
RAC is probably the best indicator, however due to the nature of Credit New (almost completely borked) you can only compare MB to MB, AP to AP. And people that run a mix of the 2 can't really be compared to either (or even each other due to the different mixes).
Grant
Darwin NT
ID: 1552368 · Report as offensive
mavrrick

Send message
Joined: 12 Apr 00
Posts: 17
Credit: 1,894,993
RAC: 4
United States
Message 1552384 - Posted: 4 Aug 2014, 23:11:45 UTC - in response to Message 1552368.  

So are you saying that field is a rating and not a measured value. The fact it indicated a Average would indicate some level of analysis is being done.

To me it looks more like a indication about how many GFLOPS that host is able to achieve when running that application.

The WU per hour is useless for what I am getting at here. I am looking actually computational performance instead of relating it to WU. We all know not all WU are the same. There has to be some way to calculate the amount of work done in a second and I suspect using GFLOPS.

Doesn't GFLOPS refer to Billions of Floating-point Operations Per Second. That by definition is a measurement of computational work per second and should be usable to compare several devices of a range of configurations. It of course isn't perfect as each system build will have it's nuances.

Nothing is going to be exactly perfect. The math I present was based on using 1 100% of each core. My expectation is that the application would only represent 1 core as each application runs as a single thread. So in the math I setup the value for Average GFLOP's is multiplied by the number of cores in the system. If there are any processes within the app that float out from that core then the number will fluctuate a little.

The point here is to evaluate the potential of lowered powered systems compared to Giant number crunchers. Can you make up for the less CPU performance with raw numbers and still maintain a good electrical foot print.
ID: 1552384 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1552389 - Posted: 4 Aug 2014, 23:20:45 UTC - in response to Message 1552293.  

Compared to my Bay Trail-D system.
Application	GFLOPS	Cores	Total GLOPS	System Watts	GFLOPS/Watt
SETI@home v7	 10.25	    4	 41.00		25		2.050
AstroPulse v6	 21.30	    4	 85.20		25		4.260


Hmm, my machine is running somewhat fewer FLOPS than yours for both MB and AP. I haven't worked out how to enable the iGPU for crunching under Linux yet.

I took delivery of an Nvidia Tegra K "Jetson" SDK tonight and should have all the bits needed to run it (HDMI->DVI cable, USB hub, Keyboard+mouse) on next-day delivery tomorrow. First plan is to work out how it runs (it's an ARM version of Ubuntu) and install the latest CUDA libraries. Then, after I've got my hologram reconstructions running on the 192-core Kepler, I'll see if there are all the resources needed to compile BOINC & S@H on it. Watch, as they say, this space.

Must take my Wattmeter back into work next time I have to power down this rig (which is running 143 W ATM, it's usually around 250 W when the GPUs have APs to crunch).
ID: 1552389 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1552395 - Posted: 4 Aug 2014, 23:48:07 UTC - in response to Message 1552384.  
Last modified: 4 Aug 2014, 23:49:09 UTC

Doesn't GFLOPS refer to Billions of Floating-point Operations Per Second. That by definition is a measurement of computational work per second and should be usable to compare several devices of a range of configurations. It of course isn't perfect as each system build will have it's nuances.

Not all FLOPS are equal, different operations have different overheads.
That's why FLOPS, just like the number of WUs per hour (due to the different types of WUs) aren't a good indicator of actual performance.
My present video cards have a much higher FLOP rating than the cards they replace, however the older cards can actually process more WUs per hour than the new ones because the present applications aren't optimised for the new video cards.
However, I can run 3 of my new video cards for less power than one of the old ones used.


As badly screwed as Credit New is, and even with the very lagging nature of RAC, RAC is the best indicator of work done we have.
Unfortunately it's not as good as it once was for comparing between different types of WU, and it's of no use at all for comparing between MB & AP, but it is very good at comparing between similar types of WU.


Can you make up for the less CPU performance with raw numbers and still maintain a good electrical foot print.

With out a doubt (as I mentioned with my new video cards). However similar to games, there will instances where more CPU power will be required to keep the faster video cards busy.
AP WUs are a good example- many people running highend & multiple highend video cards have to leave 1, 2 or even more CPU cores free, just to feed the GPUs.
Grant
Darwin NT
ID: 1552395 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1552453 - Posted: 5 Aug 2014, 3:05:19 UTC - in response to Message 1552368.  

If you take the "Average GFLOPs" as a way to indicated the processing speed of the device.

There's the problem. GFLOPs is an indicator, but a very, very poor one. Depending on the application being run, a card with a lower GFLOPS rating can process more work per hour than one with a much higher rating.
The number of WUs per hour is a better indicator, but the mix of WUs (VHARs, shorties) makes it difficult to compare things.
Average Processing Rate is a good one as it directly relates to the work being done, unfortunately it isn't accurate as processing more than one WU at a time results in a lower APR, even though the work done per hour is much higher than doing a single WU at a time.
RAC is probably the best indicator, however due to the nature of Credit New (almost completely borked) you can only compare MB to MB, AP to AP. And people that run a mix of the 2 can't really be compared to either (or even each other due to the different mixes).

The way they stated "average GFLOPS" I figured they were talking about APR. Which is displayed in GFLOPS. While it may not be the most accurate it is a measure of the application output on that device. So it does seem to be a valid measure to use.
It will be lower when running more tasks on a device. However, (GFLOPS * instances) should reflect the increased output.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1552453 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1552454 - Posted: 5 Aug 2014, 3:09:14 UTC - in response to Message 1552389.  

Compared to my Bay Trail-D system.
Application	GFLOPS	Cores	Total GLOPS	System Watts	GFLOPS/Watt
SETI@home v7	 10.25	    4	 41.00		25		2.050
AstroPulse v6	 21.30	    4	 85.20		25		4.260


Hmm, my machine is running somewhat fewer FLOPS than yours for both MB and AP. I haven't worked out how to enable the iGPU for crunching under Linux yet.

I took delivery of an Nvidia Tegra K "Jetson" SDK tonight and should have all the bits needed to run it (HDMI->DVI cable, USB hub, Keyboard+mouse) on next-day delivery tomorrow. First plan is to work out how it runs (it's an ARM version of Ubuntu) and install the latest CUDA libraries. Then, after I've got my hologram reconstructions running on the 192-core Kepler, I'll see if there are all the resources needed to compile BOINC & S@H on it. Watch, as they say, this space.

Must take my Wattmeter back into work next time I have to power down this rig (which is running 143 W ATM, it's usually around 250 W when the GPUs have APs to crunch).


I am running optimized apps & my system seems to like to stay running at its Burt frequency all of the time. Either or both of those could be the reason for my systems higher numbers.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1552454 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1552676 - Posted: 5 Aug 2014, 20:30:30 UTC - in response to Message 1552389.  

I took delivery of an Nvidia Tegra TK1 "Jetson" SDK tonight and should have all the bits needed to run it (HDMI->DVI cable, USB hub, Keyboard+mouse) on next-day delivery tomorrow. First plan is to work out how it runs (it's an ARM version of Ubuntu) and install the latest CUDA libraries. Then, after I've got my hologram reconstructions running on the 192-core Kepler, I'll see if there are all the resources needed to compile BOINC & S@H on it. Watch, as they say, this space.

Well, I've got this far so far:
05-Aug-2014 16:03:28 [---] cc_config.xml not found - using defaults
05-Aug-2014 16:03:28 [---] Starting BOINC client version 7.2.42 for armv7l-unknown-linux-gnueabihf
05-Aug-2014 16:03:28 [---] log flags: file_xfer, sched_ops, task
05-Aug-2014 16:03:28 [---] Libraries: libcurl/7.35.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28 librtmp/2.3
05-Aug-2014 16:03:28 [---] Data directory: /home/ubuntu/BOINC
05-Aug-2014 16:03:28 [---] CUDA: NVIDIA GPU 0: GK20A (driver version unknown, CUDA version 6.0, compute capability 3.2, 1746MB, 141MB available, 327 GFLOPS peak)
05-Aug-2014 16:03:28 [---] Host name: tegra-ubuntu
05-Aug-2014 16:03:28 [---] Processor: 1 ARM ARMv7 Processor rev 3 (v7l)
05-Aug-2014 16:03:28 [---] Processor features: swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
05-Aug-2014 16:03:28 [---] OS: Linux: 3.10.24-g6a2d13a
05-Aug-2014 16:03:28 [---] Memory: 1.71 GB physical, 0 bytes virtual
05-Aug-2014 16:03:28 [---] Disk: 11.69 GB total, 5.63 GB free
05-Aug-2014 16:03:28 [---] Local time is UTC +0 hours
05-Aug-2014 16:03:28 [---] No general preferences found - using defaults
05-Aug-2014 16:03:28 [---] Preferences:
05-Aug-2014 16:03:28 [---]    max memory usage when active: 873.11MB
05-Aug-2014 16:03:28 [---]    max memory usage when idle: 1571.60MB
05-Aug-2014 16:03:28 [---]    max disk usage: 5.55GB
05-Aug-2014 16:03:28 [---]    don't use GPU while active
05-Aug-2014 16:03:28 [---]    suspend work if non-BOINC CPU load exceeds 25%
05-Aug-2014 16:03:28 [---]    (to change preferences, visit a project web site or select Preferences in the Manager)
05-Aug-2014 16:03:28 [---] Not using a proxy
05-Aug-2014 16:03:28 [---] This computer is not attached to any projects
05-Aug-2014 16:03:28 [---] Visit http://boinc.berkeley.edu for instructions
05-Aug-2014 16:03:29 Initialization completed
05-Aug-2014 16:03:29 [---] Suspending GPU computation - computer is in use
05-Aug-2014 16:04:00 [---] Received signal 2
05-Aug-2014 16:04:01 [---] Exit requested by user

As with the Celeron I bought recently, I had a lot of trouble with the graphics, especially finding the GL, GLU and GLT libraries -- compounded by the fact that neither install (the Celeron is CENTOS 7) had g++ by default and ./configure doesn't really point that out to you. Big showstopper is wxWidgets. Need to compile it myself, and it looks like BOINC code isn't compatible with anything past 2.8.3 -- but 2.8.3 won't compile with gcc 4.8.3 apparently. So I haven't got boincmgr running on either yet.
Now to try to attach to S@H since the project is up again!
ID: 1552676 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1552687 - Posted: 5 Aug 2014, 20:52:09 UTC - in response to Message 1552676.  


05-Aug-2014 16:03:28 [---] This computer is not attached to any projects
05-Aug-2014 16:03:28 [---] Visit http://boinc.berkeley.edu for instructions
05-Aug-2014 16:03:29 Initialization completed
05-Aug-2014 16:03:29 [---] Suspending GPU computation - computer is in use
05-Aug-2014 16:04:00 [---] Received signal 2
05-Aug-2014 16:04:01 [---] Exit requested by user

Now to try to attach to S@H since the project is up again!

05-Aug-2014 20:31:55 [---] Suspending GPU computation - computer is in use
05-Aug-2014 20:39:33 [---] Running CPU benchmarks
05-Aug-2014 20:39:33 [---] Suspending computation - CPU benchmarks in progress
05-Aug-2014 20:39:33 [---] Running CPU benchmarks
05-Aug-2014 20:39:33 [---] Running CPU benchmarks
05-Aug-2014 20:39:33 [---] Running CPU benchmarks
05-Aug-2014 20:39:33 [---] Running CPU benchmarks
05-Aug-2014 20:40:05 [---] Benchmark results:
05-Aug-2014 20:40:05 [---]    Number of CPUs: 4
05-Aug-2014 20:40:05 [---]    966 floating point MIPS (Whetstone) per CPU
05-Aug-2014 20:40:05 [---]    6829 integer MIPS (Dhrystone) per CPU
05-Aug-2014 20:40:06 [---] Resuming computation
05-Aug-2014 20:40:12 [http://setiathome.berkeley.edu/] Master file download succeeded
05-Aug-2014 20:40:17 [---] Number of usable CPUs has changed from 4 to 1.
05-Aug-2014 20:40:17 [http://setiathome.berkeley.edu/] Sending scheduler request: Project initialization.
05-Aug-2014 20:40:17 [http://setiathome.berkeley.edu/] Requesting new tasks for CPU and NVIDIA
05-Aug-2014 20:40:22 [SETI@home] Scheduler request completed: got 0 new tasks
05-Aug-2014 20:40:22 [SETI@home] This project doesn't support computers of type armv7l-unknown-linux-gnueabihf
05-Aug-2014 20:40:24 [SETI@home] Started download of arecibo_181.png
05-Aug-2014 20:40:24 [SETI@home] Started download of sah_40.png
05-Aug-2014 20:40:27 [SETI@home] Finished download of arecibo_181.png
05-Aug-2014 20:40:27 [SETI@home] Finished download of sah_40.png
05-Aug-2014 20:40:27 [SETI@home] Started download of sah_banner_290.png
05-Aug-2014 20:40:27 [SETI@home] Started download of sah_ss_290.png
05-Aug-2014 20:40:29 [SETI@home] Finished download of sah_banner_290.png
05-Aug-2014 20:40:29 [SETI@home] Finished download of sah_ss_290.png
05-Aug-2014 20:43:41 [---] Resuming GPU computation
05-Aug-2014 20:44:27 [---] Suspending GPU computation - computer is in use
:-)
Ah, there it is!
ID: 1552687 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1552955 - Posted: 6 Aug 2014, 16:53:29 UTC - in response to Message 1552687.  
Last modified: 6 Aug 2014, 16:54:28 UTC

Ah, there it is!
Well, I got both my hologram reconstruction and s@h compiled and running on the Jetson today. The holograms run about 10x slower than on my GTX 750 Ti (1.5 frames/sec for a 4Kx4K reconstruction). No real problems with the s@h, just the missing include I reported last January, and I had to edit the config file to remove the old compute capabilities that nvcc didn't like and put in 3.2 for the Tegra.
The first WU has just finished; Run time 50 min 57 sec, CPU time 21 min 32 sec. Not validated yet. Run time is just about twice what I'm currently achieving with the 750 Ti, but that's running two at once.
ID: 1552955 · Report as offensive
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1552966 - Posted: 6 Aug 2014, 17:19:47 UTC - in response to Message 1552955.  


The first WU has just finished; Run time 50 min 57 sec, CPU time 21 min 32 sec. Not validated yet. Run time is just about twice what I'm currently achieving with the 750 Ti, but that's running two at once.

From your stderr out:

setiathome enhanced x41zc, Cuda 6.00


Where did you get this version from?
ID: 1552966 · Report as offensive
mavrrick

Send message
Joined: 12 Apr 00
Posts: 17
Credit: 1,894,993
RAC: 4
United States
Message 1552975 - Posted: 6 Aug 2014, 17:32:51 UTC - in response to Message 1552966.  

I suspect he compiled it himself based on his previous posts.

I am really curious about how he set this up and well, and if he will share directions later for those adventurous enough to attempt it after him :). That is pretty awesome performance for a system that NVidia says is using 5 watts under real work loads. It would be nice to see that validated, but even at 10 watts that is some awesome crunching.

This topic is actually kind of making me a little annoyed at my own gear now. I am seeing how truly bad my desktop really is and am to the point I would rather not turn it on at all. Considering just selling it off for parts. To build something more energy efficient.

Now all you need to do is setup a about 8 of them for a cluster solution, and let it churn out some WU.
ID: 1552975 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1552992 - Posted: 6 Aug 2014, 18:57:13 UTC - in response to Message 1552966.  


The first WU has just finished; Run time 50 min 57 sec, CPU time 21 min 32 sec. Not validated yet. Run time is just about twice what I'm currently achieving with the 750 Ti, but that's running two at once.

From your stderr out:

setiathome enhanced x41zc, Cuda 6.00


Where did you get this version from?

As mavrrick says, I compiled it myself. However, the hard part isn't s@h; the hard part is compiling BOINC. It has so many prerequisites. The basic instructions are here.
git clone git://boinc.berkeley.edu/boinc-v2.git boinc
cd boinc
git tag [Note the version corresponding to the latest recommendation.]
git checkout client_release/<required release>; git status
./_autosetup
./configure --disable-server --enable-manager
make -j n [where n is the number of cores/threads at your disposal]

The problems you will have is first finding the libraries and utilities that _autosetup wants, then ensuring that you have g++ installed, and then finding all the libraries and development packs that configure wants (you need the -devs for the header definition files). The final hurdle, if you want to use the boincmgr graphical command interface, is getting wxWidgets. It tends not to be included in repositories for modern distributions now so you have to try to compile it yourself. Which I haven't managed lately as BOINC wants an old version which was (apparently) badly coded and gives lots of problems with the newest, smartest gcc/g++ compilers. You may need to just learn how to use the boinccmd command-line controller...
The simplest way to then compile s@h was detailed back in January, in this thread.
cd <directory your boinc directory is in>
svn checkout -r1921 https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt/Xbranch
cd Xbranch
[edit client/analyzeFuncs.h and add the line '#include <unistd.h>']
sh ./_autosetup
sh ./configure BOINCDIR=../boinc --enable-sse2 --enable-fast-math
make -j n

This assumes you've installed the CUDA SDK and added the appropriate locations to your PATH and LD_LIBRARY_PATH environment variables, but that's well-covered in the Nvidia documentation. As I alluded to above, you will probably have to edit the configure file too, to make sure obsolete gencode entries are removed and appropriate ones for your kit are included. Oh, and drop the --enable-sse2 if you're compiling for other than Intel/AMD CPUs.
ID: 1552992 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1553003 - Posted: 6 Aug 2014, 19:32:19 UTC - in response to Message 1552975.  

I suspect he compiled it himself based on his previous posts.

I am really curious about how he set this up and well, and if he will share directions later for those adventurous enough to attempt it after him :). That is pretty awesome performance for a system that NVidia says is using 5 watts under real work loads. It would be nice to see that validated, but even at 10 watts that is some awesome crunching.

This topic is actually kind of making me a little annoyed at my own gear now. I am seeing how truly bad my desktop really is and am to the point I would rather not turn it on at all. Considering just selling it off for parts. To build something more energy efficient.

Now all you need to do is setup a about 8 of them for a cluster solution, and let it churn out some WU.

A home-built cluster is not much of an advantage for SETI@home or BOINC. As you have tot run the app on each node in the cluster.

Most of my computers are on for other reasons. So I run SETI@home on them. My past several system upgrades have been to increase my system efficiency. More than to increase the system performance.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1553003 · Report as offensive
mavrrick

Send message
Joined: 12 Apr 00
Posts: 17
Credit: 1,894,993
RAC: 4
United States
Message 1553006 - Posted: 6 Aug 2014, 19:49:17 UTC - in response to Message 1552975.  

That is pretty awesome performance for a system that NVidia says is using 5 watts under real work loads.


The more I look at Nvidia's support page this doesn't make much since. I don't think this applies to pushing the Cuda cores to their limit. It would be interesting to get a power meter on it to see what it's usage is.
ID: 1553006 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1553011 - Posted: 6 Aug 2014, 20:19:34 UTC - in response to Message 1553006.  
Last modified: 6 Aug 2014, 20:21:17 UTC

That is pretty awesome performance for a system that NVidia says is using 5 watts under real work loads.


The more I look at Nvidia's support page this doesn't make much since. I don't think this applies to pushing the Cuda cores to their limit. It would be interesting to get a power meter on it to see what it's usage is.

"the Kepler GPU in Tegra K1 consists of 192 CUDA cores and consumes less than two watts*.
*Average power measured on GPU power rail while playing a collection of popular mobile games."

Under full load from SETI@home the GPU average power consumption may be much more. As the load will be as varied as when playing a game.
I just did a quick test with a i5-3470. The iGPU averages 3.6w in GPUz under full load, from an app called HeavyLoad. While playing flash games the average reading in GPUz is 0.4w. I don't have much else game wise to check on this system. As it is my cubical system.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1553011 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1553058 - Posted: 6 Aug 2014, 21:49:55 UTC - in response to Message 1553006.  

That is pretty awesome performance for a system that NVidia says is using 5 watts under real work loads.

The more I look at Nvidia's support page this doesn't make much since. I don't think this applies to pushing the Cuda cores to their limit. It would be interesting to get a power meter on it to see what it's usage is.

The Jetson docs I was reading yesterday said that total at-the-wall consumption was (IIRC) 10.somethng W. Then it went through the chain describing the inefficiencies (20% loss in the power brick, etc...). It did stress that because it was a development chip the peripherals hadn't been chosen for low power consumption. Next time I power down my home system I'll remove my power-meter and apply it to the Jetson instead.
Remember, though, that the Jetson runs its cores at a lower frequency than many PCI-e video cards and the memory bus is narrower, which drops power consumption.
/home/ubuntu/CUDA-SDK/NVIDIA_CUDA-6.0_Samples/bin/armv7l/linux/release/gnueabihf/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GK20A"
  CUDA Driver Version / Runtime Version          6.0 / 6.0
  CUDA Capability Major/Minor version number:    3.2
  Total amount of global memory:                 1746 MBytes (1831051264 bytes)
  ( 1) Multiprocessors, (192) CUDA Cores/MP:     192 CUDA Cores
  GPU Clock rate:                                852 MHz (0.85 GHz)
  Memory Clock rate:                             924 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 131072 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 1, Device0 = GK20A
Result = PASS

ID: 1553058 · Report as offensive
mavrrick

Send message
Joined: 12 Apr 00
Posts: 17
Credit: 1,894,993
RAC: 4
United States
Message 1553085 - Posted: 6 Aug 2014, 22:43:06 UTC - in response to Message 1553058.  

Well I was just looking over some docs and saw a peak power draw into the 40's. I am not as much into the hardware as I use to be and it got me thinking what would probably be the biggest consumer of power. The peripherals are a given as that was brought up on the doc I was looking over about power, but I was also wondering how they quantified typical real work load.

I was just thinking that running a CUDA SETI@home app isn't typical real world load.

10 Watts seems very reasonable to me

The comment about a Jetson TK1 cluster was really about the two ways I see to increase efficiency. You either get a faster cpu's/GPU that do the work in a shorter amount of time or you get many smaller cpu's that don't use as much energy and do more WU at one time just each takes longer.

Goes back to the question about what is more efficient. If it only takes 4 TK1's to complete the same amount of work as a regular high end desktop and it uses 1/5 the power to run those WU then that would be the most energy efficient way to go. I was being sarcastic but that was my thought when I said it. Someone on the TK1 developer forums has a cluster of like 9 nodes setup.

I may be the minority on this site, but I don't run my systems 24/7 anymore. I have one box at my house that is and it is my server, so it has to be up. The rest are just clients so I got all the power saving goodness setup on them, and let them sleep normally if not being used. The second most used system in a HTPC which was built to be rather power efficient, although I am sure I could do better now. This is what drives this for me. I would love to still contribute, but want to do it in a rather green manor. ARM Devices that are low power seem to be possibly the best option. Just hoping it can keep the power company from raiding my wallet. You could almost say this is research to see if I can find a way for me to get back into contributing :) much again.

I also like what you talked about with the Baytrail-D system. If you don't mind me asking which ones do you have?
ID: 1553085 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Energy Efficiency of Arm Based sysetms over x86 or GPU based systems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.