Sha 256 & Scrypt BItcoin Mining Optimisations

Author	Message
Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3	Message 1741643 - Posted: 12 Nov 2015, 23:51:56 UTC This thread is for Optimisation Setting for Mining Bit Coin - Scrypt coin . Either for your own personal benefit or if like me your on limited funds and can't Donate any other way , doing it part time and small scale to generate money to donate to SETI@HOME This file is a Read me from GUIMiner-Scrypt-win_x64-v0.05 The latest 64 bit version , the 32 bit version does not contain this Read Me file so I put it here for future reference when using the 32 bit version . The setting contained here also work in the 32 bit version . Although it does not pertain to SETI@HOME there is some info to explain CUDA Class of cards and how many registers they have , all thou do not use the actual commands in SETI and you may see some correlation in the numbers used and how they are used in a cmdline in seti The file is quiet long so I apologize for that but better to read the whole thing than pick certain things out and have lot's of Questions I may not be able to answer CudaMiner release February 28th 2014 - Speedup release ------------------------------------------------------------- ************************************************************* If you find this tool useful and like to support its continued development, then consider a donation. LTC donation address: LKS1WDKGED647msBQfLBHV3Ls8sveGncnm BTC donation address: 16hJF5mceSojnTD3ZTUDqdRhDyPJzoRakM YAC donation address: Y87sptDEcpLkLeAuex6qZioDbvy1qXZEj4 VTC donation address: VrjeFzMgvteCGarLw85KivBzmsiH9fqp4a MAX donation address: mHrhQP9EFArechWxTFJ97s9D3jvcCvEEnt DOGE donation address: DT9ghsGmez6ojVdEZgvaZbT2Z3TruXG6yP PANDA donation address: PvgtxJ2ZKaudRogCXfUMLXVaWUMcKQgRed MRC donation address; 1Lxc4JPDpQRJB8BN4YwhmSQ3Rcu8gjj2Kd ************************************************************* >>> Introduction <<< This is a CUDA accelerated mining application for most of your AltCoin mining needs. Your nVidia cards can be very efficient miners - don't believe the AMD fanboy myth that nVidia cards suck at mining! We've recently had a watercooled 780Ti break 900 kHash/s at scrypt (N=1024) mining. This application is currently supporting 1) scrypt mining with N=1024 (LiteCoin and many, many clones) 2) scrypt-jane mining (Yacoin and several clones) 3) scrypt mining with larger N (VertCoin) 4) NEW: MaxCoin mining (SHA-3 i.e. Keccak256) <<< SEE MAXCOIN SPECIFICS SECTION You should see a notable speed-up compared to OpenCL based miners. We're not supporting Quark, ProtoShares (Momentum) or any other highly specialized "CPU-only" coin. And certainly no BitCoin: This train has left the station quite some time ago! >>> Command Line Interface <<< This code is based on the pooler cpuminer 2.3.2 release and inherits its command line interface and options. -a, --algo=ALGO specify the algorithm to use (default is scrypt) scrypt scrypt Salsa20/8(1024, 1, 1), PBKDF2(SHA2) scrypt:N scrypt Salsa20/8(N, 1, 1), PBKDF2(SHA2) scrypt-jane scrypt Chacha20/8(N, 1, 1), PBKDF2(Keccak) scrypt-jane:Coin Coin must be one of the supported coins. scrypt-jane:Nfactor scrypt-chacha20/8(22^Nfactor, 1, 1) scrypt-jane:StartTime,Nfmin,Nfmax like above nFactor derived from Unix time. sha256d SHA-256d (don't use this! No GPU acceleration) keccak Keccak256 as used in MaxCoin -o, --url=URL URL of mining server (default: " DEF_RPC_URL ") -O, --userpass=U:P username:password pair for mining server -u, --user=USERNAME username for mining server -p, --pass=PASSWORD password for mining server --cert=FILE certificate for mining server using SSL -x, --proxy=[PROTOCOL://]HOST[:PORT] connect through a proxy -t, --threads=N number of miner threads (default: number of processors) -r, --retries=N number of times to retry if a network call fails (default: retry indefinitely) -R, --retry-pause=N time to pause between retries, in seconds (default: 15) -T, --timeout=N network timeout, in seconds (default: 270) -s, --scantime=N upper bound on time spent scanning current work when long polling is unavailable, in seconds (default: 5) --no-longpoll disable X-Long-Polling support --no-stratum disable X-Stratum support -q, --quiet disable per-thread hashmeter output -D, --debug enable debug output -P, --protocol-dump verbose dump of protocol-level activities -B, --background run the miner in the background --benchmark run in offline benchmark mode -c, --config=FILE load a JSON-format configuration file -V, --version display version information and exit -h, --help display this help text and exit Additional cudaminer specific command line options are: --no-autotune disables the built-in autotuning feature for maximizing CUDA kernel efficiency and uses some heuristical guesswork, which might not be optimal. --devices [-d] gives a comma separated list of CUDA device IDs to operate on. Device IDs start counting from 0! Alternatively give string names of your card like gtx780ti or gt640#2 (matching 2nd gt640 in the PC). --launch-config [-l] specify the kernel launch configuration per device. This replaces autotune or heuristic selection. You can pass the strings "auto" or just a kernel prefix like F or K or T to autotune for a specific card generation or a kernel prefix plus a lauch configuration like F28x8 if you know what kernel runs best (from a previous autotune). --interactive [-i] list of flags (0 or 1) to enable interactive desktop performance on individual cards. Use this to remove lag at the cost of some hashing performance. Do not use large launch configs for devices that shall run in interactive mode - it's best to use autotune! --batchsize [-b] comma separated list of max. scrypt iterations that are run in one kernel invocation. Default is 1024. Best to use powers of 2 here. Increase for better performance in scrypt-jane with high N-factors. Lower for more interactivity of your video display especially when using the interactive mode. --texture-cache [-C] list of flags (0 or 1 or 2) to enable use of the texture cache for reading from the scrypt scratchpad. 1 uses a 1D cache, whereas 2 uses a 2D texture layout. Cached operation has proven to be slightly faster than noncached operation on most GPUs. --single-memory [-m] list of flags (0 or 1) to make the devices allocate their scrypt scratchpad in a single, consecutive memory block. On Windows Vista, 7/8 this may lead to a smaller memory size being used. When using the texture cache this option is implied. --hash-parallel [-H] scrypt also has a small SHA256 or Keccak component: 0 hashes this single threaded on the CPU. 1 to enable multithreaded hashing on the CPU. 2 offloads everything to the GPU (default) --lookup-gap [-L] values > 1 enable a tradeoff between memory savings and extra computation effort, in order to improve efficiency with high N-factor scrypt-jane coins. Defaults to 1. --time-limit Exit the miner after given number of seconds mining. Useful for doing some round robin pool (or worker) hopping from an external controls script. >>> Examples <<< Example for Litecoin Mining on coinotron pool with GTX 660 Ti cudaminer -d gtx660ti -l K28x32 -C 2 -i 0 -o stratum+tcp://coinotron.com:3334 -O workername:password Example for Yacoin Mining on yac.coinmine.pl pool with GTX 780 cudaminer -s 10 --algo=scrypt-jane -d gtx780 -L 3 -l T9x21 -b 4096 -C 0 -i 1 -m 0 -o stratum+tcp://yac.coinmine.pl:9088 -O workername:password Example for VertCoin mining on vtcpool.co.uk with GTX 660 Ti (assuming N=2048, changes over time) cudaminer --algo=scrypt:2048 -d gtx660ti -l K14x32 -C 2 -i 0 -o stratum+tcp://vtcpool.co.uk:3333 -O workername:password You will have to adjust the -d parameter. In most cases a -d 0 will work for you. Specifying video cards by name is best when you often swap your video cards. The device IDs tend to change a lot, whereas the names are much more consistent. If you are not sure what configuration your video card might need, then leave away the -l option and let cudaminer autotune. For scrypt-jane coins with high N factor using a lookup gap with values greater than 1 will likely boost your performance. Best to try -L 1 first and work your way up. For solo-mining you typically use -o 127.0.0.1:xxxx where xxxx represents the RPC portnumber specified in your wallet's .conf file and you have to pass the same username and password with -O as specified in the wallet's .conf file. The wallet must also be started with the -server option and the server flag in the wallet's .conf file set to 1 >>> About CUDA Kernels <<< CUDA kernels do all the computation. Which one we select and in which configuration it is run greatly affects performance. The CUDA kernel launch configurations are given as a character string, e.g. F27x3 Prefix Blocks x Warps per block Available kernel prefixes are: F or f - Fermi and Legacy cards (Compute 1.x and 2.x) K or k - Kepler cards (Compute 3.0) T or t - Titan, GTX 780, GK208 and GM107 based cards (Compute 3.5 or later) Upper case kernel prefixes mean high register count kernels. Lower case kernel prefixes mean low register count kernels. so F27x3 means: use Fermi kernel with high register count run 27 blocks in total each block consisting of 3 warps or 96 threads (a warp is a group of 32 threads) You will want to pick lower case letters for scrypt-jane based coins with a high N-factor (N being 12 and above...) because the performance can be much better. If you do not specify a kernel to use, autotune will pick a kernel that might be best for your hardware and selected algorithm. You can also override the autotune's automatic kernel selection, e.g. pass -l F or -l K or -l T in order to autotune the Fermi, Kepler or Titan kernels overriding the automatic selection. >>> Table of CUDA Kernels <<< Different CUDA kernels are identified by their Prefix letter. In some cases an alias letter is also accepted, to ensure the existing launch configs from the December 2013/January 2014 development versions are still running. Prefix Alias Compute Req. Registers use for F L 1.0 64 scrypt & low N-factor scrypt-jane K Y 3.0 63 scrypt & low N-factor scrypt-jane T Z 3.5 80 scrypt & low N-factor scrypt-jane f X 1.0 32 high N-factor scrypt-jane k 3.0 32 high N-factor scrypt-jane t 3.5 32 high N-factor scrypt-jane the old "Legacy" kernel has been replaced with the F kernel, which will also be faster on Compute 1.0 legacy devices in many cases. Therefore the F kernel has been compiled to require only Compute 1.0 capability. >>> scrypt-jane Specifics <<< scrypt-jane coins are designed to become more memory-hard over time. Some of the older coins like Yacoin already approach very high N-factors. Yacoin at the time of writing requires 4 MB per hash at N-factor 14. This means that a card with 4 GB video RAM can only compute 1024 hashes in parallel. This is a way too low number to fully occupy all the card's computational resources. Hence it is best on these cards to use kernels that run 4 threads per hash. These are the low register count kernels, all starting with lower case prefix letters (f, k, t). Additionally GPUs with enough computational reserves benefit from enabling the lookup-gap feature with the -L option and passing values > 1. This cuts the memory use per hash and allows us to run more hashes (threads) simultaneously. However additional computations have to be made to compensate for the reduced lookup tables: any missing intermediate values have to be recomputed "on the fly". Use -H 2 with any low N-factor (below 12) scrypt-jane coins, otherwise your CPU performance may be seriously limiting your hash rates. The following coin parameters have been hardcoded into cudaminer and can be given as a coin specifier with the --algo=scrypt-jane option [YBC] YBCoin, [ZZC] ZZCoin, [FEC] FreeCoin, [ONC] OneCoin, [QQC] QQCoin, [GPL] GoldPressedLatinum, [MRC] MicroCoin, [APC] AppleCoin, [CPR] Copperbars, [CACH] CacheCoinm, [UTC] UltraCoin, [VEL] VelocityCoin, [ITC] InternetCoin, [RAD] RadioactiveCoin e.g. --algo=scrypt-jane:YBC or --algo=scrypt-jane:YBCoin To mine new coins with different chain start time and minimum and maximum N-factors, you can pass the parameters to the --algo option like this: -algo=scrypt-jane:1389196388,4,30 >>> MaxCoin Specifics <<< MaxCoin support was possible on short notice only because the coin's launch was delayed by 24 hours and a cpuminer source code was posted to bitcointalk prematurely. The keccak hashing feature is not complete and user friendly yet, but it should be good enough to get you instamining AT COIN LAUNCH with several dozen megahashes at least. GTX 780 devices break 200 MHash/s. Yay! A GTX 660Ti can do nearly 100 MHash/s. The CudaMiner Windows binary release is made around 19:00 GMT on February 6th. You have about 30 minutes to find suitable launch configurations for all your cards before the coin actually launches. The MaxCoin wallet will be available from 19:30 GMT onwards on http://maxcoin.co.uk The following problems remain with the keccak algorithm: -The scrypt scratchpad is still allocated on the GPU. You must use a large -L (lookup gap) value in the order of 64 or 128 to allow the scratchpad to fit into your GPU memory when using large launch configurations. -The K kernel works better on Compute 3.5 devices than the T kernel. Duh? The F, K, T kernels support keccak mining. Maybe the K kernels run faster on Compute 3.5 devices. Best to try. Launch configs should look somewhat like this (do not exceed the warp figures given here) -l F1000x16 -l K1000x32 -l T1000x24 << now fastest on my GTX 780 Best to replace the 1000 blocks with the number of your card's CUDA cores, or even twice that value. It seems fastest that way, Pick a pool or solo-mine. Good luck! Example session on GTX 780Ti: cudaminer.exe --algo=keccak -d gtx780 -i 0 -m 1 -l T2304x24 -o stratum+tcp://maxpool.1gh.com:17333 -u WALLETADDRESS -p x CudaMiner for nVidia GPUs by Christian Buchner * This is version 2014-02-18 (beta) based on pooler-cpuminer 2.3.2 (c) 2010 Jeff Garzik, 2012 pooler Cuda additions Copyright 2013,2014 Christian Buchner LTC donation address: LKS1WDKGED647msBQfLBHV3Ls8sveGncnm BTC donation address: 16hJF5mceSojnTD3ZTUDqdRhDyPJzoRakM YAC donation address: Y87sptDEcpLkLeAuex6qZioDbvy1qXZEj4 [2014-02-28 22:56:00] Starting Stratum on stratum+tcp://maxpool.1gh.com:17333 [2014-02-28 22:56:00] 1 miner threads started, using 'keccak' algorithm. [2014-02-28 22:56:01] GPU #0: GeForce GTX 780 with compute capability 3.5 [2014-02-28 22:56:01] GPU #0: interactive: 0, tex-cache: 0 , single-alloc: 1 [2014-02-28 22:56:01] GPU #0: 32 hashes / 4.0 MB per warp. [2014-02-28 22:56:01] GPU #0: using launch configuration T2304x24 [2014-02-28 22:56:02] GPU #0: GeForce GTX 780, 196998 khash/s [2014-02-28 22:56:02] accepted: 1/1 (100.00%), 196998 khash/s (yay!!!) [2014-02-28 22:56:04] GPU #0: GeForce GTX 780, 200200 khash/s ... [2014-02-28 22:56:05] GPU #0: GeForce GTX 780, 202637 khash/s [2014-02-28 22:56:05] accepted: 4/4 (100.00%), 202637 khash/s (yay!!!) [2014-02-28 22:56:07] GPU #0: GeForce GTX 780, 203136 khash/s [2014-02-28 22:56:07] accepted: 5/5 (100.00%), 203136 khash/s (yay!!!) >>> Additional Notes <<< This code should be running on nVidia GPUs ranging from compute capability 1.0 up to compute capability 3.5. Just don't expect any hashing miracles from your old clunkers. Compute 1.0 through 1.3 devices seem to run faster on Windows XP or Linux because these OS'es use a more efficient driver model. Some coins mine a bit faster with the 32 bit cudaminer versions, other mine faster with the 64 bit cudaminer version. If your computer runs a 64 bit OS, try running both versions and compare the mining speeds! To see what autotuning does, enable the debug option (-D) switch. You will get a table of kHash/s for a variety of launch configurations. You may only want to do this when running on a single GPU, otherwise the autotuning output of multiple cards will get all mixed up. Note that mining through N-factor changes is risky and might fail. Typically a new N factor requires a different kernel launch configuration. I am trying to compensate for an N-factor increase by doubling the current lookup-gap value. This will allow the miner to keep working with the same memory buffers, but the additional lookup-gap requires more computations to be made. It is best to re-tune your kernel configuration after every N-factor change. >>> RELEASE HISTORY <<< The February 28th release adds speed increases on Yacoin (high N-factor scrypt-jane) and on keccak for Kepler and Maxwell devices. That is all ;) The February 18th release fixes a crash when encountering newer compute capabilities such as 3.7 or 5.0. As 5.0 is reported by Maxwell chips, this will improve the user experience. This release also contains some minor fixes, for example after a CTRL-C or after expiry of the time limit the program will terminate faster. I also made autotune work with Keccak, even though its usefulness is kind of limited. The February 9th release adds performance improvements and a fix for stratum rejects in maxcoin (duplicate submitted shares). The February 7th release adds fixes for stratum pool mining. The February 6th release added a first, but working keccak algorithm support to instamine at the MaxCoin launch. The miner worked, and we basically owned the coin. The February 4th release fixes a problem with very apparently incorrect autotune measurements and it also repairs the multi-GPU support. So you can again use a single cudaminer to drive all your GPUs. It wasn't the driver's fault after all, I was sloppy about some initialization of constant memory on the GPUs. The Febrary 2nd 2014 release supports scrypt-jane for the first time and includes faster scrypt kernels kindly submitted by nVidia. Most Dave Andersen and nVidia derived kernels now support -C 1 and -C 2 texture caching options providing a speed benefit in some cases. The December 18th 2013 milestone transitions cudaminer to CUDA 5.5, which makes it require newer nVidia drivers unfortunately. However users of Kepler devices will see a significant speed boost of 30% for Compute 3.0 devices and around 10% for Compute 3.5 devices. This was made possible by David Andersen who posted his more efficient miner code under Apache license. This release marks a first step of integrating his work. ... some history removed ... April 4th 2013 initial release. >>> TODO <<< Usability Improvements: - better CUDA error handling and recovery from errors - add failover support between different pools - smarter autotune algorithm - temperature and GPU utilization control features - an external API for system monitoring like that of CGMiner Further Optimization: - fix a problem in CUDA issue order that prevents overlapping of memory transfers and kernel launches (this could bring 5% more speed when fixed!) - check the hashes against target on the GPU, saving memory transfers over the PCI-x bus - reduce the thread divergence in the lookup-gap feature by sorting threads by loop trip count (requires a swapping of the thread state in the entire thread block) - allow for more optimized Keccak or SHA2 implementations for specific hardware (like Compute 3.5 using the funnel shifter) - make a direct port of the nv_kernel.cu to Fermi, using warp shuffle emulation with shared memory. >>> AUTHORS <<< Notable contributors to this application are: Christian Buchner (Germany): original CUDA implementation Alexis Provos (Greece): submitted a faster Salsa 20/8 round function. David Andersen (USA, Carnegie Mellon University): designed a low register count scrypt Kepler kernel with improved memory access. Currently best performing with high N-factor scrypt-jane. Alexey Panteleev (Moscow, nVidia): submitted a kernel with improved memory access functions for Kepler devices providing the fastest scrypt performance and provided further patches for other kernels. and also many thanks to anyone else who contributed to the original cpuminer application (Jeff Garzik, pooler) ! Source code is included to satisfy GNU GPL V2 requirements. With kind regards, Christian Buchner ( Christian.Buchner@gmail.com ) ID: 1741643 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.