Houdini Pro version |
Top Previous |
Houdini Pro is intended for power users with high-end hardware. The main differences with the Standard version are:
Large Memory Pages Houdini Pro will use so-called large memory pages if they are provided by the operating system. Depending on the hash table size the speed gain may be between 5% and 15%.
To enable this feature in Windows, you need to modify the Group Policy for your account:
IMPORTANT: You'll also need to run your chess GUI with administrative rights ("Run as Administrator") or disable UAC in Windows. Very often large memory pages will only be available shortly after booting Windows. After a while the Windows memory becomes too fragmented for large page allocation, and Houdini will fall back to standard memory page usage.
You can test the availability of Large Pages with the lp command. Run Houdini in a command window (simply by double-clicking on the executable) and type lp followed by Enter. Houdini will try to create large page memory blocks of increasing size and show a summary of the results. NUMA-awareness Most CPU mother boards with multiple sockets employ the so-called "NUMA" architecture. Houdini Pro detects the NUMA configuration at start-up and will adapt its memory management and thread interaction based on the different NUMA nodes that are available. The speed gain depends on the number of cores, the motherboard and CPU brand. Running Multiple Houdini Pro instances If you're simultaneously running multiple Houdini Pro instances they will by default compete for the resources on the same NUMA nodes. To avoid this, you should set the NUMA Offset parameter to different values in the different Houdini instances. For example, if you want to run two Houdini instances with 6 threads each on 12-core hardware, you should use NUMA Offset 1 for the second instance so that it will allocate its 6 threads on the second NUMA node. See also the NUMA Offset configuration. Some Real Performance Data 40-core dual Intel Xeon v4 at 2.3 GHz This test system is a 40-core dual Intel Xeon v4 box running at 2.3 GHz speed having 80 virtual processors (40 cores with hyper-threading) under Windows 10. The system has two CPUs and two NUMA nodes; every CPU with 20 cores resides on its own NUMA node.
A two minute benchmark was run on a number of positions using 1, 6, 20 and 40 threads with hash memory set at 2048 MB. The impact of the Large Pages and the NUMA-awareness on the measured average node speed was as follows:
When the engine can run fully on a single CPU, i.e. up to 20 threads, Windows and the Intel Xeons do a good job of providing excellent performance without any NUMA awareness. Only with 40 threads running on the two CPUs of the system the NUMA-awareness becomes important. For 2048 MB hash the speed improvement from using Large Pages is about 6%. The impact grows when the size of the Hash Memory becomes larger; repeating the same benchmark with 8192 MB of hash memory yields a speed increase from Large Pages of nearly 10%. The numbers also show that Houdini Pro scales nearly perfectly with the number of threads: the 20-thread benchmark is nearly 20 times faster than the single-thread result, and the 40-thread run is nearly 40 times faster.
24-core dual AMD Opteron 6174 at 2.3 GHz This is a 24-core dual Opteron box comprised of 4 NUMA nodes of 6 cores each; each 12-core Opteron 6174 processor has 2 NUMA nodes. A two minute benchmark was run on a number of positions using 1, 6, 12 and 24 threads with hash memory set at 2048 MB. The impact of the Large Pages and the NUMA-awareness on the measured average node speed was as follows:
The AMD CPU benefits more clearly from Large Pages and NUMA support. With 24 threads the performance benefit provided by NUMA-awareness and Large Pages is close to 25%. As above, the scaling of the performance remains nearly perfect up to maximum number of threads. |