intro_hugepages

intro_hugepages - Introduction to using huge pages

IMPLEMENTATION

Cray Linux Environment (CLE)

DESCRIPTION

Huge pages are virtual memory pages which are bigger than the default base page size of 4Kbytes. Huge pages can improve memory performance for common access patterns on large data sets. Huge pages also increase the maximum size of data and text in a program accessible by the high speed network. Access to huge pages is provided through a virtual file system called hugetlbfs. Every file on this file system is backed by huge pages and is directly accessed with mmap() or read().

The libhugetlbfs library allows an application to use huge pages more easily than it could by directly accessing the hugetlbfs filesystem. A user may use libhugetlbfs to back application text and data segments.

For definitions of terms used in this man page, see refsect2_title.

Module Support

Module files set the necessary link options and run time environment variables to facilitate the usage of the huge page size indicated by the module name.

Gemini systems: craype-hugepages128K, craype-hugepages512K, craype-hugepages2M, craype-hugepages8M, craype-hugepages16M, craype-hugepages64M.

Aries systems: craype-hugepages2M, craype-hugepages4M, craype-hugepages8M, craype-hugepages16M, craype-hugepages32M, craype-hugepages64M, craype-hugepages128M, craype-hugepages256M , craype-hugepages512M, craype-hugepages1G, and craype-hugepages2G.

To compile a Unified Parallel C application that uses 2 M huge pages:

module load PrgEnv-cray
module load craype-hugepages2M
cc -h upc -c array_upc.c
cc -h upc -o array_upc.x array_upc.o

To see the link options and run time environment variables set by these modules: module show module_name

Note that the value of HUGETLB_DEFAULT_PAGE_SIZE varies between craype-hugepages modules. Also note that the name of the HUGETLB<size>_POST_LINK_OPTS variable varies between modules, but it’s value is the same.setenv HUGETLB_DEFAULT_PAGE_SIZE <size> setenv HUGETLB_MORECORE yes setenv HUGETLB_ELFMAP W setenv HUGETLB_FORCE_ELFMAP yes+ setenv HUGETLB<size>_POST_LINK_OPTS “-Wl,\ –whole-archive,-lhugetlbfs,–no-whole-archive -Wl,-Ttext-segment=address,-zmax-page-size=size”

The HUGETLB<size>_POST_LINK_OPTS value is relevant to the creation of the executable, while the others are run time environment variables. A user may choose to run an application with a different craype-hugepages module than was used at compile and link time. To make most efficient use of available memory, use the smallest huge page size necessary for the application.

The link options -Wl,-Ttext-segment=address,-zmax-page-size=size enforce the alignment and starting addresses of segments so that there are separate read-execute (text) and read-write (data and bss) segments for all pages sizes up to the maximum of 64M for Gemini and 512M for Aries. This causes libhugetlbfs to avoid overlapping read-execute text with read-write data/bss on huge pages, which would cause a segment to be both writable and executable.

Note

The current versions of all the hugepages modules use a 512M alignment and max-page-size so that a statically linked executable may run using a variety of HUGETLB_DEFAULT_PAGE_SIZEs without having to relink; however, this may not be appropriate for certain situations.

Specifically, suppose the statically linked application allocates a large amount of static data (greater than 2GiB) in the form of initialized arrays and the 32M hugepage module sets -Ttext-segment=0x20000000,-zmax-page-size=0x20000000 (512M alignment). The combined static memory requirement (text+data), plus the memory padding that is added by the linker for 512M alignment, may cause relocation addresses to exceed 4GiB. If this occurs, the user will see “relocation truncated to fit” errors. To remedy this, select the smallest craype-hugepages module needed by the job, and then reset the alignment by resetting the HUGETLB<size>_POST_LINK_OPTS environment variable before linking. For example, if an 8M page size is sufficiently large for the application, load the craype-hugepages8M module and then set the text-segmentand max-page-size to 8MB before compiling and linking:

module load craype-hugepages8M setenv HUGETLB8M_POST_LINK_OPTS “-Wl,–whole-archive,-lhugetlbfs,–no-whole-archive \ -Wl,-Ttext-segment=0x800000,-zmax-page-size=0x800000”

--------------------------------------------------------------
Page Size  text-segment/max-page-size settings
--------------------------------------------------------------
2M         -Ttext-segment=0x200000,-zmax-page-size=0x200000
4M         -Ttext-segment=0x400000,-zmax-page-size=0x400000
8M         -Ttext-segment=0x800000,-zmax-page-size=0x800000
16M        -Ttext-segment=0x1000000,-zmax-page-size=0x1000000
--------------------------------------------------------------

Note

The run time environment variables set by these modules are relevant on compute nodes, not on service nodes. If the user is running the application on a service node instead of a compute node, they should unload the hugepage module before execution.

When to Use Huge Pages

  • For SHMEM applications, map the static data and/or private heap onto huge pages.

  • For applications written in Unified Parallel C, Coarray Fortran, and other languages based on the PGAS programming model, map the static data and/or private heap onto huge pages.

  • For MPI applications, map the static data and/or heap onto huge pages.

  • For an application which uses shared memory, which needs to be concurrently registered with the high speed network drivers for remote communication.

  • For an application doing heavy I/O.

  • To improve memory performance for common access patterns on large data sets.

When to Avoid Using Huge Pages

Applications sometimes consist of many steering programs in addition to the core application. Applying huge page behavior to all processes would not provide any benefit and would consume huge pages that would otherwise benefit the core application. See HUGETLB_RESTRICT_EXE described in ENVIRONMENT VARIABLES.

ENVIRONMENT VARIABLES

The following variables affect huge pages:

XT_SYMMETRIC_HEAP_SIZE

The symmetric heap always uses huge pages, regardless of whether or not a hugepage module is loaded.

For PGAS applications using UPC or Coarray Fortran, if XT_SYMMETRIC_HEAP_SIZE is not set, the default symmetric heap per PE for a PGAS application is 64M. Therefore, if a Coarray Fortran application requires 1000M per PE and the user does not set XT_SYMMETRIC_HEAP_SIZE, one of the coarray allocate statements will fail to find enough memory. The symmetric heap is reserved at program launch and its size does not change.

For PGAS applications using SHMEM, either XT_SYMMETRIC_HEAP_SIZE or SMA_SYMMETRIC_SIZE should be used to set the size of the symmetric heap. Cray XC series systems support a growable symmetric heap, so if XT_SYMMETRIC_HEAP_SIZE or SMA_SYMMETRIC_SIZE is not set, the symmetric heap grows dynamically as needed to a maximum of 2GB per PE. (Cray XE and Cray XK series systems do not support growable symmetric heap and have no default symmetric heap value.)

The aprun -m option does not change the size of the symmetric heap allocated by UPC or Fortran applications upon startup. The -m option refers to the total amount of memory available to a PE, which includes all memory and not just the symmetric heap. Use -m option only if necessary.

The following variables affect libhugetlbfs:

HUGETLB_DEFAULT_PAGE_SIZE

Override the system default huge page size for all uses except the hugetlbfs-backed symmetric heap used by SHMEM and PGAS programming models. The default huge page size is 2M.

Additionally supported on Gemini systems: 128K, 512K , 8M, 16M, 64M.

Additionally supported on Aries systems: 4M, 8M, 16M, 32M, 64M, 128M, 256M, 512M, 1GB, 2GB.

HUGETLB_ELFMAP

Set to W to map the read-write sections (writable static data, bss) onto huge pages.

Set to R to map the read-execute segment (text, read-only static data) onto huge pages.

Set to RW to map both onto huge pages.

HUGETLB_FORCE_ELFMAP

If set to yes, and LD_PRELOAD contains libhugetlbfs.so, then libhugetlbfs will load all parts of the text, data and bss that fall on huge page boundaries onto huge pages. The parts of the text and data and bss sections that do not fall into whole huge pages (e.g. the “edges”) are left on 4K pages.

If set to yes+ (Cray extension), then all of the text and/or data and bss (per direction of HUGETLB_ELFMAP) will be mapped onto huge pages, including the “edges”. Note that the Cray extension works for both static and dynamic executables and does not depend on LD_PRELOAD having libhugetlbfs.so in it.

If there is an overlap of the read-execute and the read-write sections, then a new mapping for the overlap will be made with combined permissions (i.e. RWX). Using the link option specified in the craype-hugepages modules avoids this overlap.

HUGETLB_MORECORE

Set to yes to map the heap (also relates to the private heap in SHMEM applications) onto huge pages. Enables malloc() to use memory backed by huge pages automatically.

HUGETLB_RESTRICT_EXE=exe1[:exe2:exe3:…]

Selectively enables libhugetlbfs to map only the named executables onto huge pages. The executables are named by the last component of the pathname; use a colon to separate the names of multiple executables. For example, if your executable is /lus/home/user/bin/mytest.x, specify:HUGETLB_RESTRICT_EXE=mytest.x

HUGETLB_VERBOSE

The range of the value is from 0 to 99. Setting to a nonzero number causes libhugetlbfs to print out informational messages. A value of 99 prints out all available information.

NOTES

Gemini NIC

There are two hardware mechanisms used by the Gemini NIC to translate virtual to physical memory references on the Cray XE and Cray XK systems. GNI and DMAPP are low level libraries which provide communication services to user level software and implement a logically shared, distributed memory programming model.

  • GART is a feature of many AMD64 processors that allows the system to access virtually contiguous user pages that are backed by non-contiguous physical pages. The GART aggregates the Linux standard 4 Kbyte pages into larger virtually contiguous memory regions. The contiguous pages exist in a portion of the physical address space known as the Graphics Aperture. The GART’s graphics aperture size is 2GiB. Therefore, the total memory which can be referenced through GART is limited to 2GiB per compute node.

  • The Memory Relocation Table (MRT) on the Gemini NIC maps the memory references contained in incoming network packets to physical memory on the local node. Memory references through the MRT map to a much larger address range than they do through the GART. Each NIC has its own MRT. MRT page sizes range from 128 K to 1 Gbyte, but all the entries on a given node must have the same page size. The MRT entries are created by kGNI in response to requests from the application, usually the uGNI library. There are 16K MRT entries. The default MRT page size is 2Mbytes, which maps to 32Gbytes (16K*2M). HUGETLB_DEFAULT_PAGE_SIZE sets the MRT page size.

Depending on the size of the allocated memory region and other default behavior, the memory registration function (of GNI/DMAPP) asks the kernel to create either GART entries on the AMD processor, or, in the case of huge pages, create entries in the Memory Relocation Table (MRT) on the NIC, to span the allocated memory region. User virtual memory that is to be read or written across nodes, generally must first be registered on the node; its physical location(s) and extent(s) loaded into the Gemini Memory Descriptor Table (MDD) and either the Opteron GART or the Gemini MRT.

Required GART Address Translation: Lustre I/O uses the GART. The Lustre Network Driver (LND) uses 1 Mbyte buffers, constructed out of smaller pages using the GART. DVS uses the GART.

Required MRT Address Translation: User virtual memory mapped by huge pages (via a hugetlbfs file system) will be registered in the MRT.

DMAPP mmaps the symmetric heap directly, regardless of its size, to the hugetlbfs file system if it is mounted, which it normally is on Cray XE systems. So, any application using DMAPP (e.g. SHMEM, PGAS programming models) will use MRT for memory references within the symmetric heap. The symmetric heap always uses huge pages, regardless of whether a hugepages module is loaded. Note that the libhugetlbfs library is not used in this case. The value of HUGETLB_DEFAULT_PAGE_SIZE determines the page size for the symmetric heap but the other HUGETLB environment variables have no effect.

When an application’s memory requirements, (specifically memory which is mapped through the HSN), exceeds the GART aperture size (2GiB) on a single node, the application must be linked with the libhugetlbfs library, to use the larger address range available with huge pages.

Default Behavior If Not Using ``craype-hugepages`` Modules:If there is no craype-hugepages module loaded and if none of the HUGETLB environment variables are set, by default the symmetric heap (in the case of SHMEM or PGAS programming models) is mapped onto huge pages but most other memory is mapped onto base pages which uses GART. Considering the 2GiB GART per node limit which is shared between application PEs on a node, Lustre and DVS, it is advisable to map the static data section and private heap onto huge pages. This can be selectively changed by using the proper link options and setting the environment variables HUGETLB_ELFMAP=W, and HUGETLB_MORECORE=yes.

Aries NIC

In Cray systems which have the Aries NIC, the Aries IO Memory Management Unit (IOMMU) provides hardware support for memory protection and address translation. The Aries IOMMU uses an entirely different memory translation mechanism than Gemini uses:

  • The IOMMU is divided into 16 translation context registers (TCRs). Each translation context (TC) supports a single page size. The TCRs can independently address different page sizes and present that to the network as a contiguous memory domain. The TCR entries are used to set and clear the page table entries (PTEs) used by GNI. PTE entries are cached in Aries NIC memory in a page table. Up to 512 PTEs can be used by applications. 512MiB (largest hugepage size) x 512 PTEs = 256GiB of addressable memory per node on Aries systems.

Other Notes on Memory Usage

Huge pages benefit applications which have a huge working set size (hundreds of Mbytes or many Gbytes and above) since this would require many virtual to physical address translations if using the default 4K pages. By using huge pages, the number of required address translations is decreased which benefits application performance by removing the wait time to fill up the TLB caches with translation data. Larger pages increase memory reach but may also exhaust available memory quicker. Thus, the optimal page size may vary from application to application.

With hugepages, an application is still limited by the total memory on the node. Also memory fragmentation can decrease available memory. See refsect1_title.

The /proc/meminfo file does not give a complete picture of huge page usage and is deprecated for this purpose.

Running Independent Software Vendor (ISV) Applications

To enable a dynamically linked executable, that was not originally linked with libhugetlbfs, to use Cray’s libhugetlbfs library at runtime, you must first load a hugepages module and set the environment variable LD_PRELOAD so that it contains the libhugetlbfs pathname:module load craype-hugepages2M export LD_PRELOAD=/usr/lib64/libhugetlbfs.soIf an ISV application is already using LD_PRELOAD to set dynamic library dependencies, then use a white-space separated list. For example:export LD_PRELOAD=”/usr/lib64/libhugetlbfs.so /directory_name/lib.so”To confirm the usage of hugepages, one may set HUGETLB_VERBOSE to 3 or higher:export HUGETLB_VERBOSE=3Statically linked executables can only use Cray’s libhugetlbfs if they are linked with it. Statically linked executables do not process LD_PRELOAD; therefore statically linked ISVs must be relinked with libhugetlbfs. See refsect2_title for compiling and linking.

The nm and ldd commands are useful for determining the contents and dynamic dependencies of executables.

Selective Mapping

ISV applications sometimes consist of scripts which run several executables, only some of which need to run with huge pages. The environment variable HUGETLB_RESTRICT_EXE enables the libhugetlbfs library to selectively map only the named executables onto huge pages.

Terms

Text Segment - contains the actual instructions to be executed.

Data Segment - contains the program’s data part, which is further divided into data, bss, and heap sections.

  • Data- global, static initialized data.

  • BSS - global, static uninitialized data.

  • Heap - dynamically allocated memory.

Stack - used for local variables, stack frames.

Symmetric Heap - contains dynamically allocated memory for a PE, which is kept in sync by the programming model (e.g. SHMEM) with that of another PE. See intro_shmem(3) man page for additional information. The private heap contains dynamically allocated memory which is specific to a PE.

GART - Graphics Aperture Relocation Table

HSN - High Speed Network

IOMMU - High I/O Memory Management Unit

ISV - Independent Software Vendor

MRT - Memory Relocation Table

TLB - Translation Look Aside Buffer is the memory management hardware uses to translate virtual addresses into physical addresses.

ISSUES

Huge pages are a per-node resource, not a per-job resource, nor a per-process resource. There is no guarantee that the requested number of huge pages will be available on the compute nodes. If the memory pool becomes fragmented, which it can over time, the number of free blocks that are equal to or larger than the huge page size can decrease below the number needed to service the request, even though there may be enough free memory in the pool when summing free blocks of all sizes. For this reason, use huge page sizes no larger than needed.

If the heap is mapped to huge pages (by setting HUGETLB_MORECORE to yes) and if a malloc call requires that the heap be extended, and if there are not enough free blocks in the memory pool large enough to support the required number of huge pages, libhugetlbfs will issue the following WARNING message and then glibc will fall back to allocating base pages.

libhugetlbfs [nid000xx:xxxxx]: WARNING: New heap segment map at
0x10000000 failed: Cannot allocate memory

Since this is a warning, jobs are able to continue running after this message occurs. But because the allocated base pages use GART entries, and as described in the NOTES section, and there are a limited number of GART entries, future memory requests may fail altogether due to lack of available GART entries.

With craype-hugepages modules loaded, it is no longer necessary to include -lhugetlbfs on the link line. Doing so will result in messages indicating multiple definitions, such as:

//usr/lib64/libhugetlbfs.a(elflink.o): In function
`__libhugetlbfs_do_remap_segments':

/usr/src/packages/BUILD/cray-libhugetlbfs-2.11/elflink.c:2012:
multiple definition of `__libhugetlbfs_do_remap_segments'

//usr/lib64/libhugetlbfs.a(elflink.o):/usr/src/packages/BUILD/
cray-libhugetlbfs-2.11/elflink.c:2012: first defined here

Adjust makefiles or build scripts accordingly.

SEE ALSO

hugeadm, cc, CC, ftn, aprun, intro_mpi, intro_shmem, libhugetlbfs

/usr/share/doc/libhugetlbfs/HOWTO