intro_hugepages
intro_hugepages - Introduction to using huge pages
IMPLEMENTATION
Cray Linux Environment (CLE)
DESCRIPTION
Huge pages are virtual memory pages which are bigger than the default
base page size of 4Kbytes. Huge pages can improve memory performance for
common access patterns on large data sets. Huge pages also increase the
maximum size of data and text in a program accessible by the high speed
network. Access to huge pages is provided through a virtual file system
called hugetlbfs
. Every file on this file system is backed by huge
pages and is directly accessed with mmap()
or read()
.
The libhugetlbfs
library allows an application to use huge pages
more easily than it could by directly accessing the hugetlbfs
filesystem. A user may use libhugetlbfs
to back application text and
data segments.
For definitions of terms used in this man page, see refsect2_title.
Module Support
Module files set the necessary link options and run time environment variables to facilitate the usage of the huge page size indicated by the module name.
Gemini systems: craype-hugepages128K
, craype-hugepages512K
,
craype-hugepages2M
, craype-hugepages8M
, craype-hugepages16M
,
craype-hugepages64M
.
Aries systems: craype-hugepages2M
, craype-hugepages4M
,
craype-hugepages8M
, craype-hugepages16M
,
craype-hugepages32M
, craype-hugepages64M
,
craype-hugepages128M
, craype-hugepages256M
,
craype-hugepages512M
, craype-hugepages1G
, and
craype-hugepages2G
.
To compile a Unified Parallel C application that uses 2 M huge pages:
module load PrgEnv-cray
module load craype-hugepages2M
cc -h upc -c array_upc.c
cc -h upc -o array_upc.x array_upc.o
To see the link options and run time environment variables set by these modules: module show module_name
Note that the value of HUGETLB_DEFAULT_PAGE_SIZE varies between
craype-hugepages
modules. Also note that the name of the
HUGETLB<size>_POST_LINK_OPTS variable varies between modules, but it’s
value is the same.setenv HUGETLB_DEFAULT_PAGE_SIZE <size> setenv
HUGETLB_MORECORE yes setenv HUGETLB_ELFMAP W setenv HUGETLB_FORCE_ELFMAP
yes+ setenv HUGETLB<size>_POST_LINK_OPTS “-Wl,\
–whole-archive,-lhugetlbfs,–no-whole-archive
-Wl,-Ttext-segment=address,-zmax-page-size=size”
The HUGETLB<size>_POST_LINK_OPTS value is relevant to the creation of
the executable, while the others are run time environment variables. A
user may choose to run an application with a different
craype-hugepages
module than was used at compile and link time. To
make most efficient use of available memory, use the smallest huge page
size necessary for the application.
The link options -Wl,-Ttext-segment=address,-zmax-page-size=size
enforce the alignment and starting addresses of segments so that there
are separate read-execute (text) and read-write (data and bss) segments
for all pages sizes up to the maximum of 64M for Gemini and 512M for
Aries. This causes libhugetlbfs
to avoid overlapping read-execute
text with read-write data/bss on huge pages, which would cause a segment
to be both writable and executable.
Note
The current versions of all the hugepages modules use a 512M alignment and max-page-size so that a statically linked executable may run using a variety of HUGETLB_DEFAULT_PAGE_SIZEs without having to relink; however, this may not be appropriate for certain situations.
Specifically, suppose the statically linked application allocates a large amount of static data (greater than 2GiB) in the form of initialized arrays and the 32M hugepage module sets
-Ttext-segment=0x20000000,-zmax-page-size=0x20000000
(512M alignment). The combined static memory requirement (text+data), plus the memory padding that is added by the linker for 512M alignment, may cause relocation addresses to exceed 4GiB. If this occurs, the user will see “relocation truncated to fit
” errors. To remedy this, select the smallestcraype-hugepages
module needed by the job, and then reset the alignment by resetting the HUGETLB<size>_POST_LINK_OPTS environment variable before linking. For example, if an 8M page size is sufficiently large for the application, load thecraype-hugepages8M
module and then set thetext-segment
andmax-page-size
to 8MB before compiling and linking:module load craype-hugepages8M setenv HUGETLB8M_POST_LINK_OPTS “-Wl,–whole-archive,-lhugetlbfs,–no-whole-archive \ -Wl,-Ttext-segment=0x800000,-zmax-page-size=0x800000”
-------------------------------------------------------------- Page Size text-segment/max-page-size settings -------------------------------------------------------------- 2M -Ttext-segment=0x200000,-zmax-page-size=0x200000 4M -Ttext-segment=0x400000,-zmax-page-size=0x400000 8M -Ttext-segment=0x800000,-zmax-page-size=0x800000 16M -Ttext-segment=0x1000000,-zmax-page-size=0x1000000 --------------------------------------------------------------
Note
The run time environment variables set by these modules are relevant on compute nodes, not on service nodes. If the user is running the application on a service node instead of a compute node, they should unload the hugepage module before execution.
When to Use Huge Pages
For SHMEM applications, map the static data and/or private heap onto huge pages.
For applications written in Unified Parallel C, Coarray Fortran, and other languages based on the PGAS programming model, map the static data and/or private heap onto huge pages.
For MPI applications, map the static data and/or heap onto huge pages.
For an application which uses shared memory, which needs to be concurrently registered with the high speed network drivers for remote communication.
For an application doing heavy I/O.
To improve memory performance for common access patterns on large data sets.
When to Avoid Using Huge Pages
Applications sometimes consist of many steering programs in addition to the core application. Applying huge page behavior to all processes would not provide any benefit and would consume huge pages that would otherwise benefit the core application. See HUGETLB_RESTRICT_EXE described in ENVIRONMENT VARIABLES.
ENVIRONMENT VARIABLES
The following variables affect huge pages:
- XT_SYMMETRIC_HEAP_SIZE
The symmetric heap always uses huge pages, regardless of whether or not a
hugepage
module is loaded.For PGAS applications using UPC or Coarray Fortran, if XT_SYMMETRIC_HEAP_SIZE is not set, the default symmetric heap per PE for a PGAS application is
64M
. Therefore, if a Coarray Fortran application requires1000M
per PE and the user does not set XT_SYMMETRIC_HEAP_SIZE, one of the coarray allocate statements will fail to find enough memory. The symmetric heap is reserved at program launch and its size does not change.For PGAS applications using SHMEM, either XT_SYMMETRIC_HEAP_SIZE or SMA_SYMMETRIC_SIZE should be used to set the size of the symmetric heap. Cray XC series systems support a growable symmetric heap, so if XT_SYMMETRIC_HEAP_SIZE or SMA_SYMMETRIC_SIZE is not set, the symmetric heap grows dynamically as needed to a maximum of 2GB per PE. (Cray XE and Cray XK series systems do not support growable symmetric heap and have no default symmetric heap value.)
The
aprun -m
option does not change the size of the symmetric heap allocated by UPC or Fortran applications upon startup. The-m
option refers to the total amount of memory available to a PE, which includes all memory and not just the symmetric heap. Use-m
option only if necessary.
The following variables affect libhugetlbfs
:
- HUGETLB_DEFAULT_PAGE_SIZE
Override the system default huge page size for all uses except the hugetlbfs-backed symmetric heap used by SHMEM and PGAS programming models. The default huge page size is 2M.
Additionally supported on Gemini systems: 128K, 512K , 8M, 16M, 64M.
Additionally supported on Aries systems: 4M, 8M, 16M, 32M, 64M, 128M, 256M, 512M, 1GB, 2GB.
- HUGETLB_ELFMAP
Set to
W
to map the read-write sections (writable static data, bss) onto huge pages.Set to
R
to map the read-execute segment (text, read-only static data) onto huge pages.Set to
RW
to map both onto huge pages.- HUGETLB_FORCE_ELFMAP
If set to
yes
, and LD_PRELOAD containslibhugetlbfs.so
, thenlibhugetlbfs
will load all parts of the text, data and bss that fall on huge page boundaries onto huge pages. The parts of the text and data and bss sections that do not fall into whole huge pages (e.g. the “edges”) are left on 4K pages.If set to
yes+
(Cray extension), then all of the text and/or data and bss (per direction of HUGETLB_ELFMAP) will be mapped onto huge pages, including the “edges”. Note that the Cray extension works for both static and dynamic executables and does not depend on LD_PRELOAD havinglibhugetlbfs.so
in it.If there is an overlap of the read-execute and the read-write sections, then a new mapping for the overlap will be made with combined permissions (i.e. RWX). Using the link option specified in the
craype-hugepages
modules avoids this overlap.- HUGETLB_MORECORE
Set to
yes
to map the heap (also relates to the private heap in SHMEM applications) onto huge pages. Enablesmalloc()
to use memory backed by huge pages automatically.- HUGETLB_RESTRICT_EXE=exe1[:exe2:exe3:…]
Selectively enables
libhugetlbfs
to map only the named executables onto huge pages. The executables are named by the last component of the pathname; use a colon to separate the names of multiple executables. For example, if your executable is /lus/home/user/bin/mytest.x, specify:HUGETLB_RESTRICT_EXE=mytest.x- HUGETLB_VERBOSE
The range of the value is from 0 to 99. Setting to a nonzero number causes
libhugetlbfs
to print out informational messages. A value of 99 prints out all available information.
NOTES
Gemini NIC
There are two hardware mechanisms used by the Gemini NIC to translate virtual to physical memory references on the Cray XE and Cray XK systems. GNI and DMAPP are low level libraries which provide communication services to user level software and implement a logically shared, distributed memory programming model.
GART is a feature of many AMD64 processors that allows the system to access virtually contiguous user pages that are backed by non-contiguous physical pages. The GART aggregates the Linux standard 4 Kbyte pages into larger virtually contiguous memory regions. The contiguous pages exist in a portion of the physical address space known as the Graphics Aperture. The GART’s graphics aperture size is 2GiB. Therefore, the total memory which can be referenced through GART is limited to 2GiB per compute node.
The Memory Relocation Table (MRT) on the Gemini NIC maps the memory references contained in incoming network packets to physical memory on the local node. Memory references through the MRT map to a much larger address range than they do through the GART. Each NIC has its own MRT. MRT page sizes range from 128 K to 1 Gbyte, but all the entries on a given node must have the same page size. The MRT entries are created by kGNI in response to requests from the application, usually the uGNI library. There are 16K MRT entries. The default MRT page size is 2Mbytes, which maps to 32Gbytes (16K*2M). HUGETLB_DEFAULT_PAGE_SIZE sets the MRT page size.
Depending on the size of the allocated memory region and other default behavior, the memory registration function (of GNI/DMAPP) asks the kernel to create either GART entries on the AMD processor, or, in the case of huge pages, create entries in the Memory Relocation Table (MRT) on the NIC, to span the allocated memory region. User virtual memory that is to be read or written across nodes, generally must first be registered on the node; its physical location(s) and extent(s) loaded into the Gemini Memory Descriptor Table (MDD) and either the Opteron GART or the Gemini MRT.
Required GART Address Translation: Lustre I/O uses the GART. The Lustre Network Driver (LND) uses 1 Mbyte buffers, constructed out of smaller pages using the GART. DVS uses the GART.
Required MRT Address Translation: User virtual memory mapped by huge
pages (via a hugetlbfs
file system) will be registered in the MRT.
DMAPP mmaps
the symmetric heap directly, regardless of its size, to
the hugetlbfs
file system if it is mounted, which it normally is on
Cray XE systems. So, any application using DMAPP (e.g. SHMEM, PGAS
programming models) will use MRT for memory references within the
symmetric heap. The symmetric heap always uses huge pages, regardless of
whether a hugepages module is loaded. Note that the libhugetlbfs
library is not used in this case. The value of HUGETLB_DEFAULT_PAGE_SIZE
determines the page size for the symmetric heap but the other
HUGETLB
environment variables have no effect.
When an application’s memory requirements, (specifically memory which is
mapped through the HSN), exceeds the GART aperture size (2GiB) on a
single node, the application must be linked with the libhugetlbfs
library, to use the larger address range available with huge pages.
Default Behavior If Not Using ``craype-hugepages`` Modules:If there
is no craype-hugepages
module loaded and if none of the HUGETLB
environment variables are set, by default the symmetric heap (in the
case of SHMEM or PGAS programming models) is mapped onto huge pages but
most other memory is mapped onto base pages which uses GART. Considering
the 2GiB GART per node limit which is shared between application PEs on
a node, Lustre and DVS, it is advisable to map the static data section
and private heap onto huge pages. This can be selectively changed by
using the proper link options and setting the environment variables
HUGETLB_ELFMAP=W
, and HUGETLB_MORECORE=yes
.
Aries NIC
In Cray systems which have the Aries NIC, the Aries IO Memory Management Unit (IOMMU) provides hardware support for memory protection and address translation. The Aries IOMMU uses an entirely different memory translation mechanism than Gemini uses:
The IOMMU is divided into 16 translation context registers (TCRs). Each translation context (TC) supports a single page size. The TCRs can independently address different page sizes and present that to the network as a contiguous memory domain. The TCR entries are used to set and clear the page table entries (PTEs) used by GNI. PTE entries are cached in Aries NIC memory in a page table. Up to 512 PTEs can be used by applications. 512MiB (largest hugepage size) x 512 PTEs = 256GiB of addressable memory per node on Aries systems.
Other Notes on Memory Usage
Huge pages benefit applications which have a huge working set size (hundreds of Mbytes or many Gbytes and above) since this would require many virtual to physical address translations if using the default 4K pages. By using huge pages, the number of required address translations is decreased which benefits application performance by removing the wait time to fill up the TLB caches with translation data. Larger pages increase memory reach but may also exhaust available memory quicker. Thus, the optimal page size may vary from application to application.
With hugepages, an application is still limited by the total memory on the node. Also memory fragmentation can decrease available memory. See refsect1_title.
The /proc/meminfo
file does not give a complete picture of huge page
usage and is deprecated for this purpose.
Running Independent Software Vendor (ISV) Applications
To enable a dynamically linked executable, that was not originally
linked with libhugetlbfs
, to use Cray’s libhugetlbfs
library at
runtime, you must first load a hugepages module and set the environment
variable LD_PRELOAD
so that it contains the libhugetlbfs
pathname:module load craype-hugepages2M export
LD_PRELOAD=/usr/lib64/libhugetlbfs.soIf an ISV application is already
using LD_PRELOAD to set dynamic library dependencies, then use a
white-space separated list. For example:export
LD_PRELOAD=”/usr/lib64/libhugetlbfs.so /directory_name/lib.so”To confirm
the usage of hugepages, one may set HUGETLB_VERBOSE to 3 or
higher:export HUGETLB_VERBOSE=3Statically linked executables can only
use Cray’s libhugetlbfs
if they are linked with it. Statically
linked executables do not process LD_PRELOAD; therefore statically
linked ISVs must be relinked with libhugetlbfs
. See
refsect2_title for compiling
and linking.
The nm
and ldd
commands are useful for determining the contents
and dynamic dependencies of executables.
Selective Mapping
ISV applications sometimes consist of scripts which run several
executables, only some of which need to run with huge pages. The
environment variable HUGETLB_RESTRICT_EXE enables the libhugetlbfs
library to selectively map only the named executables onto huge pages.
Terms
Text Segment - contains the actual instructions to be executed.
Data Segment - contains the program’s data part, which is further divided into data, bss, and heap sections.
Data- global, static initialized data.
BSS - global, static uninitialized data.
Heap - dynamically allocated memory.
Stack - used for local variables, stack frames.
Symmetric Heap - contains dynamically allocated memory for a PE, which
is kept in sync by the programming model (e.g. SHMEM) with that of
another PE. See intro_shmem(3)
man page for additional information.
The private heap contains dynamically allocated memory which is specific
to a PE.
GART - Graphics Aperture Relocation Table
HSN - High Speed Network
IOMMU - High I/O Memory Management Unit
ISV - Independent Software Vendor
MRT - Memory Relocation Table
TLB - Translation Look Aside Buffer is the memory management hardware uses to translate virtual addresses into physical addresses.
ISSUES
Huge pages are a per-node resource, not a per-job resource, nor a per-process resource. There is no guarantee that the requested number of huge pages will be available on the compute nodes. If the memory pool becomes fragmented, which it can over time, the number of free blocks that are equal to or larger than the huge page size can decrease below the number needed to service the request, even though there may be enough free memory in the pool when summing free blocks of all sizes. For this reason, use huge page sizes no larger than needed.
If the heap is mapped to huge pages (by setting HUGETLB_MORECORE to yes)
and if a malloc
call requires that the heap be extended, and if
there are not enough free blocks in the memory pool large enough to
support the required number of huge pages, libhugetlbfs
will issue
the following WARNING
message and then glibc
will fall back to
allocating base pages.
libhugetlbfs [nid000xx:xxxxx]: WARNING: New heap segment map at
0x10000000 failed: Cannot allocate memory
Since this is a warning, jobs are able to continue running after this message occurs. But because the allocated base pages use GART entries, and as described in the NOTES section, and there are a limited number of GART entries, future memory requests may fail altogether due to lack of available GART entries.
With craype-hugepages
modules loaded, it is no longer necessary to
include -lhugetlbfs
on the link line. Doing so will result in
messages indicating multiple definitions, such as:
//usr/lib64/libhugetlbfs.a(elflink.o): In function
`__libhugetlbfs_do_remap_segments':
/usr/src/packages/BUILD/cray-libhugetlbfs-2.11/elflink.c:2012:
multiple definition of `__libhugetlbfs_do_remap_segments'
//usr/lib64/libhugetlbfs.a(elflink.o):/usr/src/packages/BUILD/
cray-libhugetlbfs-2.11/elflink.c:2012: first defined here
Adjust makefiles or build scripts accordingly.
SEE ALSO
hugeadm, cc, CC, ftn, aprun, intro_mpi, intro_shmem, libhugetlbfs
/usr/share/doc/libhugetlbfs/HOWTO