hwpc

predefined hardware performance counter groups

Author:: Hewlett Packard Enterprise Development LP.
Copyright:: Copyright 2019,2021,2023-2024 Hewlett Packard Enterprise Development LP.
Manual section:: 5

DESCRIPTION

CrayPat supports the use of hardware counter groups, which specify predefined sets of hardware counters that can be instrumented for performance analysis experiments. The groups and event sets supported vary depending on the types of CPUs present on the compute nodes.

If the environment variable PAT_RT_PERFCTR is used to specify a hardware counter group that is not supported on the compute node hardware, the runtime environment will attempt to interpret PAT_RT_PERFCTR as an event name. This produces invalid HW performance counter event name error messages at runtime and can even cause program execution to abort. If you are seeing these issues, verify that the PAT_RT_PERFCTR event set you have specified is supported on your system’s compute node processors. PAT_RT_PERFCTR cannot be set to the name of a derived metric, only to the names of counter events or groups of counter events. A derived metric appears in the report if all of the events required for that metric are collected.

To list the available events, execute the command papi_native_avail on a compute node using the appropriate workload manager launcher. For example:

$ srun papi_native_avail

To list the available PAPI derived events, execute the command papi_avail on a compute node:

$ srun papi_avail

For complete lists of the hardware counter events currently supported organized by processor family, execute the pat_help utility and select the counters topic.

NOTES

About Intel Processors

The availability of floating-point performance counter events on Intel processors is processor dependent, and even when available, they may not provide the desired level of accuracy.

The values reported for floating-point operations may be significantly larger than the number of operations actually specified in the program. There are two reasons for this. First, operations must be calculated from instruction counts that include speculatively issued instructions. Second, for the general case, more counts are required than can be supported by the physical hardware counters, and so PAPI multiplexing is used for the CrayPat default event set. If it is known that, for example, only single precision operations are of interest, then a smaller set of events can be used, which can be counted without multiplexing.

Note the following details: Floating-point operations cannot be counted directly, but the various types of floating-point instructions can be counted, and so an operation count can be calculated with a weighted sum, where each summand is an instruction count times the number of operations resulting from one instruction of that type. For a weighted sum for all types of floating-point operations, it would suffice to get combined counts for all instructions that produce the same number of operations. This would reduce the number of events that must be counted.

The reduction in the number of events described in point 2 is limited by the facts that subevents of FP_COMP_OPS_EXE and SIMD_FP_256 cannot be combined, and that at least one combined event, FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE:SSE_SCALAR_DOUBLE, does not produce correct results. With hyper-threading enabled, the number of physical counters available for FP events is 4, and this is not enough to accommodate the events required for the weighted sum. So either multiplexing must be used or multiple runs must be made to count subsets of these events. In order to give at least approximate values from a single run, the CrayPat default event set uses multiplexing.

These details were discovered independently by CrayPat developers experimenting with simple computational kernels, but have been reported by other groups as well.

About AMD Processors

Depending on how a job is scheduled using the Workload Manager, jobs that execute programs using AMD processors may fail to successfully gain access to all requested performance counter events residing on the processor. This is because one or more PMU registers that monitor performance counter events may be borrowed by the job for other purposes, for example, the NMI Watchdog register. This reduces the number of registers available for use by the performance counter PMU.