CPE Cassini Performance Counters User Guide

About HPE Cray Cassini hardware performance counters

The CPE Cray Cassini Performance Counters User Guide provides details on how to use counters built into the Cassini Network Interface Card (NIC) to collect and analyze performance data. This guide includes procedures for using HPE Performance Analysis Tools running on HPE Cray Supercomputing EX systems.

See the HPE Performance Analysis Tools User Guide (S-8014) for additional and related information.

Scope and audience

This publication is intended for use by users needing to collect and analyze performance data utilizing the HPE Cray Cassini performance hardware counters.

What is Cassini?

HPE Cassini performance counters are a set of hardware counters provided by the HPE Cassini NIC. The HPE Cassini NIC resides in HPE Slingshot-11, a high-performing interconnect designed for HPC and AI clusters, and the Slingshot switch. HPE Cassini performance counters count hardware activities relative to the HPE Cassini NIC on HPE Slingshot. The HPE Cassini NIC, a 200 Gbps NIC, provides fast Message Passing Interface (MPI) message passing rates, increased bandwidth, and unlimited scalability.

Features and Benefits

The HPE Cassini NIC:

Uses an Ethernet wire protocol which enhances cluster usage capabilities.
Connects to the network using four-lane Ethernet link that operates at 56 Gbps/lane using PAM-4 signaling or 28 Gbps NRZ, providing 200 Gbps or 100 Gbps.
Connects to a CPU or GPU using a x16 PCIe Gen4 host interface operating at 16 GT/s in each direction.
Accelerates HPC and AI performance by:
- Freeing up host memory bandwidth and CPU resources, reducing compute overhead for GPU-initiated communications, and
- Integrating technology to detect key packet information.
Provides high performance for MPI messaging, remote memory access (RMA) access in partitioned global address space (PGAS) programming models and over the Ethernet.
Exposes remote direct memory access (RDMA) and HPC optimized features to software using the Libfabric software interface.

HPE Cassini performance counters that are integrated with the Slingshot NIC allow you to:

Analyze application performance, including various counters for:
- Debugging application performance problems
- Determining whether the network is causing low or fluctuating performance between runs
- Identifying details regarding the impact of any lost packets
Count anomalous or erroneous events
Fine-tune applications in more efficient and effective ways.

HPE Cassini Performance Counter Access and Event

HPE Cassini performance counters are accessed through the PAT_RT_PERFCTR environment variable and other related environment variables. HPE Cassini performance counter events are collected by processor zero on each socket upon which an application is scheduled and executing. See the HPE Performance Analysis Tools User Guide (S-8014) for additional and related information.

Traffic, Commands, Payload, and Event Handling

Cassini implements a set of blocks, including command queue blocks, that process commands into packets, track their state, execute packets, and manage completion. Each Cassini block used to implement a path collects counter data on its operation. Request or response traffic follow a distinct process involving endpoints and data exchanges across a high-speed network (HSN). Generally, endpoints initiate data exchange across the HSN by generating packets. Packets travel across the HSN to the targeted destination endpoint. The destination endpoint acts on a request and returns a response packet after successfully receiving the request packet.

Specifically, commands, along with their headers that describe inload payload data, are written to the memory queue, and afterwards, depending on whether NIC space is available, the command is executed or read from the queue later. The HPE Cassini NIC:

Reads command queue data in blocks of up to 256 bytes, and
Supports 1024 transmit command queues for sending data and 512 target command queues for posting receive buffers.

The NIC writes back queue status information periodically and delivers status details on events, such as errors. Systematically, the:

System transfers commands from the host to the NIC processor interface and then, to the command queue.
Command queue validates one or multiple commands, passing them to an outbound transfer engine (OXE).
OXE packetizes commands.
OXE sends packets to the link, if required.
Packet connection and tracking (PCT) block manages the state and matches responses to their original requests.

Counters that provide related performance information are associated with each of these systematic steps.

On the receiver side, the:

Inbound traffic engine (IXE) processes incoming packets.
Resource management unit (RMU) validates packets, determining target endpoints.
Packet buffer holds packet data.
System sends headers to the list processing engine (LPE) for matching. Note that:

a. Write requests and atomic memory operations (AMOs) are passed to a write DMA engine if a header is matched to a target buffer. b. Read requests are passed to an outbound transfer engine and processed for responses. c. Initiator and target complete queues are managed by an event engine.

Counters that provide application performance information are associated with each receiver step.

HPE Cassini NIC connection states and the retry handler (RH) are key components that assist in tracking network errors and detecting and recovering lost packets in the network. The Cassini NIC tracks information on network error handling so that applications can determine whether performance variations are due to errors. The NIC tracks the state necessary to detect and recover from packet loss in the network. The system invokes the retry handler whenever packets are dropped or a negative acknowledgement (NACK) is returned by the target NIC. The retry handler maintains a set of counters, and these counters are accessed through /run/cxi/csi<d>/. HPE Slingshot links protect against errors using link level retry (LLR), and there are counters to show how often it is used.

Packet, Flit, and Stall Counters

At points within the HPE Slingshot NIC, separate counters are provided for each set of packets, flits, and stalls.

Flit Counters

With HPE Slingshot, data is transferred between devices in packets that contain up to 4,096 bytes of payload data (or up to 9,000 bytes for jumbo ethernet packets). Within the NIC, data is moved around in 64-byte flits at a rate of one flit per cycle of 1 GHz clock.

Counter Stalls

Stalls, in most cases, are an indication of flow control back-pressure, limiting the rate at which flits can be forwarded across an interface. A stall counter increments each flit-time that a ready-to-forward flit is prevented by back-pressure from doing so. For example, if interface flit and stall counts increment at equal rates, then, on average, flits cross the interface at half the rate they would if no stalls occurred. Situations that lead to stalls include, for example, rate adaptation to a lower speed downstream interface and arbitration, at a point downstream, for a contended resource. A high ratio of stalls-to-packets or stalls-to-flits is an indication of possible congestion. Blocked counters are included at some arbiter inputs within the HPE Slingshot NIC interconnect. Typically, these blocked counters would increment whenever a flit is available at the input, but arbitration, instead, selects a different input.

Key NIC performance counter groups

There are 12 unique Cassini counter groups:

Name	Counter Group
`CxiPerfStats`	Traffic Congestion Counter Group
`CxiErrStats`	Network Error Counter Group
`CxiOpCommands`	Operation (Command) Counter Group
`CxiOpPackets`	Operation (Packet) Counter Group
`CxiDmaEngine`	DMA Engine Counter Group
`CxiWritesToHost`	Writes-to-Host Counter Group
`CxiMessageMatchingPooled`	Message Matching of Pooled Counters Group
`CxiTranslationUnit`	Translation Unit Counter Group
`CxiLatencyHist`	Latency Histogram Counter Group
`CxiPctReqRespTracking`	PCT Request and Response Tracking Counter Group
`CxiLinkReliability`	Link Reliability Counter Group
`CxiCongestion`	Congestion Counter Group

The above listed groups (which are accessed by setting PAT_RT_PERFCTR to the name of the group) combine a set of counters optimized to provide specific and unique types of performance information. A subset of some of these groups are comprised of multiple counter events that provide useful metrics if grouped together. This subset of groups serves as a starting point for investigating the application performance using Cassini. They provide information on the amount of traffic generated by the NIC, whether that traffic is being slowed down by address translation and back pressure from the host, and whether that traffic is causing congestion.

There are 1,427 unique Cassini counter events that can be assessed individually by setting PAT_RT_PERFCTR to the name of one or more events. See the cassini(5) man page for additional information.

`CxiPerfStats` - traffic congestion counter group

Use CxiPerfStats counters as a starting point for investigating application performance details. CxiPerfStats counters provide information on the amount of traffic generated by the NIC, regardless of whether the traffic is slowed down by address translation and back pressure from the host or whether that traffic is causing congestion issues.

Counter Name	Description
`HNI_PKTS_SENT_BY_TC_<n>`	Number of packets sent in traffic class `<n>`; 8 counters, default classes
	are 0 for request and 1 for response.

`HNI_PKTS_RECV_BY_TC_<n>`	Number of packets received in traffic class `<n>`; 8 counters, default
	classes are 0 for request and 1 for response.

`HNI_TX_PAUSED_<n>`	Number of cycles in which the transmit path is paused for traffic class
	`<n>`; default classes are 0 for request and 1 for response. Indicates that
	this endpoint is supplying data faster than the network can deliver it.
	Divide by 1E9 to determine the proportion of time paused.

`HNI_RX_PAUSED_<n>`	Number of cycles in which the pause is applied on the receive path for traffic
	class `<n>`; default classes are 0 for request and 1 for response. Indicates
	that the network is supplying data faster than this endpoint can consume it.
	Divide by 1E9 to determine the proportion of time paused.

`PARBS_TARB_PI_POSTED_PKTS`	Number of PCIe packets transferred using the posted path (for example, writes),
`PARBS_TARB_PI_POSTED_BLOCKED_CNT`	and the number of cycles in which this path is blocked. Compute
	the ratio `cycles/pkts`. Values of more than a few cycles per packet
	indicate back pressure from the host. This endpoint is likely to be the cause
	of congestion.

`PARBS_TARB_PI_NON_POSTED_PKTS`	Number of PCIe packets transferred using the non-posted path (for example,
`PARBS_TARB_PI_NON_POSTED_BLOCKED_CNT`	reads), and the number of cycles in which this path is blocked. Compute
	the ratio `cycles/pkts`. Values of more than a few cycles per packet indicate
	per host performance (high read latencies). This endpoint is likely to be
	injecting at a low rate.

`LPE_NET_MATCH_PRIORITY_<n>`	Number of messages matched on the priority list (or receive was posted
	before the message arrived). Four counters of which 0 is the default. These
	messages incur lower cost because data is written directly to the user buffer.

`LPE_NET_MATCH_OVERFLOW_<n>`	Number of messages where payload data was delivered to a buffer on the overflow
	list because there was no match on the priority list. Four counters of which 0
	is the default. These messages incur higher cost because data must be copied
	from the overflow buffer.

	Compute the ratio `priority/(priority + overflow)` to determine the
	porportion of messages for which receives were posted in advance.

`ATU_CACHE_MISS_<n>`	Number of misses in the NIC translation cache. Four counters of which counter 0
	counts misses on 4K pages and counter 1 counts misses on 2M pages by default.

`ATU_CACHE_EVICTIONS_<n>`	Number of times a tag was evicted from the NIC translation cache to make room
	for a new tag.

`CxiErrStats` - Cassini network error counter group

Cassini provides a reliable transport for libfabric. If packets are dropped in the network, they are automatically retried. The retry handler is also invoked when there is resource exhaustion in the target NIC. Retransmission can have an impact on performance, especially in a tightly coupled application.

When low performance or variation in performance between runs is observed, use the CxiErrStats counter group to determine if the network is a factor. The counters in this group are:

Counter Name	Description
`PCT_SPT_TIMEOUTS`	Number of response timeouts (or packet loss in the network).
`PCT_SCT_TIMEOUTS`	Retry handler is invoked.

`PCT_NO_TCT_NACKS`	Number of resource exhaustion NACKs. Retry handler is invoked.
`PCT_NO_TRS_NACKS`
`PCT_NO_MST_NACKS`

`PCT_RETRY_SRB_REQUESTS`	Number of retries.

`PCT_TRS_RSP_NACK_DROPS`	Number of NACKs dropped. Retry handler is invoked.

`HNI_PCS_UNCORRECTED_CW`	Number of uncorrected code words received on the switch to NIC link. High
	rates (multiple errors per second) indicate a poor quality link.

`HNI_LLR_TX_REPLAY_EVENT`	Number of LLR replays. High rates (multiple per second) indicate that the LLR
`HNI_LLR_RX_REPLAY_EVENT`	mechanism is providing protection on a poor quality link.

`CxiOpCommands` - Cassini operation (commands) counter group

Cassini counters categorized as operational counters (relative to commands and packets) include:

Counter Name	Description
`CQ_DMA_CMD_COUNTS`	Number of commands of each type:
	- Put, Get, rendezvous, atomics, small message, Ethernet, etc.
	- Target commands
	- Triggered operations

	Excluding the `CMD_ETHERNET_TX` command, Cassini supports the generation and
	sending of packets. Specifically, these commands include:

	- `CMD_NOOP` - (4’d0) - No operation
	- `CMD_PUT` - (4’d1) - Put
	- `CMD_GET` - (4’d2) - Get
	- `CMD_RENDEZVOUS_PUT` - (4’d3) - Rendezvous Put
	- `CMD_ATOMIC` - (4’d4) - Non-fetching atomic
	- `CMD_FETCHING_ATOMIC` - (4’d5) - Fetching
	- `CMD_ETHERNET_TX`- (4’d6) - Send an Ethernet or optimized IP packet
	- `CMD_SMALL_MESSAGE` - (4’d7) - Send a small message
	- `CMD_NOMATCH_PUT` - (4’d8) - Non-matching DMA Put
	- `CMD_NOMATCH_GET` - (4’d9) - Non-matching DMA Get
	- `CMD_CSTATE` - (10) - Update `C_STATE` for IDC commands

`CQ_CQ_CMD_COUNTS`	Both CQ commands (FENCE and LCID) and target commands:

	- `CMD_FENCE (1)` - Suspend stream until pending commands have completed.
	- `CMD_LCID(2)` - Change communication profile.
	- `CMD_TGT_APPEND(8)` - Append a target buffer. Specifies how to receive a message.
	- `CMD_TGT_SEARCH(9)` - Search the unexpected list for a buffer.
	- `CMD_TGT_SEARCH_AND_DELETE(10)` - Search the unexpected list for a buffer, and
	delete it, if found.
	- `CMD_TGT_UNLINK(11)` - Unlink a buffer by ID.
	- `CMD_TGT_SETSTATE(12)` - Change the state of an endpoint.

`CQ_NUM_DMA_CMDS`	Number of DMA commands.

`CQ_NUM_IDC_CMDS`	Number of immediate data commands.

Example

Command usage

`CxiOpPackets` - Cassini operation (packets) counter group

Operational counters in this group include:

Counter Name	Description
`HNI_TX_OK_<min>_to_<max>`	Number of packets sent in each of 12 size bins:
	- Small packets 27, 35, 64 bytes
	- 65-128, 256-511, 512-1023, 1024-2047, 2048-4095, 4096-8191, 8192-Max

`HNI_RX_OK_<min>_to_<max>`	Number of packets received in each of 12 size bins.

`HNI_PKTS_SENT_BY_TC`	Number of packets sent in each traffic class; default classes or 0 and 1.

`HNI_PKTS_RECV_BY_TC`	Number of packets received in each traffic class; default classes or 0 and 1.

Examples

Distribution of packet sizes bar graph - Transmit (TX) (Example 1)

Distribution of packet sizes bar graph - Received (RX) (Example 2)

`CxiDmaEngine` - Cassini DMA engine counter group

CxiDmaEngine are categorized as counters that provide DMA engine information. Expect high values in the stall counters when the load is high. Compare against the clock for percentage of time stalled.

Counter Name	Description
`OXE_MCU_MEAS`	Number of flits sent by each MCU (configurable to count packets or messages).

`OXE_CHANNEL_IDLE`	Number of cycles in which available bandwidth is not used.

`IXE_DISP_DMAWR_REQS`	Number of requests to DMA write controller.

`IXE_DMAWR_STALL_P_CDT`	Number of stalls due to no posted credits (cycles).

`IXE_DMAWR_STALL_NP_CDT`	Number of stalls due to no non-posted credits (cycles).

`PI_PTI_TARB_MRD_PKTS`	Number of memory read TLPs (all source).

`PI_PTI_TARB_MWR_PKTS`	Number of memory write TLPs (all source).

`CxiWritesToHost` - Cassini writes to host counters

The counters in this group are:

Counter Name	Description
`PARBS_TARB_PI_POSTED_PKTS`	Number of PCIe packets transferred using the posted path (for example, writes),
`PARBS_TARB_PI_POSTED_BLOCKED_CNT`	and the number of cycles in which this path is blocked. Compute the ratio
	`cycles/pkts`. Values of more than a few cycles per packet indicate back
	pressure from the host. This endpoint is likely to be the cause of congestion.

`PARBS_TARB_PI_NON_POSTED_PKTS`	Number of PCIe packets transferred using the non-posted path (for example,
`PARBS_TARB_PI_NON_POSTED_BLOCKED_CNT`	reads),and the number of cycles in which this path is blocked. Compute the
	ratio `cycles/pkts`. Values of more than a few cycles per packet indicate
	per host performance (high read latencies). This endpoint is likely to be
	injecting at a low rate.

Example

In this example, note that:

The ratio calculation is cycles blocked per packet.

Cassini is being blocked for 20 cycles per 512-byte PCIe packet.

Size         Bandwidth (MB/s)            
...                            
16384                  21070.06
32768                  21752.80
65536                  22896.48
131072                 23461.53
262144                 23764.63
1048576                23985.48
2097152                24024.32           
4194304                24044.25
8388608                23972.08          
(16777216)*          (20321.26)*
(33554432)*          (16550.01)*         
(67108864)*          (16043.86)*
524288                 23917.26     

Name                               Samples   Min         Mean          Max

hni_rx_paused_0                          4     0            0            0

hni_rx_paused_1                          4     0    396666167    158664669

hni_tx_paused_0                          4     0            0            0

hni_tx_paused_1                          4     0    163578075    654312300

parbs_tarb_pi_non_posted_pkts            4     0     54988998    201652661

parbs_tarb_pi_posted_blocked_cnt         4     0    941671854 (3766687419)*

parbs_tarb_pi_posted_pkts                4   372     46139681    184554415

* - Text in parentheses followed by an asterisk is red text in actual output.

`CxiMessageMatchingPooled` - Cassini message matching of pooled counters

Counters in this group provide information on:

Inbound match counters, including tries, success, fails, longest lookup (in the number of entries), and number of match attempts calculated by averages
Successful matches subdivided by used once matches, locally managed matches, and persistent matches
Local operation counters per command (tries, success, fails, longest lookup, number of match attempts calculated by averages)
The number of wildcard searches (NID== NID_ANY, PID++PID_ANY, RANK==RANK_ANY)
Total rendezvous and offloaded rendezvous
Fetch information for AMO commands received

Counters in this group include:

Counter Name	Description
`LPE_NET_MATCH_REQUESTS`	Number of requests matched.

`LPE_NET_MATCH_PRIORITY`	Number of messages matched on the priority list (receive was posted before the message
	arrived). Four counters of which 0 is the default. These messages incur lower cost
	because data is written directly to the user buffer.

`LPE_NET_MATCH_OVERFLOW_<n>`	Number of messages where payload data was delivered to a buffer on the overflow list
	because there was no match on the priority list. Four counters of which 0 is the
	default. These messages incur higher cost because data must be copied from the
	overflow buffer.

	Compute the ratio `priority/(priority + overflow)` to determine the proportion of
	messages for which receives were posted in advance.

`LPE_NET_MATCH_REQUEST`	Number of requests matched on request list (software endpoints).

`LPE_APPEND_CMDS`	Number of append commands.

`LPE_SEARCH_NID_ANY`	Number of wildcard searches using `NID_ANY`, physical matching.

`LPE_SEARCH_PID_ANY`	Number of wildcard searches using `NID_PID`, physical matching.

`LPE_SEARCH_RANK_ANY`	Number of wildcard searches using `RANK_ANY`, logical matching.

`LPE_RNDZV_PUTS`	Number of rendezvous puts received.

`LPE_AMO_CMDS`	Number of non-fetching `AMO` commands received by `LPE`.

`LPE_FAMO_CMDS`	Number of fetching `AMO` commands received by `LPE`.

Cassini message matching statistics in events include:

Event	Description
Completion events, `lpe_STAT_1`	Number of match attempts on priority list
Completion events, `lpe_STAT_2`	Number of match attempts on other lists
Link events (normally suppressed), `lpe_STAT_1`	Number of list entries allocated to pool
Link events (normally suppressed), `lpe_STAT_2`	Number of list entries available to pool

Example

Message matching statistics

`CxiTranslationUnit` - Cassini translation unit counters

The counters in this group are:

Counter Name	Description
`ATU_CACHE_MISS`	Number of cache misses by counter pool.

`ATU_CLIENT_REQ_EE`	Number of translation requests by client (events, writes, reads).
`ATU_CLIENT_REQ_IXE`
`ATU_CLIENT_REQ_OXE`	Note that EE is the Event Engine; IXE is the Input Transfer Engine or Writes; and OXE is the Output Transfer Engine or Reads.

`ATU_CACHE_MISS_EE`	Number of cache misses by client (events, writes, reads).
`ATU_CACHE_MISS_IXE`
`ATU_CACHE_MISS_OXE`

`ATU_CACHE_EVICTIONS`	Number of times a tag was evicted from the NIC translation cache to make room for a new tag.

`ATU_CACHE_HIT_BASE_PAGE_SIZE`	Number of cache hits observed on the Base Page Size.

`ATU_CACHE_HIT_DERIVATIVE1_PAGE_SIZE`	Number of cache hits observed on the Derivative 1 Page Size.

`ATU_ATS_TRANS_LATENCY`	ATS translation latency in preconfigured bins.

Example: Two Counter Pools

NIC	MISS_0	REQ_IXE	MISS_IXE	REQ_OXE	MISS_OXE
0	8744	6758637	791	27033796	7951
1	9973	6758548	1957	27033657	8014
2	8702	6758435	802	27033634	7898
3	9970	6758405	1927	27033608	8041
4	8735	6758440	762	27033637	7971
5	10053	6758404	2021	27033604	8030
6	8649	6758435	800	27033634	7847
7	9441	6758404	1405	27033604	8034

`CxiLatencyHist`- Cassini latency histogram counters

PCT counters included in this counter group provide the:

Number, including the maximum number, of active connects (SCT and TCT)
Number, including the maximum number, of active messages (SMT and MST)
Configurable histogram for response latency

Counters in this group include:

Counter Name	Description	Base	Bin Width
`PCT_HOST_ACCESS_LATENCY`	Request or response latency histogram, 32 bins.	256	128

`PCT_REQ_RSP_LATENCY`	Host access latency histogram, 16 bins.	2,048	256

Example

STAT prefix tree with driver controls setup of histograms

The example figure above shows two histograms: One for host accesses (reads) and one for network request/response. The width of the bins (in clock cycles or nanoseconds) is controlled by the base and the M parameter. The bin width is shown as 2^M^. The values were set up in advance. Note also that users cannot configure the histograms without root permission.

`CxiPctReqRespTracking` - Cassini PCT request and response tracking counters

The counters in this group are:

Counter Name	Description
`PCT_REQ_ORDERED`	Number of ordered requests.

`PCT_REQ_UNORDERED`	Number of unordered requests.

`PCT_RESPONSES_RECEIVED`	Number of responses received (all unordered).

`PCT_CONN_SCT_OPEN`	Number of open requests.

`PCT_NO_TRS_NACKS`	Number or requests that did not complete because
	the TRS was full.

`PCT_RETRY_SRB_REQUESTS`	Number of retries.

`PCT_SPT_TIMEOUTS`	Number of requests that timed out before a
	response was received.

`PCT_SCT_TIMEOUTS`	Number of close requests that timed out before
	a response was received.

The retry handler provides information on its activity in /run/cxi/cxi<device>.

`CxiLinkReliability` - Cassini link reliability counters

The counters in this group are:

Counter Name	Description
`HNI_PCS_GOOD_CW`	Number of codewords received with no errors.

`HNI_PCS_CORRECTED_CW`	Number of corrected codewords.

`HNI_PCS_UNCORRECTED_CW`	Number of uncorrected codewords.

`HNI_LLR_TX_REPLAY_EVENT`	Number of replays transmitted.

`HNI_LLR_RX_REPLAY_EVENT`	Number of replays detected.

Example

NIC	PCS_GOOD_CW	PCS_CORRECTED_CW	PCS_UNCORRECTED_CW	LLR_RX_REPLAY
0	127888452	12	0	0
1	127917527	1	0	0
2	126855057	1	0	0
3	126843486	30	0	0
4	127061355	5	0	0
5	127052551	15	0	0
6	127680056	38	0	0
7	127706017	1	0	0
-	-	-	-	-

`CxiCongestion` - Cassini congestion counters

The counters in this group are:

Counter Name	Description
`CxiWritesToHost`	See CxiWritesToHost - Cassini writes to host counters.

`HNI_RX_PAUSED[8]`	Number of cycles in which the network asserts pause, per traffic class; default classes for DMA traffic are
	0 and 1. The network asserting pause indicates the application is causing congestion.

`HNI_RX_PAUSED_STD`	Number of cycles in which at least one PCP pause occurred.

`HNI_TX_PAUSED[8]`	Number of cycles in which the NIC asserts pause, per traffic class; default classes for DMA traffic are 0
	and 1. The NIC asserting pause indicates that either the rate of writes or the rate of translations is higher
	than the host can support, and the NIC is applying back-pressure to the network.

Example

     Name                             Samples      Min          Mean            Max
     hni_rx_paused_0                        4        0             0              0
     hni_rx_paused_1                        4        0     396666167   (1586664669)*
     hni_tx_paused_0                        4        0             0              0
     hni_tx_paused_1                        4        0     163578075      654312300
     parbs_tarb_pi_non_posted_pkts          4        0      54988998      201652661
     parbs_tarb_pi_posted_blocked_cnt       4        0     941671854   (3766687419)*
     parbs_tarb_pi_posted_pkts              4      372      46139681      184554415

* - Text in parentheses followed by an asterisk is red text in actual output.

Gathering Cassini performance counters in HPE Cray MPI

The HPE Cray Message Passing Interface (MPI) is HPE’s high performance MPI library, a solution based on standards maintained by the MPI Forum. MPI allows for parallel programming across a network of computer systems by using message passing techniques. In support of this functionality, MPI uses libfabric (OFI), an open source project and a subgroup of the OpenFabrics Alliance, as the default network interface library.

HPE Cray MPI offers a feature to automatically gather and analyze Cassini counters for any MPI application run on the HPE Slingshot 11 network. This easy-to-use feature provides important feedback regarding application performance and requires no source code or linking changes. To gather CXI counters for an MPI job, use the following environment variables at runtime:

MPICH_OFI_CXI_COUNTER_REPORT
MPICH_OFI_CXI_COUNTER_FILE
MPICH_OFI_CXI_COUNTER_REPORT_FILE
MPICH_OFI_CXI_COUNTER_VERBOSE

Important: Avoid using the PAT_RT_PERFCTR and MPICH_OFI_CXI_COUNTER_* environment variables at the same time for an instrumented MPI program. Doing so can result in unexpected runtime issues.

Environment Variable	Description
`MPICH_OFI_CXI_COUNTER_REPORT`	Determines if CXI counters are collected during the application and
	verbosity of the counter data report displayed during `MPI_Finalize`.

	By default, HPE Cray MPI tracks `network timeouts` during the
	application. The counter data is collected during `MPI_Init` and
	`MPI_Finalize`, with a report indicating the change in the selected
	counters over the duration of the application. No interference exists
	while the application is running. To obtain a valid counter report, run
	times must be at least a few seconds. A `network timeout` is defined
	as an event on the Slingshot 11 - network (such as a link flap) that
	causes a packet to be re-issued. Network timeouts are identified from a
	NIC perspective. A single link flap may affect multiple NICs, depending
	on the network traffic at the time, so it is likely a single flap
	generates multiple timeouts. MPI queries all NICs in use for the
	application. Depending on the application traffic pattern and time,
	network timeouts may or may not affect the application performance
	metric. If network timeouts affecting the application occur, a one-line
	message is sent to `stdout`, indicating the number of timeouts. If no
	timeouts are detected, this line is suppressed unless additional
	verbosity is requested (for example, MPICH Slingshot Network Summary:
	3 network timeouts). Along with the network timeout counters, MPI can
	also collect and display an additional set of Cassini counters by setting
	`MPICH_OFI_CXI_COUNTER_REPORT` to a value of 2 or higher. MPI has
	a default set of counters it collects for each application, which may be
	useful in debugging performance problems. This set can be overridden by
	specifying a file with a list of alternate counters. Values 0 through 5
	are options for the `ENV` variable:

	0 - no Cassini counters collected; feature is disabled
	1 - network timeout counters collected, one-line display (default)
	2 - option 1 + CXI counters summary report displayed
	3 - option 2 + display counter data for any NIC that hit a network
	timeout
	4 - option 2 + display counter data for all NICs, if any network
	timeout occurred
	5 - option 2 + display counter data for all NICs

	Default: 1

`MPICH_OFI_CXI_COUNTER_FILE`	Specifies a file containing an alternate list of Cassini counter names
	to collect.

	If this file is present, instead of collecting the default set of counters,
	MPI collects data for the counters specified in the file. Counter names
	must be listed one per line. For retry handler counters, prefix the
	counter name with `rh:`. When specifying this option, set
	`MPICH_OFI_CXI_COUNTER_REPORT` to 2 or higher. Setting
	`MPICH_OFI_CXI_COUNTER_REPORT` to 1 may be helpful for debugging.
	The default network timeout counters is collected in addition to the
	file contents. Only applicable to Slingshot 11.

	CXI counters that MPI collects by default, if `MPICH_OFI_CXI_COUNTER_FILE`
	is not used to specify an alternate set, include:

	- `atu_cache_evictions`
	- `atu_cache_hit_base_page_size_0`
	- `atu_cache_hit_derivative1_page_size_0`
	- `lpe_net_match_priority_0`
	- `lpe_net_match_overflow_0`
	- `lpe_net_match_request_0`
	- `lpe_rndzv_puts_0`
	- `lpe_rndzv_puts_offloaded_0`
	- `hni_rx_paused_0`
	- `hni_rx_paused_1`
	- `hni_tx_paused_0`
	- `hni_tx_paused_1`
	- `parbs_tarb_pi_posted_pkts`
	- `parbs_tarb_pi_posted_blocked_cnt`
	- `parbs_tarb_pi_non_posted_pkts`
	- `parbs_tarb_pi_non_posted_blocked_cnt`
	- `pct_no_tct_nacks`
	- `pct_trs_rsp_nack_drops`
	- `pct_mst_hit_on_som`
	- `rh:sct_timeouts`
	- `rh:spt_timeouts`
	- `rh:spt_timeouts_u`
	- `rh:connections_cancelled`
	- `rh:nack_no_matching_conn`
	- `rh:nack_no_target_conn`
	- `rh:nack_no_target_mst`
	- `rh:nack_no_target_trs`
	- `rh:nack_resource_busy`
	- `rh:nacks`
	- `rh:nack_sequence_error`
	- `rh:pkts_cancelled_o`
	- `rh:pkts_cancelled_u`
	- `rh:sct_in_use`
	- `rh:tct_timeouts`

	Default: Not set.

`MPICH_OFI_CXI_COUNTER_REPORT_FILE`	Specifies an optional output filename prefix for the counter report.

	By default, the counter report is written to `stdout`. If this
	variable is set to a filename, the detailed counter data produced with
	`MPICH_OFI_CXI_COUNTER_REPORT` option 3, 4, and 5 is written to
	node-specific files with filenames of
	`MPICH_OFI_CXI_COUNTER_REPORT.<hostname>`. The user must have
	appropriate permission to create these files. This option is useful
	if you are running on hundreds or thousands of nodes, where `stdout`
	can become jumbled or truncated by the launcher. If not specified,
	`stdout` is used.

`MPICH_OFI_CXI_COUNTER_VERBOSE`	If set to a non-zero value, this option enables more verbose output
	about the Cassini counters being collected. This counter is helpful for
	debugging and identifying which counters are being collected.

	Default: 0

Example Summary Report

If set to 2 or higher, the Cassini counter summary report that is displayed during MPI_Finalize includes a minimum/mean/maximum value for each counter selected, along with computed rates. If set to 3 or higher, NIC-specific detailed counter data appears. Counters that record zero values are suppressed. Recognized values are between 0 and 5. The value affects the verbosity of the counter data displayed.

MPICH Slingshot Network Summary: 5 network timeouts
MPICH Slingshot CXI Counter Summary:

Counter                 Samples        Min        (/s)       Mean        (/s)        Max        (/s)
rh:spt_timeouts               4          1         0.1          1         0.1          2         0.2
atu_cache_hit_base_
     page_size_0            508        131        13.1        199        19.9        325        32.5
atu_cache_hit_derivative1_
     page_size_0            508  267137026  26713702.6  267258596  26725859.6  267382810  26738281.0
lpe_net_match_priority_0    508     312194     31219.4     406687     40668.7     491374     49137.4
lpe_net_match_overflow_0    508       3227       322.7      88050      8805.0     182417     18241.7
lpe_net_match_request_0     508         20         2.0         22         2.2         25         2.5
lpe_rndzv_puts_0            508     182182     18218.2     182182     18218.2     182182     18218.2
lpe_rndzv_puts_offloaded_0  508     182182     18218.2     182182     18218.2     182182     18218.2
hni_rx_paused_1             506        405        40.5     150356     15035.6     931673     93167.3
hni_tx_paused_0             508      56056      5605.6   11170254   1117025.4   22860933   2286093.3
hni_tx_paused_1             508   36133827   3613382.7 1217571763 121757176.3 4243430751 424343075.1
parbs_tarb_pi_posted_pkts   508  214956606  21495660.6  215103291  21510329.1  215275997  21527599.7
parbs_tarb_pi_posted_
     blocked_cnt            508    3301944    330194.4    8966220    896622.0   16794528   1679452.8
parbs_tarb_pi_non_posted_
     pkts                   508  213465644  21346564.4  213465928  21346592.8  213466358  21346635.8
parbs_tarb_pi_non_posted_
     blocked_cnt            508        260        26.0       3123       312.3      21554      2155.4
rh:nack_resource_busy        56          1         0.1          1         0.1          3         0.3
rh:nacks                     60          1         0.1         90         9.0        430        43.0
rh:nack_sequence_error       56         14         1.4              95         9.5        427        42.7

All user accessible NIC performance counters

This table includes all Cassini counter events available to users and supported by HPE performance analysis tools using the conventional PAT_RT_PERFCTR environment variable. Specify papi_native_avail -i cray_cassini to view these counters; the perftools modules or papi module must be loaded. Use the papi_component_avail utility to verify if the cray_cassini PAPI component is activated for the compute platform.

Counter Name	Description
`ATU_ATS_PRS_ODP_LATENCY`	ATS Page Request Services On-Demand Paging latency histogram. Four bins
	defined in `C_ATU_CFG_ODP_HIST`.

`ATU_ATS_TRANS_LATENCY`	ATS Translation latency histogram. Four bins defined in
	`C_ATU_CFG_XLATION_HIST`.

`ATU_CACHE_HIT_BASE_PAGE_SIZE`	Number of cache hits observed on the Base Page Size.

`ATU_CACHE_HIT_DERIVATIVE1_PAGE_SIZE`	Number of cache hits observed on the Derivative 1 Page Size.

`ATU_CACHE_HIT_DERIVATIVE2_PAGE_SIZE`	Number of cache hits observed on the Derivative 2 Page Size.

`ATU_CACHE_MISS`	Number of cache misses by counter pool.

`ATU_CACHE_EVICTIONS_<n>`	Number of times a tag was evicted from the NIC translation cache to make room
	for a new tag.

`ATU_CACHE_MISS_EE`	Number of cache misses by client (events, writes, reads).
`ATU_CACHE_MISS_IXE`
`ATU_CACHE_MISS_OXE`

`ATU_CLIENT_REQ_EE`	Number of translation requests by client (events, writes, reads).
`ATU_CLIENT_REQ_IXE`
`ATU_CLIENT_REQ_OXE`

`ATU_CLIENT_RSP_NOT_OK`	Number of client responses that were not `RC_OK`.

`ATU_NIC_PRI_ODP_LATENCY`	NIC Page Request Interface On-Demand Paging latency histogram.

`ATU_NTA_TRANS_LATENCY`	NTA Translation latency histogram.

`ATU_ODP_REQUESTS`	Number of On-Demand Paging requests.

`CQ_CQ_CMD_COUNTS`	Number of commands of each type:
`CQ_DMA_CMD_COUNTS`
	- Put, Get, rendezvous, atomics, small message, Ethernet, and so forth.
	- Target commands
	- Triggered operations

`CQ_CYCLES_BLOCKED`	Number of cycles the pool had a command ready to send to OXE and could not
	make progress because another OCUSET won arbitration.

`CQ_NUM_CQ_CMDS`	Number of successfully parsed CQ commands processed by the CQ block.

`CQ_NUM_DMA_CMDS`	Number of successfully parsed DMA commands.

`CQ_NUM_IDC_CMDS`	Number of successfully parsed immediate data commands.

`CQ_NUM_LL_CMDS`	Number of successfully parsed CQ commands processed by the CQ block.

`CQ_NUM_LL_OPS_RECEIVED`	The number of error-free low-latency operations received. This counter
	increments whenever counters `NUM_LL_OPS_SUCCESSFUL`,
	`NUM_LL_OPS_REJECTED`, or `NUM_LL_OPS_SPLIT` increment. However, this
	counter is not necessarily equal to the sum of these other three counters,
	because counters `NUM_LL_OPS_REJECTED` and `NUM_LL_OPS_SPLIT` can both
	increment for the same low-latency operation.

`CQ_NUM_LL_OPS_REJECTED`	The number of low-latency operations received for which the data of the
	operation was not accepted because the corresponding transmit queue was not
	empty, or the transmit queues low-latency data buffer was not empty when the
	low-latency operation was received.

`CQ_NUM_LL_OPS_SPLIT`	The number of low-latency operations received for which the data of the
	operation was not accepted because delivery of that data was split into
	multiple writes to the command issue image, with some or all of those writes
	containing less than 64 bytes of data.

`CQ_NUM_LL_OPS_SUCCESSFUL`	The number of low-latency operations for which the data of the operation was
	accepted. Barring an uncorrectable error being subsequently detected in the
	low-latency data buffer (error flag `LL_BUF_UCOR`), low-latency data that is
	accepted is forwarded to the CQ transmit command parser. Each 64-byte or
	128-byte block of data written to the command issue image is counted as one
	low-latency operation. As each such block of data may contain more than one CQ
	command, there is not a one-to-one correspondence between this
	`NUM_LL_OPS_SUCCESSFUL` counter and the `NUM_LL_CMDS` counter.

`CQ_NUM_TGQ_CMD_READS`	Number of PCIe command reads issued for target prefetch queues. Four counters,
	one each for reads of 64, 128, 192, or 256 bytes.

`CQ_NUM_TGT_CMDS`	Number of successfully parsed CQ commands processed by target command
	queues. All target commands are single flit. Incremented as target commands
	are sent to LPE.

`CQ_NUM_TOU_CMD_READS`	Number of PCIe command reads issued for TOU prefetch queues. Four counters,
	one each for reads of 64, 128, 192, or 256 bytes.

`CQ_NUM_TXQ_CMD_READS`	Number of PCIe command reads issued for transmit prefetch queues. Four
	counters, one each for reads of 64, 128, 192, or 256 bytes.

`CQ_TGT_WAITING_ON_READ`	Cycles on which target prefetch buffers are empty and pool has read requests
	pending. Note that this counter does not increment on cycles for which commands
	in another pool are being processed.

`CQ_TX_WAITING_ON_READ`	Cycles in which transmit prefetch buffers are empty and pool has read requests
	pending. Note that this counter does not increment on cycles for which commands
	in another pool are being processed. CQ maintains a count of the number of
	command read requests pending for each of the four counter pools. The prefetch
	unit increments these counts as a PCIe read is issued and decrement them as they
	complete. In cycles for which there is no command to process, the head of the CQ
	pipeline increments the `CQ_TX_WAITING_ON_READ` counter for each pool that
	has read requests pending.

`EE_ADDR_TRANS_PREFETCH_CNTR`	Number of prefetched address translations.

`EE_CBS_WRITTEN_CNTR`	Number of combining buffers written to an event queue.

`EE_DEFERRED_EQ_SWITCH_CNTR`	Number of event queue buffer switches not performed as soon as requested due
	to insufficient old event queue buffer free space immediately available to
	enqueue the buffer switch event. The buffer switch is performed when space
	becomes available in the old event queue buffer. Deferred event queue buffer
	switches could be an indication that software is waiting too long before
	scheduling buffer switch requests. One or more other events are likely to be
	dropped when an event queue buffer switch is deferred. Pooled counter, pool
	determined by `CNTR_POOL_ID` in the event queue descriptor.

`EE_EQ_BUFFER_SWITCH_CNTR`	Number of event queue buffer switches performed. Pooled counter, pool
	determined by `CNTR_POOL_ID` in the event queue descriptor.

`EE_EQ_STATUS_UPDATE_CNTR`	Number of status updates written to an event queue. Status updates are written
	to report an event queue fill level exceeding a configured threshold and to
	report dropped events. Pooled counter, pool determined by `CNTR_POOL_ID` in
	the event queue descriptor.

`EE_EQ_SW_STATE_WR_CNTR`	Number of times the event queue software state is updated using a fast path
	write. The rate of increase of this counter relative to `CBS_WRITTEN_CNTR` might
	provide an indication of how frequently software is servicing event queues.
	Pooled counter, pool determined by `CNTR_POOL_ID` in the event queue
	descriptor. This counter does not count every event queue software state fast
	path write that occurs if the processing of writes received in quick
	succession and targeting the same event queue becomes coalesced in the EEs
	event queue state pipeline. This counter does not increment for fast path
	writes that target a disabled event queue.

`EE_EVENTS_DROPPED_FC_SC_CNTR`	Number of flow control state-change full events that were not enqueued because
	the event queue was full. This count also includes events dropped because the
	targeted event queue buffer was disabled. This situation should only occur when
	software incorrectly configures the event queue descriptor. Pooled counter, pool
	determined by `CNTR_POOL_ID` in the event queue descriptor.

`EE_EVENTS_DROPPED_ORDINARY_CNTR`	Number of full events that were not enqueued because the event queue was full.
	This count includes all dropped full events not included in either
	`EE_EVENTS_DROPPED_RSRVN_CNTR` or `EE_EVENTS_DROPPED_FC_SC_CNTR`.
	The sum of all three counters is the total number of dropped full events.
	This count also includes events dropped because the targeted event queue
	buffer was disabled. This situation should only occur when software incorrectly
	configures the event queue descriptor. Pooled counter, pool determined by
	`CNTR_POOL_ID` in the event queue descriptor.

`EE_EVENTS_DROPPED_RSRVN_CNTR`	Number of full events, subject to an event queue space reservation, which were
	not enqueued because the event queue was full. This count also includes events
	dropped because the targeted event queue buffer was disabled. This situation
	should only occur when software incorrectly configures the event queue
	descriptor. Pooled counter, pool determined by `CNTR_POOL_ID` in the event
	queue descriptor.

`EE_EVENTS_ENQUEUED_CNTR`	Number of full events enqueued to an event queue. This count does not include
	null events (`EVENT_NULL_EVENT`) unless an error is reported in the events return
	code. The EE inserts error-free null events into event queues as needed to
	maintain alignment requirements for real (non-null) events. Pooled counter, pool
	determined by `CNTR_POOL_ID` in the event queue descriptor.

`EE_EXPIRED_CBS_WRITTEN_CNTR`	Number of partially full combining buffers written to their event queue because
	too much time elapsed without additional events arriving to fill the buffer.
	Pooled counter, pool determined by `CNTR_POOL_ID` in the event queue
	descriptor.

`EE_PARTIAL_CBS_WRITTEN_CNTR`	Number of partially full combining buffers written to an event queue. A partially
	full combining buffer has room for one or more additional events to be added to
	it at the time it is released to be written to the event queue. Therefore, the
	end of the buffer is padded with one or more null events. A partially full
	combining buffer can be written to the event queue because either the size of
	the next event to enqueue is such that, to maintain alignment requirements, the
	next event needs to start a new combining buffer, or because too much time has
	elapsed without additional events arriving to fill the combining buffer. Note,
	this counter does not count combining buffers containing null events (between
	real events at the start and end of the buffer) as partially full buffers.

`HNI_DISCARD_CNTR`	Number of packets discarded due to a timeout for each traffic class as indicated
	by `DISCARD_ERR`.

`HNI_LLR_TX_REPLAY_EVENT`	Number of LLR replays. High rates (multiple per second) indicate that the LLR
`HNI_LLR_RX_REPLAY_EVENT`	mechanism is providing protection on a poor quality link.

`HNI_MULTICAST_PKTS_RECV_BY_TC`	Number of multicast packets with good FCS received by TC. Multicast is indicated
	when DMAC bit 40 is set and DMAC is not all 1s.

`HNI_MULTICAST_PKTS_SENT_BY_TC`	Number of multicast packets with good FCS sent by TC. Multicast is indicated
	when DMAC bit 40 is set and DMAC is not all 1s.

`HNI_PAUSE_RECV`	Number of pause frames received for each enabled PCP (as identified by the PEV
	field of the pause frame) when PFC pause is enabled. Reception of a standard
	pause frame causes all counts to increment.

`HNI_PAUSE_XOFF_SENT`	Number of pause frames sent where `XOFF` is indicated for each PCP when PFC
	pause is enabled. Transmission of a standard pause frame causes all counts to
	increment.

`HNI_PCS_CORRECTED_CW`	Number of corrected codewords.

`HNI_PCS_FECL_ERRORS`	Number of errors in each FECL.

`HNI_PCS_GOOD_CW`	Number of codewords received with no errors.

`HNI_PCS_UNCORRECTED_CW`	Number of uncorrected code words received on the switch to NIC link. High rates
	(multiple errors per second) indicate a poor quality link.

`HNI_PFC_FIFO_OFLW_CNTR`	Number of packets discarded at the tail of the PFC FIFO. This can happen when
	the MFS setting is exceeded for that traffic class or the FIFO is nearly full.
	This equates to each time the PFC_FIFO_OFLW error flag sets for a particular TC.

`HNI_PKTS_RECV_BY_TC`	Number of packets received by traffic class `<n>`; 8 counters, default classes
	are 0 for request and 1 for response.

`HNI_PKTS_SENT_BY_TC`	Number of packets sent by sent by traffic class `<n>`; 8 counters, default
	classes are 0 for request and 1 for response.

`HNI_RX_OK_<min>_to_<max>`	Number of packets received in each of 12 size bins.

`HNI_RX_PAUSED_<n>`	Number of cycles in which the pause is applied on the receive path for traffic
	class `<n>`; default classes are 0 for request and 1 for response. Indicates that
	the network is supplying data faster than this endpoint can consume it. Divide by
	1E9 to determine the proportion of time paused.

`HNI_RX_STALL_IXE_PKTBUF`	Number of system clocks for which the Rx path of the corresponding traffic class
	is stalled due to lack of space in the IXE Packet Buffer.

`HNI_TX_OK_<min>_to_<max>`	Number of packets sent in each of 12 size bins:

	- Small packets 27, 35, 64 bytes
	- 65-128, 256-511, 512-1023, 1024-2047, 2048-4095, 4096-8191, 8192-Max

`HNI_TX_PAUSED_<n>`	Number of cycles in which the transmit path is paused for traffic class `<n>`;
	default classes are 0 for request and 1 for response. Indicates that this endpoint
	is supplying data faster than the network can deliver it. Divide by 1E9 to
	determine the proportion of time paused.

`IXE_DISP_DMAWR_REQS`	Number of requests to DMA write controller.

`IXE_DMAWR_STALL_P_CDT`	Number of stalls due to no posted credits (cycles).

`IXE_DMAWR_STALL_NP_CDT`	Number of stalls due to no non-posted credits (cycles).

`IXE_POOL_ECN_PKTS`	Number of packets with ECN set. One counter for each of the four pools of PtlTEs.

`IXE_POOL_NO_ECN_PKTS`	Number of packets without ECN set. One counter for each of the four pools of
	PtlTEs.

`PI_PTI_TARB_MRD_PKTS`	Number of memory read TLPs (all source).

`PI_PTI_TARB_MWR_PKTS`	Number of memory write TLPs (all source).

`IXE_TC_REQ_ECN_PKTS`	Number of request packets with ECN set, by TC.

`IXE_TC_REQ_NO_ECN_PKTS`	Number of request packets without ECN set, by TC.

`IXE_TC_RSP_ECN_PKTS`	Number of response packets with ECN set, by TC.

`IXE_TC_RSP_NO_ECN_PKTS`	Number of response packets without ECN set, by TC.

`LPE_APPEND_CMDS`	Number of `Append` commands received.

`LPE_APPEND_SUCCESS`	Number of `Append` commands LPE successfully completed. One counter for each
	of the four pools of PtlTEs.

`LPE_CYC_RRQ_BLOCKED`	Number of cycles in which a PE Match Request Queue dequeue was blocked
	because a Ready Request Queue was full. One counter for each of the four PEs.

`LPE_NET_MATCH_LOCAL`	Number of network requests LPE successfully matched to locally managed
	buffers. One counter for each of the four pools of PtlTEs.

`LPE_NET_MATCH_OVERFLOW_<n>`	Number of messages where payload data was delivered to a buffer on the
	overflow list because there was no match on the priority list. Four counters of
	which 0 is the default. These messages incur higher cost because data must be
	copied from the overflow buffer.

	Compute the ratio `priority/(priority + overflow)` to determine the proportion of
	messages for which receives were posted in advance.

`LPE_NET_MATCH_PRIORITY_<n>`	Number of messages matched on the priority list (that is, receive was posted
	before the message arrived. Four counters of which 0 is the default). These
	messages incur lower cost because data is written directly to the user buffer.

`LPE_NET_MATCH_REQUEST`	Number of requests matched on request list (software endpoints).

`LPE_NET_MATCH_REQUESTS`	Number of requests matched.

`LPE_NET_MATCH_SUCCESS`	Number of network requests LPE successfully completed. One counter for each of
	the four pools of PtlTEs.

`LPE_NET_MATCH_USEONCE`	Number of network requests LPE successfully matched to use-once buffers. One
	counter for each of the four pools of PtlTEs.

`LPE_NUM_TRUNCATED`	Number of truncated packets. One counter for each of the four pools of PtlTEs.

`LPE_RNDZV_PUTS`	Number of rendezvous puts received.

`LPE_RNDZV_PUTS_OFFLOADED`	Number of Rendezvous Puts that LPE was able to offload. One counter for each of
	the four pools of PtlTEs.

`LPE_SEARCH_NID_ANY`	Number of wildcard searches using `NID_ANY`, physical matching.

`LPE_SEARCH_PID_ANY`	Number of wildcard searches using `NID_PID`, physical matching.

`LPE_SEARCH_RANK_ANY`	Number of wildcard searches using `RANK_ANY`, logical matching.

`LPE_SETSTATE_CMDS`	Number of `SetState` commands LPE received. One counter for each of the four
	pools of PtlTEs.

`LPE_SETSTATE_SUCCESS`	Number of successful `SetState` commands. One counter for each of the four
	pools of PtlTEs.

`LPE_UNEXPECTED_GET_AMO`	Number of Get and AMO packets that match on Overflow or Request, resulting
	in `RC_DELAYED`.

`MB_CMC_AXI_RD_REQUESTS`	Number of AXI read requests received at the CMC channels.

`MB_CMC_AXI_WR_REQUESTS`	Number of AXI write requests received at the CMC channels.

`MB_CRMC_AXI_RD_REQUESTS`	Number of CRMC AXI read requests.

`MB_CRMC_AXI_WR_REQUESTS`	Number of CRMC AXI write requests.

`MB_CRMC_RD_ERROR`	Number of CRMC read errors.

`MB_CRMC_RING_MBE`	Number of CRMC ring in multi bit errors.

`MB_CRMC_RING_RD_REQUESTS`	Number of CRMC ring out read requests.

`MB_CRMC_RING_SBE`	Number of CRMC ring in single bit errors.

`MB_CRMC_RING_WR_REQUESTS`	Number of CRMC ring out write requests.

`MB_CRMC_WR_ERROR`	Number of CRMC write errors.

`OXE_CHANNEL_IDLE`	Number of cycles in which available bandwidth is not used.

`OXE_MCU_MEAS`	Number of flits sent by each MCU (configurable to count packets or messages).

`OXE_PTL_TX_GET_MSGS_TSC`	Number of portal Get messages sent by traffic shaping class. The following OXE
	pkt types (`c_oxe_pkt_type`) are included: `PKT_GET_REQ` where `SOM` is set.
	To be counted in the MCUOBC.

`OXE_PTL_TX_PUT_MSGS_TSC`	Number of clocks a TSC is not allowed to participate in traffic shaping
	arbitration because it is waiting for outbound shaping tokens. One counter for
	each TSC.

`OXE_PTL_TX_PUT_PKTS_TSC`	Number of portal packets by traffic shaping class. Counted either in the MCUOBC
	after the MCU is granted or at the traffic shaper after packet is selected.

`OXE_STALL_FGFC_BLK`	Number of cycles blocked for FGFC (matching entry exists and Credit >=0). Index
	given by `CSR C_OXE_CFG_FGFC_CNT[4]`.

`OXE_STALL_FGFC_CNTRL`	Number of cycles blocked for FGFC (matching entry exists). Index given by CSR
	`C_OXE_CFG_FGFC_CNT[4]`.

`OXE_STALL_FGFC_END`	Number of FGFC frames received with matching VNI that end FGFC (period == 0).
	Index given by CSR `C_OXE_CFG_FGFC_CNT[4]`.

`OXE_STALL_FGFC_START`	Number of FGFC frames received with matching VNI that start or continue FGFC
	(period != 0). Index given by CSR `C_OXE_CFG_FGFC_CNT[4]`.

`OXE_STALL_IDC_NO_BUFF_BC`	Number of cycles IDC command is enqueued (into ENQ FIFO) but cannot get
	buffer. Counted per BC. This is an OR of all MCUs belonging to the same BC
	with an IDC command but does not have a cell (buffer) allocated.

`OXE_STALL_PBUF_BC`	Number of clocks a BC is not allowed to participate in MCUOBC arbitration
	because it is waiting for packet buffer resources. One counter for each BC.
	Count in `PKTBUFF_REQ` state.

`OXE_STALL_PCT_BC`	Number of clocks a BC is not allowed to participate in MCUOBC arbitration
	because it is waiting for PCT resources. One counter for each BC. Count in
	`PKTBUFF_REQ` state.

`OXE_STALL_TS_NO_IN_CRD_TSC`	Number of clocks a TSC is not allowed to participate in traffic shaping
	arbitration because it is waiting for inbound shaping tokens. One counter for
	each TSC.

`OXE_STALL_TS_NO_OUT_CRD_TSC`	Number of clocks a TSC is not allowed to participate in traffic shaping
	arbitration because it is waiting for outbound shaping tokens. One counter for
	each TSC.

`OXE_STALL_WR_CONFLICT_PKT_BUFF_BNK`	Number of cycles header write or IDC write collides with PCIe data return. Per
	bank of Packet buffer.

`PARBS_TARB_PI_POSTED_PKTS`	Number of PCIe packets transferred using the posted path (for example, writes),
`PARBS_TARB_PI_POSTED_BLOCKED_CNT`	and the number of cycles in which this path is blocked. Compute the ratio
	`cycles/pkts`. Values of more than a few cycles per packet indicate back pressure
	from the host. This endpoint is likely to be the cause of congestion.

`PARBS_TARB_PI_NON_POSTED_PKTS`	Number of PCIe packets transferred using the non-posted path (for example,
`PARBS_TARB_PI_NON_POSTED_BLOCKED_CNT`	reads), and the number of cycles in which this path is blocked. Compute the
	ratio `cycles/pkts`. Values of more than a few cycles per packet indicate per
	host performance (high read latencies). This endpoint is likely to be injecting
	at a low rate.

`PCT_CONN_SCT_OPEN`	Number of open requests.

`PCT_HOST_ACCESS_LATENCY`	Request/response latency histogram, 32 bins.

`PCT_NO_TCT_NACKS`	Number of resource exhaustion NACKs. Retry handler is invoked.
`PCT_NO_TRS_NACKS`
`PCT_NO_MST_NACKS`

`PCT_REQ_ORDERED`	Number of ordered requests.

`PCT_REQ_UNORDERED`	Number of unordered requests.

`PCT_REQ_RSP_LATENCY`	Host access latency histogram, 16 bins.

`PCT_RESPONSES_RECEIVED`	Number of responses received (all unordered).

`PCT_RETRY_SRB_REQUESTS`	Number of retries.

`PCT_SCT_TIMEOUTS`	Number of response timeouts (or packet loss in the network). Retry handler is
	invoked.

`PCT_SPT_TIMEOUTS`	Number of response timeouts, (packet loss in the network). Retry handler is
	invoked.

`PCT_TRS_RSP_NACK_DROPS`	Number of NACKs dropped. Retry handler is invoked.

`TOU_CT_CMD_COUNTS`	Number of the successfully validated commands in `C_CT_OP_T`. Offset into the
	counter array is equal to the OpCode.

`TOU_NUM_LIST_REBUILDS`	Number of list CT list rebuilds. Pooled counter, pool is equal to the PFQ number.

`TOU_NUM_TRIG_CMDS`	Number of triggered commands. Pooled counter, pool is equal to the PFQ number.

CPE Cassini Performance Counters User Guide

About HPE Cray Cassini hardware performance counters

Scope and audience

What is Cassini?

Features and Benefits

HPE Cassini Performance Counter Access and Event

Traffic, Commands, Payload, and Event Handling

Packet, Flit, and Stall Counters

Flit Counters

Counter Stalls

Key NIC performance counter groups

CxiPerfStats - traffic congestion counter group

CxiErrStats - Cassini network error counter group

CxiOpCommands - Cassini operation (commands) counter group

CxiOpPackets - Cassini operation (packets) counter group

CxiDmaEngine - Cassini DMA engine counter group

CxiWritesToHost - Cassini writes to host counters

CxiMessageMatchingPooled - Cassini message matching of pooled counters

CxiTranslationUnit - Cassini translation unit counters

CxiLatencyHist- Cassini latency histogram counters

CxiPctReqRespTracking - Cassini PCT request and response tracking counters

CxiLinkReliability - Cassini link reliability counters

CxiCongestion - Cassini congestion counters

Gathering Cassini performance counters in HPE Cray MPI

All user accessible NIC performance counters

`CxiPerfStats` - traffic congestion counter group

`CxiErrStats` - Cassini network error counter group

`CxiOpCommands` - Cassini operation (commands) counter group

`CxiOpPackets` - Cassini operation (packets) counter group

`CxiDmaEngine` - Cassini DMA engine counter group

`CxiWritesToHost` - Cassini writes to host counters

`CxiMessageMatchingPooled` - Cassini message matching of pooled counters

`CxiTranslationUnit` - Cassini translation unit counters

`CxiLatencyHist`- Cassini latency histogram counters

`CxiPctReqRespTracking` - Cassini PCT request and response tracking counters

`CxiLinkReliability` - Cassini link reliability counters

`CxiCongestion` - Cassini congestion counters