CPE Cassini Performance Counters User Guide

About HPE Cray Cassini hardware performance counters

The CPE Cray Cassini Performance Counters User Guide provides details on how to use counters built into the Cassini Network Interface Card (NIC) to collect and analyze performance data. This guide includes procedures for using HPE Performance Analysis Tools running on HPE Cray Supercomputing EX systems.

See the HPE Performance Analysis Tools User Guide (S-8014) for additional and related information.

Scope and audience

This publication is intended for use by users needing to collect and analyze performance data utilizing the HPE Cray Cassini performance hardware counters.

What is Cassini?

HPE Cassini performance counters are a set of hardware counters provided by the HPE Cassini NIC. The HPE Cassini NIC resides in HPE Slingshot-11, a high-performing interconnect designed for HPC and AI clusters, and the Slingshot switch. HPE Cassini performance counters count hardware activities relative to the HPE Cassini NIC on HPE Slingshot. The HPE Cassini NIC, a 200 Gbps NIC, provides fast Message Passing Interface (MPI) message passing rates, increased bandwidth, and unlimited scalability.

Features and Benefits

The HPE Cassini NIC:

  • Uses an Ethernet wire protocol which enhances cluster usage capabilities.

  • Connects to the network using four-lane Ethernet link that operates at 56 Gbps/lane using PAM-4 signaling or 28 Gbps NRZ, providing 200 Gbps or 100 Gbps.

  • Connects to a CPU or GPU using a x16 PCIe Gen4 host interface operating at 16 GT/s in each direction.

  • Accelerates HPC and AI performance by:

    • Freeing up host memory bandwidth and CPU resources, reducing compute overhead for GPU-initiated communications, and

    • Integrating technology to detect key packet information.

  • Provides high performance for MPI messaging, remote memory access (RMA) access in partitioned global address space (PGAS) programming models and over the Ethernet.

  • Exposes remote direct memory access (RDMA) and HPC optimized features to software using the Libfabric software interface.

HPE Cassini performance counters that are integrated with the Slingshot NIC allow you to:

  • Analyze application performance, including various counters for:

    • Debugging application performance problems

    • Determining whether the network is causing low or fluctuating performance between runs

    • Identifying details regarding the impact of any lost packets

  • Count anomalous or erroneous events

  • Fine-tune applications in more efficient and effective ways.

HPE Cassini Performance Counter Access and Event

HPE Cassini performance counters are accessed through the PAT_RT_PERFCTR environment variable and other related environment variables. HPE Cassini performance counter events are collected by processor zero on each socket upon which an application is scheduled and executing. See the HPE Performance Analysis Tools User Guide (S-8014) for additional and related information.

Traffic, Commands, Payload, and Event Handling

Cassini implements a set of blocks, including command queue blocks, that process commands into packets, track their state, execute packets, and manage completion. Each Cassini block used to implement a path collects counter data on its operation. Request or response traffic follow a distinct process involving endpoints and data exchanges across a high-speed network (HSN). Generally, endpoints initiate data exchange across the HSN by generating packets. Packets travel across the HSN to the targeted destination endpoint. The destination endpoint acts on a request and returns a response packet after successfully receiving the request packet.

Specifically, commands, along with their headers that describe inload payload data, are written to the memory queue, and afterwards, depending on whether NIC space is available, the command is executed or read from the queue later. The HPE Cassini NIC:

  • Reads command queue data in blocks of up to 256 bytes, and

  • Supports 1024 transmit command queues for sending data and 512 target command queues for posting receive buffers.

The NIC writes back queue status information periodically and delivers status details on events, such as errors. Systematically, the:

  1. System transfers commands from the host to the NIC processor interface and then, to the command queue.

  2. Command queue validates one or multiple commands, passing them to an outbound transfer engine (OXE).

  3. OXE packetizes commands.

  4. OXE sends packets to the link, if required.

  5. Packet connection and tracking (PCT) block manages the state and matches responses to their original requests.

Counters that provide related performance information are associated with each of these systematic steps.

On the receiver side, the:

  1. Inbound traffic engine (IXE) processes incoming packets.

  2. Resource management unit (RMU) validates packets, determining target endpoints.

  3. Packet buffer holds packet data.

  4. System sends headers to the list processing engine (LPE) for matching. Note that:

    a. Write requests and atomic memory operations (AMOs) are passed to a write DMA engine if a header is matched to a target buffer. b. Read requests are passed to an outbound transfer engine and processed for responses. c. Initiator and target complete queues are managed by an event engine.

Counters that provide application performance information are associated with each receiver step.

HPE Cassini NIC connection states and the retry handler (RH) are key components that assist in tracking network errors and detecting and recovering lost packets in the network. The Cassini NIC tracks information on network error handling so that applications can determine whether performance variations are due to errors. The NIC tracks the state necessary to detect and recover from packet loss in the network. The system invokes the retry handler whenever packets are dropped or a negative acknowledgement (NACK) is returned by the target NIC. The retry handler maintains a set of counters, and these counters are accessed through /run/cxi/csi<d>/. HPE Slingshot links protect against errors using link level retry (LLR), and there are counters to show how often it is used.

Packet, Flit, and Stall Counters

At points within the HPE Slingshot NIC, separate counters are provided for each set of packets, flits, and stalls.

Flit Counters

With HPE Slingshot, data is transferred between devices in packets that contain up to 4,096 bytes of payload data (or up to 9,000 bytes for jumbo ethernet packets). Within the NIC, data is moved around in 64-byte flits at a rate of one flit per cycle of 1 GHz clock.

Counter Stalls

Stalls, in most cases, are an indication of flow control back-pressure, limiting the rate at which flits can be forwarded across an interface. A stall counter increments each flit-time that a ready-to-forward flit is prevented by back-pressure from doing so. For example, if interface flit and stall counts increment at equal rates, then, on average, flits cross the interface at half the rate they would if no stalls occurred. Situations that lead to stalls include, for example, rate adaptation to a lower speed downstream interface and arbitration, at a point downstream, for a contended resource. A high ratio of stalls-to-packets or stalls-to-flits is an indication of possible congestion. Blocked counters are included at some arbiter inputs within the HPE Slingshot NIC interconnect. Typically, these blocked counters would increment whenever a flit is available at the input, but arbitration, instead, selects a different input.

Key NIC performance counter groups

There are 12 unique Cassini counter groups:

Name

Counter Group

CxiPerfStats

Traffic Congestion Counter Group

CxiErrStats

Network Error Counter Group

CxiOpCommands

Operation (Command) Counter Group

CxiOpPackets

Operation (Packet) Counter Group

CxiDmaEngine

DMA Engine Counter Group

CxiWritesToHost

Writes-to-Host Counter Group

CxiMessageMatchingPooled

Message Matching of Pooled Counters Group

CxiTranslationUnit

Translation Unit Counter Group

CxiLatencyHist

Latency Histogram Counter Group

CxiPctReqRespTracking

PCT Request and Response Tracking Counter Group

CxiLinkReliability

Link Reliability Counter Group

CxiCongestion

Congestion Counter Group

The above listed groups (which are accessed by setting PAT_RT_PERFCTR to the name of the group) combine a set of counters optimized to provide specific and unique types of performance information. A subset of some of these groups are comprised of multiple counter events that provide useful metrics if grouped together. This subset of groups serves as a starting point for investigating the application performance using Cassini. They provide information on the amount of traffic generated by the NIC, whether that traffic is being slowed down by address translation and back pressure from the host, and whether that traffic is causing congestion.

There are 1,427 unique Cassini counter events that can be assessed individually by setting PAT_RT_PERFCTR to the name of one or more events. See the cassini(5) man page for additional information.

CxiPerfStats - traffic congestion counter group

Use CxiPerfStats counters as a starting point for investigating application performance details. CxiPerfStats counters provide information on the amount of traffic generated by the NIC, regardless of whether the traffic is slowed down by address translation and back pressure from the host or whether that traffic is causing congestion issues.

Counter Name

Description

HNI_PKTS_SENT_BY_TC_<n>

Number of packets sent in traffic class <n>; 8 counters, default classes

are 0 for request and 1 for response.

HNI_PKTS_RECV_BY_TC_<n>

Number of packets received in traffic class <n>; 8 counters, default

classes are 0 for request and 1 for response.

HNI_TX_PAUSED_<n>

Number of cycles in which the transmit path is paused for traffic class

<n>; default classes are 0 for request and 1 for response. Indicates that

this endpoint is supplying data faster than the network can deliver it.

Divide by 1E9 to determine the proportion of time paused.

HNI_RX_PAUSED_<n>

Number of cycles in which the pause is applied on the receive path for traffic

class <n>; default classes are 0 for request and 1 for response. Indicates

that the network is supplying data faster than this endpoint can consume it.

Divide by 1E9 to determine the proportion of time paused.

PARBS_TARB_PI_POSTED_PKTS

Number of PCIe packets transferred using the posted path (for example, writes),

PARBS_TARB_PI_POSTED_BLOCKED_CNT

and the number of cycles in which this path is blocked. Compute

the ratio cycles/pkts. Values of more than a few cycles per packet

indicate back pressure from the host. This endpoint is likely to be the cause

of congestion.

PARBS_TARB_PI_NON_POSTED_PKTS

Number of PCIe packets transferred using the non-posted path (for example,

PARBS_TARB_PI_NON_POSTED_BLOCKED_CNT

reads), and the number of cycles in which this path is blocked. Compute

the ratio cycles/pkts. Values of more than a few cycles per packet indicate

per host performance (high read latencies). This endpoint is likely to be

injecting at a low rate.

LPE_NET_MATCH_PRIORITY_<n>

Number of messages matched on the priority list (or receive was posted

before the message arrived). Four counters of which 0 is the default. These

messages incur lower cost because data is written directly to the user buffer.

LPE_NET_MATCH_OVERFLOW_<n>

Number of messages where payload data was delivered to a buffer on the overflow

list because there was no match on the priority list. Four counters of which 0

is the default. These messages incur higher cost because data must be copied

from the overflow buffer.

Compute the ratio priority/(priority + overflow) to determine the

porportion of messages for which receives were posted in advance.

ATU_CACHE_MISS_<n>

Number of misses in the NIC translation cache. Four counters of which counter 0

counts misses on 4K pages and counter 1 counts misses on 2M pages by default.

ATU_CACHE_EVICTIONS_<n>

Number of times a tag was evicted from the NIC translation cache to make room

for a new tag.

CxiErrStats - Cassini network error counter group

Cassini provides a reliable transport for libfabric. If packets are dropped in the network, they are automatically retried. The retry handler is also invoked when there is resource exhaustion in the target NIC. Retransmission can have an impact on performance, especially in a tightly coupled application.

When low performance or variation in performance between runs is observed, use the CxiErrStats counter group to determine if the network is a factor. The counters in this group are:

Counter Name

Description

PCT_SPT_TIMEOUTS

Number of response timeouts (or packet loss in the network).

PCT_SCT_TIMEOUTS

Retry handler is invoked.

PCT_NO_TCT_NACKS

Number of resource exhaustion NACKs. Retry handler is invoked.

PCT_NO_TRS_NACKS

PCT_NO_MST_NACKS

PCT_RETRY_SRB_REQUESTS

Number of retries.

PCT_TRS_RSP_NACK_DROPS

Number of NACKs dropped. Retry handler is invoked.

HNI_PCS_UNCORRECTED_CW

Number of uncorrected code words received on the switch to NIC link. High

rates (multiple errors per second) indicate a poor quality link.

HNI_LLR_TX_REPLAY_EVENT

Number of LLR replays. High rates (multiple per second) indicate that the LLR

HNI_LLR_RX_REPLAY_EVENT

mechanism is providing protection on a poor quality link.

CxiOpCommands - Cassini operation (commands) counter group

Cassini counters categorized as operational counters (relative to commands and packets) include:

Counter Name

Description

CQ_DMA_CMD_COUNTS

Number of commands of each type:

- Put, Get, rendezvous, atomics, small message, Ethernet, etc.

- Target commands

- Triggered operations

Excluding the CMD_ETHERNET_TX command, Cassini supports the generation and

sending of packets. Specifically, these commands include:

- CMD_NOOP - (4’d0) - No operation

- CMD_PUT - (4’d1) - Put

- CMD_GET - (4’d2) - Get

- CMD_RENDEZVOUS_PUT - (4’d3) - Rendezvous Put

- CMD_ATOMIC - (4’d4) - Non-fetching atomic

- CMD_FETCHING_ATOMIC - (4’d5) - Fetching

- CMD_ETHERNET_TX- (4’d6) - Send an Ethernet or optimized IP packet

- CMD_SMALL_MESSAGE - (4’d7) - Send a small message

- CMD_NOMATCH_PUT - (4’d8) - Non-matching DMA Put

- CMD_NOMATCH_GET - (4’d9) - Non-matching DMA Get

- CMD_CSTATE - (10) - Update C_STATE for IDC commands

CQ_CQ_CMD_COUNTS

Both CQ commands (FENCE and LCID) and target commands:

- CMD_FENCE (1) - Suspend stream until pending commands have completed.

- CMD_LCID(2) - Change communication profile.

- CMD_TGT_APPEND(8) - Append a target buffer. Specifies how to receive a message.

- CMD_TGT_SEARCH(9) - Search the unexpected list for a buffer.

- CMD_TGT_SEARCH_AND_DELETE(10) - Search the unexpected list for a buffer, and

delete it, if found.

- CMD_TGT_UNLINK(11) - Unlink a buffer by ID.

- CMD_TGT_SETSTATE(12) - Change the state of an endpoint.

CQ_NUM_DMA_CMDS

Number of DMA commands.

CQ_NUM_IDC_CMDS

Number of immediate data commands.

Example

Command usage

CxiOpPackets - Cassini operation (packets) counter group

Operational counters in this group include:

Counter Name

Description

HNI_TX_OK_<min>_to_<max>

Number of packets sent in each of 12 size bins:

- Small packets 27, 35, 64 bytes

- 65-128, 256-511, 512-1023, 1024-2047, 2048-4095, 4096-8191, 8192-Max

HNI_RX_OK_<min>_to_<max>

Number of packets received in each of 12 size bins.

HNI_PKTS_SENT_BY_TC

Number of packets sent in each traffic class; default classes or 0 and 1.

HNI_PKTS_RECV_BY_TC

Number of packets received in each traffic class; default classes or 0 and 1.

Examples

Distribution of packet sizes bar graph - Transmit (TX) (Example 1)

Distribution of packet sizes bar graph - Received (RX) (Example 2)

CxiDmaEngine - Cassini DMA engine counter group

CxiDmaEngine are categorized as counters that provide DMA engine information. Expect high values in the stall counters when the load is high. Compare against the clock for percentage of time stalled.

Counter Name

Description

OXE_MCU_MEAS

Number of flits sent by each MCU (configurable to count packets or messages).

OXE_CHANNEL_IDLE

Number of cycles in which available bandwidth is not used.

IXE_DISP_DMAWR_REQS

Number of requests to DMA write controller.

IXE_DMAWR_STALL_P_CDT

Number of stalls due to no posted credits (cycles).

IXE_DMAWR_STALL_NP_CDT

Number of stalls due to no non-posted credits (cycles).

PI_PTI_TARB_MRD_PKTS

Number of memory read TLPs (all source).

PI_PTI_TARB_MWR_PKTS

Number of memory write TLPs (all source).

CxiWritesToHost - Cassini writes to host counters

The counters in this group are:

Counter Name

Description

PARBS_TARB_PI_POSTED_PKTS

Number of PCIe packets transferred using the posted path (for example, writes),

PARBS_TARB_PI_POSTED_BLOCKED_CNT

and the number of cycles in which this path is blocked. Compute the ratio

cycles/pkts. Values of more than a few cycles per packet indicate back

pressure from the host. This endpoint is likely to be the cause of congestion.

PARBS_TARB_PI_NON_POSTED_PKTS

Number of PCIe packets transferred using the non-posted path (for example,

PARBS_TARB_PI_NON_POSTED_BLOCKED_CNT

reads),and the number of cycles in which this path is blocked. Compute the

ratio cycles/pkts. Values of more than a few cycles per packet indicate

per host performance (high read latencies). This endpoint is likely to be

injecting at a low rate.

Example

In this example, note that:

  • The ratio calculation is cycles blocked per packet.

  • Cassini is being blocked for 20 cycles per 512-byte PCIe packet.

    Size         Bandwidth (MB/s)            
    ...                            
    16384                  21070.06
    32768                  21752.80
    65536                  22896.48
    131072                 23461.53
    262144                 23764.63
    1048576                23985.48
    2097152                24024.32           
    4194304                24044.25
    8388608                23972.08          
    (16777216)*          (20321.26)*
    (33554432)*          (16550.01)*         
    (67108864)*          (16043.86)*
    524288                 23917.26     
    
    Name                               Samples   Min         Mean          Max
    
    hni_rx_paused_0                          4     0            0            0
    
    hni_rx_paused_1                          4     0    396666167    158664669
    
    hni_tx_paused_0                          4     0            0            0
    
    hni_tx_paused_1                          4     0    163578075    654312300
    
    parbs_tarb_pi_non_posted_pkts            4     0     54988998    201652661
    
    parbs_tarb_pi_posted_blocked_cnt         4     0    941671854 (3766687419)*
    
    parbs_tarb_pi_posted_pkts                4   372     46139681    184554415
    

* - Text in parentheses followed by an asterisk is red text in actual output.

CxiMessageMatchingPooled - Cassini message matching of pooled counters

Counters in this group provide information on:

  • Inbound match counters, including tries, success, fails, longest lookup (in the number of entries), and number of match attempts calculated by averages

  • Successful matches subdivided by used once matches, locally managed matches, and persistent matches

  • Local operation counters per command (tries, success, fails, longest lookup, number of match attempts calculated by averages)

  • The number of wildcard searches (NID== NID_ANY, PID++PID_ANY, RANK==RANK_ANY)

  • Total rendezvous and offloaded rendezvous

  • Fetch information for AMO commands received

Counters in this group include:

Counter Name

Description

LPE_NET_MATCH_REQUESTS

Number of requests matched.

LPE_NET_MATCH_PRIORITY

Number of messages matched on the priority list (receive was posted before the message

arrived). Four counters of which 0 is the default. These messages incur lower cost

because data is written directly to the user buffer.

LPE_NET_MATCH_OVERFLOW_<n>

Number of messages where payload data was delivered to a buffer on the overflow list

because there was no match on the priority list. Four counters of which 0 is the

default. These messages incur higher cost because data must be copied from the

overflow buffer.

Compute the ratio priority/(priority + overflow) to determine the proportion of

messages for which receives were posted in advance.

LPE_NET_MATCH_REQUEST

Number of requests matched on request list (software endpoints).

LPE_APPEND_CMDS

Number of append commands.

LPE_SEARCH_NID_ANY

Number of wildcard searches using NID_ANY, physical matching.

LPE_SEARCH_PID_ANY

Number of wildcard searches using NID_PID, physical matching.

LPE_SEARCH_RANK_ANY

Number of wildcard searches using RANK_ANY, logical matching.

LPE_RNDZV_PUTS

Number of rendezvous puts received.

LPE_AMO_CMDS

Number of non-fetching AMO commands received by LPE.

LPE_FAMO_CMDS

Number of fetching AMO commands received by LPE.

Cassini message matching statistics in events include:

Event

Description

Completion events, lpe_STAT_1

Number of match attempts on priority list

Completion events, lpe_STAT_2

Number of match attempts on other lists

Link events (normally suppressed), lpe_STAT_1

Number of list entries allocated to pool

Link events (normally suppressed), lpe_STAT_2

Number of list entries available to pool

Example

Message matching statistics

CxiTranslationUnit - Cassini translation unit counters

The counters in this group are:

Counter Name

Description

ATU_CACHE_MISS

Number of cache misses by counter pool.

ATU_CLIENT_REQ_EE

Number of translation requests by client (events, writes, reads).

ATU_CLIENT_REQ_IXE

ATU_CLIENT_REQ_OXE

Note that EE is the Event Engine; IXE is the Input Transfer Engine or Writes; and OXE is the Output Transfer Engine or Reads.

ATU_CACHE_MISS_EE

Number of cache misses by client (events, writes, reads).

ATU_CACHE_MISS_IXE

ATU_CACHE_MISS_OXE

ATU_CACHE_EVICTIONS

Number of times a tag was evicted from the NIC translation cache to make room for a new tag.

ATU_CACHE_HIT_BASE_PAGE_SIZE

Number of cache hits observed on the Base Page Size.

ATU_CACHE_HIT_DERIVATIVE1_PAGE_SIZE

Number of cache hits observed on the Derivative 1 Page Size.

ATU_ATS_TRANS_LATENCY

ATS translation latency in preconfigured bins.

Example: Two Counter Pools

NIC

MISS_0

MISS_1

REQ_IXE

MISS_IXE

REQ_OXE

MISS_OXE

0

8744

0

6758637

791

27033796

7951

1

9973

0

6758548

1957

27033657

8014

2

8702

0

6758435

802

27033634

7898

3

9970

0

6758405

1927

27033608

8041

4

8735

0

6758440

762

27033637

7971

5

10053

0

6758404

2021

27033604

8030

6

8649

0

6758435

800

27033634

7847

7

9441

0

6758404

1405

27033604

8034

CxiLatencyHist- Cassini latency histogram counters

PCT counters included in this counter group provide the:

  • Number, including the maximum number, of active connects (SCT and TCT)

  • Number, including the maximum number, of active messages (SMT and MST)

  • Configurable histogram for response latency

Counters in this group include:

Counter Name

Description

Base

Bin Width

PCT_HOST_ACCESS_LATENCY

Request or response latency histogram, 32 bins.

256

128

PCT_REQ_RSP_LATENCY

Host access latency histogram, 16 bins.

2,048

256

Example

STAT prefix tree with driver controls setup of histograms

The example figure above shows two histograms: One for host accesses (reads) and one for network request/response. The width of the bins (in clock cycles or nanoseconds) is controlled by the base and the M parameter. The bin width is shown as 2^M^. The values were set up in advance. Note also that users cannot configure the histograms without root permission.

CxiPctReqRespTracking - Cassini PCT request and response tracking counters

The counters in this group are:

Counter Name

Description

PCT_REQ_ORDERED

Number of ordered requests.

PCT_REQ_UNORDERED

Number of unordered requests.

PCT_RESPONSES_RECEIVED

Number of responses received (all unordered).

PCT_CONN_SCT_OPEN

Number of open requests.

PCT_NO_TRS_NACKS

Number or requests that did not complete because

the TRS was full.

PCT_RETRY_SRB_REQUESTS

Number of retries.

PCT_SPT_TIMEOUTS

Number of requests that timed out before a

response was received.

PCT_SCT_TIMEOUTS

Number of close requests that timed out before

a response was received.

The retry handler provides information on its activity in /run/cxi/cxi<device>.

CxiCongestion - Cassini congestion counters

The counters in this group are:

Counter Name

Description

CxiWritesToHost

See CxiWritesToHost - Cassini writes to host counters.

HNI_RX_PAUSED[8]

Number of cycles in which the network asserts pause, per traffic class; default classes for DMA traffic are

0 and 1. The network asserting pause indicates the application is causing congestion.

HNI_RX_PAUSED_STD

Number of cycles in which at least one PCP pause occurred.

HNI_TX_PAUSED[8]

Number of cycles in which the NIC asserts pause, per traffic class; default classes for DMA traffic are 0

and 1. The NIC asserting pause indicates that either the rate of writes or the rate of translations is higher

than the host can support, and the NIC is applying back-pressure to the network.

Example

     Name                             Samples      Min          Mean            Max
     hni_rx_paused_0                        4        0             0              0
     hni_rx_paused_1                        4        0     396666167   (1586664669)*
     hni_tx_paused_0                        4        0             0              0
     hni_tx_paused_1                        4        0     163578075      654312300
     parbs_tarb_pi_non_posted_pkts          4        0      54988998      201652661
     parbs_tarb_pi_posted_blocked_cnt       4        0     941671854   (3766687419)*
     parbs_tarb_pi_posted_pkts              4      372      46139681      184554415

* - Text in parentheses followed by an asterisk is red text in actual output.

Gathering Cassini performance counters in HPE Cray MPI

The HPE Cray Message Passing Interface (MPI) is HPE’s high performance MPI library, a solution based on standards maintained by the MPI Forum. MPI allows for parallel programming across a network of computer systems by using message passing techniques. In support of this functionality, MPI uses libfabric (OFI), an open source project and a subgroup of the OpenFabrics Alliance, as the default network interface library.

HPE Cray MPI offers a feature to automatically gather and analyze Cassini counters for any MPI application run on the HPE Slingshot 11 network. This easy-to-use feature provides important feedback regarding application performance and requires no source code or linking changes. To gather CXI counters for an MPI job, use the following environment variables at runtime:

  • MPICH_OFI_CXI_COUNTER_REPORT

  • MPICH_OFI_CXI_COUNTER_FILE

  • MPICH_OFI_CXI_COUNTER_REPORT_FILE

  • MPICH_OFI_CXI_COUNTER_VERBOSE

Important: Avoid using the PAT_RT_PERFCTR and MPICH_OFI_CXI_COUNTER_* environment variables at the same time for an instrumented MPI program. Doing so can result in unexpected runtime issues.

Environment Variable

Description

MPICH_OFI_CXI_COUNTER_REPORT

Determines if CXI counters are collected during the application and

verbosity of the counter data report displayed during MPI_Finalize.

By default, HPE Cray MPI tracks network timeouts during the

application. The counter data is collected during MPI_Init and

MPI_Finalize, with a report indicating the change in the selected

counters over the duration of the application. No interference exists

while the application is running. To obtain a valid counter report, run

times must be at least a few seconds. A network timeout is defined

as an event on the Slingshot 11 - network (such as a link flap) that

causes a packet to be re-issued. Network timeouts are identified from a

NIC perspective. A single link flap may affect multiple NICs, depending

on the network traffic at the time, so it is likely a single flap

generates multiple timeouts. MPI queries all NICs in use for the

application. Depending on the application traffic pattern and time,

network timeouts may or may not affect the application performance

metric. If network timeouts affecting the application occur, a one-line

message is sent to stdout, indicating the number of timeouts. If no

timeouts are detected, this line is suppressed unless additional

verbosity is requested (for example, MPICH Slingshot Network Summary:

3 network timeouts). Along with the network timeout counters, MPI can

also collect and display an additional set of Cassini counters by setting

MPICH_OFI_CXI_COUNTER_REPORT to a value of 2 or higher. MPI has

a default set of counters it collects for each application, which may be

useful in debugging performance problems. This set can be overridden by

specifying a file with a list of alternate counters. Values 0 through 5

are options for the ENV variable:

0 - no Cassini counters collected; feature is disabled

1 - network timeout counters collected, one-line display (default)

2 - option 1 + CXI counters summary report displayed

3 - option 2 + display counter data for any NIC that hit a network

timeout

4 - option 2 + display counter data for all NICs, if any network

timeout occurred

5 - option 2 + display counter data for all NICs

Default: 1

MPICH_OFI_CXI_COUNTER_FILE

Specifies a file containing an alternate list of Cassini counter names

to collect.

If this file is present, instead of collecting the default set of counters,

MPI collects data for the counters specified in the file. Counter names

must be listed one per line. For retry handler counters, prefix the

counter name with rh:. When specifying this option, set

MPICH_OFI_CXI_COUNTER_REPORT to 2 or higher. Setting

MPICH_OFI_CXI_COUNTER_REPORT to 1 may be helpful for debugging.

The default network timeout counters is collected in addition to the

file contents. Only applicable to Slingshot 11.

CXI counters that MPI collects by default, if MPICH_OFI_CXI_COUNTER_FILE

is not used to specify an alternate set, include:

- atu_cache_evictions

- atu_cache_hit_base_page_size_0

- atu_cache_hit_derivative1_page_size_0

- lpe_net_match_priority_0

- lpe_net_match_overflow_0

- lpe_net_match_request_0

- lpe_rndzv_puts_0

- lpe_rndzv_puts_offloaded_0

- hni_rx_paused_0

- hni_rx_paused_1

- hni_tx_paused_0

- hni_tx_paused_1

- parbs_tarb_pi_posted_pkts

- parbs_tarb_pi_posted_blocked_cnt

- parbs_tarb_pi_non_posted_pkts

- parbs_tarb_pi_non_posted_blocked_cnt

- pct_no_tct_nacks

- pct_trs_rsp_nack_drops

- pct_mst_hit_on_som

- rh:sct_timeouts

- rh:spt_timeouts

- rh:spt_timeouts_u

- rh:connections_cancelled

- rh:nack_no_matching_conn

- rh:nack_no_target_conn

- rh:nack_no_target_mst

- rh:nack_no_target_trs

- rh:nack_resource_busy

- rh:nacks

- rh:nack_sequence_error

- rh:pkts_cancelled_o

- rh:pkts_cancelled_u

- rh:sct_in_use

- rh:tct_timeouts

Default: Not set.

MPICH_OFI_CXI_COUNTER_REPORT_FILE

Specifies an optional output filename prefix for the counter report.

By default, the counter report is written to stdout. If this

variable is set to a filename, the detailed counter data produced with

MPICH_OFI_CXI_COUNTER_REPORT option 3, 4, and 5 is written to

node-specific files with filenames of

MPICH_OFI_CXI_COUNTER_REPORT.<hostname>. The user must have

appropriate permission to create these files. This option is useful

if you are running on hundreds or thousands of nodes, where stdout

can become jumbled or truncated by the launcher. If not specified,

stdout is used.

MPICH_OFI_CXI_COUNTER_VERBOSE

If set to a non-zero value, this option enables more verbose output

about the Cassini counters being collected. This counter is helpful for

debugging and identifying which counters are being collected.

Default: 0

Example Summary Report

If set to 2 or higher, the Cassini counter summary report that is displayed during MPI_Finalize includes a minimum/mean/maximum value for each counter selected, along with computed rates. If set to 3 or higher, NIC-specific detailed counter data appears. Counters that record zero values are suppressed. Recognized values are between 0 and 5. The value affects the verbosity of the counter data displayed.

MPICH Slingshot Network Summary: 5 network timeouts
MPICH Slingshot CXI Counter Summary:

Counter                 Samples        Min        (/s)       Mean        (/s)        Max        (/s)
rh:spt_timeouts               4          1         0.1          1         0.1          2         0.2
atu_cache_hit_base_
     page_size_0            508        131        13.1        199        19.9        325        32.5
atu_cache_hit_derivative1_
     page_size_0            508  267137026  26713702.6  267258596  26725859.6  267382810  26738281.0
lpe_net_match_priority_0    508     312194     31219.4     406687     40668.7     491374     49137.4
lpe_net_match_overflow_0    508       3227       322.7      88050      8805.0     182417     18241.7
lpe_net_match_request_0     508         20         2.0         22         2.2         25         2.5
lpe_rndzv_puts_0            508     182182     18218.2     182182     18218.2     182182     18218.2
lpe_rndzv_puts_offloaded_0  508     182182     18218.2     182182     18218.2     182182     18218.2
hni_rx_paused_1             506        405        40.5     150356     15035.6     931673     93167.3
hni_tx_paused_0             508      56056      5605.6   11170254   1117025.4   22860933   2286093.3
hni_tx_paused_1             508   36133827   3613382.7 1217571763 121757176.3 4243430751 424343075.1
parbs_tarb_pi_posted_pkts   508  214956606  21495660.6  215103291  21510329.1  215275997  21527599.7
parbs_tarb_pi_posted_
     blocked_cnt            508    3301944    330194.4    8966220    896622.0   16794528   1679452.8
parbs_tarb_pi_non_posted_
     pkts                   508  213465644  21346564.4  213465928  21346592.8  213466358  21346635.8
parbs_tarb_pi_non_posted_
     blocked_cnt            508        260        26.0       3123       312.3      21554      2155.4
rh:nack_resource_busy        56          1         0.1          1         0.1          3         0.3
rh:nacks                     60          1         0.1         90         9.0        430        43.0
rh:nack_sequence_error       56         14         1.4              95         9.5        427        42.7

All user accessible NIC performance counters

This table includes all Cassini counter events available to users and supported by HPE performance analysis tools using the conventional PAT_RT_PERFCTR environment variable. Specify papi_native_avail -i cray_cassini to view these counters; the perftools modules or papi module must be loaded. Use the papi_component_avail utility to verify if the cray_cassini PAPI component is activated for the compute platform.

Counter Name

Description

ATU_ATS_PRS_ODP_LATENCY

ATS Page Request Services On-Demand Paging latency histogram. Four bins

defined in C_ATU_CFG_ODP_HIST.

ATU_ATS_TRANS_LATENCY

ATS Translation latency histogram. Four bins defined in

C_ATU_CFG_XLATION_HIST.

ATU_CACHE_HIT_BASE_PAGE_SIZE

Number of cache hits observed on the Base Page Size.

ATU_CACHE_HIT_DERIVATIVE1_PAGE_SIZE

Number of cache hits observed on the Derivative 1 Page Size.

ATU_CACHE_HIT_DERIVATIVE2_PAGE_SIZE

Number of cache hits observed on the Derivative 2 Page Size.

ATU_CACHE_MISS

Number of cache misses by counter pool.

ATU_CACHE_EVICTIONS_<n>

Number of times a tag was evicted from the NIC translation cache to make room

for a new tag.

ATU_CACHE_MISS_EE

Number of cache misses by client (events, writes, reads).

ATU_CACHE_MISS_IXE

ATU_CACHE_MISS_OXE

ATU_CLIENT_REQ_EE

Number of translation requests by client (events, writes, reads).

ATU_CLIENT_REQ_IXE

ATU_CLIENT_REQ_OXE

ATU_CLIENT_RSP_NOT_OK

Number of client responses that were not RC_OK.

ATU_NIC_PRI_ODP_LATENCY

NIC Page Request Interface On-Demand Paging latency histogram.

ATU_NTA_TRANS_LATENCY

NTA Translation latency histogram.

ATU_ODP_REQUESTS

Number of On-Demand Paging requests.

CQ_CQ_CMD_COUNTS

Number of commands of each type:

CQ_DMA_CMD_COUNTS

- Put, Get, rendezvous, atomics, small message, Ethernet, and so forth.

- Target commands

- Triggered operations

CQ_CYCLES_BLOCKED

Number of cycles the pool had a command ready to send to OXE and could not

make progress because another OCUSET won arbitration.

CQ_NUM_CQ_CMDS

Number of successfully parsed CQ commands processed by the CQ block.

CQ_NUM_DMA_CMDS

Number of successfully parsed DMA commands.

CQ_NUM_IDC_CMDS

Number of successfully parsed immediate data commands.

CQ_NUM_LL_CMDS

Number of successfully parsed CQ commands processed by the CQ block.

CQ_NUM_LL_OPS_RECEIVED

The number of error-free low-latency operations received. This counter

increments whenever counters NUM_LL_OPS_SUCCESSFUL,

NUM_LL_OPS_REJECTED, or NUM_LL_OPS_SPLIT increment. However, this

counter is not necessarily equal to the sum of these other three counters,

because counters NUM_LL_OPS_REJECTED and NUM_LL_OPS_SPLIT can both

increment for the same low-latency operation.

CQ_NUM_LL_OPS_REJECTED

The number of low-latency operations received for which the data of the

operation was not accepted because the corresponding transmit queue was not

empty, or the transmit queues low-latency data buffer was not empty when the

low-latency operation was received.

CQ_NUM_LL_OPS_SPLIT

The number of low-latency operations received for which the data of the

operation was not accepted because delivery of that data was split into

multiple writes to the command issue image, with some or all of those writes

containing less than 64 bytes of data.

CQ_NUM_LL_OPS_SUCCESSFUL

The number of low-latency operations for which the data of the operation was

accepted. Barring an uncorrectable error being subsequently detected in the

low-latency data buffer (error flag LL_BUF_UCOR), low-latency data that is

accepted is forwarded to the CQ transmit command parser. Each 64-byte or

128-byte block of data written to the command issue image is counted as one

low-latency operation. As each such block of data may contain more than one CQ

command, there is not a one-to-one correspondence between this

NUM_LL_OPS_SUCCESSFUL counter and the NUM_LL_CMDS counter.

CQ_NUM_TGQ_CMD_READS

Number of PCIe command reads issued for target prefetch queues. Four counters,

one each for reads of 64, 128, 192, or 256 bytes.

CQ_NUM_TGT_CMDS

Number of successfully parsed CQ commands processed by target command

queues. All target commands are single flit. Incremented as target commands

are sent to LPE.

CQ_NUM_TOU_CMD_READS

Number of PCIe command reads issued for TOU prefetch queues. Four counters,

one each for reads of 64, 128, 192, or 256 bytes.

CQ_NUM_TXQ_CMD_READS

Number of PCIe command reads issued for transmit prefetch queues. Four

counters, one each for reads of 64, 128, 192, or 256 bytes.

CQ_TGT_WAITING_ON_READ

Cycles on which target prefetch buffers are empty and pool has read requests

pending. Note that this counter does not increment on cycles for which commands

in another pool are being processed.

CQ_TX_WAITING_ON_READ

Cycles in which transmit prefetch buffers are empty and pool has read requests

pending. Note that this counter does not increment on cycles for which commands

in another pool are being processed. CQ maintains a count of the number of

command read requests pending for each of the four counter pools. The prefetch

unit increments these counts as a PCIe read is issued and decrement them as they

complete. In cycles for which there is no command to process, the head of the CQ

pipeline increments the CQ_TX_WAITING_ON_READ counter for each pool that

has read requests pending.

EE_ADDR_TRANS_PREFETCH_CNTR

Number of prefetched address translations.

EE_CBS_WRITTEN_CNTR

Number of combining buffers written to an event queue.

EE_DEFERRED_EQ_SWITCH_CNTR

Number of event queue buffer switches not performed as soon as requested due

to insufficient old event queue buffer free space immediately available to

enqueue the buffer switch event. The buffer switch is performed when space

becomes available in the old event queue buffer. Deferred event queue buffer

switches could be an indication that software is waiting too long before

scheduling buffer switch requests. One or more other events are likely to be

dropped when an event queue buffer switch is deferred. Pooled counter, pool

determined by CNTR_POOL_ID in the event queue descriptor.

EE_EQ_BUFFER_SWITCH_CNTR

Number of event queue buffer switches performed. Pooled counter, pool

determined by CNTR_POOL_ID in the event queue descriptor.

EE_EQ_STATUS_UPDATE_CNTR

Number of status updates written to an event queue. Status updates are written

to report an event queue fill level exceeding a configured threshold and to

report dropped events. Pooled counter, pool determined by CNTR_POOL_ID in

the event queue descriptor.

EE_EQ_SW_STATE_WR_CNTR

Number of times the event queue software state is updated using a fast path

write. The rate of increase of this counter relative to CBS_WRITTEN_CNTR might

provide an indication of how frequently software is servicing event queues.

Pooled counter, pool determined by CNTR_POOL_ID in the event queue

descriptor. This counter does not count every event queue software state fast

path write that occurs if the processing of writes received in quick

succession and targeting the same event queue becomes coalesced in the EEs

event queue state pipeline. This counter does not increment for fast path

writes that target a disabled event queue.

EE_EVENTS_DROPPED_FC_SC_CNTR

Number of flow control state-change full events that were not enqueued because

the event queue was full. This count also includes events dropped because the

targeted event queue buffer was disabled. This situation should only occur when

software incorrectly configures the event queue descriptor. Pooled counter, pool

determined by CNTR_POOL_ID in the event queue descriptor.

EE_EVENTS_DROPPED_ORDINARY_CNTR

Number of full events that were not enqueued because the event queue was full.

This count includes all dropped full events not included in either

EE_EVENTS_DROPPED_RSRVN_CNTR or EE_EVENTS_DROPPED_FC_SC_CNTR.

The sum of all three counters is the total number of dropped full events.

This count also includes events dropped because the targeted event queue

buffer was disabled. This situation should only occur when software incorrectly

configures the event queue descriptor. Pooled counter, pool determined by

CNTR_POOL_ID in the event queue descriptor.

EE_EVENTS_DROPPED_RSRVN_CNTR

Number of full events, subject to an event queue space reservation, which were

not enqueued because the event queue was full. This count also includes events

dropped because the targeted event queue buffer was disabled. This situation

should only occur when software incorrectly configures the event queue

descriptor. Pooled counter, pool determined by CNTR_POOL_ID in the event

queue descriptor.

EE_EVENTS_ENQUEUED_CNTR

Number of full events enqueued to an event queue. This count does not include

null events (EVENT_NULL_EVENT) unless an error is reported in the events return

code. The EE inserts error-free null events into event queues as needed to

maintain alignment requirements for real (non-null) events. Pooled counter, pool

determined by CNTR_POOL_ID in the event queue descriptor.

EE_EXPIRED_CBS_WRITTEN_CNTR

Number of partially full combining buffers written to their event queue because

too much time elapsed without additional events arriving to fill the buffer.

Pooled counter, pool determined by CNTR_POOL_ID in the event queue

descriptor.

EE_PARTIAL_CBS_WRITTEN_CNTR

Number of partially full combining buffers written to an event queue. A partially

full combining buffer has room for one or more additional events to be added to

it at the time it is released to be written to the event queue. Therefore, the

end of the buffer is padded with one or more null events. A partially full

combining buffer can be written to the event queue because either the size of

the next event to enqueue is such that, to maintain alignment requirements, the

next event needs to start a new combining buffer, or because too much time has

elapsed without additional events arriving to fill the combining buffer. Note,

this counter does not count combining buffers containing null events (between

real events at the start and end of the buffer) as partially full buffers.

HNI_DISCARD_CNTR

Number of packets discarded due to a timeout for each traffic class as indicated

by DISCARD_ERR.

HNI_LLR_TX_REPLAY_EVENT

Number of LLR replays. High rates (multiple per second) indicate that the LLR

HNI_LLR_RX_REPLAY_EVENT

mechanism is providing protection on a poor quality link.

HNI_MULTICAST_PKTS_RECV_BY_TC

Number of multicast packets with good FCS received by TC. Multicast is indicated

when DMAC bit 40 is set and DMAC is not all 1s.

HNI_MULTICAST_PKTS_SENT_BY_TC

Number of multicast packets with good FCS sent by TC. Multicast is indicated

when DMAC bit 40 is set and DMAC is not all 1s.

HNI_PAUSE_RECV

Number of pause frames received for each enabled PCP (as identified by the PEV

field of the pause frame) when PFC pause is enabled. Reception of a standard

pause frame causes all counts to increment.

HNI_PAUSE_XOFF_SENT

Number of pause frames sent where XOFF is indicated for each PCP when PFC

pause is enabled. Transmission of a standard pause frame causes all counts to

increment.

HNI_PCS_CORRECTED_CW

Number of corrected codewords.

HNI_PCS_FECL_ERRORS

Number of errors in each FECL.

HNI_PCS_GOOD_CW

Number of codewords received with no errors.

HNI_PCS_UNCORRECTED_CW

Number of uncorrected code words received on the switch to NIC link. High rates

(multiple errors per second) indicate a poor quality link.

HNI_PFC_FIFO_OFLW_CNTR

Number of packets discarded at the tail of the PFC FIFO. This can happen when

the MFS setting is exceeded for that traffic class or the FIFO is nearly full.

This equates to each time the PFC_FIFO_OFLW error flag sets for a particular TC.

HNI_PKTS_RECV_BY_TC

Number of packets received by traffic class <n>; 8 counters, default classes

are 0 for request and 1 for response.

HNI_PKTS_SENT_BY_TC

Number of packets sent by sent by traffic class <n>; 8 counters, default

classes are 0 for request and 1 for response.

HNI_RX_OK_<min>_to_<max>

Number of packets received in each of 12 size bins.

HNI_RX_PAUSED_<n>

Number of cycles in which the pause is applied on the receive path for traffic

class <n>; default classes are 0 for request and 1 for response. Indicates that

the network is supplying data faster than this endpoint can consume it. Divide by

1E9 to determine the proportion of time paused.

HNI_RX_STALL_IXE_PKTBUF

Number of system clocks for which the Rx path of the corresponding traffic class

is stalled due to lack of space in the IXE Packet Buffer.

HNI_TX_OK_<min>_to_<max>

Number of packets sent in each of 12 size bins:

- Small packets 27, 35, 64 bytes

- 65-128, 256-511, 512-1023, 1024-2047, 2048-4095, 4096-8191, 8192-Max

HNI_TX_PAUSED_<n>

Number of cycles in which the transmit path is paused for traffic class <n>;

default classes are 0 for request and 1 for response. Indicates that this endpoint

is supplying data faster than the network can deliver it. Divide by 1E9 to

determine the proportion of time paused.

IXE_DISP_DMAWR_REQS

Number of requests to DMA write controller.

IXE_DMAWR_STALL_P_CDT

Number of stalls due to no posted credits (cycles).

IXE_DMAWR_STALL_NP_CDT

Number of stalls due to no non-posted credits (cycles).

IXE_POOL_ECN_PKTS

Number of packets with ECN set. One counter for each of the four pools of PtlTEs.

IXE_POOL_NO_ECN_PKTS

Number of packets without ECN set. One counter for each of the four pools of

PtlTEs.

PI_PTI_TARB_MRD_PKTS

Number of memory read TLPs (all source).

PI_PTI_TARB_MWR_PKTS

Number of memory write TLPs (all source).

IXE_TC_REQ_ECN_PKTS

Number of request packets with ECN set, by TC.

IXE_TC_REQ_NO_ECN_PKTS

Number of request packets without ECN set, by TC.

IXE_TC_RSP_ECN_PKTS

Number of response packets with ECN set, by TC.

IXE_TC_RSP_NO_ECN_PKTS

Number of response packets without ECN set, by TC.

LPE_APPEND_CMDS

Number of Append commands received.

LPE_APPEND_SUCCESS

Number of Append commands LPE successfully completed. One counter for each

of the four pools of PtlTEs.

LPE_CYC_RRQ_BLOCKED

Number of cycles in which a PE Match Request Queue dequeue was blocked

because a Ready Request Queue was full. One counter for each of the four PEs.

LPE_NET_MATCH_LOCAL

Number of network requests LPE successfully matched to locally managed

buffers. One counter for each of the four pools of PtlTEs.

LPE_NET_MATCH_OVERFLOW_<n>

Number of messages where payload data was delivered to a buffer on the

overflow list because there was no match on the priority list. Four counters of

which 0 is the default. These messages incur higher cost because data must be

copied from the overflow buffer.

Compute the ratio priority/(priority + overflow) to determine the proportion of

messages for which receives were posted in advance.

LPE_NET_MATCH_PRIORITY_<n>

Number of messages matched on the priority list (that is, receive was posted

before the message arrived. Four counters of which 0 is the default). These

messages incur lower cost because data is written directly to the user buffer.

LPE_NET_MATCH_REQUEST

Number of requests matched on request list (software endpoints).

LPE_NET_MATCH_REQUESTS

Number of requests matched.

LPE_NET_MATCH_SUCCESS

Number of network requests LPE successfully completed. One counter for each of

the four pools of PtlTEs.

LPE_NET_MATCH_USEONCE

Number of network requests LPE successfully matched to use-once buffers. One

counter for each of the four pools of PtlTEs.

LPE_NUM_TRUNCATED

Number of truncated packets. One counter for each of the four pools of PtlTEs.

LPE_RNDZV_PUTS

Number of rendezvous puts received.

LPE_RNDZV_PUTS_OFFLOADED

Number of Rendezvous Puts that LPE was able to offload. One counter for each of

the four pools of PtlTEs.

LPE_SEARCH_NID_ANY

Number of wildcard searches using NID_ANY, physical matching.

LPE_SEARCH_PID_ANY

Number of wildcard searches using NID_PID, physical matching.

LPE_SEARCH_RANK_ANY

Number of wildcard searches using RANK_ANY, logical matching.

LPE_SETSTATE_CMDS

Number of SetState commands LPE received. One counter for each of the four

pools of PtlTEs.

LPE_SETSTATE_SUCCESS

Number of successful SetState commands. One counter for each of the four

pools of PtlTEs.

LPE_UNEXPECTED_GET_AMO

Number of Get and AMO packets that match on Overflow or Request, resulting

in RC_DELAYED.

MB_CMC_AXI_RD_REQUESTS

Number of AXI read requests received at the CMC channels.

MB_CMC_AXI_WR_REQUESTS

Number of AXI write requests received at the CMC channels.

MB_CRMC_AXI_RD_REQUESTS

Number of CRMC AXI read requests.

MB_CRMC_AXI_WR_REQUESTS

Number of CRMC AXI write requests.

MB_CRMC_RD_ERROR

Number of CRMC read errors.

MB_CRMC_RING_MBE

Number of CRMC ring in multi bit errors.

MB_CRMC_RING_RD_REQUESTS

Number of CRMC ring out read requests.

MB_CRMC_RING_SBE

Number of CRMC ring in single bit errors.

MB_CRMC_RING_WR_REQUESTS

Number of CRMC ring out write requests.

MB_CRMC_WR_ERROR

Number of CRMC write errors.

OXE_CHANNEL_IDLE

Number of cycles in which available bandwidth is not used.

OXE_MCU_MEAS

Number of flits sent by each MCU (configurable to count packets or messages).

OXE_PTL_TX_GET_MSGS_TSC

Number of portal Get messages sent by traffic shaping class. The following OXE

pkt types (c_oxe_pkt_type) are included: PKT_GET_REQ where SOM is set.

To be counted in the MCUOBC.

OXE_PTL_TX_PUT_MSGS_TSC

Number of clocks a TSC is not allowed to participate in traffic shaping

arbitration because it is waiting for outbound shaping tokens. One counter for

each TSC.

OXE_PTL_TX_PUT_PKTS_TSC

Number of portal packets by traffic shaping class. Counted either in the MCUOBC

after the MCU is granted or at the traffic shaper after packet is selected.

OXE_STALL_FGFC_BLK

Number of cycles blocked for FGFC (matching entry exists and Credit >=0). Index

given by CSR C_OXE_CFG_FGFC_CNT[4].

OXE_STALL_FGFC_CNTRL

Number of cycles blocked for FGFC (matching entry exists). Index given by CSR

C_OXE_CFG_FGFC_CNT[4].

OXE_STALL_FGFC_END

Number of FGFC frames received with matching VNI that end FGFC (period == 0).

Index given by CSR C_OXE_CFG_FGFC_CNT[4].

OXE_STALL_FGFC_START

Number of FGFC frames received with matching VNI that start or continue FGFC

(period != 0). Index given by CSR C_OXE_CFG_FGFC_CNT[4].

OXE_STALL_IDC_NO_BUFF_BC

Number of cycles IDC command is enqueued (into ENQ FIFO) but cannot get

buffer. Counted per BC. This is an OR of all MCUs belonging to the same BC

with an IDC command but does not have a cell (buffer) allocated.

OXE_STALL_PBUF_BC

Number of clocks a BC is not allowed to participate in MCUOBC arbitration

because it is waiting for packet buffer resources. One counter for each BC.

Count in PKTBUFF_REQ state.

OXE_STALL_PCT_BC

Number of clocks a BC is not allowed to participate in MCUOBC arbitration

because it is waiting for PCT resources. One counter for each BC. Count in

PKTBUFF_REQ state.

OXE_STALL_TS_NO_IN_CRD_TSC

Number of clocks a TSC is not allowed to participate in traffic shaping

arbitration because it is waiting for inbound shaping tokens. One counter for

each TSC.

OXE_STALL_TS_NO_OUT_CRD_TSC

Number of clocks a TSC is not allowed to participate in traffic shaping

arbitration because it is waiting for outbound shaping tokens. One counter for

each TSC.

OXE_STALL_WR_CONFLICT_PKT_BUFF_BNK

Number of cycles header write or IDC write collides with PCIe data return. Per

bank of Packet buffer.

PARBS_TARB_PI_POSTED_PKTS

Number of PCIe packets transferred using the posted path (for example, writes),

PARBS_TARB_PI_POSTED_BLOCKED_CNT

and the number of cycles in which this path is blocked. Compute the ratio

cycles/pkts. Values of more than a few cycles per packet indicate back pressure

from the host. This endpoint is likely to be the cause of congestion.

PARBS_TARB_PI_NON_POSTED_PKTS

Number of PCIe packets transferred using the non-posted path (for example,

PARBS_TARB_PI_NON_POSTED_BLOCKED_CNT

reads), and the number of cycles in which this path is blocked. Compute the

ratio cycles/pkts. Values of more than a few cycles per packet indicate per

host performance (high read latencies). This endpoint is likely to be injecting

at a low rate.

PCT_CONN_SCT_OPEN

Number of open requests.

PCT_HOST_ACCESS_LATENCY

Request/response latency histogram, 32 bins.

PCT_NO_TCT_NACKS

Number of resource exhaustion NACKs. Retry handler is invoked.

PCT_NO_TRS_NACKS

PCT_NO_MST_NACKS

PCT_REQ_ORDERED

Number of ordered requests.

PCT_REQ_UNORDERED

Number of unordered requests.

PCT_REQ_RSP_LATENCY

Host access latency histogram, 16 bins.

PCT_RESPONSES_RECEIVED

Number of responses received (all unordered).

PCT_RETRY_SRB_REQUESTS

Number of retries.

PCT_SCT_TIMEOUTS

Number of response timeouts (or packet loss in the network). Retry handler is

invoked.

PCT_SPT_TIMEOUTS

Number of response timeouts, (packet loss in the network). Retry handler is

invoked.

PCT_TRS_RSP_NACK_DROPS

Number of NACKs dropped. Retry handler is invoked.

TOU_CT_CMD_COUNTS

Number of the successfully validated commands in C_CT_OP_T. Offset into the

counter array is equal to the OpCode.

TOU_NUM_LIST_REBUILDS

Number of list CT list rebuilds. Pooled counter, pool is equal to the PFQ number.

TOU_NUM_TRIG_CMDS

Number of triggered commands. Pooled counter, pool is equal to the PFQ number.