likwid perfctr

likwid-perfctr: Measuring applications' interaction with the hardware using the hardware performance counters

While there are already a bunch of tools around to measure hardware performance counters, a lightweight command line tool for simple end-to-end measurements was still missing. The Linux MSR module, providing an interface to access model specific registers from user space, allows us to read out hardware performance counters with an unmodified Linux kernel. Moreover, recent Intel systems provide Uncore hardware counter through PCI interfaces.

likwid-perfctr supports the following modes:

wrapper mode: Use likwid-perfctr as a wrapper to your application. You can measure without altering your code.
stethoscope mode: Measure performance counters for a variable time duration independent of any code running.
timeline mode: Output performance metric in specified frequency (can be ms or s)
marker API: Only measure regions in your code, still likwid-perfctr controls what to measure.

There are pre-configured event sets, called performance groups, with useful pre-selected event sets and derived metrics. Alternatively, you can specify a custom event set. In a single event set, you can measure as many events as there are physical counters on a given CPU respectively socket. See in the architecture specific pages for more details. likwid-perfctr will validate at startup if an event can be measured on a configured counter.

Because likwid-perfctr performs simple end-to-end measurements and does not know anything about the code which gets executed, it is crucial to pin your application. The relation between the measurement and your code is solely through pinning. As LIKWID works in user-space there is no possibility to measure only a single process, LIKWID always measures CPUs or sockets. likwid-perfctr has all pinning functionality of likwid-pin builtin. You need no additional tool for the pinning. Still you can control affinity yourself if you prefer.

likwid-perfctr's performance groups are simple text files and can be easily changed or extended. It is simple to create your own performance groups with custom derived metrics. In contrast to previous versions of LIKWID, no recompilation is needed anymore after changing a performance group.

Content

Supported architectures

See this page for supported architectures

Prerequisites

Depending on the selection of the access mode (direct or accessdaemon) the prerequisites are different.

Always required prerequisites

The MSR device files must be present. This can be checked with ls /dev/cpu/*/msr and should list one msr device file per available CPU. If you don't have the files, try to load the msr kernel module sudo modprobe msr and check the MSR device files again. In order to load the module at startup, you can add a line with msr to /etc/modules (the filename might be different for your distribution).

Prerequisites for direct access mode

The direct access mode has less overhead compared to the access daemon way but it requires higher privileges for the users. Set ACCESSMODE=direct in config.mk to use this feature.

Make sure your user has enough rights to read and write the MSR device files. You can grant read and write access to the MSR device files like this: sudo chmod +rw /dev/cpu/*/msr
The MSR device files are strongly protected to avoid security vulnerabilites. To overcome these protections do either one of the following:
- You can set the capabilities of the LIKWID's Lua interpreter: sudo setcap cap_sys_rawio+ep <PREFIX>/bin/likwid-lua where <PREFIX> is the installation path. Since the capabilities system is kind of strange and operating system dependent, this might not be enough for your system. This provides access only to core-local counters, no Uncore support.
- You can set the Lua interpreter setuid root. This is not recommended since this allows anybody who uses LIKWID's Lua interpreter to execute code with root privileges.
For Uncore support read/write access to /dev/mem is required. For example, you can set sudo setfacl -m:yourusername:rw /dev/mem. This allows access to ALL memory on the system. Do not do this unless you absolutely do not care about the machine's security.

Prerequisites for access daemon mode

In order to provide common users access to the hardware performance registers, you can use the access daemon. It is written with security in mind. It restricts accesses to hardware performance related registers, so users cannot read or write system related registers. When you select ACCESSMODE=accessdaemon in config.mk you only need install LIKWID with sudo make install. This sets the proper rights for the access daemon. Do not change the CHOWN variables in config.mk unless you want to use different permissions (group that is allowed to access the MSR device files, different name of the root user, etc.). If you need access to Uncore counters, make sure that for non-root users, that user has permissions to read/write to /dev/mem. Because being able to read and write to the entire memory allows you to take over the system, there is not much point in trying to make the access daemon run as non-root user.

Update for Linux kernel 5.9 and newer: With Linux 5.9, the msr kernel module got some security fixes. The major change for LIKWID is, that now all MSR are non-writable by default. In order to change that, you have to change the boot options of your operating system to contain msr.allow_writes=on to enable writes again. This affects only ACCESSMODE=direct and ACCESSMODE=accessdaemon. If you use the perf_event backend, you don't have to change anything.

Update for Linux kernel 5.10 and newer: We got reports, that with Linux 5.10 the PCI accesses are also restricted by security mechanisms. In order to fix this, the access daemon requires an additinal capabilities flag: sudo setcap cap_sys_admin,cap_sys_rawio=ep EXECUTABLE

See also the file INSTALL for further details. In security sensitive areas, as on multi user systems or HPC clusters the uncontrolled access to all MSR registers is a security problem. For solutions to this issue have a look at the build instructions and likwid-accessD.

Options

-h, --help		 Help message
-v, --version		 Version information
-V, --verbose <level>	 Verbose output, 0 (only errors), 1 (info), 2 (details), 3 (developer)
-c <list>		 Processor ids to measure (required), e.g. 1,2-4,8
-C <list>		 Processor ids to pin threads and measure, e.g. 1,2-4,8
			 For information about the <list> syntax, see likwid-pin
-G <list>		 GPU ids to measure
-g, --group <string>	 Performance group or custom event set string for CPUs
-W <string>		 Performance group or custom event set string for Nvidia GPUs
-H			 Get group help (together with -g switch)
-s, --skip <hex>	 Bitmask with threads to skip
-M <0|1>		 Set how MSR registers are accessed, 0=direct, 1=accessDaemon
-a			 List available performance groups
-e			 List available events and counter registers
-E <string>              List available events and corresponding counters that match <string> (case-insensitive)
-i, --info		 Print CPU info
-T <time>		 Switch eventsets with given frequency
Modes:
-S <time>		 Stethoscope mode with duration in s, ms or us, e.g 20ms
-t <time>		 Timeline mode with frequency in s, ms or us, e.g. 300ms
-m, --marker		 Use Marker API inside code
Output options:
-o, --output <file>	 Store output to file. (Optional: Apply text filter according to filename suffix)
-O			 Output easily parseable CSV instead of fancy tables

Basic Usage (Wrapper mode)

Output help text with

$ likwid-perfctr -h

There are two required flags: -c to configure for which cores the counters should be measured and -g to specify which group or event set you want to measure. The core id list is a comma separated list which can also contain ranges, e.g 1,2,4-7. This list can be specified in all variants supported by likwid-pin, from physical processor ids to different logical variants. To figure out the thread and cache topology you can use likwid-topology. As likwid-perfctr measures processors and has no knowledge about your process or threads, you have to ensure that your code you want to measure really runs on the processors you sense with likwid-perfctr. likwid-perfctr includes all functionality of likwid-pin for pinning a threaded application. Alternatively you can also care yourself for the pinning with another tool or from within the code.

For gathering information about hardware performance capabilities and performance groups use the -a, -g and -H switches.

Print all supported groups on a processor to stdout:

$ likwid-perfctr -a

To get a list with all supported counter registers and events, call:

$ likwid-perfctr -e | less

To get a list with all supported events and corresponding counter registers that match a string (case insensitive), call:

$ likwid-perfctr -E <string>

A help text explaining a specific event group can be requested with -H together with the -g switch:

$ likwid-perfctr -H -g MEM

This prints the text below LONG in the performance group file. For custom performance groups, it is recommended to add a describing text and the formulas of the derived metrics.

To use likwid-perfctr for a serial application execute:

$ likwid-perfctr  -C S0:1  -g BRANCH  ./a.out

This will pin the application to the second core (index 1) on socket zero (S0) and measure the performance group BRANCH on this core. A explanation for the CPU string notation can be found on the page likwid-pin. The output for the serial application looks like this:

--------------------------------------------------------------------------------
CPU name:	Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
CPU type:	Intel Core Haswell processor
CPU clock:	3.39 GHz
--------------------------------------------------------------------------------
YOUR PROGRAM OUTPUT
--------------------------------------------------------------------------------
Group 1: BRANCH
+------------------------------+---------+---------+
|             Event            | Counter |  Core 1 |
+------------------------------+---------+---------+
|       INSTR_RETIRED_ANY      |  FIXC0  |  201137 |
|     CPU_CLK_UNHALTED_CORE    |  FIXC1  |  375590 |
|     CPU_CLK_UNHALTED_REF     |  FIXC2  | 1595994 |
| BR_INST_RETIRED_ALL_BRANCHES |   PMC0  |  44079  |
| BR_MISP_RETIRED_ALL_BRANCHES |   PMC1  |   3982  |
+------------------------------+---------+---------+
+----------------------------+--------------+
|           Metric           |    Core 1    |
+----------------------------+--------------+
|     Runtime (RDTSC) [s]    | 3.522605e-03 |
|    Runtime unhalted [s]    | 1.107221e-04 |
|         Clock [MHz]        | 7.982933e+02 |
|             CPI            | 1.867334e+00 |
|         Branch rate        | 2.191491e-01 |
|  Branch misprediction rate | 1.979745e-02 |
| Branch misprediction ratio | 9.033780e-02 |
|   Instructions per branch  | 4.563103e+00 |
+----------------------------+--------------+`

The output will always consist of a table with the raw event counts and another table with derived metrics. The columns are the processor ids measured. If you measure more than one core, there is another table with statistical data like sum, minimum, maximum and average of all measured cores.

In general, the events have the same naming as in the official processor manuals (substituted "." by "_"). The relevant manuals are the Intel Software Development Manual 3B Appendix A and for AMD the BIOS and Kernel Developers Guides (BKDG) of the appropriate processor. You can also have a look in the optimization manuals provided by the vendors for interesting event sets or at Intel's Performance monitoring database (https://github.com/intel/perfmon, https://github.com/TomTheBear/perfmondb). There are the OFFCORE_RESPONSE events on Intel systems that don't follow the Intel notation. You have to specify the bits for the filter registers yourself using the OFFCORE_RESPONSE_0/1_OPTIONS event with the event options match0 (lower register part) and match1 (higher register part). LIKWID also introduces some events that cannot be found in the official documentation. They are commonly known events with pre-configured event options.

LIKWID counts all events in user-space by default. Kernel-space counting is deactivated but for some architectures, it can be enabled by adding the KERNEL counter option. See description of counter options for more details. It is not possible to count only kernel-space.

Basic threaded usage

For threaded use nothing changes apart from the -C command line argument. The application must be compiled with threading support. You do not need to set OMP_NUM_THREADS or CILK_WORKERS, this is done by likwid-perfctr according to the given CPU list. When the environment variables are already set, likwid-perfctr does not overwrite them.

$ likwid-perfctr -C 0-3 -g BRANCH ./a.out
--------------------------------------------------------------------------------
CPU name:	Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
CPU type:	Intel Core Haswell processor
CPU clock:	3.39 GHz
--------------------------------------------------------------------------------
YOUR PROGRAM OUTPUT
--------------------------------------------------------------------------------
Group 1: BRANCH
+------------------------------+---------+----------+---------+----------+---------+
|             Event            | Counter |  Core 0  |  Core 1 |  Core 2  |  Core 3 |
+------------------------------+---------+----------+---------+----------+---------+
|       INSTR_RETIRED_ANY      |  FIXC0  | 15585960 | 5526616 |  7679943 | 4045942 |
|     CPU_CLK_UNHALTED_CORE    |  FIXC1  | 15025112 | 4660629 |  7745757 | 3406840 |
|     CPU_CLK_UNHALTED_REF     |  FIXC2  | 44696128 | 9473964 | 22825288 | 3762474 |
| BR_INST_RETIRED_ALL_BRANCHES |   PMC0  |  1470984 |  752872 |  1163894 |  345736 |
| BR_MISP_RETIRED_ALL_BRANCHES |   PMC1  |   9457   |   8238  |   25573  |   1025  |
+------------------------------+---------+----------+---------+----------+---------+
+-----------------------------------+---------+----------+---------+----------+------------+
|               Event               | Counter |    Sum   |   Min   |    Max   |     Avg    |
+-----------------------------------+---------+----------+---------+----------+------------+
|       INSTR_RETIRED_ANY STAT      |  FIXC0  | 32838461 | 4045942 | 15585960 | 8209615.25 |
|     CPU_CLK_UNHALTED_CORE STAT    |  FIXC1  | 30838338 | 3406840 | 15025112 |  7709584.5 |
|     CPU_CLK_UNHALTED_REF STAT     |  FIXC2  | 80757854 | 3762474 | 44696128 | 20189463.5 |
| BR_INST_RETIRED_ALL_BRANCHES STAT |   PMC0  |  3733486 |  345736 |  1470984 |  933371.5  |
| BR_MISP_RETIRED_ALL_BRANCHES STAT |   PMC1  |   44293  |   1025  |   25573  |  11073.25  |
+-----------------------------------+---------+----------+---------+----------+------------+
+----------------------------+--------------+--------------+--------------+--------------+
|           Metric           |    Core 0    |    Core 1    |    Core 2    |    Core 3    |
+----------------------------+--------------+--------------+--------------+--------------+
|     Runtime (RDTSC) [s]    | 6.292864e-02 | 6.292864e-02 | 6.292864e-02 | 6.292864e-02 |
|    Runtime unhalted [s]    | 4.429985e-03 | 1.374134e-03 | 2.283749e-03 | 1.004468e-03 |
|         Clock [MHz]        | 1.140153e+03 | 1.668508e+03 | 1.150968e+03 | 3.071098e+03 |
|             CPI            | 9.640158e-01 | 8.433061e-01 | 1.008570e+00 | 8.420388e-01 |
|         Branch rate        | 9.437879e-02 | 1.362266e-01 | 1.515498e-01 | 8.545253e-02 |
|  Branch misprediction rate | 6.067640e-04 | 1.490605e-03 | 3.329842e-03 | 2.533403e-04 |
| Branch misprediction ratio | 6.429030e-03 | 1.094210e-02 | 2.197193e-02 | 2.964690e-03 |
|   Instructions per branch  | 1.059560e+01 | 7.340711e+00 | 6.598490e+00 | 1.170240e+01 |
+----------------------------+--------------+--------------+--------------+--------------+
+---------------------------------+--------------+--------------+-------------+----------------+
|              Metric             |      Sum     |      Min     |     Max     |       Avg      |
+---------------------------------+--------------+--------------+-------------+----------------+
|     Runtime (RDTSC) [s] STAT    |  0.25171456  |  0.06292864  |  0.06292864 |   0.06292864   |
|    Runtime unhalted [s] STAT    |  0.009092336 |  0.001004468 | 0.004429985 |   0.002273084  |
|         Clock [MHz] STAT        |   7030.727   |   1140.153   |   3071.098  |   1757.68175   |
|             CPI STAT            |   3.6579307  |   0.8420388  |   1.00857   |   0.914482675  |
|         Branch rate STAT        |  0.46760772  |  0.08545253  |  0.1515498  |   0.11690193   |
|  Branch misprediction rate STAT | 0.0056805513 | 0.0002533403 | 0.003329842 | 0.001420137825 |
| Branch misprediction ratio STAT |  0.04230775  |  0.00296469  |  0.02197193 |  0.0105769375  |
|   Instructions per branch STAT  |   36.237201  |    6.59849   |   11.7024   |   9.05930025   |
+---------------------------------+--------------+--------------+-------------+----------------+

Please note that in previous versions of LIKWID you had to specify the threading implementation used. This is not necessary anymore. LIKWID uses a pinning library that overloads the call of pthread_create, the thread creation procedure used by many threading solutions (of course PThreads but also OpenMP, Cilk+, C++11 threads).

On newer processors there is one issue related to Uncore events. The Uncore counters measure per socket. Therefore likwid-perfctr has a socket lock which ensures that only one thread per socket starts the counters and only one thread per socket stops them. The first CPU initialized per socket gets and keeps the lock for the whole execution time. Be aware that in the statistics tables, the processors that haven't measured the Uncore event are included, so only the values MAX and SUM are usable.

Performance groups

For common tasks there exist pre-configured event sets. These groups provide useful event sets and compute common derived metrics. We try to provide a basic set of groups on all architectures. Due to the differing capabilities some groups may be processor specific. You can print available groups on an architecture with likwid-perfctr -a. For processor specific information about what events are chosen for the groups use the -H -g group switch. This gives you detailed documentation from which events the derived metrics are computed.

Using the Marker API

The Marker API allows you to measure named regions of your code. Overlap or nesting of the regions is allowed. You can also enter a region multiple times, e.g. in a loop. The counters for each region are accumulated. In the threaded case, you can have serial and threaded regions.

The Marker API only reads out the counters. The configuration of the counters is still handled via the wrapper application likwid-perfctr. In order to use the LIKWID Marker API, you must include the file likwid-markers.h and link your code against the LIKWID library. Partly you need Pthreads enabled during linking, commonly done by setting -pthread on the compilers command line. To allow you to quickly toggle the Marker API, the LIKWID header contains a set of macros which allow you to activate the Marker API by defining LIKWID_PERFMON during build of your software. You have to include the LIKWID header to your source code to ensure your code also compiles if LIKWID is not available.

For gcc or icc this look e.g. as:

$ gcc -O3 -fopenmp -pthread -o test dofp.c -DLIKWID_PERFMON -I<PATH_TO_LIKWID>/include -L<PATH_TO_LIKWID>/lib -llikwid -lm

Below is an example showing the usage of the Marker API for a serial code:

// This block enables to compile the code with and without the likwid header in place
#ifdef LIKWID_PERFMON
#include <likwid-marker.h>
#else
#define LIKWID_MARKER_INIT
#define LIKWID_MARKER_THREADINIT
#define LIKWID_MARKER_SWITCH
#define LIKWID_MARKER_REGISTER(regionTag)
#define LIKWID_MARKER_START(regionTag)
#define LIKWID_MARKER_STOP(regionTag)
#define LIKWID_MARKER_CLOSE
#define LIKWID_MARKER_GET(regionTag, nevents, events, time, count)
#endif
LIKWID_MARKER_INIT;
LIKWID_MARKER_THREADINIT;
LIKWID_MARKER_START("Compute");
// Your code to measure
LIKWID_MARKER_STOP("Compute");
LIKWID_MARKER_CLOSE;

For a threaded code it is important to call the following sequence of function calls from the serial part of the program:

LIKWID_MARKER_INIT;
[...]
LIKWID_MARKER_CLOSE;

If you use the Marker API together with likwid-accessD, it is highly recommended to call

LIKWID_MARKER_REGISTER(string);

for each code region and application thread you want to measure with the used identifier strings. This creates basic structures and establishes the connection to the access daemon. If you don't do it and your code runs only for a short time, the values of the first region in the code will be off/lower.

For convenience there is also a simple API to pin your code or process or get the processor id.

likwid_pinProcess(int processorId);
likwid_pinThread(int processorId);
likwid_getProcessorId();

LIKWID starting with release 4.0.0 introduces some more Marker API calls: Switch between multiple event sets (causes much overhead compared to the other API functions):

LIKWID_MARKER_SWITCH;

Moreover, if you want to reduce the overhead of LIKWID_MARKER_START you can register the region names in prior. This avoids creating the hash tables serially which can cause timing problems. It is optional but highly recommended!

LIKWID_MARKER_REGISTER("Compute")

If you want to process the aggregated measurement values inside of your application:

LIKWID_MARKER_GET("Compute", nevents, events, time, count)

where nevents is int* defining the length of the given array events (type double*) and contains the number of filled entries at return. time has type double* and count has type int*.

Note: No whitespace characters are allowed in the region tag!

If you want to reset the counts for a region: (available in 4.3.3 and later)

LIKWID_MARKER_RESET("Compute")

The call has to be performed by each thread to reset its own values.

In order to run an executable with instrumentation, you have to activate the Marker API for likwid-perfctr using the -m switch:

$ likwid-perfctr -C 0-3 -g BRANCH -m ./a.out

Since the CPU list and the event set is given to likwid-perfctr and not programmed into the executable, you have the full flexibility for measurements without further modifying the executable itself.

Notice: Each threads read the counters individually, so when one thread has less work, it will read the counters before the other thread(s). This can be crucial in the case where the little-work-thread is the thread that performs the reading of Uncore counters. The Uncore counters are socket-specific. Likwid uses commonly the first hardware thread in the affinity list of a socket. For example you have 2 sockets 0,1,2,3 and 4,5,6,7 and you run with 1,2,3,5,6,7 the cores 1 and 5 will measure the Uncore counters. Consequently, their execution of user code is delayed and it might be that the other threads perform already some of their iterations until the Uncore-threads start to execute the user code.

Example:You want the memory data volume of a loop executed by multiple threads. While the threads responsible to read the Uncore counters still reads the stopped counters, the other threads already load and store data from/to memory. This data volume is not counted and consequently the memory data volume will be lower as expected.

Use barriers or conditional waits to synchronize the threads if you want exact measurements.

We are currently thinking about to provide MarkerAPI calls that include barriers and/or some environment variable that activates barriers in all MarkerAPI calls.

Using the Marker API with Fortran 90

There is a native interface for using the LIKWID Marker API with Fortran 90 programs. You have to enable it in the config.mk file as it is not enabled by default. If you enable it, the Intel Fortran compiler flags are set. To change this to gfortran edit ./make/include_GCC.mk to set gfortran with according flags. You have to care that the fortran interface module likwid.mod is in your module include path and of course linked against the likwid library.

For the Intel fortran compiler this can look as follows:

$ ifort -I<PATH_TO_LIKWID>/include -O3 -o fortran chaos.F90 -L<PATH_TO_LIKWID>/lib -llikwid  -lpthread -lm -DLIKWID_PERFMON

There is a example how to use the Marker API in Fortran in the test directory (chaos.F90) and the examples directory (F-markerAPI.F90). Code example:

call likwid_markerInit()
call likwid_markerThreadInit()
call likwid_markerStartRegion("sub")
! Do stuff
call likwid_markerStopRegion("sub")
call likwid_markerClose()

All functions that are available in the C Marker API are also available for Fortran 90, including likwid_markerRegisterRegion, likwid_markerNextGroup and likwid_markerGetRegion.

Syntax of the intermediate Marker API file

When the instrumented code closes the Marker API it writes a file with the results to disc. By default, the /tmp directory is used and the common file name is likwid_<PID_OF_PERFCTR>.txt. The syntax of the file is:

<nrThreads> <nrRegions> <nrGroups>
<regionID_1>:<regionName_1>
[...]
<regionID_n>:<regionName_n>
<regionID_1> <groupID> <cpuID_1> <callCount> <regionTime> <nrEvents> <event1> <event2> ... <eventM_g>
<regionID_1> <groupID> <cpuID_2> <callCount> <regionTime> <nrEvents> <event1> <event2> ... <eventM_g>
[...]
<regionID_n> <groupID> <cpuID_L-1> <callCount> <regionTime> <nrEvents> <event1> <event2> ... <eventM_g>
<regionID_n> <groupID> <cpuID_L> <callCount> <regionTime> <nrEvents> <event1> <event2> ... <eventM_g>

where <regionName_1> is the actual user-provided string suffixed with -<groupId>.

For parsing this file the LIKWID library provides helper functions:

int perfmon_readMarkerFile(const char* filename)
int perfmon_getNumberOfRegions(): Get number of regions
int perfmon_getGroupOfRegion(int region): If the groups are switched in the application (see LIKWID_MARKER_SWITCH), you can get the group identifier of the group. This only works, if you did perfmon_init() and gid = perfmon_addEventSet(group) before. The group identfier is the same as gid.
char* perfmon_getTagOfRegion(int region): Get the name tag of the region
int perfmon_getEventsOfRegion(int region): Get number of events measured in a region
int perfmon_getMetricsOfRegion(int region): Get number of derived metrics measured in a region
int perfmon_getThreadsOfRegion(int region): Returns the number of threads that executed the region
int perfmon_getCpulistOfRegion(int region, int count, int* cpulist): Get the hardware thread IDs that executed the region.
double perfmon_getTimeOfRegion(int region, int thread): Get aggregated runtime of thread in region
int perfmon_getCountOfRegion(int region, int thread): Get the count of region executions by a thread
double perfmon_getResultOfRegionThread(int region, int event, int thread): Get the measurement for an event for a thread and region
double perfmon_getMetricOfRegionThread(int region, int metricId, int threadId): Get the measurment for a derived metric for a thread and region

Marker API in other programming languages

The LIKWID team currently has no plans to provide the Marker API for other programming languages. But since LIKWID is open-source, everybody is welcome to create a module for his/her favorite programming language. Here is a list of projects offering the API in other languages:

Java: https://github.com/jlewandowski/likwid-java-api
Python: https://github.com/RRZE-HPC/pylikwid
Julia: https://github.com/JuliaPerf/LIKWID.jl (Doku)

The timeline mode

likwid-perfctr allows to measure a time resolved profile. With

$ likwid-perfctr -c N:0-7 -g BRANCH -t 2s > out.txt

you can measure the branching behavior of the machines on CPU cores 0-7 with a measurement every 2 seconds. This means the counters run will be read out every 2 seconds. This is implemented in a lightweight fashion. The output is to stderr. The syntax of the timeline mode output lines with a custom event set is:

<groupID> <numberOfEvents> <numberOfThreads> <Timestamp> <Event1_Thread1> <Event1_Thread2> ... <EventN_ThreadN>

The output of the timeline mode is different for custom event sets and performance groups. While for custom event sets the EventX refers to the raw count of event X, for performance groups the EventX means MetricX, thus lists the derived metrics for the different threads. In general, when using a performance group one is more interested in the derived metrics as in the raw counts, so I changed the behavior. So for performance group:

<groupID> <numberOfEvents> <numberOfThreads> <Timestamp> <Metric1_Thread1> <Metric1_Thread2> ... <MetricN_ThreadN>

You can set multiple event sets on the commandline. After each measurement period, the event set is switched to the next one in a round-robin fashion. Please note, that when you want a read out every 2s with multiple event sets, the read out is performed for each group every 2s*<numberOfEventSets>.

If you want to cancel the measurement, you can send a SIGINT signal to the likwid-perfctr process.

Notice: Although LIKWID allows measurements in a microsecond granularity, there are some points to consider. Tests have shown that for measurements below 100 milliseconds, the periodically printed results are not valid results anymore (they are higher than expected) but the behavior of the results is still valid. E.g. if you try to resolve the burst memory transfers, you need results for small intervals. The memory bandwidth for each measurement may be higher than expected (could even be higher than the theoretical maximum of the machine) but the burst and non-burst traffic is clearly identifiable by highs and lows of the memory bandwidth results.

The stethoscope mode

likwid-perfctr allows you to listen for a specific time what is happening on a node. This is useful if you want to look what a long running application currently makes in terms of performance. We use it to profile MPI codes, where we probably do not have access to the code. Stethoscope mode is also suited to be used for monitoring. Be careful not to rely too much on these measurements. Because you do not know what your code is actually doing it may happen that the result is volatile depending which time period you were measuring. Still it can give you a first idea what is going on with regard to basic performance properties.

Monitor branching behavior on the first eight processors for 10 seconds:

$ likwid-perfctr -c N:0-7 -g BRANCH  -S 10s

If you want to cancel the measurement before the specified duration, you can send a SIGINT signal to the likwid-perfctr process.

Using backends

For more information about the different available backends and how to use them, see likwid perfctr backends.

Home
Build instructions
Release Process
FAQ
LikwidAPI and MarkerAPI
Likwid nomenclature
API documentation
Quick reference sheet
Applications
Config files
- likwid.cfg
- likwid_topo.cfg
Daemons
Architectures
Tutorials
Miscellaneous
Contributing
- Adding x86/x86_64 chips
- Adding ARM chips

likwid perfctr

likwid-perfctr: Measuring applications' interaction with the hardware using the hardware performance counters

Content

Supported architectures

Prerequisites

Always required prerequisites

Prerequisites for direct access mode

Prerequisites for access daemon mode

Options

Basic Usage (Wrapper mode)

Basic threaded usage

Performance groups

Using the Marker API

Using the Marker API with Fortran 90

Syntax of the intermediate Marker API file

Marker API in other programming languages

The timeline mode

The stethoscope mode

Using backends

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!