Thread Pinning¶
StreamPU enables to select on which process units (PUs) the threads are
effectively run. This is called thread pinning and it can significantly
benefit to the performance, especially on modern heterogeneous architectures.
To do so, the runtime relies on the
hwloc library.
Warning
To use thread pinning, hwloc library has to be installed on the system and
StreamPU needs to be compiled with the SPU_HWLOC preprocessor
definition. It can simply be achieved using the following CMake option:
StreamPU is not linked with the hwloc library, then the thread
pinning interface will have no effect and the threads will not be pinned.
Info
Thread pinning relies the OS. The later needs to expose appropriated system calls. While Linux and Windows provide these syscalls, macOS does not... Thus, thread pinning will have no effect on macOS :-(.
Portable Hardware Locality¶
Portable Hardware Locality (hwloc in short) is a library which provides a
portable abstraction of the hierarchical topology of modern
architectures (see the figure below).
hwloc gives the ability to pin threads over various level of hierarchy
represented by a tree structure. The deepest/lowest nodes (the leaves) are the
PUs while higher nodes represent sets of PUs that are physically close. For
instance, a PUs set can share the same UMA node (in the case of a NUMA
architecture), the same LLC or the same package.
In the Orange Pi 5 SBC, if we pin a thread on the Package L#0, it will run
over the following set of PUs: PU L#0, PU L#1, PU L#2 and PU L#3.
Thus, the pinned thread can move in the selected hwloc node during the
execution and it is up to the OS to schedule the thread on the selected PUs
set.
Warning
The indexes given by hwloc can be different from those given by the OS:
they are logical indexes that express the real locality. Consequently, in
StreamPU, it is important to use hwloc logical indexes. The
hwloc-ls command gives an overview of the current topology with these
logical indexes.
Sequence & Pipeline¶
In StreamPU, thread pinning can be set in runtime::Sequence and
runtime::Pipeline class constructors. In both cases, there is a dedicated
argument of std::string type named sequence_pinning_policy for
runtime::Sequence or pipeline_pinning_policy for runtime::Pipeline.
Info
For NUMA architectures, it is important to specify thread pinning at the
construction of the runtime::Sequence/runtime::Pipeline object to
guarantee that the data will be allocated and initialized on the right
memory banks (according to the first touch policy) during the replication
process.
To specify the pinning policy, we defined a syntax to express hwloc objects
with three different separators:
- Pipeline stage (does not concern
runtime::Sequence):| - Replicated stage (= replicated sequence = one thread):
; - For one thread, the list of pinned
hwlocobjects (= logical or):,
Then, the pinning policy can contains all the available hwloc objects. Below
is the correspondence between the std::string and the hwloc object types:
std::map<std::string, hwloc_obj_type_t> str_to_hwloc_obj =
{
/* global containers */ /* data caches */ /* instruction caches */
{ "GROUP", HWLOC_OBJ_GROUP }, { "L5D", HWLOC_OBJ_L5CACHE }, { "L3I", HWLOC_OBJ_L3ICACHE },
{ "NUMA", HWLOC_OBJ_NUMANODE }, { "L4D", HWLOC_OBJ_L4CACHE }, { "L2I", HWLOC_OBJ_L2ICACHE },
{ "PACKAGE", HWLOC_OBJ_PACKAGE }, { "L3D", HWLOC_OBJ_L3CACHE }, { "L1I", HWLOC_OBJ_L1ICACHE },
{ "L2D", HWLOC_OBJ_L2CACHE }, /* compute units */
{ "L1D", HWLOC_OBJ_L1CACHE }, { "CORE", HWLOC_OBJ_CORE },
{ "PU", HWLOC_OBJ_PU },
};
To specify the index X of an hwloc object, the following syntax is used:
OBJECT_X (ex: PU_5 refers to the logical PU n°5).
Info
CORE and PU objects can be confusing. If the CPU cores do not support
SMT, then CORE and PU are the same. However, if the CPU cores support
SMT, then the PU is the hardware thread identifier inside a given CORE.
Illustrative Examples¶
This section gives some examples to understand how the syntax works. We suppose that we have a CPU with 8 PUs with the same topology as the the Orange Pi 5 Plus SBC presented before.
Example 1¶
Let's suppose we want to setup a 3-stage pipeline with the following characteristics:
- Stage 1 - No replication (= 1 thread):
- Pinned to
PU_0
- Pinned to
- Stage 2 - 4 replications (= 4 threads):
- Thread n°1 is pinned to
PU_4orPU_5 - Thread n°2 is pinned to
PU_4orPU_5 - Thread n°3 is pinned to
PU_6orPU_7 - Thread n°4 is pinned to
PU_6orPU_7
- Thread n°1 is pinned to
- Stage 3 - No replication (= 1 thread):
- Pinned to
PU_0,PU_1,PU_2orPU_3
- Pinned to
graph LR;
S1T1(Stage 1, thread 1 - pin: PU_0)-->SYNC1;
SYNC1(Sync)-->S2T1;
SYNC1(Sync)-->S2T2;
SYNC1(Sync)-->S2T3;
SYNC1(Sync)-->S2T4;
S2T1(Stage 2, thread 1 - pin: PU_4 or PU_5)-->SYNC2;
S2T2(Stage 2, thread 2 - pin: PU_4 or PU_5)-->SYNC2;
S2T3(Stage 2, thread 3 - pin: PU_6 or PU_7)-->SYNC2;
S2T4(Stage 2, thread 4 - pin: PU_6 or PU_7)-->SYNC2;
SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3);
In the previous configuration, 6 threads will execute simultaneously (even if the given architecture supports up to 8 executions in parallel).
To instantiate this runtime::Pipeline, here are the corresponding constructor
parameters:
- Number of replications (= threads) per stage:
{ 1, 4, 1 } - Enabling pinning per stage:
{ true, true, true } - Pinning policy:
"PU_0 | PU_4, PU_5; PU_4, PU_5; PU_6, PU_7; PU_6, PU_7 | PU_0, PU_1, PU_2, PU_3"
The previous pinning policy syntax can be compressed a little bit as follow:
- Pinning policy :
"PU_0 | PACKAGE_1; PACKAGE_1; PACKAGE_2; PACKAGE_2 | PACKAGE_0"
Example 2¶
Let's now consider that we want to pin all the threads of the stage 2 on the
PU_4, PU_5, PU_6 or PU_7 (this is less restrictive than the previous
example). The pinning strategy for stage 1 and 3 is unchanged.
graph LR;
S1T1(Stage 1, thread 1 - pin: PU_0)-->SYNC1;
SYNC1(Sync)-->S2T1;
SYNC1(Sync)-->S2T2;
SYNC1(Sync)-->S2T3;
SYNC1(Sync)-->S2T4;
S2T1(Stage 2, thread 1 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
S2T2(Stage 2, thread 2 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
S2T3(Stage 2, thread 3 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
S2T4(Stage 2, thread 4 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3);
Here are the corresponding parameters:
- Number of replications (= threads) per stage:
{ 1, 4, 1 } - Enabling pinning per stage:
{ true, true, true } - Pinning policy :
"PU_0 | PACKAGE_1, PACKAGE_2 | PACKAGE_0"
With the previous syntax, the 4 threads of the stage 2 will apply the
PACKAGE_1, PACKAGE_2 policy.
Example 3¶
It is also possible to choose the stages we want to pin or not using a vector of
boolean. Let's suppose we do not want to specify any pinning for the stage 1.
graph LR;
S1T1(Stage 1, thread 1 - no pinning)-->SYNC1;
SYNC1(Sync)-->S2T1;
SYNC1(Sync)-->S2T2;
SYNC1(Sync)-->S2T3;
SYNC1(Sync)-->S2T4;
S2T1(Stage 2, thread 1 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
S2T2(Stage 2, thread 2 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
S2T3(Stage 2, thread 3 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
S2T4(Stage 2, thread 4 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3);
Here are the corresponding parameters:
- Number of replications (= threads) per stage:
{ 1, 4, 1 } - Enabling pinning per stage:
{false, true, true} - Pinning policy:
"| PACKAGE_1, PACKAGE_2 | PACKAGE_0"
In this case, the OS will be in charge of pinning the thread of the first stage.
Unpin¶
An unpin function exists and can be called by each thread individually. Once
the unpin function is triggered the thread will be free to be scheduled by the
OS over all the process units.
Warning
We assume that the user is aware of the computer architecture, uses the
logical indexes of hwloc and follows the previously defined syntax rules,
otherwise the code will throw exceptions.