Multicore - Tools

Introduction

Multicore is a term associated with parallel processing and refers to the use of several cores that are simultaneously engaged in the execution of a program. Parallel processing divides computational-intensive operations into smaller parts and distributes these tasks among several cores. The goal is to speed up processing of a computationally-intensive operations.

Characteristics of computationally-intensive applications:

Large amount of data to process
Complex algorithms require heavy computations

We can understand the control program for a system, which was previously calculated on a single processor core, as a complex algorithm. The division of this complex program flow into several subprograms (task partitioning) that can be executed as independently as possible enables the reduction of the cycle times. This results in a better reaction time of the system because the sampling rates can be increased.

Goals of task partitioning:

Evenly dividing the effort among all available cores
Minimization of contention for shared resources

Up to now, the processing of very large amounts of data (e. g. image processing in real time) has often been outsourced to specialized control components because the computing capacity of the PLC's only processor core has not been sufficient to handle both parts of the data processing in the required time window.

The ability to use multiple processor cores directly out of the PLC program now opens up a wide range of additional possibilities. The parallel processing of data, distributed over several processor cores, now makes it possible to execute the complex processing sequence (e. g. for image processing in real time) in the required time window. In many cases, it is no longer necessary to implement parallel file processing outside the PLC with special tools and programming languages. Parallel data processing becomes an integral part of a PLC program.

Example

The new possibilities will be illustrated with the example of industrial image processing, which is based on the evaluation of just taken images, which then influences the sequence of further production steps.

In our example, the image processing consists of three steps:

Image capture due to an external event.
Image analysis
Decision on rejection as scrap or forwarding to subsequent process steps.

With a single core PLC the second step was outsourced to an external system, which was triggered by an external event. The result of the image analysis was provided through an physical input of the PLC.

With a multi-core PLC, the following structure is feasible: The Image processing takes place in a library block, which is triggered by a local input variable and ensures a minimum processing time for image evaluation based on the number of currently available processor cores. The calculations are thus dynamically distributed to an optimal number of processor cores. Subsequently, the results are summarized in order to be able to make a decision about whether the current product meets the requirements. In addition to the computational effort for image processing, there is also the effort for distributing the data and the effort for summarizing the results. Therefore, the image processing efficiency does not increase linearly with the number of processor cores. The different efforts should be in a good relationship to each other. Only then a shorter processing time per product will be possible.

Measurement

Below we present a list of procedures and tools to measure and compare the runtime of applications and libraries on single and multicore systems.

TaskManager:	This tool allows you to view and evaluate basic data on the current existing tasks. Based on the cycle time and the processing time of the task with the highest priority, the remaining calculation time available for the other tasks can be determined very well. This is a first indication of how successful a multi-core distribution can be. If the remaining computing time which the task with the highest priority leaves for the other tasks is very small, then the difficulties to handle the other tasks adequately in the required time window are growing very fast.
TraceManager:	If the SysCpuMultiCore and CmpTraceMgr components are integrated in the runtime system, then there are two system traces configured, which can be uploaded in the CODESYS environment for analyzing the CPU load and the IEC-Task load.
Profiler:	Profiler Watch List Provides the possibility, to assign (also while in online mode) a specific instance of a building block to the watch list. For each entry in the list, the number of calls and processing times (Min, Max, AVG) are displayed. Profiling a Task Cycle A configuration dialog can be used to determine how the runtime measurement is to be performed. With the next download, the PLC program is instrumented in such a way that the required data is recorded. The result of the measurement can be uploaded into the CODESYS environment. A call tree view gives you an idea of how much time was spent in which part of the program.
ExecutionTimeReporter:	Each function block out of the CommonBehaviourModel family can be observed with an ExecutionTimeReporter. For each state of the BehaviourModel, it records the time intervals in which it is active. With this data, the real manufacturing operations associated with these states can be precisely analyzed.

Atomic Operations

Atomic operations are mainly used to implement locking mechanisms (locking) used in the synchronization of data and processes. Another variant is the non-blocking synchronization of data structures, whereby objects are only modified by atomic operations and explicit locking of the objects becomes unnecessary.

An operation acting on shared memory is atomic if it completes in a single step relative to other task. When an atomic store is performed on a shared variable, no other task can observe the modification half-complete. When an atomic load is performed on a shared variable, it reads the entire value as it appeared at a single moment in time. Non-atomic loads and stores do not make those guarantees.

Typical CPU instructions for implementing atomic operations on current processors are:

Atomic load and store of integral IEC data types like LINT, __XWORD, POINTER, INTERFACE, ...
Atomic read-modify-write operations like AtomicTestAndSet, AtomicAdd or AtomicCompareAndSwap.

Please take a look to the DemoMultiCore.project example project to see the executable methods.

AtomicCompareAndSwap

Example

The following pseudo code snippet demonstrates the basic operation of AtomicCompareAndSwap:

The value at an address is compared with an expected value, if these two are the same, a new value is written to this address.

FUNCTION AtomicCompareAndSwap : BOOL
VAR_INPUT
    pxwAddress : POINTER TO __XWORD;
    xwExpected : __XWORD;
    xwNewValue : __XWORD;
END_VAR

IF pxwAddress <> 0 AND_THEN pxwAddress^ = xwExpected THEN
    pxwAddress^:= xwNewValue;
    AtomicCompareAndSwap := TRUE;
ELSE
    AtomicCompareAndSwap := FALSE;
END_IF

With the help of an AtomicTestAndSet the operation AtomicCompareAndSwap can be implemented as:

// +-------------------------------------------------------------------------------------------------------------+
// | ``_xwSync`` is context variable of type ``__XWORD`` to synchronize all tasks/cores in the related context   |
// +-------------------------------------------------------------------------------------------------------------+

WHILE AtomicTestAndSet(ADR(_xwSync)) = TRUE DO
    SchedWaitSleep(uliSleepOneUs);
END_WHILE
IF pxwAddress <> 0 AND_THEN pxwAddress^ = xwExpected THEN
    pxwAddress^:= xwNewValue;
    AtomicCompareAndSwap := TRUE;
ELSE
    AtomicCompareAndSwap := FALSE;
END_IF
_xwSync := 0;

AtomicTestAndSet

Example

The following pseudo code snippet demonstrates the basic operation of AtomicTestAndSet:

The value at an address is set to TRUE, the old value will be returned.

FUNCTION AtomicTestAndSet : BOOL
VAR_INPUT
    pxwAddress : POINTER TO __XWORD;
END_VAR

IF pxwAddress <> 0 THEN
    AtomicTestAndSet := pxwAddress^ <> 0;
    pxwAddress^ := 1
ELSE
    AtomicTestAndSet := FALSE;
END_IF

With the help of an AtomicCompareAndSwap the operation AtomicTestAndSet can be implemented as:

IF AtomicCompareAndSwap(pxwAddress:=pxwAddress, xwExpected:=0, xwNewValue:=1) THEN
    AtomicTestAndSet := FALSE;
ELSE
    AtomicTestAndSet := TRUE;
END_IF

AtomicAdd

Example

The following pseudo code snippet demonstrates the basic operation of AtomicAdd:

The sum of the value at an address and an given value is written back to the addressed location, the former value at the address will be returned.

FUNCTION AtomicAdd : __XWORD
VAR_INPUT
    pxwAddress : POINTER TO __XWORD;
    xwValue : __XWORD;
END_VAR
VAR_OUTPUT
    xError : BOOL;
    xOk : BOOL;
END_VAR

IF pxwAddress <> 0 THEN
    AtomicAdd := pxwAddress^;
    pxwAddress^ := AtomicAdd + xwValue;
ELSE
    xError := TRUE;
END_IF

With the help of an AtomicCompareAndSwap the operation AtomicAdd can be implemented as:

IF pxwAddress <> 0 THEN
    REPEAT
        AtomicAdd := pxwAddress^;
        xOk := AtomicCompareAndSwap(pxwAddress:=pxwAddress, xwExpected:=AtomicAdd, xwNewValue:=AtomicAdd + xwValue);
    UNTIL xOk
    END_REPEAT;
ELSE
    xError := TRUE;
END_IF

Applications in CODESYS

IEC Operators (Some are not yet implemented!):

dwOldValue := TEST_AND_SET(dwValue); (CDS-58764)
xOk := COMPARE_AND_SWAP(pxwAddress, xwExpectedValue, xwNewValue); (CDS-58769)
diOldValue := __XADD(pdiAddress, diValueToAdd); (CDS-58770)

CODESYS runtime system support:

udiResult := SysCpuAtomicCompareAndSwap(pxwAddress, ADR(xwNewValue), ADR(xwExpected), SIZEOF(__XWORD)) (CDS-56968)
diNewValue := SysCpuAtomicAdd(pdiAddress, diValueToAdd, pResult);
udiOldValue := SysCpuTestAndSet(pudiAddress, 0);

Note

The parameters of SysCpuAtomicAdd based on DINT data type. The examples above are based on __XWORD data type.

Memory Ordering

Different memory ordering at run-time, on the processor itself is invisible to a single-threaded program. It only becomes apparent when a shared memory area is manipulated without using any mutual exclusion between the related tasks. However, the effects of different processor reordering are only visible in multi-core environments. When writing lock-free code, one must often take special care to enforce correct memory ordering. Otherwise, surprising things can happen.

Here’s one of the simplest examples. Suppose we have two global integers X and Y, both are initially 0.

VAR GLOBAL
    X : INT := 0;
    Y : INT := 0;
END_VAR

Two cores, running in parallel, execute the following code:

Core1	Core2
`X := 1; A := Y;`	`Y := 1; B := X;`

Each processor stores 1 into one of the integer variables, then loads the other integer into a locale variable. Now, no matter which processor writes 1 to memory first, it’s natural to expect the other processor to read that value back, which means we should end up with either A = 1, B = 1, or perhaps both. But according to some observations, that won’t necessarily be the case. Some times we can observe that the value of A and/or B equals to 0 at the end of this example – a quite counterintuitive result, to say the least!

One way to understand this is that most processor families, are allowed to reorder the memory interactions of machine instructions according to certain rules, as long it never changes the execution of a single-threaded program. In particular, each processor is allowed to delay the effect of a store past any load from a different location. As a result, it might end up as though the instructions had executed in this order:

Core1	Core2
`A := Y; X := 1;`	`B := X; Y := 1;`

You can enforce correct memory ordering on the processor by issuing an instruction which acts as a memory barrier.

__MEMORYBARRIER();

Example

A code snippet to illustrate variables and types

VAR GLOBAL
    g_xIsPublished : BOOL;
    g_datValue : DATA;
END_VAR

A code snippet running on core #1

IF g_xIsPublished THEN        // Load and check shared flag
    datResult := g_datValue;  // Load published value
    x := g_datValue.x;
    y := g_datValue.y;
END_IF

A code snippet running on core #2

g_datValue.x := x;            // Publish some data
g_datValue.y := y;            // Publish some data
__MEMORYBARRIER();
g_xIsPublished := TRUE;       // Set shared flag to indicate availability of data

Note

If, at certain position in the program, the result of a sequence of write operations should to become visible for a program running on another core, a __MEMORYBARRIER operation must be inserted at exactly this position.

Example

Below a pseudo code snippet of a queue implementation illustrates another fact:

Atomic Operations like AtomicCompareAndSwap, AtomicAdd and AtomicTestAndSet act like a memory barrier. Therefor a extra __MEMORYBARRIER in this use case is not necessary.

TYPE NODE :
STRUCT
    pNext : POINTER TO NODE;
END_STRUCT
END_TYPE

VAR
    g_pTail : POINTER TO NODE;
    g_pHead : POINTER TO NODE;
END_VAR

METHOD Enqueue
VAR_INPUT
    pNode : POINTER TO NODE;
END_VAR
VAR
    pTail : POINTER TO NODE;
END_VAR

// Keep trying until Enqueue is done
REPEAT
    pTail := g_pTail;
    IF AtomicCompareAndSwap(ADR(pTail^.pNext), 0, pNode) THEN
        // __MEMORYBARRIER();
        // Enqueue is done. Exit loop
        EXIT;
    ELSE
        // Tail was not pointing to the last node
        // Try to swing Tail to the next node
        AtomicCompareAndSwap(ADR(g_pTail), pTail, pTail^.pNext);
        // __MEMORYBARRIER();
    END_IF
UNTIL FALSE
END_REPEAT
// Enqueue is done. Try to swing Tail to the inserted node
AtomicCompareAndSwap(ADR(g_pTail), pTail, pNode);
// __MEMORYBARRIER();

An interesting aspect of the Enqueue method is that it is lazy: it takes place in two distinct steps. To make this method lock-free, threads may need to help one another. Normally, the Enqueue method locates the last node in the queue, and performs the following two steps:

It calls AtomicCompareAndSwap to append the new node, and
calls AtomicCompareAndSwap to change the queue’s tail pointer from the prior last node to the current last node.

Because these two steps are not executed atomically, every other method call must be prepared to encounter a half-finished Enqueue call, and to finish the job.

Consistent Values Swapping

TYPE IObserver_XWORD_Converter :
UNION
    itf : IObserver;
    xwd : __XWORD;
END_UNION
END_TYPE

METHOD SetObserver : IObserver
VAR_INPUT
    itfObserver : IObserver;
END_VAR
VAR_OUTPUT
    eErrorID : ERROR;
END_VAR
VAR
    xOk : BOOL;
    cvtOldValue : IObserver_XWORD_Converter;
    cvtNewValue : IObserver_XWORD_Converter;
END_VAR

cvtNewValue.itf := itfObserver;
REPEAT
    cvtOldValue.itf := _itfObserver;
    xOk := AtomicCompareAndSwap(ADR(_itfObserver), cvtOldValue.xwd, cvtNewValue.xwd);
UNTIL xOK
END_REPEAT
SetObserver := cvtOldValue.itf;