Learning heterogeneous parallelism in C++ with AMP

At first, GPUs could be used for a very narrow range of tasks (try to guess what), but they looked very attractive, and software developers decided to use their power for allocating a part of computing to graphics accelerators. Since GPU cannot be used in the same way as CPU, this required new tools that did not take long to appear. This is how originated CUDA, OpenCL and DirectCompute. The new wave was named ‘GPGPU’ (General-purpose graphics processing units) to designate the technique of using GPU for general purpose computing. As a result, people began to use a number of completely different microprocessors to solve some very common tasks. This gave rise to the term “heterogeneous parallelism”, which is actually the topic of our today’s discussion.

Many different tools to compute with a graphics adapter have been already introduced to the market. Let’s see what technology is the best for our goals. First, we will define the requirements that it should meet, such as being platform-independent as much as possible (both in terms of software and hardware), easily interfaced with existing code and programming tools.

CUDA

CUDA is a patented GPGPU technology from NVIDIA, and there is really nothing wrong with that. To write Cuda scripts, you can use a language that is similar to C, but it has its limitations. In general, this is a widely used technology that deserves your attention. However, the dependence on a single vendor spoils the whole picture.

FireStream

FireStream is the same thing as CUDA but only from AMD. So, we will not dwell on it.

OpenCL

OpenCL is an open language for computing, a free, non-patented technology, something that is very interesting to view but, at its core, this is a language that is very different from C (although the developers claim the contrary). In such case, the programmer will have to learn almost an entirely new language with some non-standard functionality. This is not particularly encouraging. Moreover, since there is no standard for binary code, the compiler from any vendor may generate incompatible code, which means that, on each platform, you will have to recompile the shaders, which, in turn, requires to have their source codes. Originally, OpenCL was developed in Apple, but now it is managed by Khronos Group, just like OpenGL.

DirectCompute

DirectCompute is the new DirectX 11 module that allows to perform GPGPU operations. Before the introduction of DirectCompute, DirectX had a language to implement computing on GPU, but such computing was related exclusively to the graphics. Parts of the application that use DirectCompute are also programmed in HLSL, but now this code can serve a broader purpose. It is logical to assume that DirectCompute is the brainchild of Microsoft, but it is being further developed by NVIDIA and AMD. Unlike OpenCL, DirectCompute has a standard and can be compiled in hardware-independent byte code, which allows to run it on different hardware without recompiling. In non-Windows operating systems, which consequently have no DirectX support, DirectCompute code is executed with OpenCL. As I mentioned above, the code for DirectCompute is written in HLSL, a C-like language which has its own features (non-standard data types, functions, etc.).

AMP

Fig. 1. A laptop used for testing it all

Fig. 1. A laptop used for testing it all

Apparently, none of these technologies promises to be user-friendly and sufficiently productive. However, Microsoft has prepared one more feature called AMP (Accelerated Massive Parallelism), an add-in for C++. It is supported in Visual C++ compiler, starting from the version included in Visual Studio 2012 (many tools for working with parallelism appeared only in Visual Studio 2013, the next version).

In general, there are two methods to parallelize programs: by data or by task. When working with GPU, the parallelization is provided in accordance with the first method, because GPU has many cores, each computing its own set of data independently. If, for CPU, the executed tasks are commonly referred to as processes (divided by streams, etc.), the tasks executed on GPU are called threads (Windows NT >= 5.1 also has threads, but let’s not quibble over definitions). So, each core has its own thread. AMP allows to use familiar programming tools to parallelize the execution of code to graphics adapters, if they are more than one. AMP can work with all modern GPUs that support DirectX 11. However, before running the code on GPU, it is better to check in advance its compatibility with AMP, and this is what we will do in the next section.

Unlike the above tools for GPGPU, which are using the dialects of C, AMP uses C++ with all its benefits, such as type-based security, exceptions, overloading, templates, and other things.

This means that the development of heterogeneous applications has become easier and more productive. As something that is especially important for the language purity, AMP adds only two new keywords to C++, the rest uses library tools of AMP (templated functions, data types, and so on). As a result, AMP is compatible with Intel TBB at the code level. In addition, Microsoft opened up the specification of AMP to anyone who wanted it. This means that third-party developers may not only extend AMP but also port it to other software and hardware platforms, because AMP was designed by keeping in mind the future, when it will be possible to execute the code not only on CPU and graphics accelerators.

AMP and Supported Accelerators

Let’s write a program to retrieve all devices supporting AMP. It will display their list to the console. Also, the program will display the values for several properties of accelerators that are important for AMP.

Create a Win32 console project; to connect AMP, you need no additional fiddling with compiler settings, just connect ‘amp.h’, the header file, and ‘concurrency’, the namespaces. In addition, you will need the headers to enable ‘iostream’ and ‘iomanip’, i/o operations. In ‘_tmain’ function, we will install only the locale and we will call the function ‘show_all_accelerators’, which will perform the necessary work by displaying a list of accelerators and their properties. This function should return nothing, while there are two operations performed within it. In the first one, we obtain a vector, that contains all available accelerators, from ‘accelerator’ class object by using the static method ‘get_all’. The second operation is performed by using the algorithm ‘for_each’ from ‘concurrency’ namespace. This algorithm executes a lambda for each element of the vector: std::for_each(accs.cbegin(), accs.cend(), [=, &n](const accelerator& a). The lambda, accordingly, is communicated by a third parameter, here you can see only its introducer, we use it to specify to the compiler that the accelerator object selected from the vector will be communicated by its value and the incremented variable ‘n’ will be communicated through a reference. Within the lambda, we simply specify certain properties of graphics adapter, such as: the path to the device (the bus), allocated memory, whether the monitor is connected, whether the device is in debug mode, whether the functionality is emulated (with CPU), whether the double precision is supported or not, whether the limited double precision is supported (If yes, then the device does not allow no perform a complete set of computing, certain operations will not be supported). In my case (a laptop with two display adapters), the program had the following output (Fig. 2).

Fig. 2. Available accelerators

Fig. 2. Available accelerators

As you can see, in addition to two physical accelerators installed in my laptop, the program found three more. Let’s take a closer look at them.Software Adapter (REF) is a software adapter that emulates GPU on CPU, it is also called software rendering tool. It runs much slower than hardware GPU. It is available only in Windows 8 and used primarily to debug applications. CPU accelerator is available in both Windows 8 and Windows 7. It is also very slow, since it works on CPU and is used for debugging.Microsoft Basic Renderer Driver is the best choice among emulated accelerators, it also runs on CPU and comes in a package with Visual Studio 2012 and above. It is also known as WARP (Windows Advanced Rasterization Platform). The rendering is provided by Direct3D. The increased speed compared to other emulators is achieved by using SIMD instructions (SSE).

In addition, for developing and debugging C++ in AMP applications, it is recommended to use Windows 8, and I am prepared to argue in favor of that assertion. As I mentioned above, first, there is support for debugging on emulated accelerator, support for double-precision computing with WDDM 1.2, increased number of buffers (I’m referring to DirectCompute buffers) that enable the recording (with DirectX 11.1 support). Most importantly, since Windows 8 does not capture the global core lock when copying data from the accelerator to CPU memory (unlike Windows 7), the copying operation occurs faster, which increases overall performance.

AMP Elements

AMP consists of a very small set of basic elements, some of which are used in each AMP project. Let’s consider them briefly.

We have already seen an accelerator class object that represents the computing device. By default, it is initialized by the most appropriate of available accelerators. After using ‘get_all’ function to obtain the list of all available accelerators, you can use ‘set_default’ to assign it to another GPU by specifying the path to the latter in the parameter. Each accelerator (accelerator class object) has one or more isolated logical representations (residing in the memory of display adapter), in which perform their computing the threads related to this particular GPU. accelerator_view class object is a sort of reference to the accelerator. It allows you to work more extensively with the object. For example, you will be able to handle TDR (Timeout Detection and Recovery) (this exception can occur, for example, if GPU performs computing for more than two seconds; moreover, unlike Windows 8, Windows 7 does not allow to disable TDR exceptions). If this exception is not handled and the computing is not passed to the other ‘accelerator_view’, you can restore the work only by restarting the application. The templated type array (as its name suggests) represents a data set for computing on GPU. This collection is created in GPU view. To create a collection of this type, you need to pass to the constructor two parameters: data type and number of objects of this type. You can create an array of different dimensions (up to 128), this is specified in the constructor or by changing its templated type ‘extent < >’; there are overloaded constructors; you can fill the array with values both at the stage of its creation (in the constructor) and after that (by using ‘copy’).

To determine the position of an element in the array, you can use a special templated type ‘index < >’. The type ‘array_view’** refers to the ‘array’ type just like the ‘accelerator_view’ refers to to ‘accelerator’, in other words, it is a reference. It may be appropriate, when you don’t want to copy data from CPU memory to GPU memory and back. For example, the ‘array’ collection is always in GPU memory, that is, at the time of its initialization, the data are copied from CPU to GPU. On the other hand, if you declare an ‘array_view’ object based on a vector from CPU domain, the vector data will not be copied until a job directly related to GPU, and this job is performed within ‘parallel_for_each’ algorithm. Accordingly, this is the only point of the application where the code is parallelized to be run on the accelerator. The code is executed on GPU, whose array passed to the algorithm. In the first parameter, ‘parallel_for_each’ gets the ‘extent’ object of the array of objects, for which the algorithm executes the function passed (by the second parameter) through the functor or lambda. Such number of streams for execution will be run in accordance with the first parameter. You can also call another function within the functor or lambda (aka “core function”), but such other function should be marked with a keyword ‘restrict(amp)’. If ‘parallel_for_each’ algorithm receives a functor (or lambda expression) for execution, then the function or lambda, that it is pointing at, should also be marked with this keyword. There are still some restrictions on core function, for example it can capture (from the external code) only the parameters that are passed by reference.

In the end, the slowest part in any application, that uses GPU computing capabilities, is copying data from CPU memory to GPU memory and back. Therefore, you need to take this into account, and, if there is no much computing to do, it is most likely that you will perform this faster on CPU.

Using AMP

As you noticed, AMP is inextricably linked with DirectX, but this doesn’t mean that you can use AMP only for graphics computing. Still, the graphics are the most resource-intensive computing that requires high speed and, therefore, the most interesting and illustrative examples relate to graphics.


Let’s install DirectX SDK released in June 2010 onwards (this version includes version 11 of interfaces). We will look at an example of working with graphics, such as rotating the triangle built by using Direct3D 11 tools. Open ‘DXInterOp’ project. If you build and run application, we will see the following image, but only in motion (Fig. 3).

Fig. 3. The coordinates of vertices in the triangle are computed on the video adapter

Fig. 3. The coordinates of vertices in the triangle are computed on the video adapter

The file ‘DXInterOpsPsVs.hlsl’ contains the vertex and pixel shaders, the file ‘DXInterOp.h’, in addition to the macros for secure deletion of objects, declared the structure of 2D vertex (Vertex2D) used throughout the program. The file ‘DXInterOp.cpp’ includes the main code of application used for creating the window, initializing the graphics subsystem: creation, destruction of Direct3D device, download and creation of shaders’ objects, building the triangle, filling + redrawing the window, etc. All this code uses Direct3D functionality and, therefore, is outside the scope of today’s discussion. The file ‘ComputeEngine.h’ includes the part of application that we are interested in. ‘AMP_compute_engine’ class is responsible for converting the coordinates of vertices. Its constructor creates the reference to ‘accelerator’ object, which is provided by Direct3D device. Then this class initializes ‘m_data’ object, which is represented by a unique pointer to one-dimensional array of vertices (declared previously as Vertex2D). ‘run’ function acts as the workhorse of the class, as it includes ‘parallel_for_each’ algorithm in the lambda expression to compute the new position of coordinates to rotate the triangle:

parallel_for_each(m_data->extent, [=, &data_ref] (index idx) restrict(amp)
{
    DirectX::XMFLOAT2 pos = data_ref[idx].Pos;
    data_ref[idx].Pos.y = pos.y * cos(THETA) - pos.x * sin(THETA);
    data_ref[idx].Pos.x = pos.y * sin(THETA) + pos.x * cos(THETA);
});

Pay attention to the introducer of lambda: it indicates that ‘data_ref’ of array<Vertex2D, 1>’ type is passed by reference and the parameter passes ‘idx’ object of ‘index’ type. This index is the number of currently executed thread.

Actually, the ‘run’ function is called immediately before the visualization, so the work should be done very quickly. In many ways, this is a contrived example, as the rotation of an object can be implemented in the same stream that is used to execute Direct3D actions (initialization, rendering, etc.). However, this example clearly shows the division of responsibilities between various parts of the application, and the new coordinates of vertices are computed very fast even on the software accelerator running in debug mode.

Block Algorithms

When computing is made on GPU, then, unlike the computing on CPU, it doesn’t take advantage of the core cache, because GPU rarely uses the data repeatedly. On the other hand, as it is the case with CPU, GPU also is very slow in retrieving the data from the global memory. The peculiarity of GPU is that the closer are the target blocks, the faster it accesses them. Still, in cache memory, the data can be accessed many times faster. And you can configure the algorithm so that it accesses cache more often, that is, it stores and retrieves data from the cache. To do this, divide the data into blocks. It is not an easy thing, but it can bring substantial benefits in improving the speed of executing the algorithm. Unlike CPU, where caching in most cases is automatic, the cache in GPU is programmable. Therefore, the programmer should personally take care of it. We can define the blocks used for executing the threads. This requires to meet two preconditions: instead of a simple index, as it is in a non-block program, use the block index. In addition, use the accelerator programmable by the cache. Each thread is assigned to a memory area in the accelerator and, in order to place a variable there, put a keyword ’tile_static’ before its declaration. In other words, specify the use of block static memory. The variables marked with this key can be used only within the core function. Since the block static memory is very small, it is usually used to store a small portion of the array (array collection) from the global video memory:

tile_static int num[32][32];

The algorithm ‘parallel_for_each’ has an overloaded version, which accepts, as its first parameter, a ’tiled_extent — extent’ class object divided into blocks in 2D or 3D space. Here is the example:

parallel_for_each(extent(size, size), [=, &input, &output] (index idx) restrict(amp)

In this example, we have an array ‘size*size’ in 2D space. When ’tiled_extent’ is passed as the first parameter to ‘parallel_for_each’ algorithm, the ’tiled_index’ object is passed to lambda in the same space as ’tiled_extent’:

parallel_for_each(extent(number_of_threads).tile(), [=](tiled_index tidx) restrict(amp)

Within the lambda, you can gain access from ’tiled_index’ object to both the global and local index by using the properties ‘global’ and ‘local’:

const int tid = tidx.local[0];
const int globid = tidx.global[0];

Two of the above modifications are only the most obvious changes that should be first made when migrating from non-block to block-based version of the code. The main task of the programmer in reworking the code is to reconsider the solution and modify the core function and the code that is called from it.
One of the main problems that may arise when developing the block algorithm is a state of racing.
Consider this case. Before handling data within the lambda, they are copied from the global collection to the collection of block static memory. This is followed by calling the algorithm to handle the collection in block static memory. But if we assume that the array was filled in accordance with the index of the thread, then, before the processing, it may not completely filled, as the threads are executed independently and, after reaching the point established for calling the algorithm, no thread is able to know whether each thread has been executed, that is, whether the array has been completely filled. In this case, before the algorithm call, you need to insert a call for ‘wait’ method of ’tile_barrier’ class object, which cannot be created independently, but can be obtained from ’tile_index’ class object passed to lambda: tidx.barrier.wait();

tile_static int num[32][32];
num[tidx.local[0]][tidx.local[1]] = arr[tidx.global];
tidx.barrier.wait();.
if (tidx.local == index(0,0)) {
    num[0][0] = t[0][0] + num[0][1] + num[1][0] + num[1][1];
}

Conclusion

AMP can be used not only in C++ but also from the managed code, such as C#. In addition, the apps for the Windows Store also fully use C++ AMP, a heterogeneous parallelism on graphics accelerators which are currently available not only in PCs, but also in tablets and smartphones.
Unfortunately, this article allowed us to see only the tip of the iceberg known as Microsoft AMP, and a great part of this technology has not been reviewed. I just drew your attention to it and it is up to you to reach a deeper understanding of AMP. In conclusion, I would like to note that Visual Studio has not only the tools for creating parallelized code, but also the tools for debugging it and visualizing the execution of parallel computing, which we had no time to discuss today.

Given its complexity, the computing on accelerators was used only occasionally in the past, but AMP makes heterogeneous parallelism available to broad masses of programmers.


Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>