Learning heterogeneous parallelism in C++ with AMP

At first, GPUs could be used for a very narrow range of tasks (try to guess what), but they looked very attractive, and software developers decided to use their power for allocating a part of computing to graphics accelerators. Since GPU cannot be used in the same way as CPU, this required new tools that did not take long to appear. This is how originated CUDA, OpenCL and DirectCompute. The new wave was named ‘GPGPU’ (General-purpose graphics processing units) to designate the technique of using GPU for general purpose computing. As a result, people began to use a number of completely different microprocessors to solve some very common tasks. This gave rise to the term “heterogeneous parallelism”, which is actually the topic of our today’s discussion.

Many different tools to compute with a graphics adapter have been already introduced to the market. Let’s see what technology is the best for our goals. First, we will define the requirements that it should meet, such as being platform-independent as much as possible (both in terms of software and hardware), easily interfaced with existing code and programming tools.

CUDA

CUDA is a patented GPGPU technology from NVIDIA, and there is really nothing wrong with that. To write Cuda scripts, you can use a language that is similar to C, but it has its limitations. In general, this is a widely used technology that deserves your attention. However, the dependence on a single vendor spoils the whole picture.

FireStream

FireStream is the same thing as CUDA but only from AMD. So, we will not dwell on it.

OpenCL

OpenCL is an open language for computing, a free, non-patented technology, something that is very interesting to view but, at its core, this is a language that is very different from C (although the developers claim the contrary). In such case, the programmer will have to learn almost an entirely new language with some non-standard functionality. This is not particularly encouraging. Moreover, since there is no standard for binary code, the compiler from any vendor may generate incompatible code, which means that, on each platform, you will have to recompile the shaders, which, in turn, requires to have their source codes. Originally, OpenCL was developed in Apple, but now it is managed by Khronos Group, just like OpenGL.

DirectCompute

DirectCompute is the new DirectX 11 module that allows to perform GPGPU operations. Before the introduction of DirectCompute, DirectX had a language to implement computing on GPU, but such computing was related exclusively to the graphics. Parts of the application that use DirectCompute are also programmed in HLSL, but now this code can serve a broader purpose. It is logical to assume that DirectCompute is the brainchild of Microsoft, but it is being further developed by NVIDIA and AMD. Unlike OpenCL, DirectCompute has a standard and can be compiled in hardware-independent byte code, which allows to run it on different hardware without recompiling. In non-Windows operating systems, which consequently have no DirectX support, DirectCompute code is executed with OpenCL. As I mentioned above, the code for DirectCompute is written in HLSL, a C-like language which has its own features (non-standard data types, functions, etc.).

AMP

Fig. 1. A laptop used for testing it all

Fig. 1. A laptop used for testing it all

Apparently, none of these technologies promises to be user-friendly and sufficiently productive. However, Microsoft has prepared one more feature called AMP (Accelerated Massive Parallelism), an add-in for C++. It is supported in Visual C++ compiler, starting from the version included in Visual Studio 2012 (many tools for working with parallelism appeared only in Visual Studio 2013, the next version).

In general, there are two methods to parallelize programs: by data or by task. When working with GPU, the parallelization is provided in accordance with the first method, because GPU has many cores, each computing its own set of data independently. If, for CPU, the executed tasks are commonly referred to as processes (divided by streams, etc.), the tasks executed on GPU are called threads (Windows NT >= 5.1 also has threads, but let’s not quibble over definitions). So, each core has its own thread. AMP allows to use familiar programming tools to parallelize the execution of code to graphics adapters, if they are more than one. AMP can work with all modern GPUs that support DirectX 11. However, before running the code on GPU, it is better to check in advance its compatibility with AMP, and this is what we will do in the next section.

Unlike the above tools for GPGPU, which are using the dialects of C, AMP uses C++ with all its benefits, such as type-based security, exceptions, overloading, templates, and other things.

This means that the development of heterogeneous applications has become easier and more productive. As something that is especially important for the language purity, AMP adds only two new keywords to C++, the rest uses library tools of AMP (templated functions, data types, and so on). As a result, AMP is compatible with Intel TBB at the code level. In addition, Microsoft opened up the specification of AMP to anyone who wanted it. This means that third-party developers may not only extend AMP but also port it to other software and hardware platforms, because AMP was designed by keeping in mind the future, when it will be possible to execute the code not only on CPU and graphics accelerators.

AMP and Supported Accelerators

Let’s write a program to retrieve all devices supporting AMP. It will display their list to the console. Also, the program will display the values for several properties of accelerators that are important for AMP.

Create a Win32 console project; to connect AMP, you need no additional fiddling with compiler settings, just connect ‘amp.h’, the header file, and ‘concurrency’, the namespaces. In addition, you will need the headers to enable ‘iostream’ and ‘iomanip’, i/o operations. In ‘_tmain’ function, we will install only the locale and we will call the function ‘show_all_accelerators’, which will perform the necessary work by displaying a list of accelerators and their properties. This function should return nothing, while there are two operations performed within it. In the first one, we obtain a vector, that contains all available accelerators, from ‘accelerator’ class object by using the static method ‘get_all’. The second operation is performed by using the algorithm ‘for_each’ from ‘concurrency’ namespace. This algorithm executes a lambda for each element of the vector: std::for_each(accs.cbegin(), accs.cend(), [=, &n](const accelerator& a). The lambda, accordingly, is communicated by a third parameter, here you can see only its introducer, we use it to specify to the compiler that the accelerator object selected from the vector will be communicated by its value and the incremented variable ‘n’ will be communicated through a reference. Within the lambda, we simply specify certain properties of graphics adapter, such as: the path to the device (the bus), allocated memory, whether the monitor is connected, whether the device is in debug mode, whether the functionality is emulated (with CPU), whether the double precision is supported or not, whether the limited double precision is supported (If yes, then the device does not allow no perform a complete set of computing, certain operations will not be supported). In my case (a laptop with two display adapters), the program had the following output (Fig. 2).

Fig. 2. Available accelerators

Fig. 2. Available accelerators

As you can see, in addition to two physical accelerators installed in my laptop, the program found three more. Let’s take a closer look at them.Software Adapter (REF) is a software adapter that emulates GPU on CPU, it is also called software rendering tool. It runs much slower than hardware GPU. It is available only in Windows 8 and used primarily to debug applications. CPU accelerator is available in both Windows 8 and Windows 7. It is also very slow, since it works on CPU and is used for debugging.Microsoft Basic Renderer Driver is the best choice among emulated accelerators, it also runs on CPU and comes in a package with Visual Studio 2012 and above. It is also known as WARP (Windows Advanced Rasterization Platform). The rendering is provided by Direct3D. The increased speed compared to other emulators is achieved by using SIMD instructions (SSE).

In addition, for developing and debugging C++ in AMP applications, it is recommended to use Windows 8, and I am prepared to argue in favor of that assertion. As I mentioned above, first, there is support for debugging on emulated accelerator, support for double-precision computing with WDDM 1.2, increased number of buffers (I’m referring to DirectCompute buffers) that enable the recording (with DirectX 11.1 support). Most importantly, since Windows 8 does not capture the global core lock when copying data from the accelerator to CPU memory (unlike Windows 7), the copying operation occurs faster, which increases overall performance.

AMP Elements

AMP consists of a very small set of basic elements, some of which are used in each AMP project. Let’s consider them briefly.

We have already seen an accelerator class object that represents the computing device. By default, it is initialized by the most appropriate of available accelerators. After using ‘get_all’ function to obtain the list of all available accelerators, you can use ‘set_default’ to assign it to another GPU by specifying the path to the latter in the parameter. Each accelerator (accelerator class object) has one or more isolated logical representations (residing in the memory of display adapter), in which perform their computing the threads related to this particular GPU. accelerator_view class object is a sort of reference to the accelerator. It allows you to work more extensively with the object. For example, you will be able to handle TDR (Timeout Detection and Recovery) (this exception can occur, for example, if GPU performs computing for more than two seconds; moreover, unlike Windows 8, Windows 7 does not allow to disable TDR exceptions). If this exception is not handled and the computing is not passed to the other ‘accelerator_view’, you can restore the work only by restarting the application. The templated type array (as its name suggests) represents a data set for computing on GPU. This collection is created in GPU view. To create a collection of this type, you need to pass to the constructor two parameters: data type and number of objects of this type. You can create an array of different dimensions (up to 128), this is specified in the constructor or by changing its templated type ‘extent < >’; there are overloaded constructors; you can fill the array with values both at the stage of its creation (in the constructor) and after that (by using ‘copy’).

To determine the position of an element in the array, you can use a special templated type ‘index < >’. The type ‘array_view’** refers to the ‘array’ type just like the ‘accelerator_view’ refers to to ‘accelerator’, in other words, it is a reference. It may be appropriate, when you don’t want to copy data from CPU memory to GPU memory and back. For example, the ‘array’ collection is always in GPU memory, that is, at the time of its initialization, the data are copied from CPU to GPU. On the other hand, if you declare an ‘array_view’ object based on a vector from CPU domain, the vector data will not be copied until a job directly related to GPU, and this job is performed within ‘parallel_for_each’ algorithm. Accordingly, this is the only point of the application where the code is parallelized to be run on the accelerator. The code is executed on GPU, whose array passed to the algorithm. In the first parameter, ‘parallel_for_each’ gets the ‘extent’ object of the array of objects, for which the algorithm executes the function passed (by the second parameter) through the functor or lambda. Such number of streams for execution will be run in accordance with the first parameter. You can also call another function within the functor or lambda (aka “core function”), but such other function should be marked with a keyword ‘restrict(amp)’. If ‘parallel_for_each’ algorithm receives a functor (or lambda expression) for execution, then the function or lambda, that it is pointing at, should also be marked with this keyword. There are still some restrictions on core function, for example it can capture (from the external code) only the parameters that are passed by reference.

In the end, the slowest part in any application, that uses GPU computing capabilities, is copying data from CPU memory to GPU memory and back. Therefore, you need to take this into account, and, if there is no much computing to do, it is most likely that you will perform this faster on CPU.

Using AMP

As you noticed, AMP is inextricably linked with DirectX, but this doesn’t mean that you can use AMP only for graphics computing. Still, the graphics are the most resource-intensive computing that requires high speed and, therefore, the most interesting and illustrative examples relate to graphics.

Please subscribe to read full article

1 year

for only $300

With subscription you are free to read all of the materials of Hackmag.com.
Read more about the project


Please subscribe to view comments

Only subscribers can participate in the discussions. You may login in to your account or sign up to Hackmag and pay a subscription to access the discussions.