OpenCL 1.3: My Proposal For a Final 1.x Release

OpenCL 2.0 Feedback Series:

  1. OpenCL Standardization Issues
  2. My OpenCL Vision and Philosophy
  3. OpenCL 1.3: My Proposal For a Final 1.x Release (this article)
  4. OpenCL 2.0: SPIR Feedback and Vision

I propose that OpenCL 1.3 should be developed as the final 1.x version release. OpenCL 1.3 would provide minor enhancements and clarifications to OpenCL 1.2, and would defer major changes to OpenCL 2.0. My rationale for this proposal is fairly straightforward. First, there are already applications that use OpenCL and could benefit immediately from useful enhancements, without major conceptual changes. Second, I feel strongly that the OpenCL standard would greatly benefit from an open debate regarding its future. This debate may lead to major and disruptive changes in OpenCL 2.0 that would require major refactoring of existing software. OpenCL 1.3 would provide developers with an opportunity to develop useful software with the understanding that “something big” is coming.

I found it difficult to provide feedback on the OpenCL 2.0 specification without actually providing feedback on OpenCL 1.2, due to the number of issues it inherits from the previous specification. I have specific feedback regarding OpenCL 1.2 features that are easily accommodated by an OpenCL 1.3 specification. However, my feedback on OpenCL 2.0 is really “big picture” stuff which would result in major changes. Therefore, my OpenCL 1.3 proposal is both for presentation and for legitimate consideration. My proposal is not detailed enough to actually become part of the specification, but provides topics for debate.

OpenCL 1.3 should be designed to coexist with OpenCL 2.0. I proposed that OpenCL should migrate to a capability-based programming model, and if this was adopted then OpenCL 1.3 would become an implementation capability. This would permit OpenCL 1.3 to peacefully coexist with OpenCL 2.0 indefinitely. To clarify, if OpenCL 2.0 supports capabilities then OpenCL 1.3 would be a capability that was available on some devices. This would permit developers to gradually migrate to OpenCL 2.0 features, or restrict themselves to devices that support OpenCL 1.3.

Here is a list of OpenCL 1.3 features, in order of presentation:

  • Devices are uniquely identified across platforms, and are consistent.
  • Work-items may terminate themselves via OpenCL C.
  • Kernels may terminate normally or abnormally via OpenCL C.
  • The host can query kernel execution status.
  • The host has improved kernel argument type queries.
  • Device status may be queried, and integrity checks may be supported.
  • Memory access flags may be changed.
  • Standard integer types on host and device.
  • Clarification on clRetain and clRelease functions.
  • Improved host-device signed integer portability.
  • Revised error handling with asynchronous out-of-order events.
  • clSetKernelArg() supports const-correct code.

 Device Identification

The OpenCL 1.2 standard provides no guarantees about the order or unique identity of devices. This is problematic if multiple platforms can manage the same physical device (e.g. AMD and Intel OpenCL implementations on a CPU). Developers who write load-balancing software must be able to uniquely identify devices. The identification of devices should, ideally, be preserved across processes to permit cooperating processes to agree on device allotment.

The standard could be modified to guarantee that cl_device_id is a unique identifier. The clCreateContext function requires a platform identifier and cl_device_id, which suggests that cl_device_id could be shared amongst platforms. This solution might also be problematic for calls to clGetDeviceInfo which may provide values based on both the platform and the device.

Alternatively, a property may be added to clGetDeviceInfo which is guaranteed to be unique for a particular device. This is my favored solution.

Work-Item Termination

The OpenCL 1.2 standard provides no mechanism for a work-item to terminate itself. Code within a kernel function body may call return, but termination is complicated for users to implement deep within nested function calls. A work-item should terminate if there is no useful work to be performed, but OpenCL C code can become very complex due to the lack of a termination function.

Work-items should be able to terminate themselves by calling a work_item_terminate() function. An OpenCL implementation should guarantee that after a work_item_terminate() call is made that the work-item will have no side-effects.

Kernel Termination

A work-item should be capable of terminating the execution of its launching kernel, and hence all other work-items, via OpenCL C. If a work-item encounters some exceptional situation, it might need to terminate the entire kernel abnormally as a result. Developers could do this in OpenCL 1.2, but not in an elegant manner that enabled good software design. The OpenCL 1.3 specification could provide a defined behavior for work-item initiated kernel termination, along with relevant OpenCL C functions.

The prototypes for OpenCL C kernel termination functions are provided:

/*
 * Terminates the kernel with the given status.
 * Zero indicates successful execution, and any non-zero code indicates failure.
 */
void kernel_exit(int32_t status);

/*
 * Check if the kernel is currently exiting.  Non-zero if exiting, zero otherwise.
 */
int32_t is_kernel_exiting();

The work-item that calls kernel_exit() will terminate immediately. A kernel only has one exit status value. Due to the parallel execution model of OpenCL work-items, the standard must define what happens when multiple calls are made to kernel_exit() by different work-items, with potentially different values.

There are two cases to consider:

Case I: All work-items that call kernel_exit() are made with the identical status value x. In this case, the final kernel status value is x.

Case II: More than one work-item calls kernel_exit(), and not all status values are identical. In this case, the kernel exit status can be any non-zero value that was supplied as an argument to kernel_exit().

The kernel exit status zero indicates success. The standard should reserve negative status values for particular common meanings, and leave positive non-zero values for application use. This permits OpenCL implementations to standardize common errors for use by application developers, while permitting developers to define their own application-specific error codes.

Termination of a kernel is not instantaneous. The work-item that calls kernel_exit() will terminate immediately. The work-group that contains the work-item which called kernel_exit() will complete without interruption. Work-groups that do not contain work-items which have called kernel_exit(), and which have had some side-effect, will complete. However, work-groups which have not yet had a side-effect may, or may not, complete.

This guarantee is sufficient for programmers to reason about the behavior of kernel_exit() and provides implementations with considerable flexibility. If hardware can prevent the launch of new work-groups after a call to kernel_exit(), it is free to do so. However, if hardware must launch all work-groups regardless of the call to kernel_exit(), it is also free to do so, and these work-items may have side-effects.

The kernel function is_kernel_exiting() is very simple, and just checks whether or not any calls to kernel_exit() have been made. There may be a delay between the time that a work-item calls kernel_exit() and the time that the function is_kernel_exiting() returns true.

Developers may use the is_kernel_exiting() function to effectively prevent new work-items from launching within a kernel, as shown:

/*
 * Prevent new work-items from launching.
 */
__kernel void sample(...)
{
   // This work-item will do nothing
   if ( is_kernel_exiting() ) return;

   /* Kernel Body */
}

Potential Implementation

Implementors may use a 32-bit signed integer atomic operation and a hidden kernel argument to implement these functions. Here is a potential sample implementation:

#define KERNEL_EXIT_STATUS_SENTINEL 0x80000000
void kernel_exit(int32_t status)
{
   atomic_cmpxchg(reserved_exit_status, KERNEL_EXIT_STATUS_SENTINEL, status);
   work_item_terminate();
}

int32_t is_kernel_exiting()
{
   return atomic_cmpxchg(reserved_exit_status, KERNEL_EXIT_STATUS_SENTINEL, KERNEL_EXIT_STATUS_SENTINEL) != KERNEL_EXIT_STATUS_SENTINEL;
}

Discussion

Note that developers can easily write their own assert functions using kernel_exit(), which permits developers to write better code. It would best if developers were able to supply preprocessor information such as __FILE__ and __LINE__ for the purposes of an assert, and developers could do this themselves using extra global buffers. However, developers would again be limited by their inability to access kernel arguments within nested function calls. A cl_assert() function would be a helpful addition to OpenCL C, which would do the following:

  • Set kernel exit status to a defined ASSERT code
  • Use a hidden kernel argument to set __FILE__
  • Use a hidden kernel argument to set __LINE__

The developer could then use the OpenCL C compiler to define whether or not assertions should be enabled. Only one work-item can trigger an assert, but realistically if any work-item at all triggers an assertion, there is a problem with the software.

It is a common problem that work-item functions might need access to a few special buffers to do something interesting. It might be useful to add functions to OpenCL C that enable access to hidden buffers (hidden means that they don’t have a position index that interferes with clSetKernelArg). This would allow functions to be written that have some global effect, so long as the kernel has hidden buffers passed appropriately.

Kernel Execution Status

OpenCL 1.3 should provide the host with API functions to obtain kernel execution status information. Work-items are able to terminate themselves via the kernel_exit() function, however implementations may also terminate kernels, for instance in case of a memory protection error. OpenCL libraries may use the kernel execution status to throw exceptions in languages that support them, which provides a primitive method for OpenCL C to generate an exception which is thrown in the host application.

The prototype for the kernel execution status is simple:

/*
 * Get information on kernel execution represented as an event.
 */
cl_int clGetKernelExecutionStatus(cl_event event, cl_kernel_status_info param_name, size_t param_value_size, void* param_value, size_t* param_value_size_ret);

The API function is similar to all other OpenCL functions that query information. A table of parameters is supplied here:

cl_kernel_status_info Return Type Info. Returned in param_value
CL_KERNEL_STATUS_EXIT_CODE int32_t The execution status for the kernel.
CL_KERNEL_STATUS_EXIT_MSG char[] A user-friendly message that explains the execution status.

Developers may register their own messages for user-defined exit codes, which can be any positive non-zero integer. Messages can be program-wide or kernel-specific.

The prototypes for message registration are:

/*
 * Register a message for a specific error code, all programs share the message for the error code.
 */
cl_int clRegisterProgramErrorMessage(cl_program program, int32_t error_code, const char* msg);

/*
 * Register a message for a specific error code, all programs share the message for the error code.
 */
cl_int clRegisterKernelErrorMessage(cl_program program, const char* kernel_name, int32_t error_code, const char* msg);

Message codes may be registered for the entire program, and all kernels within, through clRegisterProgramErrorMessage(). This means that user-defined exit codes have the same meaning throughout the program. Users may also define exit codes per kernel through clRegisterKernelErrorMessage(), and kernel-specific meanings override program-wide meanings. If a particular error code is registered with a message by both a program and kernel, the kernel message takes precedent regardless of message registration order. There is no facility to unregister an error message.

OpenCL reserves all negative error codes, and error code zero. Negative error codes have specific meaning, and may be used by implementations or users. Developers should prefer to use provided common codes, when applicable, in the interest of portability.

Here is a small sample of potential codes, and their meanings:

Name Code Meaning
EXIT_SEARCH_COMPLETE -1 A search algorithm has completed.
EXIT_OUT_OF_BOUNDS -2 A buffer access out-of-bounds has occurred.

Kernel Argument Type Queries

Due to hardware alignment differences, the host and device may pad C structures differently, which makes binary compatibility difficult. OpenCL has clGetKernelArgInfo() which provides basic type information, but it is not useful for developers. In particular, the type name may be obtained but it does not help the host create a valid structure to pass to the device. OpenCL 1.3 should introduce a new kernel argument type query system which helps developers transfer data between device and host. OpenCL middleware may also use this facility to ensure type-safety of kernel arguments supplied using clSetKernelArg(), a critical feature for developers.

The cl_entity type is introduced to represent an OpenCL variable, structure member variable, structure, union, union member variable, typedef, or fundamental type. An entity has a type, and may have a name. Entities may be compositions of other entities, for example a struct is an entity composed of other entities. The OpenCL host can use the cl_entity type to determine the complete binary structure of composite types.

The clGetKernelArgInfo() function is extended with the following parameter:

cl_kernel_arg_info Return Type Info. returned in param_value
CL_KERNEL_ARG_ENTITY cl_entity The entity associated with the kernel argument.

The clGetEntityInfo() function is introduced with the following prototype and behavior:

cl_int clGetEntityInfo(cl_program program, cl_entity entity, cl_entity_info param_name, size_t param_value_size, void* param_value, size_t* param_value_size_ret);
cl_entity_info Return Type Info. returned in param_value
CL_ENTITY_TYPE cl_uint Returns the type of the current entity. One of: 

CL_ENTITY_TYPE_UNION

CL_ENTITY_TYPE_STRUCT

CL_ENTITY_TYPE_TYPEDEF

CL_ENTITY_TYPE_INT

CL_ENTITY_TYPE_INT4

(I’ve only shown a subset here).

CL_ENTITY_NAME char[] The declared name of an entity if it is a member of a union or struct. Anonymous unions will have an empty name. If the type is CL_ENTITY_TYPE_TYPEDEF, then this is the user-supplied name for the type. If this cl_entity is the result of a query to clGetKernelArgInfo() it is the name of the parameter, equivalent to CL_KERNEL_ARG_NAME.
CL_ENTITY_QUALIFIER cl_entity_qualifier_type A combination of CL_ENTITY_QUALIFIER_CONST, CL_ENTITY_QUALIFIER_VOLATILE, CL_ENTITY_QUALIFIER_RESTRICT, CL_ENTITY_QUALIFIER_NONE. If the type is CL_ENTITY_TYPE_TYPEDEF, this is always CL_ENTITY_QUALIFIER_NONE.
CL_ENTITY_SIZE size_t The size of the entity in bytes, including padding in the case of a struct or union.
CL_ENTITY_TYPEDEF_TYPE cl_entity If the entity is a typedef, the underlying type is returned.

The clNumMemberEntities() function is introduced, which allows the host to determine the number of member entities within a struct or union type. Its prototype is:

size_t clNumMemberEntities(cl_program program, cl_entity entity);

The clGetEntityMemberInfo() function is introduced with the following prototype and behavior:

cl_int clGetEntityMemberInfo(cl_program program, cl_entity entity, cl_entity_member_info param_name, size_t param_value_size, void* param_value, size_t* param_value_size_ret);
cl_entity_member_info Return Type Info. returned in param_value
CL_ENTITY_MEMBER_ENTITY cl_entity The entity at the provided index.
CL_ENTITY_MEMBER_OFFSET size_t The offset of the entity relative to a parent entity that contains it, in case the entity is a union or struct member. Otherwise, zero is returned.

The above functions are sufficient for developers to traverse a structure or union type and obtain all information required for binary compatibility. It also can be used by middleware to ensure type-safety of the low-level clSetKernelArg() calls. Due to the importance of this functionality, all kernels and programs should support these queries on kernels to conform to my OpenCL 1.3 proposal.

A few assurances are required for developers to find these calls useful. Entity types are always associated with a program, and the cl_entity variable identifies a type defined within that program. If two cl_entity objects are equal, and are from the same program, then they refer to the same type. This assurance permits developers to understand the structure of types in terms of types they have seen before, without checking memory layout for equivalence themselves.

Typedefs must be handled with care, because developers might be interested in both the underlying type, and the name of the typedef. For example, developers might use the typedef name to understand some meaning aside from its simple use as a type alias. Consider the typedef name “length_t” which is defined as an alias for int. The developer could care that a type is supposed to be a length which happens to be an int type for this application. For this reason, a cl_entity that refers to a typedef must have its aliased type accessible, which might be another typedef.

Device Status and Integrity

OpenCL 1.3 should introduce the function clGetDeviceStatusInfo() for device status queries, which permit developers to check on the status and health of an OpenCL device. Due to the separation of the host and device, it is possible for a device to fail and produce incorrect results without affecting the execution of the host application. Devices may also provide interesting information, such as processor and memory utilization which should be accessible in a standard manner. The clGetDeviceStatusInfo() may be used to issue requests that require time to complete, which enables developers to write system utilities.

The prototype and behavior for clGetDeviceStatusInfo() is below:

cl_int clGetDeviceStatusInfo(cl_device_id device, cl_device_status_info param_name, size_t param_value_size, void* param_value, size_t* param_value_size_ret);

Here is a table of parameters and behavior:

cl_device_status_info Return Type Info. Returned in param_value
CL_DEVICE_STATUS_COMMAND_PROGRESS cl_uint If a command was started, this is a value from 0..100 inclusive which indicates progress of the command, with zero indicating no progress. If no command is running, 100 is returned.
CL_DEVICE_STATUS_COMMAND_COMPLETE cl_bool Indicates if a command has completed. CL_TRUE if the command is complete, CL_FALSE otherwise. If no command has been issues, CL_TRUE is returned.
CL_DEVICE_STATUS_MEMORY_AVAILABLE cl_ulong The amount of global memory free, in bytes. The value should be considered to be out-of-date as soon as it is returned. This does not initiate a command.
CL_DEVICE_STATUS_ALU_UTILIZATION cl_uint An estimate of ALU utilization on the device, from 0…100, with zero being no utilization. This is not to be used for profiling, it provides only an estimate.
CL_DEVICE_STATUS_MEMORY_UTILIZATION cl_uint An estimate of memory bandwidth utilization on the device, from 0…100, with zero being no utilization. This is not to be used for profiling, it provides only an estimate.
CL_DEVICE_STATUS_COMMAND_MEM_CHECK cl_int Check the memory integrity of the device. This will perform some check on global, local, and private memories to ensure memory hardware is operating correctly. This issues an asynchronous command which must be checked later. The return value is not used.
CL_DEVICE_STATUS_COMMAND_ALU_CHECK cl_int Check the ALU operation on the device. This will perform some check on compute units and processing elements to ensure proper operation. This issues an asynchronous command which must be checked later. The return value is not used.
CL_DEVICE_STATUS_COMMAND_STATUS cl_bool A report from the command that completed, indicating success or failure. Only valid if the command has completed as queried by CL_DEVICE_STATUS_COMMAND_COMPLETE.

This interface mixes basic information that may be read from the device, with specific commands that may be issued. When a status command is issued, commands from command queues will not be issued. The device will be made unavailable until the command has completed, and new commands may not be issued through clGetDeviceStatusInfo(). Upon the completion of the status command, command queues will continue to dispatch work to the device and results will be made available through clGetDeviceStatusInfo().

This functionality is absolutely critical to enterprise software which demands data integrity. Developers may decide to perform some critical work and check the integrity of a device upon its completion. If the device fails its integrity checks, then the application may report an error rather than store corrupt data. For example, a compute node may report that it has failed and stop accepting future work, or a database server may scrap any pages which have been processed since the last integrity check.

Memory Access Flags May Change

OpenCL 1.2 provides flags for host and kernel access permissions when a memory object, such as a buffer, is created. Unfortunately, there is no method by which these flags may be modified. Technically, a developer may use clCreateSubBuffer() on the entire buffer to adjust memory flags, but this is an abuse of sub-buffers rather than an elegant solution, and not available for image objects.

I propose that OpenCL 1.3 provide the following prototype for a function that modifies memory access flags:

cl_int clModifyMemoryFlags(cl_mem object, cl_mem_flags flags);

Standard Integer Types

The OpenCL C standard currently defines the size and binary representation of integral types. The standard should be amended to include sized integer types. In particular, the following typedefs should be provided to OpenCL C:

typedef char             int8_t;
typedef unsigned char   uint8_t;

typedef short           int16_t;
typedef unsigned short uint16_t;

typedef int             int32_t;
typedef unsigned int   uint32_t;

typedef long            int64_t;
typedef unsigned long  uint64_t;

Note that sized vector types should also be provided, but they have been omitted from the above list. Furthermore, the OpenCL host API should be amended to replace the use of cl_ types with sized integer types. This guarantees portability and provides clarity for programmers. Another amendment must be made to the OpenCL standard to ensure adherence to the C standard, and this is addressed later in this article.

Clarification on Retain and Release

OpenCL provides functions to retain and release objects obtained from the API, which provides reference counting. Unfortunately, the retain and release functions are unsafe as currently provided, which makes development of middleware and libraries problematic. The core problem is that clRelease functions can fail. OpenCL 1.3 should amend all the clRelease function to guarantee correct behavior, and to minimize impact on existing programs, it will always return CL_SUCCESS.

To illustrate the nature of the problem, consider OpenCL objects x, and y, and the following sequence of instructions.

clRetain(x);  // Success
clRetain(y);  // Success
clRelease(x); // FAILS!

In this simple code fragment, the release of object x has failed. Now, suppose that a library wants to undo its operations so that the user sees no effect on reference counting. The library has no option to do this, because another call to release object x might fail again. The release of object y might also fail, and this further complicates the situation.

Assuming that implementations have chosen to use reference counting for clRelease and clRetain functions, implementations should return errors only for clRetain. The clRelease function should delete an object when the count becomes zero, as normal, and future calls to clRelease should have no effect. The clRelease function should silently accept invalid objects, which provides implementations with flexibility without affecting user applications.

Host Integer Representation

OpenCL has an incompatibility with the C standard, in that the C standard provides no guarantees on the representation of integers. OpenCL C states that integers are a prescribed size and are represented using 2’s complement. Given that the C standard provides no such guarantee, OpenCL must provide compiler-specific API types for device and host interoperability. This complicates the API and limits data manipulations that may occur on the host.

I have discussed this issue with colleagues, and the best suggestion for rectifying this situation comes from Thomas Jones. He suggested that OpenCL should simply state that OpenCL is compatible only with hosts and compilers that use 2’s complement representations. I agree with this solution, because it is simple and practical. Once developers are assured that OpenCL guarantees 2’s complement representation on both host and device, they can do data manipulation on either device without fear of portability issues. The OpenCL API can also be simplified considerably by replacing cl_ types with sized integral types.

I will also note that there are other incompatibilities with the C standard which should be addressed, such as the number of bits that make a byte. The C standard provides CHAR_BIT rather than defining a byte to be 8 bits. The OpenCL standard should provide a proper acknowledgement of C standard assumptions, so that developers may ensure their compilers and platforms are compliant. It sounds silly that modern systems might define a byte as anything other than 8 bits, but given that OpenCL is a C API it must acknowledge its implicit assumptions about the C environment.

Revised Event Error Handling

It is very difficult to write enterprise software with OpenCL 1.2, given this statement from the specification:

If the execution of a command is terminated, the command-queue associated with this terminated command, and the associated context (and all other command-queues in this context) may no longer be available. The behavior of OpenCL API calls that use this context (and command-queues associated with this context) are now considered to be implementation-defined. The user registered callback function specified when context is created can be used to report appropriate error information.”

In other words, if something goes wrong, all bets are off, and you might not even be provided with error information. Since memory is associated with a context, developers might not be able to access it after a crash. This forces developers writing reliable software to frequently backup memory contents, and to overuse contexts. Although this is an acceptable strategy, the standard can be improved without undue burden on implementers. In this article, I will outline a new asynchronous execution model and error policy which may degrade to the above statement but with stronger guarantees.

Traditional and OpenCL Asynchronous Programming Models

Applications use asynchronous programming to handle long-duration operations, such as file or network I/O. The advantage of asynchronous programming is that the application is free to work on other tasks while a slow operation completes. High-performance server applications will use asynchronous programming to manage many simultaneous I/O operations. This type of asynchronous programming is enabled by functions that start work asynchronously, and functions that monitor work for completion. Asynchronous operations do not have direct dependencies upon each other, and the application will use callbacks (or other designs) to issue new asynchronous operations after one completes. For the purpose of this discussion, I will call this the traditional asynchronous programming model.

OpenCL applications orchestrate the cooperation of multiple devices available to the host to achieve some useful computational result. This necessitates an asynchronous programming model that facilitates general purpose programming, in contrast to the traditional model which is concerned with freeing the application to work on other tasks. The objective of an OpenCL application is to maximize the utilization of devices. To achieve peak utilization, developers must write efficient kernels and ensure that devices are never waiting idle for new asynchronous commands to arrive. If an OpenCL implementation has work pools in kernel space, it is possible that a device will continue to execute commands on behalf of a non-running process. Commands may have dependencies upon other commands, meaning that a particular command may be blocked until another has completed. In particular, commands form a directed acyclic graph (DAG) in terms of dependencies. This is a fundamental difference between the traditional and OpenCL asynchronous programming models. I have provided a simple diagram that illustrates an OpenCL command DAG.

A command graph.
A command graph.

OpenCL provides command-queues, events, and clEnqueue functions which enable asynchronous programming. OpenCL command-queues may be in-order, which means that commands are dispatched to the device in a FIFO manner, or out-of-order which means that the implementation is free to dispatch commands in any valid order (i.e. subject to dependencies). OpenCL devices may be capable of processing several commands simultaneously, for example a GPU may interleave kernel execution and host memory transfers. An OpenCL implementation that has command DAGs from several contexts is provided with considerable flexibility in what commands it executes. A good implementation should include a command scheduler which maximizes the utilization of the device automatically, by examining all command DAGs from all contexts.

My Proposed OpenCL Asynchronous Error Model

The current OpenCL error model is problematic for developers, due to its fragility. Consider an application which uses a single context and command-queue to execute two independent kernels, A and B. Suppose that these kernels are completely independent, they access different buffers and have no command dependencies. Suppose that A causes some error. Based on the current error policy, the kernel B, which has never heard of A, cannot run. Worse, the command-queue and context are now in an unknown state! I propose a modification to the OpenCL execution and error model that would permit kernel B to execute regardless of the execution status of kernel A.

Conceptually, if an OpenCL command fails, any command that depends upon it should also fail, whether or not it has been enqueued yet. OpenCL commands depend upon each other, but they are also dependent upon the non-const memory objects which they access. If an OpenCL command fails, then it may have corrupted any non-const memory which it can access. I propose that if an OpenCL command X fails, that any command that depends upon X should fail, and any commands which access the non-const memory arguments of X should also fail. It might help to imagine that X poisons any non-const memory that it can access, which means that all commands that use that memory after X will fail. I believe that this is a reasonable error model, but there are some details which must be addressed.

I define an error as an (event, error_code) pair, which indicates which event failed, and why. When a command fails with an error, future commands that depend upon the failed memory or command will also fail. Developers must be able to determine which command caused this cascading failure. In particular, developers want to know which command was the first to fail. Unfortunately, defining which command failed first is slightly complicated by the possibility of multiple failures occurring along multiple paths of a command DAG. Developers query events for execution status, but may be provided with an (event, error_code) pair from a previous command. Consider the diagram below which illustrates such a situation. Which error should be reported: X or Y?

Divergent error paths.
Divergent error paths.

The best option is to report the error with the greatest distance (i.e. number of edges) from the node which is queried for its error status. In the event that many errors have equal distance from the queried node, any error from this set may be returned. Notice that only a single error is returned, but application developers can use this fact to develop their own error strategies, as I will outline momentarily. In the above diagram, the error X should be reported.

Developers should not check the execution status of each individual command. Instead, developers actually care about command status at application-specific points. I mentioned that the OpenCL asynchronous model requires a large set of commands to be available to the implementation for efficient scheduling of hardware. This means that developers should provide many commands with their dependencies at once, so that the implementation can sequence commands for maximum utilization. It also means that the implementation must be responsible for tracking which command has caused an error, and eventually discarding this information. To summarize: developers should check errors infrequently for best performance, the implementation should track which command paths have failed and the error that caused the failure. My solution to these problems has been motivated by the C++11 std::future object.

Developers are to be provided with a new API call: clCheckEventExecution(). This call is analogous to std::future<T>::get(), with some modifications for dependencies. It may only be called once on an event (extra calls are undefined behavior), and it will either indicate successful execution, or it will return an (event, error_code) pair containing the event that triggered the failure. The clCheckEventExecution() call indicates that an application has completed some desired operation, and all predecessor events (i.e. the set of dependencies, and recursively, dependencies of dependencies) will no longer be used as dependencies in future commands. Just as calling clCheckEventExecution() on the same event more than once is undefined, calling clCheckEventExecution() on any predecessor event, for which clCheckEventExecution() has been called, is also undefined. The call clCheckEventExecution() blocks until the target command has completed. There are corner cases that must be handled, but first I want to outline how developers will use this feature.

Developers and implementers are given considerable flexibility with this feature. Developers are able to enqueue some work in the form of a command graph, and check that there were no problems encountered in a particular event or its predecessors. It enables a rich variety of error handling strategies, because developers can decide when to call clCheckEventExecution(). Implementers know that when clCheckEventExecution() is called that predecessor events will no longer be used in future commands, so the implementation can clean up internal data structures as needed. Implementers know that when clCheckEventExecution() is called, the host application blocks to wait for the command and its predecessors to complete. The implementation can use this fact to prioritize commands to unblock the host application as quickly as possible, whereas command sequences that have not yet had clCheckEventExecution() called can be delayed without any user application noticing. In addition, the implementation knows that if clCheckEventExecution() has been called, that predecessors will not be used in future. This permits the implementation to have complete knowledge for scheduling purposes.

What happens if clCheckEventExecution() is called on multiple paths? Consider the illustration below:

Common error predecessor.
Common error predecessor.

In this case, execution has split into multiple paths, but both paths share the X predecessor with an error E. Each path must have clCheckEventExecution() called, because clCheckEventExecution() only checks predecessors. The implementation will have to store the (X, E) pair until all paths have called clCheckEventExecution().

I mentioned earlier that non-const memory objects of a command are poisoned by command failure, which means that commands that share non-const memory objects will fail. This introduces some complexity into error handling, because an error might be not be caused by a predecessor. Instead, an error may be caused by the use of a common memory object. If a command C accesses a poisoned memory object, then it will fail with a special error code, such as CL_ERROR_ACCESS_BAD_MEMORY, and dependencies will fail in the usual manner, with the error (C, CL_ERROR_ACCESS_BAD_MEMORY) reported. Developers do not know which event caused the bad memory access, it is up to them to write software that accommodates this reasonable behavior. Errors should be divided into two sets: errors that poison non-const memory, and errors that do not. This accommodates benign errors that do not affect memory, such as errors related to incorrect use of the API. It also provides an important guarantee for developers regarding the state of memory, and whether or not they may salvage memory contents after an error.

Middleware developers will require access to the set of non-const memory objects associated with a command. An API call should be provided that indicates which cl_mem objects, for a particular command, are understood to be non-const by the implementation. This permits middleware to determine which commands have memory dependencies, which may be important to the error policy adopted by that middleware.

Equivalence

I mentioned that this error strategy degrades to equivalence with the current error policy of OpenCL 1.2, but with stronger guarantees. Implementations might set the CL_ERROR_ACCESS_BAD_MEMORY error code globally on a device for a particular context, and any command enqueued without dependencies to any command-queue associated with that device will immediately encounter this error. The developer will call clCheckEventExecution() for some command, and since there must be a predecessor without dependencies, it will report this error. My proposed error strategy does not require that commands actually launch in the case of a predecessor with an error, therefore no enqueued command will have any effect. The device within the context is essentially useless at this point, but it has degraded in a defined manner. Note that middleware developers might have logic that attempts to find the command which triggered the CL_ERROR_ACCESS_BAD_MEMORY behavior, and therefore OpenCL should provide some method of determining that the error is actually due to poor error handling support on the part of the implementation.

OpenCL 2.0

I have proposed this strategy specifically for my OpenCL 1.3 proposal. I may outline a different strategy for OpenCL 2.0, which would be radically different.

clSetKernelArg() Supports Const

OpenCL provides a bridge between host and device programs. OpenCL provides the host API which manipulates the device, but it has another role: to ensure that the semantics of the host application are maintained. Experienced C++ developers write const-correct code, which assures users that const interfaces will not manipulate const data. This is an important aspect of safe C++ interface design, and OpenCL should provide some basic support for it.

Consider at a high-level that a C++ object containing an OpenCL buffer may be const, and logically the contained buffer should be read-only by any OpenCL kernel. At this point a library or middleware has no guarantees that a buffer provided to OpenCL will be treated as read-only data by a kernel. It is possible that a kernel will violate the const guarantee without warning or error, which makes it impossible for developers to write const-correct code. A library might change the access permissions of a buffer for the duration of a kernel call, but this is potentially expensive and semantically incorrect. The clSetKernelArg() function should be adjusted to allow the host to indicate if a cl_mem object should be treated as const or non-const. An error should be returned by clSetKernelArg() if the host has specified that a const cl_mem value should be passed, but the kernel prototype declares a non-const argument.

Conclusion

I have carefully considered each feature that I suggested in this article. Each of the proposed features are useful for application and middleware developers, and should not require extensive effort on behalf of implementers. I am currently writing my suggestions for OpenCL 2.0, and this article has provided me with an opportunity to address flaws in OpenCL 1.2. Please feel free to comment with any suggestions you may have regarding my proposal.

One thought on “OpenCL 1.3: My Proposal For a Final 1.x Release

  1. Another point that needs clarification and standardization is memory management in case of overcommit and/or multiple device management within the same context.

    In contrast to some vendor-specific solutions OpenCL, probably with inspiration from OpenGL, manages its memory in abstract terms: you have OpenCL buffers which are more sophisticated objects than simple pointers to device memory. There are some good reasons for this (such as the fact that an OpenCL buffer is shared by all the devices in the same context, and therefore the concept of ‘device pointer’ doesn’t really make sense), but at the same time this is not met by adequate tools to manage them.

    A typical case in which this shows is the case of overcommit: allocating buffers for which the total amount of memory used is higher than the amount of memory available on a single device. This is allowed by the OpenCL specification, because an implementation can ‘juggle’ buffers, moving them from device to host to device depending on need.

    Say for example that three buffers A, B and C are allocated, each occupying 2/5ths of the device global memory, and that a kernel K uses two of them at a time (e.g. A and B on first call, A and C on second call, B and C on third call). At any one time, only the two kernels referred to by the kernels need to reside on the device, and the third kernel (which would cause an out of memory error) may be swapped out of the device memory (into host memory), until it’s needed; so a conforming implementation may swap C out before the first call, swap B out and swap C in before the second call, and finally swap A out and swap B back in before the third call.

    To my knowledge none of the major OpenCL implementations actually does this. What’s worse, the OpenCL 1.2 specification has a clEnqueueMigrateMemObject() call to allow the programmer explicit, finer control on object migration (from device to device, but also swap in/swap out by migrating to the host), but this is supposed to only be a hint, that can be ignored by the implementation (and which presently all implementations —to my knowledge— ignore).

    There are two issues here: one, the existing platforms do not migrate buffers off the devices automatically when they need room, and two, they don’t even do it when the programmer suggests they should. At the very least, the API for object migration should force the action.

    Another use-case where direct control on (sub)buffer content migration is useful is in the case of multiple devices in the same context: a typical use case in such a scenario is to create a single huge buffer containing the data that would be processed by eitherdevice, and then some (usually overlapping) sub-buffers to be migrated to individual devices. (Enforced) memory object migration would allow the programmer fine-tuned control on when to move overlapping data from one device to the other. Combined with the const-correctness and dynamic memory flags that you propose, it would also reduce the number of “undefined behavior” when using sub-buffer.

    While we’re talking about memory, another pet peeve of mine is how most if not all implementors misread the minimum requirement for the maximum buffer allocation property on devices: the OpenCL specification states that CL_DEVICE_MAX_MEM_ALLOC_SIZE should be at least 1/4th of the device global memory, and all implementation state that is indeed the maximum allowed size for a buffer, even when this is false (which you can see in some cases because the limit is not actually enforced during buffer creation, and such oversized buffers still work correctly, and in some cases because you can allocate larger memory areas when using proprietary or other APIs for the device memory management).

Comments are closed.