Adding New SIMD Instructions to the GCC Back-end.

Mauricio Alvarez: alvarez (at) ac (dot) upc (dot) edu
Created: 16/09/2005.
Modified: 13.03.2006.

1. Introduction

This guide shows how to include support in the gcc compiler for new instructions that are added to an existing ISA. The idea is to extend an ISA with custom instructions for domain-specific processor acceleration and to  support these instructions into the compiler using intrinsics.

This work is based on the PowerPC ISA with the Altivec multimedia extension and the new instructions that are going to be added are related to the video coding/decoding domain.  GCC version 4.0 is used.

In a first stage of this work new instructions are going to be supported by means of intrinsics that allow the programmer to use them directly in C or C++ programs.


2. GCC structure and passes

In order to provide support for new instructions in gcc it is necessary to support them in some of  the stages of the compiler:

front-end: parse tree --------> middle-end: generic tree ---------> back-end: RTL

GCC passes [1]

Modifications in each stacge of GCC
2.1 front-end: specification of new intrinsics added to the Altivec existing intrinsics.
2.2 middle-end: because the instructions are not going to be generated automatically, there is no need to modify this stage
2.3 back-end: creation of the machine description of the new instructions.

3. Front-end support for intrinsics in GCC

3.1 What intrinsics are?
A intrinsic is a function known by the compiler that directly maps to a sequence of one or more assembly language instructions.

Intrinsics make the use of processor specific enhancements easier because they provide a language interface (C,C++) to assembly instructions. In doing so, the compiler manages things that the user would normally have to be concerned with, such register names, register allocations and memory locations of data.

GCC has intrinsics for the SIMD extensions (SSE, Altivec) that are available in most modern processors.

3.2 Altivec Intrinsics in GCC
Altivec intrinsics are an interface to the PowerPC processors to access Altivec instructions. Intrinsics specification also adds new types to the C,C++ languages for declaring packed variables as described in the Altivec Programming Interface Manual [2].

The intrinsics interface is made available by adding #include <altivec.h> in the source program and by adding  the -maltivec and -mabi=altivec compiler flags to the compilation command. This is only applicable for the fsf  official GCC. Mac computers (with powerpc processors) use a special version of gcc that does not need to inlude the header altivec.h and requires the compiler flag -faltivec

Altivec intrinsics are declared in altivec.h that is available in the gcc source code in: gcc/config/rs6000/altivec.h

An example of the Altivec intrinsics is the vector average, which calculates the rounded average of two vectors.

- compiler intrinsic: d = vec_avg(a,b)
-
Assembly instructions: see next table

d = vec_avg(a,b)
d a b maps to
vector unsigned char vector unsigned char vector unsigned char vavgub d,a,b
vector signed char vector signed char vector signed char vavgsb d,a,b
vector unsigned short vector unsigned short vector unsigned short vavguh d,a,b
vector signed short vector signed short vector signed short vavgsh d,a,b
vector unsigned int vector unsigned int vector unsigned int vavguw d,a,b
vector signed int vector signed int vector signed int vavgsw d,a,b


3.3 Implementation of Altivec intrinsics
Intrinsics are implemented as functions but the code is placed inline and they do not generate a function call. As can be seen in the example above each intrinsic can map to several assembly instructions depending on the data type of the operands. In GCC this is implemented by means of overloaded functions. In C++ they are supported directly by the language. In C they are implemented with macros.

Here there is the C++ declaration of the vec_avg intrinsic for vector signed/unsigned char:

inline   __vector unsigned char
vec_avg (__vector unsigned char a1, __vector unsigned char a2)
{
  return (__vector unsigned char) __builtin_altivec_vavgub ((__vector signed char) a1, (__vector signed char) a2);
}

inline   __vector signed char
vec_avg (__vector signed char a1, __vector signed char a2)
{
  return (__vector signed char) __builtin_altivec_vavgsb ((__vector signed char) a1, (__vector signed char) a2);
}

In C, the same declaration is done with macros:
#define vec_avg(a1, a2) \
__ch (__bin_args_eq (__vector unsigned char, (a1), __vector unsigned char, (a2)), \
      ((__vector unsigned char) __builtin_altivec_vavgub ((__vector signed char) (a1), (__vector signed char) (a2))), \
__ch (__bin_args_eq (__vector signed char, (a1), __vector signed char, (a2)), \
      ((__vector signed char) __builtin_altivec_vavgsb ((__vector signed char) (a1), (__vector signed char) (a2))), \
__ch (__bin_args_eq (__vector unsigned short, (a1), __vector unsigned short, (a2)), \
      ((__vector unsigned short) __builtin_altivec_vavguh ((__vector signed short) (a1), (__vector signed short) (a2))), \
__ch (__bin_args_eq (__vector signed short, (a1), __vector signed short, (a2)), \
      ((__vector signed short) __builtin_altivec_vavgsh ((__vector signed short) (a1), (__vector signed short) (a2))), \
__ch (__bin_args_eq (__vector unsigned int, (a1), __vector unsigned int, (a2)), \
      ((__vector unsigned int) __builtin_altivec_vavguw ((__vector signed int) (a1), (__vector signed int) (a2))), \
__ch (__bin_args_eq (__vector signed int, (a1), __vector signed int, (a2)), \
      ((__vector signed int) __builtin_altivec_vavgsw ((__vector signed int) (a1), (__vector signed int) (a2))), \
    __builtin_altivec_compiletime_error ("vec_avg")))))))


__bin_args_eq is a macro that checks the compatibility of the data type of the operands.
_ch is a macro that chooses between the builtin assembly expression or a data type error

3.4 An example of a new Altivec intrinsics for pixel interpolation
We are going to include support for a new instruction devoted to the pixel interpolation, a process that is common in the video coding standards like MPEG-4 or H.264.

The interface to the new instruction is d = vec_inter(a,b) and the assembly mapping is shown in the next table


d = vec_inter(a,b)
d a b maps to
vector unsigned char vector unsigned char vector unsigned char vinterub d,a,b
vector signed char vector signed char vector signed char vintersb d,a,b
vector unsigned short vector unsigned short vector unsigned short vinteruh d,a,b
vector signed short vector signed short vector signed short vintersh d,a,b
vector unsigned int vector unsigned int vector unsigned int vinteruw d,a,b
vector signed int vector signed int vector signed int vintersw d,a,b

The definition in C++  for the signed/unsigned char version is like that:

inline   __vector unsigned char
vec_inter (__vector unsigned char a1, __vector unsigned char a2)
{
  return (__vector unsigned char) __builtin_altivec_vinterub ((__vector signed char) a1, (__vector signed char) a2);
}

inline   __vector signed char
vec_inter (__vector signed char a1, __vector signed char a2)
{
  return (__vector signed char) __builtin_altivec_vintersb ((__vector signed char) a1, (__vector signed char) a2);
}

4. Back-end support for intrinsics in GCC

Intrinsics are implemented in the machine description of the back-end of the compiler.  The back-end is implemented in several files:
The machine descriptions are used in the matching process to transform RTL expressions into assembler instructions. An instruction description in the machine description consists of instruction template patterns for both instruction generation and instruction matching.

4.1 Instruction patterns for altivec instructions
Here is an example of an instruction pattern for the vec_avg intrinsic using unsigned char operands:

(define_insn "altivec_vavgub"
  [(    set (match_operand: V16QI 0 "register_operand" "=v")
        (unspec:
V16QI [ (match_operand: V16QI 1 "register_operand" "v")
                                    (match_operand:
V16QI 2 "register_operand" "v")] 44))]
  "TARGET_ALTIVEC"
  "vavgub %0,%1,%2"
  [(set_attr "type" "vecsimple")])


4.2 Instruction patterns for new instructions
Similar to the example presented above, we have defined a pattern for the vector interpolation instruction: vec_avg

- veg_avg for the unsigned data types:

(define_insn "altivec_vinteru<VI_char>"
  [(set (match_operand:VI 0 "register_operand" "=v")
        (unspec:VI [(match_operand:VI 1 "register_operand" "v")
                    (match_operand:VI 2 "register_operand" "v")] 244))]
  "TARGET_ALTIVEC"
  "vinteru<VI_char> %0,%1,%2"
  [(set_attr "type" "vecsimple")])

- veg_avg for the signed data types:

(define_insn "altivec_vinters<VI_char>"
  [(set (match_operand:VI 0 "register_operand" "=v")
        (unspec:VI [(match_operand:VI 1 "register_operand" "v")
                    (match_operand:VI 2 "register_operand" "v")] 245))]
  "TARGET_ALTIVEC"
  "vinters<VI_char> %0,%1,%2"
  [(set_attr "type" "vecsimple")])

4.3 Builtin description of the intrinsics
It is necessary to add the intrinsics in the back-end of the compiler. In our case, the intrinsics are added to the RS600 back-end which includes all the Power and PowerPC processors. The subroutines for code generation are defined in gcc/gcc-4.0.0/gcc/config/rs6000/rs6000.c  In this file there is a special section for the definition of builtins.

For two operands instructions there is a structure like this:

static struct builtin_description bdesc_2arg[] =
{
...
  { MASK_ALTIVEC, CODE_FOR_altivec_vinterub, "__builtin_altivec_vinterub", ALTIVEC_BUILTIN_VINTERUB },
  { MASK_ALTIVEC, CODE_FOR_altivec_vintersb, "__builtin_altivec_vintersb", ALTIVEC_BUILTIN_VINTERSB },
  { MASK_ALTIVEC, CODE_FOR_altivec_vinteruh, "__builtin_altivec_vinteruh", ALTIVEC_BUILTIN_VINTERUH },
  { MASK_ALTIVEC, CODE_FOR_altivec_vintersh, "__builtin_altivec_vintersh", ALTIVEC_BUILTIN_VINTERSH },
  { MASK_ALTIVEC, CODE_FOR_altivec_vinteruw, "__builtin_altivec_vinteruw", ALTIVEC_BUILTIN_VINTERUW },
  { MASK_ALTIVEC, CODE_FOR_altivec_vintersw, "__builtin_altivec_vintersw", ALTIVEC_BUILTIN_VINTERSW },
...
}

And the definition of  ALTIVEC_BUILTINs are placed in: gcc/gcc-4.0.0/gcc/config/rs6000/rs6000.h

enum rs6000_builtins
{
  /* Altivec builtins.  */
...
  ALTIVEC_BUILTIN_VINTERUB,
  ALTIVEC_BUILTIN_VINTERSB,
  ALTIVEC_BUILTIN_VINTERUH,
  ALTIVEC_BUILTIN_VINTERSH,
  ALTIVEC_BUILTIN_VINTERUW,
  ALTIVEC_BUILTIN_VINTERSW, 
...
}

5.  Extending GNU Assembler
In order to support new instructions for a given ISA it is necessary to modify the assembler for producing the object code.  The natural election of an assembler to use in conjunction with the gcc compiler is gas, the gnu assembler which is part of the binutils collection of tools.

Gas is implemented in two sections, a front-end and a back-end

5.1 Opcode List
The opcode list for PowerPC instructions is defined in the PowerPC back-end:
/binutils-2.16.1/opcodes/ppc-opc.c

const struct powerpc_opcode powerpc_opcodes[] = {
...
{ "vinterub",VX(4, 1900), VX_MASK,    PPCVEC,        { VD, VA, VB } },
{ "vinteruh",VX(4, 1901), VX_MASK,    PPCVEC,        { VD, VA, VB } },
{ "vinteruw",VX(4, 1902), VX_MASK,    PPCVEC,        { VD, VA, VB } },
{ "vintersb",VX(4, 1903), VX_MASK,    PPCVEC,        { VD, VA, VB } },
{ "vintersh",VX(4, 1904), VX_MASK,    PPCVEC,        { VD, VA, VB } },
{ "vintersw",VX(4, 1905), VX_MASK,    PPCVEC,        { VD, VA, VB } },
...
}


5.2 Adding new opcodes to the Altivec extension

PowerPC opcode format:
----------------------------------------------------------------------------------
| Main Opcode  |   VD   |   VA   |   VB   |      Extended opcode   |
----------------------------------------------------------------------------------


Free opcodes
Beyond the extended opcode of 1900 there are free slots for new instructions. For the interpolation instructions these are the selected opcodes:
- vinterub: 1900
- vinteruh: 1901
- vinteruw: 1902
- vintersb: 1903
- vintersh: 1904
- vintersw: 1905

Appendix 1. Notes on compilation of gcc

Adding new instructions do not change at all the compilation process of gcc. But for our experiments we are using a Power4 machine with AIX operating system and a PowerPC+Altivec emulator and simulator. Neither the processors or the OS has support for Altivec instructions. So it is necessary to tell gcc that include the support for altivec.

#define TARGET_ALTIVEC 1
#define TARGET_ALTIVEC_ABI 1
#define TARGET_ALTIVEC_VRSAVE 1

#define READ_ONLY_DATA_SECTION_FUNCTION    \
void                                                                                \
read_only_data_section (void)                                     \
{                                                                                     \
  if (in_section != read_only_data)                               \
    {                                                                                 \
      fprintf (asm_out_file, "\t.csect %s[RO],4\n",          \   <-------- Alignment to 128
           xcoff_read_only_section_name);                        \
      in_section = read_only_data;                                  \
    }                                                                                  \
}

the same need to be applied for
#define READ_ONLY_PRIVATE_DATA_SECTION_FUNCTION

CONFIG_SHELL={bin_dir}/bash
export CONFIG_SHELL
$ ./configure --prefix=$BIN_DIR --enable-languages=c,c++ --enable-altivec --disable-nls --disable-multilib. 
$ gmake
$ gmake install

References

[1] GCC Internals. GNU Compiler Collection Internals
[2] ALTIVECPIM. AltiVec Technology Programming Interface Manual. Motorola/Freescale.
[3] BINUTILS. GNU binary utils: assembler, linker, loader and other utilities for dealing with binary files generated by gcc compilers.