Code Sample: Intel® Advanced Matrix Extensions (Intel® AMX) -...

This tutorial with code example shows how to use the new Intel® Advanced Matrix Extensions (Intel® AMX) on both Intel® Xeon® Scalable processor Max Series and 4th Gen Intel® Xeon® Scalable Processors.

Alberto Villarreal

Intel® AMX now introduces new extensions to the x86 Instruction Set Architecture (ISA) to work on matrices and which may accelerate matrix multiplication in AI workloads. It consists of two components:

A set of 2-dimensional registers (tiles), which can hold sub-matrices from larger matrices in memory.
An accelerator called Tile Matrix Multiply (TMUL) which contains instructions that operate on tiles.

This code sample demonstrates testing the new instructions using intrinsic functions. It is simplified to highlight use of new Intel® AMX instructions. It should not be used as a basis for production code. Only for demostration purposes.

The source code is available to download.

Next is a summary and basic steps to start using Intel® AMX. More detailed information about Intel® AMX and TMUL is in the Intel® Architecture Instruction Set Extensions Programming Reference and the Intel® 64 and IA-32 Architectures Software Developer’s Manual

Tile and TMUL architecture

Before using TMUL instructions, the tile architecture must be configured specifying the tile configuration including number of tiles and tile sizes (the palette). This configuration step is to be performed once and the configuration remains until it is either changed by the code or is released. It is up to the programmer to start/change/release configuration/palette of the tiles. The information about the maximum number of tiles supported, and tiles maximum size for the specific hardware in use can be obtained via CPUID.

Once the tiles are configured, TMUL instructions can be used to perform matrix multiplications (currently, INT8 and BF16 types are supported). The TMUL instructions, when executed, will dynamically check the maximum sizes of the tile and the matrix sizes that allow a mathematically correct matrix multiplication.

Intel® AMX supports XSAVE, which defines processor registers that can be saved and restored using instructions in the XSAVE feature set. The state components that Intel® AMX is associated with are XTILECFG and XTILEDATA (see Intel® Architecture Instruction Set Extensions Programming Reference for specifics). Also, as the XTILEDATA state component is large, it may not be enabled automatically by the operating system kernel, Nevertheless, it is possible to enable the XTILEDATA state component manually using XSTATE features, in which case the application will need to invoke a system call to request access to Intel® AMX features.

In the code sample walkthrough next, an INT8 matrix multiplication will demonstrate the above procedure step by step. Specifically, the code sample will multiply matrices A and B of size 16 x 64 containing INT8 values, and accumulate the result to a 16 x 16 matrix C containing INT32 values.

Code sample walkthrough

First, define constants declaring max number of elements per tile, maximum number of rows and columns to configure the tiles, and actual number of columns in the matrices:

#define MAX 1024
#define MAX_ROWS 16
#define MAX_COLS 64
#define STRIDE 64

1. The first step of the tile configuration step described above is to declare the data structure that will hold the control register for tile configuration:

//Define tile config data structure
typedef struct __tile_config
{
  uint8_t palette_id;
  uint8_t start_row;
  uint8_t reserved_0[14];
  uint16_t colsb[16];
  uint8_t rows[16];
} __tilecfg;

This data structure in the code sample is designed to match the tile configuration format defined as a 64-byte memory location, as defined in the Intel® Intrinsics Guide:

// format of memory payload. each field is a byte.
// 0: palette_id
// 1: startRow (8b)
// 2-15: reserved (must be zero)
// 16-17: tile0.colsb -- bytes_per_row
// 18-19: tile1.colsb
// 20-21: tile2.colsb
// ...
// 46-47: tile15.colsb
// 48: tile0.rows
// 49: tile1.rows
// 50: tile2.rows
// ...
// 63: tile15.rows

The next step in the tile configuration is to fill the tile configuration variable with the specific information given by the matrices A, B and C used in this example:

/* Initialize tile config */
static void init_tile_config (__tilecfg *tileinfo)
{
  int i;
  tileinfo->palette_id = 1;
  tileinfo->start_row = 0;

  for (i = 0; i < 1; ++i)
  {
    tileinfo->colsb[i] = MAX_ROWS;
    tileinfo->rows[i] =  MAX_ROWS;
  }

  for (i = 1; i < 4; ++i)
  {
    tileinfo->colsb[i] = MAX_COLS;
    tileinfo->rows[i] =  MAX_ROWS;
  }

  _tile_loadconfig (tileinfo);
}

In the above function, the _tile_loadconfig() intrinsic function is used to load the tile configuration metadata from the 64-byte memory location specified by tileinfo.

Notice that the value of palette_id is set to 1. Intel® AMX uses a palette of (enumerated) options to programmers to configure the tiles. Currently two palettes are supported: palette 0 represents the initialized state, whereas palette 1 consists of 8 KB of storage divided across 8 tile registers, with each tile having a maximum size of 16 rows by 64 bytes. For this example, 2 tiles will be able to hold a matrix of size 16 x 64 (INT8 values), and 1 tile will hold a matrix of size 16 x 16 (INT32 values).

Notice also that the STRIDE value (later used in the code) is not part of the configuration, because the actual matrices can be smaller than the maximum sizes declared for the tiles. An example of this situation would be if this tile configuration is used to multiply matrices of size 1000 x 1000. Most of the iterations of the multiplications would be performed on 16 x 64 matrices, but 1 or more of the final operations on the remainder matrices would be of smaller size, in which case the STRIDE value can be modified in those cases.

2. The next section in the code is to to invoke a Linux system call to request access to Intel® AMX features. This is performed using the arch_prctl(2) based mechanism for applications to request usage of the Intel® AMX features. Specific information is described in the Linux kernel documentation.

/* Set_tiledata_use() - Invoke syscall to set ARCH_SET_STATE_USE */
static bool set_tiledata_use()
{
if (syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA))
{
   printf("\n Failed to enable XFEATURE_XTILEDATA \n\n");
   return false;
}
else
{
   printf("\n TILE DATA USE SET - OK \n\n");
   return true;
}
return true;
}

3. Next, TMUL instructions can be used to load the sub-matrices into tiles and perform tile matrix multiplication operations, followed by storing the results from the tiles back to memory. As mentioned above, when TMUL instructions are executed, the metadata in the tile configuration (previously configured as indicated above using the _tile_loadconfig()instruction) will be dynamically checked to verify the TMUL instruction supports the data type and the matrix sizes match the requirements for correct matrix multiplication.

3.1. Load tiles from memory specified by base address (src1, src2 and res) and stride into tiles (tiles # 2, 3 and 1, respectively).

// Load tile rows from memory
_tile_loadd (2, src1, STRIDE);
_tile_loadd (3, src2, STRIDE);
_tile_loadd (1, res, STRIDE);

STRIDE (which in this case has a value of 64) indicates how the load operations should be strided (assumes row major data layout).

3.2. Perform matrix multiplication

// Compute dot-product of bytes in tiles
_tile_dpbssd (1, 2, 3);

The above instruction computes dot-product of bytes in tiles with a source/destination accumulator. Specifically, it multiplies groups of 4 adjacent pairs of signed 8-bit integers in tile 2 with corresponding signed 8-bit integers in tile 3, producing 4 intermediate 32-bit results. These 4 results are added to the corresponding 32-bit integer in tile 1 and the 32-bit result is stored back to tile 1. Details can be found in the Intel® Intrinsics Guide.

Besides _tile_dpbssd(), TMUL also supports instructions _tile_dpbsud(), _tile_dpbusd(), _tile_dpbuud() to cover all possible combinations of signed/unsigned operands.

3.3. Lastly, the result of the matrix multiplication operation from tile 1 is stored to memory specified by res.

// Store the tile data to memory
_tile_stored (1, res, STRIDE);

Putting everything together in main() :

int main(){

   __tilecfg tile_data = {0};
   int8_t src1[MAX];
   int8_t src2[MAX];
   int32_t res[MAX/4];
   int rows  = MAX_ROWS;
   int colsb = MAX_COLS;

   // Request permission to linux kernel to run AMX
   if (!set_tiledata_use())
      exit(-1);

   // Load tile configuration
   init_tile_config (&tile_data);

   // Init src matrix buffers with data
   init_buffer (src1, 2);
   print_buffer(src1, rows, colsb);

   init_buffer (src2, 2);
   print_buffer(src2, rows, colsb);

   // Init dst matrix buffers with data
   init_buffer32 (res, 0);

   // Load tile rows from memory
   _tile_loadd (2, src1, STRIDE);
   _tile_loadd (3, src2, STRIDE);
   _tile_loadd (1, res, STRIDE);

   // Compute dot-product of bytes in tiles with a
   // source/destination accumulator
   _tile_dpbssd (1, 2, 3);

   // Store the tile data to memory
   _tile_stored (1, res, STRIDE);
   print_buffer32(res, rows, colsb/4);

   // Release the tile configuration to return to the init state,
   // which releases all storage it currently holds
   _tile_release ();
}

Compile and Run the Code Sample

Use the provided Makefile file to compile and run the code sample:

make
./test-amxtile

The test code initializes src1 and src2 with INT8 constants with value = 2. And the input matrices are A = B (of size 16 x 64).

This code has been tested on a system using a 4th Gen Intel® Xeon® Scalable processor, with Linux 8.6 (kernel version 5.17) and gcc version 12.1.

Summary and Conclusions

This tutorial describes how to use new Intel® AMX technology through a simple code sample. The example shows how to configure Intel® AMX tiles and compute the result of a single matrix multiplication with INT8 operands and INT32 accumulation on 16 x 64 matrices.

This code sample is simplified to highlight use of new Intel® AMX instructions. It shows use of instructions to configure the tiles, load data from memory into tiles, perform one matrix multiplication on tiles data and copy the result from tiles to memory. It should not be used as a basis for production code. Only for demostration purposes.

This code sample can be easily extended to multiplying matrices of any size just by iterating on the load/compute cycle, because the initial tile configuration will be valid for the entire life of the thread (and sub-threads).

However, this information describes only part of what the new Intel® AMX technology provides. BF16 matrix multiplication is also supported.

If you are a developer writing code for AI applications, libraries or frameworks, you can use Intel’s code samples and documentation to write or update your existing code and take advantage of AI acceleration that Intel® AMX provides.

If you are a Data Scientist or Python developer, you can also directly use Intel optimized tools and libraries (such as Intel® oneAPI Math Kernel Library (oneMKL) and Intel® Optimization for Tensorflow*) to accelerate you workflows and applications. These tools and libraries already take advantage of Intel® AMX technology on 4th Gen Intel® Xeon® Scalable processors. A good starting point is to download Intel oneAPI , including the Intel® AI Analytics Toolkit, which includes Intel® Distribution for Python*, Intel® Optimization for TensorFlow*, and more.

To get immediate access to the latest Intel technology (including AMX-optimized software tools, libraries and hardware), you can also sign up for access to the Intel® DevCloud.

Notices/Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Code names are used by Intel to identify products, technologies, or services that are in development and not publicly available. These are not "commercial" names and not intended to function as trademarks

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Your costs and results may vary.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Code Sample: Intel® Advanced Matrix Extensions (Intel® AMX) - Intrinsics Functions

Tile and TMUL architecture

Code sample walkthrough

Compile and Run the Code Sample

Summary and Conclusions

Notices/Disclaimers

Product and Performance Information