Multimedia Instruction Sets for General Purpose Microprocessors: An ISA comparison


Created by: Mauricio Alvarez
Created: 22.10.2003
Revised
    09.11.2010. (fix SSE3 Error, thanks to Niels Froehling)
    10.02.2004. (SSE2 and SSE3 instructions)
    10.03.2006. Added some missing instructions in SSE and SSE2, also made a correction in the number of SSE instructions, thanks to Cesare Di Mauro.
    01.05.2006. Rrevised some information about unaligned data access in SS2 and SSE3



Index
1. Introduction

2. Multimedia Extensions for GPPS

3. Multimedia ISAs

3.1 Integer Instructions
3.1.1 Integer Arithmetic Instructions
3.1.2 Integer Multiplication
3.1.3 Integer Reductions
3.1.4 Integer Compare Instructions
3.1.5 Integer Logical Instructions
3.1.6 Integer Rotate/Shift Instructions

3.2 Floating Point Instructions
3.2.1 Floating Point Arithmetic
3.2.2 Floating Point Logic
3.2.3 Floating Point Division, Square Root, Log and Exponentials
3.2.4 FP Rounding and Conversion
3.2.5 Floating Point Compare

3.3 Load and Store Instructions
3.3.1 Integer Load Instructions
3.3.2 Integer Store Instructions
3.3.3 Data Movement Instructions

3.4 Formatting Instructions
3.4.1 Pack
3.4.2 Unpack
3.4.3 Merge
3.4.4 Splat
3.4.5 Permute and Suffle

3.5 System Instructions
3.5.1 User level Cache instructions
3.5.2 Move From/To Vector Status Registers
4. References



1. Introduction

This document presents a review of the instructions available in the Intel MMX/SSE, Motorola Altivec and MOM multimedia extensions.  Although all the multimedia extensions to current general purpose processors share the same idea of exploit Data Level the kind of instructions supported, the type of operands and in the memory management capability.
The objective of this guide is not to present a detailed reference of each instructions, that can be found in the respective reference manuals, but to compare the ISA of those extensions in order to easy determine the available operations in each architecture.

The list presented in this document includes SSE2 and SSE3 extension present in Pentium 4 procesors but not the IA-64 multimedia instructions included in Itanium Processors.


2. Multimedia ISAs for GPPS


Developer Extension Base ISA Instructions Register file Processor
Intel
MMX
x86
57
8x64b (FP)
Pentium (1997)
Intel
SSE
x86
70
8x128b (XMM)
Pentium III (1999)
Intel
SSE2
x86
116
8x128b (XMM)
Pentium IV (2000)
Intel
SSE3
x86  13
8x128b (XMM)
Pentium IVPrescott (2004)
Motorola
Altivec
PowerPC
162
32x128b (V)
MCF74xx aka G4 (1999)
IBM 970 aka G5 ( 2003)
Intel Multimedia Instructions IPF (IA-64) 47 (?)
128x64bit Merced
Jesús Corbal UPC
MOM
Alpha
119
16 x 16 x 64b (VR)
2 x 192b (ACC)
not available (yet)


3. Multimedia ISA Comparison

The fallowing tables present a list of operations that are common in all multimedia extensions and the corresponding instructions in each ISA extension that performs the actual operation. It is necessary to take into account that the operations performed by each instruction although in the same category are not bit a bit equivalent and are preented in conjuction with other instructions only for compariosn purposes.  And also it is necessary to remember that different ISAs have different addressing modes, different register field size and structure (i.e.: vector, matrix) and different ways to access memory locations and those things are not expressed in the tables below. 

Notation used for describing operations was taken from Slingerland and Jouppi [1] and its purpose is only to describe the kind of operations performed by each instruction and not to expose the total sematinc content of them.  And the end of the document there is list of the symbols used  with its corresponding meaning.

Operations and data types.

Operation
Altivec MMX
SSE2
MOM_64
MOM_128
Modulo Add/Sub VADD
 8,16,32
PADD
 8,16,32
PADD
8,16,32,64
M_V_ADD, M_VS_ADD
U8,U16,S8,S16

Saturating Add/Subb
VADD
U8,U16,U32,S8,S16,S32
PADD
U8,U16,S8,S16
PADD
U8,U16,S8,S16
M_V_ADD, M_VS_ADD
U8,U16,S8,S16

Average
VAVG
U8,U16,U32,S8,S16,S32
PAVG
U8,U16
PAVG
U8,U16
M_V_AVG
U8,U16,S8,S16

Min/Max
VMAX
U8,U16,U32,S8,S16,S32
PMAX
U8, S16
PMAX
U8, S16
-

Multiplication
VMULE, VMULO
U8,U16,S8,S16

PMULH,PMULL
U16,S16
PMULH,PMULL
U16,S16
M_V_MUL, M_VS_MUL
S8,S16,S32

Multiply and accumulate
VMLADD,VMHADD
S16
PMADD
S16
PMADD
S16
M_V_MULA, M_VS_MULA
S8,S16

Multiply and sum
VMSUM
U8,U16,S8,S16
-
-
M_V_MAD
S8,S16

Sum Across
VSUM
S8,S16,S32
-
-
M_V_HADDA
S8,S16

Sum of Absolute differences
PSAD
S8
PSAD
S8
M_V_AADDA
S8,S16


3.1 Integer Instructions

3.1.1 Integer Arithmetic Instructions 

Operation
Altivec
MMX
SSE
SSE2
MOM_64
add/sub (modulo arithmetic)
VADD
PADD

PADD
M_V_ADD
(Vu8 + Vu8 ) → Vu8 VADDUBM PADDB

PADDB M_V_ADD_UW_B
M_VS_ADD_UW_B
(Vu16 + Vu16 ) → Vu16 VADDUHM PADDW
PADDW M_V_ADD_UW
M_VS_ADD_UW
(Vu32 + Vu32) → Vu32 VADDUWM PADDD
PADDD
(Vu64 + Vu64) → Vu64


PADDQ

(Vs8 + Vs8) → Vs8



M_V_ADD_SW_B
M_VS_ADD_SW_B
(Vs16 + Vs16) → Vs16



M_V_ADD_SW_W
M_VS_ADD_SW_W
add/sub (saturating arithmetic) VADD PADD

PADD
M_V_ADD
(Vu8 + Vu8) ⇒ Vu8 VADDUBS PADDUSB

PADDUSB M_V_ADD_US_B
M_VS_ADD_US_B
(Vu16 + Vu16) ⇒ Vu16 VADDUHS PADDUSW
PADDUSW M_V_ADD_US_W
M_VS_ADD_US_W
(Vu32 + Vu32) ⇒ Vu32 VADDUWS



(Vs8 + Vs8) ⇒ Vs8 VADDSBS PADDSB
PADDSB M_V_ADD_SS_B
M_VS_ADD_SS_B
(Vs16 + Vs16) ⇒ Vs16 VADDSHS PADDSW
PADDSW M_V_ADD_SS_W
M_VS_ADD_SS_W
(Vs32 + Vs32) ⇒ Vs32 VADDSWS



Add and carry aout





C(Vu32 + Vu32) → Vu32
VADDCUW




average VAVG

PAVG PAVG
M_AVG
(Vu8 avg Vu8) → Vu8 VAVGUB
  PAVGB PAVGB M_V_AVG_U_B
(Vu16 avg Vu16) → Vu16 VAVGUH   PAVGW PAVGW M_V_AVG_U_W
(Vu32 avg Vu32) → Vu32 VAVGUW



(Vs8 avg Vs8) → Vs8 VAVGSB


M_V_AVG_S_B
(Vs16 avg Vs16) → Vs16 VAVGSH


M_V_AVG_S_W
(Vs32 avg Vs32) → Vs32
VAVGSW




max
VMAX

PMAX
PMAX

(Vu8 max Vu8) → Vu8
VMAXUB

PMAXUB
PMAXUB
(Vs8 max Vs8) → Vs8
VMAXSB




(Vu16 max Vu16) → Vu16 VMAXUH




(Vs16 max Vs16) → Vs16 VMAXSH

PMAXSW PMAXSW
(Vu32 max Vu32) → Vu32 VMAXUW




(Vs32 max Vs32) → Vs32 VMAXSW




min VMIN PMIN PMIN
(Vu8 min Vu8) → Vu8 VMINUB PMINUB PMINUB
(Vs8 min Vs8) → Vs8 VMINSB
(Vu16 min Vu16) → Vu16 VMINUH
(Vs16 min Vs16) → Vs16 VMINSH PMINSW PMINSW
(Vu32 min Vu32) → Vu32 VMINUW
(Vs32 min Vs32) → Vs32 VMINSW


3.1.2 Integer Multiplication

Operation
Altivec
MMX
SSE
SSE2
MOM
Multiply





Vs8 x Vs8 → Vs8



M_V_MUL_SS_B
M_VS_MUL_SS_B
Vs16 x Vs16 → Vs16



M_V_MUL_SS_W
M_VS_MUL_SS_W
Vs32 x Vs32 → Vs32



M_V_MUL_SS_D
M_VS_MUL_SS_D
Vu32 x Vu32 → Vu64


PMULUDQ
Truncation Multiply




U16(Vu16 x Vu16) → Vu16

PMULHUW

U16(Vs16 xVs16) → Vs16
PMULHW
PMULHW
L16(Vs16 x Vs16) → Vs16
PMULLW

PMULLW
Even-Odd Multiply





E(Vu8) x E(Vu8) → Vu16 VMULEUB



E(Vu16) x E(Vu16) → Vu32 VMULEUW



E(Vs8) x E(Vs8) → Vs16 VMULESB



E(Vs16) x E(Vs16) → Vs32 VMULESW




O(Vu8) x O(Vu8) → Vu16 VMULOUB




O(Vu16) x O(Vu16) → Vu32 VMULOUW



O(Vs8) x O(Vs8) → Vs16 VMULOSB



O(Vs16) x O(Vs16) → Vs32 VMULOSW





1.3 Integer Reductions

Operation
Altivec
MMX
SSE SSE2
MOM
Multiply and Accumulate




L16(Vs16 x Vs16 + Vu16) → Vs16 VMLADDUHM



U16(Vs16 x Vs16 + Vs16) ⇒ Vs16 VMHADDSHS



R+(Fs16 x Fs16) → Fs32
PMADDWD
PMADDWD
R+(Vs8 x Vs8) ⇒ ACCs24



M_V_MULA_B
M_VS_MULA_B
R+(Vs16 x Vs16) ⇒ ACCs48



M_V_MULA_W
M_VS_MULA_W
Vector Multiply - Sum




R+(Vu8 x Vu8) + Vu32 → Vu32 VMSUMUBM



R+(Vu16 x Vu16) + Vu32 → Vu32 VMSUMUHM



R+(Vu8 x Vs8) + Vs32  → Vs32 VMSUMMBM



R+(Vs16 x Vs16) + Vs32 → Vs32 VMSUMSHM



R+(Vu16 x Vu16) + Vu32 ⇒ Vu32 VMSUMUHS



R+(Vu16 x Vs16) + Vs32 ⇒ Vs32 VMSUMSHS



R+(Vs8 x Vs8 +Vs8 x Vs8 ) → Vs16



M_V_MADD_S_B
R+(Vs16 x Vs16+ Vs16 x Vs16) → Vs32



M_V_MADD_S_W
Vector Sum Across




R+(Vs8) ⇒ Vs64




M_HADDA_S_B
R+(Vs16) ⇒ Vs64




M_HADDA_S_W
R+(Vs32) + Vs32 ⇒ Vs32 VSUMSWS



R+(Vs32) + E(Vs32) ⇒ E(Vs32) VSUM2SWS



R+(Vs8) + Vs32 ⇒ Vs32 VSUM4SBS



R+(Vs16) + Vs32 ⇒ Vs32 VSUM4SHS



Vector Sum of Absolute Differences




R+(Abs(Ms8)) ⇒ Ms64

PSADBW PSADBW M_AADDA_S_B
R+(Abs(Ms16)) ⇒ Ms64



M_AADDA_S_W

1.4 Integer Compare Instructions

Operation
Altivec
MMX SSE
SSE2
MOM
Compare Greater than unsigned





m(Vu8>Vu8) → V8
VCMPGTUB



M_CMPGT.m.u.b
m(Vu16>Vu16) → V16 VCMPGTUH


M_CMPGT.m.u.w
m(Vu32>Vu32) → V32 VCMPGTUW



Compare Greater than signed





m(Vs8>Vs8) → V8 VCMPGTSB PCMPGTB

PCMPGTB M_CMPGT.m.s.b
m(Vs16>Vs16) → V16 VCMPGTSH PCMPGTW

PCMPGTW M_CMPGT.m.s.w
m(Vs32>Vs32) → V32 VCMPGTSW PCMPGTD

PCMPGTD
Compare Equal to





m(Vu8==Vu8) → V8 VCMPEQUB
PCMPEQB

PCMPEQB M_CMPEQ.m.u.b
m(Vu16==Vu16) → V16 VCMPEQUH PCMPEQW

PCMPEQW M_CMPEQ.m.u.w
m(Vu32==Vu32) → V32 VCMPEQUW PCMPEQD

PCMPEQD


1.5 Integer Logical Instructions

Operation Altivec MMX SSE
SSE2
MOM
Logical AND





V & V → V
VAND
PAND

PAND M_AND.m.u.q
Logical OR




V | V → V
VOR
POR

POR M_OR.m.u.q
Logical XOR





V ⊕ V
VXOR
PXOR

PXOR M_XOR.m.u.q
Logical AND with complement





!V & V → V
VANDC
PANDN

PANDN M_NAND.m.u.q
Logical NOR





!(V | V) → V
VNOR





1.5 Integer Rotate/Shift Instructions

Operation Altivec MMX SSE
SSE2
MOM
Rotate Left






VRLB





VRLH





VRLW




Shift Left





V8 << V8 → V8
VSLB



M_SLL.ms.u.b
V16 << V16 → V16 VSLH
PSLLW

PSLLW M_SLL.ms.u.w
V32 << V32 → V32 VSLW
PSLLD

PSLLD
.V64 << .V64 → .V64
PSLLQ

PSLLQ M_SLL.ms.u.q
V64 << imm8 → V64


PSLLDQ

Shift Right





V8 >> V8 → V8 VSRB



V16 >> V16 → V16 VSRH
PSRLW
PSRLW
V32 >> V32 → V32 VSRW
PSRLD

PSRLD
.V64 << .V64 → .V64
PSRLQ

PSRLQ
V64 >> imm8 → V64


PSRLDQ

Shift Right Arithmetic





V8 _>> V8 → V8 VSRAB



M_SRA.ms.u.b
V16 _>> V16 → V16 VSRAH PSRAW
PSRAW M_SRA.ms.u.w
V32 _>> V32 → V32 VSRAW PSRAD

PSRAD M_SRA.ms.u.d





M_SRA.ms.u.q


2. Floating Point Instructions

* There is no FP instructions in MMX (not shown here)
** SSE versions of SIMD-FP instructions also have a scalar mode

2.1 Floating Point Arithmetic

Operation Altivec
SSE
SSE2
MOM
Vector Add




Vfp32 + Vfp32  → Vfp32

VADDFP
ADDPS
ADDSS (scalar)


Vfp64 + Vfp 64 → Vfp64


ADDPD
ADDSD (scalar)

Vector Sub




Vfp32 - Vfp32  → Vfp32 VSUBFP SUBPS
SUBSS (scalar)


Vfp64 - Vfp64  → Vfp64

SUBPD
SUBSD (scalar)

Vector Multiply




Vfp32 x Vfp32 → Vfp32


MULPS
MULSS (scalar)


Vfp64 x Vfp64 → Vfp64

MULPD
MULSD (scalar)

Multiply and Add




(Vfp32 x Vfp32 ) + Vfp32 → Vfp32
VMADDFP



Multiply and Sub




(Vfp32 x Vfp32 ) - Vfp32 → Vfp32 VNMSUBFP



Horizontal Add




R+(Vfp32) → Vf32


HADDPS (SSE3)

R+(Vfp64) → Vf64

HADDPD (SSE3)

Horizontal Sub




R-(Vfp32) → Vf32

HSUBPS (SSE3)

R-(Vfp64) → Vf64

HSUBPD (SSE3)
AddSub




Vfp[127..96] + Vfp[127..96], Vfp[127..96] - Vfp[127..96],
Vfp[63..32] + Vfp[63..32], Vfp[31..0] - Vfp[31..0] → Vfp32


ADDSUBPS (SSE3)

Vfp[127..64] + Vfp[127..64],
Vfp[ 63.. 0] - Vfp[ 63.. 0] → Vfp64


ADDSUBPD(SSE3)

Maximum




(Vfp32 max Vfp32) → Vfp32
VMAXFP
MAXPS
MAXSS(scalar)


(Vfp64 max Vfp64) → Vfp64

MAXPD
MAXSD (scalar)

Minimum



(Vfp32 min Vfp32) → Vfp32 VMINFFP
MINPS
MINSS (scalar)


(Vfp64 min Vfp64) → Vfp64

MINPD
MINSD (scalar)



2.2 Floating Point Logic

Operation Altivec
SSE
SSE2
MOM
Vector AND




(Vfp32 & Vfp32)  → Vfp32

ANDPS


(Vfp64 & Vfp64 ) → Vfp64

ANDPD
Vector AND NOT




(Vfp32 ~& Vfp32)  → Vfp32
ANDNPS

(Vfp64 ~& Vfp64)  → Vfp64

ANDNPD
Vector OR




(Vfp32 | Vfp32) → Vfp32

ORPS


(Vfp64 | Vfp64) → Vfp64

ORPD
Vector XOR




(Vfp32 ^ Vfp32) → Vfp32
XORPS


(Vfp64 ^ Vfp64) → Vfp64

XORPD


2.4
Floating Point Division, Square Root, Log and Exponentials

OperationAltivec
SSE
SSE2
MOM
Vector Divide




Vfp32 / Vfp32 → Vfp32

DIVPS
DIVSS (scalar)


Vfp64 / Vfp64 → Vfp64

DIVPD
DIVSD (scalar)

Vector Reciprocal Estimate




1/Vfp32 → Vfp32
VREFP
RCPPS
RCPSS (Scalar)


Vector Square Root



Sqrt(Vfp32) → Vfp32
SQRTPS
SQRTSS (scalar)


Sqrt(Vfp64) → Vfp64

SQRTPD
SQRTSD (scalar)

Vector Reciprocal Square Root Estimate



1/Sqrt(Vfp32) → Vfp32
VRSQRTEFP
RSQRTPS
RSQRTSS (scalar)


Vector Log2 Estimate



Log2 (Vfp32) → Vfp 32
VLOGEFP



Vector 2 Raised to Exponent Estimate




Exp2(Vfp32) → Vfp32
VEXPTEFP





2.3 FP Rounding and Conversion

Operation Altivec
SSE
SSE2
MOM
Round to FP Integer Nearest




RoundN(Vfp32) → Vfp32
VRFIN



Round to FP Integer toward Zero



RoundZ(Vfp32) → Vfp32 VRFIZ



Round to FP Integer toward Positive Infinity




Round+I(Vfp32) → Vfp32 VRFIP


Round to FP Integer toward Minus Infinity



Round-I(Vfp32) → Vfp32 VRFIM



Vector Convert to FP from Unsigned Fixed Point




Vu32 → Vfp32
VCFUX



Vector Convert to FP from Signed Fixed Point



Vs32 → Vfp32
V.s32 → V.fp32
VCFSX CVTPI2PS
CVTSI2SS (scalar)


Vs64 → Vfp32
CVTDQ2PS
Vs64 → Vfp64


CVTDQ2PD

Vs32 → Vfp64


CVTPI2PD

R32 → .Vfp64 CVTSI2SD
Vector Convert to Unsigned Fixed Point Word Saturate




Vfp32 ⇒Vu32
VCTUXS



Vector Convert to Signed Fixed Point Word Saturate



Vfp32 ⇒Vs32 VCTSXS CVTPS2PI
CVTSS2SI (scalar)


Vfp32 ⇒Vs64 CVTPS2DQ
Vfp64 → Vs32
.Vfp64 → R32


CVTPD2PI
CVTSD2SI

Vector Convert to Signed Fixed Point Word Truncate



L64(Vfp32)→Vs32
.Vfp32→Rs32

CVTTPS2PI
CVTTSS2SI (scalar)
CVTTPS2DQ
CVTTSD2SI

Vfp64→ 0 || Vs32

CVTTPD2DQ

Vfp64→ Fs32

CVTTPD2PI

Vector Convert from FP to FP




Vfp64 → 0 || Vfp32
.Vfp64 → .Vfp32


CVTPD2PS
CVTSD2SS

Vfp64 → 0 || Vs64

CVTPD2DQ
Vfp32 → Vfp64
.Vfp32 → .Vfp64


CVTPS2PD
CVTSS2SD




2.5 Floating Point Compare
* SSE compare instructions for packed single FP are pseudo-operations of the unique CMPPS instructions with variations in the imm8 parameter.

OperationAltivec
SSE
SSE2
MOM
Vector Compare Greater than FP




m(Vfp32 > Vfp32) → V32
VCMPGTFP



Vector Compare Equal To FP




m(Vfp32 == Vfp32) → V32 VCMPEQFP
CMPEQPS *

m(Vfp64 == Vfp64) → V64

CMPEQPD *

Vector Compare Greater than or Equal To FP




m(Vfp32 >_ Vfp32) → V32 VCMPGEFP


Vector Compare Bounds FP




m(Vfp32 <> Vfp32) → V32 VCMPBFP



Vector Compare Less than FP




m(Vfp32 < Vfp32) → V32
CMPLTPS

m(Vfp64 < Vfp64) → V64

CMPLTPD
Vector Compare Less than or Equal To FP



m(Vfp32 <= Vfp32) → V32
CMPLEPS

m(Vfp64 <= Vfp64) → V64

CMPLEPD
Vector Compare Unordered FP



m(Vfp32 ? Vfp32) → V32
CMPUNORDPS


m(Vfp64 ? Vfp64) → V64

CMPUNORDPD
Vector Compare Not Equal




m(Vfp32 != Vfp32) → V32
CMPNQPS

m(Vfp64 != Vfp64) → V64

CMPNQPD
Vector Compare Not Less than FP



m(!(Vfp32 < Vfp32)) → V32
CMPNLTPS

m(!(Vfp64 < Vfp64)) → V64

CMPNLTPD
Vector Compare Not Less than or Equal FP



m(!(Vfp32 <= Vfp64)) → V64
CMPNLEPS

m(!(Vfp64 <= Vfp64)) → V64

CMPNLEPD
Vector Compare Ordered FP



m(!(Vfp32 ? Vfp32)) → V32
CMPORDPS

m(!(Vfp64 ? Vfp64)) → V64

CMPORDPD
Scalar Compare






CMPSS





CMPSD
Scalar Ordered Compare and Set Flags






COMISS





COMISD
Scalar Unordered Compare and Set Flags



single precision

UCOMISS





UCOMISD



3. Load and Store Instructions

3.1 Integer Load Instructions

OperationAltivec
MMX SSE
SSE2
MOM
Load Vector Element Indexed





mem8 → V8
LVEBX




mem16 → V16
LVEHX




mem32 → V32
LVEWX




Load Vector Indexed




mem128 → V (forced aligment)
LVX



M_LD.m.u.q
Load Vector Indexed LRU




mem128 → V (forced aligment), transient LVXL




Load Vector for Shift Left





f(R+R) → V
LVSL




Load Vector for Shift Right





f(R+R) → V LVSR






3.2 Integer Store Instructions

OperationAltivec
MMX MOM
Store Vector Element Integer Indexed



V8 →mem8
STVEBX


V16 → mem16
STVEHX


V32 → mem32
STVEWX


Store Vector Indexed



V → mem128 (forced aligment)
STVX

M_ST.m.u.q
Store Vector Indexed LRU



V → mem128 (forced aligment), transient STVXL




3.3 Data Movement Instructions (load and stores)

OperationAltivec
MMX
SSE
SSE2
SSE3 MOM
Move





MM[31..0] → R[31..0]
MM[31..0] → mem
R[31..0]→ MM[31..0]
mem → MM[31..0]

MOVD


MM[63..0] → MM[63..0]
MM[63..0] → mem
mem → MM[63..0]

MOVQ


XMM[127..0] → XMM[127..0]
XMM[127..0] → mem128
mem128 → XMM[127..0]



MOVDQA (aligned)
MOVDQU (unaligned)
LDDQU
(unaligned load, supports splits between cache lines)

MMX[63..0] → XMM[63..0]


MOVQ2DQ

XMM[63..0] → MMX[63..0] MOVDQ2Q
Move Aligned FP





XMM[127..0] → XMM[127..0]
XMM[127..0]→ mem ( exception if unaligned)
mem → XMM[127..0] ( exception if unaligned)


MOVAPS
(single precision)
MOVAPD
(double precision)

Move Unaligned FP





XMM[127..0] → XMM[127..0]
XMM[127..0] → mem128
mem128 → XMM[127..0] 


MOVUPS
(single precision)
MOVUPD
(double precision)

Move Aligned High FP




XMM[127..64] → mem64
mem128 → XMM[127..64]


MOVHPS
(single precision)
MOVHPD
(double precision)

Move High to Low FP
XMM1[127-64] → XMM1[127-64]
XMM2[127-64] → XMM1[63-0]
MOVHLPS
(single precision)
Move Aligned Low FP




XMM[63..0] → mem64
mem64 → XMM[63..0]


MOVLPS
(single precision)
MOVLPD
(double precision)

Move Low to High FP
XMM2[63-0] → XMM1[127-64]
XMM1[63-0] → XMM1[63-0]
MOVLHPS
(single precision)
Move and duplicate





XMM[63..0] →XMM[127..64], XMM[63.00]
mem64[63..0] →XMM[127..64], XMM[63.00]



MOVDDUP (SSE3)

XMM[127..96] →XMM[127..96], XMM[95..64]
XMM[63..32] →XMM[63..32], XMM[32..0]



MOVSHDUP (SSE3)

XMM[95..64] →XMM[127..96], XMM[95..64]
XMM[31..0] →XMM[63..32], XMM[32..0]



MOVSLDUP (SSE3)

Move Sign Mask To Integer FP





XMM[i*32-1] → R[i]


MOVMSKPS
(single precision)
MOVMSKPD
(double precision)

Move Scalar FP




V.fp32/64 → V.fp32/64
V.fp32/64 → mem ( exception if unaligned)
mem → V.fp32/64 ( exception if unaligned)


MOVSS
(single precision)
MOVSD
(double precision)

Extract Word
MMX_select_by_imm[15-0]→r[15-0]
0x0000 → r[31-16]
PEXTRW
Insert Word
r32[15-0]→ MMX_select_by_imm[15-0] PINSRW


4. Formatting Instructions

4.1 Pack

OperationAltivec
MMX
SSE
SSE2
MOM
Pack Unsigned Integer Unsigned Modulo





(Vu16 || Vu16) → Vu8
VPKUHUM



M_PCK.m.uw.b
(Vu32 || Vu32) → Vu16 VPKUWUM



M_PCK.m.uw.w
Pack Unsigned Integer Unsigned Saturate




(Vu16 || Vu16) ⇒ Vu8 VPKUHUS




(Vu32 || Vu32) ⇒ Vu16 VPKUWUS





Pack Signed Integer Unsigned Saturate




(Vs16 || Vs16 )  ⇒ Vu8
VPKSHUS
PACKUSWB
PACKUSWB M_PCK.m.us.b
(Vs32 || Vs32 )  ⇒ Vu16 VPKSWUS



M_PCK.m.us.w
Pack Signed Integer Signed Saturate




(Vs16 || Vs16 )  ⇒ Vs8 VPKSHSS
PACKSSWB

PACKSSWB M_PCK.m.ss.b
(Vs32 || Vs32 )  ⇒ Vs16 VPKSWSS
PACKSSDW

PACKSSDW M_PCK.m.ss.w
Pack Pixel





(V || V) → Vpixel
VPKPX







4.2 Unpack

OperationAltivec
MMX
SSE
SSE2
MOM
Unpack High Signed Integer





U(Vs8) → Vs16
VUPKHSB




U(Vs16) → Vs32
VUPKHSH




Unpack Low Signed Integer




L(Vs8) → Vs16 VUPKLSB




L(Vs16) → Vs32 VUPKLSB



Unpack High Pixel




U(Vpixel) → V32
VUPKHPX




Unpack Low Pixel




L(Vpixel) → V32
VUPKLPX






4.3 Merge

OperationAltivec
MMX
SSE
SSE2
MOM
Vector Merge High





U(V8) ∧∨ U(V8) → V
VMRGHB
PUNPCKHBW
PUNPCKHBW M_UPCK.m.h.b
U(V16) ∧∨ U(V16) → V VMRGHH
PUNPCKHWD
PUNPCKHWD M_UPCK.m.h.w
U(V32) ∧∨ U(V32) → V VMRGHW
PUNPCKHDQ
PUNPCKHDQ
U(V64) ∧∨ U(V64) → V


PUNPCKHQDQ
U(Vfp32) ∧∨ U(Vfp32) → V

UNPCKHPS

U(Vfp64) ∧∨ U(Vfp64) → V


UNPCKHPD
Vector Merge Low Integer




L(V8) ∧∨ L(V8) → V VMRGLB
PUNPCKLBW
PUNPCKLBW M_UPCK.m.l.b
L(V16) ∧∨ L(V16) → V VMRGLH
PUNPCKLWD
PUNPCKLWD M_UPCK.m.l.b
L(V32) ∧∨ L(V32) → V VMRGLW
PUNPCKLDQ
PUNPCKLDQ
L(V64) ∧∨ L(V64) → V


PUNPCKLQDQ
L(Vfp32) ∧∨ L(Vfp32) → V

UNPCKLPS

L(Vfp64) ∧∨ L(Vfp64) → V


UNPCKLPD


4.4 Splat
OperationAltivec
MMX/SSE
MOM
Vector Splat Integer




VSPLTB



VSPLTH



VSPLTW


Vector Splat Immediate Signed Integer



VSPLTISB



VSPLTISH



VSPLTISW




4.5 Vector Shift
OperationAltivec
MMX/SSE
MOM
Vector Shift Left




VSL


Vector Shift Right




VSR

Vector Shift Left Double by Octect Immediate




VSLDOI


Vector Shift Left by Octect




VSLO


Vector Shift Rigth by Octect



VSRO





4.5 Permute, Shuffle and others
OperationAltivec
SSE
SSE2
MOM
Vector Permute




(V8 || V8) [Vu8] →V8[i] VPERM


Vector Shuffle






PSHUFW





PSHUFD




PSHUFHW



PSHUFLW

SHUFPS
(single precision)
Move Byte Mask
byte_mask → r32 PMOVMSKB
Matrix Transpose








M_TRANS.m.u.b




M_TRANS.m.u.w


5. System Instructions

5.1 Prefecth

OperationAltivec
MMX
SSE SSE2
MOM
Data Stream Touch
DST



Data Stream Touch Transient
DSTT



Data Stream Touch for Store
DSTST



Data Stream Touch for Store Transient
DSTST



Data Stream Stop
DSS




DSSALL









Flush





Flush and invalidate memory operand in cache



CFLUSH

Prefetch





Frefetch data into caches


PREFECTH

Fence





serialize stores


SFENCE

serialize loads



LFENCE

serialize load and stores



MFENCE

Non Temporal byte Mask Store of Packed Integer





if(mask) V8[i] → mem8[i] (64 bits)

MASKMOVQ

if(mask) V8[i] → mem8[i] (128 bits)


MASKMOVDQU
Non temporal Store of Packed Integer





F → mem (no write allocate)

MOVNTQ

V → mem (no write allocate)


MOVNTDQ

R → mem (no write allocate) MOVNTI
Non temporal Store of Packed FP





F → mem (no write allocate)

MOVNTPS

V → mem (no write allocate)


MOVNTPD



5.2 Move from/to vector status registers

OperationAltivec
MMX
SSE SSE2 MOM
Restore FP, MMX and SSE State FXRSTOR
Save FP, MMX and SSE State FXSAVE
Load SIMD Extension Control Status LDMXCSR
Store SSE Control Status STMXCSR

Notation

Symbol
Operation
R+
Additive reduction
R-
Substractive reduction
Rx
Multiplicative reduction
{
Round to nearest (even)
E
even values
m
mask
.
scalar value
<>
Bounds
f
Partial permute
U
Upper part of bytes
L
Lower Part of bytes
∧∨ Interleave
Saturate Arithmetic
Modulo (Wrap Around) Arithmetic
Exclusive Or
_>> Right Arithmetic Shift
∧∨ Interleave


 

4. References

SIMD

MMX/SSE/SSE2/SSE3
Altivec
MOM