Java 26: Vector API — SIMD for Java (JEP 529 — Incubator)

Want 4x–16x speedups on data-parallel workloads? JEP 529 brings the Vector API to its 11th incubator round in JDK 26, enabling explicit SIMD (Single Instruction, Multiple Data) computations that compile to optimal hardware vector instructions.

Status: Incubator (11th round) — requires --add-modules jdk.incubator.vector

What Is SIMD?

Instead of processing one element at a time, SIMD processes multiple elements simultaneously:

Scalar:  a[0]+b[0], a[1]+b[1], a[2]+b[2], a[3]+b[3]  → 4 operations
SIMD:    a[0..3] + b[0..3]                              → 1 operation!

Modern CPUs process 4–16 floats per instruction (SSE: 4, AVX: 8, AVX-512: 16, ARM SVE: up to 16).

Key API Concepts

  • VectorSpecies — Defines element type + vector length (e.g., FloatVector.SPECIES_PREFERRED picks the best for the current CPU)
  • Vector<E> — A fixed-size sequence of elements (FloatVector, IntVector, DoubleVector, ByteVector, etc.)
  • Operations: add, sub, mul, div, min, max, fma, abs, neg, and, or
  • Reductions: reduceLanes(ADD), reduceLanes(MAX), etc.
  • Masks: compare(), blend() for conditional operations without branching

Why Not Auto-Vectorization?

The JIT compiler can sometimes auto-vectorize simple loops, but:

  • It’s unpredictable — small code changes can disable it
  • It can’t handle complex patterns (conditional, gather/scatter)
  • There’s no developer control or visibility

The Vector API gives you explicit, portable SIMD with guaranteed behavior.

Demo Highlights

1. Element-wise Array Multiplication

Process SPECIES.length() elements per loop iteration using SIMD:

static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

float[] a = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
float[] b = { 2, 2, 2, 2, 3, 3, 3, 3, 4,  4,  4,  4,  5,  5,  5,  5 };
float[] result = new float[a.length];

int i = 0;
int upperBound = SPECIES.loopBound(a.length);

// Main vectorized loop
for (; i < upperBound; i += SPECIES.length()) {
    FloatVector va = FloatVector.fromArray(SPECIES, a, i);
    FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
    FloatVector vr = va.mul(vb);  // SIMD multiply — N elements at once
    vr.intoArray(result, i);
}

// Scalar tail for remaining elements
for (; i < a.length; i++) {
    result[i] = a[i] * b[i];
}

2. SIMD Reduction — Sum of Array

Accumulate vectors, then reduce the final vector to a scalar:

FloatVector vsum = FloatVector.zero(SPECIES);
for (int i = 0; i < upperBound; i += SPECIES.length()) {
    FloatVector v = FloatVector.fromArray(SPECIES, data, i);
    vsum = vsum.add(v);
}
float sum = vsum.reduceLanes(VectorOperators.ADD);

3. Conditional (Masked) Operations — Clamp to [0, 100]

SIMD uses max/min instead of branching:

for (int i = 0; i < upperBound; i += SPECIES.length()) {
    FloatVector v = FloatVector.fromArray(SPECIES, data, i);
    v = v.max(0.0f).min(100.0f);  // Clamp — no branches, no masks needed
    v.intoArray(result, i);
}

4. SIMD Dot Product

The core operation in neural network inference, done entirely with SIMD using fused multiply-add:

FloatVector vsum = FloatVector.zero(SPECIES);
for (int i = 0; i < upperBound; i += SPECIES.length()) {
    FloatVector va = FloatVector.fromArray(SPECIES, a, i);
    FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
    vsum = va.fma(vb, vsum);  // Fused multiply-add: vsum += va * vb
}
float dot = vsum.reduceLanes(VectorOperators.ADD);

5. Scalar vs SIMD Benchmark

The demo runs a simple comparison on 10 million float elements, showing the SIMD speedup over a plain scalar loop. Typical speedups are 2x–8x depending on CPU and JIT warmup.

Real-World Use Cases

The companion VectorApiRealWorldExamples class demonstrates eight production scenarios:

Image Brightness Adjustment

Photo editors, medical imaging, and satellite imagery pipelines need to adjust every pixel’s RGB channels. SIMD processes 8–16 channel values per instruction:

FloatVector offsetVec = FloatVector.broadcast(SPECIES, 40.0f);
FloatVector maxVec    = FloatVector.broadcast(SPECIES, 255.0f);

for (int i = 0; i < bound; i += SPECIES.length()) {
    FloatVector v = FloatVector.fromArray(SPECIES, pixels, i);
    v = v.add(offsetVec).max(0.0f).min(maxVec); // Add & clamp in one pass
    v.intoArray(result, i);
}

Search engines and recommendation systems compare embedding vectors using cosine similarity — the core operation in RAG pipelines, Elasticsearch kNN, and pgvector:

FloatVector vDot   = FloatVector.zero(SPECIES);
FloatVector vNormA = FloatVector.zero(SPECIES);
FloatVector vNormB = FloatVector.zero(SPECIES);

for (int i = 0; i < bound; i += SPECIES.length()) {
    FloatVector va = FloatVector.fromArray(SPECIES, query, i);
    FloatVector vb = FloatVector.fromArray(SPECIES, document, i);
    vDot   = va.fma(vb, vDot);       // dot   += a * b
    vNormA = va.fma(va, vNormA);      // normA += a²
    vNormB = vb.fma(vb, vNormB);      // normB += b²
}
double similarity = vDot.reduceLanes(ADD)
    / (Math.sqrt(vNormA.reduceLanes(ADD)) * Math.sqrt(vNormB.reduceLanes(ADD)));

Audio Amplification (Gain + Clamp)

Music players, VoIP apps (Zoom, Discord), and game engines multiply audio samples by a gain factor. At 44.1 kHz stereo that’s 88,200 samples/second — SIMD makes it real-time:

FloatVector gainVec = FloatVector.broadcast(SPECIES, gain);

for (int i = 0; i < bound; i += SPECIES.length()) {
    FloatVector v = FloatVector.fromArray(SPECIES, samples, i);
    v = v.mul(gainVec).max(-1.0f).min(1.0f); // Amplify & clamp to prevent distortion
    v.intoArray(amplified, i);
}

RGB to Grayscale Conversion

The standard ITU-R BT.601 luma formula applied to entire batches of pixels in parallel:

FloatVector wR = FloatVector.broadcast(SPECIES, 0.299f);
FloatVector wG = FloatVector.broadcast(SPECIES, 0.587f);
FloatVector wB = FloatVector.broadcast(SPECIES, 0.114f);

for (int i = 0; i < bound; i += SPECIES.length()) {
    FloatVector vr = FloatVector.fromArray(SPECIES, r, i);
    FloatVector vg = FloatVector.fromArray(SPECIES, g, i);
    FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
    // gray = 0.299*R + 0.587*G + 0.114*B — all lanes in parallel
    vr.mul(wR).add(vg.mul(wG)).add(vb.mul(wB)).intoArray(gray, i);
}

Hamming Distance (Bioinformatics / Error Detection)

Compare byte sequences (e.g., DNA bases) lane by lane and count differences using SIMD masks:

static final VectorSpecies<Byte> B_SPECIES = ByteVector.SPECIES_PREFERRED;

int distance = 0;
for (int i = 0; i < bound; i += B_SPECIES.length()) {
    ByteVector v1 = ByteVector.fromArray(B_SPECIES, seq1, i);
    ByteVector v2 = ByteVector.fromArray(B_SPECIES, seq2, i);
    // NE mask: true in every lane where sequences differ
    VectorMask<Byte> diff = v1.compare(VectorOperators.NE, v2);
    distance += diff.trueCount();
}

Other Use Cases in the Demo

  • Simple moving average — financial time-series and IoT sensor data using SIMD window summation
  • Min-max normalization — ML feature scaling with (x − min) / (max − min) in bulk
  • Euclidean distance — geospatial nearest-neighbor search and kNN classification

Best For

  • Scientific computing (matrix ops, physics simulations)
  • Image and audio processing (pixel operations, FFT, gain)
  • Machine learning inference (dot products, activations, embeddings)
  • Financial calculations (bulk pricing, risk analysis, time-series)
  • Compression and hashing algorithms

Running the Demo

mvn compile exec:exec \
  -Dexec.mainClass=org.example.incubator.VectorApiDemo

Check out the full source on GitHub.