Lazy Parallel Processing and Visualization of Large Data with ImgLib2, BigDataViewer, the N5-API, and Spark

Setting up …

As usual, we start by importing dependencies and registering RandomAcccessibleInterval notebook renderer.

Code

%mavenRepo scijava.public https://maven.scijava.org/content/groups/public
%maven org.scijava:scijava-common:2.99.0
%maven net.imglib2:imglib2-algorithm:0.17.1-SNAPSHOT
%maven net.imglib2:imglib2:7.1.2
%maven net.imglib2:imglib2-ij:2.0.3
%maven sc.fiji:bigdataviewer-core:10.6.2
%maven org.janelia.saalfeldlab:n5:3.3.0

Code

import io.github.spencerpark.jupyter.kernel.display.common.*;
import io.github.spencerpark.jupyter.kernel.display.mime.*;
import net.imglib2.img.display.imagej.ImageJFunctions;
import net.imglib2.RandomAccessibleInterval;

getKernelInstance().getRenderer().createRegistration(RandomAccessibleInterval.class)
    .preferring(MIMEType.IMAGE_PNG)
    .supporting(MIMEType.IMAGE_JPEG, MIMEType.IMAGE_GIF)
    .register((rai, context) -> Image.renderImage(
        ImageJFunctions.wrap(rai, rai.toString()).getBufferedImage(),
        context));

Introduction

The ImgLib2 “blocks” API is about performing computations on blocks of (image) data more efficiently than going pixel-by-pixel using RandomAccess, Type, etc.

A fundamental limitation of this framework is that it only works with NativeType, and (so far) only with those that map 1:1 to primitives. For example, UnsignedByteType works, but ComplexDoubleType does not.

Idea

The idea is simple:

Efficiently extract a region from a RandomAccessible<T> into a flat primitive array.
Implement small, efficient functions that compute the output flat primitive array from the input flat primitive array. (Importantly, these functions don’t do anything “clever”: out-of-bounds extension is handled outside; if the input data requires padding, assume the input array is padded. No per-element type checks, etc. If the function should work on float[] and int[] inputs, we have two versions.)
Wrap and assemble the computed flat primitive arrays back into RandomAccessibleInterval.

We also provide some infrastructure for chaining intermediate ‘block’ operations.

Outline

There are two levels of API:

The low-level PrimitiveBlocks and BlockProcessor deal with primitive java arrays.
The high-level BlockSupplier and UnaryBlockOperator wrap this in a layer of syntactic sugar and type-safety.

We look into the low-level API first. You’ll never have to use it unless you want to implement a new “blocks” algorithm, but it is useful to illustrate whats happening.

PrimitiveBlocks

Load and show 2D image that we will to work with…

Code

import ij.IJ;
import net.imglib2.*;
import net.imglib2.util.*;
import net.imglib2.position.*;
import net.imglib2.img.array.*;
import net.imglib2.type.numeric.real.*;
import net.imglib2.type.numeric.integer.*;
import net.imglib2.converter.RealTypeConverters;
import net.imglib2.img.display.imagej.ImageJFunctions;

// var fn = "/Users/pietzsch/workspace/data/DrosophilaWing.tif";
// var fn = "https://mirror.imagej.net/ij/images/boats.gif";
var fn = "boats.gif";
RandomAccessibleInterval<UnsignedByteType> img = ImageJFunctions.wrap(IJ.openImage(fn));

display(img);

89b8d9d7-72b6-4998-a81a-0e71ffe5c055

Code

var crop = Intervals.createMinSize(350, 220, 64, 64);
var cropped = img.view().interval(crop).zeroMin();
display(cropped);

ee567bcf-71fa-4f9a-9df4-ea7c80a932e1

For any RandomAccessible<T> we can get a “block copier” using PrimitiveBlocks.of(...). As mentioned above, this only works with NativeType that map 1:1 to primitives.

So let’s get access to the blocks of our img:

Code

import net.imglib2.blocks.PrimitiveBlocks;

PrimitiveBlocks<UnsignedByteType> blocks = PrimitiveBlocks.of(img);

We can now extract data into a primitive array (byte[] for our UnsignedByteType image). We extract the top 4x4 pixels of the cropped area:

Code

int[] pos = {350, 220};
int[] size = {4, 4};
byte[] array = new byte[4 * 4];
blocks.copy(pos, array, size);

Arrays.toString(array)

[119, -76, -55, -50, 83, -119, -69, -51, 67, 86, -116, -58, 63, 72, 95, -81]

Why are there negative numbers? Because we use Java’s (signed) byte to represent UnsignedByteType values.

Let’s convert to IntType and extract a int[] block instead.

Code

RandomAccessibleInterval<IntType> converted = RealTypeConverters.convert(img, new IntType());

PrimitiveBlocks<IntType> blocks = PrimitiveBlocks.of(converted);
int[] array = new int[4 * 4];
blocks.copy(pos, array, size);

Arrays.toString(array)

[119, 180, 201, 206, 83, 137, 187, 205, 67, 86, 140, 198, 63, 72, 95, 175]

Note that PrimitiveBlocks.of(...) understood the virtual type conversion.

It also understands many virtual coordinate transforms.

Code

var croppedAndConverted = img.view()
    .interval(crop)
    .zeroMin()
    .convert(IntType::new, (b,i) -> i.set(b.get()));

PrimitiveBlocks<IntType> blocks = PrimitiveBlocks.of(croppedAndConverted);
int[] pos = {0, 0};
int[] array = new int[4 * 4];
blocks.copy(pos, array, size);

Arrays.toString(array)

[119, 180, 201, 206, 83, 137, 187, 205, 67, 86, 140, 198, 63, 72, 95, 175]

(Note that we copy starting from (0, 0) now, as the shift to (350, 220) happens in the View transform.)

When PrimitiveBlocks.of(...) “understands” a View construction that ultimately end in CellImg, ArrayImg, etc., it and will create an optimized copier: Instead of using RandomAccess that checks for every pixel whether it enters a new Cell, whether it is out-of-bounds, etc., all these checks are precomputed and then relevant data from each Cell is copied in one go.

Details, details…

The speedup can be dramatic, in particular if the underlying source data is in a CellImg.
Here is for example results of CopyBenchmarkViewPrimitiveBlocks

# JMH version: 1.35
# VM version: JDK 17.0.3, OpenJDK 64-Bit Server VM, 17.0.3+7-LTS
...

Benchmark                                                  (oob)  (permute)  Mode  Cnt   Score   Error  Units
CopyBenchmarkViewPrimitiveBlocks.benchmarkLoopBuilder       true       true  avgt    5  12,789 ± 0,285  ms/op
CopyBenchmarkViewPrimitiveBlocks.benchmarkLoopBuilder       true      false  avgt    5   9,682 ± 0,152  ms/op
CopyBenchmarkViewPrimitiveBlocks.benchmarkLoopBuilder      false       true  avgt    5  14,333 ± 0,099  ms/op
CopyBenchmarkViewPrimitiveBlocks.benchmarkLoopBuilder      false      false  avgt    5  12,721 ± 0,123  ms/op
CopyBenchmarkViewPrimitiveBlocks.benchmarkPrimitiveBlocks   true       true  avgt    5   0,541 ± 0,010  ms/op
CopyBenchmarkViewPrimitiveBlocks.benchmarkPrimitiveBlocks   true      false  avgt    5   0,315 ± 0,024  ms/op
CopyBenchmarkViewPrimitiveBlocks.benchmarkPrimitiveBlocks  false       true  avgt    5   0,570 ± 0,013  ms/op
CopyBenchmarkViewPrimitiveBlocks.benchmarkPrimitiveBlocks  false      false  avgt    5   0,322 ± 0,008  ms/op

If a source RandomAccessible cannot be understood, PrimitiveBlocks.of(...) will return a fall-back implementation (based on LoopBuilder).
With the optional OnFallback argument of PrimitiveBlocks.of(...) it can be configured whether fall-back should be * silently accepted (ACCEPT), * a warning should be printed (WARN) – the default, * or an IllegalArgumentException thrown (FAIL). The warning/exception message explains why the source RandomAccessible requires fall-back.

Code

RandomAccessible<UnsignedByteType> func = new FunctionRandomAccessible<>(
        2,
        (pos, value) -> value.set((int) (pos.getIntPosition(0) * pos.getIntPosition(1))),
        UnsignedByteType::new);

PrimitiveBlocks<UnsignedByteType> blocks = PrimitiveBlocks.of(func);
int[] pos = {1, 1};
byte[] array = new byte[4 * 4];
blocks.copy(pos, array, size);

Arrays.toString(array)

The RandomAccessible net.imglib2.position.FunctionRandomAccessible@6e4e6a73 is only be supported through the fall-back implementation of PrimitiveBlocks. 
Cannot analyze view net.imglib2.position.FunctionRandomAccessible@6e4e6a73 of class FunctionRandomAccessible

[1, 2, 3, 4, 2, 4, 6, 8, 3, 6, 9, 12, 4, 8, 12, 16]

BlockProcessor

Once we can extract primitive arrays from RandomAccessibleInterval, the actual computation happens in BlockProcessor.

This happens in two steps: 1. Backward: Inverse-transform the desired target interval to determine which source data we need to provide to the BlockProcessor. 2. Forward: Push the source data through the BlockProcessor to compute the target data (both primitive arrays).

To try that, we’ll use a BlockProcessor for 2x downsampling. Normally we don’t use BlockProcessor directly, so the constructor of that one is not public. We need to extract it from a high-level operator (explained later):

Code

import net.imglib2.algorithm.blocks.*;
import net.imglib2.algorithm.blocks.downsample.Downsample;

BlockProcessor<byte[], byte[]> processor = Downsample
                .createOperator(new UnsignedByteType(), ComputationType.FLOAT, Downsample.Offset.HALF_PIXEL, 2)
                .blockProcessor();

1. Backward: Inverse-transform the desired target interval to determine which source data we need to provide to the `BlockProcessor`.

Now we can ask the processor which source interval we will need to produce a given target interval. (In this case, the source interval is 2x larger and appropriately shifted for downsampling).

Code

Interval interval = Intervals.createMinSize(200, 100, 64, 64);
System.out.println("target interval = " + Intervals.toString(interval));

processor.setTargetInterval(interval);
System.out.println("source interval = " + Intervals.toString(processor.getSourceInterval()));

target interval = [(200, 100) -- (263, 163) = 64x64]
source interval = [(400, 200) -- (527, 327) = 128x128]

2. Forward: Push the source data through the BlockProcessor to compute the target data (both primitive arrays).

We first need to provide some source data for processing. We get that via PrimitiveBlocks.copy(...), where we specify the source pos/size as the source interval obtained from processor. (We can also ask processor to allocate a primitive array for holding the input data.)

Code

byte[] sourceData = new byte[128 * 128]; // processor.getSourceBuffer();
PrimitiveBlocks.of(img).copy(processor.getSourcePos(), sourceData, processor.getSourceSize());

We allocate a primitive array to hold the result, then call BlockProcessor.compute(...).

Code

byte[] targetData = new byte[64 * 64];
processor.compute(sourceData, targetData);

Finally, we show the result wrapped as an ArrayImg, as well as the source area it was obtained from.

Code

var downsampled = ArrayImgs.unsignedBytes(targetData, 64, 64);
display(downsampled);
display(img.view().interval(processor.getSourceInterval()));

5858c8c3-5e67-431f-9c81-a9c59d335974

As you can imagine, it is straightforward to chain BlockProcessors, sending the target interval backwards through the chain, and the source data forwards via approriate temporary intermediate arrays. In fact, our BlockProcessor is already such a concatenation (convert byte[] to float[], downsample float[] to float[], convert float[] to byte[]).

Code

System.out.println(processor.getClass());

class net.imglib2.algorithm.blocks.ConcatenatedBlockProcessor

BlockSupplier and UnaryBlockOperator

With the details out of the way, let’s look at how this is wrapped into a more ImgLib2-like API.

The BlockSupplier interface is more or less equivalent to PrimitiveBlocks. The above code works fine if you replace PrimitiveBlocks with BlockSupplier.

BlockSupplier provides additional default methods for chaining operations, and support for wrapping into a CachedCellImg, etc. This is not available in imglib2 core, where PrimitiveBlocks lives, therefore the duplication… (BlockSupplier lives in imglib2-algorithm.)

Code

BlockSupplier<UnsignedByteType> blocks = BlockSupplier.of(img);

The UnaryBlockOperator<S,T> interface wraps BlockProcessor<I,O>. The generic parameters <S,T> are the ImgLib2 Types corresponding to the primitive array types <I,O>.

Here is the downsampling operator we already used above:

Code

UnaryBlockOperator<UnsignedByteType, UnsignedByteType> operator =
        Downsample.createOperator(new UnsignedByteType(), ComputationType.FLOAT, Downsample.Offset.HALF_PIXEL, 2);

Chaining operators

UnaryBlockOperator knows the dimensionality of its source and target spaces as well as its source and target Types. So does BlockSupplier.

Code

System.out.println("blocks.numDimensions   = " + blocks.numDimensions());
System.out.println("blocks.type            = " + blocks.getType().getClass().getSimpleName());
System.out.println("op.numSourceDimensions = " + operator.numSourceDimensions());
System.out.println("op.sourceType          = " + operator.getSourceType().getClass().getSimpleName());
System.out.println("op.numTargetDimensions = " + operator.numTargetDimensions());
System.out.println("op.targetType          = " + operator.getTargetType().getClass().getSimpleName());

blocks.numDimensions   = 2
blocks.type            = UnsignedByteType
op.numSourceDimensions = 2
op.sourceType          = UnsignedByteType
op.numTargetDimensions = 2
op.targetType          = UnsignedByteType

This allows for type-safe (at compile-time) and “dimensionality-safe” (at runtime) concatenation.

Concatenating two UnaryBlockOperators yields a new UnaryBlockOperator.
Concatenating a BlockSupplier and a UnaryBlockOperators yields a new BlockSupplier.

Code

BlockSupplier<UnsignedByteType> blocks = BlockSupplier.of(img);
UnaryBlockOperator<UnsignedByteType, UnsignedByteType> operator =
        Downsample.createOperator(new UnsignedByteType(), ComputationType.FLOAT, Downsample.Offset.HALF_PIXEL, 2);

BlockSupplier<UnsignedByteType> concatenation = blocks.andThen(operator);

Note, that creating the operator we had to specify source type (new UnsignedByteType() and number of source dimensions (2). These must match those of blocks, so we could also directly get them from the thing we are concatenating to.

To avoid this duplication, BlockSupplier.andThen(...) also accepts operator factory functions.

Code

BlockSupplier<UnsignedByteType> blocks = BlockSupplier.of(img);
long[] size = img.dimensionsAsLongArray();

Try to run the following cell multiple times…

Code

blocks = blocks.andThen(
    Downsample.downsample(ComputationType.FLOAT, Downsample.Offset.HALF_PIXEL));
size = Downsample.getDownsampledDimensions(size);

byte[] data = new byte[(int)(size[0] * size[1])];
blocks.copy(new long[] {0,0}, data, Util.long2int(size));
display(ArrayImgs.unsignedBytes(data, size));

f37d4e03-cfa7-4716-8af5-cc44b5682821

Wrap as CellImg

Code

RandomAccessibleInterval<UnsignedByteType> cellImg = blocks.toCellImg(size, 64);
display(cellImg);

8ea39058-2206-4aca-a995-7779e2925ad7

Process in Tiles

BlockSupplier.tile(...) splits the computation of a requested target interval into computations of sub-intervals of a specified (maximum) size. The results are stored into temporary buffers and then copied into the target primitive array.

Applications for this are: * Computing large outputs (e. g. to write to N5 or wrap as ArrayImg) with operators that have better performance with smaller block sizes. * Avoiding excessively large blocks (e. g. when chaining multiple downsampling operators).

Code

blocks = blocks.tile(16);
blocks.copy(new long[] {0,0}, data, Util.long2int(size));
display(ArrayImgs.unsignedBytes(data, size));

4ff02cba-3d98-47a9-b848-a8bbd1a0c774

Advanced examples

Works with VolatileViews

Load and show a bigger 3D image (834x388x357).

Code

import net.imglib2.algorithm.blocks.convert.Convert;
import net.imglib2.algorithm.blocks.convolve.Convolve;
import net.imglib2.algorithm.blocks.downsample.Downsample;
import net.imglib2.algorithm.blocks.transform.Transform;
import static net.imglib2.view.fluent.RandomAccessibleIntervalView.Extension.*;
import bdv.util.*;
import bdv.viewer.DisplayMode;

String fn = "e002_stack_fused-8bit.tif";
RandomAccessibleInterval<UnsignedByteType> img = ImageJFunctions.wrap(IJ.openImage(fn));
long[] size = img.dimensionsAsLongArray();

Bdv bdv = BdvFunctions.show(img, fn);
bdv.getBdvHandle().getViewerPanel().setDisplayMode(DisplayMode.SINGLE);

Extend the image, make a BlockSupplier, convert to FloatType, apply Gaussian smoothing, and cache in a CellImg.
For showing it in BigDataViewer, we wrap it VolatileViews.wrapAsVolatile(...) for asynchronous (lazy) loading, so that we can see it working.

Code

import net.imglib2.type.numeric.ARGBType;
import bdv.util.volatiles.VolatileViews;
import bdv.cache.SharedQueue;

RandomAccessibleInterval<FloatType> downsampledAndSmoothed = BlockSupplier
        .of(img.view().extend(mirrorDouble()))
        .andThen(Convert.convert(new FloatType()))
        .andThen(Convolve.gauss(3))
        .toCellImg(size, 32);

BdvSource source = BdvFunctions.show(
        VolatileViews.wrapAsVolatile(downsampledAndSmoothed, new SharedQueue(8)),
        "destImg",
        Bdv.options().addTo(bdv));

source.setColor(new ARGBType(0x00ff00));
source.setDisplayRange(0, 255);

Influence of tiling on memory and performance

Code

import net.imglib2.realtransform.AffineTransform3D;

AffineTransform3D affine = new AffineTransform3D();
affine.rotate(2, 0.3);
affine.rotate(1, 0.1);
affine.rotate(0, 0.6);
affine.scale(0.9);

long[] min = {0, 0, 0};
long[] size = {512, 512, 1};
byte[] dest = new byte[(int) Intervals.numElements(size)];

BlockSupplier<UnsignedByteType> blocks = BlockSupplier
        .of(img.view().extend(mirrorDouble()))
        .andThen(Convert.convert(new FloatType()))
        .andThen(Downsample.downsample())
        .andThen(Transform.affine(affine, Transform.Interpolation.NLINEAR))
        //.tile(32)
        .andThen(Convert.convert(new UnsignedByteType()));

blocks.copy(min, dest, Util.long2int(size));
display(ArrayImgs.unsignedBytes(dest, size));

EvalException: Java heap space

What is a good tile size? Not too small, not too big…
Let’s look at performance. Play around a bit with which operations to chain.

Code

public static long timeIt(int numIterations, Runnable r) {
    long start = System.currentTimeMillis();
    for (int i = 0; i < numIterations; i++) {
        r.run();
    }
    return (System.currentTimeMillis() - start) / numIterations;
}

BlockSupplier<UnsignedByteType> blocks = BlockSupplier
        .of(img.view().extend(mirrorDouble()))
        .andThen(Convert.convert(new FloatType()))
        .andThen(Downsample.downsample())
        .andThen(Transform.affine(affine, Transform.Interpolation.NLINEAR))
        .andThen(Convert.convert(new UnsignedByteType()));

int[] tileSizes = {4, 8, 16, 32};

for (int j = 0; j < 3; j++) {
    for (int tileSize : tileSizes) {
        BlockSupplier<UnsignedByteType> tiledBlocks = blocks.tile(tileSize);
        long millis = timeIt(5, () -> tiledBlocks.copy(min, dest, Util.long2int(size)));
        System.out.println("tile size = " + tileSize + ": " + millis + " ms");
    }
    System.out.println();
}

tile size = 4: 123 ms
tile size = 8: 98 ms
tile size = 16: 111 ms
tile size = 32: 167 ms

tile size = 4: 117 ms
tile size = 8: 102 ms
tile size = 16: 109 ms
tile size = 32: 174 ms

tile size = 4: 123 ms
tile size = 8: 95 ms
tile size = 16: 125 ms
tile size = 32: 180 ms

Setting up …

Introduction

Idea

Outline

PrimitiveBlocks

Details, details…

BlockProcessor

1. Backward: Inverse-transform the desired target interval to determine which source data we need to provide to the BlockProcessor.

2. Forward: Push the source data through the BlockProcessor to compute the target data (both primitive arrays).

BlockSupplier and UnaryBlockOperator

Chaining operators

Wrap as CellImg

Process in Tiles

Advanced examples

Works with VolatileViews

Influence of tiling on memory and performance

1. Backward: Inverse-transform the desired target interval to determine which source data we need to provide to the `BlockProcessor`.