I blogged a while back about the basic engine, so lets start there :-

Arithmetic Processing using Associative memory

Bit-serial engine

The SIMD engine hangs off all RAM blocks in the FPGA - of which there are 50, arranged as 1k x 36 - a possible total of 1800 bits wide. However, about four are needed elsewhere.

The SIMD engine has simple bit-serial capabilities - and, or, xor, adc, not. It has a stack, so it can also be thought of as a forth machine, and is programmed in the same file the sequencer.

You can individually enable the serial cpus. They have a stack, a top-of-stack where all the action happens, and RAM in/out, the address being generated by the sequencer.

I will call the stack elements TopOfStack, and NextOnStack. (TOS, NOS).

Addressing the columns of the RAM array are 3x 16 bit address registers, X, Y and Z - selectable in the instruction. You could, for instance, be adding adding two 16 bit numbers, serially addressed by X and Y, and putting the result in Z.

Instructions (broadcast to all PEs) are just a Karnaugh Map of all inputs :-

  1. an input from RAM address X, Y or Z
  2. TOS
  3. NOS
  4. a constant, 0 or 1

and can write the result to 2 places - TOS and address Z, and probably also the PE enable. In addition to that there is X,Y,Z increment, and stack push/pop.

So - a super simple PE - basically enough to do bit-serial addition (about the most complicated thing it will do) - but also logic operations.

The point is that the data is processed in situ - not pulled across a bus one at a time and stored back.

An example of a dedicated RAM chip

A 4 Gig RAM chip today makes 64k squared - the chip itself is 64K rows and 64k columns. Let us modify this, and on one side of the array, by the read amplifiers, we put these serial processors.

If we take a clock speed of 1Ghz for easy math, addition (X+Y = Z) would take about 3 clocks per bit (read X, read Y, write Z). So - 32 bits at 10 MHz. A multiply? 32 of those, so 300 Khz.

We can do 300k multiplies per second, on 64k fields simultaneously - about 20Gig multiplies / second, on a single chip. Obviously, you can multiply that up by the number of chips you have.

Put a heatsink on the chips as well.

So - that is the fish on the end of the line. Distributed data processing, where the data does not travel across the bus to a CPU and back.