Let us count the problems…
Data I/O - we have to keep the beast fed. This will be as big a job as the compute, and will probably need its own PE. I haven’t thought a lot about this, other than having dual-port RAM or something. If it is read out the same way as the PEs read the data, it wil also need to be turned through 90 degrees - you read 32 columns, and then you have rows of data to be read out.
Communication. It will need a butterfly network or something, working via TOS. You could shift TOS up or down, 2up, 2down, 4up, 4down - or some other tree network with its own storage and PE.
Reduction. Perhaps TOS == 00000…00000, propagate down to next set bit (use NOS). Count bits set? This looks expensive, and has latency, so will also need its own PE.
Algorithms. What types of problems lend themselves to this SIMD processing? I think here we can wave at AI and Neural Networks, and say that if we build the processor, they will come.