To get the benefit of today’s computer hardware, parallel processing is a must. The hardware can do it - the problem is programmers conveying their wishes through a programming language.
I have spent 35 years in the area of parallel processing.
I started in the late 1970’s with the massively-parallel, very low level systems built around Associative Parallel Processing - intelligent memory.
This required micro-code level understanding, and specialist hardware, but delivered parallelism limited only by the number of chips you could bolt together - an SIMD (Single Instruction, Multiple Data) offering that died through lack of hardware, among other things.
The downside (there always is one) was a very limited problem domain, and speialist-level understanding of both the problem and the hardware.
No programming language involved.
Coincidentally, the university was fortunate to have as a visiting lecturer the author of the Cray FORTRAN compiler - who on a memorable afternoon explained the philosophy behind it. The Cray, at that time was a 64-way SIMD machine, and FORTRAN, a sequential language, only had the FOR loop to express iteration. So what the Cray compiler did was to look carefully at the FOR loops and decide if the loop index allowed each array element in the loop to reside on a separate processor, thus parallelising the loop. Many constraints had to be observed by the programmer for this magic to happen.
In effect, the programmer wrote a sequential FORTRAN program, with hints to the compiler that it could be parallelised. The compiler looked at the loop, and if it could not spread it across the computer hardware could report why it failed.
Believe it or not, not much has changed in 40 years.
My next foray into Parallel Processing was the Inmos Transputer - a microprocessor built to run a parallel programming language, OCCAM. The hardware consisted of a fairly conventional stack-based execution engine, with a micro-coded scheduler that allowed very fast task-switching between a micro-code maintained list of execution threads.
It also had four serial links that allowed communication to other transputers, and the occam language allowed you to view all these computers as components of one parallel program.
Summary: capable hardware for the time, excellent opportunity to add performance by simply adding transputers.
The programming language was masterfully simple. Crucially, it allowed the progammer to explicitly express parallelism. In fact, it also required you to express sequential execution as well - most programs started with the keyword SEQ - though a program on many transputers had to start with PAR - reflecting the parallelism of the underlying hardware.
If some setup code had to to be executed by one master transputer first, the others could be programmed to wait for a message sent by the master before continuing. As an example, a simple double-buffered computing process could beset up as follows :-
PROC compute(CHAN in, CHAN out):
FLOAT dataX1000, dataY1000, dataZ1000: SEQ in ? dataX PAR compute(dataX) in ? dataY WHILE len > 0: SEQ PAR out ! dataX compute(dataY) in ? dataZ PAR out ! dataY compute(dataZ) in ? dataX PAR out ! dataZ compute(dataX) in ? dataY PAR out ! dataX compute(dataY) out ? dataY