OK Here are some alternative ideas that might help:
1) Try looking at transactions and master/slave style channel usage and data transfer, this approach would break processing into chunks
2) Can the hardware data ports be split across more than one core, would that help?
3) Could you reduce the number of worker threads but perhaps have them operate leaner and faster, perhaps combining this with some sort of chunk pipelining using (1).
could you use a streaming handler on each core that acts as a data switch/buffer to the other threads, perhaps creating a FIFO shared with the other threads on each core (same memory map).