complex_s16_t FFT on XU316

Technical questions regarding the XTC tools and programming with XMOS.
alex4d1
Member++
Posts: 18
Joined: Mon May 13, 2024 11:38 pm

complex_s16_t FFT on XU316

Post by alex4d1 »

Hello everyone,

We are working on a project based on the XU316,
and I’m looking for some guidance on how to move forward.

The project requires us to run lots of very long (8192) complex FFTs.
Our plan was to have many instances of the same task running in parallel, one on each core of each tile. These tasks run the complex FFTs.

As each of the 2 tiles only has 512KB of RAM, and each task instance needs its own “working buffer” the FFTs can run on (in-place) we quickly run out of memory (one 8K complex FFT working buffer is 65KB in size).
One way around this is to attach an external SDRAM, but this only extends the RAM of one of the two tiles, the second tile still has the memory bottleneck. We also looked into multi device systems (XLINK), but it rather seems like a sledge hammer approach. Both options don’t really meet our BOM cost/power budget/board real estate target.

So we decided to look into half precision FFTs (complex_s16_t). We quickly wrote a complex_s16_t FFT function (C code) based on the complex_s32_t variant (C code) to confirm performance would be still appropriate.

To really speed this up we realized we’d need to make use of the vector unit. So we started looking into writing our own complex_s16_t FFT assembly code. Unfortunately, most vector unit assembly instructions are only available for the complex_s32_t type.

We thought we could try to create temporary arrays of 4 to convert between complex_s16_t and complex_s32_t before and after using the same complex_s32_t vector assembly instructions as in the complex_s32_t FFT functions. Knowing that there will be a performance penalty we decided to give it a go.

We were making quick progress, storing the complex_s16_t data in complex_s32_t and running “vftff” on it, however we ran into a wall when trying to do the same with “vladsb”.

Our assembly programming skills are limited, so before we sink more days into this I’d like to get some feedback on whether this endeavour is even feasible.
Any feedback/pointers on how to achieve this are much appreciated.

Also, if anyone has a different idea of overcoming our memory bottleneck please let me know.

Thanks
Alex
User avatar
andrew
Verified
Experienced Member
Posts: 117
Joined: Fri Dec 11, 2009 10:22 am

Post by andrew »

I estimate you can get 12 complex_s16_t 8k arrays on one tile leaving 64kB as a scratch for performing a complex_s32_t 8kB FFT. This would leave 64kB for code on that tile. If you chose to write an in-place complex_s16_t FFT (which would be fairly slow by comparison) you would only get 14 complex_s16_t 8k arrays which seems a fairly small gain.

Are the FFTs related to one another? i.e. are there any other DSP tricks one could play?
alex4d1
Member++
Posts: 18
Joined: Mon May 13, 2024 11:38 pm

Post by alex4d1 »

Hi Andrew,

You suggestion of using 1 complex_s32_t as a temporary buffer does make sense, but it would limit us to run one FFT at a time and tile.
We want to run FFTs in parallel, one on each core.

Unfortunately the FFTs are not related to each other.

You stated that in-place complex_s16_t FFT would be fairly slow by comparison. We are aware that using temporary 4 element buffers (in assembly) to convert to complex_s32_t before and back to complex_s16_t after will cause a time penalty,
but since we are using the same hardware vector unit function as the complex_s32_t FFT functions there shouldn't be anything else slowing us down?

Thanks
Alex
User avatar
andrew
Verified
Experienced Member
Posts: 117
Joined: Fri Dec 11, 2009 10:22 am

Post by andrew »

Have you looked into a "sliding FFT"? Would that work for your use case?
User avatar
andrew
Verified
Experienced Member
Posts: 117
Joined: Fri Dec 11, 2009 10:22 am

Post by andrew »

I've worked out how to add images now:
fft.png
The idea here is 3 x 16bit x 8k point buffers per two 8k point 16 bit FFTs.
FFTs performed at 32 bits as that is the fastest way of doing it.
You do not have the required permissions to view the files attached to this post.
alex4d1
Member++
Posts: 18
Joined: Mon May 13, 2024 11:38 pm

Post by alex4d1 »

Thanks Andrew for the suggestion. This would save some memory and would not be very difficult to implement.