16 _megabyte_ Quad SPI RAM for your next project

XCore Project reviews, ideas, videos and proposals.
Post Reply
User avatar
Gravis
Experienced Member
Posts: 75
Joined: Thu Feb 02, 2012 3:32 pm

16 _megabyte_ Quad SPI RAM for your next project

Post by Gravis »

I figured some of you might also like to use some RAM in your next project without sacrificing all your pins just to do so. Here you go: 128M (16M x 8) Quad SPI @ 66MHz for about $6 (link for UK about £4)

I know I'll be using it for a double buffer in my PSP LCD project.

Yeah, you know you have reached full nerd level when you are excited about an IC. ;)
Last edited by Gravis on Sat Sep 01, 2012 11:34 am, edited 1 time in total.


yzoer
XCore Addict
Posts: 133
Joined: Tue Dec 15, 2009 10:23 pm

Post by yzoer »

*VERY* interesting. Thanks for posting that!

-Yvo
SpacedCowboy
Experienced Member
Posts: 67
Joined: Fri Aug 24, 2012 9:37 pm
Contact:

Post by SpacedCowboy »

Gravis wrote:I figured some of you might also like to use some RAM in your next project without sacrificing all your pins just to do so. Here you go: 128M (16M x 8) Quad SPI @ 66MHz for about $6 (link for UK about £4)

I know I'll be using it for a double buffer in my PSP LCD project.

Yeah, you know you have reached full nerd level when you are excited about an IC. ;)
It's 50MHz at quad data-rate giving you 200MHz effectively. Can the XMOS chips even do quad-data manipulation ? I thought the maximum cycle speed for output on a pin was 100MHz (input being about 60MHz), and I'm not sure how you'd do the quad-output per cycle. Possibly 50MHz at dual data-rate might be better...

Even then, though, it's a serial interface so there's some small overhead in setting up a read (32 clocks) and then 8 clocks per read - so assuming you're reading large enough chunks you can effectively divide that clock rate by 8. If you can manage 50MHz dual data-rate, that gives you ~12.5 MB/sec with one thread probably pretty much maxed out at 100MHz.

I can see it being useful for a lot of things given the memory-size, but doesn't a video buffer need fairly high bandwidth ? 480 x 272 x 60fps x 2 (read from buffer and write to buffer) x2 (16-bit pixels) => ~31.3 MB/sec.

I'm still looking to see if I can make a G4-512-BGA board that brings out most of the i/o - I can dedicate most of one of the cores pins to a local-to-the-chip parallel memory bus (looking at http://www.digikey.com/product-detail/e ... ND/1831376 at the moment) to give 100MHz x 2 => ~200MB/sec and bring out the other three cores ports to an external interface - probably a DIMM form factor.

I want to do real-time JPEG encoding of full-size incoming video streams, so I'm going to need 720 x 576 x 25fps x 2 (YUYV encoding) x2 (read & write ops) => ~20MB/sec per stream. I've found it's always useful to have some significant headroom over the "official" figures because a single extra-cycle-per-read really adds up when you're doing several million per second...

Simon.
yzoer
XCore Addict
Posts: 133
Joined: Tue Dec 15, 2009 10:23 pm

Post by yzoer »

As far as I can tell, quad data rate would use 4 pins (page 7 of the datasheet), which would be more than adequate to get a basic 640x480 VGA display up and running, especially if you'd palettize the data.

Write access would be somewhat limited and could only occur during blanking, limiting the application somewhat. Having said that, you could do bursts and cache data into a FIFO depending on how much space you want to dedicate / have left over.
SpacedCowboy
Experienced Member
Posts: 67
Joined: Fri Aug 24, 2012 9:37 pm
Contact:

Post by SpacedCowboy »

yzoer wrote:As far as I can tell, quad data rate would use 4 pins (page 7 of the datasheet), which would be more than adequate to get a basic 640x480 VGA display up and running, especially if you'd palettize the data.

Write access would be somewhat limited and could only occur during blanking, limiting the application somewhat. Having said that, you could do bursts and cache data into a FIFO depending on how much space you want to dedicate / have left over.
Wow! You're right. Teach me to read the manual :)

I guess if you palletize the colors you'd have less RAM bandwidth to worry about (you'd still have to transmit the 24bit data, but lookup tables are cool :), and you still have to run at 60Hz - brings the calculation of RAM bandwidth required to: 480 x 272 x 60fps x 2 (read/write from/to framestore) => 15.6MB/sec for an 8-bit palette.

... whereas the RAM can offer 66MHz * 0.5 bytes/transaction or ~33 MB/sec.

So yeah, I guess you really can have a frame buffer accessed over SPI - 640x480@60Hz is beyond the limit at 36MB/sec though, you'd have to have a very small palette (4-bit, 6-bit?) or use multiple SPI chips :)

Writing is going to be ... interesting though, given the write sequence is [5 bytes][data, length=1..64]. You could maybe implement a tiled display engine, where you read and write tiles of 64 bytes at a time. That'd optimise for the architecture.


Um, hang on - I've just noticed this *isn't* SRAM, its PCM. They only guarantee a million writes, and at 60fps that gives you less than 12 days of frame buffer, assuming you update the frame buffer once per frame...

Simon.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am
Contact:

Post by segher »

SpacedCowboy wrote:I want to do real-time JPEG encoding of full-size incoming video streams, so I'm going to need 720 x 576 x 25fps x 2 (YUYV encoding) x2 (read & write ops) => ~20MB/sec per stream.
Why would you buffer full frames? You don't need that for JPEG.

Assuming your data comes in in row-major order, you need to buffer
eight such rows; the writer can do the YUYV downsampling for you,
so you need 8*2*720 bytes buffer, or somewhat more for elasticity;
heck, do twice that, full double buffering. That's only 22.5kB, so it
will fit on a core just fine, with plenty of space left to do other stuff
(buffer the output a bit, for example).
SpacedCowboy
Experienced Member
Posts: 67
Joined: Fri Aug 24, 2012 9:37 pm
Contact:

Post by SpacedCowboy »

segher wrote:
SpacedCowboy wrote:I want to do real-time JPEG encoding of full-size incoming video streams, so I'm going to need 720 x 576 x 25fps x 2 (YUYV encoding) x2 (read & write ops) => ~20MB/sec per stream.
Why would you buffer full frames? You don't need that for JPEG.

Assuming your data comes in in row-major order, you need to buffer
eight such rows; the writer can do the YUYV downsampling for you,
so you need 8*2*720 bytes buffer, or somewhat more for elasticity;
heck, do twice that, full double buffering. That's only 22.5kB, so it
will fit on a core just fine, with plenty of space left to do other stuff
(buffer the output a bit, for example).
SD video is transmitted in fields, not frames, so unless you buffer two fields to make a frame you're stuck with producing "motion jpeg" images. From what I can gather, different platforms have different ways to encode mjpeg and it was never a standardized format (Apple document it one way
in their QuickTime docs, Microsoft another way in their AVI container docs).

So, buffer two fields into a frame and you can create a standard jpeg file that everyone can read.

Also, given that an xmos chip can't handle the workload of realtime full-size SD video compression, I'm either going to have multiple G4's or (more likely) an FPGA attached via an xlink or two to handle the heavy lifting (DCT and (proably) entropy coding). It's already going to be a fairly "up there" PCB is what I'm saying, so adding the RAM isn't really an issue :)

Simon
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am
Contact:

Post by segher »

SpacedCowboy wrote:SD video is transmitted in fields, not frames, so unless you buffer two fields to make a frame you're stuck with producing "motion jpeg" images.
What is wrong with that? You cannot make a frame out of just two fields,
anyway (it will look slightly worse than a donkey's behind). Best to just keep
whatever format the source has.
Also, given that an xmos chip can't handle the workload of realtime full-size SD video compression,
I'm not convinced about that; my back-of-the-envelope calculations show
you can do it on one thread of a 400MHz chip, even. It's tight, but it can
be done. Encoding JPEG is not hard.
I'm either going to have multiple G4's or (more likely) an FPGA attached via an xlink or two to handle the heavy lifting (DCT and (proably) entropy coding).
The standard integer DCT will do fine, although that is tuned for CPUs
with a slow (or no) multiplier. It's a few cycles per pixel.

The variable length code is just a table lookup (you use a fixed codebook,
right? Like everyone else does?)
SpacedCowboy
Experienced Member
Posts: 67
Joined: Fri Aug 24, 2012 9:37 pm
Contact:

Post by SpacedCowboy »

segher wrote:
SpacedCowboy wrote:SD video is transmitted in fields, not frames, so unless you buffer two fields to make a frame you're stuck with producing "motion jpeg" images.
What is wrong with that? You cannot make a frame out of just two fields,
anyway (it will look slightly worse than a donkey's behind). Best to just keep
whatever format the source has.
Actually it's pretty simple to combine the two fields together, and it looks just fine as long as your video decoder has a decent comb filter. Getting the *correct* even/odd fields is crucial - combining the odd field from NTSC frame N-1 with the even field of frame N will of course not be so great, or vice versa for PAL :)
segher wrote:
Also, given that an xmos chip can't handle the workload of realtime full-size SD video compression,
I'm not convinced about that; my back-of-the-envelope calculations show
you can do it on one thread of a 400MHz chip, even. It's tight, but it can
be done. Encoding JPEG is not hard.
I was getting that from henk muller's sc_dsp_transforms code ( http://github.xcore.com/repo_index/sc_d ... eadme.html) where (in the docs)
module_dct_jpeg
===============

This module is a proof-of-concept 2D DCT for JPEG compression.

Performance (provided you can stream data in and out):

* three 50 MIPS threads compress approx 1.7 Msamples/s.
* QVGA greyscale: 22 fps in three 50 MIPS threads.
* VGA greyscale: 5 fps in three 50 MIPS threads.
* VGA greyscale: 14 fps in three 125 MIPS threads. With an extra DCT thread
this may go up to 20 fps.
* Colour-images: approx two thirds of the speed?
segher wrote:
I'm either going to have multiple G4's or (more likely) an FPGA attached via an xlink or two to handle the heavy lifting (DCT and (proably) entropy coding).
The standard integer DCT will do fine, although that is tuned for CPUs
with a slow (or no) multiplier. It's a few cycles per pixel.

The variable length code is just a table lookup (you use a fixed codebook,
right? Like everyone else does?)
Again, from the above docs:
The performance depends on the compression ratio. When compressing
marginally, Huffman encoding starts to take a serious amount of time.
XMOS programming is new to me, so I'm going by what someone else has managed to do. If it turns out that you're right, and that henk's code isn't very optimal (you're effectively saying it could be done more than 12x faster than his code, given that you're saying a single 100MHz thread could do it and he was using 3 125MHz threads to get 14fps at VGA, which isn't even SD video), then that's great. I'll learn a lot along the way :)

I still like the idea of a multiprocessor board with a large number of exposed i/o ports and local RAM though, so even then I'm coming out ahead :)

Simon
User avatar
Bianco
XCore Expert
Posts: 754
Joined: Thu Dec 10, 2009 6:56 pm
Contact:

Post by Bianco »

Its phase change memory, not SRAM. According to the datasheet it can do about 1 million write cycles, in most situations this is not enough to replace SRAM (for longer periods :p). Using it as a display buffer at 60Hz would last for only (at least) 4.6 hours.
Post Reply