X
RSS YouTube LinkedIn Twitter XCore IRC

Search




Post new topic Reply to topic   [ 30 posts ]  Go to page Previous  1, 2, 3
Author Message
PostPosted: Thu May 03, 2012 1:57 am 
User avatar
XCore Expert
Joined: Sun Jul 11, 2010 1:31 am
Posts: 675
LyleHaze wrote:
I think it depends on what your definition of "long" is. (that's what SHE said! :lol: )

It's probably just my compiler being extra-picky. But to print a 32 it likes %ld instead of %d.

Either you're misunderstanding something, or doing something wrong,
or your compiler is. %d requires an int, %ld requires a long int. On most
32-bit platforms both are 32 bits.

Quote:
If that's my biggest problem today I'll call it a great day and move on.

Heh yeah :-)


I have no answer to any of your questions about XDE: I have never run XDE.


Quote:
The comiled output looks like the *.xe file. But I am afraid to pass it to the run command because:
The program runs on both cores, and I don't know which one to load/run it on.
The run program has a maximum file size of 64K, and this .xe file is about 86 K in size.

To run XE files, use runxe, not run. Your XE file most likely contains two images
for each core (so four in total); runxe will run all of them, in the order indicated
in the image.

Quote:
So, best guess is that I need a different output format from XDE for using with your "Run" command.

"run" simply uploads a file to RAM, on a single core, and starts it at 10000. It is
not something you want to do with programs created with the normal toolchain.

Quote:
If I need to compile from a command line, Great! If we ever port tools over that's where we will be working from anyway.

It does not matter how you compile, all that matters is the executable file that
results, right :-)
Using XDE without a supported debug adapter will not work in some obvious
places, and might not work for some non-obvious things either I hear.
If you want to use a stepping debugger (GDB) you'll need to teach GDB how
to talk via your adapter. Should be not so hard, all the code should be available.

My tools are quite command-line oriented, since I never use an IDE (or, "xterm is
my IDE").

Quote:
Any suggestions will be appreciated.

./run-xe blinkenlights.xe should do the trick. Get on IRC if you want live
help from me ;-)

Quote:
You should see the code in a few days or less.

Excellent, looking forward to it!


Segher


Top
Offline Profile View all posts by this user  
 
PostPosted: Thu May 03, 2012 9:41 am 
User avatar
Experienced Member
Joined: Wed Apr 11, 2012 6:21 am
Posts: 64
Good News.
I "adapted" (nice word!) the runxe code to optionally compile without sys/mman.h

I hijacked the jtag write outputs so I could verify that it wasn't damaged too badly. It looked
like 4 separate 64K writes, each from 0x10000 to 0x1FFFF. Exactly what I was told to expect.

Mind you, the example program I wrote is painfully small, and 95% of the "payload" was zeros.

It works! Flashy lights make me happy. So we now have both run and runxe, as well as the assortment of
regs, sregs, pregs, psregs, reset, resources,.. Am I leaving anything out?? Oh yeah, a couple of dumps too.

I am happy to hear that some part of the XMOS community prefers command line tools, as these would be the most likely candidates for porting over to the Amiga.

Since I hacked up more of your beautiful code, I have more to clean up before I send it back to you.
I have now reached my initial goal, I will be releasing the project to the community sometime soon, possibly as soon as late this week. If you notice a sudden increase in newcomers, you'll know the code is out there.

I'm not finished, I'd like to combine all the tools into a single command line driven executable, probably similar to your "term". I'd like to make a GUI display window with colorful displays of internal activity. There's a lot I'd like to do, but I'll probably taker a few days off first, my other projects need time too.

One question: How long does it take your JTAG tools to write a complete .xe file (4X 64K in this case)?
I am truly curious how fast I'm going now, and how much faster I might be able to push it.

Thanks again for the friendly support and great "community" attitude!

LyleHaze


Top
Offline Profile View all posts by this user  
 
PostPosted: Thu May 03, 2012 12:16 pm 
User avatar
XCore Expert
Joined: Sun Jul 11, 2010 1:31 am
Posts: 675
LyleHaze wrote:
One question: How long does it take your JTAG tools to write a complete .xe file (4X 64K in this case)?
I am truly curious how fast I'm going now, and how much faster I might be able to push it.


An executable just like that, on L2:
Code:
$ time ./runxe -n1 -s tt
6393089 bits jtag written
2049436 bytes usb written
490 bits jtag read
100 bytes usb read

real    0m5.423s
user    0m0.254s
sys     0m0.081s


That's with an FT2232D, JTAG running at 6MHz; ideally it would take close
to 1s, but USB gets in the way (as usual). On the FT, reading is much worse
still (it has a smaller buffer for reading, and I'm not handling it in a particularly
smart way: you have to make sure you read the return data over USB before the
buffer fills up, so I just read it immediately always):

Code:
$ time ./dump -n1 -s >/dev/null
1597933 bits jtag written
659694 bytes usb written
524386 bits jtag read
81940 bytes usb read

real    0m10.718s
user    0m0.366s
sys     0m0.855s

That is one 64kB dump; ideally that would be 1/3rd of a second.

All the other tools just take a second or so, so I never felt pressed improving
it a lot. With your JTAG adapter, you should get much better numbers: just
divide "bits jtag written" by the JTAG bus freq, and that's how long it should
take (every bit "read" is also a bit "written" at the same time).


Top
Offline Profile View all posts by this user  
 
PostPosted: Thu May 03, 2012 1:00 pm 
User avatar
Experienced Member
Joined: Wed Apr 11, 2012 6:21 am
Posts: 64
Wow!
It would appear that I am "a little bit slow". :?
(I have many friends that will support that!)
I'm writing 64KBytes in ten seconds, all four (full runxc) in forty.

Well, it WAS my intent to keep it "slow and safe" for testing. I guess I exceeded my own expectations there.

I'm using one of the OS timers twice per cycle. After a bit of sleep (Now 8AM, none yet) I'll try dropping the timer by a factor of ten to see if it's still reliable.

Maybe even make it a #define for easy access later.

Thanks again!
LyleHaze


Top
Offline Profile View all posts by this user  
 
PostPosted: Thu May 03, 2012 5:36 pm 
User avatar
XCore Expert
Joined: Sun Jul 11, 2010 1:31 am
Posts: 675
LyleHaze wrote:
I'm writing 64KBytes in ten seconds, all four (full runxc) in forty.

That is a bit slow, alright.

I use 6MHz because that is the fastest the FT2232 can do; the XTAG2 code (see
https://github.com/xcore/sc_jtag) uses 25MHz (except for very long chains).

Quote:
I'm using one of the OS timers twice per cycle. After a bit of sleep (Now 8AM, none yet) I'll try dropping the timer by a factor of ten to see if it's still reliable.

You said you access the JTAG via some PLD? Maybe it can do some of the grunt
timing work for you.


Top
Offline Profile View all posts by this user  
 
PostPosted: Thu May 03, 2012 6:58 pm 
User avatar
Experienced Member
Joined: Wed Apr 11, 2012 6:21 am
Posts: 64
OK, A few more details as we go..
All the XTAG pins I'm twiddling are coming off a Xilinx XC2C128 PLD.
To be quite honest, it's all alphabet soup right now.
It IS re-programmable by a separate JTAG port, but I'm not anxious to get quite that deep.

In the code I have been going in and out of an OS Timer twice per BIT of XTAG operation.
I fully understand that this was probably overkill, So after reading your transfer times I reduced the time constant. This made absolutely NO improvement in cycle time.
Second attempt, I removed the Timer call completely.. Now I'm getting the full Runxe cycle in about eight seconds. Still not as good as it might be, but an 80% drop off of the first two attempts.
:)
Now after a few hours sleep, I'm ready to complete the software package so I can get it out the door.
So I'm setting my sights on completing the resource and getting it out there. There will be plenty of opportunities to fine tune it, by myself and others as well. I need to keep my eye on the goal of creating a starting place, not a finish line.
And again, I am grateful for your code and your assistance in porting it. It has been as solid as anyone could hope for, even after whatever changes I have made.

LyleHaze


Top
Offline Profile View all posts by this user  
 
PostPosted: Thu May 03, 2012 8:21 pm 
User avatar
XCore Expert
Joined: Sun Jul 11, 2010 1:31 am
Posts: 675
LyleHaze wrote:
OK, A few more details as we go..
All the XTAG pins I'm twiddling are coming off a Xilinx XC2C128 PLD.
To be quite honest, it's all alphabet soup right now.
It IS re-programmable by a separate JTAG port, but I'm not anxious to get quite that deep.

Pro-tip: get someone else to do it! :-)

Quote:
In the code I have been going in and out of an OS Timer twice per BIT of XTAG operation.

Yeah I guessed that. Not a recipe for performance, heh.

Quote:
There will be plenty of opportunities to fine tune it, by myself and others as well. I need to keep my eye on the goal of creating a starting place, not a finish line.

Do you see a way to make it acceptable fast though?

Quote:
And again, I am grateful for your code and your assistance in porting it. It has been as solid as anyone could hope for, even after whatever changes I have made.

Happy to hear that :-)


Top
Offline Profile View all posts by this user  
 
PostPosted: Fri May 04, 2012 12:16 am 
User avatar
Experienced Member
Joined: Wed Apr 11, 2012 6:21 am
Posts: 64
[quote]Do you see a way to make it acceptable fast though?
/quote]

I'm sure that I can make my code "match" yours better, so there's less bit masking and translation going on. It's also usually the case that I'll get Ideas for improvement, and usually when I least expect them, that improve throughput as well.

One possibility would be to translate the object file into an Amiga-native format that has all the bits
arranged in the same position as the actual output bits on the PGA.. This would put all the translation into a post-processor and make transmitting the file ridiculously easy. write 1 byte to the port, raise the clock bit, capture TDO, lower clock bit, and then store TDO only if requested by some unused bit in that byte... I could probably get 90% of that advantage just by using a local LUT to translate "live" in the current tools.. Just a more efficient way to get from there to here. Another idea is to write a tight loop for writing long series of zero.. as 95% of the current project is... Kind of a Run-Length encoding for that specific value.

I'm not done with this code yet. I'd like to roll all these into a single executable, and change a few details as I do. The "Run" now loads from Stdin.. While we DO have pipes available, that's not the most common way of loading a file. I'm sure a few other ideas will pop up as well.

If I do the GUI display for xmos activity, it will give me a chance to break away from cross-compatible code and write it all in Amiga native code. I don't suspect that the ReAction user interface stuff would be recognizable to other platforms (though I might be wrong).

All that comes _after_ the first landmark of providing working load tools to the community. That is very close to being in reach, possibly even today.

Question: What is the file extension for object files coming from your command line compiler/linker? Apparently .xe is from XDE only.?

Back to the coding.. :)
LyleHaze


Top
Offline Profile View all posts by this user  
 
PostPosted: Fri May 04, 2012 2:03 am 
User avatar
XCore Expert
Joined: Sun Jul 11, 2010 1:31 am
Posts: 675
LyleHaze wrote:
Quote:
Do you see a way to make it acceptable fast though?
I'm sure that I can make my code "match" yours better, so there's less bit masking and translation going on.

I would not worry about the bit-munging; what is killing you is doing the
actual I/O. You probably are going through some system call interface
(or other task switch thing); minimise those transitions, they are _expensive_.

Use a full-system profiler if you have one; if not, use whatever profiler you
_do_ have. Don't guess, measure.

Complicating the code for no appreciable speed gain is the opposite of
optimisation ;-)

Quote:
The "Run" now loads from Stdin.. While we DO have pipes available, that's not the most common way of loading a file.

Yeah, I should make it optionally take a filename argument, good idea.

Quote:
Question: What is the file extension for object files coming from your command line compiler/linker? Apparently .xe is from XDE only.?

I have no idea what XDE does. The XMOS compiler outputs to whatever
you put after "-o"; the convention (from unix) is that executables do
not have an extension. Object files have extension ".o".


Top
Offline Profile View all posts by this user  
 
PostPosted: Fri May 04, 2012 2:36 am 
User avatar
Experienced Member
Joined: Wed Apr 11, 2012 6:21 am
Posts: 64
segher wrote:
I would not worry about the bit-munging; what is killing you is doing the
actual I/O. You probably are going through some system call interface
(or other task switch thing); minimise those transitions, they are _expensive_.

Use a full-system profiler if you have one; if not, use whatever profiler you
_do_ have. Don't guess, measure.

Complicating the code for no appreciable speed gain is the opposite of
optimisation ;-)

Interesting.. And the straight answer is, I do not know.
I'm NOT using any system libraries or other "apparent" translations here. As far as I can tell, I'm banging the hardware directly. If there is some interface between my code and the actual output, it is well hidden.

Allow me to demonstrate:

#define XJTAG 0xf500000D
#define jtag_TDI 2
#define jtag_TDO 3 // TDO is READ_ONLY!
#define jtag_TMS 1
#define jtag_TCK 0
#define jtag_SRST 0 // This is at a different port address.. but rarely used
#define jtag_TRST 4

static void xjtag_write(uint8 value)
{ // mask will prevent writing to unused or read-only bits
*((uint8 *)XJTAG) = (value & 0x17);
}

void short_wait()
{
return;
}

// Sets TDI if bitmask & 0x01,
// sets TMS if bitmask & 0x02, (|TMS)
// then cycles TCK, while reading return bit on TDO
// returns 0 or 1 based on TDO value
static uint8 XTAG_Clock(uint8 bitmask)
{
uint8 result = 0;
uint8 outmask = 0;

if(bitmask & 1)
{
outmask |= (1 << jtag_TDI);
}
if(bitmask & TMS)
{
outmask |= (1 << jtag_TMS);
}
xjtag_write(outmask); // bits as set, clock LOW

short_wait(); // signal setup time

outmask |= 1 << jtag_TCK; // we will raise the clock
xjtag_write(outmask); // bits as set, clock high

short_wait();

result = readjbit(jtag_TDO); // read data immediately before clock
outmask &= ~(1 << jtag_TCK); // the clock falls again
xjtag_write(outmask);

return(result); //returns the state of TDO for this bit.
}


That's probably enough to give you an idea.
LOTS of room for optimization, my goal was to get running, not to win the race.
As I read this, I could create OR masks once instead of rotating the bits every time (wasteful!)
And this code is "adapted from the code you provided, making two translations instead of one (or none).
But as far as I can tell, I am banging the hardware directly at the end of the chain, unless the compiler is very good at concealing the interface. :)


Top
Offline Profile View all posts by this user  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic   [ 30 posts ]  Go to page Previous  1, 2, 3


Who is online

Users browsing this forum: No registered users and 2 guests



Search for:
Jump to: