Memory limitations for instructions per thread

Technical discussions around xCORE processors (e.g. General Purpose (L/G), xCORE-USB, xCORE-Analog, xCORE-XA).
MarcAntigny
Active Member
Posts: 52
Joined: Tue Feb 13, 2018 2:50 pm

Re: Memory limitations for instructions per thread

Postby MarcAntigny » Mon Oct 01, 2018 1:25 pm

Hello mon,
I tried to fix the issue by removing the const type qualifier (as written in the post). Unfortunately, it didn't change anything. And I have no __attribute__ qualifier in the code.
I opened a ticket on the website and contact my FAE. The error will be investigated by the tools team I hope.
Thank you for your involvement.
Marc
MarcAntigny
Active Member
Posts: 52
Joined: Tue Feb 13, 2018 2:50 pm

Postby MarcAntigny » Mon Oct 01, 2018 3:30 pm

Here is a reduced version of the code which raised the errors.
If anyone has a clue on how to fix it, please try with this project :
https://drive.google.com/open?id=1qRXd5lfndNWLNkINY3cdV6xbiuHyT0FJ
Marc
User avatar
johned
XCore Addict
Posts: 165
Joined: Tue Mar 26, 2013 12:10 pm

Postby johned » Tue Oct 02, 2018 12:51 pm

Hi Marc,
Our observant tools team have spotted that you are unrolling loops 2000 times, from multiple functions. This will chew up a lot of memory.
If you reduce the unrolling I am sure you will see that the code will fit.
Best regards,
John
MarcAntigny
Active Member
Posts: 52
Joined: Tue Feb 13, 2018 2:50 pm

Postby MarcAntigny » Tue Oct 02, 2018 1:17 pm

Hi John,
In the code I provided, the loop is unrolled 2000 times to reproduce what I obtained with my application. In my real code, I have less step to be unrolled (about 11 times), but with much more instructions per step. And I got the same issue. Of course, by reducing the unrolling, there is a limit where it works. However, this limit isn't the memory limit, there is a lot of memory left.
For example, try it with only 1900 steps for the first loop (the others remain 2000). Everything works fine and the compiler tell you that only 1872 B are used. But then retry with 2000 steps for the first loop (with a clean/build) and you got the issue. So with only 100 steps more, the memory get suddenly filled (so more than 255 kB for only 100 steps). Here is the issue.
As I mentioned it, I tried separating the loop between two smaller loops but it doesn't change anything.
Marc
PS : For my application I need the unrolling to reduce the time taken for the calculations.
User avatar
johned
XCore Addict
Posts: 165
Joined: Tue Mar 26, 2013 12:10 pm

Postby johned » Tue Oct 02, 2018 1:39 pm

Thanks Marc,

What optimization level are you using in your application.
-O3 leads to large code space so we always recommend -Os.

Also, you might like to take a look at the generated assembly code using "-S" to see if this helps.

Best regards,
John
MarcAntigny
Active Member
Posts: 52
Joined: Tue Feb 13, 2018 2:50 pm

Postby MarcAntigny » Tue Oct 02, 2018 1:53 pm

I am using -O3 but I did try with -Os and it didn't change the result, I got the same issues.
As I mentioned to mon2, I already looked at the assembly and everything is fine when the compiler has compiled (no issue in case of smaller loops). But when I got the issue, as the compiler crash I can't see the assembly code (I suspect the issue to come at linking step during compilation).
MarcAntigny
Active Member
Posts: 52
Joined: Tue Feb 13, 2018 2:50 pm

Postby MarcAntigny » Wed Oct 03, 2018 12:52 pm

Hi,
Any other idea for the origins of the issue ? Unfortunately, if it is related to the compiler I think I can't study very much on it.
Thanks,
Marc
NigelPerks
Posts: 4
Joined: Thu Oct 04, 2018 2:09 pm

Postby NigelPerks » Thu Oct 04, 2018 2:14 pm

I have confirmed the following using both a test program and Marc’s own sample code, compiling mixer.xc with -S to see the assembly code that is generated.

Marc is right that, for whatever reason, -Os behaves the same as -O3 (not really optimising space?).

I found the default optimisation (on xcc command-line, anyway) to be equivalent to -O0 .

Under -O0, the calculation function is not inlined and its loop is not unrolled, whether the number of iterations is 1900, 2000 or mixed.

Under -O3 or -Os:

If the calculation function is called with only one number of iterations, whether 1900 or 2000, it is coded as a function (not inlined), specialised to that number of iterations. But the loop is unrolled.

If the calculation function is called with different numbers of iterations from different functions, it is inlined into those functions, but the loop is not unrolled.

So the reason for the observed leap in memory usage, and the occurrence of errors, is not that an extra 100 instructions takes up a large amount of memory, but that when 1900 and 2000 iterations are both used, the loop is not unrolled.

Using -O3, but not using inline or unroll, also produces the smaller (not unrolled) assembly.

I will start looking into how these decisions are made in the compiler, but I don’t know how long it will take to find out. In the meantime, to get the application working with the current tools, the only solution I have found is not to use inline and unroll in this case.

Nigel
MarcAntigny
Active Member
Posts: 52
Joined: Tue Feb 13, 2018 2:50 pm

Postby MarcAntigny » Thu Oct 04, 2018 4:09 pm

Hi Nigel,

Thank you for trying to help me solve this problem. As you noticed it, with no unroll and no inline, everything could work fine. However, my real code is more complex than the memory read/write I wrote for the example so unrolling and inlining manually could lead to a lot of errors and to a chaotic firmware development.
Keep me informed of your progress.

Marc
User avatar
johned
XCore Addict
Posts: 165
Joined: Tue Mar 26, 2013 12:10 pm

Postby johned » Thu Oct 04, 2018 5:37 pm

Hi Marc,
Have you tried specifying the number of times to do the unroll :
#pragma loop unroll (n)
Best regards,
John

Who is online

Users browsing this forum: mollytaylor1965, yujianfa10, yujianfacai8 and 1 guest