Title High private bytes memory usage with async def coroutines
Priority bug Status resolved
Superseder Nosy List kayhayen, marcel
Assigned To kayhayen Keywords optimization

Created on 2017-07-02.08:48:42 by marcel, last changed by kayhayen.

msg2500 (view) Author: kayhayen Date: 2018-08-23.23:19:54
Finally, for develop this is a reality. The change was very complex in nature, 
because it involved going from strings for identifiers to objects that also 
include type and where they are stored.

Also performance of generators has increased a lot in terms of creating and 
using them.

msg2181 (view) Author: kayhayen Date: 2017-07-19.09:25:21
I was making experimental changes, and it looks like yield to exit and then resume the function will 
work, if all local variables live on an allocated pointer, instead of stack, that is attached to the 
generator object.

It will take time though, and so far nothing runs. But basically this is using a "yield index" and a 
goto bunch at entry to resume code. Because we are now C, and no longer C++, the gotos seem to cause 
no issues. Previously when C++ objects were local, gotos couldn't pass them, so this couldn't be done 
in early Nuitka.

If this holds, the whole "fiber" dependency goes away. The small extra stacks will need a malloc, but 
could be cached per function for performance, which might waste run time memory though, or use some 
free list, or similar. Accessing variables on stack or via such a pointer may be relatively similar 
performance, but since it's a pointer, the C compiler won't trust it to be unchanged after calls, 
where the stack would be more trusted.

On the other hand, using the stack is not going to work for many things. Imagine e.g. also a small 
list of limited size, tuples of limited size, that only live locally, they could use that too, so it 
would even be done for normal functions.

I imagine I need a detection for what must be on that extra-stack and what not. For this here "a + b + 
c + yield", the result of "a + b + c", which is a temp variable must be saved, but the "a + b" temp 
does not. So kind of detection, for what variables hold values at the time a "yield" is to occur needs 
to make that decision, then there would be stack usage and extra-stack usage mixed and decided. 

I kind of think, I can delay that though, until correctness of the "extra-stack" only approach has 
been shown.

Also, design questions are in my mind now, e.g. every "+" operation allocates a temp variable for the 
left side to be assigned to. Maybe they could be re-used to minimize size and become more local 
anyway. Right now, on the stack, hope is the C compiler eliminates them, on the extra stack, it never 

This also ties in with C types, in that using alternatives for variable types, "C int or PyObject *" 
with an indicator of what is true, could easily double/triple memory usage on stack, and even for temp 
objects from the "+" above.

msg2177 (view) Author: kayhayen Date: 2017-07-10.06:11:57
I think Nuitka allocates 1MB per thread, and uses a thread for generator or coroutine, or asyncgen 
(3.6 mixture). 

Using a thread to make the context switch is an old decision that may or may not be good. CPython 
just switches frame pointers and uses the same stack for all, which is lower memory of course. 
When using @asyncio.coroutine that is what is happening to you, and likely it behaves equally 

Delay allocation for stack doesn't seem to work. The necessary size seems to be an unknown.

For generators, which are few, this was deemed acceptable, but it poses a scalability problem 
definitely, and I recall unit tests that ran into 2GB limit issues, when working with a lot of 
generators, but that was only unit tests. For real coroutines, their point is to be many I figure.

Making every yield a function call to the continuation won't scale. So every yield has to be a 
"return", which on entry then resumes. This would mean more C functions and all work on the same 
shared space. Kind of a big change and not happening immediately but I will look into it next 
release cycle.

msg2175 (view) Author: marcel Date: 2017-07-09.10:33:57

i have the same behaviour on my Linux VM with Python3.6.1x32 with Nuitka 0.5.26.
Htop shows me a virtual memory usage of ~520MB when building with "async def
mycoro", but only ~18MB when building with "@asyncio.coroutine" or running

I'm not that deep into memory management and the htop help adds "Most of the
time, this is not a useful number" to the VIRT/VSZ column, so is this something
i have to be concerned about?
I noticed this behaviour when switching to async def coroutines lately, because
my program uses a lot of coroutines and the virtual memory usage is now ~1.6GB.

Best Regards,
msg2160 (view) Author: marcel Date: 2017-07-02.08:48:42

the following code seems to allocate about 1GB of memory (Private Bytes) when
built with Nuitka (0.5.27rc4 and below) on Python 3.6.1x64 @ win7 x64.

import asyncio

async def main():    

    async def mycoro():
    await asyncio.gather(*[mycoro() for _ in range(500)])
if __name__ == "__main__":
    loop = asyncio.get_event_loop()

When changing

async def mycoro()


def mycoro()

the memory consumption is only about 16MB (monitored with Process Explorer).
Date User Action Args
2018-08-23 23:19:54kayhayensetstatus: chatting -> resolved
messages: + msg2500
2017-07-19 09:25:22kayhayensetmessages: + msg2181
2017-07-11 04:59:21kayhayensetnosy: + kayhayen
keyword: + optimization
assignedto: kayhayen
2017-07-10 06:11:57kayhayensetmessages: + msg2177
2017-07-09 10:33:57marcelsetstatus: unread -> chatting
messages: + msg2175
2017-07-02 08:48:42marcelcreate