Title Simple code twice as slow in nuitka as cpython
Priority bug Status in-progress
Superseder Nosy List kayhayen, lesshaste
Assigned To kayhayen Keywords optimization

Created on 2016-10-21.15:40:41 by lesshaste, last changed by kayhayen.

File name Uploaded Type Edit Remove kayhayen, 2016-10-29.07:10:46 text/x-python lesshaste, 2016-10-21.15:41:54 text/x-python
msg2263 (view) Author: kayhayen Date: 2017-10-21.09:27:10
The "goto generators" are coming. They will not switch "context", i.e. use no fibers at all. I 
suspect your change triggers usage of context stack allocation, and therefore is slower.

Suspect is that the JIT code and the Nuitka code flow ought to be really close once goto 
generators are done. We will see.

msg2262 (view) Author: lesshaste Date: 2017-10-21.09:01:26
An update for nuitka version 0.5.28. I modified one line in to make it  
"for i in xrange(19,20):"  just to make the test run a little longer.

nuitka: 13.6 seconds
cPython: 14.2 seconds
pypy: 2.2 seconds.
msg2058 (view) Author: kayhayen Date: 2016-11-14.08:16:15
The buildbots found something unimportant, but it seems fit, so I am going to push the qiter 
work to factory. The other stuff will be released real soon now. It apparently is pretty much 
good now. On the shorted version, Nuitka is now faster than CPython, although not by as much as 
I would like it to.

I have added a construct that tests builtin sum specifically with generator, list, and tuples, 
one for each. The result for generators are not yet good, only a 72% gain right now as opposed 
to 32% gain only. However, that's of course *still* a loss. It is being made up in other parts 
of the code, leading to a faster execution generally. But Nuitka is still doing too much work to 
switch back and forth between generators.

The qiter work for tuple and lists, brought the constructs tests for those from 101% (it 
basically wasn't using Nuitka at all, just sum with tuple object as normal), just a slightly 
better sum implementation of mine, avoiding a "0" Python object created and unreferenced, 
therefore the really small gain. With qiter, those cases are up to 150% and 170%, one iterator 
type seems more expensive. Nuitka avoids those iterators and their usage with qiter, so that's 
much faster then. Not sure though, how a list iterator should react to a mutated list during 
iteration, I need to test that. We wouldn't catch that.

I was contemplating a variant of generators that is using multiple C functions, one for each 
phase, and where code would work on a structure to store and access things, and then we could 
e.g. make those consider themselves how to work with frame stack. Currently "yield 1" takes a 
frame stack just in case you throw into it. Having an entry point for each "yield" until the 
next yield occurs (or generator exit), would avoid the need to work with multiple C stacks, for 
which I don't really know how large to make them anyway.

This will mean that the C compiler won't see all code working on that context anymore. So it 
won't be able to optimize as much. So I am not so sure if that will help really. I surely need 
to look for more ways to trim down things there.

Also, depending on the user case, only one cached C stack will not be enough, and there needs to 
be a free list for those too. A generator with a nested generator e.g. will right now be much 

msg2054 (view) Author: kayhayen Date: 2016-11-12.10:56:38
So, I removed setting the StopIteration at generator exit and turned it to mere NULL 
return, and only create the StopIteration when needed (send and throw methods). The 
"Py_IterNext" API used everywhere is designed to clear it away, should it exist, so this 
is even the right thing to do. The tp_iternext is right to return NULL. And doing it for 
qiter, of course too, is proper.

Shoves things down to 1018.. from 1031.. so that by itself is a relatively huge gain of 
1.2%, the construct that checks generator close will be much better. Binaries will also be 
smaller, dropping per generator code, so that's great. 

Looking at the code, I noticed however that the next big gain to have on code generation 
level, would be to use qiter inside the actual generator's iteration, when doing over the 
tuples from "itertools.combination". But for that, I would like to first add sum 
constructs to measure, which will be my next step.

msg2053 (view) Author: kayhayen Date: 2016-11-12.08:34:26
So I came up with a "qiter" structure, that has the create/next operation, and no 
release yet. Not having a release is a mistake, but not relevant for measurements 

So far only "generator" and "generic" are implemented, and I am scrubbing of 
unnecessary stuff.

The valgrind ticks went down from 1033.. to 1031.. so that's 0.2% right there for 
saving to have a genertor iterator and go through its tp_iternext. The "qiter" 
slot of generators does more work then necessary. I am going to scrub off 
operations from it.

One example, that should also benefit non-qiter usage, is that if checks if the 
"send" value is "None", and raises an error if it's not. But that is only actually 
ever useful for the send method, never for iternext which sends None into the 
generator, as does "throw".

The basic error here, is that throw, send, and iternext all share one method, 
where they could be specialized. Not sure how far I want to go there. This is 
going to give only very low percentiles. Moving the None check out of there and 
make other use a reduced function, probably all right.

Also, I might want to tune how GeneratorExit and Stopiteration are told to the 
outside world. Creating them as a real exception is no benefit for the quick 
iteration. Avoid that may also streamline the code from generation. I need to look 
at how this actually works.

The actually benefit for qiter on generators, in my mind, should become large 
enough to that the failed tests for other quick cases, lists, tuples, etc. are 
compensated for. Because those will really become way faster. For generators, 
there is not as much to be had relatively speaking.

msg2052 (view) Author: kayhayen Date: 2016-11-10.07:02:04
Untangling the code that handles "int" or "long" only to do both, I noticed that the boolean 
values False and True, also do not match on Python2. So this would be sped up too:

def g():
    yield True
    yield False


These are now also fully computed as C, as opposed to falling back to int/long objects. That 
optimization might not be too important, but summing up the amount of conditions that is true is 
actually a thing I stumbled over in the CPython test suite (when the fallback was still buggy), 
and I liked the idea of that. The use of small long for Python2 is unclear, but could happen to 
some people too. Adding extra cases before falling back to objects will always be nearly no 
cost, so that's good stuff still.

Now I will try and make dedicated iterational code for tuple, list, and generators, avoiding the 
use of iterator objects entirely. This should give good extra acceleration too. And it might be 
prototypical for what we can do in Nuitka with iterator loops in terms of code specialization, 
even when there is a still of bit of iterator tracing missing to actually use that in real world 
code outside of sum. 

msg2051 (view) Author: kayhayen Date: 2016-11-08.06:16:27
So the free list stuff seems bug free now, and will soon be part of pre-release, and sum 
starts to look good. Getting part of it wrong is quite easy, and hard to debug except in 
full test suite runs.

For the sum in the reduced case I uploaded as "", the new sum() implementation, 
without any actual optimization towards handling the fact that we deal with a generator, 
already gives a reduction to 77% of run time. Or bad measurements (running tests in 
background) from 30ms to 24ms.

I have noticed, that "long" is breaking sum optimization for Python2, although that 
strictly wouldn't be needed. Same for "bool" of Python3 and I have seen pretty plausible 
code that sums booleans. Since they are really ints on Python2, those should work. I 
have yet to add construct based tests for sum, but that appears all huge opportunities 
to outperform CPython for that C built-in.

After adding those, and seeing what it does to various inputs, specializing for 
generators seems the likely next step. Avoiding to go through the iterator and exception 
save / restore considerations should give an extra boost.

After that, I think, the things that can be done on that level have been exhausted, and 
statically inlining the generator will be next.

msg2030 (view) Author: kayhayen Date: 2016-11-01.18:17:34
So the compiled cell implementation is becoming workable, still fixing findings of the 
Buildbots, and just pushed needed fixups.

The free list implementation shove off 2% of the whole program, but I decided to push 
all stuff into the cell allocation, not just free list usage, and now it's only 1% of 
the whole program anymore.

That's still something. But the C function call overhead ought to be spent better, e.g. 
allocating multiple cells at once. For your function code, we e.g. need 2 cells, which 
could be created all at once or right when the generator is created already, instead of 
doing it afterwards. That might well push a bit more off.

For now I want to stabilize this code some more. If I am right, that other change might 
mean about 2% by itself. Which is good, but not dramatic, and can follow later.

I will now look into sum() built-in itself. We use the C function from the normal built-
in. As a dedicated helper function, it will be faster to call (no argument parsing), and 
the two variants, with 2 and 1 argument, might be specialized.

And I am pretty sure that sum on a compiled generator might be optimized too. Driving it 
via an iterator and sending a next to it could easily be shortcut. And exception save 
and restore mechanism, which now runs per "next", could be pushed to the outside, and 
done in sum then. That will be huge savings too. But also pretty specific code of 
course, so I am not so sure I want to go there. I shall see.

msg2025 (view) Author: kayhayen Date: 2016-11-01.06:41:57
I just pushed to factory another improvement, that is to use the free list with variable sized 
generators, that have the closure storage attached to them.

That extra avoided malloc/free pair shoves off another ms of your test case, putting it at 27ms.

The graphs are updated, unfortunately my Buildbot currently has no ideas of priorities yet after 
updating to 0.9 this week, so they lag behind, and this is  not yet what I pushed, but it will update 

About on par for CPython with generator expression usage in factory, probably has to be happy with 
that, because Nuitka does switcheroo with frame stack and exception, and stuff, that CPython does not, 
I do however wonder, if it wouldn't be possible to treat generators as if their were their own 
threads, for which that wouldn't be necessary. That of course would be a drastic change, but most 
likely one that gets us somewhere. More thought on that will be needed.

For generator creation, Nuitka was already better than CPython, but not really all that much. Even 
before my most recent change to make it even faster, the free list was already more than twice as 
fast. I hope to see the attached storage for cells take off even more.

Obviously, once released, other versions will go lower too.

This was pretty ugly change, which lead to a bit of segfaulting around. I hate to go into those, 
because they are never easy to resolve, when corruption only shows later. The buildbots are currently 
pounding on it, let's hope it's good.

Obviously that change, storing the cells inside the generator object, also needs to be done to 
coroutine objects and even more importantly, normal function objects. The later esp. share the same 
kind of approach, but were not migrated to the new stuff. I am using generators to check things out 
here. If they are good, functions will follow after them. And compiled coroutines are merely a more 
modern clone of generators anyway. 

So, on to compiled cells then it is. First step obviously for them to be our own extension type, 
before making any other changes, like adding the free list controlled allocation/release, which then 
will be faster.

msg2020 (view) Author: lesshaste Date: 2016-10-31.13:10:08
Thank you so much for looking at this seriously. It's exciting to see the progress.
msg2019 (view) Author: kayhayen Date: 2016-10-31.12:56:42
So, looking at things in Valgrind, I noticed that the "dealloc" of cell objects are now the top users.

The freelist that I added for generators would be perfect for this too. For CPython, the frame objects 
are cached and allocated with cells already there, or so I think, but for Nuitka that is not the case.

I went to the point, where I checked with grep that PyCell_Check is not used in CPython anywhere, 
where that is a problem. And therefore I think we can safely say we can have one or multiple compiled 
PyCell implementations and nobody will notice.

These could have a free list implementation too, and save interaction with the memory allocation and 
release too. This should be the top CPU waste right now according to Valgrind, although the allocation 
is not visible, as it's done in inlined code, it should also not be small.

Then, creating the generator object with a free_list per size, and put the cells in there, and create 
them in two steps, might streamline the creation of generators even further. Plus, bonus, access will 
be faster, as there is one pointer less to follow. Now it's "generator->closure[1]" for the second 
cell, then it would be "generator.closure[1]" which will be one C level pointer lookup less. It then 
also needs no malloc/free, and the creator can copy it it there more directly.

I will try and get these in, and see what it does, but I am very optimistic, these will reduce the 
relative overhead to CPython to the point where we might surpass.

msg2013 (view) Author: kayhayen Date: 2016-10-30.19:18:06
More slightly good news. I looked at the performance for creating generator 
objects, and for your test case, the improvement I made, shoves off another 10%, 
giving a 20ms to 28ms (was 31ms) race, better than before.

Nuitka now has a free list of generator objects, that will make the use case of 
creating and deleting them frequently avoid going through allocating the memory, 
and instead reuse what's already there. Capped to 100 instances, this works 
rather nicely.

I am now contemplating to change Nuitka to allocate for space for closure 
variables inside the generator object, avoiding malloc/free pairs with those 

Creating and deleting generators really seems to be what the remaining work of 
the test case is. So tuning that will be my goal.

msg2011 (view) Author: kayhayen Date: 2016-10-30.10:35:46
So, for a reduced version, I managed to have substantial speedup. The issue was that for generators, 
Python2 only, exceptions are saved and restored, but this was done wasteful.

a) The exception type "None" which is very frequent of course, meaning no exception context is 
running right now, was still saving and restoring the traceback and value, where in fact this should 
disable doing anything.

b) Worse, when restoring, Nuitka was also setting sys.exc_type, sys.exc_type, and sys.exc_traceback 
(or so) and the like, values, to "None", and that's 3 dictionary updates for no good reason. 
Painfully slow of course.

So the numbers for the generator usage construct are way better now: The gain used to be 66.8%, 
which means 2/3 as slow as CPython, or 243155834 ticks, and now it is just 106755678 ticks, which is 
about a 250% speedup of Nuitka compared to it self, or a gain of 152.2%, which is near acceptable.

However, the program is still slower, so I need to investigate more. My reduced version is still 
0.3s for Nuitka, and 0.2s for CPython. So there still must be a construct that is more expensive to 
Nuitka, than it is to CPython. But that's good. It's rare to find constructs this isolated. Your 
program performance (in comparison with CPython/Nuitka) purely hinges on the generator performance.

When I looked at things, I found that taking the closure variables for generator might be more 
expensive, although it is enirely unnecessary.

This is the bastard version of your program, that I currently look at:

def perm_ryser(a):
    maxn = len(a)
    n_list = range(1,maxn+1)
    s_list = chain.from_iterable(combinations(n_list,i) for i in range(maxn+1))

    for st in s_list:
        sum(7 for j in st)

When I talk about closure variable, I am talking of "st" here, and maybe "j", which is local to the 
generator, but for persistence reasons might be stored the same way, I forgot.

Passing "st" by value to the generator, would be more efficient. But it also takes better analysis 
of the generator object life time, which is not there yet. Also this should not be a difference to 
CPython, which does this absolutely the same.

Need to look more closely, there has to be something more low hanging too.

msg2001 (view) Author: kayhayen Date: 2016-10-29.12:01:55
The approach for me is as follows:

a) Optimize generic Python constructs to all be faster. I thought Nuitka was there, but 
for generators, I now realize I is not. Other cases might still exists, hard to cover 
everything, and then not making mistakes in that.

b) Then doing optimization. For the generator expression and sum, it would be sweet to 
have generators inline. No generator is needed for that sum really, this could be 
inlined. Nuitka doesn't yet inline functions that might raise, and not generators at 
all. This shall be attacked. If you follow the mailing list, you will notice that I am 
working on iteration and type inference primarily right now.

The bit that is holding back inlining of raising stuff is the one that limits local 
variables to exist for a function. So a frame is too tied to a function for code 
generation, I need to resolve that eventually. Inlining the generators yield to become 
mere assignments of the iteration result on it, would be sweet.

In that case, we would be speaking of a dramatic speed up I suppose. Even though 
generators are fast, not using them will be way faster of course.

c) Then type inference. For the sum to know the result type, the operators on it might 
become C level it. We would have to have type inference over from_iterable and 
combinations, but that feels doable. Nuitka should become able of creating C level 
integer addition code with overflow fallback to Python "long" code, that wouldn't fire 
for you.

With this in place, Nuitka indeed would be as fast as reasonable possible on this code, 
but it's a lot of work.

For this issue, I am doing the "yield" performance stuff. But I will also think of how 
to inline a generator that gets used with only one "next", like the one your sum is 
using, and then for "sum", or even "list", how to support alternative Python level code 
to use for optimization. 

That's most likely not rocket science. My expectation of PyPy would be for it to also 
not have a generator object in usage. I do however assume for it to not use C level 
integers all that much. Otherwise it would converge to C performance more closely, 
which you say it does not. Generally however, PyPy is not of much concern when thinking 
of Nuitka due to the very different level of design.

One day I hope to have profile guided optimization. Then it will be possible to make 
speculative code with C level performance for the cases where it applies.

I really like your example code, as it's clean code with no hand optimization applied 
that detriments readability. I will try and work a while on that to run as smooth as 
possible. Also with other things, e.g. optimization of "sum" to built-in at all.

But first steps first. I still lack the idea what exactly is slow about generator usage 
right now.

msg2000 (view) Author: kayhayen Date: 2016-10-29.08:07:19
Nah, it's not the creation, while only 150% gain here, the generator usage test 
was making only a "next(gen)" per generator, and probably the better "next()" 
built-in handling offset that, but increasing the usage of the generator to 
"list(gen)" shows the issue.

I now just need to convince myself, if closing the generator is the issue, or if 
it's more expensive to switch stacks for them. Either way, the issue is found it 

I remember that on my arch, x86_64, there is assembler for the context switch, 
that should be faster. So maybe that explains why my numbers are a bit better 
than yours. But that would also be hiding the symptom that would be slower yield 
handling or generator exit.

msg1999 (view) Author: kayhayen Date: 2016-10-29.07:54:52
I added a few tests for constructs that had no comparisons, always a good idea, but I think I 
drilled it down to this:

def perm_ryser(a): # Ryser's formula, using matrix entries
    maxn = len(a)
    n_list = range(1,maxn+1)
    s_list = chain.from_iterable(combinations(n_list,i) for i in range(maxn+1))

    for st in s_list:
        sum(7 for j in st)

Notice how this is just repeating creating and using a generator object a lot now, and that 
basically is all the effort. I do construct test usage of generators, they are fast, but I 
will add their creation, and the suspect is that it's more expensive to do for Nuitka right 
now that it is for CPython.
msg1998 (view) Author: lesshaste Date: 2016-10-29.07:17:36
It might be worth mentioning that almost all the time taken by pypy is in the
warm  up, which nuitka shouldn't suffer from. If you run it in a loop in pypy
the time is more like 0.2 seconds per iteration. In C it takes around 0.005
msg1997 (view) Author: kayhayen Date: 2016-10-29.07:13:15
I have slightly better results like this, 6.7s to 9.5s, but still worse, and 
definitely where it should be, I would hope for 3.x, aka 2 fold speed up to be 
the minimum.

msg1996 (view) Author: kayhayen Date: 2016-10-29.07:10:46
I am attaching the version I am using now, and removing the other, I just 
changed away from star imports, to be more explicit. I will try and further 
reduce the test case, and e.g. get rid of "random", which creates just noise for 
performance measurement.

Thanks a lot for providing the test case. I was too busy with the new born 
bureaucracy this week, but I really love to get this kind of report.

msg1989 (view) Author: lesshaste Date: 2016-10-21.15:40:41
The attached simple code takes 15.9 in nuitka 0.5.22 and 8.8 seconds in cpython

By way of comparison it takes 1.4 seconds in pypy 5.4.1
Date User Action Args
2017-10-21 09:27:10kayhayensetmessages: + msg2263
2017-10-21 09:01:26lesshastesetmessages: + msg2262
2016-11-14 08:16:15kayhayensetmessages: + msg2058
2016-11-12 10:56:38kayhayensetmessages: + msg2054
2016-11-12 08:34:26kayhayensetmessages: + msg2053
2016-11-10 07:02:04kayhayensetmessages: + msg2052
2016-11-08 06:16:27kayhayensetstatus: chatting -> in-progress
messages: + msg2051
2016-11-01 18:17:34kayhayensetmessages: + msg2030
2016-11-01 06:41:57kayhayensetmessages: + msg2025
2016-10-31 13:10:08lesshastesetmessages: + msg2020
2016-10-31 12:56:42kayhayensetmessages: + msg2019
2016-10-30 19:18:06kayhayensetmessages: + msg2013
2016-10-30 10:35:46kayhayensetmessages: + msg2011
2016-10-29 12:01:56kayhayensetmessages: + msg2001
2016-10-29 08:07:19kayhayensetmessages: + msg2000
2016-10-29 07:54:52kayhayensetmessages: + msg1999
2016-10-29 07:17:36lesshastesetmessages: + msg1998
2016-10-29 07:13:15kayhayensetmessages: + msg1997
2016-10-29 07:10:46kayhayensetfiles: +
status: unread -> chatting
keyword: + optimization
nosy: + kayhayen
messages: + msg1996
assignedto: kayhayen
2016-10-21 15:41:54lesshastesetfiles: +
2016-10-21 15:40:41lesshastecreate