Блог пользователя nskybytskyi

Автор nskybytskyi, история, 3 года назад, По-английски

I was upsolving AtCoder Beginner Contest 195 problem F, aka "Coprime Present" where you need to count subsets (of a special set of size < 72) with pairwise coprime elements.

My solution is a bitmask dp over prime factors that were used and a prefix of the set. My first submission.

Fairly straightforward, nothing smart, $$$O(N \cdot 2^n)$$$ time and memory. However, I was disappointed by the memory usage of 600+ Mb. Therefore, I implemented a two-layer version of the same dp. Here is my second submission.

The difference is minimal, I just take layer_number & 1, and don't forget to clear the previous layer before proceeding to the next one

Obviously, it uses way less memory, only 30 Mb. What surprised me is that the running time has also decreased, from 472 ms to just 147 ms, for an improvement by a factor of 3!

I have two hypotheses on why this happens:

  • two layers fit into the better cache level, facilitating memory lookups;

  • memory allocation itself takes some time (there must be a reason why people write custom allocators, right?).

Can someone more experienced please tell me what actually happens there?

  • Проголосовать: нравится
  • +106
  • Проголосовать: не нравится

»
3 года назад, # |
  Проголосовать: нравится +38 Проголосовать: не нравится

Memory allocation

  • »
    »
    3 года назад, # ^ |
    Rev. 2   Проголосовать: нравится +9 Проголосовать: не нравится

    Yeah, 30 Mb is too much to fit into cache on any level, so it's purely memory allocation related speedup.

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится +93 Проголосовать: не нравится

      meme

    • »
      »
      »
      3 года назад, # ^ |
      Rev. 2   Проголосовать: нравится +1 Проголосовать: не нравится

      It's not about cache, it's about memory allocation. It's obvious that one layer wouldn't fit into cache, so there is no difference in that regard.

      UPD: I didn't understand your message at first, but it seems that you meant the same thing as me :) Hope this comment with clarify our statements.

      • »
        »
        »
        »
        3 года назад, # ^ |
          Проголосовать: нравится +8 Проголосовать: не нравится

        I just edited my comment to make it more clear :)

      • »
        »
        »
        »
        3 года назад, # ^ |
        Rev. 3   Проголосовать: нравится +10 Проголосовать: не нравится

        I don't know how accurate it is, but "sudo perf record -e cache-misses solution" shows 5 times more cache-misses on first solution over the second on my machine (this counters may work different on different cpus, and obviously they may have different caches).

        PS. Hm, cache misses % almost the same — 75% for both solutions. Because second solution has not only 5 times less cache misses, but also 5 times less cache references.

  • »
    »
    3 года назад, # ^ |
    Rev. 2   Проголосовать: нравится +24 Проголосовать: не нравится

    The problem is not in memory allocation itself, but more in high memory utilization. So it seems to me that it is actually more closer to caching issues rather than expensive allocations.

    Two submission examples:

    First — huge amount of memory is allocated once on each run, but when it's unused, running time is low (17 ms on the smallest test).

    Second — huge amount of memory is allocated, but gradually, and previous layers of dp are being freed. So at the same time only two layers are allocated and maximum running time is same as in two-layer solution (170ms), but total allocated memory size is still huge.

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится +1 Проголосовать: не нравится

      I think that I just don't understand how this works, why allocating huge memory and not using it is fast, godbolt says that allocation happens so no compier help on that side.

      Maybe in the second example compiler optimized something too? However I'm not sure that's possible based on my knowledge. On the other hand, maybe computer decides to give the same memory block each allocation, so it's indeed equivalent of 2 layers dp?

      • »
        »
        »
        »
        3 года назад, # ^ |
        Rev. 2   Проголосовать: нравится +31 Проголосовать: не нравится

        C++ new int64_t[...] and delete operate on heap memory. Since requested memory blocks are of the same size, it indeed works almost exactly as a small fixed pool of arrays.

        As for difference in time, it is simply due to difference in number of active memory pages. The more pages your process touches, the more frequently page fault happens. Allocating huge unused array is fast because actual page allocation happens lazily.

»
3 года назад, # |
  Проголосовать: нравится -53 Проголосовать: не нравится

Obviously, it uses way less memory, only 30 Mb. What surprised me is that the running time has also decreased, from 472 ms to just 147 ms, for an improvement by a factor of 3!

Improved by a factor of 3, not 3!

»
3 года назад, # |
Rev. 3   Проголосовать: нравится -8 Проголосовать: не нравится

upd: deleted

»
3 года назад, # |
Rev. 2   Проголосовать: нравится +11 Проголосовать: не нравится

memory allocation itself takes some time (there must be a reason why people write custom allocators, right?).

That's more important when you have a huge number of allocations (not same as total allocated memory). For example allocations of some tree nodes.