I see a lot about source codes being leaked and I’m wondering how it that you could make something like an exact replica of Super Mario Bros without the source code or how you can’t take the finished product and run it back through the compilation software?

  • Dark Arc@social.packetloss.gg
    link
    fedilink
    English
    arrow-up
    66
    ·
    11 months ago

    I actually work on a C++ compiler… I think I should weigh in. The general consensus here that things are lossy is correct but perhaps non-obvious if you’re not familiar with the domain.

    When you compile a program you’re taking the source, turning into a graph that represents every aspect of the program, and then generating some kind of IR that then gets turned into machine code.

    You lose things like code comments because the machine doesn’t care about the comments right off the bat.

    Then you lose local variable and function parameter names because the machine doesn’t care about those things.

    Then you lose your class structure … because the machine really just cares about the total size of the thing it’s passing around. You can recover some of this information by looking at the functions but it’s not always going to be straight forward because not every constructor initializes everything and things like unions add further complexity … and not every memory allocation uses a constructor. You won’t get any names of any data members/fields though because … again the machine doesn’t care.

    So what you’re left with is basically the mangled names of functions and what you can derive from how instructions access memory.

    The mangled names normally tell you a lot, the namespace, the class (if any), and the argument count and types. Of course that’s not guaranteed either, it’s just because that’s how we come up with unique stable names for the various things in your program. It could function with a bunch of UUIDs if you setup a table on the compilers side to associate everything.

    But wait! There’s more! The optimizer can do some really wild things in the name of speed… Including combining functions. Those constructors? Gone, now they’re just some more operations in the function bodies. That function you wrote to help improve readability of your code? Gone. That function you wrote to deduplicate code? Gone. That eloquent recursive logic you wrote? Gone, now it’s the moral equivalent of a giant mess of goto statements. That template code that makes use of dozens of instantiated functions? Those functions are gone now too; instead it’s all the instantiated logic puked out into one giant function. That piece of logic computing a value? Well the compiler figured out it’s always 27, so the logic to compute it? Gone.

    Now all of that stuff doesn’t happen every time, particularly not all of those things are always possible optimizations or good optimizations … But you can see how incredibly difficult it is to reconstruct a program once it’s been compiled and gone through optimization. There’s a very low chance if you do reconstruct it, that it will look anything like what you started with.

    • Treczoks@lemmy.world
      link
      fedilink
      arrow-up
      14
      arrow-down
      1
      ·
      11 months ago

      Just wait until you see the crazy optimizers for embedded systems. They take the complete code of a system into consideration, and, in a number of compile passes, reuses code snippets from app, libraries, and OS layer to create one big tangled mess that is hard to follow even if you have the source code…

      • noli@programming.dev
        link
        fedilink
        arrow-up
        4
        ·
        11 months ago

        Isn’t that still the same exact process as a normal compiler except in the case of embedded systems your OS is like a couple kilobytes large and just compiled along with the rest of your code?

        As in, are those “crazy optimizations” not just standard compiler techniques, except applied to the entire OS+applications?

        • Treczoks@lemmy.world
          link
          fedilink
          arrow-up
          5
          arrow-down
          1
          ·
          11 months ago

          In a way, yes. But it really creates a mess when the linker starts sharing code between your code of which you have sources, and then jumps in the middle of system code for which you don’t have sources. And a pain in the whatever to debug.

          • noli@programming.dev
            link
            fedilink
            arrow-up
            2
            ·
            11 months ago

            Don’t you have the code in most cases? Like with e.g. freeRTOS? That’s fully open source

              • noli@programming.dev
                link
                fedilink
                arrow-up
                1
                ·
                11 months ago

                Does commercial mean closed source in this context though? It seems like a waste of resources not to provide the source code for an rtos.

                Considering how small in size they tend to be + with their power/computational constraints I can’t imagine they have very effective DRM in place so it shouldn’t take that much to reverse engineer.

                May as well just provide the source under some very restrictive license.

                • Treczoks@lemmy.world
                  link
                  fedilink
                  arrow-up
                  1
                  ·
                  11 months ago

                  Yes, it is closed source, but you can buy a “source license”. Which is painfully expensive.

        • morhp@lemmynsfw.com
          link
          fedilink
          English
          arrow-up
          4
          ·
          11 months ago

          The main difference is that when you compile a program for Windows, Linux etc., you have an operating system and kernel with their exposed functions/interfaces so even in a compiled program it’s pretty easy to find the function calls for opening a file, moving a window, etc. (as long as the developer doesn’t add specific steps hiding these calls). But in an embedded system, it’s one large mess without any interfaces apart from those directly on the hardware level.