[HN Gopher] Data-Oriented Programming in Python ___________________________________________________________________ Data-Oriented Programming in Python Author : brilee Score : 119 points Date : 2022-11-26 17:45 UTC (1 days ago) (HTM) web link (www.moderndescartes.com) (TXT) w3m dump (www.moderndescartes.com) | hgibbs wrote: | I'd like to plug riptables | (https://github.com/rtosholdings/riptable), which is (more-or- | less) a performance upgrade to pandas. | anigbrowl wrote: | Looks nice! | duped wrote: | I'm curious how you would do data oriented programming in a | language with no type system and no control over memory layout. | And I guess the answer is "you can't, but JITs might exist | someday that do it for you" | | But you can't wave your hands around and say compiler | optimizations will fix performance problems - they can, but | they're not magic, and the arrow in the proverbial knee for | optimization passes are language semantics that make them | impossible to realize (forcing the authors to either abandon the | passes, or rely on things like dynamic deoptimization which is | not free). | jessermeyer wrote: | Those are basically contradiction of terms. Orienting the | program structure around the data necessarily requires control | over memory layout and how it is interpretted. | mumblemumble wrote: | > I'm curious how you would do data oriented programming in a | language with no type system and no control over memory layout. | | I'm not sure what language you're referring to? Neither of | those is true of Python. | _aavaa_ wrote: | I'm not sure what either you or the parent consider a proper | type system, but the difference between Python and (say) | Java's type system is night and day. | | If you using the Typing module you might be able to get a | linter to check things for you, but that's a far cry from | what other typed languages have. | roflyear wrote: | How does that have anything to do with data? | mumblemumble wrote: | It is night and day. More specifically: | | Python's type discipline is stronger, but dynamic. | | Java's type discipline is static, but weaker. | | Both are most certainly typed. An _un_ typed language would | be one like Forth or most assembly languages. | sirwhinesalot wrote: | By using only coding patterns that are known to JIT well and | lower level primitive types and containers if provided by the | language. Maximizing the use of packages written in native code | also helps. | | The resulting code is even more annoying to write than using a | lower level language typed language in the first place, but | ecosystem access sometimes makes up for it. | | Hopefully tools like mypyc get better, letting well-typed | python code with reasonable usage patterns be compiled to | reasonably efficient native code. | | Last time I used it I was pleased with the performance benefits | but it couldn't even compile all files in a module to a single | shared library, despite this being mentioned as possible (and | recommended) in the docs. Maybe I was doing something wrong, | but they don't answer their github issues often, alas. | | Any little thing helps though, it's one thing for throwaway | scripts to be inefficient, but applications? At a large scale | it is a monstrous waste of time and literal energy. | gnuvince wrote: | Unfortunately, the terms "data-oriented _design_ " and "data- | oriented _programming_ " refer to two different styles of | programming. Data-oriented design is the approach to | programming made popular by Mike Acton's CppCon keynote--as you | say, it focuses on the layout of objects in memory to make the | processing of data take advantage of the underlying hardware. | | Data-oriented programming is a style of programming that, as | far as I know, originates in the Clojure community. It | emphasizes using general data structures (vectors, | dictionaries) to store all data and make code more re-usable. | It has nothing to do with good cache utilization, pre-fetching, | or avoiding branch mispredictions. | | It's a shame that two styles of programming which are almost | diametrical opposites share such similar names. | | From the look of the article, it's discussing data-oriented | design, but in Python, and I agree that it's kind of a weird | match. | tvb12 wrote: | Wait, what? Thanks for pointing this out. I was pretty close | to submitting an order with a Data-oriented Programming book | in my cart. | typon wrote: | There is no book on data oriented design as far as I know. | It would be great if someone could take Mike Actons talk | and similar talks and condense the ideas into a book filled | with real world examples. | Jtsummers wrote: | https://www.dataorienteddesign.com/dodmain/ | | There is this one. It has been posted here a few times. | wheelerof4te wrote: | To spare you a couple minutes of your life, the article is saying | this: | | Python + C modules = Speed | | Nothing new here, move along. | brilee wrote: | It actually isn't saying that. What do you think Python is made | of, under the hood? It's C modules. | | The argument is not that NumPy is written in C, but that it | amortizes the cost of Python overhead over multiple data, | rather than incur it on each datum. | cauch wrote: | It's a details, but I keep seeing it: | | > Yet, [the scientists] struggle to move away from Python, | because of network effects, and because Python's beginner- | friendliness is appealing to scientists for whom programming is | not a first language. | | I don't believe it's the whole story. | | In my case, during my 13 years in academia, I saw my field going | away from C++ and towards python. Not because of network effects | (it was the opposite: it was more difficult to not use what | everybody was using), or because scientists were not able to | program (the entry language of the whole field was C++, and | python arrived only because scientists with a deep knowledge of | C++ started to themselves switch the core library to be usable | with python). | | I think something that computer scientists forgot when they | consider the subject is that the way computer scientists do | software is just not working when you do science. | | In science, you use coding as an exploratory tool. You are lucky | if 10% of your code ends up being used in your final publication. | Because the 90% was only there to understand and to progress | towards the proper direction. For this reason, things like | declaring variables, which is very important when one makes a | professional software, are too costly to be useful when you need | to write down a piece of code that you will ever run once to | check a small hypothesis, especially when you have another | language not requiring it. Another aspect is that you will | present your scientific results to your colleagues, not your code | (they are not interested in that), and they will come up with | questions or good ideas, all very good for science, but rarely | compatible with the way your algorithm was built in the first | place, and you will need to shoe-horn it into your code (to test | it) without taking 3 weeks. In this case, python flexibility and | hackability is very useful. | | It's also visible in the popularity of things like Jupyter | notebooks (I have to acknowledge it even if I personally don't | like working with such tools), which reuse a working approach | similar to what was done in mathematica and matlab, that were | created with the scientific workflow in mind. | | I'm sure python simplicity has played a role. But I have the | feeling that some people are totally oblivious on the fact that | there may be other reasons. | analog31 wrote: | All good points. I think the network effects are more important | than they were in the past, because the network has gotten | bigger. Today, new graduates make sure to put Python on their | resumes, even if they've barely touched it. A colleague who | left for another job thanked me for encouraging him to learn | Python. Contrast that with 13 years ago, roughly when I | switched to Python. My colleagues thought it was some weird | toy, and encouraged me to learn C# instead. The past 13 years | have also seen most programmers get over their aversion to | open-source tools. | | It's true, as mentioned in another comment, that repeatability | is important. But at least in experimental science, the | repeatability of an experiment is a bigger hurdle than making | the same code run twice. I use Python code to control my | experiments, and the ability to read my way through a complex | workflow is quite valuable. | | One thing I like is being able to bodge together a huge blob of | data and metadata into a dict that's easy to store and unpack. | That encourages me to keep more experimental data and metadata, | and use it later. | | Python and its libraries have sprawled to the point where it's | anything but simple. | whatever1 wrote: | Computer scientists also assume that you know what inputs your | program needs and what is the range of the outputs. | | That is out of touch with scientific research. We may change | overnight completely the inputs, the core logic and the | outputs. | | Having to babysit function signatures, manage memory and types | throughout these activities is just draining. | duped wrote: | That is also how "computer scientists" and software engineers | work. Our time is just valued a lot higher so we've come up | with techniques to make our work more efficient and faster, | like structuring our code well using types and function | signatures. | | The added bonus is you get science that's you know, | repeatable. Because the difference between industrial code | and prototype code is that it gets run so often there can't | be glaring mistakes; it must be repeatable by default. We | have different techniques for dealing with these problems, | but writing good code is orthogonal to that (I don't think | scientists need to be running static analysis, doing deep | reviews, and having extensive integration/unit/mock testing | throughout their code). | mumblemumble wrote: | I used to be a software engineer, and now I'm a data | scientist. Not exactly the same thing as a real scientist, | but I suspect that we have common cause in this area. | | One of the hard lessons in the transition was realizing | that things that allowed me to work more efficiently when I | was a software engineer instead _reduce_ my efficiency in | my new career. | | You might get a decent analogy of the difference by | comparing photographs of the first transistor with pictures | of every subsequent transistor. The first transistor's | clearly going to be terrible in any production application. | But the same characteristics that make it so terrible for | practical use were also, to varying degrees, essential to | or characteristic of the exploratory process that led to | its creation. | | It's similar for my R&D code. In order to do my R&D work | more efficiently and effectively, I need to do things that | would be unholy in production code. This is why there's a | separate and essential productionizing step where my output | is heavily revamped and possibly even completely rewritten | in a different programming language. | | re: repeatability, I've discovered that it, too, means | something slightly different in a science context than it | does in an engineering context. | roflyear wrote: | It's nice to be able to collect your data and then do analysis | on it in one language. Data work is mostly about getting your | data in a place where you can work with it. Rarely are you | handed a dataset that is shiny and ready for analysis or | regression ... | | Python also has the benefit of tooling, where you can use tools | you're familiar with and still work on most codebases. | tomrod wrote: | This is a wonderfully technical article. I'd love to learn more | about Python internals as a scientific coder. | pedrovhb wrote: | The official Python documentation is excellent, and in many | ways goes beyond providing just a list of existing modules and | what they do. Sometimes if I'm bored I'll actually just pull up | documentation for something I'm not 100% familiar with and have | a look around, and I almost always find something new and | useful. A couple of interesting ones are [0][1], and [2] is a | nice starting point for discovering more. Not everyone's cup of | tea, but I also found it enjoyable dive into asyncio with the | docs. | | [0] https://docs.python.org/3/howto/descriptor.html [1] | https://docs.python.org/3/library/collections.html [2] | https://docs.python.org/3/ | barefeg wrote: | I recommend any of the talks by James Powell at PyData. For | example this one https://youtu.be/cKPlPJyQrt4 | | Edit: maybe this one on Numpy may be more relevant: | https://youtu.be/u2yvNw49AX4 | _visgean wrote: | Hmm nice article but imho skips over the biggest optimization: | numpy uses BLAS libraries so stuff like | | > >>> multiply_by_two = homogenous_array * 2 | | will be calculated most of the times using a BLAS library - | whichever you are using | (https://numpy.org/devdocs/user/building.html) | cdavid wrote: | That article talks about DL, where blas is much less relevant. | The kernels are mostly CUDA (for GPU) and similar stuff for | other accelerators. | college_physics wrote: | > In practice, scientific computing users rely on the NumPy | family of libraries e.g. NumPy, SciPy, TensorFlow, PyTorch, CuPy, | JAX, etc.. | | this is a somewhat confusing statement. most of these libraries | actually don't rely on numpy. e.g. tensorflow ultimately wraps | c++/eigen tensors [0] and numpy enters somewhere higher up in | their python integration | | [0] | https://github.com/tensorflow/tensorflow/blob/master/tensorf... ___________________________________________________________________ (page generated 2022-11-28 05:00 UTC)