[HN Gopher] Data-Oriented Programming in Python
       ___________________________________________________________________
        
       Data-Oriented Programming in Python
        
       Author : brilee
       Score  : 119 points
       Date   : 2022-11-26 17:45 UTC (1 days ago)
        
 (HTM) web link (www.moderndescartes.com)
 (TXT) w3m dump (www.moderndescartes.com)
        
       | hgibbs wrote:
       | I'd like to plug riptables
       | (https://github.com/rtosholdings/riptable), which is (more-or-
       | less) a performance upgrade to pandas.
        
         | anigbrowl wrote:
         | Looks nice!
        
       | duped wrote:
       | I'm curious how you would do data oriented programming in a
       | language with no type system and no control over memory layout.
       | And I guess the answer is "you can't, but JITs might exist
       | someday that do it for you"
       | 
       | But you can't wave your hands around and say compiler
       | optimizations will fix performance problems - they can, but
       | they're not magic, and the arrow in the proverbial knee for
       | optimization passes are language semantics that make them
       | impossible to realize (forcing the authors to either abandon the
       | passes, or rely on things like dynamic deoptimization which is
       | not free).
        
         | jessermeyer wrote:
         | Those are basically contradiction of terms. Orienting the
         | program structure around the data necessarily requires control
         | over memory layout and how it is interpretted.
        
         | mumblemumble wrote:
         | > I'm curious how you would do data oriented programming in a
         | language with no type system and no control over memory layout.
         | 
         | I'm not sure what language you're referring to? Neither of
         | those is true of Python.
        
           | _aavaa_ wrote:
           | I'm not sure what either you or the parent consider a proper
           | type system, but the difference between Python and (say)
           | Java's type system is night and day.
           | 
           | If you using the Typing module you might be able to get a
           | linter to check things for you, but that's a far cry from
           | what other typed languages have.
        
             | roflyear wrote:
             | How does that have anything to do with data?
        
             | mumblemumble wrote:
             | It is night and day. More specifically:
             | 
             | Python's type discipline is stronger, but dynamic.
             | 
             | Java's type discipline is static, but weaker.
             | 
             | Both are most certainly typed. An _un_ typed language would
             | be one like Forth or most assembly languages.
        
         | sirwhinesalot wrote:
         | By using only coding patterns that are known to JIT well and
         | lower level primitive types and containers if provided by the
         | language. Maximizing the use of packages written in native code
         | also helps.
         | 
         | The resulting code is even more annoying to write than using a
         | lower level language typed language in the first place, but
         | ecosystem access sometimes makes up for it.
         | 
         | Hopefully tools like mypyc get better, letting well-typed
         | python code with reasonable usage patterns be compiled to
         | reasonably efficient native code.
         | 
         | Last time I used it I was pleased with the performance benefits
         | but it couldn't even compile all files in a module to a single
         | shared library, despite this being mentioned as possible (and
         | recommended) in the docs. Maybe I was doing something wrong,
         | but they don't answer their github issues often, alas.
         | 
         | Any little thing helps though, it's one thing for throwaway
         | scripts to be inefficient, but applications? At a large scale
         | it is a monstrous waste of time and literal energy.
        
         | gnuvince wrote:
         | Unfortunately, the terms "data-oriented _design_ " and "data-
         | oriented _programming_ " refer to two different styles of
         | programming. Data-oriented design is the approach to
         | programming made popular by Mike Acton's CppCon keynote--as you
         | say, it focuses on the layout of objects in memory to make the
         | processing of data take advantage of the underlying hardware.
         | 
         | Data-oriented programming is a style of programming that, as
         | far as I know, originates in the Clojure community. It
         | emphasizes using general data structures (vectors,
         | dictionaries) to store all data and make code more re-usable.
         | It has nothing to do with good cache utilization, pre-fetching,
         | or avoiding branch mispredictions.
         | 
         | It's a shame that two styles of programming which are almost
         | diametrical opposites share such similar names.
         | 
         | From the look of the article, it's discussing data-oriented
         | design, but in Python, and I agree that it's kind of a weird
         | match.
        
           | tvb12 wrote:
           | Wait, what? Thanks for pointing this out. I was pretty close
           | to submitting an order with a Data-oriented Programming book
           | in my cart.
        
             | typon wrote:
             | There is no book on data oriented design as far as I know.
             | It would be great if someone could take Mike Actons talk
             | and similar talks and condense the ideas into a book filled
             | with real world examples.
        
               | Jtsummers wrote:
               | https://www.dataorienteddesign.com/dodmain/
               | 
               | There is this one. It has been posted here a few times.
        
       | wheelerof4te wrote:
       | To spare you a couple minutes of your life, the article is saying
       | this:
       | 
       | Python + C modules = Speed
       | 
       | Nothing new here, move along.
        
         | brilee wrote:
         | It actually isn't saying that. What do you think Python is made
         | of, under the hood? It's C modules.
         | 
         | The argument is not that NumPy is written in C, but that it
         | amortizes the cost of Python overhead over multiple data,
         | rather than incur it on each datum.
        
       | cauch wrote:
       | It's a details, but I keep seeing it:
       | 
       | > Yet, [the scientists] struggle to move away from Python,
       | because of network effects, and because Python's beginner-
       | friendliness is appealing to scientists for whom programming is
       | not a first language.
       | 
       | I don't believe it's the whole story.
       | 
       | In my case, during my 13 years in academia, I saw my field going
       | away from C++ and towards python. Not because of network effects
       | (it was the opposite: it was more difficult to not use what
       | everybody was using), or because scientists were not able to
       | program (the entry language of the whole field was C++, and
       | python arrived only because scientists with a deep knowledge of
       | C++ started to themselves switch the core library to be usable
       | with python).
       | 
       | I think something that computer scientists forgot when they
       | consider the subject is that the way computer scientists do
       | software is just not working when you do science.
       | 
       | In science, you use coding as an exploratory tool. You are lucky
       | if 10% of your code ends up being used in your final publication.
       | Because the 90% was only there to understand and to progress
       | towards the proper direction. For this reason, things like
       | declaring variables, which is very important when one makes a
       | professional software, are too costly to be useful when you need
       | to write down a piece of code that you will ever run once to
       | check a small hypothesis, especially when you have another
       | language not requiring it. Another aspect is that you will
       | present your scientific results to your colleagues, not your code
       | (they are not interested in that), and they will come up with
       | questions or good ideas, all very good for science, but rarely
       | compatible with the way your algorithm was built in the first
       | place, and you will need to shoe-horn it into your code (to test
       | it) without taking 3 weeks. In this case, python flexibility and
       | hackability is very useful.
       | 
       | It's also visible in the popularity of things like Jupyter
       | notebooks (I have to acknowledge it even if I personally don't
       | like working with such tools), which reuse a working approach
       | similar to what was done in mathematica and matlab, that were
       | created with the scientific workflow in mind.
       | 
       | I'm sure python simplicity has played a role. But I have the
       | feeling that some people are totally oblivious on the fact that
       | there may be other reasons.
        
         | analog31 wrote:
         | All good points. I think the network effects are more important
         | than they were in the past, because the network has gotten
         | bigger. Today, new graduates make sure to put Python on their
         | resumes, even if they've barely touched it. A colleague who
         | left for another job thanked me for encouraging him to learn
         | Python. Contrast that with 13 years ago, roughly when I
         | switched to Python. My colleagues thought it was some weird
         | toy, and encouraged me to learn C# instead. The past 13 years
         | have also seen most programmers get over their aversion to
         | open-source tools.
         | 
         | It's true, as mentioned in another comment, that repeatability
         | is important. But at least in experimental science, the
         | repeatability of an experiment is a bigger hurdle than making
         | the same code run twice. I use Python code to control my
         | experiments, and the ability to read my way through a complex
         | workflow is quite valuable.
         | 
         | One thing I like is being able to bodge together a huge blob of
         | data and metadata into a dict that's easy to store and unpack.
         | That encourages me to keep more experimental data and metadata,
         | and use it later.
         | 
         | Python and its libraries have sprawled to the point where it's
         | anything but simple.
        
         | whatever1 wrote:
         | Computer scientists also assume that you know what inputs your
         | program needs and what is the range of the outputs.
         | 
         | That is out of touch with scientific research. We may change
         | overnight completely the inputs, the core logic and the
         | outputs.
         | 
         | Having to babysit function signatures, manage memory and types
         | throughout these activities is just draining.
        
           | duped wrote:
           | That is also how "computer scientists" and software engineers
           | work. Our time is just valued a lot higher so we've come up
           | with techniques to make our work more efficient and faster,
           | like structuring our code well using types and function
           | signatures.
           | 
           | The added bonus is you get science that's you know,
           | repeatable. Because the difference between industrial code
           | and prototype code is that it gets run so often there can't
           | be glaring mistakes; it must be repeatable by default. We
           | have different techniques for dealing with these problems,
           | but writing good code is orthogonal to that (I don't think
           | scientists need to be running static analysis, doing deep
           | reviews, and having extensive integration/unit/mock testing
           | throughout their code).
        
             | mumblemumble wrote:
             | I used to be a software engineer, and now I'm a data
             | scientist. Not exactly the same thing as a real scientist,
             | but I suspect that we have common cause in this area.
             | 
             | One of the hard lessons in the transition was realizing
             | that things that allowed me to work more efficiently when I
             | was a software engineer instead _reduce_ my efficiency in
             | my new career.
             | 
             | You might get a decent analogy of the difference by
             | comparing photographs of the first transistor with pictures
             | of every subsequent transistor. The first transistor's
             | clearly going to be terrible in any production application.
             | But the same characteristics that make it so terrible for
             | practical use were also, to varying degrees, essential to
             | or characteristic of the exploratory process that led to
             | its creation.
             | 
             | It's similar for my R&D code. In order to do my R&D work
             | more efficiently and effectively, I need to do things that
             | would be unholy in production code. This is why there's a
             | separate and essential productionizing step where my output
             | is heavily revamped and possibly even completely rewritten
             | in a different programming language.
             | 
             | re: repeatability, I've discovered that it, too, means
             | something slightly different in a science context than it
             | does in an engineering context.
        
         | roflyear wrote:
         | It's nice to be able to collect your data and then do analysis
         | on it in one language. Data work is mostly about getting your
         | data in a place where you can work with it. Rarely are you
         | handed a dataset that is shiny and ready for analysis or
         | regression ...
         | 
         | Python also has the benefit of tooling, where you can use tools
         | you're familiar with and still work on most codebases.
        
       | tomrod wrote:
       | This is a wonderfully technical article. I'd love to learn more
       | about Python internals as a scientific coder.
        
         | pedrovhb wrote:
         | The official Python documentation is excellent, and in many
         | ways goes beyond providing just a list of existing modules and
         | what they do. Sometimes if I'm bored I'll actually just pull up
         | documentation for something I'm not 100% familiar with and have
         | a look around, and I almost always find something new and
         | useful. A couple of interesting ones are [0][1], and [2] is a
         | nice starting point for discovering more. Not everyone's cup of
         | tea, but I also found it enjoyable dive into asyncio with the
         | docs.
         | 
         | [0] https://docs.python.org/3/howto/descriptor.html [1]
         | https://docs.python.org/3/library/collections.html [2]
         | https://docs.python.org/3/
        
         | barefeg wrote:
         | I recommend any of the talks by James Powell at PyData. For
         | example this one https://youtu.be/cKPlPJyQrt4
         | 
         | Edit: maybe this one on Numpy may be more relevant:
         | https://youtu.be/u2yvNw49AX4
        
       | _visgean wrote:
       | Hmm nice article but imho skips over the biggest optimization:
       | numpy uses BLAS libraries so stuff like
       | 
       | > >>> multiply_by_two = homogenous_array * 2
       | 
       | will be calculated most of the times using a BLAS library -
       | whichever you are using
       | (https://numpy.org/devdocs/user/building.html)
        
         | cdavid wrote:
         | That article talks about DL, where blas is much less relevant.
         | The kernels are mostly CUDA (for GPU) and similar stuff for
         | other accelerators.
        
       | college_physics wrote:
       | > In practice, scientific computing users rely on the NumPy
       | family of libraries e.g. NumPy, SciPy, TensorFlow, PyTorch, CuPy,
       | JAX, etc..
       | 
       | this is a somewhat confusing statement. most of these libraries
       | actually don't rely on numpy. e.g. tensorflow ultimately wraps
       | c++/eigen tensors [0] and numpy enters somewhere higher up in
       | their python integration
       | 
       | [0]
       | https://github.com/tensorflow/tensorflow/blob/master/tensorf...
        
       ___________________________________________________________________
       (page generated 2022-11-28 05:00 UTC)