[HN Gopher] How Python List Works
       ___________________________________________________________________
        
       How Python List Works
        
       Author : tosh
       Score  : 61 points
       Date   : 2021-11-14 13:27 UTC (9 hours ago)
        
 (HTM) web link (antonz.org)
 (TXT) w3m dump (antonz.org)
        
       | kgm wrote:
       | One thing to note, in the gritty details of Python's actual
       | implementation of the list type, is that there's a little more to
       | say about how list resizing works than is immediately obvious.
       | The relevant source is here:
       | https://github.com/python/cpython/blob/main/Objects/listobje...
       | 
       | As the article points out, the size progression, as detailed in
       | the comment in that source, is relatively modest. Your "classic"
       | dynamic array will simply double the size as needed. This
       | progression does less than that: 16, 24, 32, 40, and so on.
       | _But:_ The underlying implementation is using the standard C
       | library function realloc(). This will perform some amount of in-
       | place resizing on its own, and will lend a lot to the performance
       | of a Python list, albeit in a way that might be difficult to nail
       | down, since it depends on the overall usage of the C library 's
       | heap.
        
       | Zababa wrote:
       | Related to that, do heterogenous lists make sense as a default? I
       | can see the argument for dictionaries easily, you might want to
       | put a lot of different stuff inside one. However, for lists, I
       | fail to see the advantages of heterogenous lists as a default.
       | Especially since they are going to be way slower, and nobody seem
       | to use arrays in Python.
        
         | kzrdude wrote:
         | If you use the static typing eyes, the Python list is
         | _homogenous_ : every element is of the type reference to python
         | object. There's no funny business with types, just one
         | reference after another.
        
           | Zababa wrote:
           | That's a fair point, but if static typing was like that it
           | would be useless. "Object" is a type, but it's pretty useless
           | most of the time.
        
         | pdonis wrote:
         | _> do heterogenous lists make sense as a default?_
         | 
         | Yes, because then the list object doesn't have to do any extra
         | work to check the types of objects. If your application needs
         | that extra checking, you can add it in your application's code;
         | but there's no reason to burden the basic list object with
         | extra code that not everyone needs.
         | 
         |  _> Especially since they are going to be way slower_
         | 
         | Way slower than what? They're slower than Python arrays, of
         | course, but Python arrays can only store numeric values (since
         | the amount of space each value takes up is constant and known
         | when the array is created).
         | 
         | For a container that needs to be able to store objects of any
         | type (including instances of user-defined classes), a
         | heterogeneous list is _faster_ than a homogeneous one, because,
         | as above, it doesn 't have to do extra work to check the types
         | of objects. All the other extra work is there because the size
         | in memory of the objects isn't known in advance; and that will
         | be true even if all of the objects are of the same type, since
         | instances still won't necessarily all have the same in-memory
         | size (for example, consider instances of user-defined classes
         | with different attributes).
        
           | Zababa wrote:
           | > For a container that needs to be able to store objects of
           | any type (including instances of user-defined classes), a
           | heterogeneous list is faster than a homogeneous one, because,
           | as above, it doesn't have to do extra work to check the types
           | of objects.
           | 
           | My question is about that. Is storing objects of any type a
           | common use case, compared to storing objects of one type?
        
             | sseagull wrote:
             | I have found, at least in my own use of python, lists are
             | for "homogeneous" collections of indeterminate length. But
             | these are not strictly homogeneous - there may be None
             | entries, for example. So I do use the heterogeneous
             | capabilities.
             | 
             | For heterogeneous collections with a fixed length, I tend
             | to use a tuple. That makes more sense to me, conceptually
             | (coming from C++).
        
             | pdonis wrote:
             | _> Is storing objects of any type a common use case,
             | compared to storing objects of one type?_
             | 
             | I'm not sure it's possible to know a general answer to
             | that, since Python is used in such a wide variety of
             | contexts.
             | 
             | In my own programming, I use lists that can contain objects
             | of different types often enough for that use case to be
             | significant for me.
             | 
             | One other thing to keep in mind is that Python has duck
             | typing, so, for example, if you are iterating over a list
             | and retrieving the same attribute or calling the same
             | method on each object, that doesn't require each object to
             | be of the same type, just to have an attribute with the
             | required name, or a method with the required name and
             | calling signature.
        
         | duckerude wrote:
         | Python's semantics make homogeneity less useful and hard to
         | enforce.
         | 
         | Everything's passed by reference, so the list has to contain
         | pointers even if you know all its elements have the same type.
         | And some objects can change their type by assigning to
         | __class__, which can happen from a distance (because everything
         | is by reference).
         | 
         | A careful implementation can solve this for some builtin types.
         | PyPy does it for ints: if a list only contains ints it's
         | represented as a proper numeric array. But then as soon as you
         | append some other type the whole list is converted to a more
         | expensive more traditional representation.
         | 
         | PyPy's ints still have to pretend to be heap-allocated objects
         | at all times. In particular they have to do something tricky
         | with the id() function, which on CPython just returns an
         | object's memory address. The return value has to be unique and
         | consistent for as long as an object lives.
        
       | nayuki wrote:
       | How a Python list really works? See the underlying C
       | implementation:
       | https://github.com/python/cpython/blob/main/Objects/listobje... ;
       | https://github.com/python/cpython/blob/main/Include/listobje...
       | 
       | For comparison, java.util.ArrayList is implemented in Java
       | itself, and is much shorter:
       | http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/...
        
         | azinman2 wrote:
         | Ya I was reading this thinking none of this is telling me how
         | CPython implements it. It's just a poor-man's version.
        
       | matthias509 wrote:
       | One point which I think bears mentioning with respect to
       | performance is that even with O(1) access, the performance in the
       | real world of a list implemented by a list of pointers vs a
       | contiguous block of memory will be very different. This is
       | because programs often iterate over lists and a list implemented
       | by a contiguous block of ram will have MUCH better cache
       | locality.
       | 
       | AFAIK, this unfortunately wouldn't be possible to implement in
       | Python though because I don't think there would be any way to
       | know how much space to allocate for each item in the list.
        
         | seiferteric wrote:
         | Does it have to be one or the other? You could allocate blocks
         | of memory for the list then when it will be filled up you
         | allocate more and add a pointer to it.
        
           | dmurray wrote:
           | This breaks O(1) access by index, because the objects in the
           | list may have different sizes.
        
         | Znafon wrote:
         | If you have numeric value you can use
         | https://docs.python.org/3/library/array.html
        
           | kzrdude wrote:
           | But if you're in the real world you use numpy.ndarray instead
           | (this underpins half of the Python ecosystem, more or less).
        
         | killingtime74 wrote:
         | For more efficient "lists" there are Numpy arrays
         | https://numpy.org/doc/stable/reference/generated/numpy.array...
         | which do have cache locality
        
           | wodenokoto wrote:
           | Python does come with built-in arrays, but I've never seen
           | them used, or heard anyone recommend them for anything.
           | 
           | https://docs.python.org/3/library/array.html
        
             | cozzyd wrote:
             | These are useful when you need a pointer to an integer for
             | a C or C++ API.
        
             | edgyquant wrote:
             | I had never heard of them and when I try to Google Python
             | array v set speed it just returns list v set. I did get the
             | official documentation but it doesn't go into speed.
        
             | uranusjr wrote:
             | They are recommended when you want speed, but most people
             | writing Python probably never need that much speed (or
             | would probably want numpy anyway), and the few who do
             | likely have more private sources to recommend it for
             | specific occasions. It should probably be used for more
             | things (instead of reaching for numpy at first chance) but
             | is still a niche tool.
        
               | abecedarius wrote:
               | The times I've used array, it was a common data structure
               | for communicating with a C extension. You could use a
               | Python list, but that's not just slower but more of a
               | pain to deal with on the C side. (This was before numpy
               | was everywhere; I don't know how much fun that might be
               | to work with from C.)
               | 
               | If you're going pure Python then space is a better reason
               | than speed. I'm not sure an array is even as fast as a
               | list.
        
               | uranusjr wrote:
               | I think I use struct much more when interfacing with C,
               | but I can see array being useful for arrays specifically
               | when you expect more than a couple of elements. Both are
               | still quite niche though.
        
             | kgm wrote:
             | The standard array module was more relevant in the distant
             | past, before version 2.6 added the bytearray type.
             | Otherwise, numpy arrays are considerably more useful in
             | virtually every circumstance.
        
         | mark-r wrote:
         | In CPython at least, there's no other way. Everything is an
         | object, even simple types like integers. And references to
         | objects are implemented with pointers. Even a Numpy array which
         | stores a contiguous array of simple values requires conversion
         | to an object when you actually try to _use_ those values.
        
       | arthurcolle wrote:
       | I always wondered why there weren't more 'hyper-collection'
       | abstractions that just implement multiple collection types in one
       | interface, so that you can get the respective speedups of the
       | different collection types while still having the flexibility to
       | use them as you wish
        
         | karmakaze wrote:
         | This is the field of datastructures. There isn't 'just' one be-
         | all to end-all. There are so many kinds of hashing/tables,
         | trees/balancing, hybrid (e.g. skiplists), priority queues, etc.
         | Because there are so many possible combinations, if the choices
         | matter it's useful to use a language that lets you make your
         | own. This is one of the main reasons why generics for Go is a
         | big deal. Go always had generics, but it's only been for the
         | built-ins array/slice, map, and channel. With user-defined
         | generics any datastructures/interfaces can be conveniently made
         | and used without language/library/compiler support.
         | 
         | Its curious that for Python this is blog post material, but for
         | Go it's intrinsic to understanding the performance and usage of
         | slices or maps, or in Java (and other langs) would be
         | referenced merely by the class name ArrayList. The fact that so
         | much can be done effectively in Python without always having to
         | think about such details is an achievement of the language and
         | ecosystem of libraries and packages.
        
         | ketralnis wrote:
         | C arrays are basically that. Also python dicts (most python
         | types including all user-defined classes are actually dicts
         | underneath), Lua tables, Javascript objects, and more. In most
         | dynamically typed languages, "the" type is actually what you're
         | calling a hyper-collection and not just conceptually, but also
         | in implementation
        
         | beervirus wrote:
         | How would that work? Every time you create an object,
         | internally it creates one of everything?
        
       ___________________________________________________________________
       (page generated 2021-11-14 23:01 UTC)