[HN Gopher] Prevent DoS by large int-str conversions
       ___________________________________________________________________
        
       Prevent DoS by large int-str conversions
        
       Author : genericlemon24
       Score  : 78 points
       Date   : 2022-09-07 17:03 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | wmichelin wrote:
       | Can anyone TL;DR why? Why wouldn't it just return that long
       | integer of all 1s?
        
         | sp332 wrote:
         | Yeah it's right at the top of the linked page?
        
         | schoen wrote:
         | It's stated to be CVE-2020-10735, which is apparently about a
         | denial of service by forcing Python to inefficiently convert a
         | very large string to an integer, using a potentially ridiculous
         | amount of CPU time.
         | 
         | The CVE hasn't been published, but for example there's an
         | explanation at
         | 
         | https://bugzilla.redhat.com/show_bug.cgi?id=1834423
        
           | klyrs wrote:
           | Looks to me like the actual problem is in string.__mul__ --
           | that one's got arbitrary memory usage. Better limit those
           | arguments...
        
             | masklinn wrote:
             | str.__mul__ is just a conveniently short way to demonstrate
             | the issue, the target is pretty much any parsing routine
             | exposed to outside users e.g. any JSON API.
        
               | klyrs wrote:
               | Apologies, my comment is snark. The algorithm in question
               | is soft-linear, faster implementations exist, this seems
               | like an incredibly myopic fix. Just make a bigger JSON
               | blob and it will take longer to parse.
        
           | [deleted]
        
           | adgjlsfhk1 wrote:
           | this seems like a dumb fix to the cve to me. why not just use
           | a faster algorithm?
        
             | lifthrasiir wrote:
             | Because there is no linear-time algorithm for decimal-to-
             | binary conversion. If we are to expose the bignum-aware
             | `int` function to untrusted input there should be some
             | limit anyway. I do think the current limit of 4301 digits
             | seem too low though---if it were something like 1 million
             | digits I would be okay.
        
               | schoen wrote:
               | It looks like there is some discussion of the algorithmic
               | options at
               | 
               | https://github.com/python/cpython/issues/95778
               | 
               | https://github.com/python/cpython/issues/90716
               | 
               | Is there something bad going on with Python's internal
               | representation of big integers, too? I thought I might
               | have understood Tim Peters to be saying that in the
               | latter thread.
               | 
               | It does look like gmpy2.mpz() is like 100 times faster
               | than int() or something. Is this just because it's doing
               | it all in assembly rather than in Python bytecodes, or
               | are the Python data structures here also not so hot?
        
               | thehappypm wrote:
               | One of the comments showed the incredibly naive approach
               | of just building the integer digit-by-digit:
               | 
               | '1234' => 1x1000 + 2x100 + 3x10 + 4x1
               | 
               | Is faster and has room to improve
        
               | tylerhou wrote:
               | This takes (worse than) quadratic time.
        
               | thehappypm wrote:
               | I'm not sure it does, in the best case.
               | 
               | There are d additions, so the addition is linear time.
               | 
               | Each multiplication is potentially quadratic, but it
               | seems optimizable since it's never multiplication of two
               | large numbers--always one large and one small number.
        
               | singron wrote:
               | Each addition is linear in d, but there are d additions,
               | so it's already quadratic before you even consider the
               | multiplications.
               | 
               | In a power-of-2 base, the result of the multiplication is
               | a constant number of digits (because the multiplication
               | is just a shift of a single digit), so the additions
               | could each be constant time in that case.
        
               | klodolph wrote:
               | > It does look like gmpy2.mpz() is like 100 times faster
               | than int() or something. Is this just because it's doing
               | it all in assembly rather than in Python bytecodes, or
               | are the Python data structures here also not so hot?
               | 
               | It's not the data structures. The data structures are
               | really more or less the same: you have some array of
               | words, with a length and a sign. The only real
               | differences are in the particular length of word that you
               | choose, which is not a very interesting difference.
               | 
               | Assembly language optimizations do tend to matter here,
               | because you're working with the carry bit for lots of
               | these operations, and each architecture also has some
               | different way of multiplying numbers. Multiplying numbers
               | is "funny" because it produces two words of output for
               | one word of input.
               | 
               | There are also sometimes some different algorithms in
               | use, and GMP uses some different algorithms depending on
               | the size. Here's a page describing the algorithms used by
               | GMP:
               | 
               | https://gmplib.org/manual/Multiplication-Algorithms
               | 
               | Here's a description of how carries are propagated:
               | 
               | https://gmplib.org/manual/Assembly-Carry-Propagation
               | 
               | IMO, I wouldn't expect my language's built-in bigint type
               | to use the best, most cutting-edge algorithms and lots of
               | hand-tuned assembly. GMP is a specialized library for
               | doing special things.
        
               | tylerhou wrote:
               | There is no practical linear time algorithm for
               | multiplication; should Python disable multiplication for
               | numbers greater than 10^4301?
               | 
               | Even a naive divide and conquer decimal to binary
               | algorithm is only logarithmically slower than
               | multiplication.
        
               | adgjlsfhk1 wrote:
               | there isn't a linear time algorithm, but there is an
               | algorithm in O(n*log(n)^2) http://maths-
               | people.anu.edu.au/~brent/pd/rpb032.pdf which is pretty
               | close. it also seems weird to have a CVE for "some
               | algorithms don't run in linear time". should there be a
               | 4000 element maximum for the size of list passed to sort?
        
               | lifthrasiir wrote:
               | > should there be a 4000 element maximum for the size of
               | list passed to sort?
               | 
               | Technically speaking, yes, there should be some limit if
               | you are accepting an untrusted input. But there is a good
               | argument for making this limit built-in for integers but
               | not lists: integers are expected to be atomic while lists
               | are wildly understood as aggregates, therefore large
               | integers can more easily propagate throughout
               | unsuspecting code base than large lists.
               | 
               | (Or, if you are just saying that once you have sub-
               | quadratic algorithms you don't need language-imposed
               | limits anymore, maybe you are right.)
        
               | bjourne wrote:
               | But why convert it to binary? If you store the number as
               | an array of digits the parsing process should be O(n).
        
               | lifthrasiir wrote:
               | That means every limb operation should be done modulo
               | 10^k, which would be pretty expensive and only makes
               | sense if you don't do much computation with them so the
               | base conversion will dominate the computation.
        
             | wyldfire wrote:
             | But the multiplier is unbound, though. Faster wouldn't help
             | in that case.
        
               | klyrs wrote:
               | Maybe we should limit the lengths of strings altogether.
               | 512k should be enough for anybody.
        
       | eugenekolo wrote:
       | Could they not have modified the `int` function to `int(thingy,
       | i_really_want_to_do_this=false)`?
       | 
       | Edit: Looks like they added a python argument to increase the
       | limit. So if you really need this, I suppose you can search
       | around until you figure out why it's not working and pass the
       | correct argument to the python bin.
        
       | qbane wrote:
       | Yeah, we must prevent DoS at all costs. It seems that Python
       | should not have integers at arbitrary size for "performance"
       | reason in the beginning. Aren't int32/int64/int128 nice? Number
       | of operations are all bounded. We should stick to them.
        
         | kragen wrote:
         | This was Python's behavior until Python 2; `long`, the
         | arbitrary-precision integer, was a separate type, and `int`
         | arithmetic overflow caused a ValueError. One of the big changes
         | in Python 2 was to imitate the behavior of Smalltalk and (most)
         | Lisp by transparently overflowing `int` arithmetic to `long`
         | instead of requiring an explicit `long()` cast. Python 3
         | eliminated the separate `long` type altogether.
         | 
         | Having been bitten by the Smalltalk behavior, I am skeptical
         | that the Python 2 change was a good idea.
        
       | justinsaccount wrote:
       | From the linked bug..
       | 
       | > It takes about 50ms to parse an int string with 100,000 digits
       | and about 5sec for 1,000,000 digits. The float type, decimal
       | type, int.from_bytes(), and int() for binary bases 2, 4, 8, 16,
       | and 32 are not affected.
       | 
       | Sure seems strange to set the limit to 4300. 50ms is not a DoS.
        
         | xani_ wrote:
         | balooning 2ms request to 50ms is absolutely a DoS
         | 
         | that's only 20req/sec to fill a core of execution
        
       | schoen wrote:
       | If you need to make integers this big from decimal
       | representations, I guess you could still use gmpy2.mpz(), and
       | then either leave the result as an mpz object (which is generally
       | drop-in compatible with Python's int type, with the addition of
       | some optimized assembly implementations of arithmetic operations
       | and some additional methods), or convert it to a Python int by
       | calling int() on it.
        
       | blibble wrote:
       | new interpreter argument:                   -X
       | int_max_str_digits=number            limit the size of int<->str
       | conversions.            This helps avoid denial of service
       | attacks when parsing untrusted data.            The default is
       | sys.int_info.default_max_str_digits.  0 disables.
       | 
       | this should not be a runtime configuration setting, fix the
       | sodding algorithm to not be quadratic
       | 
       | will we be getting PHP style magic quotes soon? that also
       | protects developers against untrusted input (bonus! this could be
       | configured too!)
       | 
       | or an inability to pass strings into the regular expression
       | module? that can also cause DoS
       | 
       | (what happened to Python?)
        
         | simonw wrote:
         | My understanding is that there is no algorithm for this that
         | isn't quadratic.
         | 
         | Update: I may have understood incorrectly, see
         | https://github.com/python/cpython/issues/90716
        
           | blibble wrote:
           | > My understanding is that there is no algorithm for this
           | that isn't quadratic.
           | 
           | > If you know of one, the Python core development team would
           | love to hear about it!
           | 
           | it's mentioned on the issue page that makes up the article...
           | 
           | (before they closed it due to the "code of conduct")
        
             | [deleted]
        
       | jwilk wrote:
       | https://github.com/python/cpython/issues/95778 has more
       | information.
        
         | dang wrote:
         | Ok, we'll change to that from
         | https://pythoninsider.blogspot.com/2022/09/python-
         | releases-3.... Thanks!
         | 
         | All: submitted title was "`int('1' * 4301)` will raise
         | ValueError starting with Python 3.10.7" and comments reference
         | that, so you might want to take a look at both URLs.
        
       | svet_0 wrote:
       | So now an unreasonable user input will crash my server instead of
       | slowing it down by 50ms. Great DoS mitigation!
        
         | Ukv wrote:
         | In addition to omnicognate's point, calling `int` on user input
         | would generally already expect a possible ValueError.
        
         | omnicognate wrote:
         | Your server crashes if a request fails?
        
           | xani_ wrote:
           | it does with this change where it didn't before. At the very
           | best you're still restarting the whole process instead of
           | just wasting a bit of time
        
             | fuckstick wrote:
             | Who uses a process per request for serving Python apps?
             | That must be very uncommon. Even if you use a worker pool
             | that isn't going to restart a whole process just because of
             | an errant exception in a request handler.
             | 
             | Also as noted if your whole process crashes because of
             | errant input to int() you are beyond fucked in other ways.
        
             | aYsY4dDQ2NrcNzA wrote:
             | Then don't upgrade Python in your container?
        
             | progval wrote:
             | You should always catch ValueError when using int() on user
             | input, because that input may not be a valid number.
        
               | [deleted]
        
       | ridiculous_fish wrote:
       | Why is base 10 string -> int a quadratic algorithm? Are there no
       | faster ones that could be implemented?
        
         | blahedo wrote:
         | No, because 10 is not a power of 2, so any digit in the source
         | (base 10) can affect any digit in the result (base 2).
         | Converting from e.g. base 16 to base 2 is linear, because 16 is
         | a power of 2.
        
       | saghm wrote:
       | I was surprised to see this in a bugfix release since it seems
       | like a breaking change, but from reading, it seems that this was
       | considered a security vulnerability (specifically a DOS
       | opportunity) given the CVE status, so I imagine that
       | compatibility concerns were secondary here. This seems in line
       | with how other languages seem to do things from what I've seen;
       | semver is important, but in a sense not every change is equally
       | "breaking" to users, and breaking code that's unlikely to be
       | common and potentially is not behaving correctly in the first
       | place is not going to cause as much friction as most other types
       | of breaking changes. Put another way, if there's a valid security
       | concern, breaking things loudly for users forces them to double
       | check their usage of this sort of code and ensure that nothing
       | risky is going on. (I don't personally have enough domain
       | knowledge here to know if the security concern is actually valid
       | or not, but the decision to make this change in a patch release
       | seems like a reasonable conclusion to come to for people who
       | determine that it is a security concern).
        
       | bo1024 wrote:
       | From the link:
       | 
       | > Everyone auditing all existing code for this, adding length
       | guards, and maintaining that practice everywhere is not feasible
       | nor is it what we deem the vast majority of our users want to do.
       | 
       | It's hard not to read this as "we want to use untrusted input
       | everywhere with no consequences". Seems like we'll be kicking as
       | many issues under the rug as we're fixing with this change,
       | right?
        
         | bostik wrote:
         | I read it the other way round - untrusted input is used in
         | various places where doing such inline checks is prohibitively
         | tricky. The examples given are quite telling: json, xmlrpc,
         | logging. First two are everywhere in APIs. The third is just
         | ... everywhere.
         | 
         | Are you really going to use a JSON or XML stream parser _first_
         | before feeding it to the stdlib module? And one that does not
         | try to expand the read values to native types? As for logging,
         | that is certainly the place where you are not only expected,
         | but often required to use untrusted input.
         | 
         | The fix feels like a heuristic and a compromise. None of the
         | [easily available] solutions are robust, solid or performant,
         | so someone picked an arbitrary threshold that should never be
         | hit in sane code.
         | 
         | The linked issue mentions that GMP remains fast even in face of
         | absurdly big numbers. No surprise, the library is _literally_
         | designed for it: MP stands for multi-precision (ie. big int and
         | friends).
        
           | adgjlsfhk1 wrote:
           | this would all make more sense if python was using a
           | reasonably fast string to int routine, but the one they are
           | using is asymptotically bad, and the limit they chose is
           | roughly a million times lower than it should have been.
        
         | rwmj wrote:
         | Did they consider doing tainting (like Perl)? Input strings are
         | marked as tainted and anything derived from them, except for
         | some specific operations that untaint strings. If you use a
         | tainted string for a security-sensitive operation then it
         | fails. http://perlmeme.org/howtos/secure_code/taint.html
        
         | Dylan16807 wrote:
         | It's easy for me not to read it that way! Converting to an
         | integer is a very good start for validating many kinds of
         | input.
        
       | machina_ex_deus wrote:
       | This is way too low, I've used RSA keys in base 10 with half the
       | size of this string. It corresponds to only 14,000 bit numbers,
       | there are 8192 bit keys. I'm pretty sure this will break some CTF
       | challenges. The limit should be in the millions at the very
       | least.
        
         | munch117 wrote:
         | It does seem very low.
         | 
         | However, you shouldn't be passing million-digit numbers around
         | as (decimal) text. Even if you're not at risk of DOS attacks,
         | there's still the issue that it's very, very slow:
         | $ python3 -m timeit -s "s='1'*1000000" "i=int(s)"        1
         | loop, best of 5: 5.77 sec per loop
         | 
         | A ValueError alerting you to that fact could be considered a
         | service.
         | 
         | Contrast and compare:                   $ python3 -m timeit -s
         | "s='1'*1000000" "i=int(s,16)"         200 loops, best of 5:
         | 1.45 msec per loop
        
           | adgjlsfhk1 wrote:
           | python being slow isn't news. that's not a reason for an
           | error.
        
           | nomel wrote:
           | > However, you shouldn't be passing million-digit numbers
           | around as (decimal) text
           | 
           | This is about numbers that are thousands of digits, not
           | millions. Regardless, why not? What's the alternative that
           | supports easy exchange? If you stick it in some hexified
           | representation, you still have to parse text, and put it into
           | some non-machine-native number container. It's going to be
           | slow no matter what.
        
             | blibble wrote:
             | you can convert hex into binary directly without any
             | multiplications
        
             | munch117 wrote:
             | No, it's not going to be slow no matter what. Didn't you
             | see my example? The hexadecimal non-machine-native textual
             | representation was 4000 times faster than the decimal
             | ditto. On a number that was much larger, I might add.
             | 
             | Hex number parsing is linear time.
        
               | schoen wrote:
               | I could imagine people overlooking that little "m" in
               | your example's output!
        
               | nomel wrote:
               | Indeed I did!
        
       | im3w1l wrote:
       | This will break correct code for a fairly small benefit. I don't
       | think they should do this in a patch release.
        
         | [deleted]
        
         | [deleted]
        
       | gfd wrote:
       | Why did they close the discussion due to code of conduct? I
       | didn't see anything wrong with the previous comments before that
       | point.
        
         | klodolph wrote:
         | > As a reminder to everybody the Python Community Code Of
         | Conduct applies here.
         | 
         | > Closing. This is fixed. We'll open new issues for any follow
         | up work necessary.
         | 
         | The issue was marked closed, because the associated work was
         | completed and the PR was merged. The same comment happened to
         | mention the code of conduct, but the code of conduct wasn't why
         | the issue was closed--it was just because the work was done.
         | 
         | I think the comment mentioned the CoC because the previous
         | comment, "This is appalling" was a bit rude.
        
           | Delk wrote:
           | > I think the comment mentioned the CoC because the previous
           | comment, "This is appalling" was a bit rude.
           | 
           | The previous comment was indeed a bit rude. I personally
           | wouldn't think it was rude enough to invoke a code of
           | conduct.
           | 
           | Even just referring to a code of conduct has, IMO, a rather
           | strong vibe of policing and perhaps even an implication of
           | wrongdoing, more so than merely a suggestion to keep it calm.
           | 
           | I don't know the culture or context of Python development
           | (either the language or CPython), but I'm inclined to agree
           | with gdf that it's a bit weird to start reminding people of a
           | CoC because of a slightly rude sentence or two, especially
           | since the rest of the comment was reasonable technical
           | argumentation even if unapologetic.
           | 
           | Even if closing the issue were entirely because of other
           | reasons and benign (someone did still reference the issue in
           | a commit later, though), it's all too easy to see the issue-
           | closing comment as shutting out dissenting opinions, either
           | because of a somewhat unpleasantly expressed argument or
           | simply because "this is fixed, no further discussion needed".
           | 
           | The "this is appalling" comment may have been a bit rude but
           | the closing one wasn't exactly a triumph in communication
           | either.
        
             | Guthur wrote:
             | "This is appalling" is not even remotely rude, honestly are
             | we all children now?
        
               | blibble wrote:
               | your new comment violates the PSF "code of conduct" too!
               | 
               | this particular wording could be used to ban any
               | criticism of contributions (regardless of the criticism's
               | correctness):
               | 
               | > Being respectful. We're respectful of others, their
               | positions, their skills, their commitments, and their
               | efforts.
               | 
               | in this sort of environment I guess it's far from
               | surprising that the technical decisions are suffering (to
               | put it politely)
        
             | klodolph wrote:
             | > Even just referring to a code of conduct has, IMO, a
             | rather strong vibe of policing and perhaps even an
             | implication of wrongdoing, more so than merely a suggestion
             | to keep it calm.
             | 
             | I'd say the opposite. A suggestion to "keep it calm" is
             | inappropriate, because it carries the implication that
             | someone is not calm. This is inappropriate because it is a
             | comment on a person's emotional state rather than on what
             | they say or how they say it.
             | 
             | In fact, if someone on my team said to "keep it calm", I'd
             | take that person aside and explain, in private, the reasons
             | why not to say that.
             | 
             | > Even if closing the issue were entirely because of other
             | reasons and benign (someone did still reference the issue
             | in a commit later, though), it's all too easy to see the
             | issue-closing comment as shutting out dissenting opinions,
             | [...]
             | 
             | If somebody thought that closing the issue shut out
             | dissenting opinions, then that person has forgotten how
             | GitHub issues work or how bug trackers work in general.
             | Closing an issue just means that someone thinks that the
             | work on it is done; it does not stop discussion on the
             | issue. I can see why someone might forget and not realize
             | that the issue was closed and _not_ the discussion, but I
             | don 't think that it's a problem that someone visiting the
             | bug from HN would forget how GitHub issues work for a
             | minute.
             | 
             | With any online community above a certain size, there's a
             | certain amount of policing not just of what is said, but
             | where people have discussions. Anyone who regularly uses a
             | forum, Subreddit, Discord server, IRC, Slack, etc. will see
             | this pattern of behavior everywhere. For example--the
             | discussion about whether this is the right way to fix a bug
             | is a discussion which should be held elsewhere, where
             | people can see the context and interested parties can
             | respond to it.
             | 
             | Which is why there is a comment at the bottom,
             | 
             | > Please redirect further discussion to discuss.python.org.
             | 
             | It's crystal clear to me that this is not about shutting
             | out dissenting voices, but just saying that this GitHub
             | issue is the wrong place for this discussion.
             | 
             | You can see that there is a related issue which was closed,
             | but there was a lot of discussion afterwards--but because
             | the discussion was on-topic, the issue was not locked.
             | 
             | https://github.com/python/cpython/issues/90716
        
               | Delk wrote:
               | > I'd say the opposite. A suggestion to "keep it calm" is
               | inappropriate, because it carries the implication that
               | someone is not calm.
               | 
               | Perhaps a suggestion to "keep it calm" wouldn't be the
               | best. English isn't my first language and my verbal
               | expression isn't always the greatest. But referring to a
               | code of conduct does also carry the implication that
               | someone isn't minding that code, and I don't see how that
               | would necessarily be better.
               | 
               | In my view, suggesting that someone isn't calm is less of
               | a reprimand than suggesting they might be in breach of a
               | code of conduct which, among other things, includes rules
               | against outright harassment and other clearly
               | reprehensible behaviour. It's normal to not be calm at
               | times; it's another thing if someone needs to be reminded
               | of the rules of a community. Perhaps it's a cultural
               | thing but to me the latter is stronger judgement.
               | 
               | There may well be reasons for not saying to keep it calm
               | (it sometimes simply doesn't work), but I can equally
               | well see how people might see a reference to a CoC as
               | strong-armed.
               | 
               | > If somebody thought that closing the issue shut out
               | dissenting opinions, then that person has forgotten how
               | GitHub issues work or how bug trackers work in general.
               | Closing an issue just means that someone thinks that the
               | work on it is done; it does not stop discussion on the
               | issue.
               | 
               | That's fair enough. Perhaps the intention is clear enough
               | within the community that it would indeed be deemed as
               | simply closing that rather specific GitHub issue without
               | implying that the matter is closed.
               | 
               | Human communication isn't always quite that simple,
               | though. People get impressions from the way things are
               | expressed. "This is fixed." makes it feel that there is
               | nothing to be discussed about that particular change and
               | that it is final.
               | 
               | I don't know the particular community well enough to know
               | how it would be interpreted, though.
               | 
               | > Which is why there is a comment at the bottom,
               | 
               | >> Please redirect further discussion to
               | discuss.python.org.
               | 
               | That's after the comment that closed the issue. Had it
               | been in the issue-closing comment, that would have left a
               | different taste to the closing.
        
       | googlryas wrote:
       | For anyone wondering, '1' * 4301 creates a string of '11111....'
       | 4301 characters long. It doesn't result in an integer value of
       | 4301 like in some other languages.
       | 
       | I find this a strange modification to the language, though
       | probably not a particularly painful one. Has python saved you
       | from yourself when dealing with non-linear built-in algorithms
       | before? IIRC it is also possible to have the regex engine take an
       | inordinate amount of time for certain matching concepts(I think
       | stackoverflow was affected by this?), but the engine wasn't
       | hobbled to throw in those cases, it is merely up to the user to
       | write efficient regex that aren't subject to those problems.
        
         | ffhhj wrote:
         | They should have made the analogous inverse operation: '1234' /
         | 2 = ['12', '34']
        
           | bsdz wrote:
           | I was more expecting '1111' / 4 = '1'. This would be the
           | inverse operation. However, it opens up even more questions
           | like what to do if your string has mixed values etc
        
             | ffhhj wrote:
             | The string multiplication is about _joining_ strings, the
             | inverse is about _splitting_ them in several parts. It's
             | only confusing because the * appends the string to itself,
             | the / is actually very clear.
        
               | dekhn wrote:
               | Disagree. The inverse "string" * value is logically
               | splitting, _and then collapsing the repeated values_. The
               | logical split can be omitted, but the collapsing cannot.
        
               | [deleted]
        
           | tremon wrote:
           | That's not the inverse of the multiplication though. The
           | inverse would be '33' / 2 = '3', and '1234'/2 should then
           | probably raise a ValueError.
        
         | hyperpape wrote:
         | Backtracking regular expressions as an intentional or
         | accidental DOS vector are a moderately well-known issue, and
         | while I prefer that a standard library implementation be robust
         | against them, I can see the POV that it's buyer beware.
         | 
         | Converting a string to an integer is somewhat less well known
         | as a DOS vector, more painful to avoid as an application
         | creator, and easier to fix in code.
         | 
         | So there's a cost-benefit argument that you should just do this
         | before you rewrite your regex engine.
        
           | masklinn wrote:
           | > I can see the POV that it's buyer beware.
           | 
           | On the other hands, lots of buyers are not aware that it's an
           | issue, and more frustratingly there are regex engines which
           | are very resilient to it... but are not widely used.
           | 
           | Python's stdlib will fall over on any exponential
           | backtracking pattern, but last time I tried to make postgres
           | fall over I didn't succeed. Even though it does have
           | lookahead, lookbehind, and backrefs, so should be sensible to
           | the issue (aka it's not a pure DFA).
        
         | bo1024 wrote:
         | This does seem like a strange level of handholding, even if the
         | motivation makes lots of sense. If you start going down the
         | road of protecting people who don't sanitize user input, you
         | may have quite a long journey ahead...
        
         | mjevans wrote:
         | Operator overloading sure seems to increase the prevalence of
         | foot-guns, security issues, and other gotchas.
         | 
         | str.ccClone(4301) # ConCatenate Clones of the source string N
         | times.
         | 
         | Would even an abbreviated, named, function not be more self
         | documenting and better for human and machine reviews?
        
           | proto_lambda wrote:
           | Other than that being a terrible name (it's almost impossible
           | to be sure what it does without consulting documentation), I
           | personally do prefer fewer implicit/overloaded operations.
        
             | mjevans wrote:
             | What name would you suggest? That was my 5 min of thought
             | version.
             | 
             | cc prefix for concatenate because that word is very long
             | and it seemed likely that strings may have a large number
             | of different concatenation focused functions that could all
             | share the prefix.
             | 
             | Clone as the type of concatenation operation to perform.
        
               | proto_lambda wrote:
               | Rust uses `repeat()`, which sounds much more descriptive
               | to me. The types in the function signature make the
               | "clone" part of the name redundant.
        
               | mjevans wrote:
               | Offhand, is repeat(0) an empty string, repeat(1) the
               | input string, etc? If so that's a great name for the
               | function.
        
               | pezezin wrote:
               | Repeat is an iterator, so you can apply it to any type
               | you want, not just strings. You can chain it with other
               | iterators, or collect it into some data structure. But
               | yes, repeat(0) returns an empty iterator.
               | 
               | https://doc.rust-lang.org/std/iter/fn.repeat.html
        
           | slaymaker1907 wrote:
           | I think how Rust does it is fine, but I agree operators are
           | often a mess. Yesterday I was looking at a memory dump where
           | there was a problem in a destructor (a double free was
           | detected) and it was an absolute mess trying to figure out
           | the exact execution location in source code since it was
           | setting the value of a smart pointer which triggered a
           | decrement of a reference counted value in turn triggering a
           | free. It's junk like that which starts to convince me that
           | Linus was right to avoid C++. Rust obviously also has
           | destructors, but it doesn't have the nightmare that is
           | inheritance+function overloading+implicit casting.
        
             | cma wrote:
             | > and it was an absolute mess trying to figure out the
             | exact execution location in source code since it was
             | setting the value of a smart pointer which triggered a
             | decrement of a reference counted value in turn triggering a
             | free.
             | 
             | Isn't all that context there in the stack trace?
        
               | jlarocco wrote:
               | Yes, probably. Depends on the compiler settings. Stuff
               | can get optimized out and stripped.
               | 
               | When writing the code in the first place, though, it's
               | difficult to see problems like that because it's all
               | hidden behind magic calls to copy constructors, move
               | semantics, and destructor calls. Out of sight, out of
               | mind.
        
               | DSMan195276 wrote:
               | I think it's separate from his point but some of those
               | things could potentially be tail calls, meaning the
               | functions actually leading to the free/delete might not
               | be in the stacktrace even if they were called.
        
           | UncleEntity wrote:
           | It is really useful sugar for:                 for _ in
           | range(4301):         llama.append('1')
           | 
           | (there's probably an easier way to do that but you get the
           | point)
           | 
           | where python can see both sides of the operation and optimize
           | it on the C side of things.
           | 
           | The issue really has nothing to do with that though, it is
           | converting a string to an int which is the whole point of the
           | security update.
        
           | Gordonjcp wrote:
           | > Operator overloading sure seems to increase the prevalence
           | of foot-guns, security issues, and other gotchas.
           | 
           | How exactly? What would you expect an expression like ('1' *
           | 4301) to give you, and why would you think it would be
           | different from ('caterpillar' * 4301)?
        
             | qayxc wrote:
             | Well, let's assume that the "expected" behaviour holds,
             | shall we? Let's open up a python REPL and try
             | >>> 'caterpillar' * 2       'caterpillarcaterpillar'
             | 
             | OK, now for something different:                 >>> [1, 2,
             | 3] * 2       [1, 2, 3, 1, 2, 3]
             | 
             | Marvellous! How about this then:                 >>> True *
             | 2       2
             | 
             | Wait, what? Hm.                 >>> False * 2       0
             | 
             | Whoops! Implicit type conversion takes place... Even worse:
             | >>> 'abc' + 'efg'       'acbefg'       >>> 'efg' + 'abc'
             | 'efgabc'
             | 
             | Now I'm stumped. Isn't addition supposed to be commutative?
             | 
             | So yeah, without contracts in place, operator overloading
             | is BAD. You can never know what the operator does, or what
             | its properties are by just looking at how it's used.
             | There's simply no enforced rules and so no-one's stopping
             | you from doing                  >>> class Complex:
             | def __init__(self, real, imag):            self.real = real
             | self.imag = imag               def __add__(self, other):
             | return Complex(self.real - other.real, self.imag -
             | other.imag)               def __repr__(self):
             | return f'Complex({self.real}+{self.imag})'             >>>
             | x = Complex(1, 2)        >>> y = Complex(1, 2)        >>> x
             | + y        Complex(0+0j)
             | 
             | Now this intentionally being malicious of course, but
             | plenty of libraries overload operators in non-intuitive
             | ways so that the operator's properties and behaviour isn't
             | obvious. This is especially true if commutative operators
             | are implemented as being non-commutative (e.g. abusing '+'
             | for concatenation instead of using another symbol like '&'
             | for example) or if the behaviour changes depending on the
             | order of operands.
        
             | samatman wrote:
             | In Lua, the first is 4301 and the second is a runtime
             | error. ('1' .. 4301) is 14301, the equivalent of the weird
             | thing Python is fixing would be spelled
             | `tonumber(('1'):rep(4301))` which is obviously wrong.
             | 
             | To my taste operator overloading is fine, but concatenation
             | isn't addition, so they shouldn't be overloaded because...
             | [gestures vaguely at a half dozen language]
        
           | im3w1l wrote:
           | Succinct string operations is honestly like half of what I
           | use python for and the great numeric support with bignum by
           | default and powerful libraries with overloads like numpy and
           | tensorflow is the other half.
        
         | jejones3141 wrote:
         | In Algol 68, you can do that; it's part of the standard
         | prelude. I think that some people who'd worked on Algol 68 in
         | the Netherlands also worked on the ABC language, where it's "1"
         | ^^ 4301, and Guido worked on ABC before Python.
        
         | gsliepen wrote:
         | Well, in C++ int('1' * 4301) is a perfectly valid expression,
         | but it evaluates to 210749, not 4301.
        
           | oldgradstudent wrote:
           | Or some other value.
           | 
           | If sizeof(int)=2, the result is undefined.
        
             | gsliepen wrote:
             | Not if CHAR_BIT is 10 or more!
        
               | oldgradstudent wrote:
               | I wonder how much software will fail on platforms where
               | CHAR_BIT is not 8.
        
             | eMSF wrote:
             | Whether evaluating that expression results in undefined
             | behaviour also depends on the basic execution character set
             | and the bit width of the machine byte.
        
           | dark-star wrote:
           | it doesn't evaluate to 4301 in Python either ;-)
        
       | Phil_Latio wrote:
       | What's next? A default socket timeout of X seconds for security
       | reasons? What a joke and rather scary that apparently everyone or
       | the majority on the internal side agrees with this change.
        
         | linspace wrote:
         | I find it completely unpythonic. Python has become too
         | important to do the right thing, there is money on the table.
        
         | LtWorf wrote:
         | I think python is now completely owned by a couple big
         | companies that decide everything.
         | 
         | By this logic they should also block me from running benchmarks
         | on too big lists, because I'm dossing myself.
        
         | krick wrote:
         | This. I don't really understand CPython decision-making
         | process, but it just seems like a common sense that anybody who
         | would find this a good idea surely must be a very junior
         | developer who shouldn't be allowed to commit directly to the
         | master branch of your local corporate project just yet... But
         | basically breaking a perfectly logical behaviour just like that
         | in a language used by millions of people... To me it's
         | absolutely shocking.
        
       | loeg wrote:
       | Will Python's relentless campaign to break backwards
       | compatibility never end? (80% sarcastic.)
        
         | klyrs wrote:
         | Don't worry, it's a minor release. (110% sarcastic)
        
           | tremon wrote:
           | It's a patch release, not even minor (100% serious).
        
       | mywittyname wrote:
       | What should you use instead if you want the original
       | functionality?
        
         | Veedrac wrote:
         | https://docs.python.org/3/library/stdtypes.html#configuring-...
        
           | mywittyname wrote:
           | If I'm understanding this correctly: the only way to convert
           | an extremely large base10 string to an integer using the
           | standard library is to muck with global interpreter settings?
           | 
           | It seems short sighted to not provide some function that
           | mimics legacy functionality exactly. Even if it is something
           | like int.parse_string_unlimited(). Especially since a random
           | library can just set the cap to 0 and side-step the problem
           | entirely.
        
             | Someone wrote:
             | > Especially since a random library can just set the cap to
             | 0 and side-step the problem entirely.
             | 
             | Until another random library sets it to its preferred value
             | (see https://news.ycombinator.com/item?id=32738206 for a
             | similar issue with a CPU flag for supporting IEEE
             | subnormals)
             | 
             | We might end up with libraries that keep setting that
             | global to the value they need on every call into them.
        
               | mywittyname wrote:
               | Oh fun. Just what Python needs more of, this...
               | try:             value = int(value_to_parse)
               | except ValueError:             import sys
               | __old_int_max_str_digits = sys.get_int_max_str_digits()
               | sys.set_int_max_str_digits(0)             value =
               | int(value_to_parse)
               | sys.set_int_max_str_digits(__old_int_max_str_digits)
               | 
               | Or maybe just this:                   class
               | UnboundedIntParsing:             def __enter__(self):
               | self.__old_int_max_str_digits =
               | sys.get_int_max_str_digits()                 return self
               | def __exit__(self, *args):
               | sys.set_int_max_str_digits(self.__old_int_max_str_digits)
               | with UnboundedIntParsing as uip:             value =
               | int(str_value)
        
               | dmurray wrote:
               | Needs to be made thread safe!
        
       | js2 wrote:
       | 4300 digits?
       | 
       | > Chosen such that this isn't wildly slow on modern hardware and
       | so that everyone's existing deployed numpy test suite passes
       | before https://github.com/numpy/numpy/issues/22098 is widely
       | available.
       | 
       | https://github.com/python/cpython/blob/511ca9452033ef95bc7d7...
        
       ___________________________________________________________________
       (page generated 2022-09-07 23:01 UTC)