hngopher.com

       [HN Gopher] "To be safe allocate 5 bytes more than you need"
       ___________________________________________________________________
        
       "To be safe allocate 5 bytes more than you need"
        
       Author : luu
       Score  : 165 points
       Date   : 2020-11-23 05:16 UTC (17 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | SilasX wrote:
       | Related: There was that story about how axle count on Swiss
       | trains can't be a multiple of 256 (2^8) [0], and a comment
       | mentioned that German has a limit of 252 "just to be safe":
       | 
       | https://news.ycombinator.com/item?id=12071686
       | 
       | [0] Because there are only 8 bits to store the axle counter for
       | whether the train has passed, and you don't want overflow to
       | zero.
        
         | jbay808 wrote:
         | That's interesting! But a safety factor does make sense here if
         | you're worried about a small probability of double-counting an
         | axle.
        
       | ajarmst wrote:
       | Were it my team, this would get bounced with a request for a link
       | to the bug-report/documentation in cJSON that necessitates this
       | wart and at least a FIXME flag so that we periodically re-visit
       | the 'why does cJSON still not correctly estimate its memory and
       | if it can't then why the hell is it not returning the minimum
       | safe value rather than a guess?' unsightly hack in our beautiful
       | code.
        
       | asdfasgasdgasdg wrote:
       | We have a kind of varint-based delta decoder at work that can
       | read up to 15 bytes past the end of its buffer, because it uses
       | uint64s to read the buffer faster, than the buffer doesn't encode
       | byte-length information. Since it doesn't know how many bits wide
       | the last entry will be, and since it reads ahead a little, it can
       | end up 16 bytes starting at the last byte and proceeding 15 bytes
       | onward. It doesn't matter that these bytes are filled with
       | garbage, because a well formed program will only use the bits
       | from one of the valid bytes. The outer code knows how many
       | entries to look for, so it just calls the advance function the
       | correct number of times, and those bits past the end don't get
       | used.
       | 
       | This can lead to a pretty subtle crash because, except in msan
       | mode, reading those fifteen bytes won't cause any problem except
       | at the page boundary. This is because you don't corrupt the
       | sentinel bytes, since they're never written. So to detect the bug
       | you either have to run the code in debug mode (which we rarely
       | do, because the code is deep inside a very expensive database
       | system), or you have to get unlucky.
       | 
       | I tried fixing this, but adding a "past the end" test really hurt
       | performance. Instead, we continued with the preexisting solution:
       | copy the data into a buffer fifteen bytes longer than the data,
       | then decode it. T_T
        
         | Negitivefrags wrote:
         | We ran into the same problem, only in the AMD GPU driver.
         | 
         | It would very occasionally crash during a draw that reads the
         | last few bytes of a vertex buffer on AMD cards only.
         | 
         | Very annoying to find because it only triggered rarely.
         | 
         | Our solution was the same, just allocate vertex buffers a
         | little longer on AMD cards and don't use the extra space.
        
           | huhtenberg wrote:
           | Looks suspiciously like an alignment issue, with driver
           | expecting something somewhere to be aligned to 4/8 bytes or
           | be a multiple of dword/qword?
        
           | ajarmst wrote:
           | And, of course, you carefully documented the reason for the
           | larger buffer and encoded the value into a constant or other
           | documentable source. One hopes you _didn 't_ hardcode '5
           | bytes' with a comment that says 'more bytes safer' and
           | nothing else :-).
        
         | nsajko wrote:
         | If the uint64s are aligned, wouldn't it be better to use
         | alignas (to make the buffer aligned as soon as it's allocated)?
         | As far as I see that way you get a longer buffer only when
         | necessary (you never waste memory), plus you could get
         | autovectorized code if the compiler decides that makes sense.
        
         | pedrocr wrote:
         | The same is done for decoding camera raw formats. Variable
         | encoding of the number of bits in the stream means you never
         | know when the decode will end. Reading all the file into a
         | buffer longer than needed is fast, whereas the bit pump is
         | performance critical and having to check for end of buffer
         | every time you want to read would be painful.
        
           | asdfasgasdgasdg wrote:
           | Saddest thing is we're reading these values from a database,
           | and the database SDK does not have a mechanism for reading
           | values into user-controlled buffers without copying. That is
           | a conceptually neat solution to the problem, but since it's a
           | somewhat strange thing to want to do, we can't do it in our
           | case. Storing the extended values is something I also
           | considered, but there are tens or hundreds of billions of
           | them in production. Couldn't really ask for an extra PB of
           | space for the relatively small performance and elegance
           | benefit you get from not copying these things.
        
             | pedrocr wrote:
             | Extra copying is the downside if you don't control what
             | you're reading from. Thankfully memcpy is fast but if your
             | input is very big it's still painful. In rust I just used
             | an API that allows reading from both an input stream and
             | memory so that if you are reading directly from a file the
             | copy is avoided:
             | 
             | https://github.com/pedrocr/rawloader/blob/c793a132fa03e336f
             | 8...
        
         | choeger wrote:
         | That's kind of spooky, it sounds like we work for the same
         | company, because we have a very similar thing. I am pretty sure
         | that no one at our company knows its details as well as you
         | just described yours...
         | 
         | Maybe someone should open source an efficient yet safe varint
         | writer?
        
           | asdfasgasdgasdg wrote:
           | As far as I could tell the problem was somewhat inherent in
           | the encoding. One thing that could have been done is
           | something like                   for (int i = 0; i<n-15; ++I)
           | {           dec.FastUnsafeRead(result);         }         for
           | (int j = n-15; j < n; ++j) {
           | dec.SlowSafeRead(result);         }
        
         | woofie11 wrote:
         | The concept of a preprocessor is helpful here.
         | 
         | Most modern languages no longer have preprocessors, which is a
         | pity.
         | 
         | In cases like these, you want complex computed constants. You
         | want those computed at build time, and not at runtime. I can:
         | 
         | * Run the math on a whiteboard, compute +5 or +15, hardcode
         | that, and have the world explode when the code changes
         | 
         | * Run the math in code, at runtime, and have the system be
         | slow.
         | 
         | * Code up the math, have it run at build time, and have a clear
         | documentation of what the constant is, why, how it's computed,
         | and have it change automatically if the code changes.
         | 
         | There were awesome things done in C++ templates back in the
         | day, with some pretty magical performance. The preprocessor
         | would unroll loops to the appropriate level for a given
         | processor, align things to cache boundaries, and similar.
         | 
         | I think the reason this might have gone away is that with GHz
         | CPUs, few people care as much about performance anymore. Most
         | code is no longer CPU-bound, but programmer-bound, and lower
         | cognitive complexity is more important than better performance.
        
           | ben509 wrote:
           | Option 4:
           | 
           | * Code up a unit test that replicates end of buffer scenarios
           | and verifies that X bytes is enough and X-1 bytes fails.
        
             | zabzonk wrote:
             | Reading or writing past the end of a buffer in C is
             | undefined behaviour - you cannot therefore write a test in
             | C to check this, as you have no idea how such a test would
             | behave.
        
           | Ar-Curunir wrote:
           | As the sibling comment noted; you don't preprocessors for
           | these; what you want is compile-time computation. It's much
           | more principled than text-based preprocessing
        
           | asdfasgasdgasdg wrote:
           | I don't think a preprocessor can help in this situation,
           | since what's ultimately being read is the data, and the
           | offsets to be read can't be computed without looking at it.
        
           | scott_s wrote:
           | This sort of requirement is met by compile time computations.
           | I know C++11, D and Rust have them, and I imagine many more
           | languages do as well. The need for this sort of thing clearly
           | has not gone away, we're just moving away from a free-form,
           | textual pre-processing stage.
        
             | kenniskrag wrote:
             | constexpr in c++ is pretty magic. Loops, try catch etc. is
             | possible
             | (https://en.cppreference.com/w/cpp/language/constexpr)
        
               | filleokus wrote:
               | One of the few things I was truly amazed by when doing
               | some C++ courses during university. Contrived OOP or
               | extremely cumbersome string manipulation wasn't really my
               | thing. But being able to have compile time evaluation of
               | pretty complicated functions without using macros or
               | anything was so cool. Actually thought about wanting it
               | yesterday in Go
        
             | jdc wrote:
             | Nim too
        
         | benlivengood wrote:
         | How about reading N-15 fast (assuming varints can be as small
         | as a byte) and the rest with bounds-checking? Is N generally
         | too small for that to be useful?
         | 
         | You might also be able to heuristically compare buffer address
         | modulo the page size to N to know whether you need to be safe
         | or not.
        
         | kazinator wrote:
         | > _It doesn 't matter that these bytes are filled with
         | garbage,_
         | 
         | That's only true if the strategy avoids hitting an invalid
         | page.
         | 
         | Reading a single uint64_t which is aligned on an 8 byte
         | boundary will be fine. If any byte of that is within a valid
         | page, the other bytes have to be.
         | 
         | If two uint64_t's are read, then as a whole, they have to be a
         | 16-byte-aligned block for that same assurance to hold.
         | 
         | If the code is using misaligned reads, that itself will crash
         | on some processor architectures, or possibly trigger expensive
         | handling.
        
           | vesinisa wrote:
           | Did you actually read the comment you replied to to the end?
        
         | haberman wrote:
         | The latest optimized version of the protobuf parser does a
         | remarkably similar thing:
         | 
         | https://github.com/protocolbuffers/protobuf/blob/master/src/...
         | 
         | Protobufs are full of varints, and we want to perform only one
         | bounds-check per field. So we ensure that the input buffer
         | always has at least 16 bytes available to parse per field. If
         | there are fewer than 16 bytes remaining, we copy these last few
         | bytes of data into a patch buffer and parse from that instead.
        
           | asdfasgasdgasdg wrote:
           | Possible indication of common ancestry.
        
       | PixelOfDeath wrote:
       | x86_64 malloc already uses 16 byte alignment and portions to
       | support SSE commands. So why not go all the way and allocate full
       | cache lines? This also prevents accidental false sharing between
       | different malloc pointers.
        
       | SamReidHughes wrote:
       | I've used cJSON. Its problems have been far deeper and more
       | serious than this.
        
         | pantalaimon wrote:
         | Did you try jsmn?
        
           | SamReidHughes wrote:
           | No, but I haven't personally done a search of C/C++ JSON
           | libraries.
        
       | quelsolaar wrote:
       | Over allocating is often a very useful strategy.
       | 
       | I wrote a fast Json parser a while ago and one of the largest
       | optimizations was to look at the size of the file, compute the
       | worst case of how much memory it would need, and then do one
       | allocation and then use the memory.
       | 
       | Another example would be a dynamic list of values. If you use a
       | linked list, and have 64bit pointer and a 64bit value, each link
       | costs 16 bytes. If you have a list of say 8 values then that's 8
       | allocations and 16*8 = 128 bytes. If you instead allocate
       | speculatively an array of 16 values also taking 128 bytes, and
       | then only use 8 of the values, you only need one allocation. If
       | the number of values you need grows beyond 16, you need to
       | reallocate, and expensive operation, but compared to making one
       | allocation each time a value is added its much faster. Also all
       | values are close to each other in memory so cache coherence will
       | be much better, and you may end up with an order of magnitude
       | faster access. This can be true even if you do expensive
       | operations like insert. Yes, you can improve the performance of a
       | liked list by allocating lots of links in one allocation, but the
       | added space of the pointers, and the out of order access will
       | still cost you.
       | 
       | Obviously, when you do this you should provide functions that
       | always allocate the right number for the user of the code so that
       | there is no magic numbers visible to the end user....
        
         | hzhou321 wrote:
         | > Over allocating is often a very useful strategy.
         | 
         | It is not!
         | 
         | * If you know exact reasons and exact how much space need
         | (over) allocate, then you should document the reason, and it is
         | no long _over_ allocating.
         | 
         | * If you don't know the exact reasons and only know once a
         | while it requires additional space, then you are not solving
         | the problem. You are kicking the can down the road, and it is
         | likely to explode later, possibly with inappropriate timing.
         | 
         | * If you know exactly how much the space you need, then the
         | over allocating is just misleading code.
         | 
         | Fix Your Code!
         | 
         | EDIT: Your comment apparently is for something else. Anyway,
         | the headline needs some counter correction.
        
           | quelsolaar wrote:
           | I'm talking about internal code that is not exposed to the
           | user. There are loads of situations where something will
           | fluctuate in size rapidly, you dont know beforehand to what
           | extent, and allocating memory every-time it changes is very
           | inefficient. In these cases speculatively over allocating is
           | a very good strategy.
        
       | dls2016 wrote:
       | Haha I just used this code. My little ESP32 data acquisition
       | project has been running two days since switching to the
       | PrintPreallocate function with essentially constant memory usage
       | so -\\_(tsu)_/-. Way too many other things to worry about.
        
         | elcritch wrote:
         | Exactly, if their inputs are fairly well known and tested,
         | adding 5 magic bytes is hackish but effective. It could take
         | days or weeks to debug that properly, if at all.
         | 
         | Nice you were able to get constant memory. I was getting 40byte
         | max memory usage difference between identical RPC calls on the
         | esp32, but that seems due to Nim's ARC garbage collector.
         | Sometimes +20 bytes, sometimes -20 bytes but not leaking.
         | Though maybe it's an esp-idf thing. Took me a bit to just call
         | it good and focus on more important things. -\\_(tsu)_/-
        
       | [deleted]
        
       | 29athrowaway wrote:
       | Looks like a workaround rather than a fix, for whatever problem
       | they were having.
       | 
       | This happens in science too. Many constants are placeholders for
       | stuff we do not understand.
        
         | pantalaimon wrote:
         | But science is a bit like reverse engineering the universe. In
         | this case however _they wrote the source code_.
        
         | alerighi wrote:
         | Despite the (horrible) name computer science, informatics is
         | not science, is math. There is a big difference: in science you
         | observe and try to understand something that already exists,
         | the universe.
         | 
         | In math (and informatics) you build your own universe with your
         | set of rules. It means that you can and should understand
         | everything (well, of course if you assume the hardware works
         | perfectly as specified).
        
           | MaxBarraclough wrote:
           | Computer science is not entirely math. Compilers, for
           | instance, care about real-world performance, not theory.
           | 
           | Physics is more mathematical than computer science, but
           | there's no doubt that physics is a science.
        
           | 29athrowaway wrote:
           | Discrete math, computer science, etc. are sciences (although
           | some people claim math is not a science... that's a
           | discussion for another day). Software engineering, like all
           | forms of engineering, is applied science plus processes based
           | on empirical evidence.
        
       | DannyB2 wrote:
       | 5 bytes? What about 5 characters?
       | 
       | When I started almost four decades ago it was then true that
       | bytes and characters were the same thing.
        
         | aYsY4dDQ2NrcNzA wrote:
         | I don't think that was guaranteed, at least not in C. Wasn't
         | that implementation-dependent?
        
           | MereInterest wrote:
           | Sort of, yeah. In C, the size of the integer primitives isn't
           | defined. What is defined is that sizeof(char) will always be
           | 1 (one). However, it is entirely implementation defined what
           | that 1 refers to. On most systems, it is the number of 8-bit
           | bytes, but it could refer to a larger or smaller amount of
           | memory.
        
             | wahern wrote:
             | > but it could refer to a larger or smaller amount of
             | memory.
             | 
             | Never smaller as the minimum limits for char require _at_
             | _least_ 8 bits. But definitely can be wider than 8 bits, as
             | is common on DSPs or some old mainframes. Though, even on
             | systems that don 't permit direct addressing of 8-bit char,
             | C implementations will sometimes emulate it.
             | 
             | FWIW, POSIX mandates a fixed, 8-bit char.
        
         | lambda_obrien wrote:
         | That's no longer true, in case you were wondering. People still
         | say `bytes <-"instead of"-> chars` sometimes but we should stop
         | that.
        
       | alphagrep12345 wrote:
       | This happened at my previous work place. They added 1 free byte
       | every time they saved some data to disk. When I asked my manager
       | about it, he said - ' There was a crash a while ago and this
       | fixed it. No one knows why, but this is what we all do after
       | that'
        
         | comboy wrote:
         | There are so many IT companies craving for coders, just saying.
        
       | cbsks wrote:
       | I abhor comments like these. It says that it needs to allocate 5
       | more bytes, and I can see that in the code, but it doesn't say
       | why the magic number is 5. The comment explains the what but not
       | the _why_. The commit message is equally as useless:
       | https://github.com/DaveGamble/cJSON/commit/65541b900c740e1d5...
       | 
       | What needs to be fixed so that this "+5" can be removed? What
       | happens if something changes and the magic number needs to be 6?
       | Also, where's the unit test?
        
         | Znafon wrote:
         | The Pull Request is at
         | https://github.com/DaveGamble/cJSON/pull/146
        
         | [deleted]
        
         | Arnavion wrote:
         | Based on the original message that did + 64 and that this
         | revised to + 5, it might be that a float64 can be serialized to
         | at most 13 ASCII characters (not true in general but might be
         | true of cjson's serializer; I didn't check). The original
         | buffer would only have 8 bytes since that's the width of the
         | f64, so 5 more bytes are needed.
        
         | mark-r wrote:
         | If you've debugged the problem to the point where you know +5
         | will be sufficient, you should know enough to fix the bug that
         | required it in the first place.
        
           | Arnavion wrote:
           | It's not a bug. It's a public API and the buffer is an input
           | from the caller. Nothing goes wrong if the buffer is smaller
           | than it ought to be. The doc comment is just warning the
           | caller they might need a larger buffer than they think they
           | do.
        
           | kfichter wrote:
           | Or, at the very least, know enough about the problem to leave
           | a comment that properly explains why you're unable to fix the
           | bug.
        
           | SkyPuncher wrote:
           | I disagree.
           | 
           | You can certainly fix known knowns, but you can't fix unknown
           | unknowns. Sometimes its safer to simply apply a rule that you
           | have confidence will work everywhere than fix that may
           | introduce unsuspecting problems.
        
         | thaumasiotes wrote:
         | > Also, where's the unit test?
         | 
         | What would you think if the following were true for a fairly
         | generalized problem:
         | 
         | - It is easy to prove a particular upper bound on memory
         | requirements when calling a function.
         | 
         | - It is very difficult to know exactly what the memory
         | requirements will be for particular input.
         | 
         | - No known input achieves the known upper bound.
         | 
         | Is it a problem if we allocate enough memory to satisfy the
         | bound we can prove, and then don't have a unit test showing we
         | can't do with less?
        
           | pmiller2 wrote:
           | If that's all the case, then, where's the proof? It might be
           | "easy to prove," but it's probably not an "I can prove this
           | trivially, on the fly" type of task.
        
           | cbsks wrote:
           | I'd be satisfied with that if I was a reviewer in the code
           | review. I'd want that (and the proofs) explicitly mentioned
           | in the comment, or at the very least in the commit message.
           | 
           | At first glance the example seems contrived to me. Assuming
           | that the code is deterministic, wouldn't it be trivial to
           | measure the memory requirement for a particular input?
        
             | thaumasiotes wrote:
             | There are a lot of potential inputs.
        
               | pantalaimon wrote:
               | so why is 5 enough? could there be an input where the
               | number needs to be 15? or 1005?
        
               | cbsks wrote:
               | But surely it would be possible to measure the memory
               | requirements for one of those potential inputs...
        
             | [deleted]
        
           | a1369209993 wrote:
           | > What would you think if the following were true for a
           | fairly generalized problem:
           | 
           | > - It is easy to prove a particular upper bound on memory
           | requirements
           | 
           | That they should have included said proof in the source code
           | (in a comment if the type system isn't expressive enough to
           | run it).
        
           | bena wrote:
           | I think the real resolution would be change the function that
           | estimates the amount of memory needed would return it with
           | the additional 5 added already.
           | 
           | That way you can trust the estimator to always tell you at
           | most how much memory you'd need.
        
         | raldi wrote:
         | In this case, the unit test would serve as excellent
         | documentation of how to trigger the bug this padding fixes.
        
         | [deleted]
        
         | [deleted]
        
       | WalterBright wrote:
       | This screams "time for a serious code review" to me. Also use an
       | array type that does overflow checking.
        
       | asveikau wrote:
       | I don't recall where, but I remember once encountering an API
       | that took a buffer and a length, and _required_ the buffer to
       | exceed the specified length by some n. Unlike this case, I think
       | the requirement was documented. Still seemed like a recipe for
       | bugs. If you need to implicitly adjust the length of the buffer
       | upwards, it 's not really the length of the buffer.
        
         | Someone wrote:
         | Recipe for bugs but possibly also a recipe for performance.
         | 
         | I'm not making any claims for this particular case, but if a
         | function taking a large buffer to modify needs some extra
         | space, and the code can get faster by storing that at the end
         | of the buffer, requiring the caller to reserve that space can
         | mean not having to copy the buffer twice, increasing memory use
         | and execution time.
         | 
         | A fairly common example is room for adding extra zero string
         | terminators, so that an vectorized algorithm walking the input
         | always hits one of them.
        
           | asveikau wrote:
           | I think there are better ways to encode that in API design
           | that are less ambiguous.
           | 
           | For example perhaps having a function or some other means to
           | compute required buffer size, in a way that makes such
           | padding requirements relatively opaque to the caller.
        
       | spaetzleesser wrote:
       | Wouldn't you wrap the allocation with an alloc/free function that
       | does the allocation with whatever extra bytes that are needed?
       | Letting the API users deal with such details seems a recipe for
       | disaster.
        
         | zelly wrote:
         | Because then you're forcing clients to use malloc rather than
         | their own allocator
        
       | gfodor wrote:
       | I was going to trash this, but it actually made me think of
       | something non-obvious: where is the line between hacks like this
       | and smart behavior? For example, we have code in systems that
       | performs network retries, and I have code in my own software that
       | performs polling operations when a request/response "should"
       | work, but in practice, a variety of reasons cause it to be less
       | robust. It seems right that something like "allocating extra
       | memory just to be safe" falls on the wrong side of defensive
       | programming vs obvious kludge, but where is the line, exactly? It
       | doesn't seem to be first vs third party code. Is it network hops?
       | That seems extremely arbitrary, especially given that modern
       | networking has characteristics closer to a bus than say,
       | accessing a disk, where such defensiveness is less sane. If
       | network boundaries are the norm where we start assuming failure
       | is likely, maybe that's why microservices are considered
       | beneficial since they result in more defensive programming in
       | practice even though in principle there's no reason that needs to
       | be the boundary? Maybe I've had too much coffee this morning.
        
         | [deleted]
        
         | zamadatix wrote:
         | Depends what you call smart behaviour I suppose. For the vast
         | majority of projects though I'd assume the following to be
         | smart behaviour:
         | 
         | - Is it robust
         | 
         | - Is it fast (enough)
         | 
         | - Is it the simplest way to achieve those levels of robustness
         | and speed
         | 
         | If the answer to the first 2 is largely "yes" and the 3rd is
         | "no" then I'd call it a bad hack. OTOH if the answer to 3 is
         | also "yes" then it's a good hack. If the answer to 1 or 2 is
         | "no" then it's not ready to ship.
         | 
         | In either case a comment saying "this is the simple/safe way,
         | performance opportunity via..." Or "this is the complicated way
         | because <reasons complicated way was needed, usually perf> and
         | it this is why it works..." is good to leave in case
         | performance requirements ever change or someone is trying to
         | figure out why you went the way you did.
        
         | kansface wrote:
         | > For example, we have code in systems that performs network
         | retries, and I have code in my own software that performs
         | polling operations when a request/response "should" work, but
         | in practice, a variety of reasons cause it to be less robust.
         | 
         | Apples and oranges. The network is unreliable and all code
         | dealing with the network should treat it as such. The file
         | system and disks are also unreliable, but are probably more
         | reliable by a factor of 10K than the network. Engineers mostly
         | choose to ignore these potential errors via unintentional
         | decisions or do stuff like let the process crash because
         | restarting an API server once a year doesn't matter. Of course,
         | not everyone has that luxury. Determining how much memory to
         | allocate ought to be perfectly deterministic which is why this
         | is such a code smell.
         | 
         | > where is the line, exactly?
         | 
         | The line to me is pretty clear in this case! There are all
         | sorts of reasons why this code would be acceptable: if fixing
         | the code to prevent the bug[s] is sufficiently onerous as to
         | cause more bugs than the fix would prevent, or is so much work
         | that it would never be undertaken, or so complex that the code
         | is un-mergable, an emergency hot fix... all acceptable so long
         | as the code is documented as such. In other words, the line
         | crossed here was in the comment. Where did the number 5 come
         | from? Where's the link to an issue tracking memory
         | allocation/estimation?
         | 
         | I believe most comments in code are worse than useless - that
         | we shouldn't merely document bad code, but write code that is
         | so blindingly obvious and idiomatic that comments detract from
         | its perfection. Comments exist when we deviate from the
         | platonic ideal -- the real world with leaky abstractions,
         | deadlines, bugs and dollars.
        
           | gfodor wrote:
           | I don't think you understood my comment - it was raising the
           | question where the line actually is, and if the line is drawn
           | more upon cultural norms than solid engineering principles.
           | That isn't an argument the code of the OP is a good idea, it
           | isn't, the question is if there are large categories of code
           | we write that ought to be defensive but aren't, or vice
           | versa.
        
         | WalterBright wrote:
         | > where is the line between hacks like this and smart behavior?
         | 
         | Understanding exactly what the cause of the problem. Hacks
         | imply a fix that "seems" to work but not understanding.
        
       | ExcavateGrandMa wrote:
       | hmmmm... That's kind of a bean counter. krkRKRkrkrkrkRKR
        
       | floatingatoll wrote:
       | UTF-8 characters can be composed of at most 4 bytes, plus a NUL
       | terminator makes 5. I have no idea if this is true but it's
       | plausible enough for a Unicode parser!
        
         | SAI_Peregrinus wrote:
         | Not UTF-8 characters, UTF-8 code points. UTF-8 characters (more
         | properly "extended grapheme clusters") can be of any length. "
         | " is one character, 25 bytes long (0xf0,0x9f,0x91,0xa8,0xe2,0x8
         | 0,0x8d,0xf0,0x9f,0x91,0xa8,0xe2,0x80,0x8d,0xf0,0x9f,0x91,0xa7,0
         | xe2,0x80,0x8d,0xf0,0x9f,0x91,0xa7) (it won't show correctly on
         | HN, since HN strips some Unicode). If you assume that code
         | point = character, your parser is buggy and doesn't comply with
         | unicode specifications.
        
           | floatingatoll wrote:
           | I don't presume to know enough about Unicode to code parsers
           | it, but I'm happy to accept the correction!
        
           | zepearl wrote:
           | Uuuhh, this is an extremely important remark/correction for
           | me, thank you.
           | 
           | So far I was thinking as well that any "character" could be
           | saved ("encoded"?) as UTF-8 by using max 4 bytes... .
        
         | rurban wrote:
         | On the character level yes, but this is for strings, so looks
         | totally broken to me.
        
           | rusk wrote:
           | The len is being passed to malloc which I recall takes number
           | of bytes, so this explanation seems plausible. I very much
           | doubt memory allocation is _"innacurate"_ , as per the
           | comment ...
        
       | avian wrote:
       | Some more clues where this comes from are in this issue:
       | https://github.com/DaveGamble/cJSON/issues/255
        
         | IshKebab wrote:
         | Pretty good advert for not writing code in C.
        
           | GuB-42 wrote:
           | More like writing C properly.
           | 
           | When you program in C, you should know exactly what each byte
           | of memory is for. If you don't do that, not only you will
           | eventually get crashes and leaks and overflows, but you also
           | miss the point of programming in C in the first place.
           | 
           | C is fast, but C is not fast because it overclocks your CPU
           | or whatever magic you can think of. An good reason why is is
           | fast is that it gives you tight control over your memory. You
           | can reuse buffers, size them precisely, allocate them exactly
           | when you need them and free them when you are finished,
           | nothing funny happening behind your back. It generally
           | results in lower memory usage, better use of cache and
           | therefore better performance.
           | 
           | If you don't manage your memory precisely, then sure, C is
           | probably not the language for you.
        
       ___________________________________________________________________
       (page generated 2020-11-23 23:00 UTC)