[HN Gopher] Designing a better strcpy
       ___________________________________________________________________
        
       Designing a better strcpy
        
       Author : signa11
       Score  : 61 points
       Date   : 2021-06-17 09:50 UTC (2 days ago)
        
 (HTM) web link (saagarjha.com)
 (TXT) w3m dump (saagarjha.com)
        
       | Tempest1981 wrote:
       | Something feels wrong about using memccpy to copy strings... ever
       | since I saw bugs where people used memcpy incorrectly.
       | 
       | And is there a wchar_t version of memccpy?
        
       | WalterBright wrote:
       | An even better solution:
       | 
       | https://www.digitalmars.com/articles/C-biggest-mistake.html
       | 
       | With the lengths of strings known, copying becomes fast, safe,
       | and trivial.
        
         | Hackbraten wrote:
         | Not everyone is going to be happy about 8 bytes overhead for
         | every buffer.
         | 
         | Also, not every C-style API will migrate. If you consume one of
         | those, you may still have to count the length, which costs
         | cycles.
        
           | scottlamb wrote:
           | > Not everyone is going to be happy about 8 bytes overhead
           | for every buffer.
           | 
           | Then prefix it with four bytes (which, if you omit the
           | trailing NUL, adds three bytes in total). If your C strings
           | are longer than 2^32 bytes, you're probably doing something
           | wrong.
           | 
           | I also should point out that memory is typically cheap. Cache
           | is more precious, and I'd expect on balance adding a length
           | would pollute the cache less than scanning through the whole
           | string unnecessarily.
           | 
           | But I agree with your second point:
           | 
           | > Also, not every C-style API will migrate. If you consume
           | one of those, you may still have to count the length, which
           | costs cycles.
           | 
           | Similarly, the article started with this text:
           | 
           | > Like them or not, null-terminated strings are essential to
           | C, and working with them is necessary in all but the most
           | trivial programs.
           | 
           | You can do things nicely in your own code, but you still need
           | NUL-terminated strings when dealing with existing code. And
           | dealing with existing code is often the reason to pick C...
        
           | WalterBright wrote:
           | The proposal does not take away any C features.
        
           | loup-vaillant wrote:
           | Actually, it's 7 bytes (you can remove `\0` at the end), and
           | that's if you insist on being able to exceed 4GB with your
           | strings.
           | 
           | Personally, depending on the use case, I'd be willing to have
           | strings limited to 256 bytes (no overhead), 64K bytes (1 byte
           | of overhead), and 4G bytes (3 bytes of overhead). Though
           | combining them all might be quite the nightmare...
        
             | codesnik wrote:
             | you can also have something akin to MIDI encoding with
             | higher bit reserved to mark that next byte is a part of
             | length field, too. up to 128 bytes with no overhead, up to
             | 16kb with one byte overhead, and so on. Such a detail would
             | be easy to hide in the library. But I wouldn't be surprised
             | that alignment and other troubles would eat any possible
             | gain.
        
               | WalterBright wrote:
               | Won't work with UTF-8.
        
       | tedunangst wrote:
       | The concern with performance seems a little overwrought. strlcpy
       | is only slow in the bad case where it truncates, which is ideally
       | not the common case. I've never heard or seen of a performance
       | bottleneck traced to a strlcpy in the hot path.
       | 
       | If you really cared about performance, you'd be using nothing but
       | memcpy with careful length tracking. Regardless of algorithmic
       | runtime, any function that examines bytes as it copies will be
       | slower than a length based copy.
        
       | dataflow wrote:
       | I hate to nitpick on something so mundane and superficial, but
       | why in the world are people still writing code like this in 2020?
       | while (--len && (*dst++ = *src++))         ;
       | 
       | Dereferenced post-increments are already confusing enough as-is,
       | and yet here we have 1 pre-increment and 2 _dereferenced_ _post_
       | -increments happening on _top_ of an assignment in a
       | _conditional_ , all in a single expression. Even as someone who
       | _does_ put an assignment in a conditional once in a while, this
       | still feels 100% unjustifiable to me. It 's especially ironic
       | given the premise is that C code has security bugs... if the goal
       | is to avoid that, shouldn't there be even _more_ care taken to
       | avoid this kind of code?
        
         | abhishekjha wrote:
         | I guess this is so that something like this can be asked in
         | interviews to test how lucky the candidate is on that
         | particular day and hour. Otherwise everybody would pass the
         | interviews.
        
       | stefan_ wrote:
       | I don't want to use any of the str*cpy functions, all of them are
       | either braindead or missing in most libcs. At this point I'm all
       | in on snprintf(%s, foo).
        
         | saagarjha wrote:
         | I hate to do this, but: read the post. I mention snprintf
         | specifically: https://saagarjha.com/blog/2020/04/12/designing-
         | a-better-str.... It is neither efficient nor general-purpose as
         | a string copying routine.
        
       | [deleted]
        
       | okareaman wrote:
       | I expected to read about SIMD and copying 8 bytes at a time in 64
       | bit registers, but I guess that is a compiler optimization
        
       | rbanffy wrote:
       | Would it make sense to extend the C standard to have a sensible
       | string type?
       | 
       | /me ducks
        
         | pjmlp wrote:
         | It would, just like a sensible array type and proper
         | enumerations as well.
         | 
         | Obviously WG14 doesn't care about it.
        
       | zabzonk wrote:
       | Referring to strcpy:
       | 
       | > we can only use it if we know our destination buffer is smaller
       | than our source buffer
       | 
       | Should be the other way around, surely?
        
         | Randor wrote:
         | Absolutely,
         | 
         | That's not the only discrepancy.
         | 
         | https://pubs.opengroup.org/onlinepubs/9699919799/functions/m...
         | 
         | Note the following: "The memccpy() function does not check for
         | the overflow of the receiving memory area." "If copying takes
         | place between objects that overlap, the behavior is undefined."
         | 
         | The strxcpy he provides at the bottom doesn't look better at
         | all. I'm not sure where the author got that function. I found
         | some better variants of the proposed strxcpy function with
         | bounds checks and that provides overflow detection.
        
           | saagarjha wrote:
           | I wrote strxcpy as an example of a function that satisfies
           | the requirements that I posted at the top. In doing so, it
           | becomes a useful routine for other kinds of needs, such as
           | this one: https://news.ycombinator.com/item?id=27564004
           | 
           | The reason I omitted the specific verbiage you quoted is that
           | they apply equally to all the functions. My function has an
           | important bounds check, but (like all of C) trusts that the
           | parameters you provide to it are correct. There is no
           | additional checking done because doing so in C is not
           | tenable. If implemented in a standard library it may be
           | useful to add additional heuristics to detect invalid cases,
           | but they are fundamentally best-effort and not worth showing
           | in an example.
        
         | saagarjha wrote:
         | LOL yes, thanks for catching that. I'll fix it when I'm at my
         | computer :)
         | 
         | Edit: edited. I also had a messaging from past me that I had
         | forgotten about:                 <!-- I made a mistake
         | somewhere, didn't I? -->
        
       | saurik wrote:
       | > In the case where src fits in dst, it will return a pointer
       | past the NUL byte it placed; otherwise it returns NULL to
       | indicate a truncation.
       | 
       | It is amazing to me how personal these preferences are ;P...
       | like, I'd be much happier with an API that always returns the
       | location of the NUL byte on success; and, if the string gets
       | truncated, then it instead returns dst+len (the address of the
       | byte past the end of the buffer). This allows for chained
       | constructions that provide efficient strcat-style semantics with
       | easy error propagation, such as this example which concatenates
       | three strings (which I honestly hope I got right... I'm giving
       | reasonable odds to Saagar telling me I've coded a buffer overflow
       | by accident somewhere ;P):                   char buf[X]; // for
       | any X, even 0!         char *cur = buf;         char *end = buf +
       | sizeof(buf);         cur = strjcpy(cur, str1, end - cur);
       | cur = strjcpy(cur, str2, end - cur);         cur = strjcpy(cur,
       | str3, end - cur);         if (cur == end) goto fail;
        
         | kolbusa wrote:
         | Did you mean to check for (cur > end)?
        
           | saurik wrote:
           | No: if the function succeeds, the address of the NUL byte
           | will always be !=end (as it must be inside of the buffer,
           | which end is not); whereas, in the case of string truncation,
           | cur will be equal to end, as the function returns dst+len
           | (which is end); and, if that error had "already" happened
           | during a previous call, it will get propagated through the
           | next call as end-cur will be 0, causing the next call to
           | immediately fail (even for a 0-length string, which is an
           | important corner-case) and return cur+0 (which is still end).
        
         | saagarjha wrote:
         | Not seeing anything wrong, but it's C so who knows ;)
         | Thankfully, the primitive I (and memccpy) provide makes writing
         | your wrapper easy and efficient, as opposed to all the other
         | functions which don't compose at all. (From my phone) I think
         | this might work?                 char *strjcpy(char *dst, const
         | char *src, size_t len) {           char *result = strxcpy(dst,
         | src, len);           return result ? --result : result + len;
         | }
        
           | saurik wrote:
           | FWIW, part of the fun of strjcpy is that it is also much
           | easier to write than strxcpy (and watch as I go further and
           | further out on a limb with sketchy C that is likely wrong,
           | lol... I _did_ test it, at least! ;P):                   char
           | *strjcpy(char *restrict dst, const char *restrict src, size_t
           | len) {             for (;; ++dst, --len)                 if
           | (!len || !(*dst = *src++))                     return dst;
           | }
           | 
           | (edit) Oh, I have an even cuter implementation (which might
           | look "too clever" but actually demonstrates something
           | important about the function)! Essentially, what makes
           | strjcpy so "pure" is that the NULL return is really a
           | "special case" in strxcpy that you're having to "undo" in
           | that wrapper, whereas the semantics of strjcpy--which may
           | _sound_ a bit weird--are mapping directly to what naturally
           | terminates the loop: running out of space on one of the two
           | inputs. This purity then gets taken advantage of by the
           | caller to get such easy call chaining and error propagation,
           | as one of these loops can  "continue through" into the next
           | loop without any adaptation logic.                   char
           | *strjcpy(char *restrict dst, const char *restrict src, size_t
           | len) {             for (; len && (*dst = *src++); ++dst,
           | --len);             return dst;         }
           | 
           | (edit) Ok: one difference is that this implementation of
           | strjcpy (this isn't intrinsic to the return value surface I
           | described: just to the gloriously simple versions in this
           | comment; your adapted version, for example, is "fine")
           | doesn't put a NUL byte in case of truncation... though, I
           | personally am not at all sold on doing that: I want to firmly
           | fail the operation, rather than try to "use" the truncated
           | data :/. Adding that special case would still result in
           | strjcpy being simpler than strxcpy (and doesn't break its
           | semantics advantages: just make sure to return the address of
           | dst+len, not that extra NUL), but it isn't quite so amazingly
           | simpler at that point ;P.
        
         | [deleted]
        
       | vasama wrote:
       | Migrating to a pointer-size-pair string representation would be a
       | better use of one's time.
        
         | Gibbon1 wrote:
         | I swear part of the problem is with C there is a cargo cult
         | prohibition against passing small structs by value.
        
           | kzrdude wrote:
           | In other languages we create types for just about anything,
           | and like you say, it's strange that we don't do this more in
           | C.
        
             | Gibbon1 wrote:
             | Big problem with C is the standards committee just flat out
             | refuses to add an array type to C. It's deranged because if
             | you had first class arrays it'd be a lot easier to generate
             | code that takes advantage of SIMD instructions.
        
               | klodolph wrote:
               | You mean "add a second array type?"
               | 
               | C has an array type, it's just a bit wacky, and it
               | doesn't play by the same rules as other types.
        
           | b5n wrote:
           | It's not a cult, its just that the cases where the risks of
           | passing by value would be worth any perceived advantage are
           | so few that it just doesn't make sense to even consider it.
           | 
           | It's not like it's a flimsy tribal based claim, the guidance
           | is solid.
        
           | saurik wrote:
           | It does suck that the usual 64-bit calling convention limits
           | the size of strict passed by value to 64-bits :/.
        
             | thysultan wrote:
             | You could store two 32 bit fields and cast the second to a
             | pointer when needed, and the first treat as a length.
        
           | pjmlp wrote:
           | The cargo cult goes beyond that.
           | 
           | The belief it was created alongside UNIX from the start, when
           | it was used to port UNIX V4 into high level language.
           | 
           | Micro-optimizing each line of code as it is written, "because
           | it is fast", without even bothering to use a profiler.
           | 
           | Even though lint was created alongside C to fix already known
           | programmer faults using the language, in 1979, the belief
           | that only bad programmers need such kind of tooling.
        
             | PaulDavisThe1st wrote:
             | > Micro-optimizing each line of code as it is written,
             | "because it is fast", without even bothering to use a
             | profiler.
             | 
             | With the CPU, MMU and OS architectures of that period, it
             | wasn't particularly hard to infer what was fast without
             | profiling it. The slow rise in complexity at all 3 levels
             | now makes it hard for even extremely experienced close-to-
             | the-metal programmers to understand what will be fast or
             | slow without a profiler. Times do change, in fact.
        
         | radicalcentrist wrote:
         | Are there any recommended libraries for doing this, if I'd like
         | to migrate my C codebase?
        
           | macintux wrote:
           | You might look at SDS.
           | 
           | https://github.com/antirez/sds
        
           | [deleted]
        
         | compiler-guy wrote:
         | And only works if you don't rely on any third-party libraries
         | that take normal C strings.
         | 
         | Which is to say, is unrealistic for many programs.
        
           | pjmlp wrote:
           | Depends on much they value security, even std::string has an
           | extra null for _c_str()_ calls.
        
           | tgv wrote:
           | You can use a library that also adds a trailing \O. I used it
           | to interface C++.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-06-19 23:01 UTC)